{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping Data\n", "\n", "1. Skills we will develop\n", "1. Overview of different ways to get data \n", "2. Overview of python packages we can use\n", "\n", "### What skills do I need to learn to be a master Hacker(wo)man?\n", "\n", "1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you\n", "2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs) \n", "3. Doing that for a large number of webpages (building a \"scraper\" or \"crawler\" or \"spider\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ways to get data from the web\n", "\n", "```{dropdown} 1: **Manually click and download.** \n", "\n", "The way you would have done it before this class.\n", "```\n", "\n", "```{dropdown} 2: **Let pandas download your data,** like pd.read_csv(url)\n", "\n", "Did you know? Pandas can often directly read tables on webpages! \n", "- Try `pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')`\n", "- Very easy and fast! You don't even need to save the webpage to your hard drive.\n", "- Notes on `read_html`: \n", " 1. It can only handle basic HTML tables encoded directly in the page (no Javascript, e.g.) and **only grabs displayed text -- embedded URLs are lost.**\n", " 2. If the website changes the data, the next time you run it, you'll get the newer version of data. (Unstable, potentially, but also updates automatically.)\n", "\n", "```\n", "\n", "```{dropdown} 3: **\"Install and play\" APIs,** like pandas_datareader \n", "\n", "API stands for **Application Programming Interface,** and it is a way for your computer to send a request (a query) to a server and get some response (hopefully useful data).\n", "\n", "Plug and play APIs let you interact with a website without specifying the exact API requests to send to the server.\n", "- The `pandas_datareader` plug-in for Yahoo stock prices is one version of this. \n", "- `datadotworld` was another. \n", "- Kaggle and most of the [data sources listed on our resources page](https://ledatascifi.github.io/ledatascifi-2021/content/about/resources.html#resources-tutorials-and-data) have API packages for Python. \n", "- I upload your peer reviews and manage assignment permissions using `PyGithub` to interact with GH\n", "\n", "```\n", "\n", "```{tip} \n", "\n", "If you need <20ish tables (the threshold depends on your coding speed), download what you need manually.\n", "\n", "If you need more, it's time to scrape. \n", "\n", "**Options 1-3 are BY FAR the easiest.** If you want more than 10 tables or so (but the threshold depends on your coding speed), I'd abandon the manual option and go with `pandas` or a nice API package. \n", "\n", "Never ever try \\#4 or \\#5 without searching for \"\\ python api\" first. \n", "```\n", "\n", "```{dropdown} 4: **Manual API queries** for websites without \"install and play\" APIs\n", "\n", "Many sites have an API port of some kind serving up the data they show visitors.\n", "\n", "The next few website pages and lectures will cover this. If you can see your search query in the URL, like \"https://www.google.com/search?q=gme+stock+price\", then you can run the searches manually and get the data after opening the page using one of two approaches:\n", "1. If the data is in the HTML code, you can scrape it using the approaches we will discuss. You can look at the HTML code for any webpage by right-clicking and then selecting \"View Page Source\" (or similar, depending on the browser). After opening the HTML code, CTRL+F to look for some of the data. If the data is within the source code, you can scrape it various ways, which we will cover.\n", "2. If you can see the data on the webpage, but the data isn't in the HTML code when you CTRL+F for it, you're in luck! You'll need to do a few tricks with your browser to find where the API is hidden and how to use it, but after that, you will be able to download the data without doing any HTML scraping! [See this example.](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/)\n", "3. And sometimes a webpage is \"hiding\" the way to run queries like this API. You run a search and the URL doesn't look obviously like a search. But often, inside that page is a \"backdoor\" to an API you can search just like the above example. [This](https://nbviewer.jupyter.org/github/nealcaren/ScrapingData/blob/master/Notebooks/Bonus_Undocument_APIs.ipynb) tutorial shows one example of this and more importantly, how the author found the API.\n", "\n", "```\n", "\n", "````{dropdown} 5: **Scraping the data on the website by visiting each page and downloading the data needed**\n", "\n", "The last resort. You can't find the API serving the data, but your eyes see it. And you want it, cause websites contain a lot of data, like [GoT's IMDB page](https://www.imdb.com/title/tt0944947/?ref_=nv_sr_srsg_0).\n", "\n", "```{warning}\n", "This is an essential tool, but should be the last thing you try!\n", "```\n", "\n", "````\n", "\n", " \n", "```{note} Wisdom from [Greg Reda](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) about scraping data\n", " \n", "> 1. You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.\n", "> 2. Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.\n", "> 3. Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.\n", "> 4. Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.\n", "\n", "```\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Useful packages, tricks, and tips\n", "\n", "Web scraping packages are always developing and evolving. \n", "\n", "| Task | Thoughts |\n", "| :--- | :--- |\n", "| To \"open\" a page | `urllib` or `requests`. `requests` is probably the best for sending API queries.

Warning: lots of walkthroughs online use `urllib2`, which worked for Python2 but not Python3. Use `urllib` instead, and you might have to include a few tweaks. For example, if you see `from urllib2 import urlopen` replace it with `from urllib.request import urlopen` |\n", "| To parse a page | `beautifulsoup`, `lxml`, or `pyquery` |\n", "| Combining opening/parsing | `requests_html` is a relatively new package and might be excellent. Its code is simply a combination of many of the above. |\n", "| Blocked because you look like a bot or need to accept cookies? | `selenium` is one way to \"impersonate\" a human, and also can help develop scraping macros, but you might not need it except on difficult scraping projects. It opens a literal browser window.

`requests_html` and `requests` can also store and use cookies. I'd recommend you try this before selenium. |\n", "| Blocked because you're sending requests too fast? | `from time import sleep` allows you to `sleep(<# of seconds>)` your code. |\n", "| Wonder what your current HTML looks like? | `from IPython.display import HTML` then `HTML()` will show you what the HTML you have looks like.
E.g. if you're using `r = requests(url)`, then you can use `HTML(r.text)` to see the request object. |\n", "| How do I find a particular \"piece\" of a webpage | E.g. Q: Where is that table?
A: Oh, it's inside the HTML tag called \"table3\".

You can search for elements via attributes, CSS selectors, XPath, and text. This will make more sense soon.

To find that info: Right click on an element you're interested and click \"Inspect Element\". (F12 is the Windows shortcut.) |\n", "\n", "#### My suggestion \n", "\n", "This is subject to change, but **I think you should pick ONE opening and ONE parsing module and stick with it _for now_**. `requests_html` is a pretty good option that opens pages and can parse them, and it allows you to use `lxml`, or `pyquery` within it.\n", "\n", "You can change and try other stuff as you go, but get as familiar with one package as you can (in a cheap/efficient way).\n", "\n", "Now to contradict myself: Some of the packages above can't do things others can, or do them much slower, or the code is hard to write, read, and debug. Sometimes, you're holding a hammer but you need a screwdriver. What I'm saying is, if another package can easily do the job, use it. (Just realize that learning a new package comes with a fixed cost, so be sure you need that screwdriver before grabbing it.)\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }