4.1. Scraping Data¶
Skills we will develop
Overview of different ways to get data
Overview of python packages we can use
4.1.1. What skills do I need to learn to be a master Hacker(wo)man?¶
Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you
How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs)
Doing that for a large number of webpages (building a “scraper” or “crawler” or “spider”)
4.1.2. Ways to get data from the web¶
1: Manually click and download.
The way you would have done it before this class.
2: Let pandas download your data, like pd.read_csv(url)
Did you know? Pandas can often directly read tables on webpages!
Try
pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
Very easy and fast! You don’t even need to save the webpage to your hard drive.
Notes on
read_html
:It can only handle basic HTML tables encoded directly in the page (no Javascript, e.g.) and only grabs displayed text – embedded URLs are lost.
If the website changes the data, the next time you run it, you’ll get the newer version of data. (Unstable, potentially, but also updates automatically.)
3: “Install and play” APIs, like pandas_datareader
API stands for Application Programming Interface, and it is a way for your computer to send a request (a query) to a server and get some response (hopefully useful data).
Plug and play APIs let you interact with a website without specifying the exact API requests to send to the server.
The
pandas_datareader
plug-in for Yahoo stock prices is one version of this.datadotworld
was another.Kaggle and most of the data sources listed on our resources page have API packages for Python.
I upload your peer reviews and manage assignment permissions using
PyGithub
to interact with GH
Tip
If you need <20ish tables (the threshold depends on your coding speed), download what you need manually.
If you need more, it’s time to scrape.
Options 1-3 are BY FAR the easiest. If you want more than 10 tables or so (but the threshold depends on your coding speed), I’d abandon the manual option and go with pandas
or a nice API package.
Never ever try #4 or #5 without searching for “<website> python api” first.
4: Manual API queries for websites without “install and play” APIs
Many sites have an API port of some kind serving up the data they show visitors.
The next few website pages and lectures will cover this. If you can see your search query in the URL, like “https://www.google.com/search?q=gme+stock+price”, then you can run the searches manually and get the data after opening the page using one of two approaches:
If the data is in the HTML code, you can scrape it using the approaches we will discuss. You can look at the HTML code for any webpage by right-clicking and then selecting “View Page Source” (or similar, depending on the browser). After opening the HTML code, CTRL+F to look for some of the data. If the data is within the source code, you can scrape it various ways, which we will cover.
If you can see the data on the webpage, but the data isn’t in the HTML code when you CTRL+F for it, you’re in luck! You’ll need to do a few tricks with your browser to find where the API is hidden and how to use it, but after that, you will be able to download the data without doing any HTML scraping! See this example.
And sometimes a webpage is “hiding” the way to run queries like this API. You run a search and the URL doesn’t look obviously like a search. But often, inside that page is a “backdoor” to an API you can search just like the above example. This tutorial shows one example of this and more importantly, how the author found the API.
5: Scraping the data on the website by visiting each page and downloading the data needed
The last resort. You can’t find the API serving the data, but your eyes see it. And you want it, cause websites contain a lot of data, like GoT’s IMDB page.
Warning
This is an essential tool, but should be the last thing you try!
Note
Wisdom from Greg Reda about scraping data
You should check a site’s terms and conditions before you scrape them. It’s their data and they likely have some rules to govern it.
Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don’t hammer the site’s server.
Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.
Web pages are inconsistent - There’s sometimes some manual clean up that has to happen even after you’ve gotten your data.
4.1.3. Useful packages, tricks, and tips¶
Web scraping packages are always developing and evolving.
Task |
Thoughts |
---|---|
To “open” a page |
|
To parse a page |
|
Combining opening/parsing |
|
Blocked because you look like a bot or need to accept cookies? |
|
Blocked because you’re sending requests too fast? |
|
Wonder what your current HTML looks like? |
|
How do I find a particular “piece” of a webpage |
E.g. Q: Where is that table? |
4.1.3.1. My suggestion¶
This is subject to change, but I think you should pick ONE opening and ONE parsing module and stick with it for now. requests_html
is a pretty good option that opens pages and can parse them, and it allows you to use lxml
, or pyquery
within it.
You can change and try other stuff as you go, but get as familiar with one package as you can (in a cheap/efficient way).
Now to contradict myself: Some of the packages above can’t do things others can, or do them much slower, or the code is hard to write, read, and debug. Sometimes, you’re holding a hammer but you need a screwdriver. What I’m saying is, if another package can easily do the job, use it. (Just realize that learning a new package comes with a fixed cost, so be sure you need that screwdriver before grabbing it.)