4.1. Scraping Data¶

Skills we will develop
Overview of different ways to get data
Overview of python packages we can use

4.1.1. What skills do I need to learn to be a master Hacker(wo)man?¶

Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you
How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs)
Doing that for a large number of webpages (building a “scraper” or “crawler” or “spider”)

4.1.2. Ways to get data from the web¶

Tip

If you need <20ish tables (the threshold depends on your coding speed), download what you need manually.

If you need more, it’s time to scrape.

Options 1-3 are BY FAR the easiest. If you want more than 10 tables or so (but the threshold depends on your coding speed), I’d abandon the manual option and go with pandas or a nice API package.

Never ever try #4 or #5 without searching for “<website> python api” first.

Note

Wisdom from Greg Reda about scraping data

You should check a site’s terms and conditions before you scrape them. It’s their data and they likely have some rules to govern it.

Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don’t hammer the site’s server.

Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.

Web pages are inconsistent - There’s sometimes some manual clean up that has to happen even after you’ve gotten your data.

4.1.3. Useful packages, tricks, and tips¶

Web scraping packages are always developing and evolving.

Task	Thoughts
To “open” a page	`urllib` or `requests`. `requests` is probably the best for sending API queries. Warning: lots of walkthroughs online use `urllib2`, which worked for Python2 but not Python3. Use `urllib` instead, and you might have to include a few tweaks. For example, if you see `from urllib2 import urlopen` replace it with `from urllib.request import urlopen`
To parse a page	`beautifulsoup`, `lxml`, or `pyquery`
Combining opening/parsing	`requests_html` is a relatively new package and might be excellent. Its code is simply a combination of many of the above.
Blocked because you look like a bot or need to accept cookies?	`selenium` is one way to “impersonate” a human, and also can help develop scraping macros, but you might not need it except on difficult scraping projects. It opens a literal browser window. `requests_html` and `requests` can also store and use cookies. I’d recommend you try this before selenium.
Blocked because you’re sending requests too fast?	`from time import sleep` allows you to `sleep(<# of seconds>)` your code.
Wonder what your current HTML looks like?	`from IPython.display import HTML` then `HTML(<html object>)` will show you what the HTML you have looks like. E.g. if you’re using `r = requests(url)`, then you can use `HTML(r.text)` to see the request object.
How do I find a particular “piece” of a webpage	E.g. Q: Where is that table? A: Oh, it’s inside the HTML tag called “table3”. You can search for elements via attributes, CSS selectors, XPath, and text. This will make more sense soon. To find that info: Right click on an element you’re interested and click “Inspect Element”. (F12 is the Windows shortcut.)

4.1.3.1. My suggestion¶

This is subject to change, but I think you should pick ONE opening and ONE parsing module and stick with it for now. requests_html is a pretty good option that opens pages and can parse them, and it allows you to use lxml, or pyquery within it.

You can change and try other stuff as you go, but get as familiar with one package as you can (in a cheap/efficient way).

Now to contradict myself: Some of the packages above can’t do things others can, or do them much slower, or the code is hard to write, read, and debug. Sometimes, you’re holding a hammer but you need a screwdriver. What I’m saying is, if another package can easily do the job, use it. (Just realize that learning a new package comes with a fixed cost, so be sure you need that screwdriver before grabbing it.)

LeDataSciFi-2022