{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping Data\n", "\n", "1. Skills we will develop\n", "1. Overview of different ways to get data \n", "2. Overview of python packages we can use\n", "\n", "### What skills do I need to learn to be a master Hacker(wo)man?\n", "\n", "1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you\n", "2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs) \n", "3. Doing that for a large number of webpages (building a \"scraper\" or \"crawler\" or \"spider\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ways to get data from the web\n", "\n", "```{dropdown} 1: **Manually click and download.** \n", "\n", "The way you would have done it before this class.\n", "```\n", "\n", "```{dropdown} 2: **Let pandas download your data,** like pd.read_csv(url)\n", "\n", "Did you know? Pandas can often directly read tables on webpages! \n", "- Try `pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')`\n", "- Very easy and fast! You don't even need to save the webapge to your hard drive.\n", "- Notes on `read_html`: \n", " 1. It can only handle basic HTML tables encoded directly in the page (no Javascript, e.g.) and **only grabs displayed text -- embedded URLs are lost.**\n", " 2. If the website changes the data, the next time you run it, you'll get the newer version of data. (Unstable, potentially, but also updates automatically.)\n", "\n", "```\n", "\n", "```{dropdown} 3: **\"Install and play\" APIs,** like pandas_datareader \n", "\n", "API stands for **Application Programming Interface,** and it is a way for your computer to send a request (a query) to a server and get some response (hopefully useful data).\n", "\n", "Plug and play APIs let you interact with a website without specifying the exact API requests to send to the server.\n", "- The `pandas_datareader` plug in for Yahoo stock prices is one version of this. \n", "- `datadotworld` was another. \n", "- Kaggle and most of the [data sources listed on our resources page](https://ledatascifi.github.io/ledatascifi-2021/content/about/resources.html#resources-tutorials-and-data) have API packages for Python. \n", "- I upload your peer reviews and manage assignment permissions using `PyGithub` to interact with GH\n", "\n", "```\n", "\n", "```{tip} \n", "\n", "If you need <20ish tables (the threshold depends on your coding speed), download what you need manually.\n", "\n", "If you need more, it's time to scrape. \n", "\n", "**Options 1-3 are BY FAR the easiest.** If you want more than 10 tables or so (but the threshold depends on your coding speed), I'd abandon the manual option and go with `pandas` or a nice API package. \n", "\n", "Never ever try \\#4 or \\#5 without searching for \"\\ python api\" first. \n", "```\n", "\n", "```{dropdown} 4: **Manual API queries** for websites without \"install and play\" APIs\n", "\n", "Many sites have an API port of some kind serving up the data they show visitors.\n", "```\n", "\n", "````{dropdown} 5: **Scraping the data on the website by visiting each page and downloading the data needed**\n", "\n", "The last resort. You can't find the API serving the data, but your eyes see it. And you want it, cause websites contain a lot of data, like [GoT's IMDB page](https://www.imdb.com/title/tt0944947/?ref_=nv_sr_srsg_0).\n", "\n", "```{warning}\n", "This is an essential tool, but should be the last thing you try!\n", "```\n", "\n", "````\n", "\n", " \n", "```{note} Wisdom from [Greg Reda](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) about scraping data\n", " \n", "> 1. You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.\n", "> 2. Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.\n", "> 3. Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.\n", "> 4. Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.\n", "\n", "```\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Useful packages, tricks, and tips\n", "\n", "Web scraping packages are always developing and evolving. \n", "\n", "| Task | Thoughts |\n", "| :--- | :--- |\n", "| To \"open\" a page | `urllib` or `requests`. `requests` is probably the best for sending API queries.

Warning: lots of walkthroughs online use `urllib2`, which worked for Python2 but not Python3. Use `urllib` instead, and you might have to include a few tweaks. For example, if you see `from urllib2 import urlopen` replace it with `from urllib.request import urlopen` |\n", "| To parse a page | `beautifulsoup`, `lxml`, or `pyquery` |\n", "| Combining opening/parsing | `requests_html` is a relatively new package and might be excellent. Its code is simply a combination of many of the above. |\n", "| Blocked because you look like a bot or need to accept cookies? | `selenium` is one way to \"impersonate\" a human, and also can help develop scraping macros, but you might not need it except on difficult scraping projects. It opens a literal browser window.

`requests_html` and `requests` can also store and use cookies. I'd recommend you try this before selenium. |\n", "| Blocked because you're sending requests too fast? | `from time import sleep` allows you to `sleep(<# of seconds>)` your code. |\n", "| Wonder what your current HTML looks like? | `from IPython.display import HTML` then `HTML()` will show you what the HTML you have looks like.
E.g. if you're using `r = requests(url)`, then you can use `HTML(r.text)` to see the request object. |\n", "| How do I find a particular \"piece\" of a webpage | E.g. Q: Where is that table?
A: Oh, it's inside the HTML tag called \"table3\".

You can search for elements via attributes, CSS selectors, XPath, and text. This will make more sense soon.

