{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scraping Data\n",
    "\n",
    "1. Skills we will develop\n",
    "1. Overview of different ways to get data \n",
    "2. Overview of python packages we can use\n",
    "\n",
    "### What skills do I need to learn to be a master Hacker(wo)man?\n",
    "\n",
    "1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you\n",
    "2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs) \n",
    "3. Doing that for a large number of webpages (building a \"scraper\" or \"crawler\" or \"spider\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ways to get data from the web\n",
    "\n",
    "```{dropdown} 1: **Manually click and download.** \n",
    "\n",
    "The way you would have done it before this class.\n",
    "```\n",
    "\n",
    "```{dropdown} 2: **Let pandas download your data,** like pd.read_csv(url)\n",
    "\n",
    "Did you know? Pandas can often directly read tables on webpages! \n",
    "- Try `pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')`\n",
    "- Very easy and fast! You don't even need to save the webapge to your hard drive.\n",
    "- Notes on `read_html`: \n",
    "    1. It can only handle basic HTML tables encoded directly in the page (no Javascript, e.g.) and **only grabs displayed text -- embedded URLs are lost.**\n",
    "    2. If the website changes the data, the next time you run it, you'll get the newer version of data. (Unstable, potentially, but also updates automatically.)\n",
    "\n",
    "```\n",
    "\n",
    "```{dropdown} 3: **\"Install and play\" APIs,** like pandas_datareader \n",
    "\n",
    "API stands for **Application Programming Interface,** and it is a way for your computer to send a request (a query) to a server and get some response (hopefully useful data).\n",
    "\n",
    "Plug and play APIs let you interact with a website without specifying the exact API requests to send to the server.\n",
    "- The `pandas_datareader` plug in for Yahoo stock prices is one version of this. \n",
    "- `datadotworld` was another. \n",
    "- Kaggle and most of the [data sources listed on our resources page](https://ledatascifi.github.io/ledatascifi-2021/content/about/resources.html#resources-tutorials-and-data) have API packages for Python. \n",
    "- I upload your peer reviews and manage assignment permissions using `PyGithub` to interact with GH\n",
    "\n",
    "```\n",
    "\n",
    "```{tip} \n",
    "\n",
    "If you need <20ish tables (the threshold depends on your coding speed), download what you need manually.\n",
    "\n",
    "If you need more, it's time to scrape. \n",
    "\n",
    "**Options 1-3 are BY FAR the easiest.** If you want more than 10 tables or so (but the threshold depends on your coding speed), I'd abandon the manual option and go with `pandas` or a nice API package. \n",
    "\n",
    "Never ever try \\#4 or \\#5 without searching for \"\\<website\\> python api\" first.  \n",
    "```\n",
    "\n",
    "```{dropdown} 4: **Manual API queries** for websites without \"install and play\" APIs\n",
    "\n",
    "Many sites have an API port of some kind serving up the data they show visitors.\n",
    "```\n",
    "\n",
    "````{dropdown} 5: **Scraping the data on the website by visiting each page and downloading the data needed**\n",
    "\n",
    "The last resort. You can't find the API serving the data, but your eyes see it. And you want it, cause websites contain a lot of data, like [GoT's IMDB page](https://www.imdb.com/title/tt0944947/?ref_=nv_sr_srsg_0).\n",
    "\n",
    "```{warning}\n",
    "This is an essential tool, but should be the last thing you try!\n",
    "```\n",
    "\n",
    "````\n",
    "\n",
    "  \n",
    "```{note} Wisdom from [Greg Reda](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) about scraping data\n",
    "    \n",
    "> 1. You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.\n",
    "> 2.  Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.\n",
    "> 3. Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.\n",
    "> 4. Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.\n",
    "\n",
    "```\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Useful packages, tricks, and tips\n",
    "\n",
    "Web scraping packages are always developing and evolving. \n",
    "\n",
    "| Task | Thoughts |\n",
    "| :--- | :--- |\n",
    "| To \"open\" a page | `urllib` or `requests`. `requests` is probably the best for sending API queries. <br> <br> Warning: lots of walkthroughs online use `urllib2`, which worked for Python2 but not Python3. Use `urllib` instead, and you might have to include a few tweaks. For example, if you see `from urllib2 import urlopen` replace it with `from urllib.request import urlopen` |\n",
    "| To parse a page | `beautifulsoup`, `lxml`, or `pyquery` |\n",
    "| Combining opening/parsing | `requests_html` is a relatively new package and might be excellent. Its code is simply a combination of many of the above. |\n",
    "| Blocked because you look like a bot or need to accept cookies? |     `selenium` is one way to \"impersonate\" a human, and also can help develop scraping macros, but you might not need it except on difficult scraping projects. It opens a literal browser window. <br> <br> `requests_html` and `requests` can also store and use cookies. I'd recommend you try this before selenium. |\n",
    "| Blocked because you're sending requests too fast? | `from time import sleep` allows you to `sleep(<# of seconds>)` your code. |\n",
    "| Wonder what your current HTML looks like? | `from IPython.display import HTML` then `HTML(<html object>)` will show you what the HTML you have looks like. <br> E.g. if you're using `r = requests(url)`, then you can use `HTML(r.text)` to see the request object. |\n",
    "| How do I find a particular \"piece\" of a webpage | E.g. Q: Where is that table? <br> A: Oh, it's inside the HTML tag called \"table3\". <br> <br> You can search for elements via attributes, CSS selectors, XPath, and text. This will make more sense soon. <br> <br> To find that info: Right click on an element you're interested and click \"Inspect Element\". (F12 is the Windows shortcut.) |\n",
    "\n",
    "#### My suggestion \n",
    "\n",
    "This is subject to change, but **I think you should pick ONE opening and ONE parsing module and stick with it _for now_**. `requests_html` is a pretty good option that opens pages and can parse them, and it allows you to use `lxml`, or `pyquery` within it.\n",
    "\n",
    "You can change and try other stuff as you go, but get as familiar with one package as you can (in a cheap/efficient way).\n",
    "\n",
    "Now to contradict myself: Some of the packages above can't do things others can, or do them much slower, or the code is hard to write, read, and debug. Sometimes, you're holding a hammer but you need a screwdriver. What I'm saying is, if another package can easily do the job, use it. (Just realize that learning a new package comes with a fixed cost, so be sure you need that screwdriver before grabbing it.)\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}