{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Opening + Parsing _a_ Webpage\n",
    "\n",
    "This page is about the first two skills of the three skills needed to become a master Hacker(wo)man:\n",
    "\n",
    "1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you\n",
    "2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs) \n",
    "3. Doing that for a large number of webpages (building a \"scraper\" or \"crawler\" or \"spider\")\n",
    "\n",
    "So let's start building up those skills through demos and practice!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Skill 1: Retrieving URLs and sending manual API queries\n",
    "\n",
    "```{note} Notice how, when you google something, the URL of the resulting search has the information about the search?\n",
    "\n",
    "For example: https://www.google.com/search?client=firefox-b-1-d&q=gme+stock_price\n",
    "\n",
    "We can leverage websites that do that to programmatically run searches and collect data!\n",
    "\n",
    "```\n",
    "\n",
    "### Walkthrough\n",
    "\n",
    "[Go through the Neal Caren \"First API\" example](https://nbviewer.jupyter.org/github/nealcaren/ScrapingData/blob/master/Notebooks/5_APIs.ipynb)\n",
    "- **JSON is just a data storage structure, and in python it's a dictionary.** \n",
    "- In this example, we make the search give us JSON because it can be easier to extract data from JSON than HTML sometimes.\n",
    "- HINTS: \n",
    "    - Dictionaries: You look for what's in a dictionary with `dict.keys()` and then `dict['key']` will show you the value/object belonging to the key\n",
    "    - `pd.DataFrame()` can convert a dictionary to a data frame!\n",
    "\n",
    "```{margin}\n",
    "_For your reference: Sometimes a webpage is \"hiding\" the way to run queries like this API. You run a search and the URL doesn't look obviously like a search. But often, inside that page is a \"backdoor\" to an API you can search just like the above example. [This](https://nbviewer.jupyter.org/github/nealcaren/ScrapingData/blob/master/Notebooks/Bonus_Undocument_APIs.ipynb) tutorial shows one example of this and more importantly, how the author found the API._\n",
    "```\n",
    "\n",
    "### Practice: Exchange rates\n",
    "\n",
    "- Start with `url = 'https://api.exchangeratesapi.io/latest?base=NOK' `\n",
    "- Q1: What is the average exchange rate value this search returns?\n",
    "- [The documentation](https://exchangeratesapi.io/) for this database (how to search it, change parameters, etc.)\n",
    "- Q2: Change the base to Euros, then tell me how many Japanese Yen is in a euro.\n",
    "- Q3: What was the number of Yen per Euro on Jan 2, 2018?\n",
    "- Q4: Bonus: Get a time series of EURJPY from 2018 through 2019.\n",
    "- Q5: Bonus: Rewrite our code for Q1-Q4 using `requests_html` to the extent possible. If and when you succeed, email your solution to me!\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Skill 2: How to parse a (single) page\n",
    "\n",
    "You might parse a page to\n",
    "- isolate the data/text/urls you need\n",
    "- find a link to collect (or click!)\n",
    "\n",
    "Data on page is typically HTML, XML, or JSON. \n",
    "1. **J**ava**S**cript **O**bject **N**otation (JSON)\n",
    "1. e**X**tensible **M**arkup **L**anguage (XML)\n",
    "1. HTML - for [example](view-source:https://en.m.wikipedia.org/wiki/List_of_S%26P_500_companies). \n",
    "\n",
    "You can right click on a page and \"view source\" to see the underlying HTML, XML, or JSON. \n",
    "- Go to the [S&P500 wiki page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)\n",
    "- Right click and view the source. See the structure? That's HTML!\n",
    "\n",
    "From [Wikipedia](https://en.wikipedia.org/wiki/HTML):\n",
    "> HTML markup consists of several key components, including those called tags (and their attributes), character-based data types, character references and entity references. HTML tags most commonly come in pairs like `<h1>` and `</h1>`, although some represent empty elements and so are unpaired, for example `<img>`. The first tag in such a pair is the start tag, and the second is the end tag (they are also called opening tags and closing tags).\n",
    ">\n",
    "> Another important component is the HTML document type declaration, which triggers standards mode rendering.\n",
    ">\n",
    "> The following is an example of the classic \"Hello, World!\" program: \n",
    ">\n",
    ">    ```HTML\n",
    ">    <!DOCTYPE html>\n",
    ">    <html>\n",
    ">      <head>\n",
    ">        <title>This is a title</title>\n",
    ">      </head>\n",
    ">      <body>\n",
    ">        <p>Hello world!</p>\n",
    ">      </body>\n",
    ">    </html>\n",
    ">    ```\n",
    "    \n",
    "Elements can nest inside elements (a paragraph within the body of the page, the title inside the header, the rows inside the table with columns inside them).    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Practice: Picking elements of a page \n",
    "\n",
    "[Let's play a game](https://flukeout.github.io/) about finding and selecting elements using the html tags.\n",
    "\n",
    "```{note}\n",
    "This will help on the next practice problem!\n",
    "```\n",
    "\n",
    "### Practice: Get a list of wiki pages for S&P500 firms\n",
    "- Revisit the S&P 500 wiki: `pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')`\n",
    "- Let's try to grab all the URLs. Notice the `pd.read_html` approach can't do this.\n",
    "- Use your browser's Inspector to find info you can use to select the table, and then we will try to extract the URLs.\n",
    "\n",
    "```{tip}\n",
    "This practice will help on the homework!\n",
    "```\n",
    "\n",
    "### Extra resources\n",
    "\n",
    "I really, really like the tutorials Greg Reda wrote ([here](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) and [here](http://www.gregreda.com/2013/04/29/more-web-scraping-with-python/)). See the caveat about `urllib2` above, but otherwise this code works. Greg uses `urllib` to open and `beautifulsoup` to parse, but if you want to, you should be able to rewrite his code using `requests_html` pretty easily. When you succeed, please email me! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}