{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Opening + Parsing _a_ Webpage\n", "\n", "This page is about the first two skills of the three skills needed to become a master Hacker(wo)man:\n", "\n", "1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you\n", "2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs) \n", "3. Doing that for a large number of webpages (building a \"scraper\" or \"crawler\" or \"spider\")\n", "\n", "So let's start building up those skills through demos and practice!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Skill 1: Retrieving URLs and sending manual API queries\n", "\n", "```{note} Have you ever noticed how, when you google something, the URL of the resulting search has the information about the search?\n", "\n", "For example, suppose I Google \"gme stock price\" (without the quotes). The resulting URL might be: https://www.google.com/search?client=firefox-b-1-d&q=gme+stock+price\n", "\n", "This URL has several parts: \n", "1. The stem, ending in a question mark: \"https://www.google.com/search?\"\n", "2. Everything after the stem is the parameters of the search\n", "3. A search can contain many parameters, and they will be separated by \"&\"\n", "4. Each parameter is structured like: \\=\\\n", "5. Parameter 1: My browser client is a version of firefox\n", "6. Parameter 2: My query (\"q\") is gme+stock+price (plus signs because spaces aren't allowed in URLs)\n", "\n", "We can leverage websites that do this - putting the search query in the URL - to programmatically run searches and collect data!\n", "\n", "```\n", "\n", "### Walkthrough\n", "\n", "[Let's go through the Neal Caren \"First API\" example. Some notes as we start:](https://nbviewer.jupyter.org/github/nealcaren/ScrapingData/blob/master/Notebooks/5_APIs.ipynb)\n", "- **JSON is just a data storage structure, and in python it's a dictionary.** \n", "- In this example, we make the search give us JSON because it can be easier to extract data from JSON than HTML sometimes.\n", "- HINTS: \n", " - Dictionaries: You look for what's in a dictionary with `dict.keys()` and then `dict['key']` will show you the value/object belonging to the key\n", " - `pd.DataFrame()` can convert a dictionary to a data frame!\n", "\n", "### Practice: Exchange rates\n", "\n", "- Start with `url = 'https://api.exchangeratesapi.io/latest?base=NOK' `\n", "- Q1: What is the average exchange rate value this search returns?\n", "- [The documentation](https://exchangeratesapi.io/) for this database (how to search it, change parameters, etc.)\n", "- Q2: Change the base to Euros, then tell me how many Japanese Yen is in a euro.\n", "- Q3: What was the number of Yen per Euro on Jan 2, 2018?\n", "- Q4: Bonus: Get a time series of EURJPY from 2018 through 2019.\n", "- Q5: Bonus: Rewrite our code for Q1-Q4 using `requests_html` to the extent possible. If and when you succeed, email your solution to me!\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Skill 2: How to parse a (single) page\n", "\n", "You might parse a page to\n", "- isolate the data/text/urls you need\n", "- find a link to collect (or click!)\n", "\n", "Data on page is typically HTML, XML, or JSON. \n", "1. **J**ava**S**cript **O**bject **N**otation (JSON)\n", "1. e**X**tensible **M**arkup **L**anguage (XML)\n", "1. HTML - for [example](view-source:https://en.m.wikipedia.org/wiki/List_of_S%26P_500_companies). \n", "\n", "You can right click on a page and \"view source\" to see the underlying HTML, XML, or JSON. \n", "- Go to the [S&P500 wiki page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)\n", "- Right click and view the source. See the structure? That's HTML!\n", "\n", "From [Wikipedia](https://en.wikipedia.org/wiki/HTML):\n", "> HTML markup consists of several key components, including those called tags (and their attributes), character-based data types, character references and entity references. HTML tags most commonly come in pairs like `

` and `

`, although some represent empty elements and so are unpaired, for example ``. The first tag in such a pair is the start tag, and the second is the end tag (they are also called opening tags and closing tags).\n", ">\n", "> Another important component is the HTML document type declaration, which triggers standards mode rendering.\n", ">\n", "> The following is an example of the classic \"Hello, World!\" program: \n", ">\n", "> ```HTML\n", "> \n", "> \n", "> \n", "> This is a title\n", "> \n", "> \n", ">

Hello world!

\n", "> \n", "> \n", "> ```\n", " \n", "Elements can nest inside elements (a paragraph within the body of the page, the title inside the header, the rows inside the table with columns inside them). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Practice: Picking elements of a page \n", "\n", "[Let's play a game](https://flukeout.github.io/) about finding and selecting elements using the html tags.\n", "\n", "```{note}\n", "This will help on the next practice problem!\n", "```\n", "\n", "### Practice: Get a list of wiki pages for S&P500 firms\n", "- Revisit the S&P 500 wiki: `pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')`\n", "- Let's try to grab all the URLs. Notice the `pd.read_html` approach can't do this.\n", "- Use your browser's Inspector to find info you can use to select the table, and then we will try to extract the URLs.\n", "\n", "```{tip}\n", "This practice will help on the homework!\n", "```\n", "\n", "### Extra resources\n", "\n", "I really, really like the tutorials Greg Reda wrote ([here](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) and [here](http://www.gregreda.com/2013/04/29/more-web-scraping-with-python/)). See the caveat about `urllib2` above, but otherwise this code works. Greg uses `urllib` to open and `beautifulsoup` to parse, but if you want to, you should be able to rewrite his code using `requests_html` pretty easily. When you succeed, please email me! " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }