4.2. Opening + Parsing a Webpage

This page is about the first two skills of the three skills needed to become a master Hacker(wo)man:

  1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you

  2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs)

  3. Doing that for a large number of webpages (building a “scraper” or “crawler” or “spider”)

So let’s start building up those skills through demos and practice!

4.2.1. Skill 1: Retrieving URLs and sending manual API queries

Note

Notice how, when you google something, the URL of the resulting search has the information about the search?

For example: https://www.google.com/search?client=firefox-b-1-d&q=gme+stock_price

We can leverage websites that do that to programmatically run searches and collect data!

4.2.1.1. Walkthrough

Go through the Neal Caren “First API” example

  • JSON is just a data storage structure, and in python it’s a dictionary.

  • In this example, we make the search give us JSON because it can be easier to extract data from JSON than HTML sometimes.

  • HINTS:

    • Dictionaries: You look for what’s in a dictionary with dict.keys() and then dict['key'] will show you the value/object belonging to the key

    • pd.DataFrame() can convert a dictionary to a data frame!

4.2.1.2. Practice: Exchange rates

  • Start with url = 'https://api.exchangeratesapi.io/latest?base=NOK'

  • Q1: What is the average exchange rate value this search returns?

  • The documentation for this database (how to search it, change parameters, etc.)

  • Q2: Change the base to Euros, then tell me how many Japanese Yen is in a euro.

  • Q3: What was the number of Yen per Euro on Jan 2, 2018?

  • Q4: Bonus: Get a time series of EURJPY from 2018 through 2019.

  • Q5: Bonus: Rewrite our code for Q1-Q4 using requests_html to the extent possible. If and when you succeed, email your solution to me!

4.2.2. Skill 2: How to parse a (single) page

You might parse a page to

  • isolate the data/text/urls you need

  • find a link to collect (or click!)

Data on page is typically HTML, XML, or JSON.

  1. JavaScript Object Notation (JSON)

  2. eXtensible Markup Language (XML)

  3. HTML - for example.

You can right click on a page and “view source” to see the underlying HTML, XML, or JSON.

  • Go to the S&P500 wiki page

  • Right click and view the source. See the structure? That’s HTML!

From Wikipedia:

HTML markup consists of several key components, including those called tags (and their attributes), character-based data types, character references and entity references. HTML tags most commonly come in pairs like <h1> and </h1>, although some represent empty elements and so are unpaired, for example <img>. The first tag in such a pair is the start tag, and the second is the end tag (they are also called opening tags and closing tags).

Another important component is the HTML document type declaration, which triggers standards mode rendering.

The following is an example of the classic “Hello, World!” program:

<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>

Elements can nest inside elements (a paragraph within the body of the page, the title inside the header, the rows inside the table with columns inside them).

4.2.2.1. Practice: Picking elements of a page

Let’s play a game about finding and selecting elements using the html tags.

Note

This will help on the next practice problem!

4.2.2.2. Practice: Get a list of wiki pages for S&P500 firms

  • Revisit the S&P 500 wiki: pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

  • Let’s try to grab all the URLs. Notice the pd.read_html approach can’t do this.

  • Use your browser’s Inspector to find info you can use to select the table, and then we will try to extract the URLs.

Tip

This practice will help on the homework!

4.2.2.3. Extra resources

I really, really like the tutorials Greg Reda wrote (here and here). See the caveat about urllib2 above, but otherwise this code works. Greg uses urllib to open and beautifulsoup to parse, but if you want to, you should be able to rewrite his code using requests_html pretty easily. When you succeed, please email me!