4.2. Opening + Parsing a Webpage

This page is about the first two skills of the three skills needed to become a master Hacker(wo)man:

  1. Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you

  2. How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs)

  3. Doing that for a large number of webpages (building a “scraper” or “crawler” or “spider”)

So let’s start building up those skills through demos and practice!

4.2.1. Skill 1: Retrieving URLs and sending manual API queries

Note

Have you ever noticed how, when you google something, the URL of the resulting search has the information about the search?

For example, suppose I Google “gme stock price” (without the quotes). The resulting URL might be: https://www.google.com/search?client=firefox-b-1-d&q=gme+stock+price

This URL has several parts:

  1. The stem, ending in a question mark: “https://www.google.com/search?”

  2. Everything after the stem is the parameters of the search

  3. A search can contain many parameters, and they will be separated by “&”

  4. Each parameter is structured like: <parameter name>=<parameter value>

  5. Parameter 1: My browser client is a version of firefox

  6. Parameter 2: My query (“q”) is gme+stock+price (plus signs because spaces aren’t allowed in URLs)

We can leverage websites that do this - putting the search query in the URL - to programmatically run searches and collect data!

4.2.1.1. Walkthrough

Let’s go through the Neal Caren “First API” example. Some notes as we start:

  • JSON is just a data storage structure, and in python it’s a dictionary.

  • In this example, we make the search give us JSON because it can be easier to extract data from JSON than HTML sometimes.

  • HINTS:

    • Dictionaries: You look for what’s in a dictionary with dict.keys() and then dict['key'] will show you the value/object belonging to the key

    • pd.DataFrame() can convert a dictionary to a data frame!

4.2.1.2. Practice: Exchange rates

  • Start with url = 'https://api.exchangeratesapi.io/latest?base=NOK'

  • Q1: What is the average exchange rate value this search returns?

  • The documentation for this database (how to search it, change parameters, etc.)

  • Q2: Change the base to Euros, then tell me how many Japanese Yen is in a euro.

  • Q3: What was the number of Yen per Euro on Jan 2, 2018?

  • Q4: Bonus: Get a time series of EURJPY from 2018 through 2019.

  • Q5: Bonus: Rewrite our code for Q1-Q4 using requests_html to the extent possible. If and when you succeed, email your solution to me!

4.2.2. Skill 2: How to parse a (single) page

You might parse a page to

  • isolate the data/text/urls you need

  • find a link to collect (or click!)

Data on page is typically HTML, XML, or JSON.

  1. JavaScript Object Notation (JSON)

  2. eXtensible Markup Language (XML)

  3. HTML - for example.

You can right-click on a page and “view source” to see the underlying HTML, XML, or JSON.

  • Go to the S&P500 wiki page

  • Right-click and view the source. See the structure? That’s HTML!

From Wikipedia:

HTML markup consists of several key components, including those called tags (and their attributes), character-based data types, character references and entity references. HTML tags most commonly come in pairs like <h1> and </h1>, although some represent empty elements and so are unpaired, for example <img>. The first tag in such a pair is the start tag, and the second is the end tag (they are also called opening tags and closing tags).

Another important component is the HTML document type declaration, which triggers standards mode rendering.

The following is an example of the classic “Hello, World!” program:

<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>

Elements can nest inside elements (a paragraph within the body of the page, the title inside the header, the rows inside the table with columns inside them).

4.2.2.1. Practice: Picking elements of a page

Let’s play a game about finding and selecting elements using the html tags.

Note

This will help on the next practice problem!

4.2.2.2. Practice: Get a list of wiki pages for S&P500 firms

  • Revisit the S&P 500 wiki: pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

  • Let’s try to grab all the URLs. Notice the pd.read_html approach can’t do this.

  • Use your browser’s Inspector to find info you can use to select the table, and then we will try to extract the URLs.

Tip

This practice will help on the homework!

4.2.2.3. Extra resources

I really, really like the tutorials Greg Reda wrote (here and here). See the caveat about urllib2 above, but otherwise this code works. Greg uses urllib to open and beautifulsoup to parse, but if you want to, you should be able to rewrite his code using requests_html pretty easily. When you succeed, please email me!