4.2. Opening + Parsing a Webpage¶
This page is about the first two skills of the three skills needed to become a master Hacker(wo)man:
Get the data: How to open/read a webpage, and pass specific queries to a server to control the content the server gives you
How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs)
Doing that for a large number of webpages (building a “scraper” or “crawler” or “spider”)
So let’s start building up those skills through demos and practice!
4.2.1. Skill 1: Retrieving URLs and sending manual API queries¶
Have you ever noticed how, when you google something, the URL of the resulting search has the information about the search?
For example, suppose I Google “gme stock price” (without the quotes). The resulting URL might be: https://www.google.com/search?client=firefox-b-1-d&q=gme+stock+price
This URL has several parts:
The stem, ending in a question mark: “https://www.google.com/search?”
Everything after the stem is the parameters of the search
A search can contain many parameters, and they will be separated by “&”
Each parameter is structured like: <parameter name>=<parameter value>
Parameter 1: My browser client is a version of firefox
Parameter 2: My query (“q”) is gme+stock+price (plus signs because spaces aren’t allowed in URLs)
We can leverage websites that do this - putting the search query in the URL - to programmatically run searches and collect data!
Let’s go through the Neal Caren “First API” example. Some notes as we start:
JSON is just a data storage structure, and in python it’s a dictionary.
In this example, we make the search give us JSON because it can be easier to extract data from JSON than HTML sometimes.
Dictionaries: You look for what’s in a dictionary with
dict['key']will show you the value/object belonging to the key
pd.DataFrame()can convert a dictionary to a data frame!
18.104.22.168. Practice: Exchange rates¶
url = 'https://api.exchangeratesapi.io/latest?base=NOK'
Q1: What is the average exchange rate value this search returns?
The documentation for this database (how to search it, change parameters, etc.)
Q2: Change the base to Euros, then tell me how many Japanese Yen is in a euro.
Q3: What was the number of Yen per Euro on Jan 2, 2018?
Q4: Bonus: Get a time series of EURJPY from 2018 through 2019.
Q5: Bonus: Rewrite our code for Q1-Q4 using
requests_htmlto the extent possible. If and when you succeed, email your solution to me!
4.2.2. Skill 2: How to parse a (single) page¶
You might parse a page to
isolate the data/text/urls you need
find a link to collect (or click!)
Data on page is typically HTML, XML, or JSON.
eXtensible Markup Language (XML)
HTML - for example.
You can right-click on a page and “view source” to see the underlying HTML, XML, or JSON.
Go to the S&P500 wiki page
Right-click and view the source. See the structure? That’s HTML!
HTML markup consists of several key components, including those called tags (and their attributes), character-based data types, character references and entity references. HTML tags most commonly come in pairs like
</h1>, although some represent empty elements and so are unpaired, for example
<img>. The first tag in such a pair is the start tag, and the second is the end tag (they are also called opening tags and closing tags).
Another important component is the HTML document type declaration, which triggers standards mode rendering.
The following is an example of the classic “Hello, World!” program:<!DOCTYPE html> <html> <head> <title>This is a title</title> </head> <body> <p>Hello world!</p> </body> </html>
Elements can nest inside elements (a paragraph within the body of the page, the title inside the header, the rows inside the table with columns inside them).
22.214.171.124. Practice: Picking elements of a page¶
Let’s play a game about finding and selecting elements using the html tags.
This will help on the next practice problem!
126.96.36.199. Practice: Get a list of wiki pages for S&P500 firms¶
Revisit the S&P 500 wiki:
Let’s try to grab all the URLs. Notice the
pd.read_htmlapproach can’t do this.
Use your browser’s Inspector to find info you can use to select the table, and then we will try to extract the URLs.
This practice will help on the homework!
188.8.131.52. Extra resources¶
I really, really like the tutorials Greg Reda wrote (here and here). See the caveat about
urllib2 above, but otherwise this code works. Greg uses
urllib to open and
beautifulsoup to parse, but if you want to, you should be able to rewrite his code using
requests_html pretty easily. When you succeed, please email me!