Getting Data

Outline

Overview of different ways to get data
API: description, demo, practice

Ways to get data from the web

Manually click and download. The way you would have done it before this class.
Let pandas download. E.g. our assignments often begin with pd.read_stata(<url>).
- Did you know? Pandas can often directly read tables on webpages! Try pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
- read_html can only handle basic HTML tables encoded directly in the page (no Javascript, e.g.) and only grabs displayed text -- no URLS.
- You can use the data without saving it to your hard drive if you want. The good/bad part of this is that if the website changes the data, the next time you run it, you'll get the newer version of data. (Unstable, potentially, but also updates automatically.)
"Install and play" APIs, which let you interact with a website without specifying the exact API requests. API stands for Application Programming Interface, and it is a way for your computer to send a request (a query) to a server and get some response (hopefully useful data).
- The pandas_datareader plug in for Yahoo stock prices is one version of this.
- datadotworld was another.
- Kaggle and most of the data sources listed on our main site have API packages for Python.
- I download your peer reviews using PyGithub.
Manual API queries for websites without "install and play" APIs. Many sites have an API port of some kind serving up the data they show visitors.
Scraping the data implicitly on the website. The last resort. You can't find the API serving the data, but your eyes see it. And you want it, cause websites contain a lot of data, like GoT's IMDB page.
- AGAIN: The last resort.

Wisdom from Greg Reda that applies to all of these:

You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.

Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.

Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.

Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.

Which method should you choose?

Options 1-3 are BY FAR the easiest. If you want more than 10 tables or so (but the threshold depends on your coding speed), I'd abandon the manual option and go with pandas or a nice API package.

Never ever try #4 or #5 without searching for "\<website> python api" first.

A scraping toolkit

Useful packages, tricks, and tips

Web scraping packages are always developing and evolving.

Task	Thoughts
To "open" a page	`urllib` or `requests`. `requests` is probably the best for sending API queries. Warning: lots of walkthroughs online use `urllib2`, which worked for Python2 but not Python3. Use `urllib` instead, and you might have to include a few tweaks. For example, if you see `from urllib2 import urlopen` replace it with from `urllib.request import urlopen`
To parse a page	`beautifulsoup`, `lxml`, or `pyquery`
Combining opening/parsing	`requests_html` is a relatively new package and might be excellent. Its code is simply a combination of many of the above.
Blocked because you look like a bot?	`selenium` is one way to "impersonate" a human, and also can help develop scraping macros, but you might not need it except on difficult scraping projects. It opens a literal browser window. `requests_html` and `requests` can also store and use cookies. I'd recommend you try this before selenium.
Blocked because you're sending requests too fast?	`from time import sleep` allows you to `sleep(<# of seconds>)` your code.
Wonder what your current HTML looks like?	`from IPython.display import HTML` then `HTML(<html object>)` will show you what the HTML you have looks like. E.g. if you're using `r = requests(url)`, then you can use `HTML(r.text)` to see the request object.
How do I find a particular "piece" of a webpage	E.g. Q: Where is that table? A: Oh, it's inside the HTML tag called "table3". You can search for elements via attributes, CSS selectors, XPath, and text. This will make more sense next class. To find that info: Right click on an element you're interested and click "Inspect Element". (F12 is the Windows shortcut.)

My suggestion

This is subject to change, but I think you should pick ONE opening and ONE parsing module and stick with it for now. requests_html is a pretty good option that opens pages and can parse them, and it allows you to use lxml, or pyquery within it.

You can change and try other stuff as you go, but get as familiar with one package as you can (in a cheap/efficient way).

Now to contradict myself: Some of the packages above can't do things others can, or do them much slower, or the code is hard to write, read, and debug. Sometimes, you're holding a hammer but you need a screwdriver. What I'm saying is, if another package can easily do the job, use it. (Just realize that learning a new package comes with a fixed cost, so be sure you need that screwdriver before grabbing it.)

What skills do I need to learn to be a master Hacker(wo)man?

How to open/read a webpage, and pass specific queries to a server
How to parse a (single) page, to find specific elements of interest (like tables, specific text, URLs)
Doing that for a large number of webpages (building a "scraper" or "crawler" or "spider")
- monitor progress (tqdm)
- deal with and log errors
- slow down
- passing cookies and API tokens if needed
- building a directory to store the pages/data from webpages
- doing this in a repo without uploading it all to GitHub (.gitignore)

So let's start building up those skills through demos and practice!

Skill #1: Retrieving URLs and sending manual API queries

Prof led: The Neal Caren "First API" example
- JSON is just a data storage structure, and in python it's a dictionary. We make the search give us JSON because it can be easier to extract data from JSON than HTML sometimes.
- You look for whats in a dictionary with dict.keys() and then dict['key'] will show you the value/object belonging to the key
- pd.DataFrame() can convert a dictionary to a data frame
For your reference: Sometimes a webpage is "hiding" an API. You run a search and the URL doesn't look obviously like a search. But often, inside that page is a "backdoor" to an API you can search just like the above example. This tutorial shows one example of this and more importantly, how the author found the API.
YOUR TURN: Exchange rates
- Start with url = 'https://api.exchangeratesapi.io/latest?base=NOK'
- Q1: What is the average exchange rate value this search returns?
- The documentation for this database (how to search it, change parameters, etc.)
- Q2: Change the base to Euros, then tell me how many Japanese Yen is in a euro.
- Q3: What was the number of Yen per Euro on Jan 2, 2018?
- Q4: Bonus, prof can show: Get a time series of EURJPY from 2018 through 2019.
AFTER CLASS PRACTICE:
- Rewrite our code for Q1-Q4 using requests_html to the extent possible. If and when you succeed, email your solution to me!

Skill 2: How to parse a (single) page

You might parse a page to

isolate the data/text/urls you need
find a link to collect (or click!)

Data on page is typically HTML, XML, or JSON.

JavaScript Object Notation (JSON)
eXtensible Markup Language (XML)
HTML - for example.

You can right click on a page and "view source" to see the underlying HTML, XML, or JSON.

Go to the S&P500 wiki page
Right click and view the source. See the structure? That's HTML!

From Wikipedia:

HTML markup consists of several key components, including those called tags (and their attributes), character-based data types, character references and entity references. HTML tags most commonly come in pairs like <h1> and </h1>, although some represent empty elements and so are unpaired, for example <img>. The first tag in such a pair is the start tag, and the second is the end tag (they are also called opening tags and closing tags).

Another important component is the HTML document type declaration, which triggers standards mode rendering.

The following is an example of the classic "Hello, World!" program:
<!DOCTYPE html>
   <html>
     <head>
       <title>This is a title</title>
     </head>
     <body>
       <p>Hello world!</p>
     </body>
   </html>

Elements can nest inside elements (a paragraph within the body of the page, the title inside the header, the rows inside the table with columns inside them).

Practice

Let's play a game about finding and selecting elements using the html tags.

Practice/challenge: Get a list of wiki pages for S&P500 firms

Revisit the S&P 500 wiki: pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
Let's try to grab all the URLs. Notice the pd.read_html approach can't do this.
Use your browser's Inspector to find info you can use to select the table, and then we will try to extract the URLs.

AFTER CLASS PRACTICE

I really, really like the tutorials Greg Reda wrote (here and here). See the caveat about urllib2 above, but otherwise this code works. Greg uses urllib to open and beautifulsoup to parse, but if you want to, you should be able to rewrite his code using requests_html pretty easily. When you succeed, please email me!

Skill #3: Building a scraper

We can open a page, pass an API query, and parse a page for the elements we're interested in.

One API hit is cool, but do you know whats really cool?

One million API hits.

Ok, maybe not a million.* But now that you can write a request and modify search parameters, you might need to run a bunch of searches.

Scraping jobs typically fall into one of two camps:

loop over URLs (predetermined list)
navigate from an initial page through subsequent pages (e.g through search results)

Of course, both can be true: sometimes a scraper might have a list of URLs (search for crime 1, then crime 2) and for each URL (crime) click through all result pages.

When your job falls into the first camp - you want to loop over a list of URLs - a good way to do that is: Define a function to do one search, then call that for each search in a list of searches.

def search_itunes(search_term):
    '''Simplified iTunes search'''
    
    base_url = 'https://itunes.apple.com/search'
    search_parameters = {'term': search_term}
    
    r = requests.get(base_url, params = search_parameters)
    
    results_df = pd.DataFrame(r.json()['results'])
    
    return results_df

search_itunes('billie eilish')      # one search at a time
search_itunes('father john misty')) # "another one" - dj khaled

artists = ['billie eilish','father john misty'] # you can loop over them!

# download the results and save locally
for artist in artists:
    df = search_itunes(artist)
    # you could do anything with the results here
    # a good idea in many projects: save the webpage/search results
    # even better: add the saving function inside the "search_itunes" fcn
    # but this is just a toy illustration, so nothing happens
    print(len(df)) 
    
# LATER, you will want to analyze those files. Just loop over the files again:
for artist in artists:
    # load the saved file
    # call a function you wrote to parse one file
    # do something with the output from the parser
    # but this is just a toy illustration, so nothing happens    
    pass

The main web scraping problems (and workarounds)

Also, again, check out table above on useful packages and tips.

Issue 1: The jobs are slow

In many web scraping projects, a lot of data needs to get scraped, over thousands (or millions) (or billions) of pages. It's unlikely that you can do this all in one session. (What if your WiFi disconnects, or Windows decides to do an update, or the webpage freezes you out for a period of time?)

Solutions:

Write code that only hits the server one time, and saves the results to your computer. "Step 1" of the search_itunes example above does that. Then "step 2" uses/parses those files without going to the webpages again.

You want your spider to resume, not restart. Ensure that your code can resume where it left off without having to restart from scratch. My usual solution:

# as I'm looping over webpages:
 if not os.path.exists(<filename this page would get>): 
     okay_do_the_download() # whatever the function is
 # if not, skip to the next webpage

Your spider WILL fail - you don't want it to stop. I typically use a try-except-else block. The try part accesses the url/send the API request, the except part prints or logs a failure to a log file, and the else part only executes the code I need to run after the url request if the try code was successful. For example, I could improve the search_itunes function:

if not os.path.exists(<filename this page would get>): 
     try:
         r = requests.get(base_url, params = search_parameters)
     except:
         print("hey this didn't work! prob print better info than")
         print("this string")
         # or... create strings and append them to an "error_list",
         # which you save to a text file or csv after the code finishes
         # and you can look at it then
     else:
         results_df = pd.DataFrame(r.json()['results'])

Your spider WILL fail - you will want to know what. You should log failures, warnings, and errors. The prior example can be adjusted to do this well.

Issue 2: Too much speed

Servers aren't free and can get overloaded. You've seen or heard of websites crashing due to high traffic - Fandango for Star Wars - Rogue One, Black Fridays, and the Canadian Immigration site in Nov 2016.

As such, webmasters often throttle or block computers that are sending too much traffic.

Solutions:

Slow your code down with sleep(#). This is the main solution.
Get API access with special permissions.
If you can't slow down your spider (the code crawling the site), use multiple computers/IP addresses

Issue 3: So... I'm downloading a loooot of files

You are!

It's important to save them in an organized way. There is no "one way", and the directory/storage scheme I choose depends on the job. The main thing is that you probably want two abilities ater the download:

If you sequentially open all files, can you tell what they are? (E.g. the firm, the year, the form type.)
If you want to only open some files, can you do that without opening all files? (E.g. only open 10-Ks but not 10-Qs.)

How you achieve these is somewhat up to you but you basically have two choices (and these can work in tandom):

Solution 1: Build the folder structure so that the path to the file tells you what you need to know.

E.g. /gvkey_10145/10-ks/2008/934573495-923875934.txt is "obviously" the 2008 10-K for firm 10145, and you know this without needing to open the file and even though the filename itself is not very clear.

Solution 2: Keep a master list of documents

Sometimes it's not possible or reasonable to know exactly how to build the directory in advance. For example, forms filed to the SEC in 2008 are often for fiscal year 2007. So what does the "2008" folder mean? How can you tell before running everything? So maybe you just download all the 10-ks for that firm inside the /gvkey_10145/10-ks/ folder.

To find the 2008 10-K, you'd open up a master list of documents which contains variables with enough info to assemble the path to each file, and info about each file. Then you can query("form='10-K' & fyear=2008"), assemble the filename, and run your code.

This master list must either be assembled before you run your spider (like in Assignment 5), as you run the spider (collect the info and save it as you go), or after the download you run some code one time to assemble it (either using their paths a la /gvkey_10145/10-ks/2008/934573495-923875934.txt, or open every single file to extract the info about the document).

Summary

You can combine all this discussion into a "general structure" for spiders. For each page you want to visit, you need to know

The URL
The folder and filename you want to save it to

And then, for each page you want to visit:

if not os.path.exists(<filename this page would get>): 
    try:
        r = requests.get(<the url>)
    except:
        # log the error somehow
    else:
        # save the results, I typically save the RAW source
        sleep(3) # be nice to server

Would you like another tutorial to try?

Again, Greg Reda has a nice walkthrough discussing building a robust code to download a list, and incorporates many of the elements in code we've talked about.

Practice

Start a "repo" - this time I just mean a folder inside the participation folder, but treat it like a standalone repo - and inside it, crawl all the S&P 500 firms' wiki pages.

Footnotes

* I've done well over a million API hits in the name of science.

Credits

STAT545, as always
Neal Caren