4.3. Skill 3: Building a spider¶

So, now, we can use python to open a page, pass an API query, and parse a page for the elements we’re interested in.

4.3.1. One API hit is cool, but do you know what’s really cool?¶

One million API hits.

Ok, maybe not a million.1 But now that you can write a request and modify search parameters, you might need to run a bunch of searches.

Scraping jobs typically fall into one of two camps:

loop over URLs or some search parameters (like firm names)
navigate from an initial page through subsequent pages (e.g through search results)

Of course, both can be true: sometimes a spider might have a list of URLs (search for firms that filed an 8-K in 2000, then those that filed in 2001) and for each URL (year) click through all 8-Ks.

The trick they don’t want you to know

When your job falls into the first camp - you want to loop over a list of URLs - a good way to do that is: Define a function to do one search, then call that for each search in a list of searches.

4.3.2. A silly spider¶

For example:

def search_itunes(search_term):
    '''Run one simple iTunes search'''
    
    base_url = 'https://itunes.apple.com/search'
    search_parameters = {'term': search_term}
    
    r = requests.get(base_url, params = search_parameters)
    
    results_df = pd.DataFrame(r.json()['results'])
    
    return results_df

We can run this one artist at a time:

search_itunes('billie eilish')      # one search at a time
search_itunes('father john misty')  # "another one" - dj khaled

Or we can loop over them and (TBD, but saving the results_df to files is a good idea):

artists = ['billie eilish','father john misty'] # you can loop over them!

# download the results and save locally
for artist in artists:
    df = search_itunes(artist)
    # you could do anything with the results here
    # a good idea in many projects: save the webpage/search results
    # even better: add the saving function inside the "search_itunes" fcn
    # but this is just a toy illustration, so nothing happens
    pass
    

LATER, you will want to analyze those files. Just loop over the files again:

for artist in artists:
    # load the saved file
    # call a function you wrote to parse/analyze/reformat one file
    # do something with the output from the parser
    # but this is just a toy illustration, so nothing happens    
    pass   

4.3.3. The main web scraping problems (and workarounds)¶

Also, check out the table from a few pages ago on useful packages and tips.

4.3.3.1. Summary¶

You can combine all this discussion into a “general structure” for spiders. For each page you want to visit, you need to know

The URL (or the search term)
The folder and filename you want to save it to

And then, for each page you want to visit you’ll run this:

def one_search(<the url>,<filename this page would get>):
    if not os.path.exists(<filename this page would get>): 
        try:
            r = requests.get(<the url>)
        except:
            # log the error somehow
        else:
            # save the results, I typically save the RAW source 
            sleep(3) # be nice to server

And that gets run within some loop.

for url in urls:
    filename_to_save = <some function of the url>
    one_search(url,filename_to_save)

This structure is pretty adaptable depending on the nature of the problem and the input data you have that yields the list of URLs to visit.

4.3.3.2. Would you like another tutorial to try?¶

Again, Greg Reda has a nice walkthrough discussing building a robust program to download a list and incorporating many of the elements we’ve talked about here.

4.3.4. Credits¶

STAT545, as always
Neal Caren

1: I’ve done well over a million API hits in the name of science.

LeDataSciFi-2024