Assignment 5 - Our first real data science project


Warning/time guidance: This assignment is the first project where you will have to get data, clean it yourself, and parse the data to create variables to analyze. These steps take time!

Project Set Up

The nuts and bolts of the set up are:

  • Basic question: What “types” of firms were hurt more or less by covid?

  • Specific questions: What risk factors were associated with better/worse stock returns around the onset of covid?

    • This is a “cross-sectional event study”

    • Expected minimum output: Scatterplot (x = some “risk factors”, y = returns around March 2020) with regression lines; formatted well

    • Pro output: Regression tables, heatmaps, better scatterplots

  • New data science technique: Textual analysis. We will estimate “risk factors” from the text of S&P 500 firm’s wiki pages.

    • More on this below

    • Adv method: Estimate them from the text of S&P 500 firm’s 10-K filings.

  • Data needed:

    • Returns: Stock returns for S&P 500 firms can be pulled from Yahoo

    • Risk factors for each firm will be created from their wikipedia pages OR their 10-K filings. Using the 10-K filings is significantly more interesting and likely to result in usable findings, but also more difficult and as such, will result in a grading premium.

  • So your main challenge… is to create variables that measure risks for each firm.

Steps to complete the assignment


Give yourself a big round of applause at this point!

Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 20+ years of data for all publicly traded firms.

Seriously: You are in the ball park of pulling off any analysis you want that needs to harness the power of these filings. These four studies are variously provocative, great, and (in one case) mine: