Midterm aka Assignment 5 - Our first real data science project


  1. Read all instructions before starting.

  2. Start early. Work on the components of the project in parallel with related class discussions.



Per the syllabus, this project is 10% if your overall grade, which is about 2x the weight of a typical assignment. It will probably take 2-3x the time of a typical assignment.

Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.

BUT: It will take time! If you start the day before it is due, YOU WILL NOT FINISH IT. If you start two days before it is do, you might finish it, but it will not be done well.

Project Set Up

The nuts and bolts of the set up are:

  • Basic question: What “types” of firms were hurt more or less by covid?

  • Specific questions: What risk factors were associated with better/worse stock returns around the onset of covid?

    • This is called a “cross-sectional event study”

    • Expected minimum output: Scatterplot (x = some “risk factors”, y = returns around March 2020) with regression lines; formatted well

    • Discussion of the economics linking the your risk factors to the returns is expected

    • Pro output: Regression tables, heatmaps, better scatterplots

  • New data science technique: Textual analysis. We will estimate “risk factors” from the text of S&P 500 firm’s 10-K filings.

    • More on this below

  • Data needed:

    • Returns: Stock returns for S&P 500 firms can be pulled from Yahoo

    • Risk factors for each firm will be created from their 10-K filings.

So your main challenge… is to create variables that measure risks for each firm.

Steps to complete the assignment


Give yourself a big round of applause at this point!

Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 20+ years of data for all publicly traded firms.

Seriously: You are in the ball park of pulling off any analysis you want that needs to harness the power of these filings. These four studies are variously provocative, great, and (in one case) mine: