Midterm aka Assignment 5 - Our first real data science project


  1. Read all instructions before starting.

  2. Start early. Work on the components of the project in parallel with related class discussions.


  4. Look at my class notes repo. I worked on part of this project there, especially step 3 and parts of step 6.


Per the syllabus, this project is 10% of your overall grade, which is 3x the weight of a typical assignment. It will take 2-3x the time of a typical assignment.

Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.

BUT: It will take time! If you start the day before it is due, YOU WILL NOT FINISH IT. If you start two days before it is due, you might finish it, but it will not be done well.

Project Set Up

First, let’s start with a high level overview. (More details will follow.)

  • Basic question: Do 10-K filings contain value-relevant information in the sentiment of the text?

    • Sentiment: Does the word have a positive or negative tone (e.g. “confident” vs “against”)

  • Specific questions: Is the positive or negative sentiment in a 10-K associated with better/worse stock returns?

    • This is called a “cross-sectional event study”

    • Expected minimum output: Scatterplot (x = document’s sentiment score, y = stock returns around the 10-K’s release) with regression lines; formatted well

    • Advanced output: Regression tables, heatmaps, better scatterplots/similar

  • New data science technique: Textual analysis and sentiment analysis.

  • Data needed:

    • Returns: Stock returns for firms around their 10-K dates

    • Sentiment measures will be created from the text in their 10-K filings. (We will need to download the 10-K files and get the text within them!)

Notice: The questions dictates what the analysis design should be, which in turn dictates the data requirements. Question led research > data driven research.

Project deliverables

Your repo and the code within should put the Golden Rules from Chapters 2, 3.3, and 4 into practice. You should review those. You will be graded on them in following ways (not an exhaustive list):

  • The repo’s structure (files, organization, names, README)

  • Code style/clarity in the download and build_sample files (The teaching team will not read your build_sample file other than to grade/comment on code style and, potentially, errors. Answers to questions here will be ignored - put them in the report file!)

  • The main portion of your grade is the report file. Your answers to all questions, unless explicitly directed otherwise, should be in the report file. The report file should be written and formatted like an executive report.

    • It is not a technical document and not a bunch of hasty code. It should be well formatted and clean in terms of text, code, and output. Don’t show extraneous print statements you wrote while ABCD’ing and checking your code. Treat it like a Word document that happens to have some code (but just enough to do the analysis and show outputs).

    • There is no “page expectation” or “page limit”. Aim to provide sufficient analysis and explanation but in a concise and clear way. Bullet points are fine in places, but you should have a few places with paragraph-style discussion.

    • High quality and concise (but sufficient) reporting is an A1 emphasis. Here, pretty, smart, and effective tables and visualizations will receive higher grades.

Steps to complete the assignment


Give yourself a big round of applause at this point!

Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 25+ years of data for all publicly traded firms.

Seriously: You are in the ballpark of pulling off any analysis you want that needs to harness the power of SEC filings. These four studies are variously provocative, great, and (in one case) mine: