Midterm aka Assignment 5 - Our first real data science project¶
Read all instructions before starting.
Start early. Work on the components of the project in parallel with related class discussions.
RECHECK THESE INSTRUCTIONS BEFORE SUBMITTING
Per the syllabus, this project is 10% of your overall grade, which is 3x the weight of a typical assignment. It will take 2-3x the time of a typical assignment.
Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.
BUT: It will take time! If you start the day before it is due, YOU WILL NOT FINISH IT. If you start two days before it is due, you might finish it, but it will not be done well.
Project Set Up¶
First, let’s start with a high level overview. (More details will follow.)
Basic question: Do 10-K filings contain value-relevant information in the sentiment of the text?
Sentiment: Does the word have a positive or negative tone (e.g. “confident” vs “against”)
Specific questions: Is the positive or negative sentiment in a 10-K associated with better/worse stock returns?
This is called a “cross-sectional event study”
Expected minimum output: Scatterplot (x = document’s sentiment score, y = stock returns around the 10-K’s release) with regression lines; formatted well
Advanced output: Regression tables, heatmaps, better scatterplots/similar
New data science technique: Textual analysis and sentiment analysis.
Returns: Stock returns for firms around their 10-K dates
Sentiment measures will be created from the text in their 10-K filings. (We will need to download the 10-K files and get the text within them!)
Notice: The questions dictates what the analysis design should be, which in turn dictates the data requirements. Question led research > data driven research.
Your repo and the code within should put the Golden Rules from Chapters 2, 3.3, and 4 into practice. You should review those. You will be graded on them in following ways (not an exhaustive list):
The repo’s structure (files, organization, names, README)
Code style/clarity in the
build_samplefiles (The teaching team will not read your
build_samplefile other than to grade/comment on code style and, potentially, errors. Answers to questions here will be ignored - put them in the report file!)
The main portion of your grade is the
reportfile. Your answers to all questions, unless explicitly directed otherwise, should be in the
reportfile should be written and formatted like an executive report.
It is not a technical document and not a bunch of hasty code. It should be well formatted and clean in terms of text, code, and output. Don’t show extraneous print statements you wrote while ABCD’ing and checking your code. Treat it like a Word document that happens to have some code (but just enough to do the analysis and show outputs).
There is no “page expectation” or “page limit”. Aim to provide sufficient analysis and explanation but in a concise and clear way. Bullet points are fine in places, but you should have a few places with paragraph-style discussion.
High quality and concise (but sufficient) reporting is an A1 emphasis. Here, pretty, smart, and effective tables and visualizations will receive higher grades.
Steps to complete the assignment¶
1. Start the assignment
As usual, click the link I provide in the discussion board.
The repo is almost empty. This is a start-to-finish project, so I’m letting you dictate the structure of the files.
I have provided some input files and two papers that are relevant to the midterm
Clone this to your computer.
2. Edit .gitignore
download_text_files.ipynb file will create a large data structure in a subfolder called
10k_files/ with all the downloaded 10-K files. Depending on the choices you/we make, there could be several gigs of data in this folder. We don’t want to save/push all these files to GitHub!
Add this directory (
10k_files/) to your gitignore before you proceed!
3. Create download_text_files.ipynb
Should create a subfolder for inputs (
inputs/), if one doesn’t exist. We will save background info about our sample firms there.
Should create another subfolder (
10k_files/) to hold all the text files you download. Because scraping can generate large amounts of files, I usually put it in a dedicated input folder instead of the generic input folder we just made.
If you don’t already have it, get a list of firm tickers for our study and download their last 10-K filed during 2022.
Download the last 10-K filed during 2022 for each firm, if you haven’t already downloaded that one.
Try to download just one 10-K at first. When you can successfully do that, try a few more, one at a time. Check the folders on your computer - did they download like you expected? Are the files correct? If yes, continue. If not, you have an error to fix.
The website has really good info on “building a spider.” Highly recommend!
We will spend time in class building this file up. Follow along!
When you are confident the program works,
Delete your whole
10k_files/subfolder on your computer to make sure the program works from a “fresh start”
Rerun this from scratch.
Rerun the file AGAIN (but don’t delete the files you have). The file should not download any new files! Does the file work after it’s already been run, or partially completed its work? Real spiders have to resume where they left off. You might need to make some conditional tweaks to the file to account for this. You don’t want the code to actually re-download the data, but the code should still run without error!
4. IMPORTANT: Create screenshot.png
It’s not polite to upload so much data to GitHub. It takes up space on the server, and your collaborators/peer reviewers will have to download them all when they clone your repo.
That’s why you edited the gitignore before doing all those downloads. If you did it correctly and check GitHub Desktop, you won’t see any of the text files!
Now that your
download_text_files.ipynbis done running, push the repo. Even though your computer has a
/10k_files/*folder on it with many files and some hard drive space used, the repo in your browser doesn’t show this at all! Good job!
screenshot.png. The purpose of this is to upload proof of the files for your reviewers.
10k_files folder so it shows the number of files inside of it, and take a screenshot showing this. Save it as
screenshot.png inside your repo.
5. Download near_regex.py from the community codebook into your repo
This will be used in the next step.
6. Create build_sample.ipynb
The idea is that this code builds everything we need for the report. The report should be minimal (just the necessary code to produce the output tables and figures you will discuss) so that readers can focus on making conclusions. So, this file should create a dataset that has
one observation per firm
variables needed for the analysis
I know what should be in this file because I worked backwards from what I know I want in the report. If you’re reading this section and are confused, go look at the report section below.
Variables to create:
2 versions of a “buy and hold” around the 10-K date (“date t”)
Version 1: Measure from the day t to day t+2 (inclusive) …
tis business days, so ignore weekends and holidays!
Version 2: Measure from the day t+3 to day t+10 (inclusive)
Calculate the firms’s buy and hold return over each time span, a la Assignment 2
Stock returns for 2022 can be found here. Open this the same way we open
10 sentiment variables
A positive sentiment variable is the fractions of words in a 10-K that are “positive” words
A negative sentiment variable is the fractions of words in a 10-K that are “negative” words
How do we define which words are positive and which are negative? The input folder contains two different sets of words.
The “LM” sentiment dictionary comes from two researchers named Loughran and McDonald
The “ML” sentiment lists comes from a machine learning approach used in the Journal of Financial Economics this year
Background on both measures: You should read the abstract and the intro of the papers in the literature folder. Because “ML_JFE.pdf” is the more recent publication, it contrasts the two dictionaries
The first 4 of the 10 variables:
“LM Positive” and “LM Negative”
“ML Positive” and “ML Negative”
The last 6 of the 10 variables: “Contextual” sentiment. Basically: What is the (positive and negative) sentiment of the text in a 10-K around discussions of a particular topic.
Pick three topics. Each of those will get a positive and negative sentiment score.
This file (broad steps):
Loads the initial dataset of sample firms saved inside of
For each firm,
load the corresponding 10-K. Clean the text.
Create the sentiment measurements, and save those new measurements to the correct row and column in the dataframe.
Bonus: Save the total length of the document (# of words)
Bonus: Save the # of unique words (similar to total length)
Calculate the two return measurements. Save those to the correct row and column in the dataframe
Downloads 2021 accounting data (2021 ccm_cleaned.dta) from the data repo (possibly useful in analysis) and adds them to the dataset
Save the whole thing to
When you are confident the program works, delete your whole
output/ folder on your computer so you have a “fresh start” and then rerun this from scratch.
7. Create exploration_ugly.ipynb.
Use this file to try to figure out how to do the analysis below. Play around in this file. No one will look at it. It’s a safe space.
If you find issues with your sentiment measurements or come up with improvements you think you should make, go back and work on the previous file more.
You can and should use this file to figure out what you want to include in the final report and how you want it to appear.
8. Create report.ipynb
This is the main portion of your grade!!!
The following outputs and discussion prompts are the focus of grading:
Summary section (brief, max 2 paragraphs)
Summarize your question, what you did, and your findings. You can model this on the abstracts in the literature folder.
What’s the sample?
How are the return variables built and modified? (Mechanical description.) Aim for rigor and do not skip steps. You can include text and formulas here.
How are the sentiment variables are built and modified? (Mechanical description.) Aim for rigor and do not skip steps. You can include text and formulas here.
Why did you choose the three topics you did for the “contextual sentiment” measures?
Show and discuss summary stats of your final analysis sample
Do your “contextual sentiment” measures pass some basic smell tests?
Smell tests: Is something fishy? (What you look for depends on the setting.)
Do you have variation in the measures (i.e a variable is not all the same value)?
Are industries the industries you expect talking about your subject positively or negatively?
This should be sufficient for a reader to understand the dataset and have context for interpreting results
Are there any caveats about the sample and/or data? If so, mention them and briefly discuss possible issues they raise with the analysis.
Make a table with the correlation of each (10) sentiment measure against both (2) return measures. (So: an 10x2 table.)
The return measures are the firm’s returns around the 10-K release. We have 2 versions of this measure because we will try two different windows of time around the 10-K.
You will make 5 sentiment measures, and each has a positive and negative component. Thus: 10 sentiment measures.
See step 6 for more details about the sentiment and return measures.
Include a scatterplot (or similar) of each sentiment measure against both return measures.
Better: Combining this into a single figure
Better: Skip the correlation table and include the numerical correlations on the figure
Better: Regress (Don’t worry about printing these regressions out “pretty”, just try them if you want!)
Four discussion topics:
On (1), (2), and (3) below: Focus just on the first return variable (which will examine returns around the 10-K publication)
On (4) below: Focus on how the “ML sentiment” variables (positive and negative) are related to the two different return measures.
Compare / contrast the relationship between the returns variable and the two “LM Sentiment” variables (positive and negative) with the relationship between the returns variable and the two “ML Sentiment” variables (positive and negative). Focus on the patterns of the signs of the relationships and the magnitudes.
If your comparison/contrast conflicts with Table 3 of the Garcia, Hu, and Rohrer paper (ML_JFE.pdf, in the repo), discuss and brainstorm possible reasons why you think the results may differ. If your patterns agree, discuss why you think they bothered to include so many more firms and years and additional controls in their study? (It was more work than we did on this midterm, so why do it to get to the same point?)
Discuss your 3 “contextual” sentiment measures. Do they have a relationship with returns that looks “different enough” from zero to investigate further? If so, make an economic argument for why sentiment in that context can be value relevant.
Is there a difference in the sign and magnitude? Speculate on why or why not.
9. Finalize and polish
Unlike previous assignments, how clean your code and report are will factor into your grade. Additionally, your README file should be nice!
Edit the readme file - it should be “publication ready”
Make the readme file informative and professional. Use headers to separate sections within.
Inform readers of the order in which files should be run. And warn users that this folder will download X files of X MB or GB.
Change the title of it (not the filename, the title at the top)
Describe the purpose of this repo (what this repo is analyzing) and the key inputs
List any necessary packages (might a reader need to
pip installanything?) or steps a visitor will need to run to make it work on their computer
Give yourself a big round of applause at this point!
Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 25+ years of data for all publicly traded firms.
Seriously: You are in the ballpark of pulling off any analysis you want that needs to harness the power of SEC filings. These four studies are variously provocative, great, and (in one case) mine: