Midterm aka Assignment 5 - Our first real data science project¶

Tips

Read all instructions before starting.
Start early. Work on the components of the project in parallel with related class discussions.
RECHECK THESE INSTRUCTIONS BEFORE SUBMITTING
Look at my class notes repo. I worked on part of this project there, especially step 3 and parts of step 6.

Warning

Per the syllabus, this project is 10% of your overall grade, which is 3x the weight of a typical assignment. It will take 2-3x the time of a typical assignment.

Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.

BUT: It will take time! If you start the day before it is due, YOU WILL NOT FINISH IT. If you start two days before it is due, you might finish it, but it will not be done well.

Project Set Up¶

First, let’s start with a high level overview. (More details will follow.)

Basic question: Do 10-K filings contain value-relevant information in the sentiment of the text?
- Sentiment: Does the word have a positive or negative tone (e.g. “confident” vs “against”)
Specific questions: Is the positive or negative sentiment in a 10-K associated with better/worse stock returns?
- This is called a “cross-sectional event study”
- Expected minimum output: Scatterplot (x = document’s sentiment score, y = stock returns around the 10-K’s release) with regression lines; formatted well
- Advanced output: Regression tables, heatmaps, better scatterplots/similar
New data science technique: Textual analysis and sentiment analysis.
Data needed:
- Returns: Stock returns for firms around their 10-K dates
- Sentiment measures will be created from the text in their 10-K filings. (We will need to download the 10-K files and get the text within them!)

Notice: The questions dictates what the analysis design should be, which in turn dictates the data requirements. Question led research > data driven research.

Project deliverables¶

Your repo and the code within should put the Golden Rules from Chapters 2, 3.3, and 4 into practice. You should review those. You will be graded on them in following ways (not an exhaustive list):

The repo’s structure (files, organization, names, README)
Code style/clarity in the download and build_sample files (The teaching team will not read your build_sample file other than to grade/comment on code style and, potentially, errors. Answers to questions here will be ignored - put them in the report file!)
The main portion of your grade is the report file. Your answers to all questions, unless explicitly directed otherwise, should be in the report file. The report file should be written and formatted like an executive report.
- It is not a technical document and not a bunch of hasty code. It should be well formatted and clean in terms of text, code, and output. Don’t show extraneous print statements you wrote while ABCD’ing and checking your code. Treat it like a Word document that happens to have some code (but just enough to do the analysis and show outputs).
- There is no “page expectation” or “page limit”. Aim to provide sufficient analysis and explanation but in a concise and clear way. Bullet points are fine in places, but you should have a few places with paragraph-style discussion.
- High quality and concise (but sufficient) reporting is an A1 emphasis. Here, pretty, smart, and effective tables and visualizations will receive higher grades.

Steps to complete the assignment¶

6. Create build_sample.ipynb

The idea is that this code builds everything we need for the report. The report should be minimal (just the necessary code to produce the output tables and figures you will discuss) so that readers can focus on making conclusions. So, this file should create a dataset that has

one observation per firm
variables needed for the analysis

Important

I know what should be in this file because I worked backwards from what I know I want in the report. If you’re reading this section and are confused, go look at the report section below.

Variables to create:

3 versions of a “buy and hold” around the 10-K date (“date t”)
- Version 1: The return on the day of the filing. (This is easiest, and points will be reserved for students that figure out version 2 and 3.)
- Version 2: Measure from the day t to day t+2 (inclusive) … t is business days, so ignore weekends and holidays!
- Version 3: Measure from the day t+3 to day t+10 (inclusive)
- Calculate the firms’s buy and hold return over each time span, a la Assignment 2
- Stock returns for 2022 can be found here. Open this the same way we open ccm.
- Merge the buy and hold stock returns to our dataset using ticker. (This is not the best option, but it’s simple.)
10 sentiment variables
- A positive sentiment variable is the fractions of words in a 10-K that are “positive” words
- A negative sentiment variable is the fractions of words in a 10-K that are “negative” words
- How do we define which words are positive and which are negative? The input folder contains two different sets of words.
  - The “LM” sentiment dictionary comes from two researchers named Loughran and McDonald
  - The “ML” sentiment lists comes from a machine learning approach used in the Journal of Financial Economics this year
  - Background on both measures: You should read the abstract and the intro of the papers in the literature folder. Because “ML_JFE.pdf” is the more recent publication, it contrasts the two dictionaries
- The first 4 of the 10 variables:
  - “LM Positive” and “LM Negative”
  - “ML Positive” and “ML Negative”
- The last 6 of the 10 variables: “Contextual” sentiment. Basically: What is the (positive and negative) sentiment of the text in a 10-K around discussions of a particular topic.
  - Pick three topics. Each of those will get a positive and negative sentiment score - use the ML sentiment lists (not the LM lists).
  - A must visit: This page explains this step.

This file (broad steps):

Creates an output/ folder
Loads the initial dataset of sample firms saved inside of inputs/.
For each firm,
- load the corresponding 10-K. Clean the text.
- Create the sentiment measurements, and save those new measurements to the correct row and column in the dataframe.
- Bonus: Save the total length of the document (# of words)
- Bonus: Save the # of unique words (similar to total length)
Calculate the return measurements. Save those to the correct row and column in the dataframe
Optional: Downloads 2021 accounting data (2021 ccm_cleaned.dta) from the data repo (possibly useful in analysis) and adds them to the dataset
Save the whole thing to output/analysis_sample.csv

Note

There is more details about the 6 “contextual sentiment” variables here.

Important

When you are confident the program works, delete your whole output/ folder on your computer so you have a “fresh start” and then rerun this from scratch.

8. Create report.ipynb

Important

This is the main portion of your grade!!!

The following outputs and discussion prompts are the focus of grading:

Summary section (brief, max 2 paragraphs)
- Summarize your question, what you did, and your findings. You can model this on the abstracts in the literature folder.
Data section
- What’s the sample?
- How are the return variables built and modified? (Mechanical description.) Aim for rigor and do not skip steps. You can include text and formulas here.
- How are the sentiment variables are built and modified? (Mechanical description.) Aim for rigor and do not skip steps. You can include text and formulas here.
- These datapoints about the sentiment variables:
  - How many words are in the LM positive dictionary?
  - How many words are in the LM negative dictionary?
  - How many words are in the ML positive dictionary?
  - How many words are in the ML negative dictionary?
  - A description of how you set up the near_regex function (partial = true or false, distance = what) and why you chose the values you did.
- Why did you choose the three topics you did for the “contextual sentiment” measures?
- Show and discuss summary stats of your final analysis sample
- Do your “contextual sentiment” measures pass some basic smell tests?
  - Smell tests: Is something fishy? (What you look for depends on the setting.)
  - Do you have variation in the measures (i.e a variable is not all the same value)?
  - Are industries the industries you expect talking about your subject positively or negatively?
- This should be sufficient for a reader to understand the dataset and have context for interpreting results
- Are there any caveats about the sample and/or data? If so, mention them and briefly discuss possible issues they raise with the analysis.
Results
- Make a table with the correlation of each (10) sentiment measure against both (2) return measures. (So: a 10x3 table.)
  - The return measures are the firm’s returns around the 10-K release. We have 3 versions of this to examine different windows of time around the 10-K to see the speed at which information is priced into the stock.
  - You will make 5 sentiment measures, and each has a positive and negative component. Thus: 10 sentiment measures.
  - See step 6 for more details about the sentiment and return measures.
- Include a scatterplot (or binscatter, or similar) of each sentiment measure against both return measures.
  - Better: Combining this into a single figure
  - Better: Skip the correlation table and include the numerical correlations on the figure
  - Better: Regress (Don’t worry about printing these regressions out “pretty”, just try them if you want!). You can including accounting variables as controls here.
- Four discussion topics:
  - On (1), (2), and (3) below: Focus just on the first return variable (which will examine returns around the 10-K publication)
  - On (4) below: Focus on how the “ML sentiment” variables (positive and negative) are related to the different return measures.
  1. Compare / contrast the relationship between the returns variable and the two “LM Sentiment” variables (positive and negative) with the relationship between the returns variable and the two “ML Sentiment” variables (positive and negative). Focus on the patterns of the signs of the relationships and the magnitudes.
  2. If your comparison/contrast conflicts with Table 3 of the Garcia, Hu, and Rohrer paper (ML_JFE.pdf, in the repo), discuss and brainstorm possible reasons why you think the results may differ. If your patterns agree, discuss why you think they bothered to include so many more firms and years and additional controls in their study? (It was more work than we did on this midterm, so why do it to get to the same point?)
  3. Discuss your 3 “contextual” sentiment measures. Do they have a relationship with returns that looks “different enough” from zero to investigate further? If so, make an economic argument for why sentiment in that context can be value relevant.
  4. Is there a difference in the sign and magnitude? Speculate on why or why not.

Cheers!¶

Give yourself a big round of applause at this point!

Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 25+ years of data for all publicly traded firms.

Seriously: You are in the ballpark of pulling off any analysis you want that needs to harness the power of SEC filings. These four studies are variously provocative, great, and (in one case) mine:

LeDataSciFi-2024

Midterm aka Assignment 5 - Our first real data science project¶

Project Set Up¶

Project deliverables¶

Steps to complete the assignment¶

Cheers!¶