Midterm aka Assignment 5 - Our first real data science project¶
Read all instructions before starting.
Start early. Work on the components of the project in parallel with related class discussions.
RECHECK THESE INSTRUCTIONS BEFORE SUBMITTING
Per the syllabus, this project is 10% if your overall grade, which is about 2x the weight of a typical assignment. It will probably take 2-3x the time of a typical assignment.
Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.
BUT: It will take time! If you start the day before it is due, YOU WILL NOT FINISH IT. If you start two days before it is do, you might finish it, but it will not be done well.
Project Set Up¶
The nuts and bolts of the set up are:
Basic question: What “types” of firms were hurt more or less by covid?
Specific questions: What risk factors were associated with better/worse stock returns around the onset of covid?
This is called a “cross-sectional event study”
Expected minimum output: Scatterplot (x = some “risk factors”, y = returns around March 2020) with regression lines; formatted well
Discussion of the economics linking the your risk factors to the returns is expected
Pro output: Regression tables, heatmaps, better scatterplots
New data science technique: Textual analysis. We will estimate “risk factors” from the text of S&P 500 firm’s 10-K filings.
Returns: Stock returns for S&P 500 firms can be pulled from Yahoo
Risk factors for each firm will be created from their 10-K filings.
So your main challenge… is to create variables that measure risks for each firm.
Steps to complete the assignment¶
1. Start the assignment
As usual, click the link I provide in the discussion board.
But unlike before, the repo will be essentially empty. This is a start to finish project, so I’m letting you dictate the structure of the files.
Clone this to your computer.
2. Edit .gitignore
download_text_files.ipynb file will create a large data structure in a subfolder called
10k_files/ with all the downloaded 10-K files. There will be several gigs of data in this folder. We don’t want to save/push all these files to github!
So add this directory (
10k_files/) to your gitignore before you proceed!
3. Create download_text_files.ipynb
Should create a subfolder for inputs (
inputs/). You should probably save the S&P500 list from the wikipedia page there.
Should create another subfolder (
text_files/) to hold all the text files you download. Because scraping can generate large amounts of files, I usually put it in a dedicated input folder instead of the generic input folder we just made.
Try to download just one 10-K at first. When you can successfully do that, try a few more, one at a time. Check the folders on your computer - did they download like you expected? Are the files correct? If yes, continue. If not, you have an error to fix.
The website has really good info on “building a spider.” Highly recommend!
When you are confident the program works,
Delete your whole
input/subfolders on your computer so you have a “fresh start”
Rerun this from scratch.
Rerun the file AGAIN (but don’t delete the files you have). Does the file work after it’s already been run, or partially completed it’s work? Real spiders have to resume where they left off. You might need to make some conditional tweaks to the file to account for this. You don’t want the code to actually re-download the data, but the code should still run without error!
4. IMPORTANT: Create screenshot.png
It’s not polite to upload so much data to GitHub. It takes up space on the server, and your collaborators/peer reviewers will have to download them all when they clone your repo.
That’s why you edited the gitignore before doing all those downloads. If you did it correctly and check Github Desktop, you won’t see any of the text files!
Now that your
download_text_files.ipynbis done running, push the repo. Even though your computer has a
/text_files/*folder on it with many files and some hard drive space used, the repo in your browser doesn’t show this at all! Good job!
screenshot.png. The purpose of this is to upload proof of the files for your reviewers.
Right click your
text_files folder so it shows the number of files inside of it, and take a screenshot showing this. Save it as
screenshot.png inside your repo.
5. Download near_regex.py from the community codebook into your repo
This will be used in the next step.
6. Create measure_risk.ipynb
The basic idea is to measure risks by counting the number of times a given risk topic is discussed in the 10-K.
This file (broad steps)
Loads the initial dataset of sample firms saved inside of
For each firm, load the corresponding 10-K and create (at least) 5 different risk measures, and save those new measurements to each of 5 new variables in that row.
Pick one risk type, and think of three ways to measure it. For example, there are many ways you could try to measure “antitrust risk”, so come up with 3 different ways to measure it from the text. You can try different terms, different combinations of terms, different limits on how close terms need to be, and more. Comparing these different ways might help you understand how your choices can improve or hurt the value of your measurement.
Pick a second risk type and create a single measure for it (you only need to do one measurement on this risk type, but you can do more)
Pick a third risk type and create a single measure for it (again, you only need to do one, but you can do more)
Bonus measures - interesting variables you could also measure:
The total length of the document (# of words)
The # of unique words (similar to total length)
The “tone” of the document
Downloads 2019 accounting data (2019 ccm_cleaned.dta) from the data folder in the class repo on S&P500 firms (possibly useful in analysis) and adds them to the dataset
Save the whole thing to
When you are confident the program works, delete your whole
output/ folder on your computer so you have a “fresh start” and then rerun this from scratch.
7. Create explore_ugly.ipynb to see if your risk factors were associated with higher or lower returns around covid.
Try to figure out how to do the analysis below, downloading and intergrating return measures. Play around in this file. No one will look at it. It’s a safe space.
If you find issues with your risk measurements or come up with improvements you think you should make, go back and work on the previous file more.
You can and should use this file to figure out what you want to include in the final report and how you want it to appear.
8. Create analysis_report.ipynb
This is the main portion of your grade. It should be well formatted and clean in terms of text, code, and output. Don’t show extraneous print statements. Treat it like a Word document that happens to have some code (but just enough to do the analysis and show outputs). I’ve included more thoughts in the next dropdown.
First compute the returns for the 3/9-3/13 week. This will give you a dataset with one row per firm, and one number per row (the return for that week). Then merge this into the analysis dataset. Rinse and repeat if you try for the other return measures I describe below.
Explain and describe to readers your risk measurements
How were they measured? (Mechanical description)
Why did you choose them and what do you hope they capture? (Economic reasoning)
What are their statistical properties? (Do you have values for most/all firms, they should have variation within them, are they correlated with any accounting measures)
Validation checks and discussion of the risk measurements This step (validating the measurement) is very important in production quality analysis!*
Discuss briefly whether these measurements are likely “valid” in the sense they capture what you hope.
Present some evidence they do capture your hopes. There are many ways to do this, and depend on the data you have and the risks you’re measuring.
You might print out a few examples of matches.
One option is to show sentences that will correctly be caught by the search, and correctly not caught. And how easy is it for your search to find a sentence that matches the search but shouldn’t.. (Hopefully: not too easy!) How easy is it for your search to miss a sentence that it should match…
One option is to output the list of firms that have high scores, or the industries that have high and low scores. Does the output make sense?
Describe the final sample for your tests, the set of observations where you have all the data you need.
This includes summary stats,the number of firms, and other things EDA would turn up
Are there any caveats about the sample and/or data? If so, mention them and briefly discuss possible issues they raise with the analysis.
Explore the correlation between your risk values and stock returns around key dates for the onset of covid.
Stock returns are in the class’s data folder (“2019-2020-stock_rets cleaned.zip”)
Get the firm’s returns for the week of Mar 9 - Mar 13, 2020 (the cumulative return for the week)
Bonus: repeat the analysis but use the cumulative returns from Feb 23-Mar 23 as the “collapse period”
Bonus: repeat the analysis but use Mar 24 as the “stimmy day” (stimulus was announced) … how does this change your results, and is it doing so in a predictable way?
Bonus: repeat the analysis, but use firm accounting variables: Some of these probably indicate that a firm should be more resilient to the crisis!
Present your findings visually and follow the lessons on effective visualization!
You should write brief summaries of your findings.
Bonus: Explore the risk-return relationship, but use regressions so that you can control for firm traits and market returns. Does this change your results?
Don’t worry about printing these regressions out “pretty”, just try them if you want!
Bonus: Use alpha as y, not returns, in your plots and/or regressions. This will likely change the results.
Step 1: Separately, for each firm, estimate the beta and factor loadings of each firm’s returns in 2019. Save that data.
Step 2: For firm i on date t, alpha(i,t) = ret(i,t) - beta(of firm i)*mkt_return(t) - SMB(of firm i)*SMB_port_ret(t) - HML(of firm i)*HML_port_ret(t)
SMB_port_ret(t) is the return on the SMB portfolio on date t, which you can get from the Fama-French datasets!
Just present the findings if you do this. Don’t worry about explaining it - but it might make more sense in a few weeks!
If you want to do any regressions, let me know. I’ll give you a few pointers.
9. Finalize and polish
Unlike previous assignments, how clean your code and report are will factor into your grade. Additionally, your README file should be nice!
Edit the readme file - it should be “publication ready”
Make the readme file informative and professional.
Inform readers of the order in which files should be run. And warn users that this folder will download X files of X MB or GB.
Change the title of it (not the filename, the title at the top)
Describe the purpose of this repo (what this repo is analyzing) and the key inputs
List any necessary packages (might a reader need to
pip installanything?) or steps a visitor will need to run to make it work on their computer
analysis_report file should be written and formatted like an executive report.
There is no “page expectation” or “page limit”. Aim to provide sufficient analysis and explanation, but in a concise and clear way. Bullet points are fine in places, but you should have a few places with paragraph-style discussion, especially where you explain why you chose the specific risks, the way you defined them, and what issues you think they have (which points the way forward on “extensions”).
In other words: You will be graded on how much this looks like a professional report. Just “dumping” endless printouts is not as valuable as well-tailored tables and figures. High quality and concise reporting is an A1 emphasis. Here, pretty, smart, and effective tables and visualizations will receive higher grades.
The teaching team will not read your measure_risk file other than to comment on code style. So:
Any details in that file on search terms and descriptive information on your text-based measures should be copied into your analysis file (with appropriate adjustments to suit how a report would be presented).
Make the measurement code easy to read, because we will grade the code style.
Give yourself a big round of applause at this point!
Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 20+ years of data for all publicly traded firms.
Seriously: You are in the ball park of pulling off any analysis you want that needs to harness the power of these filings. These four studies are variously provocative, great, and (in one case) mine: