Assignment 5 - Our first real data science project¶

PLEASE RECHECK THESE INSTRUCTIONS BEFORE SUBMITTING

Warning/time guidance: This assignment is the first project where you will have to get data, clean it yourself, and parse the data to create variables to analyze. These steps take time!

Project Set Up¶

The nuts and bolts of the set up are:

Basic question: What “types” of firms were hurt more or less by covid?
Specific questions: What risk factors were associated with better/worse stock returns around the onset of covid?
- This is a “cross-sectional event study”
- Expected minimum output: Scatterplot (x = some “risk factors”, y = returns around March 2020) with regression lines; formatted well
- Pro output: Regression tables, heatmaps, better scatterplots
New data science technique: Textual analysis. We will estimate “risk factors” from the text of S&P 500 firm’s wiki pages.
- More on this below
- Adv method: Estimate them from the text of S&P 500 firm’s 10-K filings.
Data needed:
- Returns: Stock returns for S&P 500 firms can be pulled from Yahoo
- Risk factors for each firm will be created from their wikipedia pages OR their 10-K filings. Using the 10-K filings is significantly more interesting and likely to result in usable findings, but also more difficult and as such, will result in a grading premium.
So your main challenge… is to create variables that measure risks for each firm.

Steps to complete the assignment¶

6. Create analysis.ipynb to see if your risk factors were associated with higher or lower returns around covid.

Tip

First compute the returns for the 3/9-3/13 week. This will give you a dataset with one row per firm, and one number per row (the return for that week). Then merge this into the analysis dataset. Rinse and repeat if you try for the other measures.

Load output/sp500_accting_plus_textrisks.csv
Explain and describe to readers your risk measurements
- How were they measured? (Mechanical description)
- Why did you choose them and what do you hope they capture? (Economic reasoning)
- What are their statistical properties? (Do you have values for most/all firms, they should have variation within them, are they correlated with any accounting measures)
Validation checks and discussion of the risk measurements
- Discuss briefly whether these measurements are likely “valid” in the sense they capture what you hope.
- Present some evidence they do capture your hopes. (You probably don’t have enough data here to rigorously validate that your measure is associated with the “actual” risk. But if you do, show us!) This step (validating the measurement) is very important in production quality analysis!*
Explore the correlation between your risk values and stock returns around key dates for the onset of covid.
- Stock returns are in the class’s data folder
- Get the firm’s returns for the week of Mar 9 - Mar 13, 2020
- Bonus: repeat the analysis but use Feb 23-Mar 23 as the “collapse period”
- Bonus: repeat the analysis but use Mar 24 as the “stimmy day” (stimulus was announced) … how does this change your results, and is it doing so in a predictable way?
- Bonus: repeat the analysis, but use firm accounting variables: Some of these probably indicate that a firm should be more resilient to the crisis!
- Present your findings visually and follow the lessons on effective visualization!
- You should write brief summaries of your findings.
Bonus: Explore the risk-return relationship, but use regressions so that you can control for firm traits and market returns. Does this change your results?
- Don’t worry about printing these out “pretty”, just try them if you want!
Bonus: Use alpha as y, not returns, in your plots and regressions. This will likely change the results again.
- Step 1: Separately, for each firm, estimate the beta and factor loadings of each firm’s returns in 2019. Save that data.
- Step 2: For firm i on date t, alpha(i,t) = ret(i,t) - beta(of firm i)*mkt_return(t) - SMB(of firm i)*SMB_port_ret(t) - HML(of firm i)*HML_port_ret(t)
  - SMB_port_ret(t) is the return on the SMB portfolio on date t, which you can get from the Fama-French datasets!
- Just present the findings if you do this. Don’t worry about explaining it - but it might make sense in a few weeks!

Note

If you want to do any regressions, let me know. I’ll give you a few pointers.

Cheers!¶

Give yourself a big round of applause at this point!

Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 20+ years of data for all publicly traded firms.

Seriously: You are in the ball park of pulling off any analysis you want that needs to harness the power of these filings. These four studies are variously provocative, great, and (in one case) mine:

LeDataSciFi-2021

Assignment 5 - Our first real data science project¶

Project Set Up¶

Steps to complete the assignment¶

Cheers!¶