Midterm aka Assignment 5 - Our first real data science project¶

Tips

Read all instructions before starting.
Start early. Work on the components of the project in parallel with related class discussions.
RECHECK THESE INSTRUCTIONS BEFORE SUBMITTING

Warning

Per the syllabus, this project is 10% if your overall grade, which is about 2x the weight of a typical assignment. It will probably take 2-3x the time of a typical assignment.

Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.

BUT: It will take time! If you start the day before it is due, YOU WILL NOT FINISH IT. If you start two days before it is do, you might finish it, but it will not be done well.

Project Set Up¶

The nuts and bolts of the set up are:

Basic question: What “types” of firms were hurt more or less by covid?
Specific questions: What risk factors were associated with better/worse stock returns around the onset of covid?
- This is called a “cross-sectional event study”
- Expected minimum output: Scatterplot (x = some “risk factors”, y = returns around March 2020) with regression lines; formatted well
- Discussion of the economics linking the your risk factors to the returns is expected
- Pro output: Regression tables, heatmaps, better scatterplots
New data science technique: Textual analysis. We will estimate “risk factors” from the text of S&P 500 firm’s 10-K filings.
- More on this below
Data needed:
- Returns: Stock returns for S&P 500 firms can be pulled from Yahoo
- Risk factors for each firm will be created from their 10-K filings.

So your main challenge… is to create variables that measure risks for each firm.

Steps to complete the assignment¶

8. Create analysis_report.ipynb

Important

This is the main portion of your grade. It should be well formatted and clean in terms of text, code, and output. Don’t show extraneous print statements. Treat it like a Word document that happens to have some code (but just enough to do the analysis and show outputs). I’ve included more thoughts in the next dropdown.

Tip

First compute the returns for the 3/9-3/13 week. This will give you a dataset with one row per firm, and one number per row (the return for that week). Then merge this into the analysis dataset. Rinse and repeat if you try for the other return measures I describe below.

Load output/sp500_accting_plus_textrisks.csv
Explain and describe to readers your risk measurements
- How were they measured? (Mechanical description)
- Why did you choose them and what do you hope they capture? (Economic reasoning)
- What are their statistical properties? (Do you have values for most/all firms, they should have variation within them, are they correlated with any accounting measures)
Validation checks and discussion of the risk measurements This step (validating the measurement) is very important in production quality analysis!*
- Discuss briefly whether these measurements are likely “valid” in the sense they capture what you hope.
- Present some evidence they do capture your hopes. There are many ways to do this, and depend on the data you have and the risks you’re measuring.
  - You might print out a few examples of matches.
  - One option is to show sentences that will correctly be caught by the search, and correctly not caught. And how easy is it for your search to find a sentence that matches the search but shouldn’t.. (Hopefully: not too easy!) How easy is it for your search to miss a sentence that it should match…
  - One option is to output the list of firms that have high scores, or the industries that have high and low scores. Does the output make sense?
Describe the final sample for your tests, the set of observations where you have all the data you need.
- This includes summary stats,the number of firms, and other things EDA would turn up
- Are there any caveats about the sample and/or data? If so, mention them and briefly discuss possible issues they raise with the analysis.
Explore the correlation between your risk values and stock returns around key dates for the onset of covid.
- Stock returns are in the class’s data folder (“2019-2020-stock_rets cleaned.zip”)
- Get the firm’s returns for the week of Mar 9 - Mar 13, 2020 (the cumulative return for the week)
- Bonus: repeat the analysis but use the cumulative returns from Feb 23-Mar 23 as the “collapse period”
- Bonus: repeat the analysis but use Mar 24 as the “stimmy day” (stimulus was announced) … how does this change your results, and is it doing so in a predictable way?
- Bonus: repeat the analysis, but use firm accounting variables: Some of these probably indicate that a firm should be more resilient to the crisis!
- Present your findings visually and follow the lessons on effective visualization!
- You should write brief summaries of your findings.
Bonus: Explore the risk-return relationship, but use regressions so that you can control for firm traits and market returns. Does this change your results?
- Don’t worry about printing these regressions out “pretty”, just try them if you want!
Bonus: Use alpha as y, not returns, in your plots and/or regressions. This will likely change the results.
- Step 1: Separately, for each firm, estimate the beta and factor loadings of each firm’s returns in 2019. Save that data.
- Step 2: For firm i on date t, alpha(i,t) = ret(i,t) - beta(of firm i)*mkt_return(t) - SMB(of firm i)*SMB_port_ret(t) - HML(of firm i)*HML_port_ret(t)
  - SMB_port_ret(t) is the return on the SMB portfolio on date t, which you can get from the Fama-French datasets!
- Just present the findings if you do this. Don’t worry about explaining it - but it might make more sense in a few weeks!

Note

If you want to do any regressions, let me know. I’ll give you a few pointers.

Cheers!¶

Give yourself a big round of applause at this point!

Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 20+ years of data for all publicly traded firms.

Seriously: You are in the ball park of pulling off any analysis you want that needs to harness the power of these filings. These four studies are variously provocative, great, and (in one case) mine:

LeDataSciFi-2022

Midterm aka Assignment 5 - Our first real data science project¶

Project Set Up¶

Steps to complete the assignment¶

Cheers!¶