Assignment 5 - Our first real data science project¶
PLEASE RECHECK THESE INSTRUCTIONS BEFORE SUBMITTING
Warning/time guidance: This assignment is the first project where you will have to get data, clean it yourself, and parse the data to create variables to analyze. These steps take time!
Project Set Up¶
The nuts and bolts of the set up are:
Basic question: What “types” of firms were hurt more or less by covid?
Specific questions: What risk factors were associated with better/worse stock returns around the onset of covid?
This is a “cross-sectional event study”
Expected minimum output: Scatterplot (x = some “risk factors”, y = returns around March 2020) with regression lines; formatted well
Pro output: Regression tables, heatmaps, better scatterplots
New data science technique: Textual analysis. We will estimate “risk factors” from the text of S&P 500 firm’s wiki pages.
Adv method: Estimate them from the text of S&P 500 firm’s 10-K filings.
Returns: Stock returns for S&P 500 firms can be pulled from Yahoo
Risk factors for each firm will be created from their wikipedia pages OR their 10-K filings. Using the 10-K filings is significantly more interesting and likely to result in usable findings, but also more difficult and as such, will result in a grading premium.
So your main challenge… is to create variables that measure risks for each firm.
Steps to complete the assignment¶
1. Start the assignment
As usual, click the link in coursesite.
But unlike before, the repo will be essentially empty. This is a start to finish project, so I’m letting you dictate the structure.
Clone this to your computer.
2. Create download_wiki_pages.ipynb (or download_10Ks.ipynb) and edit .gitignore
Should create a subfolder for inputs (
inputs/). You should probably save the S&P500 list from the wikipedia page. It would be especially useful to add columns to that with the URL to the wikipedia page for the firm, and one more column with the URL for the SEC filings for the firm.
Should create another subfolder (
text_files/) to hold all the text files you download. Because scraping can generate large amounts of files, I usually put it in a dedicated input folder instead of the generic input folder we just made.
This file will create a large data structure with all the downloaded files. After you run this and download the files, but BEFORE you push the folder to the cloud, see the next dropdown.
Optional, if you want to go after the 10-K route: Manually try to navigate from the wiki page to the last 10-K filed before March 2020. (We don’t want to use 10-Ks AFTER the event!) Then try to get
requests_html to replicate those steps until it can find that file for all firms.
The website has really good info on building a spider. Highly recommend!
You’ll need to decide on where to save files. Should you create subfolders? Multiple levels of subfolders? What should the name of the file itself be?
Answer 1: It depends on the job!
Answer 2: Again - I highly recommend the website. :)
Try to download just one file at first. When you can successfully do that, try a few more, one at a time. Check the folders on your computer - did they download like you expected? Are the files correct? If yes, continue. If not, you have an error to fix.
When you are confident the program works, delete your whole
input/ subfolders on your computer so you have a “fresh start” and then rerun this from scratch.
Then: I’d run the file AGAIN. Does the file work after it’s already been run, or partially completed it’s work? Real spiders have to resume where they left off. You might need to make some conditional tweaks to the file to account for this.
3. IMPORTANT: Create .gitignore and screenshot.png
It’s not polite to upload so much data to GitHub. It takes up space on the server, and your collaborators/peer reviewers will have to download them all when they clone your repo.
So, you should tell GitHub not to upload those from your computer to the website!
Assuming that you saved all the wiki pages/10-Ks inside a folder called “text_files”, you should add
/text_files/* in the gitignore.
Step 2: Push the repo to origin
After you do this, open the repo in your browser. You should “see” two things:
.gitignore. Click on it and see that it says
Your computer has a
/text_files/*folder on it with many files and some hard drive space used, yet the repo in your browser doesn’t show this at all! Good job!
screenshot.png - Upload proof of the files for your reviewers.
Right click your
text_files folder so it shows the number of files inside of it, and take a screenshot showing this. Save it as
screenshot.png inside your repo.
4. Download near_regex.py from the community codebook into your repo
This will be used in the next step.
5. Create measure_risk.ipynb
The basic idea is to measure risks by counting the number of times a given risk topic is discussed in the wiki page (or 10-K).
This file (broad steps)
Loads the initial dataset saved inside of
Loops over the rows. For each row, load the corresponding wiki page/10-K and create (at least) 5 different risk measures, and save those new measurements to each of 5 new variables in that row. See below for more on this.
Pick one risk type, and think of three ways to measure it. For example, there are many ways you could try to measure “antitrust risk”, so come up with 3 different ways to measure it from the text. You can try different terms, different combinations of terms, different limits on how close terms need to be, and more. Comparing these different ways might help you understand how your choices can improve or hurt the value of your measurement.
Pick a second risk type and create a single measure for it (you only need to do one measurement on this risk type, but you can do more)
Pick a third risk type and create a single measure for it (again, you only need to do one, but you can do more)
Bonus measures - interesting variables you could also measure:
The total length of the document (# of words)
The # of unique words (similar to total length)
The “tone” of the document
downloads 2019 accounting data from the data folder in the class repo on S&P500 firms (possibly useful in analysis) and adds them to the dataset
Save the whole thing to
There is a bunch more on this step below the dropdowns.
When you are confident the program works, delete your whole
output/ folder on your computer so you have a “fresh start” and then rerun this from scratch.
6. Create analysis.ipynb to see if your risk factors were associated with higher or lower returns around covid.
First compute the returns for the 3/9-3/13 week. This will give you a dataset with one row per firm, and one number per row (the return for that week). Then merge this into the analysis dataset. Rinse and repeat if you try for the other measures.
Explain and describe to readers your risk measurements
How were they measured? (Mechanical description)
Why did you choose them and what do you hope they capture? (Economic reasoning)
What are their statistical properties? (Do you have values for most/all firms, they should have variation within them, are they correlated with any accounting measures)
Validation checks and discussion of the risk measurements
Discuss briefly whether these measurements are likely “valid” in the sense they capture what you hope.
Present some evidence they do capture your hopes. (You probably don’t have enough data here to rigorously validate that your measure is associated with the “actual” risk. But if you do, show us!) This step (validating the measurement) is very important in production quality analysis!*
Explore the correlation between your risk values and stock returns around key dates for the onset of covid.
Stock returns are in the class’s data folder
Get the firm’s returns for the week of Mar 9 - Mar 13, 2020
Bonus: repeat the analysis but use Feb 23-Mar 23 as the “collapse period”
Bonus: repeat the analysis but use Mar 24 as the “stimmy day” (stimulus was announced) … how does this change your results, and is it doing so in a predictable way?
Bonus: repeat the analysis, but use firm accounting variables: Some of these probably indicate that a firm should be more resilient to the crisis!
Present your findings visually and follow the lessons on effective visualization!
You should write brief summaries of your findings.
Bonus: Explore the risk-return relationship, but use regressions so that you can control for firm traits and market returns. Does this change your results?
Don’t worry about printing these out “pretty”, just try them if you want!
Bonus: Use alpha as y, not returns, in your plots and regressions. This will likely change the results again.
Step 1: Separately, for each firm, estimate the beta and factor loadings of each firm’s returns in 2019. Save that data.
Step 2: For firm i on date t, alpha(i,t) = ret(i,t) - beta(of firm i)*mkt_return(t) - SMB(of firm i)*SMB_port_ret(t) - HML(of firm i)*HML_port_ret(t)
SMB_port_ret(t) is the return on the SMB portfolio on date t, which you can get from the Fama-French datasets!
Just present the findings if you do this. Don’t worry about explaining it - but it might make sense in a few weeks!
If you want to do any regressions, let me know. I’ll give you a few pointers.
7. Finalize and polish
Edit the readme file - it should be “publication ready”
Make the readme file informative and professional.
Inform them of the order in which files should be run. And warn users that this folder will download X files of X MB or GB.
Change the title of it (not the filename, the title at the top)
Describe the purpose of this repo (what this repo is analyzing) and the key inputs
List any necessary packages (might a reader need to
pip installanything?) or steps a visitor will need to run to make it work on their computer
The analysis file should be written and formatted like an executive report.
There is no “page expectation” or “page limit”. Aim to provide sufficient analysis and explanation, but in a concise and clear way. Bullet points are fine in places, but you should have a few places with paragraph-style discussion, especially where you explain why you chose the specific risks, the way you defined them, and what issues you think they have (which points the way forward on “extensions”).
In other words: You will be graded on how much this looks like a professional report. Just “dumping” endless printouts is not as valuable as well-tailored tables and figures. High quality and concise reporting is an A1 emphasis. Here, pretty, smart, and effective tables and visualizations will receive higher grades.
The teaching team will not read your measure_risk file other than to comment on code style. So:
Any details in that file on search terms and descriptive information on your text-based measures should be copied into your analysis file (with appropriate adjustments to suit how a report would be presented).
Make the masurement code easy to read, because we will grade the code style.
Give yourself a big round of applause at this point!
Your code is probably very flexible and powerful at this point. If you have the appetite + a larger list of EDGAR files to download + a large enough hard drive + and time, then you could download more than 100GB of 10-K filings and run textual analysis across 20+ years of data for all publicly traded firms.
Seriously: You are in the ball park of pulling off any analysis you want that needs to harness the power of these filings. These four studies are variously provocative, great, and (in one case) mine: