3.2.5. (Golden Rules for) Exploratory Data Analysis (EDA)¶
What’s below is some hard-earned wisdom. Please accept this into your heart and go forth such that data never harms you as it has so many before you.
0. Before we begin
A good approach to new projects that follows our Project Golden Rules is to have two files:
download_and_clean
: Gets messy external data into clean tables you can useDownloads the raw data to a subfolder called
/inputs_raw
Suggested: Use the profiling package to create a report on the raw data to guide your cleaning (save to outputs/profile_data_raw.html)
Does all the Golden Rule and Cleaning steps below
Saves the cleaned files to a subfolder called
/inputs_clean
Suggested: Use the profiling package to create a report on the cleaned data to refer to during your analysis (save to outputs/profile_data_clean.html)
analysis
Load the cleaned data and off you go!
Everything on this page is about the first file, not this one
Some guidelines to follow throughout:
ABCD - Always be checking (looking) at your data! Literally look at your data tables regularly, throughout the entire process, to spot issues you haven’t thought of.
Record all steps in a script (
download_and_clean
).Never overwrite the original raw data file.
Whenever possible, make changes to values ONLY by logical conditions on one or more substantive variables. Do not make changes by observation ID, another key, or (even worse) row number. You want the changes you make to be rule-based, for 2 reasons:
So that they’re general - able to handle upstream changes to the data.
So that they’re principled - no one can accuse you of cherry-picking.
1. GOLDEN RULES for EDA
You got new data - fun! But before you go speeding off into analysis, first you must learn some basic facts about the data. Reading the dataset’s documentation is 10/10 a great idea (sadly though many datasets have bad documentation).
Open the dataset’s documentation. (Save it in the folder alongside the raw data if possible.)
Reshape your data so that it is tall and allows us to have normalized data
Identify the primary key and the unit of observation in this dataset
The key is a variable (or set of variables, like ticker + date) that should uniquely identify an observation.
The “unit level” can be increments of time (e.g. daily, monthly, or yearly), the type of entity (e.g. firm, person, state, country, industry), and also combinations of entity type and time (e.g. “firm” level, “firm-year” level)
Learn about the variables (The documentation will help.)
Print the column names
Remove irrelevant, garbage, or empty variables.
Understand the definition, origin, and units of each variable, and document as necessary.
Rename variables as needed (short but descriptive enough)
Clean up rows:
Delete duplicate or empty rows.
Are there missing observations or a gap in the time series? If so, you can either ignore it (but know that this might cause issues later) or try to fix it.
If there are duplicate keys, resolve (remove true duplicates, or redefine the primary key).
Two 2001-GM observations because they changed their fiscal year in the middle of calendar year 2001. Depending on your analysis specifics, you will have to pick one of these.
You should always, always, always open your datasets up to physically (digitally?) look at them1 and then also generate summary statistics. Here are some basic outputs you should examine carefully.
What is the shape (# rows and variables) of the data?
df.shape
How much memory does it take, and what are the variable names/types?
df.info()
Explore individual variables
Continuous variables:
df.describe()
- count, mean, sd, etcCategorical variables -frequency tables:
df['var'].value_counts().head(10)
- the most common values of a variableAny variable:
df['var'].nunique()
- the number of unique values of a variableCharacterizing distributions (
df['var'].plot()
)Crosstabs (pandas
pivot_table
)Think about how you should handle extreme values (fix errors, delete observation, or winsorize)
Think about how you should handle missing data (find, ignore, delete observation, or (rarely) impute)
Should you transform variables (natural log, z-score, other less common options)
Explore how variables are related
Pandas
.corr()
reports correlations
Tip
Automate parts of your initial EDA by putting something like the below in your codebook.
Warning
Two things:
Don’t just run these tests and move on! Actually look at the output and check for possible issues. (Some possible issues are listed in the next drop down.)
This isn’t a comprehensive list of things I’d do to check datasets, merely a reasonable start!
2. Data cleaning
Start by looking for these things while cleaning your data. Make notes on what you find so you can make adjustments to the data.
Understand patterns of missing values.
Find out why they’re missing. (Problem with source or how you loaded it?)
Make sure they are not more widespread than you expect.
Convert other intended designations (i.e., -1 or -999) to NA.
Distinguish between missing values and true zeros.
Convert to numeric when variables are inappropriately stored as strings. Correct typos as necessary.
Convert to date/time format where appropriate.
Recode binary variables as 0/1 as necessary. (Often stored as “Yes”/”No” or 1/2.)
Convert to factors when strings take a limited set of possible values.
Make units and scales consistent. Avoid having in the same variable:
Some values in meters and others in feet.
Some values in USD and others in GBP.
Some percentages as 40% and others as 0.4.
Some values as millions and others as billions.
Do some variables have large outliers? Make notes for later. (You might need to winsorize, or drop, those observations.)
Perform logical checks on quantitative variables:
Define any range restrictions each variable should satisfy, and check them. (Sales can’t be negative!)
Variables shouldn’t contradict each other. If my birthday is 1/5/1999, then I must be 21 as of this writing.
Correct any violations that are indisputable data entry mistakes.
Create a flag variable to mark remaining violations.
3. Institutional knowledge is crucial
Tip
Lehigh offers students free subscriptions to WSJ, NYT, and FT!
For example, in my research on firm investment policies and financing decisions (e.g. leverage), I drop any firms from the finance or utility industries. I know to do this because those industries are subject to several factors that fundamentally change their leverage in ways unrelated to their investment policies. Thus, including them would contaminate any relationships I find.
When we work with 10-K text data, we need to know what data tables are available and how to remove boilerplate.
You only get institutional through reading supporting materials, documentation, related info (research papers and WSJ, etc), and… the documents themselves. For example, I’ve read in excess of 6,000 proxy statements for a single research project. (If that sounds exciting: Go to grad school and become a prof!)
4. Explore the data with your question in mind
Compute statistics by subgroups (by age, or industry), or two-way (by gender and age)
Do a big correlation matrix to get a sense of possible relationships between variables in the data on a first pass. (We will do this later.)
This step reinforces institutional knowledge and your understanding of the data
Remember, data exploration and data cleaning are interrelated and you’ll go back and forth from one to the other.
3.2.5.1. Acknowledgments¶
3.2.5.2. Toy EDA¶
I do all of the suggested steps here. The output isn’t pretty, and I’d clean it up if I was writing a report. But this gets the ball rolling.
Note
This data only has one categorical variable (species). You should do value_counts
and nunique
for all categorical variables.
# import a famous dataset, seaborn nicely contains it out of the box!
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.head(), '\n---')
print(iris.tail(), '\n---')
print(iris.columns, '\n---')
print("The shape is: ",iris.shape, '\n---')
print("Info:",iris.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
print(iris.describe(), '\n---') # summary stats, and you can customize the list!
print(iris['species'].value_counts()[:10], '\n---')
print(iris['species'].nunique(), '\n---')
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
---
sepal_length sepal_width petal_length petal_width species
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
---
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
---
The shape is: (150, 5)
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Info: None
---
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
---
setosa 50
versicolor 50
virginica 50
Name: species, dtype: int64
---
3
---
3.2.5.3. EDA of Compustat (Firm accounting variables)¶
Compustat is a professional grade dataset that contains accounting data from SEC filings. I use it a lot in my work. Let’s explore it!
First, let’s download our slice of it. The variables are listed and described in a csv file in the org’s data folder.
import pandas as pd
import numpy as np
# these three are used to download the file
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
url = 'https://github.com/LeDataSciFi/data/blob/main/Firm%20Year%20Datasets%20(Compustat)/CCM_cleaned_for_class.zip?raw=true'
#firms = pd.read_stata(url)
# <-- that code would work, but GH said it was too big and
# forced me to zip it, so here is the work around to download it:
with urlopen(url) as request:
data = BytesIO(request.read())
with ZipFile(data) as archive:
with archive.open(archive.namelist()[0]) as stata:
ccm = pd.read_stata(stata)
There are too many variables to print .describe()
nicely. Here’s a trick: transpose the table:
ccm.describe().T.style.format('{:,.2f}') # by transposing it, I can see more variables
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gvkey | 223,001.00 | 32,958.76 | 47,791.28 | 1,000.00 | 6,461.00 | 11,946.00 | 29,306.00 | 316,056.00 |
fyear | 223,001.00 | 1,994.72 | 10.40 | 1,975.00 | 1,986.00 | 1,995.00 | 2,003.00 | 2,014.00 |
lpermno | 223,001.00 | 59,497.47 | 26,609.23 | 10,000.00 | 37,218.00 | 68,073.00 | 81,740.00 | 93,436.00 |
sic | 223,001.00 | 4,667.96 | 1,952.31 | 100.00 | 3,341.00 | 4,512.00 | 6,036.00 | 9,997.00 |
sic3 | 223,001.00 | 466.61 | 195.24 | 10.00 | 334.00 | 451.00 | 603.00 | 999.00 |
age | 220,369.00 | 8.78 | 8.33 | 0.00 | 2.00 | 6.00 | 13.00 | 39.00 |
at | 223,001.00 | 5,113.07 | 55,838.12 | 0.00 | 32.35 | 163.06 | 957.00 | 3,771,199.85 |
me | 219,397.00 | 2,041.15 | 11,469.68 | 0.00 | 23.91 | 110.92 | 608.58 | 626,550.38 |
l_a | 222,978.00 | 5.23 | 2.41 | -6.91 | 3.48 | 5.09 | 6.86 | 15.14 |
l_sale | 218,779.00 | 4.78 | 2.45 | -6.91 | 3.24 | 4.76 | 6.38 | 13.07 |
prof_a | 217,167.00 | 0.05 | 1.16 | -218.00 | 0.02 | 0.10 | 0.16 | 198.67 |
mb | 219,269.00 | 2.01 | 9.93 | 0.05 | 1.00 | 1.25 | 1.93 | 2,705.39 |
ppe_a | 218,189.00 | 0.27 | 0.25 | 0.00 | 0.06 | 0.20 | 0.42 | 1.00 |
capx_a | 206,399.00 | 0.07 | 0.09 | -2.09 | 0.02 | 0.04 | 0.08 | 8.88 |
xrd_a | 223,001.00 | 0.04 | 0.16 | -0.38 | 0.00 | 0.00 | 0.03 | 18.37 |
cash_a | 222,332.00 | 0.16 | 0.20 | -0.07 | 0.02 | 0.07 | 0.21 | 1.00 |
div_d | 206,688.00 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
td | 221,491.00 | 1,365.23 | 16,589.58 | -166.15 | 1.64 | 21.56 | 206.04 | 961,732.00 |
td_a | 221,468.00 | 0.24 | 0.38 | -0.05 | 0.05 | 0.20 | 0.37 | 75.26 |
td_mv | 217,961.00 | 0.28 | 0.26 | -0.05 | 0.04 | 0.21 | 0.46 | 1.00 |
dltt_a | 222,002.00 | 0.18 | 0.32 | 0.00 | 0.01 | 0.12 | 0.28 | 75.26 |
dv_a | 206,679.00 | 0.02 | 0.18 | -0.01 | 0.00 | 0.00 | 0.01 | 63.53 |
invopps_FG09 | 193,956.00 | 1.75 | 10.35 | -0.10 | 0.71 | 1.03 | 1.72 | 2,705.25 |
sales_g | 196,652.00 | 0.64 | 42.13 | -199.29 | -0.02 | 0.09 | 0.24 | 11,879.50 |
short_debt | 194,560.00 | 0.32 | 0.33 | -4.67 | 0.05 | 0.20 | 0.53 | 22.59 |
long_debt_dum | 222,025.00 | 0.82 | 0.38 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
atr | 222,168.00 | 0.53 | 0.36 | 0.00 | 0.30 | 0.40 | 1.00 | 1.00 |
smalltaxlosscarry | 148,275.00 | 0.17 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
largetaxlosscarry | 148,275.00 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
tnic3hhi | 96,951.00 | 0.22 | 0.22 | 0.01 | 0.08 | 0.14 | 0.28 | 1.00 |
tnic3tsimm | 96,951.00 | 9.40 | 18.42 | 0.00 | 1.33 | 2.46 | 6.64 | 132.94 |
prodmktfluid | 88,332.00 | 7.48 | 3.91 | 0.00 | 4.63 | 6.79 | 9.63 | 40.78 |
delaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.07 | -0.01 | 0.05 | 0.46 |
equitydelaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.07 | -0.02 | 0.05 | 0.48 |
debtdelaycon | 57,130.00 | 0.00 | 0.06 | -0.18 | -0.04 | -0.00 | 0.04 | 0.35 |
privdelaycon | 57,130.00 | -0.00 | 0.08 | -0.26 | -0.06 | -0.01 | 0.05 | 0.42 |
l_emp | 208,991.00 | 1.04 | 1.16 | 0.00 | 0.15 | 0.58 | 1.56 | 7.84 |
l_ppent | 218,212.00 | 3.46 | 2.38 | 0.00 | 1.55 | 3.11 | 5.02 | 12.51 |
l_laborratio | 199,821.00 | 3.47 | 1.57 | -6.59 | 2.47 | 3.26 | 4.21 | 15.36 |
3.2.5.3.1. Observations¶
There are a lot of missing variables! Worthy of more investigation…
Some firms have negative assets, market equity
What does it mean to have negative debt (td) or dividends (dv)?
Tax rates range from 0 to 1!
Taxlosscarry
variables are booleans (1 or 0) as islongdebtdum
3.2.5.3.2. Looking for outliers¶
There are a lot of ways to go about this.
table, but print out percentiles that focus on the tails and look for large jumps near the edges (like from p95 to max)
make a table with skewness and kurtosis variables (look for fat and/or long tails)
boxplots, but you’ll need to do a bunch of them
If you “standardize” all variables (subtract the mean and divide by the standard deviation - a common first step in many ML analyses!), you could plot the densities on a handful of charts or a tall ridgeline. Histograms might be better to spot outliers but would require ~40 figures.
Some variables I’d be concerned about (not an exhaustive list!):
The leverage variables that start with
td
anddltt
Sales growth
Profitability
Market-to-Book
# a table approach to support the list of variables above
ccm.describe(percentiles=[.01,.05,.95,.99]).T.style.format('{:,.2f}')
count | mean | std | min | 1% | 5% | 50% | 95% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|
gvkey | 223,001.00 | 32,958.76 | 47,791.28 | 1,000.00 | 1,235.00 | 2,085.00 | 11,946.00 | 156,613.00 | 186,230.00 | 316,056.00 |
fyear | 223,001.00 | 1,994.72 | 10.40 | 1,975.00 | 1,975.00 | 1,977.00 | 1,995.00 | 2,011.00 | 2,013.00 | 2,014.00 |
lpermno | 223,001.00 | 59,497.47 | 26,609.23 | 10,000.00 | 10,256.00 | 11,438.00 | 68,073.00 | 90,339.00 | 92,502.00 | 93,436.00 |
sic | 223,001.00 | 4,667.96 | 1,952.31 | 100.00 | 1,040.00 | 1,381.00 | 4,512.00 | 7,830.00 | 8,734.00 | 9,997.00 |
sic3 | 223,001.00 | 466.61 | 195.24 | 10.00 | 104.00 | 138.00 | 451.00 | 783.00 | 873.00 | 999.00 |
age | 220,369.00 | 8.78 | 8.33 | 0.00 | 0.00 | 0.00 | 6.00 | 26.00 | 35.00 | 39.00 |
at | 223,001.00 | 5,113.07 | 55,838.12 | 0.00 | 1.50 | 4.43 | 163.06 | 11,540.60 | 71,923.65 | 3,771,199.85 |
me | 219,397.00 | 2,041.15 | 11,469.68 | 0.00 | 1.10 | 3.66 | 110.92 | 7,204.16 | 38,432.21 | 626,550.38 |
l_a | 222,978.00 | 5.23 | 2.41 | -6.91 | 0.42 | 1.49 | 5.09 | 9.35 | 11.18 | 15.14 |
l_sale | 218,779.00 | 4.78 | 2.45 | -6.91 | -1.67 | 0.79 | 4.76 | 8.80 | 10.39 | 13.07 |
prof_a | 217,167.00 | 0.05 | 1.16 | -218.00 | -1.06 | -0.34 | 0.10 | 0.28 | 0.42 | 198.67 |
mb | 219,269.00 | 2.01 | 9.93 | 0.05 | 0.56 | 0.76 | 1.25 | 4.96 | 11.55 | 2,705.39 |
ppe_a | 218,189.00 | 0.27 | 0.25 | 0.00 | 0.00 | 0.01 | 0.20 | 0.80 | 0.91 | 1.00 |
capx_a | 206,399.00 | 0.07 | 0.09 | -2.09 | 0.00 | 0.00 | 0.04 | 0.22 | 0.41 | 8.88 |
xrd_a | 223,001.00 | 0.04 | 0.16 | -0.38 | 0.00 | 0.00 | 0.00 | 0.20 | 0.54 | 18.37 |
cash_a | 222,332.00 | 0.16 | 0.20 | -0.07 | 0.00 | 0.00 | 0.07 | 0.64 | 0.90 | 1.00 |
div_d | 206,688.00 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
td | 221,491.00 | 1,365.23 | 16,589.58 | -166.15 | 0.00 | 0.00 | 21.56 | 3,031.72 | 17,127.38 | 961,732.00 |
td_a | 221,468.00 | 0.24 | 0.38 | -0.05 | 0.00 | 0.00 | 0.20 | 0.66 | 0.94 | 75.26 |
td_mv | 217,961.00 | 0.28 | 0.26 | -0.05 | 0.00 | 0.00 | 0.21 | 0.79 | 0.93 | 1.00 |
dltt_a | 222,002.00 | 0.18 | 0.32 | 0.00 | 0.00 | 0.00 | 0.12 | 0.55 | 0.82 | 75.26 |
dv_a | 206,679.00 | 0.02 | 0.18 | -0.01 | 0.00 | 0.00 | 0.00 | 0.05 | 0.14 | 63.53 |
invopps_FG09 | 193,956.00 | 1.75 | 10.35 | -0.10 | 0.18 | 0.40 | 1.03 | 4.72 | 11.16 | 2,705.25 |
sales_g | 196,652.00 | 0.64 | 42.13 | -199.29 | -0.78 | -0.33 | 0.09 | 0.91 | 3.79 | 11,879.50 |
short_debt | 194,560.00 | 0.32 | 0.33 | -4.67 | 0.00 | 0.00 | 0.20 | 1.00 | 1.00 | 22.59 |
long_debt_dum | 222,025.00 | 0.82 | 0.38 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
atr | 222,168.00 | 0.53 | 0.36 | 0.00 | 0.00 | 0.00 | 0.40 | 1.00 | 1.00 | 1.00 |
smalltaxlosscarry | 148,275.00 | 0.17 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
largetaxlosscarry | 148,275.00 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
tnic3hhi | 96,951.00 | 0.22 | 0.22 | 0.01 | 0.02 | 0.04 | 0.14 | 0.73 | 1.00 | 1.00 |
tnic3tsimm | 96,951.00 | 9.40 | 18.42 | 0.00 | 1.00 | 1.03 | 2.46 | 53.75 | 95.21 | 132.94 |
prodmktfluid | 88,332.00 | 7.48 | 3.91 | 0.00 | 1.49 | 2.48 | 6.79 | 14.67 | 19.09 | 40.78 |
delaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.19 | -0.15 | -0.01 | 0.16 | 0.25 | 0.46 |
equitydelaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.18 | -0.14 | -0.02 | 0.16 | 0.25 | 0.48 |
debtdelaycon | 57,130.00 | 0.00 | 0.06 | -0.18 | -0.11 | -0.08 | -0.00 | 0.11 | 0.16 | 0.35 |
privdelaycon | 57,130.00 | -0.00 | 0.08 | -0.26 | -0.16 | -0.12 | -0.01 | 0.15 | 0.22 | 0.42 |
l_emp | 208,991.00 | 1.04 | 1.16 | 0.00 | 0.00 | 0.01 | 0.58 | 3.55 | 4.79 | 7.84 |
l_ppent | 218,212.00 | 3.46 | 2.38 | 0.00 | 0.00 | 0.19 | 3.11 | 7.90 | 9.60 | 12.51 |
l_laborratio | 199,821.00 | 3.47 | 1.57 | -6.59 | 0.26 | 1.34 | 3.26 | 6.47 | 8.15 | 15.36 |
3.2.5.3.3. Missing values¶
Below is a list of variables that are missing values for more than 10% of the dataset. Future work will need to understand why there are missing values and understand whether and how we should deal with them.
(
( # these lines do the calculation - what % of missing values are there for each var
ccm.isna() # ccm.isna() TURNS every obs/variable = 1 when its missing and 0 else
.sum(axis=0) # count the number of na for each variable (now data is 1 obs per column = # missing)
/len(ccm) # convert # missing to % missing
*100 # report as percentage
)
# you can stop here and report this...
# but I wanted to format it a bit...
.sort_values(ascending=False)[:13]
.to_frame(name='% missing') # the next line only works on a frame, and because pandas sees only 1 variable at this pt
.style.format("{:.1f}") # in the code, it calls this a "series" type object, so convert it to dataframe type object
)
#
% missing | |
---|---|
privdelaycon | 74.4 |
debtdelaycon | 74.4 |
equitydelaycon | 74.4 |
delaycon | 74.4 |
prodmktfluid | 60.4 |
tnic3tsimm | 56.5 |
tnic3hhi | 56.5 |
smalltaxlosscarry | 33.5 |
largetaxlosscarry | 33.5 |
invopps_FG09 | 13.0 |
short_debt | 12.8 |
sales_g | 11.8 |
l_laborratio | 10.4 |