3.2.5. (Golden Rules for) Exploratory Data Analysis (EDA)¶
What’s below is some hard earned wisdom. Please accept this into your heart and go forth such that data never harms you as it has so many before you.
1. GOLDEN RULES for EDA
You got new data - fun! But before you go speeding off into analysis, first you must learn some basic facts about the data. Reading the dataset’s documentation is 10/10 a great idea (sadly though many datasets have bad documentation).
You should always, always, always open your datasets up to physically (digitally?) look at them1 and then also generate summary statistics. Here are some basic outputs you should examine carefully.
What is the unit of observation in this dataset?
The “unit levels” in databases we will look at are often increments of time (e.g. monthly or yearly), the type of entity (e.g. firm, person, state, country, industry), and also combinations of entity type and time.
E.g. “firm” level, “firm-year” level
Print the first and last five rows:
df.head()
anddf.tail()
What is the shape (# rows and variables) of the data?
df.shape
Print the column names
How much memory does it take, and what are the variable names/types?
df.info()
Summary stats on variables:
df.describe()
- count, mean, sd, etcdf['var'].value_counts().head(10)
- the most common values of a variabledf['var'].nunique()
- the number of unique values of a variable
Tip
Automate your initial EDA by putting something like the below in your codebook. Double tip: Look in the community codebook :)
You can refine this as you go - you’ll see more and more tips how to improve your EDA as we go.
Warning
Two things:
Don’t just run these and move on! Actually look at the output and check for possible issues. (Some possible issues are listed in the next drop down.)
This isn’t a comprehensive list of things I’d do to check datasets, merely a reasonable start!
2. Data cleaning
A thought not quite profound enough to be wisdom
Data cleaning, exploration, and analysis exist in a never ending feedback loop. Analysis projects are rarely linear. You’ll be doing some analysis and realize there is a data problem, or that you need new data, and you’re back at step one above.
Here are some things you might look for while cleaning your data:
Look for |
Comment |
---|---|
Do some variables have large outliers? |
You might need to winsorize, or drop, those observations. |
Do some variables have many missing variables? |
Maybe there was a problem with the source or how you loaded it. |
Do some variables have impossible values? |
For example, sales can’t be negative! Yet, some datasets use “-99” to indicate missing data. Or are there bad dates (June 31st)? |
Are there duplicate observations? |
E.g. two 2001-GM observations because they changed their fiscal year in the middle of calendar year 2001? |
Are there missing observations or a gap in the time series? |
E.g. no observation in 2005 for Enron because the executive team was too “distracted to file its 10-K? |
Do variables contradict each other? |
E.g. If my birthday is 1/5/1999, then I must be 21 as of this writing. |
3. Institutional knowledge is crucial
Tip
Lehigh offers students free subscriptions to WSJ, NYT, and FT!
For example, in my research on firm investment policies and financing decisions (e.g. leverage), I drop any firms from the finance or utility industries. I know to do this because those industries are subject to several factors that fundamentally change their leverage in ways unrelated to their investment policies. Thus, including them would contaminate any relationships I find.
When we work with 10-K text data, we need to know what data tables are available and how to remove boilerplate.
You only get institutional through reading supporting materials, documentation, related info (research papers and WSJ, etc), and… the documents themselves. For example, I’ve read in excess of 6,000 proxy statements for a single research project. (If that sounds exciting: Go to grad school and become a prof!)
4. Explore the data with your question in mind
Compute statistics by subgroups (by age, or industry), or two-way (by gender and age)
Do a big correlation matrix to get a sense of possible relationships between variables in the data on a first pass. (We will do this later.)
This step reinforces institutional knowledge and your understanding of the data
Remember, data exploration and data cleaning are interrelated and you’ll go back and forth from one to the other.
3.2.5.1. Toy EDA¶
I do all of suggested steps here. The output isn’t pretty, and I’d clean it up if I was writing a report. But this gets the ball rolling.
Note
This data only has one categorical variable (species). You should do value_counts
and nunique
for all categorical variables.
# import a famous dataset, seaborn nicely contains it out of the box!
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.head(), '\n---')
print(iris.tail(), '\n---')
print(iris.columns, '\n---')
print("The shape is: ",iris.shape, '\n---')
print("Info:",iris.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
print(iris.describe(), '\n---') # summary stats, and you can customize the list!
print(iris['species'].value_counts()[:10], '\n---')
print(iris['species'].nunique(), '\n---')
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
---
sepal_length sepal_width petal_length petal_width species
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
---
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
---
The shape is: (150, 5)
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Info: None
---
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
---
setosa 50
versicolor 50
virginica 50
Name: species, dtype: int64
---
3
---
3.2.5.2. EDA of Compustat (Firm accounting variables)¶
Compustat is a professional grade dataset that contains accounting data from SEC filings. I use it a lot in my work. Let’s explore it!
First, let’s download our slice of it. The variables are listed and described in a csv file in the repo’s data folder.
import pandas as pd
import numpy as np
# these three are used to download the file
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
url = 'https://github.com/LeDataSciFi/ledatascifi-2022/blob/main/data/CCM_cleaned_for_class.zip?raw=true'
#firms = pd.read_stata(url)
# <-- that code would work, but GH said it was too big and
# forced me to zip it, so here is the work around to download it:
with urlopen(url) as request:
data = BytesIO(request.read())
with ZipFile(data) as archive:
with archive.open(archive.namelist()[0]) as stata:
ccm = pd.read_stata(stata)
There are too many variables to print .describe()
nicely. Here’s a trick: transpose the table:
ccm.describe().T.style.format('{:,.2f}') # by transposing it, I can see more variables
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gvkey | 223,001.00 | 32,958.76 | 47,791.28 | 1,000.00 | 6,461.00 | 11,946.00 | 29,306.00 | 316,056.00 |
fyear | 223,001.00 | 1,994.72 | 10.40 | 1,975.00 | 1,986.00 | 1,995.00 | 2,003.00 | 2,014.00 |
lpermno | 223,001.00 | 59,497.47 | 26,609.23 | 10,000.00 | 37,218.00 | 68,073.00 | 81,740.00 | 93,436.00 |
sic | 223,001.00 | 4,667.96 | 1,952.31 | 100.00 | 3,341.00 | 4,512.00 | 6,036.00 | 9,997.00 |
sic3 | 223,001.00 | 466.61 | 195.24 | 10.00 | 334.00 | 451.00 | 603.00 | 999.00 |
age | 220,369.00 | 8.78 | 8.33 | 0.00 | 2.00 | 6.00 | 13.00 | 39.00 |
at | 223,001.00 | 5,113.07 | 55,838.12 | 0.00 | 32.35 | 163.06 | 957.00 | 3,771,199.85 |
me | 219,397.00 | 2,041.15 | 11,469.68 | 0.00 | 23.91 | 110.92 | 608.58 | 626,550.38 |
l_a | 222,978.00 | 5.23 | 2.41 | -6.91 | 3.48 | 5.09 | 6.86 | 15.14 |
l_sale | 218,779.00 | 4.78 | 2.45 | -6.91 | 3.24 | 4.76 | 6.38 | 13.07 |
prof_a | 217,167.00 | 0.05 | 1.16 | -218.00 | 0.02 | 0.10 | 0.16 | 198.67 |
mb | 219,269.00 | 2.01 | 9.93 | 0.05 | 1.00 | 1.25 | 1.93 | 2,705.39 |
ppe_a | 218,189.00 | 0.27 | 0.25 | 0.00 | 0.06 | 0.20 | 0.42 | 1.00 |
capx_a | 206,399.00 | 0.07 | 0.09 | -2.09 | 0.02 | 0.04 | 0.08 | 8.88 |
xrd_a | 223,001.00 | 0.04 | 0.16 | -0.38 | 0.00 | 0.00 | 0.03 | 18.37 |
cash_a | 222,332.00 | 0.16 | 0.20 | -0.07 | 0.02 | 0.07 | 0.21 | 1.00 |
div_d | 206,688.00 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
td | 221,491.00 | 1,365.23 | 16,589.58 | -166.15 | 1.64 | 21.56 | 206.04 | 961,732.00 |
td_a | 221,468.00 | 0.24 | 0.38 | -0.05 | 0.05 | 0.20 | 0.37 | 75.26 |
td_mv | 217,961.00 | 0.28 | 0.26 | -0.05 | 0.04 | 0.21 | 0.46 | 1.00 |
dltt_a | 222,002.00 | 0.18 | 0.32 | 0.00 | 0.01 | 0.12 | 0.28 | 75.26 |
dv_a | 206,679.00 | 0.02 | 0.18 | -0.01 | 0.00 | 0.00 | 0.01 | 63.53 |
invopps_FG09 | 193,956.00 | 1.75 | 10.35 | -0.10 | 0.71 | 1.03 | 1.72 | 2,705.25 |
sales_g | 196,652.00 | 0.64 | 42.13 | -199.29 | -0.02 | 0.09 | 0.24 | 11,879.50 |
short_debt | 194,560.00 | 0.32 | 0.33 | -4.67 | 0.05 | 0.20 | 0.53 | 22.59 |
long_debt_dum | 222,025.00 | 0.82 | 0.38 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
atr | 222,168.00 | 0.53 | 0.36 | 0.00 | 0.30 | 0.40 | 1.00 | 1.00 |
smalltaxlosscarry | 148,275.00 | 0.17 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
largetaxlosscarry | 148,275.00 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
tnic3hhi | 96,951.00 | 0.22 | 0.22 | 0.01 | 0.08 | 0.14 | 0.28 | 1.00 |
tnic3tsimm | 96,951.00 | 9.40 | 18.42 | 0.00 | 1.33 | 2.46 | 6.64 | 132.94 |
prodmktfluid | 88,332.00 | 7.48 | 3.91 | 0.00 | 4.63 | 6.79 | 9.63 | 40.78 |
delaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.07 | -0.01 | 0.05 | 0.46 |
equitydelaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.07 | -0.02 | 0.05 | 0.48 |
debtdelaycon | 57,130.00 | 0.00 | 0.06 | -0.18 | -0.04 | -0.00 | 0.04 | 0.35 |
privdelaycon | 57,130.00 | -0.00 | 0.08 | -0.26 | -0.06 | -0.01 | 0.05 | 0.42 |
l_emp | 208,991.00 | 1.04 | 1.16 | 0.00 | 0.15 | 0.58 | 1.56 | 7.84 |
l_ppent | 218,212.00 | 3.46 | 2.38 | 0.00 | 1.55 | 3.11 | 5.02 | 12.51 |
l_laborratio | 199,821.00 | 3.47 | 1.57 | -6.59 | 2.47 | 3.26 | 4.21 | 15.36 |
3.2.5.2.1. Observations¶
There are a lot of missing variables! Worthy of more investigation…
Some firms have negative assets, market equity
What does it mean to have negative debt (td) or dividends (dv)?
Tax rates range from 0 to 1!
Taxlosscarry
variables are booleans (1 or 0) as islongdebtdum
3.2.5.2.2. Looking for outliers¶
There are a lot of ways to go about this.
table, but print out percentiles that focus on the tails and look for large jumps near the edges (like from p95 to max)
make a table with skewness and kurtosis variables (look for fat and/or long tails)
boxplots, but you’ll need to do a bunch of them
If you “standardize” all variables (subtract the mean and divide by the standard deviation - a common first step in many ML analyses!), you could plot the densities on a handful of charts or a tall ridgeline. Histograms might be better to spot outliers but would require ~40 figures.
Some variables I’d be concerned about (not an exhaustive list!):
The leverage variables that start with
td
anddltt
Sales growth
Profitability
Market-to-Book
# a table approach to support the list of variables above
ccm.describe(percentiles=[.01,.05,.95,.99]).T.style.format('{:,.2f}')
count | mean | std | min | 1% | 5% | 50% | 95% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|
gvkey | 223,001.00 | 32,958.76 | 47,791.28 | 1,000.00 | 1,235.00 | 2,085.00 | 11,946.00 | 156,613.00 | 186,230.00 | 316,056.00 |
fyear | 223,001.00 | 1,994.72 | 10.40 | 1,975.00 | 1,975.00 | 1,977.00 | 1,995.00 | 2,011.00 | 2,013.00 | 2,014.00 |
lpermno | 223,001.00 | 59,497.47 | 26,609.23 | 10,000.00 | 10,256.00 | 11,438.00 | 68,073.00 | 90,339.00 | 92,502.00 | 93,436.00 |
sic | 223,001.00 | 4,667.96 | 1,952.31 | 100.00 | 1,040.00 | 1,381.00 | 4,512.00 | 7,830.00 | 8,734.00 | 9,997.00 |
sic3 | 223,001.00 | 466.61 | 195.24 | 10.00 | 104.00 | 138.00 | 451.00 | 783.00 | 873.00 | 999.00 |
age | 220,369.00 | 8.78 | 8.33 | 0.00 | 0.00 | 0.00 | 6.00 | 26.00 | 35.00 | 39.00 |
at | 223,001.00 | 5,113.07 | 55,838.12 | 0.00 | 1.50 | 4.43 | 163.06 | 11,540.60 | 71,923.65 | 3,771,199.85 |
me | 219,397.00 | 2,041.15 | 11,469.68 | 0.00 | 1.10 | 3.66 | 110.92 | 7,204.16 | 38,432.21 | 626,550.38 |
l_a | 222,978.00 | 5.23 | 2.41 | -6.91 | 0.42 | 1.49 | 5.09 | 9.35 | 11.18 | 15.14 |
l_sale | 218,779.00 | 4.78 | 2.45 | -6.91 | -1.67 | 0.79 | 4.76 | 8.80 | 10.39 | 13.07 |
prof_a | 217,167.00 | 0.05 | 1.16 | -218.00 | -1.06 | -0.34 | 0.10 | 0.28 | 0.42 | 198.67 |
mb | 219,269.00 | 2.01 | 9.93 | 0.05 | 0.56 | 0.76 | 1.25 | 4.96 | 11.55 | 2,705.39 |
ppe_a | 218,189.00 | 0.27 | 0.25 | 0.00 | 0.00 | 0.01 | 0.20 | 0.80 | 0.91 | 1.00 |
capx_a | 206,399.00 | 0.07 | 0.09 | -2.09 | 0.00 | 0.00 | 0.04 | 0.22 | 0.41 | 8.88 |
xrd_a | 223,001.00 | 0.04 | 0.16 | -0.38 | 0.00 | 0.00 | 0.00 | 0.20 | 0.54 | 18.37 |
cash_a | 222,332.00 | 0.16 | 0.20 | -0.07 | 0.00 | 0.00 | 0.07 | 0.64 | 0.90 | 1.00 |
div_d | 206,688.00 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
td | 221,491.00 | 1,365.23 | 16,589.58 | -166.15 | 0.00 | 0.00 | 21.56 | 3,031.72 | 17,127.38 | 961,732.00 |
td_a | 221,468.00 | 0.24 | 0.38 | -0.05 | 0.00 | 0.00 | 0.20 | 0.66 | 0.94 | 75.26 |
td_mv | 217,961.00 | 0.28 | 0.26 | -0.05 | 0.00 | 0.00 | 0.21 | 0.79 | 0.93 | 1.00 |
dltt_a | 222,002.00 | 0.18 | 0.32 | 0.00 | 0.00 | 0.00 | 0.12 | 0.55 | 0.82 | 75.26 |
dv_a | 206,679.00 | 0.02 | 0.18 | -0.01 | 0.00 | 0.00 | 0.00 | 0.05 | 0.14 | 63.53 |
invopps_FG09 | 193,956.00 | 1.75 | 10.35 | -0.10 | 0.18 | 0.40 | 1.03 | 4.72 | 11.16 | 2,705.25 |
sales_g | 196,652.00 | 0.64 | 42.13 | -199.29 | -0.78 | -0.33 | 0.09 | 0.91 | 3.79 | 11,879.50 |
short_debt | 194,560.00 | 0.32 | 0.33 | -4.67 | 0.00 | 0.00 | 0.20 | 1.00 | 1.00 | 22.59 |
long_debt_dum | 222,025.00 | 0.82 | 0.38 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |
atr | 222,168.00 | 0.53 | 0.36 | 0.00 | 0.00 | 0.00 | 0.40 | 1.00 | 1.00 | 1.00 |
smalltaxlosscarry | 148,275.00 | 0.17 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
largetaxlosscarry | 148,275.00 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
tnic3hhi | 96,951.00 | 0.22 | 0.22 | 0.01 | 0.02 | 0.04 | 0.14 | 0.73 | 1.00 | 1.00 |
tnic3tsimm | 96,951.00 | 9.40 | 18.42 | 0.00 | 1.00 | 1.03 | 2.46 | 53.75 | 95.21 | 132.94 |
prodmktfluid | 88,332.00 | 7.48 | 3.91 | 0.00 | 1.49 | 2.48 | 6.79 | 14.67 | 19.09 | 40.78 |
delaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.19 | -0.15 | -0.01 | 0.16 | 0.25 | 0.46 |
equitydelaycon | 57,130.00 | -0.01 | 0.09 | -0.28 | -0.18 | -0.14 | -0.02 | 0.16 | 0.25 | 0.48 |
debtdelaycon | 57,130.00 | 0.00 | 0.06 | -0.18 | -0.11 | -0.08 | -0.00 | 0.11 | 0.16 | 0.35 |
privdelaycon | 57,130.00 | -0.00 | 0.08 | -0.26 | -0.16 | -0.12 | -0.01 | 0.15 | 0.22 | 0.42 |
l_emp | 208,991.00 | 1.04 | 1.16 | 0.00 | 0.00 | 0.01 | 0.58 | 3.55 | 4.79 | 7.84 |
l_ppent | 218,212.00 | 3.46 | 2.38 | 0.00 | 0.00 | 0.19 | 3.11 | 7.90 | 9.60 | 12.51 |
l_laborratio | 199,821.00 | 3.47 | 1.57 | -6.59 | 0.26 | 1.34 | 3.26 | 6.47 | 8.15 | 15.36 |
3.2.5.2.3. Missing values¶
Below is a list of variables that are missing values for more than 10% of the dataset. Future work will need to understand why there are missing values and understand whether and how we should deal with them.
(
( # these lines do the calculation - what % of missing values are there for each var
ccm.isna() # ccm.isna() TURNS every obs/variable = 1 when its missing and 0 else
.sum(axis=0) # count the number of na for each variable (now data is 1 obs per column = # missing)
/len(ccm) # convert # missing to % missing
*100 # report as percentage
)
# you can stop here and report this...
# but I wanted to format it a bit...
.sort_values(ascending=False)[:13]
.to_frame(name='% missing') # the next line only works on a frame, and because pandas sees only 1 variable at this pt
.style.format("{:.1f}") # in the code, it calls this a "series" type object, so convert it to dataframe type object
)
#
% missing | |
---|---|
privdelaycon | 74.4 |
debtdelaycon | 74.4 |
equitydelaycon | 74.4 |
delaycon | 74.4 |
prodmktfluid | 60.4 |
tnic3tsimm | 56.5 |
tnic3hhi | 56.5 |
smalltaxlosscarry | 33.5 |
largetaxlosscarry | 33.5 |
invopps_FG09 | 13.0 |
short_debt | 12.8 |
sales_g | 11.8 |
l_laborratio | 10.4 |