3.2.5. (Golden Rules for) Exploratory Data Analysis (EDA)¶

What’s below is some hard-earned wisdom. Please accept this into your heart and go forth such that data never harms you as it has so many before you.

https://media.giphy.com/media/Wn74RUT0vjnoU98Hnt/source.gif

Fig. 3.1 You, soon to be wise¶

1. GOLDEN RULES for EDA

You got new data - fun! But before you go speeding off into analysis, first you must learn some basic facts about the data. Reading the dataset’s documentation is 10/10 a great idea (sadly though many datasets have bad documentation).

Open the dataset’s documentation. (Save it in the folder alongside the raw data if possible.)
Reshape your data so that it is tall and allows us to have normalized data
Identify the primary key and the unit of observation in this dataset
- The key is a variable (or set of variables, like ticker + date) that should uniquely identify an observation.
- The “unit level” can be increments of time (e.g. daily, monthly, or yearly), the type of entity (e.g. firm, person, state, country, industry), and also combinations of entity type and time (e.g. “firm” level, “firm-year” level)
Learn about the variables (The documentation will help.)
1. Print the column names
2. Remove irrelevant, garbage, or empty variables.
3. Understand the definition, origin, and units of each variable, and document as necessary.
4. Rename variables as needed (short but descriptive enough)
Clean up rows:
- Delete duplicate or empty rows.
- Are there missing observations or a gap in the time series? If so, you can either ignore it (but know that this might cause issues later) or try to fix it.
If there are duplicate keys, resolve (remove true duplicates, or redefine the primary key).
- Two 2001-GM observations because they changed their fiscal year in the middle of calendar year 2001. Depending on your analysis specifics, you will have to pick one of these.

You should always, always, always open your datasets up to physically (digitally?) look at them1 and then also generate summary statistics. Here are some basic outputs you should examine carefully.

What is the shape (# rows and variables) of the data? df.shape
How much memory does it take, and what are the variable names/types? df.info()
Summary stats on variables:
- Continuous variables: df.describe() - count, mean, sd, etc
- Categorical variables: df['var'].value_counts().head(10) - the most common values of a variable
- Any variable: df['var'].nunique() - the number of unique values of a variable

Tip

Automate your initial EDA by putting something like the below in your codebook. Double tip: Look in the community codebook :)

You can refine this code as you learn - you’ll see more and more tips on how to improve your EDA as we go.

Warning

Two things:

Don’t just run these and move on! Actually look at the output and check for possible issues. (Some possible issues are listed in the next drop down.)
This isn’t a comprehensive list of things I’d do to check datasets, merely a reasonable start!

3.2.5.1. Acknowledgments¶

The material above was recently improved by integrating the data cleaning checklist Nick Hagerty assembled for his class.

3.2.5.2. Toy EDA¶

I do all of the suggested steps here. The output isn’t pretty, and I’d clean it up if I was writing a report. But this gets the ball rolling.

Note

This data only has one categorical variable (species). You should do value_counts and nunique for all categorical variables.

# import a famous dataset, seaborn nicely contains it out of the box!
import seaborn as sns 
iris = sns.load_dataset('iris') 

print(iris.head(),  '\n---')
print(iris.tail(),  '\n---')
print(iris.columns, '\n---')
print("The shape is: ",iris.shape, '\n---')
print("Info:",iris.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
print(iris.describe(), '\n---') # summary stats, and you can customize the list!
print(iris['species'].value_counts()[:10], '\n---')
print(iris['species'].nunique(), '\n---')

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa 
---
     sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica 
---
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object') 
---
The shape is:  (150, 5) 
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Info: None 
---
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000 
---
setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64 
---
3 
---

3.2.5.3. EDA of Compustat (Firm accounting variables)¶

Compustat is a professional grade dataset that contains accounting data from SEC filings. I use it a lot in my work. Let’s explore it!

First, let’s download our slice of it. The variables are listed and described in a csv file in the org’s data folder.

import pandas as pd
import numpy as np
# these three are used to download the file
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

url = 'https://github.com/LeDataSciFi/data/blob/main/Firm%20Year%20Datasets%20(Compustat)/CCM_cleaned_for_class.zip?raw=true'

#firms = pd.read_stata(url)   
# <-- that code would work, but GH said it was too big and
# forced me to zip it, so here is the work around to download it:

with urlopen(url) as request:
    data = BytesIO(request.read())

with ZipFile(data) as archive:
    with archive.open(archive.namelist()[0]) as stata:
        ccm = pd.read_stata(stata)

There are too many variables to print .describe() nicely. Here’s a trick: transpose the table:

ccm.describe().T.style.format('{:,.2f}') # by transposing it, I can see more variables

	count	mean	std	min	25%	50%	75%	max
gvkey	223,001.00	32,958.76	47,791.28	1,000.00	6,461.00	11,946.00	29,306.00	316,056.00
fyear	223,001.00	1,994.72	10.40	1,975.00	1,986.00	1,995.00	2,003.00	2,014.00
lpermno	223,001.00	59,497.47	26,609.23	10,000.00	37,218.00	68,073.00	81,740.00	93,436.00
sic	223,001.00	4,667.96	1,952.31	100.00	3,341.00	4,512.00	6,036.00	9,997.00
sic3	223,001.00	466.61	195.24	10.00	334.00	451.00	603.00	999.00
age	220,369.00	8.78	8.33	0.00	2.00	6.00	13.00	39.00
at	223,001.00	5,113.07	55,838.12	0.00	32.35	163.06	957.00	3,771,199.85
me	219,397.00	2,041.15	11,469.68	0.00	23.91	110.92	608.58	626,550.38
l_a	222,978.00	5.23	2.41	-6.91	3.48	5.09	6.86	15.14
l_sale	218,779.00	4.78	2.45	-6.91	3.24	4.76	6.38	13.07
prof_a	217,167.00	0.05	1.16	-218.00	0.02	0.10	0.16	198.67
mb	219,269.00	2.01	9.93	0.05	1.00	1.25	1.93	2,705.39
ppe_a	218,189.00	0.27	0.25	0.00	0.06	0.20	0.42	1.00
capx_a	206,399.00	0.07	0.09	-2.09	0.02	0.04	0.08	8.88
xrd_a	223,001.00	0.04	0.16	-0.38	0.00	0.00	0.03	18.37
cash_a	222,332.00	0.16	0.20	-0.07	0.02	0.07	0.21	1.00
div_d	206,688.00	0.45	0.50	0.00	0.00	0.00	1.00	1.00
td	221,491.00	1,365.23	16,589.58	-166.15	1.64	21.56	206.04	961,732.00
td_a	221,468.00	0.24	0.38	-0.05	0.05	0.20	0.37	75.26
td_mv	217,961.00	0.28	0.26	-0.05	0.04	0.21	0.46	1.00
dltt_a	222,002.00	0.18	0.32	0.00	0.01	0.12	0.28	75.26
dv_a	206,679.00	0.02	0.18	-0.01	0.00	0.00	0.01	63.53
invopps_FG09	193,956.00	1.75	10.35	-0.10	0.71	1.03	1.72	2,705.25
sales_g	196,652.00	0.64	42.13	-199.29	-0.02	0.09	0.24	11,879.50
short_debt	194,560.00	0.32	0.33	-4.67	0.05	0.20	0.53	22.59
long_debt_dum	222,025.00	0.82	0.38	0.00	1.00	1.00	1.00	1.00
atr	222,168.00	0.53	0.36	0.00	0.30	0.40	1.00	1.00
smalltaxlosscarry	148,275.00	0.17	0.38	0.00	0.00	0.00	0.00	1.00
largetaxlosscarry	148,275.00	0.26	0.44	0.00	0.00	0.00	1.00	1.00
tnic3hhi	96,951.00	0.22	0.22	0.01	0.08	0.14	0.28	1.00
tnic3tsimm	96,951.00	9.40	18.42	0.00	1.33	2.46	6.64	132.94
prodmktfluid	88,332.00	7.48	3.91	0.00	4.63	6.79	9.63	40.78
delaycon	57,130.00	-0.01	0.09	-0.28	-0.07	-0.01	0.05	0.46
equitydelaycon	57,130.00	-0.01	0.09	-0.28	-0.07	-0.02	0.05	0.48
debtdelaycon	57,130.00	0.00	0.06	-0.18	-0.04	-0.00	0.04	0.35
privdelaycon	57,130.00	-0.00	0.08	-0.26	-0.06	-0.01	0.05	0.42
l_emp	208,991.00	1.04	1.16	0.00	0.15	0.58	1.56	7.84
l_ppent	218,212.00	3.46	2.38	0.00	1.55	3.11	5.02	12.51
l_laborratio	199,821.00	3.47	1.57	-6.59	2.47	3.26	4.21	15.36

3.2.5.3.1. Observations¶

There are a lot of missing variables! Worthy of more investigation…
Some firms have negative assets, market equity
What does it mean to have negative debt (td) or dividends (dv)?
Tax rates range from 0 to 1!
Taxlosscarry variables are booleans (1 or 0) as is longdebtdum

3.2.5.3.2. Looking for outliers¶

There are a lot of ways to go about this.

table, but print out percentiles that focus on the tails and look for large jumps near the edges (like from p95 to max)
make a table with skewness and kurtosis variables (look for fat and/or long tails)
boxplots, but you’ll need to do a bunch of them
If you “standardize” all variables (subtract the mean and divide by the standard deviation - a common first step in many ML analyses!), you could plot the densities on a handful of charts or a tall ridgeline. Histograms might be better to spot outliers but would require ~40 figures.

Some variables I’d be concerned about (not an exhaustive list!):

The leverage variables that start with td and dltt
Sales growth
Profitability
Market-to-Book

# a table approach to support the list of variables above
ccm.describe(percentiles=[.01,.05,.95,.99]).T.style.format('{:,.2f}')

	count	mean	std	min	1%	5%	50%	95%	99%	max
gvkey	223,001.00	32,958.76	47,791.28	1,000.00	1,235.00	2,085.00	11,946.00	156,613.00	186,230.00	316,056.00
fyear	223,001.00	1,994.72	10.40	1,975.00	1,975.00	1,977.00	1,995.00	2,011.00	2,013.00	2,014.00
lpermno	223,001.00	59,497.47	26,609.23	10,000.00	10,256.00	11,438.00	68,073.00	90,339.00	92,502.00	93,436.00
sic	223,001.00	4,667.96	1,952.31	100.00	1,040.00	1,381.00	4,512.00	7,830.00	8,734.00	9,997.00
sic3	223,001.00	466.61	195.24	10.00	104.00	138.00	451.00	783.00	873.00	999.00
age	220,369.00	8.78	8.33	0.00	0.00	0.00	6.00	26.00	35.00	39.00
at	223,001.00	5,113.07	55,838.12	0.00	1.50	4.43	163.06	11,540.60	71,923.65	3,771,199.85
me	219,397.00	2,041.15	11,469.68	0.00	1.10	3.66	110.92	7,204.16	38,432.21	626,550.38
l_a	222,978.00	5.23	2.41	-6.91	0.42	1.49	5.09	9.35	11.18	15.14
l_sale	218,779.00	4.78	2.45	-6.91	-1.67	0.79	4.76	8.80	10.39	13.07
prof_a	217,167.00	0.05	1.16	-218.00	-1.06	-0.34	0.10	0.28	0.42	198.67
mb	219,269.00	2.01	9.93	0.05	0.56	0.76	1.25	4.96	11.55	2,705.39
ppe_a	218,189.00	0.27	0.25	0.00	0.00	0.01	0.20	0.80	0.91	1.00
capx_a	206,399.00	0.07	0.09	-2.09	0.00	0.00	0.04	0.22	0.41	8.88
xrd_a	223,001.00	0.04	0.16	-0.38	0.00	0.00	0.00	0.20	0.54	18.37
cash_a	222,332.00	0.16	0.20	-0.07	0.00	0.00	0.07	0.64	0.90	1.00
div_d	206,688.00	0.45	0.50	0.00	0.00	0.00	0.00	1.00	1.00	1.00
td	221,491.00	1,365.23	16,589.58	-166.15	0.00	0.00	21.56	3,031.72	17,127.38	961,732.00
td_a	221,468.00	0.24	0.38	-0.05	0.00	0.00	0.20	0.66	0.94	75.26
td_mv	217,961.00	0.28	0.26	-0.05	0.00	0.00	0.21	0.79	0.93	1.00
dltt_a	222,002.00	0.18	0.32	0.00	0.00	0.00	0.12	0.55	0.82	75.26
dv_a	206,679.00	0.02	0.18	-0.01	0.00	0.00	0.00	0.05	0.14	63.53
invopps_FG09	193,956.00	1.75	10.35	-0.10	0.18	0.40	1.03	4.72	11.16	2,705.25
sales_g	196,652.00	0.64	42.13	-199.29	-0.78	-0.33	0.09	0.91	3.79	11,879.50
short_debt	194,560.00	0.32	0.33	-4.67	0.00	0.00	0.20	1.00	1.00	22.59
long_debt_dum	222,025.00	0.82	0.38	0.00	0.00	0.00	1.00	1.00	1.00	1.00
atr	222,168.00	0.53	0.36	0.00	0.00	0.00	0.40	1.00	1.00	1.00
smalltaxlosscarry	148,275.00	0.17	0.38	0.00	0.00	0.00	0.00	1.00	1.00	1.00
largetaxlosscarry	148,275.00	0.26	0.44	0.00	0.00	0.00	0.00	1.00	1.00	1.00
tnic3hhi	96,951.00	0.22	0.22	0.01	0.02	0.04	0.14	0.73	1.00	1.00
tnic3tsimm	96,951.00	9.40	18.42	0.00	1.00	1.03	2.46	53.75	95.21	132.94
prodmktfluid	88,332.00	7.48	3.91	0.00	1.49	2.48	6.79	14.67	19.09	40.78
delaycon	57,130.00	-0.01	0.09	-0.28	-0.19	-0.15	-0.01	0.16	0.25	0.46
equitydelaycon	57,130.00	-0.01	0.09	-0.28	-0.18	-0.14	-0.02	0.16	0.25	0.48
debtdelaycon	57,130.00	0.00	0.06	-0.18	-0.11	-0.08	-0.00	0.11	0.16	0.35
privdelaycon	57,130.00	-0.00	0.08	-0.26	-0.16	-0.12	-0.01	0.15	0.22	0.42
l_emp	208,991.00	1.04	1.16	0.00	0.00	0.01	0.58	3.55	4.79	7.84
l_ppent	218,212.00	3.46	2.38	0.00	0.00	0.19	3.11	7.90	9.60	12.51
l_laborratio	199,821.00	3.47	1.57	-6.59	0.26	1.34	3.26	6.47	8.15	15.36

3.2.5.3.3. Missing values¶

Below is a list of variables that are missing values for more than 10% of the dataset. Future work will need to understand why there are missing values and understand whether and how we should deal with them.

(
    ( # these lines do the calculation - what % of missing values are there for each var
        ccm.isna()      # ccm.isna() TURNS every obs/variable = 1 when its missing and 0 else
       .sum(axis=0)     # count the number of na for each variable (now data is 1 obs per column = # missing)
        /len(ccm)       # convert # missing to % missing 
        *100            # report as percentage
    ) 
    # you can stop here and report this...
    # but I wanted to format it a bit...
    .sort_values(ascending=False)[:13]
    .to_frame(name='% missing') # the next line only works on a frame, and because pandas sees only 1 variable at this pt
    .style.format("{:.1f}")     # in the code, it calls this a "series" type object, so convert it to dataframe type object
)
#

	% missing
privdelaycon	74.4
debtdelaycon	74.4
equitydelaycon	74.4
delaycon	74.4
prodmktfluid	60.4
tnic3tsimm	56.5
tnic3hhi	56.5
smalltaxlosscarry	33.5
largetaxlosscarry	33.5
invopps_FG09	13.0
short_debt	12.8
sales_g	11.8
l_laborratio	10.4

1: Remember this?

LeDataSciFi-2023