3.2.5. (Golden Rules for) Exploratory Data Analysis (EDA)

What’s below is some hard earned wisdom. Please accept this into your heart and go forth such that data never harms you as it has so many before you.

https://media.giphy.com/media/Wn74RUT0vjnoU98Hnt/source.gif

Fig. 3.1 You, soon to be wise

3.2.5.1. Toy EDA

I do all of suggested steps here. The output isn’t pretty, and I’d clean it up if I was writing a report. But this gets the ball rolling.

Note

This data only has one categorical variable (species). You should do value_counts and nunique for all categorical variables.

# import a famous dataset, seaborn nicely contains it out of the box!
import seaborn as sns 
iris = sns.load_dataset('iris') 

print(iris.head(),  '\n---')
print(iris.tail(),  '\n---')
print(iris.columns, '\n---')
print("The shape is: ",iris.shape, '\n---')
print("Info:",iris.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
print(iris.describe(), '\n---') # summary stats, and you can customize the list!
print(iris['species'].value_counts()[:10], '\n---')
print(iris['species'].nunique(), '\n---')
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa 
---
     sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica 
---
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object') 
---
The shape is:  (150, 5) 
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Info: None 
---
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000 
---
setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64 
---
3 
---

3.2.5.2. EDA of Compustat (Firm accounting variables)

Compustat is a professional grade dataset that contains accounting data from SEC filings. I use it a lot in my work. Let’s explore it!

First, let’s download our slice of it. The variables are listed and described in a csv file in the repo’s data folder.

import pandas as pd
import numpy as np
# these three are used to download the file
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

url = 'https://github.com/LeDataSciFi/ledatascifi-2022/blob/main/data/CCM_cleaned_for_class.zip?raw=true'

#firms = pd.read_stata(url)   
# <-- that code would work, but GH said it was too big and
# forced me to zip it, so here is the work around to download it:

with urlopen(url) as request:
    data = BytesIO(request.read())

with ZipFile(data) as archive:
    with archive.open(archive.namelist()[0]) as stata:
        ccm = pd.read_stata(stata)

There are too many variables to print .describe() nicely. Here’s a trick: transpose the table:

ccm.describe().T.style.format('{:,.2f}') # by transposing it, I can see more variables
  count mean std min 25% 50% 75% max
gvkey 223,001.00 32,958.76 47,791.28 1,000.00 6,461.00 11,946.00 29,306.00 316,056.00
fyear 223,001.00 1,994.72 10.40 1,975.00 1,986.00 1,995.00 2,003.00 2,014.00
lpermno 223,001.00 59,497.47 26,609.23 10,000.00 37,218.00 68,073.00 81,740.00 93,436.00
sic 223,001.00 4,667.96 1,952.31 100.00 3,341.00 4,512.00 6,036.00 9,997.00
sic3 223,001.00 466.61 195.24 10.00 334.00 451.00 603.00 999.00
age 220,369.00 8.78 8.33 0.00 2.00 6.00 13.00 39.00
at 223,001.00 5,113.07 55,838.12 0.00 32.35 163.06 957.00 3,771,199.85
me 219,397.00 2,041.15 11,469.68 0.00 23.91 110.92 608.58 626,550.38
l_a 222,978.00 5.23 2.41 -6.91 3.48 5.09 6.86 15.14
l_sale 218,779.00 4.78 2.45 -6.91 3.24 4.76 6.38 13.07
prof_a 217,167.00 0.05 1.16 -218.00 0.02 0.10 0.16 198.67
mb 219,269.00 2.01 9.93 0.05 1.00 1.25 1.93 2,705.39
ppe_a 218,189.00 0.27 0.25 0.00 0.06 0.20 0.42 1.00
capx_a 206,399.00 0.07 0.09 -2.09 0.02 0.04 0.08 8.88
xrd_a 223,001.00 0.04 0.16 -0.38 0.00 0.00 0.03 18.37
cash_a 222,332.00 0.16 0.20 -0.07 0.02 0.07 0.21 1.00
div_d 206,688.00 0.45 0.50 0.00 0.00 0.00 1.00 1.00
td 221,491.00 1,365.23 16,589.58 -166.15 1.64 21.56 206.04 961,732.00
td_a 221,468.00 0.24 0.38 -0.05 0.05 0.20 0.37 75.26
td_mv 217,961.00 0.28 0.26 -0.05 0.04 0.21 0.46 1.00
dltt_a 222,002.00 0.18 0.32 0.00 0.01 0.12 0.28 75.26
dv_a 206,679.00 0.02 0.18 -0.01 0.00 0.00 0.01 63.53
invopps_FG09 193,956.00 1.75 10.35 -0.10 0.71 1.03 1.72 2,705.25
sales_g 196,652.00 0.64 42.13 -199.29 -0.02 0.09 0.24 11,879.50
short_debt 194,560.00 0.32 0.33 -4.67 0.05 0.20 0.53 22.59
long_debt_dum 222,025.00 0.82 0.38 0.00 1.00 1.00 1.00 1.00
atr 222,168.00 0.53 0.36 0.00 0.30 0.40 1.00 1.00
smalltaxlosscarry 148,275.00 0.17 0.38 0.00 0.00 0.00 0.00 1.00
largetaxlosscarry 148,275.00 0.26 0.44 0.00 0.00 0.00 1.00 1.00
tnic3hhi 96,951.00 0.22 0.22 0.01 0.08 0.14 0.28 1.00
tnic3tsimm 96,951.00 9.40 18.42 0.00 1.33 2.46 6.64 132.94
prodmktfluid 88,332.00 7.48 3.91 0.00 4.63 6.79 9.63 40.78
delaycon 57,130.00 -0.01 0.09 -0.28 -0.07 -0.01 0.05 0.46
equitydelaycon 57,130.00 -0.01 0.09 -0.28 -0.07 -0.02 0.05 0.48
debtdelaycon 57,130.00 0.00 0.06 -0.18 -0.04 -0.00 0.04 0.35
privdelaycon 57,130.00 -0.00 0.08 -0.26 -0.06 -0.01 0.05 0.42
l_emp 208,991.00 1.04 1.16 0.00 0.15 0.58 1.56 7.84
l_ppent 218,212.00 3.46 2.38 0.00 1.55 3.11 5.02 12.51
l_laborratio 199,821.00 3.47 1.57 -6.59 2.47 3.26 4.21 15.36

3.2.5.2.1. Observations

  • There are a lot of missing variables! Worthy of more investigation…

  • Some firms have negative assets, market equity

  • What does it mean to have negative debt (td) or dividends (dv)?

  • Tax rates range from 0 to 1!

  • Taxlosscarry variables are booleans (1 or 0) as is longdebtdum

3.2.5.2.2. Looking for outliers

There are a lot of ways to go about this.

  • table, but print out percentiles that focus on the tails and look for large jumps near the edges (like from p95 to max)

  • make a table with skewness and kurtosis variables (look for fat and/or long tails)

  • boxplots, but you’ll need to do a bunch of them

  • If you “standardize” all variables (subtract the mean and divide by the standard deviation - a common first step in many ML analyses!), you could plot the densities on a handful of charts or a tall ridgeline. Histograms might be better to spot outliers but would require ~40 figures.

Some variables I’d be concerned about (not an exhaustive list!):

  • The leverage variables that start with td and dltt

  • Sales growth

  • Profitability

  • Market-to-Book

# a table approach to support the list of variables above
ccm.describe(percentiles=[.01,.05,.95,.99]).T.style.format('{:,.2f}')
  count mean std min 1% 5% 50% 95% 99% max
gvkey 223,001.00 32,958.76 47,791.28 1,000.00 1,235.00 2,085.00 11,946.00 156,613.00 186,230.00 316,056.00
fyear 223,001.00 1,994.72 10.40 1,975.00 1,975.00 1,977.00 1,995.00 2,011.00 2,013.00 2,014.00
lpermno 223,001.00 59,497.47 26,609.23 10,000.00 10,256.00 11,438.00 68,073.00 90,339.00 92,502.00 93,436.00
sic 223,001.00 4,667.96 1,952.31 100.00 1,040.00 1,381.00 4,512.00 7,830.00 8,734.00 9,997.00
sic3 223,001.00 466.61 195.24 10.00 104.00 138.00 451.00 783.00 873.00 999.00
age 220,369.00 8.78 8.33 0.00 0.00 0.00 6.00 26.00 35.00 39.00
at 223,001.00 5,113.07 55,838.12 0.00 1.50 4.43 163.06 11,540.60 71,923.65 3,771,199.85
me 219,397.00 2,041.15 11,469.68 0.00 1.10 3.66 110.92 7,204.16 38,432.21 626,550.38
l_a 222,978.00 5.23 2.41 -6.91 0.42 1.49 5.09 9.35 11.18 15.14
l_sale 218,779.00 4.78 2.45 -6.91 -1.67 0.79 4.76 8.80 10.39 13.07
prof_a 217,167.00 0.05 1.16 -218.00 -1.06 -0.34 0.10 0.28 0.42 198.67
mb 219,269.00 2.01 9.93 0.05 0.56 0.76 1.25 4.96 11.55 2,705.39
ppe_a 218,189.00 0.27 0.25 0.00 0.00 0.01 0.20 0.80 0.91 1.00
capx_a 206,399.00 0.07 0.09 -2.09 0.00 0.00 0.04 0.22 0.41 8.88
xrd_a 223,001.00 0.04 0.16 -0.38 0.00 0.00 0.00 0.20 0.54 18.37
cash_a 222,332.00 0.16 0.20 -0.07 0.00 0.00 0.07 0.64 0.90 1.00
div_d 206,688.00 0.45 0.50 0.00 0.00 0.00 0.00 1.00 1.00 1.00
td 221,491.00 1,365.23 16,589.58 -166.15 0.00 0.00 21.56 3,031.72 17,127.38 961,732.00
td_a 221,468.00 0.24 0.38 -0.05 0.00 0.00 0.20 0.66 0.94 75.26
td_mv 217,961.00 0.28 0.26 -0.05 0.00 0.00 0.21 0.79 0.93 1.00
dltt_a 222,002.00 0.18 0.32 0.00 0.00 0.00 0.12 0.55 0.82 75.26
dv_a 206,679.00 0.02 0.18 -0.01 0.00 0.00 0.00 0.05 0.14 63.53
invopps_FG09 193,956.00 1.75 10.35 -0.10 0.18 0.40 1.03 4.72 11.16 2,705.25
sales_g 196,652.00 0.64 42.13 -199.29 -0.78 -0.33 0.09 0.91 3.79 11,879.50
short_debt 194,560.00 0.32 0.33 -4.67 0.00 0.00 0.20 1.00 1.00 22.59
long_debt_dum 222,025.00 0.82 0.38 0.00 0.00 0.00 1.00 1.00 1.00 1.00
atr 222,168.00 0.53 0.36 0.00 0.00 0.00 0.40 1.00 1.00 1.00
smalltaxlosscarry 148,275.00 0.17 0.38 0.00 0.00 0.00 0.00 1.00 1.00 1.00
largetaxlosscarry 148,275.00 0.26 0.44 0.00 0.00 0.00 0.00 1.00 1.00 1.00
tnic3hhi 96,951.00 0.22 0.22 0.01 0.02 0.04 0.14 0.73 1.00 1.00
tnic3tsimm 96,951.00 9.40 18.42 0.00 1.00 1.03 2.46 53.75 95.21 132.94
prodmktfluid 88,332.00 7.48 3.91 0.00 1.49 2.48 6.79 14.67 19.09 40.78
delaycon 57,130.00 -0.01 0.09 -0.28 -0.19 -0.15 -0.01 0.16 0.25 0.46
equitydelaycon 57,130.00 -0.01 0.09 -0.28 -0.18 -0.14 -0.02 0.16 0.25 0.48
debtdelaycon 57,130.00 0.00 0.06 -0.18 -0.11 -0.08 -0.00 0.11 0.16 0.35
privdelaycon 57,130.00 -0.00 0.08 -0.26 -0.16 -0.12 -0.01 0.15 0.22 0.42
l_emp 208,991.00 1.04 1.16 0.00 0.00 0.01 0.58 3.55 4.79 7.84
l_ppent 218,212.00 3.46 2.38 0.00 0.00 0.19 3.11 7.90 9.60 12.51
l_laborratio 199,821.00 3.47 1.57 -6.59 0.26 1.34 3.26 6.47 8.15 15.36

3.2.5.2.3. Missing values

Below is a list of variables that are missing values for more than 10% of the dataset. Future work will need to understand why there are missing values and understand whether and how we should deal with them.

(
    ( # these lines do the calculation - what % of missing values are there for each var
        ccm.isna()      # ccm.isna() TURNS every obs/variable = 1 when its missing and 0 else
       .sum(axis=0)     # count the number of na for each variable (now data is 1 obs per column = # missing)
        /len(ccm)       # convert # missing to % missing 
        *100            # report as percentage
    ) 
    # you can stop here and report this...
    # but I wanted to format it a bit...
    .sort_values(ascending=False)[:13]
    .to_frame(name='% missing') # the next line only works on a frame, and because pandas sees only 1 variable at this pt
    .style.format("{:.1f}")     # in the code, it calls this a "series" type object, so convert it to dataframe type object
)
#
  % missing
privdelaycon 74.4
debtdelaycon 74.4
equitydelaycon 74.4
delaycon 74.4
prodmktfluid 60.4
tnic3tsimm 56.5
tnic3hhi 56.5
smalltaxlosscarry 33.5
largetaxlosscarry 33.5
invopps_FG09 13.0
short_debt 12.8
sales_g 11.8
l_laborratio 10.4

1

Remember this?