3.4.3. Dealing with Outliers¶
Outliers can indicate a problem with your dataset (e.g. data errors to fix or encoding choices to understand1). And even if they are correct data, they might skew your analysis so that it’s less useful! For example, in the first plot on the “Role of Viz” page , outliers in dataset III and IV will cause your analysis (“What is the slope of the line describing how X and Y are related?”) to give an answer that doesn’t reflect a layman’s sense of the “typical” relationship in those datasets.
3.4.3.1. Finding outliers¶
How can we find outliers?
Plot your data: scatterplot, hexplot, box plot, density, etc.
Compute z-scores and
.describe()
the data
Plotting is essential in EDA and data cleaning, as we’ve covered. I’m going to suggest the second route though as a quick and systematic way to identify which variables you should look into more.
Let’s download our firm data (ccm
).
# copied from 3.3.4.1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# these three are used to download the file
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
url = 'https://github.com/LeDataSciFi/ledatascifi-2024/blob/main/data/CCM_cleaned_for_class.zip?raw=true'
#firms = pd.read_stata(url)
# <-- that code would work, but GH said it was too big and
# forced me to zip it, so here is the work around to download it:
with urlopen(url) as request:
data = BytesIO(request.read())
with ZipFile(data) as archive:
with archive.open(archive.namelist()[0]) as stata:
ccm = pd.read_stata(stata)
And load a utility function called outlier_report
from the community codebook.
def outlier_report(df,vars_to_examine=None,color='red',thres=4,
return_df=False,no_print=False):
'''
Parameters
----------
df : DATAFRAME
Input dataframe
vars_to_examine : LIST, optional
List of variables to examine from dataframe. The default is df.columns.
color : STRING, optional
Color for cell highlighting. The default is 'red'.
thres : int, optional
Highlight cells where z score is above thres. The default is 4.
return_df : Boolean, optional
If true, will return the df obj (without styling) for further use.
The default is False.
no_print : Boolean, optional
If true, will not print.
The default is False.
Displays (if no_print=False)
-------
Table with distribution of z-scores of variables of interest.
Returns (if return_df=True)
-------
Table with distribution of z-scores of variables of interest (without styling).
'''
def highlight_extreme(s):
'''
Highlight extreme values in a series.
'''
is_extreme = abs(s) > thres
return ['background-color: '+color if v else '' for v in is_extreme]
if vars_to_examine==None:
vars_to_examine=df.columns
_tab = (
# compute z scores
((df[vars_to_examine] - df[vars_to_examine].mean())/df[vars_to_examine].std())
# output dist of z
.describe(percentiles=[.01,.05,.25,.5,.75,.95,.99]).T
# add a new column = highest of min and max column
.assign(max_z_abs = lambda x: x[['min','max']].abs().max(axis=1))
# now sort on it
.sort_values('max_z_abs',ascending = False)
)
if no_print == False:
fdict = { c:('{:,.2f}' if c != 'count' else '{:,.0f}') for c in _tab.columns }
display(_tab
.style.format(fdict)
.apply(highlight_extreme,
subset=['mean', 'std', 'min', '1%', '5%', '25%', '50%', '75%', '95%','99%', 'max', 'max_z_abs'])
)
if return_df == True:
return _tab
Now we can pick the variables we want to check and use our utility function:
vars_to_check = ['l_a', 'l_sale', 'prof_a', 'mb', 'ppe_a', 'capx_a', 'xrd_a',
'cash_a', 'div_d', 'td_a', 'td_mv', 'dltt_a', 'dv_a', 'invopps_FG09',
'sales_g', 'short_debt', 'long_debt_dum', 'atr', 'smalltaxlosscarry',
'largetaxlosscarry', 'tnic3hhi', 'tnic3tsimm', 'prodmktfluid',
'delaycon', 'equitydelaycon', 'debtdelaycon', 'privdelaycon', 'l_emp',
'l_ppent', 'l_laborratio']
outlier_report(ccm,vars_to_check,thres=4)
count | mean | std | min | 1% | 5% | 25% | 50% | 75% | 95% | 99% | max | max_z_abs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dv_a | 206,679 | 0.00 | 1.00 | -0.12 | -0.08 | -0.08 | -0.08 | -0.08 | -0.01 | 0.19 | 0.70 | 356.75 | 356.75 |
sales_g | 196,652 | -0.00 | 1.00 | -4.75 | -0.03 | -0.02 | -0.02 | -0.01 | -0.01 | 0.01 | 0.07 | 281.98 | 281.98 |
mb | 219,269 | 0.00 | 1.00 | -0.20 | -0.15 | -0.13 | -0.10 | -0.08 | -0.01 | 0.30 | 0.96 | 272.32 | 272.32 |
invopps_FG09 | 193,956 | 0.00 | 1.00 | -0.18 | -0.15 | -0.13 | -0.10 | -0.07 | -0.00 | 0.29 | 0.91 | 261.27 | 261.27 |
dltt_a | 222,002 | 0.00 | 1.00 | -0.56 | -0.56 | -0.56 | -0.53 | -0.19 | 0.32 | 1.16 | 2.01 | 234.93 | 234.93 |
td_a | 221,468 | 0.00 | 1.00 | -0.78 | -0.65 | -0.65 | -0.52 | -0.13 | 0.33 | 1.10 | 1.85 | 199.83 | 199.83 |
prof_a | 217,167 | -0.00 | 1.00 | -187.91 | -0.96 | -0.34 | -0.03 | 0.04 | 0.10 | 0.20 | 0.32 | 171.16 | 187.91 |
xrd_a | 223,001 | -0.00 | 1.00 | -2.60 | -0.25 | -0.25 | -0.25 | -0.25 | -0.09 | 0.97 | 3.06 | 112.51 | 112.51 |
capx_a | 206,399 | 0.00 | 1.00 | -23.30 | -0.71 | -0.71 | -0.54 | -0.27 | 0.18 | 1.62 | 3.73 | 95.10 | 95.10 |
short_debt | 194,560 | 0.00 | 1.00 | -15.08 | -0.98 | -0.98 | -0.82 | -0.38 | 0.62 | 2.04 | 2.04 | 67.28 | 67.28 |
prodmktfluid | 88,332 | 0.00 | 1.00 | -1.91 | -1.53 | -1.28 | -0.73 | -0.18 | 0.55 | 1.84 | 2.97 | 8.52 | 8.52 |
l_laborratio | 199,821 | 0.00 | 1.00 | -6.40 | -2.04 | -1.36 | -0.63 | -0.13 | 0.48 | 1.91 | 2.98 | 7.57 | 7.57 |
tnic3tsimm | 96,951 | 0.00 | 1.00 | -0.51 | -0.46 | -0.45 | -0.44 | -0.38 | -0.15 | 2.41 | 4.66 | 6.71 | 6.71 |
debtdelaycon | 57,130 | 0.00 | 1.00 | -3.09 | -1.96 | -1.47 | -0.69 | -0.11 | 0.60 | 1.81 | 2.72 | 5.99 | 5.99 |
l_emp | 208,991 | -0.00 | 1.00 | -0.89 | -0.89 | -0.88 | -0.76 | -0.39 | 0.45 | 2.17 | 3.23 | 5.86 | 5.86 |
equitydelaycon | 57,130 | 0.00 | 1.00 | -2.94 | -1.84 | -1.43 | -0.71 | -0.11 | 0.58 | 1.87 | 2.82 | 5.28 | 5.28 |
privdelaycon | 57,130 | -0.00 | 1.00 | -3.15 | -1.93 | -1.48 | -0.70 | -0.10 | 0.59 | 1.83 | 2.74 | 5.08 | 5.08 |
l_a | 222,978 | -0.00 | 1.00 | -5.04 | -2.00 | -1.55 | -0.73 | -0.06 | 0.68 | 1.71 | 2.47 | 4.12 | 5.04 |
delaycon | 57,130 | 0.00 | 1.00 | -2.86 | -1.92 | -1.48 | -0.72 | -0.09 | 0.63 | 1.79 | 2.68 | 4.95 | 4.95 |
l_sale | 218,779 | 0.00 | 1.00 | -4.78 | -2.63 | -1.63 | -0.63 | -0.01 | 0.66 | 1.64 | 2.29 | 3.39 | 4.78 |
cash_a | 222,332 | -0.00 | 1.00 | -1.12 | -0.78 | -0.77 | -0.67 | -0.42 | 0.25 | 2.36 | 3.61 | 4.10 | 4.10 |
l_ppent | 218,212 | -0.00 | 1.00 | -1.46 | -1.46 | -1.37 | -0.80 | -0.14 | 0.66 | 1.87 | 2.59 | 3.81 | 3.81 |
tnic3hhi | 96,951 | 0.00 | 1.00 | -0.97 | -0.92 | -0.85 | -0.67 | -0.38 | 0.28 | 2.34 | 3.60 | 3.60 | 3.60 |
ppe_a | 218,189 | 0.00 | 1.00 | -1.09 | -1.09 | -1.06 | -0.84 | -0.29 | 0.58 | 2.08 | 2.53 | 2.89 | 2.89 |
td_mv | 217,961 | 0.00 | 1.00 | -1.25 | -1.07 | -1.07 | -0.93 | -0.25 | 0.70 | 1.95 | 2.47 | 2.75 | 2.75 |
smalltaxlosscarry | 148,275 | 0.00 | 1.00 | -0.46 | -0.46 | -0.46 | -0.46 | -0.46 | -0.46 | 2.19 | 2.19 | 2.19 | 2.19 |
long_debt_dum | 222,025 | -0.00 | 1.00 | -2.16 | -2.16 | -2.16 | 0.46 | 0.46 | 0.46 | 0.46 | 0.46 | 0.46 | 2.16 |
largetaxlosscarry | 148,275 | 0.00 | 1.00 | -0.59 | -0.59 | -0.59 | -0.59 | -0.59 | 1.69 | 1.69 | 1.69 | 1.69 | 1.69 |
atr | 222,168 | 0.00 | 1.00 | -1.48 | -1.48 | -1.48 | -0.63 | -0.36 | 1.33 | 1.33 | 1.33 | 1.33 | 1.48 |
div_d | 206,688 | 0.00 | 1.00 | -0.91 | -0.91 | -0.91 | -0.91 | -0.91 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 |
And you can extract a list of problematic variables from this function too:
vars_with_big_outliers = list(outlier_report(ccm,vars_to_check,thres=4,return_df=True,no_print=True)
.query('max_z_abs > 5').index)
print(vars_with_big_outliers)
['dv_a', 'sales_g', 'mb', 'invopps_FG09', 'dltt_a', 'td_a', 'prof_a', 'xrd_a', 'capx_a', 'short_debt', 'prodmktfluid', 'l_laborratio', 'tnic3tsimm', 'debtdelaycon', 'l_emp', 'equitydelaycon', 'privdelaycon', 'l_a']
3.4.3.2. Options for dealing with outliers:¶
Fix the data: Look at the data and correct it. Sometimes this is costly or impossible to do, but it should be the first thing you look into.
Winsorize: Change the outliers so that they are closer to the rest of the distribution.
Example: Any value above the 99th percentile for a variable is changed to equal the 99th percentile; do the same for the left tail
This is a common and cheap ad-hoc correction
Intuitively: Winsorizing downplays the weight of the outlier in your analysis because the values are reduced, without tossing out the observation altogether
Tough question that depends on the data/application: What is the “right” amount of the distribution to winsorize? To the 99th percentile, or the 95th? In practice, 1%/99% is most common.
Censor: Delete observations that are outliers. This is rarely done in practical applications unless the outlier suggests a critical data error for the observation.
3.4.3.3. Winsorizing¶
Use the function winsorizer_with_missing
, available in the community codebook. (A skeleton version is below so I can do some quick examples.)
If a value is below the 1st percentile, change it to the 1st percentile
If a value is above the 99th percentile, change it to the 99th percentile
The first consideration is what “cutoff values” to use. Sometimes researchers use different values, say 5th and 95th percentiles.
The second consideration is whether to winsorize over the whole dataset or within subgroups.
The latter is most commonly useful when the distribution of a variable changes over time. For example, the average firm does much more R&D now than in 2000. So the 99th percentile of R&D in 2020 and 2000 are very different.
If you winsorize R&D over both years at once, you’ll chop off the lower values in 2000 and the upper values in 2020. Perhaps it makes more sense to winsorize each year separately!
Code:
# over the whole dataset:
df = winsorizer_with_missing(df) # adtl options available
# just some variables:
df = winsorizer_with_missing(df, cols=[<a list here>])
# winsorize each year separately
df = df.groupby('year').apply(lambda x: winsorizer_with_missing(x))
# you winsorize each year separately for only some variables using the "cols" argument
3.4.3.3.1. Example 1¶
First, let’s look at the min and max of each variable:
practice_df = ccm.copy() # don't do this "copy" step in your code
# just run the remaining lines, replacing the
# practice_df's name with the name of the df
# you're working on
# output summary stats
(practice_df
.describe(percentiles=[.01,.05,.25,.5,.75,.95,.99]).T
# add a new column = highest of min and max column
.assign(abs_maxmin = lambda x: x[['min','max']].abs().max(axis=1))
# now sort on it
.sort_values('abs_maxmin',ascending = False)
[['count','min','1%',"99%",'max']] # only need these to see winsorizing in action
[11:13] # just print a few vars
.style.format('{:,.2f}')
)
count | min | 1% | 99% | max | |
---|---|---|---|---|---|
prof_a | 217,167.00 | -218.00 | -1.06 | 0.42 | 198.67 |
tnic3tsimm | 96,951.00 | 0.00 | 1.00 | 95.21 | 132.94 |
Now let’s load a version of the function:
def winsorizer_with_missing(df,low_=.01,hi_=.99,cols=None):
# this function, with more features and explanation
# is in the community codebook!
if not cols: # if no cols provides, winsorize all columns
cols=df.columns
df[cols] = df[cols].clip(lower=df[cols].quantile(low_),
upper=df[cols].quantile(hi_),
axis=1)
return df
When we use winsorizer_with_missing
, the data is changed and now the min and max equal the 1/99 percentiles!
(
#winsorize - remember, I am showing you an example
winsorizer_with_missing(practice_df,
cols = ['prof_a','tnic3tsimm'])
# print the output:
[['prof_a','tnic3tsimm']]
.describe(percentiles=[.01,.99]).T
[['count','min','1%',"99%",'max']] # only need these to see winsorizing in action
.style.format('{:,.2f}')
)
count | min | 1% | 99% | max | |
---|---|---|---|---|---|
prof_a | 217,167.00 | -1.06 | -1.06 | 0.42 | 0.42 |
tnic3tsimm | 96,951.00 | 1.00 | 1.00 | 95.21 | 95.21 |
When we winsorize by a group (here, the grouping is each year), the minimum no longer need equal the percentile threshold we used. Also notice that the 1st and 99th percentile are a little different. This is ok when we winsorize by subgroups.
(
# fyear is the "Fiscal year" variable
practice_df.groupby('fyear').apply(lambda x: winsorizer_with_missing(x,cols = ['prof_a','tnic3tsimm']))
# print the output:
[['prof_a','tnic3tsimm']]
.describe(percentiles=[.01,.99]).T
[['count','min','1%',"99%",'max']] # only need these to see winsorizing in action
.style.format('{:,.2f}')
)
count | min | 1% | 99% | max | |
---|---|---|---|---|---|
prof_a | 217,167.00 | -1.06 | -0.96 | 0.40 | 0.42 |
tnic3tsimm | 96,951.00 | 1.00 | 1.00 | 94.96 | 95.21 |
In this example so far, I just ran the winsorization function and printed summary stats. If you actually want to use the data after winsorizing, you need to save the updated data. Just adapt the code suggestion above.
So instead of
df = winsorizer_with_missing(df, cols=[<a list here>])
you’ll write
ccm = winsorizer_with_missing(ccm,cols=vars_with_big_outliers)
3.4.3.3.2. Example 2 - Visual Intuition¶
Here are datasets III and IV from the “Role of Viz” page , after applying the same winsorizing function. (I flipped the X and Y axis because the plotting function can’t fit a vertical line.)
You can play with the amount of winsorization by using the slider. Notice that as you winsorize more,
The extreme data points (along the X and Y dimensions) shrink.
The estimated slope changes.
df_orig = sns.load_dataset("anscombe").query('dataset in ["III","IV"]')
df_plot = pd.DataFrame()
# winsorize it a bunch of times and drop it into a df to use with plotly
for tails in [0,.01,.05,.1,.15,.2]:
df = df_orig.copy()
df[["x", "y"]] = df_orig.groupby("dataset").apply(
lambda x: winsorizer_with_missing(x[["x", "y"]], low_=tails, hi_=1 - tails)
)
df['winsorized']=tails
df_plot=df_plot.append(df)
# the plotly mechanics
import warnings
warnings.filterwarnings("ignore", message="divide by zero encountered in double_scalars")
import plotly.express as px # pip install plotly.. the animation below is from plotly module
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, plot
init_notebook_mode(connected=True)
fig = px.scatter(df_plot,trendline="ols",facet_row="dataset",
y='x', x='y', animation_frame="winsorized",
range_y=[4,20], range_x=[4,14],height=800,
title = "Same datasets, but I flipped the X and Y axes")
plot(fig, filename = 'quartet.html',auto_play=False)
display(HTML('quartet.html'))
- 1
In some datasets, a missing value might be represented by -1 or -99. If you don’t know that and start analyzing the “numbers”, your analysis will be seriously wrong!