{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dealing with Outliers\n",
"\n",
"## Options for dealing with them:\n",
"\n",
"1. Fix the data: Look at the data and correct it. Can be costly or impossible. \n",
"2. Censor: Delete observations that are outliers\n",
"3. Winsorize: Change the value so that it is closer to the rest of the distribution\n",
" - Example: Any value above the 99th percentile for a variable is changed to equal the 99th percentile\n",
" - This is a common and cheap ad-hoc correction that downplays the weight of the outlier in your analysis because the values are reduced, without tossing out the observation altogether\n",
" - Tough question that depends on the data/application: What is the \"right\" amount of the distribution to winsorize?\n",
"\n",
"## Finding outliers\n",
"\n",
"How can we find outliers?\n",
" \n",
"1. Plot your data: scatterplot, hexplot, box plot, density, etc. \n",
"1. Compute z-scores and `.describe()` the data\n",
"\n",
"Plotting is essential in EDA and data cleaning, as we've covered. I'm going to suggest the second route though as a quick and systematic way to identify which variables you should look into more.\n",
"\n",
"Let's download our firm data (`ccm`).\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# copied from 3.3.4.1\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"# these three are used to download the file\n",
"from io import BytesIO\n",
"from zipfile import ZipFile\n",
"from urllib.request import urlopen\n",
"\n",
"url = 'https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/CCM_cleaned_for_class.zip?raw=true'\n",
"\n",
"#firms = pd.read_stata(url) \n",
"# <-- that code would work, but GH said it was too big and\n",
"# forced me to zip it, so here is the work around to download it:\n",
"\n",
"with urlopen(url) as request:\n",
" data = BytesIO(request.read())\n",
"\n",
"with ZipFile(data) as archive:\n",
" with archive.open(archive.namelist()[0]) as stata:\n",
" ccm = pd.read_stata(stata)\t\t\t\t\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And load a utility function from the [community codebook](https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/community_codebook/outlier_report.py)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"def outlier_report(df,vars_to_examine=None,color='red',thres=4,\n",
" return_df=False,no_print=False):\n",
" '''\n",
" Parameters\n",
" ----------\n",
" df : DATAFRAME\n",
" Input dataframe\n",
" vars_to_examine : LIST, optional\n",
" List of variables to examine from dataframe. The default is df.columns.\n",
" color : STRING, optional\n",
" Color for cell highlighting. The default is 'red'.\n",
" thres : int, optional\n",
" Highlight cells where z score is above thres. The default is 4.\n",
" return_df : Boolean, optional\n",
" If true, will return the df obj (without styling) for further use. \n",
" The default is False.\n",
" no_print : Boolean, optional\n",
" If true, will not print. \n",
" The default is False.\n",
" \n",
" Displays (if no_print=False)\n",
" -------\n",
" Table with distribution of z-scores of variables of interest. \n",
" \n",
" Returns (if return_df=True)\n",
" -------\n",
" Table with distribution of z-scores of variables of interest (without styling). \n",
" '''\n",
" \n",
" def highlight_extreme(s):\n",
" '''\n",
" Highlight extreme values in a series.\n",
" '''\n",
" is_extreme = abs(s) > thres\n",
" return ['background-color: '+color if v else '' for v in is_extreme]\n",
" \n",
" if vars_to_examine==None:\n",
" vars_to_examine=df.columns\n",
" \n",
" _tab = (\n",
" # compute z scores\n",
" ((df[vars_to_examine] - df[vars_to_examine].mean())/df[vars_to_examine].std())\n",
" # output dist of z \n",
" .describe(percentiles=[.01,.05,.25,.5,.75,.95,.99]).T\n",
" # add a new column = highest of min and max column\n",
" .assign(max_z_abs = lambda x: x[['min','max']].abs().max(axis=1))\n",
" # now sort on it\n",
" .sort_values('max_z_abs',ascending = False)\n",
" )\n",
" \n",
" if no_print == False:\n",
" display(_tab\n",
" .style.format('{:,.2f}')\n",
" .format({\"count\": '{:,.0f}'}) \n",
" .apply(highlight_extreme, \n",
" subset=['mean', 'std', 'min', '1%', '5%', '25%', '50%', '75%', '95%','99%', 'max', 'max_z_abs'])\n",
" ) \n",
" \n",
" if return_df == True:\n",
" return _tab\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can pick the variables we want to check and use our utility function:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
| count | mean | std | min | 1% | 5% | 25% | 50% | 75% | 95% | 99% | max | max_z_abs |
\n",
" \n",
" dv_a | \n",
" 206,679 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.12 | \n",
" -0.08 | \n",
" -0.08 | \n",
" -0.08 | \n",
" -0.08 | \n",
" -0.01 | \n",
" 0.19 | \n",
" 0.70 | \n",
" 356.75 | \n",
" 356.75 | \n",
"
\n",
" \n",
" sales_g | \n",
" 196,652 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -4.75 | \n",
" -0.03 | \n",
" -0.02 | \n",
" -0.02 | \n",
" -0.01 | \n",
" -0.01 | \n",
" 0.01 | \n",
" 0.07 | \n",
" 281.98 | \n",
" 281.98 | \n",
"
\n",
" \n",
" mb | \n",
" 219,269 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.20 | \n",
" -0.15 | \n",
" -0.13 | \n",
" -0.10 | \n",
" -0.08 | \n",
" -0.01 | \n",
" 0.30 | \n",
" 0.96 | \n",
" 272.32 | \n",
" 272.32 | \n",
"
\n",
" \n",
" invopps_FG09 | \n",
" 193,956 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.18 | \n",
" -0.15 | \n",
" -0.13 | \n",
" -0.10 | \n",
" -0.07 | \n",
" -0.00 | \n",
" 0.29 | \n",
" 0.91 | \n",
" 261.27 | \n",
" 261.27 | \n",
"
\n",
" \n",
" dltt_a | \n",
" 222,002 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.56 | \n",
" -0.56 | \n",
" -0.56 | \n",
" -0.53 | \n",
" -0.19 | \n",
" 0.32 | \n",
" 1.16 | \n",
" 2.01 | \n",
" 234.93 | \n",
" 234.93 | \n",
"
\n",
" \n",
" td_a | \n",
" 221,468 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.78 | \n",
" -0.65 | \n",
" -0.65 | \n",
" -0.52 | \n",
" -0.13 | \n",
" 0.33 | \n",
" 1.10 | \n",
" 1.85 | \n",
" 199.83 | \n",
" 199.83 | \n",
"
\n",
" \n",
" prof_a | \n",
" 217,167 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -187.91 | \n",
" -0.96 | \n",
" -0.34 | \n",
" -0.03 | \n",
" 0.04 | \n",
" 0.10 | \n",
" 0.20 | \n",
" 0.32 | \n",
" 171.16 | \n",
" 187.91 | \n",
"
\n",
" \n",
" xrd_a | \n",
" 223,001 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -2.60 | \n",
" -0.25 | \n",
" -0.25 | \n",
" -0.25 | \n",
" -0.25 | \n",
" -0.09 | \n",
" 0.97 | \n",
" 3.06 | \n",
" 112.51 | \n",
" 112.51 | \n",
"
\n",
" \n",
" capx_a | \n",
" 206,399 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -23.30 | \n",
" -0.71 | \n",
" -0.71 | \n",
" -0.54 | \n",
" -0.27 | \n",
" 0.18 | \n",
" 1.62 | \n",
" 3.73 | \n",
" 95.10 | \n",
" 95.10 | \n",
"
\n",
" \n",
" short_debt | \n",
" 194,560 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -15.08 | \n",
" -0.98 | \n",
" -0.98 | \n",
" -0.82 | \n",
" -0.38 | \n",
" 0.62 | \n",
" 2.04 | \n",
" 2.04 | \n",
" 67.28 | \n",
" 67.28 | \n",
"
\n",
" \n",
" prodmktfluid | \n",
" 88,332 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -1.91 | \n",
" -1.53 | \n",
" -1.28 | \n",
" -0.73 | \n",
" -0.18 | \n",
" 0.55 | \n",
" 1.84 | \n",
" 2.97 | \n",
" 8.52 | \n",
" 8.52 | \n",
"
\n",
" \n",
" l_laborratio | \n",
" 199,821 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -6.40 | \n",
" -2.04 | \n",
" -1.36 | \n",
" -0.63 | \n",
" -0.13 | \n",
" 0.48 | \n",
" 1.91 | \n",
" 2.98 | \n",
" 7.57 | \n",
" 7.57 | \n",
"
\n",
" \n",
" tnic3tsimm | \n",
" 96,951 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.51 | \n",
" -0.46 | \n",
" -0.45 | \n",
" -0.44 | \n",
" -0.38 | \n",
" -0.15 | \n",
" 2.41 | \n",
" 4.66 | \n",
" 6.71 | \n",
" 6.71 | \n",
"
\n",
" \n",
" debtdelaycon | \n",
" 57,130 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -3.09 | \n",
" -1.96 | \n",
" -1.47 | \n",
" -0.69 | \n",
" -0.11 | \n",
" 0.60 | \n",
" 1.81 | \n",
" 2.72 | \n",
" 5.99 | \n",
" 5.99 | \n",
"
\n",
" \n",
" l_emp | \n",
" 208,991 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -0.89 | \n",
" -0.89 | \n",
" -0.88 | \n",
" -0.76 | \n",
" -0.39 | \n",
" 0.45 | \n",
" 2.17 | \n",
" 3.23 | \n",
" 5.86 | \n",
" 5.86 | \n",
"
\n",
" \n",
" equitydelaycon | \n",
" 57,130 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -2.94 | \n",
" -1.84 | \n",
" -1.43 | \n",
" -0.71 | \n",
" -0.11 | \n",
" 0.58 | \n",
" 1.87 | \n",
" 2.82 | \n",
" 5.28 | \n",
" 5.28 | \n",
"
\n",
" \n",
" privdelaycon | \n",
" 57,130 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -3.15 | \n",
" -1.93 | \n",
" -1.48 | \n",
" -0.70 | \n",
" -0.10 | \n",
" 0.59 | \n",
" 1.83 | \n",
" 2.74 | \n",
" 5.08 | \n",
" 5.08 | \n",
"
\n",
" \n",
" l_a | \n",
" 222,978 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -5.04 | \n",
" -2.00 | \n",
" -1.55 | \n",
" -0.73 | \n",
" -0.06 | \n",
" 0.68 | \n",
" 1.71 | \n",
" 2.47 | \n",
" 4.12 | \n",
" 5.04 | \n",
"
\n",
" \n",
" delaycon | \n",
" 57,130 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -2.86 | \n",
" -1.92 | \n",
" -1.48 | \n",
" -0.72 | \n",
" -0.09 | \n",
" 0.63 | \n",
" 1.79 | \n",
" 2.68 | \n",
" 4.95 | \n",
" 4.95 | \n",
"
\n",
" \n",
" l_sale | \n",
" 218,779 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -4.78 | \n",
" -2.63 | \n",
" -1.63 | \n",
" -0.63 | \n",
" -0.01 | \n",
" 0.66 | \n",
" 1.64 | \n",
" 2.29 | \n",
" 3.39 | \n",
" 4.78 | \n",
"
\n",
" \n",
" cash_a | \n",
" 222,332 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -1.12 | \n",
" -0.78 | \n",
" -0.77 | \n",
" -0.67 | \n",
" -0.42 | \n",
" 0.25 | \n",
" 2.36 | \n",
" 3.61 | \n",
" 4.10 | \n",
" 4.10 | \n",
"
\n",
" \n",
" l_ppent | \n",
" 218,212 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -1.46 | \n",
" -1.46 | \n",
" -1.37 | \n",
" -0.80 | \n",
" -0.14 | \n",
" 0.66 | \n",
" 1.87 | \n",
" 2.59 | \n",
" 3.81 | \n",
" 3.81 | \n",
"
\n",
" \n",
" tnic3hhi | \n",
" 96,951 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.97 | \n",
" -0.92 | \n",
" -0.85 | \n",
" -0.67 | \n",
" -0.38 | \n",
" 0.28 | \n",
" 2.34 | \n",
" 3.60 | \n",
" 3.60 | \n",
" 3.60 | \n",
"
\n",
" \n",
" ppe_a | \n",
" 218,189 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -1.09 | \n",
" -1.09 | \n",
" -1.06 | \n",
" -0.84 | \n",
" -0.29 | \n",
" 0.58 | \n",
" 2.08 | \n",
" 2.53 | \n",
" 2.89 | \n",
" 2.89 | \n",
"
\n",
" \n",
" td_mv | \n",
" 217,961 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -1.25 | \n",
" -1.07 | \n",
" -1.07 | \n",
" -0.93 | \n",
" -0.25 | \n",
" 0.70 | \n",
" 1.95 | \n",
" 2.47 | \n",
" 2.75 | \n",
" 2.75 | \n",
"
\n",
" \n",
" smalltaxlosscarry | \n",
" 148,275 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.46 | \n",
" -0.46 | \n",
" -0.46 | \n",
" -0.46 | \n",
" -0.46 | \n",
" -0.46 | \n",
" 2.19 | \n",
" 2.19 | \n",
" 2.19 | \n",
" 2.19 | \n",
"
\n",
" \n",
" long_debt_dum | \n",
" 222,025 | \n",
" -0.00 | \n",
" 1.00 | \n",
" -2.16 | \n",
" -2.16 | \n",
" -2.16 | \n",
" 0.46 | \n",
" 0.46 | \n",
" 0.46 | \n",
" 0.46 | \n",
" 0.46 | \n",
" 0.46 | \n",
" 2.16 | \n",
"
\n",
" \n",
" largetaxlosscarry | \n",
" 148,275 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.59 | \n",
" -0.59 | \n",
" -0.59 | \n",
" -0.59 | \n",
" -0.59 | \n",
" 1.69 | \n",
" 1.69 | \n",
" 1.69 | \n",
" 1.69 | \n",
" 1.69 | \n",
"
\n",
" \n",
" atr | \n",
" 222,168 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -1.48 | \n",
" -1.48 | \n",
" -1.48 | \n",
" -0.63 | \n",
" -0.36 | \n",
" 1.33 | \n",
" 1.33 | \n",
" 1.33 | \n",
" 1.33 | \n",
" 1.48 | \n",
"
\n",
" \n",
" div_d | \n",
" 206,688 | \n",
" 0.00 | \n",
" 1.00 | \n",
" -0.91 | \n",
" -0.91 | \n",
" -0.91 | \n",
" -0.91 | \n",
" -0.91 | \n",
" 1.10 | \n",
" 1.10 | \n",
" 1.10 | \n",
" 1.10 | \n",
" 1.10 | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vars_to_check = ['l_a', 'l_sale', 'prof_a', 'mb', 'ppe_a', 'capx_a', 'xrd_a', \n",
" 'cash_a', 'div_d', 'td_a', 'td_mv', 'dltt_a', 'dv_a', \t'invopps_FG09',\n",
" 'sales_g', 'short_debt', 'long_debt_dum', 'atr', 'smalltaxlosscarry', \n",
" 'largetaxlosscarry', 'tnic3hhi', 'tnic3tsimm', 'prodmktfluid', \n",
" 'delaycon', 'equitydelaycon', 'debtdelaycon', 'privdelaycon', 'l_emp',\n",
" 'l_ppent', 'l_laborratio']\n",
"\n",
"outlier_report(ccm,vars_to_check,thres=4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And you can extract a list of problematic variables from this function too:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"vars_with_big_outliers = list(outlier_report(ccm,vars_to_check,thres=4,return_df=True,no_print=True)\n",
" .query('max_z_abs > 5').index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Winsorizing\n",
"\n",
"In this example:\n",
"1. If a value is below the 1st percentile, change it to the 1st percentile\n",
"1. If a value is above the 99th percentile, change it to the 99th percentile\n",
"\n",
"Options:\n",
"1. Winsorize once over whole dataset\n",
"1. Winsorize over subgroups (e.g., winsorize by year)\n",
" - Useful when the distribution changes over time\n",
" - Suppose the distribution shifts right from one year to the next. If you winsorize both years at once, you'll chop off the lower values in year one and the upper values in year two. Perhaps it makes more sense to winsorize each year separately.\n",
"\n",
"### Code:\n",
"\n",
"```python\n",
"from scipy.stats.mstats import winsorize\n",
"\n",
"# option 1a:\n",
"for v in [some_list_of_vars]:\n",
" df[v] = winsorize(df[v],limits=[.01,.99])\n",
" \n",
"# option 1b: same as 1a, but one line\n",
"df[some_list_of_vars]= df[some_list_of_vars].apply(lambda x: winsorize(x,limits=[.01,.99]))\n",
"\n",
"# option 2: winsorize by group (e.g. year)\n",
"df[some_list_of_vars] = df.groupby(some_group)[some_list_of_vars].transform(lambda x: winsorize(x,limits=[.01,.99]))\n",
"```\n",
"\n",
"### An example"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats.mstats import winsorize\n",
"\n",
"practice_df = ccm.copy() # don't do this in your code,\n",
" # just run the remaining lines:\n",
"\n",
"# winsorizse one var at a time \n",
"vars_to_win = vars_to_check # decide which, for this example - all the ones we checked\n",
"for v in vars_to_win:\n",
" practice_df[v] = winsorize(practice_df[v],limits=[.01,.99])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's verify it worked:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | min | 1% | 99% | max |
\n",
" \n",
" l_sale | \n",
" -1.64 | \n",
" -1.64 | \n",
" -1.64 | \n",
" -1.64 | \n",
"
\n",
" \n",
" prof_a | \n",
" -1.05 | \n",
" -1.05 | \n",
" -1.05 | \n",
" -1.05 | \n",
"
\n",
" \n",
" tnic3tsimm | \n",
" 1.01 | \n",
" 1.01 | \n",
" 1.01 | \n",
" 1.01 | \n",
"
\n",
" \n",
" sales_g | \n",
" -0.73 | \n",
" -0.73 | \n",
" -0.73 | \n",
" -0.73 | \n",
"
\n",
" \n",
" mb | \n",
" 0.56 | \n",
" 0.56 | \n",
" 0.56 | \n",
" 0.56 | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(practice_df\n",
" .describe(percentiles=[.01,.05,.25,.5,.75,.95,.99]).T\n",
" # add a new column = highest of min and max column\n",
" .assign(abs_maxmin = lambda x: x[['min','max']].abs().max(axis=1))\n",
" # now sort on it\n",
" .sort_values('abs_maxmin',ascending = False)\n",
" [['min','1%',\"99%\",'max']] # only need these to see winsorizing in action\n",
" [10:15] # just print a few vars\n",
" .style.format('{:,.2f}')\n",
") \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}