3.4.2. Dealing with Missing Data¶
Quick summary
You should focus more on the “whys” of dealing with missing data rather than mechanics. (You can look up mechanics later.)
These slides on missing data are quite good! This article has examples too.
On the “whys”: With firm level data that investment analysts deal with, the most common approach to missing data is to keep all valid observations (don’t drop anything), and for each test you run, use all observations that have no missing values for all the variables in a given test. In the slides above, this is called “Complete-Case Analysis”.
“Complete-Case Analysis” works well, as long as the fact that the variable is missing doesn’t indicate a systematic difference between observations that are missing and those that aren’t. For example, I have research that characterizes innovation based on patent text, which gives researchers a powerful tool to examine the impacts of firm innovation. However, this dataset will lead to missing values for any firm without patents. And firms without patents are systematically different than firms with patents.
It is less common corporate finance to impute missing values. However, “deductive imputation” is common when the cost of doing so isn’t high, like the height example above.
Interpolation is done in asset pricing when it’s necessary to estimate the pricing of options or other derivatives that aren’t actually traded.
df.isnull().sum()
will report missing values by variable.In general, when you can confidently deduce a value (my height this year is the same as last year because I’m a fully grown adult, mostly), go ahead. That is valid data.
With new datasets, look out for “missing values” that aren’t missing. Some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaNs with
replace
.
3.4.2.1. Pandas functions you might use to deal with missing values¶
fillna
- Replace with any value (strings included) you want, back fill, forward fill, and moredropna
- Might not be explicitly needed, some functions ignore NaNs, but some don’t. I tend to drop only as needed for a given estimation, and only temporarily.replace
- Some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaN with this function.interpolate
- E.g. use values in surrounding time periods to fill in gapsDeduce. Suppose PPE this year is missing. \(PPE_t = PPE_{t-1} + CAPX_t - DP_t \)
3.4.2.2. Practice¶
Play around with each of those functions on this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[None, 2, 54, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
_df1 = df.copy()
_df1['firm'] = 1
_df1['date'] = _df1.index
_df2 = df.copy()
_df2['firm'] = 2
_df2['date'] = _df2.index
df2 = _df1.append(_df2)
Questions for df
:
Fill all missing values with -1
Fill missing values for variable “B” with -1
Fill all values with the mean for the variable
Fill all values with the median for the variable
Fill values by taking the most recent non-missing prior value
Questions for df2
:
Carry missing values forward without carrying values from firm 1 to firm 2
Fill missing values with the average for firms on that date