3.4.2. Dealing with Missing Data¶
df.isnull().sum
will report missing values by variable.With new datasets, look out for “missing values” that aren’t missing. Some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaNs with
replace
.
These slides on missing data are quite good! This article has examples too.
You should focus on the whys and hows of dealing with missing data rather than mechanics. (You can look up mechanics later.)
For example, with firm level data that investment analysts deal with, the most common approach is to keep all valid data, and for each test you run, use all observations that have no missing values for the variables of interest. In the slides, this is called “Complete-Case Analysis”.
It is less common in my field of research to impute missing values. However, “Deductive imputation” is common when the cost of doing so isn’t high.
In general, when you can confidently deduce a value (my height this year is the same as last year because I’m a fully grown adult (mostly)), go ahead. That is valid data.
3.4.2.1. Options¶
These pandas functions can be very useful:
fillna
- any value (strings included) you want, back fill, forward, fill, and moredropna
- might not be explicitly needed, some functions ignore NaNs, but some don’t. I tend to drop only as needed for a given estimation, and only temporarily.replace
- some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaN with this function.interpolate
- e.g. use values in surrounding time periods to fill in gapsDeduce. Suppose PPE this year is missing. \(PPE_t = PPE_{t-1} + CAPX_t - DP_t \)
3.4.2.2. Practice¶
Play around with each of those functions on this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[None, 2, 54, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
_df1 = df.copy()
_df1['firm'] = 1
_df1['date'] = _df1.index
_df2 = df.copy()
_df2['firm'] = 2
_df2['date'] = _df2.index
df2 = _df1.append(_df2)
Questions for df
:
Fill all missing values with -1
Fill missing values for variable “B” with -1
Fill all values with the mean for the variable
Fill all values with the median for the variable
Fill values by taking the most recent non-missing prior value
Questions for df2
:
Carry missing values forward without carrying values from firm 1 to firm 2
Fill missing values with the average for firms on that date