5.4.5. Preprocessing

Preprocessing (cleaning and transforming) your data is utterly essential to ML. Some would say that after defining interesting problems to solve (which isn’t a modelling step), preprocessing (which isn’t a modelling step) is the next most important part of ML projects. So I wanted to have a little bit on preprocessing on the website. But preprocessing is a MASSIVE topic, so this is only a small taste of what you can do. As always, the official docs are useful!

Here is a great video on the topic from a core sk-learn dev that I highly recommend: Imputation / Missing Values

We talked about imputation a bit before in the context of pandas. These slides on missing data are quite good! This article has examples too.

Before modeling, you have to decide how to deal with missing values. You can

  1. Drop observations with any missing values,

  2. Impute missing values (mean, median, mode, interpolation, deduction, mean-of-group, etc),

  3. Or model the missing values explicitly (e.g. in a regression, as an incremental intercept but with no impact on the slope).

What’s the right choice? It depends. On the data, the domain, the question, and economic theory. My choices change from project to project. You might use a combination of these!

You should focus on the whys and hows of dealing with missing data rather than mechanics. (You can look up mechanics later.) You should have some livecoding from the prior lecture showing imputation in pandas.

sklearn comes with an impute class described in the official docs

# silly data
import numpy as np
X = np.array([[ np.nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   np.nan, 6  ],
              [ 8,   8,   1  ]])

# it's this easy:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
[[nan  0.  3.]
 [ 3.  7.  9.]
 [ 3.  5.  2.]
 [ 4. nan  6.]
 [ 8.  8.  1.]] 
array([[4.5, 0. , 3. ],
       [3. , 7. , 9. ],
       [3. , 5. , 2. ],
       [4. , 5. , 6. ],
       [8. , 8. , 1. ]])

imp.fit_transform(X) is the combination of imp.fit(X) and imp.transform(X).

If you have a train/test split, you shouldn’t use fit_transform. Instead, use imp.fit(X_train) to get the means in the training sample and imp.transform(X_test) to apply those to the test data. Standardization

Effectively, this means that continuous variables should have a mean of 0 and a variance of 1.

The sklearn documentation on this is quite good.

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

Why does this matter? “If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”


sklearn can scale variables in many ways. Some alternative transforms are faster and some transform non-normal distributions into proto-normal distributions (which can improve the efficacy of many models).

Visit (you guessed it!) the documentation for more.

Here is a simple example using preprocessing.StandardScaler.

# a very simple example
from sklearn.preprocessing import StandardScaler 
import numpy as np

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

scaler = StandardScaler() 
X_scaled = scaler.fit_transform(X_train)

print(' X_scaled\n',         '-'*40,'\n',X_scaled,'\n')
print(' Mean of each var:\n','-'*40,'\n',X_scaled.mean(axis=0),'\n')
print(' STD of each var:\n', '-'*40,'\n',X_scaled.std(axis=0),'\n')
 [[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]] 

 Mean of each var:
 [0. 0. 0.] 

 STD of each var:
 [1. 1. 1.]