Preprocessing, The Cardinal Sin, and Pipes

After this lecture, you

Can prepare categorical variables for sklearn models
Know that different imputation strategies exists and can use them
Know that standardizing continuous variables can improve your models
Know that you shouldn't apply preprocessing transformations with info from the testing dataset, that's called "data leakage" and is akin to letting your model "seeing the future" while training
Know that you should apply the exact transformations to the testing data that you applied to the training data before making predictions

All of these can be accomplished by using pipelines. Pipelines are a crucial ingredient for any viable real-world ML project.

Preprocessing categorical variables

Depending on the variable you have, you can turn to

DictVectorizer is how you turn string categorical variables into usable numeric vars
OneHotEncoder takes array-like inputs instead of dicts

Let's start by borrowing a clear example from PDSH

data = [
    {'price': 850, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 650, 'rooms': 3, 'neighborhood': 'Queen Anne'},
    {'price': 700, 'rooms': 1, 'neighborhood': 'Wallingford'},
    {'price': 650, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 700, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 600, 'rooms': 2, 'neighborhood': 'Fremont'}
]
data

[{'price': 850, 'rooms': 4, 'neighborhood': 'Queen Anne'},
 {'price': 650, 'rooms': 3, 'neighborhood': 'Queen Anne'},
 {'price': 700, 'rooms': 1, 'neighborhood': 'Wallingford'},
 {'price': 650, 'rooms': 3, 'neighborhood': 'Wallingford'},
 {'price': 700, 'rooms': 3, 'neighborhood': 'Fremont'},
 {'price': 600, 'rooms': 2, 'neighborhood': 'Fremont'}]

sklearn can't use neighborhood in a regression like sm could:

import pandas as pd    
from statsmodels.formula.api import ols as sm_ols
print('The coefs from SM:')
print(sm_ols('price ~ neighborhood - 1', data = pd.DataFrame(data)).fit().params)
# ""-1" means no intercept. Don't do this! It's here for illustration

The coefs from SM:
neighborhood[Fremont]        650.0
neighborhood[Queen Anne]     750.0
neighborhood[Wallingford]    675.0
dtype: float64

So, we need to preprocess that data to run the same regression in sklearn. Depending on the variable you have, you can turn to

DictVectorizer is how you turn string categorical variables into usable numeric vars
OneHotEncoder takes array-like inputs instead of dicts

# create an object ("vec") that can do the transform
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int) 

# apply vec with ".fit_transform", save to new data obj
data2 = vec.fit_transform(data) 
print(data2, '\n')              
print(vec.get_feature_names())  # can use .get_feature_names() to recover names

# now we can repeat the regression here
from sklearn.linear_model import LinearRegression
print('Reg coefs:')
LinearRegression(fit_intercept=False).fit(data2[:,:3],data2[:,3]).coef_

[[  0   1   0 850   4]
 [  0   1   0 650   3]
 [  0   0   1 700   1]
 [  0   0   1 650   3]
 [  1   0   0 700   3]
 [  1   0   0 600   2]] 

['neighborhood=Fremont', 'neighborhood=Queen Anne', 'neighborhood=Wallingford', 'price', 'rooms']
Reg coefs:

array([650., 750., 675.])

Imputation / Missing Values

_We talked about imputation a bit before in the context of pandas. These slides on missing data are quite good! This article has examples too._

Before modeling, you have to decide how to deal with missing values. You can

Drop observations with any missing values,
Impute missing values (mean, median, mode, interpolation, deduction, mean-of-group, etc),
Or model the missing values explicitly (e.g. in a regression, as an incremental intercept but with no impact on the slope).

What's the right choice? It depends. On the data, the domain, the question, and economic theory. My choices change from project to project. You might use a combination of these!

You should focus on the whys and hows of dealing with missing data rather than mechanics. (You can look up mechanics later.) You should have some livecoding from the prior lecture showing imputation in pandas.

sklearn comes with an impute class described in the official docs

# silly data
import numpy as np
X = np.array([[ np.nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   np.nan, 6  ],
              [ 8,   8,   1  ]])
print(X,'\n')

# it's this easy:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
imp.fit_transform(X)

[[nan  0.  3.]
 [ 3.  7.  9.]
 [ 3.  5.  2.]
 [ 4. nan  6.]
 [ 8.  8.  1.]]

array([[4.5, 0. , 3. ],
       [3. , 7. , 9. ],
       [3. , 5. , 2. ],
       [4. , 5. , 6. ],
       [8. , 8. , 1. ]])

imp.fit_transform(X) is the combination of imp.fit(X) and imp.transform(X).

If you have a train/test split, you shouldn't use fit_transform. Instead, use imp.fit(X_train) to get the means in the training sample and imp.transform(X_test) to apply those to the test data.

Standardization

Effectively, this means that continuous variables should have a mean of 0 and a variance of one.

The sklearn documentation on this is quite good.

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

Why does this matter? "If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected."

In other words: STANDARDIZATION WILL IMPROVE YOUR PREDICTIONS.

# a very simple example
from sklearn import preprocessing
import numpy as np

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X_train)

print(' X_scaled\n',         '-'*40,'\n',X_scaled,'\n')
print(' Mean of each var:\n','-'*40,'\n',X_scaled.mean(axis=0),'\n')
print(' STD of each var:\n', '-'*40,'\n',X_scaled.std(axis=0),'\n')

 X_scaled
 ---------------------------------------- 
 [[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]] 

 Mean of each var:
 ---------------------------------------- 
 [0. 0. 0.] 

 STD of each var:
 ---------------------------------------- 
 [1. 1. 1.]

sklearn can scale variables in many ways. Some alternative transforms are faster and some transform non-normal distributions into proto-normal distributions (which can improve the efficacy of many models).

Visit (you guessed it!) the documentation for more.

The Cardinal Sin of ML: Data leakage

Now you know how to transform your data before training a model. You might be tempted to do something like:

import #a bunch of sklearn stuff
X, y = #load data
X = transform(X) # imputation, encode cat vars, standardize

# and then you either do these lines:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=9,train_size=.8)
model = # something
model.fit(Xtrain, ytrain)
y_predict = model.predict(Xtest) # using X2 (out-of-sample data), predict y2
accuracy_score(ytest, y_predict)

# or this:
cross_validate(model,X,y)

The problem here is that transform(X) used info from the ENTIRE dataset, including observations that ended up in Xtest!

This means that your cross-validation scores are unreliable. They will be at the very least overoptimistic, and in some cases, result in models that are down-right completely invalid.

An illustrative aside

Here is a tiny example of that code in action. Suppose that the dataset has three observations, where X is time, and y is a stock price:

RAW DATA

X	y	sample
1	1	training
nan	2	training
3	3	test

Suppose transform(X) computes the mean of X and fills in missing values with that. So it figures out that the mean of X is 2 and fills it in. (Remember: Your code above calls transform on all of the data, including the test subsample!) So you have, after running transform(X):

AFTER transform(X), using the code above

X	y	sample
1	1	training
2	2	training
3	3	test

Now, you split the data up into training and test datasets, and in the training data, you estimate that $y=x$ (a 45 degree line). Thus, you load the test data, see that $X=3$, and predict $Y=3$ and viola! A perfect model!

HOWEVER. The test sample is supposed to be data you do not have access to while training the model. E.g. in a real-world project trying to predict stock prices, $X=3$ occurs next month, so you could have never filled in the value with 2, because you never see $X=3$ while training. Instead, you probably would have filled in $X=1$, the average in the training set, as a best guess:

REAL WORLD DATA AFTER transform(X) WITHOUT SEEING THE FUTURE

X	y	sample
1	1	training
1	2	training

So, this data would led you to conclude that $y=1.5*x$. Thus, when next month arrives, $X=3$, you predict $y=4.5$, and your prediction model is much less accurate than the code above would suggest.

The absolute golden rule of prediction modeling is...

YOUR MODEL CAN'T HAVE ACCESS TO ANY DATA THAT IT WOULDN'T HAVE IN PRACTICE WHEN IT MAKES THE PREDICTION.

I know I already said that, and repetition is usually bad writing, but it must be said again. And again.

Data leakage can be tricky

Here are some more examples:

The outcome variable is a predictor (implicitly or explicitly)
Predictor variables that are in response to the result (after the fact) or the possibility (anticipatory)
Predicting loan default, the data might include employee IDs for recent customer service contacts. But the most recent contact might be with trouble-loan specialists (because the firm anticipated possible default due to some other signal). Using that employee's customer contacts to predict default would add no value - the lender already knew to assign that employee!
The smell test - is it too good to be true? I've seen some asset pricing models with suspicious out-of-sample R2s. Predicting stock prices is hard! The best OOS predictive R2 for individual stocks in this paper is 1.80% per month.

The solution, or: Safety first, via Pipelines

Avoiding Data Leakage:

Be very familiar with the data and how it was collected and built
Do your data prep within CV folds

#2 is relatively easy to implement in sklearn: USE PIPES!

Pipelines make apply all steps to the data they receive
In cross_validate's training fold, the entire pipeline is applied to the training data
In cross_validate's testing fold, the saved transformations and model fits are applied to the test data
We will talk about pipelines for the next two lectures, so set expectations for yourself, work through all the examples, and try to follow the conceptual steps.

Today, let's quickly get our first pipe set up by following this walkthrough on scaling the iris data and building a classification model

from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn import svm

iris = load_iris() # data

# set up the pipeline, which will, given a set of observations 
# 1. fit and apply these steps to the training fold
# 2. in the testing fold, apply the transform and model to predict (no estimation)

classifier_pipeline = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))

# ok, go!
scores = cross_validate(classifier_pipeline, iris.data, iris.target, cv=5)
scores

{'fit_time': array([0.00099659, 0.00099683, 0.00099683, 0.00099778, 0.00199175]),
 'score_time': array([0.        , 0.        , 0.        , 0.00099707, 0.        ]),
 'test_score': array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])}