5.4.4. Pipelines¶

Pipelines are just series of steps you perform on data in sklearn. (The sklearn guide to them is here.)

A “typical” pipeline in ML projects

Preprocesses the data to clean and tranform variables
Possibly selects a subset of variables from among the features to avoid overfitting (see also this)
Runs a model on those cleaned variables

Tip

You can set up pipelines with make_pipeline.

5.4.4.1. Intro to pipes¶

For example, here is a simple pipeline:

from sklearn.pipeline import make_pipeline 
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge

ridge_pipe = make_pipeline(SimpleImputer(),Ridge(1.0))

You put a series of steps inside make_pipeline, separated by commas.

The pipeline object (printed out below) is a list of steps, where each step has a name (e.g. “simpleimputer” ) and a task associated with that name (e.g. “SimpleImputer()”).

ridge_pipe

Pipeline(steps=[('simpleimputer', SimpleImputer()), ('ridge', Ridge())])

Tip

You can .fit() and .predict() pipelines like any model, and they can be used in cross_validate too!

Using it is the same as using any estimator! After I load the data we’ve been using from the last two pages below (hidden), we can fit and predict like on the “one model intro” page:

import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_validate

url        = 'https://github.com/LeDataSciFi/ledatascifi-2022/blob/main/data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y          = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
                  .assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
                          l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
                         )
              .iloc[:,-11:] 
             )

rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)

ridge_pipe.fit(X_train,y_train)
ridge_pipe.predict(X_test)

array([5.95256433, 4.20060942, 3.9205946 , ..., 4.06401663, 5.30024985,
       7.32600213])

Those are the same numbers as before - good!

We can use this pipeline in our cross validation in place of the estimator:

cross_validate(ridge_pipe,X_train,y_train,
              cv=KFold(5), scoring='r2')['test_score'].mean()

0.9030537085469961

5.4.4.2. Preprocessing in pipes¶

Warning

(Virtually) All preprocessing should be done in the pipeline!

This is the link you should start with to see how you might clean and preprocess data. Key preprocessing steps include

Filling in missing values (imputation) or dropping those observations
Standardization
Encoding categorical data

With real-world data, you’ll have many data types. So the preprocessing steps you apply to one column won’t necessarily be what the next column needs.

I use ColumnTransformer to assemble my preprocessing portion of my full pipeline, and it allows me to process different variables differently.

The generic steps to preprocess in a pipeline:

Set up a pipeline for numerical data
Set up a pipeline for categorical variables
Set up the ColumnTransformer:
- ColumnTransformer() is a function, so it needs the parentheses “()”
- The first argument inside it is a list (so now it is ColumnTransformer([]))
- Each element in that list is a tuple that has three parts:
  - name of the step (you decide the name),
  - estimator/pipeline to use on that step,
  - and which variables to use it on
- Put the pipeline for each variable type as its own tuple inside ColumnTransformer([<here!>])
Use the ColumnTransformer set as the first step inside your glorious estimation pipeline.

So, let me put this together:

Tip

This is good pseudo!

from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer, make_column_selector

#############
# Step 1: how to deal with numerical vars
# pro-tip: you might set up several numeric pipelines, because
# some variables might need very different treatment!
#############

numer_pipe = make_pipeline(SimpleImputer()) 
# this deals with missing values (somehow?)
# you might also standardize the vars in this numer_pipe

#############
# Step 2: how to deal with categorical vars
#############

cat_pipe   = make_pipeline(OneHotEncoder(drop='first'))

# notes on this cat pipe:
#     OneHotEncoder is just one way to deal with categorical vars
#     drop='first' is necessary if the model is regression

#############
# Step 3: combine the subparts
#############

preproc_pipe = ColumnTransformer(  
    [ # arg 1 of ColumnTransformer is a list, so this starts the list
    # a tuple for the numerical vars: name, pipe, which vars to apply to
    ("num_impute", numer_pipe, ['l_credscore','TCMR']),
    # a tuple for the categorical vars: name, pipe, which vars to apply to
    ("cat_trans", cat_pipe, ['Property_state'])
    ]
    , remainder = 'drop' # you either drop or passthrough any vars not modified above
)

#############
# Step 4: put the preprocessing into an estimation pipeline
#############

new_ridge_pipe = make_pipeline(preproc_pipe,Ridge(1.0))

The data loaded above has no categorical variables, so I’m going to reload the data and keep new variables to illustrate what we can do:

'TCMR','l_credscore' are numerical
'Property_state' is categorical
'l_LTV' is in the data, but should be dropped (because of remainder='drop')

So here is the raw data:

url        = 'https://github.com/LeDataSciFi/ledatascifi-2022/blob/main/data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y          = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
                  .assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
                          l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
                         )
             [['TCMR', 'Property_state', 'l_credscore', 'l_LTV']]
             )

rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)

display(X_train.head())
display(X_train.describe().T.round(2))

	TCMR	Property_state	l_credscore	l_LTV
4326	4.651500	IL	6.670766	4.499810
15833	4.084211	TN	6.652863	4.442651
66753	3.675000	MO	6.635947	4.442651
23440	3.998182	MO	6.548219	4.553877
4155	4.651500	CO	6.602588	4.442651

	count	mean	std	min	25%	50%	75%	max
TCMR	7938.0	3.36	1.29	1.50	2.21	3.00	4.45	6.66
l_credscore	7938.0	6.60	0.07	6.27	6.55	6.61	6.66	6.72
l_LTV	7938.0	4.51	0.05	4.25	4.49	4.50	4.55	4.57

We could .fit() and .transform() using the preproc_pipe from step 3 (or just .fit_transform() to do it in one command) to see how it transforms the data. But the output is tough to use:

preproc_pipe.fit_transform(X_train)

<7938x53 sparse matrix of type '<class 'numpy.float64'>'
	with 23792 stored elements in Compressed Sparse Row format>

So I added a convenience function (df_after_transform) to the community codebook to show the dataframe after the ColumnTransformer step.

Notice

The l_LTV column is gone!
The property state variable is now 50+ variables (one dummy for each state, and a few territories)
The numerical variables aren’t changed (there are no missing variables, so the imputation does nothing)

This is the transformed data:

from df_after_transform import df_after_transform

df_after_transform(preproc_pipe,X_train)

	l_credscore	TCMR	Property_state_AL	Property_state_AR	Property_state_AZ	Property_state_CA	Property_state_CO	Property_state_CT	Property_state_DC	Property_state_DE	...	Property_state_SD	Property_state_TN	Property_state_TX	Property_state_UT	Property_state_VA	Property_state_VT	Property_state_WA	Property_state_WI	Property_state_WV	Property_state_WY
0	6.670766	4.651500	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	6.652863	4.084211	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	6.635947	3.675000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	6.548219	3.998182	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	6.602588	4.651500	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7933	6.650279	1.556522	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7934	6.647688	2.416364	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7935	6.507278	6.054000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
7936	6.618739	2.303636	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7937	6.639876	4.971304	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

7938 rows × 53 columns

display(df_after_transform(preproc_pipe,X_train)
        .describe().T.round(2)
        .iloc[:7,:]) # only show a few variables for space...

	count	mean	std	min	25%	50%	75%	max
l_credscore	7938.0	6.60	0.07	6.27	6.55	6.61	6.66	6.72
TCMR	7938.0	3.36	1.29	1.50	2.21	3.00	4.45	6.66
Property_state_AL	7938.0	0.02	0.12	0.00	0.00	0.00	0.00	1.00
Property_state_AR	7938.0	0.01	0.10	0.00	0.00	0.00	0.00	1.00
Property_state_AZ	7938.0	0.03	0.17	0.00	0.00	0.00	0.00	1.00
Property_state_CA	7938.0	0.07	0.25	0.00	0.00	0.00	0.00	1.00
Property_state_CO	7938.0	0.03	0.16	0.00	0.00	0.00	0.00	1.00

5.4.4.3. Working with pipes¶

Using pipes is the same as any model: .fit() and .predict(), put into CVs
When modelling, you should spend time interrogating model predictions, plotting and printing. Does the model struggle predicting certain observations? Does it excel at some?
You’ll want to tweak parts of your pipeline. The next pages cover how we can do that.

LeDataSciFi-2022

5.4.4. Pipelines¶

5.4.4.1. Intro to pipes¶

5.4.4.2. Preprocessing in pipes¶

5.4.4.3. Working with pipes¶