5.4.2. An intro to SK-learn + Fitting One Model

Warning

This page is just showing you how sklearn fits ONE model for ONE set of hyperparameters on a generic set of X and y. The idea is to show you the flow of how we work through estimation. Do NOT wholesale copy this code for assignments - it is deliberately missing a bunch of best practices, as we build up your familiarity with developing a ML model. But the steps here are universally present in everything we do.

5.4.2.1. Five steps to fit a model

Step 1: Import class of model from sklearn

from sklearn.linear_model import Ridge

Step 2: Load data into y and X, and split off test data

# this cell is copied from the L17 lecture file
# EXCEPT: I put the interest rate in its own "y" variable
#         and remove the y variable from the fannie_mae data

import pandas as pd
import numpy as np

url        = 'https://github.com/LeDataSciFi/ledatascifi-2022/blob/main/data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y          = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
                  .assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
                          l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
                         )
              .iloc[:,-11:] # limit to these vars for the sake of this example
             )
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)

Step 3: Choose the model you want and “instantiate” that class of model

Important: For many kinds of models, you will set some desired values for parameters of that model. These are referred to as “hyperparameters”.

# create ("instantiate") the class, here I set hyper param alpha=1
ridge = Ridge(alpha=1.0) 

Step 4: fit() the model on training data

This is done by typing <modelname>.fit(X_train,y_train). Here, our model object is called ridge, so:

ridge.fit(X_train,y_train)
Ridge()

Step 5: Apply the model to new data. Either:

  • <modelname>.predict(X_test) will predict what \(y\) should be using \(X_{test}\), and is used in supervised learning tasks

  • <modelname>.transform(X_test) will change \(X_{test}\) using the model, and is common with preprocessing and unsupervised learning

ridge.predict(X_test)
array([5.95256433, 4.20060942, 3.9205946 , ..., 4.06401663, 5.30024985,
       7.32600213])

The text here is adapted from PDSH