5.4.2. An intro to SK-learn + Fitting One Model¶
This is just showing you how sklearn fits ONE model for ONE set of hyperparameters on a generic set of X and y. The idea is to show you the flow of how we work through estimation. Do NOT wholesale copy this code for assignments - it is deliberately missing a bunch of best practices, as we build up your familiarity with developing a ML model. But the steps here are universally present in everything we do.
5.4.2.1. Five steps to fit a model¶
Step 1: Import class of model from sklearn
from sklearn.linear_model import Ridge
Step 2: Load data into y and X, and split off test data
# this cell is copied from the L17 lecture file
# EXCEPT: I put the interest rate in its own "y" variable
# and remove the y variable from the fannie_mae data
import pandas as pd
import numpy as np
url = 'https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
.assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
)
.iloc[:,-11:] # limit to these vars for the sake of this example
)
from sklearn.model_selection import train_test_split
rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)
Step 3: Choose initial model hyperparameters by instantiating this class with desired values
# create ("instantiate") the class, here I set hyper param alpha=1
ridge = Ridge(alpha=1.0)
Step 4: fit()
the model on training data
ridge.fit(X_train,y_train)
Ridge()
Step 5: Apply the model to new data. Either:
<modelname>.predict(X_test)
will predict what \(y\) should be using \(X_test\), and is used in supervised learning tasks<modelname>.transform(X_test)
will change \(X_test\) using the model, and is common with preprocessing and unsupervised learning
ridge.predict(X_test)
array([5.95256433, 4.20060942, 3.9205946 , ..., 4.06401663, 5.30024985,
7.32600213])
The text here is adapted from PDSH