{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pipelines\n",
    "\n",
    "Pipelines are just a series of steps you perform on data in `sklearn`. (The  `sklearn` [guide to them is here.](https://scikit-learn.org/stable/modules/compose.html))\n",
    "\n",
    "A \"typical\" pipeline in ML projects \n",
    "1. [Preprocesses the data](https://scikit-learn.org/stable/modules/preprocessing.html) to clean and tranform variables \n",
    "1. Possibly selects a subset of variables from among the features [to avoid overfitting](03a_ML_obj_and_tradeoff) (see also [this](https://scikit-learn.org/stable/modules/feature_selection.html))\n",
    "1. Runs [a model](03e_whichModel) on those cleaned variables \n",
    "\n",
    "```{tip}\n",
    "You can set up pipelines with `make_pipeline`.\n",
    "```\n",
    "\n",
    "## Intro to pipes\n",
    "\n",
    "```{margin}\n",
    "<img src=\"https://media.giphy.com/media/k5b6fkFnSA3yo/source.gif\" alt=\"Mario\" style=\"width:200px;\">  \n",
    "```\n",
    "\n",
    "For example, here is a simple pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline \n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.linear_model import Ridge\n",
    "\n",
    "# set_config(pandas) might slow down sklearn, but should be used during development to \n",
    "# facilitate EDA/ABCD, bc we can see the transformed data with variable names\n",
    "from sklearn import set_config          \n",
    "set_config(transform_output=\"pandas\") \n",
    "\n",
    "ridge_pipe = make_pipeline(SimpleImputer(),Ridge(1.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You put a series of steps inside `make_pipeline`, separated by commas.\n",
    "\n",
    "The pipeline object (printed out below) is a list of steps, where each step has a name (e.g. \"simpleimputer\" ) and a task associated with that name (e.g. \"SimpleImputer()\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;simpleimputer&#x27;, SimpleImputer()), (&#x27;ridge&#x27;, Ridge())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;simpleimputer&#x27;, SimpleImputer()), (&#x27;ridge&#x27;, Ridge())])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Ridge</label><div class=\"sk-toggleable__content\"><pre>Ridge()</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('simpleimputer', SimpleImputer()), ('ridge', Ridge())])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ridge_pipe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "You can `.fit()` and `.predict()` pipelines like any model, and they can be used in `cross_validate` too!\n",
    "```\n",
    "\n",
    "Using it is the same as using any estimator! After I load the data we've been using [from the last two pages](04d_crossval) below (hidden), we can fit and predict like on the [\"one model intro\" page](04c_onemodel):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.linear_model import Ridge\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import KFold, cross_validate\n",
    "\n",
    "url        = 'https://github.com/LeDataSciFi/data/blob/main/Fannie%20Mae/Fannie_Mae_Plus_Data.gzip?raw=true'\n",
    "fannie_mae = pd.read_csv(url,compression='gzip').dropna()\n",
    "y          = fannie_mae.Original_Interest_Rate\n",
    "fannie_mae = (fannie_mae\n",
    "                  .assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),\n",
    "                          l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),\n",
    "                         )\n",
    "              .iloc[:,-11:] \n",
    "             )\n",
    "\n",
    "rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly\n",
    "X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([5.95256433, 4.20060942, 3.9205946 , ..., 4.06401663, 5.30024985,\n",
       "       7.32600213])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ridge_pipe.fit(X_train,y_train)\n",
    "ridge_pipe.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those are the same numbers as before - good! \n",
    "\n",
    "We can use this pipeline in our cross validation in place of the estimator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9030537085469961"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cross_validate(ridge_pipe,X_train,y_train,\n",
    "              cv=KFold(5), scoring='r2')['test_score'].mean()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing in pipes\n",
    "\n",
    "```{warning}\n",
    "(Virtually) All preprocessing should be done in the pipeline! \n",
    "```\n",
    "\n",
    "[This is the link you should start with to see how you might clean and preprocess data.](https://scikit-learn.org/stable/modules/preprocessing.html) Key preprocessing steps include\n",
    "- Filling in missing values (imputation) or dropping those observations\n",
    "- Standardization\n",
    "- Encoding categorical data\n",
    "\n",
    "With real-world data, you'll have many data types. So the preprocessing steps you apply to one column won't necessarily be what the next column needs. \n",
    "\n",
    "I use [ColumnTransformer](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py) to assemble my preprocessing portion of my full pipeline, and it allows me to process different variables differently.\n",
    "\n",
    "---\n",
    "\n",
    "**The generic steps to preprocess in a pipeline:**\n",
    "1. Set up a pipeline for numerical data\n",
    "1. Set up a pipeline for categorical variables\n",
    "1. Set up the ColumnTransformer:\n",
    "    - `ColumnTransformer()` is a function, so it needs the parentheses \"()\"\n",
    "    - The first argument inside it is a list (so now it is `ColumnTransformer([])`)\n",
    "    - Each element in that list is a tuple that has three parts: \n",
    "        - name of the step (you decide the name), \n",
    "        - estimator/pipeline to use on that step, \n",
    "        - and which variables to use it on\n",
    "    - **Put the pipeline for each variable type as its own tuple inside `ColumnTransformer([<here!>])`**\n",
    "1. Use the `ColumnTransformer` set as the first step inside your glorious estimation pipeline.  \n",
    "\n",
    "---\n",
    "\n",
    "So, let me put this together:    \n",
    "\n",
    "```{tip}\n",
    "This is good pseudo!\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder \n",
    "from sklearn.compose import ColumnTransformer, make_column_selector\n",
    "\n",
    "#############\n",
    "# Step 1: how to deal with numerical vars\n",
    "# pro-tip: you might set up several numeric pipelines, because\n",
    "# some variables might need very different treatment!\n",
    "#############\n",
    "\n",
    "numer_pipe = make_pipeline(SimpleImputer()) \n",
    "# this deals with missing values (somehow?)\n",
    "# you might also standardize the vars in this numer_pipe\n",
    "\n",
    "#############\n",
    "# Step 2: how to deal with categorical vars\n",
    "#############\n",
    "\n",
    "cat_pipe   = make_pipeline(OneHotEncoder(drop='first',sparse_output=False))\n",
    "\n",
    "# notes on this cat pipe:\n",
    "#     OneHotEncoder is just one way to deal with categorical vars\n",
    "#     drop='first' is necessary if the model is regression\n",
    "#     sparse_output=False might slow down sklearn, BUT IT MUST BE USED WITH set_config(pandas) !!! \n",
    "\n",
    "#############\n",
    "# Step 3: combine the subparts\n",
    "#############\n",
    "\n",
    "preproc_pipe = ColumnTransformer(  \n",
    "    [ # arg 1 of ColumnTransformer is a list, so this starts the list\n",
    "    # a tuple for the numerical vars: name, pipe, which vars to apply to\n",
    "    (\"num_impute\", numer_pipe, ['l_credscore','TCMR']),\n",
    "    # a tuple for the categorical vars: name, pipe, which vars to apply to\n",
    "    (\"cat_trans\", cat_pipe, ['Property_state'])\n",
    "    ]\n",
    "    , remainder = 'drop' # you either drop or passthrough any vars not modified above\n",
    ")\n",
    "\n",
    "#############\n",
    "# Step 4: put the preprocessing into an estimation pipeline\n",
    "#############\n",
    "\n",
    "new_ridge_pipe = make_pipeline(preproc_pipe,Ridge(1.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data loaded above has no categorical variables, so I'm going to reload the data and keep new variables to illustrate what we can do: \n",
    "- `'TCMR','l_credscore'` are numerical\n",
    "-  `'Property_state'` is categorical\n",
    "- `'l_LTV'` is in the data, but should be dropped (because of `remainder='drop'`)\n",
    "\n",
    "So here is the raw data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCMR</th>\n",
       "      <th>Property_state</th>\n",
       "      <th>l_credscore</th>\n",
       "      <th>l_LTV</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4326</th>\n",
       "      <td>4.651500</td>\n",
       "      <td>IL</td>\n",
       "      <td>6.670766</td>\n",
       "      <td>4.499810</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15833</th>\n",
       "      <td>4.084211</td>\n",
       "      <td>TN</td>\n",
       "      <td>6.652863</td>\n",
       "      <td>4.442651</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66753</th>\n",
       "      <td>3.675000</td>\n",
       "      <td>MO</td>\n",
       "      <td>6.635947</td>\n",
       "      <td>4.442651</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23440</th>\n",
       "      <td>3.998182</td>\n",
       "      <td>MO</td>\n",
       "      <td>6.548219</td>\n",
       "      <td>4.553877</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4155</th>\n",
       "      <td>4.651500</td>\n",
       "      <td>CO</td>\n",
       "      <td>6.602588</td>\n",
       "      <td>4.442651</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           TCMR Property_state  l_credscore     l_LTV\n",
       "4326   4.651500             IL     6.670766  4.499810\n",
       "15833  4.084211             TN     6.652863  4.442651\n",
       "66753  3.675000             MO     6.635947  4.442651\n",
       "23440  3.998182             MO     6.548219  4.553877\n",
       "4155   4.651500             CO     6.602588  4.442651"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCMR</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>3.36</td>\n",
       "      <td>1.29</td>\n",
       "      <td>1.50</td>\n",
       "      <td>2.21</td>\n",
       "      <td>3.00</td>\n",
       "      <td>4.45</td>\n",
       "      <td>6.66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>l_credscore</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>6.60</td>\n",
       "      <td>0.07</td>\n",
       "      <td>6.27</td>\n",
       "      <td>6.55</td>\n",
       "      <td>6.61</td>\n",
       "      <td>6.66</td>\n",
       "      <td>6.72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>l_LTV</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>4.51</td>\n",
       "      <td>0.05</td>\n",
       "      <td>4.25</td>\n",
       "      <td>4.49</td>\n",
       "      <td>4.50</td>\n",
       "      <td>4.55</td>\n",
       "      <td>4.57</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              count  mean   std   min   25%   50%   75%   max\n",
       "TCMR         7938.0  3.36  1.29  1.50  2.21  3.00  4.45  6.66\n",
       "l_credscore  7938.0  6.60  0.07  6.27  6.55  6.61  6.66  6.72\n",
       "l_LTV        7938.0  4.51  0.05  4.25  4.49  4.50  4.55  4.57"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "url        = 'https://github.com/LeDataSciFi/data/blob/main/Fannie%20Mae/Fannie_Mae_Plus_Data.gzip?raw=true'\n",
    "fannie_mae = pd.read_csv(url,compression='gzip').dropna()\n",
    "y          = fannie_mae.Original_Interest_Rate\n",
    "fannie_mae = (fannie_mae\n",
    "                  .assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),\n",
    "                          l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),\n",
    "                         )\n",
    "             [['TCMR', 'Property_state', 'l_credscore', 'l_LTV']]\n",
    "             )\n",
    "\n",
    "rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly\n",
    "X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)\n",
    "\n",
    "display(X_train.head())\n",
    "display(X_train.describe().T.round(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could `.fit()` and `.transform()` using the `preproc_pipe` from step 3 (or just `.fit_transform()` to do it in one command) to see how it transforms the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num_impute__l_credscore</th>\n",
       "      <th>num_impute__TCMR</th>\n",
       "      <th>cat_trans__Property_state_AL</th>\n",
       "      <th>cat_trans__Property_state_AR</th>\n",
       "      <th>cat_trans__Property_state_AZ</th>\n",
       "      <th>cat_trans__Property_state_CA</th>\n",
       "      <th>cat_trans__Property_state_CO</th>\n",
       "      <th>cat_trans__Property_state_CT</th>\n",
       "      <th>cat_trans__Property_state_DC</th>\n",
       "      <th>cat_trans__Property_state_DE</th>\n",
       "      <th>...</th>\n",
       "      <th>cat_trans__Property_state_SD</th>\n",
       "      <th>cat_trans__Property_state_TN</th>\n",
       "      <th>cat_trans__Property_state_TX</th>\n",
       "      <th>cat_trans__Property_state_UT</th>\n",
       "      <th>cat_trans__Property_state_VA</th>\n",
       "      <th>cat_trans__Property_state_VT</th>\n",
       "      <th>cat_trans__Property_state_WA</th>\n",
       "      <th>cat_trans__Property_state_WI</th>\n",
       "      <th>cat_trans__Property_state_WV</th>\n",
       "      <th>cat_trans__Property_state_WY</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4326</th>\n",
       "      <td>6.670766</td>\n",
       "      <td>4.651500</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15833</th>\n",
       "      <td>6.652863</td>\n",
       "      <td>4.084211</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66753</th>\n",
       "      <td>6.635947</td>\n",
       "      <td>3.675000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23440</th>\n",
       "      <td>6.548219</td>\n",
       "      <td>3.998182</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4155</th>\n",
       "      <td>6.602588</td>\n",
       "      <td>4.651500</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>123118</th>\n",
       "      <td>6.650279</td>\n",
       "      <td>1.556522</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>69842</th>\n",
       "      <td>6.647688</td>\n",
       "      <td>2.416364</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51872</th>\n",
       "      <td>6.507278</td>\n",
       "      <td>6.054000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>128800</th>\n",
       "      <td>6.618739</td>\n",
       "      <td>2.303636</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46240</th>\n",
       "      <td>6.639876</td>\n",
       "      <td>4.971304</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>7938 rows × 53 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        num_impute__l_credscore  num_impute__TCMR  \\\n",
       "4326                   6.670766          4.651500   \n",
       "15833                  6.652863          4.084211   \n",
       "66753                  6.635947          3.675000   \n",
       "23440                  6.548219          3.998182   \n",
       "4155                   6.602588          4.651500   \n",
       "...                         ...               ...   \n",
       "123118                 6.650279          1.556522   \n",
       "69842                  6.647688          2.416364   \n",
       "51872                  6.507278          6.054000   \n",
       "128800                 6.618739          2.303636   \n",
       "46240                  6.639876          4.971304   \n",
       "\n",
       "        cat_trans__Property_state_AL  cat_trans__Property_state_AR  \\\n",
       "4326                             0.0                           0.0   \n",
       "15833                            0.0                           0.0   \n",
       "66753                            0.0                           0.0   \n",
       "23440                            0.0                           0.0   \n",
       "4155                             0.0                           0.0   \n",
       "...                              ...                           ...   \n",
       "123118                           0.0                           0.0   \n",
       "69842                            0.0                           0.0   \n",
       "51872                            0.0                           0.0   \n",
       "128800                           0.0                           0.0   \n",
       "46240                            0.0                           0.0   \n",
       "\n",
       "        cat_trans__Property_state_AZ  cat_trans__Property_state_CA  \\\n",
       "4326                             0.0                           0.0   \n",
       "15833                            0.0                           0.0   \n",
       "66753                            0.0                           0.0   \n",
       "23440                            0.0                           0.0   \n",
       "4155                             0.0                           0.0   \n",
       "...                              ...                           ...   \n",
       "123118                           0.0                           0.0   \n",
       "69842                            0.0                           0.0   \n",
       "51872                            0.0                           0.0   \n",
       "128800                           0.0                           0.0   \n",
       "46240                            0.0                           0.0   \n",
       "\n",
       "        cat_trans__Property_state_CO  cat_trans__Property_state_CT  \\\n",
       "4326                             0.0                           0.0   \n",
       "15833                            0.0                           0.0   \n",
       "66753                            0.0                           0.0   \n",
       "23440                            0.0                           0.0   \n",
       "4155                             1.0                           0.0   \n",
       "...                              ...                           ...   \n",
       "123118                           0.0                           0.0   \n",
       "69842                            0.0                           0.0   \n",
       "51872                            0.0                           0.0   \n",
       "128800                           0.0                           0.0   \n",
       "46240                            0.0                           0.0   \n",
       "\n",
       "        cat_trans__Property_state_DC  cat_trans__Property_state_DE  ...  \\\n",
       "4326                             0.0                           0.0  ...   \n",
       "15833                            0.0                           0.0  ...   \n",
       "66753                            0.0                           0.0  ...   \n",
       "23440                            0.0                           0.0  ...   \n",
       "4155                             0.0                           0.0  ...   \n",
       "...                              ...                           ...  ...   \n",
       "123118                           0.0                           0.0  ...   \n",
       "69842                            0.0                           0.0  ...   \n",
       "51872                            0.0                           0.0  ...   \n",
       "128800                           0.0                           0.0  ...   \n",
       "46240                            0.0                           0.0  ...   \n",
       "\n",
       "        cat_trans__Property_state_SD  cat_trans__Property_state_TN  \\\n",
       "4326                             0.0                           0.0   \n",
       "15833                            0.0                           1.0   \n",
       "66753                            0.0                           0.0   \n",
       "23440                            0.0                           0.0   \n",
       "4155                             0.0                           0.0   \n",
       "...                              ...                           ...   \n",
       "123118                           0.0                           0.0   \n",
       "69842                            0.0                           0.0   \n",
       "51872                            0.0                           0.0   \n",
       "128800                           0.0                           0.0   \n",
       "46240                            0.0                           0.0   \n",
       "\n",
       "        cat_trans__Property_state_TX  cat_trans__Property_state_UT  \\\n",
       "4326                             0.0                           0.0   \n",
       "15833                            0.0                           0.0   \n",
       "66753                            0.0                           0.0   \n",
       "23440                            0.0                           0.0   \n",
       "4155                             0.0                           0.0   \n",
       "...                              ...                           ...   \n",
       "123118                           0.0                           0.0   \n",
       "69842                            0.0                           0.0   \n",
       "51872                            0.0                           0.0   \n",
       "128800                           1.0                           0.0   \n",
       "46240                            0.0                           0.0   \n",
       "\n",
       "        cat_trans__Property_state_VA  cat_trans__Property_state_VT  \\\n",
       "4326                             0.0                           0.0   \n",
       "15833                            0.0                           0.0   \n",
       "66753                            0.0                           0.0   \n",
       "23440                            0.0                           0.0   \n",
       "4155                             0.0                           0.0   \n",
       "...                              ...                           ...   \n",
       "123118                           0.0                           0.0   \n",
       "69842                            0.0                           0.0   \n",
       "51872                            0.0                           0.0   \n",
       "128800                           0.0                           0.0   \n",
       "46240                            0.0                           0.0   \n",
       "\n",
       "        cat_trans__Property_state_WA  cat_trans__Property_state_WI  \\\n",
       "4326                             0.0                           0.0   \n",
       "15833                            0.0                           0.0   \n",
       "66753                            0.0                           0.0   \n",
       "23440                            0.0                           0.0   \n",
       "4155                             0.0                           0.0   \n",
       "...                              ...                           ...   \n",
       "123118                           0.0                           0.0   \n",
       "69842                            0.0                           0.0   \n",
       "51872                            0.0                           1.0   \n",
       "128800                           0.0                           0.0   \n",
       "46240                            0.0                           0.0   \n",
       "\n",
       "        cat_trans__Property_state_WV  cat_trans__Property_state_WY  \n",
       "4326                             0.0                           0.0  \n",
       "15833                            0.0                           0.0  \n",
       "66753                            0.0                           0.0  \n",
       "23440                            0.0                           0.0  \n",
       "4155                             0.0                           0.0  \n",
       "...                              ...                           ...  \n",
       "123118                           0.0                           0.0  \n",
       "69842                            0.0                           0.0  \n",
       "51872                            0.0                           0.0  \n",
       "128800                           0.0                           0.0  \n",
       "46240                            0.0                           0.0  \n",
       "\n",
       "[7938 rows x 53 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transformed_Xtrain = preproc_pipe.fit_transform(X_train)\n",
    "transformed_Xtrain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice\n",
    "- The `l_LTV` column is gone!\n",
    "- The property state variable is now 50+ variables (one dummy for each state, and a few territories)\n",
    "- The numerical variables aren't changed (there are no missing variables, so the imputation does nothing)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>num_impute__l_credscore</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>6.60</td>\n",
       "      <td>0.07</td>\n",
       "      <td>6.27</td>\n",
       "      <td>6.55</td>\n",
       "      <td>6.61</td>\n",
       "      <td>6.66</td>\n",
       "      <td>6.72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num_impute__TCMR</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>3.36</td>\n",
       "      <td>1.29</td>\n",
       "      <td>1.50</td>\n",
       "      <td>2.21</td>\n",
       "      <td>3.00</td>\n",
       "      <td>4.45</td>\n",
       "      <td>6.66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat_trans__Property_state_AL</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>0.02</td>\n",
       "      <td>0.12</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat_trans__Property_state_AR</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>0.01</td>\n",
       "      <td>0.10</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat_trans__Property_state_AZ</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>0.03</td>\n",
       "      <td>0.17</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat_trans__Property_state_CA</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>0.07</td>\n",
       "      <td>0.25</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat_trans__Property_state_CO</th>\n",
       "      <td>7938.0</td>\n",
       "      <td>0.03</td>\n",
       "      <td>0.16</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               count  mean   std   min   25%   50%   75%   max\n",
       "num_impute__l_credscore       7938.0  6.60  0.07  6.27  6.55  6.61  6.66  6.72\n",
       "num_impute__TCMR              7938.0  3.36  1.29  1.50  2.21  3.00  4.45  6.66\n",
       "cat_trans__Property_state_AL  7938.0  0.02  0.12  0.00  0.00  0.00  0.00  1.00\n",
       "cat_trans__Property_state_AR  7938.0  0.01  0.10  0.00  0.00  0.00  0.00  1.00\n",
       "cat_trans__Property_state_AZ  7938.0  0.03  0.17  0.00  0.00  0.00  0.00  1.00\n",
       "cat_trans__Property_state_CA  7938.0  0.07  0.25  0.00  0.00  0.00  0.00  1.00\n",
       "cat_trans__Property_state_CO  7938.0  0.03  0.16  0.00  0.00  0.00  0.00  1.00"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(transformed_Xtrain\n",
    "        .describe().T.round(2)\n",
    "        .iloc[:7,:]) # only show a few variables for space..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with pipes\n",
    "\n",
    "- Using pipes is the same as any model: `.fit()` and `.predict()`, put into CVs\n",
    "- When modeling, you should spend time interrogating model predictions, plotting, and printing. Does the model struggle to predict certain observations? Does it excel at some?\n",
    "- You'll want to tweak parts of your pipeline. The next pages cover how we can do that."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}