5.4.1. Principles into Practice

Important

Tip

Here is a template ipynb file you can use when putting together a project that follows the pseudocode below.

Let’s put the principles from last chapter into code. Here is the pseudocode:

  1. All of your import functions

  2. Load data

  3. Split your data into 2 subsamples: a “training” portion and a “holdout” (aka “test”) portion as in this page or this page or this page. This is the first arrow in the picture below.1 We will do all of our work on the “train” sample until the very last step.

  4. Before modeling, do EDA (on the training data only!)

    • Sample basics: What is the unit of observation? What time spans are covered?

    • Look for outliers, missing values, or data errors

    • Note what variables are continuous or discrete numbers, which variables are categorical variables (and whether the categorical ordering is meaningful)

    • You should read up on what all the variables mean from the documentation in the data folder.

    • Visually explore the relationship between v_Sale_Price and other variables.

      • For continuous variables - take note of whether the relationship seems linear or quadratic or polynomial

      • For categorical variables - maybe try a box plot for the various levels?

    • Now decide how you’d clean the data (imputing missing values, scaling variables, encoding categorical variables). These lessons will go into the preprocessing portion of your pipeline below. The sklearn guide on preprocessing is very informative, as this page and the video I link to therein.

  5. Prepare to optimize a series of models (covered here)

    1. Set up one pipeline to clean each type of variable

    2. Combine those pipes into a “preprocessing” pipeline using ColumnTransformer

    3. Set up your cross validation method:

      • The picture below illustrates 5 folds, apparently based on the row number.

      • There are many CV splitters available, including TimeSeriesSplit (a starting point for asset price predictions) and GroupTimesSeriesSplit is in development (which addresses a core problem with TimeSeriesSplit, and is shown in practice here)

    4. Set up your scoring metric as discussed here

  6. Optimize candidate model 1 on the training data

    1. Set up a pipeline that combines preprocessing, estimator

    2. Set up a hyper param grid

    3. Find optimal hyper params (e.g. gridsearchcv)

    4. Save pipeline with optimal params in place

  7. Repeat step 6 for other candidate models

  8. Compare all of the optimized models

    # something like...
    for model in models:
        cross_validate(model, X, y,...)
    


1