# Principles into Practice

````{important}
```{tip}
[Here is a template ipynb file](https://github.com/LeDataSciFi/ledatascifi-2022/blob/main/handouts/ML/ML_template.ipynb) you can use when putting together a project that follows the pseudocode below.
```
````

Let's put the principles from last chapter into code. Here is the pseudocode:

1. All of your import functions
2. Load data 
3. Split your data into 2 subsamples: a "training" portion and a "holdout" (aka "test") portion as [in this page](03_ML) or [this page](03c_ModelEval) or [this page](03c1_OOS). This is the first arrow in the picture below.[^pic] We will do all of our work on the "train" sample until the very last step. 
4. Before modeling, do EDA (**on the training data only!**)
    - Sample basics: What is the unit of observation? What time spans are covered?
    - Look for outliers, missing values, or data errors
    - Note what variables are continuous or discrete numbers, which variables are categorical variables (and whether the categorical ordering is meaningful)     
    - You should read up on what all the variables mean from the documentation in the data folder.
    - Visually explore the relationship between `v_Sale_Price` and other variables.
        - For continuous variables - take note of whether the relationship seems linear or quadratic or polynomial
        - For categorical variables - maybe try a box plot for the various levels?
    - Now decide how you'd clean the data (imputing missing values, scaling variables, encoding categorical variables). These lessons will go into the preprocessing portion of your pipeline below. The [sklearn guide on preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) is very informative, as [this page **and the video I link to therein.**](04e1_preprocessing)
    
5. Prepare to optimize a series of models ([covered here](04e_pipelines)) 
    1. Set up one pipeline to clean each type of variable
    2. Combine those pipes into a "preprocessing" pipeline using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer)
    1. [Set up your cross validation method](04d_crossval):
        - The picture below illustrates 5 folds, apparently based on the row number.
        - There are many [CV splitters](https://scikit-learn.org/stable/modules/classes.html#splitter-classes) available, including TimeSeriesSplit (a starting point for asset price predictions) and [GroupTimesSeriesSplit](https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243) is in development (which addresses a core problem with TimeSeriesSplit, and [is shown in practice here](https://www.kaggle.com/jorijnsmit/found-the-holy-grail-grouptimeseriessplit))
    1. Set up your scoring [metric](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) as discussed [here](03d_whatToMax)
5. [Optimize candidate model 1](04f_optimizing_a_model) _on the training data_
    1. Set up [a pipeline](04e_pipelines) that combines preprocessing, estimator
    1. Set up a hyper param grid
    1. Find optimal hyper params (e.g. gridsearchcv)
    1. Save pipeline with optimal params in place
6. Repeat step 6 for other candidate models
7. Compare all of the optimized models
    ```python
    # something like...
    for model in models:
        cross_validate(model, X, y,...)
    ```


---

[^pic]: ![](img/feature_5_fold_cv.jpg)