{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Principles into Practice\n",
    "\n",
    "```{tip}\n",
    "[Here is a template ipynb file](https://github.com/LeDataSciFi/ledatascifi-2024/blob/main/handouts/ML/ML_template.ipynb) you can use when putting together a project that follows the pseudocode below.\n",
    "```\n",
    "\n",
    "Let's put the principles from last chapter into code. Here is the pseudocode:\n",
    "\n",
    "1. All of your import functions\n",
    "2. Load data \n",
    "3. Split your data into 2 subsamples: a \"training\" portion and a \"holdout\" (aka \"test\") portion as [in this page](03_ML) or [this page](03c_ModelEval) or [this page](03c1_OOS). We will do all of our work on the \"train\" sample until the very last step. \n",
    "4. Before modeling, do EDA (**on the training data only!**)\n",
    "    - Sample basics: What is the unit of observation? What time spans are covered?\n",
    "    - Look for outliers, missing values, or data errors\n",
    "    - Note what variables are continuous or discrete numbers, which variables are categorical variables (and whether the categorical ordering is meaningful)     \n",
    "    - You should read up on what all the variables mean from the documentation in the data folder.\n",
    "    - Visually explore the relationship between your outcome variable and other variables.\n",
    "        - For continuous variables - take note of whether the relationship seems linear or quadratic or polynomial\n",
    "        - For categorical variables - maybe try a box plot for the various levels?\n",
    "    - Now decide how you'd clean the data (imputing missing values, scaling variables, encoding categorical variables). These lessons will go into the preprocessing portion of your pipeline below. The [sklearn guide on preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) is very informative, as is [this page **and the video I link to therein.**](04e1_preprocessing)\n",
    "    \n",
    "5. Prepare to optimize a series of models ([pipelines introduced here](04e_pipelines)) \n",
    "    1. Set up one pipeline to clean each type of variable\n",
    "    2. Combine those pipes into a \"preprocessing\" pipeline using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer)\n",
    "    1. [Set up your cross validation method](04d_crossval)\n",
    "    1. Set up your scoring [metric](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) as discussed [here](03d_whatToMax)\n",
    "5. [Optimize candidate model 1](04f_optimizing_a_model) _on the training data_\n",
    "    1. Set up [a pipeline](04e_pipelines) that combines preprocessing, estimator\n",
    "    1. Set up a hyper param grid\n",
    "    1. Find optimal hyper params (e.g. gridsearchcv)\n",
    "    1. Save pipeline with optimal params in place\n",
    "6. Repeat step 6 for other candidate models\n",
    "7. Compare all of the optimized models.  This step might look like: \n",
    "    ```python\n",
    "    models = [best_ridge, tuned_lasso, nn_reg, poly_reg, ...] \n",
    "    for model in models:\n",
    "        cross_validate(model, X, y,...)\n",
    "    ```\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}