{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Principles into Practice\n",
"\n",
"```{tip}\n",
"[Here is a template ipynb file](https://github.com/LeDataSciFi/ledatascifi-2023/blob/main/handouts/ML/ML_template.ipynb) you can use when putting together a project that follows the pseudocode below.\n",
"```\n",
"\n",
"Let's put the principles from last chapter into code. Here is the pseudocode:\n",
"\n",
"1. All of your import functions\n",
"2. Load data \n",
"3. Split your data into 2 subsamples: a \"training\" portion and a \"holdout\" (aka \"test\") portion as [in this page](03_ML) or [this page](03c_ModelEval) or [this page](03c1_OOS). We will do all of our work on the \"train\" sample until the very last step. \n",
"4. Before modeling, do EDA (**on the training data only!**)\n",
" - Sample basics: What is the unit of observation? What time spans are covered?\n",
" - Look for outliers, missing values, or data errors\n",
" - Note what variables are continuous or discrete numbers, which variables are categorical variables (and whether the categorical ordering is meaningful) \n",
" - You should read up on what all the variables mean from the documentation in the data folder.\n",
" - Visually explore the relationship between your outcome variable and other variables.\n",
" - For continuous variables - take note of whether the relationship seems linear or quadratic or polynomial\n",
" - For categorical variables - maybe try a box plot for the various levels?\n",
" - Now decide how you'd clean the data (imputing missing values, scaling variables, encoding categorical variables). These lessons will go into the preprocessing portion of your pipeline below. The [sklearn guide on preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) is very informative, as is [this page **and the video I link to therein.**](04e1_preprocessing)\n",
" \n",
"5. Prepare to optimize a series of models ([pipelines introduced here](04e_pipelines)) \n",
" 1. Set up one pipeline to clean each type of variable\n",
" 2. Combine those pipes into a \"preprocessing\" pipeline using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer)\n",
" 1. [Set up your cross validation method](04d_crossval)\n",
" 1. Set up your scoring [metric](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) as discussed [here](03d_whatToMax)\n",
"5. [Optimize candidate model 1](04f_optimizing_a_model) _on the training data_\n",
" 1. Set up [a pipeline](04e_pipelines) that combines preprocessing, estimator\n",
" 1. Set up a hyper param grid\n",
" 1. Find optimal hyper params (e.g. gridsearchcv)\n",
" 1. Save pipeline with optimal params in place\n",
"6. Repeat step 6 for other candidate models\n",
"7. Compare all of the optimized models. This step might look like: \n",
" ```python\n",
" models = [best_ridge, tuned_lasso, nn_reg, poly_reg, ...] \n",
" for model in models:\n",
" cross_validate(model, X, y,...)\n",
" ```\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}