8. Getting our hands dirty with ML

The objective of this chapter is to put the principles from the last chapter into practice. We will use sklearn, the go-to package for machine learning in python. Go ahead and just bookmark its user guide right now, you’ll be visiting it a lot.

At the end of this portion of class (between these pages and lectures), you should be able to

  1. Build a pipeline that

    1. Preprocesses realistic data (i.e. multiple variable types) in a pipeline that handles each variable type

    2. Estimates a model’s performance using cross-validation

    3. Hypertunes the model’s parameters to improve its performance

    4. Finally, evaluate its performance on a test sample

  2. Use that pipeline within the best practice workflow to optimize several models and pick your preferred

  3. Discuss key issues relating to

    1. The value of preprocessing

    2. The value of feature engineering

    3. How your hold-out split and folding method can cause unrealistic performance estimates