5.4.7. Many (and Better) Models

At this point, we can

  1. Set up a pipeline that

    • preprocesses different variables differently

    • we could add an estimator in the middle to reduce the number of variables, since too many variables can lead to overfitting

    • ends in an estimator

  2. Set up a cross-validation method, and optimize the pipeline parameters via GridSearchCV (and save the best model to use on the test sample later)

The best practices page outlined a pseudo-code that covers a full project. Following that code, what is left, is for us to try many more models in the pursuit of improving our predictions of the interest rate!

5.4.7.1. Better Models

Improving the performance of a model is contingent on the problem domain, data, and the models you’re considering, so generic advice is tough to offer with complete confidence. That said, these are usually good ideas:

  • Exploring your data endlessly

  • Preprocessing data in a pipeline (data leakage is bad!) that utilizes what you learned via EDA (imputation, scaling, and transformations)

  • Exploring preprocessing alternatives in your pipeline optimization

  • Feature engineering: Creating new variables via interactions (think: X3 = X1*X2)

  • Feature selection/reduction: Which X variables to include. Too many variables will lead to overfitting.

    • Common options: SelectFromModel, LassoCV, RFECV

  • Gradient Boosting, discussed here and here, and ensemble + stacked predictors

    • xgboost and lightGBM are the go to implementations, and HistGradientBoostingRegressor is the analogue in sk-learn

    • If you just sk-learn for gradient boosting, look for the “Hist” in the function name! (Newer, much faster.)