8.7. Many (and Better) Models¶
At this point, we can
Set up a pipeline that
preprocesses different variables differently
we could add an estimator in the middle to reduce the number of variables, since too many variables can lead to overfitting
ends in an estimator
Set up a cross-validation method, and optimize the pipeline parameters via
GridSearchCV
(and save the best model to use on the test sample later)
The best practices page outlined a pseudo-code that covers a full project. Following that code, what is left, is for us to try many more models in the pursuit of improving our predictions of the interest rate!
8.7.1. Better Models¶
Improving the performance of a model is contingent on the problem domain, data, and the models you’re considering, so generic advice is tough to offer with complete confidence. That said, these are usually good ideas:
Exploring your data endlessly
Preprocessing data in a pipeline (data leakage is bad!) that utilizes what you learned via EDA (imputation, scaling, and transformations)
Exploring preprocessing alternatives in your pipeline optimization
Feature engineering: Creating new variables via interactions (think: X3 = X1*X2)
Feature selection/reduction: Which X variables to include? Too many variables will lead to overfitting.
Common options:
SelectFromModel
,LassoCV
,RFECV
Gradient Boosting, discussed here and here, and ensemble + stacked predictors
xgboost
andlightGBM
are the go-to implementations, andHistGradientBoostingRegressor
is the analogue in sklearnIf you are using sklearn for gradient boosting, use the function with “Hist” in the name! (Newer, much faster.)