Search
Sklearn starter kit

sklearn essentials

This page is just a quick-hit of sklearn and is a living document. It's not meant as a walkthrough. (Clearly!)

Modifications, additions, and suggestions are truly welcomed!

A smattering of key sklearn functions and skills. Parameters we talked about in class are explicitly included below, but you should look at the documentation to see the parameters for all functions before you use them!

  • Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=<#>,train_size=<%>)
  • cross_validate(model,X,y,cv,scoring) covered extensively in class here
  • Optimizing the parameters of a specific model: GridSearchCV, or RandomizedSearchCV
  • Scoring functions per last lecture and how to pass to cross_validate
  • How to compare different models by looping over them with cross_validate
  • Post model diagnostics:
    • Predicting a classification: confusion_matrix and classification_report
  • Fold generators: KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit
    • Stratified is probably better when proportions are lop-sided
    • Do you have multiple obs per "group" (a firm, person, etc.) and want to ensure no group is in both training and testing? GroupKFold
    • If you shuffle or introduce randomization, set random_state!
    • Is time an important parameter! E.g. are you predicting stock prices? TimeSeriesSplit
  • make_pipeline for chaining steps in an estimation sequence
    • ColumnTransformer for simultaneously processing categorical, text, and continuous variables.

Great resources

  1. The best bookmark you can set
  2. On folds
  3. PDSH on SVM and on Random Forests (note some module calls are obsolete, so you might need to update code)

If you find other resources you think other people might benefit from, please email me!

A rough pseudo-code for ML prediction

VERY ROUGH. Learning isn't a straight line or sequence of code. It's a cycle, a loop.

# imports
# load data

## optimize a series of models
# set up pipeline for model type #1; set up hyper param grid; find optimal hyper param
#     the pipeline includes preprocessing, estimator
#     you should spend time interrogating model predictions, plotting and printing
#     does the model struggle predicting certain obs? excel at some?
# set up pipeline for model type #2; set up hyper param grid; find optimal hyper param
# ...
# set up pipeline for model type #N; set up hyper param grid; find optimal hyper param

## compare the N optimized models
# build list of models
# for model in models:
#    cross_validate(model, X, y,...)
# pick the winner!

A bunch of import statements

Any code you write might use none of these, some of these, or all of them!

# dataset loader
from sklearn import datasets

# model training and evalutation utilities 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold # this is one way to generate folds
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

# metrics
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# preprocessing and feature extraction
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.impute import SimpleImputer

# feature selection

# models
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

# toy data
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape