5.3.6. The Cardinal Sin of ML: Data Leakage

5.3.6.1. Illustration

The following catastrophe is brought to you by Data Leakage (TM), the #1 enemy of machine learners everywhere, recommended by 0 out of 10 dentists, and also, our sponsor Daisy Cottage Cheese:

(I’m sorry for putting that jingle in your head, so so sorry.)


  • INT. DUNGY BASEMENT COMPUTER LAB
  • We open as a light casts a glow over the face of a sweating programmer. A crushed (finished) Red Bull is the only thing on the desk visible, besides 8,000 pieces of scrap paper. In the corner of a frame, a portion of a white board is visible - permanent smudges show that is has been heavily used.
  • (Unnamed programmer)
  • I have a plan. You and me, let's get rich! Who needs this grind?
  • He fidgets. The red bull has him a little fritzy.
  • (Unnamed programmer)
  • Seriously! I'm super good at coding, you stake me, and I'll build a stock prediction algo with fancy ML tools. I know all the super cool and trendy words that "Wow!" investors, so we'll probably get backers too!
  • His eyes search wantingly for any feedback from the unseen narrator. (TM-Christopher Nolan)
  • (Unnamed programmer)
  • Here, let me show you. I'll use a model to predict daily returns for Microsoft. After downloading the data, we can use this fancy model:
  • He swivels his chair towards the computer.
  • "Dramatic revealation" music, camera pans and zooms onto the computer screen
import pandas_datareader as pdr  # to install: !pip install pandas_datareader
from datetime import datetime
from sklearn.metrics import r2_score
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
from tqdm import trange
import warnings # suppress arima loop warning with these next 3 lines
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)

# load stock returns 
start = datetime(2004, 1, 1)
end = datetime(2007, 12, 31)
stock_prices = pdr.get_data_yahoo(['MSFT'], start=start, end=end)
stock_prices = stock_prices.filter(like='Adj Close') # reduce to just columns with this in the name
stock_prices.columns = ['MSFT'] # put their tickers as column names
stock_prices = stock_prices.stack().swaplevel().sort_index().reset_index()
stock_prices.columns = ['Firm','Date','Adj Close']
stock_prices['ret'] = stock_prices.groupby('Firm')['Adj Close'].pct_change()
stock_prices = stock_prices.iloc[1:,:]

# fit model and evaluate it to see the predictive power

series = stock_prices.ret
series.index = stock_prices.Date#.to_period('D')
series.index = series.index.to_period('D')
model = ARIMA(series, order=(4,0,1))
model_fit = model.fit()
predictions = model_fit.predict(start='2006-01-03')
print(f'Heck yes! Our patent pending* ARIMA(3,0,1) model predicts')
print(f'next day stock returns with an R2 of...\n\n        R2={r2_score(series[-len(predictions):],predictions).round(3)}!')
print('\nCan you FEEL the excitement!?')
print("\n\nLet's build this model and start trading our life's savings.\nWe can use your Grandma's heirlooms as collateral and lever up for extra earnings!")
print("She would understand. In fact, she would be PROUD to contribute to the **cause**!")
print('\n...The year is 2006, the Day is Jan 3...\n\n...model loading...')
print('\n...model loading...')
print('\n...model ready! Lets start trading!...')
print('\n...Predicting Jan 4, buy/sell based on the prediction...')
print('\n...Now it is Jan 4, predicting Jan 5, buy/sell based on the prediction...')
print('\n...[A montage rolls, 2 years of Wolf of Wall Str insanity, as money flows from the coffers]...')
print('\n...We wake up on Dec 31 2007. Crazy two years. Lets see how we did!...')
print('\n===========================================================================')

history   = series[:-len(predictions)]
test_data = series[-len(predictions):]
model_predictions = []
for time_point in range(len(test_data)):
    model = ARIMA(history, order=(4,1,0)) # use model on past
    model_fit = model.fit()               # predict the future
    model_predictions.append(model_fit.forecast()[0])  # store prediction
    history = np.append(history,test_data[time_point]) # and reality
    
print('\nOops... Our model, used in the real world, had an R2 of...')
print('\nThis cant be right!!!!')
print(f'\n        R2={r2_score(test_data,model_predictions).round(3)}')
print('\nGood thing Grandma wont know about this...')
Heck yes! Our patent pending* ARIMA(3,0,1) model predicts
next day stock returns with an R2 of...

        R2=0.013!

Can you FEEL the excitement!?


Let's build this model and start trading our life's savings.
We can use your Grandma's heirlooms as collateral and lever up for extra earnings!
She would understand. In fact, she would be PROUD to contribute to the **cause**!

...The year is 2006, the Day is Jan 3...

...model loading...

...model loading...

...model ready! Lets start trading!...

...Predicting Jan 4, buy/sell based on the prediction...

...Now it is Jan 4, predicting Jan 5, buy/sell based on the prediction...

...[A montage rolls, 2 years of Wolf of Wall Str insanity, as money flows from the coffers]...

...We wake up on Dec 31 2007. Crazy two years. Lets see how we did!...

===========================================================================

Oops... Our model, used in the real world, had an R2 of...

This cant be right!!!!

        R2=-0.196

Good thing Grandma wont know about this...
  • QUICK FADE TO BLACK
  • IMMIGRANT SONG BY LED ZEPPELIN PLAYS OVER CREDITS

5.3.6.2. Definition

Data leakage

is when information that would not be available at prediction time is used when building the model. This results in overly optimistic performance estimates, for example from cross-validation, and thus poorer performance when the model is used on actually novel data, for example during production.

Lessons:

  1. Keep the test and train data subsets separate

  2. Never call fit on the test data

  3. Data cleaning and transformation steps applied to the training data should not be learned from the test data

The example above falls prey to data leakage because the testing data is the training data, rather than following the lessons above.

The next section of the book will explain how exactly to avoid all of these problems with code, but for now, let’s just state the following warning:

Warning

The absolute golden rule of prediction modeling is…

YOUR MODEL CAN’T HAVE ACCESS TO ANY DATA THAT IT WOULDN’T HAVE IN PRACTICE WHEN IT MAKES THE PREDICTION.

I know I already said that, and repetition is usually bad writing, but it must be said again. And again.

That said, knowing that data leakage is bad doesn’t mean it’s easy to avoid.

Data leakage can sneak into your analysis in tricky ways:

  • The outcome variable is a predictor (implicitly or explicitly)

  • Predictor variables that are in response to the result (after the fact) or the possibility (anticipatory)

  • Predicting loan default, the data might include employee IDs for recent customer service contacts. But the most recent contact might be with trouble-loan specialists (because the firm anticipated possible default due to some other signal). Using that employee’s customer contacts to predict default would add no value - the lender already knew to assign that employee!

  • The smell test: Is it too good to be true? I’ve seen some asset pricing models with suspicious out-of-sample R2s. Predicting stock prices is hard! The best OOS predictive R2 for individual stocks in this paper is 1.80% per month.