7.2. Data Leakage - Illustration¶
Important
Data leakage is one of the cardinal sins of ML.
The following catastrophe is brought to you by Data Leakage (TM), the #1 enemy of machine learners everywhere, recommended by 0 out of 10 dentists, and also, our sponsor Daisy Cottage Cheese:
(I’m sorry for putting that jingle in your head, so so sorry.)
Note
The code below might need to be modified to work as of Feb 2023. The fix is here.
- INT. DUNGY BASEMENT COMPUTER LAB
- We open as a light casts a glow over the face of a sweating programmer. A crushed (finished) Red Bull is the only thing on the desk visible, besides 8,000 pieces of scrap paper. In the corner of a frame, a portion of a white board is visible - permanent smudges show that is has been heavily used.
- (Unnamed programmer)
- I have a plan. You and me, let's get rich! Who needs this grind?
- He fidgets. The red bull has him a little fritzy. He's talking to someone just off screen.
- (Unnamed programmer)
- Seriously! I'm super good at coding, you stake me, and I'll build a stock prediction algo with fancy ML tools. I know all the super cool and trendy words that "Wow!" investors, so we'll probably get backers too!
- His eyes search wantingly for any feedback from the unseen narrator. (TM-Christopher Nolan)
- (Unnamed programmer)
- Here, let me show you. I'll use a model to predict daily returns for Microsoft. After downloading the data, we can use this fancy model:
- He swivels his chair towards the computer.
- "Dramatic revealation" music, camera pans and zooms onto the computer screen
import pandas_datareader as pdr # to install: !pip install pandas_datareader
import yfinance as yf
from datetime import datetime
from sklearn.metrics import r2_score
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
from tqdm import trange
import warnings # suppress arima loop warning with these next 3 lines
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
# load stock returns
start = datetime(2004, 1, 1)
end = datetime(2007, 12, 31)
stocks = ['MSFT']
stock_prices = yf.download(stocks, start , end)
stock_prices.index = stock_prices.index.tz_localize(None) # change yf date format to match pdr
stock_prices = stock_prices.filter(like='Adj Close') # reduce to just columns with this in the name
if len(stocks) > 1: # this next line fails if len=1 bc yahoo gives back data with diff structure
stock_prices.columns = stock_prices.columns.get_level_values(1) # tickers as col names, works no matter order of tics
else:
stock_prices.columns = stocks
stock_prices = stock_prices.stack().swaplevel().sort_index().reset_index()
stock_prices.columns = ['Firm','Date','Adj Close']
stock_prices['ret'] = stock_prices.groupby('Firm')['Adj Close'].pct_change()
stock_prices = stock_prices.iloc[1:,:]
# fit model and evaluate it to see the predictive power
series = stock_prices.ret
series.index = stock_prices.Date#.to_period('D')
series.index = series.index.to_period('D')
model = ARIMA(series, order=(4,0,1))
model_fit = model.fit()
predictions = model_fit.predict(start='2006-01-03')
- (Unnamed programmer)
- It's done! Heck yes! Our patent pending* ARIMA(3,0,1) model predicts next day stock returns with an R2 of...
- Cut to the screen
print(f'R2={r2_score(series[-len(predictions):],predictions).round(3)}!')
R2=0.013!
- He gets frenetically excited.
- (Unnamed programmer)
- Let's start trading! I have my life's savings. We can use the heirlooms your Grandma Fama left you as collateral and lever up for extra earnings! She would understand. In fact, she would be PROUD to contribute to the cause!
- Screen dialogue box: Jan 3, 2006, 5pm.
- (Unnamed programmer)
- Ok, let's see what the model says is going to happen tomorrow, and we'll buy or sell based on that...
- Screen dialogue box: Jan 4, 2006, 5pm.
- (Unnamed programmer)
- Ok, let's see what the model says is going to happen tomorrow, and we'll buy or sell based on that...
- Montage begins, Rolling Stones plays over it.
- Montage basically follows Wolf of Wall Street.
- Screen dialogue box: Dec 31, 2007, 5pm.
- Our programmer sits at his computer. He hasn't slept in days. He is disheveled and has coffee stains on his shirt. Unsigned divorce papers are on the desk. He looks at the screen.
- (Unnamed programmer)
- Crap.
- Pan and zoom in on the computer screen, revealing how their model performed in the real world.
history = series[:-len(predictions)]
test_data = series[-len(predictions):]
model_predictions = []
for time_point in range(len(test_data)):
model = ARIMA(history, order=(4,1,0)) # use model on past
model_fit = model.fit() # predict the future
model_predictions.append(model_fit.forecast()[0]) # store prediction
history = np.append(history,test_data[time_point]) # and reality
print(f'R2={r2_score(test_data,model_predictions).round(3)}')
R2=-0.196
- His eyes widen in horror as he begins to comprehend the magnitude of the disaster. He gulps and then turns
- (Unnamed programmer)
- I'm sorry. I hope your grandma doesn't know about this, wherever she is.
- Smash cut to a graveyard, at night. It's dark. The tombstone says "Eugenia Fama". Suddenly a skeleton of a hand bursts out of the ground!
- QUICK FADE TO BLACK
- IMMIGRANT SONG BY LED ZEPPELIN PLAYS OVER CREDITS