7.2. Data Leakage - Illustration


Data leakage is one of the cardinal sins of ML.

  • Here, let me show you. I'll use a model to predict daily returns for Microsoft. After downloading the data, we can use this fancy model:
import pandas_datareader as pdr  # to install: !pip install pandas_datareader
import yfinance as yf
from datetime import datetime
from sklearn.metrics import r2_score
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
from tqdm import trange
import warnings # suppress arima loop warning with these next 3 lines
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)

# load stock returns 
start = datetime(2004, 1, 1)
end = datetime(2007, 12, 31)
stocks = ['MSFT']
stock_prices         = yf.download(stocks, start , end)
stock_prices.index   = stock_prices.index.tz_localize(None)      # change yf date format to match pdr
stock_prices         = stock_prices.filter(like='Adj Close')     # reduce to just columns with this in the name
if len(stocks) > 1: # this next line fails if len=1 bc yahoo gives back data with diff structure
    stock_prices.columns = stock_prices.columns.get_level_values(1)  # tickers as col names, works no matter order of tics
    stock_prices.columns = stocks 
stock_prices = stock_prices.stack().swaplevel().sort_index().reset_index()
stock_prices.columns = ['Firm','Date','Adj Close']
stock_prices['ret'] = stock_prices.groupby('Firm')['Adj Close'].pct_change()
stock_prices = stock_prices.iloc[1:,:]

# fit model and evaluate it to see the predictive power

series = stock_prices.ret
series.index = stock_prices.Date#.to_period('D')
series.index = series.index.to_period('D')
model = ARIMA(series, order=(4,0,1))
model_fit = model.fit()
predictions = model_fit.predict(start='2006-01-03')
  • It's done! Heck yes! Our patent pending* ARIMA(3,0,1) model predicts next day stock returns with an R2 of...
  • Let's start trading! I have my life's savings. We can use the heirlooms your Grandma Fama left you as collateral and lever up for extra earnings! She would understand. In fact, she would be PROUD to contribute to the cause!
  • Ok, let's see what the model says is going to happen tomorrow, and we'll buy or sell based on that...
  • Ok, let's see what the model says is going to happen tomorrow, and we'll buy or sell based on that...
  • Pan and zoom in on the computer screen, revealing how their model performed in the real world.
history   = series[:-len(predictions)]
test_data = series[-len(predictions):]
model_predictions = []
for time_point in range(len(test_data)):
    model = ARIMA(history, order=(4,1,0)) # use model on past
    model_fit = model.fit()               # predict the future
    model_predictions.append(model_fit.forecast()[0])  # store prediction
    history = np.append(history,test_data[time_point]) # and reality
