7.4. Data Leakage Defined

Data leakage

is when information that would not be available at prediction time is used when building the model. This results in overly optimistic performance estimates, for example from cross-validation, and thus poorer performance when the model is used on actually novel data, for example during production.

Lessons:

  1. Keep the test and train data subsets separate

  2. Never call fit on the test data

  3. Data cleaning and transformation steps applied to the training data should not be learned from the test data

  4. Have a baseline so you can use the smell test (see below)

The example on the last page falls prey to data leakage because the training data was the testing data. The model “had the answers to the test” already!

The next pages of the book discuss the main (but not only) method to avoid data leakage and the following chapter will explain how with code, but for now, let’s just state the following warning:

Warning

The absolute golden rule of prediction modeling is…

YOUR MODEL CAN’T HAVE ACCESS TO ANY DATA THAT IT WOULDN’T HAVE IN PRACTICE WHEN IT MAKES THE PREDICTION.

I know I already said that, and repetition is usually bad writing, but it must be said again. And again.

That said, knowing that data leakage is bad doesn’t mean it’s easy to avoid.

Data leakage can sneak into your analysis in tricky ways:

  • The outcome variable is a predictor (implicitly or explicitly)

  • Predictor variables that are in response to the result (after the fact) or the possibility (anticipatory)

  • Predicting loan default, the data might include employee IDs for recent customer service contacts. But the most recent contact might be with trouble-loan specialists (because the firm anticipated possible default due to some other signal). Using that employee’s customer contacts to predict default would add no value - the lender already knew to assign that employee!

Establish a baseline + Use the smell test!

Is it too good to be true? One way to tell is to establish a baseline: Find best-in-class performance metrics for your problem. If you see dramatically better results, data leakage is a good candidate for the performance.

For example: Predicting stock prices is hard! The best predictive R2 for individual stocks in this paper (open access here) is just 1.80% per month. An R2 of 0.018 is WORLD CLASS.

Knowing that, if someone tells you they can predict asset returns with an R2 of 2%+, you should be very skeptical.

If you can’t find a baseline from prior work, create a “dummy” model to beat. For example: Predict that returns are always 0, or always 8% a year. What’s the R2 on that?