Detecting data leakage in ML pipelines using NANs and complex numbers


This post was originally published by Abhay Pawar at Towards Data Science

A simple and precise way to detect data leakage

Data leakage in machine learning pipelines can cause havoc for your model. In this post, I’m going to share an amazingly simple way to detect data leakages using NANs and complex numbers while treating your ML pipeline as a black box. I’ll talk very briefly about what data leakage is. I’ll also talk about leak-detect, a python package I’m releasing to do all this in one line code.

A quick intro to data leakage

The most precise way to describe data leakage could be this:

Data leakage in an ML model occurs when data used to create predictor variables during training time is unavailable at the time of inference.

Clearly, using data(features) unavailable at inference time at training time leads to model underperforming in production. This under-performance could mean millions of lost dollars depending on the scale of your company!

An example of leakage

What are some ways feature creation pipeline can introduce data leakage?

  1. Using target or data used to create target for feature engineering.
  2. Using data from future periods for feature engineering.

First is generally easier to detect and keep track of. So, let’s try to understand the second one using an example. Consider you are trying to predict stock price of a company after 5 days. Our data contains date and daily open price.

We want to create various hand-made features for this task. Say, one feature we want is ‘price on the previous day’.

But instead of doing .shift(1), let’s say by mistake we did .shift(-1)and used values from the next row of open price as a feature. We just created ‘price on the next day’ instead. This is a leaky feature because it uses data from a future period.

There are many best practices to follow to avoid leakage, but none of these can make you 100% sure that your pipeline is not leaky. This is where NANs and complex numbers come in! This methodology can be looked at as a unit test for data leakages.

The Methodology

Before getting to the methodology, let’s do an analogy first :).

Let’s say you have two tanks connected through a pipe which is closed. How can we detect that this pipe is indeed closed and is not leaky? You can add color to one tank and check if the other tank also gets that color. Notice that you didn’t even have to inspect the pipe. Just like watercolors, NANs and complex numbers are ideal for leakage detection tasks because they have the ability to persist after any operation with real numbers. Operations like addition, subtraction, etc between a real number and NAN or complex number yield NAN or complex number respectively. Of course, there are exceptions to this and we will come to that later.

In the stock price example, let’s say we set open price on a specific day D to NAN and create our features using this data. ‘Price on next day’ (leaky feature) will have NAN value for day D-1, whereas ‘price on previous day’ (non-leaky) will have it for day D+1.

So for the leaky feature, we will observe an extra NAN in the feature before day D. Essentially, if open-price on day D is being used to create a feature for days before D (which is leakage), we will see extra NAN in that feature before day D. This is true if the feature is being created using any operation which yields NAN when one input is NAN.

What if we want to check if data from day D+1, D+2, etc is being used to create features for days before D? Well, we set them all to NAN and just count NANs in final features before day D.

This methodology can be summarized in 4 simple steps:

1. Define an imaginary leakage partition that splits your data into the upper and lower half. In our case, we don’t want lower half data (future) to be used to create features for dates in the upper half (past).

2. Run the data creation pipeline on raw data and count the number of NANs in all features in the upper half. Our data creation pipeline here creates both leaky and non-leaky features described above. Our non-leaky feature has 1 NAN because our data starts from 1-Jan-2020 and leaky one has 0 NANs.

Spread the word

This post was originally published by Abhay Pawar at Towards Data Science

Related posts