The demand sales forecast technique every Data Scientist should be using to reduce error


This post was originally published by Federico Riveroll at Towards Data Science

Create a Record-Breaking Demand Sales Forecast on a Step-by-Step Tutorial Using Python

Thousands of companies around the world, from small startups to global corporations, find great value in being able to accurately predict sales, and it’s almost always one of the priorities for their Data Science / Analytics teams.

However, all of them seem to attempt to increase accuracy (reduce error) by focusing on mainly two things:

1) Feature engineering (getting the most out of your features)

2) Model/parameter optimization (choosing the best model & best parameters)

Both of the above are very necessary indeed, but there is a third thing that adds value in a complementary way, and it’s wildly underused not only in this use case but in most data science projects:

  • Combining external information.

In this article, we’ll do a simple sales forecast model and then blend external variables (properly done).

  • Step 2: Make a Simple Forecast Model
  • Step 3: Add Financial Indicators and News
  • Step 4: Test the Models
  • Step 5: Measure Results

Screenshot from the Kaggle Competition

For this article, we’ll use the data and improve a model only by combining external information.

Walmart released data containing weekly sales for 99 departments (clothing, electronics, food…) in every physical store along with some other added features.

Walmart dataset screenshot

For this, we will create an ML model with ‘Weekly_Sales’ as a target, and train with the first 70% observations and test on the posterior 30%.

The objective is to minimize the Prediction error on future weekly sales.

We’ll add external variables that impact or have a relationship with sales such as dollar index, oil price, and news about Walmart.

We won’t use model/parameter optimization nor feature engineering so we can distinguish the benefit from adding the external features.

$ pip install pandas OpenBlender scikit-learn

Then, open a Python script (preferably Jupyter notebook), and let’s import the needed libraries.

from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import OpenBlender
import json

Now, let’s define the methodology and model to be used in all experiments.

First, the date range of the data is from Jan 2010 to Dec 2012. Let’s define the first 70% of the data used for training and the posterior 30% for testing (because we don’t want data leakage on our predictions).

Next, let’s define as our standard model a RandomForestRegressor with 50 estimators, which is a reasonably good option.

Finally, to keep things as simple as possible, let’s define the error as the absolute sum of errors.

Now, let’s put it in a Python class.

class StandardModel:

model = RandomForestRegressor(n_estimators=50, criterion=’mse’)

    def train(self, df, target):        # Drop non numerics
df = df.dropna(axis=1).select_dtypes(['number'])        # Create train/test sets
X = df.loc[:, df.columns != target].values
y = df.loc[:,[target]].values        # We take the first bit of the data as test and the 
# last as train because the data is ordered desc.
div = int(round(len(X) * 0.29))        X_train = X[div:]
y_train = y[div:]        print('Train Shape:')
print(y_train.shape)        #Random forest model specification
self.model = RandomForestRegressor(n_estimators=50)        # Train on data, y_train.ravel())
def getMetrics(self, df, target):
# Function to get the error sum from the trained model        # Drop non numerics
df = df.dropna(axis=1).select_dtypes(['number'])        # Create train/test sets
X = df.loc[:, df.columns != target].values
y = df.loc[:,[target]].values        div = int(round(len(X) * 0.29))        X_test = X[:div]
y_test = y[:div]        print('Test Shape:')

# Predict on test
y_pred_random = self.model.predict(X_test) # Gather absolute error
error_sum = sum(abs(y_test.ravel() – y_pred_random)) return error_sum

Above we have an object with 3 elements:

  • model (RandomForestRegressor)
  • train: A function to train that model with a dataframe and a target
  • getMetrics: A function to test with the trained model with the test data and retrieve the error

We will use this configuration for all of the experiments, although you can modify it as you want to test different models, parameters, configuration or whatever else. The value-added will remain and could potentially improve.

Now, let’s get the Walmart data. You can get that csv here.

df_walmart = pd.read_csv('walmartData.csv')

There are 421, 570 observations. As we said before, the observations are registers of weekly sales by store per department.

Let’s plug the data into the model without tampering with it at all.

our_model = StandardModel()
our_model.train(df_walmart, 'Weekly_Sales')total_error_sum = our_model.getMetrics(df_walmart, 'Weekly_Sales')
print("Error sum: " + str(total_error_sum))> Error sum: 967705992.5034052

The sum of all errors for the complete model is $ 967,705,992.5 USD from all the predictions vs. the real sales.

This doesn’t mean much by itself, the only reference is that the sum of all sales in that period is $ 6,737,218,987.11 USD.

Since there is a whole lot of data, we will only focus on Store #1 for this tutorial, but the methodology is absolutely replicable for all stores.

Let’s take a look at the error generated by Store 1 alone.

# Select store 1
df_walmart_st1 = df_walmart[df_walmart['Store'] == 1]error_sum_st1 = our_model.getMetrics(df_walmart_st1, 'Weekly_Sales')
print("Error sum error_sum_st1: " + str(error_sum_st1))# > Error sum error_sum_st1: 24009404.060399983

So, Store 1 is responsible for the $24,009,404.06 USD error and this will be our threshold for comparison.

Now let’s break down the error by the department to have more visibility later.

error_summary = []for i in range(1,100):
df_dept = df_walmart_st1[df_walmart_st1['Dept'] == i]
error_sum = our_model.getMetrics(df_dept, 'Weekly_Sales')
print("Error dept : " + str(i) + ' is: ' + str(error_sum))
error_summary.append({'dept' : i, 'error_sum_normal_model' : error_sum})
error_sum = 0
print('No obs for Dept: ' + str(i))error_summary_df = pd.DataFrame(error_summary)

Now we have a dataframe with the errors corresponding to each department on Store 1 with our threshold model.

Let’s improve these numbers.

action = 'API_createDataset'
token = 'YOUR_TOKEN_HERE'parameters = { 
'token' : token,
'name' : 'Walmart Sales Store1',
'description' : 'Weekly sales from Store 1',
'visibility' : 'private',
'tags' : ['walmart'],
'insert_observations' : 'on',
'dataframe' : df_walmart_st1.to_json(),
'day_first' : 0
}, parameters)

Note: To get a token you need to create an account on (free), you’ll find it in the ‘Account’ tab on your profile icon.

After we run this code, the dataset_id should be printed, you’ll need it later (you can also get it in your dashboard).

Now, let’s try with department 1, and let’s blend the US Dollar Index and the Oil Index.

action = 'API_getObservationsFromDataset'# ANCHOR: 'Walmart Sales St1'
# BLENDS: 'US Dollar Index', 'Oil Index'week = 60 * 60 * 24 * 7 #A week in secondsparameters = { 
'filter_select' : {'feature' : 'dept', 'values' : [1]},
'add_date' : 'date',
'drop_features':["isholiday", "type", "store", "size", "dept"],
'aggregate_in_time_interval':{"output" : "avg", 
"empty_intervals" : "impute", 
"time_interval_size" : week},
# US Dollar Index
"time_interval_size" : week},
"drop_features":["change_","high","low","open"]},              # Oil Index
"time_interval_size" : week},
}df_enhanced = pd.read_json(json.dumps(, parameters)['sample']), convert_dates=False, convert_axes=False).sort_values('timestamp', ascending=False)
df_enhanced.reset_index(drop=True, inplace=True)print(df_enhanced.shape)
df_enhanced.head()# > (143, 20)

We just pulled the weekly sales of Store 1, Dept 1 (143 observations and 20 features) exactly as they into a pandas dataframe, with the added columns.

I’ll explain the important parts of the parameters. You can also consult the API documentation here.

  • id_dataset: The id of the Store 1 dataset we uploaded earlier
  • filter_select: We are only selecting observations from department 1, that’s why we only got 143 observations
  • drop_features: we filter out features which have the same values in all observations.
  • blends: These are time blended datasets, aggregated in one-week intervals, we added two datasets: dollar and oil indexes.

The blended features start with a prefix of their name in upper and the feature name in lowercase (eg. “OIL_INDEX_price”).

Let’s apply the Standard Model and compare the error vs. the original.

our_model.train(df_enhanced, 'weekly_sales')
error_sum = our_model.getMetrics(df_enhanced, 'weekly_sales')
error_sum#> 253875.30

The current model had a $253, 975 error while the previous one had $290,037. That’s a 12% improvement.

But this doesn’t prove much, it could be that the RandomForest got lucky. After all, the original model trained with over 299K observations. The current one only trained with 102!!

So let’s now try it with every department separately to measure how consistent the benefit is.

Also, to get a glimpse, we’re gonna experiment with the first 10 Departments first and compare the benefit of adding each additional source.

# Function to filter features from other sources
def excludeColsWithPrefix(df, prefixes):
cols = df.columns
for prefix in prefixes:
cols = [col for col in cols if prefix not in col]
return df[cols]error_sum_enhanced = []
action = 'API_getObservationsFromDataset'
# Loop through the first 10 Departments and test them.
for idx, row in error_summary_df[0:10].iterrows():
print('Starting department ' + str(row['dept']))
parameters = { 
'filter_select' : {'feature' : 'dept', 
'values' : [row['dept']]},
'add_date' : 'date',
'drop_features':["isholiday", "type", "store", "size", "dept"],
'aggregate_in_time_interval':{"output" : "avg", 
"empty_intervals" : "impute",   
'blends':[ # US Dollar Index
"time_interval_size" : week_seconds},
# Oil Index
"time_interval_size" : week_seconds},
# Consumer Sentiment
"time_interval_size" : week_seconds},
# CNN News vectorizer
"specifications":{"time_interval_size" : 
# Get the data
df = pd.read_json(json.dumps(, parameters)['sample']), convert_dates=False, convert_axes=False).sort_values('timestamp', ascending=False)
df.reset_index(drop=True, inplace=True)

error_sum = {}

# Gather errors from every source by itself# Dollar Index
df_selection = excludeColsWithPrefix(df, [‘5e95ac5c951629562b9ff24b’, ‘US_MONTHLY_CONSUMER’, ‘OIL_INDEX’])
our_model.train(df_selection, ‘weekly_sales’)
error_sum[‘1_features’] = our_model.getMetrics(df_selection, ‘weekly_sales’)

# Walmart News
df_selection = excludeColsWithPrefix(df, [ ‘US_MONTHLY_CONSUMER’, ‘OIL_INDEX’])
our_model.train(df_selection, ‘weekly_sales’)
error_sum[‘2_feature’] = our_model.getMetrics(df_selection, ‘weekly_sales’)

# Oil Index
df_selection = excludeColsWithPrefix(df,[‘US_MONTHLY_CONSUMER’])
our_model.train(df_selection, ‘weekly_sales’)
error_sum[‘3_features’] = our_model.getMetrics(df_selection, ‘weekly_sales’)

# Consumer Sentiment (All features)
df_selection = df
our_model.train(df_selection, ‘weekly_sales’)
error_sum[‘4_features’] = our_model.getMetrics(df_selection, ‘weekly_sales’)

print(“No observations found for department: ” + str(row[‘dept’]))
error_sum = 0

To get all of the data from the above code you will have to buy processing units in OpenBlender for about $35.00 USD. You can remove blends (such as Consumer Sentiment and Oil Index) to reduce cost.

Let’s get the results into a DataFrame and visualize.

separated_results = pd.DataFrame(error_sum_enhanced)
separated_results['original_error'] = error_summary_df[0:10]['error_sum_normal_model']
separated_results = separated_results[['original_error', '1_feature', '2_features', '3_features', '4_features']]

Departments 4 and 6 are on a higher order than the rest Let’s remove them to take a closer look at the rest.

separated_results.drop([6, 4]).transpose().plot(kind=’line’)

We can see that almost on all departments the error lowered as we added new features. We can also see that the Oil Index (third feature) was not only not helpful but even harmful for some departments.

I excluded the Oil Index and ran the algorithm with the 3 features on all the departments (which you can do by iterating all the error_summary_df and not just the first 10).

Let’s see the results.

Not only did the added features improve the error on over 88% of the Departments, but some improvements were significant.

The original error (calculated at the beginning of the article) was $24,009,404.06 USD, and the final error is $9,596,215.21 USD meaning it was reduced by over 60%

And this is just one store.

Thank you for reading.

Spread the word

This post was originally published by Federico Riveroll at Towards Data Science

Related posts