*This post was originally published by Adam Brownell at Towards Data Science*

**Using Aggregate Statistics to Judge the Strength of a Customer Behavior Model is a Bad Idea**

In this post, I hope to demonstrate, through both a toy and real-world example, **why using aggregate statistics to judge the strength of a customer behavior model is a bad idea**.

Instead, the best CLV Model is the one that has the strongest predictions on the individual level. Data Scientists exploring Customer Lifetime Value should primarily, and perhaps only, use individual level metrics to fully understand the strengths and weaknesses of a CLV model.

CLV Modelling is essentially guessing how frequently someone will shop at your store and how much they’ll spend. Photo by rupixen.com on Unsplash

While this is intended for Data Scientists, I wanted to address the business ramifications of this article, since understanding the business need will inform both why I hold certain opinions and why it is important for all of us to grasp the added benefit of a good CLV model.

**CLV**: How much a customer will spend in the future

CLV is a business KPI that has exploded in popularity over the past few years. The reason is obvious: if your company can accurately predict how much a customer will spend over the next couple months or years, you can tailor their experience to fit that budget. This has dramatic applications from marketing to customer service to overall business strategy.

Here a quick list of business applications that accurate CLV can help empower:

- Marketing Audience Generation
- Cohort Analysis
- Customer Service Ticket Ordering
- Marketing Lift Analysis
- CAC bid capping marketing
- Discount Campaigns
- VIP buying experiences
- Loyalty Programs
- Segmentation
- Board Reporting

There’s plenty more, these are just the ones that come to my mind the fastest.

Great Digital Marketing stems from a Great Understanding of your customers. Photo by Campaign Creators on Unsplash

With so much business planning at stake, tech-savvy companies are busy scrambling to find which model can best capture CLV of their customer base. The most popular and commonly used customer lifetime value models benchmark their strength on aggregate metrics, using statistics like aggregate revenue percent error (ARPE). I know this because first hand many of my clients have compared their internal CLV models to mine using aggregate statistics.

I would argue that is a serious mistake.

The following 2 examples, one toy and one real, will hopefully demonstrate how aggregate statistics can both lead us astray and hide model shortcomings that are glaringly apparent at the individual level. This is especially prescient because **most business use cases require a strong CLV model at the individual level, not just at the aggregate.**

When you rely on aggregate metrics and ignore the individual-level inaccuracies, you are missing a large part of the technical narrative. Consider the following example of 4 customers and their 1 year CLV:

This example includes high, low, and medium CLV customers, as well as a churned customer, creating a nice distribution for a smart model to capture.

Now, consider the following validation metrics:

**MAE**: Mean absolute error (The Average Difference between predictions)

3. **ARPE**: Aggregate revenue percent error (The overall difference between total revenue and predicted revenue)

**MAE** and is on the customer level, while **ARPE** and is an aggregate statistic. The lower the value for these validation metrics, the better.

This example will demonstrate how an aggregate statistic can bury the shortcomings of low-quality models.

To do so, compare a dummy guessing the mean to a CLV model off by 20% across the board.

**Model 1: The Dummy**

The dummy model will only guess $40 for every customer.

**Model 2: CLV Model**

This model tries to make an accurate model prediction at the customer level.

We can use these numbers to calculate the three validation metrics.

This example illustrates that a model that is considerably worse in the aggregate (the CLV model is worse by over 20%) is actually better at the individual level.

To make this example even better, let’s add some noise to the predictions.

# Dummy Sampling: # randomly sampling a normal dist around $40 with a SD of $5np.random.normal (40,5,4)OUT: (44.88, 40.63, 40.35, 42.16)#CLV Sampling: # randomly sampling a normal dist around answer with a SD of $15 max (0, np.random.normal (0,15)), max (0, np.random.normal (10, 15)), max (0, np.random.normal (50,15)), max (0, np.random.normal (100, 15))OUT: (0, 17.48, 37.45, 81.41)

The results above indicate that even if an individual stat is a higher percentage than you would hope, the distribution of those CLV numbers is more in light with what we are looking for: a model that distinguishes high CLV customers from low CLV customers. If you only look at the aggregate metrics for a CLV model, you are missing a major part of the story, and you may end up choosing the wrong model for your business.

But even rolling up an error metric calculated at the individual level, such as MAE or alternatives like MAPE, can hide critical information about the strengths and weaknesses of your model. Mainly, **its capacity to create an accurate distribution of CLV scores.**

To explore this further, let’s move to a more realistic example

Congratulations! You, the Reader, have been hired as a Data Scientist by BottleRocket Brewing Co, an eCommerce company I just made up. (The data we will use is based on a real eCommerce company that I scrubbed for this post)

Fun (Fake) Fact: BottleRocket Brewing is quite a popular brand in California. Photo by Helena Lopes on Unsplash

Your first task as a Data Scientist: Choose the best CLV model for BottleRocket’s business…

…but what does “best” mean?

Undeterred, you run an experiment with the following models:

**Pareto/NBD Model (PNBD)**

The Pareto/NBD model is a very popular choice, and is the model under the hood of most data-driven CLV predictions today. To quote the documentation:

*The Pareto/NBD model, introduced in 1987, combines the [Negative Binomial Distribution] for transactions of active customers with a heterogeneous dropout process, and to this date can still be considered a gold standard for buy-till-you-die models* *[Link]*

But another way, the model learns two distributions, one for churn probability and the other for inter transaction-time (ITT) and makes CLV predictions by sampling from these distributions.

** Describing BTYD models in more technical detail is a bit out of scope of this article, which is focused on error metrics. Please drop a comment if you are interested in a more in-depth write-up about BTYD models and I’m happy to write a follow-on article!

**2. Gradient Boosted Machines (GBM)**

Gradient Boosted Machines models are a popular machine learning model in which weak trees are trained and assembled together to make a strong overall classifier.

** As with BTYD models, I won’t go into detail about how GBMs work but once again comment below if you’d like me to write up something on method/models

**3. Dummy CLV**

This model is defined as simply:

Calculate the average ITT for the business Calculate the average spend over 1yrIf someone has not bought within 2x the average purchase time: Predict $0 Else: Predict the average 1y spend

**4. Very Dumb Dummy Model (Avg Dummy)**

This model only guesses the average spend over 1yr for all customers. Included as a baseline for model performance

**Ch.3.1 The Aggregate Metrics**

We can consolidate all of these models’ predictions into a nice little Pandas DataFrame `combined_result_pdf` that looks like:

```
combined_result_pdf.head()
```

Given this customer table, we can calculate error metrics using the following code:

from sklearn.metrics import mean_absolute_erroractual = combined_result_pdf['actual'] rev_actual = sum(actual)for col in combined_result_pdf.columns: if col in ['customer_id', 'actual']: continue pred = combined_result_pdf[col] mae = mean_absolute_error(y_true=actual, y_pred=pred) print(col + ": ${:,.2f}".format(mae))

rev_pred = sum(pred)

perc = 100*rev_pred/rev_actual

print(col + “: ${:,} total spend ({:.2f}%)”.format(pred, perc))

With these four models, we tried to predict 1yr CLV for BottleRocket customers, ranked by MAE score:

Here are some interesting insights from this table:

- GBM appears to be the best model for CLV
- PNBD, despite being a popular CLV, seems to be the worst. In fact, it’s worse than a simple if/else rule list, and only slightly better than a model only guesses the mean!
- Despite GBM being the best, it’s only a few dollars better than a dummy if/else rule list model

Point #3 especially has some interesting ramifications if the Data Scientist/Client accepts it. If the interpreter of these results actually believes that a simple if/else model can capture nearly all the complexity a GBM could capture, and better than the commonly used PNBD model, then obviously the “best” model would be the Dummy CLV once cost, speed of training, and interpretability are all factored in.

This brings us back to the original claim — **that aggregate error metrics, even ones calculated on the individual level, hide some shortcomings of models.** To demonstrate this, let’s rework our DataFrame into Confusion Matrices.

Statistics is sometimes very confusing. For Confusion Matrices, they are literally confusing. Photo by Nathan Dumlao on Unsplash

**Mini Chapter: What is a Confusion Matrix?**

From its name alone, understanding a confusion matrix sounds confusing & challenging. But it is crucial to understand the points being made in this post, as well as a powerful tool to add to your Data Science toolkit.

A **Confusion Matrix** is a table that outlines the accuracy of classification, and what misclassifications are common by the model. A simple confusion matrix may look like this:

The diagonal on the above confusion matrix, highlighted Green, reflects correct predictions — predicting Cat when the it was actually a Cat etc. The rows will add up to 100%, allowing us to get a nice snapshot of how well our model captures Recall behavior, or the probability our model guesses correctly given a specific label.

What we can also tell from the above confusion matrix is..

- The model is excellent at predicting Cat given the true label is Cat (Cat Recall is 90%)
- The model has a difficult time distinguishing between Dogs and Cats, often misclassifying Dogs for Cats. This is the most common mistake made by the model.
- While it sometimes misclassifies a Cat as a Dog, it is far less common than other errors

With this in mind, let’s explore how well our CLV models capture customer behavior using Confusion Matrices. A strong model would be able to correctly classify low value customers and high value customers as such. I prefer this method of visualization as opposed to something like a histogram of CLV scores because it reveals what elements of modelling the distribution are the strong and weak.

To achieve this, we will convert our monetary value predictions into quantiled CLV predictions of Low Medium High and Best. These will be drawn from the quantiles generated by each model’s predictions.

The best model will correctly categorize customers into these 4 buckets of low/medium/high/best. Therefore each model we will make a confusion matrix of the following structure:

And the best model will have the most amount of predictions that fall within this diagonal.

**Ch. 3.2 The Individual Metrics**

These confusion matrices can be generated from our Pandas DF with the following code snippet:

from sklearn.metrics import confusion_matrix import matplotlib.patches as patches import matplotlib.colors as colors# Helper function to get quantiles def get_quant_list(vals, quants): actual_quants = []

for val in vals:

if val > quants[2]:

actual_quants.append(4)

elif val > quants[1]:

actual_quants.append(3)

elif val > quants[0]:

actual_quants.append(2)

else:

actual_quants.append(1)

return(actual_quants)

# Create Plot fig, axes = plt.subplots(nrows=int(num_plots/2)+(num_plots%2),ncols=2, figsize=(10,5*(num_plots/2)+1))fig.tight_layout(pad=6.0) tick_marks = np.arange(len(class_names)) plt.setp(axes, xticks=tick_marks, xticklabels=class_names, yticks=tick_marks, yticklabels=class_names)# Pick colors cmap = plt.get_cmap('Greens')# Generate Quant Labels plt_num = 0 for col in combined_result_pdf.columns: if col in ['customer_id', 'actual']: continue quants = combined_result_pdf[col]quantile(q=[0.25,0.5,0.75]) pred = combined_result_pdf[col] pred_quants = get_quant_list(pred,quants)

# Generate Conf Matrix

cm = confusion_matrix(y_true=actual_quants, y_pred=pred_quants)

ax = axes.flatten()[plt_num]

accuracy = np.trace(cm) / float(np.sum(cm))

misclass = 1 – accuracy

```
# Clean up CM code
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] *100
ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.set_title('{} Bucketting'.format(col))
thresh = cm.max() / 1.5
```

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

` ax.text(j, i, "{:.0f}%".format(cm[i, j]), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")`

# Clean Up Chart

ax.set_ylabel(‘True label’)

ax.set_xlabel(‘Predicted label’)

ax.set_title(‘{}naccuracy={:.0f}%’.format(col,100*accuracy))

ax.spines[‘top’].set_visible(False)

ax.spines[‘right’].set_visible(False)

ax.spines[‘bottom’].set_visible(False)

ax.spines[‘left’].set_visible(False)

for i in [-0.5,0.5,1.5,2.5]:

ax.add_patch(patches.Rectangle((i,i),

1,1,linewidth=2,edgecolor=’k’,facecolor=’none’))

```
plt_num += 1
```

This produces the following Charts:

The coloring of the chart is how concentrated a certain prediction/actual classification is — the darker green, the more examples that fall within that square.

As with the example confusion matrix discussed above, the diagonal (highlighted with black lines) indicates appropriate classification of customers.

**Ch.3.3: Analysis**

**Dummy Models (Top Row)**

**Dummy1**, which is only predicting the average every time, has a distribution of ONLY the mean. It makes no distinction between high or low value customers.

Only slightly better, **Dummy2** predicts either $0 or the average. This means it can make some claim about distribution, and in fact, capture 81% and 98% of the lowest and highest value customers respectively.

But the major issue with these models, which was not apparent when looking at MAE (but obvious if you know how their labels were generated), is that these models have very little sophistication when it comes to distinguishing between customer segments. For all of our business applications listed in Ch.1, which is the entire point of building a strong CLV model, distinguishing between customer types is essential to success

**CLV Models (Bottom Row)**

First, don’t let the overall accuracy scare you. In the same way that we can hide the truth through aggregate statistics, we can hide the strength of distribution modelling through a rolled-out accuracy metric.

Second, it is pretty clear from this visual, as opposed to the previous table, that there is a reason the dummy models are named as such — these second row models are actually capturing a distribution. Even with Dummy2 capturing a much higher percentage of low value customers — this can just be an artifact of having a long-tail CLV distribution. Clearly, these are the models you want to be choosing between.

Looking at the diagonal, we can see that GBM has major improvements in predicting most categories across the board. Major mislabellings — missing by two squares — is down considerably. The biggest increase on the GBM side is on recognizing medium level customers, which is a nice sign that the distribution is healthy and our predictions are realistic.

If you just skimmed this article, you may want to conclude that GBM is a better CLV model. And that may be true, but model selection is more complicated. Some questions you would want to ask:

- Do I want to predict many years into the future?
- Do I want to predict churn?
- Do I want to predict the number of transactions?
- Do I have enough data to run a supervised model?
- Do I care about explainability?

Are of these questions, while not related to the thesis of this article, would need to be answered before you swap out your model for a GBM.

Choosing the right model requires a deep understanding of your business and use case. Photo by Brett Jordan on Unsplash

First underlying variable to consider when choosing the model is the company’s data you are working with. Often BTYD models work well, and are comparable to ML alternatives. But BTYD models make some strong assumptions about customer behavior, so if these assumptions are broken, they perform sub-optimally. Running a model comparison is crucial to making the right model decision.

While the issues at the individual level are apparent for the dummy models, often companies will fall prey to these issues by running a “naive”/”simple”/”excel-based” model to do this exact thing — attempt to apply an aggregate number across your entire customer base. At some companies, CLV can be as simply defined by dividing revenue equally among all customers. This may work for a board report or two, but in reality, this is not an adequate way to calculate such a complex number. Truth is, not all customers are created equal, and the sooner your company shifts their attention away from aggregate customer metrics to strong individual-level predictions, the more effectively you can market, strategize and ultimately find your best customers.

Hope this was as enjoyable and informative to read about as it was to write about.

Thanks for reading!

*This post was originally published by Adam Brownell at Towards Data Science*