Standardizing machine learning results


The issue

Machine learning has become one of the central tools of data science, with new and valuable use cases being found for it every day. But making the output of a machine learning model accessible to the somewhat less technically-minded players in a business organization can be challenging. I want to illustrate how I approached this issue for a large machine learning system I helped to develop several months ago. Without getting into overly-revealing detail, the goal of the system in question is to produce ongoing forecasts of upcoming customer purchases.

A very common use of the output of a machine learning classifier is to identify a group of customers who are most likely to make a purchase from a given category of products in the near future. For example, let’s say we want 500 prospective consumers of dairy products in the state of Florida, and our predictive system stores customer propensity scores in a SQL table where the code for dairy products is “3491.” Then a query like the following would provide the information we want:

select top 500 * from CustomerScores
where StateCode = ‘FL’
and CategoryCode = ‘3491’
order by RawScore desc;

The raw scores generated by classifiers have an important “ranking” property – the larger the raw score, the more likely a customer is to purchase. So raw scores are suitable for the one common purpose described above. However, raw scores typically lack any other useful properties or intrinsic meaning whatsoever. For example, if I tell you “Our model assigned a score of 0.142 to customer A and a score of 0.117 to customer B,” you can say that customer A is more likely to make a purchase, but you couldn’t guess much more than that. And if the purpose of the model is to prioritize targets for sales efforts, there is no clear or objective basis on which to say “Customers with a score above _______ are  good prospects.”

Possible approaches

My coworkers and I engaged in a great deal of discussion about how to formulate a more standardized scoring scheme, having a straightforward meaning that is not too difficult to understand or explain, while retaining the ranking property of raw scores. We considered a number of possibilities, some of which I’ll describe and illustrate in code below.


First of all, we will be using the following tools:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
import ML_snippets as ml

The last statement makes available the various tools I discussed in some previous writing. Now, we’ll load all available data for whatever binary classification task we’re interested in, and divide the data into “training” and “testing” sets:

X = ...   # get the training/testing features
y = ...   # get the training/testing labels
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, stratify = y)

As you will soon see, it might be more appropriate to call these the “training” and “calibration” sets. In the data set I’m using, the base rate of positive examples is only around 5%. In such unbalanced situations, and when I have a large quantity of data to work with, I typically impose an artificial balance by taking all the positive examples and an equally large sample of the negative examples –

bal_index = Xtrain[ytrain > 0].append(
    Xtrain[ytrain <= 0].sample(sum(ytrain))

– then the classifier is trained on this balanced subset:

import ...  # some classifier
clf = ...   # set up the classifier[bal_index], ytrain.loc[bal_index])

Basic ideas

Probably the most straightforward version of standardization would be to transform the classifier’s raw scores into something more uniform, such as a rank or percentile. Then we might say, for example,

  • “Out of 2,710,012 customers, our model ranked customer A in 351,782nd place, and customer B in 901,175th place,” or
  • “Our model placed customer A at the 87th percentile, and customer B at the 67th percentile.”


The most intuitive standardized score might be the probability of purchase. Suppose we could run a query like this:

select * from CustomerScores
where StateCode = ‘FL’
and CategoryCode = ‘3491’
and Probability between 0.7 and 0.85;

If the “Probability” column is accurate, and the query returns 7,300 customers (for example), then we should expect somewhere between 5,110 and 6,205 of those customers to purchase dairy products in the near future.

Some machine learning classifiers are capable of producing decent probability estimates; in fact, in these cases, what I’ve referred to as the “raw score” is actually a modeled probability. However, this is not always true – and certainly not if the training data has been artificially balanced as I’ve done here!

The scikit-learn library provides a tool for creating a “calibrated” classifier, which produces raw scores that conform to actual probabilities as closely as possible. We’ll created a calibrated classifier from our previously trained classifier as follows:

cclf = CalibratedClassifierCV(clf, cv = 'prefit'), ytest);

Local lift

Another approach to standardization that my coworkers and I explored would be to use the model’s lift as a score. Let’s examine the (local) lift curve for our classifier on the test data.

ml.plot_lift(clf, Xtest, ytest)

For the dataset and classifier I’m using for this demo, the curve looks like this:


Looking carefully at this somewhat sparsely-labeled plot, we can see that customers ranked near the 90th percentile have a lift of about 4, which means that they are 4 times as likely to purchase as a randomly-selected group of customers would be. And the very highest-ranked customers apparently have a lift of 12 or more. So if 5% of all customers purchase, we would expect 20% of those at the 90th percentile to purchase, and 60% of those ranked near the top to purchase.

Note that the lift at the 81st percentile is about 1, which means that customers ranked at this percentile by our model should have about the same purchase rate as a random group of customers. Meanwhile, customers below the 81st percentile should be less likely to purchase than randomly-selected customers! Let’s save this special value, where the lift crosses below 1, for later:

lift_cross_percentile = 81

An advantage of using (local) lift as a standard score is that a score of 1 has this special meaning: a sample of customers above this score should almost always outperform a fully random sample of customers. At one point it was suggested that we might use lift-minus-one (AKA “improvement”) as the standard score, so that customers ranked below the crossing point (the 81st percentile in this example) would receive negative scores, making it more obvious that these customers typically shouldn’t be targeted in sales campaigns.

Cumulative lift score

All of the above are valid and potentially valuable ways to standardize the output of a machine learning system, but in the end my coworkers and I, with feedback from stakeholders outside our immediate team, arrived at the idea of using cumulative lift as a standard score for our predictive model. To illustrate this with a concrete example, let’s say we used a query like the following:

select * from CustomerScores
where StateCode = ‘FL’
and CategoryCode = ‘3491’
and LiftScore >= 10;

This should return a group of customers whose rate of purchasing dairy products in the near future is 10 times as high as the overall rate. So for example, if 5% of all customers will make purchases in the near future, then 50% of the customers identified by the query above should make purchases.

To convert raw scores to (cumulative) lift scores, we feed our classifier and testing data into a tool I introduced before:

spf = ml.plot_cumlift(clf, Xtest, ytest, True)


This visualizes the cumulative lift curve and produces a mathematical function that can be used to convert raw model output to lift scores, as will be shown below.

Generating standardized scores

Up to this point, we have only been working with the original, fully-labeled data set, as we created a classifier and prepared various ways to standardize the output of that classifier. Now, let’s say we are ready to use what we have put together in order to make predictions on a new set of data without known labels.

I’ll call the new features matrix “Xf” (the letter f stands for “future,” or “forecast,” or maybe “forward”):

# Load the data to be used for making new predictions
Xf = ...

Using this dataframe as a sort of backbone, let’s build the table of customer scores (which would later, hypothetically, be exported to SQL as the CustomerScores table):

scores = Xf.copy()[[]]
scores['RawScore'] = clf.predict_proba(Xf)[:,1]
scores['Probability'] = cclf.predict_proba(Xf)[:,1]
scores['Percentile'] = (
    scores.RawScore.rank(method = 'min')
    / len(scores.RawScore)*100
).apply(np.round, decimals = 2)
scores['LiftScore'] = (
    1 - scores.Percentile/100
).apply(spf).apply(np.round, decimals = 2)

Below is a somewhat random sample of rows from the resulting table. In practice, there would also have to be a column or index with some kind of customer identifier for each row.

RawScore Probability Percentile LiftScore
0.06 0.001544 33.78 1.51
0.25 0.007014 70.21 3.30
0.03 0.001215 18.21 1.22
0.57 0.083621 88.89 7.85
0.14 0.002922 55.09 2.23
0.01 0.001035 4.50 1.05
0.11 0.002301 48.73 1.95
0.38 0.019582 80.55 4.86
0.03 0.001215 18.21 1.22
0.31 0.011283 75.76 3.97
0.73 0.246981 93.47 11.08
0.31 0.011283 75.76 3.97
0.08 0.001811 40.60 1.69
0.42 0.026765 82.91 5.44
0.87 0.501166 96.44 12.26
0.00 0.000956 0.01 1.00
  ⋮   ⋮   ⋮   ⋮

Checking performance

With a machine learning system designed to foresee the near future, it is important to check later how well it performed. For example, if the system is meant to predict purchases made over the next month, then once the month has passed we should look back and see how accurate the predictions turned out to be. This is often called a monthly “postmortem.” I have written elsewhere about this process, so here I’ll focus only on checking the correctness of the standardized scores, rather than the predictive power of the underlying raw scores.

At the end of the month, we should be able to retrieve the final result for each customer – i.e. a “1” for each customer that did end up purchasing and a “0” for each that did not. I’ll call these results “yf.” For convenience, let’s attach them to the table of scores we created a month earlier:

# Load the actual outcomes
yf = ...
scores['Outcome'] = yf

The correctness of the modeled probabilities is a bit subjective, but one way to check it would be to verify that among the customers assigned a purchase probability in a certain range (e.g. between 20% and 30%), the purchase rate did indeed end up in that range. Looking at ranges of 10 percentage points –

    (scores.Probability * 100) // 10 * 10
Probability Outcome
0 – 10% 0.009382
10 – 20% 0.120521
20 – 30% 0.240260
30 – 40% 0.325000
40 – 50% 0.446809
50 – 60% 0.507576
60 – 70% 0.688000
70 – 80% 0.904348

The results look pretty good from this perspective. But with ranges of 5 percentage points, the results aren’t nearly as trustworthy –

    (scores.Probability * 100) // 5 * 5
Probability Outcome
0 – 5% 0.007521
5 – 10% 0.058997
10 – 15% 0.112821
15 – 20% 0.133929
20 – 25% 0.229885
25 – 30% 0.253731
30 – 35% 0.280000
35 – 40% 0.400000
40 – 45% 0.379310
45 – 50% 0.555556
50 – 55% 0.380952
55 – 60% 0.623188
60 – 65% 0.561404
65 – 70% 0.794118
70 – 75% 0.904348

We aren’t going to fully explore the idea of using local lift as a standardized score here, but we can observe that the local lift curve of the model on the new data seems to match the previous curve pretty well:

ml.plot_lift(clf, Xf, yf)


As noted before, we should expect that a sample of customers from above the 81st percentile should almost always outperform a fully random sample of customers. To test this out a little bit, let’s run 10,000 experiments, as follows:

        scores.Percentile > lift_cross_percentile
    ].sample(100).Outcome.sum() >
    for _ in range(10000)

The result is that a sample from above the lift curve’s crossing point outperforms a random sample very close to 100% of the time.

To check the correctness of the (cumulative) LiftScore, we should ask a series of questions like…

  • Among customers with a LiftScore of 2 or higher, was the purchase rate two times as large as the rate among all customers?
  • Among customers with a LiftScore of 3 or higher, was the purchase rate three times as large as the rate among all customers?
  • etc. …..

Doing this in a systematic and automated way is tricky, but the following code will accomplish it and visualize the results for all whole-number lift values:

scores_agg = scores.groupby(
    'LiftScore' : 'count',
    'Outcome'   : 'sum'

scores_agg.columns = ['Total', 'Positives']
scores_agg = scores_agg.reset_index()

scores_agg[['Total', 'Positives']] = (
    scores_agg[['Total', 'Positives']]

xlim = int(-scores_agg.LiftScore.min()) + 1

    (scores_agg.Positives / scores_agg.Total)
    / scores.Outcome.mean(),
    c = 'red', s = 100

plt.xlabel('Minimum LiftScore')
plt.ylabel('Actual Lift')
plt.title('Predicted Lift vs. Actual Lift')


The plot shows that we can generally get the lift we want. For example, if we pull all customers with LiftScore ≥ 10, one month later we’ll see that these were (almost) 10 times as likely to purchase as a random group of customers would be. And the relationship is closer for most of the other minimum LiftScores on the graph.

Real-life implementation

Monthly process

All of the above is merely a demonstration of how we might code a few approaches to standardizing a model’s predictions. In the real machine learning system that inspired this post, the last approach (based on cumulative lift) was implemented, and the entire system was set up to be “refreshed” once a month, by carrying out the following steps:

  1. For each of the predictions made about purchases during the preceding month, the actual outcome is recorded.
  2. The performance of last month’s predictions is measured and recorded in various ways.
  3. Relationships between raw predictions and actual outcomes in the preceding month are used to build a new set of standardizing or “spf” functions.
  4. The feature matrix needed to make predictions about next month is assembled.
  5. With next month’s features as input, the ML classifiers produce RawScores.
  6. With the raw scores as input, the standardizing functions produce the LiftScores for next month.
  7. All scores are stored in shared SQL space.

The ML classifiers used in step 5 can be re-trained on more recent labeled data at any time, especially if their performance (measured in step 2) is found to have degraded.

Final technical notes about the LiftScore

Aside from simply using rank or percentile, none of the other standardization approaches discussed above is as simple to implement in the real world as it might at first appear. I’ll focus on the (cumulative) lift score idea here, and illustrate some of the difficulties that had to be overcome in order to put it in place.

The graph below shows the cumulative lift curve for one of the many ML classifiers in the predictive system I’ve been referring to. (Without getting into too many specifics, each classifier in the system corresponds to one category of products, like dairy for example. I’ve removed the graph’s title and axis labels to conceal certain business details that I should probably keep to myself.) In this particular case, the relationship between the model’s ranking of customers (on the horizontal axis) and the cumulative lift (on the vertical axis) is nicely monotonic, so the standardization function (in red) is simply an interpolation of all the observations (in blue).


Here a customer at the bottom of the best-ranked 10% (i.e. at the 90th percentile) would be assigned a LiftScore around 6, because we have observed a six-fold lift in purchase rate among the top 10% of customers. A customer at the 70th percentile would receive a LiftScore of about 3.

To see why we go through all the trouble of using the lower convex envelope of the lift curve (as mentioned in my previous post), see the graph below, corresponding to a very different product category than the previous one. In this case, there is a transitory “jump” in the relationship between model ranking and lift. The lower convex envelope, seen in red below, essentially allows us to ignore this jump. There are a couple of good reasons to do this, but the main reason is that it seems unwise to change the order of the ranking produced by the model. When LiftScores for this product category are produced, a customer at the 85th percentile will be assigned a score of about 5, and at the 90th percentile a score of about 7.


Another issue with using cumulative lift as a standard score is how to deal with the very highest ranked customers. Generally speaking, if our test data set has many thousands of instances, and if our classifier is any good at all, the purchase rate among the top 1,000 ranked customers will almost certainly be better than the rate among the top 5,000 or the top 10,000, etc. But will the purchase rate among the top 50 ranked customers be better than among the top 1,000? Will the rate among the top 10 ranked customers be better than among the top 50? I wouldn’t be too sure.

In the demonstration earlier, the first point on the cumulative lift chart is made by observing the top 5% of ranked customers. In practice, this is probably not fine enough detail; 5% of customers could be tens of thousands of customers, and we might want to know what lift to expect from, say, the top 100 customers. In developing the real-life implementation of the LiftScore, I asked a number of people within the organization, “If we use this system to generate sales leads, what is the smallest number of leads we could conceivably ever want to pull from it?” Let’s say (hypothetically) their answer was 100 customers. If that were the case, then we should force the first point of the cumulative lift curve to correspond to the 100 top ranked customers. We refrain from calculating the cumulative lift for a smaller group of customers than this, because the smaller the group used, the more our observation will be subject to unpredictable variability and not a good predictor of future model performance.

Since we do not calculate the cumulative lift for the top 99 ranked customers, what LiftScore do we assign them? There are a few reasonable answers, but in my view, the best thing to do is give all of them the same LiftScore as the 100th customer. This is easy to accomplish with the UnivariateSpline tool used within the plot_cumlift function. The specific line of code that generates the standardizing function is:

spf = UnivariateSpline(xhull, yhull, k = 1, s = 0, ext = 'const')

The lists “xhull” and “yhull” correspond to the points on the cumulative lift curve that will actually be used (i.e. after throwing away any transitory jumps). The parameter setting k = 1 creates a linear spline, which is sometimes known as a “polyline.” The setting s = 0 forces the spline to pass through all of the chosen points. The setting ext = ‘const’ extrapolates a constant value beyond the boundary of the given points, so that the customers on the extreme left side of the curve (the very highest ranked customers) all receive the same LiftScore.

As part of the system’s monthly “postmortem,” the relationship between LiftScores assigned and actual lift accomplished is recorded graphically, in a way similar to that shown before. Since the real-life system makes predictions in many product categories across multiple countries, the graphs are grouped by country. Below and to the left is a plot representing all product categories for a certain country. On the right is an alternative view of the same points, but with each category represented as a single curve.

In the real-life system, the relationship between assigned LiftScore and the actual lift achieved is not perfect, but it’s also not generally too far off. In the very worst category represented above, if we requested customers with a predicted lift of 21, we would end up only achieving a lift of about 18. As with most things, there is room for improvement, but this will suffice for now.


Title image: source