Lift Charts

After training a classifier to predict, say, which customers will respond to a direct-mail campaign, you typically can't contact everyone — you have a budget. The question becomes: if you contact the top 20% of customers ranked by predicted probability, how many of the actual responders will you reach? A lift chart answers this question. Lift is the ratio of positives captured by the model to positives captured by random selection at the same coverage. A lift of 2.5 at 20% means your model finds 2.5× as many positives as random when you examine the top quintile. Cumulative gains charts (the raw version of lift) show what fraction of all positives you've captured as you work down the ranked list.

### Ranking predictions by score

The key operation is sorting test-set predictions by their positive-class probability, then computing cumulative statistics as you move down the list.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000, n_features=10, n_informative=5,
    n_redundant=2, weights=[0.7, 0.3], random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
scores = model.predict_proba(X_test)[:, 1]

# Sort by predicted score, highest first
order = np.argsort(scores)[::-1]
y_sorted = y_test[order]

n = len(y_sorted)
total_pos = y_sorted.sum()
print(f"Test samples: {n}, Total positives: {total_pos}, Prevalence: {total_pos/n:.2%}")

Test samples: 400, Total positives: 139, Prevalence: 34.75%

- `np.argsort(scores)[::-1]` gives the indices that would sort the array in descending order — index 0 is the sample the model is most confident is positive.
- `y_sorted` is the actual labels in that same order. Cumulative summing it tells you how many true positives you've accumulated as you work down the list.
- The prevalence (30% in this case) is the random-selection baseline: if you pick randomly, you'd expect 30% positives at every coverage level.

### Computing cumulative gains and lift

Cumulative gain at each position is the fraction of all positives found so far. Lift is gain divided by what random selection would achieve.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                            n_redundant=2, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
scores = model.predict_proba(X_test)[:, 1]

order = np.argsort(scores)[::-1]
y_sorted = y_test[order]
n = len(y_sorted)
total_pos = y_sorted.sum()

pct_cases = np.arange(1, n + 1) / n        # fraction of cases examined
cum_gain  = np.cumsum(y_sorted) / total_pos  # fraction of positives found
lift      = cum_gain / pct_cases             # lift over random

# Report at deciles
for pct in [0.10, 0.20, 0.30, 0.50]:
    idx = int(pct * n) - 1
    print(f"Top {pct:.0%}: gain={cum_gain[idx]:.2%}, lift={lift[idx]:.2f}x")

Top 10%: gain=23.74%, lift=2.37x
Top 20%: gain=47.48%, lift=2.37x
Top 30%: gain=66.91%, lift=2.23x
Top 50%: gain=92.81%, lift=1.86x

- `cum_gain[i]` is the fraction of all positives captured in the first `i+1` cases — this is the y-axis of the cumulative gains chart.
- `lift[i] = cum_gain[i] / pct_cases[i]` normalises for coverage: a gain of 60% at 20% coverage gives a lift of 3.0.
- The printed decile table is the most common way to present lift results in a business context — stakeholders pick a budget constraint and read off the expected efficiency.

### Plotting the gains and lift curves

Two side-by-side panels show cumulative gains (absolute positives captured) and lift (efficiency over random).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                            n_redundant=2, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
scores = model.predict_proba(X_test)[:, 1]

order = np.argsort(scores)[::-1]
y_sorted = y_test[order]
n = len(y_sorted)
total_pos = y_sorted.sum()
pct_cases = np.arange(1, n + 1) / n
cum_gain  = np.cumsum(y_sorted) / total_pos
lift      = cum_gain / pct_cases

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Cumulative gains chart
axes[0].plot(pct_cases, cum_gain, color="steelblue", linewidth=2, label="Model")
axes[0].plot([0, 1], [0, 1], "k--", linewidth=1, label="Random baseline")
axes[0].plot([0, total_pos/n, 1], [0, 1, 1], "g--", linewidth=1, label="Perfect model")
axes[0].set_xlabel("Fraction of cases examined")
axes[0].set_ylabel("Fraction of positives captured")
axes[0].set_title("Cumulative Gains Chart")
axes[0].legend()

# Lift chart
axes[1].plot(pct_cases, lift, color="tomato", linewidth=2, label="Model lift")
axes[1].axhline(1.0, color="k", linestyle="--", linewidth=1, label="Random (lift = 1)")
axes[1].set_xlabel("Fraction of cases examined")
axes[1].set_ylabel("Lift")
axes[1].set_title("Lift Chart")
axes[1].legend()
axes[1].set_ylim(0, lift[:int(0.05*n)].max() * 1.1)

plt.tight_layout()
plt.show()

- The diagonal dashed line in the gains chart is random selection: examining x% of cases finds x% of positives. Any model above this line adds value.
- The "perfect model" line shows the theoretical maximum: all positives are concentrated at the very top, then gain hits 1.0 and stays flat.
- Lift starts very high at small coverage (the model's top-ranked cases are heavily positive) and converges to 1.0 as you examine the full dataset.

### Conclusion

Lift charts translate model performance into business terms: how many fewer cases do you need to examine to find the same number of positives? A lift of 3.0 at 20% coverage means your marketing team can reach the same number of likely-buyers while contacting only a third as many people. Always pair lift charts with the random and perfect-model baselines to frame how much improvement is theoretically possible.

For threshold-independent ranking performance as a single number, see [ROC curves and AUC](/tutorials/roc-curves-and-auc). For the per-threshold breakdown of prediction errors, see [confusion matrix with scikit-learn](/tutorials/confusion-matrix-sklearn).