Tutorials

ROC Curves and AUC

A binary classifier outputs a score for each sample. Where you set the threshold that converts that score into a "positive" or "negative" prediction determines the trade-off between catching real positives (sensitivity) and avoiding false alarms (specificity). The ROC (Receiver Operating Characteristic) curve traces this trade-off across every possible threshold: the x-axis shows the false positive rate and the y-axis shows the true positive rate. The Area Under the Curve (AUC) collapses the entire curve into a single number — the probability that the model ranks a random positive higher than a random negative. A perfect model has AUC = 1.0; a random guess produces the diagonal line with AUC = 0.5.

### Simulating a binary classification problem

A synthetic dataset with 10 features lets you control exactly how separable the classes are, making it easy to connect the data properties to the resulting ROC shape.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=600, n_features=10, n_informative=5,
    n_redundant=2, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Train size: {X_train.shape[0]}  Test size: {X_test.shape[0]}")
print(f"Positive rate in test set: {y_test.mean():.2f}")
Train size: 420  Test size: 180
Positive rate in test set: 0.53
- `n_informative=5` means only five of the ten features actually drive the class label — the rest add noise, making the problem realistically imperfect.
- Splitting before fitting is essential: the ROC curve must be computed on held-out data to give an honest picture of generalization.
- The positive rate printed here confirms the test set is balanced, so the baseline AUC of 0.5 corresponds exactly to the diagonal.

### Computing the ROC curve

`roc_curve` returns the false positive rates, true positive rates, and thresholds at each operating point. `roc_auc_score` summarises them in a single number.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score

X, y = make_classification(n_samples=600, n_features=10, n_informative=5,
                            n_redundant=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
scores = model.predict_proba(X_test)[:, 1]  # probability of the positive class

fpr, tpr, thresholds = roc_curve(y_test, scores)
auc = roc_auc_score(y_test, scores)

print(f"AUC: {auc:.4f}")
print(f"Number of threshold points: {len(thresholds)}")
print(f"Threshold range: {thresholds.min():.3f} – {thresholds.max():.3f}")
AUC: 0.9344
Number of threshold points: 36
Threshold range: 0.005 – inf
- `predict_proba(X_test)[:, 1]` extracts the probability of class 1 for each test sample — these are the raw scores that `roc_curve` will threshold.
- `roc_curve` returns one (fpr, tpr) pair per unique threshold value, sorted by increasing fpr. The first point is always (0, 0) at threshold ∞.
- `roc_auc_score` can also accept raw probabilities directly, bypassing the intermediate `roc_curve` call when you only need the scalar.

### Plotting the ROC curve

The standard ROC plot includes the model curve, the diagonal no-skill baseline, and the AUC in the legend.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score

X, y = make_classification(n_samples=600, n_features=10, n_informative=5,
                            n_redundant=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
scores = model.predict_proba(X_test)[:, 1]

fpr, tpr, _ = roc_curve(y_test, scores)
auc = roc_auc_score(y_test, scores)

fig, ax = plt.subplots(figsize=(6, 6))
ax.plot(fpr, tpr, color="steelblue", linewidth=2, label=f"Logistic Regression (AUC = {auc:.3f})")
ax.plot([0, 1], [0, 1], "k--", linewidth=1, label="No-skill baseline (AUC = 0.5)")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curve")
ax.legend(loc="lower right")
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()
- The diagonal dashed line represents a classifier that assigns scores at random — any model below it is worse than random.
- Points in the upper-left corner represent high true positive rate with low false positive rate — the ideal operating region.
- The curve's shape shows how aggressively you can recall positives before false alarms become unacceptable.

### Comparing classifiers

Plotting multiple ROC curves together lets you compare models fairly — they all operate on the same test set, so AUC differences are meaningful.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score

X, y = make_classification(n_samples=600, n_features=10, n_informative=5,
                            n_redundant=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

classifiers = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Decision Tree":       DecisionTreeClassifier(max_depth=4, random_state=42),
    "Random Forest":       RandomForestClassifier(n_estimators=100, random_state=42),
}
colors = ["steelblue", "tomato", "green"]

fig, ax = plt.subplots(figsize=(7, 6))
for (name, clf), color in zip(classifiers.items(), colors):
    clf.fit(X_train, y_train)
    scores = clf.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, scores)
    auc = roc_auc_score(y_test, scores)
    ax.plot(fpr, tpr, color=color, linewidth=2, label=f"{name} (AUC = {auc:.3f})")

ax.plot([0, 1], [0, 1], "k--", linewidth=1, label="No-skill baseline")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curves — model comparison")
ax.legend(loc="lower right", fontsize=9)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()
- Each classifier is fitted on the same training split and evaluated on the same test split, making the AUC values directly comparable.
- Random Forest typically achieves the highest AUC because it averages many decorrelated trees, smoothing out individual classification boundaries.
- Decision Tree with `max_depth=4` avoids overfitting while still capturing some non-linearity — without the depth limit it would likely overfit and score worse on the test set.

### Conclusion

The ROC curve and AUC are the standard tools for evaluating binary classifiers whenever the class balance or misclassification costs are unequal — which is most real-world situations. AUC is threshold-independent, so it measures intrinsic ranking ability rather than performance at any one cut-off. When you need to choose an operating point, pick the threshold on the curve that matches your specific cost trade-off.

For the grid of actual prediction errors at a fixed threshold, see [confusion matrix with scikit-learn](/tutorials/confusion-matrix-sklearn). For ranking-based evaluation of how many positives you capture with minimal cases examined, see [lift charts](/tutorials/lift-charts).