Tutorials

Confusion Matrix with scikit-learn

After training a classifier, accuracy alone can be misleading — a model that labels everything as "negative" in a 95% negative dataset achieves 95% accuracy while being completely useless. A confusion matrix shows exactly where the errors go: which positives were missed (false negatives), which negatives were falsely flagged (false positives), and which were correctly handled. From those four cells you can derive every standard metric — precision, recall, F1, and specificity — giving you a complete picture of where the classifier succeeds and fails.

### Generating predictions

A logistic regression model trained on a synthetic dataset provides the predictions and ground-truth labels needed to build the matrix.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=500, n_features=10, n_informative=5,
    n_redundant=2, weights=[0.6, 0.4], random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

correct = np.sum(y_test == y_pred)
print(f"Accuracy: {correct / len(y_test):.3f}")
print(f"Test positives: {y_test.sum()}  Test negatives: {(y_test == 0).sum()}")
Accuracy: 0.827
Test positives: 65  Test negatives: 85
- `weights=[0.6, 0.4]` creates a mildly imbalanced dataset (60% negative) — common in real classification problems and where confusion matrices are most informative.
- `model.predict` uses the default 0.5 probability threshold to produce hard labels. The confusion matrix depends on this threshold; changing it shifts counts between cells.
- Accuracy is printed here for reference — the confusion matrix will show why even a decent accuracy can hide problems.

### Building and displaying the confusion matrix

`confusion_matrix` counts the four cells; `ConfusionMatrixDisplay` wraps it into a labeled heatmap with one method call.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

X, y = make_classification(n_samples=500, n_features=10, n_informative=5,
                            n_redundant=2, weights=[0.6, 0.4], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:")
print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Negative", "Positive"])
disp.plot(colorbar=True)
plt.title("Confusion matrix — logistic regression")
plt.tight_layout()
plt.show()
Confusion matrix:
[[76  9]
 [17 48]]
- `confusion_matrix(y_test, y_pred)` returns a 2×2 array where `cm[i, j]` counts samples with true label `i` predicted as `j`. The diagonal is correct predictions.
- Row 0 = true negatives (left) and false positives (right); row 1 = false negatives (left) and true positives (right).
- `ConfusionMatrixDisplay` automatically colors cells by count and prints the raw numbers — blue intensity makes the dominant cells visually obvious.

### Precision, recall, and F1 from the matrix

Precision, recall, and F1 are computed directly from the four cells. Understanding the formulas makes interpreting them intuitive.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

X, y = make_classification(n_samples=500, n_features=10, n_informative=5,
                            n_redundant=2, weights=[0.6, 0.4], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
precision = tp / (tp + fp)
recall    = tp / (tp + fn)
f1        = 2 * precision * recall / (precision + recall)
specificity = tn / (tn + fp)

print(f"TP={tp}  FP={fp}  FN={fn}  TN={tn}")
print(f"Precision:   {precision:.3f}  (of predicted positives, how many are real?)")
print(f"Recall:      {recall:.3f}  (of real positives, how many were found?)")
print(f"F1:          {f1:.3f}  (harmonic mean of precision and recall)")
print(f"Specificity: {specificity:.3f}  (of real negatives, how many were correctly ignored?)")
print()
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))
TP=48  FP=9  FN=17  TN=76
Precision:   0.842  (of predicted positives, how many are real?)
Recall:      0.738  (of real positives, how many were found?)
F1:          0.787  (harmonic mean of precision and recall)
Specificity: 0.894  (of real negatives, how many were correctly ignored?)

              precision    recall  f1-score   support

    Negative       0.82      0.89      0.85        85
    Positive       0.84      0.74      0.79        65

    accuracy                           0.83       150
   macro avg       0.83      0.82      0.82       150
weighted avg       0.83      0.83      0.82       150

- **Precision** answers "when the model says positive, how often is it right?" — critical when false alarms are costly (e.g., fraud alerts).
- **Recall** answers "how many of the real positives did the model catch?" — critical when missing a positive is costly (e.g., disease detection).
- `classification_report` computes these per class and macro/weighted averages in a single call, matching the manual formulas above.

### Normalised confusion matrix

Normalising by the true count of each class shows recall per class regardless of class size — essential when classes are imbalanced.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

X, y = make_classification(n_samples=500, n_features=10, n_informative=5,
                            n_redundant=2, weights=[0.6, 0.4], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
for ax, normalize, title in zip(axes,
                                 [None, "true"],
                                 ["Raw counts", "Normalised by true label"]):
    cm = confusion_matrix(y_test, y_pred, normalize=normalize)
    disp = ConfusionMatrixDisplay(cm, display_labels=["Negative", "Positive"])
    disp.plot(ax=ax, colorbar=False)
    ax.set_title(title)

plt.tight_layout()
plt.show()
- `normalize="true"` divides each row by its sum, so each cell shows the fraction of that true class that landed in each predicted class — the diagonal cells become per-class recall.
- In an imbalanced dataset the raw counts can make a rare class's errors look small just because the class is rare — normalisation reveals whether the model is actually learning the minority class.
- `normalize="pred"` (also available) would normalise by predicted column, turning diagonal cells into per-class precision.

### Conclusion

The confusion matrix is the foundation of classifier evaluation at a fixed threshold — every common metric (precision, recall, F1, specificity) is just arithmetic on its four cells. Always inspect both the raw matrix and the normalised version: raw counts expose absolute error magnitudes, normalisation exposes per-class failure rates.

For threshold-independent evaluation across all operating points, see [ROC curves and AUC](/tutorials/roc-curves-and-auc). For evaluating how efficiently the model ranks positives to the top of a scored list, see [lift charts](/tutorials/lift-charts).