Cohen's Kappa

When two doctors independently classify the same X-rays as "normal" or "abnormal", raw percent agreement looks encouraging — 85% sounds good. But if 80% of X-rays are normal and both doctors label everything normal by default, they'd agree 80% of the time by pure chance. Cohen's kappa corrects for this: it subtracts expected chance agreement from observed agreement, then scales by the maximum possible improvement over chance. The result ranges from −1 (perfect disagreement) to +1 (perfect agreement), with 0 meaning no better than chance. Kappa is the standard metric wherever human raters, model predictions, or repeated measurements need to be compared for reliability.

### Simulating two-rater binary labels

Two radiologists each classify 100 chest X-rays as normal (0) or abnormal (1). Rater A is slightly more likely to label findings as abnormal.

import numpy as np

rng = np.random.default_rng(42)
n = 100
true_labels = rng.choice([0, 1], size=n, p=[0.7, 0.3])

# Rater A: 10% chance of flipping the true label
rater_a = np.where(rng.random(n) < 0.10, 1 - true_labels, true_labels)
# Rater B: 15% chance of flipping, slightly more conservative
rater_b = np.where(rng.random(n) < 0.15, 1 - true_labels, true_labels)

agree = np.mean(rater_a == rater_b)
print(f"Rater A: {rater_a.mean():.2f} positive rate")
print(f"Rater B: {rater_b.mean():.2f} positive rate")
print(f"Percent agreement: {agree:.2%}")

Rater A: 0.30 positive rate
Rater B: 0.41 positive rate
Percent agreement: 73.00%

- `rng.choice([0, 1], p=[0.7, 0.3])` generates true labels with 30% prevalence of abnormal — realistic for a screening scenario.
- Each rater independently flips the true label with a fixed probability, simulating realistic but imperfect human judgement.
- Percent agreement printed here will look reasonable — kappa in the next section will show how much of that agreement is just chance.

### Computing Cohen's kappa from scratch

Kappa needs the observed agreement and the expected agreement under the assumption that raters chose labels independently based on their own marginal rates.

import numpy as np

rng = np.random.default_rng(42)
n = 100
true_labels = rng.choice([0, 1], size=n, p=[0.7, 0.3])
rater_a = np.where(rng.random(n) < 0.10, 1 - true_labels, true_labels)
rater_b = np.where(rng.random(n) < 0.15, 1 - true_labels, true_labels)

def cohen_kappa(y1, y2):
    classes = np.unique(np.concatenate([y1, y2]))
    n = len(y1)
    po = np.mean(y1 == y2)  # observed agreement
    pe = sum(
        (np.mean(y1 == c) * np.mean(y2 == c))
        for c in classes
    )  # expected agreement
    return (po - pe) / (1 - pe)

kappa = cohen_kappa(rater_a, rater_b)
po = np.mean(rater_a == rater_b)
print(f"Observed agreement (po):  {po:.3f}")
print(f"Cohen's kappa:            {kappa:.3f}")

benchmarks = [(0.80, "Almost perfect"), (0.60, "Substantial"),
              (0.40, "Moderate"), (0.20, "Fair"), (0.0, "Slight"), (-1, "Poor")]
label = next(name for threshold, name in benchmarks if kappa >= threshold)
print(f"Interpretation: {label}")

Observed agreement (po):  0.730
Cohen's kappa:            0.418
Interpretation: Moderate

- `po = mean(y1 == y2)` is the fraction of cases where both raters agreed — the raw percent agreement divided by 100.
- `pe` is the expected agreement: if both raters assigned labels independently based on their own prevalences, this is the probability they'd agree by coincidence. For each class, you multiply the two raters' rates and sum.
- `(po - pe) / (1 - pe)` is the kappa formula: numerator is the improvement over chance, denominator is the maximum possible improvement. When both raters always agree, kappa = 1.

### Visualising the confusion matrix

A confusion matrix between the two raters shows not just the total agreement but which types of disagreements dominate — false positives vs false negatives.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
n = 100
true_labels = rng.choice([0, 1], size=n, p=[0.7, 0.3])
rater_a = np.where(rng.random(n) < 0.10, 1 - true_labels, true_labels)
rater_b = np.where(rng.random(n) < 0.15, 1 - true_labels, true_labels)

labels = [0, 1]
cm = np.array([[np.sum((rater_a == i) & (rater_b == j)) for j in labels] for i in labels])

fig, ax = plt.subplots(figsize=(5, 4))
im = ax.imshow(cm, cmap="Blues")
plt.colorbar(im, ax=ax)
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(["Rater B: Normal", "Rater B: Abnormal"])
ax.set_yticklabels(["Rater A: Normal", "Rater A: Abnormal"])
for i in range(2):
    for j in range(2):
        color = "white" if cm[i, j] > cm.max() / 2 else "black"
        ax.text(j, i, str(cm[i, j]), ha="center", va="center", fontsize=14, color=color)
ax.set_title("Inter-rater confusion matrix")
plt.tight_layout()
plt.show()

- `cm[i, j]` counts how many cases Rater A labelled as class `i` and Rater B labelled as class `j`. The diagonal cells are agreements; off-diagonal cells are disagreements.
- The upper-right cell (A=Normal, B=Abnormal) and lower-left cell (A=Abnormal, B=Normal) represent different failure modes — Rater B over-detecting vs Rater A over-detecting.
- A large kappa corresponds to a confusion matrix where the diagonal dominates and off-diagonal counts are small.

### Kappa for multi-class labels

Cohen's kappa generalises directly to any number of classes. Here three pathologists classify biopsy slides into three categories: benign, uncertain, and malignant.

import numpy as np

rng = np.random.default_rng(42)
n = 120
true_class = rng.choice([0, 1, 2], size=n, p=[0.5, 0.3, 0.2])

def noisy_rater(true, flip_prob, rng):
    out = true.copy()
    flip_mask = rng.random(len(true)) < flip_prob
    for i in np.where(flip_mask)[0]:
        choices = [c for c in [0, 1, 2] if c != true[i]]
        out[i] = rng.choice(choices)
    return out

rater_a = noisy_rater(true_class, 0.10, rng)
rater_b = noisy_rater(true_class, 0.20, rng)

def cohen_kappa(y1, y2):
    classes = np.unique(np.concatenate([y1, y2]))
    po = np.mean(y1 == y2)
    pe = sum(np.mean(y1 == c) * np.mean(y2 == c) for c in classes)
    return (po - pe) / (1 - pe)

kappa_ab = cohen_kappa(rater_a, rater_b)
kappa_at = cohen_kappa(rater_a, true_class)
kappa_bt = cohen_kappa(rater_b, true_class)

print(f"Kappa (A vs B):    {kappa_ab:.3f}")
print(f"Kappa (A vs true): {kappa_at:.3f}")
print(f"Kappa (B vs true): {kappa_bt:.3f}")

Kappa (A vs B):    0.563
Kappa (A vs true): 0.851
Kappa (B vs true): 0.623

- `noisy_rater` flips each label to a random other class with probability `flip_prob` — a simple model of diagnostic error that avoids systematic bias.
- The same `cohen_kappa` function works unchanged: `np.unique` finds however many classes exist, and the `pe` sum runs over all of them.
- Comparing A vs true and B vs true quantifies each rater's individual accuracy; comparing A vs B quantifies inter-rater reliability — these answer different questions.

### Kappa vs percent agreement: when they diverge

The gap between raw percent agreement and kappa grows large when class prevalence is skewed. This section shows the divergence across a range of noise levels.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
n = 500
# Heavily skewed: 90% negative class
true_labels = rng.choice([0, 1], size=n, p=[0.9, 0.1])

flip_probs = np.linspace(0, 0.5, 30)
agreements = []
kappas = []

for fp in flip_probs:
    rater_a = np.where(rng.random(n) < fp, 1 - true_labels, true_labels)
    rater_b = np.where(rng.random(n) < fp, 1 - true_labels, true_labels)
    po = np.mean(rater_a == rater_b)
    pe = sum(np.mean(rater_a == c) * np.mean(rater_b == c) for c in [0, 1])
    kappa = (po - pe) / (1 - pe) if pe < 1 else 0
    agreements.append(po)
    kappas.append(kappa)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(flip_probs, agreements, "steelblue", linewidth=2, label="Percent agreement")
ax.plot(flip_probs, kappas, "tomato", linewidth=2, label="Cohen's kappa")
ax.axhline(0, color="gray", linestyle="--", linewidth=1)
ax.set_xlabel("Flip probability (noise level)")
ax.set_ylabel("Agreement metric")
ax.set_title("Percent agreement vs kappa — skewed class (90% negative)")
ax.legend()
plt.tight_layout()
plt.show()

- With a 90% negative prevalence and no noise (`flip_prob=0`), both raters always agree perfectly — both metrics equal 1.
- As noise increases, percent agreement stays high because both raters are still mostly labelling everything negative by chance. Kappa falls much faster because it accounts for that baseline.
- At `flip_prob=0.5` (random labels), agreement stays around 0.82 due to chance, but kappa drops near 0 — correctly signalling that the raters are no better than coin flips.

### Conclusion

Cohen's kappa is the correct reliability metric whenever class prevalence is unequal or when you need to distinguish genuine agreement from accidental agreement. Always report kappa alongside percent agreement — the gap between them is itself informative about how much the result is driven by the prevalence of the dominant class.

For agreement across continuous measurements rather than categories, see [intraclass correlation coefficient](/tutorials/intraclass-correlation-coefficient). For the underlying idea of correlation without a chance-correction, see [correlation analysis with SciPy](/tutorials/correlation-analysis-with-scipy).