When two clinicians both rate the same patients for pain severity, or when a lab instrument measures the same sample twice, you need more than a simple correlation to judge reliability. The intraclass correlation coefficient (ICC) measures how much of the total variance in scores is due to real differences between subjects rather than differences between raters or occasions. Unlike Pearson correlation, ICC treats all raters symmetrically and accounts for systematic biases between them. There are several ICC forms — the choice depends on whether raters are randomly sampled or fixed, and whether you care about *consistency* (same relative ranking) or *absolute agreement* (same numerical scores). ### Simulating inter-rater data Three clinicians each rate 20 patients for pain on a 0–10 scale. Rater B consistently scores higher than Rater A, and Rater C scores lower — realistic systematic biases that will affect agreement but not consistency.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
n_subjects = 20
true_pain = rng.uniform(1, 9, n_subjects) # each patient's true pain level
biases = [0.0, 1.5, -1.0] # Rater A, B, C systematic offsets
ratings = np.column_stack([
np.clip(true_pain + b + rng.normal(0, 0.8, n_subjects), 0, 10)
for b in biases
])
df = pd.DataFrame(ratings, columns=["Rater A", "Rater B", "Rater C"])
print(df.head(8).round(1))
print(f"\nRater means: {df.mean().round(2).to_dict()}")
print(f"Grand mean: {df.values.mean():.2f}")- `true_pain` represents the actual pain level for each patient — the signal that all raters should ideally track equally. - `biases = [0.0, 1.5, -1.0]` simulates a common real-world problem: Rater B uses the upper end of the scale while Rater C is conservative. Both are tracking the same signal, just shifted. - The rater means printed at the bottom quantify these offsets numerically. Absolute agreement ICC will penalise this spread; consistency ICC will not. ### One-way ICC: the simplest form One-way ICC (ICC 1,1) assumes raters are randomly chosen from a large pool and cannot be replicated. It partitions total variance into *between-subject* and *within-subject* components using a one-way ANOVA on subjects.
import numpy as np
rng = np.random.default_rng(42)
n_subjects = 20
true_pain = rng.uniform(1, 9, n_subjects)
biases = [0.0, 1.5, -1.0]
ratings = np.column_stack([
np.clip(true_pain + b + rng.normal(0, 0.8, n_subjects), 0, 10)
for b in biases
])
def icc_oneway(data):
n, k = data.shape
grand_mean = data.mean()
subject_means = data.mean(axis=1)
ssb = k * np.sum((subject_means - grand_mean) ** 2)
ssw = np.sum((data - subject_means[:, None]) ** 2)
msb = ssb / (n - 1)
msw = ssw / (n * (k - 1))
return (msb - msw) / (msb + (k - 1) * msw)
icc1 = icc_oneway(ratings)
print(f"ICC(1,1): {icc1:.3f}")
benchmarks = [(0.90, "Excellent"), (0.75, "Good"), (0.50, "Moderate"), (0.0, "Poor")]
label = next(name for threshold, name in benchmarks if icc1 >= threshold)
print(f"Interpretation: {label}")- `ssb = k * Σ(subject_mean - grand_mean)²` is the between-subjects sum of squares — how much subjects genuinely differ from each other. - `ssw = Σ(rating - subject_mean)²` is the within-subjects sum of squares — how much raters disagree for the same subject, capturing both random noise and systematic rater biases. - `(msb - msw) / (msb + (k-1)*msw)` is the ICC formula: when `msb >> msw` (subjects vary much more than raters disagree), ICC approaches 1. Benchmarks: > 0.90 excellent, > 0.75 good, > 0.50 moderate. ### Two-way ICC: consistency vs absolute agreement When the same raters always evaluate all subjects (a fixed or replicable set), a two-way model separates rater bias from random error. This gives two distinct estimates: *consistency* (do raters rank subjects the same way?) and *absolute agreement* (do they give the same numerical scores?).
import numpy as np
rng = np.random.default_rng(42)
n_subjects = 20
true_pain = rng.uniform(1, 9, n_subjects)
biases = [0.0, 1.5, -1.0]
ratings = np.column_stack([
np.clip(true_pain + b + rng.normal(0, 0.8, n_subjects), 0, 10)
for b in biases
])
def icc_twoway(data, model="consistency"):
n, k = data.shape
grand_mean = data.mean()
row_means = data.mean(axis=1) # subject means
col_means = data.mean(axis=0) # rater means
ssb = k * np.sum((row_means - grand_mean) ** 2)
ssr = n * np.sum((col_means - grand_mean) ** 2)
sst = np.sum((data - grand_mean) ** 2)
sse = sst - ssb - ssr
msb = ssb / (n - 1)
msr = ssr / (k - 1)
mse = sse / ((n - 1) * (k - 1))
if model == "consistency":
return (msb - mse) / (msb + (k - 1) * mse)
else: # absolute agreement
return (msb - mse) / (msb + (k - 1) * mse + k * (msr - mse) / n)
icc_cons = icc_twoway(ratings, model="consistency")
icc_agree = icc_twoway(ratings, model="agreement")
print(f"ICC(3,1) consistency: {icc_cons:.3f}")
print(f"ICC(2,1) absolute agreement: {icc_agree:.3f}")
print(f"\nDifference: {icc_cons - icc_agree:.3f} (reflects rater bias)")- `ssr = n * Σ(rater_mean - grand_mean)²` captures systematic differences between raters. In the consistency model this term is partitioned out (ignored); in the agreement model it inflates the denominator. - When there are no systematic rater biases (`biases = [0, 0, 0]`), consistency and agreement ICC would be identical. The difference here reflects the 1.5 and −1.0 offsets we built into the data. - Choose **consistency** when raters will always be the same people and what matters is relative ordering (e.g., ranking patients). Choose **agreement** when absolute score values matter or when raters might be substituted. ### Visualising rater agreement Scatter plots of each rater pair show agreement at a glance: points on the diagonal mean perfect agreement, while vertical spread reveals inconsistency.
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
n_subjects = 20
true_pain = rng.uniform(1, 9, n_subjects)
biases = [0.0, 1.5, -1.0]
ratings = np.column_stack([
np.clip(true_pain + b + rng.normal(0, 0.8, n_subjects), 0, 10)
for b in biases
])
rater_names = ["Rater A", "Rater B", "Rater C"]
pairs = [(0, 1), (0, 2), (1, 2)]
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for ax, (i, j) in zip(axes, pairs):
ax.scatter(ratings[:, i], ratings[:, j], alpha=0.7, color="steelblue", s=40)
lo = min(ratings[:, i].min(), ratings[:, j].min()) - 0.5
hi = max(ratings[:, i].max(), ratings[:, j].max()) + 0.5
ax.plot([lo, hi], [lo, hi], "r--", linewidth=1.2, label="Perfect agreement")
ax.set_xlabel(rater_names[i])
ax.set_ylabel(rater_names[j])
ax.set_title(f"{rater_names[i]} vs {rater_names[j]}")
ax.set_xlim(lo, hi)
ax.set_ylim(lo, hi)
plt.suptitle("Pairwise rater agreement — pain severity ratings", y=1.02)
plt.tight_layout()
plt.show()- Three side-by-side panels cover all unique rater pairs. Points close to the dashed red line indicate strong agreement; points systematically above or below it reveal a directional bias. - Rater A vs Rater B and Rater A vs Rater C should show clear parallel offsets from the diagonal — the lines of points run parallel to the identity line but shifted, exactly what systematic rater bias looks like. - Rater B vs Rater C will show the combined 2.5-unit offset, the largest gap. This visual confirms what the ICC numbers captured numerically. ### Conclusion ICC choice matters: for the same dataset, ICC(3,1) consistency will always be ≥ ICC(2,1) absolute agreement because it ignores systematic rater differences. Always report which ICC form you used and justify the choice based on your study design. For the clinical pain rating scenario here, if you plan to substitute raters interchangeably, the lower absolute-agreement ICC is the appropriate and honest estimate. For reliability of a multi-item questionnaire rather than rater agreement, see [Cronbach's alpha](/tutorials/cronbachs-alpha). For a full model of the between-subject and within-subject variance structure, see [linear mixed effects models](/tutorials/linear-mixed-effects-models).