When a new blood pressure monitor is compared against a hospital-grade device, a high Pearson correlation does not tell you whether patients could safely use either device — two methods can correlate perfectly yet always disagree by 20 mmHg. The Bland-Altman method, introduced in a landmark 1986 paper, plots the *difference* between two measurements against their *mean* for every subject. The middle line shows the average bias; the limits of agreement (mean ± 1.96 SD of differences) show the spread of disagreement. If those limits fall within a clinically acceptable range, the methods can be used interchangeably. This approach has become the standard in medical device validation, physiological measurement studies, and any domain where two instruments measure the same quantity. ### Simulating two measurement methods Forty patients each have their systolic blood pressure measured by a standard sphygmomanometer (reference) and a new wrist device (test). The test device has a small systematic bias and adds slightly more noise.
import numpy as np
rng = np.random.default_rng(42)
n = 40
true_bp = rng.normal(120, 15, n) # true systolic BP (mmHg)
method_a = true_bp + rng.normal(0, 3, n) # reference: unbiased, low noise
method_b = true_bp + rng.normal(3, 5, n) # test: +3 mmHg bias, more noise
diff = method_b - method_a # difference (test minus reference)
means = (method_a + method_b) / 2 # average of the two measurements
print(f"Mean difference (bias): {diff.mean():.2f} mmHg")
print(f"SD of differences: {diff.std(ddof=1):.2f} mmHg")
print(f"Range of means: {means.min():.0f} – {means.max():.0f} mmHg")- `true_bp` is the underlying true value; neither method knows it — they each observe it with noise. - `diff = method_b - method_a` is the key quantity. Convention is test minus reference; keep the direction consistent so bias interpretation makes sense. - The mean difference is the *systematic bias* — how much the test device reads higher or lower on average. The SD measures scatter of individual disagreements around that bias. ### Computing limits of agreement The limits of agreement (LoA) define the interval within which 95% of differences between the two methods will fall for a new subject, assuming the differences are approximately normally distributed.
import numpy as np
rng = np.random.default_rng(42)
n = 40
true_bp = rng.normal(120, 15, n)
method_a = true_bp + rng.normal(0, 3, n)
method_b = true_bp + rng.normal(3, 5, n)
diff = method_b - method_a
means = (method_a + method_b) / 2
bias = diff.mean()
sd = diff.std(ddof=1)
loa_upper = bias + 1.96 * sd
loa_lower = bias - 1.96 * sd
print(f"Bias (mean diff): {bias:+.2f} mmHg")
print(f"SD of differences: {sd:.2f} mmHg")
print(f"Upper limit of agreement: {loa_upper:+.2f} mmHg")
print(f"Lower limit of agreement: {loa_lower:+.2f} mmHg")
print(f"Width of LoA interval: {loa_upper - loa_lower:.2f} mmHg")- `bias = diff.mean()` estimates the average systematic offset. A bias near zero means the methods agree on average; a large bias means one consistently reads higher. - `1.96 * sd` comes from the normal distribution: 95% of observations fall within ±1.96 standard deviations of the mean. - The width of the interval (`loa_upper - loa_lower ≈ 4 × 1.96 × sd`) is the key practical quantity — compare it against the maximum clinically acceptable difference for your application. ### The Bland-Altman plot The classic plot puts the mean of the two methods on the x-axis and the difference on the y-axis. Horizontal lines mark the bias and limits of agreement.
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
n = 40
true_bp = rng.normal(120, 15, n)
method_a = true_bp + rng.normal(0, 3, n)
method_b = true_bp + rng.normal(3, 5, n)
diff = method_b - method_a
means = (method_a + method_b) / 2
bias = diff.mean()
sd = diff.std(ddof=1)
loa_upper = bias + 1.96 * sd
loa_lower = bias - 1.96 * sd
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(means, diff, color="steelblue", alpha=0.75, s=40, zorder=3)
ax.axhline(bias, color="black", linewidth=1.8, label=f"Bias: {bias:+.2f} mmHg")
ax.axhline(loa_upper, color="tomato", linewidth=1.4, linestyle="--",
label=f"Upper LoA: {loa_upper:+.2f} mmHg")
ax.axhline(loa_lower, color="tomato", linewidth=1.4, linestyle="--",
label=f"Lower LoA: {loa_lower:+.2f} mmHg")
ax.axhline(0, color="gray", linewidth=0.8, linestyle=":")
ax.set_xlabel("Mean of Method A and Method B (mmHg)")
ax.set_ylabel("Difference: Method B − Method A (mmHg)")
ax.set_title("Bland-Altman plot — blood pressure measurement")
ax.legend(loc="upper left", fontsize=9)
plt.tight_layout()
plt.show()- Points above the zero line mean Method B read higher than Method A for that patient; points below mean it read lower. - The dashed red lines are the limits of agreement — about 95% of points should fall between them if the assumption of normally distributed differences holds. - A visual check is as important as the numbers: points should be randomly scattered around the bias line with no fan shape (constant variance) and no curved trend (no proportional bias). ### Detecting proportional bias Proportional bias occurs when the disagreement between methods depends on the magnitude of the measurement — larger values produce larger differences. This shows up as a slope in the Bland-Altman plot.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
rng = np.random.default_rng(42)
n = 40
true_bp = rng.normal(120, 15, n)
method_a = true_bp + rng.normal(0, 3, n)
# Proportional bias: error grows with the true value
method_b_prop = true_bp * 1.05 + rng.normal(0, 4, n)
diff_prop = method_b_prop - method_a
means_prop = (method_a + method_b_prop) / 2
r, p = pearsonr(means_prop, diff_prop)
slope = np.polyfit(means_prop, diff_prop, 1)
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(means_prop, diff_prop, color="steelblue", alpha=0.75, s=40)
x_line = np.linspace(means_prop.min(), means_prop.max(), 100)
ax.plot(x_line, np.polyval(slope, x_line), color="tomato", linewidth=2,
label=f"Trend line (r={r:.2f}, p={p:.3f})")
ax.axhline(diff_prop.mean(), color="black", linewidth=1.5, linestyle="--",
label=f"Mean bias: {diff_prop.mean():+.2f} mmHg")
ax.axhline(0, color="gray", linewidth=0.8, linestyle=":")
ax.set_xlabel("Mean of both methods (mmHg)")
ax.set_ylabel("Difference: B − A (mmHg)")
ax.set_title("Bland-Altman plot — proportional bias example")
ax.legend()
plt.tight_layout()
plt.show()
print(f"Correlation of means vs differences: r = {r:.3f}, p = {p:.4f}")- `pearsonr(means_prop, diff_prop)` tests whether differences are correlated with the average — a significant result indicates proportional bias. - A positive slope means Method B over-reads more at higher blood pressures; a negative slope means it under-reads at high values. - When proportional bias is present, a single set of limits of agreement is misleading — the limits should vary with the magnitude, or a log transformation of the original data may stabilise the differences. ### Conclusion Bland-Altman analysis shifts the question from "are these methods correlated?" to "are the differences small enough to matter clinically?" — a much more practically useful question. Always pair the plot with a pre-specified acceptable limit of agreement based on clinical or engineering requirements, and check visually for proportional bias before trusting the global limits. For comparing continuous measurements between two groups rather than two methods, see [independent samples t-test with SciPy](/tutorials/independent-samples-t-test-with-scipy). For a chance-corrected agreement metric for categorical ratings, see [Cohen's kappa](/tutorials/cohens-kappa).