In a standard [one-way ANOVA](/tutorials/one-way-anova-with-scipy), different groups of people are measured once each. Repeated measures ANOVA is used when the *same* people are measured across multiple conditions or time points — for example, testing reaction time before, one week into, and four weeks into a training program. Because each person acts as their own control, the within-person variability is removed from the error term, making the test more sensitive to real effects. The tradeoff is the assumption of *sphericity* — that the variance of differences between all pairs of conditions is equal — which can be tested and corrected for. ### Creating within-subject data Repeated measures data must be in **long format**: one row per observation, with columns for the subject ID, the within-subjects factor (e.g., time point), and the measurement. We'll simulate 20 participants measured at four time points during a typing speed study.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86] # words per minute
# Each participant has a persistent personal offset (individual differences)
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
for tp, mu in zip(timepoints, true_means):
wpm = mu + subject_offsets[i] + rng.normal(0, 8)
rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)
group_means = df.groupby("time")["wpm"].mean().reindex(timepoints).round(1)
print(group_means)
print(f"\nTotal rows: {len(df)} ({n_subjects} subjects × {len(timepoints)} timepoints)")- `subject_offsets` gives each participant a stable personal baseline — some people are naturally faster typists. This offset is shared across all their measurements, creating the within-person correlation that repeated measures ANOVA exploits. - `rng.normal(0, 8)` adds small measurement noise on top of the true mean and the personal offset. - Long format is required by `AnovaRM`; wide format (one column per timepoint) is NOT accepted. ### Visualizing individual profiles A "spaghetti plot" — one line per participant — reveals both the overall trend and how consistently individuals follow it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
for tp, mu in zip(timepoints, true_means):
wpm = mu + subject_offsets[i] + rng.normal(0, 8)
rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)
fig, ax = plt.subplots(figsize=(8, 5))
for subj in df["subject"].unique():
subj_data = df[df["subject"] == subj].set_index("time").reindex(timepoints)
ax.plot(timepoints, subj_data["wpm"], color="steelblue", alpha=0.3, linewidth=0.9)
means = df.groupby("time")["wpm"].mean().reindex(timepoints)
ax.plot(timepoints, means, color="tomato", linewidth=2.5, marker="o", label="Group mean")
ax.set_xlabel("Time point")
ax.set_ylabel("Typing speed (wpm)")
ax.set_title("Individual profiles and group mean across time")
ax.legend()
plt.tight_layout()
plt.show()
- The faint blue lines are individual participants — even though absolute speeds vary widely between people, nearly everyone shows the same upward pattern.
- This parallel rise is exactly what repeated measures ANOVA tests: is the *shape* of the trajectory significant after removing person-to-person offsets?
- `set_index("time").reindex(timepoints)` guarantees the four points appear in chronological order regardless of how pandas sorted the groups.
### Running the repeated measures ANOVA
`statsmodels.stats.anova.AnovaRM` fits a mixed-effects ANOVA model directly from a long-format DataFrame. The `within` argument lists the within-subjects factors — here just time.
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
for tp, mu in zip(timepoints, true_means):
wpm = mu + subject_offsets[i] + rng.normal(0, 8)
rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)
result = AnovaRM(data=df, depvar="wpm", subject="subject", within=["time"]).fit()
print(result.summary())- `depvar="wpm"` specifies the column containing the measurements. - `subject="subject"` tells the model which column identifies each participant — it uses this to partition out individual differences. - `within=["time"]` lists the within-subjects factors; for a one-factor design this is a single-element list. - The output table shows the F-statistic, numerator and denominator degrees of freedom, and p-value (`Pr > F`) for the time factor. A significant p-value means typing speed changed significantly across the four time points. ### Post-hoc pairwise comparisons A significant omnibus F-test tells you that *something* changed, but not which specific pairs of time points differ. Pairwise paired t-tests with Bonferroni correction answer that question.
import numpy as np
import pandas as pd
from itertools import combinations
from scipy.stats import ttest_rel
rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
for tp, mu in zip(timepoints, true_means):
wpm = mu + subject_offsets[i] + rng.normal(0, 8)
rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)
n_comparisons = len(list(combinations(timepoints, 2)))
alpha_corrected = 0.05 / n_comparisons
print(f"Bonferroni-corrected threshold: {alpha_corrected:.4f} ({n_comparisons} comparisons)\n")
for tp1, tp2 in combinations(timepoints, 2):
scores1 = df[df["time"] == tp1]["wpm"].values
scores2 = df[df["time"] == tp2]["wpm"].values
stat, p = ttest_rel(scores1, scores2)
sig = "*" if p < alpha_corrected else " "
print(f"{sig} {tp1} vs {tp2}: t={stat:.2f}, p={p:.4f}")- `combinations(timepoints, 2)` generates all C(4, 2) = 6 unique pairs without repetition. - `ttest_rel` performs a paired t-test — it computes the difference for each participant and tests whether the mean difference is zero. This is valid here because each row for `tp1` corresponds to the same participant as the matching row for `tp2`. - Bonferroni correction divides the significance threshold by the number of comparisons. This is conservative but simple; it controls the family-wise error rate so you don't get spurious "discoveries" from running many tests. - Pairs marked `*` are significant after correction. ### Visualizing group means with confidence intervals A final chart summarises which time points are meaningfully different at a glance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
for tp, mu in zip(timepoints, true_means):
wpm = mu + subject_offsets[i] + rng.normal(0, 8)
rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)
means = df.groupby("time")["wpm"].mean().reindex(timepoints)
sems = df.groupby("time")["wpm"].sem().reindex(timepoints)
n = df["subject"].nunique()
t_crit = stats.t.ppf(0.975, df=n - 1)
ci = sems * t_crit
fig, ax = plt.subplots(figsize=(7, 5))
ax.bar(timepoints, means, yerr=ci, capsize=5,
color=["#d0e8f1", "#7cb9d4", "#3a87b0", "#1a5a80"], edgecolor="black")
ax.set_xlabel("Time point")
ax.set_ylabel("Mean typing speed (wpm)")
ax.set_title("Mean typing speed with 95% confidence intervals")
ax.set_ylim(40, 110)
plt.tight_layout()
plt.show()- `sem()` computes the standard error of the mean per group — smaller standard error means a tighter estimate of the true mean. - `t_crit = stats.t.ppf(0.975, df=n-1)` is the critical t-value for a 95% CI with n−1 degrees of freedom. Multiplying by the SEM gives the half-width of the confidence interval. - Non-overlapping confidence intervals are a visual hint that two means differ significantly, though overlapping bars do not necessarily mean the difference is non-significant — always trust the test over the visual. - The progressively darker bars reinforce the increasing typing speed trend. ### Conclusion Repeated measures ANOVA is the right tool when the same subjects appear in all conditions — it removes individual baselines from the error, giving more power to detect real changes. Always follow a significant omnibus result with pairwise post-hoc tests, and correct for multiple comparisons to keep the false-positive rate under control. For comparing just two paired measurements (e.g. pre vs post), use `scipy.stats.ttest_rel` — a paired t-test rather than an independent-samples test. For a two-factor design where both factors are between-subjects, see [two-way ANOVA](/tutorials/two-way-anova).