Repeated Measures ANOVA

In a standard [one-way ANOVA](/tutorials/one-way-anova-with-scipy), different groups of people are measured once each. Repeated measures ANOVA is used when the *same* people are measured across multiple conditions or time points — for example, testing reaction time before, one week into, and four weeks into a training program. Because each person acts as their own control, the within-person variability is removed from the error term, making the test more sensitive to real effects. The tradeoff is the assumption of *sphericity* — that the variance of differences between all pairs of conditions is equal — which can be tested and corrected for.

### Creating within-subject data

Repeated measures data must be in **long format**: one row per observation, with columns for the subject ID, the within-subjects factor (e.g., time point), and the measurement. We'll simulate 20 participants measured at four time points during a typing speed study.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]  # words per minute

# Each participant has a persistent personal offset (individual differences)
subject_offsets = rng.normal(0, 15, n_subjects)

rows = []
for i in range(n_subjects):
    for tp, mu in zip(timepoints, true_means):
        wpm = mu + subject_offsets[i] + rng.normal(0, 8)
        rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})

df = pd.DataFrame(rows)

group_means = df.groupby("time")["wpm"].mean().reindex(timepoints).round(1)
print(group_means)
print(f"\nTotal rows: {len(df)}  ({n_subjects} subjects × {len(timepoints)} timepoints)")

time
Baseline    57.7
Week 1      71.5
Week 4      81.8
Week 8      85.3
Name: wpm, dtype: float64

Total rows: 80  (20 subjects × 4 timepoints)

- `subject_offsets` gives each participant a stable personal baseline — some people are naturally faster typists. This offset is shared across all their measurements, creating the within-person correlation that repeated measures ANOVA exploits.
- `rng.normal(0, 8)` adds small measurement noise on top of the true mean and the personal offset.
- Long format is required by `AnovaRM`; wide format (one column per timepoint) is NOT accepted.

### Visualizing individual profiles

A "spaghetti plot" — one line per participant — reveals both the overall trend and how consistently individuals follow it.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
    for tp, mu in zip(timepoints, true_means):
        wpm = mu + subject_offsets[i] + rng.normal(0, 8)
        rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)

fig, ax = plt.subplots(figsize=(8, 5))

for subj in df["subject"].unique():
    subj_data = df[df["subject"] == subj].set_index("time").reindex(timepoints)
    ax.plot(timepoints, subj_data["wpm"], color="steelblue", alpha=0.3, linewidth=0.9)

means = df.groupby("time")["wpm"].mean().reindex(timepoints)
ax.plot(timepoints, means, color="tomato", linewidth=2.5, marker="o", label="Group mean")

ax.set_xlabel("Time point")
ax.set_ylabel("Typing speed (wpm)")
ax.set_title("Individual profiles and group mean across time")
ax.legend()
plt.tight_layout()
plt.show()

- The faint blue lines are individual participants — even though absolute speeds vary widely between people, nearly everyone shows the same upward pattern.
- This parallel rise is exactly what repeated measures ANOVA tests: is the *shape* of the trajectory significant after removing person-to-person offsets?
- `set_index("time").reindex(timepoints)` guarantees the four points appear in chronological order regardless of how pandas sorted the groups.

### Running the repeated measures ANOVA

`statsmodels.stats.anova.AnovaRM` fits a mixed-effects ANOVA model directly from a long-format DataFrame. The `within` argument lists the within-subjects factors — here just time.

import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM

rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
    for tp, mu in zip(timepoints, true_means):
        wpm = mu + subject_offsets[i] + rng.normal(0, 8)
        rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)

result = AnovaRM(data=df, depvar="wpm", subject="subject", within=["time"]).fit()
print(result.summary())

              Anova
==================================
     F Value Num DF  Den DF Pr > F
----------------------------------
time 88.0532 3.0000 57.0000 0.0000
==================================

- `depvar="wpm"` specifies the column containing the measurements.
- `subject="subject"` tells the model which column identifies each participant — it uses this to partition out individual differences.
- `within=["time"]` lists the within-subjects factors; for a one-factor design this is a single-element list.
- The output table shows the F-statistic, numerator and denominator degrees of freedom, and p-value (`Pr > F`) for the time factor. A significant p-value means typing speed changed significantly across the four time points.

### Post-hoc pairwise comparisons

A significant omnibus F-test tells you that *something* changed, but not which specific pairs of time points differ. Pairwise paired t-tests with Bonferroni correction answer that question.

import numpy as np
import pandas as pd
from itertools import combinations
from scipy.stats import ttest_rel

rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
    for tp, mu in zip(timepoints, true_means):
        wpm = mu + subject_offsets[i] + rng.normal(0, 8)
        rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)

n_comparisons = len(list(combinations(timepoints, 2)))
alpha_corrected = 0.05 / n_comparisons

print(f"Bonferroni-corrected threshold: {alpha_corrected:.4f} ({n_comparisons} comparisons)\n")

for tp1, tp2 in combinations(timepoints, 2):
    scores1 = df[df["time"] == tp1]["wpm"].values
    scores2 = df[df["time"] == tp2]["wpm"].values
    stat, p = ttest_rel(scores1, scores2)
    sig = "*" if p < alpha_corrected else " "
    print(f"{sig} {tp1} vs {tp2}: t={stat:.2f}, p={p:.4f}")

Bonferroni-corrected threshold: 0.0083 (6 comparisons)

* Baseline vs Week 1: t=-8.84, p=0.0000
* Baseline vs Week 4: t=-11.19, p=0.0000
* Baseline vs Week 8: t=-14.01, p=0.0000
* Week 1 vs Week 4: t=-4.98, p=0.0001
* Week 1 vs Week 8: t=-8.93, p=0.0000
  Week 4 vs Week 8: t=-1.95, p=0.0655

- `combinations(timepoints, 2)` generates all C(4, 2) = 6 unique pairs without repetition.
- `ttest_rel` performs a paired t-test — it computes the difference for each participant and tests whether the mean difference is zero. This is valid here because each row for `tp1` corresponds to the same participant as the matching row for `tp2`.
- Bonferroni correction divides the significance threshold by the number of comparisons. This is conservative but simple; it controls the family-wise error rate so you don't get spurious "discoveries" from running many tests.
- Pairs marked `*` are significant after correction.

### Visualizing group means with confidence intervals

A final chart summarises which time points are meaningfully different at a glance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

rng = np.random.default_rng(42)
n_subjects = 20
timepoints = ["Baseline", "Week 1", "Week 4", "Week 8"]
true_means = [60, 72, 82, 86]
subject_offsets = rng.normal(0, 15, n_subjects)
rows = []
for i in range(n_subjects):
    for tp, mu in zip(timepoints, true_means):
        wpm = mu + subject_offsets[i] + rng.normal(0, 8)
        rows.append({"subject": f"P{i+1:02d}", "time": tp, "wpm": wpm})
df = pd.DataFrame(rows)

means = df.groupby("time")["wpm"].mean().reindex(timepoints)
sems = df.groupby("time")["wpm"].sem().reindex(timepoints)
n = df["subject"].nunique()
t_crit = stats.t.ppf(0.975, df=n - 1)
ci = sems * t_crit

fig, ax = plt.subplots(figsize=(7, 5))
ax.bar(timepoints, means, yerr=ci, capsize=5,
       color=["#d0e8f1", "#7cb9d4", "#3a87b0", "#1a5a80"], edgecolor="black")
ax.set_xlabel("Time point")
ax.set_ylabel("Mean typing speed (wpm)")
ax.set_title("Mean typing speed with 95% confidence intervals")
ax.set_ylim(40, 110)
plt.tight_layout()
plt.show()

- `sem()` computes the standard error of the mean per group — smaller standard error means a tighter estimate of the true mean.
- `t_crit = stats.t.ppf(0.975, df=n-1)` is the critical t-value for a 95% CI with n−1 degrees of freedom. Multiplying by the SEM gives the half-width of the confidence interval.
- Non-overlapping confidence intervals are a visual hint that two means differ significantly, though overlapping bars do not necessarily mean the difference is non-significant — always trust the test over the visual.
- The progressively darker bars reinforce the increasing typing speed trend.

### Conclusion

Repeated measures ANOVA is the right tool when the same subjects appear in all conditions — it removes individual baselines from the error, giving more power to detect real changes. Always follow a significant omnibus result with pairwise post-hoc tests, and correct for multiple comparisons to keep the false-positive rate under control.

For comparing just two paired measurements (e.g. pre vs post), use `scipy.stats.ttest_rel` — a paired t-test rather than an independent-samples test. For a two-factor design where both factors are between-subjects, see [two-way ANOVA](/tutorials/two-way-anova).