Standard [linear regression](/tutorials/linear-regression) assumes every observation is independent. That breaks down when observations are clustered — students in the same school, animals in the same litter, or patients measured at multiple time points all share some unobserved group characteristic that makes them more similar to each other than to outsiders. A linear mixed effects model handles this by adding *random effects*: group-specific adjustments that capture the clustering. The result is more accurate standard errors, valid p-values, and the ability to estimate how much of the total variance comes from between-group differences versus within-group noise. ### Simulating nested data We'll generate test scores for 150 students spread across 10 schools. Each school has its own random baseline, and homework hours has a fixed positive effect on scores.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
n_schools = 10
n_per_school = 15
school_intercepts = rng.normal(0, 8, n_schools) # between-school random effects
rows = []
for s in range(n_schools):
hours = rng.uniform(1, 10, n_per_school)
for h in hours:
score = 50 + 3 * h + school_intercepts[s] + rng.normal(0, 5)
rows.append({"school": f"School {s+1:02d}", "hours": h, "score": score})
df = pd.DataFrame(rows)
print(df.groupby("school")["score"].agg(["mean", "count"]).round(1))- `school_intercepts = rng.normal(0, 8, n_schools)` creates one random shift per school, sampled from a normal distribution with standard deviation 8. This is the true between-school variance the model will recover. - The true data-generating process is `score = 50 + 3*hours + school_effect + noise`. The fixed intercept is 50, the fixed slope is 3. - Each school's baseline is persistent: every student in School 03 benefits from (or is penalised by) the same school-level offset, regardless of their homework hours. ### Visualizing the group structure Plotting all schools together reveals both the within-school trend and the between-school spread in baselines.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
n_schools = 10
n_per_school = 15
school_intercepts = rng.normal(0, 8, n_schools)
rows = []
for s in range(n_schools):
hours = rng.uniform(1, 10, n_per_school)
for h in hours:
score = 50 + 3 * h + school_intercepts[s] + rng.normal(0, 5)
rows.append({"school": f"School {s+1:02d}", "hours": h, "score": score})
df = pd.DataFrame(rows)
fig, ax = plt.subplots(figsize=(9, 5))
cmap = plt.get_cmap("tab10")
for i, (school, grp) in enumerate(df.groupby("school")):
ax.scatter(grp["hours"], grp["score"], color=cmap(i), alpha=0.7,
s=30, label=school)
ax.set_xlabel("Hours of homework per week")
ax.set_ylabel("Test score")
ax.set_title("Test scores by homework hours — 10 schools")
ax.legend(loc="upper left", fontsize=7, ncol=2)
plt.tight_layout()
plt.show()
- `df.groupby("school")` iterates over the data one school at a time, letting us give each school a distinct colour.
- Notice that the cloud of points for each school is shifted up or down as a block — that vertical offset is the random intercept the model will estimate.
- A naive OLS fit through all 150 points would ignore these offsets and produce biased standard errors, because students in the same school are not independent.
### Fitting a random intercept model
`statsmodels.formula.api.mixedlm` fits a linear mixed effects model using a formula. The `groups` argument identifies the clustering variable.
import numpy as np
import pandas as pd
from statsmodels.formula.api import mixedlm
rng = np.random.default_rng(42)
n_schools = 10
n_per_school = 15
school_intercepts = rng.normal(0, 8, n_schools)
rows = []
for s in range(n_schools):
hours = rng.uniform(1, 10, n_per_school)
for h in hours:
score = 50 + 3 * h + school_intercepts[s] + rng.normal(0, 5)
rows.append({"school": f"School {s+1:02d}", "hours": h, "score": score})
df = pd.DataFrame(rows)
model = mixedlm("score ~ hours", data=df, groups=df["school"])
result = model.fit(reml=True)
print(result.summary())- `"score ~ hours"` specifies the fixed-effect formula: one predictor, `hours`, plus an implicit intercept. - `groups=df["school"]` tells the model that observations sharing a school label are clustered and should receive a shared random intercept. - `reml=True` uses Restricted Maximum Likelihood — the default and generally preferred method for estimating variance components, as it produces less-biased estimates than standard ML. - The summary shows two tables: **Fixed Effects** (the overall intercept and slope, with z-statistics and p-values) and **Random Effects** (the variance of the school-level intercepts, labelled `Group Var`). ### Extracting fixed and random effects The fitted result object exposes both the population-level coefficients and each school's individual deviation.
import numpy as np
import pandas as pd
from statsmodels.formula.api import mixedlm
rng = np.random.default_rng(42)
n_schools = 10
n_per_school = 15
school_intercepts = rng.normal(0, 8, n_schools)
rows = []
for s in range(n_schools):
hours = rng.uniform(1, 10, n_per_school)
for h in hours:
score = 50 + 3 * h + school_intercepts[s] + rng.normal(0, 5)
rows.append({"school": f"School {s+1:02d}", "hours": h, "score": score})
df = pd.DataFrame(rows)
result = mixedlm("score ~ hours", data=df, groups=df["school"]).fit(reml=True)
print("Fixed effects:")
print(f" Intercept: {result.fe_params['Intercept']:.2f} (true: 50)")
print(f" hours: {result.fe_params['hours']:.2f} (true: 3)")
print("\nRandom intercepts per school:")
for school, re in sorted(result.random_effects.items()):
print(f" {school}: {re.iloc[0]:+.2f}")
group_var = result.cov_re.iloc[0, 0]
resid_var = result.scale
icc = group_var / (group_var + resid_var)
print(f"\nGroup variance: {group_var:.2f} Residual variance: {resid_var:.2f}")
print(f"ICC: {icc:.2f} (fraction of variance explained by school membership)")- `result.fe_params` is a Series of fixed-effect estimates — the intercept and slope that apply to all students regardless of school. - `result.random_effects` is a dict keyed by group label; `re.iloc[0]` extracts each school's estimated deviation from the fixed intercept. - `result.cov_re.iloc[0, 0]` is the estimated variance of the random intercepts — how much schools differ from each other on average. - The **ICC** (Intraclass Correlation Coefficient) measures what fraction of total variability is due to group membership. An ICC near 1 means knowing the school tells you almost everything; near 0 means schools are interchangeable. ### Visualizing group-specific fitted lines Each school gets its own intercept (fixed intercept + school's random deviation), while sharing the same slope for homework hours.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.formula.api import mixedlm
rng = np.random.default_rng(42)
n_schools = 10
n_per_school = 15
school_intercepts = rng.normal(0, 8, n_schools)
rows = []
for s in range(n_schools):
hours = rng.uniform(1, 10, n_per_school)
for h in hours:
score = 50 + 3 * h + school_intercepts[s] + rng.normal(0, 5)
rows.append({"school": f"School {s+1:02d}", "hours": h, "score": score})
df = pd.DataFrame(rows)
result = mixedlm("score ~ hours", data=df, groups=df["school"]).fit(reml=True)
fe_int = result.fe_params["Intercept"]
fe_slope = result.fe_params["hours"]
x_range = np.linspace(df["hours"].min(), df["hours"].max(), 100)
fig, ax = plt.subplots(figsize=(9, 5))
cmap = plt.get_cmap("tab10")
for i, (school, grp) in enumerate(df.groupby("school")):
ax.scatter(grp["hours"], grp["score"], color=cmap(i), alpha=0.5, s=25)
rand_int = result.random_effects[school].iloc[0]
y_line = (fe_int + rand_int) + fe_slope * x_range
ax.plot(x_range, y_line, color=cmap(i), linewidth=1.5)
# Population-level line
ax.plot(x_range, fe_int + fe_slope * x_range, color="black",
linewidth=2.5, linestyle="--", label="Population mean")
ax.set_xlabel("Hours of homework per week")
ax.set_ylabel("Predicted test score")
ax.set_title("Random intercept model: school-specific fitted lines")
ax.legend()
plt.tight_layout()
plt.show()- Each coloured line has the same slope (`fe_slope`) but a different y-intercept — `fe_int + rand_int` — making the lines parallel. - The black dashed line is the population mean, i.e., what you'd predict for a student from a hypothetical average school. - If the lines were identical, there would be no between-school variance; the fact that they're spread apart reflects the estimated `Group Var` from the model summary. ### Conclusion Linear mixed effects models are the correct tool whenever your data has a grouped structure — schools, subjects, clinics, time series — that violates the independence assumption of ordinary regression. The fixed effects give you the overall relationship you care about, and the random effects absorb the clustering so your inference stays valid. For repeated measurements of the same subject over time, see [repeated measures ANOVA](/tutorials/repeated-measures-anova). For the formula syntax used here, see [Statsmodels formula API regression](/tutorials/statsmodels-formula-api-regression).