Suppose you find that more education predicts higher income. But *why*? One explanation is that education increases job skill, and job skill increases income — education's effect is *mediated* by skill. Mediation analysis decomposes the total effect of X on Y into a *direct* effect (X → Y, bypassing the mediator) and an *indirect* effect (X → M → Y, flowing through the mediator). The Baron-Kenny procedure is the classic approach: run three regressions, collect the path coefficients, and compute indirect = a × b. Because the sampling distribution of a × b is non-normal, bootstrapping is the standard method for confidence intervals. ### Simulating a mediation pathway A dataset where study hours (X) influence test scores (Y) partly through knowledge acquired (M) — simulated so the true indirect effect is known.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
n = 300
X = rng.normal(0, 1, n) # study hours (standardised)
M = 0.7 * X + rng.normal(0, 0.7, n) # knowledge: a = 0.7
Y = 0.5 * M + 0.3 * X + rng.normal(0, 0.7, n) # b = 0.5, direct c' = 0.3
df = pd.DataFrame({"X": X, "M": M, "Y": Y})
# True effects
true_indirect = 0.7 * 0.5 # a * b
true_direct = 0.3 # c'
true_total = true_indirect + true_direct
print(f"True indirect (a×b): {true_indirect:.2f}")
print(f"True direct (c'): {true_direct:.2f}")
print(f"True total (c): {true_total:.2f}")- The data-generating process makes the causal structure explicit: X causes M (a = 0.7), M causes Y (b = 0.5), and X also directly causes Y (c' = 0.3). - The true indirect effect (0.35) is the product of a and b — mediation analysis should recover values close to these from the data. - Knowing the ground truth lets you verify that the statistical procedure is working correctly. ### Baron-Kenny steps: three OLS regressions The Baron-Kenny procedure estimates three path coefficients from sequential regressions. The indirect effect is their product.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
rng = np.random.default_rng(42)
n = 300
X = rng.normal(0, 1, n)
M = 0.7 * X + rng.normal(0, 0.7, n)
Y = 0.5 * M + 0.3 * X + rng.normal(0, 0.7, n)
df = pd.DataFrame({"X": X, "M": M, "Y": Y})
# Step 1: total effect c (X → Y, ignoring M)
c = smf.ols("Y ~ X", data=df).fit().params["X"]
# Step 2: a path (X → M)
a = smf.ols("M ~ X", data=df).fit().params["X"]
# Step 3: direct effect c' and b path (X + M → Y)
m3 = smf.ols("Y ~ X + M", data=df).fit()
c_prime = m3.params["X"] # direct effect
b = m3.params["M"] # b path
indirect = a * b
print(f"Total effect (c): {c:.3f}")
print(f"a path (X→M): {a:.3f}")
print(f"b path (M→Y): {b:.3f}")
print(f"Direct effect (c'): {c_prime:.3f}")
print(f"Indirect (a×b): {indirect:.3f}")
print(f"Total ≈ direct + indirect: {c_prime + indirect:.3f} (vs c = {c:.3f})")- Steps 1–3 each run a separate OLS regression and extract one coefficient. No special package is needed — just `statsmodels.formula.api.ols`. - The indirect effect `a × b = 0.35` should be close to the true value. The total effect `c` should approximately equal `c' + a×b`. - A small or non-significant `c'` alongside a large `a×b` indicates *full mediation* — M explains all of X's effect on Y. ### Bootstrapped confidence interval for the indirect effect The product `a × b` has a skewed sampling distribution, so standard errors based on normality are unreliable. Bootstrapping resamples the data many times and directly observes the distribution of `a × b`.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
rng = np.random.default_rng(42)
n = 300
X = rng.normal(0, 1, n)
M = 0.7 * X + rng.normal(0, 0.7, n)
Y = 0.5 * M + 0.3 * X + rng.normal(0, 0.7, n)
df = pd.DataFrame({"X": X, "M": M, "Y": Y})
def indirect_effect(data):
a = smf.ols("M ~ X", data=data).fit().params["X"]
b = smf.ols("Y ~ X + M", data=data).fit().params["M"]
return a * b
observed = indirect_effect(df)
n_boot = 1000
boot_samples = [
indirect_effect(df.sample(n=len(df), replace=True, random_state=i))
for i in range(n_boot)
]
boot_samples = np.array(boot_samples)
ci_lo, ci_hi = np.percentile(boot_samples, [2.5, 97.5])
print(f"Indirect effect: {observed:.3f}")
print(f"95% bootstrap CI: [{ci_lo:.3f}, {ci_hi:.3f}]")
print(f"Significant mediation: {'Yes' if ci_lo > 0 or ci_hi < 0 else 'No'}")- `df.sample(n=len(df), replace=True)` draws a bootstrap resample: same size as the original but with replacement, so some rows appear multiple times. - `np.percentile(boot_samples, [2.5, 97.5])` gives the percentile CI. If this interval excludes zero, the indirect effect is statistically significant. - 1000 bootstrap samples is a common minimum; 5000 is more robust for final reporting. ### Conclusion Mediation analysis decomposes a total effect into the portion explained by a pathway through a third variable. The Baron-Kenny product-of-coefficients approach is straightforward, but always use bootstrapped confidence intervals for the indirect effect rather than the Sobel test — the Sobel test assumes normality of `a × b`, which rarely holds. Full mediation (c' ≈ 0) is rare in practice; partial mediation (both c' and a × b are meaningful) is the more common finding. For controlling a confounding variable instead of testing a mediating one, see [partial correlation](/tutorials/partial-correlation). For testing whether a third variable changes the *size* of an effect (moderation), see [moderation analysis with statsmodels](/tutorials/moderation-analysis-statsmodels).