Mediation Analysis

Suppose you find that more education predicts higher income. But *why*? One explanation is that education increases job skill, and job skill increases income — education's effect is *mediated* by skill. Mediation analysis decomposes the total effect of X on Y into a *direct* effect (X → Y, bypassing the mediator) and an *indirect* effect (X → M → Y, flowing through the mediator). The Baron-Kenny procedure is the classic approach: run three regressions, collect the path coefficients, and compute indirect = a × b. Because the sampling distribution of a × b is non-normal, bootstrapping is the standard method for confidence intervals.

### Simulating a mediation pathway

A dataset where study hours (X) influence test scores (Y) partly through knowledge acquired (M) — simulated so the true indirect effect is known.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
n = 300

X = rng.normal(0, 1, n)          # study hours (standardised)
M = 0.7 * X + rng.normal(0, 0.7, n)  # knowledge: a = 0.7
Y = 0.5 * M + 0.3 * X + rng.normal(0, 0.7, n)  # b = 0.5, direct c' = 0.3

df = pd.DataFrame({"X": X, "M": M, "Y": Y})

# True effects
true_indirect = 0.7 * 0.5   # a * b
true_direct   = 0.3          # c'
true_total    = true_indirect + true_direct

print(f"True indirect (a×b): {true_indirect:.2f}")
print(f"True direct (c'):     {true_direct:.2f}")
print(f"True total (c):       {true_total:.2f}")

True indirect (a×b): 0.35
True direct (c'):     0.30
True total (c):       0.65

- The data-generating process makes the causal structure explicit: X causes M (a = 0.7), M causes Y (b = 0.5), and X also directly causes Y (c' = 0.3).
- The true indirect effect (0.35) is the product of a and b — mediation analysis should recover values close to these from the data.
- Knowing the ground truth lets you verify that the statistical procedure is working correctly.

### Baron-Kenny steps: three OLS regressions

The Baron-Kenny procedure estimates three path coefficients from sequential regressions. The indirect effect is their product.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

rng = np.random.default_rng(42)
n = 300
X = rng.normal(0, 1, n)
M = 0.7 * X + rng.normal(0, 0.7, n)
Y = 0.5 * M + 0.3 * X + rng.normal(0, 0.7, n)
df = pd.DataFrame({"X": X, "M": M, "Y": Y})

# Step 1: total effect c (X → Y, ignoring M)
c = smf.ols("Y ~ X", data=df).fit().params["X"]

# Step 2: a path (X → M)
a = smf.ols("M ~ X", data=df).fit().params["X"]

# Step 3: direct effect c' and b path (X + M → Y)
m3 = smf.ols("Y ~ X + M", data=df).fit()
c_prime = m3.params["X"]   # direct effect
b       = m3.params["M"]   # b path

indirect = a * b

print(f"Total effect (c):   {c:.3f}")
print(f"a path (X→M):       {a:.3f}")
print(f"b path (M→Y):       {b:.3f}")
print(f"Direct effect (c'): {c_prime:.3f}")
print(f"Indirect (a×b):     {indirect:.3f}")
print(f"Total ≈ direct + indirect: {c_prime + indirect:.3f}  (vs c = {c:.3f})")

Total effect (c):   0.607
a path (X→M):       0.725
b path (M→Y):       0.369
Direct effect (c'): 0.339
Indirect (a×b):     0.267
Total ≈ direct + indirect: 0.607  (vs c = 0.607)

- Steps 1–3 each run a separate OLS regression and extract one coefficient. No special package is needed — just `statsmodels.formula.api.ols`.
- The indirect effect `a × b = 0.35` should be close to the true value. The total effect `c` should approximately equal `c' + a×b`.
- A small or non-significant `c'` alongside a large `a×b` indicates *full mediation* — M explains all of X's effect on Y.

### Bootstrapped confidence interval for the indirect effect

The product `a × b` has a skewed sampling distribution, so standard errors based on normality are unreliable. Bootstrapping resamples the data many times and directly observes the distribution of `a × b`.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

rng = np.random.default_rng(42)
n = 300
X = rng.normal(0, 1, n)
M = 0.7 * X + rng.normal(0, 0.7, n)
Y = 0.5 * M + 0.3 * X + rng.normal(0, 0.7, n)
df = pd.DataFrame({"X": X, "M": M, "Y": Y})

def indirect_effect(data):
    a = smf.ols("M ~ X", data=data).fit().params["X"]
    b = smf.ols("Y ~ X + M", data=data).fit().params["M"]
    return a * b

observed = indirect_effect(df)

n_boot = 1000
boot_samples = [
    indirect_effect(df.sample(n=len(df), replace=True, random_state=i))
    for i in range(n_boot)
]
boot_samples = np.array(boot_samples)

ci_lo, ci_hi = np.percentile(boot_samples, [2.5, 97.5])
print(f"Indirect effect: {observed:.3f}")
print(f"95% bootstrap CI: [{ci_lo:.3f}, {ci_hi:.3f}]")
print(f"Significant mediation: {'Yes' if ci_lo > 0 or ci_hi < 0 else 'No'}")

Indirect effect: 0.267
95% bootstrap CI: [0.189, 0.349]
Significant mediation: Yes

- `df.sample(n=len(df), replace=True)` draws a bootstrap resample: same size as the original but with replacement, so some rows appear multiple times.
- `np.percentile(boot_samples, [2.5, 97.5])` gives the percentile CI. If this interval excludes zero, the indirect effect is statistically significant.
- 1000 bootstrap samples is a common minimum; 5000 is more robust for final reporting.

### Conclusion

Mediation analysis decomposes a total effect into the portion explained by a pathway through a third variable. The Baron-Kenny product-of-coefficients approach is straightforward, but always use bootstrapped confidence intervals for the indirect effect rather than the Sobel test — the Sobel test assumes normality of `a × b`, which rarely holds. Full mediation (c' ≈ 0) is rare in practice; partial mediation (both c' and a × b are meaningful) is the more common finding.

For controlling a confounding variable instead of testing a mediating one, see [partial correlation](/tutorials/partial-correlation). For testing whether a third variable changes the *size* of an effect (moderation), see [moderation analysis with statsmodels](/tutorials/moderation-analysis-statsmodels).