Power Analysis

Running a study that is too small is one of the most common mistakes in data analysis: the effect is real, but you don't have enough data to detect it reliably, so you declare "no significant result" and draw the wrong conclusion. Power analysis prevents this by letting you calculate the sample size you need *before* collecting data. The four quantities are connected — fix any three and you can solve for the fourth: **α** (false positive rate, typically 0.05), **power** (1 − false negative rate, typically 0.80), **effect size** (how large the difference is), and **n** (sample size per group). Usually you set α and power, estimate an expected effect size, and solve for n.

### What statistical power means

Statistical power is the probability of correctly detecting a real effect. Running many simulated experiments makes this concrete: if the true effect exists and your test has 70% power, you'd reject the null hypothesis in roughly 70 out of 100 studies.

import numpy as np
from scipy import stats

rng = np.random.default_rng(42)
n_simulations = 10_000
n_per_group = 50
true_d = 0.5   # Cohen's d: medium effect
alpha = 0.05

reject_count = 0
for _ in range(n_simulations):
    group_a = rng.normal(0, 1, n_per_group)
    group_b = rng.normal(true_d, 1, n_per_group)
    _, p = stats.ttest_ind(group_a, group_b)
    if p < alpha:
        reject_count += 1

empirical_power = reject_count / n_simulations
print(f"Effect size (d): {true_d}")
print(f"n per group:     {n_per_group}")
print(f"Empirical power: {empirical_power:.3f}")

Effect size (d): 0.5
n per group:     50
Empirical power: 0.702

- `rng.normal(true_d, 1, n_per_group)` generates Group B with a mean shifted by `true_d` standard deviations above Group A — that shift is Cohen's d when both SDs are 1.
- Each iteration simulates a complete experiment. `reject_count` accumulates how often the correct conclusion (p < 0.05) is reached.
- The empirical power should be close to the theoretical value of ~0.70 — meaning with n=50 per group and a medium effect, roughly 30% of experiments would miss the real signal.

### Solving for sample size

`statsmodels.stats.power.TTestIndPower` computes the exact analytical power for a two-sample t-test. Passing three of the four quantities to `solve_power` returns the fourth.

from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()

# Standard planning: solve for n given alpha, power, and effect size
for d, label in [(0.2, "Small"), (0.5, "Medium"), (0.8, "Large")]:
    n = analysis.solve_power(effect_size=d, power=0.8, alpha=0.05)
    print(f"{label} effect (d={d}): {n:.0f} participants per group needed")

Small effect (d=0.2): 393 participants per group needed

Medium effect (d=0.5): 64 participants per group needed
Large effect (d=0.8): 26 participants per group needed

- `solve_power` returns the value of whichever argument is left as `None` (here, `nobs1`).
- With default `ratio=1.0`, both groups are the same size. The returned `n` is the per-group count, so the total sample is `2 × n`.
- Notice how dramatically sample size increases as effect size shrinks: detecting a small effect requires many more participants than a large one.

### Power curves

A power curve plots statistical power against sample size, letting you see exactly where you cross the 80% threshold for your expected effect size.

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
n_range = np.arange(10, 201, 5)
effect_sizes = {"Small (d=0.2)": 0.2, "Medium (d=0.5)": 0.5, "Large (d=0.8)": 0.8}

fig, ax = plt.subplots(figsize=(9, 5))
for label, d in effect_sizes.items():
    powers = [analysis.solve_power(effect_size=d, nobs1=n, alpha=0.05) for n in n_range]
    ax.plot(n_range, powers, label=label, linewidth=2)

ax.axhline(0.8, color="black", linestyle="--", linewidth=1.2, label="80% power target")
ax.set_xlabel("Sample size per group")
ax.set_ylabel("Statistical power")
ax.set_title("Power curves for a two-sample t-test (α = 0.05)")
ax.legend()
ax.set_ylim(0, 1.05)
plt.tight_layout()
plt.show()

- Each line rises from near zero to near 1.0 as n grows. The steep part of the curve is where adding a few more participants buys a lot of power; the flat top is where adding more provides little benefit.
- The black dashed line marks the conventional 80% target. The x-intercept with each curve is the required n for that effect size.
- A small effect (d=0.2) barely reaches 80% even at n=200 per group — detecting subtle differences is expensive.

### Minimum detectable effect

Sometimes the sample size is fixed by practical constraints. `solve_power` can then tell you the smallest effect your study is capable of detecting at 80% power.

from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()

print(f"{'n/group':>8}  {'Min detectable d':>18}  {'Interpretation'}")
print("-" * 55)
for n in [20, 50, 100, 200, 500]:
    min_d = analysis.solve_power(nobs1=n, power=0.8, alpha=0.05)
    if min_d >= 0.8:
        label = "large effects only"
    elif min_d >= 0.5:
        label = "medium–large effects"
    elif min_d >= 0.2:
        label = "small–medium effects"
    else:
        label = "very small effects"
    print(f"{n:>8}  {min_d:>18.3f}  {label}")

 n/group    Min detectable d  Interpretation
-------------------------------------------------------
      20               0.909  large effects only
      50               0.566  medium–large effects
     100               0.398  small–medium effects
     200               0.281  small–medium effects
     500               0.177  very small effects

- Setting `power=0.8` and leaving `effect_size=None` makes `solve_power` return the minimum Cohen's d detectable at 80% power for that n.
- Cohen's benchmarks: d = 0.2 (small), 0.5 (medium), 0.8 (large). A study with n=20 per group can only detect large effects reliably.
- This calculation is most useful when reviewing whether a proposed study is adequately powered, or when writing up a study with a fixed sample.

### Power heatmap: n × effect size

Visualising power across a grid of sample sizes and effect sizes gives a complete picture of what your study design can and cannot detect.

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
effect_grid = np.arange(0.1, 1.05, 0.1)
n_grid = np.arange(10, 210, 10)

power_matrix = np.array([
    [analysis.solve_power(effect_size=d, nobs1=n, alpha=0.05)
     for d in effect_grid]
    for n in n_grid
])

fig, ax = plt.subplots(figsize=(10, 6))
im = ax.imshow(power_matrix, aspect="auto", cmap="RdYlGn",
               vmin=0, vmax=1, origin="lower")
ax.set_xticks(range(len(effect_grid)))
ax.set_xticklabels([f"{d:.1f}" for d in effect_grid])
ax.set_yticks(range(len(n_grid)))
ax.set_yticklabels([str(n) for n in n_grid])
ax.set_xlabel("Cohen's d (effect size)")
ax.set_ylabel("Sample size per group")
ax.set_title("Statistical power — two-sample t-test (α = 0.05)")
plt.colorbar(im, ax=ax, label="Power")
plt.tight_layout()
plt.show()

- Green cells (power ≥ 0.8) indicate design combinations that are adequately powered; red cells are underpowered.
- The boundary between red and green traces out the same relationship as the power curves above, just shown as a 2D grid instead of separate lines.
- `origin="lower"` ensures the smallest n appears at the bottom of the y-axis, matching the natural reading direction (more data → higher power → move up).

### Conclusion

Power analysis is a planning tool: run it before data collection, not after. The key decision is estimating the minimum effect size you care about detecting — Cohen's d = 0.5 is a reasonable default for many behavioral studies, but subject-matter knowledge should guide your choice.

For the t-test used in this tutorial, see [independent samples t-test with SciPy](/tutorials/independent-samples-t-test-with-scipy). For more complex designs, see [one-way ANOVA with SciPy](/tutorials/one-way-anova-with-scipy).