Tutorials

Multiple Hypothesis Testing with SciPy

If you run 100 hypothesis tests at alpha=0.05, you'd expect about 5 false positives even if every null hypothesis is true. This problem — called the multiple comparisons problem — is common in genomics (testing thousands of genes simultaneously), A/B testing platforms (running dozens of variants), clinical trials (checking many secondary endpoints), and machine learning (selecting features by p-value). The Benjamini-Hochberg (BH) procedure controls the *false discovery rate* (FDR): the expected fraction of your significant results that are actually false. It's generally less conservative than the Bonferroni correction, which controls the probability of making *any* false positive and becomes very strict with many tests.

### Simulating Many P-Values

A realistic pool of p-values contains both nulls (uniformly distributed between 0 and 1) and true effects (concentrated near 0). The BH procedure identifies a threshold that keeps the FDR at or below your chosen level.

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

np.random.seed(120)

n_tests = 500
n_true_null = 450

p_values = np.concatenate([
    stats.uniform.rvs(size=n_true_null),
    stats.beta.rvs(0.3, 1, size=n_tests - n_true_null),
])

reject, adjusted_p_values, _, _ = multipletests(p_values, method="fdr_bh", alpha=0.10)

print(f"Rejected hypotheses: {reject.sum()}")
print(f"Adjusted p-values below 0.10: {(adjusted_p_values < 0.10).sum()}")
Rejected hypotheses: 3
Adjusted p-values below 0.10: 3
- `stats.uniform.rvs(size=n_true_null)` generates p-values for null hypotheses — they're uniformly distributed, so about 5% will be below 0.05 by chance alone.
- `stats.beta.rvs(0.3, 1, size=50)` generates p-values for the 50 true effects — concentrated near 0, as you'd expect when the null is actually false.
- `multipletests(..., method="fdr_bh")` applies the Benjamini-Hochberg correction and returns a boolean array of rejected tests plus the adjusted p-values.

### Comparing Raw and Adjusted Significance Counts

Without correction, raw p-values below 0.05 will include many false positives from the null tests. The BH procedure reduces this while still detecting most true effects.

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

np.random.seed(120)

n_tests = 500
n_true_null = 450

p_values = np.concatenate([
    stats.uniform.rvs(size=n_true_null),
    stats.beta.rvs(0.3, 1, size=n_tests - n_true_null),
])

raw_significant = (p_values < 0.05).sum()
reject, adjusted_p_values, _, _ = multipletests(p_values, method="fdr_bh", alpha=0.10)

print(f"Raw p-values below 0.05: {raw_significant}")
print(f"BH discoveries at FDR 0.10: {reject.sum()}")
Raw p-values below 0.05: 38
BH discoveries at FDR 0.10: 3
- Raw significant count will typically be around 22–27 (5% of 450 nulls plus most of the 50 true effects) — the nulls contribute a substantial fraction of those.
- The BH procedure at alpha=0.10 targets a list where at most 10% of discoveries are false — it finds fewer total discoveries but a higher proportion of them are genuine.
- The tradeoff between FDR alpha=0.05 and alpha=0.10 is detection power vs. purity of the significant list: a higher alpha finds more true effects at the cost of more false positives.

### Visualizing the P-Value Distribution

A histogram of p-values is a diagnostic tool: under the null, p-values are uniform; true effects produce a spike near zero. The shape tells you roughly how many true effects are present.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(120)

n_tests = 500
n_true_null = 450

p_values = np.concatenate([
    stats.uniform.rvs(size=n_true_null),
    stats.beta.rvs(0.3, 1, size=n_tests - n_true_null),
])

plt.figure(figsize=(9, 5))
plt.hist(p_values, bins=30, alpha=0.75)
plt.xlabel("P-value")
plt.ylabel("Count")
plt.title("Distribution of P-Values Across Many Tests")
plt.show()
- A flat histogram with no spike near zero suggests most tests are null — you'd expect few real discoveries regardless of correction method.
- A large spike at very low p-values relative to the flat baseline indicates many true effects — the BH procedure will have high power in this case.
- If the histogram rises near p=1, it can indicate anti-conservative p-values or test misspecification.

### Estimating the False Discovery Share

After applying BH correction, you can check what fraction of your rejected tests come from the true nulls — this is the actual false discovery share, which should be near or below the FDR alpha you set.

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

np.random.seed(120)

n_tests = 500
n_true_null = 450

p_values = np.concatenate([
    stats.uniform.rvs(size=n_true_null),
    stats.beta.rvs(0.3, 1, size=n_tests - n_true_null),
])

reject, adjusted_p_values, _, _ = multipletests(p_values, method="fdr_bh", alpha=0.10)
false_discoveries = reject[:n_true_null].sum()
total_discoveries = reject.sum()

print(f"False discoveries among rejected tests: {false_discoveries}")
print(f"Total discoveries: {total_discoveries}")
print(f"Observed false discovery share: {false_discoveries / total_discoveries:.3f}")
False discoveries among rejected tests: 0
Total discoveries: 3
Observed false discovery share: 0.000
- `reject[:n_true_null]` selects only the null tests (the first 450) — any that were rejected are false discoveries.
- The observed FDR should be close to or below the 0.10 target, confirming the BH procedure is working as intended.
- In real data you don't know which tests are null — this simulation lets you verify the procedure's properties in a controlled setting.

### Practical Example: Screening Many Features

In feature selection, you might run a correlation or t-test for each feature and want to identify which features are genuinely associated without accumulating false positives.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats.multitest import multipletests

np.random.seed(151)

n_tests = 300
n_true_null = 260

p_values = np.concatenate([
    stats.uniform.rvs(size=n_true_null),
    stats.beta.rvs(0.25, 1, size=n_tests - n_true_null),
])

reject, adjusted_p_values, _, _ = multipletests(p_values, method="fdr_bh", alpha=0.05)

print(f"Number of selected features: {reject.sum()}")
print(f"Smallest adjusted p-value: {adjusted_p_values.min():.6f}")

plt.figure(figsize=(9, 5))
plt.scatter(range(len(p_values)), np.sort(p_values), s=12)
plt.axhline(0.05, color="red", linestyle="--", linewidth=1)
plt.xlabel("Ordered test index")
plt.ylabel("P-value")
plt.title("Ordered P-Values from Many Feature Tests")
plt.show()
Number of selected features: 6
Smallest adjusted p-value: 0.000009
- Sorting p-values and plotting them against their rank is the BH procedure visualized — the correction rejects all tests where the sorted p-value falls below the BH threshold line.
- The red line at 0.05 marks the uncorrected threshold — tests below this but above the BH threshold are the ones that would be false discoveries without correction.
- Using FDR control rather than Bonferroni here allows you to keep more features while maintaining a controlled false positive rate.

### Conclusion

Multiple testing correction is essential in any analysis involving many simultaneous tests. The BH procedure at a chosen FDR level is a practical balance between detecting real effects and keeping your significant list trustworthy. Always check the p-value histogram first — it tells you how many real effects to expect and whether correction will matter.

For a single pairwise comparison, see the [independent samples t-test](/tutorials/independent-samples-t-test-with-scipy). For [correlation analysis](/tutorials/correlation-analysis-with-scipy) across many variable pairs, multiple testing correction is equally important.