Partial Correlation

A standard [correlation analysis](/tutorials/correlation-analysis-with-scipy) can mislead you when a third variable drives both variables you are comparing. This is called confounding. For example, older adults tend to exercise less *and* weigh more — simply because both change with age. A direct correlation between exercise and weight will look negative, but it tells you more about ageing than about the exercise-weight relationship itself. Partial correlation removes the influence of the confounding variable before measuring the association, revealing whether the relationship between the two variables of interest is genuine or just a reflection of the shared driver.

### Simulating confounded data

We'll generate age, exercise, and weight data where age drives both exercise and weight, with no direct link between exercise and weight.

import numpy as np
from scipy.stats import pearsonr

rng = np.random.default_rng(42)
n = 200

age = rng.uniform(20, 70, n)

# Both driven by age, with independent noise — no direct link between them
exercise = -0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)
weight   =  0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)

r_ew, p_ew = pearsonr(exercise, weight)
r_ea, _    = pearsonr(exercise, age)
r_wa, _    = pearsonr(weight, age)

print(f"Correlation(exercise, weight):  r = {r_ew:.3f}  p = {p_ew:.4f}")
print(f"Correlation(exercise, age):     r = {r_ea:.3f}")
print(f"Correlation(weight, age):       r = {r_wa:.3f}")

Correlation(exercise, weight):  r = -0.323  p = 0.0000
Correlation(exercise, age):     r = -0.524
Correlation(weight, age):       r = 0.643

- The exercise and age terms are scaled by `age.std()` so that the resulting correlation between exercise and age is close to −0.8 by construction.
- The direct correlation between exercise and weight is strongly negative — but this is entirely inherited from their shared link to age. There is no causal path from exercise to weight in the data-generating process.
- Both predictor–confounder correlations are printed alongside the main correlation to show the full picture before controlling.

### Computing partial correlation via residuals

The most intuitive way to partial out a variable is to regress both variables on the confounder, then correlate what is left over.

import numpy as np
from scipy.stats import pearsonr

rng = np.random.default_rng(42)
n = 200
age = rng.uniform(20, 70, n)
exercise = -0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)
weight   =  0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)

def residuals_after_regression(y, x):
    """Return residuals from regressing y on x (with intercept)."""
    x_with_const = np.column_stack([np.ones(len(x)), x])
    coeffs, _, _, _ = np.linalg.lstsq(x_with_const, y, rcond=None)
    return y - x_with_const @ coeffs

exercise_resid = residuals_after_regression(exercise, age)
weight_resid   = residuals_after_regression(weight, age)

r_partial, p_partial = pearsonr(exercise_resid, weight_resid)
print(f"Partial correlation (controlling for age): r = {r_partial:.3f}  p = {p_partial:.4f}")

Partial correlation (controlling for age): r = 0.023  p = 0.7499

- `np.linalg.lstsq` fits the least-squares line of y on x; subtracting the fitted values leaves residuals that are orthogonal to x — the part of y that x cannot explain.
- After removing age's influence from both exercise and weight, their residuals are nearly uncorrelated — confirming that the original negative correlation was spurious.
- This residual approach works for any number of control variables: regress both y and z on all controls, then correlate the residuals.

### The closed-form formula

For controlling a single variable, partial correlation can be computed directly from three Pearson correlations without running any regressions.

import numpy as np
from scipy.stats import pearsonr

rng = np.random.default_rng(42)
n = 200
age = rng.uniform(20, 70, n)
exercise = -0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)
weight   =  0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)

r_ew, _ = pearsonr(exercise, weight)
r_ea, _ = pearsonr(exercise, age)
r_wa, _ = pearsonr(weight, age)

r_partial = (r_ew - r_ea * r_wa) / np.sqrt((1 - r_ea**2) * (1 - r_wa**2))

print(f"r(exercise, weight)         = {r_ew:.4f}")
print(f"r(exercise, age)            = {r_ea:.4f}")
print(f"r(weight, age)              = {r_wa:.4f}")
print(f"Partial r(exercise, weight) = {r_partial:.4f}  (controlling for age)")

r(exercise, weight)         = -0.3226
r(exercise, age)            = -0.5243
r(weight, age)              = 0.6435
Partial r(exercise, weight) = 0.0227  (controlling for age)

- The formula subtracts the portion of the exercise–weight correlation that is mediated through age (`r_ea * r_wa`), then rescales.
- This gives the same result as the residual approach (up to floating-point precision) and is computationally cheaper for the single-control case.
- The result matches the residual method — reassuring that both approaches implement the same mathematical concept.

### Testing significance

A partial correlation has its own significance test. The test statistic follows a t-distribution with `n − 2 − k` degrees of freedom, where `k` is the number of control variables.

import numpy as np
from scipy.stats import pearsonr, t as t_dist

rng = np.random.default_rng(42)
n = 200
age = rng.uniform(20, 70, n)
exercise = -0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)
weight   =  0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)

r_ew, _ = pearsonr(exercise, weight)
r_ea, _ = pearsonr(exercise, age)
r_wa, _ = pearsonr(weight, age)
r_partial = (r_ew - r_ea * r_wa) / np.sqrt((1 - r_ea**2) * (1 - r_wa**2))

k = 1  # number of control variables
df = n - 2 - k
t_stat = r_partial * np.sqrt(df / (1 - r_partial**2))
p_value = 2 * t_dist.sf(abs(t_stat), df)

print(f"Partial r = {r_partial:.4f}")
print(f"t = {t_stat:.3f},  df = {df},  p = {p_value:.4f}")

alpha = 0.05
verdict = "significant" if p_value < alpha else "not significant"
print(f"\nPartial correlation is {verdict} at α = {alpha}")

Partial r = 0.0227
t = 0.318,  df = 197,  p = 0.7505

Partial correlation is not significant at α = 0.05

- `t_dist.sf(abs(t_stat), df)` computes the one-tailed p-value; multiplying by 2 gives the two-tailed result.
- Each control variable costs one degree of freedom: with `k=1` control and `n=200`, `df = 197`. More controls → fewer degrees of freedom → less power.
- A non-significant partial correlation here would confirm that exercise and weight are not directly related once age is accounted for.

### Visualizing raw vs controlled correlation

Plotting the scatter before and after partialling out age makes the effect of confounding immediately visible.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

rng = np.random.default_rng(42)
n = 200
age = rng.uniform(20, 70, n)
exercise = -0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)
weight   =  0.8 * (age - age.mean()) / age.std() + rng.normal(0, 1, n)

def residuals(y, x):
    x_c = np.column_stack([np.ones(len(x)), x])
    b, _, _, _ = np.linalg.lstsq(x_c, y, rcond=None)
    return y - x_c @ b

exercise_resid = residuals(exercise, age)
weight_resid   = residuals(weight, age)

r_raw,     _ = pearsonr(exercise, weight)
r_partial, _ = pearsonr(exercise_resid, weight_resid)

fig, axes = plt.subplots(1, 2, figsize=(11, 5))

sc = axes[0].scatter(exercise, weight, c=age, cmap="RdYlGn_r", alpha=0.7, s=25)
plt.colorbar(sc, ax=axes[0], label="Age (years)")
axes[0].set_xlabel("Exercise (standardised)")
axes[0].set_ylabel("Weight (standardised)")
axes[0].set_title(f"Raw correlation\nr = {r_raw:.3f}")

axes[1].scatter(exercise_resid, weight_resid, alpha=0.7, s=25, color="steelblue")
axes[1].axhline(0, color="gray", linewidth=0.8)
axes[1].axvline(0, color="gray", linewidth=0.8)
axes[1].set_xlabel("Exercise residual (age removed)")
axes[1].set_ylabel("Weight residual (age removed)")
axes[1].set_title(f"Partial correlation\nr = {r_partial:.3f}")

plt.tight_layout()
plt.show()

- The left panel is coloured by age, revealing that the negative slope is entirely explained by the colour gradient — older (green) points cluster in the low-exercise, high-weight corner.
- The right panel shows the residuals after age's effect is removed from both variables. The cloud is round with no visible trend — the partial correlation is near zero.
- This visual confirms that the original correlation was entirely spurious and that exercise and weight have no direct relationship in this dataset.

### Conclusion

Partial correlation is an essential diagnostic tool whenever you suspect a third variable might be inflating or masking a relationship. Computing it via residuals makes the concept explicit; the closed-form formula offers a quick shortcut for one control variable. Always report partial correlations alongside the raw ones so readers can see how much the confound mattered.

For the raw Pearson correlation this tutorial builds on, see [correlation analysis with SciPy](/tutorials/correlation-analysis-with-scipy). To include multiple predictors in a regression framework instead of correlations, see [multiple linear regression with scikit-learn](/tutorials/multiple-linear-regression-sklearn).