Correlation Analysis with SciPy

Correlation analysis answers a common question in data work: when one variable changes, does another tend to change with it? A positive correlation means both rise together; a negative correlation means one rises as the other falls. The coefficient tells you the strength (0 = no relationship, 1 or -1 = perfect relationship) and direction. SciPy provides three correlation methods — Pearson, Spearman, and Kendall — each suited to different kinds of data. Knowing which to use, and how to interpret the result alongside a p-value, is a core skill for any data analysis.

### Pearson Correlation

Pearson correlation measures the strength of a **linear** relationship between two continuous variables. It assumes both variables are roughly normally distributed and that the relationship between them is a straight line. It's the right choice when your data meets those assumptions — for example, comparing height and weight across a population.

import numpy as np
from scipy import stats

np.random.seed(15)

x = np.random.normal(0, 1, 120)
y = 0.8 * x + np.random.normal(0, 0.5, 120)

r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.3f}")
print(f"P-value: {p_value:.6f}")

Pearson r: 0.875
P-value: 0.000000

- `y = 0.8 * x + noise` constructs a variable that's strongly but not perfectly correlated with `x` — the `0.8` controls the signal strength and `np.random.normal(0, 0.5, 120)` adds scatter.
- `stats.pearsonr(x, y)` returns `r` (the correlation coefficient) and a p-value testing whether the true correlation is zero.
- A p-value near zero here confirms the correlation is unlikely to be a fluke — the relationship was built in by construction.

### Comparing Pearson, Spearman, and Kendall

Pearson is sensitive to outliers and assumes linearity. **Spearman** relaxes both: it works on ranks rather than raw values, so it captures any monotonic relationship (one that consistently goes up or down, even if not in a straight line). **Kendall's tau** is similar to Spearman but uses a different ranking formula — it's more robust with small samples or many tied values. When in doubt with real-world data, Spearman is a safer default than Pearson.

import numpy as np
from scipy import stats

np.random.seed(15)

x = np.random.normal(0, 1, 120)
y = 0.8 * x + np.random.normal(0, 0.5, 120)

pearson = stats.pearsonr(x, y)
spearman = stats.spearmanr(x, y)
kendall = stats.kendalltau(x, y)

print(f"Pearson: r = {pearson.statistic:.3f}, p = {pearson.pvalue:.6f}")
print(f"Spearman: r = {spearman.statistic:.3f}, p = {spearman.pvalue:.6f}")
print(f"Kendall: tau = {kendall.statistic:.3f}, p = {kendall.pvalue:.6f}")

Pearson: r = 0.875, p = 0.000000
Spearman: r = 0.860, p = 0.000000
Kendall: tau = 0.684, p = 0.000000

- All three methods return a result object with `.statistic` and `.pvalue` attributes.
- For linearly correlated, normally distributed data like this, all three coefficients will be close — the differences matter more with non-linear or non-normal data.
- Kendall's tau is scaled differently from Pearson and Spearman (its range is still -1 to 1, but the same underlying relationship will produce a smaller absolute value), so don't compare tau directly to `r`.

### Scatter Plot of the Relationship

A correlation coefficient is a single number — it can't tell you whether the relationship is linear or curved, whether there are outliers pulling the coefficient up, or whether the data clusters in unexpected ways. Plotting the two variables against each other before (or alongside) running the statistic is essential to interpreting it correctly.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(15)

x = np.random.normal(0, 1, 120)
y = 0.8 * x + np.random.normal(0, 0.5, 120)
r, _ = stats.pearsonr(x, y)

plt.figure(figsize=(8, 5))
plt.scatter(x, y, alpha=0.6)
plt.xlabel("X")
plt.ylabel("Y")
plt.title(f"Scatter Plot (Pearson r = {r:.3f})")
plt.grid(alpha=0.3)
plt.show()

- `r, _ = stats.pearsonr(x, y)` unpacks only the coefficient (discarding the p-value with `_`) so it can be embedded in the chart title.
- `alpha=0.6` makes overlapping points visible — without it, dense regions look identical to sparse ones.
- Embedding `r` in the title keeps the statistic and the visual together, which makes the chart self-explanatory.

### Correlation Matrix

When you have more than two variables, checking every pair individually doesn't scale. A correlation matrix computes all pairwise correlations at once and a heatmap makes it easy to spot which variables are strongly related and which are independent.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(15)

x = np.random.normal(0, 1, 120)
y = 0.8 * x + np.random.normal(0, 0.5, 120)
z = np.random.normal(0, 1, 120)

matrix = np.corrcoef(np.column_stack([x, y, z]).T)

plt.figure(figsize=(6, 5))
plt.imshow(matrix, cmap="coolwarm", vmin=-1, vmax=1)
plt.xticks([0, 1, 2], ["x", "y", "z"])
plt.yticks([0, 1, 2], ["x", "y", "z"])
for i in range(3):
    for j in range(3):
        plt.text(j, i, f"{matrix[i, j]:.2f}", ha="center", va="center")
plt.colorbar(label="Correlation")
plt.title("Correlation Matrix")
plt.show()

- `np.column_stack([x, y, z]).T` arranges the variables as rows, which is the shape `np.corrcoef` expects.
- `cmap="coolwarm"` with `vmin=-1, vmax=1` pins the color scale so blue always means negative correlation and red always means positive — regardless of the actual range in your data.
- The diagonal is always 1.0 (each variable is perfectly correlated with itself), so focus on the off-diagonal cells.
- `z` was generated independently of `x` and `y`, so its row and column should show values near zero.

### Practical Example: Study Time and Exam Score

This example applies correlation to a realistic scenario: measuring whether students who study more tend to score higher. It also shows how to run both Pearson and Spearman and compare them when you're unsure which is more appropriate.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(29)

study_hours = np.random.uniform(2, 12, 80)
exam_scores = 52 + 3.5 * study_hours + np.random.normal(0, 6, 80)

pearson = stats.pearsonr(study_hours, exam_scores)
spearman = stats.spearmanr(study_hours, exam_scores)

print(f"Pearson r: {pearson.statistic:.3f}, p-value: {pearson.pvalue:.6f}")
print(f"Spearman r: {spearman.statistic:.3f}, p-value: {spearman.pvalue:.6f}")

plt.figure(figsize=(8, 5))
plt.scatter(study_hours, exam_scores, alpha=0.65)
plt.xlabel("Study hours")
plt.ylabel("Exam score")
plt.title("Study Time vs Exam Score")
plt.grid(alpha=0.3)
plt.show()

Pearson r: 0.846, p-value: 0.000000
Spearman r: 0.847, p-value: 0.000000

- `exam_scores = 52 + 3.5 * study_hours + noise` simulates a realistic linear relationship — each additional study hour is worth about 3.5 points on average, with real-world variability added.
- Running both Pearson and Spearman here is intentional: if they agree closely, the linear assumption is reasonable; a large gap would suggest the relationship is monotonic but non-linear.
- The scatter plot confirms the direction and spread of the relationship, which the coefficient alone can't show.

### Conclusion

SciPy's `pearsonr`, `spearmanr`, and `kendalltau` give you a toolkit for measuring different kinds of relationships. Pearson is the standard choice for linear, normally distributed data; Spearman and Kendall are more robust when those assumptions don't hold. Always pair the coefficient with a scatter plot — a high `r` can still hide a non-linear relationship or influential outliers.

To go further, see [linear regression with SciPy](/tutorials/linear-regression-with-scipy) to model the relationship quantitatively, or [independent samples t-test with SciPy](/tutorials/independent-samples-t-test-with-scipy) to compare group means rather than measure association. For visualizing relationships, [matplotlib scatter plots](/tutorials/matplotlib-scatter) covers more chart customization options.