Principal Component Analysis (PCA)

When a dataset has many variables that are correlated with each other, you often don't need all of them. For example, a person's height, arm length, and leg length all measure roughly the same underlying thing — overall body size. Principal Component Analysis (PCA) finds new axes (called *principal components*) aligned with the directions of greatest variance in the data. You can then keep the first two or three components and discard the rest, reducing noise and enabling visualisation of high-dimensional data in two dimensions. PCA is used widely in biology, image compression, finance, and as a preprocessing step before machine learning.

### Generating correlated body measurements

Five body measurements — height, arm length, leg length, shoulder width, and hip width — all share a common driver: overall body size. This creates strong positive correlations between all features.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
n = 200
size = rng.normal(0, 1, n)  # latent body size factor

measurements = np.column_stack([
    170 + 10 * size + rng.normal(0, 3, n),  # height (cm)
     70 +  4 * size + rng.normal(0, 2, n),  # arm length (cm)
     80 +  5 * size + rng.normal(0, 2, n),  # leg length (cm)
     40 +  3 * size + rng.normal(0, 2, n),  # shoulder width (cm)
     37 +  2 * size + rng.normal(0, 2, n),  # hip width (cm)
])
features = ["Height", "Arm", "Leg", "Shoulder", "Hip"]
df = pd.DataFrame(measurements, columns=features)

print(df.corr().round(2))

          Height   Arm   Leg  Shoulder   Hip
Height      1.00  0.82  0.85      0.73  0.61
Arm         0.82  1.00  0.80      0.69  0.59
Leg         0.85  0.80  1.00      0.74  0.62
Shoulder    0.73  0.69  0.74      1.00  0.57
Hip         0.61  0.59  0.62      0.57  1.00

- `size` is the latent factor: it shifts all five measurements in the same direction, making them correlated. Taller people have longer arms, legs, and wider shoulders.
- Each measurement adds independent Gaussian noise (`rng.normal(0, ..., n)`) on top of the shared size effect, preventing perfect correlation.
- The printed correlation matrix shows all positive values, confirming that the features move together — exactly the scenario where PCA finds a useful low-dimensional representation.

### Standardising and fitting PCA

PCA measures variance, so features with larger numerical ranges dominate if you feed in raw values. `StandardScaler` centres each feature at zero and scales it to unit variance before fitting.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

rng = np.random.default_rng(42)
n = 200
size = rng.normal(0, 1, n)
measurements = np.column_stack([
    170 + 10 * size + rng.normal(0, 3, n),
     70 +  4 * size + rng.normal(0, 2, n),
     80 +  5 * size + rng.normal(0, 2, n),
     40 +  3 * size + rng.normal(0, 2, n),
     37 +  2 * size + rng.normal(0, 2, n),
])

X_scaled = StandardScaler().fit_transform(measurements)

pca = PCA()
pca.fit(X_scaled)

cumulative = pca.explained_variance_ratio_.cumsum()
print("Component  Variance  Cumulative")
for i, (var, cum) in enumerate(zip(pca.explained_variance_ratio_, cumulative)):
    print(f"  PC{i+1}       {var:.3f}     {cum:.3f}")

Component  Variance  Cumulative
  PC1       0.765     0.765
  PC2       0.098     0.863
  PC3       0.066     0.929
  PC4       0.041     0.970
  PC5       0.030     1.000

- `StandardScaler().fit_transform(measurements)` computes the mean and standard deviation of each column from the data and applies the transformation in one step.
- `PCA()` with no arguments fits all five components. `pca.explained_variance_ratio_` contains the fraction of total variance captured by each component, summing to 1.
- The first component should capture around 75–80% of the variance by itself, reflecting how much of the variation in these five measurements is really just one thing: overall body size.

### Scree plot: choosing how many components to keep

A scree plot shows the explained variance per component alongside its cumulative total. The "elbow" in the bars — where adding more components gives diminishing returns — is a common heuristic for deciding how many to retain.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

rng = np.random.default_rng(42)
n = 200
size = rng.normal(0, 1, n)
measurements = np.column_stack([
    170 + 10 * size + rng.normal(0, 3, n),
     70 +  4 * size + rng.normal(0, 2, n),
     80 +  5 * size + rng.normal(0, 2, n),
     40 +  3 * size + rng.normal(0, 2, n),
     37 +  2 * size + rng.normal(0, 2, n),
])

X_scaled = StandardScaler().fit_transform(measurements)
pca = PCA()
pca.fit(X_scaled)

x = np.arange(1, len(pca.explained_variance_ratio_) + 1)
cumulative = pca.explained_variance_ratio_.cumsum()

fig, ax1 = plt.subplots(figsize=(7, 5))
ax1.bar(x, pca.explained_variance_ratio_, color="steelblue", alpha=0.85, label="Individual")
ax1.set_xlabel("Principal component")
ax1.set_ylabel("Explained variance ratio", color="steelblue")
ax1.tick_params(axis="y", labelcolor="steelblue")

ax2 = ax1.twinx()
ax2.plot(x, cumulative, "o-", color="tomato", linewidth=2, label="Cumulative")
ax2.axhline(0.9, linestyle="--", color="gray", linewidth=1, alpha=0.7)
ax2.set_ylabel("Cumulative explained variance", color="tomato")
ax2.tick_params(axis="y", labelcolor="tomato")
ax2.set_ylim(0, 1.05)

ax1.set_title("Scree plot — body measurements")
plt.tight_layout()
plt.show()

- `ax1.twinx()` creates a second y-axis sharing the same x-axis — the left axis shows individual variance (bars) and the right shows cumulative (line).
- The dashed horizontal line at 0.9 marks the "90% of variance explained" threshold — a common retention criterion.
- For this dataset, the elbow is sharp after PC1: that first bar towers over the rest, confirming that most of the structure reduces to a single dimension.

### Visualising the scores in 2D

Projecting the 200 observations onto the first two PCs creates a 2D map. Colouring by the latent body size confirms that PC1 tracks what it should.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

rng = np.random.default_rng(42)
n = 200
size = rng.normal(0, 1, n)
measurements = np.column_stack([
    170 + 10 * size + rng.normal(0, 3, n),
     70 +  4 * size + rng.normal(0, 2, n),
     80 +  5 * size + rng.normal(0, 2, n),
     40 +  3 * size + rng.normal(0, 2, n),
     37 +  2 * size + rng.normal(0, 2, n),
])

X_scaled = StandardScaler().fit_transform(measurements)
scores = PCA(n_components=2).fit_transform(X_scaled)

fig, ax = plt.subplots(figsize=(7, 5))
sc = ax.scatter(scores[:, 0], scores[:, 1], c=size, cmap="RdYlGn", alpha=0.75, s=30)
plt.colorbar(sc, ax=ax, label="True body size (latent)")
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_title("PCA scores — each point is one person")
plt.tight_layout()
plt.show()

- `PCA(n_components=2).fit_transform(X_scaled)` fits PCA and immediately projects the data, returning an (n, 2) array of scores.
- The colour gradient runs smoothly from red (small) to green (large) along the PC1 axis — confirming that PC1 captures the latent body size factor we built into the data.
- PC2 shows the residual variation not explained by overall size: people can have relatively longer legs versus arms at the same overall body size.

### Reading the component loadings

Loadings tell you how much each original feature contributes to each principal component — they are the weights in the linear combination that defines each PC.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

rng = np.random.default_rng(42)
n = 200
size = rng.normal(0, 1, n)
measurements = np.column_stack([
    170 + 10 * size + rng.normal(0, 3, n),
     70 +  4 * size + rng.normal(0, 2, n),
     80 +  5 * size + rng.normal(0, 2, n),
     40 +  3 * size + rng.normal(0, 2, n),
     37 +  2 * size + rng.normal(0, 2, n),
])
features = ["Height", "Arm", "Leg", "Shoulder", "Hip"]

X_scaled = StandardScaler().fit_transform(measurements)
pca = PCA(n_components=2)
pca.fit(X_scaled)

loadings = pca.components_.T  # (n_features, n_components)
print(f"{'Feature':>10}  {'PC1':>8}  {'PC2':>8}")
print("-" * 30)
for name, row in zip(features, loadings):
    print(f"{name:>10}  {row[0]:>8.3f}  {row[1]:>8.3f}")

   Feature       PC1       PC2
------------------------------
    Height     0.472    -0.204
       Arm     0.459    -0.221
       Leg     0.472    -0.179
  Shoulder     0.437    -0.177
       Hip     0.391     0.920

- `pca.components_` has shape (n_components, n_features); transposing it gives (n_features, n_components) so each row corresponds to one feature.
- A large positive PC1 loading for every feature confirms that PC1 is a "size index" — all five measurements increase together along this axis.
- PC2 loadings will be mixed in sign: features where a person can be relatively long (e.g., legs) load one way, and features associated with width load the other way, capturing body *shape* independent of overall size.

### Conclusion

PCA is most useful when you have many correlated features and want either to visualise the data in fewer dimensions or to remove redundancy before fitting a model. The scree plot guides how many components to keep; the loadings explain what those components actually measure in terms of the original variables.

For partial correlation, which also addresses correlated features from a different angle, see [partial correlation](/tutorials/partial-correlation). To use the reduced components as input to a regression model, see [multiple linear regression with scikit-learn](/tutorials/multiple-linear-regression-sklearn).