Factor Analysis

When you give 300 people a 12-question personality survey, you don't expect 12 independent sources of variation. Questions about being "talkative", "energetic", and "sociable" all reflect the same underlying trait — extraversion. Factor analysis finds these hidden (*latent*) variables and tells you which survey questions tap into each one. Unlike [PCA](/tutorials/principal-component-analysis), which finds axes of maximum variance without assuming any underlying structure, factor analysis fits an explicit model: each observed variable equals a weighted combination of shared latent factors plus unique noise specific to that item. The result is a *loading matrix* that maps cleanly onto the theoretical constructs you are trying to measure.

### Simulating a personality survey

Twelve items measure three latent personality traits — extraversion, conscientiousness, and openness — with four items per trait. Each item is a noisy version of one latent factor, which creates strong within-group correlations and near-zero between-group correlations.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
n = 300
noise = 0.6  # unique noise per item

extrav     = rng.normal(0, 1, n)
conscient  = rng.normal(0, 1, n)
openness   = rng.normal(0, 1, n)

items = np.column_stack([
    extrav     + rng.normal(0, noise, n),  # E1
    extrav     + rng.normal(0, noise, n),  # E2
    extrav     + rng.normal(0, noise, n),  # E3
    extrav     + rng.normal(0, noise, n),  # E4
    conscient  + rng.normal(0, noise, n),  # C1
    conscient  + rng.normal(0, noise, n),  # C2
    conscient  + rng.normal(0, noise, n),  # C3
    conscient  + rng.normal(0, noise, n),  # C4
    openness   + rng.normal(0, noise, n),  # O1
    openness   + rng.normal(0, noise, n),  # O2
    openness   + rng.normal(0, noise, n),  # O3
    openness   + rng.normal(0, noise, n),  # O4
])

names = [f"E{i}" for i in range(1, 5)] + \
        [f"C{i}" for i in range(1, 5)] + \
        [f"O{i}" for i in range(1, 5)]

corr = pd.DataFrame(items, columns=names).corr().round(2)
print(corr)

      E1    E2    E3    E4    C1    C2    C3    C4    O1    O2    O3    O4
E1  1.00  0.69  0.72  0.71 -0.06 -0.01  0.01 -0.04 -0.07 -0.06 -0.01  0.03
E2  0.69  1.00  0.68  0.70  0.02  0.04  0.06  0.01 -0.10 -0.08 -0.02 -0.02
E3  0.72  0.68  1.00  0.70  0.01  0.06  0.04  0.03 -0.07 -0.07 -0.01 -0.00
E4  0.71  0.70  0.70  1.00  0.02  0.03  0.02 -0.01  0.02 -0.01  0.02  0.06
C1 -0.06  0.02  0.01  0.02  1.00  0.73  0.71  0.75 -0.11 -0.12 -0.11 -0.09
C2 -0.01  0.04  0.06  0.03  0.73  1.00  0.74  0.73 -0.07 -0.12 -0.09 -0.08
C3  0.01  0.06  0.04  0.02  0.71  0.74  1.00  0.74 -0.11 -0.11 -0.10 -0.08
C4 -0.04  0.01  0.03 -0.01  0.75  0.73  0.74  1.00 -0.12 -0.11 -0.10 -0.07
O1 -0.07 -0.10 -0.07  0.02 -0.11 -0.07 -0.11 -0.12  1.00  0.75  0.77  0.74
O2 -0.06 -0.08 -0.07 -0.01 -0.12 -0.12 -0.11 -0.11  0.75  1.00  0.77  0.73
O3 -0.01 -0.02 -0.01  0.02 -0.11 -0.09 -0.10 -0.10  0.77  0.77  1.00  0.76
O4  0.03 -0.02 -0.00  0.06 -0.09 -0.08 -0.08 -0.07  0.74  0.73  0.76  1.00

- Each item is `latent_factor + noise`, so two items sharing the same factor correlate at roughly `1 / (1 + noise²)` ≈ 0.74. Items from different factors are uncorrelated.
- The printed matrix should show a visible 3×3 block structure: the top-left 4×4 block (E items), the middle 4×4 block (C items), and bottom-right 4×4 block (O items) each have high internal correlations, while off-diagonal blocks are near zero.
- Factor analysis is designed to recover this block structure from the correlation matrix — without being told in advance that there are three groups of four.

### Fitting Factor Analysis with varimax rotation

`sklearn.decomposition.FactorAnalysis` fits the model using maximum likelihood. Varimax rotation then maximises the variance of squared loadings within each factor, making the solution easier to interpret by pushing each item towards loading strongly on one factor and weakly on the others.

import numpy as np
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)
n = 300
noise = 0.6
extrav    = rng.normal(0, 1, n)
conscient = rng.normal(0, 1, n)
openness  = rng.normal(0, 1, n)
items = np.column_stack([
    extrav     + rng.normal(0, noise, n),
    extrav     + rng.normal(0, noise, n),
    extrav     + rng.normal(0, noise, n),
    extrav     + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n),
])

X = StandardScaler().fit_transform(items)

fa = FactorAnalysis(n_components=3, rotation="varimax", random_state=42)
fa.fit(X)

loadings = fa.components_.T  # (n_items, n_factors)
sum_sq = np.sum(loadings**2, axis=0)
names = [f"E{i}" for i in range(1,5)] + [f"C{i}" for i in range(1,5)] + [f"O{i}" for i in range(1,5)]

print(f"{'Item':>5}  {'F1':>7}  {'F2':>7}  {'F3':>7}")
print("-" * 35)
for name, row in zip(names, loadings):
    print(f"{name:>5}  {row[0]:>7.3f}  {row[1]:>7.3f}  {row[2]:>7.3f}")
print("-" * 35)
print(f"{'SS':>5}  {sum_sq[0]:>7.3f}  {sum_sq[1]:>7.3f}  {sum_sq[2]:>7.3f}")

 Item       F1       F2       F3
-----------------------------------
   E1    0.020    0.039    0.856
   E2    0.047   -0.025    0.817
   E3    0.031   -0.034    0.836
   E4   -0.038   -0.014    0.838
   C1    0.067   -0.851   -0.017
   C2    0.047   -0.855    0.027
   C3    0.061   -0.848    0.031
   C4    0.060   -0.864   -0.013
   O1   -0.866    0.058   -0.049
   O2   -0.860    0.075   -0.050
   O3   -0.886    0.060    0.010
   O4   -0.852    0.036    0.035
-----------------------------------
   SS    3.021    2.938    2.810

- `StandardScaler().fit_transform(items)` is necessary: factor analysis uses correlations, and standardising ensures each item contributes equally regardless of its raw scale.
- `fa.components_` has shape (n_factors, n_items); transposing gives the conventional loadings layout with items as rows and factors as columns.
- The sum-of-squared loadings (SS) row at the bottom measures how much total variance each factor accounts for — analogous to an eigenvalue. Factors with higher SS capture more shared variance.

### Visualising the loading matrix

A heatmap of the loading matrix immediately shows the block structure: red/blue cells where items load strongly on their intended factor, and white cells where they don't.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)
n = 300
noise = 0.6
extrav    = rng.normal(0, 1, n)
conscient = rng.normal(0, 1, n)
openness  = rng.normal(0, 1, n)
items = np.column_stack([
    extrav     + rng.normal(0, noise, n), extrav     + rng.normal(0, noise, n),
    extrav     + rng.normal(0, noise, n), extrav     + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n), conscient  + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n), conscient  + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n), openness   + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n), openness   + rng.normal(0, noise, n),
])
names = [f"E{i}" for i in range(1,5)] + [f"C{i}" for i in range(1,5)] + [f"O{i}" for i in range(1,5)]

X = StandardScaler().fit_transform(items)
fa = FactorAnalysis(n_components=3, rotation="varimax", random_state=42)
fa.fit(X)
loadings = fa.components_.T

fig, ax = plt.subplots(figsize=(6, 8))
im = ax.imshow(loadings, cmap="RdBu_r", vmin=-1, vmax=1, aspect="auto")
ax.set_xticks([0, 1, 2])
ax.set_xticklabels(["Factor 1", "Factor 2", "Factor 3"])
ax.set_yticks(range(12))
ax.set_yticklabels(names)
plt.colorbar(im, ax=ax, label="Loading")
ax.set_title("Factor loadings — varimax rotation")
for i in range(12):
    for j in range(3):
        val = loadings[i, j]
        color = "white" if abs(val) > 0.5 else "black"
        ax.text(j, i, f"{val:.2f}", ha="center", va="center", fontsize=8, color=color)
plt.tight_layout()
plt.show()

- A well-rotated solution shows three columns that each have four large values and eight near-zero values — one column per latent trait.
- `cmap="RdBu_r"` maps large positive loadings to dark red, large negative loadings to dark blue, and near-zero loadings to white — the white cells are what you want for the items that belong to a *different* factor.
- Annotating each cell with its numeric value lets you apply the common threshold rule: loadings > 0.40 (or sometimes 0.30) are considered meaningful.

### Communalities: how well do items measure the factors?

A *communality* is the proportion of an item's variance that is explained by all factors together. High communality means the item is well-captured by the factor structure; low communality suggests it carries mostly unique, item-specific variance.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)
n = 300
noise = 0.6
extrav    = rng.normal(0, 1, n)
conscient = rng.normal(0, 1, n)
openness  = rng.normal(0, 1, n)
items = np.column_stack([
    extrav     + rng.normal(0, noise, n), extrav     + rng.normal(0, noise, n),
    extrav     + rng.normal(0, noise, n), extrav     + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n), conscient  + rng.normal(0, noise, n),
    conscient  + rng.normal(0, noise, n), conscient  + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n), openness   + rng.normal(0, noise, n),
    openness   + rng.normal(0, noise, n), openness   + rng.normal(0, noise, n),
])
names = [f"E{i}" for i in range(1,5)] + [f"C{i}" for i in range(1,5)] + [f"O{i}" for i in range(1,5)]

X = StandardScaler().fit_transform(items)
fa = FactorAnalysis(n_components=3, rotation="varimax", random_state=42)
fa.fit(X)
loadings = fa.components_.T

communalities = np.sum(loadings**2, axis=1)
uniqueness = 1 - communalities

fig, ax = plt.subplots(figsize=(10, 4))
x = np.arange(12)
ax.bar(x, communalities, color="steelblue", label="Communality (shared variance)")
ax.bar(x, uniqueness, bottom=communalities, color="lightcoral", label="Uniqueness (item-specific)")
ax.set_xticks(x)
ax.set_xticklabels(names)
ax.axhline(0.4, color="black", linestyle="--", linewidth=1, label="Communality = 0.4 threshold")
ax.set_ylabel("Proportion of item variance")
ax.set_title("Communalities and uniqueness per survey item")
ax.legend()
plt.tight_layout()
plt.show()

- `communalities = np.sum(loadings**2, axis=1)` sums squared loadings across factors for each item — this is the standard communality formula.
- `uniqueness = 1 - communalities` is the proportion of variance not explained by any factor. `fa.noise_variance_` contains the model's own uniqueness estimates, which will closely match.
- Items with communality below 0.40 (dashed line) are weak indicators — they share little variance with the factors and might be candidates for removal from the survey.

### Conclusion

Factor analysis is the right tool when you have a theoretical expectation that observed variables reflect underlying constructs: intelligence, brand loyalty, clinical symptoms. Varimax rotation makes the factor structure interpretable by concentrating each item's loading on one factor, and communalities tell you which items are worth keeping.

For variance-based dimensionality reduction without a latent-variable model, see [principal component analysis](/tutorials/principal-component-analysis). For regression with many correlated predictors, see [multiple linear regression with scikit-learn](/tutorials/multiple-linear-regression-sklearn).