Kernel Density Estimation with SciPy

A [histogram](/tutorials/seaborn-histogram) divides data into fixed bins and counts how many observations fall in each — the result depends on where you draw the bin edges and how wide you make them. Kernel density estimation (KDE) avoids both problems by placing a small smooth curve (the kernel) on each data point and summing them all up. The result is a continuous density estimate that doesn't depend on arbitrary bin choices. This makes KDE especially useful for visualizing multimodal distributions (data with multiple peaks), comparing groups on the same axis, or when you want to estimate the probability of observing a specific value. The key parameter is bandwidth, which controls how much the density is smoothed.

### Basic KDE

`stats.gaussian_kde` fits a KDE to your data and returns a callable object you can evaluate at any point to get the estimated density.

import numpy as np
from scipy import stats

np.random.seed(90)

data = np.concatenate([
    np.random.normal(-2, 0.8, 200),
    np.random.normal(2, 0.8, 200),
])

kde = stats.gaussian_kde(data)
x_eval = np.linspace(data.min() - 1, data.max() + 1, 200)

print("Density at x = 0:", round(float(kde([0])[0]), 4))
print("Density at x = 2:", round(float(kde([2])[0]), 4))

Density at x = 0: 0.064
Density at x = 2: 0.1906

- `np.concatenate([...])` creates a bimodal dataset with two peaks at -2 and 2 — this is where KDE shows its advantage over histograms.
- `stats.gaussian_kde(data)` estimates the density using Scott's rule for bandwidth by default.
- Calling `kde([0])` evaluates the density at x=0, which is between the two peaks and should be near zero.

### KDE with a Histogram

Overlaying a KDE curve on a histogram gives you both the raw binned counts and a smooth summary of the distribution shape in a single plot.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(90)

data = np.concatenate([
    np.random.normal(-2, 0.8, 200),
    np.random.normal(2, 0.8, 200),
])

kde = stats.gaussian_kde(data)
x_eval = np.linspace(data.min() - 1, data.max() + 1, 200)

plt.figure(figsize=(9, 5))
plt.hist(data, bins=30, density=True, alpha=0.6, label="Histogram")
plt.plot(x_eval, kde(x_eval), color="crimson", linewidth=2, label="KDE")
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Kernel Density Estimate")
plt.legend()
plt.show()

- `density=True` on the histogram normalizes bin heights to density so the y-axis matches the KDE curve — without this, the scales would be incompatible.
- `kde(x_eval)` evaluates the density at 200 evenly-spaced points, producing a smooth curve.
- The two peaks at -2 and 2 should be clearly visible in both the histogram and the KDE — a histogram with fewer bins might merge them into one.

### Bandwidth Effects

Bandwidth controls how much each data point influences its neighbors. Too narrow and the density is jagged (overfitting individual points); too wide and distinct features like multiple peaks are smoothed away.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(90)

data = np.concatenate([
    np.random.normal(-2, 0.8, 200),
    np.random.normal(2, 0.8, 200),
])

x_eval = np.linspace(data.min() - 1, data.max() + 1, 200)
kde_default = stats.gaussian_kde(data)
kde_narrow = stats.gaussian_kde(data, bw_method=0.2)
kde_wide = stats.gaussian_kde(data, bw_method=0.5)

plt.figure(figsize=(9, 5))
plt.plot(x_eval, kde_default(x_eval), label="Default bandwidth")
plt.plot(x_eval, kde_narrow(x_eval), label="Narrow bandwidth (0.2)")
plt.plot(x_eval, kde_wide(x_eval), label="Wide bandwidth (0.5)")
plt.title("Effect of KDE Bandwidth")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()

- `bw_method=0.2` applies a narrow bandwidth — the curve tracks individual data clusters closely and may show spurious small bumps.
- `bw_method=0.5` applies a wide bandwidth — the two peaks may merge into one, losing the bimodal structure.
- The default (Scott's rule) is a data-driven bandwidth that balances smoothness and detail — it's a good starting point, but you should visually verify it captures the structure you care about.

### Practical Example: Exam Score Distribution

KDE is useful for visualizing grade distributions where you suspect multiple clusters — for instance, students who struggled and students who excelled.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(101)

scores = np.concatenate([
    np.random.normal(65, 6, 120),
    np.random.normal(82, 4, 80),
])

kde = stats.gaussian_kde(scores)
x_eval = np.linspace(scores.min() - 5, scores.max() + 5, 250)

plt.figure(figsize=(9, 5))
plt.hist(scores, bins=20, density=True, alpha=0.55, label="Scores")
plt.plot(x_eval, kde(x_eval), color="darkgreen", linewidth=2.5, label="KDE")
plt.xlabel("Score")
plt.ylabel("Density")
plt.title("Exam Score Density Estimate")
plt.legend()
plt.show()

- Two groups of students (struggling around 65, succeeding around 82) create a bimodal distribution — the KDE makes both peaks visible even if the histogram bins happen to blur them.
- The KDE curve lets you estimate the relative density at any specific score — for example, how many students scored around 75 compared to 65.
- A unimodal KDE (single peak) on this kind of data might suggest the two groups aren't as separate as expected, which would be worth investigating.

### Conclusion

KDE is a versatile tool for visualizing distributions without committing to bin widths or distribution families. Use it when you want a smooth, interpretable density plot — especially when the data might be multimodal or when you're comparing multiple groups on the same axis.

For a simpler binned view, see [histograms with Seaborn](/tutorials/seaborn-histogram). For formally testing whether two distributions differ, see the [Kolmogorov-Smirnov test](/tutorials/kolmogorov-smirnov-test-with-scipy).