Tutorials

Statistical Distributions with SciPy

A probability distribution describes the likelihood of different outcomes for a random variable. Working with distributions comes up constantly in data analysis: computing p-values, building confidence intervals, simulating data, or checking whether observed data follows a known distribution. `scipy.stats` gives every distribution a consistent interface — `rvs` to sample, `pdf`/`cdf` to evaluate probabilities, `ppf` to invert the CDF, and `fit` to estimate parameters from data. The same four methods work identically across all distributions.

### Sampling and Visualizing Distributions

Every distribution object in `scipy.stats` has an `rvs` method (random variates) that draws samples. Plotting the samples alongside the theoretical PDF is a quick sanity check that your parameters make sense.

import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)

fig, axes = plt.subplots(1, 3, figsize=(13, 4))
distributions = [
    (stats.norm(loc=0, scale=1),       "Normal(0, 1)",     np.linspace(-4, 4, 200)),
    (stats.expon(scale=2),             "Exponential(λ=0.5)", np.linspace(0, 12, 200)),
    (stats.beta(a=2, b=5),             "Beta(2, 5)",       np.linspace(0, 1, 200)),
]
for ax, (dist, name, x) in zip(axes, distributions):
    samples = dist.rvs(size=1000, random_state=rng)
    ax.hist(samples, bins=40, density=True, color="steelblue", alpha=0.5, label="Samples")
    ax.plot(x, dist.pdf(x), color="crimson", linewidth=2, label="PDF")
    ax.set_title(name)
    ax.set_xlabel("x")
    ax.legend(fontsize=8)
axes[0].set_ylabel("Density")
plt.tight_layout()
plt.show()
- `dist.rvs(size=1000, random_state=rng)` draws 1000 samples; passing `random_state` ensures reproducibility.
- `density=True` in `hist` normalizes the histogram to a probability density so it can be compared directly to the PDF.
- The same pattern works for any `scipy.stats` distribution — just swap the distribution object and x range.

### PDF, CDF, and the Percent-Point Function

The PDF gives the probability density at a point; the CDF gives the cumulative probability P(X ≤ x); and the PPF (percent-point function, or quantile function) inverts the CDF: given a probability p, it returns the value x such that P(X ≤ x) = p.

import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

dist = stats.norm(loc=0, scale=1)
x = np.linspace(-4, 4, 300)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].plot(x, dist.pdf(x), color="steelblue", linewidth=2)
axes[0].set_title("PDF — Probability Density")
axes[0].set_xlabel("x")
axes[0].set_ylabel("Density")

axes[1].plot(x, dist.cdf(x), color="crimson", linewidth=2)
axes[1].axhline(0.975, color="gray", linestyle="--", linewidth=1)
axes[1].axvline(dist.ppf(0.975), color="gray", linestyle="--", linewidth=1)
axes[1].set_title("CDF — Cumulative Probability")
axes[1].set_xlabel("x")
axes[1].set_ylabel("P(X ≤ x)")
plt.tight_layout()
plt.show()

print(f"P(X ≤ 1.96)     = {dist.cdf(1.96):.4f}")
print(f"P(-1.96 ≤ X ≤ 1.96) = {dist.cdf(1.96) - dist.cdf(-1.96):.4f}")
print(f"97.5th percentile   = {dist.ppf(0.975):.4f}")
P(X ≤ 1.96)     = 0.9750
P(-1.96 ≤ X ≤ 1.96) = 0.9500
97.5th percentile   = 1.9600
- `dist.cdf(1.96)` computes the area to the left of 1.96 under the standard normal — about 0.975.
- `dist.ppf(0.975)` is the inverse: "what x gives CDF = 0.975?" — the answer is ≈ 1.96, the critical value for a 95% confidence interval.
- The dashed lines in the right panel show how PPF inverts the CDF: start at p = 0.975 on the y-axis, read across, and drop down to x ≈ 1.96.

### Critical Values for Hypothesis Tests

The PPF is essential for computing critical values in hypothesis tests. Different distributions — normal, t, chi-squared, F — each have their own shape, and the critical values differ accordingly.

import scipy.stats as stats

# z-scores (normal) for common confidence levels
print("Normal critical values (two-tailed):")
for ci in [0.90, 0.95, 0.99]:
    z = stats.norm.ppf((1 + ci) / 2)
    print(f"  {ci*100:.0f}%  z = ±{z:.4f}")

# t critical values shrink toward z as df increases
print("\nt critical values (two-tailed, df=10):")
for ci in [0.90, 0.95, 0.99]:
    t = stats.t.ppf((1 + ci) / 2, df=10)
    print(f"  {ci*100:.0f}%  t = ±{t:.4f}")
Normal critical values (two-tailed):
  90%  z = ±1.6449
  95%  z = ±1.9600
  99%  z = ±2.5758

t critical values (two-tailed, df=10):
  90%  t = ±1.8125
  95%  t = ±2.2281
  99%  t = ±3.1693
- For large samples the t-distribution converges to the normal; at df=10 the 95% critical value is 2.228 vs 1.960 for the normal — a meaningful difference for small samples.
- `stats.norm.ppf(...)` uses the distribution class directly (without instantiating an object) by passing parameters as keyword arguments.
- For chi-squared and F tests, use `stats.chi2.ppf(p, df)` and `stats.f.ppf(p, dfn, dfd)` with the same pattern.

### Fitting a Distribution to Data

`dist.fit(data)` uses maximum likelihood estimation to find the parameters that best describe observed data. This is how you'd determine whether your data follows a gamma, lognormal, or other parametric distribution.

import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(1)
data = stats.gamma.rvs(a=3, scale=2, size=500, random_state=rng)

# Fit gamma distribution (fix loc=0 since gamma is non-negative)
a_fit, loc_fit, scale_fit = stats.gamma.fit(data, floc=0)
print(f"Fitted:  shape={a_fit:.3f}, scale={scale_fit:.3f}")
print(f"True:    shape=3.000, scale=2.000")

x = np.linspace(0, 25, 300)
plt.figure(figsize=(7, 4))
plt.hist(data, bins=40, density=True, color="steelblue", alpha=0.5, label="Data")
plt.plot(x, stats.gamma.pdf(x, a=a_fit, loc=loc_fit, scale=scale_fit),
         color="crimson", linewidth=2, label="Fitted Gamma")
plt.title("Maximum Likelihood Fitting: Gamma Distribution")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.tight_layout()
plt.show()
Fitted:  shape=3.354, scale=1.740
True:    shape=3.000, scale=2.000
- `stats.gamma.fit(data, floc=0)` returns `(shape, loc, scale)` — the `floc=0` keyword fixes the location parameter to 0, since gamma-distributed data is non-negative.
- The fitted shape ≈ 3 and scale ≈ 2 should be close to the true values used to generate the data; the match improves with more samples.
- To compare which distribution fits best, use `stats.kstest(data, 'gamma', args=(a_fit, loc_fit, scale_fit))` to run a Kolmogorov-Smirnov goodness-of-fit test — see the [KS test tutorial](/tutorials/kolmogorov-smirnov-test-with-scipy).

`scipy.stats` distributions are a complete toolkit for probability. Once you've characterized a distribution, you can use it to run [normality tests](/tutorials/normality-tests-with-scipy) or compute [bootstrap confidence intervals](/tutorials/bootstrap-confidence-intervals-with-scipy).