A probability distribution describes the likelihood of different outcomes for a random variable. Working with distributions comes up constantly in data analysis: computing p-values, building confidence intervals, simulating data, or checking whether observed data follows a known distribution. `scipy.stats` gives every distribution a consistent interface — `rvs` to sample, `pdf`/`cdf` to evaluate probabilities, `ppf` to invert the CDF, and `fit` to estimate parameters from data. The same four methods work identically across all distributions. ### Sampling and Visualizing Distributions Every distribution object in `scipy.stats` has an `rvs` method (random variates) that draws samples. Plotting the samples alongside the theoretical PDF is a quick sanity check that your parameters make sense.
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(0)
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
distributions = [
(stats.norm(loc=0, scale=1), "Normal(0, 1)", np.linspace(-4, 4, 200)),
(stats.expon(scale=2), "Exponential(λ=0.5)", np.linspace(0, 12, 200)),
(stats.beta(a=2, b=5), "Beta(2, 5)", np.linspace(0, 1, 200)),
]
for ax, (dist, name, x) in zip(axes, distributions):
samples = dist.rvs(size=1000, random_state=rng)
ax.hist(samples, bins=40, density=True, color="steelblue", alpha=0.5, label="Samples")
ax.plot(x, dist.pdf(x), color="crimson", linewidth=2, label="PDF")
ax.set_title(name)
ax.set_xlabel("x")
ax.legend(fontsize=8)
axes[0].set_ylabel("Density")
plt.tight_layout()
plt.show()- `dist.rvs(size=1000, random_state=rng)` draws 1000 samples; passing `random_state` ensures reproducibility. - `density=True` in `hist` normalizes the histogram to a probability density so it can be compared directly to the PDF. - The same pattern works for any `scipy.stats` distribution — just swap the distribution object and x range. ### PDF, CDF, and the Percent-Point Function The PDF gives the probability density at a point; the CDF gives the cumulative probability P(X ≤ x); and the PPF (percent-point function, or quantile function) inverts the CDF: given a probability p, it returns the value x such that P(X ≤ x) = p.
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
dist = stats.norm(loc=0, scale=1)
x = np.linspace(-4, 4, 300)
fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].plot(x, dist.pdf(x), color="steelblue", linewidth=2)
axes[0].set_title("PDF — Probability Density")
axes[0].set_xlabel("x")
axes[0].set_ylabel("Density")
axes[1].plot(x, dist.cdf(x), color="crimson", linewidth=2)
axes[1].axhline(0.975, color="gray", linestyle="--", linewidth=1)
axes[1].axvline(dist.ppf(0.975), color="gray", linestyle="--", linewidth=1)
axes[1].set_title("CDF — Cumulative Probability")
axes[1].set_xlabel("x")
axes[1].set_ylabel("P(X ≤ x)")
plt.tight_layout()
plt.show()
print(f"P(X ≤ 1.96) = {dist.cdf(1.96):.4f}")
print(f"P(-1.96 ≤ X ≤ 1.96) = {dist.cdf(1.96) - dist.cdf(-1.96):.4f}")
print(f"97.5th percentile = {dist.ppf(0.975):.4f}")- `dist.cdf(1.96)` computes the area to the left of 1.96 under the standard normal — about 0.975. - `dist.ppf(0.975)` is the inverse: "what x gives CDF = 0.975?" — the answer is ≈ 1.96, the critical value for a 95% confidence interval. - The dashed lines in the right panel show how PPF inverts the CDF: start at p = 0.975 on the y-axis, read across, and drop down to x ≈ 1.96. ### Critical Values for Hypothesis Tests The PPF is essential for computing critical values in hypothesis tests. Different distributions — normal, t, chi-squared, F — each have their own shape, and the critical values differ accordingly.
import scipy.stats as stats
# z-scores (normal) for common confidence levels
print("Normal critical values (two-tailed):")
for ci in [0.90, 0.95, 0.99]:
z = stats.norm.ppf((1 + ci) / 2)
print(f" {ci*100:.0f}% z = ±{z:.4f}")
# t critical values shrink toward z as df increases
print("\nt critical values (two-tailed, df=10):")
for ci in [0.90, 0.95, 0.99]:
t = stats.t.ppf((1 + ci) / 2, df=10)
print(f" {ci*100:.0f}% t = ±{t:.4f}")- For large samples the t-distribution converges to the normal; at df=10 the 95% critical value is 2.228 vs 1.960 for the normal — a meaningful difference for small samples. - `stats.norm.ppf(...)` uses the distribution class directly (without instantiating an object) by passing parameters as keyword arguments. - For chi-squared and F tests, use `stats.chi2.ppf(p, df)` and `stats.f.ppf(p, dfn, dfd)` with the same pattern. ### Fitting a Distribution to Data `dist.fit(data)` uses maximum likelihood estimation to find the parameters that best describe observed data. This is how you'd determine whether your data follows a gamma, lognormal, or other parametric distribution.
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(1)
data = stats.gamma.rvs(a=3, scale=2, size=500, random_state=rng)
# Fit gamma distribution (fix loc=0 since gamma is non-negative)
a_fit, loc_fit, scale_fit = stats.gamma.fit(data, floc=0)
print(f"Fitted: shape={a_fit:.3f}, scale={scale_fit:.3f}")
print(f"True: shape=3.000, scale=2.000")
x = np.linspace(0, 25, 300)
plt.figure(figsize=(7, 4))
plt.hist(data, bins=40, density=True, color="steelblue", alpha=0.5, label="Data")
plt.plot(x, stats.gamma.pdf(x, a=a_fit, loc=loc_fit, scale=scale_fit),
color="crimson", linewidth=2, label="Fitted Gamma")
plt.title("Maximum Likelihood Fitting: Gamma Distribution")
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.tight_layout()
plt.show()- `stats.gamma.fit(data, floc=0)` returns `(shape, loc, scale)` — the `floc=0` keyword fixes the location parameter to 0, since gamma-distributed data is non-negative. - The fitted shape ≈ 3 and scale ≈ 2 should be close to the true values used to generate the data; the match improves with more samples. - To compare which distribution fits best, use `stats.kstest(data, 'gamma', args=(a_fit, loc_fit, scale_fit))` to run a Kolmogorov-Smirnov goodness-of-fit test — see the [KS test tutorial](/tutorials/kolmogorov-smirnov-test-with-scipy). `scipy.stats` distributions are a complete toolkit for probability. Once you've characterized a distribution, you can use it to run [normality tests](/tutorials/normality-tests-with-scipy) or compute [bootstrap confidence intervals](/tutorials/bootstrap-confidence-intervals-with-scipy).