Tutorials

Kolmogorov-Smirnov Test with SciPy

The Kolmogorov-Smirnov test compares distributions by measuring the largest difference between cumulative distribution functions. It can be used to compare a sample to a reference distribution or to compare two samples directly.

### Two-Sample KS Test

import numpy as np
from scipy import stats

np.random.seed(52)

sample_a = np.random.normal(loc=0, scale=1, size=120)
sample_b = np.random.normal(loc=0.6, scale=1.1, size=120)

result = stats.ks_2samp(sample_a, sample_b)

print(f"KS statistic: {result.statistic:.3f}")
print(f"P-value: {result.pvalue:.6f}")
KS statistic: 0.200
P-value: 0.016260
### Interpreting the Result

import numpy as np
from scipy import stats

np.random.seed(52)

sample_a = np.random.normal(loc=0, scale=1, size=120)
sample_b = np.random.normal(loc=0.6, scale=1.1, size=120)

result = stats.ks_2samp(sample_a, sample_b)

if result.pvalue < 0.05:
    print("Reject the null hypothesis: the samples come from different distributions.")
else:
    print("Fail to reject the null hypothesis: the samples could come from the same distribution.")
Reject the null hypothesis: the samples come from different distributions.
### Visualizing the Samples

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(52)

sample_a = np.random.normal(loc=0, scale=1, size=120)
sample_b = np.random.normal(loc=0.6, scale=1.1, size=120)

plt.figure(figsize=(9, 5))
plt.hist(sample_a, bins=18, alpha=0.6, density=True, label="Sample A")
plt.hist(sample_b, bins=18, alpha=0.6, density=True, label="Sample B")
plt.title("Two Samples Compared with KS Test")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()
### One-Sample KS Test Against a Normal Distribution

import numpy as np
from scipy import stats

np.random.seed(70)

sample = np.random.normal(loc=0, scale=1, size=100)
result = stats.kstest(sample, "norm")

print(f"One-sample KS statistic: {result.statistic:.3f}")
print(f"P-value: {result.pvalue:.6f}")
One-sample KS statistic: 0.084
P-value: 0.453159
### Practical Example: Comparing Load Time Distributions

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(83)

before = np.random.lognormal(mean=1.9, sigma=0.25, size=100)
after = np.random.lognormal(mean=1.7, sigma=0.20, size=100)

result = stats.ks_2samp(before, after)

print(f"KS statistic: {result.statistic:.3f}")
print(f"P-value: {result.pvalue:.6f}")
print("Conclusion: load-time distributions differ significantly." if result.pvalue < 0.05 else "Conclusion: no significant distribution difference detected.")

plt.figure(figsize=(9, 5))
plt.hist(before, bins=14, alpha=0.6, density=True, label="Before")
plt.hist(after, bins=14, alpha=0.6, density=True, label="After")
plt.xlabel("Load time")
plt.ylabel("Density")
plt.title("Before vs After Load Time Distributions")
plt.legend()
plt.show()
KS statistic: 0.380
P-value: 0.000001
Conclusion: load-time distributions differ significantly.
### Conclusion

The KS test is useful when you want to compare entire distributions rather than only their means. It works especially well alongside a plot of the sample distributions.