Fitting a regression model is only the first step. Diagnostics tell you whether assumptions are reasonable and whether a few unusual points are driving your conclusions. In this tutorial, you will run a practical diagnostics workflow in `statsmodels`. ## Checking regression assumptions with a baseline model
import requests
import pandas as pd
import statsmodels.api as sm
# Download once so all diagnostic steps use the same dataset
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Build and fit baseline model for downstream diagnostics
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.835
Model: OLS Adj. R-squared: 0.817
Method: Least Squares F-statistic: 47.15
Date: Wed, 11 Mar 2026 Prob (F-statistic): 4.51e-11
Time: 16:48:05 Log-Likelihood: -73.571
No. Observations: 32 AIC: 155.1
Df Residuals: 28 BIC: 161.0
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 27.6105 8.420 3.279 0.003 10.363 44.858
hp -0.0178 0.015 -1.190 0.244 -0.049 0.013
wt -4.3588 0.753 -5.791 0.000 -5.901 -2.817
qsec 0.5108 0.439 1.163 0.255 -0.389 1.411
==============================================================================
Omnibus: 4.495 Durbin-Watson: 1.422
Prob(Omnibus): 0.106 Jarque-Bera (JB): 3.368
Skew: 0.786 Prob(JB): 0.186
Kurtosis: 3.230 Cond. No. 3.00e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
This block downloads and saves `mtcars.csv` once and fits the baseline OLS model that all later diagnostics inspect. Using one shared baseline makes each diagnostic result comparable and easier to interpret. ## Residual analysis and residual plots
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot
# Refit baseline model and compute fitted values + residuals
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
fitted = model.fittedvalues
residuals = model.resid
plt.figure(figsize=(8, 5))
plt.scatter(fitted, residuals, alpha=0.8)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.tight_layout()
plt.show()
qqplot(residuals, line="s")
plt.title("Q-Q Plot of Residuals")
plt.tight_layout()
plt.show()Residual-vs-fitted helps detect nonlinearity or variance patterns, while the Q-Q plot checks how close residuals are to normal. Together, these two plots give a fast visual check of key OLS assumptions before you trust inference. ## Heteroskedasticity tests
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
# Run Breusch-Pagan test on model residuals
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
bp_stat, bp_pvalue, f_stat, f_pvalue = het_breuschpagan(model.resid, model.model.exog)
print("Breusch-Pagan LM stat:", bp_stat)
print("Breusch-Pagan LM p-value:", bp_pvalue)
print("Breusch-Pagan F stat:", f_stat)
print("Breusch-Pagan F p-value:", f_pvalue)Breusch-Pagan LM stat: 2.0700718723832026 Breusch-Pagan LM p-value: 0.5579911149315504 Breusch-Pagan F stat: 0.6455301438255844 Breusch-Pagan F p-value: 0.5922887042132922
Breusch-Pagan evaluates whether residual variance depends on predictors, which can affect inference reliability. If heteroskedasticity is present, you may need robust standard errors or a different model specification. ## Variance Inflation Factor (VIF)
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Compute VIF for each predictor to quantify collinearity
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
vif_df = pd.DataFrame(
{
"feature": X.columns,
"VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
}
)
print(vif_df)feature VIF 0 const 341.407799 1 hp 4.921956 2 wt 2.530443 3 qsec 2.873804
VIF measures multicollinearity, helping you detect predictors that may inflate coefficient uncertainty. High VIF values are a warning sign that coefficient magnitudes and p-values can become unstable. ## Influence and outlier detection
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Use model influence diagnostics to identify high-impact points
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
result = pd.DataFrame({"car": df["rownames"], "cooks_distance": cooks_d})
print(result.sort_values("cooks_distance", ascending=False).head(5))
plt.figure(figsize=(9, 4))
plt.stem(range(len(cooks_d)), cooks_d, basefmt=" ")
plt.xlabel("Observation index")
plt.ylabel("Cook's distance")
plt.title("Influence by observation")
plt.tight_layout()
plt.show()car cooks_distance 16 Chrysler Imperial 0.332849 19 Toyota Corolla 0.191211 30 Maserati Bora 0.130785 17 Fiat 128 0.118508 20 Toyota Corona 0.086320
Cook's distance surfaces influential observations and the stem plot makes large-impact points easy to inspect. Reviewing top points helps you decide whether unusual rows are valid data, errors, or cases requiring separate treatment.