Regression Diagnostics with Statsmodels

Fitting a regression model is only the first step. Diagnostics tell you whether assumptions are reasonable and whether a few unusual points are driving your conclusions.

In this tutorial, you will run a practical diagnostics workflow in `statsmodels`.

## Checking regression assumptions with a baseline model

import requests
import pandas as pd
import statsmodels.api as sm

# Download once so all diagnostic steps use the same dataset
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

# Build and fit baseline model for downstream diagnostics
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]

model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     47.15
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           4.51e-11
Time:                        16:48:05   Log-Likelihood:                -73.571
No. Observations:                  32   AIC:                             155.1
Df Residuals:                      28   BIC:                             161.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         27.6105      8.420      3.279      0.003      10.363      44.858
hp            -0.0178      0.015     -1.190      0.244      -0.049       0.013
wt            -4.3588      0.753     -5.791      0.000      -5.901      -2.817
qsec           0.5108      0.439      1.163      0.255      -0.389       1.411
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   1.422
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                3.368
Skew:                           0.786   Prob(JB):                        0.186
Kurtosis:                       3.230   Cond. No.                     3.00e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
This block downloads and saves `mtcars.csv` once and fits the baseline OLS model that all later diagnostics inspect. Using one shared baseline makes each diagnostic result comparable and easier to interpret.

## Residual analysis and residual plots

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot

# Refit baseline model and compute fitted values + residuals
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

fitted = model.fittedvalues
residuals = model.resid

plt.figure(figsize=(8, 5))
plt.scatter(fitted, residuals, alpha=0.8)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.tight_layout()
plt.show()

qqplot(residuals, line="s")
plt.title("Q-Q Plot of Residuals")
plt.tight_layout()
plt.show()
Residual-vs-fitted helps detect nonlinearity or variance patterns, while the Q-Q plot checks how close residuals are to normal. Together, these two plots give a fast visual check of key OLS assumptions before you trust inference.

## Heteroskedasticity tests

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

# Run Breusch-Pagan test on model residuals
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

bp_stat, bp_pvalue, f_stat, f_pvalue = het_breuschpagan(model.resid, model.model.exog)
print("Breusch-Pagan LM stat:", bp_stat)
print("Breusch-Pagan LM p-value:", bp_pvalue)
print("Breusch-Pagan F stat:", f_stat)
print("Breusch-Pagan F p-value:", f_pvalue)
Breusch-Pagan LM stat: 2.0700718723832026
Breusch-Pagan LM p-value: 0.5579911149315504
Breusch-Pagan F stat: 0.6455301438255844
Breusch-Pagan F p-value: 0.5922887042132922
Breusch-Pagan evaluates whether residual variance depends on predictors, which can affect inference reliability. If heteroskedasticity is present, you may need robust standard errors or a different model specification.

## Variance Inflation Factor (VIF)

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute VIF for each predictor to quantify collinearity
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])

vif_df = pd.DataFrame(
    {
        "feature": X.columns,
        "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
    }
)
print(vif_df)
  feature         VIF
0   const  341.407799
1      hp    4.921956
2      wt    2.530443
3    qsec    2.873804
VIF measures multicollinearity, helping you detect predictors that may inflate coefficient uncertainty. High VIF values are a warning sign that coefficient magnitudes and p-values can become unstable.

## Influence and outlier detection

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Use model influence diagnostics to identify high-impact points
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

influence = model.get_influence()
cooks_d = influence.cooks_distance[0]

result = pd.DataFrame({"car": df["rownames"], "cooks_distance": cooks_d})
print(result.sort_values("cooks_distance", ascending=False).head(5))

plt.figure(figsize=(9, 4))
plt.stem(range(len(cooks_d)), cooks_d, basefmt=" ")
plt.xlabel("Observation index")
plt.ylabel("Cook's distance")
plt.title("Influence by observation")
plt.tight_layout()
plt.show()
                  car  cooks_distance
16  Chrysler Imperial        0.332849
19     Toyota Corolla        0.191211
30      Maserati Bora        0.130785
17           Fiat 128        0.118508
20      Toyota Corona        0.086320
Cook's distance surfaces influential observations and the stem plot makes large-impact points easy to inspect. Reviewing top points helps you decide whether unusual rows are valid data, errors, or cases requiring separate treatment.