Tutorials

Regression Diagnostics with Statsmodels

Fitting a [regression model](/tutorials/statsmodels-linear-regression) gives you coefficient estimates and p-values, but those numbers are only trustworthy if the model's assumptions hold. OLS assumes that residuals are roughly normally distributed, have constant variance (homoskedasticity), and that no single observation disproportionately controls the result. When these assumptions are violated, standard errors can be wrong, p-values become unreliable, and one unusual data point can pull the entire fitted line off course. Diagnostics are the tool for catching these problems before you act on the results. A good diagnostics workflow covers four questions: are residuals well-behaved? Is variance constant? Are predictors too correlated? Are any observations unusually influential?

## Checking regression assumptions with a baseline model

First we fit the same [OLS model from the linear regression tutorial](/tutorials/statsmodels-linear-regression) — using `hp`, `wt`, and `qsec` to predict `mpg` in the `mtcars` dataset. All later diagnostic steps inspect this model.

import requests
import pandas as pd
import statsmodels.api as sm

url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]

model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     47.15
Date:                Fri, 10 Apr 2026   Prob (F-statistic):           4.51e-11
Time:                        12:30:05   Log-Likelihood:                -73.571
No. Observations:                  32   AIC:                             155.1
Df Residuals:                      28   BIC:                             161.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         27.6105      8.420      3.279      0.003      10.363      44.858
hp            -0.0178      0.015     -1.190      0.244      -0.049       0.013
wt            -4.3588      0.753     -5.791      0.000      -5.901      -2.817
qsec           0.5108      0.439      1.163      0.255      -0.389       1.411
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   1.422
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                3.368
Skew:                           0.786   Prob(JB):                        0.186
Kurtosis:                       3.230   Cond. No.                     3.00e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
- Saving the dataset locally once means all later blocks can start with `pd.read_csv("mtcars.csv")` without repeating the download.
- Looking at the summary first is useful — low R-squared or insignificant predictors are worth noting before interpreting residual patterns.

## Residual analysis and residual plots

Residuals are the differences between observed values and the model's predictions. Two diagnostic plots — residuals vs. fitted values and a Q-Q plot — reveal the most common assumption violations quickly.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

fitted = model.fittedvalues
residuals = model.resid

plt.figure(figsize=(8, 5))
plt.scatter(fitted, residuals, alpha=0.8)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.tight_layout()
plt.show()

qqplot(residuals, line="s")
plt.title("Q-Q Plot of Residuals")
plt.tight_layout()
plt.show()
- In the residuals-vs-fitted plot, a random scatter around the red zero line is what you want — a U-shape or fan pattern indicates nonlinearity or heteroskedasticity.
- `model.resid` returns the raw residuals; `model.fittedvalues` returns the predicted values — both are available directly on the fitted model object.
- In the Q-Q plot, points close to the diagonal line indicate normally distributed residuals. Heavy tails or curves at the ends suggest the residuals are not normal, which can affect inference.

## Heteroskedasticity tests

Heteroskedasticity means the variance of residuals is not constant across observations — it often appears as a funnel shape in the residuals plot. The Breusch-Pagan test gives you a formal test for this pattern.

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

bp_stat, bp_pvalue, f_stat, f_pvalue = het_breuschpagan(model.resid, model.model.exog)
print("Breusch-Pagan LM stat:", bp_stat)
print("Breusch-Pagan LM p-value:", bp_pvalue)
print("Breusch-Pagan F stat:", f_stat)
print("Breusch-Pagan F p-value:", f_pvalue)
Breusch-Pagan LM stat: 2.0700718723832026
Breusch-Pagan LM p-value: 0.5579911149315504
Breusch-Pagan F stat: 0.6455301438255844
Breusch-Pagan F p-value: 0.5922887042132922
- `het_breuschpagan(model.resid, model.model.exog)` regresses the squared residuals on the predictors to test whether variance depends on the predictors' values.
- A low p-value means evidence of heteroskedasticity — standard errors from OLS may be incorrect, and you should consider using `model.get_robustcov_results(cov_type='HC3')` for robust standard errors.
- `model.model.exog` is the design matrix (predictors including the intercept) as stored inside the fitted model — it's the same as the `X` you passed in.

## Variance Inflation Factor (VIF)

When predictors are correlated with each other, their coefficient estimates become imprecise — this is multicollinearity. VIF measures how much the variance of each coefficient is inflated by its correlation with the other predictors.

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])

vif_df = pd.DataFrame(
    {
        "feature": X.columns,
        "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
    }
)
print(vif_df)
  feature         VIF
0   const  341.407799
1      hp    4.921956
2      wt    2.530443
3    qsec    2.873804
- `variance_inflation_factor(X.values, i)` computes VIF for column `i` by regressing it on all other columns and measuring how well they predict it — a high R² in that regression gives a high VIF.
- VIF of 1 means no correlation with other predictors; values above 5–10 are typically a concern.
- High VIF doesn't make predictions worse, but it makes individual coefficient estimates unreliable — don't interpret magnitudes or significance of correlated predictors at face value.

## Influence and outlier detection

Even a correctly specified model can produce misleading results if a few extreme observations dominate the fit. Cook's distance measures how much each observation shifts the fitted coefficients when removed.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

influence = model.get_influence()
cooks_d = influence.cooks_distance[0]

result = pd.DataFrame({"car": df["rownames"], "cooks_distance": cooks_d})
print(result.sort_values("cooks_distance", ascending=False).head(5))

plt.figure(figsize=(9, 4))
plt.stem(range(len(cooks_d)), cooks_d, basefmt=" ")
plt.xlabel("Observation index")
plt.ylabel("Cook's distance")
plt.title("Influence by observation")
plt.tight_layout()
plt.show()
                  car  cooks_distance
16  Chrysler Imperial        0.332849
19     Toyota Corolla        0.191211
30      Maserati Bora        0.130785
17           Fiat 128        0.118508
20      Toyota Corona        0.086320
- `model.get_influence().cooks_distance[0]` computes Cook's distance for each observation — a rough rule of thumb is that values above `4/n` (here `4/32 ≈ 0.125`) are worth investigating.
- The stem plot makes it easy to spot which observations have unusually large Cook's distances compared to the rest.
- High Cook's distance doesn't automatically mean you should remove the observation — investigate whether it's a data entry error, a genuine outlier, or a legitimately extreme but valid case.

### Conclusion

Diagnostics are not optional when you need to trust regression inference. A quick workflow — residual plot, Q-Q plot, Breusch-Pagan test, VIF check, Cook's distances — catches the most common assumption violations before results are published or acted on.

For fitting the initial model, see [linear regression with Statsmodels](/tutorials/statsmodels-linear-regression) or [multiple linear regression](/tutorials/statsmodels-multiple-linear-regression). For [normality tests](/tutorials/normality-tests-with-scipy) on residuals specifically, SciPy's Shapiro-Wilk test is a good complement to the Q-Q plot.