Fitting a [regression model](/tutorials/statsmodels-linear-regression) gives you coefficient estimates and p-values, but those numbers are only trustworthy if the model's assumptions hold. OLS assumes that residuals are roughly normally distributed, have constant variance (homoskedasticity), and that no single observation disproportionately controls the result. When these assumptions are violated, standard errors can be wrong, p-values become unreliable, and one unusual data point can pull the entire fitted line off course. Diagnostics are the tool for catching these problems before you act on the results. A good diagnostics workflow covers four questions: are residuals well-behaved? Is variance constant? Are predictors too correlated? Are any observations unusually influential? ## Checking regression assumptions with a baseline model First we fit the same [OLS model from the linear regression tutorial](/tutorials/statsmodels-linear-regression) — using `hp`, `wt`, and `qsec` to predict `mpg` in the `mtcars` dataset. All later diagnostic steps inspect this model.
import requests
import pandas as pd
import statsmodels.api as sm
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print(model.summary())
- Saving the dataset locally once means all later blocks can start with `pd.read_csv("mtcars.csv")` without repeating the download.
- Looking at the summary first is useful — low R-squared or insignificant predictors are worth noting before interpreting residual patterns.
## Residual analysis and residual plots
Residuals are the differences between observed values and the model's predictions. Two diagnostic plots — residuals vs. fitted values and a Q-Q plot — reveal the most common assumption violations quickly.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
fitted = model.fittedvalues
residuals = model.resid
plt.figure(figsize=(8, 5))
plt.scatter(fitted, residuals, alpha=0.8)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.tight_layout()
plt.show()
qqplot(residuals, line="s")
plt.title("Q-Q Plot of Residuals")
plt.tight_layout()
plt.show()- In the residuals-vs-fitted plot, a random scatter around the red zero line is what you want — a U-shape or fan pattern indicates nonlinearity or heteroskedasticity. - `model.resid` returns the raw residuals; `model.fittedvalues` returns the predicted values — both are available directly on the fitted model object. - In the Q-Q plot, points close to the diagonal line indicate normally distributed residuals. Heavy tails or curves at the ends suggest the residuals are not normal, which can affect inference. ## Heteroskedasticity tests Heteroskedasticity means the variance of residuals is not constant across observations — it often appears as a funnel shape in the residuals plot. The Breusch-Pagan test gives you a formal test for this pattern.
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
bp_stat, bp_pvalue, f_stat, f_pvalue = het_breuschpagan(model.resid, model.model.exog)
print("Breusch-Pagan LM stat:", bp_stat)
print("Breusch-Pagan LM p-value:", bp_pvalue)
print("Breusch-Pagan F stat:", f_stat)
print("Breusch-Pagan F p-value:", f_pvalue)- `het_breuschpagan(model.resid, model.model.exog)` regresses the squared residuals on the predictors to test whether variance depends on the predictors' values. - A low p-value means evidence of heteroskedasticity — standard errors from OLS may be incorrect, and you should consider using `model.get_robustcov_results(cov_type='HC3')` for robust standard errors. - `model.model.exog` is the design matrix (predictors including the intercept) as stored inside the fitted model — it's the same as the `X` you passed in. ## Variance Inflation Factor (VIF) When predictors are correlated with each other, their coefficient estimates become imprecise — this is multicollinearity. VIF measures how much the variance of each coefficient is inflated by its correlation with the other predictors.
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
vif_df = pd.DataFrame(
{
"feature": X.columns,
"VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
}
)
print(vif_df)- `variance_inflation_factor(X.values, i)` computes VIF for column `i` by regressing it on all other columns and measuring how well they predict it — a high R² in that regression gives a high VIF. - VIF of 1 means no correlation with other predictors; values above 5–10 are typically a concern. - High VIF doesn't make predictions worse, but it makes individual coefficient estimates unreliable — don't interpret magnitudes or significance of correlated predictors at face value. ## Influence and outlier detection Even a correctly specified model can produce misleading results if a few extreme observations dominate the fit. Cook's distance measures how much each observation shifts the fitted coefficients when removed.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
result = pd.DataFrame({"car": df["rownames"], "cooks_distance": cooks_d})
print(result.sort_values("cooks_distance", ascending=False).head(5))
plt.figure(figsize=(9, 4))
plt.stem(range(len(cooks_d)), cooks_d, basefmt=" ")
plt.xlabel("Observation index")
plt.ylabel("Cook's distance")
plt.title("Influence by observation")
plt.tight_layout()
plt.show()- `model.get_influence().cooks_distance[0]` computes Cook's distance for each observation — a rough rule of thumb is that values above `4/n` (here `4/32 ≈ 0.125`) are worth investigating. - The stem plot makes it easy to spot which observations have unusually large Cook's distances compared to the rest. - High Cook's distance doesn't automatically mean you should remove the observation — investigate whether it's a data entry error, a genuine outlier, or a legitimately extreme but valid case. ### Conclusion Diagnostics are not optional when you need to trust regression inference. A quick workflow — residual plot, Q-Q plot, Breusch-Pagan test, VIF check, Cook's distances — catches the most common assumption violations before results are published or acted on. For fitting the initial model, see [linear regression with Statsmodels](/tutorials/statsmodels-linear-regression) or [multiple linear regression](/tutorials/statsmodels-multiple-linear-regression). For [normality tests](/tutorials/normality-tests-with-scipy) on residuals specifically, SciPy's Shapiro-Wilk test is a good complement to the Q-Q plot.