Linear Regression in Python with Statsmodels

Linear relationships appear everywhere: pricing, forecasting, operations, and experimentation. If you want interpretable models with statistical diagnostics, `statsmodels` is one of the best tools in Python.

This tutorial focuses on **statsmodels linear regression** using **statsmodels OLS**.

## What OLS regression is

Ordinary Least Squares (OLS) estimates a line by minimizing the sum of squared residuals, where each residual is the difference between an observed value and the model prediction. In practice, OLS gives you coefficients, confidence intervals, p-values, and R-squared in one workflow.

## Creating the dataset

import requests
import pandas as pd

# Download the source CSV once for reuse in later code blocks
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

# Save a local copy so later examples can start from the same file
with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

# Load only the columns we need for this tutorial step
df = pd.read_csv("mtcars.csv")
print(df[["mpg", "hp"]].head())
    mpg   hp
0  21.0  110
1  21.0  110
2  22.8   93
3  21.4  110
4  18.7  175
This code downloads the dataset once, saves it locally as `mtcars.csv`, and loads it into a DataFrame for the rest of the tutorial. Saving the file first keeps later code blocks simpler and repeatable, because every block can start from the same local data source instead of re-downloading.

## Adding the intercept (`add_constant`)

import pandas as pd
import statsmodels.api as sm

# Reload the shared local dataset
df = pd.read_csv("mtcars.csv")
# Build predictor matrix and add intercept column
X = sm.add_constant(df[["hp"]])
y = df["mpg"]

print(X.head())
   const   hp
0    1.0  110
1    1.0  110
2    1.0   93
3    1.0  110
4    1.0  175
`statsmodels.add_constant()` adds an intercept column (`const`) so the fitted model can learn both slope and baseline level. Without this step, the regression would be forced through the origin, which often produces biased coefficients and a worse fit.

## Fitting the OLS model

import pandas as pd
import statsmodels.api as sm

# Prepare inputs in the shape expected by statsmodels
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]

# Fit ordinary least squares model
model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.602
Model:                            OLS   Adj. R-squared:                  0.589
Method:                 Least Squares   F-statistic:                     45.46
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           1.79e-07
Time:                        16:48:06   Log-Likelihood:                -87.619
No. Observations:                  32   AIC:                             179.2
Df Residuals:                      30   BIC:                             182.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         30.0989      1.634     18.421      0.000      26.762      33.436
hp            -0.0682      0.010     -6.742      0.000      -0.089      -0.048
==============================================================================
Omnibus:                        3.692   Durbin-Watson:                   1.134
Prob(Omnibus):                  0.158   Jarque-Bera (JB):                2.984
Skew:                           0.747   Prob(JB):                        0.225
Kurtosis:                       2.935   Cond. No.                         386.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This block fits `statsmodels.OLS` and prints the full regression summary with fit quality and inference metrics. The summary is useful because it combines coefficient estimates, standard errors, confidence intervals, and global fit statistics in one place for quick model review.

## Understanding coefficients, p-values, and R²

import pandas as pd
import statsmodels.api as sm

# Recreate the same fitted model used above
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]

model = sm.OLS(y, X).fit()

print("Intercept:", model.params["const"])
print("Slope for hp:", model.params["hp"])
print("P-value for hp:", model.pvalues["hp"])
print("R-squared:", model.rsquared)
Intercept: 30.09886053962251
Slope for hp: -0.06822827807156366
P-value for hp: 1.7878352541210556e-07
R-squared: 0.602437341423934
The intercept gives the baseline prediction, the slope shows the change in `mpg` per unit of `hp`, the p-value tests statistical significance, and R² summarizes explained variance. These values are the core outputs you use to decide whether the relationship is meaningful and whether the model is strong enough for forecasting or decision support.

## Predicting new values

import pandas as pd
import statsmodels.api as sm

# Train model on historical rows
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

# New input values where we want predictions
new_points = pd.DataFrame({"hp": [90, 150, 220]})
new_points = sm.add_constant(new_points, has_constant="add")

# Return prediction mean and confidence bounds
pred = model.get_prediction(new_points).summary_frame(alpha=0.05)
print(pred[["mean", "mean_ci_lower", "mean_ci_upper"]])
        mean  mean_ci_lower  mean_ci_upper
0  23.958316      22.136924      25.779707
1  19.864619      18.468309      21.260928
2  15.088639      13.029384      17.147895
This creates predictions for unseen values and includes confidence intervals so you can report uncertainty with each estimate. Reporting intervals is important because point predictions alone can hide how much variation is plausible around each forecast.

## Visualizing regression results

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Fit model on the original points
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

# Build a sorted x-grid so the regression line is smooth
x_grid = pd.DataFrame({"hp": sorted(df["hp"].tolist())})
x_grid_const = sm.add_constant(x_grid, has_constant="add")
y_hat = model.predict(x_grid_const)

# Plot observed points plus fitted line
plt.figure(figsize=(8, 5))
plt.scatter(df["hp"], y, alpha=0.7, label="Observed")
plt.plot(x_grid["hp"], y_hat, color="red", linewidth=2, label="OLS fit")
plt.xlabel("Horsepower")
plt.ylabel("Miles per gallon")
plt.title("Statsmodels OLS: mpg vs hp")
plt.legend()
plt.tight_layout()
plt.show()
This plot overlays the fitted OLS line on the observed data so you can visually assess trend direction and fit quality. Visual checks help confirm whether a straight-line assumption is reasonable and can reveal patterns that summary metrics alone may miss.