Tutorials

Linear Regression in Python with Statsmodels

Linear relationships appear everywhere: pricing, forecasting, operations, and experimentation. When you need to go beyond prediction and understand *whether* a relationship is statistically real, *how strong* it is, and *how uncertain* your estimates are, `statsmodels` is the right tool. Unlike scikit-learn (which focuses on prediction pipelines), statsmodels gives you a full inference summary: each coefficient's estimate, standard error, p-value, and confidence interval in one readable table. This tutorial focuses on Ordinary Least Squares (OLS) regression — the standard method that minimizes the sum of squared differences between the observed values and the fitted line.

## What OLS regression is

OLS estimates the slope and intercept that produce the best-fitting straight line through your data by minimizing the sum of squared residuals. Each residual is the vertical distance between an observed data point and the fitted line. In practice, OLS gives you coefficients, [confidence intervals](/tutorials/bootstrap-confidence-intervals-with-scipy), p-values, and R-squared in one workflow.

## Creating the dataset

The classic `mtcars` dataset contains measurements of 32 car models including horsepower (`hp`) and fuel efficiency in miles per gallon (`mpg`). Downloading it once and saving locally means every later code block can load from disk without repeating the network call.

import requests
import pandas as pd

url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

df = pd.read_csv("mtcars.csv")
print(df[["mpg", "hp"]].head())
    mpg   hp
0  21.0  110
1  21.0  110
2  22.8   93
3  21.4  110
4  18.7  175
- `response.raise_for_status()` raises an error immediately if the download fails, rather than silently producing an empty or partial file.
- Saving to `mtcars.csv` first means all later blocks can start with `pd.read_csv("mtcars.csv")` — each block is self-contained.

## Adding the intercept (`add_constant`)

Statsmodels does not automatically include an intercept — you must add it explicitly. Without it, the model is forced through the origin, which usually produces a worse fit and biased slope estimates.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]

print(X.head())
   const   hp
0    1.0  110
1    1.0  110
2    1.0   93
3    1.0  110
4    1.0  175
- `sm.add_constant(df[["hp"]])` adds a column named `const` filled with 1s — this column lets the model estimate the intercept as a separate coefficient.
- Without `add_constant`, `sm.OLS(y, X).fit()` would constrain the line to pass through the origin, which is rarely appropriate for real data.

## Fitting the OLS model

Once the predictor matrix has the intercept column, fitting OLS and printing the summary is a single readable chain.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]

model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.602
Model:                            OLS   Adj. R-squared:                  0.589
Method:                 Least Squares   F-statistic:                     45.46
Date:                Fri, 10 Apr 2026   Prob (F-statistic):           1.79e-07
Time:                        12:29:56   Log-Likelihood:                -87.619
No. Observations:                  32   AIC:                             179.2
Df Residuals:                      30   BIC:                             182.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         30.0989      1.634     18.421      0.000      26.762      33.436
hp            -0.0682      0.010     -6.742      0.000      -0.089      -0.048
==============================================================================
Omnibus:                        3.692   Durbin-Watson:                   1.134
Prob(Omnibus):                  0.158   Jarque-Bera (JB):                2.984
Skew:                           0.747   Prob(JB):                        0.225
Kurtosis:                       2.935   Cond. No.                         386.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- `sm.OLS(y, X)` defines the model; `.fit()` estimates the coefficients by minimizing the sum of squared residuals.
- The summary table has three sections: overall fit statistics (R-squared, F-statistic), per-coefficient estimates, and residual diagnostics.
- A significant p-value for `hp` (shown as `P>|t|` in the table) means the relationship between horsepower and fuel efficiency is unlikely to be a coincidence in this dataset.

## Understanding coefficients, p-values, and R²

These four numbers from the summary are the ones you'll cite most often when interpreting a regression result.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]

model = sm.OLS(y, X).fit()

print("Intercept:", model.params["const"])
print("Slope for hp:", model.params["hp"])
print("P-value for hp:", model.pvalues["hp"])
print("R-squared:", model.rsquared)
Intercept: 30.09886053962251
Slope for hp: -0.06822827807156366
P-value for hp: 1.7878352541210556e-07
R-squared: 0.602437341423934
- `model.params["hp"]` is the slope: the expected change in `mpg` for each additional unit of `hp`. A negative value here means more powerful cars are less fuel-efficient.
- `model.pvalues["hp"]` tests the null hypothesis that the true slope is zero. A very small p-value means we can reject that — `hp` is a meaningful predictor.
- `model.rsquared` is R-squared: the fraction of variance in `mpg` explained by `hp` alone. A value near 0.6 means `hp` accounts for about 60% of the variation; the rest is explained by other factors not in this model.

## Predicting new values

After fitting, you can call `get_prediction()` to get point forecasts and confidence intervals for new inputs — this is more informative than a bare `predict()` call because it quantifies the uncertainty around each estimate.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

new_points = pd.DataFrame({"hp": [90, 150, 220]})
new_points = sm.add_constant(new_points, has_constant="add")

pred = model.get_prediction(new_points).summary_frame(alpha=0.05)
print(pred[["mean", "mean_ci_lower", "mean_ci_upper"]])
        mean  mean_ci_lower  mean_ci_upper
0  23.958316      22.136924      25.779707
1  19.864619      18.468309      21.260928
2  15.088639      13.029384      17.147895
- `has_constant="add"` forces `add_constant` to insert the intercept column even though the DataFrame has no existing `const` column — required for new data not built from the training DataFrame.
- `summary_frame(alpha=0.05)` returns a DataFrame with the predicted mean and the 95% confidence interval bounds.
- A wider confidence interval means the prediction is more uncertain — this typically happens when the input value is far from the training data's center.

## Visualizing regression results

A scatter plot of the observed data with the fitted line drawn over it shows at a glance whether the linear model is a reasonable description of the relationship or whether a curve would fit better.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

x_grid = pd.DataFrame({"hp": sorted(df["hp"].tolist())})
x_grid_const = sm.add_constant(x_grid, has_constant="add")
y_hat = model.predict(x_grid_const)

plt.figure(figsize=(8, 5))
plt.scatter(df["hp"], y, alpha=0.7, label="Observed")
plt.plot(x_grid["hp"], y_hat, color="red", linewidth=2, label="OLS fit")
plt.xlabel("Horsepower")
plt.ylabel("Miles per gallon")
plt.title("Statsmodels OLS: mpg vs hp")
plt.legend()
plt.tight_layout()
plt.show()
- `sorted(df["hp"].tolist())` creates a smooth x-axis grid so the line is drawn without gaps or jagged segments.
- If the data shows a curved pattern around the line (systematic residuals), it suggests a log transform or polynomial term might improve the model.

### Conclusion

Statsmodels OLS is the go-to tool when a regression needs to be *explained*, not just used for prediction. The summary table gives you everything needed to report a result: coefficient estimates, significance tests, and overall fit.

To add more predictors, see [multiple linear regression with Statsmodels](/tutorials/statsmodels-multiple-linear-regression). For binary outcomes, see [logistic regression with Statsmodels](/tutorials/statsmodels-logistic-regression). For diagnosing whether your model's assumptions hold, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).