Linear relationships appear everywhere: pricing, forecasting, operations, and experimentation. If you want interpretable models with statistical diagnostics, `statsmodels` is one of the best tools in Python. This tutorial focuses on **statsmodels linear regression** using **statsmodels OLS**. ## What OLS regression is Ordinary Least Squares (OLS) estimates a line by minimizing the sum of squared residuals, where each residual is the difference between an observed value and the model prediction. In practice, OLS gives you coefficients, confidence intervals, p-values, and R-squared in one workflow. ## Creating the dataset
import requests
import pandas as pd
# Download the source CSV once for reuse in later code blocks
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
# Save a local copy so later examples can start from the same file
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Load only the columns we need for this tutorial step
df = pd.read_csv("mtcars.csv")
print(df[["mpg", "hp"]].head())mpg hp 0 21.0 110 1 21.0 110 2 22.8 93 3 21.4 110 4 18.7 175
This code downloads the dataset once, saves it locally as `mtcars.csv`, and loads it into a DataFrame for the rest of the tutorial. Saving the file first keeps later code blocks simpler and repeatable, because every block can start from the same local data source instead of re-downloading. ## Adding the intercept (`add_constant`)
import pandas as pd
import statsmodels.api as sm
# Reload the shared local dataset
df = pd.read_csv("mtcars.csv")
# Build predictor matrix and add intercept column
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
print(X.head())const hp 0 1.0 110 1 1.0 110 2 1.0 93 3 1.0 110 4 1.0 175
`statsmodels.add_constant()` adds an intercept column (`const`) so the fitted model can learn both slope and baseline level. Without this step, the regression would be forced through the origin, which often produces biased coefficients and a worse fit. ## Fitting the OLS model
import pandas as pd
import statsmodels.api as sm
# Prepare inputs in the shape expected by statsmodels
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
# Fit ordinary least squares model
model = sm.OLS(y, X).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.602
Model: OLS Adj. R-squared: 0.589
Method: Least Squares F-statistic: 45.46
Date: Wed, 11 Mar 2026 Prob (F-statistic): 1.79e-07
Time: 16:48:06 Log-Likelihood: -87.619
No. Observations: 32 AIC: 179.2
Df Residuals: 30 BIC: 182.2
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 30.0989 1.634 18.421 0.000 26.762 33.436
hp -0.0682 0.010 -6.742 0.000 -0.089 -0.048
==============================================================================
Omnibus: 3.692 Durbin-Watson: 1.134
Prob(Omnibus): 0.158 Jarque-Bera (JB): 2.984
Skew: 0.747 Prob(JB): 0.225
Kurtosis: 2.935 Cond. No. 386.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This block fits `statsmodels.OLS` and prints the full regression summary with fit quality and inference metrics. The summary is useful because it combines coefficient estimates, standard errors, confidence intervals, and global fit statistics in one place for quick model review. ## Understanding coefficients, p-values, and R²
import pandas as pd
import statsmodels.api as sm
# Recreate the same fitted model used above
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print("Intercept:", model.params["const"])
print("Slope for hp:", model.params["hp"])
print("P-value for hp:", model.pvalues["hp"])
print("R-squared:", model.rsquared)Intercept: 30.09886053962251 Slope for hp: -0.06822827807156366 P-value for hp: 1.7878352541210556e-07 R-squared: 0.602437341423934
The intercept gives the baseline prediction, the slope shows the change in `mpg` per unit of `hp`, the p-value tests statistical significance, and R² summarizes explained variance. These values are the core outputs you use to decide whether the relationship is meaningful and whether the model is strong enough for forecasting or decision support. ## Predicting new values
import pandas as pd
import statsmodels.api as sm
# Train model on historical rows
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
# New input values where we want predictions
new_points = pd.DataFrame({"hp": [90, 150, 220]})
new_points = sm.add_constant(new_points, has_constant="add")
# Return prediction mean and confidence bounds
pred = model.get_prediction(new_points).summary_frame(alpha=0.05)
print(pred[["mean", "mean_ci_lower", "mean_ci_upper"]])mean mean_ci_lower mean_ci_upper 0 23.958316 22.136924 25.779707 1 19.864619 18.468309 21.260928 2 15.088639 13.029384 17.147895
This creates predictions for unseen values and includes confidence intervals so you can report uncertainty with each estimate. Reporting intervals is important because point predictions alone can hide how much variation is plausible around each forecast. ## Visualizing regression results
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Fit model on the original points
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
# Build a sorted x-grid so the regression line is smooth
x_grid = pd.DataFrame({"hp": sorted(df["hp"].tolist())})
x_grid_const = sm.add_constant(x_grid, has_constant="add")
y_hat = model.predict(x_grid_const)
# Plot observed points plus fitted line
plt.figure(figsize=(8, 5))
plt.scatter(df["hp"], y, alpha=0.7, label="Observed")
plt.plot(x_grid["hp"], y_hat, color="red", linewidth=2, label="OLS fit")
plt.xlabel("Horsepower")
plt.ylabel("Miles per gallon")
plt.title("Statsmodels OLS: mpg vs hp")
plt.legend()
plt.tight_layout()
plt.show()This plot overlays the fitted OLS line on the observed data so you can visually assess trend direction and fit quality. Visual checks help confirm whether a straight-line assumption is reasonable and can reveal patterns that summary metrics alone may miss.