Linear relationships appear everywhere: pricing, forecasting, operations, and experimentation. When you need to go beyond prediction and understand *whether* a relationship is statistically real, *how strong* it is, and *how uncertain* your estimates are, `statsmodels` is the right tool. Unlike scikit-learn (which focuses on prediction pipelines), statsmodels gives you a full inference summary: each coefficient's estimate, standard error, p-value, and confidence interval in one readable table. This tutorial focuses on Ordinary Least Squares (OLS) regression — the standard method that minimizes the sum of squared differences between the observed values and the fitted line. ## What OLS regression is OLS estimates the slope and intercept that produce the best-fitting straight line through your data by minimizing the sum of squared residuals. Each residual is the vertical distance between an observed data point and the fitted line. In practice, OLS gives you coefficients, [confidence intervals](/tutorials/bootstrap-confidence-intervals-with-scipy), p-values, and R-squared in one workflow. ## Creating the dataset The classic `mtcars` dataset contains measurements of 32 car models including horsepower (`hp`) and fuel efficiency in miles per gallon (`mpg`). Downloading it once and saving locally means every later code block can load from disk without repeating the network call.
import requests
import pandas as pd
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
df = pd.read_csv("mtcars.csv")
print(df[["mpg", "hp"]].head())
- `response.raise_for_status()` raises an error immediately if the download fails, rather than silently producing an empty or partial file.
- Saving to `mtcars.csv` first means all later blocks can start with `pd.read_csv("mtcars.csv")` — each block is self-contained.
## Adding the intercept (`add_constant`)
Statsmodels does not automatically include an intercept — you must add it explicitly. Without it, the model is forced through the origin, which usually produces a worse fit and biased slope estimates.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
print(X.head())- `sm.add_constant(df[["hp"]])` adds a column named `const` filled with 1s — this column lets the model estimate the intercept as a separate coefficient. - Without `add_constant`, `sm.OLS(y, X).fit()` would constrain the line to pass through the origin, which is rarely appropriate for real data. ## Fitting the OLS model Once the predictor matrix has the intercept column, fitting OLS and printing the summary is a single readable chain.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print(model.summary())- `sm.OLS(y, X)` defines the model; `.fit()` estimates the coefficients by minimizing the sum of squared residuals. - The summary table has three sections: overall fit statistics (R-squared, F-statistic), per-coefficient estimates, and residual diagnostics. - A significant p-value for `hp` (shown as `P>|t|` in the table) means the relationship between horsepower and fuel efficiency is unlikely to be a coincidence in this dataset. ## Understanding coefficients, p-values, and R² These four numbers from the summary are the ones you'll cite most often when interpreting a regression result.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print("Intercept:", model.params["const"])
print("Slope for hp:", model.params["hp"])
print("P-value for hp:", model.pvalues["hp"])
print("R-squared:", model.rsquared)- `model.params["hp"]` is the slope: the expected change in `mpg` for each additional unit of `hp`. A negative value here means more powerful cars are less fuel-efficient. - `model.pvalues["hp"]` tests the null hypothesis that the true slope is zero. A very small p-value means we can reject that — `hp` is a meaningful predictor. - `model.rsquared` is R-squared: the fraction of variance in `mpg` explained by `hp` alone. A value near 0.6 means `hp` accounts for about 60% of the variation; the rest is explained by other factors not in this model. ## Predicting new values After fitting, you can call `get_prediction()` to get point forecasts and confidence intervals for new inputs — this is more informative than a bare `predict()` call because it quantifies the uncertainty around each estimate.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
new_points = pd.DataFrame({"hp": [90, 150, 220]})
new_points = sm.add_constant(new_points, has_constant="add")
pred = model.get_prediction(new_points).summary_frame(alpha=0.05)
print(pred[["mean", "mean_ci_lower", "mean_ci_upper"]])- `has_constant="add"` forces `add_constant` to insert the intercept column even though the DataFrame has no existing `const` column — required for new data not built from the training DataFrame. - `summary_frame(alpha=0.05)` returns a DataFrame with the predicted mean and the 95% confidence interval bounds. - A wider confidence interval means the prediction is more uncertain — this typically happens when the input value is far from the training data's center. ## Visualizing regression results A scatter plot of the observed data with the fitted line drawn over it shows at a glance whether the linear model is a reasonable description of the relationship or whether a curve would fit better.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
x_grid = pd.DataFrame({"hp": sorted(df["hp"].tolist())})
x_grid_const = sm.add_constant(x_grid, has_constant="add")
y_hat = model.predict(x_grid_const)
plt.figure(figsize=(8, 5))
plt.scatter(df["hp"], y, alpha=0.7, label="Observed")
plt.plot(x_grid["hp"], y_hat, color="red", linewidth=2, label="OLS fit")
plt.xlabel("Horsepower")
plt.ylabel("Miles per gallon")
plt.title("Statsmodels OLS: mpg vs hp")
plt.legend()
plt.tight_layout()
plt.show()- `sorted(df["hp"].tolist())` creates a smooth x-axis grid so the line is drawn without gaps or jagged segments. - If the data shows a curved pattern around the line (systematic residuals), it suggests a log transform or polynomial term might improve the model. ### Conclusion Statsmodels OLS is the go-to tool when a regression needs to be *explained*, not just used for prediction. The summary table gives you everything needed to report a result: coefficient estimates, significance tests, and overall fit. To add more predictors, see [multiple linear regression with Statsmodels](/tutorials/statsmodels-multiple-linear-regression). For binary outcomes, see [logistic regression with Statsmodels](/tutorials/statsmodels-logistic-regression). For diagnosing whether your model's assumptions hold, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).