[Simple linear regression](/tutorials/statsmodels-linear-regression) uses one predictor to explain the outcome. In most real problems, outcomes depend on several factors at once. Multiple linear regression extends OLS to handle this: each coefficient becomes a **partial effect** — the estimated change in the outcome from a one-unit increase in that predictor *while all other predictors are held constant*. This lets you separate the contribution of correlated variables. For example, a car's fuel efficiency depends on both weight and horsepower — including both in the model lets you isolate the effect of each. The tradeoff is multicollinearity: when predictors are strongly correlated with each other, coefficient estimates become unstable and harder to interpret. ## Preparing multiple predictors The `mtcars` dataset has several continuous predictors. Here we download it once and identify which columns to use as features for modeling fuel efficiency.
import requests
import pandas as pd
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
df = pd.read_csv("mtcars.csv")
features = ["hp", "wt", "qsec"]
X = df[features]
y = df["mpg"]
print(X.head())
print(y.head())- Saving `mtcars.csv` locally once means all later blocks load from disk without repeating the network call. - `features = ["hp", "wt", "qsec"]` selects horsepower, weight, and quarter-mile time — three theoretically distinct aspects of a car's performance, making them reasonable candidates as joint predictors. ## Building the feature matrix and adding intercept Before passing the predictors to OLS, you must add an intercept column. In multiple regression, this is exactly the same as in simple regression — statsmodels requires it explicitly.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
print(X.head())- `sm.add_constant(...)` prepends a `const` column of 1s, allowing OLS to estimate the intercept as a free parameter. - The result is a matrix with four columns (`const`, `hp`, `wt`, `qsec`) and one row per car — each row is one observation in the regression. ## Fitting the regression model Fitting multiple regression works the same as simple regression — the only difference is the number of columns in `X`.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
print(model.summary())- The summary now has a row per predictor — each row shows that predictor's partial coefficient, standard error, t-statistic, p-value, and 95% confidence interval. - The overall R-squared measures how much of `mpg`'s variance all three predictors together explain. - The F-statistic at the bottom tests whether any of the predictors are jointly significant — a small p-value means the model as a whole is better than predicting the mean. ## Interpreting coefficients and significance Extracting the key inference columns into a tidy table makes it easier to compare predictors at a glance.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()
summary_table = pd.DataFrame(
{
"coefficient": model.params,
"p_value": model.pvalues,
"ci_lower": model.conf_int()[0],
"ci_upper": model.conf_int()[1],
}
)
print(summary_table)- Each coefficient is a partial effect: the `wt` coefficient tells you how much `mpg` changes per unit increase in weight *after accounting for* `hp` and `qsec`. This is what makes multiple regression more useful than running three separate simple regressions. - A coefficient whose confidence interval crosses zero (ci_lower < 0 < ci_upper) is not statistically distinguishable from zero at the 5% level. - `model.conf_int()[0]` and `[1]` extract the lower and upper bounds of the 95% confidence intervals for all coefficients. ## Multicollinearity check with VIF When predictors are highly correlated with each other, their coefficients become imprecise — small changes in the data can cause large swings in estimates. Variance Inflation Factor (VIF) quantifies how much each predictor's variance is inflated by its correlation with the others.
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
vif = pd.DataFrame(
{
"feature": X.columns,
"VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
}
)
print(vif)- `variance_inflation_factor(X.values, i)` computes the VIF for column `i` by regressing it on all other columns and measuring how well they predict it. - A VIF of 1 means no correlation with other predictors; VIF > 5–10 is typically considered a warning sign that coefficient estimates may be unstable. - If you see high VIF for two predictors, consider removing one, combining them (e.g., a ratio), or using a dimensionality reduction approach. ### Conclusion Multiple linear regression lets you estimate each predictor's independent contribution after controlling for the others — but this only works well when predictors aren't too highly correlated. Always check VIF before interpreting individual coefficients as causal effects. For a more compact model syntax using R-style formulas, see the [Statsmodels Formula API](/tutorials/statsmodels-formula-api-regression). To verify that the model's assumptions hold, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).