Tutorials

Multiple Linear Regression with Statsmodels

[Simple linear regression](/tutorials/statsmodels-linear-regression) uses one predictor to explain the outcome. In most real problems, outcomes depend on several factors at once. Multiple linear regression extends OLS to handle this: each coefficient becomes a **partial effect** — the estimated change in the outcome from a one-unit increase in that predictor *while all other predictors are held constant*. This lets you separate the contribution of correlated variables. For example, a car's fuel efficiency depends on both weight and horsepower — including both in the model lets you isolate the effect of each. The tradeoff is multicollinearity: when predictors are strongly correlated with each other, coefficient estimates become unstable and harder to interpret.

## Preparing multiple predictors

The `mtcars` dataset has several continuous predictors. Here we download it once and identify which columns to use as features for modeling fuel efficiency.

import requests
import pandas as pd

url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

df = pd.read_csv("mtcars.csv")

features = ["hp", "wt", "qsec"]
X = df[features]
y = df["mpg"]

print(X.head())
print(y.head())
    hp     wt   qsec
0  110  2.620  16.46
1  110  2.875  17.02
2   93  2.320  18.61
3  110  3.215  19.44
4  175  3.440  17.02
0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64
- Saving `mtcars.csv` locally once means all later blocks load from disk without repeating the network call.
- `features = ["hp", "wt", "qsec"]` selects horsepower, weight, and quarter-mile time — three theoretically distinct aspects of a car's performance, making them reasonable candidates as joint predictors.

## Building the feature matrix and adding intercept

Before passing the predictors to OLS, you must add an intercept column. In multiple regression, this is exactly the same as in simple regression — statsmodels requires it explicitly.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]

print(X.head())
   const   hp     wt   qsec
0    1.0  110  2.620  16.46
1    1.0  110  2.875  17.02
2    1.0   93  2.320  18.61
3    1.0  110  3.215  19.44
4    1.0  175  3.440  17.02
- `sm.add_constant(...)` prepends a `const` column of 1s, allowing OLS to estimate the intercept as a free parameter.
- The result is a matrix with four columns (`const`, `hp`, `wt`, `qsec`) and one row per car — each row is one observation in the regression.

## Fitting the regression model

Fitting multiple regression works the same as simple regression — the only difference is the number of columns in `X`.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]

model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     47.15
Date:                Fri, 10 Apr 2026   Prob (F-statistic):           4.51e-11
Time:                        12:29:59   Log-Likelihood:                -73.571
No. Observations:                  32   AIC:                             155.1
Df Residuals:                      28   BIC:                             161.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         27.6105      8.420      3.279      0.003      10.363      44.858
hp            -0.0178      0.015     -1.190      0.244      -0.049       0.013
wt            -4.3588      0.753     -5.791      0.000      -5.901      -2.817
qsec           0.5108      0.439      1.163      0.255      -0.389       1.411
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   1.422
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                3.368
Skew:                           0.786   Prob(JB):                        0.186
Kurtosis:                       3.230   Cond. No.                     3.00e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
- The summary now has a row per predictor — each row shows that predictor's partial coefficient, standard error, t-statistic, p-value, and 95% confidence interval.
- The overall R-squared measures how much of `mpg`'s variance all three predictors together explain.
- The F-statistic at the bottom tests whether any of the predictors are jointly significant — a small p-value means the model as a whole is better than predicting the mean.

## Interpreting coefficients and significance

Extracting the key inference columns into a tidy table makes it easier to compare predictors at a glance.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])
y = df["mpg"]
model = sm.OLS(y, X).fit()

summary_table = pd.DataFrame(
    {
        "coefficient": model.params,
        "p_value": model.pvalues,
        "ci_lower": model.conf_int()[0],
        "ci_upper": model.conf_int()[1],
    }
)
print(summary_table)
       coefficient   p_value   ci_lower   ci_upper
const    27.610527  0.002785  10.363085  44.857968
hp       -0.017822  0.244176  -0.048510   0.012865
wt       -4.358797  0.000003  -5.900634  -2.816960
qsec      0.510834  0.254628  -0.388871   1.410538
- Each coefficient is a partial effect: the `wt` coefficient tells you how much `mpg` changes per unit increase in weight *after accounting for* `hp` and `qsec`. This is what makes multiple regression more useful than running three separate simple regressions.
- A coefficient whose confidence interval crosses zero (ci_lower < 0 < ci_upper) is not statistically distinguishable from zero at the 5% level.
- `model.conf_int()[0]` and `[1]` extract the lower and upper bounds of the 95% confidence intervals for all coefficients.

## Multicollinearity check with VIF

When predictors are highly correlated with each other, their coefficients become imprecise — small changes in the data can cause large swings in estimates. Variance Inflation Factor (VIF) quantifies how much each predictor's variance is inflated by its correlation with the others.

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["hp", "wt", "qsec"]])

vif = pd.DataFrame(
    {
        "feature": X.columns,
        "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
    }
)
print(vif)
  feature         VIF
0   const  341.407799
1      hp    4.921956
2      wt    2.530443
3    qsec    2.873804
- `variance_inflation_factor(X.values, i)` computes the VIF for column `i` by regressing it on all other columns and measuring how well they predict it.
- A VIF of 1 means no correlation with other predictors; VIF > 5–10 is typically considered a warning sign that coefficient estimates may be unstable.
- If you see high VIF for two predictors, consider removing one, combining them (e.g., a ratio), or using a dimensionality reduction approach.

### Conclusion

Multiple linear regression lets you estimate each predictor's independent contribution after controlling for the others — but this only works well when predictors aren't too highly correlated. Always check VIF before interpreting individual coefficients as causal effects.

For a more compact model syntax using R-style formulas, see the [Statsmodels Formula API](/tutorials/statsmodels-formula-api-regression). To verify that the model's assumptions hold, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).