Using the Statsmodels Formula API for Regression

As models grow, manually assembling design matrices becomes harder to read and maintain. `statsmodels.formula.api` keeps model definitions compact and expressive.

In this tutorial, you will use formula syntax for OLS and Logit workflows.

## Formula syntax with `statsmodels.formula.api`

import requests
import pandas as pd
import statsmodels.formula.api as smf

# Download once and keep a local copy for all later formula examples
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

# Fit OLS using formula notation instead of manual matrix building
df = pd.read_csv("mtcars.csv")
model = smf.ols("mpg ~ hp + wt + qsec", data=df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     47.15
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           4.51e-11
Time:                        16:48:02   Log-Likelihood:                -73.571
No. Observations:                  32   AIC:                             155.1
Df Residuals:                      28   BIC:                             161.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     27.6105      8.420      3.279      0.003      10.363      44.858
hp            -0.0178      0.015     -1.190      0.244      -0.049       0.013
wt            -4.3588      0.753     -5.791      0.000      -5.901      -2.817
qsec           0.5108      0.439      1.163      0.255      -0.389       1.411
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   1.422
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                3.368
Skew:                           0.786   Prob(JB):                        0.186
Kurtosis:                       3.230   Cond. No.                     3.00e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
The formula `y ~ x1 + x2` style is concise and automatically includes an intercept unless you explicitly remove it. This approach reduces boilerplate and makes it easier for readers to understand your model definition at a glance.

## Categorical predictors with `C()`

import pandas as pd
import statsmodels.formula.api as smf

# Treat `cyl` as categorical so statsmodels creates indicator variables
df = pd.read_csv("mtcars.csv")
cat_model = smf.ols("mpg ~ hp + wt + C(cyl)", data=df).fit()
print(cat_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.857
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     40.53
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           4.87e-11
Time:                        16:48:02   Log-Likelihood:                -71.235
No. Observations:                  32   AIC:                             152.5
Df Residuals:                      27   BIC:                             159.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      35.8460      2.041     17.563      0.000      31.658      40.034
C(cyl)[T.6]    -3.3590      1.402     -2.396      0.024      -6.235      -0.483
C(cyl)[T.8]    -3.1859      2.170     -1.468      0.154      -7.639       1.268
hp             -0.0231      0.012     -1.934      0.064      -0.048       0.001
wt             -3.1814      0.720     -4.421      0.000      -4.658      -1.705
==============================================================================
Omnibus:                        2.972   Durbin-Watson:                   1.790
Prob(Omnibus):                  0.226   Jarque-Bera (JB):                1.864
Skew:                           0.569   Prob(JB):                        0.394
Kurtosis:                       3.320   Cond. No.                     1.08e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
`C(cyl)` encodes cylinder count as categorical levels instead of treating it as one continuous numeric trend. This matters when you expect group-level differences rather than a single linear effect across numeric values.

## Interaction terms

import pandas as pd
import statsmodels.formula.api as smf

# `*` includes both main effects and their interaction term
df = pd.read_csv("mtcars.csv")
interaction_model = smf.ols("mpg ~ wt * hp", data=df).fit()
print(interaction_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.885
Model:                            OLS   Adj. R-squared:                  0.872
Method:                 Least Squares   F-statistic:                     71.66
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           2.98e-13
Time:                        16:48:02   Log-Likelihood:                -67.805
No. Observations:                  32   AIC:                             143.6
Df Residuals:                      28   BIC:                             149.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     49.8084      3.605     13.816      0.000      42.424      57.193
wt            -8.2166      1.270     -6.471      0.000     -10.818      -5.616
hp            -0.1201      0.025     -4.863      0.000      -0.171      -0.070
wt:hp          0.0278      0.007      3.753      0.001       0.013       0.043
==============================================================================
Omnibus:                        2.221   Durbin-Watson:                   2.128
Prob(Omnibus):                  0.329   Jarque-Bera (JB):                1.736
Skew:                           0.407   Prob(JB):                        0.420
Kurtosis:                       2.200   Cond. No.                     6.35e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.35e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
`wt * hp` expands to main effects plus `wt:hp`, so the effect of one variable can depend on the value of the other. Interaction terms are useful when relationships are not additive and one feature changes the impact of another.

## Formula API with logistic regression

import pandas as pd
import statsmodels.formula.api as smf

# The same formula workflow works for binary outcomes with logit
df = pd.read_csv("mtcars.csv")
logit_model = smf.logit("am ~ mpg + hp + wt", data=df).fit(disp=False)
print(logit_model.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                     am   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Wed, 11 Mar 2026   Pseudo R-squ.:                  0.7972
Time:                        16:48:02   Log-Likelihood:                -4.3831
converged:                       True   LL-Null:                       -21.615
Covariance Type:            nonrobust   LLR p-value:                 1.581e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -15.7214     40.003     -0.393      0.694     -94.125      62.683
mpg            1.2293      1.581      0.778      0.437      -1.870       4.328
hp             0.0839      0.082      1.020      0.308      -0.077       0.245
wt            -6.9549      3.353     -2.074      0.038     -13.527      -0.383
==============================================================================

Possibly complete quasi-separation: A fraction 0.25 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
The same formula style works for `logit`, making it easy to switch between regression types while keeping readable model definitions. That consistency is helpful when you compare linear and classification models in the same project.