Tutorials

Using the Statsmodels Formula API for Regression

When you use `statsmodels.api` directly, you assemble predictor matrices by hand: selecting columns, calling `add_constant`, and tracking which columns correspond to which variables. As models grow with categorical variables, interaction terms, and transformations, this becomes verbose and error-prone. `statsmodels.formula.api` solves this with R-style formula syntax: you write `"mpg ~ hp + wt"` and the library handles matrix construction, intercept, and dummy encoding automatically. The formulas are concise, self-documenting, and easy to modify — changing a model from three predictors to five is a one-word edit. This tutorial covers [OLS](/tutorials/statsmodels-linear-regression) and [Logit](/tutorials/statsmodels-logistic-regression) workflows using formula syntax.

## Formula syntax with `statsmodels.formula.api`

The formula string follows the convention `response ~ predictor1 + predictor2`. The intercept is included automatically; you don't need `add_constant`.

import requests
import pandas as pd
import statsmodels.formula.api as smf

url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

df = pd.read_csv("mtcars.csv")
model = smf.ols("mpg ~ hp + wt + qsec", data=df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     47.15
Date:                Fri, 10 Apr 2026   Prob (F-statistic):           4.51e-11
Time:                        12:30:03   Log-Likelihood:                -73.571
No. Observations:                  32   AIC:                             155.1
Df Residuals:                      28   BIC:                             161.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     27.6105      8.420      3.279      0.003      10.363      44.858
hp            -0.0178      0.015     -1.190      0.244      -0.049       0.013
wt            -4.3588      0.753     -5.791      0.000      -5.901      -2.817
qsec           0.5108      0.439      1.163      0.255      -0.389       1.411
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   1.422
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                3.368
Skew:                           0.786   Prob(JB):                        0.186
Kurtosis:                       3.230   Cond. No.                     3.00e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
- `smf.ols("mpg ~ hp + wt + qsec", data=df)` defines the model entirely from the formula and the DataFrame — no separate `X` and `y` variables needed.
- The intercept is included by default. To remove it (force through origin), append `- 1` to the formula: `"mpg ~ hp + wt - 1"`.
- The `data=df` argument means column names in the formula refer directly to DataFrame columns — typos in column names raise a clear error instead of silently producing wrong results.

## Categorical predictors with `C()`

Numeric columns are treated as continuous by default. When a column contains discrete groups (like cylinder count), wrapping it in `C()` tells statsmodels to create indicator variables instead of treating it as a single numeric trend.

import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("mtcars.csv")
cat_model = smf.ols("mpg ~ hp + wt + C(cyl)", data=df).fit()
print(cat_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.857
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     40.53
Date:                Fri, 10 Apr 2026   Prob (F-statistic):           4.87e-11
Time:                        12:30:03   Log-Likelihood:                -71.235
No. Observations:                  32   AIC:                             152.5
Df Residuals:                      27   BIC:                             159.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      35.8460      2.041     17.563      0.000      31.658      40.034
C(cyl)[T.6]    -3.3590      1.402     -2.396      0.024      -6.235      -0.483
C(cyl)[T.8]    -3.1859      2.170     -1.468      0.154      -7.639       1.268
hp             -0.0231      0.012     -1.934      0.064      -0.048       0.001
wt             -3.1814      0.720     -4.421      0.000      -4.658      -1.705
==============================================================================
Omnibus:                        2.972   Durbin-Watson:                   1.790
Prob(Omnibus):                  0.226   Jarque-Bera (JB):                1.864
Skew:                           0.569   Prob(JB):                        0.394
Kurtosis:                       3.320   Cond. No.                     1.08e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
- `C(cyl)` encodes cylinder count (4, 6, 8) as two indicator variables with one level as the reference — the coefficients represent the mean difference in `mpg` between each level and the reference group, after controlling for `hp` and `wt`.
- Without `C()`, `cyl` would be treated as continuous, implying that going from 4 to 6 cylinders has the same effect as going from 6 to 8 — which may not be true.
- The reference level is chosen automatically (usually the lowest value); you can change it with `C(cyl, Treatment(reference=6))`.

## Interaction terms

An interaction term tests whether the effect of one predictor depends on the value of another. For example, the impact of weight on fuel efficiency might be different for high-horsepower vs. low-horsepower cars.

import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("mtcars.csv")
interaction_model = smf.ols("mpg ~ wt * hp", data=df).fit()
print(interaction_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.885
Model:                            OLS   Adj. R-squared:                  0.872
Method:                 Least Squares   F-statistic:                     71.66
Date:                Fri, 10 Apr 2026   Prob (F-statistic):           2.98e-13
Time:                        12:30:03   Log-Likelihood:                -67.805
No. Observations:                  32   AIC:                             143.6
Df Residuals:                      28   BIC:                             149.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     49.8084      3.605     13.816      0.000      42.424      57.193
wt            -8.2166      1.270     -6.471      0.000     -10.818      -5.616
hp            -0.1201      0.025     -4.863      0.000      -0.171      -0.070
wt:hp          0.0278      0.007      3.753      0.001       0.013       0.043
==============================================================================
Omnibus:                        2.221   Durbin-Watson:                   2.128
Prob(Omnibus):                  0.329   Jarque-Bera (JB):                1.736
Skew:                           0.407   Prob(JB):                        0.420
Kurtosis:                       2.200   Cond. No.                     6.35e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.35e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
- `wt * hp` expands to `wt + hp + wt:hp` — it includes both main effects and their product term automatically.
- The `wt:hp` coefficient measures how much the effect of `wt` on `mpg` changes per unit increase in `hp`. A significant interaction means the two variables don't act independently.
- To include only the interaction without main effects (rarely appropriate), use `wt:hp` directly.

## Formula API with logistic regression

The same formula syntax works for binary outcomes — just swap `smf.ols` for `smf.logit`. The model interface and output format are identical.

import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("mtcars.csv")
logit_model = smf.logit("am ~ mpg + hp + wt", data=df).fit(disp=False)
print(logit_model.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                     am   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Fri, 10 Apr 2026   Pseudo R-squ.:                  0.7972
Time:                        12:30:03   Log-Likelihood:                -4.3831
converged:                       True   LL-Null:                       -21.615
Covariance Type:            nonrobust   LLR p-value:                 1.581e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -15.7214     40.003     -0.393      0.694     -94.125      62.683
mpg            1.2293      1.581      0.778      0.437      -1.870       4.328
hp             0.0839      0.082      1.020      0.308      -0.077       0.245
wt            -6.9549      3.353     -2.074      0.038     -13.527      -0.383
==============================================================================

Possibly complete quasi-separation: A fraction 0.25 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
- `smf.logit(...)` fits a logistic regression using the same formula syntax — switching between regression types is a one-word change.
- The formula handles `add_constant` automatically here too, so the intercept is always included unless you add `- 1`.
- The summary output is identical in structure to the matrix-based `sm.Logit` version — all the same coefficient tables and fit statistics are present.

### Conclusion

The Formula API makes models easier to write, read, and iterate on — especially when adding categorical variables or interaction terms. The underlying math and output are identical to the matrix-based API; it's purely a syntax convenience.

For interpreting the regression output in detail, see [OLS linear regression](/tutorials/statsmodels-linear-regression) and [logistic regression](/tutorials/statsmodels-logistic-regression). To check whether your model's assumptions hold after fitting, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).