When you use `statsmodels.api` directly, you assemble predictor matrices by hand: selecting columns, calling `add_constant`, and tracking which columns correspond to which variables. As models grow with categorical variables, interaction terms, and transformations, this becomes verbose and error-prone. `statsmodels.formula.api` solves this with R-style formula syntax: you write `"mpg ~ hp + wt"` and the library handles matrix construction, intercept, and dummy encoding automatically. The formulas are concise, self-documenting, and easy to modify — changing a model from three predictors to five is a one-word edit. This tutorial covers [OLS](/tutorials/statsmodels-linear-regression) and [Logit](/tutorials/statsmodels-logistic-regression) workflows using formula syntax. ## Formula syntax with `statsmodels.formula.api` The formula string follows the convention `response ~ predictor1 + predictor2`. The intercept is included automatically; you don't need `add_constant`.
import requests
import pandas as pd
import statsmodels.formula.api as smf
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
df = pd.read_csv("mtcars.csv")
model = smf.ols("mpg ~ hp + wt + qsec", data=df).fit()
print(model.summary())
- `smf.ols("mpg ~ hp + wt + qsec", data=df)` defines the model entirely from the formula and the DataFrame — no separate `X` and `y` variables needed.
- The intercept is included by default. To remove it (force through origin), append `- 1` to the formula: `"mpg ~ hp + wt - 1"`.
- The `data=df` argument means column names in the formula refer directly to DataFrame columns — typos in column names raise a clear error instead of silently producing wrong results.
## Categorical predictors with `C()`
Numeric columns are treated as continuous by default. When a column contains discrete groups (like cylinder count), wrapping it in `C()` tells statsmodels to create indicator variables instead of treating it as a single numeric trend.
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv("mtcars.csv")
cat_model = smf.ols("mpg ~ hp + wt + C(cyl)", data=df).fit()
print(cat_model.summary())- `C(cyl)` encodes cylinder count (4, 6, 8) as two indicator variables with one level as the reference — the coefficients represent the mean difference in `mpg` between each level and the reference group, after controlling for `hp` and `wt`. - Without `C()`, `cyl` would be treated as continuous, implying that going from 4 to 6 cylinders has the same effect as going from 6 to 8 — which may not be true. - The reference level is chosen automatically (usually the lowest value); you can change it with `C(cyl, Treatment(reference=6))`. ## Interaction terms An interaction term tests whether the effect of one predictor depends on the value of another. For example, the impact of weight on fuel efficiency might be different for high-horsepower vs. low-horsepower cars.
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv("mtcars.csv")
interaction_model = smf.ols("mpg ~ wt * hp", data=df).fit()
print(interaction_model.summary())- `wt * hp` expands to `wt + hp + wt:hp` — it includes both main effects and their product term automatically. - The `wt:hp` coefficient measures how much the effect of `wt` on `mpg` changes per unit increase in `hp`. A significant interaction means the two variables don't act independently. - To include only the interaction without main effects (rarely appropriate), use `wt:hp` directly. ## Formula API with logistic regression The same formula syntax works for binary outcomes — just swap `smf.ols` for `smf.logit`. The model interface and output format are identical.
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv("mtcars.csv")
logit_model = smf.logit("am ~ mpg + hp + wt", data=df).fit(disp=False)
print(logit_model.summary())- `smf.logit(...)` fits a logistic regression using the same formula syntax — switching between regression types is a one-word change. - The formula handles `add_constant` automatically here too, so the intercept is always included unless you add `- 1`. - The summary output is identical in structure to the matrix-based `sm.Logit` version — all the same coefficient tables and fit statistics are present. ### Conclusion The Formula API makes models easier to write, read, and iterate on — especially when adding categorical variables or interaction terms. The underlying math and output are identical to the matrix-based API; it's purely a syntax convenience. For interpreting the regression output in detail, see [OLS linear regression](/tutorials/statsmodels-linear-regression) and [logistic regression](/tutorials/statsmodels-logistic-regression). To check whether your model's assumptions hold after fitting, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).