As models grow, manually assembling design matrices becomes harder to read and maintain. `statsmodels.formula.api` keeps model definitions compact and expressive. In this tutorial, you will use formula syntax for OLS and Logit workflows. ## Formula syntax with `statsmodels.formula.api`
import requests
import pandas as pd
import statsmodels.formula.api as smf
# Download once and keep a local copy for all later formula examples
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Fit OLS using formula notation instead of manual matrix building
df = pd.read_csv("mtcars.csv")
model = smf.ols("mpg ~ hp + wt + qsec", data=df).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.835
Model: OLS Adj. R-squared: 0.817
Method: Least Squares F-statistic: 47.15
Date: Wed, 11 Mar 2026 Prob (F-statistic): 4.51e-11
Time: 16:48:02 Log-Likelihood: -73.571
No. Observations: 32 AIC: 155.1
Df Residuals: 28 BIC: 161.0
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 27.6105 8.420 3.279 0.003 10.363 44.858
hp -0.0178 0.015 -1.190 0.244 -0.049 0.013
wt -4.3588 0.753 -5.791 0.000 -5.901 -2.817
qsec 0.5108 0.439 1.163 0.255 -0.389 1.411
==============================================================================
Omnibus: 4.495 Durbin-Watson: 1.422
Prob(Omnibus): 0.106 Jarque-Bera (JB): 3.368
Skew: 0.786 Prob(JB): 0.186
Kurtosis: 3.230 Cond. No. 3.00e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
The formula `y ~ x1 + x2` style is concise and automatically includes an intercept unless you explicitly remove it. This approach reduces boilerplate and makes it easier for readers to understand your model definition at a glance. ## Categorical predictors with `C()`
import pandas as pd
import statsmodels.formula.api as smf
# Treat `cyl` as categorical so statsmodels creates indicator variables
df = pd.read_csv("mtcars.csv")
cat_model = smf.ols("mpg ~ hp + wt + C(cyl)", data=df).fit()
print(cat_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.857
Model: OLS Adj. R-squared: 0.836
Method: Least Squares F-statistic: 40.53
Date: Wed, 11 Mar 2026 Prob (F-statistic): 4.87e-11
Time: 16:48:02 Log-Likelihood: -71.235
No. Observations: 32 AIC: 152.5
Df Residuals: 27 BIC: 159.8
Df Model: 4
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 35.8460 2.041 17.563 0.000 31.658 40.034
C(cyl)[T.6] -3.3590 1.402 -2.396 0.024 -6.235 -0.483
C(cyl)[T.8] -3.1859 2.170 -1.468 0.154 -7.639 1.268
hp -0.0231 0.012 -1.934 0.064 -0.048 0.001
wt -3.1814 0.720 -4.421 0.000 -4.658 -1.705
==============================================================================
Omnibus: 2.972 Durbin-Watson: 1.790
Prob(Omnibus): 0.226 Jarque-Bera (JB): 1.864
Skew: 0.569 Prob(JB): 0.394
Kurtosis: 3.320 Cond. No. 1.08e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
`C(cyl)` encodes cylinder count as categorical levels instead of treating it as one continuous numeric trend. This matters when you expect group-level differences rather than a single linear effect across numeric values. ## Interaction terms
import pandas as pd
import statsmodels.formula.api as smf
# `*` includes both main effects and their interaction term
df = pd.read_csv("mtcars.csv")
interaction_model = smf.ols("mpg ~ wt * hp", data=df).fit()
print(interaction_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.885
Model: OLS Adj. R-squared: 0.872
Method: Least Squares F-statistic: 71.66
Date: Wed, 11 Mar 2026 Prob (F-statistic): 2.98e-13
Time: 16:48:02 Log-Likelihood: -67.805
No. Observations: 32 AIC: 143.6
Df Residuals: 28 BIC: 149.5
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 49.8084 3.605 13.816 0.000 42.424 57.193
wt -8.2166 1.270 -6.471 0.000 -10.818 -5.616
hp -0.1201 0.025 -4.863 0.000 -0.171 -0.070
wt:hp 0.0278 0.007 3.753 0.001 0.013 0.043
==============================================================================
Omnibus: 2.221 Durbin-Watson: 2.128
Prob(Omnibus): 0.329 Jarque-Bera (JB): 1.736
Skew: 0.407 Prob(JB): 0.420
Kurtosis: 2.200 Cond. No. 6.35e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.35e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
`wt * hp` expands to main effects plus `wt:hp`, so the effect of one variable can depend on the value of the other. Interaction terms are useful when relationships are not additive and one feature changes the impact of another. ## Formula API with logistic regression
import pandas as pd
import statsmodels.formula.api as smf
# The same formula workflow works for binary outcomes with logit
df = pd.read_csv("mtcars.csv")
logit_model = smf.logit("am ~ mpg + hp + wt", data=df).fit(disp=False)
print(logit_model.summary()) Logit Regression Results
==============================================================================
Dep. Variable: am No. Observations: 32
Model: Logit Df Residuals: 28
Method: MLE Df Model: 3
Date: Wed, 11 Mar 2026 Pseudo R-squ.: 0.7972
Time: 16:48:02 Log-Likelihood: -4.3831
converged: True LL-Null: -21.615
Covariance Type: nonrobust LLR p-value: 1.581e-07
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -15.7214 40.003 -0.393 0.694 -94.125 62.683
mpg 1.2293 1.581 0.778 0.437 -1.870 4.328
hp 0.0839 0.082 1.020 0.308 -0.077 0.245
wt -6.9549 3.353 -2.074 0.038 -13.527 -0.383
==============================================================================
Possibly complete quasi-separation: A fraction 0.25 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
The same formula style works for `logit`, making it easy to switch between regression types while keeping readable model definitions. That consistency is helpful when you compare linear and classification models in the same project.