Many practical modeling tasks have binary outcomes: a customer converts or doesn't, a loan defaults or is repaid, an email is spam or not. Logistic regression is built for these situations. Instead of fitting a line to the raw 0/1 values (which can produce predictions outside the valid 0–1 range), it models the *log-odds* of the outcome and then maps that through a logistic curve to get probabilities between 0 and 1. The result is a model that's both interpretable — each coefficient tells you how a predictor shifts the log-odds — and actionable: you choose a probability threshold to convert predictions into decisions. Statsmodels' `Logit` adds inference on top: p-values and confidence intervals for every coefficient, so you can judge which predictors actually matter. ## Logistic regression vs [linear regression](/tutorials/statsmodels-linear-regression) Linear regression predicts unbounded numeric values. Logistic regression predicts probabilities between `0` and `1` using a logistic curve, then maps those probabilities to classes using a threshold. The coefficients have a different interpretation: each coefficient represents the change in log-odds per unit increase in the predictor, not a direct change in the outcome. ## Preparing binary target data The `am` column in `mtcars` is 1 for manual transmission and 0 for automatic — a natural binary target. We'll use `mpg`, `hp`, and `wt` as predictors.
import requests
import pandas as pd
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
df = pd.read_csv("mtcars.csv")
y = df["am"]
X = df[["mpg", "hp", "wt"]]
print(X.head())
print(y.head())- Saving to `mtcars.csv` once lets all later blocks load from disk cleanly. - `y = df["am"]` is already coded as 0/1, which is exactly what `Logit` expects — no encoding needed here. - Printing both `X` and `y` confirms the shapes and confirms the target has the expected binary values before you invest time fitting the model. ## Adding intercept Statsmodels does not include an intercept automatically, so you must add it explicitly before fitting. Without it, the model estimates log-odds relative to zero, which is rarely meaningful.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
print(X.head())- `sm.add_constant(...)` adds a `const` column of 1s — the corresponding coefficient is the baseline log-odds when all other predictors are zero. - The same `add_constant` call works identically for `OLS` and `Logit` — same pattern, same requirement. ## Fitting the Logit model Fitting `Logit` is syntactically identical to fitting `OLS` — pass the target and predictor matrix, call `.fit()`.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
print(logit_model.summary())- `disp=False` suppresses the optimizer's convergence output — the summary will still print. - The summary uses pseudo R-squared (McFadden's R²) rather than standard R-squared — values between 0.2 and 0.4 are generally considered a good fit for logistic regression. - The `Log-Likelihood` and `LLR p-value` in the header test whether the model as a whole is better than predicting the most common class for everyone. ## Interpreting coefficients and odds ratios Logit coefficients are in log-odds space, which is hard to interpret directly. Exponentiating them converts to odds ratios, which are easier to communicate: an odds ratio of 2 means the odds of the outcome double for each one-unit increase in the predictor.
import numpy as np
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
coef_df = pd.DataFrame(
{
"coef": logit_model.params,
"odds_ratio": np.exp(logit_model.params),
"p_value": logit_model.pvalues,
}
)
print(coef_df)- `np.exp(logit_model.params)` converts each log-odds coefficient to an odds ratio — an odds ratio > 1 means the predictor increases the probability of the outcome; < 1 means it decreases it. - A p-value below 0.05 for a coefficient means that predictor is a statistically significant contributor to the log-odds of the outcome. - The intercept's odds ratio usually isn't meaningful to interpret directly — focus on the predictor coefficients. ## Predicting probabilities and classification thresholds After fitting, use `.predict()` to get probabilities for new observations. Converting probabilities to class labels requires choosing a threshold — 0.5 is the default, but you can adjust it to favor precision or recall depending on the cost of each type of error.
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
new_cars = pd.DataFrame(
{
"mpg": [18.0, 28.0],
"hp": [150, 95],
"wt": [3.4, 2.1],
}
)
new_cars = sm.add_constant(new_cars, has_constant="add")
prob = logit_model.predict(new_cars)
class_at_05 = (prob >= 0.5).astype(int)
class_at_07 = (prob >= 0.7).astype(int)
print("Probabilities:", prob.values)
print("Classes at threshold 0.5:", class_at_05.values)
print("Classes at threshold 0.7:", class_at_07.values)- `has_constant="add"` forces the intercept column to be added to the new DataFrame — required when new data doesn't already have a `const` column. - `(prob >= 0.5).astype(int)` converts probabilities to 0/1 labels — observations with probability ≥ 0.5 are classified as 1 (manual transmission). - Using a higher threshold (0.7) makes the model more conservative about predicting class 1 — some previously predicted 1s become 0s. This tradeoff matters in practice: a fraud model might use a very low threshold to catch more fraud, at the cost of more false alarms. ### Conclusion Statsmodels `Logit` is ideal when you need to both classify and *explain* — the odds ratios and p-values tell you which predictors matter and by how much. For pure classification without interpretation, scikit-learn's `LogisticRegression` integrates better into ML pipelines. For more complex model specifications, see the [Statsmodels Formula API](/tutorials/statsmodels-formula-api-regression). To verify that the model is well-calibrated and assumptions hold, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).