Many practical modeling tasks are binary: convert or not, churn or stay, default or repay. Logistic regression is built for these outcomes and keeps interpretation straightforward. This tutorial focuses on **statsmodels logistic regression** with **statsmodels Logit**. ## Logistic regression vs linear regression Linear regression predicts unbounded numeric values. Logistic regression predicts probabilities between `0` and `1` using a logistic curve, then maps those probabilities to classes using a threshold. ## Preparing binary target data
import requests
import pandas as pd
# Download once and persist locally for later blocks
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("mtcars.csv", "w", encoding="utf-8") as f:
f.write(response.text)
# Binary target and feature matrix
df = pd.read_csv("mtcars.csv")
y = df["am"]
X = df[["mpg", "hp", "wt"]]
print(X.head())
print(y.head())mpg hp wt 0 21.0 110 2.620 1 21.0 110 2.875 2 22.8 93 2.320 3 21.4 110 3.215 4 18.7 175 3.440 0 1 1 1 2 1 3 0 4 0 Name: am, dtype: int64
This block downloads and saves `mtcars.csv` once, then prepares the binary target and predictor matrix for classification. Defining `y` and `X` explicitly up front makes the modeling flow clearer and ensures the same dataset is reused consistently in later blocks. ## Adding intercept
import pandas as pd
import statsmodels.api as sm
# Add intercept for baseline log-odds
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
print(X.head())const mpg hp wt 0 1.0 21.0 110 2.620 1 1.0 21.0 110 2.875 2 1.0 22.8 93 2.320 3 1.0 21.4 110 3.215 4 1.0 18.7 175 3.440
`add_constant()` adds the intercept term required for a baseline log-odds estimate. Including an intercept lets the model represent the base probability level instead of forcing all predictors to explain it. ## Fitting the Logit model
import pandas as pd
import statsmodels.api as sm
# Fit logistic regression model
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
print(logit_model.summary()) Logit Regression Results
==============================================================================
Dep. Variable: am No. Observations: 32
Model: Logit Df Residuals: 28
Method: MLE Df Model: 3
Date: Wed, 11 Mar 2026 Pseudo R-squ.: 0.7972
Time: 16:48:03 Log-Likelihood: -4.3831
converged: True LL-Null: -21.615
Covariance Type: nonrobust LLR p-value: 1.581e-07
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -15.7214 40.003 -0.393 0.694 -94.125 62.683
mpg 1.2293 1.581 0.778 0.437 -1.870 4.328
hp 0.0839 0.082 1.020 0.308 -0.077 0.245
wt -6.9549 3.353 -2.074 0.038 -13.527 -0.383
==============================================================================
Possibly complete quasi-separation: A fraction 0.25 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
This fits `Logit` and prints coefficient significance, confidence intervals, and fit statistics. The summary helps you decide which predictors carry useful classification signal before you move to prediction. ## Interpreting coefficients and odds ratios
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Train model then convert log-odds coefficients to odds ratios
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
coef_df = pd.DataFrame(
{
"coef": logit_model.params,
"odds_ratio": np.exp(logit_model.params),
"p_value": logit_model.pvalues,
}
)
print(coef_df)coef odds_ratio p_value const -15.721371 1.486947e-07 0.694315 mpg 1.229302 3.418843e+00 0.436861 hp 0.083893 1.087513e+00 0.307900 wt -6.954924 9.539266e-04 0.038056
Logit coefficients are in log-odds space; exponentiating them gives odds ratios, which are easier to interpret operationally. Odds ratios let you explain model behavior in practical terms, such as how much odds change when a feature increases. ## Predicting probabilities and classification thresholds
import pandas as pd
import statsmodels.api as sm
# Fit model on training rows
df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)
new_cars = pd.DataFrame(
{
"mpg": [18.0, 28.0],
"hp": [150, 95],
"wt": [3.4, 2.1],
}
)
new_cars = sm.add_constant(new_cars, has_constant="add")
# Predict probabilities, then map to classes at two thresholds
prob = logit_model.predict(new_cars)
class_at_05 = (prob >= 0.5).astype(int)
class_at_07 = (prob >= 0.7).astype(int)
print("Probabilities:")
print(prob)
print("Classes at threshold 0.5:")
print(class_at_05)
print("Classes at threshold 0.7:")
print(class_at_07)Probabilities: 0 0.009409 1 0.999994 dtype: float64 Classes at threshold 0.5: 0 0 1 1 dtype: int64 Classes at threshold 0.7: 0 0 1 1 dtype: int64
This computes probabilities first and then shows how changing thresholds changes class assignments. Comparing thresholds demonstrates the precision/recall tradeoff you make when turning probabilities into hard yes/no decisions.