Tutorials

Logistic Regression in Python with Statsmodels

Many practical modeling tasks have binary outcomes: a customer converts or doesn't, a loan defaults or is repaid, an email is spam or not. Logistic regression is built for these situations. Instead of fitting a line to the raw 0/1 values (which can produce predictions outside the valid 0–1 range), it models the *log-odds* of the outcome and then maps that through a logistic curve to get probabilities between 0 and 1. The result is a model that's both interpretable — each coefficient tells you how a predictor shifts the log-odds — and actionable: you choose a probability threshold to convert predictions into decisions. Statsmodels' `Logit` adds inference on top: p-values and confidence intervals for every coefficient, so you can judge which predictors actually matter.

## Logistic regression vs [linear regression](/tutorials/statsmodels-linear-regression)

Linear regression predicts unbounded numeric values. Logistic regression predicts probabilities between `0` and `1` using a logistic curve, then maps those probabilities to classes using a threshold. The coefficients have a different interpretation: each coefficient represents the change in log-odds per unit increase in the predictor, not a direct change in the outcome.

## Preparing binary target data

The `am` column in `mtcars` is 1 for manual transmission and 0 for automatic — a natural binary target. We'll use `mpg`, `hp`, and `wt` as predictors.

import requests
import pandas as pd

url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("mtcars.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

df = pd.read_csv("mtcars.csv")
y = df["am"]
X = df[["mpg", "hp", "wt"]]

print(X.head())
print(y.head())
    mpg   hp     wt
0  21.0  110  2.620
1  21.0  110  2.875
2  22.8   93  2.320
3  21.4  110  3.215
4  18.7  175  3.440
0    1
1    1
2    1
3    0
4    0
Name: am, dtype: int64
- Saving to `mtcars.csv` once lets all later blocks load from disk cleanly.
- `y = df["am"]` is already coded as 0/1, which is exactly what `Logit` expects — no encoding needed here.
- Printing both `X` and `y` confirms the shapes and confirms the target has the expected binary values before you invest time fitting the model.

## Adding intercept

Statsmodels does not include an intercept automatically, so you must add it explicitly before fitting. Without it, the model estimates log-odds relative to zero, which is rarely meaningful.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
print(X.head())
   const   mpg   hp     wt
0    1.0  21.0  110  2.620
1    1.0  21.0  110  2.875
2    1.0  22.8   93  2.320
3    1.0  21.4  110  3.215
4    1.0  18.7  175  3.440
- `sm.add_constant(...)` adds a `const` column of 1s — the corresponding coefficient is the baseline log-odds when all other predictors are zero.
- The same `add_constant` call works identically for `OLS` and `Logit` — same pattern, same requirement.

## Fitting the Logit model

Fitting `Logit` is syntactically identical to fitting `OLS` — pass the target and predictor matrix, call `.fit()`.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]

logit_model = sm.Logit(y, X).fit(disp=False)
print(logit_model.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                     am   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Fri, 10 Apr 2026   Pseudo R-squ.:                  0.7972
Time:                        12:30:01   Log-Likelihood:                -4.3831
converged:                       True   LL-Null:                       -21.615
Covariance Type:            nonrobust   LLR p-value:                 1.581e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -15.7214     40.003     -0.393      0.694     -94.125      62.683
mpg            1.2293      1.581      0.778      0.437      -1.870       4.328
hp             0.0839      0.082      1.020      0.308      -0.077       0.245
wt            -6.9549      3.353     -2.074      0.038     -13.527      -0.383
==============================================================================

Possibly complete quasi-separation: A fraction 0.25 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
- `disp=False` suppresses the optimizer's convergence output — the summary will still print.
- The summary uses pseudo R-squared (McFadden's R²) rather than standard R-squared — values between 0.2 and 0.4 are generally considered a good fit for logistic regression.
- The `Log-Likelihood` and `LLR p-value` in the header test whether the model as a whole is better than predicting the most common class for everyone.

## Interpreting coefficients and odds ratios

Logit coefficients are in log-odds space, which is hard to interpret directly. Exponentiating them converts to odds ratios, which are easier to communicate: an odds ratio of 2 means the odds of the outcome double for each one-unit increase in the predictor.

import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)

coef_df = pd.DataFrame(
    {
        "coef": logit_model.params,
        "odds_ratio": np.exp(logit_model.params),
        "p_value": logit_model.pvalues,
    }
)
print(coef_df)
            coef    odds_ratio   p_value
const -15.721371  1.486947e-07  0.694315
mpg     1.229302  3.418843e+00  0.436861
hp      0.083893  1.087513e+00  0.307900
wt     -6.954924  9.539266e-04  0.038056
- `np.exp(logit_model.params)` converts each log-odds coefficient to an odds ratio — an odds ratio > 1 means the predictor increases the probability of the outcome; < 1 means it decreases it.
- A p-value below 0.05 for a coefficient means that predictor is a statistically significant contributor to the log-odds of the outcome.
- The intercept's odds ratio usually isn't meaningful to interpret directly — focus on the predictor coefficients.

## Predicting probabilities and classification thresholds

After fitting, use `.predict()` to get probabilities for new observations. Converting probabilities to class labels requires choosing a threshold — 0.5 is the default, but you can adjust it to favor precision or recall depending on the cost of each type of error.

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("mtcars.csv")
X = sm.add_constant(df[["mpg", "hp", "wt"]])
y = df["am"]
logit_model = sm.Logit(y, X).fit(disp=False)

new_cars = pd.DataFrame(
    {
        "mpg": [18.0, 28.0],
        "hp": [150, 95],
        "wt": [3.4, 2.1],
    }
)
new_cars = sm.add_constant(new_cars, has_constant="add")

prob = logit_model.predict(new_cars)
class_at_05 = (prob >= 0.5).astype(int)
class_at_07 = (prob >= 0.7).astype(int)

print("Probabilities:", prob.values)
print("Classes at threshold 0.5:", class_at_05.values)
print("Classes at threshold 0.7:", class_at_07.values)
Probabilities: [0.0094088  0.99999423]
Classes at threshold 0.5: [0 1]
Classes at threshold 0.7: [0 1]
- `has_constant="add"` forces the intercept column to be added to the new DataFrame — required when new data doesn't already have a `const` column.
- `(prob >= 0.5).astype(int)` converts probabilities to 0/1 labels — observations with probability ≥ 0.5 are classified as 1 (manual transmission).
- Using a higher threshold (0.7) makes the model more conservative about predicting class 1 — some previously predicted 1s become 0s. This tradeoff matters in practice: a fraud model might use a very low threshold to catch more fraud, at the cost of more false alarms.

### Conclusion

Statsmodels `Logit` is ideal when you need to both classify and *explain* — the odds ratios and p-values tell you which predictors matter and by how much. For pure classification without interpretation, scikit-learn's `LogisticRegression` integrates better into ML pipelines.

For more complex model specifications, see the [Statsmodels Formula API](/tutorials/statsmodels-formula-api-regression). To verify that the model is well-calibrated and assumptions hold, see [regression diagnostics](/tutorials/statsmodels-regression-diagnostics).