Tutorials

SARIMAX Time Series Forecasting with Statsmodels

Forecasting is useful when you need to plan inventory, staffing, or budgets ahead of demand changes. SARIMA models capture the four key features of a time series in one framework: trend (a long-run upward or downward drift), autoregression (today's value depends on recent past values), moving-average errors (today's value depends on recent forecast errors), and seasonality (a repeating pattern at a fixed period like 12 months or 7 days). The `SARIMAX` class in statsmodels handles all of this — "S" adds seasonal terms on top of the base [ARIMA model](/tutorials/statsmodels-arima-models), and "X" allows optional exogenous (external) predictors. For monthly data like airline passenger counts, seasonality is the dominant feature and SARIMAX captures it directly.

## Preparing time series data

The classic airline passengers dataset has monthly counts from 1949 to 1960 — a clear upward trend with strong annual seasonality. Setting the frequency explicitly tells statsmodels how to interpret forecast steps.

import requests
import pandas as pd

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("airline-passengers.csv", "w", encoding="utf-8") as f:
    f.write(response.text)

df = pd.read_csv("airline-passengers.csv")
df["Month"] = pd.to_datetime(df["Month"])
df = df.set_index("Month")
series = df["Passengers"].asfreq("MS")

print(series.head())
print(series.index.freq)
Month
1949-01-01    112
1949-02-01    118
1949-03-01    132
1949-04-01    129
1949-05-01    121
Freq: MS, Name: Passengers, dtype: int64
<MonthBegin>
- `pd.to_datetime(df["Month"])` converts the string dates to proper datetime objects so they can be used as a time index.
- `.asfreq("MS")` sets the frequency to "Month Start" — without an explicit frequency, statsmodels can't determine the spacing between observations and may warn or fail during forecasting.
- Saving locally once means all later blocks avoid repeated network calls.

## Defining SARIMAX parameters

A SARIMAX model is defined by two tuples: the non-seasonal order `(p, d, q)` and the seasonal order `(P, D, Q, s)`. Understanding what each component does helps you choose sensible values.

import pandas as pd

df = pd.read_csv("airline-passengers.csv")
df["Month"] = pd.to_datetime(df["Month"])
df = df.set_index("Month")
series = df["Passengers"].asfreq("MS")

order = (1, 1, 1)
seasonal_order = (1, 1, 1, 12)

print("order:", order)
print("seasonal_order:", seasonal_order)
order: (1, 1, 1)
seasonal_order: (1, 1, 1, 12)
- `order=(p, d, q)`: `p` is the number of autoregressive lags (how many past values to use), `d` is the number of differences to apply (1 removes a linear trend), `q` is the number of moving-average lags.
- `seasonal_order=(P, D, Q, s)`: the same three parameters applied at the seasonal level, with `s=12` meaning the pattern repeats every 12 months.
- `(1, 1, 1, 12)` is a common starting point for monthly seasonal data — one seasonal difference removes the growing seasonal amplitude, and seasonal AR/MA terms capture the residual pattern.

## Fitting the model

Fitting SARIMAX requires passing both order tuples and a couple of stability flags that prevent numerical issues with certain parameter combinations.

import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX

df = pd.read_csv("airline-passengers.csv")
df["Month"] = pd.to_datetime(df["Month"])
df = df.set_index("Month")
series = df["Passengers"].asfreq("MS")

model = SARIMAX(
    series,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False,
)
results = model.fit(disp=False)
print(results.summary())
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                         Passengers   No. Observations:                  144
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 12)   Log Likelihood                -456.103
Date:                            Fri, 10 Apr 2026   AIC                            922.205
Time:                                    12:30:08   BIC                            936.016
Sample:                                01-01-1949   HQIC                           927.812
                                     - 12-01-1960                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.2298      0.401     -0.573      0.567      -1.016       0.557
ma.L1         -0.0987      0.374     -0.264      0.792      -0.832       0.634
ar.S.L12      -0.5460      0.299     -1.825      0.068      -1.133       0.041
ma.S.L12       0.3959      0.352      1.125      0.261      -0.294       1.086
sigma2       140.2945     17.997      7.795      0.000     105.020     175.569
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                 5.42
Prob(Q):                              0.95   Prob(JB):                         0.07
Heteroskedasticity (H):               2.51   Skew:                             0.12
Prob(H) (two-sided):                  0.01   Kurtosis:                         4.03
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
- `enforce_stationarity=False` and `enforce_invertibility=False` allow the optimizer to explore the full parameter space without numerical constraints — useful when starting parameters are uncertain.
- `disp=False` suppresses the optimizer's iteration output, leaving only the final summary.
- The summary shows AIC and BIC (lower is better for model comparison), log-likelihood, and coefficient estimates with p-values for each AR/MA term.

## Forecasting and visualization

A train-test split lets you evaluate forecast quality on held-out data before using the model for real future forecasts.

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX

df = pd.read_csv("airline-passengers.csv")
df["Month"] = pd.to_datetime(df["Month"])
df = df.set_index("Month")
series = df["Passengers"].asfreq("MS")

train = series.iloc[:-12]
test = series.iloc[-12:]

model = SARIMAX(
    train,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False,
)
results = model.fit(disp=False)

forecast = results.get_forecast(steps=12)
forecast_df = forecast.summary_frame(alpha=0.05)
forecast_df.index = test.index

plt.figure(figsize=(10, 5))
plt.plot(train.index, train.values, label="Train")
plt.plot(test.index, test.values, label="Actual", color="black")
plt.plot(forecast_df.index, forecast_df["mean"], label="Forecast", color="blue")
plt.fill_between(
    forecast_df.index,
    forecast_df["mean_ci_lower"],
    forecast_df["mean_ci_upper"],
    color="blue",
    alpha=0.2,
    label="95% CI",
)
plt.title("SARIMAX Forecast vs Actual")
plt.xlabel("Month")
plt.ylabel("Passengers")
plt.legend()
plt.tight_layout()
plt.show()

print(forecast_df[["mean", "mean_ci_lower", "mean_ci_upper"]])
Passengers        mean  mean_ci_lower  mean_ci_upper
Month                                               
1960-01-01  423.220775     402.813162     443.628389
1960-02-01  406.433566     380.708655     432.158478
1960-03-01  467.547433     436.021002     499.073865
1960-04-01  457.478940     421.736394     493.221486
1960-05-01  480.937601     441.101984     520.773218
1960-06-01  534.599304     491.215120     577.983488
1960-07-01  609.414970     562.670581     656.159359
1960-08-01  621.009678     571.171892     670.847464
1960-09-01  523.363515     470.592559     576.134470
1960-10-01  468.643694     413.104946     524.182443
1960-11-01  423.339480     365.158953     481.520006
1960-12-01  465.073605     404.369009     525.778201
- `series.iloc[:-12]` and `series.iloc[-12:]` split the last 12 months as a holdout — fitting on `train` only means the model never sees the test period.
- `get_forecast(steps=12)` forecasts 12 steps ahead; `.summary_frame(alpha=0.05)` returns the point forecast and 95% confidence interval bounds in a DataFrame.
- `fill_between` shades the confidence interval — widening bands toward the end of the forecast period is expected and reflects compounding uncertainty.
- Visually comparing the blue forecast line to the black actual line shows whether the model captures the seasonal pattern and level correctly.

### Conclusion

SARIMAX is the standard tool for monthly or quarterly data with both trend and seasonality. Fitting on a training set and comparing to a holdout is the most reliable way to evaluate whether the chosen order actually generalizes.

For non-seasonal or daily time series, see [ARIMA models with Statsmodels](/tutorials/statsmodels-arima-models). For simpler linear relationships over time, [linear regression with Statsmodels](/tutorials/statsmodels-linear-regression) with a time variable as a predictor can also work.