Polynomial Regression

[Linear regression](/tutorials/linear-regression) draws the best straight line through your data, but many real-world relationships are curved: plant growth accelerates then plateaus, temperature rises and falls across a day, drag force grows with the square of speed. Polynomial regression handles these cases by adding powers of your input variable — x², x³, and so on — as extra features. The model is still a linear equation underneath (linear in the *coefficients*), so the same fitting machinery applies. The only new decision is choosing the polynomial degree: too low and you underfit (the model misses the curve), too high and you overfit (the model chases noise).

### Generating curved data

We'll simulate a simple quadratic relationship — fuel consumption as a function of vehicle speed — to have a concrete, interpretable example throughout.

import numpy as np

rng = np.random.default_rng(42)
x = np.linspace(10, 130, 80)

# True relationship: consumption = 0.003*speed² - 0.3*speed + 12 + noise
y = 0.003 * x**2 - 0.3 * x + 12 + rng.normal(0, 0.5, len(x))

print(f"Speed range: {x.min():.0f}–{x.max():.0f} km/h")
print(f"Consumption range: {y.min():.2f}–{y.max():.2f} L/100km")

Speed range: 10–130 km/h
Consumption range: 4.32–23.55 L/100km

- `np.linspace(10, 130, 80)` creates 80 evenly spaced speed values between 10 and 130 km/h.
- The true relationship is quadratic: fuel consumption drops as you accelerate out of low gears, reaches a minimum, then rises again at high speeds due to air resistance.
- `rng.normal(0, 0.5, len(x))` adds small Gaussian noise to make the data look realistic.

### Fitting a polynomial with NumPy

`np.polyfit` finds the polynomial coefficients that best fit the data in the least-squares sense. `np.poly1d` wraps those coefficients into a callable function.

import numpy as np

rng = np.random.default_rng(42)
x = np.linspace(10, 130, 80)
y = 0.003 * x**2 - 0.3 * x + 12 + rng.normal(0, 0.5, len(x))

coeffs = np.polyfit(x, y, deg=2)
poly = np.poly1d(coeffs)

print("Coefficients (high to low degree):", coeffs.round(5))
print(f"Fitted: {coeffs[0]:.4f}x² + {coeffs[1]:.4f}x + {coeffs[2]:.4f}")

Coefficients (high to low degree): [ 2.960000e-03 -2.943200e-01  1.187166e+01]
Fitted: 0.0030x² + -0.2943x + 11.8717

- `np.polyfit(x, y, deg=2)` fits a degree-2 polynomial and returns three coefficients `[a, b, c]` for ax² + bx + c, ordered from highest to lowest degree.
- `np.poly1d(coeffs)` creates a polynomial object — calling `poly(value)` evaluates the polynomial at that point.
- The recovered coefficients should be close to the true values (0.003, −0.3, 12) with small differences due to the noise.

### Visualizing the polynomial fit

Plotting the scatter data alongside the fitted curve shows immediately how well the model captures the underlying shape.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
x = np.linspace(10, 130, 80)
y = 0.003 * x**2 - 0.3 * x + 12 + rng.normal(0, 0.5, len(x))

coeffs = np.polyfit(x, y, deg=2)
poly = np.poly1d(coeffs)

x_smooth = np.linspace(x.min(), x.max(), 300)
y_smooth = poly(x_smooth)

plt.figure(figsize=(8, 5))
plt.scatter(x, y, alpha=0.6, label="Observed data", color="steelblue")
plt.plot(x_smooth, y_smooth, color="tomato", linewidth=2, label="Degree-2 fit")
plt.xlabel("Speed (km/h)")
plt.ylabel("Fuel consumption (L/100km)")
plt.title("Polynomial Regression Fit")
plt.legend()
plt.tight_layout()
plt.show()

- `x_smooth = np.linspace(x.min(), x.max(), 300)` creates 300 closely spaced points so the plotted curve looks smooth rather than jagged.
- `poly(x_smooth)` evaluates the fitted polynomial at all 300 points — this is the curve shown in red.
- Using a denser grid for plotting than for fitting is standard practice: you fit on your actual data, but draw the line on a finer resolution.

### Using scikit-learn's Pipeline

For larger projects, scikit-learn's `PolynomialFeatures` transformer plus `LinearRegression` keeps polynomial regression inside the standard `.fit()` / `.predict()` workflow. Wrapping them in a `Pipeline` ensures the feature expansion is always applied consistently.

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

rng = np.random.default_rng(42)
x = np.linspace(10, 130, 80)
y = 0.003 * x**2 - 0.3 * x + 12 + rng.normal(0, 0.5, len(x))

X = x.reshape(-1, 1)

model = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("linear", LinearRegression()),
])
model.fit(X, y)

y_pred = model.predict(X)
print(f"R²: {r2_score(y, y_pred):.4f}")
print("Coefficients:", model.named_steps["linear"].coef_.round(5))
print("Intercept:", round(model.named_steps["linear"].intercept_, 4))

R²: 0.9949
Coefficients: [-0.29432  0.00296]
Intercept: 11.8717

- `x.reshape(-1, 1)` converts the 1D array to a (80, 1) column vector — scikit-learn expects a 2D feature matrix.
- `PolynomialFeatures(degree=2, include_bias=False)` expands `[x]` into `[x, x²]`; the linear model then fits coefficients for each expanded feature.
- `include_bias=False` omits the column of ones because `LinearRegression` adds the intercept itself.
- An R² near 1.0 means the model explains almost all the variance in the data.

### Comparing degrees: underfitting and overfitting

Choosing the wrong degree is the most common mistake in polynomial regression. A degree too low misses the curve; a degree too high memorises the noise.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

rng = np.random.default_rng(42)
x = np.linspace(10, 130, 80)
y = 0.003 * x**2 - 0.3 * x + 12 + rng.normal(0, 0.5, len(x))
X = x.reshape(-1, 1)

x_smooth = np.linspace(x.min(), x.max(), 300).reshape(-1, 1)

degrees = [1, 2, 10]
labels = ["Degree 1 (underfit)", "Degree 2 (good fit)", "Degree 10 (overfit)"]
colors = ["royalblue", "tomato", "green"]

plt.figure(figsize=(9, 5))
plt.scatter(x, y, alpha=0.5, color="gray", zorder=3, label="Data")

for deg, label, color in zip(degrees, labels, colors):
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=deg, include_bias=False)),
        ("linear", LinearRegression()),
    ])
    model.fit(X, y)
    y_smooth = model.predict(x_smooth)
    plt.plot(x_smooth, y_smooth, label=label, color=color, linewidth=2)

plt.xlabel("Speed (km/h)")
plt.ylabel("Fuel consumption (L/100km)")
plt.title("Effect of Polynomial Degree")
plt.legend()
plt.ylim(y.min() - 1, y.max() + 2)
plt.tight_layout()
plt.show()

- Degree 1 draws a straight line that systematically misses the curve — this is underfitting.
- Degree 2 follows the true quadratic shape closely — this is the right choice because we know the data is quadratic.
- Degree 10 wiggles through the noise and would generalize poorly to new data — this is overfitting.
- `plt.ylim(y.min() - 1, y.max() + 2)` clips the y-axis to the data range so the degree-10 curve's extreme edge wiggles don't distort the scale.

### Evaluating on a held-out test set

The real test of the right degree is how well it predicts data it has never seen. Splitting the data before fitting reveals the generalization gap.

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

rng = np.random.default_rng(42)
x = np.linspace(10, 130, 80)
y = 0.003 * x**2 - 0.3 * x + 12 + rng.normal(0, 0.5, len(x))
X = x.reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

for deg in [1, 2, 5, 10]:
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=deg, include_bias=False)),
        ("linear", LinearRegression()),
    ])
    model.fit(X_train, y_train)
    train_r2 = r2_score(y_train, model.predict(X_train))
    test_r2 = r2_score(y_test, model.predict(X_test))
    print(f"Degree {deg:2d} — Train R²: {train_r2:.4f}  Test R²: {test_r2:.4f}")

Degree  1 — Train R²: 0.6352  Test R²: 0.5577
Degree  2 — Train R²: 0.9959  Test R²: 0.9900
Degree  5 — Train R²: 0.9960  Test R²: 0.9896
Degree 10 — Train R²: 0.9809  Test R²: 0.9695

- As degree increases, train R² keeps climbing because a higher-degree polynomial has more freedom to fit every point exactly.
- Test R² peaks around degree 2 and then drops — the model is memorising the training noise instead of learning the true shape.
- Choosing the degree with the highest *test* R² is the standard way to avoid overfitting when you don't know the true degree in advance.

### Conclusion

Polynomial regression extends linear regression to curved relationships by adding powers of the input as extra features — the fitting is still linear in the coefficients, which keeps it fast and interpretable. The key practical skill is picking the polynomial degree using a held-out test set or cross-validation, not training accuracy alone.

For a statistically richer output with p-values and confidence intervals, see [linear regression with Statsmodels](/tutorials/statsmodels-linear-regression). To add multiple predictors alongside the polynomial terms, see [multiple linear regression](/tutorials/multiple-linear-regression-sklearn).