Multiple Linear Regression with scikit-learn

When a single predictor isn't enough to explain your outcome, multiple linear regression lets you include several features at once. For example, predicting a house price from its size alone ignores the number of bedrooms, age, and location. Multiple linear regression models all of these together: each coefficient tells you how much the outcome changes when that one feature increases by one unit while every other feature stays fixed. scikit-learn's `LinearRegression` handles the math and integrates cleanly with the rest of its ML ecosystem — the same `.fit()` / `.predict()` interface you'd use for any other model.

### Creating a dataset

We'll simulate house price data with three predictors: size in square meters, number of bedrooms, and building age. Adding noise makes it realistic.

import numpy as np

rng = np.random.default_rng(42)
n = 200

size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)

# True relationship: price = 50*size + 8000*bedrooms - 300*age + 20000 + noise
price = (
    50 * size
    + 8000 * bedrooms
    - 300 * age
    + 20000
    + rng.normal(0, 3000, n)
)

print(f"Size range: {size.min():.0f}–{size.max():.0f} m²")
print(f"Bedrooms range: {bedrooms.min():.0f}–{bedrooms.max():.0f}")
print(f"Age range: {age.min():.1f}–{age.max():.1f} years")
print(f"Price range: €{price.min():.0f}–€{price.max():.0f}")

Size range: 41–199 m²
Bedrooms range: 1–5
Age range: 0.2–40.0 years
Price range: €15906–€75750

- `rng = np.random.default_rng(42)` creates a reproducible random number generator — the seed `42` ensures the same numbers every time.
- The true coefficients (50, 8000, -300) are the values we expect the model to recover.
- `rng.normal(0, 3000, n)` adds Gaussian noise that makes the relationship realistic — real data is never perfectly linear.

### Building the feature matrix

scikit-learn expects features as a 2D array where each column is a variable and each row is an observation.

import numpy as np

rng = np.random.default_rng(42)
n = 200

size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
    50 * size + 8000 * bedrooms - 300 * age + 20000
    + rng.normal(0, 3000, n)
)

X = np.column_stack([size, bedrooms, age])
y = price

print("X shape:", X.shape)
print("y shape:", y.shape)
print("First row:", X[0])

X shape: (200, 3)
y shape: (200,)
First row: [163.83296777   5.          31.15985418]

- `np.column_stack(...)` takes three 1D arrays and stacks them as columns, producing a (200, 3) matrix.
- `y` is a 1D array of 200 prices — one target value per observation.
- There's no need to add an intercept column manually; `LinearRegression` adds it automatically via its `fit_intercept=True` default.

### Fitting the model

With the feature matrix ready, fitting the model is a single call.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
    50 * size + 8000 * bedrooms - 300 * age + 20000
    + rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

print("Intercept:", round(model.intercept_, 2))
print("Coefficients:", dict(zip(["size", "bedrooms", "age"], model.coef_.round(2))))

Intercept: 20111.55
Coefficients: {'size': np.float64(45.13), 'bedrooms': np.float64(7991.32), 'age': np.float64(-293.61)}

- `train_test_split` holds out 20% of the data (40 rows) as a test set so we can evaluate on examples the model has never seen.
- `model.fit(X_train, y_train)` finds the coefficients that minimize the sum of squared residuals on the training set.
- The printed coefficients should be close to the true values (50, 8000, −300) — small differences come from the added noise.

### Evaluating model performance

Three metrics give a complete picture: R² measures how much variance is explained, MAE reports average error in the original units, and RMSE penalizes large errors more heavily.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
    50 * size + 8000 * bedrooms - 300 * age + 20000
    + rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = LinearRegression().fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"R²:   {r2_score(y_test, y_pred):.4f}")
print(f"MAE:  €{mean_absolute_error(y_test, y_pred):.0f}")
print(f"RMSE: €{root_mean_squared_error(y_test, y_pred):.0f}")

R²:   0.9366
MAE:  €2583
RMSE: €3213

- `R²` ranges from 0 to 1 — a value near 1 means the model explains most of the variance; near 0 means it barely outperforms predicting the mean.
- `MAE` (Mean Absolute Error) is the average distance between predictions and true values in euros — easy to interpret directly.
- `RMSE` (Root Mean Squared Error) squares the errors before averaging, so it reacts more strongly to large outliers than MAE.
- Evaluating on `X_test` / `y_test` — data the model never trained on — gives an honest estimate of real-world performance.

### Visualizing predictions vs actual values

A scatter plot of predicted vs actual values quickly shows whether the model is systematically wrong anywhere.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
    50 * size + 8000 * bedrooms - 300 * age + 20000
    + rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Predicted vs actual
axes[0].scatter(y_test, y_pred, alpha=0.7, color="steelblue")
lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
axes[0].plot(lims, lims, "r--", label="Perfect prediction")
axes[0].set_xlabel("Actual price (€)")
axes[0].set_ylabel("Predicted price (€)")
axes[0].set_title("Predicted vs Actual")
axes[0].legend()

# Residuals
residuals = y_test - y_pred
axes[1].scatter(y_pred, residuals, alpha=0.7, color="coral")
axes[1].axhline(0, color="black", linewidth=1)
axes[1].set_xlabel("Predicted price (€)")
axes[1].set_ylabel("Residual (€)")
axes[1].set_title("Residual Plot")

plt.tight_layout()
plt.show()

- The **predicted vs actual** plot should show points clustered tightly around the red dashed line — deviations indicate where the model over- or under-predicts.
- The **residual plot** shows errors as a function of predicted value. Random scatter around zero is good; any curve or funnel shape suggests the model is missing a non-linear relationship or that variance is growing with price (heteroscedasticity).
- `alpha=0.7` makes overlapping points visible without obscuring the overall trend.

### Comparing individual feature impacts

Standardizing the coefficients shows which feature has the strongest influence, regardless of units.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
    50 * size + 8000 * bedrooms - 300 * age + 20000
    + rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model_scaled = LinearRegression().fit(X_train_scaled, y_train)

feature_names = ["Size (m²)", "Bedrooms", "Age (years)"]
coefs = model_scaled.coef_

plt.figure(figsize=(7, 4))
colors = ["steelblue" if c > 0 else "tomato" for c in coefs]
plt.barh(feature_names, coefs, color=colors)
plt.axvline(0, color="black", linewidth=0.8)
plt.xlabel("Standardized coefficient")
plt.title("Feature Impact (standardized)")
plt.tight_layout()
plt.show()

- `StandardScaler` rescales each feature to have zero mean and unit variance, making coefficients from different features directly comparable.
- A larger absolute standardized coefficient means that feature has a stronger effect on price relative to its natural variation.
- Blue bars are positive effects (larger → higher price), red bars are negative (larger → lower price).
- Note: these standardized coefficients are for interpretation only; the scaler is fitted on training data only (`fit_transform`) to avoid data leakage from the test set.

### Conclusion

scikit-learn's `LinearRegression` makes it straightforward to fit, evaluate, and interpret multiple linear regression models. The key steps are building a 2D feature matrix, splitting data before fitting, and checking both accuracy metrics and residual plots to confirm the model is well-behaved.

For a statistically richer output with p-values and confidence intervals, see [Multiple Linear Regression with Statsmodels](/tutorials/statsmodels-multiple-linear-regression). To extend this further with regularization, explore [linear regression with scikit-learn's Ridge or Lasso](/tutorials/linear-regression).