When a single predictor isn't enough to explain your outcome, multiple linear regression lets you include several features at once. For example, predicting a house price from its size alone ignores the number of bedrooms, age, and location. Multiple linear regression models all of these together: each coefficient tells you how much the outcome changes when that one feature increases by one unit while every other feature stays fixed. scikit-learn's `LinearRegression` handles the math and integrates cleanly with the rest of its ML ecosystem — the same `.fit()` / `.predict()` interface you'd use for any other model. ### Creating a dataset We'll simulate house price data with three predictors: size in square meters, number of bedrooms, and building age. Adding noise makes it realistic.
import numpy as np
rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
# True relationship: price = 50*size + 8000*bedrooms - 300*age + 20000 + noise
price = (
50 * size
+ 8000 * bedrooms
- 300 * age
+ 20000
+ rng.normal(0, 3000, n)
)
print(f"Size range: {size.min():.0f}–{size.max():.0f} m²")
print(f"Bedrooms range: {bedrooms.min():.0f}–{bedrooms.max():.0f}")
print(f"Age range: {age.min():.1f}–{age.max():.1f} years")
print(f"Price range: €{price.min():.0f}–€{price.max():.0f}")- `rng = np.random.default_rng(42)` creates a reproducible random number generator — the seed `42` ensures the same numbers every time. - The true coefficients (50, 8000, -300) are the values we expect the model to recover. - `rng.normal(0, 3000, n)` adds Gaussian noise that makes the relationship realistic — real data is never perfectly linear. ### Building the feature matrix scikit-learn expects features as a 2D array where each column is a variable and each row is an observation.
import numpy as np
rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
50 * size + 8000 * bedrooms - 300 * age + 20000
+ rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price
print("X shape:", X.shape)
print("y shape:", y.shape)
print("First row:", X[0])- `np.column_stack(...)` takes three 1D arrays and stacks them as columns, producing a (200, 3) matrix. - `y` is a 1D array of 200 prices — one target value per observation. - There's no need to add an intercept column manually; `LinearRegression` adds it automatically via its `fit_intercept=True` default. ### Fitting the model With the feature matrix ready, fitting the model is a single call.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
50 * size + 8000 * bedrooms - 300 * age + 20000
+ rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
print("Intercept:", round(model.intercept_, 2))
print("Coefficients:", dict(zip(["size", "bedrooms", "age"], model.coef_.round(2))))- `train_test_split` holds out 20% of the data (40 rows) as a test set so we can evaluate on examples the model has never seen. - `model.fit(X_train, y_train)` finds the coefficients that minimize the sum of squared residuals on the training set. - The printed coefficients should be close to the true values (50, 8000, −300) — small differences come from the added noise. ### Evaluating model performance Three metrics give a complete picture: R² measures how much variance is explained, MAE reports average error in the original units, and RMSE penalizes large errors more heavily.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error
rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
50 * size + 8000 * bedrooms - 300 * age + 20000
+ rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"R²: {r2_score(y_test, y_pred):.4f}")
print(f"MAE: €{mean_absolute_error(y_test, y_pred):.0f}")
print(f"RMSE: €{root_mean_squared_error(y_test, y_pred):.0f}")- `R²` ranges from 0 to 1 — a value near 1 means the model explains most of the variance; near 0 means it barely outperforms predicting the mean. - `MAE` (Mean Absolute Error) is the average distance between predictions and true values in euros — easy to interpret directly. - `RMSE` (Root Mean Squared Error) squares the errors before averaging, so it reacts more strongly to large outliers than MAE. - Evaluating on `X_test` / `y_test` — data the model never trained on — gives an honest estimate of real-world performance. ### Visualizing predictions vs actual values A scatter plot of predicted vs actual values quickly shows whether the model is systematically wrong anywhere.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
50 * size + 8000 * bedrooms - 300 * age + 20000
+ rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Predicted vs actual
axes[0].scatter(y_test, y_pred, alpha=0.7, color="steelblue")
lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
axes[0].plot(lims, lims, "r--", label="Perfect prediction")
axes[0].set_xlabel("Actual price (€)")
axes[0].set_ylabel("Predicted price (€)")
axes[0].set_title("Predicted vs Actual")
axes[0].legend()
# Residuals
residuals = y_test - y_pred
axes[1].scatter(y_pred, residuals, alpha=0.7, color="coral")
axes[1].axhline(0, color="black", linewidth=1)
axes[1].set_xlabel("Predicted price (€)")
axes[1].set_ylabel("Residual (€)")
axes[1].set_title("Residual Plot")
plt.tight_layout()
plt.show()- The **predicted vs actual** plot should show points clustered tightly around the red dashed line — deviations indicate where the model over- or under-predicts. - The **residual plot** shows errors as a function of predicted value. Random scatter around zero is good; any curve or funnel shape suggests the model is missing a non-linear relationship or that variance is growing with price (heteroscedasticity). - `alpha=0.7` makes overlapping points visible without obscuring the overall trend. ### Comparing individual feature impacts Standardizing the coefficients shows which feature has the strongest influence, regardless of units.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
rng = np.random.default_rng(42)
n = 200
size = rng.uniform(40, 200, n)
bedrooms = rng.integers(1, 6, n).astype(float)
age = rng.uniform(0, 40, n)
price = (
50 * size + 8000 * bedrooms - 300 * age + 20000
+ rng.normal(0, 3000, n)
)
X = np.column_stack([size, bedrooms, age])
y = price
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model_scaled = LinearRegression().fit(X_train_scaled, y_train)
feature_names = ["Size (m²)", "Bedrooms", "Age (years)"]
coefs = model_scaled.coef_
plt.figure(figsize=(7, 4))
colors = ["steelblue" if c > 0 else "tomato" for c in coefs]
plt.barh(feature_names, coefs, color=colors)
plt.axvline(0, color="black", linewidth=0.8)
plt.xlabel("Standardized coefficient")
plt.title("Feature Impact (standardized)")
plt.tight_layout()
plt.show()- `StandardScaler` rescales each feature to have zero mean and unit variance, making coefficients from different features directly comparable. - A larger absolute standardized coefficient means that feature has a stronger effect on price relative to its natural variation. - Blue bars are positive effects (larger → higher price), red bars are negative (larger → lower price). - Note: these standardized coefficients are for interpretation only; the scaler is fitted on training data only (`fit_transform`) to avoid data leakage from the test set. ### Conclusion scikit-learn's `LinearRegression` makes it straightforward to fit, evaluate, and interpret multiple linear regression models. The key steps are building a 2D feature matrix, splitting data before fitting, and checking both accuracy metrics and residual plots to confirm the model is well-behaved. For a statistically richer output with p-values and confidence intervals, see [Multiple Linear Regression with Statsmodels](/tutorials/statsmodels-multiple-linear-regression). To extend this further with regularization, explore [linear regression with scikit-learn's Ridge or Lasso](/tutorials/linear-regression).