Tutorials

Linear Regression with Scikit-Learn

Linear regression is one of the most widely used tools in data analysis and machine learning. It fits a straight line through your data to estimate how a change in one variable (the feature) corresponds to a change in another (the target). The two key outputs are the **coefficient** (the slope — how much the target changes per unit of the feature) and the **intercept** (the baseline value when the feature is zero). Scikit-learn is the right choice when you want a prediction pipeline that integrates with other models and preprocessing steps. For statistical inference with p-values, confidence intervals, and diagnostic tests, see [statsmodels linear regression](/tutorials/statsmodels-linear-regression).

### Basic Linear Regression

To fit a linear regression model, you create a `LinearRegression` object, call `.fit()` with your training data, and then use `.predict()` to get predictions for new inputs.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

model = LinearRegression()
model.fit(X, y)

X_new = np.array([[0], [2]])
y_predict = model.predict(X_new)

plt.scatter(X, y)
plt.plot(X_new, y_predict, color='red', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.show()
- `y = 4 + 3 * X + noise` builds a dataset where the true slope is 3 and the intercept is 4 — after fitting, `model.coef_` and `model.intercept_` should be close to those values.
- `X` must be a 2D array (shape `(100, 1)`) because scikit-learn expects feature matrices, not flat vectors.
- `X_new = np.array([[0], [2]])` creates two new points to predict — the model returns the fitted line value at each.
- The red line connects the two predicted points, showing the regression line across the data range.

### Data Preparation

In practice, you evaluate a model on data it hasn't seen before. Splitting into train and test sets gives you an honest estimate of how well the model generalizes.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = LinearRegression()
model.fit(X_train, y_train)

y_test_pred = model.predict(X_test)
print(f"Trained on {len(X_train)} samples, testing on {len(X_test)}")
Trained on 80 samples, testing on 20
- `train_test_split(..., test_size=0.2)` reserves 20% of the data for evaluation — the model never sees these rows during training.
- `random_state=0` makes the split reproducible so the same rows go to train and test every time.
- Fitting on `X_train` only (not the full dataset) is critical — evaluating on training data gives an overly optimistic accuracy estimate.

### Model Evaluation

Two common metrics for regression are Mean Squared Error (MSE) and R-squared. Together they tell you how large the errors are and how well the model explains the variation in the target.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression()
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_test_pred)
r2 = r2_score(y_test, y_test_pred)
print(f"Mean Squared Error: {mse:.3f}")
print(f"R-squared: {r2:.3f}")
Mean Squared Error: 1.043
R-squared: 0.742
- **MSE** is the average squared difference between predictions and actual values — lower is better, and squaring penalizes large errors more than small ones.
- **R-squared** is between 0 and 1: a value of 1 means the model explains all variance in the target; 0 means it does no better than predicting the mean. Values above 0.8 are generally considered a strong fit for simple regression.
- Both metrics are computed on the test set, not the training set, so they reflect real predictive performance.

### Visualizing Results

Plotting the regression line over both training and test data lets you visually confirm that the model fits well and isn't systematically off in one region.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression()
model.fit(X_train, y_train)

X_line = np.array([[0], [2]])
y_line = model.predict(X_line)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].scatter(X_train, y_train, color='blue', alpha=0.6, label='Training data')
axes[0].plot(X_line, y_line, color='red', linewidth=2, label='Regression line')
axes[0].set_title('Training Data')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].legend()

axes[1].scatter(X_test, y_test, color='green', alpha=0.6, label='Test data')
axes[1].plot(X_line, y_line, color='red', linewidth=2, label='Regression line')
axes[1].set_title('Test Data')
axes[1].set_xlabel('X')
axes[1].set_ylabel('y')
axes[1].legend()

plt.tight_layout()
plt.show()
- Using `fig, axes = plt.subplots(1, 2)` puts training and test plots side by side so you can compare the fit visually without switching charts.
- The same regression line (fitted on training data only) is drawn on both plots — if it fits the test data well too, the model generalizes.
- Systematic deviation from the line at high or low X values would suggest a non-linear relationship that a straight-line model can't capture.

### Conclusion

Scikit-learn's `LinearRegression` covers the full prediction workflow: fit, predict, and evaluate. Use MSE and R-squared together — MSE tells you the absolute error magnitude, R-squared tells you how much of the target's variation the model explains.

For more control over statistical output (p-values, confidence intervals, diagnostic plots), see [statsmodels linear regression](/tutorials/statsmodels-linear-regression). To measure the strength of the relationship before fitting a model, see [correlation analysis with SciPy](/tutorials/correlation-analysis-with-scipy).