Tutorials

Linear Regression with SciPy

Linear regression answers the question: for every one-unit increase in X, how much does Y tend to change? The slope gives you that rate of change, the intercept gives you the predicted Y when X is zero, R-squared tells you what fraction of Y's variation is explained by X, and the p-value tests whether the slope is significantly different from zero. SciPy's `linregress` is the fastest way to get these for a simple two-variable relationship. For more complex models with multiple predictors, see [statsmodels linear regression](/tutorials/statsmodels-linear-regression).

### Basic Linear Regression

`linregress` fits a least-squares line — it finds the slope and intercept that minimize the sum of squared vertical distances from each point to the line.

import numpy as np
from scipy import stats

np.random.seed(42)

x = np.linspace(0, 10, 80)
y = 3.2 * x + 4 + np.random.normal(0, 2, size=len(x))

result = stats.linregress(x, y)

print(f"Slope: {result.slope:.3f}")
print(f"Intercept: {result.intercept:.3f}")
print(f"R-squared: {result.rvalue**2:.3f}")
print(f"P-value: {result.pvalue:.6f}")
Slope: 3.219
Intercept: 3.658
R-squared: 0.961
P-value: 0.000000
- `result.slope` is the estimated change in Y per unit increase in X — here it should be close to the true value of 3.2.
- `result.rvalue**2` is R-squared: a value of 1.0 means X perfectly predicts Y; 0.0 means X explains nothing.
- `result.pvalue` tests the null hypothesis that the true slope is zero — a very small value means X and Y have a real linear relationship.

### Visualizing the Fitted Line

Plotting the scatter of observations alongside the regression line shows whether the linear model is a reasonable description of the data or whether a curve would fit better.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

x = np.linspace(0, 10, 80)
y = 3.2 * x + 4 + np.random.normal(0, 2, size=len(x))

result = stats.linregress(x, y)

plt.figure(figsize=(9, 5))
plt.scatter(x, y, alpha=0.6, label="Observed data")
plt.plot(x, result.slope * x + result.intercept, color="crimson", label="Fitted line")
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Simple Linear Regression")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
- `result.slope * x + result.intercept` computes predicted Y values across the full range of X — this is the regression line equation.
- If points fan out as X increases (heteroscedasticity), or curve systematically away from the line, the linear model may not be the best choice.

### Making Predictions

Once you have the slope and intercept, you can predict the Y value for any new X — even values not in the original dataset.

import numpy as np
from scipy import stats

np.random.seed(42)

x = np.linspace(0, 10, 80)
y = 3.2 * x + 4 + np.random.normal(0, 2, size=len(x))

result = stats.linregress(x, y)
x_new = 7.5
y_pred = result.slope * x_new + result.intercept

print(f"Predicted y at x = {x_new}: {y_pred:.2f}")
Predicted y at x = 7.5: 27.80
- This is called extrapolation when `x_new` is outside the range of the training data — predictions become less reliable the further you extrapolate.
- The predicted value is a point estimate. For a range that accounts for uncertainty, use a confidence interval for the slope (shown in the next section).

### Confidence Interval for the Slope

The slope estimate will differ slightly each time you collect new data. A 95% confidence interval gives you a range of plausible values for the true slope based on this sample.

import numpy as np
from scipy import stats

np.random.seed(42)

x = np.linspace(0, 10, 80)
y = 3.2 * x + 4 + np.random.normal(0, 2, size=len(x))

result = stats.linregress(x, y)
df = len(x) - 2
t_crit = stats.t.ppf(0.975, df)
ci = (
    result.slope - t_crit * result.stderr,
    result.slope + t_crit * result.stderr,
)

print(f"Slope estimate: {result.slope:.3f}")
print(f"95% CI for slope: ({ci[0]:.3f}, {ci[1]:.3f})")
Slope estimate: 3.219
95% CI for slope: (3.072, 3.366)
- `df = len(x) - 2` is the degrees of freedom for simple linear regression — two parameters (slope and intercept) are estimated from the data.
- `stats.t.ppf(0.975, df)` gives the t critical value for a two-sided 95% interval — `0.975` because 2.5% is in each tail.
- `result.stderr` is the standard error of the slope estimate; a narrower interval means the slope is estimated more precisely.

### Practical Example: Advertising Spend and Sales

This example fits a regression to simulated ad spend and sales data, then interprets the slope in business terms.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(8)

ad_spend = np.linspace(5, 50, 60)
sales = 1.8 * ad_spend + 20 + np.random.normal(0, 6, size=len(ad_spend))

result = stats.linregress(ad_spend, sales)

print(f"Slope: {result.slope:.3f}")
print(f"Intercept: {result.intercept:.3f}")
print(f"R-squared: {result.rvalue**2:.3f}")
print(f"P-value: {result.pvalue:.6f}")

plt.figure(figsize=(9, 5))
plt.scatter(ad_spend, sales, alpha=0.65)
plt.plot(ad_spend, result.slope * ad_spend + result.intercept, color="darkorange", linewidth=2)
plt.xlabel("Advertising spend")
plt.ylabel("Sales")
plt.title("Advertising Spend vs Sales")
plt.grid(alpha=0.3)
plt.show()
Slope: 1.747
Intercept: 21.253
R-squared: 0.918
P-value: 0.000000
- The slope here estimates how many additional sales are associated with each extra unit of ad spend — the real-world interpretation is always specific to the units of X and Y.
- R-squared near 0.9 means advertising spend accounts for ~90% of the variation in sales in this simulated dataset — real marketing data typically shows much weaker relationships.
- The p-value near zero confirms the slope is significantly different from zero, meaning ad spend is a meaningful predictor here.

### Conclusion

SciPy's `linregress` gives you the essentials of simple linear regression in one call: slope, intercept, R-squared, and significance. It's best suited to two-variable relationships — if you have multiple predictors or need diagnostic tools, see [statsmodels linear regression](/tutorials/statsmodels-linear-regression).

To measure the strength of a linear relationship without fitting a model, see [correlation analysis with SciPy](/tutorials/correlation-analysis-with-scipy).