Linear Regression in Python

Linear regression is a fundamental tool in data science and machine learning for modeling the relationship between a single independent variable (feature) and a dependent variable (target). In this tutorial, we'll guide you through implementing linear regression using the `scikit-learn` library in Python.

We'll cover basic implementations, data preparation, visualizations, and model evaluation.

### Basic Linear Regression

Linear regression aims to fit a linear model to the data. Here's how to get started with a simple example.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generating some sample data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Creating and fitting the model
model = LinearRegression()
model.fit(X, y)

# Predicting
X_new = np.array([[0], [2]])
y_predict = model.predict(X_new)

# Plotting
plt.scatter(X, y)
plt.plot(X_new, y_predict, color='red', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.show()

- **`np.random.seed(0)`**: Sets the seed for reproducibility.
- **`X` and `y`**: Generates random sample data for the independent and dependent variables.
- **`LinearRegression()`**: Initializes the linear regression model from `sklearn`.
- **`model.fit(X, y)`**: Fits the model to the data.
- **`model.predict(X_new)`**: Makes predictions using the model.
- **`plt.scatter` and `plt.plot`**: Plots the data and the regression line.

### Data Preparation

Before training a model, it's essential to prepare the data. This involves splitting the data into training and testing sets.

from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fitting the model on the training data
model.fit(X_train, y_train)

# Predicting on the test data
y_test_pred = model.predict(X_test)

- **`train_test_split`**: Splits the data into training and testing sets.
- **`test_size=0.2`**: Specifies that 20% of the data will be used for testing.
- **`random_state=0`**: Ensures reproducibility.

### Model Evaluation

Evaluating the performance of the linear regression model is crucial. We'll use metrics such as Mean Squared Error (MSE) and R-squared.

from sklearn.metrics import mean_squared_error, r2_score

# Mean Squared Error
mse = mean_squared_error(y_test, y_test_pred)
print(f"Mean Squared Error: {mse}")

# R-squared
r2 = r2_score(y_test, y_test_pred)
print(f"R-squared: {r2}")

Mean Squared Error: 1.0434333815695171
R-squared: 0.7424452332071367

- **`mean_squared_error`**: Computes the average squared difference between actual and predicted values.
- **`r2_score`**: Measures the proportion of variance explained by the model.

### Visualizing Results

Visualizing the regression line over the training and test data can provide important insights on the model performance.

# Plotting training data and the model
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(X_new, model.predict(X_new), color='red', linewidth=2, label='Regression line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Training Data')
plt.legend()
plt.show()

# Plotting test data and the model
plt.scatter(X_test, y_test, color='green', label='Test data')
plt.plot(X_new, model.predict(X_new), color='red', linewidth=2, label='Regression line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Test Data')
plt.legend()
plt.show()

### Conclusion

You've learned how to prepare data, fit the model, evaluate it, and visualize the results. Linear regression serves as a foundational technique in data science and machine learning, offering valuable insights into data relationships.