Linear regression is a fundamental tool in data science and machine learning for modeling the relationship between a single independent variable (feature) and a dependent variable (target). In this tutorial, we'll guide you through implementing linear regression using the `scikit-learn` library in Python. We'll cover basic implementations, data preparation, visualizations, and model evaluation. ### Basic Linear Regression Linear regression aims to fit a linear model to the data. Here's how to get started with a simple example.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # Generating some sample data np.random.seed(0) X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) # Creating and fitting the model model = LinearRegression() model.fit(X, y) # Predicting X_new = np.array([[0], [2]]) y_predict = model.predict(X_new) # Plotting plt.scatter(X, y) plt.plot(X_new, y_predict, color='red', linewidth=2) plt.xlabel('X') plt.ylabel('y') plt.title('Simple Linear Regression') plt.show()
- **`np.random.seed(0)`**: Sets the seed for reproducibility. - **`X` and `y`**: Generates random sample data for the independent and dependent variables. - **`LinearRegression()`**: Initializes the linear regression model from `sklearn`. - **`model.fit(X, y)`**: Fits the model to the data. - **`model.predict(X_new)`**: Makes predictions using the model. - **`plt.scatter` and `plt.plot`**: Plots the data and the regression line. ### Data Preparation Before training a model, it's essential to prepare the data. This involves splitting the data into training and testing sets.
from sklearn.model_selection import train_test_split # Splitting the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # Fitting the model on the training data model.fit(X_train, y_train) # Predicting on the test data y_test_pred = model.predict(X_test)
- **`train_test_split`**: Splits the data into training and testing sets. - **`test_size=0.2`**: Specifies that 20% of the data will be used for testing. - **`random_state=0`**: Ensures reproducibility. ### Model Evaluation Evaluating the performance of the linear regression model is crucial. We'll use metrics such as Mean Squared Error (MSE) and R-squared.
from sklearn.metrics import mean_squared_error, r2_score # Mean Squared Error mse = mean_squared_error(y_test, y_test_pred) print(f"Mean Squared Error: {mse}") # R-squared r2 = r2_score(y_test, y_test_pred) print(f"R-squared: {r2}")
Mean Squared Error: 1.0434333815695171 R-squared: 0.7424452332071367
- **`mean_squared_error`**: Computes the average squared difference between actual and predicted values. - **`r2_score`**: Measures the proportion of variance explained by the model. ### Visualizing Results Visualizing the regression line over the training and test data can provide important insights on the model performance.
# Plotting training data and the model plt.scatter(X_train, y_train, color='blue', label='Training data') plt.plot(X_new, model.predict(X_new), color='red', linewidth=2, label='Regression line') plt.xlabel('X') plt.ylabel('y') plt.title('Linear Regression Training Data') plt.legend() plt.show() # Plotting test data and the model plt.scatter(X_test, y_test, color='green', label='Test data') plt.plot(X_new, model.predict(X_new), color='red', linewidth=2, label='Regression line') plt.xlabel('X') plt.ylabel('y') plt.title('Linear Regression Test Data') plt.legend() plt.show()
### Conclusion You've learned how to prepare data, fit the model, evaluate it, and visualize the results. Linear regression serves as a foundational technique in data science and machine learning, offering valuable insights into data relationships.