Difference Between Linear and Multiple Regression
Linear regression examines the relationship between one predictor and an outcome, while multiple regression delves into how several predictors influence that outcome. Both are essential tools in predictive analytics, but knowing their differences ensures effective and accurate modelling. Dive in to discover the core distinctions and when to use each approach.
Table of Content
Best-suited Mlops courses for you
Learn Mlops with these high-rated online courses
Difference Between Linear Regression and Multiple Regression: Linear Regression vs Multiple Regression
Parameter | Linear (Simple) Regression | Multiple Regression |
Definition | Models the relationship between one dependent and one independent variable. | Models the relationship between one dependent and two or more independent variables. |
Equation | Y = C0 + C1X + e | Y = C0 + C1X1 + C2X2 + C3X3 + ….. + CnXn + e |
Complexity | It is simpler to deal with one relationship. | More complex due to multiple relationships. |
Use Cases | Suitable when there is one clear predictor. | Suitable when multiple factors affect the outcome. |
Assumptions | Linearity, Independence, Homoscedasticity, Normality | Same as linear regression, with the added concern of multicollinearity. |
Visualization | Typically visualized with a 2D scatter plot and a line of best fit. | Requires 3D or multi-dimensional space, often represented using partial regression plots. |
Risk of Overfitting | Lower, as it deals with only one predictor. | Higher, especially if too many predictors are used without adequate data. |
Multicollinearity Concern | Not applicable, as there’s only one predictor. | A primary concern; having correlated predictors can affect the model’s accuracy and interpretation. |
Applications | Basic research, simple predictions, understanding a singular relationship. | Complex research, multifactorial predictions, studying interrelated systems. |
What is Linear Regression?
Linear regression is a statistical method used to model the relationship between a dependent variable and one independent variable. It aims to establish a linear relationship between these variables and can be used for both prediction and understanding the nature of the relationship.
Mathematical Equation
The mathematical representation of simple linear regression is:
Y = C0 + C1X + e
where,
- Y: Dependent Variable (target variable)
- X: Independent Variable (input variable)
- C0: Intercept (value of Y when X=0)
- C1: Slope of line
- e: Error term
Assumptions of Linear Regression
Here are some assumption that must be satisfied for the linear regression model to be valid.
- Linearity: The relationship between the independent and dependent variables should be linear.
- Independence: Observations should be independent of each other.
- Homoscedasticity: The variance of the errors should be the same across all levels of the independent variables.
- Normality: The dependent variable is normally distributed for a fixed value of the independent variable.
- No Multicollinearity: This is more pertinent for multiple regression, where all independent variables should be independent.
Limitations of Linear Regression
- Outliers: This can significantly impact the slope and intercept of the regression line.
- Non-linearity: Linear regression assumes a linear relationship, but this assumption may sometimes not hold.
- Correlation ≠ Causation: Just because two variables have a linear relationship doesn’t mean changes in one cause changes in the other.
What is Multiple Regression?
Multiple regression is an extension of simple linear regression. It models the relationship between one dependent variable and two or more independent variables. The primary purpose is to understand how the dependent variable changes as the independent variables change.
Mathematical Equation
The mathematical representation of multiple regression is:
Y = C0 + C1X1 + C2X2 + C3X3 + ….. + CnXn + e
where,
- Y: Dependent Variable (target variable)
- X1, X2, X3,…, Xn: Independent Variable (input variable)
- C0: Intercept (value of Y when X=0)
- C1, C2, C3, C4, C5, …., Cn: Slope of line
- e: Error term
Assumptions of Multiple Regression
- Linearity: A linear relationship exists between the dependent and independent variables.
- Independence: Observations are independent of each other.
- No multicollinearity: Independent variables aren’t too highly correlated with each other.
- Homoscedasticity: Constant variance of the errors.
- No Autocorrelation: The residuals (errors) are independent.
- Normality: The dependent variable is normally distributed for any fixed value of the independent variables.
Limitations of Multiple Regression
- Overfitting: Including too many independent variables can lead to a model that fits the training data too closely.
- Omitted Variable Bias: Leaving out a significant independent variable can bias the coefficients of other variables.
- Endogeneity occurs when an independent variable is correlated with the error term, leading to biased coefficient estimates.
Until now, you clearly understand what linear and multiple regression are, their mathematical equations, assumption, and their limitations. You also have a better understanding of how linear regression and multiple regression are different from each other. Now it’s time for an example that will give you an idea of calculating the value of linear and multiple regression using Python.
Example of Linear and Multiple Regression
Problem Statement: Suppose we have data for a retail company. The company wants to understand how their advertising expenses in various channels (e.g., TV, Radio) impact sales.
- Linear Regression: Predict sales using only TV advertising expenses.
- Multiple Regression: Predict sales using both TV and Radio advertising expenses.
Step-1: Generate a random dataset
import numpy as npimport pandas as pd
# Sample data generationnp.random.seed(0)tv = 100 + 50 * np.random.rand(100)radio = 50 + 25 * np.random.rand(100)sales = 200 + 3*tv + 1.5*radio + 30*np.random.randn(100)
data = pd.DataFrame({'TV': tv, 'Radio': radio, 'Sales': sales})
# show the first five resultsdata.head()
Output
Step-2: Split the dataset into training and test dataset
#split the data into training and testing sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, random_state=0)
Step-3: Evaluating Mean Squared Error for Linear Regression
#Linear Regression# Using only TV expenses for prediction
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error
X_train_tv = train[['TV']]y_train = train['Sales']X_test_tv = test[['TV']]y_test = test['Sales']
linear_model = LinearRegression().fit(X_train_tv, y_train)linear_pred = linear_model.predict(X_test_tv)
# Evaluationlinear_rmse = np.sqrt(mean_squared_error(y_test, linear_pred))
Step-4: Evaluating Mean Squared Error for Multiple Regression
#Multiple Regression# Using both TV and Radio expenses for predictionX_train_multi = train[['TV', 'Radio']]X_test_multi = test[['TV', 'Radio']]
multiple_model = LinearRegression().fit(X_train_multi, y_train)multiple_pred = multiple_model.predict(X_test_multi)
# Evaluationmultiple_rmse = np.sqrt(mean_squared_error(y_test, multiple_pred))
Step-5: Print the results
# Error Metricsprint(f"Linear Regression RMSE: {linear_rmse:.2f}")print(f"Multiple Regression RMSE: {multiple_rmse:.2f}")
Output
Linear Regression RMSE: 27.18
Multiple Regression RMSE: 25.27
Explanation
From the above result, we have the value of RMSE for linear regression is greater than the RMSE value for multiple regression. This implies multiple regression gives a better fit to the data.
Typically adding more relevant predictors (features) can enhance a model’s performance, but you must be cautious about overfitting. Also, if the features are correlated, it can introduce multi-collinearity.
Now, let’s see how the plots of linear and multiple regression looks like:
Linear Regression
# For Linear Regressionplt.scatter(X_test_tv, y_test, color='blue', label='True values')plt.scatter(X_test_tv, linear_pred, color='red', label='Predicted values')plt.xlabel('TV Expenses')plt.ylabel('Sales')plt.title('Linear Regression: TV vs Sales')plt.legend()plt.show()
# Error Metricsprint(f"Linear Regression RMSE: {linear_rmse:.2f}")print(f"Multiple Regression RMSE: {multiple_rmse:.2f}")
Output
Multiple Regression
from mpl_toolkits.mplot3d import Axes3D
# Setting up the 3D plotfig = plt.figure(figsize=(10, 7))ax = fig.add_subplot(111, projection='3d')
# Scatter plot of actual dataax.scatter(train['TV'], train['Radio'], train['Sales'], color='blue', marker='o', alpha=0.5, label='True values')
# Creating a meshgrid for the planex_surf = np.linspace(train['TV'].min(), train['TV'].max(), 100)y_surf = np.linspace(train['Radio'].min(), train['Radio'].max(), 100)x_surf, y_surf = np.meshgrid(x_surf, y_surf)
# Predicting the values from the meshed gridvals = pd.DataFrame({'TV': x_surf.ravel(), 'Radio': y_surf.ravel()})predicted_sales = multiple_model.predict(vals)ax.plot_surface(x_surf, y_surf, predicted_sales.reshape(x_surf.shape), color='None', alpha=0.3)
# Labeling the axesax.set_xlabel('TV Expenses')ax.set_ylabel('Radio Expenses')ax.set_zlabel('Sales')ax.set_title('Multiple Regression: Sales predicted by TV and Radio Expenses')ax.legend()
plt.show()
Output
Vikram has a Postgraduate degree in Applied Mathematics, with a keen interest in Data Science and Machine Learning. He has experience of 2+ years in content creation in Mathematics, Statistics, Data Science, and Mac... Read Full Bio