Multiple linear regression

Multiple linear regression

5 mins read2.6K Views Comment
Updated on Feb 3, 2023 14:57 IST

Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on the value of two or more variables. In this article we are talking about multiple linear regression using real-life example.It is expalined by explaining python programming example also.

2022_11_MicrosoftTeams-image-15.jpg

Most of the people are confused between linear regression and multiple linear regression algorithm. Multiple linear regression is the extension of linear regression. These both are used for regression problems. We have already covered linear regression. So lets learn about multiple linear regression now.

Table of contents

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

2.5 L
2 years
2.5 L
2 years
1.53 L
11 months
34.65 K
11 months
5.6 L
18 months
– / –
8 hours
– / –
6 months

What is Multiple linear regression?

Multiple linear regression is a statistical technique for predicting the outcome of one variable based on the values ​​of two or more variables. Sometimes called multiple regression, it is an extension of linear regression. The variable we want to predict is called the dependent variable, and the variables we use to predict the value of the dependent variable are called the independent or explanatory variables.

In other words, MLR analyses how multiple independent variables are related to the dependent variable. Once each independent factor for predicting the dependent variable is determined, information about multiple variables can be used to accurately predict the magnitude of their impact on the outcome variable. The model establishes a straight-line (linear) relationship that best fits the individual data points. Multiple linear regression can be used for checking

  1.  The strength of the relationship between one dependent variable and two or more independent variables(e.g., how salary is calculated on the basis of a number of years of experience, number of certifications done, level of education).
  2. The dependent variable’s value is given the independent variables’ values (e.g., the expected salary depends on the given number of years of experience, the number of certifications done, and level of education).
How to Calculate R squared in Linear Regression
R-Squared vs. Adjusted R-Squared
Cost function in linear regression

Also explore:

What is Programming

What is Python

What is Data Science

What is Machine Learning

Real-life example of multiple linear regression

Suppose we have a dataset in which we have one independent feature like work experience(years) and salary as a dependent feature which means salary prediction is dependent on working experience. So we can use linear regression here.

But if we have more than one independent feature in our dataset. For, eg.

In this data set, we have independent features like the number of certifications, education level, and work experience, and the dependent feature is salary. So these independent features(the factors you suspect impact your dependent variable) are used to predict the salary. So now the question here comes can we use a linear regression model for this dataset? The answer is NO. Here we have to use a multi-linear regression model for salary prediction.

We have shown a small dataset in the below fig but is suppose we have a lot of data entries in employee_salary data set then our graph would look like this(as shown in fig above).The difference between graphs of linear and multiple regression is that in linear regression we get 2-D plot and in multiple linear regression we get 3-D plot.So we have x(axis)=Salary,y(axis)=Education level and z(axis)=Working experience. And another difference is that in linear regression we have regression line but in multiple linear regression we have hyperplane as shown in fig.

Also explore:

Programming Online Courses and Certification

Python Online Courses and Certifications

Data Science Online Courses and Certifications

Machine Learning Online Courses and Certifications

How to perform a multiple linear regression

Multiple linear regression formula

The formula for multiple linear regression is:

y= 0+1 X1+2 X2+………+n Xn

  • y= the predicted value of the dependent variable
  • 0= the y-intercept (value of y when all other parameters are set to 0)
  • 1 X1= the regression coefficient (1) of the first independent variable (X1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value).
  • n Xn= the regression coefficient of the last independent variable

Multiple Linear Regression Assumptions

Multiple linear regression makes the same assumptions as simple linear regression.Normality: The data follow a normal distribution. 

1. Observation Independence

Observations in the data set were collected using statistically valid methods and have no hidden relationships between variables.In multiple linear regression, some independent variables may be correlated with each other, so it is essential to check these before developing the regression model. If two independent variables are too strongly correlated (r2 > ~0.6), only one of them should be used in the regression model.

2. Homogeneity of variance (homogeneity of variances)

The magnitude of the prediction error does not vary significantly across the values ​​of the independent variables.

3. Linearity

The best-fit line going through the data points is a straight line, not a curve or grouping factor.

Multiple linear regression sklearn

About data set

This dataset contains 7 species of fish data for market sale.We have to predict the weight of fish.We will perform it using python.The dataset uesd for this is Multiple Linear Regression Fish Weight Prediction which is freely available on kaggle.

1.Speciesspecies name of fish

2. Weight- weight of fish in Gram

3. Length1-vertical length in cm

4. Length2- diagonal length in cm

5. Length3- cross length in cm

6. Height-height in cm

7. Width-diagonal width in cm

1. Importing Libraries and reading dataset

 
import pandas as pd
import numpy as np
dataset= pd.read_csv('/content/Fish.csv')
dataset
Copy code

2. Plotting the graph between height and Weight

 
Plotting the graph between height and Weight
df = dataset[dataset['Species'] == 'Bream']
import matplotlib.pyplot as plt
plt.scatter(df['Weight'],df['Height'])
Copy code

We are here trying to make scatter plot between weight and height for one specie(Bream).We make this scatterplot to check if the data is linear in nature.

3. Drop the categorical feature

 
Drop the categorical feature
dataset=dataset.drop(['Species'],axis=1)
Copy code

In this dataset we have one categorical feature ‘species’.So we have two options here either we can convert it into numerical feature using one hot encoding.

we have expleined this topic i detail in our blog

Please refer Handling categorical variables with one-hot encoding

One hot encoding for multi categorical variables

One hot encoding vs label encoding in Machine Learning

Else we can simply drop it if it dont have much relavance in predicting the output.

4. Dependent and independent variables

 
#Get Target data
y = dataset['Weight']
#Load X Variables into a Pandas Dataframe with columns
X = dataset.drop(['Weight'], axis = 1)
Copy code

5. Splitting dataset into Training and Testing Set

 
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Copy code

Next, split both x and y into training and testing sets with the help of the train_test_split() function. In this training data set is 0.8 which means 80%.

6. Importing and fitting Linear Regression to the Training set

 
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
X_train = X_train
model=regressor.fit(X_train,y_train)
Copy code

7. Predicting the Test set results

 
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
score=r2_score(y_test,y_pred)
score
0.863
Copy code

8. Find correlation to check if we can remove features

 
import seaborn as sns
import matplotlib.pyplot as plt
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = dataset.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()
Copy code
2022_11_image-71.jpg

From this heatmap we can conclude that weight feature have high corelation with all the features.So reoving feature here won’t help in improving accuracy score.

Conclusion

Multiple linear regression is used to evaluate predictors for continuously distributed outcome variables. This procedure computes a coefficient for each independent variable (predictor) that best fits the observed data in the sample.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio