Multicollinearity in machine learning

5 mins read1.2K Views Comment

Updated on Nov 24, 2022 12:05 IST

Multicollinearity appears when two or more independent variables in the regression model are correlated. In this article we will discuss multicollinearity in machine leaning where the topics like types of multicollinearity,how to remove it are included.

It’s worth remembering that one factor that affects the standard error of the partial regression is the degree to which that independent variable is correlated with the other independent variables in the regression equation. All other things being equal, an independent variable is highly correlated with one or more other independent variables. The variables have relatively large standard errors. This is, partial regression coefficients are unstable and fluctuate widely. From rehearsal to next. This is a known situation of Multicollinearity.

What is Multicollinearity?
Multicollinearity real-life example
Problem if have multicollinearity
Types of Multicollinearity
Removing Multicollinearity
Conclusion

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

Amity OnlineDegree

Total Fees

₹1.7 L

Duration

2 years

MCA in Machine Learning Online

Amity OnlineDegree

Total Fees

₹2.5 L

Duration

2 years

MCA in Machine Learning

Amity University Online, NoidaDegree

Total Fees

₹2.5 L

Duration

2 years

Advance Certification in Applied Data Science, Machine Learning & IoT

IIT GuwahatiCertificate

4.0

Total Fees

₹95 K

Duration

9 months

Professional Certificate Course In Generative AI And Machine Learning

IIT KanpurCertificate

Total Fees

₹1.53 L

Duration

11 months

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

TimesProCertificate

4.0

Total Fees

₹2 L

Duration

10 months

Data Science & Machine Learning Course

Coding NinjasCertificate

4.8

Total Fees

₹34.65 K

Duration

11 months

M.Sc. in Machine Learning and AI

upGradDegree

Total Fees

₹5.6 L

Duration

18 months

Full Stack Machine Learning & AI Program

Jigsaw AcademyCertificate

Total Fees

– / –

Duration

8 hours

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

IIT RoorkeeCertificate

Total Fees

– / –

Duration

6 months

What is Multicollinearity?

Multicollinearity occurs when two or more predictors in one regression model are highly correlated. We almost always have Multicollinearity in the data. The question is whether we can get away with it; and what to do if Multicollinearity is so severe that we cannot ignore it.

Extreme Multicollinearity occurs whenever the independent variables are very high and Correlated with one or more other independent variables. In that case, the standard error of the partial regression coefficient for that independent variable is relatively large. Neither is likely to be statistically significant, even though they may be highly correlated with the dependent variable. The coefficients for these two independent variables are relative. Collinearity has a significant impact on coefficient accuracy.

Consequences of Multicollinearity

Itwidens variances and covariances, making the statistical determination of the null and alternative hypotheses difficult.
In the presence of Multicollinearity, standard errors increase, and t-test values decrease. Accept the null hypothesis, which should be rejected.
It also increases R-squared, affecting the model’s goodness of fit.

How to Calculate R squared in Linear Regression

In this article we focussed on R-squared in Linear Regression in which explain the procedure of calculating R squared value in step by step way and with example. When you...read more

Supervised Learning in Action: Real-World Applications and Examples

Supervised learning is a type of machine learning where a model is trained to make predictions based on labeled data. The model is given input data and the corresponding correct...read more

Overfitting in machine learning:Python code example

Overfitting is a situation that happens when a machine learning algorithm becomes too specific to the training data and doesn’t generalize well to new data.This article focus on Overfitting in...read more

Also Read: How tech giants are using your data?

Also read:What is machine learning?

Also read :Machine learning courses

Multicollinearity real-life example

Consider an example of a dataset in which we have to calculate the LPA based on IQ and CGPA grades. IQ and CGPA grades are Content features, and LPA is the dependent feature.

Suppose LPA is dependent on IQ and CGPA.So, in that case, we will use multiple linear regression whose Equation is like

y= 0+1X1+2 X2+………+n Xn

Where

1 and 2 are the coefficients

X1 and Y2 are the variables

We can write it as

LPA = 0+1IQ +2 CGPA

So we have to check that if we increase IQ, keeping CGPA constant, then what will increase LPA? But suppose there is collinearity in this which means IG and CGPA correlate with each other then if we increase IQ, then it will make CGPA increase as well, which means these coefficients will not be relevant.

In short, in this case, checking the contribution of related features(features having Multicollinearity) on output features would be difficult.

Problem if have multicollinearity

The idea is that you can change the value of one independent variable and not affect the other. However, when the independent variables are correlated, the changes in one variable are associated with changes in another. The stronger the correlation, the more difficult it is to change one variable without changing another. Independent variables tend to vary together, making it difficult for the model to estimate the relationship between each independent and dependent variable independently. Now a very important question arises here.

Is Multicollinearity always a problem? The answer is NO. Suppose you are using linear regression for prediction purposes. Then Multicollinearity will not affect much(even if it is there). But on the other hand, if you are using linear regression to check feature importance, then Multicollinearity will affect your model performance, and you should remove Multicollinearity first. And most cases, linear regression is used to check the feature’s importance.

Types of Multicollinearity

1. Structural Multicollinearity

It is a mathematical artifact caused by creating new predictors from other predictors (for example, creating predictor x2 from predictor x). For example-When, a data scientist adds new features to the dataset while using one hot encoding. Like you have feature city as shown in fig below-

So, in this case, we can remove the one column out of City A, City B, and City C, but if we don’t, then that can lead to structural Multicollinearity.

2. Data multicollinearity

This type of Multicollinearity exists in the data itself and is not an artifact of the model. Observational experiments tend to show this type of Multicollinearity. That means when Multicollinearity was already there at the time when data was collected.

Removing Multicollinearity

1. Domain knowledge

You can identify highly correlated pairs of independent variables by examining the correlation matrix between independent variables based on domain knowledge.

2. Draw a Scatter plot

import pandas as pd
df=pd.read_csv('Iris.csv')
df
Copy code

In this, you can see that there is a linear relationship between these two variables, so we can say that there is Multicollinearity present here.

import matplotlib.pyplot as plt  
plt.scatter(df['SepalLengthCm'],df['PetalWidthCm']
Copy code

3. Correlation matrix

Make a correlation Matrix as shown in the diagram. We have taken the iris data set and have written this code to make a correlation matrix

import seaborn as sns
plt.figure(figsize=(8,6))
iris_corr = iris_df.corr()
sns.heatmap(iris_corr, annot=True)
Copy code

From the graph, we can see a strong correlation between petal length and petal width and a good correlation between petal width and sepal length.

4. Variance inflation factor(VIF)

We have a very simple test to assess the Multicollinearity of a regression model. Variance Inflation Factor (VIF) identifies the correlation between independent variables and the strength of that correlation.

Statistical software calculates VIF for each independent variable. VIFs start at one and have no upper limit. A value of 1 indicates no correlation between this independent variable and the other independent variables. A VIF between 1 and 5 indicates a moderate correlation but is not strong enough to warrant corrective action. A VIF above 5 represents a critical level of Multicollinearity with poorly estimated coefficients, which will give you a predicted variable’s questionable p-values. So if the variables show VIF of more than value 5, you can delete one of the independent features.

VIF Value	Conclusion
1	No correlation
1-5	Moderate correlation
>5	Multicollinearity(can delete that feature)

5. Principal Component analysis

Perform a principal component analysis (PCA). PCA is used to reduce the number of variables in your data, but it doesn’t know which ones to remove. This kind of transformation combines existing predictors such that only the most informative parts are kept. This will give you unrelated predictors at the end.

6. LASSO and Ridge regression

They are advanced regression analyzes that can handle Multicollinearity. If you know how to do a least-squares linear regression, you can do these analyzes with a little extra investigation.

How to check for multicolinearity?

You must have understand from the above topic that we can detect multicollinearity by making

Scatter graph
variance inflation factor
Correlation matrix

Conclusion

It is considered one of the main problems in linear regression analysis because strong correlations between variables affect their values and change the same way other variables change. This affects the entire assembly that researchers are preparing for analysis. Therefore, it is recommended to identify possible collinearity before performing regression analysis.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

Multicollinearity in machine learning

Table of contents

Best-suited Machine Learning courses for you

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

MCA in Machine Learning Online

MCA in Machine Learning

Advance Certification in Applied Data Science, Machine Learning & IoT

Professional Certificate Course In Generative AI And Machine Learning

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

Data Science & Machine Learning Course

M.Sc. in Machine Learning and AI

Full Stack Machine Learning & AI Program

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

What is Multicollinearity?

Consequences of Multicollinearity

Multicollinearity real-life example

Problem if have multicollinearity

Types of Multicollinearity

1. Structural Multicollinearity

2. Data multicollinearity

Removing Multicollinearity

1. Domain knowledge

2. Draw a Scatter plot

3. Correlation matrix

4. Variance inflation factor(VIF)

5. Principal Component analysis

6. LASSO and Ridge regression

How to check for multicolinearity?

Conclusion

Top Picks & New Arrivals