Multicollinearity in machine learning

Multicollinearity in machine learning

5 mins read1.2K Views Comment
Updated on Nov 24, 2022 12:05 IST

Multicollinearity appears when two or more independent variables in the regression model are correlated. In this article we will discuss multicollinearity in machine leaning where the topics like types of multicollinearity,how to remove it are included.

2022_11_MicrosoftTeams-image-16.jpg

It’s worth remembering that one factor that affects the standard error of the partial regression is the degree to which that independent variable is correlated with the other independent variables in the regression equation. All other things being equal, an independent variable is highly correlated with one or more other independent variables. The variables have relatively large standard errors. This is, partial regression coefficients are unstable and fluctuate widely. From rehearsal to next. This is a known situation of Multicollinearity.

Table of contents

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

2.5 L
2 years
2.5 L
2 years
1.53 L
11 months
34.65 K
11 months
5.6 L
18 months
– / –
8 hours
– / –
6 months

What is Multicollinearity? 

Multicollinearity occurs when two or more predictors in one regression model are highly correlated. We almost always have Multicollinearity in the data. The question is whether we can get away with it; and what to do if Multicollinearity is so severe that we cannot ignore it.

Extreme Multicollinearity occurs whenever the independent variables are very high and Correlated with one or more other independent variables. In that case, the standard error of the partial regression coefficient for that independent variable is relatively large. Neither is likely to be statistically significant, even though they may be highly correlated with the dependent variable. The coefficients for these two independent variables are relative. Collinearity has a significant impact on coefficient accuracy.

Consequences of Multicollinearity 

  1. Itwidens variances and covariances, making the statistical determination of the null and alternative hypotheses difficult. 
  2. In the presence of Multicollinearity, standard errors increase, and t-test values ​​decrease. Accept the null hypothesis, which should be rejected.
  3. It also increases R-squared, affecting the model’s goodness of fit.
How to Calculate R squared in Linear Regression
How to Calculate R squared in Linear Regression
In this article we focussed on R-squared in Linear Regression in which explain the procedure of calculating R squared value in step by step way and with example. When you...read more
Supervised Learning in Action: Real-World Applications and Examples
Supervised Learning in Action: Real-World Applications and Examples
Supervised learning is a type of machine learning where a model is trained to make predictions based on labeled data. The model is given input data and the corresponding correct...read more
Overfitting in machine learning:Python code example
Overfitting in machine learning:Python code example
Overfitting is a situation that happens when a machine learning algorithm becomes too specific to the training data and doesn’t generalize well to new data.This article focus on Overfitting in...read more

Also Read: How tech giants are using your data?

Also read:What is machine learning?

Also read :Machine learning courses

Multicollinearity real-life example

Consider an example of a dataset in which we have to calculate the LPA based on IQ and CGPA grades. IQ and CGPA grades are Content features, and LPA is the dependent feature.

Suppose LPA is dependent on IQ and CGPA.So, in that case, we will use multiple linear regression whose Equation is like

y= 0+1X1+2 X2+………+n Xn

Where

1 and 2 are the coefficients

X1 and Y2 are the variables

We can write it as

LPA = 0+1IQ +2 CGPA

So we have to check that if we increase IQ, keeping CGPA constant, then what will increase LPA? But suppose there is collinearity in this which means IG and CGPA correlate with each other then if we increase IQ, then it will make CGPA increase as well, which means these coefficients will not be relevant.

In short, in this case, checking the contribution of related features(features having Multicollinearity) on output features would be difficult.

Problem if have multicollinearity

The idea is that you can change the value of one independent variable and not affect the other. However, when the independent variables are correlated, the changes in one variable are associated with changes in another. The stronger the correlation, the more difficult it is to change one variable without changing another. Independent variables tend to vary together, making it difficult for the model to estimate the relationship between each independent and dependent variable independently. Now a very important question arises here

Is Multicollinearity always a problem? The answer is NO. Suppose you are using linear regression for prediction purposes. Then Multicollinearity will not affect much(even if it is there). But on the other hand, if you are using linear regression to check feature importance, then Multicollinearity will affect your model performance, and you should remove Multicollinearity first. And most cases, linear regression is used to check the feature’s importance.

Types of Multicollinearity

1. Structural Multicollinearity 

It is a mathematical artifact caused by creating new predictors from other predictors (for example, creating predictor x2 from predictor x). For example-When, a data scientist adds new features to the dataset while using one hot encoding. Like you have feature city as shown in fig below-

So, in this case, we can remove the one column out of City A, City B, and City C, but if we don’t, then that can lead to structural Multicollinearity.

2. Data multicollinearity

This type of Multicollinearity exists in the data itself and is not an artifact of the model. Observational experiments tend to show this type of Multicollinearity. That means when Multicollinearity was already there at the time when data was collected.

Removing Multicollinearity 

1. Domain knowledge

You can identify highly correlated pairs of independent variables by examining the correlation matrix between independent variables based on domain knowledge.

2. Draw a Scatter plot

 
import pandas as pd
df=pd.read_csv('Iris.csv')
df
Copy code

In this, you can see that there is a linear relationship between these two variables, so we can say that there is Multicollinearity present here.

 
import matplotlib.pyplot as plt
plt.scatter(df['SepalLengthCm'],df['PetalWidthCm']
Copy code

3. Correlation matrix

Make a correlation Matrix as shown in the diagram. We have taken the iris data set and have written this code to make a correlation matrix

 
import seaborn as sns
plt.figure(figsize=(8,6))
iris_corr = iris_df.corr()
sns.heatmap(iris_corr, annot=True)
Copy code

From the graph, we can see a strong correlation between petal length and petal width and a good correlation between petal width and sepal length.

4. Variance inflation factor(VIF)

We have a very simple test to assess the Multicollinearity of a regression model. Variance Inflation Factor (VIF) identifies the correlation between independent variables and the strength of that correlation.

Statistical software calculates VIF for each independent variable. VIFs start at one and have no upper limit. A value of 1 indicates no correlation between this independent variable and the other independent variables. A VIF between 1 and 5 indicates a moderate correlation but is not strong enough to warrant corrective action. A VIF above 5 represents a critical level of Multicollinearity with poorly estimated coefficients, which will give you a predicted variable’s questionable p-values. So if the variables show VIF of more than value 5, you can delete one of the independent features.

VIF Value Conclusion
1 No correlation
1-5 Moderate correlation
>5 Multicollinearity(can delete that feature)

5. Principal Component analysis

Perform a principal component analysis (PCA). PCA is used to reduce the number of variables in your data, but it doesn’t know which ones to remove. This kind of transformation combines existing predictors such that only the most informative parts are kept. This will give you unrelated predictors at the end.

6. LASSO and Ridge regression 

They are advanced regression analyzes that can handle Multicollinearity. If you know how to do a least-squares linear regression, you can do these analyzes with a little extra investigation.

How to check for multicolinearity?

You must have understand from the above topic that we can detect multicollinearity by making

  • Scatter graph
  • variance inflation factor
  • Correlation matrix

Conclusion

It is considered one of the main problems in linear regression analysis because strong correlations between variables affect their values ​​and change the same way other variables change. This affects the entire assembly that researchers are preparing for analysis. Therefore, it is recommended to identify possible collinearity before performing regression analysis.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio