Assumptions of Linear Regression

4 mins read1K Views Comment

Updated on Nov 21, 2022 11:30 IST

There are certain assumptions of linear regression algorithm that must be satisfied before implementing over any model, otherwise it will lead to insignificant result.

Linear regression is a supervised machine-learning algorithm that models a linear relationship between two or more continuous variables (dependent and independent). The linear regression algorithm finds the best fit line that passes through all the points with the minimum difference between the actual and predicted value (Estimated Value).

Equation of Linear Regression:

Y = mX + C

where

Y: dependent variable

X: independent variable

C: y-intercept

m: slope or regression coefficient

Linear Regression algorithm has a set of assumptions that must be satisfied while building any linear regression model that produces the best fit line (regression line) for any given dataset.

What is Programming	What is Python
What is Data Science	What is Machine Learning

Linear regression algorithm uses a set of parameters while learning from the dataset, and due to parametric, it has some restrictions. If the algorithm fails to satisfy the assumptions, it will fail to predict the best fit line.

Linear Regression Algorithm has the following assumptions:

Linearity
No Hidden or Missing Values
Multicollinearity
Autocorrelation
Normality
Homoscedasticity

Now, let’s discuss them one-by-one:

1. Linearity

As the name suggests, in Linear regression, the relation between the dependent and independent variables must be linear.
General Linear Equation can be given as

Y = C₀ + C₁X₁ + C₂X₂ + C₃X₃ + ……+C_nX_n

where,
C_i: constant
x_i: independent variable

How to Check whether the given equation is linear or not:

f(ax+by) = af(x) + bf(y)
Where,
a, b: constant
x, y: independent variable

If the linear regression algorithm fails the linearity assumption, it will fail to capture the trend, which will lead to a false prediction.

Also read :How to Calculate R squared in Linear Regression

Also read: r-squared vs. adjusted r-squared

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

Amity OnlineDegree

Total Fees

₹1.7 L

Duration

2 years

MCA with specialization in Machine Learning & Artificial Intelligence (ML & AI)

Amity OnlineDegree

Total Fees

₹2.5 L

Duration

2 years

MCA in Machine Learning

Amity University Online, NoidaDegree

Total Fees

₹2.5 L

Duration

2 years

Advance Certification in Applied Data Science, Machine Learning & IoT

IIT GuwahatiCertificate

4.0

Total Fees

₹95 K

Duration

9 months

Professional Certificate Course In Generative AI And Machine Learning

IIT KanpurCertificate

Total Fees

₹1.53 L

Duration

11 months

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

TimesProCertificate

4.0

Total Fees

₹2 L

Duration

10 months

Data Science & Machine Learning Course

Coding NinjasCertificate

4.8

Total Fees

₹34.65 K

Duration

11 months

M.Sc. in Machine Learning and AI

upGradDegree

Total Fees

₹5.6 L

Duration

18 months

Full Stack Machine Learning & AI Program

Jigsaw AcademyCertificate

Total Fees

– / –

Duration

8 hours

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

IIT RoorkeeCertificate

Total Fees

– / –

Duration

6 months

2. No Hidden or Missing Value

All the variables that are used in the linear regression algorithm must be relevant and not have any missing value or hidden values. If any variable contains the missing value, it will lead to a false prediction or insignificant prediction. These missing values can be handled by:

Deleting rows with missing values
Replacing with arbitrary variable
Interpolation

Handling missing values: Beginners Tutorial

We take data from sometimes sources like kaggle.com, sometimes we collect from different sources by doing web scrapping containing missing values in it. We take data from sometimes sources like...read more

Read Later

Handling missing data: Mean, Median, Mode

So what all steps do we actually perform in what kind of order to complete the feature engineering process. Now in a data science project if we just consider feature...read more

Read Later

Normalization vs Standardization

Normalization and standardization are two techniques used to transform data into a common scale. Normalization is a technique used to scale numerical data in the range of 0 to 1....read more

Read Later

3. Multicollinearity

There should not be any correlation between the independent variables. Having collinearity between the independent variable will lead to an increase in the complexity of the model. As the linear regression algorithm checks the effect of each independent variable, it will be difficult to isolate the impact of an individual variable over the dependent variable if there exists a correlation between the variables.
In simple terms, if there exists multicollinearity between the variable, it will be difficult to find which independent variable has a significant impact on the dependent variable (i.e., which feature is impacting more to predict the model).

Multicollinearity can be tested using the correlation matrix and tolerance (tolerance = 1-R²).

Programming Online Courses and Certification	Python Online Courses and Certifications
Data Science Online Courses and Certifications	Machine Learning Online Courses and Certifications

4. Autocorrelation in the residuals (Error terms):

Autocorrelation occurs if errors depend on each other. The error term in the linear regression should be linearly independent and identically distributed.

Autocorrelation mostly occurs in time series models, where the next instant depends on the previous instant.
Due to autocorrelation, the estimated standard error tends to underestimate the true standard error.
Due to autocorrelation, the confidence interval and predication interval become narrower.

5. Normality (Gauss Distribution)

The residual or the error term must follow a normal distribution. The normality condition can be relaxed when there are many observations. Still, in the case of a small number of observations, the standard error in the model will be unreliable.

The histogram or QQ plot can check the normality of the linear regression equation. Most of the residual values in the plot will lie near zero.

Normality in the linear regression can be fixed by:

If there is an outlier, remove it or use the least-square method.
If possible, add more observations.

Also Read: Normal Distribution – Definition, and Example

6. Homoscedasticity

Homoscedasticity means the error term should be constant along the values of the dependent variable. It can be interpreted by the scatter plot with the residual against the dependent variable. It may be due to the presence of outliers or incorrectly specified models.
If the algorithm does not satisfy the assumption of homoscedasticity, it can be fixed by applying a logistic or square root transformation to the dependent variable.

Conclusion

In this article, we have discussed some of the important assumptions of linear regression algorithms that must be satisfied before applying to any model; otherwise, the predicted value may be insignificant.
Hope this article is useful to you.

Top Trending Article

Interview Questions

About the Author

Vikram Singh

Assumptions of Linear Regression

Equation of Linear Regression:

1. Linearity

Best-suited Machine Learning courses for you

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

MCA with specialization in Machine Learning & Artificial Intelligence (ML & AI)

MCA in Machine Learning

Advance Certification in Applied Data Science, Machine Learning & IoT

Professional Certificate Course In Generative AI And Machine Learning

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

Data Science & Machine Learning Course

M.Sc. in Machine Learning and AI

Full Stack Machine Learning & AI Program

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

2. No Hidden or Missing Value

3. Multicollinearity

4. Autocorrelation in the residuals (Error terms):

5. Normality (Gauss Distribution)

6. Homoscedasticity

Conclusion

Top Picks & New Arrivals