How to Calculate Adjusted R-Squared
Ever wondered how well your regression model truly fits your data, especially when multiple variables come into play? Adjusted R-squared—a metric that goes beyond traditional R-squared to offer deeper insights. But what makes it different from R-squared? This article will discuss all.
In the previous article, we discussed how to calculate the r-squared value for the machine learning algorithm. In this article, we will discuss another evaluation metric, i.e., adjusted r-squared, and will also discuss some examples to know why we need adjusted r-squared.
But before that let’s have a quick introduction of r-squared.
Table of Content
- Understanding R-Squared
- The Need for Adjusted R-Squared
- Calculating Adjusted R-Squared
- Difference Between R-squared and Adjusted R-Squared
Best-suited Machine Learning courses for you
Learn Machine Learning with these high-rated online courses
What is R-Squared?
R-squared, also known as the coefficient of determination, describes the proportion of the variance in a dependent variable explained by an independent variable or variable in a linear regression model.
It is calculated by dividing the explained variation by the total variation or 1- (Unexplained Variation/Total Variation).
Mathematical Formula of R-Squared
R-Squared = 1- (SSR/SST)
where,
SSR: Sum of Squared Residual (The sum of Squared Error)
SST: Total sum of squares (sum of squared deviation from the mean)
Note:
- The value of R-squared ranges between 0 and 1.
- 0 means that the model doesn’t explain any variation in the dependent variable.
- 1 means that the model explains all the variations.
Limitations of R-Squared
- The value of r-squared will increase as the number of independent variables are added, regardless of whether they are relevant or not. This can lead to overfitting.
- It is not the best metric for comparing models, especially when the models have a different number of predictors.
- A high value of r-squared doesn’t necessarily mean the model is adequate.
- R-squared is highly sensitive to outliers. A few outliers can significantly decrease the value of the R-squared value.
Why we need adjusted R-Squared?
As we mentioned earlier, the value of the adjusted r-squared increases if new variables are added. It doesn’t matter whether the added variable is correlated or not. To overcome this, an adjusted R-squared metric comes into existence that provides a more accurate measure of the model’s goodness of fit.
As the word suggests, adjusted r-squared adjusts for the number of predictors in the model, ensuring that only significant predictors enhance its value.
It penalizes the model for the inclusion of irrelevant predictors. This makes it a more robust metric, especially when evaluating the model with various predictors.
Adjusted R-Squared Formula
Adjusted R-Squared = 1- [(1 – R2) (n – 1)/ (n – k – 1)]
where,
n: number of data points
k: number of independent variables
R: R-squared value
Interpretation of Adjusted R-Squared Formula
- If the value of the R-squared doesn’t increase significantly with the addition of a new independent variable, then the value of the adjusted R-squared value will decrease.
- If the value of the R-squared significantly increases with adding a new independent variable, the value of the adjusted R-squared will also increase.
Note: It is recommended to use adjusted r-squared when multiple variables exist in the regression model. This would allow us to compare models with different numbers of independent variables.
Until now, you have a clear understanding of what adjusted r-squared is, its formula, and the need of adjusted r-squared over r-squared to evaluate the performance of machine learning model.
How to Calculate the Adjusted R-Squared?
Problem Statement: Create a dataset, build two linear regression models (simple linear regression model and multiple regression model) and then calculate the value of R2 and adjusted R2 in both the cases.
Solution
Step-1: Create a Sample dataset
import numpy as npimport pandas as pdfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error
# Create a synthetic datasetnp.random.seed(0)n_samples = 100StudyHours = np.random.uniform(1, 10, n_samples)Extracurricular = np.random.randint(0, 5, n_samples)FinalExamScores = 50 + 3 * StudyHours + 2 * Extracurricular + np.random.normal(0, 5, n_samples)
# Create a DataFrame from the datadata = pd.DataFrame({'StudyHours': StudyHours, 'Extracurricular': Extracurricular, 'FinalExamScores': FinalExamScores})data.head()
Output
Step-2: Split the data into predictors (X) and target (Y)
# Split the data into predictors (X) and target (y)X = data[['StudyHours', 'Extracurricular']]y = data['FinalExamScores']
Step-3: Create a Linear Regression Model with one Predictor
# Create and fit a simple linear regression model with one predictor (StudyHours)model_simple = LinearRegression()model_simple.fit(X[['StudyHours']], y)y_pred_simple = model_simple.predict(X[['StudyHours']])
# Calculate R-squared for the simple modelmse_simple = mean_squared_error(y, y_pred_simple)r_squared_simple = 1 - (mse_simple / np.var(y))
# Calculate Adjusted R-squared for the simple modeln = len(y)p_simple = 1 # Number of predictors in the simple modeladjusted_r_squared_simple = 1 - (1 - r_squared_simple) * (n - 1) / (n - p_simple - 1)
# Print R-squared and Adjusted R-squared values for both modelsprint("Simple Model:")print(f"R-squared: {r_squared_simple:.4f}")print(f"Adjusted R-squared: {adjusted_r_squared_simple:.4f}\n")
Output
Step-4: Create a Linear Regression Model with one Predictor
# Create and fit a more complex linear regression model with two predictors (StudyHours and Extracurricular)model_complex = LinearRegression()model_complex.fit(X, y)y_pred_complex = model_complex.predict(X)
# Calculate R-squared for the complex modelmse_complex = mean_squared_error(y, y_pred_complex)r_squared_complex = 1 - (mse_complex / np.var(y))
# Calculate Adjusted R-squared for the complex modelp_complex = 2 # Number of predictors in the complex modeladjusted_r_squared_complex = 1 - (1 - r_squared_complex) * (n - 1) / (n - p_complex - 1)
print("Complex Model:")print(f"R-squared: {r_squared_complex:.4f}")print(f"Adjusted R-squared: {adjusted_r_squared_complex:.4f}")
Output
Explnation
From the above, we get the value of R-square and adjusted r-squared increases significantly with the addition of one more variable (“Extracurricular”). This implies that the added variable has some correlation with the predictor and the target variable.
Difference Between R-Squared and Adjusted R-Squared
Parameter | R-Squared | Adjusted R-Squared |
---|---|---|
Definition | Proportion of variance in the dependent variable explained by the independent variable(s). | R-Squared adjusted for the number of predictors in the model. |
Value Range | Between 0 and 1. | Can be negative, but typically between 0 and 1. |
Response to Adding Predictors | Always increases or remains the same. | Can increase or decrease based on the usefulness of the added predictor. |
Purpose | Measures overall goodness of fit. | Measures goodness of fit while accounting for model complexity. |
Calculation | R-Squared = 1- (SSR/SST) | Adjusted R-Squared = 1- [(1 – R2) (n – 1)/ (n – k – 1)] |
Best for | Simple linear regression with one predictor. | Multiple regression models with several predictors. |
Interpretation | Higher value indicates more variance explained by the model. | Higher value indicates a better fit, especially when comparing models with different numbers of predictors. |