The Game of Increasing R-squared in a Regression Model

20 July 2024

5

This article was published as a part of the Data Science Blogathon.

Introduction

After building a Machine Learning model, the next and very crucial step is to evaluate the model performance on the unseen or test data and see how good our model is against a benchmark model.

The evaluation metric to be used would depend upon the type of problem you are trying to solve —whether it is a supervised, unsupervised problem, or a mix of these (like semi-supervised), and if it is a classification or a regression task.

In this article, we will discuss two important evaluation metrics used for regression problem statements and we will try to find the key difference between them and learn why these metrics are preferred over Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for a regression problem statement.

Some Important questions which we are trying to understand in this article are as follows:

👉 The Game of increasing R-squared (R²)

👉 Why we go for adjusted-R²?

👉 When to use which from R² and adjusted-R²?

Let’s first understand what exactly is R Squared?

R-squared, which sometimes is also known as the coefficient of determination, defines the degree to which the variance in the dependent variable (target or response) can be explained by the independent variable (features or predictors).

Let us understand this with an example — say the R² value for a regression model having Income as an Independent variable (predictor) and, Expenditure as a dependent variable (response) comes out to be 0.76.

– In general terms, this means that 76% of the variation in the dependent variable is explained by the independent variables.

But for our defined regression problem statement, it can be understood as,

👉 76% variability in expenditure is associated or related with the regression equation and 24% variations are due to other factors.

👉76% variability in expenditure is explained by its linear relationship with income while 24% variations are uncounted for.

👉 76% variation in expenditure due to variation in income while we can’t say anything about the 24% variations. God knows better about it.

Image Source: link

Important points about R Squared

👉 Ideally, we would want the independent variables to explain the complete variations in the target variable. In that scenario, the R² value would be equal to 1. Thus we can say that the higher the R² value, the better is our model.

👉 In simple terms, the higher the R², the more variation is explained by your input variables, and hence better is your model. Also, the R² would range from [0,1]. Here is the formula for calculating R²–

The R² is calculated by dividing the sum of squares of residuals from the regression model (given by SS_RES) by the total sum of squares of errors from the average model (given by SS_TOT) and then subtracting it from 1.

Fig. Formula for Calculating R²

Image Source: link

Drawbacks of using R Squared :

👉 Every time if we add X_i(independent/predictor/explanatory) to a regression model, R² increases even if the independent variable is insignificant for our regression model.

👉 R² assumes that every independent variable in the model helps to explain variations in the dependent variable. In fact, some independent variables don’t help to explain the dependent variable. In simple words, some variables don’t contribute to predicting the dependent variable.

👉 So, if we add new features to the data (which may or may not be useful), the R² value for the model would either increase or remain the same but it would never decrease.

So, to overcome all these problems, we have adjusted-R² which is a slightly modified version of R².

Let’s understand what is Adjusted R²?

👉 Similar to R², Adjusted-R² measures the proportion of variations explained by only those independent variables that really help in explaining the dependent variable.

👉 Unlike R², the Adjusted-R² punishes for adding such independent variables that don’t help in predicting the dependent variable (target).

Let us mathematically understand how this feature is accommodated in Adjusted-R². Here is the formula for adjusted R²

Fig. Formula for Calculating adjusted-R²

Image Source: link

Let’s take an example to understand the values changes of these metrics in a Regression model

For Example,

Independent Variable	R²	Adjusted-R²
X₁	67.8	67.1
X₂	88.3	85.6
X₃	92.5	82.7

In this example for a regression problem statement, we observed that the independent variable X₃ is insignificant or it doesn’t contribute to explain the variation in the dependent variable. Hence, adjusted-R² is decreased because the involvement of in-significant variable harms the predicting power of other variables that are already included in the model and declared significant.

R² vs Adjusted-R²

👉 Adjusted-R² is an improved version of R².

👉 Adjusted-R² includes the independent variable in the model on merit.

👉 Adjusted-R² < R²

👉 R² includes extraneous variations whereas adjusted-R² includes pure variations.

👉 The difference between R² and adjusted-R² is only the degrees of freedom.

The Game of Increasing R²

Sometimes researchers tried their best to increase R² in every possible way.

👉 One way to include more and more explanatory (independent) variables in the model because:

R² is an increasing function of the number of independent variables i.e, with the inclusion of one more independent variable R² is likely to increase or at least will not decrease.

When to use which?

Comparing models using R²

Comparing two models just based on R² is dangerous as,

👉 Models having a different number of independent variables may have an equal value of R².

👉 Total sample size and respective degrees of freedom are ignored.

Hence, there is a likelihood that one would choose the wrong model.

Problem solved by adjusted-R²

To compare two different models, or choose the best model, the adjusted-R²is used because:

👉 It is adjusted for the respective degree of freedom.

👉 It takes into account the total sample size and number of independent variables.

👉 It is not an increasing function of the number of independent variables.

👉 It only increases if newly independent variables have an impact on the dependent variable.

CONCLUSION:

So, concluding the discussion we say that,

👉 R²can be used to access the goodness of fit of a single model whereas,

👉Adjusted-R² is used to compare two models and to see the real impact of newly added independent variables.

👉 Adjusted-R² should be used while selecting important predictors for the regression model.

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

About the author

Chirag Goyal

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article on Top Machine Learning Libraries in Julia are not owned by Analytics Vidhya and is used at the Author’s discretion.

The Game of Increasing R-squared in a Regression Model

Introduction