The implication of R-Squared

The R squared value, called the coefficient of determination, determines how well the data points fit on a regression equation. More specifically, the R squared value is a measure of how the independent variables in a regression equation explain the variables of the dependent variable. The value of R squared can change based on the inclusion or removal of variables in the regression model. R squared values are typically used as a measure of the effectiveness of a model.  Hence, a high R squared value (anything above 55%), can be an indicator of a capable model.

However, R squared is not meant to actually reflect the reliability of the statistical model. It merely reflects how many of the data points lie on the regression line. Hence, a model with an R squared of 0.75 would mean that 75% of the residuals lie on the line of fit.

The number of predictor variables in a regression equation affect the value of the R squared statistic. If a model with 4 variables is added with another predictor variable, the value of the R squared is bound to go up. The Adjusted R squared acts as a buffer to the inflationary effect of the number of predictive variables and is usually lower than the R squared. Hence, the R squared, and the adjusted R squared are indicators of how many data points lie on the regression line, with the adjusted R squared accounting for the effect of numerous predictive variables.

It should be noted that the R squared does not hold well in some cases. For example, a dataset pertaining stocks will show a Time Series effect. In such a situation, other measures of goodness of fit such as the AIC and the BIC should be used.

Its easy for the R squared to be over inflated and unreliable. In cases of Time Series, or of a model with a high number of variables, the R squared may be unreliable. Hence, it is necessary to understand the nature of the dataset and the model before attempting to consider the R squared.

The nature of the dataset is one of the foremost factors that impact the value of the R squared. A dataset pertaining psychological behavior is bound to have a large proportion of residuals that would be volatile, and hence, unaccounted for by the line of fit. On the other hand, the data from the assembly line of a manufacturing unit emphasizing quality and consistency would show a high R squared value. In this case the residuals in this case would show a higher tendency to align with the line of fit. For this reason, it is of foremost importance that we understand the nature of the data before accounting for the value of R squared, or adjusted R squared. Adjusted R squared only attempts to counter the effect of additional variables only, and not the inherent nature of the dataset.

Let us study the effect of R squared, and R squared adjusted with the help of two separate regression models created using two separate datasets that are part of the ‘datasets’ package for R. The first dataset is a dataset called ‘cars’, made up of average vehicle speeds and journey distances, and another, called ‘women’, consisting of heights and weights for a set of 15 women.

A simple regression model, trying to predict the average speed of the journey based on the journey distance is prepared with the first set of data, and another simple regression, predicting heights based on the weights of the 15 women is prepared. These two types of regression models help us highlight and contrast the differences between the R squared values for these two instances.

We see that the regression equation created using the cars dataset has an R squared value of 0.64, and adjusted R squared of 0.64. The second regression model using the women dataset has an R squared value of 99%, and an R squared adjusted of 99%. It must be noted here that the R squared adjusted and the R squared values are identical to each other because the models have singular independent variables.

The values of the R squared are affected by just the nature of the datasets. Since the average speeds of the vehicles do not really gauge the distance of the journey, the line of fit is able to account for the variance of only 64% of the observations. On the other hand, since there is a real connection between the heights and the weights of the women, the model shows R squared and an R squared adjusted values of 99%; the model aligns better with the line of fit.

This does not mean that the average vehicle speeds can predict the journey distance 64% of the time, it means that the average speed to journey distance relationship was found in only 64% of the observations.

Data analysts should use the R squared as a measure of the explanation of the variance in the dataset, and not as a measure of the predictive capability of the model. The R squared can be a helpful measure in actually understanding the nature of the dataset and the regression model.