Basic predictions are one of the most fundamental aspects of data science. It provides insight that enables functional units such as marketing, logistics and finance to take action on the data. One of the most basic tools in the prediction area are regression models. Regression models take a number of “predictor” variables, i.e. variables that will be used to predict some outcome, and one or more “response”, i.e. outcome, variables. While the technique is very powerful, using it improperly can lead to disastrous effects.
Many of the basic statistical tools such as SAS, R, and Excel, will have the capability to do regressions. You should find it fairly easy to use the tools to get the desired result. However, only the more powerful tools like SAS and R can help you more easily identify whether a regression equation is actually valid and trustworthy.
While many factors go into understanding whether a regression model is valid to use, here are a few quick tips you can use as a “white glove” test in order to see if using the regression model are valid.
Regression is a very powerful prediction technique however, using it improperly can lead to disastrous effects.
1. Response variable should be continuous
When you create your regression model, the outcome variable, or response, should be continuous. A continuous variable are variables that are purely numeric, where the quantities between the numbers have meaning. For example, a person’s height being 67 or 68 inches, 67.5 has meaning, as opposed to a variable such as Male or Female, having quantities of 0 or 1, the middle point .5 has no meaning associated with it. Therefore, in your regression equations the response variable should be continuous.
There are other methods to use if the variable is not continuous, such as logistic regressions, however, that is beyond the scope of this article.
2. Regression explanatory power
While the equation you derive from your regression may appear to be good, you should ensure that it has a strong enough explanatory power, i.e. the variance in the response variable is explained by the variances in the x variables. This is accomplished by examining the R2 statistic. When you run a model, each software application should show you the R2 or Adjusted R2. The difference between the two is very simple. If you have more predictor variables, (x’s), the Adjusted R2 takes that into account and actually penalizes you for it. Therefore, the Adjusted R2 is really the number you want, because you want a more conservative estimate.
The Adjusted R2, as the the regular R2, is a percentage, therefore, the higher the percentage the better the model. You ultimately have to be the judge of whether the model suits your needs, based on the Adjusted R2. In some cases, explanatory power of 35% is good, whereas in other cases, you may require a more robust model that explains 80% of the variance. In either case, your decision should be informed by your needs and by the variables in question.
3. Inclusion of important variables.
When forming a regression equation, statistical packages will let you know which variables are important. These are known by evaluating the t-statistic of the coefficient. If the p-value is less than .05, the variable is statistically significant and should be included in any prediction. Including variables in a model that aren’t statistically significant, can result in incorrect conclusions about what is important for response prediction.