Understanding Regression – Part 2

The last article focused on what regression was and how the results can be interpreted. It mentioned that there were a number of assumptions required in order for the model to be valid. The assumptions are necessary because they relate to the reasons why a regression line works well as a prediction. The assumptions are based on the residuals, which are the difference between the predicted value of the dependent variable in the regression and the actual y value in the regression.

Residual analysis is a critical component to ensuring that any regression model is adequate for use.

For example, the model, in Part 1, predicted a y value based on the formula 27.5 + 1.4 * age. The difference between the predict value form the formula based on an age, and the actual value for that given age in the dataset, is known as the residual.

There are 4 assumptions for regression:

  1. Mean, or average, of the residuals is zero
  2. Residuals are normally distributed, around the regression line
  3. Variance of the residuals is constant
  4. Residuals are independent of each other

When we conduct the regression and obtain a model, we need to look at the residuals to determine how good the model actually is. Obviously, the smaller the residuals, the better our model is in predicting the value of y, and therefore leads to the first assumption. The the mean of the errors is 0. If the mean of the errors is zero, or very close, the regression line is cutting through the datapoints almost perfectly. Some of the actual residuals will be above the line and some will be below the line, when we take the average, we should obtain 0.

Next, we  should check to see if the residuals are normally distributed around zero. Since the expected value or mean of the residuals is zero, it is our hope that there is a balance of residuals, some positive and some negative. In addition to the balanced, we want a majority of these residuals to be near the mean, e.g. normally distributed.

Next the variance of the residuals should be constant. This means that how the residuals vary around zero should not have a pattern. In other words, you dont want to see the residuals gradually increasing, or decreasing, or have a curved pattern to them. These are easily seen using residual plots.

Finally, we must check to make sure that residuals are independent. This means that they aren’t correlated with each other, known as seriel correlation. Consider that if the errors are correlated with each other it means that previous residuals, either directly before, or some periodic cycle, for predicted values are related. This occurs most commonly in time series models, or in very poorly fitted models.

In an upcoming article we will show how to run statistical tests for the above assumptions.

Alexander Pelaez, Ph.D., is a President of Five Element Analytics, an analytics consulting firm. He has served as a senior executive to a number of firms in healthcare, retail and media. He is also a professor of Information Systems and Business Analytics at Hofstra University.