Beware of Correlations

Beware of correlations because they can bite! Actually, correlations are a great tool to measure the relationship of one variable to another, and they are widely used to help understand the strength and direction of the relationship. However, there are some interesting points about correlation differences that every data analyst should understand. Knowing how to use the different correlations and interpret them properly is essential for companies that are using correlation analysis in their data science or analytics practice. If you took any basic statistics course in college at a graduate or undergraduate level, you were probably told, quite often I will assume, that correlation doesn’t imply causality. This is most certainly true; however, companies that use correlation are looking for causes, so using correlation becomes the first step toward identifying cause and effect, and sometimes it ends there. Acceptable? Probably, but not always. Companies should be cognizant of the true meaning of the correlation and identify the true cause of effects by examining the data more closely and, if possible, running experiments in a controlled setting, although this is not always possible or feasible in a corporate setting. Additionally, many people in analytics groups today may not understand there are different correlations methods based on the type of data.

First, when people talk of correlations generally they are referring to the Pearson Correlation. The correlation function in excel is the Pearson correlation. This method has a number of underlying assumptions. First, the data must be normal and the relationship between the variables is thought to be in a linear fashion. There are ways to check normality of the data, but we wont address that in this article. Basically, if your numbers are continuous, i.e. there is meaning between individual units, such as daily stock prices or weight of an individual, then the Pearson method will likely work well for you.

If you are using a Likert scale, or if you have some ranking of information such as “How likely are you to recommend our product on a scale of 1-10”, then you are using discrete or ordinal values. For this type of data, you would want your correlation method to be Spearman, and not Pearson.

There are a number of resources available that will show you how to do this in Excel, R and other statistical packages. It’s important that you understand your data in order to use the appropriate method. In some cases, the results of correlations might be the same, but they could also be very different. More importantly, the right method will provide an analyst with the correct effect size. Making decisions based on correlation analysis is great, but the above demonstrates briefly, why it’s important to understand different methods and when they are used in order to make the appropriate decision.