Are you Correlating Correctly?

November 16, 2021 Brian Seko

Well…are you? In a conversation between a few colleagues the concept of a correlation came up. I made a comment that we couldn’t be certain that the correlation formula was being applied correctly by our software, and therefore, we shouldn’t use it (typical black box). To my surprise, my colleague asked me “what’s the difference?” not knowing that using certain correlation formulas for certain data sets is inappropriate and inaccurate. It prompted me to write this article about the four common ways to correlate data, and its practical use.

In this article, all the examples use the South Korean COVID-19 Dataset available on Kaggle.

Following Along

All the images and correlations were produced in python using Jupyter notebooks. If you plan to replicate any code, here are the imports we will be using:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

The Basics

The concept of a correlation is simple, it’s a bivariate analysis that measures the strength of association between two variables and the direction of that relationship (Magiya 2019). Simply put, it is a way that we measure the strength of a relationship between two variables. If something is statistically significant, then it means that the relationship may not be by chance.

That is not to say that statistical significance is the same as practical significance. We can take a great lesson from my favorite website and realize that just because we found a statistically relevant correlation – it doesn’t mean that it has meaning.

We can have a statistically significant finding, but the implications of that finding may have no practical application. The researcher must always examine both the statistical and the practical significance of any research finding (“Tests of Statistical Significance” n.d.).

Correlation Formulas and Their Use

When we correlate data, the shape and type come into play. If normally distributed parametric and continuous then there are two tests we should be taking advantage of.

Pearson’s Correlation Coefficient

Pearson’s correlation is the statistic you probably have heard about the most. It’s intended for linear relationships, where both variables are continuous. This formula uses covariation to determine the distances from the best fitting line between all the data points. The slope of that line is calculated as r and provides robust information since it uses the mean and standard deviation to calculate the coefficient.

It does have drawbacks, however. If influential outliers are present, the coefficient may show erroneously negative correlations. So, check your data!

In this example we evaluate the relationship of confirmed cases to the total number of deaths. As expected, we see a very strong positive correlation of 0.9386 with a p-value of 2.712e-76 (so unlikely this is by chance). I like to reference the p-value with all my correlations, because it helps the audience think about the likelihood that this could be random. With smaller datasets the more likely to be random. Take care with this dataset, there are some smaller data sets that may throw you off.

Speaking of p-values, if you are not familiar with this test-statistic, check out a short video that explains exactly what this number means.

To calculate this example, we will want to take data from Time.csv

time = pd.read_csv('kaggle/Time.csv')

There isn’t much work here if we use seaborn, a single line of code will produce the above graphic.

sns.regplot(data=time, x='confirmed', y='deceased')

We can also use scipystats to get the exact number and p-value. A correlation image without the coefficient is not impressive. If you are going to draw a line on your dataset, be sure that the coefficient can be seen so the image size doesn’t skew the perceived results.

stats.pearsonr(time['confirmed'], time['deceased'])

Why are we easily able to produce a coefficient and visualization with this data? This is because there are many records in this data set where each record is a continuous value for both variables. It allows the formula to be followed without the need to manipulate the data much. If the data were aggregated, the above code would not be sufficient.

Point-Biserial

This is my personal favorite as I find it the most practical for day-to-day work. Often, I am comparing dichotomous data sets (like a coin flip, or biological gender) against continuous data (like time or money). It is mathematically equivalent to Pearson’s R, but many assumptions are made about the data to use this correlation. Outliers will make this unreliable.

Using this coefficient, let’s see if there is a correlation between biological gender and the death rate. Is one sex more at risk than another?

Here we can see a very weak correlation toward males and deceased rate. We would say that there is no relationship. Since the correlation is so small, there’s no point in calculating a p-value.

To start, we need the PatientInfo.csv file because it has two data points that we can use. Biological gender, and the death date.

Let’s take a look at what we are dealing with…

We are going to need to transform this data so that we can produce a record with a deceased rate making our timestamp variable “deceased_date” a continuous value. We will also need a binary signal for our plotting package, so male/female will be coded as well.

patient = pd.read_csv('kaggle/PatientInfo.csv')

First, we need to choose an aggregation step for our data. What will become our rows? We want to produce as many records as we can to avoid the pitfalls of having a small data set. Our best option is to count the number of deceased records by date. A column that contains no nulls is the “confirmed_date”, so this will work nicely for us.

patient_gender_group = patient[patient['deceased_date'].isnull()==False].groupby(['confirmed_date','sex'], as_index=False).count()

Here we first filter the data set to only deceased patients. We don’t want to count 0s as values in this set, so we can throw out anyone who is alive with patient[patient[‘deceased_date’].isnull()==False. Then we’ll use .groupby() to count by the date and gender of the patient. You’ll notice I added as_index=False, this makes a few steps more explicit with visualization, which I prefer for readability. Finally we use .count() to aggregate the total number of records. You can see how this nicely transforms our data:

Next, we should encode our male/female values in a way that both scipystats and seaborn will like.

patient_gender_group['sex'] = patient_gender_group['sex'].apply(lambda x: 1 if x=='male' else 0)

Now all that is left is to plot this and calculate our coefficient so it can be annotated on our graph. Remember, a line drawn on a visualization is meaningless without reference to it’s coefficient.

fig_dims = (8, 5)
fig, ax = plt.subplots(figsize=fig_dims)
sns.regplot(data=patient_gender_group, x='sex', y='deceased_date', ci=False)
plt.setp(ax,xticks=[-1,2]) 
ax.set_xlabel('Female                                       Male', fontsize = 10) 
ax.set_ylabel('# Deceased')
sns.despine(right=True) 
ax.set(xlim=(-1,2), xticks=[0,1])

rpb, p = stats.pointbiserialr(patient_gender_group['sex'], patient_gender_group['deceased_date'] )
t = str(round(rpb,4))

ax.annotate(f'correlation = {t}', (0,patient_gender_group['deceased_date'].max()))

Spearman and Kendall Coefficients

If our data ends up being non-parametric, monotonic or it is ordinal, then we would use:

Spearman’s rank formula
Kendall’s Tau coefficient

Spearman’s coefficient uses rank order to determine the coefficient. One thing that we should note is that this coefficient is immune to outliers. Which is a reason why you would use this for non-parametric data. Your data should have a monotonic relationship though. This calculations is less desirable for smaller data sets and sets with multiple tied pairs. It also uses less data than Pearon’s R since it makes a few assumptions about the data.

Finally we get to Kendall’s, which is essentially the same as Spearman’s for use. You should prefer Kendall’s (for most situations) because it has all the benefits of Spearman’s rank formula, but it is better with small data sets, or data sets with tied pairs. Overall, it is considered more robust and it makes fewer assumptions about the data. Unlike Spearman’s rank formula, Kendall’s evaluates concordant pairs.

Let’s evaluate if age impacts the number of patients infected by COVID-19. As we know, the elderly are more susceptible to the virus so I would expect to see higher numbers in the older age groups. Is that true?

Now you may be thinking “but age is a continuous value!”. It’s not though, it is a term to describe a bucket in which we group individuals. While you can be 25.68 years old, it’s odd to hear that. Age does go on forever but we know that it will cap at a certain point. Additionally, you can’t have an age of 0 – therefore it is an ordinal variable. Since we are using an ordinal value and a continuous value, we should be employing Kendall or Spearman coefficients.

Here we see that the correlations show a negative weak-moderate relationship. The opposite of what I suspected (that’s odd, but not so much if you consider the South Korean COVID-19 experience). You can see that the numbers are similar but Kendall’s is more conservative. You will also notice that Spearman’s p-value is higher, but that is because the data set is smaller, indicating that we should rely on Kendall’s Tau for this analysis.

To get this visualization and coefficient, we will make use of the PatientInfo.csv data again, since it has an age group. We’ll need to clean that up though.

patient = pd.read_csv('kaggle/PatientInfo.csv')

When we import this data, you’ll notice a trialing “s” at the end of our age bucket.

Let’s remove that.

patient['age'] = patient['age'].astype(str)
patient = patient[patient['age']!='nan']
patient['age2'] = patient['age'].apply(lambda x: int(str(x.replace('s', ''))))
patient['age2'] = patient['age2'].astype(int)

We also need to group by our age group and count the total patients. Here we will use the patient_id as the count variable.

patient_count = patient.groupby('age2').count()
patient_count.reset_index(inplace=True)

You can see this nicely pivots the data for us and puts it into a usable format.

Once again we can call seaborn regplot to easily show the coefficient line.

fig_dims = (8, 8)
fig, ax = plt.subplots(figsize=fig_dims)
sns.regplot( y='patient_id', x='age2',data=patient_count, color ='#039BE5', ci=False)

Now to calculate our coefficient we can use scipystats again:

stats.kendalltau(patient_count['age2'], patient_count['patient_id'])
KendalltauResult(correlation=-0.41818181818181815, pvalue=0.08656124739458072)

stats.spearmanr(patient_count['age2'], patient_count['patient_id'])
SpearmanrResult(correlation=-0.45454545454545453, pvalue=0.16014543725525882)

We can also make use of the pandas .corr() method as well. While I prefer to use scipystats, this has its place. The downside is that you won’t get a p-value as well. But if you are building a SPLOM, certainly take advantage of this.

patient_count[['age2', 'patient_id']].corr(method = 'kendall')

patient_count[['age2', 'patient_id']].corr(method = 'spearman')

Conclusion

There are many ways that we can correlate data, which depend on its type and structure. Knowing when and how to employ the right coefficient will make the difference in your analysis. These four formulas are the most common but there are other variations out there. Also, be sure not to make the mistake that Correlation = Causation, remember just because something is related (even statistically), doesn’t mean there is a practical reason for it.

References and Additional Material

•Magiya, Joseph. 2019. “Kendall Rank Correlation Explained.” Medium. November 23, 2019. https://towardsdatascience.com/kendall-rank-correlation-explained-dee01d99c535.

•“Tests of Statistical Significance.” n.d. Accessed August 19, 2021. https://web.csulb.edu/~msaintg/ppa696/696stsig.htm.

•Magiya, Joseph. 2019. “Kendall Rank Correlation Explained.” Medium. November 23, 2019. https://towardsdatascience.com/kendall-rank-correlation-explained-dee01d99c535.

•Jaadi, Zakaria. 2019. “Eveything You Need to Know about Interpreting Correlations.” Medium. October 16, 2019. https://towardsdatascience.com/eveything-you-need-to-know-about-interpreting-correlations-2c485841c0b8.

•Geographer Online. n.d. Explaining: Spearman’s Rank Correlation Coefficient. Accessed August 19, 2021. https://www.youtube.com/watch?v=pFzNaxJIpiU.

•“Measuring Item Reliability – What’s the Point of Point Biserial?” 2019. Maxinity. September 17, 2019. https://www.maxinity.co.uk/blog/point-biserial.

•“Kendall’s Tau-a.” n.d. Accessed August 22, 2021. http://www.statistics4u.com/fundstat_eng/ee_kendall_rank_correlation.html.