Why Visualize Data? Anscombe’s quartet and the many lives that could have been saved…
Any intellectual worth their salt can clearly look at a data set and see trends, right? Why would we need to bother ourselves with the “prettying” of data, which provides the cold hard facts we are looking for? What if I told you that the difference between a good data presentation and a poor one, was the difference between someone living or dying? We need to visualize data to see the whole picture. Whether we are looking at temperature data, or the death rates post births in 1840, the plotting of data will help us avoid fundamental mistakes that many before us have made. Today we are going to explore a data phenomenon known as “Anscombe’s Quartet” and the sad story of a physician who was ahead of his time.
Anscombe’s Quartet
Born in England May 13th, 1918, a mathematician would unknowingly change how we look at data. The Fellow of American Statistical Association in Princeton University would provide us with a phenomenon that would stand the test of time. Francis John “Frank” Anscombe was interested in experiments that emphasized randomization. His best-known work is noted to be his account of formal properties of residuals in linear regression. Later, he became interested in statistical computations, noting that computational visualizations were essential. He published what would be known as “Anscombe’s Quartet”, a data set that would fool us into the importance of computer aided visualizations.
Let us take a look at this famous data set and see if there is anything interesting to observe…
From this data set there is nothing particularly interesting to observe. We can see stable averages, medians, and standard deviations. If we calculate this from the group itself, there is not much more information to be gained. So, what is so impressive about this data set?
First let’s use python and import the data set properly.
import pandas as pd
import numpy as np
import seaborn as sns
We will use pandas to read the excel file into a data frame. There isn’t anything complex about the data so we will let pandas interpret the rows for us.
df = pd.read_excel('ANSCOMBESQ.xlsx')
x1 = df[['X1','Y1']]
x2 = df[['X2','Y2']]
x3 = df[['X3','Y3']]
x4 = df[['X4','Y4']]
Here we will use the seaborn library to generate some scatterplots easily. Pass the X variable into the x axis and the y variable into the y axis. This will show use the change in x, for the given y.
Set #1
sns.scatterplot(data=x1, x='X1', y='Y1')
Set #2
sns.scatterplot(data=x2, x='X2', y='Y2')
Set #3
sns.scatterplot(data=x3, x='X3', y='Y3')
Set #4
sns.scatterplot(data=x4, x='X4', y='Y4')
The data sets are distinctly different. How can that be so? After all, didn’t we see that the summary statistics were nearly all the same? Case in point, only one measure or “view” of our data will not tell the entire story. You have to visualize your data in order to get more meaning.
Ok neat, data sets can be different, and visualizations can show that. Now what?
Let’s go back to the Vienna hospital around the 1840’s. Here we see a promising doctor who has made an important observation but lacked the presentation to convince his colleagues. Ignaz Semmelweis was a Hungarian physician who observed a difference in the death rates of mothers birthed from physicians as opposed to midwives in an adjacent clinic.
Semmelweis had an epiphany when a friend performing an autopsy was pricked by the scalpel and died from the wound. His friend died of a fever that was very similar to that of the mothers in the physician’s clinic. From here Semmelweis observed that the physicians spent the morning performing autopsies, in the afternoon they performed deliveries. Clearly there was an association with this, and he created an experiment to test his hypothesis.
Semmelweis introduced a hand washing policy using chlorinated lime solution, which was strong enough to remove the smell from the autopsy. He didn’t have the science to prove what was happening, but the solution was removing deadly pathogens that were infecting the mothers. Soon the death rates plummeted, and he had his answer.
While he made a few mistakes when presenting his results, the biggest, was that he used a data frame to show the outcome of his experiments. Humans are notoriously bad at extracting trends from numerical data, and without the scientific advancement of the microbiology at work he was met with high criticism and the rejection of his work. It was not until years later, after more deaths occurred that his results were accepted. It is speculated that had he provided a simple visualization like the one below, it would have made the difference to his peers.
4 thoughts on “Why Visualize Data? Anscombe’s quartet and the many lives that could have been saved…”
This is an awesome example dude! I hope people will watch “Comrades In Farms” episode 25 on The Farmacy seeds Network You-Tube channel tonight when it airs. You really dove deep into this stuff so nicely! Thanks again!
Thanks! I had a great time discussing this and looking forward to the next time we can chat again. I have some neat stuff to dig into, along with some coding examples!
This post, as well as episode 25 of Comrades in Farms, really made me think about the psychology behind data: how it’s presented and perhaps most importantly, how the audience receives and perceives it. I loved how you talked about color schemes, emphasis, and even accessibility for someone who is color blind. When I think about how data visualization techniques could impact other audiences, like grand juries for example, the integrity of the data and the perception of it becomes incredibly consequential. I learned a lot from this – thank you.
Thanks for your comment, I’m glad you enjoyed this content! Data visualization has a lot of impact to someone receiving information and these nuanced details can be the difference between something good and something great.
Comments are closed.