What is correlation?


This post was originally published by Cassie Kozyrkov at Towards Data Science

Not causation.

Model Classifies Corneas

Experiments allow you to talk about cause and effect. Without them, all you have is correlation. What is correlation?


Sure, you’ve probably already heard us statisticians yelling that at you. But what is correlation? It’s when the variables in a dataset look like they’re moving together in some way.

Two variables X and Y are correlated if they seem to be moving together in some way.

For example, “when X is higher, Y tends to be higher (this is called positive correlation) or “when X is higher, Y tends to be lower (this is called negative correlation).

Thanks, Wikipedia.

If you’re looking for the formula for (population) correlation, your friend Wikipedia has everything you need. But if you wanted that, why didn’t you go there straight away? Why are you here? Ah, you want the intuitive explanation? Cool. Here’s a hill:

On the left, height and (left-to-right) distance are positively correlated. When one goes up, so does the other. On the right, height and distance are negatively correlated.

When most people hear the word correlation, they tend to think of perfect linear correlation: taking a horizontal step (X) to the right on the hill above gets you the same change in altitude (Y) everywhere on the same slope. As long as you’re going up from left to right (positive correlation), there are no surprise jagged/curved bits.

Bear in mind that going up is positive only if you’re hiking left-to-right, same way as you read English. If you approach hills from the right, statisticians won’t know what to do with you. I suppose what statisticians are trying to tell you is never to approach a hike from the right. That will only confuse us.

But if you hike properly, then “up” is “positive.”

In reality, this hill is not perfect, so the correlation magnitude between height and distance will be less than 100%. (You’ll pop a +/- sign in front depending on whether we’re going up or down, so correlation lives between -1 and 1.

That’s because its formula (pasted from Wikipedia above) divides by standard deviation, thereby removing the magnitude of each variable’s dispersion. Without that denominator, you’d struggle to see that the strength of the relationship is the same regardless of whether you measure height in inches or centimetres. Whenever you see scaling/normalization in statistics, it’s usually there to help you compare apples and oranges that were measured in different units.)

What does a correlation of zero look like? Are you thinking of a messy cloud with no discernible patterns inside? Something like:

Sure, that works. You know how I know X and Y truly have nothing to do with one another? Because I created them that way. If you want to simulate a similar plot of two uncorrelated variables, try running this basic code snippet in R online:

X <- runif(100) # 100 regular random numbers between 0 and 1
Y <- rnorm(100) # Another 100 random numbers from bell curve
plot(X, Y, main = "X and Y have nothing to do with one another")

But there’s another way. The less linear the relationship, the closer your correlation is to zero. In fact, if you look at the hill as a whole (not just one of its slopes at a time), you’ll find a zero correlation even though there’s a clear relationship between height and distance (duh, it’s a hill).

X <- seq(-1, 1, 0.01) # Go from -1 to 1 in increments of 0.01
Y <- -X^2 # Secret formula for the ideal hill
plot(X, Y, main = "The linear correlation is zero")
print(cor(X, Y)) # Check the correlation is zero

The presence of a linear correlation means that data move together in a somewhat linear fashion. It does not mean that X causes Y (or the other way around). They might both be moving due to something else entirely.

Want proof of this? Imagine you and I invested in the same stock. Let’s call it ZOOM, because I find it hilarious that pandemic investors intended to buy ZM (the video communications company) but accidentally bought ZOOM (the Chinese micro-cap) instead, leading to a 900% increase in the price of the wrong Zoom, while the real ZM didn’t even double. *wipes away laugh-tears* Anyways — in honor of that comedy — imagine that you and I invested a small amount in ZOOM.

Since we’re both holding ZOOM, the value of your stock portfolio ($X) is correlated with my stock portfolio value ($Y). If ZOOM goes up, we both profit. That does not mean that my portfolio’s value causes your portfolio’s value. I cannot dump all my stock in a way that punishes you — if my portfolio value suddenly becomes zero because I sell everything to buy a pile of cupcakes, that doesn’t mean that yours is now worthless.

Many decision-makers fall flat on their faces for precisely this reason. Seeing two correlated variables, they invest resources in affecting thing 1 to try to move thing 2… and the results are not what they expect. Without an experiment, they had no business assuming that thing 1 drives thing 2 in the first place.

Correlation is not causation.

The lovely term “spurious correlation” refers to the situation where where there’s no direct causal relationship between two correlated variables. Their correlation might be due to coincidence or due to the effect of a third (usually unseen, a.k.a. “latent”) variable that influences both. Never take correlation at face value — in data, things often aren’t what they seem.

For fun with spurious correlations, check out the website this prime example hails from.

To summarize, if you want to talk about causes and effects, you need a (real!) experiment. Without experiments, all you have is correlation and for many decisions — the ones based on causal reasoning — that is not helpful.

It’s putting lines through stuff. Think of it as, “Oh, hey! These things are correlated, so let’s use one to predict the other…”

What is regression? It’s putting lines through stuff. Here’s me telling you all about it.

Spread the word

This post was originally published by Cassie Kozyrkov at Towards Data Science

Related posts