6.6. Significant warnings about “statistical significance”

The “classical” approach to assess whether X is related to y is to see if the t-stat on X is above 1.96/the p-value is below 0.05. These thresholds are important and part of a sensible approach to learning from data, but when you read about a “statistically significant relationship” on some website, it often comes across like


And suddenly, you see a article saying that 10 cups of coffee/2 bars of chocolate/3 glasses of wine/etc a day leads to longer lives, or that breastfeeding for up to two years causes better outcomes.1

6.6.1. “Correlation is not causation”

Surely, you’ve heard that. I prefer this version:

Everyone who confuses correlation with causation eventually ends up dead.


The “default” interpretation you should have of a regression is that you’re seeing a correlation, not that X causes Y. You need to rule out some alternative possibilities first.

6.6.2. Alternatives to causation

Here are some reasons the (statistically significant) correlation might not be causal:

  • Spurious correlation: If you look at enough Xs and enough ys, you, by chance alone, can find “significant” relationships where none exist

  • Sampling bias: The famous Dewey Defeats Truman headline happened because of bad polls (and an analyst that got 4 out of 5 of the prior elections correct)

  • Survivorship bias: If you evaluate the trading strategy “buy and hold current S&P companies” for the last 50 years, you’ll discover that this trading strategy did great!

  • Reusing the data aka “p-hacking”: If you torture the data, it will confess! Play this fun game and you’ll see that (1) The choices you make about what variables to include or focus on can change the sign and p-values. (2) If you play with a dataset long enough, you’ll find “results”.

  • Sample selection: The sample only exists for some subset of possible X or Y values.

  • Reverse causation: Y causes X.

  • Omitted variables: W causes X and Y to go up, but if you run a test using just X and Y (not W) you’ll find that X and Y are related. “Ability” and “quality” can not be measured, but are often important to control for.

  • Simultaneity: Think of this as “equilibrium effects”. X and Y are determined together, like price and quantity.

If you see a regression or a study where these might come up, it’s time to think critically about whether you should trust and act on that finding, or do additional tests to prove the relationship is causal.

6.6.3. Getting to causation

Generally, the intuition of approaches to proving causality are about finding or creating randomness in X. If variation in X is truly random, then we can attribute different outcomes Y to the differences in X.

The most common methods that can to establish causality are:

  • Randomized trials

  • Difference in difference

  • Instrumental variable

I emphasized that these methods can establish causality because they do not always suffice. Designing studies to deal with these issues is a massive topic you can pursue in other classes. I can’t do it justice here.

Humility is good

Until you learn about the advance techniques above, focus on humility as you report regressions:

  1. Our standard fill in the blank interpretation sentence calls the relationship an “association”.

  2. Emphasize in discussion of findings what you found (a statistical association) and didn’t (“We acknowledge that this finding isn’t causal.” “One limitation of our study is that…”)

  3. Discuss alternative explanations (some may apply in your setting, some may not)

  4. Banned words: impact, causes, causality, because of, leads to, etc.

6.6.4. Help me help you


When you run a regression, your focus should be on testing and evaluating a hypothesis, not “finding a result”

Remember, if you torture the data enough, it will confess and produce a “statistical” result. Meaning: It’s often “easy” to find results.

The focus on p-values can be dangerous because it distorts the incentives of analysts. If you’re paid to publish research, and journals have a bias towards publishing non-null results (they do), then your incentive is to “find something.” This 538 article mentions that about 2/3 of retractions are due to misconduct.

However, it doesn’t take ill intent: You, or friends, or strangers might find a false result and trumpet it due to motivated reasoning, cognitive dissonance, or confirmation bias. Analysis in many domains are fraught with these temptations; the game above has a political valence.

Additionally, the focus on p-values shifts attention towards statistical significance, which does not mean causation nor economic significance (i.e. large/important relationships)**

Tips to avoid p-hacking

  1. Your focus should be on testing and evaluating a hypothesis, not “finding a result”

  2. Null results are fine! Famously, Edison and his teams found a lot of wire filaments that did not work for a lightbulb, and this information was valuable!

  3. “Preregister” your ideas


The AAP recently started suggesting breastfeeding for two years, in part due to some studies finding a correlation between long breastfeeding and better maternal outcomes. However, moms that breastfeed that long are different than those who don’t. One difference: They tend to be richer. (Please pardon the sassy joke: Perhaps the AAP should suggest bringing your child home in a Mercedes.) Even if the study can control for wealth, it’s easy to worry about other confounding factors.