6.6. Significant warnings about “statistical significance”¶
The “classical” approach to assess whether X is related to y: Is the t-stat on X above 1.96? (Equivalently, the p-value is below 0.05.) These thresholds are important and part of a sensible approach to learning from data, but when you read about a “statistically significant relationship” on some website, it often comes across like
MY NEW, AND MEANINGFUL, FINDING
AND KNOWING IT ...
WILL CHANGE YOUR LIFE!"
And suddenly, you see an article saying that 10 cups of coffee, 2 bars of chocolate, and 3 glasses of wine a day leads to longer lives, or that breastfeeding for up to two years causes better outcomes.1
6.6.1. “Correlation is not causation”¶
Surely, you’ve heard that. I prefer this version:
Everyone who confuses correlation with causation eventually ends up dead.
The “default” interpretation you should have of a regression result is that you’re seeing a correlation, not that X causes Y. You need to rule out some alternative possibilities first.
6.6.2. Alternatives to causation¶
Here are some reasons the (statistically significant) correlation might not be causal:
Spurious correlation: If you look at enough Xs and enough ys, you, by chance alone, can find “significant” relationships where none exist
Sampling bias: The famous Dewey Defeats Truman headline happened because of bad polls (and an analyst that got 4 out of 5 of the prior elections correct)
Survivorship bias: If you evaluate the trading strategy “buy and hold current S&P companies” for the last 50 years, you’ll discover that this trading strategy did great!
Reusing the data aka “p-hacking”: If you torture the data, it will confess! Play this fun game and you’ll see that (1) The choices you make about what variables to include or focus on can change the sign and p-values. (2) If you play with a dataset long enough, you’ll find “results”.
Sample selection: The sample only exists for some subset of possible X or Y values.
Reverse causation: Y causes X.
Omitted variables: W causes X and Y to go up, but if you run a test using just X and Y (not W) you’ll find that X and Y are related. “Ability” and “quality” can not be measured, but are often important to control for.
Simultaneity: Think of this as “equilibrium effects”. X and Y are determined together, like price and quantity.
If you see a regression or a study where these might come up, it’s time to think critically about whether you should trust and act on that finding, or do additional tests to prove the relationship is causal.
6.6.3. Getting to causation¶
Techniques to prove causality rely on the same intuition: Find (or create) randomness in X. If variation in X is truly random, then we can attribute different outcomes Y to the differences in X.
The canonical technique: Randomly give some people a drug, and others a placebo.
Randomized trials are rarely possible in finance research due to feasibility or ethical concerns.
The most common methods that can establish causality are:
Difference in difference
A great resource to learn about these methods is Paul Goldsmith-Pinkham’s applied class materials.
I emphasized that these methods can establish causality because they do not always suffice. Designing studies to deal with these issues is a massive topic you can pursue in other classes. I can’t do it justice here.
Humility is good
Until you learn about the advanced techniques above, focus on humility as you report regressions:
Our standard fill-in-the-blank interpretation sentence calls the relationship an “association” and avoids the banned words below.
In discussion of findings, emphasize what you found (a statistical association) and didn’t (“We acknowledge that this finding isn’t causal.” “One limitation of our study is that…”)
Discuss alternative explanations (some may apply in your setting, some may not)
Banned words: effect, impact, causes, causality, because of, leads to, etc.
6.6.4. Help me help you¶
When you run a regression, your focus should be on testing and evaluating a hypothesis, not “finding a result”
Remember, if you torture the data enough, it will confess and produce a “statistical” result. Meaning: It’s often “easy” to find results.
The focus on p-values can be dangerous because it distorts the incentives of analysts. If you’re paid to publish research, and journals have a bias towards publishing non-null results (they do), then your incentive is to “find something.” This 538 article mentions that about 2/3 of retractions are due to misconduct.
However, it doesn’t take ill intent: You, friends, or strangers might find a false result and trumpet it due to motivated reasoning, cognitive dissonance, or confirmation bias. Analyses in many domains are fraught with these temptations; the game above has a political valence.
Additionally, the focus on p-values shifts attention towards statistical significance, which does not mean causation nor economic significance (i.e. large/important relationships)**
Tips to avoid p-hacking
Your focus should be on testing and evaluating a hypothesis, not “finding a result”
Null results are fine! Famously, Edison and his teams found a lot of wire filaments that did not work for a light bulb, and this information was valuable!
“Preregister” your ideas
The simplest version of this: Write down your data, theory, and hypothesis (it can be short!) BEFORE you run your tests.
This Science article covers the reasoning and intuition for it
This article by Brian Nosek, one of the key voices pushing for ways to improve credibility is an instant classic
The AAP recently started suggesting breastfeeding for two years, in part due to some studies finding a correlation between long breastfeeding and better maternal outcomes. However, moms that breastfeed that long are different than those who don’t. One difference: They tend to be richer. (Please pardon the sassy joke: Perhaps the AAP should suggest bringing your child home in a Mercedes.) Even if the study can control for wealth, it’s easy to worry about other confounding factors.