{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical significance\n", "\n", "[Previously,](02b_mechanics.html#including-interaction-terms) we estimated \n", "\n", "$$\n", "\\log(\\text{price})= 8.2+ 1.53 *\\log(\\text{carat}) + 0.33* \\text{Ideal} + 0.18* \\log(\\text{carat})\\cdot \\text{Ideal}\n", "$$\n", "\n", "Those coefficients are _estimates_, not gospel. There is some uncertainty about what the \"true\" value should be.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 8.1954 0.007 1232.871 0.000 8.182 8.208
ideal[T.True] 0.3302 0.007 46.677 0.000 0.316 0.344
lcarat 1.5282 0.015 103.832 0.000 1.499 1.557
lcarat:ideal[T.True] 0.1822 0.015 12.101 0.000 0.153 0.212
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load some data to practice regressions\n", "import seaborn as sns\n", "import numpy as np\n", "from statsmodels.formula.api import ols as sm_ols # need this\n", "\n", "diamonds = sns.load_dataset('diamonds')\n", "\n", "# this alteration is not strictly necessary to practice a regression\n", "# but we use this in livecoding\n", "diamonds2 = (diamonds.query('carat < 2.5') # censor/remove outliers\n", " .assign(lprice = np.log(diamonds['price'])) # log transform price\n", " .assign(lcarat = np.log(diamonds['carat'])) # log transform carats\n", " .assign(ideal = diamonds['cut'] == 'Ideal') \n", " \n", " # some regression packages want you to explicitly provide \n", " # a variable for the constant\n", " .assign(const = 1) \n", " ) \n", "\n", "(\n", "sm_ols('lprice ~ lcarat + ideal + lcarat*ideal', \n", " data=diamonds2.query('cut in [\"Fair\",\"Ideal\"]'))\n", ").fit().summary().tables[1] # the summary is multipe tables stitched together. let's only print the params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, the table that regression produces (which you can see above by clicking the plus button) also displays information next to each coefficient about the level of precision associated with the estimated coefficients.\n", "\n", "| Column | Meaning |\n", "| :-- | :-- |\n", "| \"std err\" |\tThe standard error of the coefficient estimate |\n", "| \"t\" \t | The _**\"t-stat\"**_ = $\\beta$ divided by its standard error |\n", "| \"P>t\" \t| The _**\"p-value\"**_ is the probability that the coefficient is different than zero by random chance. |\n", "| \"[0.025 \t0.975]\" | The 95% confidence interval for the coefficient. | \n", "\n", "\n", "```{note}\n", "We use these columns, particularly the \"t-stat\" and \"p-value\", to assess the probability that the coefficient is different than zero by random chance. \n", "- A t-stat of 1.645 corresponds to a p-value of 0.10; meaning only 10% of the time would you get that coefficient randomly\n", "- A t-stat of 1.96 corresponds to a p-value of 0.05; this is a common \"threshold\" to say a \"relationship is _**statistically**_ significant\" and that \"the relationship between X and y is not zero\"\n", "- A t-stat of 2.58 corresponds to a p-value of 0.01\n", "```\n", "\n", "The \"classical\" approach to assess whether X is related to y is to see if the t-stat on X is above 1.96. These thresholds are important and part of a sensible approach to learning from data, but...\n", "\n", "When people say a \"relationship is statistically significant\", what they often mean is \n", "
\n", "
\"LOOK YONDER, I HAVE FOUND A NEW, MEANINGFUL
\n", " CORRELATION THAT IS TRUE
(TRULY!) (VERILY!)
\n", " AND KNOWING IT ...
WILL CHANGE THE WORLD!\"
\n", "\n", "Which leads me to the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Significant_ warnings about \"statistical significance\"\n", "\n", "The focus on t-stats p-values can be dangerous. As it turns out, \"p-hacking\" (finding a significant relationship) is easy for a number of reasons:\n", "- If you look at enough Xs and enough ys, you, by chance alone, can find \"significant\" relationships where [none exist](https://www.tylervigen.com/spurious-correlations )\n", "- Your data can be biased. [Dewey Defeats Truman](https://en.wikipedia.org/wiki/Dewey_Defeats_Truman) may be the most famous example\n", " \n", " \n", "- Your sample restrictions can generate bias: If you evaluate the trading strategy \"buy and hold current S&P companies\" for the last 50 years, you'll discover that this trading strategy was unbelievable!\n", "- Reusing the data: If you torture the data, it will confess!\n", "\n", "\n", "```{admonition} Torturing your data\n", ":class: warning\n", "There is a [fun game in here](https://fivethirtyeight.com/features/science-isnt-broken/#part1). \n", "\n", "The point: \n", "1. The choices you make about what variables to include or focus on can change the _statistical_ results. \n", "2. **And if you play with a dataset long enough, you'll find \"results\"... ***\n", "```\n", "\n", "### The focus on p-values can be dangerous because it distorts the incentives of analysts\n", "\n", "- Again: **If you torture the data, it will confess (regardless of whether you or it wanted to)**\n", "- **Motivated reasoning/cognitive dissonance/confirmation bias**: Analysis like [the game above](https://fivethirtyeight.com/features/science-isnt-broken/#part1) is fraught with temptations for humans. That 538 article mentions that about 2/3 of retractions are due to misconduct.\n", "- Focus on p-value **shifts attention: Statistical signicance does not mean causation nor economic significance (i.e. large/important relationships)**\n", "\n", "```{admonition} Tips to avoid p-hacking\n", ":class: tips\n", "1. Your focus should be on testing and evaluating a hypothesis, not \"finding a result\"\n", "1. Null results are fine! Famously, Edison and his teams found a lot of wire filaments that did _not_ work for a lightbulb, and this information was valuable!\n", "1. \"Preregister\" your ideas\n", " - The simplest version of this: Write down your data, theory, and hypothesis (it can be short!) **BEFORE** you run your tests.\n", " - [This Science article](https://www.science.org/content/article/more-and-more-scientists-are-preregistering-their-studies-should-you) covers the reasoning and intuition for it\n", " - [This article by Brian Nosek, one of the key voices pushing for ways to improve credibility](https://www.pnas.org/content/115/11/2600) is an instant classic\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \"Correlation is not causation\"\n", "\n", "Surely, you've heard that. I prefer this version:\n", "\n", "```{epigraph}\n", "Everyone who confuses correlation with causation eventually ends up dead.\n", "\n", "_@rauscher_emily_\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Common analytical errors\n", "\n", "Let me repost this figure from [an earlier page](01a_MLgonewrong). Notice how a focus on finding significant results can incentivize or subtly lead you towards several of these:\n", "\n", "```{image} img/data_fallacies_to_avoid.jpg\n", ":alt: flowchart\n", ":width: 800px\n", "```" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }