{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Statistical significance\n",
    "\n",
    "```{note}\n",
    "This page is very concise and avoids derivations. The focus here is on a working exposure to the topic. The corresponding lecture will add intuition.\n",
    "```\n",
    "\n",
    "[Previously,](02b_mechanics.html#including-interaction-terms) we estimated \n",
    "\n",
    "$$\n",
    "\\log(\\text{price})= 8.2+ 1.53 *\\log(\\text{carat}) + 0.33* \\text{Ideal} + 0.18* \\log(\\text{carat})\\cdot \\text{Ideal}\n",
    "$$\n",
    "\n",
    "Those coefficients are _estimates_, not gospel. They come from the sample of data we have. There is some uncertainty about what the \"true\" value of the coefficients is in the unseen \"population.\" \n",
    "\n",
    "<img src=https://media.giphy.com/media/TIjVQiwWQFDMzjk4gU/giphy.gif width=\"300\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, when we run the regression, it would be nice to get some extra info about those estimates. `sm_ols` does just that. For that regression (click the `+` sign to see the code) we get the following info:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "            <td></td>              <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>            <td>    8.1954</td> <td>    0.007</td> <td> 1232.871</td> <td> 0.000</td> <td>    8.182</td> <td>    8.208</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>ideal[T.True]</th>        <td>    0.3302</td> <td>    0.007</td> <td>   46.677</td> <td> 0.000</td> <td>    0.316</td> <td>    0.344</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>lcarat</th>               <td>    1.5282</td> <td>    0.015</td> <td>  103.832</td> <td> 0.000</td> <td>    1.499</td> <td>    1.557</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>lcarat:ideal[T.True]</th> <td>    0.1822</td> <td>    0.015</td> <td>   12.101</td> <td> 0.000</td> <td>    0.153</td> <td>    0.212</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.table.SimpleTable'>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load some data to practice regressions\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "from statsmodels.formula.api import ols as sm_ols # need this\n",
    "\n",
    "diamonds = sns.load_dataset('diamonds')\n",
    "\n",
    "# this alteration is not strictly necessary to practice a regression\n",
    "# but we use this in livecoding\n",
    "diamonds2 = (diamonds.query('carat < 2.5')               # censor/remove outliers\n",
    "            .assign(lprice = np.log(diamonds['price']))  # log transform price\n",
    "            .assign(lcarat = np.log(diamonds['carat']))  # log transform carats\n",
    "            .assign(ideal = diamonds['cut'] == 'Ideal') \n",
    "             \n",
    "             # some regression packages want you to explicitly provide \n",
    "             # a variable for the constant\n",
    "            .assign(const = 1)                           \n",
    "            )  \n",
    "\n",
    "(\n",
    "sm_ols('lprice ~ lcarat + ideal + lcarat*ideal', \n",
    "       data=diamonds2.query('cut in [\"Fair\",\"Ideal\"]'))\n",
    ").fit().summary().tables[1] # the summary() is multiple tables stitched together. I only care to print the params here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The information next to each coefficient in this table is about the level of precision associated with the estimated coefficients.\n",
    "\n",
    "| Column | Meaning |\n",
    "| :-- | :-- |\n",
    "| \"std err\" |\tThe standard error of the coefficient estimate. This gives an indication of how much the estimated coefficient likely varies from the population coefficient. _There are several ways to compute a standard error, and the choice is important! However, it's beyond the scope of this class, and we will use the default option throughout._ |\n",
    "| \"t\" \t    |  The _**\"t-stat\"**_ = $\\beta$ divided by its standard error. |\n",
    "| \"P>t\" \t|  The _**\"p-value\"**_ is the probability that the coefficient is different than zero by random chance. |\n",
    "| \"[0.025 \t0.975]\" | The 95% confidence interval for the coefficient. | \n",
    "\n",
    "```{important}\n",
    "We use these columns, particularly the \"t-stat\" and \"p-value\", to assess the probability that the coefficient is different from zero by random chance. \n",
    "- A t-stat of 1.645 corresponds to a p-value of 0.10; meaning only 10% of the time would you get that coefficient randomly\n",
    "- A t-stat of 1.96 corresponds to a p-value of 0.05; this is a common \"threshold\" to say a \"relationship is _**statistically**_ significant\" and that \"the relationship between X and y is not zero\"\n",
    "- A t-stat of 2.58 corresponds to a p-value of 0.01\n",
    "```\n",
    "\n",
    "## Some practical guidance\n",
    " \n",
    "1. You can focus on the p-values rather than the t-stats. However, knowing the threshold values of the t-stat (those above) is useful, as many people distribute research by discussing t-stats instead.\n",
    "2. If a p-value for the coefficient in a regression is below 0.05, \n",
    "    - We say that the relationship between that variable and Y is **\"statistically significant\"** at the 5% level.\n",
    "    - Now, **consider the direction**: Is the coefficient positive or negative? Does this align with your intuition and economic theory?\n",
    "    - Now, [**consider the \"size\" of the coefficient**](02d_interpretingCoefs.html#comparing-the-size-of-two-coefficients): Is a \"reasonable\" change in X leading to a \"small\" or \"big\" change in Y? If the relationship is small, it may not be important enough to care about even if true.\n",
    "    - Now, take a step back: **Statistically significant** does NOT mean X causes Y. You need additional information to make these claims. That's what the next page is about.\n",
    "3. Practically, if the p-value is above 0.05, most researchers consider completely disregard the coefficient (and ignore the sign and the value). Because you can't say the coefficient is statistically distinguishable from zero, they basically interpret the coefficient for that variable as being zero. Meaning: Ignore the sign, ignore the value, assume the coefficient is zero. \n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}