6.5. Statistical significance¶

Note

This page is very concise and avoids derivations. The focus here is on a working exposure to the topic. The corresponding lecture will add intuition.

Previously, we estimated

\[ \log(\text{price})= 8.2+ 1.53 *\log(\text{carat}) + 0.33* \text{Ideal} + 0.18* \log(\text{carat})\cdot \text{Ideal} \]

Those coefficients are estimates, not gospel. They come from the sample of data we have. There is some uncertainty about what the “true” value of the coefficients is in the unseen “population.”

So, when we run the regression, it would be nice to get some extra info about those estimates. sm_ols does just that. For that regression (click the + sign to see the code) we get the following info:

# load some data to practice regressions
import seaborn as sns
import numpy as np
from statsmodels.formula.api import ols as sm_ols # need this

diamonds = sns.load_dataset('diamonds')

# this alteration is not strictly necessary to practice a regression
# but we use this in livecoding
diamonds2 = (diamonds.query('carat < 2.5')               # censor/remove outliers
            .assign(lprice = np.log(diamonds['price']))  # log transform price
            .assign(lcarat = np.log(diamonds['carat']))  # log transform carats
            .assign(ideal = diamonds['cut'] == 'Ideal') 
             
             # some regression packages want you to explicitly provide 
             # a variable for the constant
            .assign(const = 1)                           
            )  

(
sm_ols('lprice ~ lcarat + ideal + lcarat*ideal', 
       data=diamonds2.query('cut in ["Fair","Ideal"]'))
).fit().summary().tables[1] # the summary() is multiple tables stitched together. I only care to print the params here.

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	8.1954	0.007	1232.871	0.000	8.182	8.208
ideal[T.True]	0.3302	0.007	46.677	0.000	0.316	0.344
lcarat	1.5282	0.015	103.832	0.000	1.499	1.557
lcarat:ideal[T.True]	0.1822	0.015	12.101	0.000	0.153	0.212

The information next to each coefficient in this table is about the level of precision associated with the estimated coefficients.

Column	Meaning
“std err”	The standard error of the coefficient estimate. This gives an indication of how much the estimated coefficient likely varies from the population coefficient. There are several ways to compute a standard error, and the choice is important! However, it’s beyond the scope of this class, and we will use the default option throughout.
“t”	The “t-stat” = \(\beta\) divided by its standard error.
“P>t”	The “p-value” is the probability that the coefficient is different than zero by random chance.
“[0.025 0.975]”	The 95% confidence interval for the coefficient.

Important

We use these columns, particularly the “t-stat” and “p-value”, to assess the probability that the coefficient is different from zero by random chance.

A t-stat of 1.645 corresponds to a p-value of 0.10; meaning only 10% of the time would you get that coefficient randomly
A t-stat of 1.96 corresponds to a p-value of 0.05; this is a common “threshold” to say a “relationship is statistically significant” and that “the relationship between X and y is not zero”
A t-stat of 2.58 corresponds to a p-value of 0.01

6.5.1. Some practical guidance¶

You can focus on the p-values rather than the t-stats. However, knowing the threshold values of the t-stat (those above) is useful, as many people distribute research by discussing t-stats instead.
If a p-value for the coefficient in a regression is below 0.05,
- We say that the relationship between that variable and Y is “statistically significant” at the 5% level.
- Now, consider the direction: Is the coefficient positive or negative? Does this align with your intuition and economic theory?
- Now, consider the “size” of the coefficient: Is a “reasonable” change in X leading to a “small” or “big” change in Y? If the relationship is small, it may not be important enough to care about even if true.
- Now, take a step back: Statistically significant does NOT mean X causes Y. You need additional information to make these claims. That’s what the next page is about.
Practically, if the p-value is above 0.05, most researchers consider completely disregard the coefficient (and ignore the sign and the value). Because you can’t say the coefficient is statistically distinguishable from zero, they basically interpret the coefficient for that variable as being zero. Meaning: Ignore the sign, ignore the value, assume the coefficient is zero.

LeDataSciFi-2023

6.5. Statistical significance¶

6.5.1. Some practical guidance¶