6.4. Interpreting regression coefficients¶
Revisiting the regression objectives: After this page,
- You can interpret the mechanical meaning of the coefficients for
- continuous variables
- categorical a.k.a qualitative variables with two or more values (aka "dummy", "binary", and "categorical" variables)
- interaction terms between two X variables changes interpretation
- variables in models with other controls included (including categorical variables)
Tip
Regressions of y on \(N\) different variables take the form
The generic interpretation of any of the \(b\) coefficients is a sentence in three parts:
“A 1 unit increase in \(X_i\)…
…is associated with a \(b_i\) change in \(y\),…
holding all other X constant.”
6.4.1. If X is a continuous variable¶
If the model is……………… |
then \(\beta\) means (approx. in log cases) |
---|---|
\(y=a+\beta X\) |
If \(X \uparrow \) 1 unit, then \(y \uparrow\) by \(\beta\) units |
\(\log y=a+\beta X\) |
If \(X \uparrow \) 1 unit, then \(y \uparrow\) by about \(100*\beta\)%. |
\(y=a+\beta \log X\) |
If \(X \uparrow \) 1%, then \(y \uparrow\) by about \(\beta / 100\) units |
\(\log y=a+\beta \log X\) |
If \(X \uparrow \) 1%, then \(y \uparrow\) by \(\beta\)% |
Note
This table should help you see why log transformations are useful: They model proportional relationships. That is, instead of focusing on 1 unit changes in X and Y (i.e. “linear” changes), they model percent changes in X and/or Y!
6.4.2. If X is a binary variable¶
This is a categorical or qualitative variable with two values (aka “dummy”). E.g. gender in Census data, and "ideal"
above.
Usually, we encode one value as zero, and the other as one before we include it in the regression. This makes interpretation simple, as it just follows from the previous table, since a “1 unit change in X” simply means changing from the baseline group encoded as zero to the other group encoded as one.
If the model is……………… |
then \(\beta\) means |
---|---|
\(y=a+\beta X\) |
\(y\) is \(\beta\) units higher for cases when \(X=1\) than when \(X=0\). |
\(\log y=a+\beta X\) |
\(y\) is about \(100*\beta\) % higher for cases when \(X=1\) than when \(X=0\). |
6.4.3. If X is a categorical variable¶
Suppose X has three categories, and let’s just call them 0, 1, and 2. To run this regression, first create two variables: \(X_1\) and \(X_2\), which are binary variables indicating if an observation’s value of X equals the subscript. So:
If X (original variable) is |
Then \(X_1=\) |
and \(X_2=\) |
---|---|---|
0 |
0 |
0 |
1 |
1 |
0 |
2 |
0 |
1 |
Then, we run a regression of \(y\) on \(X_1\) and \(X_2\). The way we interpret the coefficients is:
If the model is……………… |
\(a\) means |
then \(\beta_1\) means |
then \(\beta_2\) means |
---|---|---|---|
\(y=a+\beta_1 X_1 +\beta_2 X_2\) |
the average value of \(y\) is \(a\) for group 0 (because \(X_1=X_2=0\) if \(X=0\)) |
\(y\) is \(\beta_1\) units higher on average for cases when \(X=1\) than when \(X=0\). |
\(y\) is \(\beta_2\) units higher on average for cases when \(X=2\) than when \(X=0\). |
\(\log y=a+\beta_1 X_1 +\beta_2 X_2\) |
the average value of \(\log y\) is \(a\) for group 0 (because \(X_1=X_2=0\) if \(X=0\)) |
\(y\) is about \(100*\beta_1\) % higher on average for cases when \(X=1\) than when \(X=0\). |
\(y\) is about \(100*\beta_2\) % higher on average for cases when \(X=2\) than when \(X=0\). |
Tip
The interpretation of \(\beta_{oneLevelOfACategoricalVariable}\) is the same as a binary variable (use the table above depending on if the model is using y or \(\log y\)), except that it is capturing the jump from the “omitted group” (X=0 above) to whichever level that particular \(\beta\) captures.
Students often get confused by this at first, so let’s do an example.
Suppose we model the price of a diamond as a function of its cut and nothing else. This is close to what we did previously. This ends up looking like
To do this, you take the cut variable (cut={Fair,Good,Very Good,Premium,Ideal}
) and create a dummy variable for “Fair”, a dummy variable for “Good”, a dummy variable for “Very Good”, and a dummy variable for “Premium”. (But not for “Ideal”!) The “statsmodels formula” approach to specifying the regression does this step automatically for you!
So now, your model can be rewritten in one line and used in a regression as
And we interpret these like this:
\(\beta\)….. |
means |
or |
---|---|---|
\(\beta_{Premium}\) |
The average log(price) for Premium diamonds is \(100*\beta_{Premium}\)% higher than Ideal diamonds. |
\(avg_{Premium}-avg_{Ideal}\) |
\(\beta_{Very Good}\) |
The average log(price) for Very Good diamonds is \(100*\beta_{Very Good}\)% higher than Ideal diamonds. |
\(avg_{Very Good}-avg_{Ideal}\) |
\(\beta_{Good}\) |
The average log(price) for Good diamonds is \(100*\beta_{Good}\)% higher than Ideal diamonds. |
\(avg_{Good}-avg_{Ideal}\) |
\(\beta_{Fair}\) |
The average log(price) for Fair diamonds is \(100*\beta_{Fair}\)% higher than Ideal diamonds. |
\(avg_{Fair}-avg_{Ideal}\) |
So if we run this regression, we get these coefficients:
# load some data to practice regressions
import seaborn as sns
import numpy as np
from statsmodels.formula.api import ols as sm_ols # need this
diamonds = sns.load_dataset('diamonds')
# this alteration is not strictly necessary to practice a regression
# but we use this in livecoding
diamonds2 = (diamonds.query('carat < 2.5') # censor/remove outliers
.assign(lprice = np.log(diamonds['price'])) # log transform price
.assign(lcarat = np.log(diamonds['carat'])) # log transform carats
.assign(ideal = diamonds['cut'] == 'Ideal')
# some regression packages want you to explicitly provide
# a variable for the constant
.assign(const = 1)
)
print(sm_ols('lprice ~ C(cut)', data=diamonds2).fit().params)
Intercept 7.636921
C(cut)[T.Premium] 0.307769
C(cut)[T.Very Good] 0.158754
C(cut)[T.Good] 0.199155
C(cut)[T.Fair] 0.431911
dtype: float64
THE MAIN THING TO REMEMBER IS THAT \(\beta_{value}\) COMPARES THAT \(value\) TO THE OMITTED CATEGORY!
So, \(\beta_{Good}=0.199\) implies that “Good” cut diamonds are about 20% more expensive than “Ideal” diamonds. (Weird?)
If we add \(\alpha\) (the average log price of “Ideal” diamonds) to \(\beta_{Good}\), we get \(\beta_{Good}+\alpha=0.199+7.637=7.83\). This should be the average log price of “Good” diamonds.
Let’s check:
diamonds2.groupby('cut')['lprice'].mean() # avg lprice by cut
cut
Ideal 7.636921
Premium 7.944690
Very Good 7.795675
Good 7.836076
Fair 8.068832
Name: lprice, dtype: float64
6.4.4. If X is an interaction term¶
Previously, we estimated
There are two natural questions:
What is the impact of X on y? (Specifically, what is the total impact of diamond size on price?)
What is does the interaction term’s coefficient mean?
To find the answer to Q1, to figure out the relationship of X on y, take the derivative of X w.r.t. y:
Relationship of size on price: \(1.53 + 0.18*Ideal\).
A 1% increase in size is associated with a 1.53% higher price for non ideal diamonds
A 1% increase in size is associated with a 1.71% higher price for ideal diamonds
Relationship of cut on price: \(0.33 + 0.18*log(carat)\).
For 1 carat diamonds (\(log(1)=0\)), ideal diamonds are 33% more expensive than non-ideal diamonds
For 2 carat diamonds (\(log(2)=0.693\)), ideal diamonds are 45% more expensive than non-ideal diamonds
Q2: How do you interpret \(\beta_3=0.18\)? I recommend revisiting that link above too, but I’ll summarize as:
\(\beta_3 \neq 0\) implies that the relationship between carat size and price is different for ideal and non-ideal diamonds.
Mathematically: \(1\% \uparrow\) in carat \(\rightarrow\) price increases by 1.53% for non-ideal but 1.71% for ideal
Graphically, the difference in the slope of those carat-price relationships for ideal/non-ideal diamonds is \(\beta_3\)
Economically, you might say that the value of a larger ring is even more valuable for better cut diamonds
\(\beta_3 \neq 0\) implies that the relationship between cut quality and price is different for diamonds of different sizes.
Mathematically: 1 carat diamonds that are ideal are 33% more expensive than non-ideal diamonds, but 2 carat ideal diamonds are 45% more expensive than non-ideal diamonds
Graphically, the difference in the slope of those cut quality-price relationships for small/large diamonds is \(\beta_3\)
Economically, you might say that the value of a better cut is even more valuable for larger diamonds
6.4.5. If other controls are included¶
Tip
Always keep the “holding all other controls constant” mantra in mind!
In reality, independent variables in X often move together. So the marginal effect of X (i.e. \(\beta\)) is not the same as the total effect of X.
If you have many control variables (up to N controls):
\(\beta_1\) estimates the expected change in Y for a 1 unit increase in \(X_1\) (as we covered above), holding all other controls constant!
As a hypothetical illustration, if Y = number of tackles by a football player in a year, \(W\) is weight, and \(H\) is height, suppose we estimate that
How do you interpret \(\hat{\beta_2} < 0 \) on H?
This regression implies that, for a given weight (holding weight fixed), taller players average fewer tackles. In other words, skinny football players get fewer tackles.
This regression does NOT imply that taller players average fewer tackles. In the real world, what we call “independent variables” in the regression often change together. Taller players are likely to be heavier. So if a 1 inch increase typically comes with a weight gain, the total impact of height on tackles (i.e. not holding all other factors constant) will include the estimated impact of weight, which \(\hat{\beta_2} < 0 \)
6.4.6. If other categorical controls are included¶
Tip
Check out this page to see how to print out regression tables when your categorical variables have lots of levels. E.g., if you use states, your regression will be default show 50 rows of output for just that categorical variable.
Suppose you estimate \(profits=a+b*investment+c*X+u\), and you want to focus on \(b\) to capture how investments translate to profits. You’ve added some control variables X, but you’re still worried that this regression will get the relationship wrong because different industries have different profit margins for reasons that have nothing to do with investment levels.
In other words, you want to “control for industry”. So you estimate \(profits=a+b*investment+c*X+d*C(gsector)+u\), by including the firm’s industry as a categorical control.
Note
When you add industry to a regression as a categorical variable, it is called including “industry fixed effects”.
What does \(b\) mean now? Well, the lessons above apply: It is the relationship between investment and profits, but now we capture the average profit level (across all firms in an industry over all years of our sample). So we are holding the industry profit level (again, over the whole time period of analysis) fixed.
Intuitively, you can think of \(b\) in this model as “comparing firms in the same industry” or “controlling for industry factors”.
This should go a decent way towards solving your worry above.
Similarly, you might be worried that some years are at high points in the business cycle, and these years have concurrently high investment and profits simply because of the business cycle. This would cause \(b\) to be positive even if investment does not lead to profits.
So you might estimate \(profits=a+b*investment+c*X+d*C(year)+u\). This is often referred to as “year fixed effects”, and it means that your estimate of \(b\) removes the impact of years, and presumably, the business cycle.
Intuitively, you can think of \(b\) in this model as “comparing firms in the same year” or “controlling for time trends in profits”.
Here is an example to see how categorical controls can interact with “normal” continuous variables.
Remember our weird result earlier? That better cut diamonds had lower average prices?
The answer to that puzzle is pretty simple: Better cut diamonds tend to be smaller, and size is the most important aspect of diamond price. You can click to show the model results below.
By adding carat size back to our model, we get the sensible result, that going from an Ideal cut to a Fair cut diamond (a big downgrade), as long as we compare similarly sized diamonds (“control for diamond size”), is associated with a 31% decrease in price.
print(sm_ols('lprice ~ lcarat + C(cut)', data=diamonds2).fit().summary())
OLS Regression Results
==============================================================================
Dep. Variable: lprice R-squared: 0.937
Model: OLS Adj. R-squared: 0.937
Method: Least Squares F-statistic: 1.613e+05
Date: Thu, 25 Mar 2021 Prob (F-statistic): 0.00
Time: 12:49:44 Log-Likelihood: -2389.9
No. Observations: 53797 AIC: 4792.
Df Residuals: 53791 BIC: 4845.
Df Model: 5
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 8.5209 0.002 4281.488 0.000 8.517 8.525
C(cut)[T.Premium] -0.0790 0.003 -28.249 0.000 -0.084 -0.074
C(cut)[T.Very Good] -0.0770 0.003 -26.656 0.000 -0.083 -0.071
C(cut)[T.Good] -0.1543 0.004 -38.311 0.000 -0.162 -0.146
C(cut)[T.Fair] -0.3111 0.007 -46.838 0.000 -0.324 -0.298
lcarat 1.7014 0.002 889.548 0.000 1.698 1.705
==============================================================================
Omnibus: 792.280 Durbin-Watson: 1.261
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1178.654
Skew: 0.168 Prob(JB): 1.14e-256
Kurtosis: 3.643 Cond. No. 7.20
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
6.4.7. Comparing the size of two coefficients¶
Earlier, we estimated that \(\log price = \hat{8.41} + \hat{1.69} \log carat + \hat{0.10} ideal\).
So… I have questions:
Does that mean that the size of the diamond (\(\log carat\)) has a 17 times larger impact than the cut (\(ideal\)) in terms of price impact?
How do we compare those magnitudes?
More generally, how do we compare the magnitudes of any 2 control variables?
To which, I’d say that how “big” a coefficient is depends on the variable!
For some variables, an increase of 1 unit is common (e.g. our \(ideal\) dummy is one 40% of the time)
For some variables, an increase of 1 unit is rare (e.g. \(cash/assets\))
\(\rightarrow\) the meaning of the coefficient’s magnitude depends on the corresponding variable’s variation!
\(\rightarrow\) so change variables so that a “1 unit increase” implies the same amount of movement
A great trick for comparing coefficient size
Scale continuous variables by their standard deviation!
Warning
Only scale continuous variables! Don’t scale dummy variables or categorical variables!
Here is that solution in action:
standardize = lambda x:x/x.std() # normalize(df['x']) will divide all 'x' by the std deviation of 'x'
print("Divide lcarat by its std dev:\n")
print(sm_ols('lprice ~ lcarat + ideal',
# for **just** this regression, divide
data=diamonds2.assign(lcarat = standardize(diamonds2['lcarat']))
# this doesn't change the diamonds2 data permanently, so the next time you call on
# diamonds2, you can use lcarat as if nothing changed. if you want to repeat this
# a bunch, you might instead create and save a permanent variable called "lcarat_std"
# where "_std" indicates that you divided it by the std dev.
).fit().params)
print("\n\nThe original reg:\n")
print(sm_ols('lprice ~ lcarat + ideal',data=diamonds2 ).fit().params)
Divide lcarat by its std dev:
Intercept 8.418208
ideal[T.True] 0.100013
lcarat 0.985628
dtype: float64
The original reg:
Intercept 8.418208
ideal[T.True] 0.100013
lcarat 1.696259
dtype: float64
So a 1 standard deviation increase in \(\log carat\) is associated with a 98% increase in price. Compared to \(ideal\), we can say that a reasonable variation in carat size is associated with a price increase about 10 times larger than the impact of cut, not 17 times larger.
Also, notice that the new coefficient (0.98) is about 58% of the original coefficient (1.69).
Q: Why is it 58% of the previous coefficient?
A: Because the standard deviation of \(\log carat\) is 0.58!
This works because, if “\(std\)” stands for the standard deviation of \(\log carat\) each of these steps is valid and doesn’t change the estimation:
So what that last line shows is that if we divide the variable by its standard deviation, the coefficient will change by an offsetting amount.
If a variable has a small standard deviation (e.g. 0.10), dividing the variable by 0.10 before running the regression will reduce the coefficient by 90%.