5.2.4. Interpreting regression coefficients

Revisiting the regression objectives: After this page,

  1. You can interpret the mechanical meaning of the coefficients for
    • continuous variables
    • categorical a.k.a qualitative variables with two or more values (aka "dummy", "binary", and "categorical" variables)
    • interaction terms between two X variables changes interpretation
    • variables in models with other controls included (including categorical variables)

Tip

Regressions of y on \(N\) different variables takes the form

\[ y = a+b_1*X_1+b_2*X_2+...+b_N*X_N+u \]

The generic interpretation of any of the \(b\) coefficients is a sentence in three parts:

  1. “A 1 unit increase in \(X_i\)

  2. is associated with a \(b_i\) change in \(y\),…

  3. holding all other X constant.”

5.2.4.1. If X is a continuous variable

If the model is………………

then \(\beta\) means (approx. in log cases)

\(y=a+\beta X\)

If \(X \uparrow \) 1 unit, then \(y \uparrow\) by \(\beta\) units

\(\log y=a+\beta X\)

If \(X \uparrow \) 1 unit, then \(y \uparrow\) by about \(100*\beta\)%.
(Exact: \(100*(\exp(\beta)-1)\))

\(y=a+\beta \log X\)

If \(X \uparrow \) 1%, then \(y \uparrow\) by about \(\beta / 100\) units

\(\log y=a+\beta \log X\)

If \(X \uparrow \) 1%, then \(y \uparrow\) by \(\beta\)%

Note

This table should help you see why log transformations are useful: They model proportional relationships. That is, instead of focusing on 1 unit changes (i.e. “linear” changes), they model (using percent changes in X and/or Y!

5.2.4.2. If X is a binary variable

This is a categorical or qualitative variable with two values (aka “dummy”). E.g. gender in Census data, and "ideal" above.

Usually, we encode one value as zero, and the other as one before we include it in the regression. This makes interpretation simple, as it just follows from the previous table, since a “1 unit change in X” simply means changing from the baseline group encoded as zero to the other group encoded as one.

If the model is………………

then \(\beta\) means

\(y=a+\beta X\)

\(y\) is \(\beta\) units higher for cases when \(X=1\) than when \(X=0\).

\(\log y=a+\beta X\)

\(y\) is about \(100*\beta\) % higher for cases when \(X=1\) than when \(X=0\).

5.2.4.3. If X is a categorical variable

Suppose X has three categories, and let’s just call them 0, 1, and 2. To run this regression, first create two variables: \(X_1\) and \(X_2\), which are binary variables indicating if an observation’s value of X equals the subscript. So:

If X (original variable) is

Then \(X_1=\)

and \(X_2=\)

0

0

0

1

1

0

2

0

1

Then, we run a regression of \(y\) on \(X_1\) and \(X_2\). The way we interpret the coefficients is:

If the model is………………

\(a\) means

then \(\beta_1\) means

then \(\beta_2\) means

\(y=a+\beta_1 X_1 +\beta_2 X_2\)

the average value of \(y\) is \(a\) for group 0 (because \(X_1=X_2=0\) if \(X=0\))

\(y\) is \(\beta_1\) units higher on average for cases when \(X=1\) than when \(X=0\).

\(y\) is \(\beta_2\) units higher on average for cases when \(X=2\) than when \(X=0\).

\(\log y=a+\beta_1 X_1 +\beta_2 X_2\)

the average value of \(\log y\) is \(a\) for group 0 (because \(X_1=X_2=0\) if \(X=0\))

\(y\) is about \(100*\beta_1\) % higher on average for cases when \(X=1\) than when \(X=0\).

\(y\) is about \(100*\beta_2\) % higher on average for cases when \(X=2\) than when \(X=0\).

Tip

The interpretation of \(\beta_{oneLevelOfACategoricalVariable}\) is the same as a binary variable (use the table above depending on if the model is using y or \(\log y\)), except that the it is capturing the jump from the “omitted group” (X=0 above) to whichever level that particular \(\beta\) captures.


Students often get confused by this at first, so let’s do an example.

Suppose we model the price of a diamond as function of its cut and nothing else. This is close to what we did previously. This ends up looking like

\[\begin{split} \log(\text{price})= \begin{cases} a, & \text{if cut is Ideal} \\ a +\beta_{Premium}, & \text{if cut is Premium} \\ a +\beta_{Very Good}, & \text{if cut is Very Good} \\ a +\beta_{Good}, & \text{if cut is Good} \\ a +\beta_{Fair}, & \text{if cut is Fair} \end{cases} \end{split}\]

To do this, you take the cut variable (cut={Fair,Good,Very Good,Premium,Ideal}) and create a dummy variable for “Fair”, a dummy vairable for “Good”, a dummy variable for “Very Good”, and a dummy variable for “Premium”. (But not for “Ideal”!) The “statsmodels formula” approach to specifying the regression does this step automatically for you!

So now, your model can be rewritten in one line and used in a regression as

\[ \log(\text{price}) = a + \beta_{Premium}*X_{Premium} + \beta_{Very Good}*X_{Very Good} + \beta_{Good}*X_{Good} + \beta_{Fair}*X_{Fair} + u \]

And we interpret these like this:

\(\beta\)…..

means

or

\(\beta_{Premium}\)

The average log(price) for Premium diamonds is \(100*\beta_{Premium}\)% higher than Ideal diamonds.

\(avg_{Premium}-avg_{Ideal}\)

\(\beta_{Very Good}\)

The average log(price) for Very Good diamonds is \(100*\beta_{Very Good}\)% higher than Ideal diamonds.

\(avg_{Very Good}-avg_{Ideal}\)

\(\beta_{Good}\)

The average log(price) for Good diamonds is \(100*\beta_{Good}\)% higher than Ideal diamonds.

\(avg_{Good}-avg_{Ideal}\)

\(\beta_{Fair}\)

The average log(price) for Fair diamonds is \(100*\beta_{Fair}\)% higher than Ideal diamonds.

\(avg_{Fair}-avg_{Ideal}\)

So if we run this regression, we get these coefficients:

# load some data to practice regressions
import seaborn as sns
import numpy as np
from statsmodels.formula.api import ols as sm_ols # need this

diamonds = sns.load_dataset('diamonds')

# this alteration is not strictly necessary to practice a regression
# but we use this in livecoding
diamonds2 = (diamonds.query('carat < 2.5')               # censor/remove outliers
            .assign(lprice = np.log(diamonds['price']))  # log transform price
            .assign(lcarat = np.log(diamonds['carat']))  # log transform carats
            .assign(ideal = diamonds['cut'] == 'Ideal') 
             
             # some regression packages want you to explicitly provide 
             # a variable for the constant
            .assign(const = 1)                           
            )  
print(sm_ols('lprice ~ C(cut)', data=diamonds2).fit().params)
Intercept              7.636921
C(cut)[T.Premium]      0.307769
C(cut)[T.Very Good]    0.158754
C(cut)[T.Good]         0.199155
C(cut)[T.Fair]         0.431911
dtype: float64

THE MAIN THING TO REMEMBER IS THAT \(\beta_{value}\) COMPARES THAT \(value\) TO THE OMITTED CATEGORY!

So, \(\beta_{Good}=0.199\) implies that “Good” cut diamonds are about 20% more expensive than “Ideal” diamonds. (Weird?)

If we add \(\alpha\) (the average log price of “Ideal” diamonds) to \(\beta_{Good}\), we get \(\beta_{Good}+\alpha=0.199+7.637=7.83\). This should be the average log price of “Good” diamonds.

Let’s check:

diamonds2.groupby('cut')['lprice'].mean() # avg lprice by cut
cut
Ideal        7.636921
Premium      7.944690
Very Good    7.795675
Good         7.836076
Fair         8.068832
Name: lprice, dtype: float64

5.2.4.4. If X is an interaction term

Previously, we estimated

\[ \log(\text{price})= 8.2+ 1.53 *\log(\text{carat}) + 0.33* \text{Ideal} + 0.18* \log(\text{carat})\cdot \text{Ideal} \]

There are two natural questions:

  1. What is the impact of X on y? (Specifically, what is the total impact of diamond size on price?)

  2. What is does the interaction term’s coefficient mean?

To find the answer to Q1, to figure out the relationship of X on y, take the derivative of X w.r.t. y:

  • Relationship of size on price: \(1.53 + 0.18*Ideal\).

    • A 1% increase in size is associate with a 1.53% higher price for non ideal diamonds

    • A 1% increase in size is associate with a 1.71% higher price for ideal diamonds

  • Relationship of cut on price: \(0.33 + 0.18*log(carat)\). The average

    • For 1 carat diamonds (\(log(1)=0\)), ideal diamonds are 33% more expensive than non-ideal diamonds

    • For 2 carat diamonds (\(log(2)=0.693\)), ideal diamonds are 45% more expensive than non-ideal diamonds

Q2: How do you interpret \(\beta_2=0.18\)? I recommend revisiting that link above too, but I’ll summarize as:

  1. \(\beta_2 \neq 0\) implies that the relationship between carat size and price is different for ideal and non-ideal diamonds.

    • Mathematically: \(1\% \uparrow\) in carat \(\rightarrow\) price increases by 1.53% for non-ideal but 1.71% for ideal

    • Graphically, the difference in the slope of those carat-price relationships for ideal/non-ideal diamonds is \(\beta_2\)

    • Economically, you might say that the value of a larger ring is even more valuable for better cut diamonds

  2. \(\beta_2 \neq 0\) implies that the relationship between cut quality and price is different for diamonds of different sizes.

    • Mathematically: 1 carat diamonds that are ideal are 33% more expensive than non-ideal diamonds, but 2 carat ideal diamonds are 45% more expensive than non-deal diamonds

    • Graphically, the difference in the slope of those cut quality-price relationships for small/large diamonds is \(\beta_2\)

    • Economically, you might say that the value of a better cut is even more valuable forlarger diamonds

5.2.4.5. If other controls are included

Tip

  1. Always keep the “holding all other controls constant” mantra in mind!

  2. In reality, independent variables in X often move together. So the marginal effect of X (i.e. \(\beta\)) is not the same as the total effect of X.

If you have many control variables (up to N controls):

\[ y = a +\beta_0 X_0 + \beta_1 X_1+ ...+\beta_N X_N+ u \]

\(\beta_1\) estimates the expected change in Y for a 1 unit increase in \(X_1\) (as we covered above), holding all other controls constant!

As an illustration, if Y = number of tackles by a football player in a year, \(W\) is weight, and \(H\) is height, suppose we estimate that

\[ y = a +\hat{0.5} W + \hat{-0.1} H \]

How do you interpret \(\hat{\beta_1} < 0 \) on H?

This regression implies that, for a given weight (holding weight fixed), taller players average fewer tackles. In other words, skinny football players get fewer tackles.

Also remember that predictors often change together!

For example, taller players are likely to be heavier. So if a 1 inch increase typically comes with a weight gain, the total impact of height on tackles (i.e. not holding all other factors constant) will include the estimated impact of weight.

5.2.4.6. If other categorical controls are included

Suppose you estimate \(profits=a+b*investment+c*X+u\), and you want to focus on \(b\) to capture how investments translate to profits. You’ve added some control variables X, but you’re still worried that this regression will get the relationship wrong, because different industries have different profit margins for reasons that have nothing to do with investment levels.

In other words, you want to “control for industry”. So you estimate \(profits=a+b*investment+c*X+d*C(gsector)+u\), by including the firm’s industry as a categorical control.

What does \(b\) mean now? Well, the lessons above apply: It is the relationship between investment and profits, but now we capture the average profit level (across all firms in an industry over all years of our sample). So we are holding the industry profit level (again, over the whole time period of analysis) fixed.

Intuitively, you can think of \(b\) in this model as “comparing firms in the same industry” or “controlling for industry factors”.

This should go a decent way to solving your worry above.


Similarly, you might be worried that some years are at high points in the business cycle, and these years have concurrently high investment and profits simly because of the business cycle. This would cause \(b\) to be positive even if investment does not lead to profits.

So you might estimate \(profits=a+b*investment+c*X+d*C(year)+u\). This is often referred to as “year fixed effects”, and it means that your estimate of \(b\) removes the impact of years, and presumably, the business cycle.

Intuitively, you can think of \(b\) in this model as “comparing firms in the same year” or “controlling for time trends in profits”.


Here is an example to see how categorical controls can interact with “normal” continuous variables.

Remember our weird result earlier? That better cut diamonds had lower average prices?

The answer to that puzzle is pretty simple: Better cut diamonds tend to be smaller, and size is the most important aspect of diamond price. You can click to show the model results below.

By adding carat size back to our model, we get the sensible result, that going from an Ideal cut to a Fair cut diamond (a big downgrade), as long as we compare similar sized diamonds (“control for diamond size”), is associated with a 31% decrease in price.

print(sm_ols('lprice ~ lcarat + C(cut)', data=diamonds2).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 lprice   R-squared:                       0.937
Model:                            OLS   Adj. R-squared:                  0.937
Method:                 Least Squares   F-statistic:                 1.613e+05
Date:                Thu, 25 Mar 2021   Prob (F-statistic):               0.00
Time:                        12:49:44   Log-Likelihood:                -2389.9
No. Observations:               53797   AIC:                             4792.
Df Residuals:                   53791   BIC:                             4845.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               8.5209      0.002   4281.488      0.000       8.517       8.525
C(cut)[T.Premium]      -0.0790      0.003    -28.249      0.000      -0.084      -0.074
C(cut)[T.Very Good]    -0.0770      0.003    -26.656      0.000      -0.083      -0.071
C(cut)[T.Good]         -0.1543      0.004    -38.311      0.000      -0.162      -0.146
C(cut)[T.Fair]         -0.3111      0.007    -46.838      0.000      -0.324      -0.298
lcarat                  1.7014      0.002    889.548      0.000       1.698       1.705
==============================================================================
Omnibus:                      792.280   Durbin-Watson:                   1.261
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1178.654
Skew:                           0.168   Prob(JB):                    1.14e-256
Kurtosis:                       3.643   Cond. No.                         7.20
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5.2.4.7. Comparing the size of two coefficients

Earlier, we estimated that \(\log price = \hat{8.41} + \hat{1.69} \log carat + \hat{0.10} ideal\).

So… I have questions:

  1. Does that mean that the size of the diamond (\(\log carat\)) has a 17 times larger impact than the cut (\(ideal\)) in terms of price impact?

  2. How do we compare those magnitudes?

  3. More generally, how do we compare the magnitudes of any 2 control variables?

To which, I’d say that how “big” a coefficient is depends on the variable!

  • For some variables, an increase of 1 unit is common (e.g. our \(ideal\) dummy is one 40% of the time)

  • For some variables, an increase of 1 unit is rare (e.g. \(cash/assets\))

  • \(\rightarrow\) the meaning of the coefficient’s magnitude depends on the corresponding variable’s variation!

  • \(\rightarrow\) so change variables so that a “1 unit increase” implies the same amount of movement

A great trick for comparing cofficient size

Scale continuous variables by their standard deviation!

Warning

(Only continuous variables! Don’t do this for dummy variables or categorical variables)

Here is that solution in action:

standardize = lambda x:x/x.std() # normalize(df['x']) will divide all 'x' by the std deviation of 'x'

print("Divide lcarat by its std dev:\n")
print(sm_ols('lprice ~ lcarat + ideal', 
       # for **just** this regression, divide 
       data=diamonds2.assign(lcarat = standardize(diamonds2['lcarat'])) 
       # this doesn't change the diamonds2 data permanently, so the next time you call on
       # diamonds2, you can use lcarat as if nothing changed. if you want to repeat this
       # a bunch, you might instead create and save a permanent variable called "lcarat_std"
       # where "_std" indicates that you divided it by the std dev.
      ).fit().params)

print("\n\nThe original reg:\n")
print(sm_ols('lprice ~ lcarat + ideal',data=diamonds2 ).fit().params)
Divide lcarat by its std dev:

Intercept        8.418208
ideal[T.True]    0.100013
lcarat           0.985628
dtype: float64


The original reg:

Intercept        8.418208
ideal[T.True]    0.100013
lcarat           1.696259
dtype: float64

So a 1 standard deviation increase in \(\log carat\) is associated with a 0.98% increase in price. Compared to \(ideal\), we can say that a reasonable variation in carat size is associated with a price increase about 10 times larger than the impact of cut, not 17 times larger.

Also, notice that the new coefficent (0.98) is about 58% of the original coefficient (1.69).