6.7. Fixed effects, categorical variables, and prettier regression tables

The summary_col function in statsmodels makes nice regression tables easy to create.

When you add a categorical variable to your model, it automatically adds a variable for each level. Sometimes, these coefficients have meaning and are of interest.

However, this isn’t always true. For example, in an earlier page noted that you can modify a model from \(profits=a+b*investment+c*X+u\), where the focus is on understanding how investments translate to profits, to \(profits=a+b*investment+c*X+d*C(gsector)+e*C(year)+u\). The latter model is better, but the coefficients on gsector and year are not the focus (and are difficult to interpret).

Aside: When a categorical variable has many levels, it is often called a “fixed effect”. So the latter model, which adds industry and year to a regression as a categorical variable, is said to include “industry fixed effects” and “year fixed effect”. The point of industry fixed effects is usually not to understand the coefficients on the industry dummy variables. It is to “control for industry”, and it changes the interpretation of \(b\): It is the relationship between investment and profits, holding fixed the industry. The same goes for the year fixed effects. Thus, in the improved model, \(b\) shows the relationship for two firms in the same industry in the same year.

When a categorical variable has a lot of levels, and seeing those values is not important, the output tables are easier to read if you drop those coefficients.

You can do that with summary_col as it exists now (April 2024) by specifying regressor_order with all variables you want to show and setting drop_omitted to True. But this hides the fact that your model has all these other variables. Which is not great.

So @adrianmoss came up with a solution, by modifying the summary_col function. Below, I show an example of it in action.

You can use this by downloading summary_colFE.py from the community codebook and putting it in the folder you are using.

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from summary_colFE import summary_col   # summary_colFE.py available at https://github.com/LeDataSciFi/ledatascifi-2024/tree/main/community_codebook
                                        # it replaces: from statsmodels.iolib.summary2 import summary_col
                                        # pending PR in statsmodels: https://github.com/statsmodels/statsmodels/pull/9191 

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Adapted regressions for the diamonds dataset
regressions = [
    (smf.ols('np.log(price) ~ carat', data=diamonds).fit(), 'log(Price) ~ Carat'),
    (smf.ols('np.log(price) ~ np.log(carat)', data=diamonds).fit(), 'log(Price) ~ log(Carat)'),
    (smf.ols('np.log(price) ~ C(cut)', data=diamonds).fit(), 'log(Price) ~ C(Cut)'),
    (smf.ols('np.log(price) ~ C(clarity)', data=diamonds).fit(), 'log(Price) ~ C(Clarity)'),
    (smf.ols('np.log(price) ~ carat + C(cut) + C(clarity)', data=diamonds).fit(), 'log(Price) ~ Carat + C(Cut) + C(Clarity)')
]

info_dict={
        'No. observations' : lambda x: f"{int(x.nobs):d}"}

summary = summary_col([reg[0] for reg in regressions],
                    model_names=[f'{i}. '+reg[1] for i, reg in enumerate(regressions, 1)],
                    stars=True, info_dict=info_dict, 
                    fixed_effects=['cut', 'clarity'],
                    )
summary
1. log(Price) ~ Carat 2. log(Price) ~ log(Carat) 3. log(Price) ~ C(Cut) 4. log(Price) ~ C(Clarity) 5. log(Price) ~ Carat + C(Cut) + C(Clarity)
Intercept 6.2150*** 8.4487*** 7.6395*** 7.4052*** 6.3613***
(0.0033) (0.0014) (0.0068) (0.0234) (0.0090)
carat 1.9698*** 2.0851***
(0.0036) (0.0036)
np.log(carat) 1.6758***
(0.0019)
No. observations 53940 53940 53940 53940 53940
cut FE Yes Yes
clarity FE Yes Yes
R-squared 0.8468 0.9330 0.0181 0.0511 0.8668
R-squared Adj. 0.8468 0.9330 0.0181 0.0510 0.8667

Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01