{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Fixed effects, categorical variables, and prettier regression tables\n", "\n", "The `summary_col` function in `statsmodels` makes nice regression tables easy to create. \n", "\n", "When you add a categorical variable to your model, it automatically adds a variable for each level. Sometimes, these coefficients have meaning and are of interest. \n", "\n", "However, this isn't always true. For example, [in an earlier page](02d_interpretingCoefs.html#if-other-categorical-controls-are-included) noted that you can modify a model from $profits=a+b*investment+c*X+u$, where the focus is on understanding how investments translate to profits, to $profits=a+b*investment+c*X+d*C(gsector)+e*C(year)+u$. The latter model is better, but the coefficients on `gsector` and `year` are not the focus (and are difficult to interpret).\n", "\n", "_Aside: When a categorical variable has many levels, it is often called a \"fixed effect\". So the latter model, which adds industry and year to a regression as a categorical variable, is said to include \"industry **fixed effects**\" and \"year fixed effect\". The point of industry fixed effects is usually not to understand the coefficients on the industry dummy variables. It is to \"control for industry\", and it changes the interpretation of $b$: It is the relationship between investment and profits, holding fixed the industry. The same goes for the year fixed effects. Thus, in the improved model, $b$ shows the relationship for two firms in the same industry in the same year._\n", "\n", "When a categorical variable has a lot of levels, and seeing those values is not important, the output tables are easier to read if you drop those coefficients.\n", "\n", "You can do that with `summary_col` as it exists now (April 2024) by specifying `regressor_order` with all variables you want to show and setting `drop_omitted` to True. But this hides the fact that your model has all these other variables. Which is not great.\n", "\n", "So @adrianmoss came up with a solution, by modifying the `summary_col` function. Below, I show an example of it in action.\n", "\n", "You can use this by downloading `summary_colFE.py` [from the community codebook](https://github.com/LeDataSciFi/ledatascifi-2025/tree/main/community_codebook) and putting it in the folder you are using. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
1. log(Price) ~ Carat 2. log(Price) ~ log(Carat) 3. log(Price) ~ C(Cut) 4. log(Price) ~ C(Clarity) 5. log(Price) ~ Carat + C(Cut) + C(Clarity)
Intercept 6.2150*** 8.4487*** 7.6395*** 7.4052*** 6.3613***
(0.0033) (0.0014) (0.0068) (0.0234) (0.0090)
carat 1.9698*** 2.0851***
(0.0036) (0.0036)
np.log(carat) 1.6758***
(0.0019)
No. observations 53940 53940 53940 53940 53940
cut FE Yes Yes
clarity FE Yes Yes
R-squared 0.8468 0.9330 0.0181 0.0511 0.8668
R-squared Adj. 0.8468 0.9330 0.0181 0.0510 0.8667

\n", "Standard errors in parentheses.
\n", "* p<.1, ** p<.05, ***p<.01" ], "text/latex": [ "\\begin{table}\n", "\\caption{}\n", "\\label{}\n", "\\begin{center}\n", "\\begin{tabular}{llllll}\n", "\\hline\n", " & 1. log(Price) ~ Carat & 2. log(Price) ~ log(Carat) & 3. log(Price) ~ C(Cut) & 4. log(Price) ~ C(Clarity) & 5. log(Price) ~ Carat + C(Cut) + C(Clarity) \\\\\n", "\\hline\n", "Intercept & 6.2150*** & 8.4487*** & 7.6395*** & 7.4052*** & 6.3613*** \\\\\n", " & (0.0033) & (0.0014) & (0.0068) & (0.0234) & (0.0090) \\\\\n", "carat & 1.9698*** & & & & 2.0851*** \\\\\n", " & (0.0036) & & & & (0.0036) \\\\\n", "np.log(carat) & & 1.6758*** & & & \\\\\n", " & & (0.0019) & & & \\\\\n", "No. observations & 53940 & 53940 & 53940 & 53940 & 53940 \\\\\n", "cut FE & & & Yes & & Yes \\\\\n", "clarity FE & & & & Yes & Yes \\\\\n", "R-squared & 0.8468 & 0.9330 & 0.0181 & 0.0511 & 0.8668 \\\\\n", "R-squared Adj. & 0.8468 & 0.9330 & 0.0181 & 0.0510 & 0.8667 \\\\\n", "\\hline\n", "\\end{tabular}\n", "\\end{center}\n", "\\end{table}\n", "\\bigskip\n", "Standard errors in parentheses. \\newline \n", "* p<.1, ** p<.05, ***p<.01" ], "text/plain": [ "\n", "\"\"\"\n", "\n", "===============================================================================================================================================================\n", " 1. log(Price) ~ Carat 2. log(Price) ~ log(Carat) 3. log(Price) ~ C(Cut) 4. log(Price) ~ C(Clarity) 5. log(Price) ~ Carat + C(Cut) + C(Clarity)\n", "---------------------------------------------------------------------------------------------------------------------------------------------------------------\n", "Intercept 6.2150*** 8.4487*** 7.6395*** 7.4052*** 6.3613*** \n", " (0.0033) (0.0014) (0.0068) (0.0234) (0.0090) \n", "carat 1.9698*** 2.0851*** \n", " (0.0036) (0.0036) \n", "np.log(carat) 1.6758*** \n", " (0.0019) \n", "No. observations 53940 53940 53940 53940 53940 \n", "cut FE Yes Yes \n", "clarity FE Yes Yes \n", "R-squared 0.8468 0.9330 0.0181 0.0511 0.8668 \n", "R-squared Adj. 0.8468 0.9330 0.0181 0.0510 0.8667 \n", "===============================================================================================================================================================\n", "Standard errors in parentheses.\n", "* p<.1, ** p<.05, ***p<.01\n", "\"\"\"" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import statsmodels.formula.api as smf\n", "from summary_colFE import summary_col # summary_colFE.py available at https://github.com/LeDataSciFi/ledatascifi-2025/tree/main/community_codebook\n", " # it replaces: from statsmodels.iolib.summary2 import summary_col\n", " # pending PR in statsmodels: https://github.com/statsmodels/statsmodels/pull/9191 \n", "\n", "# Load the diamonds dataset\n", "diamonds = sns.load_dataset('diamonds')\n", "\n", "# Adapted regressions for the diamonds dataset\n", "regressions = [\n", " (smf.ols('np.log(price) ~ carat', data=diamonds).fit(), 'log(Price) ~ Carat'),\n", " (smf.ols('np.log(price) ~ np.log(carat)', data=diamonds).fit(), 'log(Price) ~ log(Carat)'),\n", " (smf.ols('np.log(price) ~ C(cut)', data=diamonds).fit(), 'log(Price) ~ C(Cut)'),\n", " (smf.ols('np.log(price) ~ C(clarity)', data=diamonds).fit(), 'log(Price) ~ C(Clarity)'),\n", " (smf.ols('np.log(price) ~ carat + C(cut) + C(clarity)', data=diamonds).fit(), 'log(Price) ~ Carat + C(Cut) + C(Clarity)')\n", "]\n", "\n", "info_dict={\n", " 'No. observations' : lambda x: f\"{int(x.nobs):d}\"}\n", "\n", "summary = summary_col([reg[0] for reg in regressions],\n", " model_names=[f'{i}. '+reg[1] for i, reg in enumerate(regressions, 1)],\n", " stars=True, info_dict=info_dict, \n", " fixed_effects=['cut', 'clarity'],\n", " )\n", "summary" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }