{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Common tasks\n",
    "\n",
    "```{important}\n",
    "\n",
    "Yes, this page is kind of long. But that's because it has a lot of useful info!\n",
    "\n",
    "Use the page's table of contents to the right to jump to what you're looking for. \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reshaping data\n",
    "\n",
    "In the [shape of data](02b_pandasVocab) page, I explained the concept of wide vs. tall data with this example: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tall:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Firm</th>\n",
       "      <th>Year</th>\n",
       "      <th>Sales</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Ford</td>\n",
       "      <td>2000</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Ford</td>\n",
       "      <td>2001</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Ford</td>\n",
       "      <td>2002</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ford</td>\n",
       "      <td>2003</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>GM</td>\n",
       "      <td>2000</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>GM</td>\n",
       "      <td>2001</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>GM</td>\n",
       "      <td>2002</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>GM</td>\n",
       "      <td>2003</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Firm  Year  Sales\n",
       "0  Ford  2000     10\n",
       "1  Ford  2001     12\n",
       "2  Ford  2002     14\n",
       "3  Ford  2003     16\n",
       "4    GM  2000     11\n",
       "5    GM  2001     13\n",
       "6    GM  2002     13\n",
       "7    GM  2003     15"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = (pd.Series({   ('Ford',2000):10,\n",
    "                   ('Ford',2001):12,\n",
    "                   ('Ford',2002):14,\n",
    "                   ('Ford',2003):16,\n",
    "                   ('GM',2000):11,\n",
    "                   ('GM',2001):13,\n",
    "                   ('GM',2002):13,\n",
    "                   ('GM',2003):15})\n",
    "      .to_frame()\n",
    "      .rename(columns={0:'Sales'})\n",
    "      .rename_axis(['Firm','Year'])\n",
    "      .reset_index()\n",
    "     )\n",
    "print(\"Tall:\")\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{note}\n",
    "To reshape dataframes, you have to work with index and column names.  \n",
    "```\n",
    "\n",
    "So before we use `stack` and `unstack` here, put the firm and year into the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "tall = df.set_index(['Firm','Year'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### To convert a tall dataframe to wide: `df.unstack()`.\n",
    "\n",
    "If your index has multiple levels, the level parameter is used to pick which to unstack. \"0\" is the innermost level of the index. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Unstack (make it shorter+wider) on level 0/Firm:\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">Sales</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Firm</th>\n",
       "      <th>Ford</th>\n",
       "      <th>GM</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>14</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>16</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Sales    \n",
       "Firm  Ford  GM\n",
       "Year          \n",
       "2000    10  11\n",
       "2001    12  13\n",
       "2002    14  13\n",
       "2003    16  15"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Unstack (make it shorter+wider) on level 1/Year:\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"4\" halign=\"left\">Sales</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th>2000</th>\n",
       "      <th>2001</th>\n",
       "      <th>2002</th>\n",
       "      <th>2003</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Firm</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Ford</th>\n",
       "      <td>10</td>\n",
       "      <td>12</td>\n",
       "      <td>14</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <td>11</td>\n",
       "      <td>13</td>\n",
       "      <td>13</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Sales               \n",
       "Year  2000 2001 2002 2003\n",
       "Firm                     \n",
       "Ford    10   12   14   16\n",
       "GM      11   13   13   15"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"\\n\\nUnstack (make it shorter+wider) on level 0/Firm:\\n\") \n",
    "display(tall.unstack(level=0))\n",
    "print(\"\\n\\nUnstack (make it shorter+wider) on level 1/Year:\\n\") \n",
    "display(tall.unstack(level=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### To convert a wide dataframe to tall/long: `df.stack()`.\n",
    "\n",
    "```{tip}\n",
    "Pay attention after reshaping to the order of your index variables and how they are sorted. \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Stack it back (make it tall): wide_year.stack()\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Sales</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th>Firm</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">2000</th>\n",
       "      <th>Ford</th>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">2001</th>\n",
       "      <th>Ford</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">2002</th>\n",
       "      <th>Ford</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">2003</th>\n",
       "      <th>Ford</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Sales\n",
       "Year Firm       \n",
       "2000 Ford     10\n",
       "     GM       11\n",
       "2001 Ford     12\n",
       "     GM       13\n",
       "2002 Ford     14\n",
       "     GM       13\n",
       "2003 Ford     16\n",
       "     GM       15"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Year-then-firm doesn't make much sense.\n",
      "Reorder to firm-year: wide_year.stack().swaplevel()\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Sales</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Firm</th>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Ford</th>\n",
       "      <th>2000</th>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <th>2000</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ford</th>\n",
       "      <th>2001</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <th>2001</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ford</th>\n",
       "      <th>2002</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <th>2002</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ford</th>\n",
       "      <th>2003</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GM</th>\n",
       "      <th>2003</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Sales\n",
       "Firm Year       \n",
       "Ford 2000     10\n",
       "GM   2000     11\n",
       "Ford 2001     12\n",
       "GM   2001     13\n",
       "Ford 2002     14\n",
       "GM   2002     13\n",
       "Ford 2003     16\n",
       "GM   2003     15"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Year-then-firm sorting make much sense.\n",
      "Sort to firm-year: wide_year.stack().swaplevel().sort_index()\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Sales</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Firm</th>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Ford</th>\n",
       "      <th>2000</th>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">GM</th>\n",
       "      <th>2000</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Sales\n",
       "Firm Year       \n",
       "Ford 2000     10\n",
       "     2001     12\n",
       "     2002     14\n",
       "     2003     16\n",
       "GM   2000     11\n",
       "     2001     13\n",
       "     2002     13\n",
       "     2003     15"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# save the wide df above to this name for subseq examples\n",
    "wide_year = tall.unstack(level=0) \n",
    "\n",
    "print(\"\\n\\nStack it back (make it tall): wide_year.stack()\\n\") \n",
    "display(wide_year.stack())\n",
    "print(\"\\n\\nYear-then-firm doesn't make much sense.\\nReorder to firm-year: wide_year.stack().swaplevel()\") \n",
    "display(wide_year.stack().swaplevel())\n",
    "print(\"\\n\\nYear-then-firm sorting make much sense.\\nSort to firm-year: wide_year.stack().swaplevel().sort_index()\") \n",
    "display(wide_year.stack().swaplevel().sort_index())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Beautiful!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lambda (in `assign` or after `groupby`)\n",
    "\n",
    "You will see this inside pandas chains a lot: `lambda x: someFunc(x)`, e.g.:\n",
    "- `.assign(lev = lambda x: (x['dltt']+x['dlc'])/x['at']  )`\n",
    "- `.groupby('industry').assign(avglev = lambda x: x['lev'].mean()  )`\n",
    "\n",
    "Q1: What is that \"lambda\"?\n",
    "\n",
    "A1: A lambda function is an anonymous function that is usually one line and usually defined without a name. You write it like this:\n",
    "\n",
    "```py\n",
    "lambda <list the input(s)>: <expression that uses the inputs to make an output>\n",
    "```\n",
    "\n",
    "Here, you can see how the lambda function takes inputs and creates output the same way a function does:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dumb_prog = lambda a: a + 10     # I added \"dumb_prog =\" to name the lambda function and use it\n",
    "dumb_prog(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we could define a fnc to do the exact same thing\n",
    "def dumb_prog(a):\n",
    "    return a + 10\n",
    "dumb_prog(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q2: Why is that lambda there? \n",
    "\n",
    "A2: We use lambdas when we need a function for a short period of time and when the name of the function doesn't matter. \n",
    "\n",
    "<!-- Let's go back to this bit of code: `.groupby('industry').assign(avglev = lambda x: x['lev'].mean()  )`\n",
    "\n",
    "When you get to the \"assign\" step, what you would do to reference a variable is type the dataframe's name and the variable's name. _But `df.groupby('industry')` is a different dataframe than `df`! And when the code starts to execute the `.assign` method, the `df.groupby('industry')` doesn't exist in memory yet and so it has no name!_  \n",
    " -->\n",
    " \n",
    "In the example above, `[df].groupby('industry').assign(avglev = lambda x: x['lev'].mean()  )`, \n",
    "1. groupby **splits** the dataframe into groups, \n",
    "2. then, within each group, it **applies** a function (here: the mean), \n",
    "3. and then returns a new dataframe with one observation for each group (the average leverage for the industry). Visually, this **split-apply-combine**[^ref] process looks like this:\n",
    "\n",
    "![](https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png)\n",
    "\n",
    "[^ref]: (This figure is yet another resource I'm borrowing from the awesome [PythonDataScienceHandbook](https://jakevdp.github.io/PythonDataScienceHandbook). \n",
    "\n",
    "But notice! The `.assign()` portion is working on these tiny split up pieces of the dataframe created by `df.groupby('industry')`. Those pieces are dataframe objects that don't have names! \n",
    "\n",
    "**So lambda functions let us refer to an unnamed dataframe object!** When you type `<some df object, like df.groupby()>.assign(newVar = lambda x: someFunc(x))`, `x` is the object (\"some df object\") that assign is working on. Ta da!\n",
    "\n",
    "```python\n",
    "# common syntax within pandas\n",
    ".assign(<newvarname> = lambda <tempnameforpriorobj>:  <do stuff to tempnameforpriorobj>   )       \n",
    "\n",
    "# often, tempname is just \"x\" for short\n",
    ".assign(<newvarname> = lambda x: <someFunc(x)> )      \n",
    "\n",
    "# example:\n",
    ".assign(lev = lambda x: (x['dltt']+x['dlc'])/x['at'] )\n",
    "\n",
    "```\n",
    "\n",
    "```{note}\n",
    "It turns out that lambda functions are very useful in python programming, and not just within pandas. For example, some functions take functions as inputs, like [csnap()](#printing-inside-of-chains), `map()`, and `filter()`, and lambda functions let us give them custom functions quickly. \n",
    "\n",
    "But pandas is where we will use lambda functions most in this class.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `.transform()` after groupby\n",
    "\n",
    "Sometimes you get a statistic for a group, but you want that statistic in every single row of your original dataset.\n",
    "\n",
    "But `groupby` creates a new dataframe that is smaller, with only one row per row.\n",
    "\n",
    "```{admonition}\n",
    ":class: tip\n",
    "\n",
    "Use `.transform(<function>)` after `groupby` to \"cast\" those statistics back to the original \n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     data\n",
       "key      \n",
       "A       1\n",
       "A       4\n",
       "B       2\n",
       "B       5\n",
       "C       3\n",
       "C       6"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "import numpy as np\n",
    "df = pd.DataFrame({'key':[\"A\",'B','C',\"A\",'B','C'],\n",
    "                   'data':np.arange(1,7)}).set_index('key').sort_index()\n",
    "\n",
    "display(df) # the input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     data\n",
       "key      \n",
       "A       5\n",
       "B       7\n",
       "C       9"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# groupby().sum() shrinks the dataset\n",
    "display(df.groupby(level='key')['data'].sum()\n",
    "       .to_frame() ) # just added this line bc df prints prettier than series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     data\n",
       "key      \n",
       "A       5\n",
       "A       5\n",
       "B       7\n",
       "B       7\n",
       "C       9\n",
       "C       9"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# groupby().transform(sum) does NOT shrink the dataset\n",
    "\n",
    "df.groupby(level='key').transform(sum)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One last trick: Let's add that new variable to the original dataset!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data</th>\n",
       "      <th>groupsum</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>3</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     data  groupsum\n",
       "key                \n",
       "A       1         5\n",
       "A       4         5\n",
       "B       2         7\n",
       "B       5         7\n",
       "C       3         9\n",
       "C       6         9"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# option 1: create the var\n",
    "df['groupsum'] = df.groupby(level='key').transform(sum)\n",
    "\n",
    "# option 2: create the var with assign (can be used inside chains)\n",
    "df = df.assign(groupsum = df.groupby(level='key')['data'].transform(sum))\n",
    "\n",
    "display(df) \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using non-pandas functions inside chains \n",
    "\n",
    "One problem with writing chains on dataframes is that you can only use methods that work on the object (a dataframe) that is getting chained. \n",
    "\n",
    "So for example, you've formatted dataframe to plot. You can't directly add a seaborn function to the chain: _Seaborn functions are methods of the package seaborn, not the dataframe._ (It's `sns.lmplot`, not `df.lmplot`.) \n",
    "\n",
    "`.pipe()` allows you to hand a dataframe to functions that don't work directly on dataframes. \n",
    "\n",
    "\n",
    "````{admonition} The syntax of .pipe()\n",
    "```python\n",
    "df.pipe(<'outside function'>, \n",
    "        <'if the first parameter of the outside function isnt the df, '\n",
    "         'the name of the parameter that is expecting the dataframe'>,\n",
    "        <'any other parameters youd give the outside function'>\n",
    "```\n",
    "\n",
    "Note that the object after the pipe command is run might not be a dataframe anymore! It's whatever object the piped function produces!\n",
    "````\n",
    "\n",
    "### Example 1\n",
    "\n",
    "[From one of the pandas devs:](https://tomaugspurger.github.io/method-chaining)\n",
    "\n",
    "> ```python\n",
    "> jack_jill = pd.DataFrame()\n",
    "> (jack_jill.pipe(went_up, 'hill')\n",
    ">     .pipe(fetch, 'water')\n",
    ">     .pipe(fell_down, 'jack')\n",
    ">     .pipe(broke, 'crown')\n",
    ">     .pipe(tumble_after, 'jill')\n",
    "> )\n",
    "> ```\n",
    "> \n",
    "> This really is just right-to-left function execution. The first argument to pipe, a callable, is called with the DataFrame on the left as its first argument, and any additional arguments you specify.\n",
    "> \n",
    "> I hope the analogy to data analysis code is clear. Code is read more often than it is written. When you or your coworkers or research partners have to go back in two months to update your script, having the story of raw data to results be told as clearly as possible will save you time.\n",
    "\n",
    "### Example 2\n",
    "\n",
    "[From Steven Morse:](https://stmorse.github.io/journal/tidyverse-style-pandas.html)\n",
    "\n",
    "> ```python\n",
    "> (sns.load_dataset('diamonds')\n",
    ">  .query('cut in [\"Ideal\", \"Good\"] & \\\n",
    ">          clarity in [\"IF\", \"SI2\"] & \\\n",
    ">          carat < 3')\n",
    ">  .pipe((sns.FacetGrid, 'data'),\n",
    ">         row='cut', col='clarity', hue='color',\n",
    ">         hue_order=list('DEFGHIJ'),\n",
    ">         height=6,\n",
    ">         legend_out=True)\n",
    ">  .map(sns.scatterplot, 'carat', 'price', alpha=0.8)\n",
    ">  .add_legend())\n",
    "> ```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Printing inside of chains\n",
    "\n",
    "```{tip}\n",
    "One thing about chains, is that sometimes it's hard to know what's going on within them without just commenting out all the code and running it bit-by-bit. \n",
    "\n",
    "This function, `csnap` (meaning \"C\"hain \"SNAP\"shot) will let you print messages from inside the chain, by exploiting the `.pipe()` function we just covered!\n",
    "```\n",
    "\n",
    "![](https://media.giphy.com/media/Buy7YdhkyHBCM/source.gif)\n",
    "\n",
    "Copy this into your code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def csnap(df, fn=lambda x: x.shape, msg=None):\n",
    "    \"\"\" Custom Help function to print things in method chaining.    \n",
    "        Will also print a message, which helps if you're printing a bunch of these, so that you know which csnap print happens at which point.\n",
    "        Returns back the df to further use in chaining.\n",
    "        \n",
    "        Usage examples - within a chain of methods:\n",
    "            df.pipe(csnap)\n",
    "            df.pipe(csnap, lambda x: <do stuff>)\n",
    "            df.pipe(csnap, msg=\"Shape here\")\n",
    "            df.pipe(csnap, lambda x: x.sample(10), msg=\"10 random obs\")\n",
    "    \"\"\"\n",
    "    if msg:\n",
    "        print(msg)\n",
    "    display(fn(df))\n",
    "    return df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An example of this in use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape before describe\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(6, 2)"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape after describe and pick one var\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(8,)"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random sample of df at point #3\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data</th>\n",
       "      <th>ones</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>6.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     data  ones\n",
       "max   6.0     1\n",
       "min   1.0     1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data</th>\n",
       "      <th>ones</th>\n",
       "      <th>twos</th>\n",
       "      <th>threes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>6.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>3.500000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.870829</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.250000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.500000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>4.750000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>6.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           data  ones  twos  threes\n",
       "count  6.000000     1     2       3\n",
       "mean   3.500000     1     2       3\n",
       "std    1.870829     1     2       3\n",
       "min    1.000000     1     2       3\n",
       "25%    2.250000     1     2       3\n",
       "50%    3.500000     1     2       3\n",
       "75%    4.750000     1     2       3\n",
       "max    6.000000     1     2       3"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df\n",
    " .pipe(csnap, msg=\"Shape before describe\")\n",
    " .describe()['data']  # get the distribution stats of a variable (I'm just doing something to show csnap off)\n",
    " .pipe(csnap, msg=\"Shape after describe and pick one var\") # see, it prints a message from within the chain!\n",
    " .to_frame()\n",
    " .assign(ones = 1)\n",
    " .pipe(csnap, lambda x: x.sample(2), msg=\"Random sample of df at point #3\") # see, it prints a message from within the chain! \n",
    " .assign(twos=2,threes=3)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prettier pandas output\n",
    "\n",
    "A few random things:\n",
    "\n",
    "- Want to change the order of rows in an output table? `.reindex()`\n",
    "- Want to format the numbers shown by pandas?\n",
    "    1. Permanent: Add this line of code to the top of your file: `pd.set_option('display.float', '{:.2f}'.format)`\n",
    "    2. Temp: Add `style.format` to the end of your table command. E.g.: `df.describe().style.format(\"{:.2f}\")`\n",
    "- Want to control the number of columns / rows pandas shows? \n",
    "    1. `pd.set_option('display.max_columns', 50)`\n",
    "    2. `pd.set_option('display.max_rows', 50)`\n",
    "- More formatting controls: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}