{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Merging\n", "\n", "You've already merged datasets. But so far, our examples have been \"well-behaved\" so it was easy to just proceed. But real world datasets are messy (bad variable names, poor documentation) and big, and so merging isn't always as easy as \"just do it\".\n", "\n", "\n", "\n", "## Important parameters of `pd.merge()`\n", "\n", "Read through the [parameters of the function here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html). \n", "\n", "- `right = `, `left = ` - name of your datasets\n", "- `on = ` or `on = [,,...]` - what variable(s) to use to match from the left and right datasets. These are your \"keys\". \n", " - if the variable names aren't the same in the datasets (e.g. \"ID\" in one and \"identity\" in the other), use `left_on` and `right_on` instead of `on`\n", " - if the variables are the index variables, use `left_index = True` and/or `right_index = True` instead of `on`\n", "- `how` - what observations are in the resulting dataset, [see below](#how-:-left-v.-right-v.-inner-v.-outer)\n", "- `suffix` - if a variable is in both datasets, how should we name each. \n", " - **It's a good idea to always use this option and specify the source, because the default option makes uninformative variable names! **\n", "- `indicator=True` will create a variable saying which dataset the variable came from\n", "- `validate` = \"one_to_one\", \"one_to_many\", or \"many_to_one\". Will check if the merge is actually what you think it is. Useful!\n", "\n", "```{admonition} Be careful with the validate option \n", ":class: warning\n", "\n", "- Always use the \"smallest\" option you think is correct: If you're doing a 1:1, type \"one_to_one\"\n", "- Don't simply change it until the merge \"works\"! **Errors are good!** \n", " - either you're merging \"wrong\" (picked the wrong \"on\" vars), \n", " - or the data has problems (duplicates where there shouldn't be), \n", " - or you misunderstand the data\n", " - M:M can be dangerous\n", "- The \"many_to_many\" will _**always**_ work and is usually meaningless \n", "- NEVER USE THE MANY TO MANY OPTION (unless you're super sure it's correct for your application)\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categories of joins\n", "\n", "An important concept in merging is about how the keys you're merging on are unique or not. You should be prepared that your merge might be:\n", "- **1:1**: The keys in each dataset are all unique\n", "- **1:M**: The keys in right dataset have some duplicates\n", "- **M:1**: The keys in left dataset have some duplicates \n", "- **M:M**: The keys in both datasets have some duplicates\n", "\n", "What the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging) says is:\n", "\n", "> In SQL / standard relational algebra, if a key combination appears more than once in [either or both] tables, the resulting table will have the Cartesian product of the associated data. \n", "\n", "That sounds complicated. What it means is: For a given row of data in the left dataset, we find all rows in the right dataset with the same key, and create that many rows in the merged dataset. \n", "\n", "That might sounds only a little less complicated than the quote from the pandas docs. Let me just show you:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firmv1
0CitiAlpha
1CitiBravo
\n", "
" ], "text/plain": [ " firm v1\n", "0 Citi Alpha\n", "1 Citi Bravo" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "left_df = pd.DataFrame({\n", " 'firm': ['Citi','Citi',],\n", " 'v1': ['Alpha', 'Bravo']})\n", "display(left_df)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firmv2
0CitiCharlie
1CitiDelta
2WFEcho
\n", "
" ], "text/plain": [ " firm v2\n", "0 Citi Charlie\n", "1 Citi Delta\n", "2 WF Echo" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "right_df = pd.DataFrame({\n", " 'firm': ['Citi','Citi','WF'],\n", " 'v2': ['Charlie','Delta','Echo']}) \n", "display(right_df)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firmv1v2
0CitiAlphaCharlie
1CitiAlphaDelta
2CitiBravoCharlie
3CitiBravoDelta
\n", "
" ], "text/plain": [ " firm v1 v2\n", "0 Citi Alpha Charlie\n", "1 Citi Alpha Delta\n", "2 Citi Bravo Charlie\n", "3 Citi Bravo Delta" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(pd.merge(left_df,right_df,how='inner'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See? Each row on the left was matched with every row in the right data with the same key. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{tip}\n", "\n", "1. Run this helper function before a merge. Does it think you're doing the type of merge you think you are?\n", "2. Set the `pd.merge(validate=...` argument to the type of merge you think you're doing. It will produce an error before you start if you're wrong, at which point you should look at your data more.\n", "3. Set the `_indicator=True` option inside `pd.merge()`. It\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def merge_type(df1,df2,on):\n", " '''\n", " This function could use case test. E.g.: multiple key vars, missing vars \n", " '''\n", " # if there are duplicates, dropping them will shrink the key vector\n", " if len(df1[on]) > len(df1[on].drop_duplicates()):\n", " _l = \"many\"\n", " else:\n", " _l = \"one\"\n", " if len(df2[on]) > len(df2[on].drop_duplicates()):\n", " _r = \"many\"\n", " else:\n", " _r = \"one\"\n", " return \"{}_to_{}\".format(_l,_r)\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'many_to_many'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merge_type(left_df,right_df,'firm')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firmv1v2_merge
0CitiAlphaCharlieboth
1CitiAlphaDeltaboth
2CitiBravoCharlieboth
3CitiBravoDeltaboth
\n", "
" ], "text/plain": [ " firm v1 v2 _merge\n", "0 Citi Alpha Charlie both\n", "1 Citi Alpha Delta both\n", "2 Citi Bravo Charlie both\n", "3 Citi Bravo Delta both" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(pd.merge(left_df,right_df,on='firm',how='inner',\n", " validate='many_to_many',\n", " indicator=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `indicator=True`\n", "\n", "The post-merge variable created by `indicator=True` can be very useful to have around. After a merge, you can count the number of rows from each dataset.\n", "\n", "```python\n", "some_informative_name = pd.merge(df1,df1,indicator=True)\n", "some_informative_name['_merge'].value_counts() \n", "```\n", "\n", "In action:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "both 4\n", "right_only 1\n", "left_only 0\n", "Name: _merge, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(left_df,right_df,on='firm', how='outer', \n", " # NOTICE! \"how\" has changed from prior example to outer\n", " validate='many_to_many',\n", " indicator=True\n", " )['_merge'].value_counts() # ['_merge'].value_counts() \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `how` : left v. right v. inner v. outer\n", "\n", "| option | observations in resulting dataset |\n", "| :--- | :--- |\n", "`how = \"inner\"`| Keys (`on` variables) that are in both datasets \n", "`how = \"left\"` | \"inner\" + all unmatched obs in left \n", "`how = \"right\"` | \"inner\" + all unmatched obs in right\n", "`how = \"outer\"` | \"inner\" + all unmatched obs in left and right\n", "\n", "Let's illustrate that:\n", "\n", "\n", "![](img/merges.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Practice\n", "\n", "Do a left, right, inner, and outer merge on these datasets to confirm the illustration on `how`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "left_df = pd.DataFrame({\n", " 'firm': ['Accenture','Citi','GS'],\n", " 'varA': ['A1', 'A2', 'A3']})\n", "\n", "right_df = pd.DataFrame({\n", " 'firm': ['GS','Chase','WF'],\n", " 'varB': ['B1', 'B2', 'B3'],\n", " 'varc': ['C1', 'C2', 'C3']}) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merging in new variables to your analysis\n", "\n", "A typical \"use case\" for merging: Working on some data you want to analyze (let's call this the \"left\" data), and want to merge in another dataset (the \"right\" data) in order to add a variable to your analysis. \n", "\n", "```{admonition} **HOT TIPS**\n", ":class: tip\n", "\n", "**The number of rows in your dataset shouldn't change!** Do both of these and you'll be set:\n", "1. Set `how = 'left'`\n", " - _From the table above, this means that the post-merge data will have a length equal to the number of unmatched rows from the left plus the number of rows from the intersection (\"inner\")_\n", "2. Set `validate='one_to_one'` or `validate='many_to_one'`\n", " - This means that any observations from the left that are matched to the right (the intersection of the two datasets) can only be matched to one observation. In other words, the intersection will not be larger than it is in the left data.\n", "```\n", "\n", "You can verify your dataset is the same length before and after the merge many ways. Here is an easy one: \n", " \n", "```python\n", "# a way to check:\n", "len(df)\n", "df = df.merge(someNewData, \n", " on=[listofvars],\n", " how='left',\n", " validate='one_to_one' or 'many_to_one')\n", "len(df)\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating variables around a merge\n", "\n", "Suppose you want to merge in a new variable. The variable is a function of a variable in another dataset. Should we create the variable before the merge or after the merge?\n", "\n", "```python\n", "# before \n", "newVar = function(right['x'])\n", "left.merge(newVar)\n", "\n", "# after \n", "merge = left.merge(right)\n", "merge['newVar'] = function(merge['x'])\n", "```\n", "\n", "Both are feasible! So which should you choose?\n", "\n", "```{note}\n", "[Golden Rule 4.B from the Good Analysis Practices chapter](../02/10_Golden_3) is \"Keep data normalized as far into your code pipeline as you can.\" \n", "```\n", "\n", "One implication of that is to create variables on the \"smallest\" dataset where you can validly compute the variable. By \"smallest\", I mean: the dataset where the units exactly correspond to the statistic you have in mind.\n", "\n", "If you want to calculate the variance of annual industry sales, open an industry-level dataset with annual sales, and take the variance. Do not merge annual sales into a industry-day level dataset (with many days for each industry) and then take the variance. \n", "\n", "Let's illustrate:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
industryx
017
121
230
342
452
564
\n", "
" ], "text/plain": [ " industry x\n", "0 1 7\n", "1 2 1\n", "2 3 0\n", "3 4 2\n", "4 5 2\n", "5 6 4" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "STD if you calculate it before the merge: 2.503331114069145\n" ] } ], "source": [ "right = pd.DataFrame({\n", " \"industry\":[1,2,3,4,5,6], \n", " }) \n", "right['x'] = pd.DataFrame(np.random.randint(0,10,size=(6, 1)), columns=list('x'))\n", "display(right)\n", "print(\"STD if you calculate it before the merge:\",right.x.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that is the correct standard deviation.\n", "\n", "If you use this as the right dataframe in a M:1 merge with an industry-day dataset as the left, our right df will show up in the post-merge dataset many times: Eeach 1 of the right is matched to the left M times." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "STD if you calculate it after the merge: 2.3176342617912282\n" ] } ], "source": [ "after_merge = right\n", "after_merge = after_merge.append(right).append(right).append(right).append(right).append(right)\n", "print(\"STD if you calculate it after the merge:\",after_merge.x.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tips and best practices\n", "\n", "````{warning}\n", "```{tip}\n", "⭐⭐⭐ **THESE ARE IMPORTANT** ⭐⭐⭐\n", "```\n", "````\n", "\n", "1. **Pick the \"keys\" you'll merge on**\n", " 1. What are the observation units in each of your datasets? \n", " 1. What variable (**or variables**) should you be merging on? For example: Should you merge based on the firm, or the firm AND the year? \n", "
\n", " **I usually merge with as many levels of units as are in both datasets.** So if I have two firm-year datasets, I merge on firm year. If I have an asset-day dataset (stock returns) and a daily dataset (returns for the overall stock market), merge on the date. \n", "1. **Before (before!) your merge, examine the keys**\n", " 1. Drop any observations with missing keys in each dataset\n", " 2. How many unique keys are in each dataset? Simply replace `df` and `keyvars` in this: `len(df[].drop_duplicates()`\n", " 3. What will the observation unit be after your merge? E.g., if you have a firm dataset, and a weekly dataset, the resulting dataset might be a firm-week data. \n", "1. **Always specify `how`, `on`, `indicator`, and `validate` inside `pd.merge()`**\n", " 1. This will force you to think about the observation levels in each dataset you're merging before you try the merge, and whether the merge you're doing is 1:1, 1:M, M:M, or M:1.\n", " 2. Guess how many observations you'll have (more or less than left? more or less than right or left?) and then check afterwards. \n", "1. **After the merge**, check that it did what you expected, and give the resulting dataframe a _good_ name. **Don't name it \"merged\"!!!**\n", " 1. Look at a good chunk of data before and after the merge. This is just following the \"look at the data\" golden rule. \n", " 1. Examine the `_merge` variable (value_counts, e.g.)\n", " 1. Good names: I often actively name the dataframe to the new observation level. \n", " \n", " _For example, I know exactly how `state_industry_year_df` and `state_industry_df` should differ._ \n", " \n", "````{warning}\n", "If I see assignments with a line like `merged=pd.merge()`, I'll mark that down automatically. \n", "```` \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A nice overview\n", "\n", "The [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html) has a wonderful breakdown of the mechanics of merging. You should read it!\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }