{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas Vocab and the Shape of Data\n",
"\n",
"Pandas is a library that helps you work with data! \n",
"\n",
"At the top of your python code, load Pandas like this:\n",
"```py\n",
"import pandas as pd\n",
"```\n",
"Now you can use pandas throughout your file via the `pd` object. \n",
"\n",
"\n",
"\n",
"## Pandas Vocab\n",
"\n",
"- The key object in the pandas library is that you put data into **dataframes**, which are like Excel spreadsheets\n",
"- Variables are in columns (which have a **name** that identifies the column)\n",
"- Observations are in rows (which have an **index** that identifies the row)\n",
" - _In our \"Golden Rules\" chapter we used the term \"key\" (which I prefer), but pandas uses Index._\n",
"- If you create an object with a single variable, pandas might store it as a **series** object\n",
"- **\"Wide data\"** vs. **\"Long data\"**: See the [next section](#The-shape-of-data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"popout"
]
},
"source": [
"## The shape of data\n",
"\n",
"> _**The fundamental principle of database design is that the physical structure of a database should communicate its logical structure**_\n",
"\n",
"Data can be logically stored in many ways. Let's start by showing one dataset, and three ways it can be stored.\n",
"\n",
"### Wide vs. Long (or Tall) Data\n",
"\n",
"Here is a **long dataset** - the \"key\" or \"index\" is the _combination of year and firm_:\n",
"\n",
"| Year | Firm | Sales | Profits |\n",
"| :--- | :--- | :--- | :--- |\n",
"| 2000 | Ford | 10 | 1 |\n",
"| 2001 | Ford | 12 | 1 | \n",
"| 2002 | Ford | 14 | 1 |\n",
"| 2003 | Ford | 16 | 1 | \n",
"| 2000 | GM | 11 | 0 |\n",
"| 2001 | GM | 13 | 2 | \n",
"| 2002 | GM | 13 | 0 | \n",
"| 2003 | GM | 15 | 2 | \n",
"\n",
"The exact same data, stored as a **wide dataset** - - the \"key\" or \"index\" is the _year_, and each variable is duplicated for each firm:\n",
"\n",
"|
Year | Sales
GM|
Ford | Profits
GM|
Ford |\n",
"| :--- | :--- | :--- | :--- | :--- |\n",
"| 2000 | 11 | 10 | 0 | 1 |\n",
"| 2001 | 13 | 12 | 2 | 1 |\n",
"| 2002 | 13 | 14 | 0 | 1 |\n",
"| 2003 | 15 | 16 | 2 | 1 |\n",
"\n",
"```{admonition} \"MultiColumnIndex\" and \"MultiIndex\" in pandas\n",
"Notice here how the variables have multiple levels for the variable name: level 0 is \"Sales\" which applies to the level 1 \"GM\" and \"Ford\". Thus, column 2 is Sales of GM and column 3 is Sales of Ford.\n",
"\n",
"This combination of the two levels of the variable name is called a \"MultiColumnIndex\" in pandas.\n",
"\n",
"A similar case can occur for the row names/numbers. If there is one level of the row name, that's the Index. If there are multiple levels, it is called a MultiIndex. \n",
"```\n",
"\n",
"The exact same data, stored as a **wide dataset** - - the \"key\" or \"index\" is the _firm_, and each variable is duplicated for each year:\n",
"\n",
"|
Firm | Sales
2000 |
2001 |
2002 |
2003 | Profits
2000 |
2001 |
2002 |
2003 | \n",
"| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |\n",
"| Ford | 10 | 12 | 14 | 16 | 1 | 1 | 1 | 1 |\n",
"| GM | 11 | 13 | 13 | 15 | 0 | 2 | 0 | 2 |\n",
"\n",
"### Which shape should I use?\n",
"\n",
"**A, nuanced:** It depends on what you're using the data for!\n",
"- I try to make my data \"tidy\" at the start of analysis. Tidy data is quicker to analyze!\n",
" - The \"long\" dataset is the only one above that is tidy\n",
" - [This is a good description](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) of what \"tidy\" is, what \"messy\" often is, and how to tidy messy data. Focus not on the code, but the textual ideas and the data (shown as code comments)\n",
" - On the main [resources page](../about/resoruces), there is a link to a resource I highly recommend, [Coding best practices, and project management](https://web.stanford.edu/~gentzkow/research/CodeAndData.xhtml). \"Chapter 5\" is about tidy data, even though they never use the phrase.\n",
"- Seaborn likes plotting long data, while pandas likes plotting wide data\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}