3.2.2. Pandas Vocab and the Shape of Data¶
Pandas is a library that helps you work with data!
At the top of your python code, load Pandas like this:
import pandas as pd
Now you can use pandas throughout your file via the pd
object.
3.2.2.1. Pandas Vocab¶
The key object in the pandas library is that you put data into dataframes, which are like Excel spreadsheets
Variables are in columns (which have a name that identifies the column)
Observations are in rows (which have an index that identifies the row)
In our “Golden Rules” chapter we used the term “key” (which I prefer), but pandas uses Index.
If you create an object with a single variable, pandas might store it as a series object
“Wide data” vs. “Long data”: See the next section
3.2.2.2. The shape of data¶
The fundamental principle of database design is that the physical structure of a database should communicate its logical structure
Data can be logically stored in many ways. Let’s start by showing one dataset, and three ways it can be stored.
3.2.2.2.1. Wide vs. Long (or Tall) Data¶
Here is a long dataset - the “key” or “index” is the combination of year and firm:
Year |
Firm |
Sales |
Profits |
---|---|---|---|
2000 |
Ford |
10 |
1 |
2001 |
Ford |
12 |
1 |
2002 |
Ford |
14 |
1 |
2003 |
Ford |
16 |
1 |
2000 |
GM |
11 |
0 |
2001 |
GM |
13 |
2 |
2002 |
GM |
13 |
0 |
2003 |
GM |
15 |
2 |
The exact same data, stored as a wide dataset - - the “key” or “index” is the year, and each variable is duplicated for each firm:
|
Sales |
|
Profits |
|
---|---|---|---|---|
2000 |
11 |
10 |
0 |
1 |
2001 |
13 |
12 |
2 |
1 |
2002 |
13 |
14 |
0 |
1 |
2003 |
15 |
16 |
2 |
1 |
“MultiColumnIndex” and “MultiIndex” in pandas
Notice here how the variables have multiple levels for the variable name: level 0 is “Sales” which applies to the level 1 “GM” and “Ford”. Thus, column 2 is Sales of GM and column 3 is Sales of Ford.
This combination of the two levels of the variable name is called a “MultiColumnIndex” in pandas.
A similar case can occur for the row names/numbers. If there is one level of the row name, that’s the Index. If there are multiple levels, it is called a MultiIndex.
The exact same data, stored as a wide dataset - - the “key” or “index” is the firm, and each variable is duplicated for each year:
|
Sales |
|
|
|
Profits |
|
|
|
---|---|---|---|---|---|---|---|---|
Ford |
10 |
12 |
14 |
16 |
1 |
1 |
1 |
1 |
GM |
11 |
13 |
13 |
15 |
0 |
2 |
0 |
2 |
3.2.2.2.2. Which shape should I use?¶
A, nuanced: It depends on what you’re using the data for!
I try to make my data “tidy” at the start of analysis. Tidy data is quicker to analyze!
The “long” dataset is the only one above that is tidy
This is a good description of what “tidy” is, what “messy” often is, and how to tidy messy data. Focus not on the code, but the textual ideas and the data (shown as code comments)
On the main resources page, there is a link to a resource I highly recommend, Coding best practices, and project management. “Chapter 5” is about tidy data, even though they never use the phrase.
Seaborn likes plotting long data, while pandas likes plotting wide data