3.2.2. Pandas Vocab and the Shape of Data¶

Pandas is a library that helps you work with data!

At the top of your python code, load Pandas like this:

import pandas as pd

Now you can use pandas throughout your file via the pd object.

3.2.2.1. Pandas Vocab¶

The key object in the pandas library is that you put data into dataframes, which are like Excel spreadsheets
Variables are in columns (which have a name that identifies the column)
Observations are in rows (which have an index that identifies the row)
- In our “Golden Rules” chapter we used the term “key” (which I prefer), but pandas uses Index.
If you create an object with a single variable, pandas might store it as a series object
“Wide data” vs. “Long data”: See the next section

3.2.2.2. The shape of data¶

Recall

The fundamental principle of database design is that the physical structure of a database should communicate its logical structure

Data can be logically stored in many ways. Let’s start by showing one dataset, and three ways it can be stored.

“Long” Data

Here is a long dataset. The “key” or “index” is the combination of year and firm. This is what I’d refer to as a “firm-year” dataset.

Year	Firm	Sales	Profits
2000	Ford	10	1
2001	Ford	12	1
2002	Ford	14	1
2003	Ford	16	1
2000	GM	11	0
2001	GM	13	2
2002	GM	13	0
2003	GM	15	2

Same data, but “Wide”

This is the exact same data, but stored as a wide dataset. Now, the “key” or “index” is just the year, and each variable is duplicated for each firm. So there is a Sales variable for GM, and a Sales variable for Ford.

Year	Sales GM	Ford	Profits GM	Ford
2000	11	10	0	1
2001	13	12	2	1
2002	13	14	0	1
2003	15	16	2	1

“MultiColumnIndex” and “MultiIndex” in pandas

Notice here how the variables have multiple levels for the variable name: level 0 is “Sales” which applies to the level 1 “GM” and “Ford”. Thus, column 2 is Sales of GM and column 3 is Sales of Ford.

This combination of the two levels of the variable name is called a “MultiColumnIndex” in pandas.

A similar case can occur for the row names/numbers. If there is one level of the row name, that’s the Index. If there are multiple levels, it is called a MultiIndex.

Same data, but differently “Wide”

Now, the “key” or “index” is just the firm, and each variable is duplicated for each year. So there is a Sales variable for 2000, and a Sales variable for 2001.

Firm	Sales 2000	2001	2002	2003	Profits 2000	2001	2002	2003
Ford	10	12	14	16	1	1	1	1
GM	11	13	13	15	0	2	0	2

3.2.2.2.1. Which shape should I use?¶

A, nuanced: It depends on what you’re using the data for!

I try to make my data “tidy” when I start analysis. Tidy data is quicker to analyze!
- The “long” dataset is the only one above that is tidy
- This is a good description of what “tidy” is, what “messy” often is, and how to tidy messy data. Focus not on the code, but the ideas in the text and the data (shown as code comments)
- On the main resources page, there is a link to a resource I highly recommend, Coding best practices, and project management. “Chapter 5” is about tidy data, even though they never use the phrase.
Seaborn likes plotting long data, while pandas likes plotting wide data

3.2.2.2.2. How do I convert between wide and long data?¶

That’s covered in the common tasks page.

LeDataSciFi-2024