3.2.2. Pandas Vocab and the Shape of Data

Pandas is a library that helps you work with data!

At the top of your python code, load Pandas like this:

import pandas as pd

Now you can use pandas throughout your file via the pd object.

3.2.2.1. Pandas Vocab

  • The key object in the pandas library is that you put data into dataframes, which are like Excel spreadsheets

  • Variables are in columns (which have a name that identifies the column)

  • Observations are in rows (which have an index that identifies the row)

    • In our “Golden Rules” chapter we used the term “key” (which I prefer), but pandas uses Index.

  • If you create an object with a single variable, pandas might store it as a series object

  • “Wide data” vs. “Long data”: See the next section

3.2.2.2. The shape of data

Recall

The fundamental principle of database design is that the physical structure of a database should communicate its logical structure

Data can be logically stored in many ways. Let’s start by showing one dataset, and three ways it can be stored.

Here is a long dataset. The “key” or “index” is the combination of year and firm. This is what I’d refer to as a “firm-year” dataset.

Year

Firm

Sales

Profits

2000

Ford

10

1

2001

Ford

12

1

2002

Ford

14

1

2003

Ford

16

1

2000

GM

11

0

2001

GM

13

2

2002

GM

13

0

2003

GM

15

2

This is the exact same data, but stored as a wide dataset. Now, the “key” or “index” is just the year, and each variable is duplicated for each firm. So there is a Sales variable for GM, and a Sales variable for Ford.


Year

Sales
GM


Ford

Profits
GM


Ford

2000

11

10

0

1

2001

13

12

2

1

2002

13

14

0

1

2003

15

16

2

1

“MultiColumnIndex” and “MultiIndex” in pandas

Notice here how the variables have multiple levels for the variable name: level 0 is “Sales” which applies to the level 1 “GM” and “Ford”. Thus, column 2 is Sales of GM and column 3 is Sales of Ford.

This combination of the two levels of the variable name is called a “MultiColumnIndex” in pandas.

A similar case can occur for the row names/numbers. If there is one level of the row name, that’s the Index. If there are multiple levels, it is called a MultiIndex.

Now, the “key” or “index” is just the firm, and each variable is duplicated for each year. So there is a Sales variable for 2000, and a Sales variable for 2001.


Firm

Sales
2000


2001


2002


2003

Profits
2000


2001


2002


2003

Ford

10

12

14

16

1

1

1

1

GM

11

13

13

15

0

2

0

2

3.2.2.2.1. Which shape should I use?

A, nuanced: It depends on what you’re using the data for!

  • I try to make my data “tidy” when I start analysis. Tidy data is quicker to analyze!

    • The “long” dataset is the only one above that is tidy

    • This is a good description of what “tidy” is, what “messy” often is, and how to tidy messy data. Focus not on the code, but the ideas in the text and the data (shown as code comments)

    • On the main resources page, there is a link to a resource I highly recommend, Coding best practices, and project management. “Chapter 5” is about tidy data, even though they never use the phrase.

  • Seaborn likes plotting long data, while pandas likes plotting wide data

3.2.2.2.2. How do I convert between wide and long data?

That’s covered in the common tasks page.