3.2.7. Common tasks

This page is kind of long. (It’s got a lot of useful info!) Use the page’s table of contents to the right to jump to what you’re looking for.

3.2.7.1. Reshaping data

In the shape of data page, I explained the concept of wide vs. tall data with this example:

import pandas as pd

df = (pd.Series({   ('Ford',2000):10,
                   ('Ford',2001):12,
                   ('Ford',2002):14,
                   ('Ford',2003):16,
                   ('GM',2000):11,
                   ('GM',2001):13,
                   ('GM',2002):13,
                   ('GM',2003):15})
      .to_frame()
      .rename(columns={0:'Sales'})
      .rename_axis(['Firm','Year'])
      .reset_index()
     )
print("Tall:")
display(df)
Tall:
Firm Year Sales
0 Ford 2000 10
1 Ford 2001 12
2 Ford 2002 14
3 Ford 2003 16
4 GM 2000 11
5 GM 2001 13
6 GM 2002 13
7 GM 2003 15

Note

To reshape dataframes, you have to work with index and column names.

So before we use stack and unstack here, put the firm and year into the index.

tall = df.set_index(['Firm','Year'])

3.2.7.1.1. To convert a tall dataframe to wide: df.unstack().

If your index has multiple levels, the level parameter is used to pick which to unstack. “0” is the innermost level of the index.

print("\n\nUnstack (make it shorter+wider) on level 0/Firm:\n") 
display(tall.unstack(level=0))
print("\n\nUnstack (make it shorter+wider) on level 1/Year:\n") 
display(tall.unstack(level=1))
Unstack (make it shorter+wider) on level 0/Firm:
Sales
Firm Ford GM
Year
2000 10 11
2001 12 13
2002 14 13
2003 16 15
Unstack (make it shorter+wider) on level 1/Year:
Sales
Year 2000 2001 2002 2003
Firm
Ford 10 12 14 16
GM 11 13 13 15

3.2.7.1.2. To convert a wide dataframe to tall/long: df.stack().

Tip

Pay attention after reshaping to the order of your index variables and how they are sorted.

# save the wide df above to this name for subseq examples
wide_year = tall.unstack(level=0) 

print("\n\nStack it back (make it tall): wide_year.stack()\n") 
display(wide_year.stack())
print("\n\nYear-then-firm doesn't make much sense.\nReorder to firm-year: wide_year.stack().swaplevel()") 
display(wide_year.stack().swaplevel())
print("\n\nYear-then-firm sorting make much sense.\nSort to firm-year: wide_year.stack().swaplevel().sort_index()") 
display(wide_year.stack().swaplevel().sort_index())
Stack it back (make it tall): wide_year.stack()
Sales
Year Firm
2000 Ford 10
GM 11
2001 Ford 12
GM 13
2002 Ford 14
GM 13
2003 Ford 16
GM 15
Year-then-firm doesn't make much sense.
Reorder to firm-year: wide_year.stack().swaplevel()
Sales
Firm Year
Ford 2000 10
GM 2000 11
Ford 2001 12
GM 2001 13
Ford 2002 14
GM 2002 13
Ford 2003 16
GM 2003 15
Year-then-firm sorting make much sense.
Sort to firm-year: wide_year.stack().swaplevel().sort_index()
Sales
Firm Year
Ford 2000 10
2001 12
2002 14
2003 16
GM 2000 11
2001 13
2002 13
2003 15

Beautiful!

3.2.7.2. Lambda (in assign or after groupby)

You will see this inside pandas chains a lot: lambda x: someFunc(x), e.g.:

  • .assign(lev = lambda x: (x['dltt']+x['dlc'])/x['at']  )

  • .groupby('industry').assign(avglev = lambda x: x['lev'].mean()  )

What is that “lambda” and why is it there? Well, when you get to the “assign” step, what you would do to reference a variable is type the dataframe name and the variable name.

But often, the dataframe object doesn’t exist in memory yet and so it has no name.

In the example above, [df].groupby('industry').assign(avglev = lambda x: x['lev'].mean()  ), pandas splits the dataframe into groups, within each group applies a function (here: the mean), and then returns a new dataframe with one observation for each group (the average leverage for the industry). Visually, this split-apply-combine1 process looks like this:

So, the .assign() portion is working on these tiny pieces of the dataframe. Those pieces are dataframe objects that don’t have names!

So how do you refer to an unnamed dataframe object?

Answer: Lambda functions. When you type <some df object>.assign(newVar = lambda x: someFunc(x)), x is the object (“some df object”) that assign is working on. Ta da!

# common syntax within pandas
.assign(<newvarname> = lambda <tempnameforpriorobj>:  <do stuff to tempnameforpriorobj>   )       

# often, tempname is just "x" for short
.assign(<newvarname> = lambda x: <someFunc(x)> )       

Note

It turns out that lambda functions are very useful in python programming, and not just within pandas. But pandas is where we will use them most in this class.

3.2.7.3. .transform() after groupby

Sometimes you get a statistic for a group, but you want that statistic in every single row of your original dataset.

But groupby creates a new dataframe that is smaller, with only one row per row.

import pandas as pd 
import numpy as np
df = pd.DataFrame({'key':["A",'B','C',"A",'B','C'],
                   'data':np.arange(1,7)}).set_index('key').sort_index()

display(df) # the input
data
key
A 1
A 4
B 2
B 5
C 3
C 6
# groupby().sum() shrinks the dataset
display(df.groupby(level='key')['data'].sum()
       .to_frame() ) # just added this line bc df prints prettier than series
data
key
A 5
B 7
C 9
# groupby().transform(sum) does NOT shrink the dataset

df.groupby(level='key').transform(sum)    
data
key
A 5
A 5
B 7
B 7
C 9
C 9

One last trick: Let’s add that new variable to the original dataset!

# option 1: create the var
df['groupsum'] = df.groupby(level='key').transform(sum)

# option 2: create the var with assign (can be used inside chains)
df = df.assign(groupsum = df.groupby(level='key')['data'].transform(sum))

display(df) 
data groupsum
key
A 1 5
A 4 5
B 2 7
B 5 7
C 3 9
C 6 9

3.2.7.4. .pipe()

One problem with chains on dataframes is that you can only use methods that work on the object (a dataframe) that is getting chained.

So for example, you’ve formatted dataframe to plot. You can’t directly add a seaborn function to the chain: Seaborn functions are methods of the package seaborn, not the dataframe. (It’s sns.lmplot, not df.lmplot.)

.pipe() allows you to hand a dataframe to functions that don’t work directly on dataframes.

The syntax of .pipe()

df.pipe(<'outside function'>, 
        <'if the first parameter of the outside function isnt the df, '
         'the name of the parameter that is expecting the dataframe'>,
        <'any other parameters youd give the outside function'>

Note that the object after the pipe command is run might not be a dataframe anymore! It’s whatever object the piped function produces!

3.2.7.4.1. Example 1

From one of the pandas devs:

jack_jill = pd.DataFrame()
(jack_jill.pipe(went_up, 'hill')
    .pipe(fetch, 'water')
    .pipe(fell_down, 'jack')
    .pipe(broke, 'crown')
    .pipe(tumble_after, 'jill')
)

This really is just right-to-left function execution. The first argument to pipe, a callable, is called with the DataFrame on the left as its first argument, and any additional arguments you specify.

I hope the analogy to data analysis code is clear. Code is read more often than it is written. When you or your coworkers or research partners have to go back in two months to update your script, having the story of raw data to results be told as clearly as possible will save you time.

3.2.7.4.2. Example 2

From Steven Morse:

(sns.load_dataset('diamonds')
 .query('cut in ["Ideal", "Good"] & \
         clarity in ["IF", "SI2"] & \
         carat < 3')
 .pipe((sns.FacetGrid, 'data'),
        row='cut', col='clarity', hue='color',
        hue_order=list('DEFGHIJ'),
        height=6,
        legend_out=True)
 .map(sns.scatterplot, 'carat', 'price', alpha=0.8)
 .add_legend())

3.2.7.5. Printing inside of chains

Tip

One thing about chains, is that sometimes it’s hard to know what’s going on within them without just commenting out all the code and running it bit-by-bit.

This function will let you print messages from inside the chain, by exploiting the .pipe() function we just covered!

Copy this into your code:

def csnap(df, fn=lambda x: x.shape, msg=None):
    """ Custom Help function to print things in method chaining.    
        Will also print a message, which helps if you're printing a bunch of these, so that you know which csnap print happens at which point.
        Returns back the df to further use in chaining.
        
        Usage examples - within a chain of methods:
            df.pipe(csnap)
            df.pipe(csnap, lambda x: <do stuff>)
            df.pipe(csnap, msg="Shape here")
            df.pipe(csnap, lambda x: x.sample(10), msg="10 random obs")
    """
    if msg:
        print(msg)
    display(fn(df))
    return df

An example of this in use:

(df
 .pipe(csnap, msg="Shape before describe")
 .describe()['data']  # get the distribution stats of a variable (I'm just doing something to show csnap off)
 .pipe(csnap, msg="Shape after describe and pick one var") # see, it prints a message from within the chain!
 .to_frame()
 .assign(ones = 1)
 .pipe(csnap, lambda x: x.sample(2), msg="Random sample of df at point #3") # see, it prints a message from within the chain! 
 .assign(twos=2,threes=3)
)
Shape before describe
(6, 2)
Shape after describe and pick one var
(8,)
Random sample of df at point #3
data ones
75% 4.750000 1
std 1.870829 1
data ones twos threes
count 6.000000 1 2 3
mean 3.500000 1 2 3
std 1.870829 1 2 3
min 1.000000 1 2 3
25% 2.250000 1 2 3
50% 3.500000 1 2 3
75% 4.750000 1 2 3
max 6.000000 1 2 3

3.2.7.6. Prettier pandas output

A few random things:

  • Want to change the order of rows in an output table? .reindex()

  • Want to format the numbers shown by pandas?

    1. Permanent: Add this line of code to the top of your file: pd.set_option('display.float', '{:.2f}'.format)

    2. Temp:Add style.format to the end of your table command. E.g.: df.describe().style.format("{:.2f}")

  • Want to control the number of columns / rows pandas shows?

    1. pd.set_option('display.max_columns', 50)

    2. pd.set_option('display.max_rows', 50)

  • More formatting controls: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html


1

(This figure is yet another resource I’m borrowing from the awesome PythonDataScienceHandbook.