Answers to Pandas Exercises¶

There are many ways to solve each of these!

Run each cell below in little incremental bits to really see how the code blocks work.

import pandas as pd
import pandas_datareader as pdr # IF NECESSARY, from terminal: pip install pandas_datareader 
import datetime
import numpy as np

start = datetime.datetime(2017, 1, 1) # you can specify start and end dates this way
end = datetime.datetime(2021, 1, 27)
macro_df = pdr.data.DataReader(['GDP','CPIAUCSL','UNRATE'], 'fred', start, end)

Part 1¶

During class, I used this dataframe to go over Pandas vocab, and we show how to

access 1 variable (note: pd calls this a “series” object, which is a 1D object instead of a 2D object)
access multiple vars
access, print, and change column names
access, print, reset, and set the index

Questions:

Q0: Do each of the four new golden rules for initial data exploration, from the lecture.
Q1: What is the second series above?
Q2: What is the frequency of the series?
Q3: What is the average ANNUAL GDP, based on the data?

# do your work here
def insufficient_but_starting_eda(df,cat_vars_list=None):
    '''
    
    Parameters
    ----------
    df : DATAFRAME
    cat_vars_list : LIST, optional
        A list of strings containing variable names in the dataframe
        for variables where you want to see the number of unique values
        and the 10 most common values. Likely used for categorical values.

    Returns
    -------
    None. It simply prints.
    
    Description
    -------    
    This function will print a MINIMUM amount of info about a new dataframe. 
    
    You should ****look**** at all this output below and consider the data
    exploration and cleaning questions from 
    https://ledatascifi.github.io/ledatascifi-2021/content/03/02e_eda_golden.html#member
    
    Also LOOK at more of the data manually. 
    
    Then write up anything notable you observe.
    
    TIP: put this function in your codebook to reuse easily.
    
    PROTIP: Improve this function (better outputs, better formatting).
    
    FEATURE REQUEST: optionally print the nunique and top 10 values under the describe matrix
    
    FEATURE REQUEST: optionally print more stats (percentiles)
    
    '''
    print(df.head(),  '\n---')
    print(df.tail(),  '\n---')
    print(df.columns, '\n---')
    print("The shape is: ",df.shape, '\n---')
    print("Info:",df.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
    print(df.describe(), '\n---') # summary stats, and you can customize the list!
    if cat_vars_list != None:
        for var in cat_vars_list:
            print(var,"has",df[var].nunique(),"values and its top 10 most common are:")
            print(df[var].value_counts().head(10), '\n---')
        
insufficient_but_starting_eda(macro_df,['UNRATE'])  

                  GDP  CPIAUCSL  UNRATE
DATE                                   
2017-01-01  19237.435   243.620     4.7
2017-02-01        NaN   243.872     4.6
2017-03-01        NaN   243.766     4.4
2017-04-01  19379.232   244.274     4.5
2017-05-01        NaN   244.069     4.4 
---
                  GDP  CPIAUCSL  UNRATE
DATE                                   
2020-09-01        NaN   260.149     7.8
2020-10-01  21479.529   260.462     6.9
2020-11-01        NaN   260.927     6.7
2020-12-01        NaN   261.560     6.7
2021-01-01        NaN   262.231     6.3 
---
Index(['GDP', 'CPIAUCSL', 'UNRATE'], dtype='object') 
---
The shape is:  (49, 3) 
---
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49 entries, 2017-01-01 to 2021-01-01
Freq: MS
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   GDP       16 non-null     float64
 1   CPIAUCSL  49 non-null     float64
 2   UNRATE    49 non-null     float64
dtypes: float64(3)
memory usage: 1.5 KB
Info: None 
---
                GDP    CPIAUCSL     UNRATE
count     16.000000   49.000000  49.000000
mean   20630.206313  252.878469   5.034694
std      862.963655    5.557894   2.524179
min    19237.435000  243.620000   3.500000
25%    19857.794250  248.721000   3.800000
50%    20826.288000  252.899000   4.000000
75%    21367.290000  257.387000   4.500000
max    21747.394000  262.231000  14.800000 
---
UNRATE has 22 values and its top 10 most common are:
3.8    7
4.0    5
3.6    5
4.4    4
3.7    4
3.5    3
4.1    3
4.2    2
4.3    2
6.7    2
Name: UNRATE, dtype: int64 
---

Some answers¶

Q0: What have we learned about the data? Anything to keep track of? (GDP is annual, others are quarterly)
Q1: Inflation (CPI)
Q2: Quarterly, but GDP is only annual
Q3: 20,630 (trillion)

Part 2¶

Q4: Download the annual real gdp from 1960 to 2018 from FRED and compute the average annual percent change
Q5: Compute the average gdp percent change within each decade

# do your work here

# v1:
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018) 
part2_df['real_gdp_pct'] = part2_df['GDPCA'].pct_change()
part2_df['real_gdp_pct'].mean()

# prof notes: after students present, go through, bit by bit
# then add comments (a la psuedo), 
# reiter access var: df[var], how methods apply to obj df[var].func()

0.030766069353016783

# v1.1: chain the last 2 lines together
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018) 
part2_df['GDPCA'].pct_change().mean()

0.030766069353016783

# v1.2 that last line could be:    
(part2_df
     .pct_change()
     .mean()
     
)

# explain reasons for chaining: readable, no temp objects

# breaking this up into 3 lines is silly but shows chaining over multiple lines
# how can we add the first line (the DL) into this? (next block)

GDPCA    0.030766
dtype: float64

# v2.0: chaining - but involves a lambda function work around
# don't cover until we go over lambdas
(
    # DL data
    pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018) 
    
    # create the var
    # syntax is: .assign(newname = fcn(df(var)))
    # try this:
    #.assign(sillyvar=1)
    # but this df doesn't have a name! how do we access it?
    # trick: "lambda" function. 
    #        here x is the object that assign is working on,
    #        meaning what ever is produced right before .assign
    #        which is just the **df** we DLed on the line above
    #
    # this is VERY COMMON in pandas chaining:
    # .assign(newvarname = lambda <tempnameforpriorobj>:  <do stuff to tempnameforpriorobj>   )       
    .assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
    
    # grab the var and take its mean
    ['real_gdp_pct'].mean()
)

0.030766069353016783

# v2.1: chaining - clean
(
    # DL data
    pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018) 
    # create var
    .assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
    # get mean value
    ['real_gdp_pct'].mean()
)

0.030766069353016783

# Q5:
import numpy as np
# v2.1: chaining - clean
(
    # DL data
    pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018) 
    # create var pct change
    .assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
    # get the decade from the index (do this BIT BY BIT)
    .reset_index() # turn it into variable
        # how to pull year out of date?
        # DATE is a datetime series. dt is a way to access properities of the date 
    .assign(decade = lambda x: np.floor(x.DATE.dt.year/10)*10)
    # for each decade = groupby!
    .groupby('decade')
    # take mean
    ['real_gdp_pct'].mean()
)

decade
0    0.047426
0    0.032352
0    0.031240
0    0.032220
0    0.019099
0    0.023165
Name: real_gdp_pct, dtype: float64

Part 3¶

First, I’ll load January data on unemployment, the Case-Shiller housing index, and median household income in three states (CA/MI/PA).

# LOAD DATA AND CONVERT TO ANNUAL

start = 1990 # pandas datareader can infer these are years
end = 2018
macro_data = pdr.data.DataReader(['CAUR','MIUR','PAUR', # unemployment 
                                  'LXXRSA','DEXRSA','WDXRSA', # case shiller index in LA, Detroit, DC (no PA  available!)
                                  'MEHOINUSCAA672N','MEHOINUSMIA672N','MEHOINUSPAA672N'], #  
                                 'fred', start, end)
macro_data = macro_data.resample('Y').first() # get's the first observation for each variable in a given year

# CLEAN UP THE FORMATING SOMEWHAT

macro_data.index = macro_data.index.year
print("\n\n DATA BEFORE FORMATTING: \n\n")
print(macro_data[:20]) # see how the data looks now? ugly variable names, but its an annual dataset at least
macro_data.columns=pd.MultiIndex.from_tuples([
    ('Unemployment','CA'),('Unemployment','MI'),('Unemployment','PA'),
    ('HouseIdx','CA'),('HouseIdx','MI'),('HouseIdx','PA'),
    ('MedIncome','CA'),('MedIncome','MI'),('MedIncome','PA')
    ])
print("\n\n DATA AFTER FORMATTING: \n\n")
print(macro_data[:20]) # this is a dataset that is "wide", and now 
                       # the column variable names have 2 levels - var name, 
                       # and unit/state that variable applies to

 DATA BEFORE FORMATTING: 


      CAUR  MIUR  PAUR      LXXRSA      DEXRSA      WDXRSA  MEHOINUSCAA672N  \
DATE                                                                          
 5.2   7.8   5.1  100.471193         NaN   93.362855          63333.0   
 7.1   8.8   6.6   95.569015   58.420806   89.706871          61797.0   
 8.5   9.4   7.5   92.786926   59.748947   88.573807          62517.0   
 9.7   7.6   7.3   85.246295   61.564205   89.065118          59529.0   
 9.2   7.1   6.6   77.395052   64.526663   88.988467          60464.0   
 7.7   5.2   5.8   76.376389   68.973289   89.670303          61846.0   
 7.7   4.9   5.9   73.919989   73.582339   88.655369          63176.0   
 6.8   4.7   5.3   74.536884   79.347642   88.775224          63245.0   
 6.0   4.0   4.7   81.035037   85.183613   90.108149          64349.0   
 5.6   3.9   4.5   92.140086   92.433567   94.188054          67156.0   
 5.0   3.3   4.1  101.031209  100.145625  100.871832          69696.0   
 4.8   4.4   4.3  112.031476  107.327208  113.794411          68413.0   
 6.5   6.4   5.5  122.706507  111.468549  127.245701          67600.0   
 6.8   6.6   5.8  145.752401  115.704198  146.676851          68695.0   
 6.5   6.8   5.5  178.773752  119.772720  169.535303          66782.0   
 5.8   7.2   5.2  221.471364  123.264823  210.799937          67918.0   
 5.0   6.8   4.7  268.208844  126.872678  250.272528          70316.0   
 4.9   6.9   4.3  270.804920  118.163891  241.411126          68890.0   
 6.0   7.1   4.7  226.111966  100.378575  215.656581          67865.0   
 9.7  10.9   7.0  167.831216   77.690858  174.183926          67051.0   

      MEHOINUSMIA672N  MEHOINUSPAA672N  
DATE                                    
        56954.0          55181.0  
        58957.0          55744.0  
        57795.0          53523.0  
        57064.0          54151.0  
        60384.0          54877.0  
        60872.0          57693.0  
        63849.0          56807.0  
        61728.0          59776.0  
        65744.0          61333.0  
        70942.0          58119.0  
        67755.0          62789.0  
        65206.0          62966.0  
        60871.0          60562.0  
        62734.0          59823.0  
        57331.0          59841.0  
        60278.0          60760.0  
        61835.0          61619.0  
        61024.0          59870.0  
        59264.0          61185.0  
        54939.0          57540.0  


 DATA AFTER FORMATTING: 


     Unemployment               HouseIdx                         MedIncome  \
               CA    MI   PA          CA          MI          PA        CA   
DATE                                                                         
        5.2   7.8  5.1  100.471193         NaN   93.362855   63333.0   
        7.1   8.8  6.6   95.569015   58.420806   89.706871   61797.0   
        8.5   9.4  7.5   92.786926   59.748947   88.573807   62517.0   
        9.7   7.6  7.3   85.246295   61.564205   89.065118   59529.0   
        9.2   7.1  6.6   77.395052   64.526663   88.988467   60464.0   
        7.7   5.2  5.8   76.376389   68.973289   89.670303   61846.0   
        7.7   4.9  5.9   73.919989   73.582339   88.655369   63176.0   
        6.8   4.7  5.3   74.536884   79.347642   88.775224   63245.0   
        6.0   4.0  4.7   81.035037   85.183613   90.108149   64349.0   
        5.6   3.9  4.5   92.140086   92.433567   94.188054   67156.0   
        5.0   3.3  4.1  101.031209  100.145625  100.871832   69696.0   
        4.8   4.4  4.3  112.031476  107.327208  113.794411   68413.0   
        6.5   6.4  5.5  122.706507  111.468549  127.245701   67600.0   
        6.8   6.6  5.8  145.752401  115.704198  146.676851   68695.0   
        6.5   6.8  5.5  178.773752  119.772720  169.535303   66782.0   
        5.8   7.2  5.2  221.471364  123.264823  210.799937   67918.0   
        5.0   6.8  4.7  268.208844  126.872678  250.272528   70316.0   
        4.9   6.9  4.3  270.804920  118.163891  241.411126   68890.0   
        6.0   7.1  4.7  226.111966  100.378575  215.656581   67865.0   
        9.7  10.9  7.0  167.831216   77.690858  174.183926   67051.0   

                        
           MI       PA  
DATE                    
56954.0  55181.0  
58957.0  55744.0  
57795.0  53523.0  
57064.0  54151.0  
60384.0  54877.0  
60872.0  57693.0  
63849.0  56807.0  
61728.0  59776.0  
65744.0  61333.0  
70942.0  58119.0  
67755.0  62789.0  
65206.0  62966.0  
60871.0  60562.0  
62734.0  59823.0  
57331.0  59841.0  
60278.0  60760.0  
61835.0  61619.0  
61024.0  59870.0  
59264.0  61185.0  
54939.0  57540.0  

Q6: for each decade and state, report the average annual CHANGE (level, not percent) in unemployment
Q7: for each decade and state, report the average annual PERCENT CHANGE in house prices and household income

# do your work here: 

# let's pseudocode

# q6 
# get decade variable
# get annual change/difference (level) in unemploy for each state
# average unemploy for each state within decade

# q7
# get decade variable
# get annual pct change in house price and income for each state
# average those for each state within decade

# HEY! those are similar - let's combine:

# get decade variable
# get annual change in unemploy for each state
# get annual pct change in house price and income for each state
# average unemploy for each state within decade

Error easter-egg hunt¶

This code is allllllllmost correct. It actually generates incorrect answers. If you figure out the problem and/or the solution, please submit an error fix via the “edit” button at the top of this page.

pd.set_option('display.float_format', '{:,.2f}'.format)

(
    # reformat the data to tall:
    macro_data.stack().swaplevel().sort_index().reset_index().rename(columns={'level_0':'state'})
    
    # create vars <---- this is not even needed to explain this block!
    .assign(
            decade          = lambda x: 10*np.floor(x['DATE']/10).astype(int),
            unemploy_diff   = lambda x: x['Unemployment'].diff(),
            HouseIdx_pctch  = lambda x: x['HouseIdx'].pct_change(),
            MedIncome_pctch = lambda x: x['MedIncome'].pct_change()    
    )
    
    # opt A for output:
    .pivot_table(index='decade',
                 columns='state',
                 values=['unemploy_diff','HouseIdx_pctch','MedIncome_pctch'])
    .multiply(100) # for more meaningful displays 
    
    # opt B for output + formatting (here, as percentages)
#     .groupby(['state','decade'])
#     [['unemploy_diff','HouseIdx_pctch','MedIncome_pctch']].mean()
#     .multiply(100)
#     # note about this: unemp isn't a % diff, but a p.p 
#     # so I make it explicit 
#     .style.format({'HouseIdx_pctch': '{0:,.2f}%',
#                    'MedIncome_pctch': '{0:,.2f}%',
#                    'unemploy_diff': '{0:,.2f} p.p.'}) 
    
)    

	HouseIdx_pctch			MedIncome_pctch			unemploy_diff
state	CA	MI	PA	CA	MI	PA	CA	MI	PA
decade
1990	-0.71	-3.13	-2.07	0.69	0.22	-0.46	4.44	-5.00	0.00
2000	7.52	-1.20	7.22	0.01	-2.42	-0.03	41.00	70.00	25.00
2010	5.84	5.16	2.83	0.84	1.42	1.54	-58.89	-71.11	-26.67