Answers to Pandas Exercises¶

There are many ways to solve each of these!

Run each cell below in little incremental bits to really see how the code blocks work.

import pandas as pd
import pandas_datareader as pdr # IF NECESSARY, from terminal: pip install pandas_datareader 
import datetime
import numpy as np

start = datetime.datetime(2017, 1, 1) # you can specify start and end dates this way
end = datetime.datetime(2021, 1, 27)
macro_df = pdr.data.DataReader(['GDP','CPIAUCSL','UNRATE'], 'fred', start, end)

Part 1¶

During class, I used this dataframe to go over Pandas vocab, and we show how to

access 1 variable (note: pd calls this a “series” object, which is a 1D object instead of a 2D object)
access multiple vars
access, print, and change column names
access, print, reset, and set the index

Questions:

Q0: Do each of the four new golden rules for initial data exploration, from the lecture.
Q1: What is the second series above?
Q2: What is the frequency of the series?
Q3: What is the average ANNUAL GDP, based on the data?

# do your work here
def insufficient_but_starting_eda(df,cat_vars_list=None):
    '''
    
    Parameters
    ----------
    df : DATAFRAME
    cat_vars_list : LIST, optional
        A list of strings containing variable names in the dataframe
        for variables where you want to see the number of unique values
        and the 10 most common values. Likely used for categorical values.

    Returns
    -------
    None. It simply prints.
    
    Description
    -------    
    This function will print a MINIMUM amount of info about a new dataframe. 
    
    You should ****look**** at all this output below and consider the data
    exploration and cleaning questions from 
    https://ledatascifi.github.io/ledatascifi-2022/content/03/02e_eda_golden.html#member
    
    Also LOOK at more of the data manually. 
    
    Then write up anything notable you observe.
    
    TIP: put this function in your codebook to reuse easily.
    
    PROTIP: Improve this function (better outputs, better formatting).
    
    FEATURE REQUEST: optionally print the nunique and top 10 values under the describe matrix
    
    FEATURE REQUEST: optionally print more stats (percentiles)
    
    '''
    print(df.head(),  '\n---')
    print(df.tail(),  '\n---')
    print(df.columns, '\n---')
    print("The shape is: ",df.shape, '\n---')
    print("Info:",df.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
    print(df.describe(), '\n---') # summary stats, and you can customize the list!
    if cat_vars_list != None:
        for var in cat_vars_list:
            print(var,"has",df[var].nunique(),"values and its top 10 most common are:")
            print(df[var].value_counts().head(10), '\n---')
        
insufficient_but_starting_eda(macro_df,['UNRATE'])  

                  GDP  CPIAUCSL  UNRATE
DATE                                   
2017-01-01  19153.912   243.620     4.7
2017-02-01        NaN   243.872     4.6
2017-03-01        NaN   243.766     4.4
2017-04-01  19322.920   244.274     4.5
2017-05-01        NaN   244.069     4.4 
---
                  GDP  CPIAUCSL  UNRATE
DATE                                   
2020-09-01        NaN   260.149     7.8
2020-10-01  21477.597   260.462     6.9
2020-11-01        NaN   260.927     6.7
2020-12-01        NaN   261.560     6.7
2021-01-01  22038.226   262.231     6.3 
---
Index(['GDP', 'CPIAUCSL', 'UNRATE'], dtype='object') 
---
The shape is:  (49, 3) 
---
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49 entries, 2017-01-01 to 2021-01-01
Freq: MS
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   GDP       17 non-null     float64
 1   CPIAUCSL  49 non-null     float64
 2   UNRATE    49 non-null     float64
dtypes: float64(3)
memory usage: 1.5 KB
Info: None 
---
                GDP    CPIAUCSL     UNRATE
count     17.000000   49.000000  49.000000
mean   20654.744824  252.878469   5.034694
std      913.690696    5.557894   2.524179
min    19153.912000  243.620000   3.500000
25%    19882.965000  248.721000   3.800000
50%    20813.325000  252.899000   4.000000
75%    21477.597000  257.387000   4.500000
max    22038.226000  262.231000  14.800000 
---
UNRATE has 22 values and its top 10 most common are:
3.8    7
3.6    5
4.0    5
4.4    4
3.7    4
4.1    3
3.5    3
6.7    2
4.3    2
4.2    2
Name: UNRATE, dtype: int64 
---

Some answers¶

Q0: What have we learned about the data? Anything to keep track of? (GDP is annual, others are quarterly)
Q1: Inflation (CPI)
Q2: Quarterly, but GDP is only annual
Q3: 20,630 (trillion)

Part 2¶

Q4: Download the annual real gdp from 1960 to 2018 from FRED and compute the average annual percent change
Q5: Compute the average gdp percent change within each decade

# do your work here

# v1:
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018) 
part2_df['real_gdp_pct'] = part2_df['GDPCA'].pct_change()
part2_df['real_gdp_pct'].mean()

# prof notes: after students present, go through, bit by bit
# then add comments (a la psuedo), 
# reiter access var: df[var], how methods apply to obj df[var].func()

0.03067733411159815

# v1.1: chain the last 2 lines together
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018) 
part2_df['GDPCA'].pct_change().mean()

0.03067733411159815

# v1.2 that last line could be:    
(part2_df
     .pct_change()
     .mean()
     
)

# explain reasons for chaining: readable, no temp objects

# breaking this up into 3 lines is silly but shows chaining over multiple lines
# how can we add the first line (the DL) into this? (next block)

GDPCA    0.030677
dtype: float64

# v2.0: chaining - but involves a lambda function work around
# don't cover until we go over lambdas
(
    # DL data
    pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018) 
    
    # create the var
    # syntax is: .assign(newname = fcn(df(var)))
    # try this:
    #.assign(sillyvar=1)
    # but this df doesn't have a name! how do we access it?
    # trick: "lambda" function. 
    #        here x is the object that assign is working on,
    #        meaning what ever is produced right before .assign
    #        which is just the **df** we DLed on the line above
    #
    # this is VERY COMMON in pandas chaining:
    # .assign(newvarname = lambda <tempnameforpriorobj>:  <do stuff to tempnameforpriorobj>   )       
    .assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
    
    # grab the var and take its mean
    ['real_gdp_pct'].mean()
)

0.03067733411159815

# v2.1: chaining - clean
(
    # DL data
    pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018) 
    # create var
    .assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
    # get mean value
    ['real_gdp_pct'].mean()
)

0.03067733411159815

# Q5:
import numpy as np
# v2.1: chaining - clean
(
    # DL data
    pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018) 
    # create var pct change
    .assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
    # get the decade from the index (do this BIT BY BIT)
    .reset_index() # turn it into variable
        # how to pull year out of date?
        # DATE is a datetime series. dt is a way to access properities of the date 
    .assign(decade = lambda x: np.floor(x.DATE.dt.year/10)*10)
    # for each decade = groupby!
    .groupby('decade')
    # take mean
    ['real_gdp_pct'].mean()
)

decade
0    0.047426
0    0.032352
0    0.031240
0    0.032261
0    0.019175
0    0.022464
Name: real_gdp_pct, dtype: float64

Part 3¶

First, I’ll load January data on unemployment, the Case-Shiller housing index, and median household income in three states (CA/MI/PA).

# LOAD DATA AND CONVERT TO ANNUAL

start = 1990 # pandas datareader can infer these are years
end = 2018
macro_data = pdr.data.DataReader(['CAUR','MIUR','PAUR', # unemployment 
                                  'LXXRSA','DEXRSA','WDXRSA', # case shiller index in LA, Detroit, DC (no PA  available!)
                                  'MEHOINUSCAA672N','MEHOINUSMIA672N','MEHOINUSPAA672N'], #  
                                 'fred', start, end)
macro_data = macro_data.resample('Y').first() # get's the first observation for each variable in a given year

# CLEAN UP THE FORMATING SOMEWHAT

macro_data.index = macro_data.index.year
print("\n\n DATA BEFORE FORMATTING: \n\n")
print(macro_data[:20]) # see how the data looks now? ugly variable names, but its an annual dataset at least
macro_data.columns=pd.MultiIndex.from_tuples([
    ('Unemployment','CA'),('Unemployment','MI'),('Unemployment','PA'),
    ('HouseIdx','CA'),('HouseIdx','MI'),('HouseIdx','PA'),
    ('MedIncome','CA'),('MedIncome','MI'),('MedIncome','PA')
    ])
print("\n\n DATA AFTER FORMATTING: \n\n")
print(macro_data[:20]) # this is a dataset that is "wide", and now 
                       # the column variable names have 2 levels - var name, 
                       # and unit/state that variable applies to

 DATA BEFORE FORMATTING: 


      CAUR  MIUR  PAUR      LXXRSA      DEXRSA      WDXRSA  MEHOINUSCAA672N  \
DATE                                                                          
 5.2   7.7   5.2  100.471193         NaN   93.362855          64124.0   
 7.1   8.8   6.5   95.569015   58.420806   89.706871          62568.0   
 8.6   9.5   7.4   92.786926   59.748947   88.573807          63297.0   
 9.8   7.6   7.2   85.246295   61.564205   89.065118          60272.0   
 9.3   7.2   6.7   77.395052   64.526663   88.988467          61219.0   
 7.8   5.2   5.7   76.376389   68.973289   89.670303          62618.0   
 7.8   4.9   5.9   73.919989   73.582339   88.655369          63965.0   
 6.9   4.7   5.3   74.536884   79.347642   88.775224          64035.0   
 6.0   4.0   4.8   81.035037   85.183613   90.108149          65153.0   
 5.6   4.0   4.5   92.140086   92.433567   94.188054          67994.0   
 5.0   3.3   4.2  101.031209  100.145625  100.871832          70566.0   
 4.8   4.4   4.3  112.031476  107.327208  113.794411          69267.0   
 6.6   6.3   5.5  122.706507  111.468549  127.245701          68444.0   
 6.9   6.5   5.8  145.752401  115.704198  146.676851          69553.0   
 6.6   6.7   5.4  178.773750  119.772720  169.535303          67616.0   
 5.8   7.2   5.2  221.471348  123.264823  210.799935          68766.0   
 5.0   6.8   4.8  268.208801  126.872677  250.272523          71194.0   
 4.9   7.0   4.4  270.804836  118.163888  241.411114          69750.0   
 5.9   7.2   4.8  226.111844  100.378575  215.656550          68712.0   
 9.9  10.9   7.0  167.831065   77.690839  174.183883          67888.0   

      MEHOINUSMIA672N  MEHOINUSPAA672N  
DATE                                    
        57665.0          55870.0  
        59693.0          56440.0  
        58517.0          54191.0  
        57776.0          54827.0  
        61138.0          55562.0  
        61632.0          58413.0  
        64646.0          57516.0  
        62499.0          60523.0  
        66564.0          62098.0  
        71828.0          58844.0  
        68601.0          63573.0  
        66020.0          63752.0  
        61631.0          61318.0  
        63517.0          60570.0  
        58047.0          60588.0  
        61031.0          61518.0  
        62607.0          62388.0  
        61785.0          60618.0  
        60004.0          61949.0  
        55625.0          58259.0  


 DATA AFTER FORMATTING: 


     Unemployment               HouseIdx                         MedIncome  \
               CA    MI   PA          CA          MI          PA        CA   
DATE                                                                         
        5.2   7.7  5.2  100.471193         NaN   93.362855   64124.0   
        7.1   8.8  6.5   95.569015   58.420806   89.706871   62568.0   
        8.6   9.5  7.4   92.786926   59.748947   88.573807   63297.0   
        9.8   7.6  7.2   85.246295   61.564205   89.065118   60272.0   
        9.3   7.2  6.7   77.395052   64.526663   88.988467   61219.0   
        7.8   5.2  5.7   76.376389   68.973289   89.670303   62618.0   
        7.8   4.9  5.9   73.919989   73.582339   88.655369   63965.0   
        6.9   4.7  5.3   74.536884   79.347642   88.775224   64035.0   
        6.0   4.0  4.8   81.035037   85.183613   90.108149   65153.0   
        5.6   4.0  4.5   92.140086   92.433567   94.188054   67994.0   
        5.0   3.3  4.2  101.031209  100.145625  100.871832   70566.0   
        4.8   4.4  4.3  112.031476  107.327208  113.794411   69267.0   
        6.6   6.3  5.5  122.706507  111.468549  127.245701   68444.0   
        6.9   6.5  5.8  145.752401  115.704198  146.676851   69553.0   
        6.6   6.7  5.4  178.773750  119.772720  169.535303   67616.0   
        5.8   7.2  5.2  221.471348  123.264823  210.799935   68766.0   
        5.0   6.8  4.8  268.208801  126.872677  250.272523   71194.0   
        4.9   7.0  4.4  270.804836  118.163888  241.411114   69750.0   
        5.9   7.2  4.8  226.111844  100.378575  215.656550   68712.0   
        9.9  10.9  7.0  167.831065   77.690839  174.183883   67888.0   

                        
           MI       PA  
DATE                    
57665.0  55870.0  
59693.0  56440.0  
58517.0  54191.0  
57776.0  54827.0  
61138.0  55562.0  
61632.0  58413.0  
64646.0  57516.0  
62499.0  60523.0  
66564.0  62098.0  
71828.0  58844.0  
68601.0  63573.0  
66020.0  63752.0  
61631.0  61318.0  
63517.0  60570.0  
58047.0  60588.0  
61031.0  61518.0  
62607.0  62388.0  
61785.0  60618.0  
60004.0  61949.0  
55625.0  58259.0  

Q6: for each decade and state, report the average annual CHANGE (level, not percent) in unemployment
Q7: for each decade and state, report the average annual PERCENT CHANGE in house prices and household income

# do your work here: 

# let's pseudocode

# q6 
# get decade variable
# get annual change/difference (level) in unemploy for each state
# average unemploy for each state within decade

# q7
# get decade variable
# get annual pct change in house price and income for each state
# average those for each state within decade

# HEY! those are similar - let's combine:

# get decade variable
# get annual change in unemploy for each state
# get annual pct change in house price and income for each state
# average unemploy for each state within decade

Error easter-egg hunt¶

This code is allllllllmost correct. It actually generates incorrect answers. If you figure out the problem and/or the solution, please submit an error fix via the “edit” button at the top of this page. HINT: ABCD

pd.set_option('display.float_format', '{:,.2f}'.format)

(
    # reformat the data to tall:
    macro_data.stack().swaplevel().sort_index().reset_index().rename(columns={'level_0':'state'})
    
    # create vars <---- this is not even needed to explain this block!
    .assign(
            decade          = lambda x: 10*np.floor(x['DATE']/10).astype(int),
            unemploy_diff   = lambda x: x['Unemployment'].diff(),
            HouseIdx_pctch  = lambda x: x['HouseIdx'].pct_change(),
            MedIncome_pctch = lambda x: x['MedIncome'].pct_change()    
    )
    
    # opt A for output:
    .pivot_table(index='decade',
                 columns='state',
                 values=['unemploy_diff','HouseIdx_pctch','MedIncome_pctch'])
    .multiply(100) # for more meaningful displays 
    
    # opt B for output + formatting (here, as percentages)
#     .groupby(['state','decade'])
#     [['unemploy_diff','HouseIdx_pctch','MedIncome_pctch']].mean()
#     .multiply(100)
#     # note about this: unemp isn't a % diff, but a p.p 
#     # so I make it explicit 
#     .style.format({'HouseIdx_pctch': '{0:,.2f}%',
#                    'MedIncome_pctch': '{0:,.2f}%',
#                    'unemploy_diff': '{0:,.2f} p.p.'}) 
    
)    

	HouseIdx_pctch			MedIncome_pctch			unemploy_diff
state	CA	MI	PA	CA	MI	PA	CA	MI	PA
decade
1990	-0.71	-3.13	-2.06	0.69	0.22	-0.46	4.44	-4.00	0.00
2000	7.52	-1.20	7.22	0.01	-2.42	-0.03	43.00	69.00	25.00
2010	5.84	5.15	2.82	0.84	1.42	1.54	-61.11	-71.11	-24.44