Answers to Pandas Exercises¶
There are many ways to solve each of these!
Run each cell below in little incremental bits to really see how the code blocks work.
import pandas as pd
import pandas_datareader as pdr # IF NECESSARY, from terminal: pip install pandas_datareader
import datetime
import numpy as np
start = datetime.datetime(2017, 1, 1) # you can specify start and end dates this way
end = datetime.datetime(2021, 1, 27)
macro_df = pdr.data.DataReader(['GDP','CPIAUCSL','UNRATE'], 'fred', start, end)
Part 1¶
During class, I used this dataframe to go over Pandas vocab, and we show how to
access 1 variable (note:
pd
calls this a “series” object, which is a 1D object instead of a 2D object)access multiple vars
access, print, and change column names
access, print, reset, and set the index
Questions:
Q0: Do each of the four new golden rules for initial data exploration, from the lecture.
Q1: What is the second series above?
Q2: What is the frequency of the series?
Q3: What is the average ANNUAL GDP, based on the data?
# do your work here
def insufficient_but_starting_eda(df,cat_vars_list=None):
'''
Parameters
----------
df : DATAFRAME
cat_vars_list : LIST, optional
A list of strings containing variable names in the dataframe
for variables where you want to see the number of unique values
and the 10 most common values. Likely used for categorical values.
Returns
-------
None. It simply prints.
Description
-------
This function will print a MINIMUM amount of info about a new dataframe.
You should ****look**** at all this output below and consider the data
exploration and cleaning questions from
https://ledatascifi.github.io/ledatascifi-2021/content/03/02e_eda_golden.html#member
Also LOOK at more of the data manually.
Then write up anything notable you observe.
TIP: put this function in your codebook to reuse easily.
PROTIP: Improve this function (better outputs, better formatting).
FEATURE REQUEST: optionally print the nunique and top 10 values under the describe matrix
FEATURE REQUEST: optionally print more stats (percentiles)
'''
print(df.head(), '\n---')
print(df.tail(), '\n---')
print(df.columns, '\n---')
print("The shape is: ",df.shape, '\n---')
print("Info:",df.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
print(df.describe(), '\n---') # summary stats, and you can customize the list!
if cat_vars_list != None:
for var in cat_vars_list:
print(var,"has",df[var].nunique(),"values and its top 10 most common are:")
print(df[var].value_counts().head(10), '\n---')
insufficient_but_starting_eda(macro_df,['UNRATE'])
GDP CPIAUCSL UNRATE
DATE
2017-01-01 19237.435 243.620 4.7
2017-02-01 NaN 243.872 4.6
2017-03-01 NaN 243.766 4.4
2017-04-01 19379.232 244.274 4.5
2017-05-01 NaN 244.069 4.4
---
GDP CPIAUCSL UNRATE
DATE
2020-09-01 NaN 260.149 7.8
2020-10-01 21479.529 260.462 6.9
2020-11-01 NaN 260.927 6.7
2020-12-01 NaN 261.560 6.7
2021-01-01 NaN 262.231 6.3
---
Index(['GDP', 'CPIAUCSL', 'UNRATE'], dtype='object')
---
The shape is: (49, 3)
---
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49 entries, 2017-01-01 to 2021-01-01
Freq: MS
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GDP 16 non-null float64
1 CPIAUCSL 49 non-null float64
2 UNRATE 49 non-null float64
dtypes: float64(3)
memory usage: 1.5 KB
Info: None
---
GDP CPIAUCSL UNRATE
count 16.000000 49.000000 49.000000
mean 20630.206313 252.878469 5.034694
std 862.963655 5.557894 2.524179
min 19237.435000 243.620000 3.500000
25% 19857.794250 248.721000 3.800000
50% 20826.288000 252.899000 4.000000
75% 21367.290000 257.387000 4.500000
max 21747.394000 262.231000 14.800000
---
UNRATE has 22 values and its top 10 most common are:
3.8 7
4.0 5
3.6 5
4.4 4
3.7 4
3.5 3
4.1 3
4.2 2
4.3 2
6.7 2
Name: UNRATE, dtype: int64
---
Some answers¶
Q0: What have we learned about the data? Anything to keep track of? (GDP is annual, others are quarterly)
Q1: Inflation (CPI)
Q2: Quarterly, but GDP is only annual
Q3: 20,630 (trillion)
Part 2¶
Q4: Download the annual real gdp from 1960 to 2018 from FRED and compute the average annual percent change
Q5: Compute the average gdp percent change within each decade
# do your work here
# v1:
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018)
part2_df['real_gdp_pct'] = part2_df['GDPCA'].pct_change()
part2_df['real_gdp_pct'].mean()
# prof notes: after students present, go through, bit by bit
# then add comments (a la psuedo),
# reiter access var: df[var], how methods apply to obj df[var].func()
0.030766069353016783
# v1.1: chain the last 2 lines together
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018)
part2_df['GDPCA'].pct_change().mean()
0.030766069353016783
# v1.2 that last line could be:
(part2_df
.pct_change()
.mean()
)
# explain reasons for chaining: readable, no temp objects
# breaking this up into 3 lines is silly but shows chaining over multiple lines
# how can we add the first line (the DL) into this? (next block)
GDPCA 0.030766
dtype: float64
# v2.0: chaining - but involves a lambda function work around
# don't cover until we go over lambdas
(
# DL data
pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018)
# create the var
# syntax is: .assign(newname = fcn(df(var)))
# try this:
#.assign(sillyvar=1)
# but this df doesn't have a name! how do we access it?
# trick: "lambda" function.
# here x is the object that assign is working on,
# meaning what ever is produced right before .assign
# which is just the **df** we DLed on the line above
#
# this is VERY COMMON in pandas chaining:
# .assign(newvarname = lambda <tempnameforpriorobj>: <do stuff to tempnameforpriorobj> )
.assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
# grab the var and take its mean
['real_gdp_pct'].mean()
)
0.030766069353016783
# v2.1: chaining - clean
(
# DL data
pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018)
# create var
.assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
# get mean value
['real_gdp_pct'].mean()
)
0.030766069353016783
# Q5:
import numpy as np
# v2.1: chaining - clean
(
# DL data
pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018)
# create var pct change
.assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
# get the decade from the index (do this BIT BY BIT)
.reset_index() # turn it into variable
# how to pull year out of date?
# DATE is a datetime series. dt is a way to access properities of the date
.assign(decade = lambda x: np.floor(x.DATE.dt.year/10)*10)
# for each decade = groupby!
.groupby('decade')
# take mean
['real_gdp_pct'].mean()
)
decade
1960.0 0.047426
1970.0 0.032352
1980.0 0.031240
1990.0 0.032220
2000.0 0.019099
2010.0 0.023165
Name: real_gdp_pct, dtype: float64
Part 3¶
First, I’ll load January data on unemployment, the Case-Shiller housing index, and median household income in three states (CA/MI/PA).
# LOAD DATA AND CONVERT TO ANNUAL
start = 1990 # pandas datareader can infer these are years
end = 2018
macro_data = pdr.data.DataReader(['CAUR','MIUR','PAUR', # unemployment
'LXXRSA','DEXRSA','WDXRSA', # case shiller index in LA, Detroit, DC (no PA available!)
'MEHOINUSCAA672N','MEHOINUSMIA672N','MEHOINUSPAA672N'], #
'fred', start, end)
macro_data = macro_data.resample('Y').first() # get's the first observation for each variable in a given year
# CLEAN UP THE FORMATING SOMEWHAT
macro_data.index = macro_data.index.year
print("\n\n DATA BEFORE FORMATTING: \n\n")
print(macro_data[:20]) # see how the data looks now? ugly variable names, but its an annual dataset at least
macro_data.columns=pd.MultiIndex.from_tuples([
('Unemployment','CA'),('Unemployment','MI'),('Unemployment','PA'),
('HouseIdx','CA'),('HouseIdx','MI'),('HouseIdx','PA'),
('MedIncome','CA'),('MedIncome','MI'),('MedIncome','PA')
])
print("\n\n DATA AFTER FORMATTING: \n\n")
print(macro_data[:20]) # this is a dataset that is "wide", and now
# the column variable names have 2 levels - var name,
# and unit/state that variable applies to
DATA BEFORE FORMATTING:
CAUR MIUR PAUR LXXRSA DEXRSA WDXRSA MEHOINUSCAA672N \
DATE
1990 5.2 7.8 5.1 100.471193 NaN 93.362855 63333.0
1991 7.1 8.8 6.6 95.569015 58.420806 89.706871 61797.0
1992 8.5 9.4 7.5 92.786926 59.748947 88.573807 62517.0
1993 9.7 7.6 7.3 85.246295 61.564205 89.065118 59529.0
1994 9.2 7.1 6.6 77.395052 64.526663 88.988467 60464.0
1995 7.7 5.2 5.8 76.376389 68.973289 89.670303 61846.0
1996 7.7 4.9 5.9 73.919989 73.582339 88.655369 63176.0
1997 6.8 4.7 5.3 74.536884 79.347642 88.775224 63245.0
1998 6.0 4.0 4.7 81.035037 85.183613 90.108149 64349.0
1999 5.6 3.9 4.5 92.140086 92.433567 94.188054 67156.0
2000 5.0 3.3 4.1 101.031209 100.145625 100.871832 69696.0
2001 4.8 4.4 4.3 112.031476 107.327208 113.794411 68413.0
2002 6.5 6.4 5.5 122.706507 111.468549 127.245701 67600.0
2003 6.8 6.6 5.8 145.752401 115.704198 146.676851 68695.0
2004 6.5 6.8 5.5 178.773752 119.772720 169.535303 66782.0
2005 5.8 7.2 5.2 221.471364 123.264823 210.799937 67918.0
2006 5.0 6.8 4.7 268.208844 126.872678 250.272528 70316.0
2007 4.9 6.9 4.3 270.804920 118.163891 241.411126 68890.0
2008 6.0 7.1 4.7 226.111966 100.378575 215.656581 67865.0
2009 9.7 10.9 7.0 167.831216 77.690858 174.183926 67051.0
MEHOINUSMIA672N MEHOINUSPAA672N
DATE
1990 56954.0 55181.0
1991 58957.0 55744.0
1992 57795.0 53523.0
1993 57064.0 54151.0
1994 60384.0 54877.0
1995 60872.0 57693.0
1996 63849.0 56807.0
1997 61728.0 59776.0
1998 65744.0 61333.0
1999 70942.0 58119.0
2000 67755.0 62789.0
2001 65206.0 62966.0
2002 60871.0 60562.0
2003 62734.0 59823.0
2004 57331.0 59841.0
2005 60278.0 60760.0
2006 61835.0 61619.0
2007 61024.0 59870.0
2008 59264.0 61185.0
2009 54939.0 57540.0
DATA AFTER FORMATTING:
Unemployment HouseIdx MedIncome \
CA MI PA CA MI PA CA
DATE
1990 5.2 7.8 5.1 100.471193 NaN 93.362855 63333.0
1991 7.1 8.8 6.6 95.569015 58.420806 89.706871 61797.0
1992 8.5 9.4 7.5 92.786926 59.748947 88.573807 62517.0
1993 9.7 7.6 7.3 85.246295 61.564205 89.065118 59529.0
1994 9.2 7.1 6.6 77.395052 64.526663 88.988467 60464.0
1995 7.7 5.2 5.8 76.376389 68.973289 89.670303 61846.0
1996 7.7 4.9 5.9 73.919989 73.582339 88.655369 63176.0
1997 6.8 4.7 5.3 74.536884 79.347642 88.775224 63245.0
1998 6.0 4.0 4.7 81.035037 85.183613 90.108149 64349.0
1999 5.6 3.9 4.5 92.140086 92.433567 94.188054 67156.0
2000 5.0 3.3 4.1 101.031209 100.145625 100.871832 69696.0
2001 4.8 4.4 4.3 112.031476 107.327208 113.794411 68413.0
2002 6.5 6.4 5.5 122.706507 111.468549 127.245701 67600.0
2003 6.8 6.6 5.8 145.752401 115.704198 146.676851 68695.0
2004 6.5 6.8 5.5 178.773752 119.772720 169.535303 66782.0
2005 5.8 7.2 5.2 221.471364 123.264823 210.799937 67918.0
2006 5.0 6.8 4.7 268.208844 126.872678 250.272528 70316.0
2007 4.9 6.9 4.3 270.804920 118.163891 241.411126 68890.0
2008 6.0 7.1 4.7 226.111966 100.378575 215.656581 67865.0
2009 9.7 10.9 7.0 167.831216 77.690858 174.183926 67051.0
MI PA
DATE
1990 56954.0 55181.0
1991 58957.0 55744.0
1992 57795.0 53523.0
1993 57064.0 54151.0
1994 60384.0 54877.0
1995 60872.0 57693.0
1996 63849.0 56807.0
1997 61728.0 59776.0
1998 65744.0 61333.0
1999 70942.0 58119.0
2000 67755.0 62789.0
2001 65206.0 62966.0
2002 60871.0 60562.0
2003 62734.0 59823.0
2004 57331.0 59841.0
2005 60278.0 60760.0
2006 61835.0 61619.0
2007 61024.0 59870.0
2008 59264.0 61185.0
2009 54939.0 57540.0
Q6: for each decade and state, report the average annual CHANGE (level, not percent) in unemployment
Q7: for each decade and state, report the average annual PERCENT CHANGE in house prices and household income
# do your work here:
# let's pseudocode
# q6
# get decade variable
# get annual change/difference (level) in unemploy for each state
# average unemploy for each state within decade
# q7
# get decade variable
# get annual pct change in house price and income for each state
# average those for each state within decade
# HEY! those are similar - let's combine:
# get decade variable
# get annual change in unemploy for each state
# get annual pct change in house price and income for each state
# average unemploy for each state within decade
Error easter-egg hunt¶
This code is allllllllmost correct. It actually generates incorrect answers. If you figure out the problem and/or the solution, please submit an error fix via the “edit” button at the top of this page.
pd.set_option('display.float_format', '{:,.2f}'.format)
(
# reformat the data to tall:
macro_data.stack().swaplevel().sort_index().reset_index().rename(columns={'level_0':'state'})
# create vars <---- this is not even needed to explain this block!
.assign(
decade = lambda x: 10*np.floor(x['DATE']/10).astype(int),
unemploy_diff = lambda x: x['Unemployment'].diff(),
HouseIdx_pctch = lambda x: x['HouseIdx'].pct_change(),
MedIncome_pctch = lambda x: x['MedIncome'].pct_change()
)
# opt A for output:
.pivot_table(index='decade',
columns='state',
values=['unemploy_diff','HouseIdx_pctch','MedIncome_pctch'])
.multiply(100) # for more meaningful displays
# opt B for output + formatting (here, as percentages)
# .groupby(['state','decade'])
# [['unemploy_diff','HouseIdx_pctch','MedIncome_pctch']].mean()
# .multiply(100)
# # note about this: unemp isn't a % diff, but a p.p
# # so I make it explicit
# .style.format({'HouseIdx_pctch': '{0:,.2f}%',
# 'MedIncome_pctch': '{0:,.2f}%',
# 'unemploy_diff': '{0:,.2f} p.p.'})
)
HouseIdx_pctch | MedIncome_pctch | unemploy_diff | |||||||
---|---|---|---|---|---|---|---|---|---|
state | CA | MI | PA | CA | MI | PA | CA | MI | PA |
decade | |||||||||
1990 | -0.71 | -3.13 | -2.07 | 0.69 | 0.22 | -0.46 | 4.44 | -5.00 | 0.00 |
2000 | 7.52 | -1.20 | 7.22 | 0.01 | -2.42 | -0.03 | 41.00 | 70.00 | 25.00 |
2010 | 5.84 | 5.16 | 2.83 | 0.84 | 1.42 | 1.54 | -58.89 | -71.11 | -26.67 |