Answers to Pandas Exercises¶
There are many ways to solve each of these!
Run each cell below in little incremental bits to really see how the code blocks work.
import pandas as pd
import pandas_datareader as pdr # IF NECESSARY, from terminal: pip install pandas_datareader
import datetime
import numpy as np
start = datetime.datetime(2017, 1, 1) # you can specify start and end dates this way
end = datetime.datetime(2021, 1, 27)
macro_df = pdr.data.DataReader(['GDP','CPIAUCSL','UNRATE'], 'fred', start, end)
Part 1¶
During class, I used this dataframe to go over Pandas vocab, and we show how to
access 1 variable (note:
pd
calls this a “series” object, which is a 1D object instead of a 2D object)access multiple vars
access, print, and change column names
access, print, reset, and set the index
Questions:
Q0: Do each of the four new golden rules for initial data exploration, from the lecture.
Q1: What is the second series above?
Q2: What is the frequency of the series?
Q3: What is the average ANNUAL GDP, based on the data?
# do your work here
def insufficient_but_starting_eda(df,cat_vars_list=None):
'''
Parameters
----------
df : DATAFRAME
cat_vars_list : LIST, optional
A list of strings containing variable names in the dataframe
for variables where you want to see the number of unique values
and the 10 most common values. Likely used for categorical values.
Returns
-------
None. It simply prints.
Description
-------
This function will print a MINIMUM amount of info about a new dataframe.
You should ****look**** at all this output below and consider the data
exploration and cleaning questions from
https://ledatascifi.github.io/ledatascifi-2022/content/03/02e_eda_golden.html#member
Also LOOK at more of the data manually.
Then write up anything notable you observe.
TIP: put this function in your codebook to reuse easily.
PROTIP: Improve this function (better outputs, better formatting).
FEATURE REQUEST: optionally print the nunique and top 10 values under the describe matrix
FEATURE REQUEST: optionally print more stats (percentiles)
'''
print(df.head(), '\n---')
print(df.tail(), '\n---')
print(df.columns, '\n---')
print("The shape is: ",df.shape, '\n---')
print("Info:",df.info(), '\n---') # memory usage, name, dtype, and # of non-null obs (--> # of missing obs) per variable
print(df.describe(), '\n---') # summary stats, and you can customize the list!
if cat_vars_list != None:
for var in cat_vars_list:
print(var,"has",df[var].nunique(),"values and its top 10 most common are:")
print(df[var].value_counts().head(10), '\n---')
insufficient_but_starting_eda(macro_df,['UNRATE'])
GDP CPIAUCSL UNRATE
DATE
2017-01-01 19153.912 243.620 4.7
2017-02-01 NaN 243.872 4.6
2017-03-01 NaN 243.766 4.4
2017-04-01 19322.920 244.274 4.5
2017-05-01 NaN 244.069 4.4
---
GDP CPIAUCSL UNRATE
DATE
2020-09-01 NaN 260.149 7.8
2020-10-01 21477.597 260.462 6.9
2020-11-01 NaN 260.927 6.7
2020-12-01 NaN 261.560 6.7
2021-01-01 22038.226 262.231 6.3
---
Index(['GDP', 'CPIAUCSL', 'UNRATE'], dtype='object')
---
The shape is: (49, 3)
---
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49 entries, 2017-01-01 to 2021-01-01
Freq: MS
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GDP 17 non-null float64
1 CPIAUCSL 49 non-null float64
2 UNRATE 49 non-null float64
dtypes: float64(3)
memory usage: 1.5 KB
Info: None
---
GDP CPIAUCSL UNRATE
count 17.000000 49.000000 49.000000
mean 20654.744824 252.878469 5.034694
std 913.690696 5.557894 2.524179
min 19153.912000 243.620000 3.500000
25% 19882.965000 248.721000 3.800000
50% 20813.325000 252.899000 4.000000
75% 21477.597000 257.387000 4.500000
max 22038.226000 262.231000 14.800000
---
UNRATE has 22 values and its top 10 most common are:
3.8 7
3.6 5
4.0 5
4.4 4
3.7 4
4.1 3
3.5 3
6.7 2
4.3 2
4.2 2
Name: UNRATE, dtype: int64
---
Some answers¶
Q0: What have we learned about the data? Anything to keep track of? (GDP is annual, others are quarterly)
Q1: Inflation (CPI)
Q2: Quarterly, but GDP is only annual
Q3: 20,630 (trillion)
Part 2¶
Q4: Download the annual real gdp from 1960 to 2018 from FRED and compute the average annual percent change
Q5: Compute the average gdp percent change within each decade
# do your work here
# v1:
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018)
part2_df['real_gdp_pct'] = part2_df['GDPCA'].pct_change()
part2_df['real_gdp_pct'].mean()
# prof notes: after students present, go through, bit by bit
# then add comments (a la psuedo),
# reiter access var: df[var], how methods apply to obj df[var].func()
0.03067733411159815
# v1.1: chain the last 2 lines together
part2_df = pdr.data.DataReader(['GDPCA'], 'fred', 1960, 2018)
part2_df['GDPCA'].pct_change().mean()
0.03067733411159815
# v1.2 that last line could be:
(part2_df
.pct_change()
.mean()
)
# explain reasons for chaining: readable, no temp objects
# breaking this up into 3 lines is silly but shows chaining over multiple lines
# how can we add the first line (the DL) into this? (next block)
GDPCA 0.030677
dtype: float64
# v2.0: chaining - but involves a lambda function work around
# don't cover until we go over lambdas
(
# DL data
pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018)
# create the var
# syntax is: .assign(newname = fcn(df(var)))
# try this:
#.assign(sillyvar=1)
# but this df doesn't have a name! how do we access it?
# trick: "lambda" function.
# here x is the object that assign is working on,
# meaning what ever is produced right before .assign
# which is just the **df** we DLed on the line above
#
# this is VERY COMMON in pandas chaining:
# .assign(newvarname = lambda <tempnameforpriorobj>: <do stuff to tempnameforpriorobj> )
.assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
# grab the var and take its mean
['real_gdp_pct'].mean()
)
0.03067733411159815
# v2.1: chaining - clean
(
# DL data
pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018)
# create var
.assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
# get mean value
['real_gdp_pct'].mean()
)
0.03067733411159815
# Q5:
import numpy as np
# v2.1: chaining - clean
(
# DL data
pdr.data.DataReader(['GDPCA','GDPA'], 'fred', 1960, 2018)
# create var pct change
.assign(real_gdp_pct = lambda x: x['GDPCA'].pct_change())
# get the decade from the index (do this BIT BY BIT)
.reset_index() # turn it into variable
# how to pull year out of date?
# DATE is a datetime series. dt is a way to access properities of the date
.assign(decade = lambda x: np.floor(x.DATE.dt.year/10)*10)
# for each decade = groupby!
.groupby('decade')
# take mean
['real_gdp_pct'].mean()
)
decade
1960.0 0.047426
1970.0 0.032352
1980.0 0.031240
1990.0 0.032261
2000.0 0.019175
2010.0 0.022464
Name: real_gdp_pct, dtype: float64
Part 3¶
First, I’ll load January data on unemployment, the Case-Shiller housing index, and median household income in three states (CA/MI/PA).
# LOAD DATA AND CONVERT TO ANNUAL
start = 1990 # pandas datareader can infer these are years
end = 2018
macro_data = pdr.data.DataReader(['CAUR','MIUR','PAUR', # unemployment
'LXXRSA','DEXRSA','WDXRSA', # case shiller index in LA, Detroit, DC (no PA available!)
'MEHOINUSCAA672N','MEHOINUSMIA672N','MEHOINUSPAA672N'], #
'fred', start, end)
macro_data = macro_data.resample('Y').first() # get's the first observation for each variable in a given year
# CLEAN UP THE FORMATING SOMEWHAT
macro_data.index = macro_data.index.year
print("\n\n DATA BEFORE FORMATTING: \n\n")
print(macro_data[:20]) # see how the data looks now? ugly variable names, but its an annual dataset at least
macro_data.columns=pd.MultiIndex.from_tuples([
('Unemployment','CA'),('Unemployment','MI'),('Unemployment','PA'),
('HouseIdx','CA'),('HouseIdx','MI'),('HouseIdx','PA'),
('MedIncome','CA'),('MedIncome','MI'),('MedIncome','PA')
])
print("\n\n DATA AFTER FORMATTING: \n\n")
print(macro_data[:20]) # this is a dataset that is "wide", and now
# the column variable names have 2 levels - var name,
# and unit/state that variable applies to
DATA BEFORE FORMATTING:
CAUR MIUR PAUR LXXRSA DEXRSA WDXRSA MEHOINUSCAA672N \
DATE
1990 5.2 7.7 5.2 100.471193 NaN 93.362855 64124.0
1991 7.1 8.8 6.5 95.569015 58.420806 89.706871 62568.0
1992 8.6 9.5 7.4 92.786926 59.748947 88.573807 63297.0
1993 9.8 7.6 7.2 85.246295 61.564205 89.065118 60272.0
1994 9.3 7.2 6.7 77.395052 64.526663 88.988467 61219.0
1995 7.8 5.2 5.7 76.376389 68.973289 89.670303 62618.0
1996 7.8 4.9 5.9 73.919989 73.582339 88.655369 63965.0
1997 6.9 4.7 5.3 74.536884 79.347642 88.775224 64035.0
1998 6.0 4.0 4.8 81.035037 85.183613 90.108149 65153.0
1999 5.6 4.0 4.5 92.140086 92.433567 94.188054 67994.0
2000 5.0 3.3 4.2 101.031209 100.145625 100.871832 70566.0
2001 4.8 4.4 4.3 112.031476 107.327208 113.794411 69267.0
2002 6.6 6.3 5.5 122.706507 111.468549 127.245701 68444.0
2003 6.9 6.5 5.8 145.752401 115.704198 146.676851 69553.0
2004 6.6 6.7 5.4 178.773750 119.772720 169.535303 67616.0
2005 5.8 7.2 5.2 221.471348 123.264823 210.799935 68766.0
2006 5.0 6.8 4.8 268.208801 126.872677 250.272523 71194.0
2007 4.9 7.0 4.4 270.804836 118.163888 241.411114 69750.0
2008 5.9 7.2 4.8 226.111844 100.378575 215.656550 68712.0
2009 9.9 10.9 7.0 167.831065 77.690839 174.183883 67888.0
MEHOINUSMIA672N MEHOINUSPAA672N
DATE
1990 57665.0 55870.0
1991 59693.0 56440.0
1992 58517.0 54191.0
1993 57776.0 54827.0
1994 61138.0 55562.0
1995 61632.0 58413.0
1996 64646.0 57516.0
1997 62499.0 60523.0
1998 66564.0 62098.0
1999 71828.0 58844.0
2000 68601.0 63573.0
2001 66020.0 63752.0
2002 61631.0 61318.0
2003 63517.0 60570.0
2004 58047.0 60588.0
2005 61031.0 61518.0
2006 62607.0 62388.0
2007 61785.0 60618.0
2008 60004.0 61949.0
2009 55625.0 58259.0
DATA AFTER FORMATTING:
Unemployment HouseIdx MedIncome \
CA MI PA CA MI PA CA
DATE
1990 5.2 7.7 5.2 100.471193 NaN 93.362855 64124.0
1991 7.1 8.8 6.5 95.569015 58.420806 89.706871 62568.0
1992 8.6 9.5 7.4 92.786926 59.748947 88.573807 63297.0
1993 9.8 7.6 7.2 85.246295 61.564205 89.065118 60272.0
1994 9.3 7.2 6.7 77.395052 64.526663 88.988467 61219.0
1995 7.8 5.2 5.7 76.376389 68.973289 89.670303 62618.0
1996 7.8 4.9 5.9 73.919989 73.582339 88.655369 63965.0
1997 6.9 4.7 5.3 74.536884 79.347642 88.775224 64035.0
1998 6.0 4.0 4.8 81.035037 85.183613 90.108149 65153.0
1999 5.6 4.0 4.5 92.140086 92.433567 94.188054 67994.0
2000 5.0 3.3 4.2 101.031209 100.145625 100.871832 70566.0
2001 4.8 4.4 4.3 112.031476 107.327208 113.794411 69267.0
2002 6.6 6.3 5.5 122.706507 111.468549 127.245701 68444.0
2003 6.9 6.5 5.8 145.752401 115.704198 146.676851 69553.0
2004 6.6 6.7 5.4 178.773750 119.772720 169.535303 67616.0
2005 5.8 7.2 5.2 221.471348 123.264823 210.799935 68766.0
2006 5.0 6.8 4.8 268.208801 126.872677 250.272523 71194.0
2007 4.9 7.0 4.4 270.804836 118.163888 241.411114 69750.0
2008 5.9 7.2 4.8 226.111844 100.378575 215.656550 68712.0
2009 9.9 10.9 7.0 167.831065 77.690839 174.183883 67888.0
MI PA
DATE
1990 57665.0 55870.0
1991 59693.0 56440.0
1992 58517.0 54191.0
1993 57776.0 54827.0
1994 61138.0 55562.0
1995 61632.0 58413.0
1996 64646.0 57516.0
1997 62499.0 60523.0
1998 66564.0 62098.0
1999 71828.0 58844.0
2000 68601.0 63573.0
2001 66020.0 63752.0
2002 61631.0 61318.0
2003 63517.0 60570.0
2004 58047.0 60588.0
2005 61031.0 61518.0
2006 62607.0 62388.0
2007 61785.0 60618.0
2008 60004.0 61949.0
2009 55625.0 58259.0
Q6: for each decade and state, report the average annual CHANGE (level, not percent) in unemployment
Q7: for each decade and state, report the average annual PERCENT CHANGE in house prices and household income
# do your work here:
# let's pseudocode
# q6
# get decade variable
# get annual change/difference (level) in unemploy for each state
# average unemploy for each state within decade
# q7
# get decade variable
# get annual pct change in house price and income for each state
# average those for each state within decade
# HEY! those are similar - let's combine:
# get decade variable
# get annual change in unemploy for each state
# get annual pct change in house price and income for each state
# average unemploy for each state within decade
Error easter-egg hunt¶
This code is allllllllmost correct. It actually generates incorrect answers. If you figure out the problem and/or the solution, please submit an error fix via the “edit” button at the top of this page. HINT: ABCD
pd.set_option('display.float_format', '{:,.2f}'.format)
(
# reformat the data to tall:
macro_data.stack().swaplevel().sort_index().reset_index().rename(columns={'level_0':'state'})
# create vars <---- this is not even needed to explain this block!
.assign(
decade = lambda x: 10*np.floor(x['DATE']/10).astype(int),
unemploy_diff = lambda x: x['Unemployment'].diff(),
HouseIdx_pctch = lambda x: x['HouseIdx'].pct_change(),
MedIncome_pctch = lambda x: x['MedIncome'].pct_change()
)
# opt A for output:
.pivot_table(index='decade',
columns='state',
values=['unemploy_diff','HouseIdx_pctch','MedIncome_pctch'])
.multiply(100) # for more meaningful displays
# opt B for output + formatting (here, as percentages)
# .groupby(['state','decade'])
# [['unemploy_diff','HouseIdx_pctch','MedIncome_pctch']].mean()
# .multiply(100)
# # note about this: unemp isn't a % diff, but a p.p
# # so I make it explicit
# .style.format({'HouseIdx_pctch': '{0:,.2f}%',
# 'MedIncome_pctch': '{0:,.2f}%',
# 'unemploy_diff': '{0:,.2f} p.p.'})
)
HouseIdx_pctch | MedIncome_pctch | unemploy_diff | |||||||
---|---|---|---|---|---|---|---|---|---|
state | CA | MI | PA | CA | MI | PA | CA | MI | PA |
decade | |||||||||
1990 | -0.71 | -3.13 | -2.06 | 0.69 | 0.22 | -0.46 | 4.44 | -4.00 | 0.00 |
2000 | 7.52 | -1.20 | 7.22 | 0.01 | -2.42 | -0.03 | 43.00 | 69.00 | 25.00 |
2010 | 5.84 | 5.15 | 2.82 | 0.84 | 1.42 | 1.54 | -61.11 | -71.11 | -24.44 |