3.2.3. Common Functions/Methods¶

Note

Some pandas methods work on a dataframe, like df.assign(feet=df['height']//12). These are methods that are altering a dataframe, and you use them like this: <dfname>.<method>(<arguments>).

Note: Delete the < and > after you type in the dataframe name, method, and arguments. Those are just indicating the text inside them is a placeholder.

Some pandas methods are a called on the pandas module itself (e.g. pd.merge). These are methods that are doing tasks outside a dataframe (like loading or merging datasets), and you use them like this: pd.<method>(<arguments>)

Remember the SHIFT+TAB trick to see function help!

Type import pandas as pd then run that to load pandas. Then type pd.merge( like you want to merge to dataframes, except you don’t remember the arguments to use. So type SHIFT+TAB to see the function’s documentation!

Loading and saving data

Function	Pandas method	Example (see official syntax for more)
loading data	read_csv, read_dta, etc	`pd.read_csv('wine.csv')`
saving data	to_csv, to_dta, etc	`pd.to_csv('wine.csv')`

Manipulating data ⭐

Warning

df.assign(feet=df['height']//12) will not add a “feet” variable to df permanently. This is true of almost all dataframe methods (e.g. filter, rename, …). If you want to save the new variable, you need to type df = df.assign(feet=df['height']//12). See the next page for more.

Note

Remember: replace df below with the name of the dataframe you’re working on!

Function	Pandas method	Examples (see official syntax for more)
new variables or replace existing	assign	`df.assign(feet=df['height']//12)` `df['feet'] = df['height']//12`
filter or get subset of observations or, “drop rows”	⭐ query / loc / iloc	`df.query('height > 68')` `df.loc[df['gender']=='F']` `df.iloc[1:]`
get subset of columns	filter	`df.filter(['height','weight'])` `df[['height','weight']]`
rename columns	rename	`df.rename(columns={"height": "how tall"})`
sort	sort_values	`df.sort_values(['gender','weight'])`
do an operation on groups of observations	groupby ⭐	`df.groupby(['gender'])` , see common tasks for more. But if you think “I’d like to do a “for-loop” on this dataframe… the answer is usually groupby
summary stats	agg / pivot_table	`df.agg({'height':[max,min,np.mean]})` `df.pivot_table(index='age', columns='age', values='weight')`
summary stats on groups	agg / pivot_table	`df.groupby(['gender'])` `.agg({'height':[max,min,np.mean]})` `df.pivot_table(index='age', columns='age', values='weight'`
create a variable based on its group	agg+transform	`df.groupby(['industry','year'])['leverage'].mean().transform()` will add industry average leverage to your dataset for each firm
delete column	drop	`df.drop(columns=['gender'])`
use non-pd function on df	pipe	`df.pipe((sns.lineplot,data),x=x,y=y)`
combine dataframes	merge	`pd.merge(df1,df2)`
change time frequency of data	resample	`df.resample('Y').mean()`
window/rolling calculations	window	`df['vol_5yr']= df.groupby('firm').rolling(36).var('ret').transform()` will add 36 period volatility for each firm

Reshaping data and changing index

Function	Pandas method	Example (see official syntax for more)
convert wide to long/tall (“stack!”)	stack	`df.stack()`, see common tasks for usage examples and The Shape of Data for explanation of wide/tall `melt` is another option and is a special case of `stack`
convert long/tall to wide (“unstack!”)	unstack	`df.unstack()`, see common tasks for usage examples and The Shape of Data for explanation of wide/tall `pivot_table` is another option and is a special case of `unstack`
turn a variable column into the index	set_index	`df.set_index('SSN')`
turn the index into a variable	reset_index	`df.reset_index()` Note: The new index will just be the row numbers.

Statistical operations

These functions can be called for a variable “col1” in this form: <dfname>['columnname'].<function>() or for all numerical columns at once using <dfname>.<function>().
These functions work within groups. ⭐

Function	Description
count	Number of non-null observations
sum	Sum of values
mean	Mean of values
mad	Mean absolute deviation
median	Arithmetic median of values
min	Minimum
max	Maximum
mode	Mode
abs	Absolute Value
prod	Product of values
std	Unbiased standard deviation
var	Unbiased variance
sem	Unbiased standard error of the mean
skew	Unbiased skewness (3rd moment)
kurt	Unbiased kurtosis (4th moment)
quantile	Sample quantile (value at %)
cumsum	Cumulative sum
cumprod	Cumulative product
cummax	Cumulative maximum
cummin	Cumulative minimum
nunique	How many unique values?
value_counts	How many of each unique value are there?

LeDataSciFi-2022

3.2.3. Common Functions/Methods¶