3.3.2. Making a Plot¶
To start plotting, add these to your import statements at the top of your file:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt # sometimes we want to tweak plots
3.3.2.1. Plotting process¶
# |
Step |
Note |
---|---|---|
0 |
Ask a question about the data |
Ex: What is the distribution of unemployment in each state? |
1 |
Q > What the plot should look like. Draw it! |
Draw it on paper! |
2 |
Plot appearance > which plot function/options to use |
Find a |
3 |
The function dictates how data should be formatted before you call the plot |
Key: Wide or tall? |
3.3.2.2. General tips¶
Start with simple graphs, and then build in and layer on “complications” and features.
Easy formatting things you should always do: title, xlabel, ylabel
Perfecting graphs can involve a lot of tweaks and code, so only do this when you know you have the content of the graph correct and that it will be in your deliverable (assignment, report, etc)
Really compare your code with the syntax in the documentation.
Understanding what each parameter does and needs is essential.
Triple check for typos, unclosed parentheses, and the like
Three places to find similar charts (and the functions that make them)
What chart should I use (with
sns
examples) and more help on how can I make itThe
seaborn
gallery
3.3.2.3. Syntax tips¶
With seaborn
, I usually use this syntax that looks something this like for graphing. (Delete the “<” and “>” and replace the inside with what you need.) Obviously, you’ll see many examples in this chapter that deviate from this. Usually, this is because you don’t need to explicitly declare “data”, or because “x” is just assumed as all variables in the dataset.
sns.<function>(data = <dataframe> [optional data functions],
x = '<varname>', y = '<varname>',
[optional arguments for specific plots] )
Tips for the “Optional data functions”:
Sometimes I add
.query()
after the dataframe name to filter outliersSometimes I add
.sample()
afterwards to plot a more manageable amount of data.
Example:
sns.boxplot(data=ccm.query('td_a < 1 & td_a > 0'),
x='td_a')
3.3.2.4. Tips on plotting workflow¶
Generally, to plot in Python:
Put your data into a DataFrame
Format the data long if you want to use a
sns
functionUse
pd
orsns
plotting functions.Q: Which? A: Whichever is easiest!
panda
’s plotting functions are simple and good for early-stage exploration and some simple graphics (bar, “barh”, scatter, and density), butseaborn
has many more built-in options, has simpler syntax, and is easier to use, IMO.Start with basic plots, then layer in features
Get the “gist” of the figure right
If you need to customize the figure, you’ll end up using
matplotlib
commands after the main plot function. Matplotlib is a full-powered (but confusing as heck) graphing package. In fact, bothpandas
andseaborn
are just usingmatplotlib
, but they hide the gory details for us. Thanks,seaborn
!This page discusses customizing and improving figures
Only customize when necessary for hyper control. Focus on CONTENT over hyper-control of formatting.
Some “format” tweaks (add a title, change the axis titles) and choices about plotting can be quick/cheap and have high value, and you should do these right before you finish your project/assignment and are about to post it officially. Otherwise, focus on content.
3.3.2.5. “I swear the syntax is correct!”¶
Warning
After syntax errors, most graphing pain comes from insufficient data wrangling. Most plotting functions have assumptions about how the data is shaped. Data might be unwieldy but we can control it:
How do we wrangle our data to make plot functions happy?
Keep your data in “tidy form” (aka tall data aka long data.
Seaborn
expects data shaped like this. Long data is generally better for data analysis and visualization (even aside from Seaborn’s assumptions)The exception: Pandas. If you want to plot using a
pandas
plot function, you might have to reshape (temporarily) your data to the wider “output shape” that corresponds to the graph type you’re generating.