Search
Outro

Wrangling Outro

Context - our goals were: Remember, the class's objectives are to:

  1. obtain, explore, groom, visualize, and analyze data
  2. make all of that reproducible, reusable, and shareable

Context - so far: At this point, we're in the ballpark for all of that! In fact, if you recall the lengthier objectives, the "data: cleaning, exploring, visualizing and organizing" one said:

Some data will be given to you and well organized, but much of the data that will be given to you will be poorly organized. So we need to know how to explore it (both to clean it and learn from it). Tables will help you understand data, but visuals are usually better. [Plus], we are going to learn good dataset habits.

At this point, you've added skills

  • GitHub for collaboration (issues on peer reviews and discussion board) and sharing (forking your peer's repos)
  • GitHub for project management/development and version control
  • Python: numpy, pandas, seaborn, matplotlib
  • Datasets: CRSP (stock prices), Compustat (firm financial statements), FRED, Data.World
  • Data scraping: Yes, you've done this already!
  • Finance: Factor-loading estimates ("A Simple Program" and its sequel)

We need to talk about a few more issues before we get properly ambitious.

Context - going forward: We need to introduce a few more skills before we start really running analytical models.

  1. Merging datasets
  2. What to do with missing values?
  3. How to get a world of data off the world wide web
  4. Strings

Today, we will talk about those first two topics: Merging, and Dealing With Missing Values.

Merging

You've already merged datasets. But so far, our examples have been "well-behaved" so it was easy to just proceed. But real world datasets are messy (bad variable names, poor documentation) and big, and so merging isn't always as easy as "just do it".

A nice overview

The Python Data Science Handbook has a wonderful breakdown of the mechanics of merging. You should read it!

Important parameters of pd.merge

Read through the parameters of the function here.

Key functions:

  • right, left - your datasets
  • on - a single variable or a list of variables you're matching
    • if the variable names aren't the same in the datasets (e.g. "ID" in one and "identity" in the other), use left_on and right_on
  • how =
    • "inner" --> obs in both datasets
    • "left" --> inner + all unmatched obs in left
    • "right" (inner + all unmatched obs in right), right, outer
    • "left" (inner + all unmatched obs in left), right, outer
  • suffix = when a variable is in both datasets, how should we name each.
    • It's a good idea to always use this option and specific the source, because the default option is uninformative!
  • indicator=True will create a variable saying which dataset the variable came from

How many observations will my merge produce?

It depends on how you merge the datasets. But also, any time you merge datasets, you should think about the variables you're merging on (the "keys") and ask:

  1. Is it 1 observation per key in one, both, or neither?
  2. Am I doing 1:1, 1:M, M:1, or M:M merge? The differences are huge!
  3. Related: What's the "observation level" before and after? Which variables will be at different levels?
  4. Is the key ever empty in either datasets?

Copy the datasets below. Let's try merging different combinations and see what happens.

Let's also see what happens when we repeat a merge or try to merge in a dataset with a variable of the same name.

import pandas as pd
# this "firm" variable is really "last firm" but the database manager called it what they called it 
# and now IT is scared that fixing it will break all their systems, so it's "stuck as-is"
some_employees = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'firm': ['Citi','GS','Accenture','Wells Fargo'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR'],
                    'hire_date': [2004, 2008, 2012, 2014]})

departments = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'employees': [100,200,1] })

current_employer = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'firm': ['Yahoo','Yahoo','Yahoo','Yahoo'],

some_employees_alt = pd.DataFrame({'employee': ['Jonathan', 'Mary', 'Lisa', 'Sue'],
                    'firm': ['?','Lehigh','Accenture','Wells Fargo'],
                    'group': ['Accounting', 'CEO', 'Engineering', 'HR'],
                    'hire_date': [2010, 2019, 2012, 2014]})

Good merging habits

  1. Specify how and suffix.
  2. Before the merge think about (1) the observation levels in the left and right (2) how you're merging and (3) whether you think the merge is 1:1, 1:M, M:M, or M:1. Guess how many observations you'll have (more or less than left? more or less than right or left?) and then check afterwards.
  3. Look at the data! (Example: In the golden rules lecture "Good and Bad Data" example, if we merged the firm-level data to the firm-year level data, check that "GM"'s first year is matched to all annual observations for GM.)
  4. describe() the data and look at "count" for each variable (and also isnull().sum() to count missing observations). Does the count make sense?
    • Repeat this for each section of the merged dataset (the observations from the left data, from the right data, and from both). For example, df.query('_merge != "right_only").describe() will describe the merged dataset's observations from the left data.
  5. Going forward, remember the new observation level. I often actively name the dataframe to remind myself exactly that. For example, I know exactly how state_industry_year_df and state_industry_df should differ.

Dealing With Missing Values

  1. df.isnull().sum will report missing values by variable.
  2. With new datasets, look out for "missing values" that aren't missing. Some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaNs with replace.

These slides on missing data are quite good! This article has examples too.

You should focus on the whys and hows of dealing with missing data rather than mechanics. (You can look up mechanics later.)

For example, with firm level data that investment analysts deal with, the most common approach is to keep all valid data, and for each test you run, use all observations that have no missing values for the variables of interest. In the slides, this is called "Complete-Case Analysis".

It is less common in my field of research to impute missing values. However, "Deductive imputation" is common when the cost of doing so isn't high.

In general, when you can confidently deduce a value (my height this year is the same as last year because I'm a fully grown adult (mostly)), go ahead. That is valid data.

Options

  • fillna - any value (strings included) you want, back fill, forward, fill, and more
  • dropna - might not be explicitly needed, some functions ignore NaNs, but some don't. I tend to drop only as needed for a given estimation, and only temporarily.
  • replace - some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaN with this function.
  • interpolate - e.g. use values in surrounding time periods to fill in gaps
  • Deduce. Suppose PPE this year is missing. $PPE_t = PPE_{t-1} + CAPX_t - DP_t $

Copy the following code into your live code and we can try a few things:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[None, 2, 54, 3, None], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6]})