3.4. Other (Important) Data Wrangling Skills

Context - starting point: Remember, the class’s first two objectives are to:

  1. obtain, explore, groom, visualize, and analyze data

  2. make all of that reproducible, reusable, and shareable

At this point, we’re in the ballpark for all of that! In fact, if you recall the lengthier objectives, the “data: cleaning, exploring, visualizing and organizing” one said:

Some data will be given to you and well organized, but much of the data that will be given to you will be poorly organized. So we need to know how to explore it (both to clean it and learn from it). Tables will help you understand data, but visuals are usually better. [Plus], we are going to learn good dataset habits.

Context - right now: At this point, we’ve covered/added skills

  • GitHub for collaboration (issues and discussion board) and sharing (cloning your peer’s repos)

  • GitHub for project management/development and version control

  • Python: numpy, pandas, seaborn, matplotlib

  • Datasets: CRSP (stock prices), Compustat (firm financial statements), FRED (macroeconomic and regional time series data)

  • Data scraping: Yes, you’ve done this already!

  • Finance: Downloading stock prices and compounding returns over arbitrary time spans

We need to talk about a few more issues before we get properly ambitious.

Context - going forward: We need to introduce a few more skills before we start really running analytical models.

  1. Merging datasets

  2. What to do with missing values?

  3. What to do with outliers?

  4. How to scrape a world of data off the world wide web

  5. Working with string data

In this section of the data wrangling chapter, we will deal with the first three of those.

(Scraping and strings is a whole chapter of the textbook…)