3.4. Other (Important) Data Wrangling Skills¶
Context - starting point: Remember, the class’s first two objectives are to:
obtain, explore, groom, visualize, and analyze data
make all of that reproducible, reusable, and shareable
At this point, we’re in the ballpark for all of that! In fact, if you recall the lengthier objectives, the “data: cleaning, exploring, visualizing and organizing” one said:
Some data will be given to you and well organized, but much of the data that will be given to you will be poorly organized. So we need to know how to explore it (both to clean it and learn from it). Tables will help you understand data, but visuals are usually better. [Plus], we are going to learn good dataset habits.
Context - right now: At this point, we’ve covered/added skills
GitHub for collaboration (issues and discussion board) and sharing (cloning your peer’s repos)
GitHub for project management/development and version control
Python: numpy, pandas, seaborn, matplotlib
Datasets: CRSP (stock prices), Compustat (firm financial statements), FRED (macroeconomic and regional time series data)
Data scraping: Yes, you’ve done this already!
Finance: Downloading stock prices and compounding returns over arbitrary time spans
We need to talk about a few more issues before we get properly ambitious.
Context - going forward: We need to introduce a few more skills before we start really running analytical models.
Merging datasets
What to do with missing values?
What to do with outliers?
How to scrape a world of data off the world wide web
Working with string data
In this section of the data wrangling chapter, we will deal with the first three of those.
(Scraping and strings is a whole chapter of the textbook…)