Resources, Tutorials, and Data Sources

THE MOST ESSENTIAL RESOURCES

Help: Google, Stack Overflow, Github help, JupyterLab documentation, Python help
Cheat sheets to bookmark/print! Better yet, download these to your Notes repo, and put them in the “Codebook” folder therein!
- Included in this folder: python basics, jupyter notebook, importing data, numpy, pandas, seaborn, and scikit-learn
Coding best practices, and project management

Python

Note

If you use any of these and LIKE or DISLIKE them, please let me know so I can guide future students to resources.

Data Science

Visualization

Essential: Kaggle’s Data viz tutorial is excellent. It has reproducible code and data, using python.
Essential: An Economist’s Guide to Visualizing Data is excellent as well.
Essential: Data Visualization: A practical introduction, by Kieran Healy especially discusses the “whys” of visualization in a smart way. The walkthroughs are in R, not python, however.
STAT545 on good visualization
Some good data viz blogs

Github, Git, and Version control

Data/ML

Scikit (python package) can read in some data, which has data on Boston real estate, wine, a larger california housing dataset
Essential: Pandas can read in a LOT of useful data! Data providers include: Federal Reserve (“FRED”), Ken French, NASDAQ, OECD, Qunadl, TSP, World Bank, and more!
ML competitions with serious prizes at drivendata.org
- This comp was interesting. You could start trying to analyze it here. This has a good example of the process you might follow. After you’re done, you can see the winner’s code and discussion of the winning approach
Essential: kaggle.com has ML competitions, some FAQs, tutorials, data and competitions
- Real estate data, a tutorial exploring that data, and a pass at a model
- Philly based data would be fun. Here is real estate, one option for data, seems ok, N=805
- Predict box office for movies. VaultML claims they can do this by reading the screenplays and using textual analysis tools
- Wine, but not necessarily the best data source
UC Irvine has a data repo, some of these are available via scikit package
- Predicting where the wine is from (wine/location <— easy starter challenge (where is the wine from?)
- Wine Quality
- German credit data by person
More good sources: data.gov, data.census.gov, data.world, https://ourworldindata.org is incredible and also has many repos on Github including one that imports data via python, …

Books

Signal and the Noise, by Nate Silver
Range, by David Epstein is a very interesting book generally, and it touches on prediction skill too
Superforecasters. Here is a decent free summary

LeDataSciFi-2022