Resources, Tutorials, and Data Sources¶
Note
Anything that is bolded/underlined below is also considered essential.
If you have any favorite resources you like, or found helpful, please let me know!
THE MOST ESSENTIAL RESOURCES
Help: Google, Stack Overflow, Github help, JupyterLab documentation, Python help
Cheat sheets to bookmark/print! Better yet, download these to your Notes repo, and put them in the “Codebook” folder therein!
Included in this folder: python basics, jupyter notebook, importing data, numpy, pandas, seaborn, and scikit-learn
Python
Note
If you use any of these and LIKE or DISLIKE them, please let me know so I can guide future students to resources.
Essential: A whirlwind tour of python
Essential: datacamp.com has many self guided lessons
Lessons 3 - 5 of the official tutoral
The best compilation of coding resources on the web, including:
Data Science
Visualization
Essential: Kaggle’s Data viz tutorial is excellent. It has reproducible code and data, using python.
Essential: An Economist’s Guide to Visualizing Data is excellent as well.
Essential: Data Visualization: A practical introduction, by Kieran Healy especially discusses the “whys” of visualization in a smart way. The walkthroughs are in R, not python, however.
Github, Git, and Version control
Getting started on GitHub and a twitter length description of how a project flows
The most thorough yet simple walkthrough of Git and Github use on the web. Applies to python use for the most part.
Data/ML
Scikit (python package) can read in some data, which has data on Boston real estate, wine, a larger california housing dataset
Essential: Pandas can read in a LOT of useful data! Data providers include: Federal Reserve (“FRED”), Ken French, NASDAQ, OECD, Qunadl, TSP, World Bank, and more!
ML competitions with serious prizes at drivendata.org
This comp was interesting. You could start trying to analyze it here. This has a good example of the process you might follow. After you’re done, you can see the winner’s code and discussion of the winning approach
Essential: kaggle.com has ML competitions, some FAQs, tutorials, data and competitions
Real estate data, a tutorial exploring that data, and a pass at a model
Philly based data would be fun. Here is real estate, one option for data, seems ok, N=805
Predict box office for movies. VaultML claims they can do this by reading the screenplays and using textual analysis tools
UC Irvine has a data repo, some of these are available via scikit package
Predicting where the wine is from (wine/location <— easy starter challenge (where is the wine from?)
More good sources: data.gov, data.census.gov, data.world, https://ourworldindata.org is incredible and also has many repos on Github including one that imports data via python, …
Books
Range, by David Epstein is a very interesting book generally, and it touches on prediction skill too
Superforecasters. Here is a decent free summary