Resources, Tutorials, and Data Sources¶
Anything that is bolded/underlined below is also considered essential.
If you have any favorite resources you like, or found helpful, please let me know!
THE MOST ESSENTIAL RESOURCES
Cheat sheets to bookmark/print! Better yet, download these to your Notes repo, and put them in the “Codebook” folder therein!
Included in this folder: python basics, jupyter notebook, importing data, numpy, pandas, seaborn, and scikit-learn
Essential: Kaggle’s Data viz tutorial is excellent. It has reproducible code and data, using python.
Essential: An Economist’s Guide to Visualizing Data is excellent as well.
Essential: Data Visualization: A practical introduction, by Kieran Healy especially discusses the “whys” of visualization in a smart way. The walkthroughs are in R, not python, however.
Github, Git, and Version control
The most thorough yet simple walkthrough of Git and Github use on the web. Applies to python use for the most part.
Scikit (python package) can read in some data, which has data on Boston real estate, wine, a larger california housing dataset
Essential: Pandas can read in a LOT of useful data! Data providers include: Federal Reserve (“FRED”), Ken French, NASDAQ, OECD, Qunadl, TSP, World Bank, and more!
This comp was interesting. You could start trying to analyze it here. This has a good example of the process you might follow. After you’re done, you can see the winner’s code and discussion of the winning approach
Essential: kaggle.com has ML competitions, some FAQs, tutorials, data and competitions
Philly based data would be fun. Here is real estate, one option for data, seems ok, N=805
Predict box office for movies. VaultML claims they can do this by reading the screenplays and using textual analysis tools
UC Irvine has a data repo, some of these are available via scikit package