Objectives¶
Briefly¶
Use python to obtain, explore, groom, visualize, and analyze data → the puts the world of big data at your fingertips
Make all of that reproducible, reusable, and shareable → collobaration is mandatory in real-world projects
Apply those skills to problems in the finance domain → interesting, profitable, and impactful action
In more detail¶
(1) project management and programming golden rules¶
organized workflows and good programming habits reduce errors, increase speed, and make it easier for others to read it (including yourself when you look at the code 8 months later)
portable code is better, too
intro to programming
oh: working “asynchronously” on group projects is a way of life in 21st century jobs1, we’re going to learn how to do that more productively2 and with fewer headaches3
(2) our data science “stack”¶
Python: Loaded via Anaconda, and coding in JupyterLab (or Jupyter, and Spyder when it suits the task)
Github: Sharing and storing code, data, and reports + collaboration
Git: Version control system. Can think of it like Google Docs or “track changes” in Microsoft Word, but built for software projects that teams work on.
Github Desktop: Git is, famously, a pain in the @$^, Github Desktop makes it easy
(3) data: cleaning, exploring, visualizing and organizing¶
some data will be given to you and well organized
but much of the data that will be given to you will be poorly organized
so we need to know how to explore it (both to clean it and learn from it)
tables will help you understand data, but visuals are usually better
we are going to learn good dataset habits
(4) web crawling and data scraping¶
some data will be given to you.
but most of the world’s data is uncollected and unorganized (despite google’s best efforts)
finding and using that data - that’s where we make our money
(5) prediction models and data analysis¶
what we’re going to do with the data. producing and improving a model. applied econometrics (ie more conceptual than mathematical rigor) to understand and improve the output
understanding the how “data analysis/ML/<<buzz word #51>>” fit into the bigger picture of producing and using our domain knowledge of from finance.4 to quote Prof Gunther: data < info < knowledge < wisdom
learning from the model: what does the output of my analysis mean? (A and B are related, but WHY)
(6) applying all those skills within finance¶
it’s easy to misuse powerful tools: don’t go running blindfolded in the woods wielding a chainsaw
data science skills are no exception: be wary of using DS/ML/big data outside areas with domain skill and contextual understanding
this applies to other people: e.g. beware of amateur epidemiologists in a pandemic forecasting with 3rd degree polynomials in Excel
bad algos abound
→ applying our new skills in our expertise area (finance, duh!)
- 1
I wrote this before covid and it’s obviously more true now.
- 2
= $$$
- 3
= :)
- 4
“A few times a year, I get asked to be a judge of student statistical projects in politics or sports. While the students are very bright, they spend WAY too much time using fancy statistical methods and not enough time framing the right questions and contextualizing their answers. If you want to be a good data scientist, you should spend ~49% of your time developing your statistical intuition (i.e. how to ask good questions of the data), and ~49% of your time on domain knowledge (improving overall understanding of your field). Only ~2% on methods per se.” - Nate Silver