But first, let's acknowledge how crazy it is out there...
Still, experts recommend keeping your daily rituals even while working from home
Seriously: If you have issues with internet, food, security, stability, anything: Please let me and possibly Lehigh staff know. We will try to find and direct resources your way.
Zoom... I didn't expect this to be an online class. I'm sure nothing will go terribly wrong...
- When you join the classroom: Ensure your mic is muted, then click on the "participation" and "chat" buttons
- If you have a second screen at home, use one for Zoom and one for Jupyter
- If you don't have a second screen, I recommend making the zoom screen the left side and Jupyter the right side of your screen.
- on windows, click on the Zoom app and hit Windows + Left to snap it left, and then click on your Jupyter window and click Windows + Right to snap it right
- on mac, follow these instructions
The promise of machine learning
- Robo-advising
- Manage risk (loans and insurance) to reduce write-offs and lower costs for consumers
- Prevent and detect fraud (external and internal)
- Investment choices - stocks, real estates (where to put factories, banks, etc)
- Improve ad offers to credit customers
Accenture thinks AI will add $140B of value to financial service firms alone via cost and productivity savings by 2025.
Don't you want to capture a little of that?
Machine Learning gone wrong
- Google Flu Trends consistently over predicted flu prevalence
- IBM's Watson tried to predict cancer. How'd it go? According to internal documents: "This product is a piece of sh–."
- Amazon's engineers used ML to evaluate applicants but taught the model that males were automatically better
Chatbots have had many struggles. Here's Microsoft's attempt at speaking like the youths:
- ML/AI methods replicate patterns in the data by design: If you give it data with human biases, then the AI can easily become biased. This has led to debates about how to use ML for
- Criminal sentencing based on "risk predictions" overweight race
- Online advertising - Google is more likely to serve up arrest records in searches for names assigned "primarily to black babies"
- Google will stitch together photos
I guess Google's AI thought the guy was built like a mountain...
How to define a project and structure the process
A few times a year, I get asked to be a judge of student statistical projects in politics or sports. While the students are very bright, they spend WAY too much time using fancy statistical methods and not enough time framing the right questions and contextualizing their answers. If you want to be a good data scientist, you should spend ~49% of your time developing your statistical intuition (i.e. how to ask good questions of the data), and ~49% of your time on domain knowledge (improving overall understanding of your field). Only ~2% on methods per se. - Nate Silver
Start with an interesting question or problem
Before you begin the analysis, know the questions you're trying to answer and what you're trying to accomplish - don't fall into an analytical rabbit hole. Additionally, you should know some basic things about your potential data - what data sources are available to answer the questions? How is that data structured? Is it in a database? CSVs? Third-party APIs? What tools will you be able to use for the analysis?
Your approach will likely change throughout, but it's helpful to start with a plan and adjust.
Two types of questions:
- Relationships: Do airline closures affect how VCs monitor portfolio companies? (Positively, negatively, or not? How much? Is the relationship because one causes the other, or something else?)
- Predictions: Which loans will default?
Pick your model(s)
A model is an idealized representation of a system
- "All models are wrong, but some are useful" - George Box
- Really!
- Relationship model: When people have one ice cream cone, they are 2% more likely to drown
- Model should summarize the data
- Simpler models are better because they are easier to interpret
- Example: Linear models (usually regression) are nice: $final grade = b + m * midterm grade$
- Prediction model: Loan defaults over the next three months are 20% more likely for restaurant and service workers.
- More complex models are often favored
- May not summarize the data, and often are impossible to interpret
- Example: Nearest neighbor model: $final grade = nearest neighbor(midterm grade)$
To estimate any model
We will talk in depth about a few models in class, but generally, these three steps always apply:
- Select a model. (For example: find the "center" of a univariate distribution, regression, logistic)
- Use knowledge about the area to help pick
- Select a loss function. (For example: Mean squared error, mean absolute deviation, R2)
- There are many loss functions!
- The loss function choice affects the accuracy and speed of estimation
- Choice depends on the estimation task
- Qualitative or quantitative data?
- Are all errors equal? (A false negative on a cancer test is much worse than a false positive!)
- Do outliers matter more or less?
- Some models often imply the loss function. For example, regression's loss function is almost always Mean Squared Error.
- Fit the model by minimizing the loss.
Required reading before Thursday
Starting our projects
- Projects groups of 3 or 4. You'll collaborate within a GitHub repo (more soon on that).
Timeline:
See the project assignment page.
Collective brainstorming
Discussion time: I'll keep track of a list - Let's free form this...
- What interesting applications of "big data" have you seen?
- Think about interesting firms, developing stories (COVID), business problems you've seen.
- We need a finance angle, which includes but is not limited to:
- Fed policy
- Investment platforms
- Asset returns
- Retirement planning
- Crypto
- Firm investments
- Real estate
- Fraud
- Cybersecurity
Teams
- Let's try to form teams of 3 or 4 now (can use Zoom chat, text, email)
- Head into Breakout rooms and discuss project ideas
- Which project ideas so far interest you?
- Do you have a sense of what ML techniques might be interesting to try on that problem?
- Note: If regression, you can still use ML to build variables as inputs to regression analysis (a la Assignment 5)
After/during class
Formally form teams:
- Go to https://classroom.github.com/g/nv0-pqH7 .
- The first person on the team that goes there creates the team. The rest join.
- I might have to tweak the teams