Search
Modeling Intro

Modeling Intro

But first, let's acknowledge how crazy it is out there...

Still, experts recommend keeping your daily rituals even while working from home

Seriously: If you have issues with internet, food, security, stability, anything: Please let me and possibly Lehigh staff know. We will try to find and direct resources your way.

Zoom... I didn't expect this to be an online class. I'm sure nothing will go terribly wrong...

Link to that

  • When you join the classroom: Ensure your mic is muted, then click on the "participation" and "chat" buttons
  • If you have a second screen at home, use one for Zoom and one for Jupyter
  • If you don't have a second screen, I recommend making the zoom screen the left side and Jupyter the right side of your screen.
    • on windows, click on the Zoom app and hit Windows + Left to snap it left, and then click on your Jupyter window and click Windows + Right to snap it right
    • on mac, follow these instructions

But second first, some thoughts about the assignment

  • Due Monday!
  • Up/down: Do the new instructions make sense?
  • Let me show you what my directory looks like...
  • Chat window: What question do you have about the new instructions, or generally about the assignment as you try to finish it?

The promise of machine learning

  1. Robo-advising
  2. Manage risk (loans and insurance) to reduce write-offs and lower costs for consumers
  3. Prevent and detect fraud (external and internal)
  4. Investment choices - stocks, real estates (where to put factories, banks, etc)
  5. Improve ad offers to credit customers

Accenture thinks AI will add $140B of value to financial service firms alone via cost and productivity savings by 2025.

Don't you want to capture a little of that?

Machine Learning gone wrong

I guess Google's AI thought the guy was built like a mountain...

How to define a project and structure the process

A few times a year, I get asked to be a judge of student statistical projects in politics or sports. While the students are very bright, they spend WAY too much time using fancy statistical methods and not enough time framing the right questions and contextualizing their answers. If you want to be a good data scientist, you should spend ~49% of your time developing your statistical intuition (i.e. how to ask good questions of the data), and ~49% of your time on domain knowledge (improving overall understanding of your field). Only ~2% on methods per se. - Nate Silver

Start with an interesting question or problem

Before you begin the analysis, know the questions you're trying to answer and what you're trying to accomplish - don't fall into an analytical rabbit hole. Additionally, you should know some basic things about your potential data - what data sources are available to answer the questions? How is that data structured? Is it in a database? CSVs? Third-party APIs? What tools will you be able to use for the analysis?

Your approach will likely change throughout, but it's helpful to start with a plan and adjust.

Two types of questions:

  • Relationships: Do airline closures affect how VCs monitor portfolio companies? (Positively, negatively, or not? How much? Is the relationship because one causes the other, or something else?)
  • Predictions: Which loans will default?

Pick your model(s)

A model is an idealized representation of a system

  • "All models are wrong, but some are useful" - George Box
    • Really!
  • Relationship model: When people have one ice cream cone, they are 2% more likely to drown
    • Model should summarize the data
    • Simpler models are better because they are easier to interpret
    • Example: Linear models (usually regression) are nice: $final grade = b + m * midterm grade$
  • Prediction model: Loan defaults over the next three months are 20% more likely for restaurant and service workers.
    • More complex models are often favored
    • May not summarize the data, and often are impossible to interpret
    • Example: Nearest neighbor model: $final grade = nearest neighbor(midterm grade)$

To estimate any model

We will talk in depth about a few models in class, but generally, these three steps always apply:

  1. Select a model. (For example: find the "center" of a univariate distribution, regression, logistic)
    • Use knowledge about the area to help pick
  2. Select a loss function. (For example: Mean squared error, mean absolute deviation, R2)
    • There are many loss functions!
    • The loss function choice affects the accuracy and speed of estimation
    • Choice depends on the estimation task
    • Qualitative or quantitative data?
    • Are all errors equal? (A false negative on a cancer test is much worse than a false positive!)
    • Do outliers matter more or less?
    • Some models often imply the loss function. For example, regression's loss function is almost always Mean Squared Error.
  3. Fit the model by minimizing the loss.

Required reading before Thursday

  1. Principles of good data analysis, by Greg Reda
  2. Chapter 10 of Data 100
  3. How Big Investors Cash in on Alternative Data

Starting our projects

  • Projects groups of 3 or 4. You'll collaborate within a GitHub repo (more soon on that).

Timeline:

See the project assignment page.

Collective brainstorming

Discussion time: I'll keep track of a list - Let's free form this...

  • What interesting applications of "big data" have you seen?
  • Think about interesting firms, developing stories (COVID), business problems you've seen.
  • We need a finance angle, which includes but is not limited to:
    • Fed policy
    • Investment platforms
    • Asset returns
    • Retirement planning
    • Crypto
    • Firm investments
    • Real estate
    • Fraud
    • Cybersecurity

Teams

  • Let's try to form teams of 3 or 4 now (can use Zoom chat, text, email)
  • Head into Breakout rooms and discuss project ideas
    • Which project ideas so far interest you?
    • Do you have a sense of what ML techniques might be interesting to try on that problem?
    • Note: If regression, you can still use ML to build variables as inputs to regression analysis (a la Assignment 5)

After/during class

Formally form teams: