5.2. Planning a project

Or: All the work you should do before you start working


  1. Start with an interesting question or problem

  2. What type of question are you asking?

  3. Think about data:

    • What is the ideal dataset that would most easily answer your question?

    • What data is available (sources, how easy/costly is it to get)

    • Explore the data.

    • If you have sub-optimal data, you’ll have to adjust your subsequent steps to use that data. Adjustments are a natural part of the modeling cycle.

  4. Pick your model(s)

  5. Estimate your model(s) and evaluate the output

It’s easy and tempting to get a ton of data, type import sklearn and start plowing into the data like Leroy Jenkins. When you do that, it’s easy to get caught up in the minutia of some discrete part of your project. (How do I reshape the data to the right analysis level, or add a new variable to the dataset?)

This is the path toward the dark side and will result in a ton of wasted time and effort!

Throughout the semester, you’ve seen that thinking about how to find a solution to a homework problem is usually easiest when you step back, think about the challenge you’re facing, and write down some pseudocode. Similarly, when you are working on a project, a big-picture orientation is especially valuable. The main difference is that projects are bigger than our homework, so when you get stuck on irrelevant minutia for a project, you will waste 10-100x more effort and (2)

The other difference between projects and homework is that, with projects, you’re in charge. Instead of needing to figure out how to write some code to solve a discrete problem I gave you, you need to figure out what to do.

A few times a year, I get asked to be a judge of student statistical projects in politics or sports. While the students are very bright, they spend WAY too much time using fancy statistical methods and not enough time framing the right questions and contextualizing their answers. If you want to be a good data scientist, you should spend ~49% of your time developing your statistical intuition (i.e. how to ask good questions of the data), and ~49% of your time on domain knowledge (improving overall understanding of your field). Only ~2% on methods per se.

—Nate Silver


So, we will build up the requisite skills to do the mechanical stuff, but honestly, most interesting applied problems don’t hinge on the ability to type knn.fit(y,X). The key to solving problems is seeing the context of the situation and framing the right questions.

5.2.1. What type of question are you asking?

Most analyses fall into one of two camps, based on the type of question:

Example: Do airline closures affect how VCs monitor portfolio companies?

  • Direction matters: Is the relationship positive, negative, or neither?

  • Magnitude of the relationship usually matters (which factors matter most?)

  • Is the relationship because one causes the other, or is it a correlation reflecting something else?

Model: You’ll probably start with regression or logit

Example: Which loans will default?

  • Which factors matter and by how much, and in what ways are not as important

Model: Probably something more complex (depends on question)

5.2.2. Why should I think about the type of question?

The type of question is a good thing to fix in your head, because it dictates what you care about when you pick and estimate your models.

  1. Suppose you have two variables, \(y\) and \(X\).

  2. You run a regression on it. That regression models the data as \(y = m+\beta*X\).

  3. The regression produces an estimate for m and \(\beta\), and I’ll call the estimates versions \(\hat{m}\) and \(\hat{\beta}\). (I add the hats to denote that these are estimates, and because choosing weird notation to confuse undergrads is in all faculty contracts.)

  4. You can also use the estimates to create predicted values for y, which I’ll call \(\hat{y}\) and the formula to get them is \(\hat{y}=\hat{m}+\hat{\beta}*X\). That is, you take all the real values of X, and multiply by \(\hat{\beta}\), and then add \(\hat{m}\).

So, do you care about \(\hat{\beta}\), or \(\hat{y}\)? It depends on the type of your question!


  • Relationship questions are \(\hat{\beta}\) questions

  • Prediction questions are \(\hat{y}\) questions

5.2.3. Pick your model(s)

A model is an idealized representation of a system

For example:

  • \(E=mc^2\)

  • Financing policies: \(investment = MarginalQ\)

  • Asset prices: \(r = \beta * MKT\)

Famous take by George Box: “All models are wrong, but some are useful”

Relationship model

Prediction model

Word example

When people have one ice cream cone, they are 2% more likely to drown

Loan defaults over the next three months will be 9% for restaurant and service workers


Model should summarize the data

May not summarize the data, and often are impossible to interpret beyond predicted values


Simpler models are often preferred because they are easier to interpret

More complex models are often favored (understanding how predictions are made is less important than accuracy


Linear regression: \(final grade = b + m * midterm grade\)

Nearest neighbor model: \(final grade = nearest neighbor(midterm grade)\)

5.2.4. Design your tests

TBD - TODO. Map model to the empirical test you’ll actually implement. Evaluate it for weaknesses.

5.2.5. Estimate your model

We will talk in-depth about a few models in class, but generally, these three steps always apply:

  1. Select a model. (For example: find the “center” of a univariate distribution, regression, logistic)

    • Use knowledge about the domain area of the question to help pick the model

  2. Select a loss function. (For example: Mean squared error, mean absolute deviation, R2)

    • There are many loss functions!

    • The loss function choice affects the accuracy and speed of estimation

    • Choice depends on the estimation task

    • Qualitative or quantitative data?

    • Are all errors equal? (A false negative on a cancer test is much worse than a false positive!)

    • Do outliers matter more or less?

    • Some models often imply the loss function. For example, regression’s loss function is almost always Mean Squared Error.

  3. Estimate (“fit”) the model by minimizing the loss.