2.5. Good Data is Normalized

Important

Normalized data means

  1. The dataset has distinct “Keys”

  2. Variables in the dataset are at the key’s unit-level

Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.


There are multi-billion dollar firms whose entire business model depends on storing and accessing data. Therefore we should apply our rule and see how they store data:

Note

Long ago, smart people figured out a fundamental principle of database design: that the physical structure of a database should communicate its logical structure.

Let’s put that into practice. First, we need to agree on three things:

  1. Every database contains “elements” or “observations” (the rows) and “variables” (the columns).

  2. Definition: The “key” is the variable(s) that uniquely identifies a row in the table.

    • If your dataset is about different firms and has one observation per firm, GM is the key for the observation about General Motors.

    • If your dataset contains info about firms over multiple years (a “firm-year” dataset), (GM, 2000) is the key for the observation for GM in 2000.

    • Notice that the “key” can be multiple variables, like the firm and the year in the prior sentence.

  3. Definition: The “unit” of a dataset refers to the level of aggregation of variables in the dataset. Examples:

    • Firm-level data: Industry of a firm

    • Firm-year data: Profits

    • Firm-day data: Firm returns

    • Daily-data (or any time interval): Market returns

    • Country-year: GDP

    • Country-level data: Continent

2.5.1. Rules for storing data

  1. Rule: Keys should never be missing! (Rule 4.A) If one is, how can you say what that row is about?

  2. Rule: Keys should never be duplicated. If any keys are duplicated, there is a problem with your dataset or you misunderstand what the unit is and what variable(s) constitute the key.

  3. Rule: Variables in a normalized dataset are at the unit level of the dataset, and not other unit levels.


Rule 3 means that if the key is a county, and the unit is the county in 2010 (a cross-section at a single point in time), then the variables you can have in that dataset are attributes of the county as of 2010.

If you have variables about state populations, they should be in a different database (the state database).

If you have variables about the county at many points in time (county GDP for each year), those should be in a different database (the county-year database).

2.5.2. Normalized data in practice

Try to create and store data that is normalized. But notice how Golden rule 4.B is phrased: “Keep data normalized as far into your code pipeline as you can.”

I doubt you will ever run a serious regression or analysis on completely normalized data. For example, asset pricing tests require the firm’s return that day (which is a variable whose unit is a firm-day) and the market return (which is a variable whose unit is a day).

So, at some point, you will combine datasets with different unit levels. Do your best to delay that (4.B) until the beginning of the analysis.

2.5.3. Example: Fixing non-normalized data

So, as a little example, here is a bad firm-year dataset:

Firm

Year

Profits

First_year

IndustryDef

SubIndustryDef

Profits_Ind

GM

2000

5

1908

1

1A

1

GM

2001

5.5

1

1A

13

Ford

2000

-4

1903

2

1A

1

Ford

2001

7.5

1903

2

1A

13

Ford

2002

8

1903

2

1A

23

2002

15

1903

2

1A

23

What a mess!

  • What is the last observation about?

  • How is GM missing a “first year” in one observation? GM and Ford are in the same sub-industry, but different industries?

  • You’ll have to keep track of what Profits and Profits_Ind mean. (And which industry level are profits at?)

Beyond those problems, this dataset has variables at THREE unit levels:

  1. Firm-year variables (profits in a given year for a given firm)

  2. Firm-level variables (first year the firm existed)

  3. Industry-year variables (industry profits)

To save this data in a normalized form, we need to save three datasets. Check out how I did that:

Firm

Year

Profits

GM

2000

5

GM

2001

5.5

Ford

2000

-4

Ford

2001

7.5

Ford

2002

8

Firm

First_year

IndustryDef

SubIndustryDef

GM

1908

1

1A

Ford

1903

1

1A

IndustryDef

Year

Profits

1

2000

1

1

2001

13

1

2002

23

Now, the physical structure of each dataset communicates its logical structure: Every firm-year has a profit level, and each firm has a founding year. In fact, these three databases are self-documenting! (Rule 6.C) Isn’t that nice!?