2.3. Good Data is Normalized

Normalized data means the dataset has distinct “Keys” + variables in the dataset are the Key’s Unit-Level


There are multi-billion dollar firms whose entire business model depends on storing and accessing data. Therefore we should apply our rule and see how they store data:

Long ago, smart people figured out a fundamental principle of database design: that the physical structure of a database should communicate its logical structure.

How does that work?

  1. Every database contains “elements” or “observations” (rows) and “variables” (columns).

  2. The “key” is the variable(s) that identifies a row in the table.

    • GM is the key for the observation about General Motors in a dataset whose unit is a firm.

    • (GM, 2000) is the key for the observation for GM in 2000 in a firm-year dataset.

    • Keys should never be missing! (Rule 4.A) If one is, how can you say what that row is?

    • Keys should never be duplicated. If one is, your database won’t have a logical structure any more.

  3. Closely related to “keys” is the idea of “units”, refers to the level of aggregation in the dataset.

    • Variables in the dataset should apply as properties of the “unit” of the “key”. In other words: variables are attributes of the key.

    • County population is a property of a county, so it lives in the county database, not the state database. State population is a property of a state, so it cannot live in the county table.

    • **Implication: You need one dataset for each unit level. For example:

      • One firm-level dataset and one firm-year dataset.

      • One county dataset and one state dataset.

If you have distinct keys in your dataset, and all variables apply to the unit of the key, your dataset is said to be “normalized”. Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes. (See the example below)

I doubt you will ever run a serious regression or analysis on completely normalized data. So at some point you will combine the datasets. Do your best to delay that (4.B).

2.3.1. Example: Bad and Good Data

So, as a trivial example, here is a bad firm-year dataset:

Firm

Year

Profits

First_year

IndustryDef

SubIndustryDef

Profits_Ind

GM

2000

5

1908

1

1A

1

GM

2001

5.5

1

1A

13

Ford

2000

-4

1903

2

1A

1

Ford

2001

7.5

1903

2

1A

13

Ford

2002

8

1903

2

1A

23

2002

15

1903

2

1A

23

What a mess! What is the last observation about? How is GM missing a “first year” in one observation? GM and Ford are in the same sub-industry, but different industries? You’ll have to keep track of what Profits and Profits_Ind mean. (And which industry level are profits at?)

The main issue is that this dataset conflates firm-year variables (profits in a given year) with firm-level variables (the first year of the firm. You can fix this by keeping three datasets, one firm-year dataset:

Firm

Year

Profits

GM

2000

5

GM

2001

5.5

Ford

2000

-4

Ford

2001

7.5

Ford

2002

8

… and a firm-level dataset:

Firm

First_year

IndustryDef

SubIndustryDef

GM

1908

1

1A

Ford

1903

1

1A

… and an industry-year dataset:

IndustryDef

Year

Profits

1

2000

1

1

2001

13

1

2002

23

Now, the physical structure of each dataset communicates its logical structure: Every firm-year has a profit level, and each firm has a founding year. In fact, these databases are self-documenting! (Rule 6.C) Isn’t that nice!?