2.3. Good Data is Normalized¶

Normalized data means the dataset has distinct “Keys” + variables in the dataset are the Key’s Unit-Level

There are multi-billion dollar firms whose entire business model depends on storing and accessing data. Therefore we should apply our rule and see how they store data:

Long ago, smart people figured out a fundamental principle of database design: that the physical structure of a database should communicate its logical structure.

How does that work?

Every database contains “elements” or “observations” (rows) and “variables” (columns).
The “key” is the variable(s) that identifies a row in the table.
- GM is the key for the observation about General Motors in a dataset whose unit is a firm.
- (GM, 2000) is the key for the observation for GM in 2000 in a firm-year dataset.
- Keys should never be missing! (Rule 4.A) If one is, how can you say what that row is?
- Keys should never be duplicated. If one is, your database won’t have a logical structure any more.
Closely related to “keys” is the idea of “units”, refers to the level of aggregation in the dataset.
- Variables in the dataset should apply as properties of the “unit” of the “key”. In other words: variables are attributes of the key.
- County population is a property of a county, so it lives in the county database, not the state database. State population is a property of a state, so it cannot live in the county table.
- **Implication: You need one dataset for each unit level. For example:
  - One firm-level dataset and one firm-year dataset.
  - One county dataset and one state dataset.

If you have distinct keys in your dataset, and all variables apply to the unit of the key, your dataset is said to be “normalized”. Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes. (See the example below)

I doubt you will ever run a serious regression or analysis on completely normalized data. So at some point you will combine the datasets. Do your best to delay that (4.B).

2.3.1. Example: Bad and Good Data¶

So, as a trivial example, here is a bad firm-year dataset:

Firm	Year	Profits	First_year	IndustryDef	SubIndustryDef	Profits_Ind
GM	2000	5	1908	1	1A	1
GM	2001	5.5		1	1A	13
Ford	2000	-4	1903	2	1A	1
Ford	2001	7.5	1903	2	1A	13
Ford	2002	8	1903	2	1A	23
	2002	15	1903	2	1A	23

What a mess! What is the last observation about? How is GM missing a “first year” in one observation? GM and Ford are in the same sub-industry, but different industries? You’ll have to keep track of what Profits and Profits_Ind mean. (And which industry level are profits at?)

The main issue is that this dataset conflates firm-year variables (profits in a given year) with firm-level variables (the first year of the firm. You can fix this by keeping three datasets, one firm-year dataset:

Firm	Year	Profits
GM	2000	5
GM	2001	5.5
Ford	2000	-4
Ford	2001	7.5
Ford	2002	8

… and a firm-level dataset:

Firm	First_year	IndustryDef	SubIndustryDef
GM	1908	1	1A
Ford	1903	1	1A

… and an industry-year dataset:

IndustryDef	Year	Profits
1	2000	1
1	2001	13
1	2002	23

Now, the physical structure of each dataset communicates its logical structure: Every firm-year has a profit level, and each firm has a founding year. In fact, these databases are self-documenting! (Rule 6.C) Isn’t that nice!?

LeDataSciFi-2021

2.3. Good Data is Normalized¶

2.3.1. Example: Bad and Good Data¶