2.3. Good Data is Normalized¶
Normalized data means the dataset has distinct “Keys” + variables in the dataset are the Key’s Unit-Level
There are multi-billion dollar firms whose entire business model depends on storing and accessing data. Therefore we should apply our rule and see how they store data:
Long ago, smart people figured out a fundamental principle of database design: that the physical structure of a database should communicate its logical structure.
How does that work?
Every database contains “elements” or “observations” (rows) and “variables” (columns).
The “key” is the variable(s) that identifies a row in the table.
GM
is the key for the observation about General Motors in a dataset whose unit is a firm.(GM, 2000)
is the key for the observation for GM in 2000 in a firm-year dataset.Keys should never be missing! (Rule 4.A) If one is, how can you say what that row is?
Keys should never be duplicated. If one is, your database won’t have a logical structure any more.
Closely related to “keys” is the idea of “units”, refers to the level of aggregation in the dataset.
Variables in the dataset should apply as properties of the “unit” of the “key”. In other words: variables are attributes of the key.
County population is a property of a county, so it lives in the county database, not the state database. State population is a property of a state, so it cannot live in the county table.
**Implication: You need one dataset for each unit level. For example:
One firm-level dataset and one firm-year dataset.
One county dataset and one state dataset.
If you have distinct keys in your dataset, and all variables apply to the unit of the key, your dataset is said to be “normalized”. Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes. (See the example below)
I doubt you will ever run a serious regression or analysis on completely normalized data. So at some point you will combine the datasets. Do your best to delay that (4.B).
2.3.1. Example: Bad and Good Data¶
So, as a trivial example, here is a bad firm-year dataset:
Firm |
Year |
Profits |
First_year |
IndustryDef |
SubIndustryDef |
Profits_Ind |
---|---|---|---|---|---|---|
GM |
2000 |
5 |
1908 |
1 |
1A |
1 |
GM |
2001 |
5.5 |
1 |
1A |
13 |
|
Ford |
2000 |
-4 |
1903 |
2 |
1A |
1 |
Ford |
2001 |
7.5 |
1903 |
2 |
1A |
13 |
Ford |
2002 |
8 |
1903 |
2 |
1A |
23 |
2002 |
15 |
1903 |
2 |
1A |
23 |
What a mess! What is the last observation about? How is GM missing a “first year” in one observation? GM and Ford are in the same sub-industry, but different industries? You’ll have to keep track of what Profits
and Profits_Ind
mean. (And which industry level are profits at?)
The main issue is that this dataset conflates firm-year variables (profits in a given year) with firm-level variables (the first year of the firm. You can fix this by keeping three datasets, one firm-year dataset:
Firm |
Year |
Profits |
---|---|---|
GM |
2000 |
5 |
GM |
2001 |
5.5 |
Ford |
2000 |
-4 |
Ford |
2001 |
7.5 |
Ford |
2002 |
8 |
… and a firm-level dataset:
Firm |
First_year |
IndustryDef |
SubIndustryDef |
---|---|---|---|
GM |
1908 |
1 |
1A |
Ford |
1903 |
1 |
1A |
… and an industry-year dataset:
IndustryDef |
Year |
Profits |
---|---|---|
1 |
2000 |
1 |
1 |
2001 |
13 |
1 |
2002 |
23 |
Now, the physical structure of each dataset communicates its logical structure: Every firm-year has a profit level, and each firm has a founding year. In fact, these databases are self-documenting! (Rule 6.C) Isn’t that nice!?