2.5. Good Data is Normalized¶
Important
Normalized data means
The dataset has distinct “Keys”
Variables in the dataset are at the key’s unit-level
Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.
There are multi-billion dollar firms whose entire business model depends on storing and accessing data. Therefore we should apply our rule and see how they store data:
Note
Long ago, smart people figured out a fundamental principle of database design: that the physical structure of a database should communicate its logical structure.
Let’s put that into practice. First, we need to agree on three things:
Every database contains “elements” or “observations” (the rows) and “variables” (the columns).
Definition: The “key” is the variable(s) that uniquely identifies a row in the table.
If your dataset is about different firms and has one observation per firm,
GM
is the key for the observation about General Motors.If your dataset contains info about firms over multiple years (a “firm-year” dataset),
(GM, 2000)
is the key for the observation for GM in 2000.Notice that the “key” can be multiple variables, like the firm and the year in the prior sentence.
Definition: The “unit” of a dataset refers to the level of aggregation of variables in the dataset. Examples:
Firm-level data: Industry of a firm
Firm-year data: Profits
Firm-day data: Firm returns
Daily-data (or any time interval): Market returns
Country-year: GDP
Country-level data: Continent
2.5.1. Rules for storing data¶
Rule: Keys should never be missing! (Rule 4.A) If one is, how can you say what that row is about?
Rule: Keys should never be duplicated. If any keys are duplicated, there is a problem with your dataset or you misunderstand what the unit is and what variable(s) constitute the key.
Rule: Variables in a normalized dataset are at the unit level of the dataset, and not other unit levels.
Rule 3 means that if the key is a county, and the unit is the county in 2010 (a cross-section at a single point in time), then the variables you can have in that dataset are attributes of the county as of 2010.
If you have variables about state populations, they should be in a different database (the state database).
If you have variables about the county at many points in time (county GDP for each year), those should be in a different database (the county-year database).
2.5.2. Normalized data in practice¶
Try to create and store data that is normalized. But notice how Golden rule 4.B is phrased: “Keep data normalized as far into your code pipeline as you can.”
I doubt you will ever run a serious regression or analysis on completely normalized data. For example, asset pricing tests require the firm’s return that day (which is a variable whose unit is a firm-day) and the market return (which is a variable whose unit is a day).
So, at some point, you will combine datasets with different unit levels. Do your best to delay that (4.B) until the beginning of the analysis.
2.5.3. Example: Fixing non-normalized data¶
So, as a little example, here is a bad firm-year dataset:
Firm |
Year |
Profits |
First_year |
IndustryDef |
SubIndustryDef |
Profits_Ind |
---|---|---|---|---|---|---|
GM |
2000 |
5 |
1908 |
1 |
1A |
1 |
GM |
2001 |
5.5 |
1 |
1A |
13 |
|
Ford |
2000 |
-4 |
1903 |
2 |
1A |
1 |
Ford |
2001 |
7.5 |
1903 |
2 |
1A |
13 |
Ford |
2002 |
8 |
1903 |
2 |
1A |
23 |
2002 |
15 |
1903 |
2 |
1A |
23 |
What a mess!
What is the last observation about?
How is GM missing a “first year” in one observation? GM and Ford are in the same sub-industry, but different industries?
You’ll have to keep track of what
Profits
andProfits_Ind
mean. (And which industry level are profits at?)
Beyond those problems, this dataset has variables at THREE unit levels:
Firm-year variables (profits in a given year for a given firm)
Firm-level variables (first year the firm existed)
Industry-year variables (industry profits)
To save this data in a normalized form, we need to save three datasets. Check out how I did that:
Firm |
Year |
Profits |
---|---|---|
GM |
2000 |
5 |
GM |
2001 |
5.5 |
Ford |
2000 |
-4 |
Ford |
2001 |
7.5 |
Ford |
2002 |
8 |
Firm |
First_year |
IndustryDef |
SubIndustryDef |
---|---|---|---|
GM |
1908 |
1 |
1A |
Ford |
1903 |
1 |
1A |
IndustryDef |
Year |
Profits |
---|---|---|
1 |
2000 |
1 |
1 |
2001 |
13 |
1 |
2002 |
23 |
Now, the physical structure of each dataset communicates its logical structure: Every firm-year has a profit level, and each firm has a founding year. In fact, these three databases are self-documenting! (Rule 6.C) Isn’t that nice!?