Fourth normal form and normalization
Overview
This is a discussion of normalization, higher normal forms, and a specific
example.
The following is a reference to the main thread-point that this short paper is on:
For context, the following URL contains the example discussed
Original Posting: Fourth normal form and normalization
Stephen Lindholm wrote:
> Now, it seems to me that 4th normal form is what any sensible person would
> implement. It seems obvious that aggregating unrelated data (denormalizing?
> is that the word?) is a bad thing. The Puppy example (as well as the
> example in Prince and Murach's book on Microsoft Access) seems contrived; no
> one would actually design a database like the denormalized, 1NF, and 2NF
> examples they give.
Actually, I think you will find many production tables similar to the
1NF, 2NF tables in that example, and even some equivalent to the
denormalized version (although that is not as common a practice
anymore). Consider a simple example of redundant data: an address with
both a Zip Code and a City+State.
> I realize that given time constraints and the complexity of large projects,
> this isn't always possible, and it's good to have a yardstick to see what
> exactly we can do, and so we have 1st, 2nd, 3rd, and 4th NF.
I don't think time is usually a legitimate issue. Lack of normalization
(where one made no attempt to produce a normalized model) is usually
from habits with previous technology or laziness. Denormalization
(where one knew the normalized model and choose a less normalized
version) can be done for many reasons including technology performance,
ease of supporting queries on base tables (instead of views), table
management complexity, and uncertainty in the correctness of business
rules [see below].
> But why would anyone implement 5th normal form? It seems that it puts
> business rules into the design of a database.
Ah... but actually every normalization beyond 1st "puts business rules
into the design of a database". The relationship of one attribute to
another attribute is from a business/derivation rule. The process of
normalization is to make sure everything is said once and only once, and
only by knowing these derivation rules can you know which data is
repetitive.
The problem with most descriptions of normalization is that they only
talk in terms of the tables/relations themselves, from which they make
standard assumptions of what business rules are likely to exist given
the sample data and then show just the new tables. Instead a
description of normalization should always show the set of rules that
are currently known and use those constant rules to explain why data is
redundant.
> In particular, in the Puppy
> example, my understanding of the 5NF example is that it loses the
> information for which breeder can supply which breed of puppy, and that this
> is acceptable only because a hypothetical law was passed that said that all
> breeders can supply all puppies....
The rule is that a kennel must buy the types of puppies it sells from
all its supplying breeders (no exclusive deals to particular breeders).
The example though is broken because it is missing the one piece of
information you mentioned: what breeds a particular breeder sells. If
you include this table:
Breeder Breed
Acme Spaniel
...
Whatapuppy Banana-Biter
You would actually get something like the first table
(Kennel-Breeder-Breed) but without the 'Kennel' column. This makes the
example difficult to follow (we now have three tables one of which is
very much like the starting table). If they instead included multiple
kennels in the starting table, it would be easier to follow:
Kennel# Breeder Breed
4 Acme Spaniel
4 Acme Dachshund
...
5 Acme Spaniel
5 Acme Dachshund
...
The business rule of "no exclusive deals" makes the above relation less
than 5NF and it should be split into its three atomic relations:
"Suppliers" (Kennel#-Breeder), "Sells" (Kennel#-Breed), and "Produces"
(Breeder-Breed).
Although 5NF examples can be "obscure" in any case, you might want to
look at the one in C.J. Date's "An Introduction to Database Systems"
instead of this particular example.
Summary
Returning to my main point: every normalization beyond 1st is only
possible because of a business rule. A decision to not normalize a
particular area may actually be the result of known or possible
violations of a business rule: ZIP-codes do not always uniquely
determine a city or exclusive deals for puppies may later be possible.
That would be the conceptually good reason. The practical reasons of
performance problems, habit, etc. are the more common.
> I've read that 3NF is what people strive for, and not
> 4NF. Is that simply to keep the number of tables from exploding to an
> unmanageable number?
Truthfully, I think it is out of comfort with the meaning of 3NF vs.
higher forms and familiarity with the name. The true 3NF is actually
BCNF (Boyce-Codd Normal Form), but the differences among 3NF, BCNF, 4NF,
and 5NF require more effort to learn.
|