ChiMu  
 
Menu Edge About   Products   Services   Projects   Publications  
  Publications                 

Fourth normal form and normalization

Overview

This is a discussion of normalization, higher normal forms, and a specific example.

The following is a reference to the main thread-point that this short paper is on:

For context, the following URL contains the example discussed

Original Posting: Fourth normal form and normalization

Stephen Lindholm wrote:
> Now, it seems to me that 4th normal form is what any sensible person would
> implement.  It seems obvious that aggregating unrelated data (denormalizing?
> is that the word?) is a bad thing.  The Puppy example (as well as the
> example in Prince and Murach's book on Microsoft Access) seems contrived; no
> one would actually design a database like the denormalized, 1NF, and 2NF
> examples they give.

Actually, I think you will find many production tables similar to the 1NF, 2NF tables in that example, and even some equivalent to the denormalized version (although that is not as common a practice anymore). Consider a simple example of redundant data: an address with both a Zip Code and a City+State.

> I realize that given time constraints and the complexity of large projects,
> this isn't always possible, and it's good to have a yardstick to see what
> exactly we can do, and so we have 1st, 2nd, 3rd, and 4th NF.

I don't think time is usually a legitimate issue. Lack of normalization (where one made no attempt to produce a normalized model) is usually from habits with previous technology or laziness. Denormalization (where one knew the normalized model and choose a less normalized version) can be done for many reasons including technology performance, ease of supporting queries on base tables (instead of views), table management complexity, and uncertainty in the correctness of business rules [see below].

> But why would anyone implement 5th normal form?  It seems that it puts
> business rules into the design of a database.

Ah... but actually every normalization beyond 1st "puts business rules into the design of a database". The relationship of one attribute to another attribute is from a business/derivation rule. The process of normalization is to make sure everything is said once and only once, and only by knowing these derivation rules can you know which data is repetitive.

The problem with most descriptions of normalization is that they only talk in terms of the tables/relations themselves, from which they make standard assumptions of what business rules are likely to exist given the sample data and then show just the new tables. Instead a description of normalization should always show the set of rules that are currently known and use those constant rules to explain why data is redundant.

> In particular, in the Puppy
> example, my understanding of the 5NF example is that it loses the
> information for which breeder can supply which breed of puppy, and that this
> is acceptable only because a hypothetical law was passed that said that all
> breeders can supply all puppies....

The rule is that a kennel must buy the types of puppies it sells from all its supplying breeders (no exclusive deals to particular breeders). The example though is broken because it is missing the one piece of information you mentioned: what breeds a particular breeder sells. If you include this table:

   Breeder     Breed
   Acme        Spaniel
   ...
   Whatapuppy  Banana-Biter

You would actually get something like the first table (Kennel-Breeder-Breed) but without the 'Kennel' column. This makes the example difficult to follow (we now have three tables one of which is very much like the starting table). If they instead included multiple kennels in the starting table, it would be easier to follow:

   Kennel#   Breeder   Breed
   4         Acme      Spaniel
   4         Acme      Dachshund
   ...
   5         Acme      Spaniel
   5         Acme      Dachshund
   ...

The business rule of "no exclusive deals" makes the above relation less than 5NF and it should be split into its three atomic relations: "Suppliers" (Kennel#-Breeder), "Sells" (Kennel#-Breed), and "Produces" (Breeder-Breed).

Although 5NF examples can be "obscure" in any case, you might want to look at the one in C.J. Date's "An Introduction to Database Systems" instead of this particular example.

Summary

Returning to my main point: every normalization beyond 1st is only possible because of a business rule. A decision to not normalize a particular area may actually be the result of known or possible violations of a business rule: ZIP-codes do not always uniquely determine a city or exclusive deals for puppies may later be possible. That would be the conceptually good reason. The practical reasons of performance problems, habit, etc. are the more common.

> I've read that 3NF is what people strive for, and not
> 4NF.  Is that simply to keep the number of tables from exploding to an
> unmanageable number?

Truthfully, I think it is out of comfort with the meaning of 3NF vs. higher forms and familiarity with the name. The true 3NF is actually BCNF (Boyce-Codd Normal Form), but the differences among 3NF, BCNF, 4NF, and 5NF require more effort to learn.

 
Publications