ChiMu  
 
Menu Edge About   Products   Services   Projects   Publications  
  Projects > MONDO                 

Recipes for Information

I suggest that SGML/XML be perceived as a markup language to describe how to build information instead of describing (and modeling) the information itself. This may appear to be a subtle distinction but it has a lot of implications.

I will start with a recent concrete example from Rick Jelliffe :

   <!ELEMENT citation   ( title, text, url)>  

This says a citation is composed of (through its content) a title, text, and url. But do not view that as the information model of a citation; consider it a recipe for a citation. We can build a citation if we supply the three (named) ingredients: title, text, and url. The detail of the resulting information (which I will call an object) is unknown. It is likely that the citation object will have these three attributes, but it could have more or it could even discard some of them (in which case the recipe included information that the model did not need).

If we have a different element that requires more information we could have a different recipe:

       <!ELEMENT DetailedCitation   ( title, text, name, text, url )>

The object that results from this recipe might be the same type as a citation object, a subtype of the citation object (i.e. treatable as a citation object but has more capabilities), or even an unrelated type of object. For the moment we will abstain on discussing anything about the objects resulting from the DetailedCitation and the Citation recipes [why I started capitalizing will be explained later too].

What about combining the two recipes into a single element? We could combine them as:

     <!ELEMENT Citation   ( ( title, text, url) | (title, text, name, text, url) )>
     <!ELEMENT Citation   ( title, (text, name)?, text, url  )>
     <!ELEMENT Citation   ( title,  text, (name, text)?, url  )>

This would be both ambiguous (in SGML terms) for the first two but all of them are bad recipes. They are bad because we (or the computer) must look at all the content to know which version we are using. This is analogous to reading a whole recipe before we can be sure what we are trying to make. It would be better to more clearly separate the options from the requirements if you choose that option. Our original version separated the recipes through the elements:

       <!ELEMENT Citation   ( title, text, url)>
       <!ELEMENT DetailedCitation   ( title, text, name, text, url )>

We could also do this with:

       <!ELEMENT Citation     ( basicInfo & detailedInfo? )>
       <!ELEMENT basicInfo    ( title, text, url)>
       <!ELEMENT detailedInfo ( text, name)>

or:

       <!ELEMENT Citation     ( basic | detailed )>
       <!ELEMENT basic        ( title, text, url)>
       <!ELEMENT detailed     ( title, text, url, text, name)>

In these forms it is explicit what we are trying to build (or at least the complexity is dramatically reduced). We do not have to look into the details of the information itself and decode it.

RECIPES

Now I will ask for a leap of faith.

Consider separating ELEMENTs between Recipes that build objects and Parameters that name the ingredients that are required for a particular recipe. As an architectural-form it would look like this:

   <!ELEMENT   Recipe      (parameter)*>
   <!ELEMENT   parameter   (Recipe)>

Although in the content model parameters are sequential, their order is insignificant semantically. Each parameter must have a unique name, so consider them to be and-ed together instead of seq-ed. Sort of like:

   <!ELEMENT   Recipe      (parameter)&*>

or like required element attributes.

As a convention I will capitalize the Recipes and keep parameters in lowercase. Now returning to our example, to build a Citation required three parameters:

       <!ELEMENT Citation   ( title & text & url)>

The original ordering of the parameters is irrelevant to the informational content because each parameter is uniquely named, it is only a presentation/encoding restriction to have them be sequential. Also, the parameters do not describe the Types of the ingredients, just the Role of them in building the recipe. All of 'title', 'text', and 'url' could be simple strings:

       <!ELEMENT title    (String)>
       <!ELEMENT text     (String)>
       <!ELEMENT url      (String)>
       <!ELEMENT String   (#PCDATA)*>

Or any of them could have a more complex type. By separating the two types of elements we can

  • Be very explicit about what we are constructing
  • Have a great deal of flexibility for reuse of elements
  • Use very simple content models that produce complex structures

Note that although the '&' is considered complex to implement, this particular use of it has the same form as attributes: Parameters are unordered and possibly required.

Shortcuts

You might have noticed that String cheats: a String does not follow the required Recipe pattern of having only parameters in content. This is a convenience shortcut Recipe [OK, and an insanity prevention device], which makes putting strings of text into this format more easily. Similarly we will probably need to have a shortcut for Lists (sequences) of objects:

       <!ELEMENT List     (Recipe)*>

With these additions we have to modify our original description of the architectural-form of Recipes to:

   <!ELEMENT   Recipe       (parameter)*>
   <!ELEMENT   StringRecipe (#PCDATA)*>
   <!ELEMENT   ListRecipe   (Recipe)*>
   <!ELEMENT   parameter    (Recipe | StringRecipe | ListRecipe )>

Recipes, DTDs, and DomainModels

Each Recipe builds an object. What is the type of this object and how does it relate to the ELEMENT content model? I propose (and agree with others proposing) that there should be no required connection between the rules of a recipe (the DTD) and the rules of the DomainModel objects built from that recipe. Objects can have far more complex relationship rules than DTDs can describe and the DTD will either over-constrain or under-constrain the built objects.

Instead consider the DTD as similar to a UI Form. You may want to place things in a particular order and group them together:

    Person
      FirstName   LastName
      SSN
      Children
          FirstName  LastName

But this is a presentation of the (view independent) information model that has a person with several attributes and associations in no particular order (even children do not need to be explicitly ordered for orderings can be derived from [for example] the child's birthdate). The UI/DTD can place constraints (like a SSN has a "123-45-6789" format) but it should be very careful about these constraints (what about "99-" SSNs) or really delegate the responsibility of validation to the DomainModel. But simplified views are still useful.

DTDs can still be used to produce an information model but it should be possible to unlink the information model and have it start a more robust life of its own (or the dependency reversed). The Recipes should still be useful because they encode the knowledge required to build the information independently of how precisely or extensively it is modeled (up to a point). The recipes can live on as the model grows.

And, in a strange circularity, information models are also (obviously) information so they can again be encoded as recipes in SGML/XML and used as metadata for the domain model. So although DTDs are not good information models, there is nothing stopping SGML/XML from being a good encoding for good information models.

--Mark
mark.fussell@chimu.com

 
Projects > MONDO