Missing Information
Now we can deal with representing the lack of knowledge (or missing information) in a relational database. We have already seen what a database does not know: everything it has not been told it does know. So we dont have a problem of telling the database: "You dont know this", we have a problem of telling a database "You only know this".
Fully Determined Relations
This is the technique of using only fully determined relations [ See [McGoveran 94c] and [Date 94b].] : We can add any relations needed to express what we know, but each should only include exactly what we know about those tuples. If this approach can suitably handle all our needs for representing missing information then we should not add another mechanism to the relational model. That would be adding complexity without any new expressiveness.To try out this technique, what if we dont know where a particular supplier is located? So far our only base relation is S which has a predicate that statesS.2: There exists a supplier with identifier S#, who has the name SName, who is located only in city City, and
Since we dont know where the supplier is located we cant use this relation. No problem, we can add a new relation to our database.
S_NoCity : S_NoCity.1
S_NoCity.1: There exists a supplier with identifier S#, who has the name SName, and
And if we add two entries to S_NoCity we get:
S_NoCity : S_NoCity.1
|
S#
|
SName
|
|
|
S3
|
DuPont
|
|
|
S5
|
Grid
|
| We now have two relation variables that each express exactly what we know about their tuples. Everything in fully determined. One of the relations has more information than the other and that is the only sense in which we are "missing" information. The S_NoCity relation is "half-full" so it could also be considered "half-empty".
So how does this new relation and the new tuples affect our last two "tough" questions?
Q2.2: What suppliers can be proven to be in London?
Q3.2: What suppliers can be proven not to be located in London?
The new tuples dont affect the queries at all. Since the S_NoCity predicate doesnt mention cities in any way, it can not affect the outcome of any query that mentions a city. It has nothing to do with it.
On the other hand, the easy question:
Q1.2: What names can be proven to be the name of a supplier?
Should now return:
|
SName |
|
Jones |
|
Smith |
|
DuPont |
|
Eiffel |
|
Grid | The problem is that asking this question of the database is now a bit cumbersome for the user. We need to take the two base relations, project SName, and union the results together. Not terribly difficult, but this requires a user to always remember the two tables when querying. We could also define the following derived relation value (i.e. View) to make this particular query easier for the user.
S_All = f(S,S_NoCity)
|
S#
|
SName
|
|
|
S1
|
Jones
|
|
|
S2
|
Smith
|
|
|
S3
|
DuPont
|
|
|
S4
|
Eiffel
|
|
|
S5
|
Grid
|
|
S_All.1: There exists a supplier with identifier S#, who has the name SName, and
The user can now query from S_All when asking about names, use S when asking about cities, and use S_NoCity when asking about
hmm
that isnt completely clear from our predicates. Why would we use S_NoCity instead of S_AllNames? What distinguishing trait places a tuple in S_NoCity?
Predicates that specify known "unknowns"
So far we have three predicates (renaming them slightly):
S_City.2: {S#, SName, City,
} There exists a supplier with identifier S#, who has the name SName, who is located only in city City, and
S_NoCity.1: {S#, SName,
} There exists a supplier with identifier S#, who has the name SName, and
S_AllNames.1: {S#, SName,
} There exists a supplier with identifier S#, who has the name SName
We appear to have been a bit vague about the meaning of putting a tuple into S_NoCity: why is the set of cities in S_NoCity different from the set in S_AllNames? We can now decide what "unknown" information we want to represent in our database by specifying what the known information means. We could decide that putting a city into S_NoCity means that we dont know the city for that supplier, or we could mean that the supplier is a multi-national corporation that is located in multiple cities, or any number of other meanings. For a fuller discussion of the different possible meanings of missing information see [McGoveran 94b]. If we needed more than one meaning we would create multiple relations and variables. For example we could have:
S_UnknownCity
S_MultiNational
We would then need to have S_AllNames be a function of all three relations to union them together.
At the moment we choose to simply correct the predicate of S_NoCity:
S_NoCity.2: There exists a supplier with identifier S#, who has the name SName, and
and for which we do not know the city it is located in.
This new version of the predicate allows us to ask the question:
Q4.1: What suppliers can be proven to be among those that we dont know their location?
It has no impact on the results of our previous questions:
Q1.2: What names can be proven to be the name of a supplier?
Q2.2: What suppliers can be proven to be in London?
Q3.2: What suppliers can be proven not to be located in London?
Complexity
All these tables and views add complexity to the database and make life more difficult for the user and also for the manager of these base and derived relations. This is with just one "optional" attribute. We will get combinatorial explosions of relations if for some suppliers we knew the City but not the Name, and some we knew the previous months purchase-quota, and so on. We get an even greater explosion if we need to have multiple meanings for the missing information.
Summary
The approach in this section correctly handled lack of information by adding new predicates that contain only the information we know. We didnt need to add any new mechanisms to relational algebra. We did need to specify our predicates precisely enough to identify why tuples are placed into one predicate over another. This precision is also what enables a user to know what the results of a query mean.
Unfortunately, managing all the new predicates proved more difficult than we might desire and the resulting scheme may be difficult for a user to understand. It would be good if there was a mechanism that can ease this management and simplify the scheme, but it should have as little impact as possible on relational algebra.
|