Data & Knowledge Engineering 11 (1993) 299-316 North-Holland
299
DATAK 184
Refinin.g imprecise data by integrity constraints* Frank S.C. Tseng a, A r b e e L.P. Chen b'* and Wei-Pang Yang a aDepartment of Computer Science and Information Engineering, National Chiao Tung University, Hsinchu 30050, Taiwan bDepartment of Computer Science, National Tsing Hua University, Hsinchu 30043, Taiwan Abstract Uncertain data in databases were originally denoted as null values, which represent the meaning of 'values unknown at present.' Null values were generalized into partial values, which correspond to a set of possible values, to provide a more powerful notion. In this paper, we derive some properties to refine partial values into more informative ones. In some cases, they can even be refined into definite values. Such a refinement is possible when there exist range constraint on attribute domains, or referential integrities, functional dependencies, or multivalued dependencies among attributes. Our work actually eliminates redundant elements in a partial value. By this process, we not only provide a more concise and informative answer to users, but also speedup the computation of queries issued afterward. Besides, it reduces the communication cost when imprecise data are requested to be transmitted from one site to another site in a distributed environment.
Keywords. Imprecise data; partial values; refinement; referential integrity; functional dependency; multivalued dependency; multidatabase systems.
1. Introduction
Uncertain data exist in the real world. Null values were originally adopted to represent the meaning of 'values unknown at present' in database systems. Codd [4] pioneers the work on extended relational algebra to manipulate null values. From then on, incomplete information in relational databases have been extensively studied [9, 14, 11, 3, 15, 16, 18]. The update semantics with null values in relational databases have been discussed in [1, 2]. Codd [5, 6] distinguishes null values into applicable and inapplicable null values. Besides, the relationship between null values and functional dependencies has been studied in [12, 13, 24, 25]. For a concise review of handling null values by algebraic approaches, refer to [17]. The null value concept has been generalized to the concept of partial values by Grant [10]. Instead of being treated as an atomic value, an attribute value in a table is considered as a nonempty subset of the corresponding domain. A partial value in Grant [10] is represented as an interval such that exactly one of the values in the interval is the 'true' value of the partial value. In our work, however, a partial value is considered to correspond to a finite set of possible values such that exactly one of the values in that set is 'true.' That is, our work considers partial values of discrete domains while Grant's work considers partial values of * Corresponding author. Fax: 886 35 723694, email:
[email protected] * This research was partially supported by the Republic of China National Science Council under Contract No. NSC 82-0408-E-007-162. 0169-023X/93/$06.00 © 1993- Elsevier Science Publishers B.V. All rights reserved
300
Refining imprecise data by integrity constraints
continuous domains. Therefore, an applicable null value is a partial value which corresponds to the whole domain. Besides, partial values are also referred to as exclusive disjunctive data in [20]. DeMichiel [7] employed partial values for resolving domain mismatch problems in multidatabase systems, in which an algebraic approach for operating on partial values is proposed. Ola [20] presents an approach to processing relations containing exclusive disjunctive data. Besides, we discuss the elimination of redundant partial values in [21] and generalize partial values into probabilistic partial values in [22] to provide a probabilistic approach to query processing in heterogeneous database systems. The aggregation functions over partial values are studied in [23]. Lipski [14] presents a general discussion on manipulating imprecise information including partial values. In this paper, we study the problem of incrementally refining applicable null values into partial values. In some cases, partial values can be further refined into definite values. We utilize the integrity constraints to do the refinement, including range constraints of attribute domains, and referential integrities, functional dependencies and multivalued dependencies among attributes. To see such a refinement is possible, we give an example in the following. Suppose we have the relation SF(Son, Father) containing two t u p l e s - namely, (s3, f5) and ([s3, s4], f 6 ) - w h e r e the first tuple means that the father of 's3' is definitely 'f5' and the second tuple denotes the son of 'f6' can be either 's3' or 's4' (a partial value). However, since it is impossible for a child to have two or more fathers, we obtain that Son functionally determines Father. Therefore, we know that the son of 'f6' cannot be 's3', otherwise 's3' will have two fathers. Since there is only one candidate remaining (i.e. 's4'), we can make sure that the son of 'f6' is definitely 's4'. Thus, the content of SF(Son, Father) can be refined into {(s3, fS), (s4, f6)}. By acquiring more data into a database, some partial values can be refined into definite data, which may in turn be used to refine other partial values, and so forth. Thus, the refinement process can be applied incrementally. This is similar to a detective handling a murder case. As more and more clues are being revealed, the detective can refine a set of suspects into a smaller set of suspects and, when possible, finally expose the murderer. Our work actually eliminates redundant elements in a partial value. By this process, we not only provide a more concise and informative answer to the user, but also speedup the computation of queries. Besides, it reduces the communication cost when imprecise data are requested to be transmitted from one site to another site in a distributed environment. That is, our work also contributes to the query optimization in a distributed database system. We state the basic concepts and assumptions in Section 2. In Section 3, we elaborate on the utilization of range constraints, referential integrities, functional dependencies, and multivalued dependencies to refine partial values into definite values or more informative partial values. (A partial value becomes more informative if its possible values can be restricted to a smaller subset.) Section 4 states how to implement the developed properties by rules to enforce integrity constraints. Section 5 demonstrates an example to illustrate the refinement process by rules. In Section 6, we discuss an application of partial v a l u e s resolving semantic discrepancies in multidatabase systems - which is also a practical implication of our work. Finally, we conclude our work in Section 7. 2. Basic concepts and definitions
2.1 Basic concepts of partial values Partial values model data imprecision in databases in the sense that, for an imprecise datum, its true value can be restricted in a specific set of possible values [7] or an interval of
F.S.C. Tseng et al.
301
values [10]. In o u r w o r k , a partial value is r e p r e s e n t e d by a set of possible values, in which exactly o n e of the value is true. In D e M i c h i e l [7], a partial value is d e n o t e d as '0 = [al, a 2 . . . . , a , ] , w h e r e exactly o n e of the values a~ is true. A function u on "0 m a p s the partial value to its c o r r e s p o n d i n g finite set of possible values; that is, u('0) = {al, a 2 , . . . , an}. Recall that an applicable null value [5], bl, can be c o n s i d e r e d as a partial value with u(b¢)= D, w h e r e D is the whole d o m a i n . Besides, a relation containing tuples with partial values is called a partial relation [7]. O t h e r w i s e , it will be called a definite relation. E x c e p t explicitly specified, a relation is c o n s i d e r e d to be partial hereafter. In the following, we will use "0 and u('0) i n t e r c h a n g e a b l y w h e n it does not cause confusion. F o r e x a m p l e , v E "0 if v E u('0). T h e cardinality of a partial value "0 is defined as lu('0)1 in [7]. W h e n the cardinality of a partial value equals to 1, i.e. there exists only one possible value, say d, in the partial value, t h e n the partial value [d] actually c o r r e s p o n d s to the definite value d. O n the o t h e r h a n d , a definite value d can be r e p r e s e n t e d as a partial value [d]. Besides, a partial value with cardinality g r e a t e r than 1 is r e f e r r e d to as a proper partial value in [7]. F o r any two p r o p e r partial values, say "01 and '02, "01 ~ "02 even if u('01 ) = / / ( ' 0 2 ) . This is because the true value of "01 m a y not be the s a m e as the true value of '02.
2.2 Alternate worlds of a set of partial values F o r a set of partial values @, we m a y e n u m e r a t e all the possible cases that qo r e p r e s e n t s by the following definitions. Definition 2.1. A n interpretation, 0 / = (al, a 2 . . . . . am), of a set of partial values, {'01, "02, • • •, "on}, is an assignment of values f r o m qb such that a i E '0i, 1 ~< i ~< m. B y Definition 2.1, for a set of partial values q~ = V('0m ) is the set of all interpretations of ~. Definition 2.2. For an i n t e r p r e t a t i o n 0 / = (a 1, a 2 {'01,7/2 . . . . . "0m}, the value set of 0/ is d e n o t e d S o
{ ' 0 1 , "02 . . . . .
. . . . .
=
"ore},
/J('01)
X
/'1('1'12)X
q)=
" ""
X
am) of a set of partial values • = Ul~i
Definition 2.3. F o r all interpretations, o/j, l<~]<~p, p = 1"011x 1,21 x . . . x 1"oral, of a set of partial values qO = { ' 0 1 , ' 0 2 , ' - ' , ' 0 m } , the family of value sets of qb is d e n o t e d ~ ( ~ ) = U t~j~p{S~). If q~ = 0 then define ~ ( q ~ ) = 0. T h e following e x a m p l e illustrates these definitions. E x a m p l e 2.1. S u p p o s e there is a set of partial values • = {[a, c], [a, b]}, then t h e r e are four interpretations, a , = (a, a ) ,
a 2 = (a, b ) ,
a 3 = (c, a ) ,
T h e c o r r e s p o n d i n g value sets are S,~, = {a} U {a} = { a } , S,~2 = {a} U {b} = {a, b } , S,~3 = {c} U {a} = {a, c } , S,,, = {c} U {b} = {b, c } .
and
and
0/4 =
(C, b ) .
Refining imprecise data by integrity constraints
302
T h e r e f o r e , the family of value sets of tb is
~ ( ~ ) = {G,} u {G=} u {so,} u {so,} = {{a}, {a, b}, {a, c}, {b, c}}. 2.3 Alternate worlds of a partial relation W h e n a set of attributes {A1, A ~ , . . . , Am} is considered as a composite attribute A ~A 2 " " A m , its values can be c o m p u t e d f r o m the values in this set of attributes as follows.
A1
A2
""
A,,~
7"]11
7121
" " "
?]ml
1112
?]22
" " "
?]m2
:
:
"..
?]In
?12n
" "
A:A2...Am 7111 X 7/21 X . . . X rim I 7112 X I]2 2 X ' ' .
<-----4,
X 7Ira2
:
"
Ill n X q2n X " ' " X 7Iran
?]ntn
T h e × d e n o t e s the cartesian product of partial values, which is defined as follows.
Definition 2.4. T h e cartesian product r/a × r/b of the partial values r/a = [a~, a 2 . . . . a,,] and T~b ~ [ b l , b E . . . . . bn] is the partial value ~LXb with UO?a×b) being a set of the o r d e r e d pairs (a i, bj) for every a i E r/, and b / E r/b. W e regard ('O~j, rtej . . . . , ~mj) and 771 j X "1'~j2 X • • • X T~rnjas semantically-equivalent because if the ' t r u e ' value of (~?~i, r/2J . . . . , ~?mi) is the m - t u p l e (a~, a 2 , . . . , am) , a i E r/o, w h e r e a~ is the ' t r u e ' value of ~?gj, then the ' t r u e ' value of ~/1/x 7/2: x • • • × rimj is also (a~, a 2, . . . , am), and vice versa. E x a m p l e 2.2. C o n s i d e r the following data in the set of attributes {A1,
At
A2}
A2
:
[2,3]
c
We can regard {A1, A2} as a c o m p o s i t e attribute A1A 2 and c o m p u t e its data as
AIA2 [(1, a ) , (1, b)]
[(2, c),(a c)] A partial relation r(A~, A 2 , . . . , A m ) therefore can be r e g a r d e d as a relation r ( A 1 A 2 . " Am) with the c o m p o s i t e attribute A 1 A 2 • • " A m. T h a t is, a partial relation r can be considered as a set of partial values. T h e r e f o r e , ~ ( r ) can be defined accordingly. This is illustrated by the following e x a m p l e . E x a m p l e 2.3. Consider the following relation r(X, Y):
303
F.S.C. Tseng et al. 7"
X a
Y y
.
[x,y]
We can regard {X, Y} as a composite attribute X Y and compute r to be r
XY
[(.,y)] [(.,x), (a,u)] T h e n there are two interpretations, a I and a2, of r: a I = ((a, y), (a, x))
and
a z = ((a, y), (a, y ) ) .
Their corresponding value sets are S,~ = {(a, y)} U {(a, x)} = {(a, y), (a, x)}
and
S~2 = {(a, y)} U {(a, y)} = {(a, y ) } .
T h e r e f o r e , the family of value sets for r is ~-(r) = {S,,} U {S,2 } = {{(a, y), ( a , x ) } , {(a, y ) } } , which corresponds to the following two definite relations: SOt2
So,1
X
Y
X
Y
a
y
a
y
a
x]
2.4 Basic definitions of our approach In our approach, we first define the legality of a partial relation and then define the condition for proper partial values to be refined into more informative ones. Some properties will be developed to do such refinement in the next section. For a definite relation r', we say r' is legal if r' satisfies all the integrity constraints. By regarding a partial relation as a set of partial values, we define the legality of a partial relation as follows. Definition 2.5. A partial relation r is said to be legal if at least one of the elements in ~:(r), which is a definite relation, does not violate any integrity constraint. The following definition gives the condition to refine a proper partial value.
304
Refining imprecise data by integrity constraints
Fig. 1. The pictorial representation of o ~ v*.
Definition 2.6. For a proper partial value v of a relation r, we say that v can be refined into its refinement v*, denoted v~-, v*, if substituting v' (u(v') = v(v) - u(v*)) for v in r will cause r to be illegal. That is, the refinement process preserves the legality of a partial relation. We can illustrate Definition 2.6 by using the Venn diagram to represent a partial value. In Fig. 1, if substituting the shadow area for v in its corresponding partial relation r will cause r to be illegal, then v can be refined into smaller subset v*. To refine partial relations, we can distinguish three kinds of restrictions on relations: (1) Restrictions that depend on the semantics or possible range of domain elements. For example, it is reasonable to restrict the values of an age attribute to fall into the range [0, 120]. Such restrictions are referred to as range constraints. (2) Referential integrities. If relation r 2 includes a foreign key F k matching the primary key Pk of another relation r 1, then 'referential integrity' states that every value of F k in r 2 must either • be equal to the value of Pk in some tuple of r 1, or • be wholly null. (3) Restrictions by functional dependencies or multivalued dependencies. These three kinds of restrictions can be obtained from the integrity constraints of a database schema. We will utilize these constraints to do the refinement of a partial relation. O u r work assumes original relations to be refined are all legal. For a partial relation r, a brute force way to do the refinement would be to compute the set ~ ( r ) , check the legality of each element in o~(r), delete those which are illegal, and finally reconstruct the refined partial relation. However, such a computation would be uneconomical since the computational complexity of o%(r) is exponential. In the following, we derive some properties to do such refinement.
3. Refining a partial relation by integrity constraints It is straight-forward to refine an applicable null value into a partial value corresponding to the restricted range by employing the range constraints. In the following, we will study how to utilize referential integrities, functional dependencies and multivalued dependencies to refine a partial value.
3.1 Refining a partial relation by referential integrities By allowing a foreign key to accept applicable null values, we may use referential integrity constraints to refine an applicable null value in a foreign key. We have the following theorem.
F.S.C. Tseng et al.
305
T h e o r e m 3.1. For a relation r E with a foreign key F k matching the primary key Pk of another
relation rl, if a ~ r2, then a . F k ~ , u * , where t,(u*)= v(a.Fk) f3 [-J ,erl v([3"Pk)" Proof. To show a.Fk,~, u*, where v ( u * ) = ~,(a.Fk)fl {,..J~e~, v(fl.Pk), we have to show that
substituting u', where v(u') = ~,(a.Fk)- v(u*), for a . F k will cause r 2 to be illegal. By the definition of u', we obtain .(u') =
.(u*)
= lY(ol. Fk) -- (1.t(ol,Fk) ~I
UoErl V([3.P~))
=
v(a.Fk) - (u(a.Fk) - (v(a.Fk) -
=
u(a.Fk)-
U~er,
v([3.Pk)))
v(fl.nk).
U~zr,
That is, for each/3 E r 1, u(u') U v(fl.Pk) = 0. Therefore, when substituting u' for a . F k in r 2, the 'true' value of u' never agrees with the 'true' value of [3"Pk, Vfl E r~, which violates the referential integrity constraints and implies that r is illegal. Hence, a. F k ~ u*, and our proof completes. [] Example 3.1. Suppose we have the suppliers-and-parts database as shown in Fig. 2. If the
attributes Shipments.sno and Shipments.pno are two foreign keys that match the primary keys Suppliers.sno and Parts.pno, respectively, then, by Theorem 3.1, we have a.sno ~ , u*, where l,(u*) = {s2, s4} N {sl, s2, s3} = {s2}. Similarly, [3.pno,~'~v*, where 1,(v*) = {p2, p5) f3 {pl, p2, p3} = {p2}.
3.2 Refining a partial relation by functional dependencies For a definite relation r(S), where S denotes the set of attributes of r with X, Y E S, we say r satisfies the functional dependency X--+ Y if
(Va E r)(Vfl E r ) ( a . X
=
[3.X ~ a . Y
=
(1)
[3.Y) .
Equivalently, that means for any two tuples a, /3 E r, (2)
a . Y # [3. Y ~ a . X # f l . X .
If a . X , a. Y, fl.X, and [3. Y are all definite values then no refinement is needed for a and/3. Otherwise, we have the following properties to refine a a n d / o r / 3 if X--> Y holds. T h e o r e m 3.2. For any two tuples, say a and fl, in a partial relation r(S) with X, Y ~ S, if X--> Y and a . X = f l . X then a . Y , fl.Y,,~,u*, where v(u*) = u(cr.Y) N v([3. Y).
Suppliers
sno sl s2 s3
name Tom Paul Jim
city Tokyo Paris London
Parts pno color pl p2 1)3
red blue green
Fig. 2. An example database for Example 3.1.
Shipments sno pno OL
Is2, s4] s3
p3 [p2, pS]
306
Refining imprecise data by integrity constraints
Proof. Recall that any two proper partial values cannot be equal even if they correspond to the same set of possible values. Therefore, by a . X = / 3 . X , we know a . X a n d / 3 . X are all definite and represent the same value. By X--> Y, we know that a. Y should agree with/3. Y in their 'true' values; otherwise, r is illegal. To show a . Y , , ~ u * , where v ( u * ) = u ( a . Y ) n v(/3. Y), we have to show that when substituting u', where v(u') = v(a. Y ) - v(u*), for a. Y then r becomes illegal. By the definition of u', we obtain v(u') = v ( a . Y ) = v(a.Y)-
v(u*) (v(a.Y)n
v(/3. Y))
= v(a. Y ) - (v(a. Y ) - (v(a. Y ) - v(/3. Y))) = v ( a . Y ) - v(/3. Y ) .
That is, v(u') n v(/3. Y) = 0. Therefore, when substituting u' for a. Y in r, the 'true' values of u' and /3. Y never agree, which implies r becomes illegal. Hence, a . Y , , ~ u * , where v(u*) = v(a. Y ) n v(/3. Y). Similarly, /3. Y~,--v*, where v(v*) = v ( a . Y ) n v(/3. Y), and our proof completes. [] Example 3.2. Suppose we have a relation SF(Son, Father) containing two tuples a and 13 as shown in Fig. 3. Since it is impossible for a child to have two or more fathers, we know that Son---~Father. Therefore, by Theorem 3.2, a.Father,/3.Father~,~,u *, where v ( u * ) = {Alex, Peter, Richard} n {Richard, Mike} = {Richard}. Hence, u* = [Richard] = Richard. That is, John's father is Richard, definitely. Notice that after the refinement there are duplicate tuples, which can be eliminated. Theorem 3.2 can easily be generalized as follows. Theorem 3.3. For any n tuples, say a t , a 2, . . . , a n , in a partial relation r(S) with X , Y E S, where X--> Y, and a l . X = a2.X . . . . . a n . X , then ai.Y~,'.v* , for i = 1 . . . . . n, where /--'(V*) = nl<
Y, we obtain that a . X cannot agree with /3.X in their 'true' values; otherwise, r is illegal. SF Son
Father
John John
[Alex, P e t e r , R i c h a r d ] [ R i c h a r d , Mike]
Fig. 3. Example relation SF for Example 3.2.
F. S.C. Tseng et al.
307
When/3.X is a definite value d, to show a.X,,~, u*, where u(u*) = v ( a . X ) - {d}, we have to show that when substituting u', where g(u') = u(a.X) - J,(u*), for a . X then r becomes illegal. By u ( u ' ) = u ( a . X ) - u ( u * ) = u ( a . X ) - ( u ( a . X ) - { d } ) = {d}, we obtain u' is a definite value d. However, if we replace a . X by u', we get f l . X = a . X = d, which means r is illegal. Therefore, a . X , , ~ u*, where u(u*) = u ( a . X ) - {d}. The proof for/3.X,,,* v% where u(v*) = u(/3.X) - {d}, can be obtained similarly and our proof completes. [] T h e o r e m 3.5. I f / 3 is a tuple in a partial relation r(S) with X , Y ~ S, where X--~ Y, and there are also n tuples in r, say tll, a 2 , . . . , Otn, where a i . X is a definite value d i and u ( a i . Y ) f3 u(/3. Y ) = O, for i = 1 . . . . , n, then we have/3.X^,-~v**, where u(v**) = u(/3.X) Ul<_i~n{di}
•
Proof. First of all, since u(%. Y ) fq u(/3. Y ) = O, for i = 1 , . . . , n, the 'true' value of /3. Y cannot equal to that of any %. Y. Therefore, by X---~ Y, u(v**) = v(/3.X) - U 1 ~ , {d~} ~ 0, otherwise the 'true' value o f / 3 . X must agree with some a i . X (i.e. d~), which means r violates X--~ Y and is illegal. Now, we prove this theorem by induction on n as follows. (Basis:) When n - - 1 , this theorem reduces to T h e o r e m 3.4. (Induction hypothesis:) Suppose when n = k - 1 , we have [3.X,,,',v*, where u ( v * ) = u( /3.X) - U l~i~k_l { di).
(Induction:) For n = k, by the induction hypothesis, we may replace/3.X by v*, where v(v*)= u ( / 3 . X ) - U l ~ k _ l ( d i } . Now, by considering only the tuples a k and /3, we can apply Theorem 3.4 to obtain v*,,'~ v**, where u(v**)
_-
u(v*)
-
{dk} =
That completes our proof.
( u(/3.X)-
U
l~i~k-I
{di}
) -- {dk}
---- P ( / 3 . X )
--
u
,<<-i<<-k
[]
T h e o r e m 3.5 is a generalization of T h e o r e m 3.4. We present an example as follows to illustrate this theorem. Example 3.3. Suppose we have a relation SF(Son, Father) shown in Fig. 4. As in Example 3.2, we have Son---~ Father. It is easy to verify that u(/3.Father) fq U(arFather ) = 0, for i = 1, 2 and 3. Therefore, by T h e o r e m 3.5, ~3.Son,,,* v% where
SF Son
Father
C~ 1
Don
C~ 2
J &n~les
[Alex, Peter] [Andy, Peter] [Chris, Clark] Richard
O~ 3
Paul [John, Don, J a m e s , Paul]
Fig. 4. Example relation SF for Example 3.3.
Refining imprecise data by integrity constraints
308
u(v*) = v(/3.Son)- U
1~
u(a,.Son)
= (John, Don, James, Paul} - {Don, James, Paul} = (John} . That is, the son of Richard is John, definitely.
3.3 Refining a partial relation by multivalued dependencies We say ' X multidetermines Y' [8] (or 'there is a multivalued dependency of Y on X ' ) , denoted X---~ ~ Y, in r(S), where X, Y ~ S, if a and/3 are two tuples of r and a . X = / 3 . X , then r also contains tuples 6 and ~0, where (1) ~0.x = ~ . x = a.x---/3.x. (2) qJ.Y= a . Y and q,.Z = / 3 . Z , where Z = S - (X, Y}. (3) q~.Y=/3.Y and ¢ . Z = a.Z, where Z = S - {X, Y}. Notice that when we regard a multivalued dependency X---~--* Y on a relation r(S), X, Y ~ S, we assume r has at least three attributes (i.e. IS[/> 3). Besides, it is already shown that, given the relation r(X, Y, Z), the multivalued dependency X---~ ~ Y holds if and 0nly if X---~ ~ Z holds [8]. Therefore, it is common to denote both multivalued dependencies in the single joint notation X---~ ~ YIZ. Theorem 3.6. For any two tuples, say a and/3, in a partial relation r(X, Y, Z) if the following
conditions hold:
(1) x - ~
vlz,
(2) a . X =/3.X, (3) u ( a . Y ) n v(/3.Y)=0 and v ( a . z ) n v(/3.Z)=0, (4) there exists only one tuple qJ E r, such that ~O.X= a . X =/3.X, v( ¢. Y) n v( a. Y) ~ O, and
,(q,.z) n ,(/3.z) ~0, (5) there exists only one tuple ~oE r, such that ~o.X = a . X =/3.X, v(~o.Y) n v( /3. Y) ~ O, and , ( ~ . z ) n ,,(,~.z) ~ 0 ,
then (1)
{ a.Y, O.Y,,~u*, where u(u*)= u ( a . Y ) n v(qJ.Y), and /3.Z, q~.Z,,~v*, where v(v*)= v(/3.Z)N u(qJ.Z).
(2)
{ ~3 Z, ~p.Z'~ u**, where v(u**) = v(a.Z) n v(cp.z), and IY, ¢ . Y ~ , v * * , where v(v**) v ( / 3 . Y ) n u(¢.Y).
Proof. Since u(a. Y) n v(/3. Y) = 0 and u(a. Z ) n u(/3. Z ) = 0, we know that the 'true' values of a. Y and a.Z never agree with those of /3. Y and /3.Z, respectively. Therefore, by X--~ --~ Y[Z and the fourth and fifth conditions, we know ~ and ~ are the only candidates such that O . X = ¢ . X = a . X = / 3 . X and the 'true' values of ~.Y, ~0.Y, ~.Z, and ~0.Z agree with those of a. Y,/3. Y,/3. Z, and a. Z, respectively. To show a.Y,,,,, u*, we have to show that when substituting u', where v(u')= v ( a . Y ) v(u*), for a. Y then r becomes illegal. Since
v(u') = v ( a . Y ) - v(u*) = v(a. Y) - (v(a. Y) n v($. Y)) = u(a. Y) - (v(a. Y) - (v(ot. Y) - v(~O.Y))) = v(a. Y) - v(qJ. Y ) ,
F.S.C. Tseng et al.
309
we obtain u(u') n u(~b. Y) = 0. Therefore, when substituting u' for a. Y in r, the 'true' values of u' and 41. Y never agree. Since ~ is the only candidate, we know there exist no more tuples to make X---~ ~ YIz hold, which implies r is illegal. The other case can be proved analogously. [] The following example illustrates this theorem. Example 3.4. Suppose we have the relation NCTU(Dept, Student, Required_Course) as depicted in Fig. 5. Besides, we assume Dept---~ ~ StudentIRequired_Course. By Theorem 3.6, we obtain (1) a.Student, ~b.Student ~ , u*, where
u(u*) = u(a.Student) n u(~O.Student) = {Frank} . (2) /3. Required_Course, ~b.Required_ Course ,~, o*, where
u(v *) = u(/3. Required_Course) n u( qJ.Required_ Course) = {Compiler, Graphics} n {Database, Compiler} = {Compiler} . (3) a. Required_Course, ~o.Required_Course ~ , u**, where
u( u** ) = u( a. Required_Course) n u( ~o.Required_Course) = {Database} n {Database, Algorithm} = {Database}. (4) [3.Student, ~.Student ~ , v**, where
u( v ** ) = u( ft. Student) n u( ~o.Student) = {Annie}. Therefore, the relation NCTU can be refined into the one depicted in Fig. 6.
a /3 ~b c2
Dept CS CS CS CS
NCTU Sludent Required_Course Frank Database Annie [Compiler, Graphics] Frank [Database, Compiler] [ Database, Algorithm] Annie
Fig. 5. Example relation NCTU for Example 3.4.
Refining imprecise data by integrity constraints
310
Dept
3
CS CS CS CS
NCTU Student Required_Course Frank Database Annie Compiler Frank Compiler Annie Database
Fig. 6. The refined relation NCTU.
Theorem 3.7. For any two tuples, say a and /3, in a partial relation r(X, Y, Z ) with X---L--~ Y I z , if (1) v(,~.Y)n v(/3.Y)=O, (2) v ( a . z ) n v ( / 3 . Z ) = 0 , and (3) r does not contain two tuples 0 and q~, such that ~b.Y N a. Y, ~. Z n/3. Z, q~.Y 71/3. Y, and ~ . Z n a . Z are all non-empty sets, then (1) w h e n / 3 . X is a definite value d, we obtain a.X~', u*, where v(u*) = v ( a . X ) - {d} (2) when a . X is a definite value d, we obtain/3.X,,-,v*, where v(v*)= v ( / 3 . X ) - {d}. Proof. By u(a. Y) (3 v(/3. Y) = 0 and v ( a . Z ) 0 v(/3.Z) = 0, we know if a . X = / 3 . X then r contains two tuples, 0 and q~, such that the 'true' values of ~b.Y, ¢.Z, q~.Y, and q~.Z must agree with those of a. Y,/3. Z,/3. Y, and a. Z, respectively. However, by the third condition, we know r does not contain such tuples, 4, and q~. Therefore, a . X ~ / 3 . X , otherwise r will violate the multivalued dependency X--+ --->YIz and become illegal. When/3.X is a definite value d, to show a.X,,~ u*, where v(u*) = v ( a . X ) - {d}, we have to show that when substituting u', where v(u') = v ( a . X ) - v(u*), for a . X then r becomes illegal. By u ( u ' ) = v ( a . X ) - v ( u * ) = v ( a . X ) - ( v ( a . X ) - { d } ) = {d}, we obtain u' is a definite value d. However, if we replace a. X by u', we get/3. X = a. X = d, which means r is illegal. Therefore, a.X,,,~ u*, where v(u*) = v ( a . X ) - {d}. The proof for/3.X ,,,~. v*, where v(v*)= v(/3.X) - {d}, can be proved similarly and our proof completes. [] Example 3.5. Suppose we have the relation NTHU(Dept, Student, Required_Course) as depicted in Fig. 7. We assume Dept--+---> StudentlRequired_Course. Since (1) U(al.Student ) fq v(fll.Student ) = O, (2) u(a 1. Required_Course) f) v(/31. Required_Course) = 0, and
o¢1 0/2
/31
Dept [EE, CS
cs] CS CS
NTHU Student Tom
Andy Frank Frank
Required_Course [Statistics, Calculus] [Statistics, Calculus] Database Compiler
Fig. 7. Example relation NTHU for Example 3.5.
F.S.C. Tseng et al.
311
NTHU Dept
Student
Required_Course
EE EE CS CS
Tom Andy Frank Frank
[Statistics, Calculus] [Statistics, Calculus] Database Compiler
Fig. 8. The refined relation NTHU. (3) NTHU does not contain two tuples ~ and ¢, such that • O.Student O ol1.Student ~ O, • ~b.Required_Course n/31. Required_Course ~ 0, • ~. Student n ill. Student # 0, and • ~.Required_Course n a 1.Required_Course ~ O, by Theorem 3.7, we obtain al.Dept,,~, u~, where v(u~) = v(%.Dept) - {CS} = {EE}. Similarly, we obtain a2.Dept~,u~, where u(u~)= v(aE.Dept ) - {CS}--(EE}. That is, the relation NTHU is refined into the one depicted in Fig. 8. Although we may define algorithms b a s e d on the above results for performing the refinement process, we propose to implement the refinement process by rules. Since a database usually contains a large amount of data, the rule implementation makes it easy to incrementally refine imprecise data into more informative ones. (To define algorithms for performing the 'refinements' is beyond the scope of our paper.) The refinement process deals with data which already exist in the database. That is, it handles imprecise data after the data are inserted. Besides, the integrity constraints can be naturally specified by rules. In the following, we will discuss how to use the properties developed in this section to enforce the integrity constraints. That is, imprecise data will be inspected before they are inserted into a database. Therefore, when imprecise data are inserted, updated, or enquired, the related rules will be triggered to enforce the integrity constraints and/or activate the refinement process.
4. Enforcement of integrity constraints by rules The properties developed in the last section can be specified by rules to enforce integrity constraints. We enumerate these rules as follows. Rule 4.1. By Theorem 3.1, for a relation r 2 includes a foreign key Fk matching the primary key Pk of another relation rl, a tuple a cannot be inserted into rE, if u ( a . F k ) n
U ~r2 ~(/3.Pk) = ~. Rule 4.2. By Theorem 3.2, a tuple cannot be inserted into a partial relation r(S) with X , Y ~ S and S ~ Y if (V/3 ~ r)(/3.X = a . X ^ v(a. Y) n u(/3. Y) = 0). Rule 4.3. By Theorem 3.6, for any two tuples, say a and t , in a partial relation r(X, Y, Z ) if
(1) x ~ ~ r l z , (2) ~ . x = /3.x, (3) v ( a . Y ) n v(/3. Y ) = 0 and v ( a . Z ) n v ( / 3 . Z ) = O, then
312
Refining imprecise data by integrity constraints
Supplier
Part
Supply
sno
name
city
pno
color
sno
pno
sl s2 s3 s4 s5
Tom Paul Jim Mike Don
Tokyo Paris London Taipei Berlin
pl 1)2 1)3 p4 p5
red blue green black yellow
[sl, s2, s3] s2 s4 [s4, s5, s6] sl
p3 [pl, p2, p51 [p4, p6, p7] pl [p2, p4]
7 6 £
Fig. 9. An example database for Example 5.1.
(1) when there exists no tuple qJE r, such that qJ.X n a . X #O, u( qJ.Z ) n v( fl. Z ) # O, a tuple ~0' satisfies qJ'.X = a . X = fl.X, u(qI.Z) n v([3.Z) ¢ 0 must be inserted into r, (2) when there exists no tuple ~ E r, such that q~.x n a . X # O, v(~o.Z) n v ( a . Z ) ~ ~, a tuple ~' satisfies ~ '.X = a . X = fl.X, v(q~'.Z) n u ( f l . Z ) ~ 0 must be inserted into r.
v(qJ.Y)n u ( a . Y ) # O, and v( qd. Y) n v( a. Y ) # O, and v(q~.Y)n v( fl. Y) # O, and v(~p'. Y) n v(a. Y) ~ ~, and
5. An example for illustrating the incremental refinement process
In this section, we present an example to show the refinement process. Example 5.1. Suppose we have the Supplier-Supply-Part database as shown in Fig. 9. The attributes Supply.sno and Supply.pno are two foreign keys that match the primary keys Supplier.sno and Part.pno, respectively. Besides, assume that each supplier only supplies
one part and each part is only supplied by one supplier. That is, in the relation Supply, we have sno ~ pno and pno ~ sno. Now, suppose a user wants to retrieve the relation Supply. We proceed the refinement process as follows. By Theorem 3.1, 6.sno,~,u*, where v ( u * ) = {s4, s5, s6} n {sl, s2, s3, s4, s5} = {s4, s5}. Similarly, y.pno,~, v*, where v(v*) = {p4, p6, pT} n {pl, p2, p3, p4, p5} = {p4}. The relation Supply can be refined into the following relation. Supply O~
7 6 £
sno
pno
[s l, s2, s3] s2 s4
p3 [pl, p2, p5] p4 pl [p2, p4]
Is4, sl
Since a.pno n fl.pno = 0 and a.pno n ~.pno = 0, by Theorem 3.5, a.sno ~'~ w*, where v(w*) = v(a.sno) - ({s2} U {sl}) = {sl, s2, s3} - {sl, s2} = {s3}. Similarly, since y.pno n &pno = ~1, by Theorem 3.4, 6.sno~'~x*, where v(x*) = v(&sno) - {s4} = {s4, s5} - {s4} = {s5}. Therefore, up to now, Supply is refined as follows.
F.S.C. Tseng et al.
313
Supply
"7 6 £
sno
pno
s3 s2 s4 s5 sl
p3 [191, 192, p5] p4 pl [p2, p4]
Now, since ~ . s n o O e . s n o = 0, by Theorem 3.4, e . p n o ~ , - , y * , where u ( y * ) = u ( e . p n o ) {p4} = {p2, p4} - {p4} = {p2}. That is, we obtain the refined supply as follows. Supply c~ /3 7 6 e
8710
pno
s3 s2 s4 s5 sl
p3 [p l, p2, pS] p4 pl p2
Finally, since [3.sno N 6.sno = 0 and f l . s n o O E.sno = O, by Theorem 3.5, [ 3 . p n o , ~ z * , where u ( z * ) = v ( f l . p n o ) - ({pl} U {p2}) = {pl, p2, p5} - {pl, p2} = {p5}. We finally obtain the refined Supply and present this relation as the query result as follows. Supply a /3 7 6
sno
pno
s3 s2 s4 s5 sl
p3 p5 p4 pl p2
Notice that all data in Supply are all refined to be definite. Besides, if the query is issued in a remote site, then only about half of the original data in Supply need to be transmitted to the remote user. 6. Practical implications of partial values in multidatabase systems
In addition to imprecise data manipulation, partial values are also important in resolving the semantic discrepancies in multidatabase systems. For example, DeMichiel [7] employs partial values for resolving domain mismatch problems in multidatabase systems. In DeMichiel's approach, data imprecision comes from data incompatibilities in a multidatabase system. We further generalize the concept of partial values into probabilistic partial values in [22] for resolving more interoperability problems in multidatabase systems. For example, suppose we want to integrate the following relations located in different sites in a multidatabase system.
Refining imprecise data by integrity constraints
314
CS-Researchers
Taiwan-Scientists
name
specialty
age
name
specialty
age
Frank Jesse Annie
DB AI SE Site 1
28 30 28
Frank Jesse Andy
CS CS CS Site 2
26 30 25
Assuming that Computer Science (CS) consists of three subareas, i.e. database (DB), artificial intelligence (AI), and software engineering (SE), we can use partial values to resolve the mismatched domain, specialty. That is, the relation Taiwan-Scientists can be transformed into
Taiwan-Scientists' name
specialty
age
Frank Jesse Andy
[DB, AI, SE [DB, AI, SE] [DB, AI, SE]
26 30 25
We can now 'integrate' CS-Researchers and Taiwan-Scientists' into the following relation Taiwan-CS-Scientists for global multidatabase queries.
Taiwan-CS-Scientists name
specialty
age
Frank Jesse Annie Andy
DB AI SE [DB, AI, SE]
[26, 28] 30 28 25
Notice that the partial value [26, 28] comes from the inconsistency of Frank's age in different sites. Since we do not know which one is the 'true' value, the integrated result is represented by a partial value. Queries posed on partial relations in a multidatabase system may produce uncertain answer tuples. For more details about this issue, the gentle readers are referred to [7, 22]. In addition to incomplete data manipulation, partial values are also important in resolving the semantic discrepancies in multidatabase systems. By employing our work, some imprecise data in a multidatabase system can also be refined into more informative answers.
7. Discussions
Partial values have been used to represent imprecise data in databases. Data imprecision may come from their unavailability or data/schema incompatibilities in a multidatabase system. By employing the integrity constraints of a relation, we provide an approach to refine imprecise data into more informative ones. We elaborate on the utilization of range constraints, referential integrities, functional dependencies, and multivalued dependencies to
F.S.C. Tseng et al.
315
develop a set of useful properties for refining applicable null values into definite values or partial values. We propose to implement these properties by rules. That makes a database containing imprecise data can be incrementally refined. Furthermore, we also provide rules to enforce integrity constraints in a database containing imprecise data. Our work actually eliminates redundant elements in a partial value. Therefore, our work not only provide a more concise and informative answer to the user, but also speedup the computation of queries. Besides, it reduces the communication cost when imprecise data are requested to be transmitted from one site to another site in a distributed environment. That is, our work also contributes to the query optimization in a distributed database system.
Acknowledgement The authors wish to thank the anonymous referees whose comments and suggestions helped improve this paper.
References [1] S. Abiteboul and G. Grahne, Update semantics for incomplete information, in: Proc. 11th Internat. Conf. Very Large Data Bases (1985) 1-12. [2] F. Bancilhon and N. Spyratos, Update semantics of relational views, A C M Trans. Database Syst. 6 (4) (1981) 557-575. [3] J. Biskup, A foundation of Codd's Relational Maybe-Operations, A C M Trans. Database Syst. 8 (4) (1983) 608-636. [4] E.F. Codd, Extending the Database Relational Model to capture more meaning, A C M Trans. Database Syst. 4 (4) (1979) 397-434. [5] E.F. Codd, Missing information (applicable and inapplicable) in Relational Databases, SIGMOD R E C O R D 15 (4) (1986) 53-78. [6] E.F. Codd, More commentary on missing information in Relational Databases (applicable and inapplicable information), SIGMOD RECORD 16 (1) (1987) 42-50. [7] L.G. DeMichiel, Resolving database incompatibility: an approach to performing relational operations over mismatched domains, IEEE Trans. Knowledge & Data Engrg. 1 (4) (1989) 485 -493. [8] R. Fagin, Multivalued dependencies and a new normal form for relational databases, A C M Trans. Database Syst. 2 (3) (1977) 262-278. [9] J. Grant, Null values in a relational data base, Inform. Processing Letters 6 (5) (1977) 156-157. [10] J. Grant, Partial values in a tabular database model, Inform. Processing Letters 9 (2) (1979) 97-99. [11] T. Imielinski and W. Lipski, On representing incomplete information in a relational database, in: Proc. 7th Internat. Conf. Very Large Data Bases (1981) 389-397.
[12] T. Imielinski and W. Lipski, Incomplete information and dependencies in relational databases, in: Proc. A C M SIGMOD Internat. Conf. Management of Data (1983) 178-184. [13] E. Lien, Muitivalued dependencies with null values in relational databases, in: Proc. 5th lnternat. Conf. Very Large Data Bases (1979) 61-66. [14] W. Lipski, On semantic issues connected with incomplete information systems, A C M Trans. Database Syst. 4 (3) (1979) 262-296. [15] K.-C. Liu and R. Sunderraman, Indefinite and maybe information in relational databases, A C M Trans. Database Syst. 15 (1) (1990) 1-39. [16] K.-C. Liu and R. Sunderraman, A generalized relational model for indefinite and maybe information, IEEE Trans. Knowledge and Data Engrg. 3 (1) (1991) 65-77. [17] D. Maier, The Theory of Relational Databases (Computer Science Press, Rockville, MD, 1983). [18] J.M. Morrissey, Imprecise information and uncertainty in information systems, A C M Trans. Informat. Syst. 8 (2) (1990) 159-180. [19] A. Motro, Accommodating imprecision in database systems: issues and solutions, A C M SIGMOD RECORD 19 (4) (1990) 69-74. [20] A. Ola, Relational databases with exclusive disjunctions, in: Proc. 8th IEEE Internat. Conf. Data Engineering (1992) 328-336. [21] F.S.C. Tseng, A.L.P. Chen and W.P. Yang, Searching a minimal subset of a set of partial values, Internat. J. Very Large Data BasesV L D B J., to appear. [22] F.S.C. Tseng, A.L.P. Chen and W.P. Yang, Answering heterogeneous database queries with degrees of uncertainty, Distributed and Parallel Database: An Internat. J., to appear.
316
Refining imprecise data by integrity constraints
[23] F.S.C. Tseng, A.L.P. Chert and W.P. Yang, Evaluating aggregate operations over imprecise data, submitted for publication (1993). [24] Y. Vassiliou, Null values in data base managem e n t - a denotational semantics approach, in:
Proc. ACM-S1GMOD Internat. Conf. Management of Data (1979) 162-169. [25] Y. Vassiliou, Functional dependencies and incomplete information, in: Proc. 6th Internat. Conf. Very Large Data Bases (1980) 260-269.
Arbee L.E Chen received the B.S. degree in computer science from National Chiao Tung University, Taiwan, Republic of China, in 1977, and the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, in 1984. He joined National Tsing Hua University, Taiwan, in August 1990, and is currently a Professor in the Department of Computer Science. He was a Member of Technical Staff at Bell Communications Research, New Jersey, from 1987 to 1990, an Adjunct Associate Professor in the Department of Electrical Engineering and Computer Science, Polytechnic University, Brooklyn, New York, and a Research Scientist at Unisys, Santa Monica, California, from 1985 to 1986. He is currently also an Advisor to Industrial Technology Research Institute and Institute for Information Industry in Taiwan. His research interests include heterogeneous database systems, incomplete information ano uncertainty, knowledge discovery in databases, longduration transactions, and object-oriented database systems. Dr. Chen is a member of the Association for Computing Machinery and the IEEE Computer Society, and was a member of the ANSI/X3/SPARC/Database Systems Study Group. He serves as a Program Committee member of 1993 ACM SIGMOD Conference and 1992-1994 IEEE Data Engineering Conferences, and is listed in Marquis' Who's Who in the World.
Frank S.C. Tseng was born on June 30, 1964 in ChangHua, Taiwan, Republic of China. He received the B.S. and M.S. degrees both in computer engineering from National Chiao Tung University, Taiwan, ROC, in 1986 and 1988, respectively. Then, he received the Ph.D. degree in Computer Science and Information Engineering from National Chiao Tung University in 1992. His current research interests include integration and interoperability of federated databases, uncertain data manipulation, artificial intelligence, and graph theory. He is a member of ACM and IEEE Computer Society.
Wei-Pang Yang was born on May 17, 1950 in Hualien, Taiwan, Republic of China. He received the B.S. degree in mathematics from National Taiwan Normal University in 1974, and the M.S. and Ph.D. degrees from the National Chiao Tung University in 1979 and 1984, respectively, both in computer engineering. Since August 1979, he has been on the faculty of the Department of Computer Engineering at National Chiao Tung University, Hsinchu, Taiwan. In the academic year 1985-1986, he was awarded the National Postdoctoral Research Fellowship and was a visiting scholar at Harvard University. From 1986 to 1987, he was the Director of the Computer Center of National Chiao Tung University. In August 1988, he joined the Department of Computer and Information Science at National Chiao Tung University, and acted as the Head of the Department for one year. Then he went to IBM Almaden Research Center in San Jose, California for another one year as visiting scientist. From 1990 to 1992, he was the Head of the Department of Computer and Information Science again. His research interests include database theory, database security, object-oriented database, image database and Chinese database systems. Dr. Yang is a senior member of IEEE, ACM, and the Phi Tau Phi Society. He was the winner of the 1988 and 1992 AceR Long.Term Award for Outstanding M.S. Thesis Supervision and the winner of 1990 Outstanding Paper Award of the Computer Society of the Republic of China.