Recovering a relation from a decomposition using constraint satisfaction

Recovering a relation from a decomposition using constraint satisfaction

INFORMATION SCIENCES 78,229-256 (1994) 229 Recovering a Relation from a Decomposition Using Constraint Satisfaction PETER JEAVONS RoyalHollowayan...

2MB Sizes 0 Downloads 32 Views

INFORMATION

SCIENCES 78,229-256

(1994)

229

Recovering a Relation from a Decomposition Using Constraint Satisfaction PETER JEAVONS

RoyalHollowayand BedfordNew College, Un~i~e~~~ of London, &ham, Stmey, England

ABSTRACT A purely distributive representation of a relation is characterized by invariance under changes in the sorted order of the tuples of the relation. Natural join, intersection, and union operations can be performed on distributive representations of the operands with very high parallelism because there is no need for any search, nor for prior sorting. What is obtained is a distributive representation of the result relation, from which the relation itself can be recovered, in any desired sorted order, by a serial tree-search process. This paper describes a particular dist~butive representation scheme, and demonstrates that the associated recovery process corresponds to solving a constraint satisfaction problem. It is shown that by pruning the tree-search, it is possible to carry out the recovery process in a time which is linear in the cardinal@ of the recovered relation, in some cases. Although the recovery process may occasionally obtain additional spurious tuples, not present in the relation, it is shown that the occurrence of these spurious tuples may be made extremely rare by an appropriate choice of system parameters.

1. INTRODUCTION Practical implementations of relational operations have traditionally used a merge-like operation on sorted operands [2, 301. More recently, decreasing memory costs have led to the introduction of distributive techniques which do not require prior sorting of the operands [7, 221. A distributive technique assigns data to storage locations on the basis of value, rather than the order of presentation [13]. One obvious problem with such techniques is that the number of possible tuple values is so large that it is not practical to allocate a separate storage location for each possible value. An analogous problem exists for distributive sorting techniques applied to sets of integers [13], and is often overcome by using a “digit sort” technique, which works with a distributive representation 6Elsevier Science Inc. 1994 655 Avenue of the Americas, New York, NY 10010

ooze-0255/94/$7.00

230

P. JEAVONS

based first on one digit, then on another, and so on. Raschid et al. [18] describe a distributive digit sort technique for database relations. The technique we shall describe here is a distributive technique first proposed by Ullmann in [27] as the basis for an efficient, highly-parallel implementation of a number of relational operations. Instead of working with nonoverlapping digits, the technique that Ullmann describes uses a fixed collection of overlapping projections to represent an arbitrary relation. The advantage of this approach is that the individual projections can be kept small enough to be representable distributively, and yet contain sufficient information to allow the relation to be recovered in a single pass. Ullmann has shown that, using a distributive representation of this type for the operands, the binary relational operations of join, intersection, and union may all be performed with an extremely high degree of parallelism. Details of the implementation of these binary relational operations are given in [27] and [29], and will not be discussed here. In this paper, we shall focus on the problem of recovering the resulting relation from the distributive representation after any relational operations have been performed. We present a theoretical analysis of the recovery process, which has previously been investigated only in small-scale simulations [27]. The technique which Ullmann uses for carrying out the recovery process actually obtains the relation in sorted order [27]. Hence, the process of forming the representation of a given relation and then recovering the relation may be viewed as a distributive sorting technique [13]. The experimental results given in [27] suggest that the time complexity of this sorting technique is linear in the size of the data, at least in some cases. The analysis below explains this result and clarifies the extent to which it is generally true. Ullmann also points out that the recovery process may produce additional, spurious tuples which do not belong to the relation which is being represented. By analyzing the recovery process, we are able to show that the probability of these spurious tuples arising may be made extremely small by an appropriate choice of parameters. 2.

DEFINITIONS

The relational database model was initially developed by Codd [51. Good introductions to relational database theory can be found in [15, 251. In this model, a database consists of a finite set of relations, which are defined as follows: DEFINITION 2.1. A relation “relation instance.”

consists

of a “relation

scheme”

and a

RECOVERING l

l

l

A RELATION

FROM A DECOMPOSITION

231

scheme is a finite set of attributes. Each attribute is associated with a (possibly infinite) set of values, called a domain. A tuple over a relation scheme is a mapping that associates with each attribute of the relation scheme a value from its corresponding domain. A relation instance over a relation scheme is a finite set of tuples over that relation scheme.

A relation

Intuitively, a relation scheme represents the structure of a relation and a relation instance its contents. In order to describe the distributive representation of a relation with which we are going to work, we need two operators from Codd’s relational algebra [5]. DEFINITION 2.2. Let Y, Z be sets of attributes with Z cY. Let t be a tuple over Y, and let r be a relation instance over Y. The projection onto Z oft, denoted t[Z], is the restriction oft to Z. The projection onto Z of r, denoted r_&-), is the set (t[ Z]lt E r}. DEFINITION 2.3. Let Y , , . . . , Yq be sets of attributes, and let r, be a relation instance over Y., for i = 1,. . ., q. Let Y= lJy= , x. The join of rl,. . . , ry, denoted r, W ... Wr4, is the set {tltisatupleoverY&t[Y,]Erl

fori=l,...,q}.

Finally, we define a join-dependency. DEFINITION 2.4 [191. A join-dependency

over Y is an expression of the

f OmZ

where Y ,, . . . ,Yq are sets of attributes with Y = Uy= , x:.. The sets Y,, . , . , Yq are called the edges of the join-dependency. A relation instance, r, over Y satisfies the join-dependency if r=7ry,(r)

W ... W7ryJr).

Satisfying a join-dependency Yi W *** W Yq is a necessary and sufficient condition for the contents of a relation to be storable as its projections onto Yi,. . .,Y,, and to be recoverable by performing the corresponding join.

232 3.

P. JEAVONS FORMING

THE REPRESENTATION

We now describe the technique proposed in [27] for creating a distributed representation of a relation, R. We shall initially assume that the relation scheme of R is the set of m attributes {A,, . . . , A,). It is convenient to restrict the domain of each of the attributes A 1,. . ., A, to fixed length strings of symbols from some alphabet. The indi~dual symbols will be referred to as “digits.” For example, numerical attribute values may be represented by fixed length binary or decimal strings. Similarly, nonnumeric attribute values may be encoded by bit-strings. With a specified fixed encoding of this type for all of the attributes in the relation scheme {A,, . . ., A,}, the tuples of the relation R may also be seen as tuples over an associated relation scheme @ 1,“‘, D,), where the attribute Di corresponds to the ith digit in the encoding, and n is generally greater than m. From now on, we shall regard the set D = {D 1,. . . , D,,) as the relation scheme of R and, for simplicity, we shall assume that the domains of all the attributes in D are equal to the same finite set A. Clearly, the values of the original m attributes, A 1,. . . , A,, can be recovered from the values of these new attributes by reversing the encoding process. The advantage of regarding the digits themselves as attributes rather than retaining the original relation scheme is an increase in generality: many different relation schemes may be encoded by the same fixed-length digit string; hence, many different relation schemes may be processed by the same dedicated hardware, described below. EXAMPLE 3.1. As a very small-scale example of a possible relation R, consider the case where the original relation scheme consists of just three attributes: Age, Sex, and Marital Status. The Age attribute takes integer values in the range O-99, and the Sex attribute takes the value M or F. The Marital Status attribute has four possible values: Single, Married, Divorced, or Widowed. An example of a typical relation instance, r, is

r={(25,M,Single),(09,M,Single),(26,M,Mffnied),(SO,M,Widowed)}

The tuples of this relation instance may be encoded by bit-strings in which the first 7 bits represent the Age, the next bit represents the Sex (according to some coding convention), and the final 2 bits represent the Marital Status (again according to some coding convention). The following set of bit-strings is obtained by choosing one possible encoding for the attributes

RECOVERING

A RELATION

FROM A DECOMPOSITION

233

of R:

With this encoding, there are ten digits in each tuple, and the relation scheme of R wiif be taken to be {o,, . . . , II,,}, where the domain of each cl attribute, Q, is {O,l}. Hence, in this example, n = 10 and IAl = 2. Having established a relation scheme for R consisting of digits, we now form the representation of R by constructing a set of projections. This is conveniently described in terms of a join-dependency (see Definition 2.4):

of attributes, and ouer D with set of edges C =

DEFINITION 3.2. Let D = (D,, D,, . . . , l3,J be a set let S,WS,W

+**WS,

be a join-dependency

{S,, S,, . . . , SJ. For any relation instance r ooer D, the ordered list of projections

will be denoted TJr)

and used as a distributke representation of r.

For reasons of symmetry, Ullmann [27] restricts his attention to representations TX(r) for which the set of edges, C, has the following properties: 1. Each edge in C contains the same number, p, of attributes. 2. Each attribute belongs to the same number, t, of edges in 2. In the language of design theory [ll], these properties require the set X to be a l-design with parameter t. If G satisfies property 1, then the maximum size of each rs(r) is /Al”. Hence, the projections may each be stored in distributive fo:m using a bit-vector of length 1Al". Each bit in this bit-vector corresponds to one particular possible element of zs(r), and this bit is set to one or zero to indicate the presence or absence of this element in r,(r). EXAMPLE

3.3. Consider the relation

R, with scheme {Oi,. . . , D,,)

and

234

P. JEAVONS

instance r, described in Example 3.1. Let C be the set {S,, S,, S,, S,) where

With this choice of X, 4 = 4, p = 5, and t = 2. The values of the projections of the instance r onto each edge of the join-dependency S, WS, WS, WS, are as follows:

?,(I.)

=I(~,I,~,~,~),(0,1,I,I,1),(1,0,1,1,1)}.

The projections onto the Sj each have a maximum size of 25 tuples; hence, the representation rz(r) may be stored in an array of four bit-vectors, 0 each of length 32. The size of the storage required for all of the rrs(r) together is q]AIP bits, which is generally much lower than the IA]“’ bits required for a standard distributive representation of the whole of r. It is this fact which allows a feasible implementation in dedicated hardware, as discussed in [27]. This special-purpose hardware may also be designed to calculate each of the projections r,,(r) in parallel, using a dedicated array of logic gates. 4.

RECOVERING

THE RELATION

The crucial feature of the representation rx(r) defined above is that it allows the operations of relational join, intersection, and union to be carried out in constant time using fully parallel bitwise AND and OR operations, as described in [27] and 1291. For example, if we form the intersection of two representations, 7rf(rl) n az(r,>, by calculating

RECOVERING

A RELATION

FROM A DECOMPOSI~ON

235

the bit-wise intersection of corresponding bit vectors, then we obtain the representation of r-i n r2. The implementation of these relational operations has already been fully described elsewhere 127,291, and will not be discussed here. What this paper provides is the first quantitative analysis of the essential final step in the process: the recovery of a relation instance, T, from its representation, rx(r). This analysis indicates the cost of the special-purpose hardware required for the representation, which must be set against the benefits of being able to perform any sequence of these relational operations in a time which is independent of the cardinal@ of the operand relations. The recovery procedure we shall describe calculates the join of all the projections in v-Jr), which will be denoted rz.

However, we are trot assuming that r sutures the join-dependency whose edges are 2, so it is possible that rz may contain additional tuples that do not belong to r. These additional, spurious tuples, which may be generated by the recovery process, will be referred to as “fake drops” since this is the established terminology in the context of superimposed coding techniques [20f and the distributive representation 7rx(rl is, in fact, a form of superimposed coding. The complete recovery procedure must provide some mechanism for dealing with these false drops, such as that described in [29]. However, we shall show in Section 6 that the probability of false drops occurring can be made very small by an appropriate choice of q and t. From the definition of the join operation, it is clear that the members of rz are precisely the tuples whose projections onto each of the Si belong to the projection of r onto the same Si. In other words, rrx = {tit is a tuple over D & t[si] E rs,(r)

for i=l ,...,4}.

When the definition of rL is cast in this form, it may be noted that the calculation of rz involves the solution of a standard “constraint satisfaction problem” 18, 14, 161, as studied by researchers in the field of artificial intelligence. Such problems are also sometimes referred to as “consistent labeling problems” [lo]. the constraints to be satisfied in this case are provided by each of the projections and specify the allowed combinations of values for certain subsets of the attributes. Formulating the recovery problem as a constraint satisfaction problem suggests that solution strategies developed for constraint satisfaction

236

P. JEAVONS

problems (see [171) may be adapted to tackle this problem. The solution technique we now describe is derived from these ideas, and is conveniently implemented using the special-purpose hardware designed to store the distributed representation. In this technique, the search for the tuples, W, which belong to rz is organized as a backtrack tree search. For each node in the tree, the search procedure maintains a set of possible values for each attribute. To proceed from a node to one of its children an attribute is selected and assigned one of its possible values from this set. This is referred to as instantiating the attribute. The sets of possible attribute values are refined at each node, using the representation nz(~), as described below. The refinement process removes any values which can be ruled out on the basis of the values already instantiated and the information in m&-). By removing these values from further consideration, the size of the tree which is explored is reduced. The extent of this reduction is discussed in the next section. The whole algorithm may be described by the following recursive procedure, which is similar to the TREESEARCH procedure in [261. ALGORITHM

4.1.

procedure SEARCH(E,, E,, . . . , E,, LEVEL) local E;,E;,..., EI, call REFINECE,, E, ,..., E,, E;,E2 ,..., EL) if UNIQUE( E;, E!, . . . , Ek) then call OUTPUlY E;, E2, . . . , Ei) else for each e EEI&~~+ 1 call SEARCH( Ei, E;, . . . , E’,,,,,, {e), . . . , EL, LEVEL + 1) end for end if end In this algorithm, the variables Ei and Ej are sets of maximum cardinality IAl. The set E, contains the values under consideration for the ith attribute before refinement by procedure REFINE. The set Ej contains the values still under consideration for the ith attribute after the refinement has taken place. The function UNIQUE returns the value TRUE if all its arguments are singleton sets, which indicates that a unique tuple has been determined. The procedure OUTPUT simply outputs the tuple formed by the values contained in its arguments, which must all be singleton sets. The variable LEVEL is an integer which represents the current level in the backtrack search tree. The i + lth attribute is instantiated at LEVEL i by selecting one of the available values from ELEVEL+,, and then calling SEARCH recursively.

RECOVERING

A RELATION

The procedure

REFINE

FROM A DECOMPOSITION

calculates

237

the values of E;, . . . , EL from

E , , . . . , E, using the following algorithm: ALGORITHM

4.2. Refining the sets of possible values using 7r2(r).

Input. The sets of possible tlalues, E,, Ez, . . . , E,. Output. The refined sets of possible values, E;, Ei,. . . , E>. Method.

1. Fori=l 2. For i=l

ton,letEi=E,. to n, let

Fi= ( eEE113j7TS,(E;X...XEl~,X{e}X

e.. xE;)

n 7+,(r) =ia)

and let E: = El - Fi, 3. If all sets Fi are empty, then stop; else return to 2.

It is clear that the refinement procedure defined by Algorithm 4.2 must terminate after at most ClE,l iterations since the total number of elements in the Ei starts at this value and is decremented by at least 1 on each iteration until there is no further change. The next lemma shows that Algorithm 4.2 does not result in the elimination of any elements of rz. LEMMA 4.3. Let E,, E,,. . ., E, and Ei, E;, . . . , EA be the inputs and outputs of Algotithm 4.2. IfwErz, andwEE,X*..XE,, then WEE;X**.XE~. Proof. Let e be the value of the ith attribute of w, for any i in the range l-n. Since w E E, X ... XE,,weknowthat eEEi,andso eEE: after step 1 of Algorithm 4.2. Hence, at the first iteration of Step 2

so for any S,

Since w E rz, we know that w[S,] E n,(r), I

by the definition of rp. Hence,

238

P. JEAVONS

at any point where the above condition holds, we know that

9p; x .-a

xE,'_,x{e}X***

XE~)fl~,,(r)#0

because they have a common element w[S,]. Hence, e will not be a member of Fi and will not be removed from Ej in Step 2. Applying the same argument at each iteration gives the result. n This result extends the result given in [21], for a filtering operation which is similar to the refinement operation described here, but is only defined for the case p = 2. Using Lemma 4.3, we are now in a position to prove that the SEARCH procedure in Algorithm 4.1 may be used to calculate rx.

PROPOSITION 4.4. Executing SEARCH( A, A,. . . , A, 0) will output precisely the elements of rz. Proof. Consider the tree of recursive calls of procedure SEARCH. This tree cannot have any infinitely long branches since the number of singleton sets in the arguments increases by at least one at each successive recursive call. Once this number reaches rz, the call to UNIQUE must return the value TRUE and the branch will terminate with a call to OUTPUT. Also, the number of children of any node is clearly finite since the Ei are always finite. Hence, the tree is finite and the SEARCH procedure must terminate. Using Lemma 4.3, we know that if w E rz and w E E, X **a X E,, at some x Ek at that node so either w will be output at that node, then w E E; X *-* nodeor WGE,X ... x E,, at one of the children of that node, and the same argument may be applied again. x A at the root node, then for all If E, x ***xE, is equal to A x .*. --. x E, at the root node. Hence, all w E rz will be output WEE~X WErx, by executing SEARCH(A, A,. . . , A,O). Conversely, if u is any tuple which is output by procedure SEARCH, then we know that the attribute values of v are returned from the REFINE procedure as singleton sets at some point. Hence, u[S,l must n belong to T,,(r) for each j, so UEI-~ by the definition of rp. We may strengthen this result to show that when A is an ordered set, procedure SEARCH can be made to deliver the tuples of rI in sorted order, according to a standard lexicographic ordering, defined as follows:

DEFINITION 4.5.Let R be a relation with scheme D = {DI, . . . , II,), where the domain of each Di is the set A, which has a total ordering relation ck

RECOVERING

A RELATION

FROM A DECOMPOSITION

239

For any instance r of R, we define an ordering on the tuples of r as follows. For all tuples w,, w2 E r, we define w, < w2 if and only if there exists a j such that

wl(Di)

=wZ(Di)

(i=l

,...,j-1)

Other lexicographic orderings may be obtained by permuting the ordering of the attributes D,, . . . , D,, .

PROPOSITION 4.6.The output of procedure SEARCH can be obtained in lexicographic order. Proof. Again, consider the tree of recursive calls of procedure SEARCH. This tree is traversed in a depth-first order. If we implement the for loop in procedure SEARCH so that the values for e are chosen in ascending order, then the leaf nodes will be visited in lexicographic order. n

To obtain a different lexicographic ordering, based on a different permutation of the attributes, it is only necessary to change the order in which the attributes are selected for instantiation. EXAMPLE 4.7. Reconsider the simple example relation, R, with instance in Example 3.1, and the representation, TX(r), described in Example 3.3. Recall that R has ten attributes, each with domain (O,l}.

r, described

In order to illustrate the operation of procedure SEARCH, we shall denote the values of the sets E;, . .., E& at each stage by a string of 10 symbols. In this string, a “1” in the ith position indicates that E: = { 11. Similarly, a “0” in the ith position indicates that E: = {O}. Finally, a “ * ” in the ith position indicates that E: = IO,1). We now consider the tree of recursive calls of procedure SEARCH which results from the call SEARCHCA, A,. . . , A,O). At the root node of this tree, after the first execution of the REFINE operation, the Eis have the following values: O***O**l**

The values of the first, fifth, and eighth attributes are already fixed at this point because only one of the two possible values for these digits occurs in the instance r, so only one of the two values is allowed by the projections in 7rr(r).

P. JEAVONS

240

At this point, the first digit is instantiated to 0 and the first recursive call to SEARCH is made, which is now at level 1. The application of the REFINE operation at this node leaves the values unchanged, so the second digit is instantiated to 0 and the SEARCH procedure is called again at level 2. Now, the REFINE operation results in the following values for the Ejs: OO*lO**l** Note that the value of the fourth digit has now been fixed on the basis of the instantiations which have already taken place. The value of 0 for the fourth digit is eliminated using the values in the projection rs{r). Next, the third digit is instantiated to zero and the SEARCH procedure is called at level 3. The application of REFINE results in 0001001100 This is identified as consisting entirely of singleton sets, and so the corresponding tuple is output as the first element of rH. Note that this tuple is the first element of r in the standard lexicographic ordering. After outputting this value, the SEARCH procedure returns to the previous level and instantiates the third digit to 1 since that value was also available at that point. This results in another call at level 3 which has the following values after refinement: 00110**1** The REFINE operation does not succeed in eliminating any values for digits 6, 7, 9, and 10 so the tree must be explored further. The complete search tree is shown in Figure 1, which indicates that a total of ten nodes are visited in order to recover the four members of r. In this particular example, there are no false drops and rI = r. 0 5.

TIME COMPLEXITY

5.1. COMPUTING

THE RELATION

rz

The calculations of the projections, rs,(Ei x *a* xEA), used in the REFINE operation (Algorithm 4.21, may all be performed in parallel by the same special-purpose hardware used to calculate wsTs(r).The calculation of the intersection of projections may also be performed very effi-

RECOVERING

A RELATION

FROM A DECOMPOSITION

f(0 * * * 0 * *1 * *) level 0

level 1

level 2

level 3

h(OOllO**l**)

level4

level 5

level 6 Fig. 1. The complete search tree (Example 4.7).

241

P. JEAVONS

242

ciently by appropriate hardware, using the distributive representation of the projections as bit-vectors. Hence, the REFINE operation defined by Algorithm 4.2 may be carried out extremely rapidly using a highly parallel dedicated unit (see [26] and [29]). For the analysis below, we shall simply assume that REFINE is available as a primitive operation. Because of this assumption, the time complexity of the recovery technique will be expressed in terms of the number of executions of the REFINE operation. Since there is one call to REFINE during each call of procedure SEARCH, this number is equal to the total number of nodes in the tree of recursive calls of procedure SEARCH. We now consider ways of estimating the number of nodes in the search tree, T, resulting from the execution of SEARCH(A, A,. . . , A,O). In this analysis, we shall assume that the relation instance r is nonempty, and hence rp is also nonempty. We shall also assume that the edges in the set 2 are connected. This assumption allows us to prove the following technical lemma:

LEMMA 5.1.After a REFINE operation, if any of the E,! are empty, then they are all empty.

Proof. If E:, = 0, then for any j such that Dil E Sj

9,( E; x .a.

xE;)

=0.

So for any Diz E Sj and any e E E,‘,, we know that

Hence, e will be removed from E,‘, by the REFINE operation (Algorithm 4.2). The REFINE operation therefore proceeds to remove all elements from E,‘,. The connectedness condition on the Si implies that it will continue in this way to empty every set E,!, and the result follows. n The nodes in the search tree T may be divided into two categories, depending on whether or not the sets of values being considered for each attribute at that node are all singletons after refinement. If they are, then the search procedure will simply output the corresponding tuple, and there will be no descendant nodes. Nodes of this type are therefore leaf nodes and correspond to the elements of rz which are output. We will refer to them as output nodes. Otherwise, we know that at least one of the sets of possible attribute values (E;,..., EL) is not a singleton set. By Lemma 5.1, we know that if

RECOVERING

A RELATION

FROM A DECOMPOSITION

243

any one of these sets is empty, then they will all be empty. If they are all empty, then the node will have no descendants, and will be referred to as a dead-end. Otherwise, the node will have one or more descendants, and will therefore be an internal node. We will now prove that when p = 2 and IA1= 2, the search tree T is “backtrack-free” in the sense defined by Freuder [9]. In other words, every node visited is either an output node or an internal node, and there are no dead-ends.

PROPOSITION 5.2.There are no dead-ends in the search tree T when p = 2 and lAl=2. Proof. By induction on the depth of the tree. The root node is clearly not a dead-end because it has at least Irl descendants, by Proposition 4.4, and we are assuming that r is nonempty. We shall now show that if N is any node in T at level A which is not a dead-end, then the children of n cannot be dead-ends. Since N is not a dead-end, Lemma 5.1 shows that none of the sets ,!$ at node N is empty. If they are all singleton sets, then N is an output node and we are done (because it has no children at all, so it clearly has no children which are dead-ends). So the only case we need to consider is when some of the Ei are singleton sets and some are of size 2 (the only possibilities when IAl = 2). Let Y, be the set of attributes for which the E: are singletons, and let Y2 be the set of attributes for which the Ei are of size 2. The attribute which is instantiated at node N is DA+,. For each member e EEL+ 1, there must be at least one tuple, u, in r such that u( Dh+ ,) = e; otherwise, e would have been removed from EL + , by the REFINE operation. Choose any such e and a corresponding tuple v. Now, consider the tuple, w, which is defined as follows:

if iEY,

tw< ‘i>} = ;;(D;,)if ieY,. i (Note that w( DA+ 1) = u( Dh + , >= e in either case.) We claim that w ET z; in other words, for all j, w[Sj] E m,,(r). This can be proved by considering two possible cases: 1. For all S, which lie entirely within Y, or span Y, and Y2, it follows from the fact that the attribute values of w were not eliminated by the previous REFINE operation. 2. For Sj which are contained within Y,, it follows from the fact that w agrees with u on these attributes, and u~r.

P. JEAVONS

244

Hence, at each child node of N, there will be a tuple w EF-~ which belongs to E, x a-- xE~. By Lemma 4.3, this implies that the child nodes n of N cannot be dead-ends, and the result follows by induction. This result means that in this simplest case, the maximum possible amount of pruning will take place: no node will be visited unless there is a path leading through that node to a solution. This property has an important consequence, which will allow a direct estimate of the number of nodes in the tree.

LEMMA 5.3. Zf there are no dead-ends in a search tree, then all internal nodes must have subtrees containing two or more output nodes. Proof. The subtree at any internal node must have two or more branches because there must be at least one attribute with at least two values still under consideration. These branches cannot terminate at dead-ends, by assumption, so they W must all terminate at output nodes.

Using Lemma 5.3, we may obtain an approximate value for the total number of nodes in the search tree T by assuming that the tuples of rp are randomly distributed or, in other words, that the tuples of rz may be considered as a sequence of tuples chosen independently and at random from the set of [Al” possible tuples. (This assumption can only be approximately valid since it allows the possibility of duplication, but it may provide a useful approximation when (rg 1is small compared with 1AI”.)

PROPOSITION 5.4.Zf there are no dead-ends in the search tree T, and the tuples of rI are randomly distributed, then the expected number of nodes in T is approximately equal to

Ir,J+

lr”-l

lnbl

provided that Ir, I is small compared with [AI”. Proof. Each internal node of T at level A (A = 1,. . . , n - 1) corresponds to a call of procedure SEARCH with the E, to EA arguments all singleton sets. There are 1Al A distinct possible values for this list of arguments. By Lemma 5.3, each of these possible values will only occur in the search tree if the corresponding node has two or more descendant output nodes or, in other words, if there are two or more tuples of rp which take these values on the first A attributes.

RECOVERING

A RELATION

FROM A DECOMPOSITION

245

The probability of this event, P(A), may be calculated from the binomial distribution, using the assumption that the tuples of rz: are randomly distributed:

Hence, the expected number of internal nodes in the search tree, I, is obtained by multiplying this probability by 1A.l” and summing over all levels: n-1

I=

c

P(h)lhl!

(1)

h=ll

This summation may be appro~mated continuous variable:

with an integral by treating h as a

Evaluating the integral, we obtain

If lrzi is small compared with /A/“, then we may neglect first-order terms, and obtain the approximation

all except

The expected total number of nodes in the search tree will be given by I-t lrrl since there are lrrl output nodes in addition to the internal nodes. n

There is a close similarity between this result and the result obtained by Knuth [13] for the expected number of nodes in a binary trie structure. The connection arises because the truncated tree which is explored by the recovery procedure is equivalent to the binary trie which would be created

246

P. JEAVONS

to hold rz. A similar result has also been derived by Johnson 1121 in connection with a hardware sorting technique which uses an associative memory to provide the pruning information. The above analysis confirms that the expected time complexity of the recovery procedure is approximately linear in the size of the set of tuples recovered, in the simplest case. In more general cases, with larger values of [Ai, or p, it is always possible to construct examples which do have dead-ends. No estimate for the expected frequency of dead-ends in the search tree T has yet been derived, in the general case, and this value may well depend heavily on the structure of ‘c. However, it is possible to show that there are levels in the search tree at which dead-ends cannot occur, even in the general case. PROPOSITION 5.5. Zf {II,, D,, . . . , DA}c Si, for some Si E C, then there are no dead-ends in the search tree T at level A. Proof. Each node of the search tree T at level A is associated with a unique tuple, t, over the attributes D,, D,,. . ., DA, defined by {t(Di)} =Ei, i=l2 7 ,***, A. If (Dl, D, ,..., DA} cS,, then the tuple t must be a restriction of some tuple t’ E r,,(r); otherwise, it would be eliminated by the REFINE procedure. A dead-end in T corresponds to a tuple, t, which cannot be extended to a member of rL. However, by the definition of ,,$r>, every tuple, t’ E r&r>, may be extended to a member of r, and hence to a member of rz.

In all the simulations carried out by Ullmann to investigate the technique, it was in fact the case that the first p attributes to be instantiated belonged to a single Si [28]. This had the unintended consequence of preventing dead-ends in the first p levels of the search tree T. When Irl is small compared with /Alp, most tuples of rz are uniquely identified by the first p levels of T, so that very few nodes at higher levels are visited. This explains why the simulation results published in [27] for the number of nodes in the search tree T agree very closely with the predictions of Proposition 5.4. 5.2. COMPUTING A SORTED PROJECTION OF rL

When a projection of the data onto a subset, 2, of the attributes is required, then the recovery procedure may easily be modified so that only the values of these attributes are calculated. In fact, the time complexity of the entire recovery and sorting operation is actually reduced when only a projection of the result is required. Hence, we have a reversal of the

RECOVERING

A RELATION

FROM A DECOMPOSITION

247

well-known fact that sorting data makes subsequent projection easier: with this technique, the final projection makes the sorting easier. There are two cases to consider. The first is when we know that there are no dead-ends in the search tree. In this case, we simply arrange for the search procedure to instantiate the attributes in 2 first, and then stop. This clearly reduces the size of the search tree because only the first IZI levels are explored, so the summation in Equation (1) is only over these levels. On the other hand, if there is a possibility of dead-ends, then it is necessary to find at least one allowed instantiation of the remaining attributes as well, to ensure that the values selected for the attributes in Z will not be eliminated further down the tree. Hence, one branch of the tree must be explored until it terminates in an output node. However, it is not necessary to explore all possible values for attributes not in Z, since one complete tuple value is sufficient to guarantee that the attribute values in Z should be included in the result of the projection. This again reduces the size of the search tree. Note that in either case, the desired projection is obtained directly, without the need for a “second pass” over the data to remove duplicates. The advantages of removing duplicates during a sorting operation rather than in a subsequent separate operation are discussed in [3], in the context of a more conventional merge-sort procedure. 6. 6.1.

SPACE REQUIREMENTS ESTIMATING THE NUMBER OF FALSE DROPS

The space requirements of the technique we have described will generally be dominated by the space required to store the representation, GT&). As we have already remarked, each projection, rS(r), may be stored using a bit-vector of length IAlp. Since there are q projections, the total storage space required for them is qlAIP bits. The values of q and p must be chosen sufficiently large to ensure that the expected number of false drops is sufficiently small for the technique to be useful in a particular context. In order to determine the appropriate values for these parameters, and hence the space requirements of the technique, we need to analyze the way in which the number of false drops depends on q and p. One way to estimate the expected size of rz, and hence the number of false drops, is to assume that the projections in TX(r) act as independent constraints which filter the tuples, W, which may belong to rz. For each constraint, there are IAlp possible values, so if r may be approximated by a

248

P. JEAVONS

random selection of tuples, we may use elementary to show that

probability arguments

Assuming, for the moment, that the constraints all act independently, we obtain the following expression for the expected size of the recovered relation, E(lr,I) (see [20]): E( Irzl) = lAln(l - (1 - l/lA[‘)“‘)‘.

(2)

This expression may be a useful approximation when the Si do not overlap, but is much too optimistic for the cases we are interested in, which have a high degree of overlap. Table 1 compares the values obtained from Equation (2) with the results of simulations for various values of q. It can be seen that the predicted values are much too small for larger values of q. In fact, the predicted values rapidly fall below b-1as q increases, which is clearly incorrect since lrpl must always be greater than or equal to Irl. Because of the overlap of the projections, the assumption of independence cannot be used to obtain a valid approximation. There are, as yet, no completely general results concerning the expected value of Ir,l in the overlapping case: the situation is apparently very complex and depends on the exact structure of C. However, the following Proposition provides a useful formula for calculating this expected value: PROPOSITION 6.1.

TABLE 1 Expected Size of Recovered

Relation c/r/ = 10, n = 12, p = 4, IA) = 2)

9

~(lr,l) [Equation(2)1

E(Jr,))[Simulation1

3 6 9 12

440.5 47.4 5.1 0.55

442.0 113.9 42.8 24.0

RECOVERING

A RELATION

FROM A DECOMPOSITION

249

where wO is any tuple over D and N(w,) is the number of distinct relation instances, s, of cardinal&y Irl for which wOE s2:. Proof. The total number of tuples in all the recovered relations, sz, for all the relation instances, s, of cardinal@ Irl, may be counted in two different ways, by summing over either the relations or the possible tuples, to obtain the equation

c

IspI=

relations s

c

N(w).

hIpIesw

To remove the summations, we note that there will be precisely

‘;t’; ( 1 relations of cardinality Irl on the left-hand side and IAl” tuples on the right-hand side. Also, by symmetry, the value of N(w) must be the same for each w, so we may select any fixed tuple w0 and obtain

E( Irzl) = lAl”N( wo). With a little rearrangement,

n

the result follows.

Hence, the problem of finding the expected number of tuples in rz has been reduced to the problem of calculating N(w,) for any fixed tuple wO. 6.2.

EXACT

VALUES

FOR EXTREME

CASES

Proposition 6.1 may be used to calculate an exact value for E(lr,I) two extreme cases:

in

COROLLARY 6.2.If IAl = 2 and C contains all subsets of D of cardinal@ n-l, then

E(lr,I) =lrl+

(2” -lrl)lrl” (2” - 1)”

where the notation x” indicates a “falling factorial power,”

i.e., xn =x(x -

1)*.*(x-n+l). Proof. The only possible relations which can contribute to N(w,) are those which contain the tuple w,,, or else contain all of the n tuples which have the same values as w0 on n - 1 attributes and differ on the other

P. JEAVONS

250

attribute. Simple counting arguments can be used to count both of these types of relations, giving

After some simplification, Proposition 6.1 gives the result. COROLLARY 6.3.

H

if C contains ali subsets of D of cardinality1, then

E(lr,l)=lAl”(l

- (1 - l/lAl)“‘)‘.

aunt. In this special case, the constraints do not overlap, so each constraint acts independently and we may use elementary probability arguments to obtain

‘;;

N(w,) = (

(1 - (1 - l/lAi)‘~‘)~.

1

Using Proposition 6.1 gives the result.

n

Corollary 6.2 demonstrates that if I: contains all subsets of D of cardinal&y n - 1, then the expected number of false drops is very small, provided that Iti is small compared to 2”. In this extreme case, however, the amount of storage required is prohibitively large since the maximum size of each projection is 2”- *. At the opposite extreme, if X contains all subsets of D of cardinality 1, then the amount of storage required is only 2n bits, but Corollary 6.3 shows that the number of false drops is generally prohibitively large. 6.3.INTERMEDL4TE CASES

Any set of values which results in an acceptable number of false drops and requires an acceptable amount of storage must lie in the region between these two extreme cases. What constitutes an acceptable value for either of these variables will obviously depend on the application. There is an enormous range of choices for 2, and it is clearly not possible to present results for all of these options here. We will concentrate on one particularly natural family of choices, the “balanced incomplete block designs” [ll]. In these designs, the sets s;i are chosen in such a way that every pair of attributes is contained in precisely one set Sj. There

RECOVERING

A RELATION

FROM A DECOMPOSITION

251

is a design of this type (derived from a finite projective geometry), for each n of the form s* ss + 1, where s is any prime number. In this design, p = s + 1 and q = n, so the total storage requirement for the projections is nlAIP bits. Because of the structure of the Sj in these examples, it is possible to calculate the expected numbers of false drops of a particular kind: tuples that differ in only a single digit from some tuple in the relation r. Tuples of this kind will be referred to as “neighbors” of r. (The possibility of restricting attention to false drops of this special type is suggested in [6].) PROPOSITION 6.4. When 2 is a balanced incomplete block design, and the tuples of r are randomly distributed, then the expected number of false drops which are neighbors of r is approximately equal to Jrln(lAl-

l)(l -e-(ir”‘A’p’)P.

Proof. Consider a particular neighbor, w ‘, of the relation r, which differs from the tuple w or only on the attribute Q. Only the projections 7rs,(r) for which Sj contains Di can distinguish w’ from w. Because 2 is a balanced incomplete block design, there will be precisely p projections of this kind, and they will all overlap only on Di. Hence, they will behave independently on the other p - 1 attributes, and the same arguments used for the independent case above lead to the following formula, where a is used to count the number of tuples in r - {w) which agree with w’ on Di:

Using the binomial expansion, this expression may be recast in a more computationally convenient form which involves a summation over only p + 1 terms:

252

P. JEAVONS

Each tuple in I has n(lAl- 1) neighbors, so there are approximately the possibility of overlaps). Multiplying this number by the probability just derived, we obtain an estimate for the expected number of neighbors of r which result in false drops: Irln(l AI - 1) neighbors of r (neglecting

Ih(lAl-1)

If we neglect approximation

f: ,=,,(s)~-l~~((l-~)+~(l-~)b)‘r’-‘.

all except first-order

(3)

terms in l/IAIp- ‘, we obtain the

Irln(lAl-1) ,=~(~)(-l,“(l-~)“‘-’ i After some further approximation,

the result follows.

W

Proposition 6.4 may be used to obtain an estimate of the number of false drops which are neighbors for large values of IZ since all of the approximations used to obtain the result become increasingly accurate for larger values of n. If the number of false drops of this restricted type is very small, then we would predict that the total number of false drops will also be very small since false drops which are neighbors of r are much more likely to occur than other types. This prediction has been confirmed by small-scale simulations, as described below. Note that the total storage requirement for the projections is nlAIP bits, and the total number of bits required to hold the relation is lrllz log,lAl. If we assume that the ratio between these two numbers is kept constant, as rr increases, by increasing the value of Irl, then the space requirements of the representation will increase linearly with the size of the problem. Proposition 6.4 shows that in this case, the number of “neighbor” false drops will tend to zero, provided that IAl0 -e-(lr’/lAIP)) < 1. To satisfy this condition, the number of bits used to store the projections must be at least In2 (InlAl -In( IA/ - 1))lnlAl times the number of bits required to store the relation. This ratio is smallest when IAI = 2, when it is approximately 1.44. In other words, the optimal choice for IAl is 2, and in this case, the representation rz(r)

RECOVERING

A RELATION

FROM A DECOMPOSITION

253

requires at least 44% more storage space than the relation itself in order to ensure that the number of false drops decreases with increasing n. 6.4.

SIMULATION RESULTS

The number N(w,) defined in Proposition 6.1 is a convenient quantity to estimate using simulations since we need only calculate r&-j for a random sample of relations r, and then determine the proportion of these which contain the particular value, ~T~({w~}).For a direct estimation of E(lr, I) which did not use Proposition 6.1, we would need to carry out the entire recovery process, which is very time-consuming when simulated on a serial machine. The results of simulations giving an estimated value of N(w,) for a very small-scale balanced incomplete block design of the type described above are given in Table 2. This table indicates that the total number of false drops obtained from the simulation results is close to the number of “neighbor” false drops predicted by Equation (3), used in the proof of Proposition 6.4, above. There is clearly scope for a great deal of further investigation to clarify the exact relationship between the choice of the C and the number of false drops, and hence determine the optimal choice of parameters.

7.

CONCLUSION

In this paper, we have described a distributive representation for database relations, and have investigated a technique which may be used to recover the relation from this representation. In particular, we have demonstrated that the size of the search tree explored by this recovery technique grows linearly with the number of

TABLE 2 Expected number of false drops (n = 13, p = 4, IAl = 2)

Irl

Equation (3)

Simulation

3 4 5 6 7

0.034 0.141 0.389 0.860 1.641

0.018 0.137 0.463 0.996 2.123

254

P. JEAVONS

tuples recovered, in some simple cases. This fact provides a concrete illustration of the theoretical results of Stone and Sipala [23] concerning back-track search with cutoff. They have shown that search algorithms with a potentially exponential complexity can be restricted to a linear growth in complexity if there is some means to identify fruitless search paths and avoid them. Although the recovery process may recover additional, spurious tuples, which are referred to as false drops, we have shown that by using sufficiently many projections, the expected number of false drops can be kept very low. The analysis above indicates that in some cases, this may be achieved using a set of projections for which the total storage space required is a linear function of the space required for the original relation. The analysis presented here explains the experimental results reported in [27], for which no theoretical explanation has previously been offered. However, the precise conditions under which the space complexity of the representation and the time complexity of the recovery operation are guaranteed to be linear are still open questions. Since the tuples are recovered in lexicographic order, the representation and recovery technique we have described could be used purely as a distributive sorting algorithm. However, if we compare the cost of this technique with other parallel sorting algorithms using special-purpose hardware which have previously been proposed, then it is clear that the large bit arrays and dedicated logic circuits required cannot be justified purely as a sorting engine. For example, a parallelized bubble-sort algorithm using an array of comparison-exchange modules has a linear time and space complexity [4]. Alternatively, a bitonic sort algorithm for N items using log* N processors, each capable of performing word-parallel comparison-exchange operations, has a linear time complexity and a space complexity which is O(log3 N) [24]. Although not competitive for sorting alone, the fact that the same hardware may also be used to perform a number of different relational operations with very high parallelism appears to justify the further investigation of this technique. Using this hardware, the total time complexity for answering a query involving any number of these relational operations and sorting the result is simply the time required to form the distributed representation (which is linear in the cardinalities of the operand relations) plus the time required to recover the result relation (which is linear in the cardinality of the result relation in some cases) plus a constant time for each operation. We have shown that the total space required for the representation, and hence the cost of the hardware, can be limited to a linear function of the storage space required for the operand relations in some cases.

RECOVERING

A RELATION

FROM A DECOMPOSITION

255

Conceptually, this work links relational database implementation with abstract constraint satisfaction problems, and this cross-fertilization has resulted in a radically new approach to the implementation of relational operations.

REFERENCES 1. E. Babb, Implementing ACM

a relational

database

Transactions on Database Systems 4:1-29

by means of specialized

hardware,

(1979).

2. D. Bitton, H. Boral, D. J. Dewitt,

execution of relational

and W. K. Wilkinson, Parallel algorithms for the database operations, ACM Transactions on Database Systems

8324-353 (1983). 3. D. Bitton and D. J. Dewitt, Duplicate record elimination in large data files, ACM Transactions on Database Systems 8255-265 (1983). 4. K. M. Chung, F. Luccio, and C. K. Wong, On the complexity of sorting in magnetic bubble memory systems, IEEE Transactions on Computers C-29:553-562 (19801. 5. E. F. Codd, A relational model of data for large shared databanks, Communications of the ACM 13:377-387 (1970). 6. M. C. Cooper, Estimating optimal parameters for parallel International Journal of Systems Science, to appear.

database

hardware,

7. D. Dewitt, R. Katz, F. Olken, L. Shapiro, M. Stonebraker, and D. Wood, Implementation techniques for main memory database systems, in Proceedings of SIGMOD (Boston, 19841, ACM, New York. 8. E. C. Freuder, Synthesizing constraint expressions, Communications of the ACM 21:958-966 (1978). 9. E. C. Freuder, A sufficient condition for backtrack-free search, Journal of the ACM 29:24-32 (1982). 10. R. M. Haralick and L. G. Shapiro, The consistent labeling problem: Part I, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1:173-184 (1979). 11. D. R. Hughes and F. C. Piper, Design Theory, Cambridge University Press, 1985.

12. L. R. Johnson and M. H. McAndrew, On ordered retrieval from an associative memory, IBM Journal of Research and Development 189-193 (1964). 13. D. E. Knuth, Sorting and Searching, Addison-Wesley, Reading, MA, 1973. 14. A. K. Ma&worth, Consistency in networks of relations, Artificial Intelligence 8:99-118 (1977). 15. D. Maier, The Theory of Relational Databases,

Computer Science Press, Rockville, Maryland, 1983. 16. U. Montanari, Networks of constraints: Fundamental properties and applications to picture processing, Information Sciences 7:95-132 (1974). 17. B. Nudel, Consistent-labeling problems and their algorithms: Expected complexities and theory-based heuristics, Artificial Intelligence 21:135-178 (19831. 18. L. Raschid, T. Fei, H. Lam, and S. Y. W. Su, A special-function unit for sorting and sort-based database operations, IEEE Transactions on Computers 35:1071-1077 (1986).

19. J. Rissanen,

Theory of joins for relational

7th Symp. on Math. Found.

Springer-Verlag,

of Comp.

1978, pp. 537-551.

databases-A

Sci., Lecture

tutorial survey, in Proc.

Notes in Computer

Science

64,

256

P. JEAVONS

20. C. S. Roberts, Partial match retrieval via the method of superimposed codes, Proceedings of the IEEE 67: 1624- 1642 (1979). 21. A. Rosenfeld, R. A. Hummel, and S. W. Zucker, Scene labelling by relaxation operations, IEEE Transactions on Systems, Man, and Cybernetics SMC-6:420-434 (1986). 22. L. D. Shapiro, Join processing in database systems with large main memories, ACM Transactions on Database Systems 11:239-264 (1986). 23. H. S. Stone and P. Sipala, The average complexity of depth-first search with backtracking and cutoff, IBM Journal of Research and Development 30:242-258 (19861. 24. C. D. Thompson, The VLSI complexity of sorting, IEEE Transactions on Computers C-32:1171-1184 (19831. 25. J. D. Ullman, Principles of Database and knowledge Base Systems, Vol. 1 and 2, Computer Science Press, Rockville, Maryland, 1988-1989. 26. J. R. Ullmann, R. M. Haralick, and L. G. Shapiro, Computer architecture for solving consistent labelling problems, The Computer Journal 28:105-111 (198.5). 27. J. R. Ullmann, Fast implementation of relational operations via inverse projections, The Computer Journal 31:147-154 (1988). 28. J. R. Ullmann, personal communication. 29. J. R. Ullmann, Distributive implementation of relational operations, IEE Proceedings 137:283-294 (1990). 30. P. Valduriez and G. Gardarin, Join and semijoin algorithms for a multiprocessor database machine, ACM Transactions on Database Systems 9:133-161 (1984). Received 18 December 1991; reuised 22 March 1993