Parallel transitive closure computation in relational databases

Parallel transitive closure computation in relational databases

Informatics and Computer Science NORTIt- t t O I I A N D Parallel Transitive Closure Computation in Relational Databases XIAOFANG ZHOU CSIRO DiL,isi...

1MB Sizes 0 Downloads 91 Views

Informatics and Computer Science NORTIt- t t O I I A N D

Parallel Transitive Closure Computation in Relational Databases XIAOFANG ZHOU

CSIRO DiL,ision of lnformation Technology, GPO Box 664, Canberra, ACT 2601, Australia YANCHUN ZHANG

Department of Maths and Computing, The Unit,ersity of South Queensland, Toowoomba, QLD 4350, Australia and M A R I A E. O R L O W S K A

Department of Computer Science, The UniL,ersity of Queensland, St. Lucia, QLD 4072, Australia

ABSTRACT The transitive closure operation is an important extension to relational algebra. Because of its high computation cost, it is of great interest to design efficient parallel algorithms for computing the transitive closure in relational database systems. In this paper, we present a new algorithm to compute transitive closures on SIMD meshes based on relational algebra operations. Double-hash distribution is used to avoid rehashing new tuples for the next join phase. There presently exists no extra step for the redistribution of these tuples. Possible redundant computation between different join phases has been prevented without using global operations. As only regular linear communication occurs on the mesh, and the workload is fully distributed, a speedup of O(n x n) has been achieved, where n x n is the size of mesh. Therefore, this algorithm is an optimal parallel version of the transitive closure algorithms based on relational algebra operations on SIMD meshes.

1.

INTRODUCTION

With the increasing popularity of database systems, many new applications appear to challenge existing data base technology. Some problems, such as bill of materials, shortest path, and critical path, cannot be INFORMATION SCIENCES 92, 109-135 (1996) © Elsevier Science Inc. 1996 655 Avenue of the Americas, New York, NY 10010

0020-0255/96/$15.00 PII S0020-0255(96)00053-9

110

X. Z H O U ET AL.

expressed by the relational algebra, which has neither recursion nor iteration constructs. Hence, there are many proposals to extend relational algebra with the transitive closure operation [12]. A query with a transitive closure (TC) operator is intrinsically more expensive than a join query because it consists of a sequence of joins. Much effort has been devoted to efficient TC algorithm design in the last decades [2, 3, 5, 22, 23, 25, 26]. With the emergence of massively parallel computers like MasPar MP-1, CM-2, and CM-5, it becomes possible to compute the TCs of very large relations in a satisfactory response time. This has drawn substantial attention from the database research community [4, 9, 11, 14, 16-18, 24, 27, 28]. A survey can be found in [6]. Basically, there are two classes of TC algorithms. One is based on relational algebra operations [5, 9, 11, 24, 23], and the other is based on matrix manipulation [2, 3, 16, 22, 25, 26]. Because of the regularity of matrix manipulation," this class of TC algorithms has been well studied on various parallel architectures such as cube, mesh, and systolic array [4, 17, 18, 13]. However, the matrix representation of a relation is impractical when the relation becomes very large [27]. Our TC algorithms are based on relational algebra operations. The algorithms in this class consist of a sequence of joins and set unions. There are three ways to evaluate a TC query in parallel. A naive way is to decompose a TC query into resolvent subqueries, each of which uses a full set of data. This is not practically attractive because of higher computation costs and the difficulty of avoiding redundant computation when compared with other parallel TC algorithms. The second way is to use the disconnection set approach [14]. By fragmenting the data according to the rules stemming from the application domain, queries can be split into several independent subqueries. These subqueries are computed in parallel on only a part of the data, and intersubquery communication is minimized by the fragmentation scheme. This is an attractive approach in a distributed environment. However, how to fragment data such that parallel evaluation of transitive closure can achieve load-balancing and keep communication cost minimum is still somehow application-dependent. The third way is to use parallel-hash join algorithms to parallelize each join operation in TC computation [24, 9]. As TC evaluation consists of a sequence of joins, the result tuples of one join operation need to be redistributed across processors for the next join operation. This can result in a high computation cost to determine where a new tuple should be sent by rehashing, and a high communication cost for shipping data. A doublehash data partitioning method is proposed in [9] to avoid the computation overhead. In comparison with the disconnection set approach, this ap-

PARALLEL TRANSITIVE CLOSURE

111

proach does not depend on domain knowledge, thus is more general. But it suffers from excessive communication cost when the number of processors is large. This problem becomes more severe when using SIMD (Single Instruction Stream, Multiple Data Stream) machines (e.g., MasPar), as different processors, when requesting to send data to different destination processors at the same time, have to act sequentially. It is also criticized by [1] for not being able to eliminate duplicate tuples on the fly. In this paper, we present a method to design parallel TC algorithms for mesh-connected massively parallel computers using parallel-hash join algorithms. The double-hash partition method is used to parallelize the computation so rehashing new tuples can be avoided. Our contribution is a regularized communication pattern for processors to exchange data between join phases on the mesh, and gives a method to eliminate duplicates on the fly without using global set union operations. As only regular linear communication occurs on the mesh, and the workload is fully distributed, a computational complexity speedup of O(n × n) has been achieved, where n × n is the size of mesh. Therefore, this algorithm is an optimal parallel version of the TC algorithms based on relational algebra operations on SIMD meshes. The rest of this paper is organized as follows. Section 2 gives a brief overview of both sequential and parallel TC algorithms, as well as the features of SIMD meshes. The design of parallel TC algorithms for the SIMD meshes is presented in Section 3, and an empirical study of these algorithms on the MasPar MP-1 is reported in Section 4. We conclude this paper in Section 5. 2.

PRELIMINARIES

2.1. TRANSITIVE CLOSURE Let R be a relation schema with two (possibly composite) attributes X and Y which are defined over the same domain D. Then table r with schema R, denoted as r(R), can be represented as a directed graph G = (V, E), where the node set V = { v ~D: 3a ~ D , ( ( a , v ) ~ r ) V ( ( v , a ) ~ r ) } and the edge set E={(v,v'): 3 ( v , v ' ) ~ r } . A path from node u to node v in V is an ordered set of edges {(s i, ei): i = 1... k} __E such that s 1 = u and ei=si+l, 1 <~i<~k-1, and e k = v . The length of the path is k, k>~l. The transitive closure of r(R) then corresponds to the transitive closure of G, in which there exists an edge from node u to v if and only if (iff) there exists a path in G from u and v. Note that it is possible to have more than one path between a pair of nodes. The problem of generalized transitive closure

112

X. Z H O U

E T AL.

requires one to find, in addition to the existence of paths, the length of paths [21, 12]. l e t " o " d e n o t e the composition operation. T h e n r ( R ) o r ( R ) has the s a m e scheme with R. Clearly, r o r - { ( a , c ) : 3 b ( ( a , b ) ~ r A ( b , c ) ~ r ) } . T h a t is, all the n o d e pairs b e t w e e n which there exists at least one path of length 2 are contained in r o r. Let r 1 = r and ri=-r i 1 o r; then r i contains all node pairs such that b e t w e e n the two nodes of each of these pairs there is a path of length i, and [,J }= lri contains all edges whose lengths are less than or equal to i. Let d be the length of the longest path in G; the transitive closure of r is r + = 13/a= lri. E X A M P L E 1. Let r be a relation with scheme ( X , Y). T h e r e are seven tuples in r:

r= {(a,b),(a,c),(b,d),(c,d),(d,e),(d,f),(c,f)}.

its corresponding graph is shown in Figure l(a). T o find r ÷, we get

r x=r= {(a,b),(a,c),(b,d),(c,d),(d,e),(d,f),(c,f)},

r 2= {(a,d),(b,e),(c,f),(b,f),(c,e)},

r3 = { ( a , e ) , ( a , f ) } ,

a

a

,, .... ~. . . .

\ .....

i"

oo./#'-..o/f (b) the graph of r+ (a) the graph of r Fig. 1. An example of transitive closure.

P A R A L L E L TRANSITIVE C L O S U R E

113

and for any r i, i > 3, it contains no new tuples. Therefore, r+=r 1 U r 2 U r 3.

The graph of r + is shown in Figure l(b). Note that r l n r 2 = { ( c , f ) ~ Q here. 2.2. DATALOG AND RELATIONAL DATABASES

Relational query languages are not sufficient to express recursive queries. With the development of artificial intelligence and deductive database technologies in recent years, substantial efforts to introduce logic into databases have been seen. An important contribution is the integration of logic programming and databases. A rule-based language, called Datalog, is designed to interact with large databases [20]. A Datalog query is a finite set of rules of the form A ~ A 1 , A 2..... An, where n >/0. The clause A is the head of the rule, and A1, A 2..... A n are the body. Rules are nonrecurs&e if the predicate symbol of the head does not appear in the body; otherwise, they are recursiue. If the body is empty, the rule is called a fact. The correspondence of relations and Datalog rules is that the underlying model of data for Datalog is essentially that of the relational model [19]. Predicate symbols correspond to relations. For example, predicate l(a, b, c) holds iff the tuple (a, b, c) is in the relation L (we use lowercase letters for predicates, and capital letters for corresponding relations). Note that attribute names are referred in Datalog by their position among the arguments of a given predicate symbol. Relations can be defined as extensional database (EDB) relations, which are stored in a relational database, or intentional database (IDB) relations, which are defined as logical rules. IDB relations correspond to views in the relational model, but they are more powerful, as they can also support recursion. Each clause of a Datalog query can be translated, by a syntax-directed translation algorithm, into an inclusion relationship of relational algebra [7]. The set of inclusion relationships that refer to the same predicate is then interpreted as an equation of relational algebra. Thus, we say a Datalog query gives rise to a system of algebraic equations. The translation from Datalog to relational algebra is described in [8]. A Datalog equation of a nonrecursive rule can simply be evaluated by normal query-processing techniques. For a nonrecursive Datalog rule, any predicate symbols with common arguments give rise to joins, while the head symbol determines the projection.

114

X. Z H O U ET AL.

EXAMPLE 2. Given a Datalog rule

grandparent( X, Y ) : -parent( X, Z ), parent ( Z, Y ). Predicate parent refers to relation Parent(parent, child). Using SQL, this rule is equivalent to the following view: C R E A T E VIEW Grandfather(grandparent, grandchild) AS SELECT p i .parent, p 2.child F R O M Parent p l , Parent p 2 W H E R E p l .child = p 2 .parent and the relational algebraic expression of this ~':. 4(Parent(parent, child ) M~ 3 Parent( parent, child )).

Datalog

rule

is

It is more difficult to to compute the meaning of a Datalog equation originating in a recursive rule. In logic programming terminology, a linear stirup consists of one nonrecursive rule (exit rule) and one linear recursive rule. It is known that each linear stirup in Datalog is equivalent to a transitive closure [6]. Actually, any linear logic program can be converted to a transitive closure [21]. E X A M P L E 3. Consider the following Datalog query:

ancestor( X, Y ) : -parent( X, Y ). ancestor( X, Y): -parent( X, Z ), ancestor( Z, Y ) . This query is a linear stirup. Parent is an EDB relation, and Ancestor is an IDB relation. It in fact computes the closure of relation Parent. This becomes more obvious by looking at a bottom-up evaluation of the query - - s t e p i shows which tuples qualify for Ancestor if the recursive rule is applied i times.

Step O: ancestor( X , Y ) : -parent( X , Y ) . Step 1:

ancestor( X, Y ) : -parent( X, Z ) , parent( Z, Y ) .

Step 2:

ancestor( X , Y ) : -parent( X , Z ) , p a r e n t ( Z , U ) , p a r e n t ( U , Y ) .

PARALLEL TRANSITIVE CLOSURE

Step 3:

115

ancestor( X , Y ) : -parent( X , Z ) , parent( Z, U), parent(U, V ), parent(V, Y ) .

Step i can be translated into relational algebra 7rl.4(Parent i- 1 M2,3 Parent). This procedure continues until no more new tuples can be derived, and the result of the Datalog query is the union of the results from all steps. Therefore, we have

Ancestor( X , Y ) = Parent +( X , Y ) . In such a way, recursive Datalog query is translated into an expression of relational algebra supplemented by a transitive closure operator.

2.3. TRANSITIVE CLOSURE EVALUATION The transitive closure of a relation r can be computed using relational algebra operations by implementing r I o r 2 as 7rl,4(r I t~2,3 r 2) [5, 23, 24]. The naive algorithm (Algorithm 1) is a representative. We initialize r' to the input relation r, and repeatedly apply the composition operation to obtain new tuples for r' until no more new tuples can be added. The relational algebra operations used in the naive algorithm include composition (i.e., join and projection), union, and set comparison operations. ALGORITHM 1. The naive TC algorithm. Begin (1) r' ~ r; (2) repeat (3) r" ~ r'; (4) r' ~ r' U (r' o r); (5) until r" --- r'; (6) return (r'); End. The naive algorithm is simple but not efficient. For example, if there are two paths of different lengths from node v 1 and node v2, and there are paths from v 2 to v 3..... z n, then the naive algorithm will generate the tuples ( v l , v 3)..... (vl,vn) at least twice. A "semi-naive" algorithm, shown as Algorithm 2, is suggested to avoid this kind of redundancy by removing all the tuples already existing in r" from r' before the next composition [5].

116

X. Z H O U ET AL.

In this case, one more relational algebra operation, the set minus operation, is added. In Algorithm 2, Ar is used to contain new tuples that have not appeared before. ALGORITHM 2. The semi-naive TC algorithm.

Begin (1) r' * - - A r ~ r ; (2) repeat (3) Ar ~ A r o r; (4) Ar*--Ar--r'; (5) r' ~ r ' U A r ; (6) until Ar= •; (7) return (r'); End. Both the naive and semi-naive algorithms are iterative algorithms. They need to iterate d times to finish, where d is the length of the longest path in the graph of r. By composing r' with itself in Algorithm 1, i.e., line 4 in Algorithm 1 is replaced by (4)

r'~r'u

(r'or');

the number of iterations can be reduced to log d. This is known as the logarithmic algorithm [23].

Discussions on the relative merits of these sequential TC algorithms can be found in [8]. From the efficiency viewpoint, the naive algorithm is worse than the other two, while the semi-naive algorithm is the best in the aspect of having small composition operand sizes, but the logarithmic algorithm is the best for having a small number of iterations. Before going to detailed parallel TC algorithms, we explain here an outline of parallel TC computing-based parallel-join algorithms and show what the problems are. A parallel TC algorithm exploring data-partitioning parallelism, no matter what parallel-join algorithm is used, is basically a repeated procedure of three steps: data partitioning, parallel join, and redundant removal, as shown in Figure 2. When the parallel-hash join algorithm is used in an SIMD mesh architecture to compute transitive closures, some serious problems arise. After an initial distribution of the original relation, each processor needs to determine from which processors to obtain the required part of the operand relation, and to which processors the new tuples should be sent for the next parallel join. Though the computation of rehashing intermedi-

PARALLEL TRANSITIVE CLOSURE

[ do

r=ror

117

using a parallel,join algorithgm] I

eliminate redudant tuples in r 1 I

yes

Fig. 2. A general outline of parallel TC algorithms.

ate tuples to find their destination processors can be avoided by using the algorithm proposed in [9], there are two major problems to be solved:

1. eliminating redundant computation: some of the tuples generated by a join may already exist in the original relation or have already been generated in the previous joins or by other processors. If not eliminated, they can cause redundant computation. To remove these tuples, a global union is needed. This is an expensive operation in a parallel environment [6]. 2. reducing communication cost: although rehashing can be avoided, redistribution of new tuples among all the processors is unavoidable. Each processor may produce new tuples which are to be sent to any other processors to perform the next join. This data exchange procedure could compromise the benefits of parallel computing. 2.4.

MESH-CONNECTED SIMD MACHINES

In a mesh, processors are arranged in a rectangular two-dimensional lattice. The processor is addressed as p(i,j) if it is the jth processor on the ith row, and the left-top one is p(0,0). In an n x n mesh, p(i,j), 0 4 i , j ~ n 1, is directly connected with p(i,j+_ 1) and p(i+_ 1,j), provided they exist. It is a wrapped mesh if p(i,j) is connected with p(i,(j+_ 1 ) m o d n ) and p((i +_1)mod n,j). Figure 3 shows a 4 x 4 mesh and a 4 x 4 wrapped mesh. These processors have their own memories. In a SIMD parallel -

118

X. Z H O U E T AL.

Cl,o) ~ (~

(1,2)

(2,1) (2,2)

t~,3)

3.1~t-4(3,3) (a) 4 x 4 mesh

(b) 4 x 4 wrapedmesh

Fig. 3. Two 4 × 4 meshes.

machine, all the processors execute the same instruction on their own data. In order to provide a perception of SIMD meshes, we describe the main related features of the MasPar MP-1 here. The MasPar MP-1 is a massively mesh-connected SIMD parallel computer [10]. It consists of 1024 to 16,384 processor elements (PEs). These processors execute the same tasks in parallel (SIMD parallelism). Each PE has local memory of 16 kbytes. Interprocessor communications are handled by two separate mechanisms. Regular communications can be handled by the X-Net mesh, which links each PE with its eight nearest neighbors. The aggregate X-Net bandwidth for a 1024-PE array is 1.1 G b y t e / s ; this increases linearly with system size. Random communications between arbitrary PEs are possible via a multistage crossbar router network. The router provides a circuitswitched two-way connection between PEs. Its bandwidth is only 1 / 1 6 that of the X-Net. Moreover, synchronization cost may be introduced when more than one PE want to communicate with the same target PE, because it is the user's responsibility to avoid the possible conflicts by explicit control.

3. NEW T R A N S I T I V E C L O S U R E A L G O R I T H M S F O R SIMD M E S H E S 3.1.

DATA DISTRIBUTION S T R A T E G Y

Data partitioning and distribution are crucial to the efficient computation of transitive closure. A good data distribution not only needs to make it possible to perform the relational algebra operations fully in parallel, but also to minimize the data exchange cost during the operations.

PARALLEL TRANSITIVE CLOSURE

119

DEFINITION 1. Let R ( X , Y ) be a r e l a t i o n scheme, where X , Y are d e f i n e d o n d o m a i n D. Let f : D--* ( 1 . . . n ) be a hash function. A doublehash partition is a m a p p i n g from r(R) into n × n buckets rij, such that

rij={tEr: f ( t . X ) = i A f ( t . Y ) = j } , l <~i,j <~n. All the buckets rij together, l < ~ i , j <~n, form a p a r t i t i o n of r. This c o n c e p t is first p r o p o s e d by [9] to avoid rehashing o n result tuples. Figure

4 shows the idea of double-hash. In Figure 4, Y i = { t ~ r : f(t.Y)=i}, Xi={t~r: f(t.X)=i}, and Xij={t~r: f ( t . X ) = i A f ( t . Y ) = j } . It is clear that Yi ° Xij only produces the tuples whose hash value on Y is j, i.e., there is no need to rehash the tuples in Y~o Xij to find their hash values on Y. Now we give a theorem about the double-hash partitioning. THEOREM 1. Let {ri/ i,j = 1... n} be a double-hash partition of r; then

~.j U (rjiorik)

ror=

i=1

X

X0

X1

X2

X3

(1)

k,j=l

Y

X01 X02 X03 X10 Xll X12 X13 X20 X21 X22 X23 X30 X31 X32 X33

YO

Y1 ~ S J

Y2

J J ¢

....

Y3

:~.

sending join result sending join operand Fig. 4. Double-hash can avoid rehash.

120

X. Z H O U E T AL.

and n

z~r/j =- U (rik ° rkj),

. (2)

k=l

where Arij = {t ~r o r: f ( t . X ) =i A f(t.Y) =j}. Proof. Let X i = { t ~ r : f(t.X)=i}, X q = { / ~ r : f ( t . X ) = i A f ( t . Y ) = j } , Yi = {t ~ r: f(t.Y) = i}, and ~ j = {t ~ r: f(t.Y) = i A f ( t . X ) =j}. It is clear that Xi= UT=IXu, Y/= U ~=lYik, and r/j=Xq=-Yji. One can derive, from the bucket partitioning procedure, that r o r = U ~= 1(Y, o Xi)" Therefore, r o r = 0 (Yi° Xi) i=1

n U (Yi ° X i j )

= i=1

j=l n

=

U i=1

= 0 i=1

x j)

k,j=l

0

k,j=l

(rki°r/j) •

Therefore, (1) holds. Equation (2) is obvious by the definition of ar/j; its proof is omitted here. [] A double-hashing partition of r ( X , Y ) can be obtained by hashing r into n buckets on the value of column X first; then each bucket is rehashed into n buckets on the value of column Y. DEFINITION 2. Let {r/j: i,j= 1 ...n} be a double-hash partition of r, and P={pq: i , j = l . . . n } . A double-hash distribution of r over P is to assign bucket r/j to processor pu, 1 ~
P A R A L L E L TRANSITIVE C L O S U R E

121

T N r,j. Then we can obtain the following corollary to define from where a processor gets the data to compose with. COROLLARY 1. Let {r#: i, j = 1... n} be a double-hash distribution on {p~j: i, j = 1... n}. Then processor Pij needs and only needs to join the data that resides on the i-th column o f the mesh to complete r o r, 1 <~i ~ n. Proof. From (1) of Theorem 1, we know that r o r = U n=l X U nk,j= l(rki o rij)" That is, r o r = U ,.j=l ~ n Ul~=l(rkiorij). For given i,j, rij needs to join with all rki, 1 ~ k ~ n. So this corollary holds. []

After r 2 is computed out, it needs to be redistributed over the mesh to make r 2 follow the double-hash distribution such that the next phase of composition can be performed recursively. The destination processors of the intermediate-result redistribution originated from a processor is confined by Corollary 2. COROLLARY2. Under the double-hash distribution, all the results generated by Pij need only to be redistributed along the j-th column, 1 <~i, j <<,n. 3.2. PARALLEL LOGARITHMIC TC ALGORITHMS 3.2.1.

The Basic Algorithm

Using the double-hash distribution strategy, a new parallel TC algorithm for SIMD meshes is proposed (Algorithm 3). The original relation r is distributed over P by double-hashing. In Algorithm 3, " ~ " stands for intraprocessor relation assignment as before, and " = " for interprocessor relation assignment. Note that " ~ " only occurs between directly connected processors in Algorithm 3. Buffer rij resides on processor Pij to contain the piece of the operand relation assigned to pij according to the double-hash distribution. T/j contains the data to be joined with rij. Arij mainly contains the join results, although it is also used as a temporary buffer for passing data, as we shall explain later. We explain how Algorithm 3 works before proving its correctness. ALGORITHM 3. The basic parallel logarithmic TC algorithm.

Begin (1) rij ~ {t ~ r: f ( t . X ) = i A f ( t . Y ) =j}; / * double-hash distribution * / (2), d o / * phase loop * / (3) aij =0; / * all processors disabled * / (4) A r ~ j ~ T i j ~ r i j ; sij=lrijl; / * phase initialization * / (5) for k = 0 to 2n - 1 d o / * step loop * /

122 (6) (7)

(8) (9)

X. Z H O U ET AL.

if(k
times * / (10) Arij ,-- Tij o ri/; / * local join * / (11) T/r ¢=: Ti,(j 1)modn,• / * send data rightward * / (12) end_if (13) A r i / ~ Ar(i+ 1)modn,/; / * send data upwards * / (14) if k < n then / * data routes turn at the diagonal processors * / (15) Ti i 4-- Arii; (16) end_if (17) if air > n then / * result assembling * / (18) rir ~ ri/U {t ~ Ari/: f ( t . X ) = i A f ( t . Y ) =j}; (19) elsif (air > 0) then (20) a~r = air + 1; (21) end_if (22) end_for (23) while (:lp( i, j) ~ P, sij -~ Irij I); (24) return ( [_Ji~/: lrij); End. Algorithm 3 is to be executed on all the processors• A while-iteration (lines 2-22) in Algorithm 3 is called a phase, and a for-iteration (lines 5-21) is a step. From Corollary 1, we know that any bucket on the ith row only needs to join with all the buckets from the ith column• We name these n × n joins Ji={rki o ri/: k , j = 1 •••n}• In Algorithm 3, variable aij o n Pij is used to control the behavior of p ( i , j ) (lines 9-12): Pij performs T~j o ri/ when 1 <<.aij<~n and assembles the results when a i / > n (lines 17-20). Initially, both T~j and Arij are assigned to contain the same data as in rij (line 4). All the processors use Arij to send their buckets rij upwards in a pipeline way (line 13). As wrapped meshes are assumed, the processors on the top row send the buckets to the corresponding processors on the bottom row. Each processor on the leading diagonal line, p , , is the entry for the buckets from the ith column of the ith row. It switches the up-going buckets rightward by assigning the content in Ar~j to Tij (line 15). All the processors start local joins whenever they have received right-going buckets T# (line 10), which are passed rightward after the joins (line 11), and the join results are packed into Ar~/ to send upwards (line 10 and line 13). As Ar~j is used to store the newly generated tuples from the current composition on p~j as well as to pass the up-going new tuples from other processors, some data structures, like pointers to record the location of

P A R A L L E L TRANSITIVE C L O S U R E

123

added tuples, should be used to avoid the rehashing when assembling the results later. Also, because the meshes are wrapped, the processors on the rightmost column send the right-going buckets to the processors on the left margin. Therefore, the route of the buckets in Ji forms a ring of 2n processors. We call such a ring a Ji-ring. Figure 5 shows a join ring (for operand distribution). By the concept of join ring, the data exchange in a phase follows a regular pattern. Along the J~-ring, mFji r o u t e s vertically (line 13) and Ti~ routes horizontally (line 11), 1 ~
'"

i:

(o,o)

!

,,

i

(i,i) :

........ i

: :

i

_3

u,i)

.... !

..............................¢.......................... ....

Ji-ring

Fig. 5.

......... Jj-ring

(n-l,n-1)

.-~ join point

Two join rings with join points.

124

X. Z H O U ET AL.

new tuples unnecessary because of the double-hash distribution, but any extra steps to send the new tuples to their destinations are also unnecessary. The final closure is the union of all rij when the algorithm finishes. It must be pointed out that no suppression of duplicate tuples is required at this stage. Now we prove the correctness of Algorithm 3. LEMMA 1. L e t {rij" i, j = 1... n} be a double-hash distribution on an n × n wrapped mesh. A single phase o f Algorithm 3 performs r o r and n

U ro=(rUr ) i,j~l

at the end o f the phase. Proof. From lines 9-12, we know that each processor performs n times composition. As all join operands must go through the join point, each processor Pij always starts to join with rii. Clearly, it takes ( j - i ) m o d n steps for rii to arrive at Pig. This is when pi/ starts the first of its n joins (lines 9-11). After r~ arrives at p~/, the next is r~i+ 1)moan,i (line 13 and line 15), the second next is r(~+2)modn,i, and so on. The nth arriving data is rt~+,)mod,, i. Therefore, during the n steps of joining, Pij executes D nk: 1( r ki ° rij). From Corollary 1, we know that a single phase of Algorithm 3 performs exactly r o r. Now we prove the second part. From Corollary 2, we know that all the new tuples allocated to p~j after r o r can only be from the Pkj, 1 ~ k ~ n. Lines 17-19 guarantee that all the new tuples pass through Pig and the new tuples are kept in T~i after p~j finishes its n joins. Now we need to prove that for any results generated by Pkj, 1 ~ n, all the new tuples which are to be allocated to Pij by the double-hash distribution will go through piy. From line 18, we know that U i~,j lrij = r U r 2. [] =

LEMMA 2. Let {rij: i , j = 1... n} be a double-hash distribution on an n × n wrapped mesh. After a single phase o f Algorithm 3. Vt~riy ,

f(t.X)=iAf(t.Y)=j,

Proof. From Corollary 2.

l<~i,j<~n.

[]

PARALLEL TRANSITIVE CLOSURE

125

THEOREM 2. Let d be the length of the longest path in r. After log d phases of Algorithm 3, Algorithm 3 terminates and n

U rij =- r +. i,j=l

Proof. As one can see from line 18, Algorithm 3 is a logarithmic TC algorithm. Lemma 1 shows each single phase of Algorithm 3 performs a global composition, and Lemma 2 shows the composition result is doublehash redistributed at the end of each phase (i.e., before the next global composition starts). Therefore, after log d phases, U in,j= lrij -~-r +. [] Under the constraint of memory availability on a uniprocessor system, hash-based partitioning is one of the most efficient methods to avoid excessive I / O operations in join computation. Theorem 3 shows that Algorithm 3 fully parallelizes the sequential hash-based logarithmic TC algorithm. THEOREM 3. Let SEQTC be a sequential logarithmic TC algorithm using hash-based partitioning and the nested-loops join algorithm for bucket joining. For a table r which has a uniform data value distribution and can be stored in the aggregate memory of a mesh-connected parallel computer, Algorithm 3 is an optimal parallel version of SEQTC in terms of computational complexity provided SEQTC and Algorithm 3 use the same hash function.

Proof. Let Irl--N. Let the mesh size be n x n . We assume the hash function SEQTC uses if f: D--* (1... m), m = an, where o~>/1 is bounded by a constant. Then, the time complexity of SEQTC is O(log d x n x ( N / n ) x ( N / n ) ) = O ( l o g d x N 2 / n ) . At each step of Algorithm 3, beside data exchange, a processor either joins its local data ru with the just-arrived data Tip or remains inactive. A processor sends data upwards 2n times, and rightward 2n times. Then, the local computation at a processor in each step, on average, is O ( N / n 2 x N / n 2 ) = O ( N 2 / n 4 ) . There are 2n steps in a phase, so the computation complexity for each phase is O(N2/n3). As Algorithm 3 is a logarithmic algorithm (shown in Theorem 2), the total complexity of Algorithm 3 is O(log d ×N2/n3). Therefore, a speedup of n x n has been achieved by Algorithm 3 over SEQTC. Because there is no extra communication or computation overhead between the phases, Algorithm 3 is an optimal parallel version of the logarithmic TC algorithm. [] Algorithm 3 has an additional advantage. On massively parallel computers, the memory on each processor is usually limited. By the delicate

126

X. Z H O U ET AL.

arrangement, Algorithm 3 minimizes the memory requirement. It only needs two memory areas for the two operand buckets and one area to contain the join results, which are obviously necessary for any TC algorithms.

3. 2.2.

Eliminating Computational Redundancy

As mentioned before, computational redundancy may occur in iterative TC algorithms. One kind of computational redundancy is caused by the new tuples which have been generated in the previous phases or by other processors in the same phase. Generally, it requires global union operations [1], which are very expensive in a parallel environment, to eliminate these duplicates. This problem prevents many parallel TC algorithms from being practical, as pointed out by [6]. Fortunately, this kind of redundancy has been excluded in Algorithm 3 without any global operations. Because all the possible redundant tuples are sent to the same processors, thanks to the double-hash distributions scheme, duplicate elimination becomes a local operation and is fully parallelized. Another kind of redundancy is unique to the logarithmic TC algorithms. After r' = (r o r) U r is computed, r' o r' needs to be computed in the next phase. Clearly, r' Dr. Let r' = r U Ar. Then

r'or'=(ror) U(ro Ar)U( Aror)U( Aro Ar).

(3)

Note that (Ar o r ) ~ (r o Ar). There is a redundant computation r o r. This is a reason why the naive and semi-naive algorithms are still in use even when the logarithmic TC algorithm is known. Algorithm 3 also suffers from this redundancy. Let r~j=rijUArij, where Arij contains the new tuples that finally reach pq. When r~j arrives at processor p(j, k).

r~jor~k=(rijorjk)U(ri~oArjk)U(Arijorjk)U(ArijoArjk).

(4)

There still exists computational redundancy on each processor (although the redundant computation is parallelized). ALGORITHM 4. The nonredundant SIMD mesh TC algorithm.

Begin (1)

(2) (3)

r U ~ {t ~ r: f ( t . X ) = i m f ( t . Y ) =j}; / * double-hash distribution * / Ar!,j ~ 77,j ~ r'.,j ~ r,~.', Ar..c--u Tij ~-- rij #'- Q; / * initialization * / d o / * phase l o o p * /

PARALLEL TRANSITIVE CLOSURE (4) (5) (6) (7) (8) (9) (10) (11)

127

a ij = 0; / * all processors are disabled * / for k = 0 to 2n - 1 d o / * step loop * / if (k~
*/

,T/j ¢:=

T/,(j

l ) m o d n , • T i jt ¢:= T i ,' ( j

/*

1)modn," / *

nonredundant join send

data

rightward

/

(12) (13)

end_if

(14)

if k
(15) (16) (17) (18) (19)

(20)

Arij=Ar(i+l)modn,j;AF~jc:=AF[i+l)modn,j;/* */

send data upwards

Tii~---Arii; Ti'i4--Ar~i; / * data routes turn at the diagonal processors * / end_if if aij > n then

Ti'y*-{t~Arij: f ( t . X ) = i A f ( t . Y ) = j } ;

/ * result assembling * /

elsif (aq > O) then aij = aij + 1; end_if end_for

(21) (22) (23) r;j,:- Ti'y-rij; rij ~---rijt.-)r;j; (24) Arq ,-- Tq ,-- rq; Ar;j ,-- T/j ,-- r;j; (25) while (3p(i, j) ~ P, r:j -4=•); (26) return (U ~,/-lrij); End. Algorithm 4 is an extension to Algorithm 3 without either of the two kinds of computation redundancy. They are very similar essentially, except r that in Algorithm 4, three more buffers r~j, Tq, and Arq¢ are used on each processor to separate the original bucket and the tuples generated in the previous phase in Algorithm 4. rij in Algorithm 3 is split into r~j, which contains the new tuples generated just in the previous phase, and rq, which contains the original tuples as well as the tuples generated before the previous phase. 7],./ and T/~ are buffers for rij and r~j, respectively. The termination condition of Algorithm 4 is changed into another equivalent form to that in Algorithm 3. The compositions at line 10 in Algorithm 4 are all the compositions in (4) but the redundant one. Note that when the new tuples generated in the current phase are put into T/~ at line 18, T/j will not be moved (at line 11) or reset (at line 15) in this algorithm. The

128

X. Z H O U ET AL.

time complexity of Algorithm 4 is the same as Algorithm 3. The communication cost in a step is still O(n). 3.3.

NONLOGARITHM1C

ALGORITHMS

Using the double-hash distribution, the parallel versions of the naive and semi-naive TC algorithms can also be derived. The major difference between the logarithmic and nonlogarithmic TC algorithms is that there is one operand remaining unchanged for all composition phases. Therefore, after an initial join ring, the following data exchange occurs only in the same row and the same column. We give the parallel naive and semi-naive TC algorithms now.

3.3.1.

Parallel Naive Algorithm

The parallel naive TC algorithm, Algorithm 5, consists of two parts. The first part (lines 3-7) is a for-loop to send all the buckets on the ith column, i.e., {rij: 1 <~j <.n}, to the ith row. This procedure is just like the communication pattern in Algorithm 3, except that no computation needs to be done in Algorithm 5 in this loop. At the end of this loop, V1 ~
Begin (1) rii ~ {t E r: f ( t . X ) = i A f ( t . Y ) =j}; / * double-hash distribution * / (2) Ar~j~ Tq ~ rii; / * initialization * / (3) for k = 0 to n d o / * move buckets from i th column to i th row * / (4) T/j ~T,,¢i l)modn; / * send T rightward*/ (5) Arij = Ari+ lmodn,j; / * Arq moves upwards * / (6) Tii ~ A rii; / * data routes turn at the diagonal PEs * / (7) end_for (8) d o / * phase l o o p * / (9) for k = 0 to n do / * each PE performs n times compositions * / (10) Ar~j *-- (T~i o rij); (11) Tii~T/,~j l)modn; //* send T rightward*/ (12) end_for (13) sii = Ir~j]; (14) for k = 0 to n d o / * redistribute new tuples * / (15) Argj ~ Ari+lmod~,j; / * Arij moves upwards * / (16) rii~rqW{t~rij: f(t,X)=i}; (17) end_for

PARALLEL TRANSITIVE CLOSURE (18) while (3p(i,j) ~ P, (19) return (U inj : l rij); End.

129

Irgjl 4: s~j);

The second part (lines 8-18) is the while-loop. There are two things different with comparison to Algorithm 3. First, at least one of the operands in the naive TC algorithm remains the same for all the compositions. Therefore, this unchanged operand, stored in T,j on all processors, can be routed along the row, and each route corresponds to one composition (lines 9-12). Consequently, an explicit ring to redistribute the results is needed to redistribute the new tuples (lines 14-17). Since Ar u is used to store new tuples from all n compositions on p(i,j), some data structures, such as pointers to separate the tuples generated from different compositions, should be used to avoid rehashing when assembling the results later. The final result is the union of all r u when the algorithm terminates. It must be pointed out that no suppression of duplicate tuples is required at this stage.

3.3.2.

Parallel Semi-Na&e Algorithm

ALGORITHM 6. Parallel semi-na&e TC algorithm. Begin (1) rij ~ (t ~ r: f ( t . X ) = i A f ( t . Y ) =j}; / * double-hash distribution * / (2) Arij ,-- Tq ,-- rij; / * initialization * / (3) for k = 0 to n d o / * this loop is same as in Algorithm 5 * / (4) Tq ¢=: T/,(j_ 1)mod n; (5) Arij ~ Ari+lmodn,j; (6) Zii ~'- mrii"~ (7) end_for

(8) cij*- rij; (9) d o / * phase l o o p * / (10) for k = 0 to n d o / * each PE joins n times * / (11) Arij ,-- (Tij o Cij); (12) T/j = T/,u_l)modn: / * send T rightward * / (13) end_for (14) Sij = Ir~j[; Cij ~-- Q~; (15) for k = 0 to n do / * redistribute new tuples * / (16) Arij ~ Ari+ lmodn,i; / * Arij moves upwards * / (17) C~j *--C~jU{t~Ar~j: f ( t , X ) = i } ; (18) end_for (19) Cq ,-- C~j - r u ;

(20)

rij,--r~juCij;

130

X. Z H O U ET AL.

(21) while (3p(i, j) ~ P, Iriy[ ~=sir); (22) return ( U ni,j= t r ij),• End. Algorithm 6 is the parallel semi-naive TC algorithm using the double-hash distribution. It is easy to understand this algorithm, because its behavior is almost the same as Algorithm 5. A new buffer C u is introduced to contain those tuples which are generated from the current composition and which do not exist in the original relation or have not been generated by the previous compositions. Each comparison occurs between C u and Tip rather than r u and Tij. 4. IMPLEMENTATION AND EVALUATION ON T H E MASPAR MP-1 We have implemented Algorithm 3, Algorithm 5, and Algorithm 6 on MasPar MP-1 parallel computer with 6 4 x 6 4 processors and a total memory of 4096 X 16 KB = 64 MB (Algorithm 4 cannot be tested because of memory limitation on the machine we use). A tuple has two integer columns of four bytes each. Relations are generated at random within a given range (to control the lengths of the longest paths). The hash function is simply f ( x ) = x m o d n , where n = 6 4 . We assume that the original relation is already distributed across the processor array evenly (but not necessarily following the double-hash scheme) and the final result is not required to be collected (this cost should be the same for all of our parallel TC algorithms).

4.1. LOCAL COMPUTATION COSTS For each bucket obtained from the double-hashing scheme, it needs to join with another n buckets. We keep the tuples ordered on their join column in each bucket to improve the performance of joins and other set operations. And also, one operand of the union and minus operations is kept ordered on both columns, but another operand, which is only used once, is not ordered. Table 1 is the response time required to process N tuples (N + N tuples for join, union, and minus) on each processor. Note that there is no communication cost counted at this stage. In Table 1, D-Hashing stands for the double-hash operation, SeleSort for the selection sorting procedure, Sort2Col for the selection sorting procedure on both columns. From Table 1, one can see that on SIMD computers, the quicksort algorithm performs poorly. This is because it is

PARALLEL TRANSITIVE CLOSURE

131

TABLE 1 Basic Local Computation Costs for the Parallel TC Algorithms (s) Relation size on each processor Operation N = 1 0 Hashing D-hashing SeleSort Sele2Sort QuickSort Join Union Minus

0.01 0.01 0.02 0.04 0.02 0.02 0.05 0.04

N=50

N=100

N=150

N=200

N=250

N=300

N=350

0.02 0.02 0.20 0.23 1.35 0.03 0.29 0.25

0.02 0.03 0.70 0.71 5.32 0.05 0.89 0.81

0.02 0.04 1.55 1.57 11.85 0.07 1.82 1.70

0.04 0.06 2.71 2.75 20.20 0.09 3.08 3.20

0.03 0.06 4.20 4.25 30.00 0.11 4.69 4.51

0.03 0.05 6.14 6.21 41.27 0.13 6.62 6.39

0.04 0.10 8.38 8.46 -0.16 -8.61

highly possible that one processor takes the worst case of time complexity in the partitioning p r o c e d u r e in one step; thus the response time of every step is in its worst case for an S I M D processor array. W e use selection sort twice to sort a relation on both columns. In o u r T C algorithms, the most expensive operations, the sorting, union, and minus operations, only need to be p e r f o r m e d once at the end o f each phase. Their costs do not increase with the mesh size. Only the join cost, which is quite low, increases with the mesh size. 4.2. 1NTERPROCESSOR DATA EXCHANGE

W e n a m e all the processors on the same column a result-ring (along which the join results are redistributed for the phase of join in o u r parallel T C algorithms). A result-ring and a join-ring c o m m u n i c a t i o n are implem e n t e d as follows (see [10] for the p r o g r a m m i n g language M P L we use here): int k, bucket_size; plural int R_size, C_size; plural struct R E L A T I O N *plural r, *plural C; /* R E L A T I O N contains integer X & Y */ /* result ring */ for (k=l; k < N E S H _ S I Z E ; k++) { p p _ x s e n d ( l , 0 , C , C , b u c k e t size*tuple upwards * / xnetN[l] .C_size=C size; }

size) ; /* send C

132

X. Z H O U E T AL.

/* join r i n g */ for (k:l; k < M E S H _ S I Z E ; k++) { pp_xsend(0,l,R,R,bucket_size*tuple ~ze) ; /*send R r i g h t w a r d * / x n e t E [i] .R _ s i z e = R _ s i z e ; pp_xsend(0,l,C,C,bucket_size*tuple_size) upwards * / x n e t N [ l ] .C s i z e - C _ s i z e ; if ( i x p r o c - - i y p r o c ) diagonal * / T-C; T_size=C_size; }

/* C - ~ T at

PEs

; /* s e n d C

on

the

mesh

}

Table 2 shows the data communication costs of a join-ring and result-ring on MasPar. Communication cost along a join-ring takes twice as much as that along a result-ring, as the length of a join is twice that of a result-ring. They both increase linearly with the size of the data packet. If we use xnet-family communication primitives instead of xsend, both the join-ring and the result-ring communication cost will be nearly double. Note that the communication cost in Table 2 is the total cost in one phase. Therefore, following the regularized data exchange pattern under the double-hash scheme, the overall communication overhead to our TC algorithms is very small while the computation is fully distributed across the network. This shows that our algorithms are very scalable with mesh size.

4.3. RESULT ANALYSIS We can derive the response time of our algorithms from Table 2 and Table 1. Assume that d is the length of the longest path in r. For the naive and semi-naive algorithms, they need d iterations. After one operand r is double-hash redistributed, which takes a join-ring plus n times double hashing, another join-ring is used to redistribute r as another operand TABLE 2 Basic Communication Costs for the Parallel TC Algorithms (s) Operation N=10 N=50 Result-ring Join-ring

0.08 0.14

0.29 0.57

Relation size on each processor N=100 N=150 N=200 N=250 N=300 N=350 0.55 1.11

0.83 1.65

1.10 2.19

1.37 2.72

1.63 3.26

1.90 3.80

PARALLEL TRANSITIVE CLOSURE

133

TABLE 3 Response Time for the Parallel TC Algorithms(s) Tuple number

Length (d)

737,280 204,800

6 8

Algorithm 3 (logarithmic) 177.09 114.43

Algorithm 5 (naive) 215.57 172.98

Algorithm 6 (semi-naive) 112.02 85.63

before the iteration starts. In each iteration, there are two result-rings, one for passing join operand and one for join result. There are n joins and one union in each iteration. For the semi-naive algorithm, one operand is reduced at the end of each iteration at the price of a minus operation. For the logarithmic algorithm, it takes log d iterations. However, the join operand size is much larger than the nonlogarithmic algorithms. There is no result-ring but a join ring in every iteration. Comparing Table 2 and Table 1, we know that local computation cost dominates the overall response time under the double-hashing scheme. Although the discussion on the relative merits of the three TC algorithms is application-dependent, in general the semi-naive algorithm is expected to outperform the other two algorithms, as its local computation cost is significantly smaller than that of the other two algorithms. We have tested the three algorithms on the MasPar with 4096 PEs on randomly built relations whose sizes range from 200-760 K and d from 5 to 14. In our tests, the semi-naive algorithm is always the best (this is also reported by [15]) and the naive one is always the worst. Table 3 gives two sets of typical results.

5.

CONCLUSIONS

In this paper, we have presented a new family of parallel TC algorithms to compute transitive closures on SIMD meshes using relational algebra operations. The double-hash data fragmentation scheme in [9] has been further explored in this paper. A mapping of buckets obtained from double-hash partitioning to a mesh of processors has been proposed to regularize the data exchange pattern among processors. There presently exists no extra step for the redistribution of intermediate tuples in Algorithm 3. Possible redundant computation between different composition phases has been prevented without using global operations. As only regular linear communication occurs on the mesh, and the workload is fully distributed, a time complexity speedup of O(n Xn) has

134

X. Z H O U

ET AL.

b e e n achieved, w h e r e n × n is t h e size of mesh. T h e r e f o r e , t h e s e algor i t h m s are o p t i m a l p a r a l l e l versions o f t h e transitive closure a l g o r i t h m s b a s e d on r e l a t i o n a l a l g e b r a o p e r a t i o n s for S I M D meshes. T h e s e algor i t h m s have b e e n i m p l e m e n t e d a n d e m p i r i c a l l y c o m p a r e d on t h e massively p a r a l l e l c o m p u t e r M a s P a r MP-1. U s i n g o u r new algorithms, t h e transitive closure o f r e l a t i o n s with n e a r l y a million t u p l e s can be c o m p u t e d in less t h a n t h r e e minutes. This shows that p a r a l l e l p r o c e s s i n g is very p r o m i s i n g in s u p p o r t i n g new d a t a b a s e a p p l i c a t i o n s with very large v o l u m e s o f data. REFERENCES 1. R. Agrawal, S. Dar, and H. V. Jagadish, Composition of database relations, in Proc. 5th Internat. Conf. Data Engrg., 1989, pp. 102-108. 2. R. Agrawal and H. V. Jagadish, Direct algorithm for computing the transitive closure of database relations, in Proc. 13th lnternat. Conf. Very Large Data Bases, Brighton, England, 1987, pp. 255-266. 3. R. Agrawal and H. V. Jagadish, Hybrid transitive closure algorithms, in Proc. 16th Internat. Conf. Very Large Data Bases, Brisbane, Australia, 1990, pp. 326-334. 4. S. G. Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, Englewood Cliffs, NJ, 1989. 5. F. Bancilhon and R. Ramakrishnan, An amateur's introduction to recursive query processing strategies, in ACM SIGMOD'86, Washington, DC, 1986. 6. F. Cacace, S. Ceri, and M. Houtsma, An overview of parallel strategies for transitive closure on algebraic machines, in LNCS 503: Parallel database systems: PRISMA Workshop, Noordwijk, The Netherlands, 1990, pp. 44-62. 7. S. Ceri, G. Gottlob, and L. Tanca, What you always wanted to know about datalog (and never dared to ask), IEEE Trans. Knowl. Data Engrg. 1(1):146-166 (1989). 8. S. Ceri, G. Gottlob, and L. Tanca, Logic Programming and Databases, SpringerVerlag, Berlin, 1990. 9. J.-P. Cheiney and C. de Maindreville, A parallel strategy for transitive closure using double hash-based clustering, in Proc. 16th Internat. Conf. Very Large Data Bases, Brisbane, Australia, 1990, pp. 347-358. 10. MasPar Computer Corporation, MasPar MP-1 MPL Programming Manuals, MasPar Computer Corporation, 1991. l 1. G. Cybenko, T. G. Allen, and J. E. Polito, Practical parallel union-find algorithms for transitive closure and clustering, lnternat. J. Parallel Programming 17(5):402-423 (1988). 12. S. Dar and R. Agrawal, Extending SQL with generalized transitive closure, Data Knowl. Engrg. 5(5) (1993). 13. C. Ewald and X. Zhou, Efficient matrix-based transactive closure algorithms for fd-graphs, in Proc. 7th Australia Database Conf., Melbourne, Australia, 1996, pp. 61-71. 14. M. A. W. Houtsma, P. M. G. Apers, and S. Ceri, Distributed transitive closure computations: The disconnection set approach, in Proc. 16th Internat. Conf. Very Large Data Bases, Brisbane, Australia, 1990, pp. 335-346. 15. M. A. W. Houtsma, A. N. Wilschut, and J. Flokstra, Implementation and performance evaluation of a parallel transitive closure algorithm on Prisma/DB, in Proe. 19th Internat. Conf. Very Large Data Bases, Dublin, Ireland, 1993, pp. 206-216.

PARALLEL TRANSITIVE CLOSURE

135

16. Y.-N. Huang and J.-P. Cheiney, Parallel computation of direct transitive closures, in Proc. 7th Internat. Conf. Data Engrg., Kobe, Japan, 1991, pp. 192-199. 17. F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann Publishers, Los Altos, CA, 1992. 18. C. J. Scheiman and P. R. Cappello, A processor-time-minimal systolic array for transitive closure, IEEE Trans. Parallel Distrib. Syst. 3(3):257-269 (1992). 19. T. Sellis and N. Roussopoulos, Expert database system: Efficient support for engineering environments, Data Knowl. Engrg. 3 (1988). 20. J. D. Ullman, Database and Knowledge-base Systems, Vol. 1, Computer Science Press, Rockville, MD, 1988. 21. J. D. Ullman, Database and Knowledge-base Systems, Vol. 2. Computer Science Press, Rockville, MD, 1989. 22. J. D. Ullman and M. Yannakakis, The input/output complexity of transitive closure, in ACM SIGMOD'90, 1990, pp. 44-53. 23. P. Valduriez and S. Khoshafian, Transitive closure of transitively closed relations, in Second lnternat. Conf. Expert Database Systems, Tysons Lorner, USA, 1988. 24. P. Valdurize and S. Khoshafian, Parallel evaluation of the transitive closure of a database relation, Internat. J. Parallel Programming 17(1):19-42 (1988). 25. E. S. Warren, A modification of Warshall's algorithm for the transitive closure of the binary relations, Commun. ACM 18(4):218-220 (1975). 26. S. Warshall, A theorem on boolean matrices, J. ACM 9(1):11-12 (1962). 27. X. Zhou, Parallel processing in relational database systems, Ph.D. thesis, The University of Queensland, Australia, 1994. 28. X. Zhou, Y. Zhang, and M. Orlowska, A new fragmentation scheme for recursive query processing, Inernat. J. Data Knowl. Engrg. 13(2):177-192 (1994).

Received l April 1995; revised 23 November 1995