JOURNAL
OF
PARALLEL
AND
DISTRIBUTED
COMPUTING
14, 146-162 (1992)
A Parallel Pipelined Strategy for Evaluating Linear Recursive Predicates in a Multiprocessor Environment*Tt LOUIQA RASCHID Department
of Information
Systems and Institute for Advanced
Computer Studies, University
c?f Muryland,
College Park, Maryland
20742
AND STANLEY Database
Systems Research and Development
Y. W . Su
Center, University of Florida,
A parallel pipelined strategy for evaluating single linear recursive predicates in a multiprocessor system is described. A topdown compiling technique generates the resolvents corresponding to the recursive predicate. While evaluating the resolvents against the database relations, the proposed strategy exploits three database query optimization techniques. W e develop an analytical model for the proposed evaluation strategy; it models the execution of the pipelined butterfly hash-join operation. In an analytical performance evaluation, we compare the proposed strategy with a sequential and parallel bottom-up semi-naive algorithm, for computing the transitive closure of a database relation. W e measure the response time and execution time for each resolvent. The speedup demonstrates the benefits of the parallel pipelined strategy; the performance evaluation indicates that it can increasingly benefit from each additional processor used, as larger relations are passed along the pipeline. A steady pipeline can be maintained even when hash table overflow occurs due to memory limitations. 0 1992 Academic Press, Inc.
Guinesville,
Florida
32611
semi-naive algorithm for computing the transitive closure of a relation. In contrast to bottom-up schemes, topdown schemes compile the recursive clause(s) to gener-
ate a proof tree or its equivalent relational algebra expressions that can then be evaluated. A performance comparison [3] demonstrated that, in a sequential processor, the top-down compiling method out-performs the bottom-up semi-naive scheme for various data sets. If recursive query evaluation is to be of practical success, it is important that the queries be efficiently evaluated against large database relations. One of the solutions is to utilize m u ltiple processors to evaluate these recursive queries. Several parallelization schemes have been proposed to parallelize the bottom-up algorithms [ 1, 22, 231. They achieve parallelism through horizontal partitioning of the database relation among m u ltiple processors, thus distributing the workload among them. A parallelization of the semi-naive algorithm has the best performance results [I]. There has been little research on the parallelization of the top-down evaluation schemes for databases relations. 1. INTRODUCTION In addition, it has not been shown if the parallelism Many database applications can benefit from the ability achieved in the semi-naive scheme [l], by partitioning to efficiently evaluate recursive queries. The m a jority of one database relation among the processors, is applicable these applications require only the expressive power of to the case of a linear recursive predicate in which the linear recursive queries. For example, computing the evaluation involves several database relations. In other transitive closure of a relation can be used to find all the words, to achieve a linear speedup, an independent partisubparts of a part, or determine if a path exists between tioning of several database relations among the procestwo nodes in a graph. A query involving several relations sors must result in an even load balancing while evaluatsuch as the same generation problem is equivalent to the ing the recursive query. W e also wish to utilize database problem of determining if two nodes in a graph are equi- query processing/optimization strategies such as query decomposition, result sharing, and pipelined evaluation distant from a common node. Several techniques for the evaluation of recursive of relational operations, e.g., the join operation; such predicates have been proposed in the literature [l-3, 10, techniques have not been used previously for this problem. 13, 15, 21-231. The most well known is the bottom-up In this paper, we describe a parallel pipelined strategy * This research has been partially sponsored by the National Science for evaluating single linear recursive predicates against Foundation under Grant DMC 8814989 and by the State of Florida High database relations in a m u ltiprocessor environment. The Technology and Industry Council under Grant UPN 85100316. t The authors thank Timos K. Sellis for his valuable comments. strategy is a parallelization of the top-down compilation 146 0743-7315192$3.00 Copyright 0 1992 by Academic Press, Inc. Ail rights of reproduction in any form reserved.
PARALLEL
PIPELINED
STRATEGY
FOR EVALUATING
method proposed by Henschen and Naqvi [ 131.The compilation generates a sequence of resolvents which are equivalent to relational algebra expressions, to be evaluated against the relations. The resolvents are decomposed into primitive algebraic operations and this decomposition identifies common subexpressions which facilitate result sharing (the output of the primitive operations) among the resolvents. Operations that can be executed in parallel are also identified, and the primitive operations (based on the relational join operation) are evaluated using a pipelined execution method. The benefits of pipelining are that shared results (among resolvents) become available at an earlier instant, and there is a potential for overlapping the read/write of blocks of data and processing along the pipeline of operations (processors) that share data. An analytical performance evaluation compares this strategy with the sequential bottom-up semi-naive algorithm and a parallelization of this algorithm. We model the behavior of the pipelined execution strategy for operations based on the relational join operation; this strategy for joins is called a pipelined butterfly hash-join. The speedup in response time and execution time demonstrates the benefits of the parallel pipelined strategy. The performance evaluation indicates that the pipelined strategy can increasingly benefit from each additional processor utilized by the system, when the sizes of the relations being passed down the pipeline increase. In contrast, the parallelized semi-naive algorithm benefits in a constant manner from each additional processor utilized. We also study the performance degradation when there is limited memory. Our study indicates that the overhead due to hash table overflow is not cumulative for the pipelined strategy, since the operations along the pipeline execute concurrently. The paper is organized as follows: In Section 2, we review relevant literature on methods for evaluating recursive queries and database query optimization techniques. Section 3 uses the example of computing the transitive closure of a database relation to describe our parallel pipelined evaluation strategy. Section 4 provides an analytical model for the sequential and parallel seminaive algorithms and our parallel pipelined strategy. An analytical performance comparison is described in Section 5. In Section 6, the strategy for evaluating a linear recursive predicate is presented. Section 7 is a summary. 2. REVIEW
OF RELATED
RESEARCH
We review sequential and parallel methods for evaluating recursive predicates, and database query optimization techniques. 2.1. Top-Down Recursive Query Evaluation Techniques A first-order database is a function-free first-order theory in which the extensional database (EDB) corre-
LINEAR
RECURSIVE
147
PREDICATES
sponds to the data stored in the relations. If we consider only a Horn database, then the intensional database (IDB) is a set of clauses with exactly one positive literal; each clause is a definition of some of the tuples in the predicate named in its positive literal. For example, P may be defined as P(x, z) + Qh,
Y), WY,
(1)
z).
When each IDB literal is compiled, corresponding to a query, then all possible proof trees are constructed for this query, until the NULL clause is derived or all leaf nodes of the proof trees are EDB predicates. The clauses constructed of leaf nodes alone can be evaluated by a relational database engine using relational operations such as join (r>a), select (o), and project (7~)over the EDB relations. This straightforward compiling process fails in the presence of recursively defined IDB predicates because the resulting proof tree is infinite and it must be extended. Henschen and Naqvi presented a top-down method for compiling recursive queries. The method is presented in detail in [13]; we present it using the same generation example, as follows: s&,
Y)
+- UP(x, z), s&z, v), DOWN(v,
s&,
Y) +- FLAT(x,
Y>.
(2)
(3)
Y).
The predicate sg is a single linear recursive predicate, and UP, DOWN, and FLAT are database relations. The pair x and y are in sg if they occur in a tuple of the relation FLAT (by Eq. (3)), or if they are equidistant from a pair z and u that occur in sg (by Eq. (2)). x and y are equidistant from z and v, respectively, if an equal number of joins of the relations UP and DOWN, respectively, will form a path from x to z (in UP) and from u to y (in DOWN). When these clauses are compiled they form a potential recursive loop or PRL [ 131. For this straightforward example, the PRL is exactly Eq. (2), and the nonrecursive exit clause for the PRL is Eq. (3). By continually traversing the PRL of Eq. (2), we generate longer resolvents. After each traversal, exiting the PRL via the exit clause Eq. (3) leads to a sequence of resolvents that correspond to compiling sg. The following resolvents are generated corresponding to a query sg(x, c), where c is a constant. sg(x, c) +- FLAT(x,
c).
sg(x, c) + UP(x, zl), FLAT(z1, vl),
(4) DOWN(v1, c). (5)
sg(x, c) + UP(x, zl), UP(z1, z2), FLAT(z2, u2), DOWN(v2, vl), DOWN(v1, c), etc.
(6)
A sequential algorithm for retrieving answers from the database, based on these resolvents, consists of an outer
148
RASCHID
loop and two inner loops. Initially, using selection on the database relation DOWN, values for ~1 (in Eq. (5)) are pushed onto a queue. All answers for this resolvent are extracted in the first inner loop; i.e., using join, selection, and projection operations, the query UP(x, zl), FLAT(z1, ~1) is evaluated and values are obtained for X. In the second inner loop, the query DOWN(u2, ul), is evaluated and the values for ~2, to be used in evaluating Eq. (6), are pushed onto a queue. This process continues until no new answers for x can be obtained. Our research is on a strategy for the parallel, pipelined evaluation of these resolvents, using multiple processors. Parallel evaluation of the resolvents will relax the strictly sequential evaluation of the outer loop. In addition, we simplify the operations in the inner loops by result sharing, and allow the two inner loops to be evaluated simultaneously. 2.2. Bottom-Up Recursive Evaluation Techniques
Query
The most well-known bottom-up evaluation technique [l-3, 10, 21-231 for computing the transitive closure of a relation is the semi-naive algorithm. This algorithm iteratively repeats a sequence of relational algebra operations, until it reaches a fixpoint. Figure 1 presents a description of this algorithm, for both a sequential and a parallel evaluation. Let RO be the initial relation, RA the new tuples computed in each iteration, and Rf the transitive closure of Ro. The sequential version of the algorithm is on the left, and the parallel version is on the right. Assume that RO, Rf, and RA are all binary relations with two attributes, source and destination, respectively. In each iteration of the sequential algorithm, new tuples are computed as follows: the tuples for RA from the previous iteration are joined with the tuples of the initial relation RO, such that the destination (second) attribute of RA matches the source attribute of RO. The answer for RP (for the current iteration) is obtained by projecting the source and destination attributes from the matching tuples of RA and RO, respectively. The algorithm terminates upon reaching a fixpoint. The bottom-up semi-naive approach
AND SU
has been compared with the top-down connection graph based approach of [13]. The results of executing these algorithms on a single processor, reported in [3], show that for many different data sets, the top-down evaluation out-performs the efficient semi-naive evaluation. Parallel evaluation techniques have been developed to compute the transitive closure of a relation [l, 221. A very efficient scheme presented by Agrawal and Jagadish is the hash-join based parallelization for the semi-naive algorithm, as seen in Fig. 1. RO is initially hash-partitioned on the first or source attribute and assigned to each processor as its relevant partition of R%. Each processor computes the semi-naive evaluation in parallel for its partition of Ri; it has to access the entire RO relation in each iteration. The superscript p indicates that the relation has been hash-partitioned among p processors, and the algorithm executes on each processor. Only those tuples in Rx produced by that processor will be used in computing the answers for the next iteration. Thus, each processor will compute its transitive closure independent of the other processors and will produce a partition of the final result Rf, based on the source attributes in its partition of Rg; i.e., each processor will compute the reachability graph of its partition of the source attributes. This parallelization of the semi-naive algorithm has the advantage that there is no communication or synchronization between processors. With even load balancing, i.e., if each processor is equally busy in each iteration, it will perform well. The disadvantage is that there may be redundancy in the computation of all the processors [I]. In the performance evaluation of Section 5, we compare the performance of our parallel top-down strategy with this parallelization of the semi-naive scheme. We note that this parallelization technique is based on partitioning the relation among the processors. However, each processor computes the same queries (resolvents); i.e., the underlying architecture is SIMD. A different parallelization of the semi-naive algorithm also makes use of horizontal partitioning [22]. However, in each iteration, tuples in Ri have to be repartitioned; i.e., they have to be accessed by processors other than
R, + R,;
RF + ROP
R A + R,;
R, * + R,’
while
while
R, # @ do
RA + RA M
R,; /*compute
R* + RA - R,; /*elim. R, + R, u R,:
/*collate FIG. 1.
RAP + RAP t>-3 R,
new tuples”
R,” + RAP - R,
dupl.*/
R,” + R;
rrsult.s*/ Sequential
RAP # @ do
and parallel
semi-naive
algorithm.
u R,”
PARALLEL
PIPELINED
STRATEGY
FOR EVALUATING
the processor p that generated the tuples. This is a drawback of this scheme since there are synchronization problems among the processors. Other schemes include a technique for the bottom-up evaluation of certain classes of logic programs using a constant number of processors [23]. Han and Lu propose a wavefront approach for evaluating recursive queries [lo]. Our strategy is similar to their double wavefront (DW) algorithm, in that they both share results among the resolvents. However, their DW algorithm does not evaluate the resolvents in parallel nor do they use pipelined evaluation methods. These techniques are used in our algorithm to increase the degree of parallelism and improve the response time. In summary, there are many schemes to parallelize the semi-naive algorithm; parallelism is mainly achieved by a horizontal partitioning of the relation among the processors. However, it has not been shown that the method of partitioning one relation among multiple processors to obtain parallelism through even load balancing can be applied to the more general case of a linear recursive predicate, where each resolvent is evaluated using several relations. For the parallel evaluation to show a linear speedup in this case, even load balancing must be achieved in each iteration of the bottom-up evaluation, after independently partitioning several relations among the processors. Finally, these parallelization schemes do not exploit database query optimization techniques.
LINEAR
RECURSIVE
149
PREDICATES
The third optimization technique is pipelining and the data-flow based processing approach proposed for several database machines [4, 5, 111. Using this technique, each processor assigned to a node in a query tree transmits a block of information as soon as it is produced. This is in contrast to traditional distributed systems that delay output until the operation assigned to the node is completely executed. The main advantage of this data-flow based approach is the possibility of vertical concurrency. An operation at one level that requires input from an operation at a previous level can get its input data at an earlier instant, i.e., before the operation at the previous level is completed. Pipelining can have a significant impact on the efficiency of evaluating resolvents that share intermediate results. If each resolvent is evaluated by a different processor(s), then, in the event of a common subexpression, pipelining will allow sharing of the data earlier than if there were no pipelining. Pipelining also allows an overlap in reading and writing blocks of data by operations along the pipeline. 3. THE PARALLEL
PIPELINED EVALUATION AN EXAMPLE
STRATEGY:
Consider the following definition for the transitive closure T of relation A:
2.3. Database Query Processing Techniques
T(x, z) + T(x,
The sequence of resolvents generated by compiling a linear recursive predicate is equivalent to a sequence of relational algebra queries that must be evaluated against the corresponding database relations. Thus, database query processing and optimization techniques can be used to optimize the evaluation of these queries. Query decomposition is a process of translating a query into a hierarchy of primitive operations; the result is a query tree in which the nodes represent the primitive operations [24]. The advantage of query decomposition is that it identifies primitive operations on different branches of a query tree that can be executed in parallel, i.e., horizontul concurrency, thus increasing the degree of parallelism. It also increases the probability of finding an overlap among several query trees which facilitates intermediate result sharing. The sharing of intermediate results among queries and the resulting elimination of redundant execution of operations have been proposed in [9, 191. It has also been shown that as the degree of sharing among queries increases, the query throughput also increases. Most of this research studies the effect of eliminating low-level read operations by sharing buffer space [5, 61. More recent work [12, 191 shows the advantage of sharing the output of high-level operations such as select and join.
T(x,
Y)
+-- A&,
z)
Y), A(Y,
(7) (8)
Y).
An example query T(a, c), where a and c are constants, corresponds to finding if a path exists between two given points of the graph which is contained in A. When these clauses are compiled, Eq. (7) is the PRL and Eq. (8) is the nonrecursive exit clause. The following resolvents are generated by traversing the PRL and exiting via E8: Ah, cl
Tl A(a,
T2
~1)
C-4
A(yl,
(9) cl
(10)
T3
$a,
~2) P-a
A(~29
~1) Dd
A(Y~,
7-4
A(a,
~3) C-d
A(y3,
~2) DG
4~2,
D4
A(y1, CL etc.
cl
(11)
~1)
(12)
Here, i in Ti increases with, depth of the resolvent and Da is the relational join operation. The goal of the parallel pipelined strategy is to evaluate this set of resolvents in parallel using pipelined execution. We first decompose each resolvent into a hierarchy of primitive algebraic operations that can benefit from pipelining. These primitive operations are used to identify common subexpressions, so that intermediate
150
RASCHID
results, output from the operations, may be shared among the resolvents. We then identify possible parallelism in executing operations and assign operations to processors for evaluation. Although a decomposition that maximizes the number of parallel primitive operations may be advantageous, the degree of parallelism is limited by the availability of processors. Similarly, result sharing among resolvents is constrained by the bandwidth and the structure of the interconnection between the processors. We choose the heuristic of decomposing resolvents so that they share the greatest (longest) common subexpression from previously evaluated resolvents. Although this heuristic does not always minimize the amount of processing, it has the advantages of regularity in the decomposition of the resolvents and in sharing results among operations, as will be seen. Figure 2 shows a hierarchical decomposition of this sequence of resolvents Tj into primitive operations. Each resolvent is decomposed to share the longest common subexpression from its predecessors. The algorithm for the decomposition of a linear recursive predicate, and a discussion on its correctness and termination conditions is presented later. For notational convenience, f indicates an attribute that is free (unbound) in a relation and b indicates a variable bound to a constant value, i.e., a selection based on an attribute value. Resolvent Tt in Eq. (9) is A(a, c); the correspondingprimitive operation, denoted (A - bb), is a selection of tuples from the relation A. The next resolvent, T2, in Eq. (10) is represented by the primitive operation (A - bf D-=cI A - fb). This operation comprises initial selections, A - bfand A - fi, from the relation A, a subsequent join (Da) over the appropriate attribute, followed by a projection operation to produce answers. This operation is typical of the operations that are obtained with the decomposition. Resolvent T3 will be decomposed as follows: ((A - bfM A -ff) D-4 A -fb):
Ill-1 Ambb
*,
Ill-2 (A-bf
*2
((A-bf
*a
WA-fb) Ill-3
121-l
cd&ff)
b&fb)
common
common *5 FIG. 2.
121-2
(11-4
((A-bf WA-ff)bd(A-ffWA-fb))
*4
(((A-bfmpff) Decomposition
121-3
131-l
kkff)
W(A--ff
of resolvents
common WA-fb))
T, into primkive
operations.
AND SU A-bf
A-ff
A-bf
A-fb
A-ff
A-fb
IIlk y T -bf
Amff
P-3 *4
A-ff
A-ff Ts-fb I
131-2 I *,
5
Amff
131-4 A-ff
T -bf
1
I,,
T,-fb
‘1
]~--7q-+
*s
*7 FIG. 3.
Architecture
1 w4 ‘I
for evaluating
the transitive
closure.
the nested operation (A - bf DCI A - ff) will be evaluated first and its output used as input by the outer operation. In order to identify primitive operations that may execute in parallel, and to identify result sharing of the output of these primitive operations, the operations in Fig. 2 are marked with the legend [m] - j; this represents operationj executed at level m. The level, m, of an operation is determined by the input requirements. For example, operations at level 1, represented by [I] - j, do not obtain input from other operations. There may be several operations at level m; the value of j distinguishes them. All operations at a level m are independent of each other, and may be evaluated in parallel. Those primitive operations that are common to several resolvents and provide input to other primitive operations are labeled common in the figure when they appear in another resolvent. The primitive operations of Fig. 2 are each assigned to a processor for evaluation (unless they are marked common). Using the markings of the operations in Fig. 2, it is straightforward to obtain the architecture of Fig. 3, where the boxes at each level represent a processor evaluating a primitive operation and the arcs represent the sharing of common subexpressions. Primitive operations at each level are independent of each other and may be evaluated in parallel. Operations that provide input to other operations are evaluated using pipelined evaluation techniques. Thus, several operations along the pipeline at different levels, [l]-3, [2]-3, etc. (and several resolvents), will be simultaneously eval-
PARALLEL
PIPELINED
STRATEGY
FOR EVALUATING
uated. Pipelined evaluation allows the early sharing of the join output data among processors; it also allows a possible overlap in reading and writing blocks of shared data among the processors. As can be seen from Fig. 3, there is a uniform interconnection structure between processors, and the number of processors (and primitive operations) at each level increases in a linear manner w.r.t. the depth of the resolvents that are evaluated. Each horizontal level is a replica of the previous level. This simplifies the task of assigning processors to the primitive operations. In summary, the architecture of Fig. 3 represents the parallel pipelined strategy for evaluating the resolvents generated in the transitive closure example. The answers for the resolvents T, are obtained as output from the operations marked [11-l, [ 11-2, [21-l, [2]-2, etc. In this discussion we assumed that both arguments in the query were bound. When this is not the case, then our decomposition will compute unrestricted joins. We discuss the effect of such joins in the section on performance evaluation. 4. ANALYTICAL
MODELS PIPELINED
FOR THE SEMI-NAIVE ALGORITHMS
AND
An analytical evaluation for computing the transitive closure of a database relation is used to compare the performance of the three strategies, i.e., the sequential semi-naive algorithm executing on a single processor, and the parallel semi-naive algorithm and the parallel pipelined strategy, both executing on multiple processors. The two performance measures that are determined in the model are the response time, i.e., the time to produce the first block of data, and the execution time, i.e., the time to complete processing an operation, for each resolvent Ti. The analysis assumes a shared-nothing multiprocessor environment [20]. For all three strategies, the operation being evaluated by each processor is based on the relational join operation; thus, accurately modeling the join operation is very important. We use the hash-join for the semi-naive algorithms, and a pipelined butterfly hash-join for the parallel pipelined strategy. Prior analysis of join operations on systems with large memory [4, 181, suggests that hashjoin based query processing strategies are advantageous. 4.1. Modeling the Sequential and Parallel Semi-Naive Algorithms In each iteration of the semi-naive algorithm, two relations are joined and their output provides the input for the next iteration. In Fig. 1, RA is initially assumed to be A-bfin the first iteration, where its size is determined by the initial selectivity s, and RO is the initial database relation A. This is actually a modification of the semi-naive evaluation which does not push this variable binding in-
LINEAR
RECURSIVE
PREDICATES
151
side the transitive closure computation. In subsequent iterations, the two relations being joined are A and the output of the previous iteration Rh; the size of the latter is determined by the join selectivity. For each iteration, let the two input relations be RI and RI, and let their cardinality be b, * B and b2 * B, respectively, where B is the block size. Assume bl 2 bz. Let Tbr be the time to input a block, Th the time for hashing the value of an attribute over which the join is to be performed, T, (T,) the time to read/write (a word) in memory, and T, the time to compare a hashed value with values in the stored hash table. Let j, the join selectivity be defined as follows: (number of join tuples output) / b, * b2 * B * B. We also assume that the join attribute of each tuple can be accessed with 1 read but thatf, reads /writes are required for processing the result tuple. A 20% overhead accommodates the extra comparisons required to deal with collisions, when comparing values using a hash table [8]. The following is a description of the hash-join used in the model: In each iteration, the smaller relation, say RZ, will be read first, hashed, and the hash table stored in memory. The larger relation, say RI, will then be read, hashed, and compared with the stored hash table. If there is a match, then the result tuple will be output. Any projections required will be included in the time to move the join output tuples to the buffer. Since the evaluation is on a single processor, we do not assume overlap in the I/O and processing times of an operation. The response time is the time to produce the first block of output in each iteration i, and the execution time is the time to complete the evaluation of that iteration. Each iteration is also referred to as the depth of the resolvent. Based on the above, the time for each iteration is as follows: time to read, hash, and store tuples of Rz {= Tbr * bz + T, * b2 * B + Th * b2 * B + T, * b2 * B} + time to read, hash, and compare tuples of RI {=Tbr * b, + T, * b, * B + Th * b, * B
+ T, * b, * B * 1.2) + time to output tuples of join result Rh {=T, *j * B * B * 6, * b2 *fr} For the parallel semi-naive algorithm, we assume that A - bfis hash-partitioned among p processors, based on its first attribute. Each processor will compute the transitive closure for its partition, A - bfp, and it accesses a copy of the entire relation A to do so. Relation A will first be hashed and a hash table built. We assume that the selected subset A - bf of the relation will be simultaneously partitioned among the processors, while the hash table is built. Then, in the first iteration, each processor
152
RASCHID
will evaluate the semi-naive algorithm (in parallel) for its partition of A - bfp (or R{) and the result will be the answers for the resolvent T2. In subsequent iterations, the output of the previous iteration, Rx, is joined with A. The time for each iteration on each processor will be similar to the sequential semi-naive algorithm previously discussed, and computation on all the processors proceeds in parallel. The maximum possible speedup with p processors over the single processor is p; this is an ideal value. For example, the time to hash partition A - bf among the processors cannot be reduced through parallel execution. The maximum speedup also assumes an ideal even load balancing. This would occur if the size of the result relation in each iteration, for each processor, i.e., Rt;, were the same, or if on the average, over all iterations, the processors were equally busy. In our analytical evaluation, we assume an ideal load balancing. 4.2. The Pipelined Butter-y Hash-Join For the pipelined evaluation, we first develop a simple model for a single pipelined hash-join operation; it is also called a butterfly join. We then extend this concept to a sequence or pipeline of butterfly join operations that compute the transitive closure. Assume that a block is the granularity of input/output. In the butterfly hash-join, blocks of both relations, say RI and RI, are alternately processed. Two hash tables are built to accumulate the input so that subsequent blocks can be also joined with the accumulated input. For each primitive operation executing on a processor, the first block of the smaller relation, say Rl, will be read, hashed, and stored in memory. The first block of RI will then be read, hashed, and compared with the current contents of the hash table and the join output; i.e., the pairs of matching tuples from both relations will be written into an output buffer. RI will also be stored in the hash table for further comparison with subsequent blocks of Rz. The subsequent blocks of RI and R2 will be treated in a similar fashion. As soon as the number of tuples in the output buffer exceeds B, a block of output will be transmitted. After the last (b,th) block of Rz is processed, the hash table for R, is discarded. For each block i, where i = I, . . . . 61, Ti,,-RI(i) and Ti,,-Rz(i) are the time to read, hash, and optionally store blocks of RI and RZ, respectively. Tcomp(i)is the time spent in comparing hashed values with the hash table and T,,,(i) is the time spent to output the join result. For i = 1 the following hold: Ti,,-RI( 1) = T,,,-R2(1) = Tbr + Tr * B + Th * B + T,*B; T,,,,(l)
= T, *B * 1.2;
T,,,(l) = T, *j * B * B *fr.
AND SU
For subsequent blocks i = 2, 3, . . . , bZ, the following hold: Tin,,-RI(i) = Tinp-R*(i) = Tinp-RI(l); T,,,,(i) = T, * 2 * B * 1.2; T,,,(i) = T, *j * (2 * i-l) * B * B *fr. For blocks i = b2 + 1, . . . . b,, the following hold: Tin,-R,(i) = Tbr + T, * B + Th * B;
Ti,,-Rz(i) = 0;
Tcomp(i)= T, * B * 1.2; T,,,(i) = T, *j * bz * B * B * fr. These expressions assume that the input data for the butterfly join operation are always available and there are no delays in obtaining input. Since this is not the case when the input of an operation along the pipeline is actually the output of a previous operation, we must further develop the model for a sequence or pipeline of butterfly join operations. For an accumulation type operation such as a pipelined butterfly join, as more input blocks are accumulated, a single block of input will be compared against an increasing number of blocks, and the number of join output tuples produced will increase. Note that this assumes a uniform distribution of the join attribute values for both relations. Thus, we model a varying output rate for a sequence of butterfly join operations. This is more accurate than one that assumes an average rate of output [ 121. The following discussion elaborates on this model. In the pipelined model for an accumulation type operation, the rate of output blocks produced by the operation is determined by the availability of input, as long as the ith block of input can be completely processed before the (i + l)th input block is available; i.e., the output rate is determined by the slower input rate (which is the output rate of the previous operation providing this input). At some point, the input blocks are available at a faster rate than they can be consumed. This is the critical point for this operation, and after this point, the output rate is determined by the processing rate of the operation itself. The relationship between the number of input blocks consumed (hi) and the number of output blocks produced (6,) in the case of the pipelined butterfly join is j * bi * bi * B * B = b, * B
if bi < bz
(13)
j * bi * b2 * B * B = b, * B
otherwise.
(14)
Equation (13) holds while the blocks of the smaller relation, RZ, are being input, and Eq. (14) holds afterward. In any join operation, the join selectivity determines the number of output tuples that will be produced, and the output rate. In the case of the transitive closure of a database relation A, in each iteration, the size of the
PARALLEL
PIPELINED
STRATEGY
FOR EVALUATING
result that is computed will depend on the fan-out factor of the underlying graph, which is contained in A. As the (average) fan-out increases, the join selectivity also increases. The join selectivity for the transitive closure example is normalized with respect to the size of the input relation(s) and expressed as a growth factor, gf. We choose to normalize gf with respect to the size of Ra in each iteration of the semi-naive algorithm; in the parallel pipelined strategy this is equivalent to the size of Ti - bf (or Ti - fi) in each level of the pipeline. For example, a gf of 1.0 implies that in each iteration, or each level of the pipeline, the total number of output blocks produced by an operation is equal to the total number of blocks of (one) input relation that is consumed. A gf of 2.0 means that in each iteration, or with each level of the pipeline, the size of the output relation doubles in comparison with the size of (one) input relation. As the gf increases, the output of the butterfly join increases, together with the number of output blocks produced. This is because with increasing join selectivity the same number of input blocks consumed by the butterfly join (in the same time) will produce a larger number of blocks of the join output relation. We now determine the critical point for the pipelined butterfly join operation; this is the point at which input blocks start accumulating for the operation. We wish to calculate the value of the critical block b,, or the input block consumed at the critical point. Recall that the input to an operation at level k, along the critical path, is the output of an operation at level (k - 1). Thus, the critical point for the operation, at level k, is determined by the operation that provides input; it corresponds to the first block of input consumed by the operation at level (k - 1) that accumulates more than one block at the input of the operation at level k. To calculate these values, we first determine the least value of b satisfying the following equation, for the operation at level (k - 1): j*(2*b-
l)*B*B>B.
(15)
Next, we substitute this value of b (from solving Eq. (5)) for bi in Eq. (13) (or Eq. (14) where appropriate) for the operation at level (k - l), and we obtain a value for b,. This value of the output block b, is the value of the critical input block bC, consumed at the critical point of the operation at level k. After this point input will start accumulating at the input of the operation at level k. 4.3. Modeling
the Parallel
Pipelined
Strategy
We use the mode1 for the pipelined butterfly hash-join to mode1 a sequence of critical path operations that compute the resolvents corresponding to the transitive closure of the relation A. The response time and execution time for operations along the pipeline, at level m, wi]]
LINEAR
RECURSIVE
153
PREDICATES
be defined recursively, with respect to operations at level (m - l), which provide input. For level 1 operations that
do not depend on any other operations for input, these values can be obtained directly using the expressions for Tcomp(i), T,,,(i), etc., as will be seen. The response time for any primitive operation Pi, Tr,,,(Pi), will be a function of b,, the number of input blocks needed to produce the first output block. By substituting (6, = 1) for the first output block, in either Eq. (13) or (14), the value of b, (= bi) can be obtained. The operation(s) providing input to P; are Pi-i and PI-i, and P, requires b, blocks of input, from each operation, to produce its first output block. Let the time for operation P to consume (process) bi blocks of input be defined as Z’,,,,(P, bi) and the time to produce 6, blocks of output be defined as Tprod(P, b,). The response time for operation Pi is determined by the output rates of the operations PipI and Pi-, that provide input as T,,,(P,)
= max. time for P-i , PI- I to produce b, blocks for P; {= max[T,,,d(Pi-1,
b,), &&PLI,
&)I}
We obtain this expression for T,,,(PJ because, in all cases, the response time preceded the critical point for Pi; i.e., the first output block was produced before the input started accumulating. To determine the execution time of operation Pi, T,,,,(P;), we first determine the critical point of operation Pi and the corresponding input block, b,, that is consumed at the critical point. Before the critical point, the processing speed for Pi will be controlled by Pi-, (or P:-,), the operations that provide input blocks. After the critical point, the processing speed will be controlled by operation P; itself. If b,,, is the maximum number of blocks processed by an operation Pi, then the following holds: Te,,,(Pi) = max. time for Pi-], Pi-1 to produce b, blocks, r=P. { = maXITprodPi- 1, b,), Tprod(pi- I, b,)l)
+ time for Pi to process a max. of (b,,, - b,) blocks {= T&P;,
b,,,)
- Tproc(Pj, b,)}
The above expression models the worst case situation for evaluating T&Pi). Tproc(Pi, bin), the time for any operation P; to process hi,
blocks, is Tproc(Pi,bin) i=h,,
= 2 . [‘Inp-RI(i) + Tinp- &(i) + Tconlp(i)+ Z’o,t(41. i= I
154
RASCHID
Tp,&Pj, 6,,,), the time for an operation at level 1 to produce b,,, block, is
AND
SU
1K
100 --
bin is the number of input blocks consumed to produce 6,,, blocks of output. We obtain an expression of this form since operations at level 1 do not depend on other operations for input. For subsequent levels, Tprod(Pi, b,,,) is defined recursively. Suppose that the bin input blocks (needed to produce b,,, output blocks), provided by operations Pi-, and Pi-l, are not available before operation Pi reaches its critical point, corresponding to input block b,. Then, Tprod(Pi, bout) is
Number of tuples of A N=200000 Join selectivity growth factor = 1.0 Maximum number 01 processors=64 1
I 0.11 24
FIG. 4.
If the bin input blocks are available before the operation P, reaches its critical point, then
These expressions are used to obtain the response time and the execution time for the resolvents Ti, which are the output of operations [ 11-1, [l]-2, [21-l, [2]-2, . . . . [i&l, [i]-2 7 ..-, etc., where i is the depth of the pipeline. 5. RESULTS
OF THE ANALYTICAL EVALUATION
PERFORMANCE
In this section, we report the results of an analytical performance evaluation for the transitive closure example. We expect that the parallel pipelined approach should have better response time and execution time, as compared to the sequential or the parallel semi-naive algorithm. To explain it simply, with pipelining, the execution of an operation (or the evaluation of a resolvent) commences at an earlier instant. There is also greater opportunity for overlap in reading or writing blocks of data when operations along the pipeline share data. One parameter used in our study is the join selectivity growth factor, gf. Another parameter is the number of tuples, N, of the initial database relation, A. A third parameter is the selectivity, s, for the initial database relation A - bf. The next parameter is the depth i of the resolvents T; being evaluated. A final parameter is the number of processors p utilized in evaluating the resolvents. We assume up to a maximum of 64 processors for the multiprocessor environment. We also study the per-
Response
8 16 Dept,h of resolvents Ti time and execution
24
time for the pipelined
! , 32
strategy.
formance of the algorithms when the memory is limited; this results in an overhead due to hash table overflow. Finally, we examine the tradeoffs involved in eliminating redundant computations. Figure 4 shows the response time and the execution time for the parallel pipelined strategy as a function of the depth i of the resolvents Ti. The value of gf is 1.0 and ensures that the size of the output produced by the critical path operations remains a constant at all levels along the pipeline. The figure shows that for small i, with pipelining, the response time is much less than the execution time. As i increases these two curves tend to move closer. The reason is that for small i there are less operations (and delays) along the critical path and thus the response time is small. As i increases, and the resolvents Ti become longer, there are more operations (and delays) along the critical path which tend to increase the delay in producing the first block of output for T;. 5.1. Comparison with the Sequential Semi-naive Strategy Figure 5 compares the parallel pipelined evaluation (utilizing a varying number of processors, depending on the depth of the resolvent Ti being evaluated), with the sequential semi-naive algorithm executing on a single processor. The ratio of speedup in the response time is plotted versus the depth of the resolvents Ti. As seen in Fig. 5, for smaller i, the speedup ratio is much larger and decreases with increasing values of i. To explain, for the longer resolvents, there are more operations along the critical path, hence the initial delays in setting up the
PARALLEL
N=16000000 N=
PIPELINED
STRATEGY
FOR EVALUATlNG
Join selectivity growth factor= 1.0 Maximum number of processors=32
T4
155
PREDICATES
Number of tuples N=200000 20
Block size =20
L
RECURSIVE
gf: join selectivity growth factor
t,uples
00000 tuplos
I- -4 ”
LINEAR
Maximum number of processors=64
10
-- N=200000
T8
T16
tuples
B
Depth of resolvents Ti
I
/
/
FIG. 5. Ratio of speedup in response time for the pipelined strategy compared to sequential semi-naive.
T4
TlG
‘1‘8
Depth of resolveuts Ti FIG. 6. Ratio of speedup in execution compared to sequential semi-naive.
pipeline accumulate. This has an adverse effect on the response time, hence the speedup ratio for the response time decreases. Figure 6 compares the ratio of speedup in the execution time for these two strategies, as the depth i of the resolvent increases, for different values of gf. As i increases, the speedup ratio also increases. To explain, after the initial delays, when the pipeline is operating in steady state, then the benefits of pipelining are experienced. The larger the value of i, the greater the depth of the pipeline of operations. This increases the benefits obtained from pipelining. The speedup ratio increases more rapidly with a larger value of gf. Again, as gf increases, so does the size of the relations being passed down the pipeline. The longer the pipeline operates in steady state the greater the benefits it experiences from pipelining. In order to determine the benefit of utilizing an increasing number of processors, for the parallel pipelined strategy, we normalize the speedup ratio in the execution time, with respect to the sequential semi-naive algorithm executing on one processor. The execution time for each resolvent T,, in the sequential semi-naive evaluation, is divided by the number of processors p utilized by the parallel pipelined strategy to evaluate the same resolvent T,. This results in a normalized speedup ratio in the execution time for the parallel pipelined strategy: the normalized speedup is then used to measure the benefits obtained from each additional processor being utilized.
T32’
T24
time for the pipelined strategy
Figure 7 plots this normalized speedup in execution time versus the depth of the resolvent 7;:. The number of processors utilized by the parallel pipelined strategy increases with increasing values of i. Plots are obtained for varying values of gf. For a gf of I .O or 1.1, this normal-
I
Maximum number of processors=32 Number of tuples N= 1600000 gf: join selectivity growth factor gf=1.3
25
3 ;0
gf=1.2
0%
kg gf=l.O
T4 T8 Depth of resolvents Ti FIG. 7. Normalized speedup in execution strategy compared to sequential semi-naive.
0%
T16
-
time for the pipelined
156
RASCHID AND SU
ized speedup ratio is seen to reduce with increasing i, while for larger values of gf of 1.2 or 1.3, the normalized speedup ratio is seen to increase. To explain, as the depth of the pipeline increases, there is an accumulation of delays along the critical path of the pipeline. Although additional processors are utilized, the benefits of each additional processor decrease and hence the speedup ratio decreases. However, the higher values of gf counteract the effect of the accumulating delays. Consequently, the benefits of each additional processor utilized increase. Plots obtained for a value of 1.3 for the growth factor show a normalized speedup value greater than lOO%, compared to the normalized sequential semi-naive strategy. To explain this seeming anomaly, we note that in the case of the parallel pipelined strategy, the resolvents that are being evaluated in parallel on different processors share the output of common subexpressions. Thus, there is a potential overlap of read/write operations along the pipeline. This overlap benefits the pipelined evaluation and so it is possible to obtain speedup ratios of greater than 100%. Several of the experiments were conducted with a selectivity factor of lOO%, i.e., when the arguments were unbound. Although this resulted in unbounded joins for the parallel pipelined strategy, it did not significantly degrade the performance, in comparison with the seminaive strategy. For the pipelined evaluation, all of the operations commence execution much earlier. Executing an unbounded join causes the pipeline to execute longer in steady state; as we have noted previously, this is not a significant drawback, since the operations along the pipeline execute concurrently. In a later section, we analyze the performance of our strategy with hash table overflow; this situation is more likely to occur with unbounded joins. The analysis indicates that the pipelined strategy does not degrade when compared to the parallel seminaive algorithm. Unbounded joins may also be a problem when there is an insufficient number of processors to evaluate the decomposed operations in parallel. This is an issue that we propose to study experimentally 1161. 5.2. Comparison
with the Parallel
Semi-naive
Strategy
We now compare the parallel pipelined strategy with the parallel semi-naive algorithm, where each strategy utilizes the same number of processors. In Fig. 8, we consider the ratio of speedup in execution time for the parallel pipelined strategy over the parallel semi-naive algorithm. The parallel pipelined strategy performs better than the parallel semi-naive algorithm. However, as can be seen, the speedup decreases with increasing depth of resolvents. To explain, with increasing depth of the resolvents, the number of processors being utilized also increases. Due to its strategy of horizontally partitioning the database relation to obtain parallelism, the parallel
N: Number of tuples I N=3200f0”
I
~
T4 T8 Depth of resolvents Ti
b
T16
FIG. 8. Normalized speedup in execution time for the pipelined strategy compared to parallel semi-naive.
semi-naive algorithm benefits in a linear manner from each additional processor utilized; i.e., the benefit of each additional processor is a constant under the assumption of ideal load balancing. However, as was seen previously in Fig. 7, for a value of gf = 1.0, i.e., when the size of relations along the pipeline remains a constant, the parallel pipelined strategy does not benefit as much from each additional processor being utilized. Hence a decrease in speedup results as is seen in Fig. 8. However, we previously noted that as the size of relations along the pipeline increases the parallel pipelined strategy benefits increasingly from each additional processor utilized. This is demonstrated in Fig. 9, which plots the speedup in execution time versus the growth factor. As the growth factor increases, so does the speedup, since the parallel pipelined strategy increasingly benefits from each additional processor and from operating the pipeline in the steady state for longer periods. 5.3. Performance with Limited Table Overflow
Memory
and Hash
In the case where memory is limited and cannot accommodate the hash table, hash table overflow requires that the join operation be executed in several phases by each processor. Assume that the size of memory is NP pages and the two hash tables are of sizes p1 * NP and p2 * Nr, respectively, where p1 > p2. For the semi-naive algorithm, in each iteration, each partition of the smaller hash table (of Np pages) will reside in memory and be
PARALLEL
5T
PIPELINED
STRATEGY
FOR EVALUATING
Number of tuples of A
N=3200000 Block size B=20 Maximum
number of
proccssors=32 atio
/i;
I// 1
/
26% gf=1.2
FIG. 9.
Ratio for T8
I
40% gf=1.4 gf: Join selectivity
’
60%
iD
gf=1.6 growth factor
Ratio of speedup in execution time versus the join selectiv-
ity growth factor.
joined with the larger hash table. Partitions of the smaller hash table will be phased in and processing will continue until the entire join completes. The execution time for each iteration of the algorithm is (~2 + PI * p?) * T,,(Nr) where T,,(Nr) is the time for joining the larger relation with one memory resident partition. We note that the response time for the algorithm is adversely affected due to hash table overflow. For the pipelined evaluation we have two options. The first option, 01, is to initially follow the pipelined execution until Np/2 pages of each hash table are processed. Then, by alternately allowing a partition of one hash table to reside in memory and joining it with all the partitions of the other hash table, the two relations can be joined in phases. This alternation makes sure that the pipeline is not allowed to decay but maintains a steady state. Each operation in the pipeline has an execution time of Tp,(Np12> + (p,! + (~2 - l)! + 1) * Np/2 * T&l) where T,,(x) is the pipelined processing time for x pages of the hash table. The disadvantage is that with this option partitions of both hash tables must reside in memory; thus, the hash tables are phased in and processed more times than, for example, with the semi-naive scheme, where partitions of only one hash table reside in memory. However, the pipelined scheme has the advantage that the operations execute concurrently; thus, the overhead due to hash table overflow, for each operation in the pipeline, is not cumulative, as is the case with the sequence of iterations of the semi-naive algorithm. The second option for the pipelined evaluation, 02, is
LINEAR
RECURSIVE
157
PREDICATES
to initially follow the pipelined execution until NJ2 pages of each hash table are processed. Then, the partition of the larger hash table is discarded. The partition of the smaller hash table is accumulated in memory (up to NP pages), and is joined with the larger hash table. Processing continues until the entire join is completed. The disadvantage is that each time a partition of the hash table is flushed from memory, the output of the pipeline decays. Each operation in the pipeline executes for (~2 + pI * p2) * T,,(Nr), where T,,(Nr) is the time for the pipelined evaluation of one memory resident partition of the hash table. Figure 10 plots the ratio of slowdown in execution time for the pipelined strategy with overflow, compared to the execution time without overflow, for several values of P, the percentage of each hash table that can reside in memory. As expected, the first option, 01, with a steady pipeline has a smaller slowdown ratio and performs better than option 02, with a decaying pipeline. The performance degradation increases with lower values of P, i.e., as smaller partitions of the hash table can reside in main memory, as expected. For both 01 and 02, as the depth of the pipeline (resolvents) increases, the ratio of the slowdown decreases. To explain, the operations along the pipeline execute concurrently and the effect of the overhead due to overflow is not cumulative. The benefits from executing the pipeline for a longer period offset the overhead due to over-how. Figure 11 plots the ratio of the speedup in execution time for the two pipelined options, in comparison with
opt. 02
P=5%
20 opt: 01 P==5%
~-
\\ opt. 02
10
P=lO%
Maximum number of procesors=32 Join selectivity factor=l.O
growth
T4 T8 T12 Depth of resolvents Ti
:
D
T16
FIG. 10. Ratio of slowdown in pipelined execution time with hash table overflow.
1.58
RASCHlD
opt. 01
gies. Also in both strategies, the second relation in each join is produced during query execution, and is only accessed by one (or two) other operations. Thus, the second relation could be clustered on local disks, so as to not degrade performance. These issues will be studied in future experimental evaluation [ 161.
P=5% 1
T
opt. 02
5.4. The Performance Tradeoffs for Redundant Computation
P==5%
-:;:-::-i 0.5.-
opt. 02
P=lO%
+ I
P: %age of hash table resident in memory Maximum number of processors=32 Join selectivity growth factor=l.O :
T4 T8 T12 Depth of resolvents Ti FIG. 11. Ratio of speedup in execution semi-naive with hash table overflow.
AND SU
l
T16
time compared
to parallel
the parallel semi-naive algorithm. The latter has the disadvantage that the effect of the overhead due to overflow is cumulative with each iteration. However, it has the advantage that due to horizontal partitioning of one of the relations among the processors, a smaller segment of the hash table (of that relation) resides in each processor. As the number of processors utilized increases, the overhead due to overflow is minimized. Thus, when 64 processors are utilized, there is effectively no overhead due to overflow, for the parallel semi-naive algorithm. As seen in Fig. 11, the pipelined option 01, with a steady pipeline, initially out-performs the parallel semi-naive strategy, but as an increasing number of processors are utilized its performance degrades. Schneider and Dewitt report that hash table overflow, due to limited memory, can result in disk head contention, when multiple processors are executing a query [18]. As the degree of parallel execution increases, one of the significant factors that affect performance is data placement; a high degree of declustering relations over all available disks will increase disk head contention and degrade performance, as reported in [ 181. For both the parallel semi-naive and the pipelined strategy, the number of parallel operations executing concurrently is the same. In both cases, each of the operations (processors) accesses two relations; one is the initial database relation and the second relation is the result relation of the previous iteration (or the output of the previous operation in the pipeline). Since all the operations access the initial relation, the effect of clustering/ declustering this relation will be the same for both strate-
There are two sources for redundant computation in the transitive closure example. Assume there are duplicate paths between two nodes i and j of a graph. When the same answer is produced more than once (by different resolvents), and the answers are passed on to the next iteration (or the next operation in the pipeline), without eliminating duplicates, then there is redundant computation. In the case of the parallel pipelined evaluation, the different resolvents are evaluated on separate processors. Thus, there is a communication overhead associated with duplicate elimination, since answers must be completely forwarded until the end of the pipeline. In the case of the parallel semi-naive algorithm, these duplicate values are computed on the same processor and there is no communication overhead associated with duplicate elimination. However, storing the answers of previous iterations (for duplicate elimination) significantly increases the size of the hash table, especially for later iterations. This can result in hash table overflow. Our previous analysis indicates that the overhead due to hash table overflow has a cumulative effect on the performance degradation, for the semi-naive algorithm. Figure 12 plots the ratio of speedup in execution time for the parallel pipelined algorithm without eliminating duplicates in comparison with the parallel semi-naive algorithm with duplicate elimination. We use the join selectivity growth factor (gf) to accommodate for the fact that the pipelined algorithm does not eliminate duplicates. Plots are obtained for gf values of 1.2 and 1.3; i.e., in each iteration 20% (30%) of the tuples produced in the result are duplicates. Thus, with increasing depth of the pipeline, the additional tuples due to duplicates are multiplied. As seen in Fig. 12, initially the pipelined algorithm performs better, but the ratio of speedup decreases with longer resolvents. Initially the decrease is more rapid, but, after some depth of the pipeline, the decrease is less. To explain, since the semi-naive algorithm must accumulate all the previous results, it stores a larger hash table, and pays a penalty of possible hash table overhead. This overhead is cumulative. The pipelined algorithm on the other hand pays the penalty that with increasing depth of the pipeline, the percentage of additional (duplicate) tuples increases. However, there are benefits from operating the pipeline for a longer time. Thus, the decrease of the speedup ratio is not as steep. We note that this analyt-
PARALLEL
0.5.-
PIPELINED
STRATEGY
FOR EVALUATING
RECURSIVE
PREDICATES
159
lel pipelined strategy. The benefits of pipelining are greater with a greater depth of the pipeline, corresponding to longer resolvents and when the join selectivity growth factor exceeds 1.0, i.e., as larger relations are passed down the pipeline. For the pipelined strategy, as larger relations are passed down the pipeline of operations and as the depth of the pipeline increases, there is an increasing benefit from each additional processor being utilized. Thus, the speedup in execution time can increase compared to the parallel semi-naive algorithm in these cases. With limited memory, the pipelined evaluation can maintain a steady pipeline and can outperform the parallel semi-naive algorithm. However, the horizontal partitioning of the relations, for the latter strategy, minimizes the overhead due to hash table overflow, when a large number of processors are utilized.
\
25% of hash table resident in memory
LINEAR
Maximum number of processors=32 gf: growth factor of duplicates
I
T12 T4 T8 Depth of resolvents Ti FIG. 12. Ratio of speedup in execution semi-naive with duplicates.
T16
time compared
to parallel
ical evaluation has made a number of simplifying assumptions. The second source for redundant computation is harder to eliminate. If there is a path from node i to node j, then all the computation in finding the set of nodes connected to node j, i.e., its transitive closure, will be repeated in finding the transitive closure of i. Neither the parallel semi-naive strategy nor the parallel pipelined strategy can efficiently eliminate these redundant computations. In general, in a shared-nothing architecture, the communication overhead of avoiding redundant computations seems greater than the benefits. Efficient parallelizing of the transitive closure algorithms, while avoiding redundant computation, seems to mandate sharing of hash tables and sharing the results of previously computed closure. Thus, we are investigating the implementation of these algorithms with a shared-memory architecture [ 161. Here the overhead for sharing hash tables is not a communication overhead but instead it is the overhead of accessing shared memory. With the hash table of answers maintained in shared memory, redundant computations may be eliminated. An experimental study and performance comparison of several of these algorithms, in a shared-memory machine, is currently being performed 1161. 5.5. Summary of the Results of the Performance Evaluation To summarize, the speedup in the response time and the execution time demonstrated the benefits of the paral-
6. THE EVALUATION STRATEGY FOR A LINEAR RECURSIVE PREDICATE
The following is a definition for a single linear recursive predicate S; Eq. (16) is the PRL and Eq. (17) is the exit clause: S(-, -, . ..) t Left(-, -, . ..). S(-, -, . ..). Right(-, -, . ..) (16)
S(-, -, . . .) t Exit(-, -, . . .).
(17)
Left, Exit, and Right are subexpressions comprising relational operations which are executed against only the database relations, and result in a relation. The following resolvents, Si, are obtained from compiling the clauses S, Exit(b)
(18)
S2 Left(!+i), Exit(b,,), Right(b2,).
(19)
S3 Left(bzi, Left(b,i), Exit(&), etc.
Right(b3,), Right(bz,), (20)
Let b represent the set of variables that are initially bound. Then, in each resolvent S;, bil, bj,, and bj, are the bound variables. The number of variables bound after each traversal of the PRL must not exceed the number that were previously bound. The following algorithm decomposes the resolvents Sj into primitive operations; the operations are each marked with the legend [m] - n, where m is the level of the operation and n is the operation number. We use a variable array NumOper[il to represent the number of parallel operations; the value of each array element, for each level i, is the maximum number of parallel operations at that level. Each array element is initialized to 0. Let p be the number of processors.
160 ALGORITHM FORDECOMPOSINGRESOLVENTS MARKING OPERATIONS.
RASCHIDANDSU AND
INIT The operation for evaluating Exit(b) is marked [I J-1. The value of NumOper[ I] is updated from 0 to 1. LOOP (while p < (Xi NumOper[i])) For each S;, where i > 1, starting from both ends and proceeding inward, group the subexpressions together for a pairwise join (r>a), nesting the joins by using the results of previous groupings. The grouping from the left gets precedence. Each pairwise join is a primitive operation. To mark each operation with a level and an operation number, for each Si, starting with the most deeply nested operation and proceeding outward, alternating from left to right, test the following: (a) If the operation has been previously computed by resolvent, Sj, j < i, then it is marked common. (b) If this operation has not been previously computed, then, test the following cases: (1) If the two input operands for the operation have not been previously computed by Sj, j < i, then mark this operation [1] - j + 1, where the value of j is obtained from the current value of NumOper[l]. Increment this value; NumOper[ 11 = NumOper[l] + 1. (2) If one of the operands has been previously computed (by another operation) at level k, then this operation must execute at level k + 1. Thus, the operation is marked [k + I J - j + 1, where the value of j is the current value of NumOper[k + 11. Increment this value; NumOper[k + I] = NumOper[k + 11 + 1. (3) If both operands have been previously computed at levels k, and kZ, respectively, then this operation is marked [k,,, + l] - j + I, k,,, is the maximum of (k,, k?), and the value of j is the current value of NumOper[k,,, + I]. Increment it; NumOper[k,,, + II = NumOper[k,,, + I] + 1.
6.1. Architecture for the Parallel Pipelined Evaluation Assume that there are p processors and that each operation is to execute on a separate processor. A straightforward mapping of the primitive operations to the processors provides the architecture of Fig. 13; it also depicts the interconnection structure required to facilitate sharing of output among the processors. Each level is a replica of the previous level, thus providing regularity of the interconnection structure and simplifying the task of assigning operations to the processors. The answers for each resolvent Si are obtained as output from the corresponding operation marked [i] - 1. The bold lines in the figure represent the critical path of operations. The similarity of the architecture of Fig. 13 to that of Fig. 3 leads to the hypothesis that the performance of the parallel pipelined strategy, while evaluating a linear recursive
LelZ( bs,) I
I 141-3 s4
FIG. 13. Architecture for evaluating a linear recursive predicate
predicate, must exhibit the same behavior as that exhibited while evaluating the transitive closure. The value of d for the terminating resolvent Sd cannot be predetermined. If the number of processors available is limited, then in the assignment of operations to processors, priority is given to the operations that are marked at lower levels and are in the critical path of resolvents at lower levels. For example, within each level i, priority is given to the operation [i] - 1 that evaluates the resolvent Si, and then to [i] - 2 since it is in the critical path of the next resolvent Si+, . The operations [i] - 3 and [i] - 4 have lower priority since they are in the critical path of resolvent Si+2. If p is the number of processors available, then we will be able to simultaneously evaluate resolvents S,, Sz, . . . . Si, where i is the largest positive integer (i 2 3) satisfying the inequality (i - 2) * 4 + 3 c: p. The output of processors [i - 21 - 3 and [i - 21 - 4 are saved on disk, until Si is computed. Then these results are read back in and the processors are reused to compute S;+,, 5’2*i, etc. The analytical model can be modified Si+27 .*.T for this situation. 6.2. Correctness of the Evaluation and Termination We informally discuss the correctness and the termination of the evaluation. Each resolvent is decomposed into a nested sequence of joins (or Cartesian products) involving the (relational) expressions, Left, Exit, and Right. The relational join operator and the Cartesian product
PARALLEL
PIPELINED STRATEGY FOR EVALUATING
operator are both commutative and associative. Thus, irrespective of the order of decomposing the resolvents into primitive operations, the answers computed for S(b) will be the same. Similarly, sharing common subexpressions between the different resolvents will provide the same answers as when the resolvents are evaluated independently, provided the output of the operation evaluating the subexpression is completely input to the other operations using this output. The subexpressions that are shared are sequences (Left Del Left), ((Left m Left) DC] Left), . . . . (Right DC] Right), . . . . etc. The attribute values in each of these subexpressions, e.g., (Left Da Left), that are releuunt for the next computation are values for distinguished variables in S(b) and join variables that are common to the subsequent operation that is using the answers, e.g., ((Left CC! Left) M Left). A sequence such as (Left Da Left), ((Left D
AND FUTURE
RESEARCH
A parallel pipelined strategy for evaluating the resolvents obtained from compiling a single linear recursive predicate in a multiprocessor environment is described. An analytical performance evaluation compared this strategy with a sequential bottom-up semi-naive algorithm and a parallelization of this algorithm, for computing the transitive closure of a database relation. Measurements of the response time and the execution time, for each resolvent, demonstrated the benefits of the parallel pipelined strategy. The pipelined strategy can increas-
161
LINEAR RECURSIVE PREDlCATES
ingly benefit from each additional processor utilized when the size of the relations being passed down the pipeline increases. The analysis indicates that even with lim ited memory, the pipelined strategy can maintain a steady pipeline; the overhead due to overflow is not cumulative along the pipeline, since all operations execute concurrently. Since our research indicates that pipelined evaluation techniques are efficient, in future research we wish to investigate the effect of combining pipelined evaluation with, for example, the parallelized semi-naive algorithm. Our studies also indicate that an efficient parallelization of the transitive closure algorithms requires sharing of hash tables and join results. Thus, in an experimental evaluation, we are studying both implementation techniques and the performance of several transitive closure techniques, in a shared-memory multiprocessor system [16]. We are also investigating the use of the connection graph based technique for evaluating multiple linear recursive queries [ 141. REFERENCES 1. Agrawal, R., and Jagadish, H. V. Multiprocessor transitive closure algorithms. Proc. 1988 lnternntiona/ Symposium on Dutabases in Parallel and Distributed
Systems, 1988.
2. Bancilhon, F., Maier, D., Sagiv, Y., and Ullman, J. D. Magic sets and other strange ways to implement logic programs. Proc. A C M Symposium on Principles of Database Systems, 1986. 3. Bancilhon, F., and Ramakrishnan, R. Performance evaluation of data intensive logic programs. In Minker, J. (Ed.). Foundations of Deductive
Datubases
and Logic Progrumming,
1988.
4. Boral, H.. et al. Prototyping Bubba, a highly parallel database system. IEEE Trans. Know/edge Data Engrg. 2, I (Mar. 1990), 4-24. 5. Boral, H., and Dewitt, D. J. A methodology for database system performance evaluation. Proc. A C M SIGMOD Conference. Boston, Massachusetts, 1984. 6. Chou, H., and Dewitt, D. J. An evaluation of buffer management strategies for relational database systems. Proc. Conference on Ver?: Large Data Bases. Stockholm, Sweden, 1985. I. Dewitt, D., et al. The Gamma database machine project. ZEEE Trans. Knowledge Data Engrg. 2, 1 (Mar. 1990), 44-62. 8. Dewitt, D., et al. Implementation techniques for main memory database systems. Proc. A C M SIGMOD Conference. Boston, Massachusetts, 1984. 9. Finkelstein, S. Common expression analysis in database applications. Proc. A C M SZGMOD Conference, 1982. 10. Han, J., and Lu, H. Some performance results of recursive query processing in relational database systems. Proc. IEEE Conference on Data Engineering,
1984.
11. Kim, W., Gajski, D., and Kuck, D. A parallel pipelined relational query processor. A C M Trans. Database Systems 9, 2 (1984). 12. Mikkilineni, K. P., and Su, S. Y. W. An evaluation of relational join algorithms in a pipelined query processing environment. IEEE Trans. Software Engrg. 14, 6 (June 1988). 13. Henschen, L. J., and Naqvi, S. A. On compiling queries in recursive first order databases. J. Assoc. Comput. Mach. 31, 1 (1984). 14. Raschid, L. A connection graph based method for compiling and evaluating multiple linear recursive predicates. In preparation.
162
RASCHID AND SU
IS. Raschid, L., and Su, S. Y. W. A parallel processing strategy for evaluating recursive queries. Proc. International Conference on Very Large Databases. Kyoto, Japan, 1986. 16. Raschid, L., and Young-Myers, H. An experimental performance evaluation of parallel transitive closure algorithms. In preparation. 17. Reiter, R. Deductive question answering on relational data bases. In Logic and Databases. Plenum, New York, 1978. 18. Schneider, D., and Dewitt, D. Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. Proc. Conference on Very Large Data Bases. Brisbane, Australia, 1990, pp. 469-480. 19. Sellis, T. K. Multiple-query optimization. ACM Trans. Database Systems 13, 1 (1988), 23-52. 20. Stonebraker, M. The case for shared nothing. IEEE Database Engrg. 9, I (Mar. 1986). 21. Ullman, Jeffrey D. Implementation of logical query languages for databases. Proc. ACM SIGMOD Conference. Austin, Texas, 1985. 22. Valduriez, P., and Koshafian, S. Parallel evaluation of the transitive closure of a database relation. Internat. J. Parallel Programming 17, I (Feb. 1988). 23. Wolfson, 0. Sharing the load of logic program evaluation. Proc. 1988 International Symposium on Databases in Parallel and Distributed Systems, 1988. 24. Wong, E., and Youssefi, K. Decomposition: A strategy for query processing. ACM Trans. Database Systems. 1, 3 (1976). Received July 24, 1990; accepted December 9, 1990
LOUIQA RASCHID received her Bachelor’s degree from the Indian Institute of Technology, Madras, in 1980, and her Ph.D. in electrical engineering from the University of Florida, Gainesville, in 1987. Since then she has been an assistant professor with the Department of Information Systems, and with the Institute for Advanced Computer Studies, at the University of Maryland, College Park. Her current research interests include defining semantics for rule-based programs; providing transaction support for rule execution; and the evaluation of recursive queries and rules in multiprocessor systems. She is a member of IEEE, ACM, and the Society of Women Engineers. STANLEY Y. W. SU received his Ph.D. in computer science from the University of Wisconsin, Madison, in 1968. He is professor of computer and information sciences and of electrical engineering, and director of the Database Systems Research and Development Center at the University of Florida. He was one of the founding members of the IEEE-CS Technical Committee on Database Engineering and has chaired several workshops and conferences on database, software system, and architecture areas. He has served or is serving as an editor of the following journals: IEEE Transactions on Software Engineering, the Journal of Parallel and Distributed Computing, the International Journal on Computer Languages, IEEE Transactions on Knowledge and Data Engineering, and the International Journal on Very Large Data Bases. He is the author of Database Computers: Principles, Architectures and Techniques, McGraw-Hill, 1988, and of over 100 technical papers.