Optimal data dependence chaining in parallel loops

Optimal data dependence chaining in parallel loops

Microprocessing and Microprogramming 38 (1993) 437-444 North-Holland 437 Optimal data dependence chaining in parallel loops* Z. Szczerbitlski Polish...

558KB Sizes 2 Downloads 77 Views

Microprocessing and Microprogramming 38 (1993) 437-444 North-Holland

437

Optimal data dependence chaining in parallel loops* Z. Szczerbitlski Polish Academy of Sciences, Institute for Theoretical and Applied Computer Science, Battycka 5, 44-100 Gliwice, Poland We propose a method for optimizing shared-memory MIMD programs containing parallel loops with loop-carried data dependences of various distances. The optimization consists in reducing the number of synchronizations necessary to satisfy the dependences by identifying dependences which are redundant and neet not be synchronized. The idea of dependence chaining is presented. Next, conditions for dependence arc elimination are formulated. A proposal for application of nested forward dependences in dependence chaining so as to achieve extra arc elimination is put forward. Theoretical considerations are accompanied by the description of the algorithm which implements the proposed method and an example of its application.

1. I N T R O D U C T I O N In this paper we propose a method for optimization of programs containing parallel loops, designed to be run in a shared-memory multiprocessing (MIMD) environment. The loops in question are not nested and contain no branches. They feature loop-carried data dependences of various distances which are constant throughout loop execution; some of these dependences are assumed to be forward. As a natural consequence of the flow of data in a program, a data dependence [1, 2] or simply dependence (the issue of control dependences is beyond the subject of this paper) exists between two statements if they both access the same variable and at least one access is that of assigning a new value to the variable. Denoting the statement executed earlier by Sl and the one executed later by $2, we call $1 the source and $2 the sink of the dependence. A general classification of data dependences subdivides them into true, anti- and output ones depending on which statement (source, sink, or both, respectively) assigns a new value to the variable. As regards a dependence inside a loop, it is loop.carried if it exists across loop iterations i.e. the access of 5'2 to the variable is at least one iteration later than that of 5'1; otherwise it is loop-independent (both accesses occur in the same iteration) [3]. For a dependence in a non-nested *Tiffs work was supported by Pola~d's Conmaittee for Scientific Research grant No. 3 3512 91 02.

loop we define its distance as the number of iterations that it crosses. A loop-independent dependence (LID) has distance 0 whereas a loop-carried dependence (LCD) is of distance 1 or greater. We limit our attention to dependence distances which are constant throughout loop execution. Loop-carried dependences can be further classified into forward and backward ones, depending on the lexical order in which the dependence's source and sink appear in the body of the loop. A forward dependence occurs if the source lexically precedes the sink. In a backward dependence the sink lexically precedes the source or they are the same statement (S1 = $2). A parallel loop is commonly represented as a graph whose nodes (vertices) correspond to the loop body's statements and the arcs (edges) symbolize dependences. An example Fortran-syntax parallel loop and its associated graph are presented in Fig. 1. The solid line arcs represent LCDs whereas the dashed line arcs symbolize LIDs. In a multiprocessing environment, a loop is parallelized so that different iterations are assigned to separate processors. We base our approach to loop parallelization on the notion of doacross loops. In a doacross loop [4, 5] iterations are mapped to consecutive processors and mechanisms for providing synchronizations between iterations containing sources and sinks of LCDs must be provided. LIDs require no additional synchronization other than the natural one satisfied by the sequential nature of an iteration.

438

Z Szczerbi(lski

o)

b)

1

2

...

d+l

-% doacross I = 2,100 A(I) : B ( I * 2 ) + X ( I ) C(I) = A ( I ) / 2 . 0 F(1) = C(I+I)-A(1) $4: D ( I + I ) = C(I)*C(I) S5: X(I) : D ( I ) + C ( I - 1 ) S 6" X(I+2) = C(I)+F(I) enddo(]cross S 1" $2" S3:

Figure 1. An example parallel loop and its graph.

We adopt random synchronization of loopcarried dependences [6], the most flexible of synchronization strategies. More restricted strategies (software pipelining, barrier synchronization, critical sections) may be easier to implement yet provide less parallelism. In random synchronization, with each LCD there is associated a pair of synchronization primitives [6]: *

a non-blocking post (label, iteration number) placed immediately after the dependence's source,

• a blocking wait (label, iteration number), placed immediately before the dependenee's sink. A wait suspends execution of an iteration until the corresponding post (i.e. with the same label and iteration number) has been executed. Since each dependence must be synchronized separately and each synchronization adds overhead of the extra code for implementing post and wait, a relatively large number of dependences in

Figure 2. An example of dependence arc elimination: D1 eliminates D2; a) The loop graph, b) The CPG.

a loop results in a prohibitive time cost, especially for loops with large numbers of iterations. Thus, while programming a parallel loop one would like to have as few dependences to synchronize as possible. Our paper is concerned with the problem of identifying certain LCDs which need not be synchronized. The rest of the paper is organized as follows. In section 2, the idea of dependence are elimination by dependence chaining is recalled. Section 3 discusses the rules for dependence arc elimination in a pair of forward dependences. In section 4, a proposal for application of nested forward LCDs in dependence chaining so as to achieve extra arc elimination is presented. The algorithm which implements the proposed method is described in section 5. In section 6, an example is given which shows benefits from applying the method. Coneluding remarks constitute section 7. 2. D E P E N D E N C E A R C E L I M I N A T I O N BY D E P E N D E N C E C H A I N I N G Careful analysis of dependences in a parallel loop leads to the conclusion that, quite often, not all existing dependences need to be synchronized. Unless true dependences are affected, one can dispose of anti- and output dependences by way of variable renaming [2]. Within true dependences, further optimization is frequently possible.

Optimal data dependence chaining in parallel loops

Consider the loop whose graph is given in Fig. 2a. There are two LCDs of the same distance d, D1 and D2. Suppose we ensure the correct flow of data between statement $4 in iteration 1 and statement S1 in iteration 1 + d by inserting the synchronization primitives defined in section 1. We ensure in this way that statement 5'4 in iteration 1 will be completed before statement 5'1 in iteration 1 + d commences. Since the order in which statements are executed within an iteration is sequential it is certain that $3 in iteration 1 will be completed before 5"4 in the same iteration. Likewise, 5"2 in iteration 1 + d must commence only after S1 in the same iteration has been completed. Since 5"4 in iteration 1 precedes Sl in iteration 1 + d by virtue of synchronization, the conclusion is that the execution of $3 in iteration 1 must precede the execution of $2 in iteration 1 + d even though no synchronization has been inserted to ensure this. We call this phenomenon dependence arc elimination, i.e. the LCD arc from 6:4 to $1 eliminates the LCD arc from $3 to $2. We also say that one dependence covers another which is then called redundant. One method of analyzing loop-carried dependences is to sketch the controlled path graph (CPG) [7] of the loop. The CPG consists of columns representing iterations. Each column contains nodes representing occurrences of statements within the iteration. Besides the columns, the CPG contains two types of arcs: machine arcs, which represent the flow of control inside an iteration, and synchronization arcs, representing the LCDs to synchronize. For the loop of Fig. 2a, the corresponding CPG is presented in Fig. 2b, where the vertical lines represent machine arcs and the lines between columns correspond to synchronization arcs. It has been proved in [8] that for a loop with constant dependence distances of which the maximum is d,,,,ffi, only dm,~ + 1 columns of CPG (representing the first d,na~ + 1 iterations) are actually needed for the CPG to be representative of interiterational data flow throughout loop execution. Consider Fig. 2 again. It has been shown above that the D~ arc eliminates the D2 arc. Referring to Fig. 2b we can explain the elimination as follows: the synchronization arc from So in iteration

439

1 to $2 in iteration d + 1 is actually not needed since there exists a path consisting of machine and synchronization arcs and performing exactly the same function i.e. linking $3 and $2:($3 in iteration 1) ---, ($4 in iteration 1) --. (S~ in iteration d + 1) --. (Sz in iteration d + 1). Such a path is called a controlled path [7]. A controlled path may contain more than one synchronization arc. We call such a case dependence chainin9 (a number of LCDs are chained to cover another LCD). 3. D E P E N D E N C E ARC ELIMINATION IN A PAIR OF FORWARD DEPENDENCES For two dependences, general rules for coverage of one by the other have been presented in [9-12]. We limit our attention to forward LCDs of the same distance. We shall number the loop body's statements according to their lexical order with consecutive positive integers as subindices and denote the source and sink of dependence Dk by S~ and Sy, respectively, and those of Dj by S~ and S~. Assuming that S~ precedes the remaining nodes we say that: • Dj and Dk are disjoint if v < w < z < y, • Dj and Dk are adjacent if v < w = z < V, • Dj a n d D k overlap i f v < z < w < y ,

• Dk is nested i n D j i f v < z < y < w . It has been proved in [12] that, given two forward dependences of the same distance, D~ and Dj, Dk covers Dj iff Dk is nested in Dj. 4. T H E C H A I N A B L E DENCES IDEA

NESTED

DEPEN-

Since a forward LCD is covered by any other forward LCD which is nested in and of the same dependence distance as it, we may sometimes achieve extra dependence arc elimination by chaining nested LCDs. Consider Fig. 3. Taking the original approach to dependence chaining as described in sec. 2, no arc elimination is possible. Yet after replacing, as a link in the dependence

440

Z. Szczerbiriski

b)

G

2 Sl

(

s2

( ( ( (

s3 2

s4 s5

3

) ) ) )

Figure 3. Arc elimination by an alternate controlled path involving a nested dependence's arc: a) The loop graph (numbers at arcs denote dependence distances), b) The CPG.

chain, the $1 ~ Sa dependence by the dependence $2 ---* Sa which is nested in $1 ---* Sa and has the same distance of 1, dependence $2 -* $5 becomes redundant and its arc may be eliminated. As it can be seen, the idea exemplified above consists in replacing, in the CPG, the arcs corresponding to a forward LCD by the arcs corresponding to another forward LCD which is nested in the former one; we call it chainable. A chainable nested LCD covers, directly or indirectly, at least two other LCDs: the one that it is nested in (we shall call it original) and the one covered by the chain. Therefore, one chainable nested LCD reduces the number of necessary synchronizations in the loop by (at least) one. The gain grows with the number of LCDs covered by the chain simultaneously. Note that LCDs of distance dma~ are not chainable. Note also that, since a chainable LCD is forward, its synchronization does not introduce any delay, new or additional, between iterations of the loop. Unfortunately, it is rather difficult to find chalnable nested LCDs. Therefore, we propose to find, in the set of all nested LCDs, the basic subset of potentially chainable LCDs, defined as the minimum subset of potentially chainable LCDs which

cover all other potentially chainable LCDs, and then search among them for a dependence which is actually chainable. Since dependence nesting is transitive, it is easy to prove that the basic subset defined above consists of forward dependences between neighbouring statements from the LCD's source to its sink. In order to find, in this subset, an actually chainable LCD, we can choose from the two following options: Option A: We search the whole subset to find the chainable LCD which, when chained, covers the maximum number of dependences. Option B: We begin searching the subset and stop after finding a chainable LCD which covers (when chained) more dependences than the original LCD. Option A provides the optimal solution yet is time-consuming. It requires to check each element of the subset. Option B provides a solution which is generally non-optimal but requires (usually) much less computations. Obviously, other options may be proposed e.g. concluding searching after finding the "second best" element. 5. T H E ALGORITHM FOR CONSTRUCTING AN OPTIMIZED CPG We shall now present the algorithm for constructing an optimized CPG from the loop's graph. Its first part is based on the original algorithm presented in [8]. The second part implements the optimization idea described in the previous section. We assume that the nodes in the loop's graph and in each column of the CPG have been subscripted with consecutive positive integers: S1,S2,...,Sn. Upon identifying the maximum dependence distance in the loop, drna~, and building the drna~ + 1 columns of nodes corresponding to the loop's statements, arcs must be added to such "skeleton" of the CPG. First, machine arcs are inserted. This is straightforward: for each node Si in each column (excluding the last node in a column) an arc is placed from Sj to Sj+l. Next, synchronization arcs are added. We analyze

Optimal data dependence chaining in parallel loops

all LCDs in the loop, one by one. Denoting the currently analyzed LCD's source, sink and distance by Sso,Ssl and d, we place an arc from Soo in columns 1 , 2 , . . . , d m ~ + 1 - d to S,i in columns 1 + d, 2 + d , . . . , dm~x + 1, respectively. Now the "basic" CPG, constructed according to [8], is ready. In the following part, it will be optimized. We consecutively analyze all LCDs whose synchronization arcs are in the CPG. For each LCD, we first check if it is already covered i.e. if an alternate controlled path for its synchronization arc exists in the CPG. If so, all of the LCD's synchronization arcs are removed from the CPG and its analysis is concluded. If not, we analyze it further. If it is a forward LCD for which si - so > 2 and d < dm~x, we try to optimize the CPG according to the method described in sec. 4. The following procedure is performed: 1. The number of all alternate controlled paths in the current CPG, A0, is calculated; an auxiliary copy of A0 is stored as Ama~. 2. Assuming that the numbers (subindices) of the LCD's source and sink are a and b, for each consecutive node St, a < k < b: (a) arcs are added to the CPG from St in columns 1 , 2 , . . . , dma= + 1 - d to St+l in columns 1 + d, 2 + d , . . . , dmax "k 1, respectively; (b) the number of alternate controlled paths in the CPG (excluding the paths for the currently analyzed arcs from S~ to Sb) is calculated and denoted by A t ; (c) At is compared with Amax; if Ak is greater than A,n,~ then • Option A: Ak is stored as Am,= and k is stored as i n d m a x ; • Option B: the arcs for which alternate controlled paths exist are removed from the CPG and the procedure is terminated i.e. analysis begins of the next LCD; (d) the arcs from St to St+l are removed from the CPG.

441

3. (This step performed in option A only) If Amax > Ao then step 2.a is repeated for k := i n d m a x and the arcs for which alternate controlled paths exist are removed from the CPG. 6. A N E X A M P L E

OF OPTIMIZATION

Below we give an example of how to optimize a controlled path graph of a parallel loop as suggested in the previous sections. The loop's graph is given in Fig. 4a, with dependence distances accompanying respective arcs (for clarity, LIDs are not shown). There are 9 LCDs in the loop; hence, between each iteration and 3 (according to the given dependence distances) iterations succeeding it 9 synchronizations are collectively needed. We shall follow the steps of the algorithm for constructing the optimized CPG (sec. 5). We choose option A for the optimal solution. The meaning of variables has been explained in see. 5. The original CPG, obtained after addition of machine and synchronization arcs to d,naz + 1 = 4 columns of nodes (9 per column) is presented in Fig. 4b. In the process of its optimization, we analyze consecutive LCDs. Su ~ SI , d = 2.

The LCD is backward and not covered; we pass to the next LCD. $2-.*$4, d=2.

This LCD is covered by the previously analyzed LCD 5'2 ~ $1, e.g. for the arc ($2 in it. 1) ---. ($4 in it. 3) there is an alternate controlled path: ($2 in it. 1) ~ ($1 in it. 3) ($2 in it. 3) ~ (5"3 in it. 3) ---* ($4 in it. 3). Therefore, the arcs ($2 in it. 1) ---* ($4 in it. 3) and ($2 in it. 2) ~ ($4 in it. 4) are removed from the CPG. $4"*$7, d=1.

This LCD is not covered. It is forward, si - so = 3 > 2 and d < drear. Therefore, we perform the optimization procedure. 1. There are two alternate controlled paths in the current CPG:

442

Z. Szczerbir~ski

a)

1

b)

2

3

$6

) ) () ) ) )

$7

)

2 $2

2,@

)

$3 $4

)

$5

~~ 2

3i

4

c)

1

2

3

$9

d)

~

2

@ 2 3

1 2

) )

$8

4

Figure 4. Optimization of an example parallel loop: a) The loop graph, b) The original CPG, c) The optimized CPG, d) Eventual dependences to synchronize. - (ST in it. 1) ~ (5'8 in it. 1) ($9 in it. 1) ---+(5:6 in it. 3) is an alternate controlled path for (ST in it. 1) -+ ($6 in it. 3), - (ST in it. 2) --+ (Ss in it. 2) ($9 in it. 2) ---+($6 in it. 4) is an alternate controlled path for (5'7 in it. 2) ~ ($6 in it. 4).

Therefore, A0 = A,nax = 2. .

- Node $4.

(a)

(b)

We add arcs ($4 in it. 1) ---* ( & in it. 2), ($4 in it. 2) ---* ( & in it. 3), ($4 in it. 3) -+ ($5 in it. 4) to the CPG. Excluding the paths for arcs $4 -"* ST, the alternate controlled paths in the CPG are now: * ($6 in it. 1) -+ ($4 in it. 3) --+ ($5 in it. 4) for the arc (5'6 in it. 1) ~ (5'5 in it. 4), , the two paths for arcs Sz --+ $6 listed in step 1.

Therefore, A4 = 1 + 2 = 3. (c) Since A4 > Am.~, the new A,n.r is now 3 and indmax = 4. (d) The arcs added in step (a) are removed from the CPG i.e. the CPG is restored to its state of before (a). - Node Ss. (a) We add arcs ($5 in it. 1) -* ($6 in it. 2), (5'5 in it. 2) ($6 in it. 3), ($5 in it. 3) ---* (5'6 in it. 4) to the CPG. (b) There are no alternate controlled paths in the CPG except the paths for arcs $4 --'* ST and the two paths for arcs ST ---* $6 (listed in step 1). Therefore, As = 2. (c) A5 < A,nar. A m ~ remains 3. (d) We remove the arcs added in step (a). - Node 5'6.

Optimal data dependence chaining in parallel loops

(a) We add arcs ($6 in it. 1) ---* ($7 in it. 2), (5:6 in it. 2) (ST in it. 3), ($6 in it. 3) (ST in it. 4) to the CPG. (b) Excluding the paths for arcs 5'4 ---* ST, the alternate controlled paths in the current CPG are: * (Ss in it. 1) --* ($9 in it. 1) --~ ($6 in it. 3) ~ (5:7 in it. 4) for the arc (Ss in it. 1) (5:7in it. 4), * ($9 in it. 1) ---. ($6 in it. 3) (ST in it. 4) ~ (Ss in it. 4) for the arc ($9 in it. 1) ---* (Ss in it. 4), * the two paths for arcs 5'7 ---* Ss listed in step 1. Thus, A s = 1 + 1 + 2 = 4 . (C) As > A.,~=. The new A m ~ becomes 4 and indmaz becomes 6. (d) We remove the arcs added in step (a). 3. A,,,ar = 4, A0 = 2. Since Amax > Ao and indmaz = 6, the arcs listed in step 2.Node Ss.(a) are once again added to the CPG while the arcs listed in step 2.Node S6.(b), i.e. (Ss in it. 1) ---, ($7 in it. 4), - ($9 in it. 1) --* (Ss in it. 4), - ($7 in it. 1) --* ($6 in it. 3) and ($7 in it. 2) ~ (Ss in it. 4) and the arcs corresponding to the currently analyzed original LCD $4 --* $7, i.e. ($4 in it. 1) --* (ST in it. 2), ($4 in it. 2) --* (ST in it. 3), ($4 in it. 3) ---, (ST in it. 4) are removed from the CPG.

443

• & - - . & , d = 2.

The LCD is backward and not covered. Execution of the algorithm ends here. The LCDs: ST ~ ,96, of distance 2, and Ss --~ $7 and $9 ~ Ss, both of distance 3, are not analyzed -- their synchronization arcs have been earlier removed from the CPG since they were found redundant (see step 3 above). Note that, if option B had been selected, the two latter LCDs would have been analyzed (and found to be not covered). This is because, in step 2.Node $4, the chainable nested dependence $4 --~ $5 (rather than dependence $6 --* 5:7 in step 3 of option A) would have replaced the dependence $4 --* $7. This would have covered dependence $6 ~ 5'5 rather than Ss --* $7 and $9 --" Ss. This solution would thus be non-optimal (one dependence covered rather than two); still, it would improve on the original CPG. The optimized CPG is presented in Fig. 4c. In Fig. 4d, the loop's graph is shown after removal of arcs corresponding to redundant LCDs; besides, an arc symbolizing the chainable nested dependence found in the optimization process has been inserted. Note that the overall number of LCDs to synchronize has been reduced by 4; of this number, 2 are due to the optimization method described in this paper. 7. C O N C L U S I O N

-



$6---*$4, d = 2 . The LCD is backward and not covered; we pass to the next LCD.

° Ss---*Ss, d = 3 . The LCD is backward and not covered; we pass to the next LCD.

In this paper, a new method of optimizing parallel program loops has been proposed. It takes advantage of the idea of loop-carried dependence chaining and complements it with dependence nesting. In effect, by chaining dependences which are nested in original dependences existing in the analyzed loop, a number of redundant synchronizations may be eliminated and the compiled machine code thus shortened. Theoretical considerations and description of the method are accompanied by presentation of the adequate algorithm and by an example of its application. Although novel, the proposed idea would not have been born without earlier research in dependence arc elimination presented in [7-12]. It is important to note that the material contained in this paper is by no means exhaustive as regards

444

Z. Szczerbi~ski

arc elimination by chaining. Suggested areas of further research include: Utilization of chainable LCDs which are backward rather than forward. We have left this idea intact since, generally, backward LCDs introduce unwanted delays between the loop's iterations. Nevertheless if, before chaining, there has already been a delay which can by no means be reduced and a chainable backward LCD does not increase it, the LCD may be inserted. • Dependence chaining in nested loops. It is hoped that solving the above and related problems will significantly advance the development of reliable parallelizing software which, as of yet, does not match the rapidly growing supply of parallel architectures. REFERENCES

1. D.J. Kuck, The Structure of Computers and Computations, Wiley, New York, 1978. 2. D. A. Padua and M. Wolfe, Advanced compiler optimizations for supercomputers, Comm. ACM 29 (1986) 1184. 3. J.R. Allen and K. Kennedy, Automatic translation of Fortran programs to vector form, ACM TOPLAS 9 (1987) 491. 4. D. A. Padua, Multiprocessors: Discussion of some theoretical and practical problems, Ph.D. Thesis, University of Illinois at Urbana-Champaign (1979). 5. R. Cytron, Doacross: Beyond vectorization for multiprocessors, Proc. Int. Conf. Parallel Proc. (1986) 836. 6. M. Wolfe, Optimizing Supercompilers for Supercomputers, Pitman, London, 1989. 7. S. P. Midkiff and D. A. Padua, Compiler algorithms for synchronization, IEEE Trans. Comp. C-36 (1987) 1485. 8. S. P. Midkiff, Automatic generation of synchronization instructions for parallel processors, M.S. Thesis, University of Illinois at Urbana-Champaign (1986). 9. Z. Li, A technique for reducing data synchronization in multiprocessed loops, M.S. Thesis,

University of Illinois at Urbana-Champaign (1985). 10. Z. Li and W. Abu-Sufah, A technique for reducing synchronization overhead in large scale multiprocessors, Proc. 12th Ann. Int. Symp. Comp. Architect. (1985) 284. 11. Z. Li and W. Abu-Sufah, On reducing data synchronization in multiprocessed loops, IEEE Trans. Comp. C-36 (1987) 105. 12. Z. Szczerbifiski, Eliminating forward data dependences in parallel loops, submitted to Parallel Computing.