Achieving low cost synchronization in a multiprocessor system

Achieving low cost synchronization in a multiprocessor system

255 Achieving low cost synchronization in a multiprocessor system Rajiv Gupta and Michael Epstein Philips Laboratories, North American Philips Corpor...

1014KB Sizes 47 Downloads 63 Views

255

Achieving low cost synchronization in a multiprocessor system Rajiv Gupta and Michael Epstein Philips Laboratories, North American Philips Corporation, 345 Scarborough Road, Briarcliff Manor, New York 10510, USA

The barrier is a commonly used mechanism for synchronizing processors executing in parallel Upon reaching a barrier a processor must idle until all processors reach the barrier. In this paper, the fuzzy barrier, a mechanism that reduces the idling of processors is presented. The idling is reduced by using software techniques to find useful instructions that can be executed by a processor while it awaits synchronization. The fuzzy barrier mechanism has been implemented both in hardware and software. The hardware implementation eliminates busy waiting at barriers, provides a mask that allows disjoint subsets of processors to synchronize simultaneously, and provides multiple barriers by associating a tag with a barrier. The software implementation of the fuzzy barrier provides significant reduction in the synchronization overhead over the software implementation of the fixed barrier. Compiler techniques are presented for constructing barrier regions which consist of instructions that a processor can execute while it is waiting for other processors to reach the barrier. The larger the barrier region, the more likely it is that none of the processors will have to stall. Initial observations show that barrier regions can be large and the use of program transformations can be used to increase their size.

I. I n l r o d u c l i o n ~

To efficiently exploit fine grained parallelism, the development of low cost synchronization mechanisms is essential. The barrier [15] is a commonly used mechanism for synchronizing the flow

North-Holland Future Generation Computer Systems 6 (1990) 255-269

of control of parallel instruction streams. U p o n reaching a barrier the processor must wait until all processors reach the barrier. Barriers may be automatically introduced by a parallelizing compiler [4] or may be introduced explicitly by a programmer [14]. Since a processor upon reaching a barrier must idle until all other processors reach the barrier [15], no useful work is done by the processor while it is waiting to synchronize at the barrier. In this paper, the fuzzy barrier, a mechanism that reduces the idling of processors, is presented. The fuzzy barrier has been implemented both in hardware and software. The software implementation of the fuzzy barrier provides improvement in performance over the software implementation of a fixed barrier. The software implementations are based upon the use of shared variables. These implementations entail significant run-time overhead which increases with the number of processors synchronizing at the barrier. Furthermore, these techniques are known to cause hot-spot accesses [16]. For high-speed synchronization, the run-time overhead, due to execution of instructions to achieve synchronization, can be reduced by implementing barriers in hardware. To achieve this goal, the barriers specified in instruction streams are detected by the hardware to ascertain when a processor is ready to synchronize. All other processors are simultaneously informed of this event, and when all processors have reached the barrier, they simultaneously recognize that synchronization has taken place. A single instruction is required to set up the barrier. Once this has been done, the processors can repeatedly synchronize without executing any overhead instructions. Each processor is provided with a mask. By setting the mask appropriately a processor can synchronize with any subset of processors in the system. The implementation supports multiple barriers by associating a tag with the barrier. Thus, two processors at a barrier are able to synchronize only if their tags match. It is desirable to reduce the idle time of processors at barriers. The compiler can do so by

0376-5075/90/$03.50 © 1990 - Elsevier Science Publishers B.V. (North-Holland)

256

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

scheduling approximately equal amounts of work on each processor between successive barrier synchronizations. However, if the code being scheduled contains conditional statements, processors may follow different control paths and arrive at the barrier at different times. Furthermore, the times for memory accesses may vary for different processors. The barrier mechanism should be able to tolerate drift in the speed of execution of processors if idling at the barriers is to be reduced. The fuzzy barrier mechanism provides tolerance to this drift. In this mechanism, instead of specifying a specific point at which the processors must synchronize, a range of instructions, over which the synchronization is to take place, is specified. This range of instructions will be referred to as the barrier region. Upon reaching the first instruction in the barrier region, a processor is considered ready to synchronize. However, it can continue to execute the remaining instructions in the region even if synchronization has not yet occurred. The mechanism, though implemented in hardware, relies upon the compiler to construct the barrier regions. During different synchronizations at the barrier the processors may be executing different instructions, from the specified range of instructions, at the time of synchronization; hence the name fuzzy barrier. A flexible barrier, of the kind described, has several advantages. If the processors in the system are pipelined, repeated synchronization is less likely to degrade the performance of the pipeline. This is because the synchronization point is not exactly specified; thus upon reaching a barrier the processor may be able to issue instructions even if the synchronization has not taken place. Since the synchronization overhead is low, concurrentizable loops requiring barrier synchronization can be efficiently executed on multiple processors even if the size of the loop body is relatively small. Application of transformations such as cycle shrinking [12] depend heavily upon use of barriers. Availability of an efficient barrier mechanism makes their application practical. A parallelizing compiler can employ such a mechanism to exploit instruction level parallelism using techniques similar to those used in VLIW machines [3,5]. In subsequent sections of this paper the semantics of the fuzzy barrier and its implementations are described in detail. Hardware fuzzy barrier has been implemented in a prototype multi-

processor system based upon RISC processors. This implementation of the fuzzy barrier supports multiple barriers and allows disjoint subsets of processors to synchronize simultaneously. Experimental results based upon a software implementation of the fuzzy barrier on the Encore Multimax are presented. An example showing the compilation process to exploit such a mechanism is described. Code reorganization techniques to increase the size of barrier regions are discussed.

2. Semantics of lhe fuzzy barrier Parallel instruction streams are viewed as consisting of barrier regions and non-barrier regions. In Fig. 1 the shaded regions represent the barrier regions and the unshaded regions are the nonbarrier regions. Streams with no barrier regions have no barrier synchronizations, while a shaded region extending across all streams or a subset of streams indicates a barrier and forces the processors to synchronize. The barrier regions for different streams may contain varying number of instructions. The functionality of the fuzzy barrier can be briefly stated as follows: No processor can execute an instruction from its respective non-barrier region (UNSHADED2)following the barrier region (SHADED) until all processors have executed the instructions in their respective non-barrier regions (UNSHADED1) preceding the barrier region i.e., Vi U~' can be executed iff Vj U16 have been executed. A processor Pi is ready to synchronize if it has completed the execution of instructions from the non-barrier region U~e, preceding the barrier region. It should be noted that at this point execu-

Pl

P2

PN

f

l

,°,,°°

Fig. 1. Fuzzy barrier.

'

257

Ft. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

tion of instructions from the barrier region may already have begun. This is because a pipelined machine overlaps the execution of multiple instructions. Processors have synchronized at the barrier if and only if they have all completed execution of instructions in their respective nonbarrier regions preceding the barrier region. A processor can enter a non-barrier region following a barrier region if and only if synchronization has occurred. Thus, if the synchronization is yet to occur when a processor Pi reaches an instruction from the non-barrier region Uf', its execution is

Fixed Barrier

Fuzzy Barrier

Approaching Barrier

smiled. F r o m the above description it is clear that when a processor reaches the first instruction of the barrier region, it does not have to stop immediately but can continue to execute even if other processors haven't reached their corresponding barrier regions. Similarly, upon reaching the last instruction in a barrier region, the processor can continue even if other processors haven't reached the end of their corresponding barrier regions. If the barrier region for a stream consists of n instructions, then at the point of synchronization the processor could have executed 0 to n instructions from the barrier region. The tolerance of the mechanism to the variation in the rate at which the execution of each instruction stream progresses is limited by the number of instructions in the barrier regions. Thus, the larger the barrier regions, the less likely it is that the processors will stall. Fig. 2 illustrates the reduction in stalling caused by using a fuzzy barrier instead of a fixed barrier. In this example P3 has fallen behind in execution which causes stalling of processors P2 and P~ if a fixed barrier is used. No stalling occurs if a fuzzy barrier is used. The instructions that form the barrier region can contain unconditional as well as conditional branch instructions. Thus, any sequence of instructions that are consecutive along a control path in the program can form a barrier. The branches in the barrier region allow a barrier region to have multiple exits. Branches into a barrier region from non-barrier regions allow the barrier region to have multiple entry points. The advantage of allowing branches in barrier code are that entire control structures, such as loops and if-statements, can be included in a barrier region. Furthermore, the sequence of instructions forming the barrier may not be physically contiguous. Thus,

Synchronizing at the Barrier

Proceeding Past the Barrier

I l

Fig. 2. Fixed barrier vs fuzzy barrier.

for a loop whose iterations are separated by a barrier, the barrier region can not only contain code from the end of one iteration but also from the start of the subsequent iteration. This will be demonstrated through an example later in the paper. The destination of a branch instruction in the barrier region should either be an instruction in the same barrier region or an instruction in a non-barrier region. Thus after the branch has been executed the processor is either in the same barrier region or in a non-barrier region. The compiler should not generate code where the control can be transferred directly from one logical barrier to another. This is because such branches can result in improper synchronization and deadlocks if the hardware cannot distinguish among different barriers. Consider the example in Fig. 3, where there are two barriers at which the processors must synchronize and a branch instruction can transfer

258

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

Fig. 3. Invalid branch.

control of processor /'1 directly from barrierl to barrier 2. If this branch is taken P1 will cross both the barriers by synchronizing with /)2 only once which is when /)2 reaches barrier]. On the other hand P2 will be deadlocked at barrier 2 waiting for a synchronization that will never take place. It should be noted that the above problem will not arise in an implementation which exphcitly specifies unique identifiers for the barrier in the code. During the exploitation of parallelism implicit in a sequential program, a parallelizing compiler can exploit the semantics of the fuzzy barrier by constructing barrier regions after analyzing the dependencies in a program. At source level a programmer writing a parallel program can construct barrier regions while coding an application.

different subsets of streams that do not know of each others existence. In addition to the mask a tag is provided to indicate the identity of a barrier. Two processors can only synchronize at a barrier if their tags match. Both the mask and the tag are set by the processors under software control. Barriers are allocated when the streams are created. The creation of the first stream does not require allocation of a barrier as there is no other stream with which it can synchronize. Subsequently, creation of every stream requires allocation of at most one barrier which may be used by the newly created stream to synchronize with its parent. Thus, in a N processor system which allows creation of at most N streams, a m a x i m u m of N - 1 barriers is needed. Different subsets of streams must synchronize using logically different barriers. In other words, the processors must know the identity of a barrier to achieve correct synchronization. Consider the example shown in Fig. 4 where the barriers are essentially being used to merge streams. Different subsets of processors synchronize at different barriers. Note that processor P3 engages in barriers B 1 and B 2, processor /)2 engages in barriers B 2 and B 3 and finally P] engages in barrier B 3. Processor P1 upon reaching barrier B3 may incorrectly synchronize with processor P2, when P2 reaches barrier B 2, if the barriers are not given different identities. From this example it is clear that in a N processor system which allows

3. Multiple barriers All of the processors in the system are not forced to synchronize every time a barrier is used. Disjoint subsets of processors can independently synchronize among themselves. A mask is provided in each processor for specifying particular processors participating in a barrier synchronization. If it is known at compile-time that the streams would definitely be created and interact in a precisely predictable fashion, the synchronizations can be achieved using a single barrier. The masks for each of the processors can be set to either synchronize with or ignore other streams. But if the streams are created dynamically or are conditionally created, their existence is not known until run-time. In this situation multiple barriers are used. Logically distinct barriers are assigned to

P I(So) 1 PI ( S ~

2)

P3(

B2 ~ P3 _~

B3 ~ P 2

Fig. 4. Multiple barriers.

(S6)

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

creation of at most N streams, a maximum of N - 1 barriers is needed. The streams that need to synchronize repeatedly can reuse the barrier shared by them. Disjoint subsets of a group of streams that share the same barrier can synchronize by manipulating their masks. In the above example it was assumed that the streams were being created dynamically or are conditionally created. For the same set of streams, if it was known at compile-time that the streams would definitely be created and interact precisely in the manner specified in Fig. 4, the synchronizations can be achieved using a single barrier. By forcing all processors to synchronize each time any two processors need to synchronize, a correct schedule that uses a single barrier can be generated. However, the disadvantage of such an approach is that redundant synchronizations are introduced in the streams. Having multiple barriers eliminates redundant synchronizations and enables decisions regarding creation and destruction of streams to be dynamic. Although static schedules have the advantages of simplicity and low run-time overhead, they lack the capability to spawn a variable number of instruction streams based upon run-time information such as the amount of computation to be performed and the availability of processors. A dynamic schedule can do a better job in allocation of resources based upon the run-time information.

4. llard~are implemenlation The fuzzy barrier mechanism has been implemented in a RISC based multiprocessor system. To distinguish between instructions from non-barrier and barrier regions, a single bit in each instruction is being used. This will be referred to as the 1-bit. The 1-bit is one if the instruction is from a barrier region and zero otherwise. If there are no instructions that can be included in the barrier region a null operation is introduced to create a barrier region. An alternative approach is to use special instructions that can be executed to indicate an entry or exit from a barrier region. If special instructions are used to indicate the boundaries of the barrier region then the null operation is no longer needed to indicate a null barrier region.

259

In a non-pipelined machine, a processor is ready to synchronize when it enters the barrier region. Determining whether a processor is in a barrier region or a non-barrier region can be simply done by examining the 1-bit of the current instruction. In a pipelined machine, a processor will typically enter the barrier region before exiting the non-barrier region because multiple instructions are being executed simultaneously. A processor is ready to synchronize when it has completed the execution of all instructions from the non-barrier region preceding the barrier region. Thus, determining whether a processor is ready to synchronize requires examining the 1-bits of all the instructions in the pipeline. For simplicity, the implementation discussed in this section is for a non-pipelined machine although similar ideas can be applied to implement it for a pipelined machine. Each processor is provided with an identical piece of hardware to implement the fuzzy barrier. It is assumed that all processors use a c o m m o n clock and are reset simultaneously. The hardware detects when a processor enters a barrier region, and a signal indicating that the processor is ready to synchronize is broadcast to all other processors. When a processor is ready to synchronize and has received similar signals from the processors it is synchronizing with, it knows that synchronization has taken place. Since the signals are being broadcast and monitored by each processor independently, all processors simultaneously discover the occurrence of synchronization. If a processor reaches the end of the barrier region and tries to execute a non-barrier instruction before synchronization has not taken place, the processor is stalled. A processor is provided with an internal register which contains the current mask and tag for that processor. In an n processor system the mask for each processor consists of n - 1 bits, one bit corresponding to each of the other processors. By setting the mask bits a processor specifies the processors with which it wishes to synchronize. The tag identifies the current barrier for the processor, and two processors can synchronize only if their tags match. A system with an m bit tag supports 2 m ] logical barriers, where a combination of all zeros is used to indicate that the processor is not participating in barrier synchronization. This internal register is set under software control. The mask and tag for a processor are -

-

260

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

i! ..........................................

i ...............................................................................................

i~

,ostruction / /,o.troct,oo I r,'ns"uc"on I -..'ns'ruct'oo A,~,~ress Oata /

IAd,,,~ss Data I I

Address Data I

Address Data

Data

Memory

Fig. 5. A four processor system with fuzzy barrier.

•r bit (Barrier bit)

I Tag

J

'4

t

EU

'7

I

Ma~

Stall

1'!"3 Match w

State Machine

Want_out

Tag_in1 d

Tag_in2 Tag_in3

"4

Match Box

=,,,

Fig. 6. Per processor hardware.

Tag_out

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

determined by the compiler for static scheduling and by the runtime system for dynamic schedul,lg. Although the fuzzy barrier can be implemented in a system with any number of processors, the number of interconnections between the processors increases with the number of processors. This is because each processor must broadcast its tag to the other processors in the system. Fig. 5 shows a four processor system (i.e. three bit mask) with fifteen logical barriers (i.e. a four bit tag). Thus each processor will receive three tags consisting of four bits each. In the diagram each set of output tags is shown by a unique crosshatching pattern. Each processor contains an identical copy of the hardware shown in Fig. 6. This consists of a state machine that determines the status of the barrier for the processor, an internal register that contains the current tag and mask for the processor, and some combinational logic (match logic) which determines whether the processor's tag matches the tags of processors with which it is to synchronize. As shown, every instruction contains an 1-bit that is cleared if the processor is in a non-barrier region and set if the processor is in a

MASK

3

TAG

TAG_IN1

/

4/: I

(

q~]O--

4

/I

7

CH TAG_IN2

4

TAG_IN3

4

/[

/

/

Fig. 7. G e n e r a t i o n o f the match signal.

barrier region. The state machine determines the status of the barrier using the match signal and the 1-bit's of the instructions being executed by the processor. It also has two outputs; the signal

MATCH

~--

WANT_OUT

MATCH* I* & MATCH

WANT_OUT STALL

WANT

I I

I*

I* &

&

MATCH*

MATCH* I

&

WANT_OUT

MATCH WANT

261

WANT OUT

STALE

Ol

WANT_OUT

I& MATCH WANT_OUT

Fig. 8. T h e state machine.

262

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

w a n t _ o u t enables the processor's tag to be output

to all other processors; the signal stall is used to stall the processor if it reaches the end of a barrier region before synchronization occurs. The combinational logic that generates the m a t c h signal for the state machine is shown in Fig. 7. The signal m a t c h is asserted if the tag is zero as this indicates that the processor is not participating in barrier synchronization. In all other cases the match will be asserted only if the tags received from the non-masked processors are the same as the processor's tag. The state machine shown in Fig. 8 is a M e a l y machine. This means that the outputs can change during transitions between states. In Fig. 8 the inputs that cause state transition are shown in plain text and the corresponding outputs are shown in italics. A processors state machine can be in one of the following states: (i) State-O - if the processor is executing instructions from a non-barrier region; (ii) State-1 - if the processor is in the barrier region and has not synchronized; (iii) S t a t e 2 - if the processor is in the barrier region and has synchronized; and (iv) State-3 - if synchronization has not taken place and the processor is stalled as it has completed the execution of instructions from the barrier region. The state machine is initially in state-O. It enters state-1 when the processor enters a barrier region but is not yet able to synchronize. U p o n synchronization the state changes to state-2 and stays the same till all instructions from the barrier region have been executed. However, if synchronization occurs during the execution of the last instruction in the barrier region, the state directly changes from state-1 to state-O. If all but one of the processors synchronizing at the barrier have entered the barrier region, synchronization takes place immediately when this processor enters its barrier region. Thus, the processor that enters the barrier region last, moves directly from state-O to state-2. If all of the processors synchronizing at the barrier enter their barrier regions simultaneously, they all move directly from state-O to state-2.

If the state machine is in state-1 and the processor attempts to leave the barrier region the processor is stalled till synchronization occurs. In this situation the state machine enters state-3 and

asserts the stall signal. The stall signal causes the basic state of the processor to remain unchanged and the execution of the first instruction from the following non-barrier region is stalled. This continues until all synchronizing processors enter their respective barrier regions and synchronization takes place. The state machine follows the transition to state-O and the instruction from the nonbarrier region is correctly executed. It should be noted that the state machine of a processor that wishes to synchronize asserts the w a n t _ out signal. This enables the tag to be output to the other processors. Only when the processor has successfully synchronized (state-2), or does not want to synchronize (state-O), is the want-out signal de-asserted.

5. ~ol'l~'¢;.lrc implementation The fuzzy barrier semantics can also be implemented in software. Although the cost of synchronizing using such an implementation will be significant, improvement over a software implementation of a fixed barrier results due to reduction in processor idling. Commercial multiprocessor systems, such as Encore [17] and Sequent [11], support the barrier mechanism as part of their parallel programming library which is available to application programmers. By supporting the fuzzy barrier in software the performance of the system may be further enhanced. A software implementation of the fuzzy barrier on a four processor Encore Multimax has been carried out. This implementation uses the primitives available in Encore's parallel programming library. The primitives used are locks that provide mutual exclusion and task queues for suspended tasks. Since barrier synchronization can be expensive it is essential to use it only in situations where the synchronization overhead will not offset the speedup achieved due to parallel execution. By comparing the minimum cost of synchronization, found by examining the implementation, with amount of code executed by each processor between successive synchronizations the speedup due to loop parallelization can be computed. However, the above estimations of the speedup are not accurate if the code executed by the processors varies widely due to presence of conditionals. To

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

overcome this problem first the variance in the execution time of the code can be estimated using the techniques developed by Sarkar [13]. Next the barrier region can be constructed and the execution time of the code in the barrier region can be estimated. If the variance in the execution time of the loop body is less than the execution time of the code in the barrier region, then the variance is not likely to reduce the speedup. This is because the fuzzy barrier reduces the idling of processors at the barriers which may be caused due to the

263

varying amount of work performed between successive synchronizations. A software implementation of the fuzzy barrier is shown in Fig. 9. This implementation is based upon the primitives that are available in the parallel programming libraries of most commercial multiprocessors. The primitives used are locks that provide mutual exclusion and task queues for suspended tasks. The processors synchronizing at the barrier must call the functions enter_barrier and exit_barrier to synchronize. The statements that

enter_barrier

{ lock(barrier); if (phase==eziting) { num_enter_queue++; enqueu e(self, enter_qucue) ; unlock_and_suspend(barrier, self); lock(barrier);

} tasks_in_fuzzy++; if (tasks_in._fuzz~t=.-num_taeks) { for (i--l; i
} unlock(barrier);

} ezit_barrier

{ lock(barrier); if (phasef-entering) { num_ezit_queue+÷; enqu eu e(self, ezit_qu eue); unlock_and suspend(barrier, self); lock(barrier);

} tasks_in_fuzzy.-; if (tasks_in..,fuzzy=-O) { for (izl; i
} unlock(barrier);

}

Fig. 9. Software implementation of the fuzzy barrier.

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

264

form the barrier region are included between the calls to enter_barrier and exit_barrier. By calling enter_ barrier a processor essentially indicates that it is ready to synchronize and by calling exit_ barrier it checks whether all processors have reached the barrier. If all processors have not entered the barrier region a processor trying to exit the barrier suspends itself on the exit_queue. The last processor entering the barrier region resumes the tasks waiting in the exit_ queue. At any given point in time the barrier is either in the entering phase or in the exiting phase. The barrier is initialized to be in the entering phase and it continues to stay in this phase till there is at least one processor that has not executed enter barrier. After all processors have executed enter barrier the barrier is in the exiting phase and continues to stay in this phase till there is at least one processor has not executed exit_ barrier. In execution of loops processors repeatedly synchronize using the same barrier. Thus, it is possible that a processor that has exited the barrier may try to re-enter the barrier before all processors have exited the barrier. If this is the case the processor suspends itself on enter_queue and is resumed by the last processor to exit the barrier. The cost of synchronization can be computed in terms of enqueue and dequeue operations performed. This is because these operations account for most of the synchronization cost as they require context saves and restores. In the best case no e n q u e u e / d e q u e u e operations need to be performed. However, in the worst case n - 1 enqueue and n - 1 dequeue operations may be performed. If a fixed barrier was implemented using the same

10000 I

~'

eooo •"1

]

6000

~

•l

=ooo

~"

ii'~i]

]

40

60

0 0

20

80

100

% fuzzy

Fig. 10. E x p e r i m e n t a l results.

120

n=2 n=3

primitives as the fuzzy barrier, it will always require n - 1 enqueue and n - 1 dequeue operations to perform each synchronization. Thus, the fuzzy barrier can significantly reduce the synchronization overhead and will never perform worse than a fixed barrier. This is confirmed by the experimental results obtained by implementing the fuzzy barrier and the fixed barrier on a four processor Encore Multimax. For nested loops, similar to the ones in Fig. 10, the cost of synchronizing four processors was reduced from 10,000 gsec to 300 ttsec as the size of the barrier region was increased from zero instructions to half of the total instructions in the loop body. The cost of fixed barrier is approximately the same as that of fuzzy barrier with no code in the barrier region. Thus, the use of fuzzy barrier can only improve performance. In the above experiments the number of tasks that needed to synchronize was at most four as the system has only four processors. If more than four tasks synchronize at a barrier the synchronization cost is much higher as all the tasks cannot be executed in parallel. However, even if the number of tasks is greater than the number of processors some performance improvement can be expected.

6. Compiler support In this section the compiler techniques needed to effectively exploit the fuzzy barrier are demonstrated through an example. We show that the barrier region can be constructed by first examining the data dependences at the statement level. By examining the intermediate code [1] the size of the barrier region can be enlarged. The example also demonstrates that reordering of intermediate code can further increase the size of the barrier region significantly. Consider the fragment of code shown in Fig. 11(a). The iterations of the inner loop can be executed in parallel. Thus, four processors can be used to execute a single iteration of the outer loop. A processor ready to begin a new iteration of the outer loop has to wait for the other processors to complete their previous iterations so that the data dependences between successive executions of statement S] are enforced. This can be achieved by introducing a barrier at which the processors must synchronize at the end of each iteration. The code executed by each of the four processors is

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

shown in Fig. ll(b). Storage related dependences a m o n g the parallel iterations due to loop variables are eliminated by creating private copies of i and j for each subtask. The barrier region is constructed by examining the statements along the control flow path on which the barrier lies. The statements preceding and following the barrier are candidates for inclusion in the barrier region. If we examine the dependencies at the statement level it can be seen that the barrier was inserted due to the dependency present in statement S 1. However, the execution of statement S 2 does not require any synchronization. Thus, the entire statement S 2 can be included in the barrier region. This is only possible because conditional and unconditional branches can be included in barrier regions. Since S 2 is an if-statement, the time spent in the barrier regions varies from one processor to another. If a traditional barrier is used, all processors that execute fewer instructions will have to wait upon reaching the barrier. However, the use of a fuzzy barrier can eliminate this waiting. As soon as all processors enter the barrier region they synchronize and hence the processors that execute fewer instructions do not have to wait when they reach the end of the barrier region. The intermediate code for the program fragment in Fig. 11(b) is shown in Fig. 12(a). As shown, the code corresponding to statement S 2

for (i==l; i < - 4 ; i++) do seq for ( j - l ; j < - 4 ; j + + ) do par

{ St: ali][j] -- 2 * a[i-lHj-1] + aliq][j+l];

}

$2: if (j==i) b[illJ] = b[i][j I + c[iHj]; (a) Original Code

Taskp, where 1 < p <_ 4: private i,j;

j =P; for (i-l; i<-4; i++)

{

Sl: a[i][j I = 2 * a[i-1][j-l] + a[i-ll[j+l]; $2: if ( j = - i ) b[i][j] - b[i]lJ] + elillJ];

}

barrier;

(b) After Parallelization

Fig. 11, Barrier synchronization.

265

part of the barrier region. By examining the code at this level additional instructions can be moved into the barrier region. For the above example, since the barrier is at the end of a loop, instructions from two consecutive loop iterations can be included in the barrier region. Fig. 12(b) shows the code after additional instructions have been included in the barrier region. To enlarge the barrier region, the instructions that must be in the non-barrier regions are identified. These instructions will be referred to as the marked instructions. The instructions starting at the first marked instruction to the last marked instruction are then included in the non-barrier region. The remaining instructions form the barrier region. Marked instructions are those instructions which either access a value computed by another processor or compute a value that will be accessed by another processor. Barrier synchronization ensures that a processor accesses a value after it has been computed by another processor. The instructions I 1 and 12 corresponding to statement $1, r e a d / w r i t e array a and are the ones involved in the loop carried dependency. They must be included in the non-barrier region and thus the non-barrier region extends from 11 to 12. As mentioned earlier, it is preferable if the non-barrier regions are small and barrier regions are large. Code reordering [8,9] can be performed to move instructions, other than the marked instructions, from the non-barrier region to the barrier region. The process of code reordering requires examining the dependences a m o n g the instructions to determine if they can be reordered in a suitable fashion. In the example presented, the instructions that compute the addresses array elements a[i][j] and a [ i - 1][j + 1] can be executed before any of the array elements are actually accessed and can be moved out of the non-barrier region. This leaves only instructions 11 and 12 in the non-barrier region as shown in Fig. 12(c). Given a piece of code that forms the non-barrier region, code reordering to remove instructions from the non-barrier region can be carried out as follows. First a directed acyclic graph ( D A G ) [1] representing the data dependences for the code in the non-barrier region is built. Since a D A G represents the dependences among the intermediate code statements, it can be used to find another legal ordering of instructions which results in smaller non-barrier regions. We first schedule the

266

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

instructions from the non-barrier region that are not one of the marked instructions (i.e. instructions other than 11 and 12 in the example). All instructions scheduled during this phase are essentially moved into the barrier region preceding the non-barrier region. Next, the scheduling of instructions is carried out in manner that tries to schedule the marked instructions as early as possible. This is continued till all marked instructions

have been scheduled. In the example, instructions 11 and 12 are scheduled during this phase. The instructions scheduled during this phase form the non-barrier region. After the last non-barrier instruction has been scheduled, the final phase generates an ordering for the remaining instructions. These instructions are included in the barrier region following the non-barrier region and hence are moved out of the non-barrier region. In the

[* Let the a r r a y declarations be as follows: int a[4][4], b[4][4], c[4114l; where a n integer is 4 b y t e s Let _a, _b and _e be the base addresses of the a r r a y s * /

Non-barrier: i=1 j=p LI: T1 = i - 1 T2-j1 T3 = 20 * T2 T4 = T3 + _a T5 = 4 * T1 T6 - T4 + T5 T7 = 2 * ITr] T8=i1 T9-j+I T10 ~ 20 * T9 T11-T10+_a T12 = 4 * T8 T13 - T l l + T12 T14 - 20 * j T15-T14+_a T16 - 4 * i T17 - T15 + T18 [T17] ffi T7 + [T131 .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

/* T6 ~-- a d d r e s s of a[i-1]{j-1] * / /* T7 - 2 * a[i-ll[j-l] * /

/* T13 ~

a d d r e s s of a[i-1][j+l] * /

/* T17 ~-- a d d r e s s of ali][j I * / / * a{i][j] - T7 + a l i q l l J ÷ l l * /

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Barrier: if i - j go to L2 T18 - 20 * j T19-T18 +_b T20 i 4 * i T21 - T19 + T20 / * T21 ~-- a d d r e s s of b[i][j] * / T 2 2 - 20 * j T23 - T22 + _e T24 i 4 * i T25 - T23 + T24 / * T25 ~-- a d d r e s s of ¢[i][j 1 * / [T21] = [T21] + [T25] /* bli][j] - b[iJ[j] + e[i][j] * / L2: i - i+ 1 if i < - 2 0 go to L1 .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Non-barrier:

Fig. 12(a). Barrier region.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

example presented, there are no instructions left to be scheduled during this phase. Since the barrier region in Fig. 12(c) contains all but two instructions from the loop body, a processor may fall behind almost an entire iteration without causing another processor to stall at the barrier. In the example presented, the reordering was performed at intermediate code level as this is more effective than reordering machine

code. After machine code has been generated the opportunities for reordering are restricted due to dependences introduced from register or other resource usages. In addition to code reordering at the intermediate code level, statement level transformations such as loop distribution [10] may be useful in increasing the size of the barrier region. Details of other situations where fuzzy barriers may be used can be found in [6]. The fuzzy barrier

Non-barrier:

Barrier: LI:

i=l j=p T1 = i - 1 T2=j1 T3 = 20 * T2 T4 = T3 + _a T5 = 4 * T1 T6 = T4 + T5

/ * T6 +--- address of ali-lHj-I ] *it

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Non-barrier: I1: T7 = 2 * IT6] T8=i1 Tg=j+I T 1 0 = 20 * T9 Tll=T10+_a T12 = 4 * T8 T 1 3 = T l l + T12 T14 ,= 20 * j TI$=T14 +_a T16=,4" i T I 7 -- T I 5 + T I 6 I2: [T17) ,= T7 + {T13]

/* T7 = 2 * a[i-lllj-1} sit

/ts T13 ~

address of ali-1]lj+l ] sit

/ * T17 ¢-- a d d r e s s of a{illJ] * / It* a[iUj ] = T7 + a[i-1][j+l I * /

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Barrier: if i - j go to L2 T18 = 20 * j T19=T18+_b T20 = 4 * i W21 = W19 + W20 T22 = 20 * j T23mT22+_c T24 ,= 4 * i T 2 5 = T23 + T24 [W21} - IT21] + [T25] L2: i - i+ 1 if i < = 2 0 go to L1

267

/* W21 ~-- a d d r e s s of b[i]Ij ] * /

/ * T25 *-- a d d r e s s of c[i][j] * / / * b[i][j] =, bill[j] + e[il[j] * /

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Non-barrier:

Fig. 12(b). Enlarging the barrier region.

268

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

Non-barrier:

Barrier: i~1 j-p LI: T1 - i - 1 T2=j- 1 T3 =. 20 * T2 T4 - T3 + _a T5 - 4 * T1 T6 - T4 + T5 T8-i1 Tg-j+I T10 - 20 * T9 TI1 - T10 + _ a TI2 - 4 * T8 T13 =, T l l + T12 TI4 =, 20 * j T15 - TI4 + _a TI6 -

4 * i

T17

T15 + T16

-

Non-barrier: 11: T7 - 2 * IT6] I2: [T171 - T7 + [T13] Barrier: if i l j go to L2 T18 - 20 * j T19 - T18 + _ b T20 - 4 * i T21 - T19 + T20 T22 - 20 * j T23 = T22 + _c T24 == 4 * i T25 - T23 + T24

IT2X] = L2:

IT211 +

/* T6 ~

address of ali-lllJ-xl */

/* TlZ +--- address of a[i-lllJ+ll */

/* T17 ~-" address of a[i][j] */

/* T7 - 2 * ali-l][j-l] */ /* a[i][j] - T7 + a[i-ll[j+l] */

/* T21 ~-- address of b[il[j] */

/* T25 +-" address of c[i][j] */

1T251 /* blillJl - blillJl + e[illJl */

i - i+ 1 if i < - 2 0 go to L1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Non-barrier:

Fig. 12(c). Barrier region after code reordering.

can be used to effectively exploit parallelism in loops being scheduled at run-time by compiling multiple versions of the loop and choosing the a p p r o p r i a t e o n e a t r u n - t i m e . T h e e x a m p l e discussed in this section demonstrated how the fuzzy barrier can be used to enforce loop carried depend e n c e s . I n a s i m i l a r f a s h i o n it c a n a l s o b e u s e d t o e n f o r c e l e x i c a l l y f o r w a r d d e p e n d e n c e s [2]. D u r i n g

s t a t i c s c h e d u l i n g o f l o o p s it is n o t p o s s i b l e t o distribute the iterations of a loop equally among t h e p r o c e s s o r s if t h e n u m b e r o f i t e r a t i o n s is n o t divisible by the number of processors available. In this situation, the idling of processors can be potentially reduced by distributing the iterations appropriately using loop displacement and cons t r u c t i n g b a r r i e r r e g i o n s [7].

R. Gupta, M. Epstein / Low cost synchronization in multiprocessor system

7. Nummar3 I n this paper, the fuzzy barrier, a m e c h a n i s m for efficient s y n c h r o n i z a t i o n of processors in a tightly coupled multiprocessor system, was presented. The fuzzy barrier has b e e n i m p l e m e n t e d in a R I S C based multiprocessor system. The hardware i m p l e m e n t a t i o n used in this system was outlined. The fuzzy barrier will be used for executing code in V L I W m o d e as well as code generated b y c o n c u r r e n t i z a t i o n of loops. The m e c h a n i s m has b e e n i m p l e m e n t e d in software o n a n Encore multiprocessor system. E x p e r i m e n t s based u p o n this i m p l e m e n t a t i o n d e m o n s t r a t e that i d l i n g of processors at the barrier can be greatly reduced by u s i n g the fuzzy barrier. T o reduce the idling of processors at the barrier, it is essential to construct large barrier regions. C o m p i l e - t i m e techniques for enlarging barrier regions were presented. Initial o b s e r v a t i o n s show that the barriers can be large.

Rel'erenct,~ [1] A.V. Aho, R. Sethi and J.D. Ullman, Compilers: Principles, Techniques and Tools (Addison-Wesley,Reading, MA, 1986). [2] R. Cytron, Doacross: Beyond vectorization for multiprocessors, Proc. Internat. Conf. Parallel Processing (August 1986) 836-844. [3] J.R. Ellis, Bulldog: A Compiler for V L I W Architectures (MIT Press, Cambridge, MA, 1986). [4] R. Gupta, Synchronization and communication costs of loop partitioning on shared-memory multiprocessor systems, Proc. Internat. Conf. Parallel Processing, vol. I1, (August, 1989) 23-30.

269

[5] R. Gupta and M.L. Sofia, Compilation techniques for a reconfigurable LIW architecture, J. Supercomput. 3 (1989) 271-304. [6] R. Gupta, The fuzzy barrier: A mechanism for high speed synchronization of processors, Proc. Third Internat. Conf. Architectural Support for Programming Languages and Operating Systems (April, 1989) 54-64.

[7] R. Gupta, Loop displacement: A technique for efficient parallel execution of loops, Technical note TN-89-121, Philips Laboratories, Briarcliff Manor, NY, 1989. [8] J. Hennessy and T. Gross, Postpass code optimization of pipeline constraints, A C M Trans. Programming Languages Systems 3 (5) (1983) 422-448. [9] W.C. Hsu, Register allocation and code scheduling for load/store architectures, Dept. Computer Science, Ph.D. dissertation, University of Wisconsin, Madison, 1987. [10] D.J. Kuck, R.H. Kuhn, D.A. Padua, B. Leasure and M. Wolfe, Dependence graphs and compiler optimizations, 8th Ann. A C M Syrup. Principles of Programming Languages (1981) 207-218.

[11] A. Osterhaug, Guide to parallel programming on sequent computer systems, Sequent Computer Systems, Inc., Beaverton, Oregon, 1987. [12] C.D. Polychronopoulos, Compiler optimizations for enhancing parallelism and their impact on architecture design, IEEE Trans. Comput. 37 (8) (August, 1988) 9911004. [13] V. Sarkar, Determining average program execution times and their variance, Proc. SIGPLAN'89 Conf. Programming Language Design and Implementation (June, 1989). [14] H.S. Stone, High-Performance Computer Architecture (Addison-Wesley, Reading, MA, 1987). [15] P. Tang and P.C. Yew, Processor self-scheduling for multiple-nested parallel loops, Proc. Internat. Conf. Parallel Processing (August, 1986) 528-535. [16] P.C. Yew, N.F. Tzeng and D.H. Lawrie, Distributing hot-spot addressing in large scale multiprocessors, IEEE Trans. Comput. C-36 (4) (April, 1987). [17] Multimax technical summary, Encore Computer Corporation, Marlboro MA, 1987.