97
Processor self-scheduling for parallel loops in preemptive environments Y a n g Xuejun, C h e n H a i b o , Ci Yungui, C h e n Fujie a n d C h e n Lijie Department of Computer Science, Changsha Institute of TechnologY, Changsha, Hunan, P.R. China
Processor self-schedufing schemes for parallel loops can be used in a non-preemptive environment to reduce the schedufing overhead significantly. In a preemptive environment system fike Cray X - M P / 4, the schemes can cause serious problems. In this paper, we describe the problems of processor selfscheduling schemes in a preemptive environment. A solution for resolving these problems is also presented.
1. Introduction Parallel loops in programs provide the greatest potential of parallelism to be exploited by multiprocessor systems. The most important parallel loops are doall loops and doacross loops. A parallel loop is called a doaU loop if all its iterations are independent. If there are dependences across iterations, it is still possible to execute its iterations concurrently on different processors to speedup the execution, provided that dependences are enforced by synchronization across processors during the execution [3]. This kind of parallel loops is called doacross loops. A parallel loop can either be recognized by a compiler or explicitly specified by a programmer. Once a loop is identified as a parallel loop, it is of first importance to schedule processors properly so that execution time of the loop is minimized. Processor scheduling for a parallel loop can be completed either statically or dynamically. Static
scheduling assigns iterations evenly among processors. It is optimal when execution time of each iteration is the same. If execution time of each iteration is different, dynamic scheduling is preferred. This kind of scheduling schemes assign iterations to processors dynamically at run time. The total number of iterations to be executed by each processor may not be equal, but the workload of each processor tends to be balanced. To reduce the scheduling overheads of system calls in dynamic scheduling schemes, processors can schedule themselves by fetch-and-adding a shared register to get a chunk of iterations to be executed [1]. This kind of self-scheduling schemes can reduce the scheduling overheads significantly by using hardware-implemented synchronization primitives. However, so far, all processor self-scheduling schemes assumed a non-preemptive environment, i.e. once a processor is assigned an iteration, it will continue to execute the iteration until its completion. In a preemptive environment, a processor executing a parallel loop may be interrupted by an external interrupt at any time and never be returned to the loop. This may cause serious problems when processors are self-scheduled to execute a parallel loop. In this paper, we present a processor self-scheduling scheme for parallel loops in a preemptive multiprocessor system. In Section 2, we describe the hardware feature of a preemptive multiprocessor system. In Section 3, processor selfscheduling for parallel loops in preemptive environments is discussed. In Section 4, we present hardware to support our processor scheduling scheme. In the final section, performance and further work are discussed.
2. Hardware feature North-Holland Future Generation Computer Systems 6 (1990) 97-103
In this section, we describe the hardware feature of a typical preemptive multiprocessor sys-
0376-5075/90/$03.50 © 1990 - ElsevierSciencePublishers B.V. (North-Holland)
98
Y. Xuejun et al. / Processor self-schedufing
tern. Discussion in the following sections is based on this description. The most important concept for multiprocessor systems is multitasking, which is defined as structuring a program into several tasks that can be executed concurrently on different processors. A multiprocessor system should provide certain mechanisms for the parallel execution of a multitasking program. The multiprocessor system we are considering is a Cray X-MP/4-1ike system [5]. The system contains four processors that share a common global memory. Each C P U has a local set of general purpose address and scalar registers, a set of vector registers and a scratch-pad set of registers. Multitasking programs also have assigned a set or cluster of shared registers for very fast communication. This set consists of eight groups of four semaphore bits, eight shared address registers, and eight shared scalar registers. The semaphore bits have atomic test-and-set, unconditional set, and clear instructions defined. The instructions can either operate on semaphore bits directly, or do that indirectly with local address registers as the index registers. The format of a direct semaphore instruction is as follows: operation SG SB, where SG, SB are the group number and bit number of a semaphore bit, respectively. Obviously 0~
3. Processor self-scheduling for parallel loops in
preemptive environments For simplicity of discussion, we will only consider single-nested parallel loops. When a parallel loop is multitasked, it will be assigned a cluster of shared registers for communication. Four logical CPUs are assigned to a program to execute its parallel loops. This may be completed statically at load time and specified in the job control card. A compiler determines whether a parallel loop should be multitasked and how many tasks will be created by making tradeoffs between the theoretical speedup and the incurred execution overhead, and issues a call Tfork(np) (2 ~< np ~< 4) to the operating system before the execution of a multitasked loop. This call will activate np logical C P U s assigned to the program with the calling task's execution environment to be copied to each logical CPU.
3.1 Code generation In the following, we show how to use semaphore operations to synchronize parallel loops for processor self-scheduling. There are two kinds of synchronizations. One is for dependencies across iterations; the other is for processor self-scheduling to get indices of a chunk of iterations. For simplicity of discussion, we assume the distances of all dependencies across iterations are one, and processors schedule themselves to get only one iteration at a time. A doall loop has no dependence across iterations and the code for processor self-scheduling can be easily generated (and embedded in the loop). The following example shows how a doall loop can be synchronized for processor self-scheduling: Example 1. Tfork(np) / * I is a shared variable; J is a local variable * / / * initially, I = 1; SM(SG, SB) - - " c l e a r " * / 10 C O N T I N U E test&set SG SB J=I I=J+l clear SG SB IF (J .GT. N) G O T O 20 loop body G O T O 10 20
Y. Xuejun et at. / Processor self-scheduling
For a doacross loop forking np tasks (2 ~< n p < 4), each dependence across iterations is assigned a group of semaphore bits. The dependence can be enforced by test&set the semaphore bit of an iteration before the sink and clear the semaphore bit of the successive iteration after the source. Some dependencies across iterations may be implicitly synchronized by a combination of the control structure of the architecture and the synchronization of other dependencies across iterations [3]. Thus, not all dependencies across iterations need to be synchronized explicitly.
3.2 Preemptive problems With the synchronization discussed above, processors are self-scheduled to execute a parallel loop efficiently in a non-preemptive environment. In a preemptive environment, serious problems may arise.
(1) Expensive waiting Consider a doacross loop with a lexically-forward-dependence across iterations in the following:
Example 2. DOACROSS A(I) = B(I) + E(I) = D(1) + F(I) = A(1) +
I = 1, N C(I) F(1) E(I) B(I) = F ( I ) + E(I - 1) ENDLOOP Embedding the synchronization instructions for processor self-scheduling results in a loop like the following: /* /*
10
Tfork(np) I is a shared variable; J, K are local variables * / initially, I = 1; SM(0, 0) ="clear", SM(1, 0) = "set", SM(1, 1) ="clear", SM(1, 2) = "set", SM(1, 3 ) = " s e t " * / CONTINUE test & set 0 0 J=I I=J+l clear 0 0 K=J+I IF (J .GT. N) G O T O 20 A(J) = B(J) + C(J)
99
E(J) = D(J) + F(J) clear* 1 K mod 4 F(J) = A(J) + E(J) test&set* 1 J mod 4 B(J) = F(J) + E(J - 1) G O T O 10 20 If a processor is preempted by an external interrupt during the execution of an iteration and has not yet executed the clear*instruction, the processor executing the succeeding iteration will be locked on when it reaches the corresponding test&set* instruction. The waiting processor cannot proceed with its execution until the operating system schedules a processor to execute the interrupted iteration. This active waiting may be very expensive and significantly reduces the efficiency of CPUs.
(2) Deadlock vibration This problem arises when a processor is preempted before it clears the semaphore bit of a lexically-backward-dependence of a doacross loop. Other processors executing the loop will soon reach some point waiting for semaphore bits. Then a deadlock interrupt will be issued to the operating system. The operating system will schedule processors to allow the task being preempted to continue. However, the number of processors assigned to the loop may have changed by this time. If the number of processors assigned to the loop is decreased after the deadlock interrupt, another deadlock interrupt will soon take place. This is because the number of iterations being activated is larger than the number of processors executing the loop as is shown in the following figure: Activated Index
iterations: i + l i + 2 . . , i+n-li+n Processors: Pi+l P,+2 "'" Pi+n 1 registerI: i + n + l .
After a processor executes an iteration, it schedules itself to get a new iteration by fetch& adding the shared register I. This means that iteration i + n will be stepped over by the processors and has no chance to clear the semaphore bit that the succeeding iteration must test. Therefore, another deadlock interrupt will result. The situation repeats until more processors are assigned to the loop or the loop is eventually executed.
lOO
Y. Xuejun et al. / Processor self-scheduling
(3) Forsaking If all semaphore bits in an iteration have been cleared when an external interrupt occurs, other iterations can still proceed with their executions. The interrupted iteration will be forsaken by the processors executing the loop. This problem may arise for both doall and doacross loops. 3.3 Ordered synchronization Expensive waiting can be avoided by ordering loop synchronization for a lexically-forward-dependence (LFD) across iterations. An iteration clears the semaphore bit of its succeeding iteration for an L F D dependence across iterations only if it has successfully test&set the semaphore bit of itself for the dependence. Therefore, the clear* instruction should be placed after the corresponding test&set* instruction in Example 2 as follows: /*
Tfork(np) I is a shared variable; J, K are local variables
,/ /*
10
initially, I = 1; SM(0, 0) = " c l e a r " , SM(1, 0) = " s e t " , SM(1, 1) = " c l e a r " , SM(1, 2) = " s e t " , SM(1, 3 ) = " s e t " * / CONTINUE test&set 0 0 J=I I=J+l clear 0 0 K=J+I IF (J .GT. N) G O T O 20 A(J) = B(J) + C(J) E(J) = D(J) + F(J) F(J) = A(J) + E(J) test&set* 1 J mod 4 clear* 1 K mod 4 B(J) = F(J) + E(J - 1) G O T O 10
20
3. 4 Task join Tjoin is a system call, which is used with Tfork(np) as a couple and is placed after a parallel loop. Tasks must always wait for the companions to complete before returning. The forsaking problem is resolved when processors are scheduled to execute the forsaken tasks.
3.5 Processor rescheduling To avoid deadlock vibration, some operating system supports are needed. A "soft" deadlock is considered to be a request for processor rescheduling within a multitasking group. Deadlock vibration results from the decrement of processors assigned to a doacross loop. If a processor " k n o w s " the decrement and "helps" the preempted processor to complete its remaining work before activating a new iteration for execution, deadlock vibration can be avoided. The operating system is responsible to notify processors of the decrement of companions by setting the decrement into a shared address register. If a processor is preempted, the register will increase with one. If a new processor is assigned, the register will decrease with one. Let D be a shared register. The operating system will perform the following operations: 1. initially (before executing the loop) D=0 2. When a new processor is assigned to the loop, do: if D > 0 then D = D - 1. 3. When a processor executing the loop is preempted, do: if D > 0 then D = D + 1, else D = 1. A processor "helps" the preempted companions by issuing a rescheduling call to the operating system. This call places a task at the end of the queue of runable tasks, ensuring that all tasks in the multitasking group get a chance to execute before the task requesting the rescheduling.
4. Ilomocluster s~vilching
4.1 Homocluster switching As described in the above section, processor self-scheduling on Cray X - M P in preemptive environment can cause m a n y problems. Deadlock vibration is the most serious one. In order to reduce the overhead, we have investigated the possibility of resolving the deadlock problem by hardware. Homocluster switch is a result of the investigation. We consider tasks that a loop forks as a task cluster. When the statement Tfork (np) is ex-
Y. Xuejun et al. / Processor self-scheduling
ecuted, np tasks would be created and a task cluster is defined. We hope a cpu executing a task can be rescheduled to the other task in the cluster. The hardware model supporting HCS is described in the following. Two data structures are used in order to implement HCS:
101
TID 0
FEP 0
TID n -1
FEP n -1
entry 0
entry n -1
(1) Task switching table The task switching table has n entries (n is equal to the number of cpus), with each entry corresponding to one cpu. An entry consists of two fields. One is the task identifier (TID), the other is the first address of the exchange package of the task (FEP). The table is shown in Fig. 1.
(2) Mask register Another important table is the mask register of the task switch table which consists of one word of n bits. Each bit corresponds to an entry of task switching table. If the ith bit of the mask register is set, the i th entry of the task switch table is available. The mask register is shown in the following figure: 0
...
n--1
The process of HCS is presented by the following procedure:
/* /* /* /*
P R O C E D U R E SWF M is the mask register. * / TS is the tasks switching table . * / TS.ID is the task identifier field of T S . * / I, J are work variables * / Begin If M = 0 Then return I = MAX J=0 ForL=0Ton-1 DO If M [L] = 1 Then If TS.ID [L] < I Then I = TS.ID [L] J=L End if End if End for Switch the cpu to the task TS.ID [J] according to TS.A [J]. End.
The exchange package ol task 0
The exchange pack of task n-1
Fig. 1. Task switch table.
4.2 Implementation In the following, we will build a architecture model on which the HCS is implemented at hardware level. So the overhead of executing a loop can decrease and the performance of the system can increase largely. The model is developed from the Cray X - M P / 4 . Each cpu has eight vector registers, eight scalar registers and eight address registers. Unlike the Cray X-MP, the A7 can be used as a loops index register. A new instruction defined in the system can fetch the value from shared address registers (SB), increase the value and send the result to the SB and A7. The format of the instruction is as follows: FET&INC
SB
A7
The cpu intercommunication section comprises five clusters for interprocessor communication and synchronization. Each cluster consists of eight 24bit shared address registers (SB), eight 64-bit shared scalar registers (TS), eight groups of four bits semaphore register, an HCS device and a four bits mask register. Unlike the Cray X - M P / 4 , the instruction defined in the system either operates on semaphore bits directly, or does that indirectly with local address registers as the index register. A cpu that executes a test-and-set instruction on a semaphore which is already set will hold the instruction issue until the semaphore is cleared. When every cpu within a cluster is waiting semaphores, the hardware tests the mask register. If all bits of M are zero, a deadlock interrupt occurs in all cpus.
Y. Xuejun et al. / Processor self-scheduling
102
CPUO
TID
CPU1
""
CPU2
""
CPU3
TID
+D
FEP
FEP
TLB
Loop body GO TO 10 Tjoin.
20
When the execution of the loop begins, the operating system is responsible for assigning cpus for that. If the number of idle cpus is n (n ~< np), all of these will serve the loop. The number of cpus assigned can be increased or decreased dynamically. If an external event interrupts the CPU2 that is executing the i th iteration, the value of i will be put into the field TID and the bit M[2] will be set. While another cpu (for example cpul) tries to get a new iteration, it will execute the HCS procedure firstly, so the cpu will be switched to execute the ith iteration that is not completed on cpu2. Because the number of active iterations of the loop is equal to the number of cpus assigned to the job, problems caused in preemptive environments can be avoided. The process of executing a loop on the model is shown in Fig. 3.
M
Fig. 2. The form of a TLB.
Otherwise, the process of homocluster switching is started. The HCS mechanism consists of a special translation lookaside buffer (TLB) which is shown in
Fig. 2. When external interrupts occur, the TLB can be updated automatically. The value of A7 of the interrupted cpu is put into the corresponding field TID and the value of XA (exchange address) register is put into the field FEP. If the instruction SWF is executed, the machine searches parallely for the entry of the smallest value of TID in the TLB, and switches the cpu to the task according to the result. At the compiler level, a loop can be translated into the codes shown in the following example.
5. Performance and discussion
Example 3. /* / * /* 10
The overhead of executing a parallel loop with the processor self-scheduling schemes on the model we build consists of processor self-scheduling overhead and HCS overhead. The processor selfscheduling overhead is similar to that in a nonpreemptive environment [1]. The overhead of homocluster switching in the same family is very small, for it is supported directly by hardware.
Tfork (4) SBi is a share address register * / Initially, SBi = '1 * / A7 is the loop index register * / CONTINUE SWF F E T & I N C SBi A7 If (AT > N) G O TO 20
index = CPUO : index = CPU1 : index =
CPU2 : index = CPU3 :
5
+++ o°°
n
other jobs finish 1
2 (half work)
......
n- 2
2 (half work) interrupt occurs and this C P U assigns to other jobs 3
4
......
Fig. 3. A loop executing model.
n-1
Y. Xuejun et al. / Processor self-scheduling References [1] Peiyi Tang and Pen-Chung Yew, Processor self-scheduling for multiple-nested parallel loops, Proc. ICPP (1986) 528535, [2] C.D. Polychronopoulos, D.J. Kuck and D.A. Padua, Execution of parallel loops on parallel processor system, Proc. ICPP (1986) 519-527. [3] S.P. Midkiff and D.A. Padua, Compiler generated synchronization for do loops, Proc. 1CPP (1986) 544-551. [4] Zhixi Fang, Pen-Chung Yew, Peiyi Tang and Chuan-Qi Zhu, Dynamic processor self-scheduling for general nested loops, Proc. ICPP (1987) 1-10.
103
[5] Steve Reinhardt, A data-flow approach to multitasking on CRAY X-MP computers. [6] J.L. Larson, Multitasking on the CRAY X-MP/2 multiprocessor, Computer (July 1984) 62-73. [7] Steve S. Chen, Multiprocessing linear algebra algorithms on the CRAY X-MP-2: Experiences with small granularity, J. Parallel and Distributed Computing 1 (1) (August 1984) 22-31. [8] D.A. Galahan, Influence of task granularity on vector multiprocessor performance, Proc. ICPP (1984) 278-280.