An Enhancement of SIMD Machine for Executing SPMD Programs

An Enhancement of SIMD Machine for Executing SPMD Programs

Parallel Computing: Fundamentals, Applications and New Directions E.H. D'Hollander, G.R. Joubert, F.J. Peters and U. Trottenberg (Editors) @ 1998 Else...

266KB Sizes 3 Downloads 85 Views

Parallel Computing: Fundamentals, Applications and New Directions E.H. D'Hollander, G.R. Joubert, F.J. Peters and U. Trottenberg (Editors) @ 1998 Elsevier Science B.V. All rights reserved.

203

An Enhancement of SIMD Machine for Executing SPMD Programs Yoshizo Takahashi, Masahiko Sano and Tomio Inoue*" ^'Department of Information Science and Intelligent Systems, Faculty of Engineering, The University of Tokushima, 2-1 Minami Jyosanjima-cho,Tokushima 770, Japan While SIMD machine is appraised because of simplicity in structure that fits the massively parallel systems, its programming is limited to depend on inflexible data-parallel paradigm. In order to adapt SIMD machine to more flexible SPMD paradigm for MIMD machine, a new branching mechanism is introduced. 1. INTRODUCTION The SIMD machine is appraised because of its simple structure that facilitates higher parallelism than MIMD machines. However, as only data-parallel paradigm is allowed in programming for SIMD machines, the applicability of this machine is greatly limited [1,2]. If an SPMD (single program multiple data) paradigm which is widely used in MIMD machines can be efliiciently applied to SIMD machine, the programmability will be very much improved. To make this possible, an enhancement of SIMD machine by introducing a new branching mechanism is proposed. This technology well fits the application-specific parallel computer unlike the machines built from standard microprocessors. Consider the following program. { a; if (condl) { b; if (cond2) c; else d; } else e; f; while (condS) g; h;

} where a, b, c, ..., h are composite statements. When this program is processed by a SIMD machine, the instructions of compiled object program are issued from the control processor (CP) to the processing element (PE) in the order of (a) (b) (c) (d) (e) (f) (g) (g) . . (g) (h), where (a) represents the sequence of instructions of object program for statement a etc.. The instructions (g) are repeatedly issued until no PE satisfies the loop condition. However, if all the PEs satisfy the condition condl, the issuing of (e) is not needed and, if no PE satisfies condl, issuing (b)(c)(d) is unnecessary. A common mechanism in SIMD machine to select a PE to execute the instruction is the use of tags and the mask register. When a conditional instruction is received, the PE sets a bit in mask register to memorize the state and the subsequent instructions broadcast from CP are tagged so that only those PEs of which mask registers match the tag can execute the instruction [3]. This mechanism works well as far as program structure is simple. In practical programs, however, there are so many states that an unrealistic number of

204 bits are required for mask register. To overcome this difficulty and allow more flexible programming, a new branching mechanism is proposed.

2. NEW BRANCHING MECHANISM The architectures of CP and PE enhanced with new branching mechanism are shown in Figure 1, where following features are introduced. • Instruction address bus to broadcast the content of program counter (PC) to PEs. • Target address register (TAR) to store the restarting address when PE recognizes the succeeding instructions are not to execute, and turns into inactive state. • Active flag (AF) to notify that the PE is in active state. AF is reset while PE is m inactive state. • Alternative program counter (APC) to store alternative target address. • OR output of AFs of all PEs is applied to CP as ACT signal indicating at least ore PE is active.. • Different handling of Jump instructions depending on jump directions.

Program Memory

Instruction Bus

IR PC

Instruction Address Bus O

APC Control Logic

ACT

- < : ^

CP IR: Instruction Register PC: Program Couner APC: Alternate Program Counter COM: Communication port AF: Active Flag TAR: Target Address Register

AF

RAM

Control I J CompaLogic " ^ rator

ALU

TTTtl PE

TAR

nrxi REG

COM

Figure 1. Architectures of CP and PE with new branching mechanism Alike conventional SIMD machines CP issues instructions to PE in the order as generated by compiler except when flow control instructions are encountered. Although subroutine call/return instructions are executed solely by CP and do not affect PE, the conditional and unconditional jump instructions affect both CP and PE. When a PE

205 receives a jump instruction and recognizes that the succeeding instructions are not to execute, it stores the restarting address in TAR and turns into inactive state until when TAR matches the address appearing on instruction address bus. For forward jump, where the value of PC is less than the operand target address, the restarting address is the operand address of the jump instruction. For backward jump, where the value of PC is greater than or equal to the target address, restarting address is the next address, that is current instruction address plus one. When CP fetches a forward jump instruction, it stores operand target address in APC and does not jump. If it fetches a backward jump, CP stores the next address in APC and the jump is taken. Whenever ACT signal is reset, CP jumps to the address in APC. The actions taken by CP and PE on jump instructions are summarized in Table 1. 1. 2. 3 4 5. 6. 7. 8. 9. 1. 2. 3 4. 5.

Ida cmp jm Ida sta jmp els: Ida sta fi: equ

X

Ida sub cmp jnm sta

X

do:

y els b a fi d

Table 1 Actions taken by CP and PE on jump instructions CP/PE

(1)

CP

* PE

(2)

Conditions

Actions

APC=operand; 1 PC++; APC=PC+1; backward PC=operand; | J jump condition TAR=operand; satisfied AF=0; turn to inactive | forward jump condition AF= 1; keep active unsatisfied jump condition AF=I; keep active satisfied backward jump condition 'IAR=PC+1; unsatisfied AF=0; turn to inactive | forward

c

y y do

Jump Directions

-

X

Now consider the programs (1) and (2) above, where instructions 3 and 6 in (1) are forward jumps and instruction 4 in (2) is a backward jump. Assume that program (1) is processed with two PEs which are PEl and PE2. The changes in AF of each PEs and ACT signal as each instruction are issued are shown in Table 2 for three different cases. The instruction sequence when program (2) is processed with three PEs, where they exit the loop at 1st, 2nd and 3rd iterations respectively, is shown in Table 3. The barrier synchronization is thus realized. It should be noted that this mechanism works well only for the compiler-generated programs. The arbitrary assembler programs with entangled branches may results a confusion. 3. EVALUATION AND CONCLUSION The aim of this research is to enhance the SIMD machine to handle SPMD programs. Skipping the issuing of instructions not executed by any PE is expected to increase the computational speed. In order to evaluate this effect quantitatively, an SPMD program

206 Table 2 Instruction sequence of program (1) processed with two PEs. (a)PEl:x=y (b)PEl:x
3 4 5 6 7 8

19

1 I 1 1 1 1 1

1 0 0 0 1 1 1

^ 0 I

3 4 7 8 9

I G 1 1 1

1 0 1 1 I

1 0 1 1 1

(c)PEl:x>= v,PE2: x>=y t instr. ACT AFof AFof adrs signal PEl PE2 1 1 2 1

3 4 5 6 7 9

1 1 1 1 0 1

0 1

0 1

1

Table 3 Instruction sequence of program (2) processed with three PEs. PEl, PE2 and PE3 exit loop at 1st, 2nd and 3rd iterations. instr. ACT AFof AFof AF(J adrs signal PEl PE2 PE:1 1 1 2 1 1

3 4 2 3 4 2 3 4 2

15

1 1 1 1 1 1 1 1 0 1

1 1 0 0 0 0 0 0 0 1

0 0 0 0 1

0 1

!

for travelling salesman problem is analyzed. In this program, different search subtree :s allocated to each PE which independently searches it in depth-first order and compares the results to obtain the solution. In conventional SIMD machines, CP has to issue irstructions of all branches in this order even if they may not be executed by any PE). In case of eight cities, the number of branches in a subtree amounts to 561,736, which corresponds to the total computation time in terms of average execution time per brancli. With our branching mechanism, the branches which are not executed by any PEs are not issued from CP. By tracing the same program, it was found that only 12,436 branches are issued. The improvement in computational speed is, therefore, 45 times to the conventional SIMD machine. For more cities, much greater improvement is obtainable. This result is comparable to the MIMD machine where 4,550 branches were issued. According to Kai Hwang [4], MIMD machine is good at independent branching, but weak at synchronization and communication, while SIMD machine is good at synchronization and communication but weak at independent branching. He claims that CM-r) is a MIMD machine with improved synchronization and communication. Our proposal is, in contrast, to enhance the independent branching feature of SIMD machine. REFERENCES 1. Sabot,G.W., The Paralation Model: Architecture Independent Parallel Programming. MIT Press, pp.165-168,1988. 2. Karp,A.H., Programming for Parallelism, IEEE Computer, Vol.20, No.5, pp.43-57; 1987. 3. Hillis,W.D.,The Connection Machine, MIT Press, 1985. 4. Hwang,K., Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, p.457, 1993.