Update iWarp multicomputer with an embedded switching network H T Kung describes the versatile architectural features of the iWarp cell, a building block for high-performance parallel systems
buses for connection to other cells, as shown in Figure 1. In addition, it has two input and two output physical buses for connection to the computation agent. Each bus has a data bandwidth of 40 Mbyte s-1. An important feature of these multicomputers switchingnetworks iWarp physical buses is that each can support a number of logical buses in the same direction. The logical buses Local memory iWarp is a multicomputer architecture share the physical bus in a timebeing developed jointly by Carnegie 160 Mbyte s- 1 multiplexing manner according to a Mellon University, USA and Intel round-robin schedule on a wordCorp., USA. It evolved from Warp 1, level basis. The scheduler allocates Computation i!~:::: age,. iiiiiiiii and is expected to support a wide cycles to active logical buses only; 20aFLOPS i!i::il range of applications including highand 20 MIPS i~ili~i=: idle logical buses consume no speed signal, image and scientific physical bus bandwidth. Moreover, a qO Mbyte s -1 per bus computing. flow control mechanism is Communication An iWarp system is an array of implemented in hardware so that identical processing nodes, called whenever a data word is transferred iWarp cells. Each iWarp cell is over a logical bus the receiver is Figure 1. iWarp components composed of the iWarp component guaranteed to have space to receive and memory chips. As shown in it. The logical bus architecture and its Figure 1, the iWarp component network is embedded in the iWarp word-level flow control mechanism, contains both a powerful computation array to support a variety of made possible by VLSI, are essential agent (20 MFLOPS and 20 MIPS) and communication methods. for the efficient implementation of a high throughput (320Mbyte s-l), In an iWarp cell, the computation some sophisticated communication low latency (150-200 ns) communiagent can carry out computations methods. cation agent for interfacing with other Logical buses are statically independently from the operations iWarp cells. Owing to its strong being performed by the communiallocated to physical buses under computation and communication cation agent. Therefore, the cell may software control. The hardware capabilities, the iWarp component is perform its computation while allows the total number of incoming a versatile building block for various communication through the cell to logical buses in the communications high-performance parallel systems. and from other cells is taking place, agent of each cell to be as high as 20. iWarp systems may range from without the cell program being For example, in a 2D array special-purpose systolic arrays to involved with the communication. configuration, the logical buses can general-purpose distributed memory While separating the control of the be evenly distributed between the computers. They are able to support two agents makes programming easy, four neighbours and the computation efficiently both fine-grain parallel and having the two agents on the same agent, as shown in Figure 2. In this coarse-grain distributed computational chip allows them to cooperate in a case, the communication agent can models simultaneously in the same tightly coupled manner. The tight be thought of as a 20 x 20 crossbar system. As in the hypercube and coupling is required to implement that links incoming logical buses to transputer, a general communication architectural features such as systolic outgoing logical buses. communication, where the comUsing the logical buses, a cell can School of Computer Science, Carnegie Mellon putation agent operates directly on maintain many connections simulUniversity, Pittsburgh, PA 15213, USA data in the communication agent. taneously, including some statically © 1989 MIT Press. Reprinted, with permission, The communication agent has allocated connections, called 'system from Advanced Research in VLSI (Proceedings four input and four output physical of the 1989 Caltech Conference) pathways', devoted to system uses 0141-9331/90/01059-02 © 1990 Butterworth & Co. (Publishers) Ltd This brief describes the iWarp multicomputer architecture. The two components of each iWarp cell, a 20 MFLOPS and 20 MIPS computation agent and a 320 Mbyte s-1 communication agent for interfacing with other cells, can operate independently, but their location on a single chip allows for tight coupling. A variety of communication methods are supported by iWarp.
Vol 14 No 1 January~February 1990
59
Update A
B
\ b
C
D
Figure 3. Multiple connections in a 2D iWarp array
Figure 2. a, physical 2D network. b, logical buses of a cell
only. Figure 3 shows an example of three connections through cells in a 2D array. Connection 1 is from the computation agent of cell B to that of cell A. Connection 2 passes through cell A, turns a corner at cell C, and then reaches the destination D. Finally, connection 3 passes through both cells C and D. Note that on the same physical bus from cell C to D, two connections are maintained at the same time using two logical buses. Programs can read or write data from or to a message buffer via the side effects of special register
60
references. These special registers are called streaming gates, because they provide a 'gating' or 'windowing' function allowing a stream of data to pass, a word at a time, between the communication agent and the computation agent. There are two input gates and two output gates. These gates can be bound to different logical buses dynamically. A read from the gate will consume the next word of the associated input message; correspondingly, a write to an output gate will generate the next word of the associated output message. The instruction spins when reading from an empty gate or writing to a full gate. iWarp also provides a transparent, low-overhead mechanism for transferring data between the communication agent and the local memory. The transfer is done via spooling gates. Spooling has low overhead to avoid significant reduction of the speed of any ongoing computations. Spooling is transparent to software except for delays incurred due to either cycle stealing or local memory
access interference from other memory references. The architecture and logic designs for iWarp were completed at the end of 1988. In the software area, the optimizing compiler developed for Warp 2'3 has been retargetted to generate code for iWarp. Using this compiler together with an architecture simulator, the iWarp architecture and performance on realistic programs have been evaluated 4. A prototype iWarp system is expected to be operational bythe end of 1989. Three 1.28 G FLOPS demonstration systems, each consisting of an 8 X 8 torus of iWarp cells, are scheduled to be operational at Carnegie-Mellon in the middle of 1990. The same system design is extendible to a 20.48 GFLOPS, 32 X 32 torus.
REFERENCES 1 Annaratone, M, Arnould, E, Gross, T, Kung, H T, Lam, M, Menzilcioglu, O and Webb, J A 'The Warp computer: Architecture, implementation and performance' IEEE Trans. CompuL Vol C-36 No 12 (December 1987) pp 1523-I 538 2 Gross, T and Lain, M 'Compilation of high-performance systolic array'
Proc. SIGPLAN 86 Symp. on Compiler Construction ACM SIGPLAN (June 1986) pp 27-38 3 Lain, M A Systolic Array Optimizing Compiler PhD Thesis, Carnegie Mellon University, Pittsburgh, PA, USA (May 1987) 4 Cohn, R, Gross, T, Lam, M and Tseng, P S 'Architecture and compiler tradeoffs for a long instruction word microprocesser' Proc. Third Int. Conf. on Architectural Support for Programming Languagesand Operating Systems (ASPLOS III) ACM (April 1989)
Microprocessors and Microsystems