Interfaces in Computing, I ( 1 9 8 3 ) 3 1 9 - 327
319
AN EXPERIMENTAL P A R A L L E L MICROPROCESSOR SYSTEM C. D W A Y N E E T H R I D G E , J A M E S W. M O O R E a n d V I T O A. T R U J I L L O
Los Alamos National Laboratory, Computing Division, Los Alamos, NM 87545 (U.S.A.) (Received August 17, 1 9 8 3 )
Summary The Computing Division at the Los Alamos National Laboratory has designed and is building a parallel microprocessor system (PpPS) to serve as a research tool for evaluating parallel processing of large-scale scientific codes. The PpPS is an experimental architecture consisting of an orthogonal array of 20 processing elements by 32 m e m o r y elements, establishing a tightly coupled shared m e m o r y (16-Mbyte) machine. The hardware incorporates very-large-scale integration components, such as 16-bit microprocessors, floating-point co-processors and dyn~unic random access memories. The design replaces the conventional medium-scale integration or small-scale integration circuitry with programmable array logic, logic sequencers and logic arrays. This experimental system, which is only one element of the parallel processing research being done by the Laboratory's Computing Division, will enable direct comparisons of speed-ups of algorithms for a variety of multiprocessor architectures.
1. Introduction The objective of the parallel processing research at the Los Alamos National Laboratory is to assess the potential of parallel processing to provide increases in speed for large-scale computations in areas such as hydrodynamics, Monte Carlo techniques, reactor safety studies and nuclear waste disposal. Speed-up in parallel processing (the ratio of a computation execution time using one processor to the c o m p u t a t i o n time using p processors) is related to the code parallelism and to the number of processors. There are no processor-rich systems (16 or more processors) at this time, although computer manufacturers may be able to offer such systems by 1990; furthermore, there are no processor-rich systems in the U.S.A. with floating-point arithmetic, large m e m o r y and F O R T R A N programmability. Therefore, as part of the parallel processing research at Los Alamos, we are constructing an experimental parallel microprocessing system (PpPS) to serve as a research tool for evaluating parallel processing of large codes on various multiprocessing architectures. 0252-7308/83/$3.00
© Elsevier S e q u o i a / P r i n t e d in T h e N e t h e r l a n d s
320 The PpPS is an experimental computer design consisting of an orthogonal array of 20 processing elements by 32 m e m o r y elements establishing a tightly coupled shared m e m o r y (16-Mbyte) machine. The system will have reconfigurable processor-to-memory and processor-to-processor interconnections to enable direct comparisons of speed-ups of algorithms for a variety of parallel architectures. The reconfigurable design allows architectures with multiple processors sharing c o m m o n m e m o r y and with networks of processors with only local memory, such as rings, trees and stars. The hardware design incorporates very-large-scale integration (VLSI) components, such as 16-bit microprocessors, floating-point co-processors and dynamic random access memories, and replaces the conventional medium-scale integration (MSI) or small-scale integration (SSI) circuitry with programmable array logic, programmable logic sequencers and programmable logic arrays. Use o f these arrays greatly reduces the different types of logic devices in the system, which in turn reduces engineering effort related to parts procurement as well as maintenance of parts for c o m p u t e r support and maintenance after the c o m p u t e r is in operation. The design goals for the experimental PpPS include the following requirements: a processor-rich (more than eight processors) system, large m e m o r y , commercially available components, floating-point hardware, F O R T R A N compilation and nominal software development. This experimental PpPS is one element of the parallel processing research within the Computing Division at the Los Alamos National Laboratory.
2. Parallel microprocessor system architecture The PpPS is a tightly coupled shared-memory multiple-instruction multiple-data machine that supports reconfiguration between processor and m e m o r y nodes to permit experimentation with c o m m o n m e m o r y architectures and with various processor network structures, such as rings, trees and stars. The proper combinations of processors and memories can be selected on the basis of the parallel processing objectives. The system consists of numerous processor and m e m o r y nodes that are directly interconnected using multiple processor-to-memory busses and multiported global m e m o r y nodes. The multiple-bus multiported-memory design functionally implements a full cross-bar switch between the processor and m e m o r y nodes. This multiple-bus architecture allows processor-to-processor communications to occur concurrently with processor execution from either local m e m o r y or global memory. Reconfiguration between processor and global m e m o r y nodes is accomplished through a m e m o r y mapping facility within each processor node. Reconfiguration will be limited to initialization time only; however, experience with the hardware and software could lead to dynamic reconfiguration if applicable. The system architecture is shown in Fig. 1. Each processing element has a local m e m o r y that consists of both programmable read-only m e m o r y
321 LOCAL AREA NETWORK 31~
MEMORY
NODES
-4
-5 -5 :::
L~Z
-4
-5
-5 l::::
PROCESSOR NODES
Fig. 1. PpPS architecture: PE, general processing element or execution processor; GM, global memory element; LM, local memory; PC, control processor; PR, read processor; PW, write processor; FIFO, first-in first-out. (PROM) and r a n d o m access m e m o r y (RAM). A c o n t r o l p r o c e s s o r loads t h e e x e c u t i o n c o d e into the global m e m o r i e s and initiates e x e c u t i o n t h r o u g h the s y s t e m c o n t r o l bus. S y s t e m s with a large c o m m o n m e m o r y can be configured with t h e processing e l e m e n t s and global m e m o r i e s . Systems with o n l y local m e m o r y can be c o n f i g u r e d with the general processing elements, global m e m o r i e s and data t r a n s f e r processors. A data transfer p r o c e s s o r consists o f a read processor, which uses o n e bus t o read f r o m m e m o r y and to store in a F I F O m e m o r y , and a write processor, which uses o n e bus to read t h e F I F O m e m o r y and to write t o m e m o r y . This a r c h i t e c t u r e is enabled b y the o r t h o g o n a l packaging s c h e m e s h o w n in Fig. 2. T h e processing e l e m e n t cards plug h o r i z o n t a l l y into the
GLOBALMEMORY
S
//
Fig. 2. Orthogonal paekaging diagram.
322 processor section of the back plane. The global m e m o r y element cards plug vertically into the left and right global m e m o r y sections (low and high m e m o r y addresses) of the back plane. The left and right sections provide contiguous logical addressing. However, the left and right sections are electrically isolated, limiting each back-plane section to 16 global m e m o r y cards and one bus termination card. Therefore, the back-plane sections provide 21 microcomputer busses: one control processor bus and 20 processing element busses. The control processor bus is a Multibus-oriented bus. Each processing element bus (back-plane bus) consists of 32 lines. 24 lines provide 24 address signals, 16 of which are multiplexed address-data signals. Seven lines provide bus control and m e m o r y error identification. The remaining line is a reserved line, possibly for a diagnostic bus. Figure 3 details the devices used in the back-plane bus design and address-data flow.
3. Processing element implementation Selection of the floating-point hardware for the processing element was required early in the design cycle. The design goal of using commercially available c o m p o n e n t s led to the selection of the only available floating-point co-processors at that time: the Intel iAPX 86/20, which consists of the Intel 8087 and 8086 co-processors. The Intel 8087 is the high performance numeric data co-processor for the processing element or execution processor. This co-processor provides the floating-point hardware required to create a processing element with significant scientific computational capabilities. The Intel 8086 16-bit high performance m e t a l - o x i d e - s e m i c o n d u c t o r (HMOS) microprocessor allows the processing element to address a megabyte of global m e m o r y directly. The m e m o r y mapping elements 74LS610 used in the design extend addressing to 16 Mbytes. Three types of processor nodes are included within the system: the system control processor, the processing elements and the data transfer processors. The system control processor performs system initialization (downloading of global m e m o r y , configuration control etc.), initiation of parallel processing applications code, performance measurements and m e m o r y error processing. This processor also incorporates an interprocessor interruption facility. In addition, because the multiprocessor is strictly an execution environment, the system control processor provides communication with an external local area network that includes development work stations. Each processing element includes the Intel 8086 and 8087 co-processors, 8 - 32 kbytes of local dedicated PROM and 4 - 16 kbytes of RAM, real-time interrupt facility and m e m o r y mapping logic that allows 61 16-kbyte segments to be permanently and/or dynamically allocated within the system global memory. Each data transfer processor is a high speed controller specifically designed for implementing processor-to-processor communications by performing data movement between global m e m o r y segments. Two Intel 8089 HMOS i n p u t - o u t p u t (I/O) processors are used to implement each data transfer processor.
323
i
o
SN8
~1~ 8 0
~13t i~
(] 3 3 d $
HglH
~z~ o00
o
~ J~
F
. . . . . . . . . . . . .
I
sn8
L
.
.
F
.
.
J
v i v o / s s 3 a a o v .
. .
. .
. .
. .
. .
. .
. .
.
3NV-!d~ .
.
.
.
.
F
.
.
:)re
.
. .
. .
. .
.
Sd .
.
. .
. .
r~d . .
.
~
SN8
J
__
She
i,~
a.
~ O g $ 3 : : ) O N d 0 ~ D I i~/
I
~J
H3
L~ o ~
7
VJ.~O 7 ~ ' 8 0 7 O
324 4. Global m e m o r y element implementation The system global m e m o r y consists of multiple m e m o r y nodes, each having a 256- to 512-kbyte RAM array (Microbar DBR50-256) accessible from the system control processor, and a multiported-memory controller. The port for the system control processor supports downloading and m e m o r y error reporting functions. The multiported-memory controller includes interface logic for 20 ports, m e m o r y arbitration logic that implements a last-granted lowest priority algorithm, and a high speed m e m o r y access controller. Memory mapping logic within each processor node allows each m e m o r y node to be allocated as either private or public m e m o r y for each processor node. All 32 global m e m o r y cards are identical in construction. The upper five address bits (minus the most significant bit used by the left and/or right bus controllers) is compared with the 4-bit back-plane geographical address to grant m e m o r y access on the specific card. The processor-to-memory interconnection is accomplished with m e m o r y mapping logic at each processor node, a multiported-memory controller at each global m e m o r y node and a multiple-bus interconnection back plane that allows an orthogonal arrangement of processor and global m e m o r y boards. The packaging scheme uses minimal bus lengths in providing complete physical interconnection between the processor and global m e m o r y nodes. The processor-to-memory interconnection provides fully reconfigurable processor-to-memory connections, resolves access arbitration when multiple processors are simultaneously accessing a c o m m o n global m e m o r y node and supports mutual exclusion to shared memory. Control of shared m e m o r y is accomplished through an extension of the lock mechanism available with the Intel 8086, 8087 and 8089 co-processors.
5. Back-plane bus control A m e m o r y cycle (non-local m e m o r y ) using the system back plane is controlled by five hardware controllers. An Intel 8288 bus controller provides the microprocessor bus control. A left or a right back-plane bus controller located on the execution processor board provides the control signals for the processor interface to the back plane. A port interface controller (one for each bus; 20 per global m e m o r y card) provides the control signals for the m e m o r y port interface to the back plane and initiates a request to the lastgranted lowest priority bus arbitration controller. This controller issues a grant to only one of the 20 port interface controllers. The final controller, the high speed bus controller, multiplexes the address and data bus connection to the Microbar Multibus m e m o r y board. The principal hardware components of these controllers are Monolithic Memories programmable array logic devices (PAL16L8) and programmable logic sequencers (PAL16R8).
325
6. Hardware and software parallel microprocessor system development The equipment involved in the development of the PpPS is shown in Fig. 4. The microcomputer development system (MDS) is an Intel Series 3 MDS-800 with in-circuit-emulation capability ICE-86A for the iAPX 86/20 co-processors. The MDS system includes a Data I/O System 19 universal programmer for the local m e m o r y PROMs and for the programmable logic devices. Memory storage for the system is provided by a 40-Mbyte hard disk. System files and source file entry and/or back-up is provided by a single double-density floppy disk drive.
p~ps t6 8086/7 CPU 4 8089 CPU 16 MBYTES
DI AGNQSTIC TERMINAL WrI{XI
LOCAL
AREA
MICROCOMPUTER DEVELOPMENT SYSTEM
soPrWARE W O R K STATION
IN T E L S E R II~ 3 MDS-800
8086,/7 C P U 180 MBYTE HARD DISK
TI~.M;NAL
TI~INAL
VlS~AL l(lO
I ~ VTI00
STORAGE MEMORY DEC VAX
NETWORK
Fig. 4. P~PS development support block diagram: CPU, central processing unit; DEC, Digital Equipment Corporation.
A local area network links the parallel microprocessor, the MDS, backup storage (a VAX 780) and a software work station. The software work station will be used to convert existing F O R T R A N production code to parallelized, compiled, linked and located code for the execution processor. The operating system for the software work station and possibly the control processor will be Intel's iRMX 86 operating system, which provides a multiterminal and multiuser interface capability.
7. General system construction The PpPS enclosure shown in Fig. 5 is approximately 1 m deep, 2 m high and 3 m wide. The MDS is also shown in the front view (Fig. 5) before installation of the hard disk unit. The processor card cage section for housing 27 processor cards is shown in the center with circuit breakers to the left. Seven card slots are reserved for the control processor, leaving 20 card slots for the processing elements. The rear view of the system enclosure is shown in Fig. 6. The global m e m o r y card cage section to the right has global m e m o r y cards installed with one of its circuit breaker panels above the cage. Global m e m o r y cards
326
Fig. 5. Front view of the system enclosure.
Fig. 6. Rear view of the system enclosure.
327 have been removed from the left global m e m o r y card cage section to show the system back-plane cards. Two 200 A power supplies provide 5 V to one global m e m o r y section. One 200 A power supply provides 5 V to the processor section. Voltage (and return) is applied to each card at the front edge so that no supply voltage is in the back-plane wiring. Each card voltage is also switched by its own circuit breaker.
8. Summarizing remarks The P/IPS is an experimental computer architecture design consisting of an orthogonal array of 20 processing elements by 32 m e m o r y elements establishing a tightly coupled shared m e m o r y (16-Mbyte) machine. The principal objective is to develop an experimental P/xPS to serve as a research tool for evaluating the parallel processing of production codes on various multiprocessor architectures. The hardware design incorporates VLSI components, such as 16-bit microprocessors, floating-point co-processors and dynamic random access memories, and replaces conventional MSI or SSI circuitry with programmable array logic, programmable logic sequencers and programmable logic arrays. This experimental PpPS is one element of the parallel processing research within the Computing Division at the Los Alamos National Laboratory.
Acknowledgments The work is being performed under the auspices of the U.S. Department of Energy. Reference to a company or product name anywhere in this paper does not imply approval or recommendation of the product by the University of California or the U.S. Department of Energy to the exclusion of others that may be suitable.