Progress of the APEmille project

Progress of the APEmille project

ELSEVIER Nuclear Physics B (Proc. Suppl.) 83-84 (2000) 828-830 PROCEEDINGS SUPPLEMENTS www.elsevier.nl/locate/npe Progress of the APEmiUe Project* ...

275KB Sizes 0 Downloads 117 Views

ELSEVIER

Nuclear Physics B (Proc. Suppl.) 83-84 (2000) 828-830

PROCEEDINGS SUPPLEMENTS www.elsevier.nl/locate/npe

Progress of the APEmiUe Project* APE Collaboration F. Agliettia, A. Bartoloui &, N. Cabibbo a, M. Cosimia, I. D'Auriab, P. De Riso a, W. Erricob, W. Friebelb, U. Genschc, A. Kretzschmann e, H. Leichc, A. Lonardoa, G. Magazzdb, A. Menchikovc, A. Michelottia, E. Panizzi d a, N. Paschedagc, F. Rapuano a, D. Rossetti a, G. Saccoe, F. Schifanob, U. Schwendickee, H. Simmae, K.H. Sulankec, M. Torelli&, R. Tripiccioneb, P. Viciuia, P. Wegnere aINFN, Sezione di Roma I, 1-00100 Itoma (Italy) bINFN, Sezione di Pisa, 1-56010 San Piero a Grado (Italy) CDESY, Platanenallee 6, 15636 Zenthen (Germany) dUniversita' dell'Aquila, Dip. di Ingegneria Elettrica, 67040 Monteluco di Roio, L'Aquila (Italy) We report on the progress and status of the integration and test of APEmille, a SMD parallel computer optimised for Lattice Gauge Theory (LGT) with a peak performance in the TeraFlops range. After extensive development and testing of the design, a first 128-node system has quickly and successfully been integrated. Tests with physics production runs are to be started soon.

I. I n t r o d u c t i o n The APE collaboration is now in the final integration phase of APEmille, an evolution of the APE100 SIMD architecture [1] with roughly one order of magnitude higher peak performance. The APE project is a physicist-driven effort which, like other projects in Japan [2] and the US [3], aims at the development of dedicated machines optimised for number crunching in LGT applications. The main architectural enhancements of APEmille [4] over the previous APE generations are the availability of local addressing on each node, the support of floating-point (FP) arithmetics in double precision and complex data format, and a host network of Linux PC's, which allows easy partitioning of the machine. These enhancements make the machine potentially useful in a broader class of applications that require massive numerical computations and enjoy some degree of locality, e.g. electro-magnetic field simulatious, complex systems and fluid dynamics. In this paper, we briefly recall the APEmille *Poster presented by F. Schifanoand H. Simma

architecture and describe the current status of the project. 2. A r c h i t e c t u r e APEmille is orgauised as a three-dlmensional array of compute nodes operating in SIMD mode. Each node is based on a FP processor (500 MFlops) and its locally addressable data memory (32 MByte). Each group of eight nodes is controlled by a separate control processor (T1000) with its own data and program memory (2 M instructions). Larger SIMD partitions are obtained by running an identical instruction stream on the corresponding control processors. In this way, machines ranging is size from 8 to 2048 nodes (4 to 1000 GFlops peak performance) can be assembled. During program execution, all processors of a partition synchronously execute the same instructions and access their local memories using (possibly different) locally calculated addresses. Data memories of remote nodes can be accessed via the synchronous communication network. Each group of 32 nodes has a separate PCI-

0920-5632/00/$ - see frontmatter© 2000 ElsevierScienceB.V. All rightsreserved. PII S0920-5632(00)00438-2

E Aglietti et al./Nuclear Physics B (Proc. Suppl.) 83-84 (2000) 828-830

based host PC, which can access memories and control registers of the A P E processors via the PCI bus for program loading and data I/O. The FP processor (J1000) supports norma/operations (a x b + c) and mathematical functions (division,sqrt, log, exp) for 32- and 64-bit IEEE FP formats, as well as for singleprecisioncomplex and vector (pairsof 32-bit) operands. Arithmetic and bit-wise operations are available for 32-bit integer data format. Operands can be converted between the various formats. The combined double pipeline of J1000 allows to start one arithmetic operation in every clock cycle, corresponding to two, four, or eight Flop per cycle for normal operations with floating-point (single or double precision), vector or complex operands, respectively. A large register file with 512 words h 32 bits helps to achieve high filling of the arithmetic pipeline and replaces any intermediate cache between memory and registers. Each J1000 processor addresses its own local data memory, a synchronous DRAM with 4 Mwords h 64 bit plus error correction bits, by combining the global address from T1000 with a local addrem-off~t register. Data can be tranw fered to/from the memory in bursts as large as the entire registerfile(even across memory pageboundaries), with a bandwidth of one 64-bit word per clock cycle. Memory access to remote nodes is controlled by specific bits of the address and is routed by the communication controller. The synchronous communication network supports homogeneous communications over arbitrary distances, i.e. all nodes access simultaneously data from a corresponding remote node with a given relative distance, as well as broadcast along lines and planes of nodes and over the full machine. All functional units, like (local) address generation, memory and register access, arithmetic operations and flow-control, are controlled by independent fields of a very long instruction word

(VLIW). The JlO00 and TIO00 processors and the communication controller (CommlO00) are built into three custom designed ASIC circuits in 0.5/~ra CMOS technology operating at 66 MHz. The basic hardware building-block of APEmille

829

is a processing board (PB), which houses eight nodes (logically arranged at the comers of a 2 x 2 x 2 cube), one control processor, and a communication controller. Each of the processors, with its memories and bus drivers, is mounted on a separate piggy-back module. In addition, each PB cont~ns programmable logic for the PCIinterface and global synchronization. Four PB's together with one host PC are plugged into the same compact-PCI bus of an A P E backplane and form an APEunit with 2x2x 8 nodes. The backplane of an APEcrate houses 4 APEunits (corresponding to 2 x 8 x 8 nodes). Data links on the backplane support synchronous communication along the z-directionbetween the nodes of differentPB's of an APEuuit, and along the y-direction between the nodes of differentAPEunits of an APEcrate. The PB's of differentAPEcrates are connected via cables. There is a second PCI bus for each host PC, which holds disks and network interfaces. For the (asynchronous) interconnection network of the host PC's we are using a dedicated interface card (Flink), based on high-speed low-latency serial links (NS Flat-Link) with a bandwidth of up to 132 MSyte/sec [5]. The main elements of the software environment of APEmille are the operating system (OS) and the compilation chain (TAO), together with various tools for debugging, testing and simulation. The OS is distributed on the Linux host PC's. It handles the program loading, system services, like local and global data I/O, and the partitionhag of the machine. The OS has a layered and object-orientedstructure written in C + + and indudes kernel drivers for high-speed access to the A P E PB's and the Flink network cards. The compilation chain includes the dynamic parser for the TAOmille programming ~language (an extension of the TAO language, used in APE100), a high-level optimisation stage, an assembler and linker, and the RuM optimisation of the VLIW microcode.

3. Testing and Integration In the final design phase of the ASIC circuits we have performed extensive simulations based on

830

E Aglietti et al./Nuclear Physics B (Proc. Suppl.) 83-84 (2000) 828-830

the VHDL model of an entire PB. This allowed in-depth functional design ve~fication and debugging of the individual processor units and of the interplay between them. These simulations included both hardware-oriented tests and kernels of physics application codes. Since also the lowlevel part of the compilation chaiu and a singlePB version of the OS made an integral part of our simulation procedure, these software components had to be tested and debugged to a large extent simultaneously with the chip design. After delivery of the FP-processor prototypes in M a y 99 (with considerable delay with respect to the original production schedule) first test programs were succedully run on a singleP B system in just a few days. Also the subsequent integration of bigger systems, requiring correct beh_aviour of several criticalparts of the machine, like global synchronisation and remote communications, was pleasantly fast (leaving hard times for the software debugging to keep pace). Presently, end of July, we are executing tests programa on a crate with 128 nodes r~mnlng over many hours and giving correct and stable results. All hardware components have turned out to be very reliable. A low rate of fabrication defects (less than 1 % for the F P processors) has been found, reflecting the very conservative hardware design and/or good fault coverage of the fabrication tests. Since no severe design problems were found in any of the three ASIC chips, the production for larger machines can proceed without any modification or rework of the prototype processors. Besides dedicated test suites for the functionality of hardware and software, which test mereodes, fabrication faults, mathematical functions or OS system services, we are using a number of programs derived from kernels of physics codes, which include random number generators, simple gauge models and the Dirac operator. These tests are tailored in order to extensively quafify hardware aspects, likeremote communications, reproducibility, synchronisation and equality between the nodes, and to be bit-wise verifiable with reference results f~om runs on APE100 or on the APEmille simulator.

Many of the test programs are cross-assembled from APE100 (and therefore have yet a very low performance) or written directlyin assembly. The optimisation and debugging of the high-levelpart of the compilation chain and the distributed O S are currently a major activity in order to reach the user-friendlinessand performance needed for large scale physics production. First benchmarks indicate however that the design goal of a high sustained performance (e.g. above 50 % for the Dirac operator on distributed lattices) can be reached and should be exploitable soon by plain T A O codes. 4. Status S n m m a r y The successful integration and testing of a first 128-node APEmille system has been completed in remarkably short time and without encountering significantproblems. W e therefore expect to integrate further and bigger APEmille systems within the next months. In the meantime the required optimisation and debugging of the finalsoi~ware elements is in progress and should allow to start physics production runs soon. Acknowledgments Many people have contributed at various stages to the APEmille projects. Our warm thanks go to all of them. In particular, we are indebted to S. Cabasino and P. S. Paolucci for many key ideas and contributions.

REFERENCES 1. C. Battista et al. (APE collaboration), Int. Journal of High Speed Computing 5 (1993) 637; 2. Y. Iwasaki, Nucl. Phys. B (Proc.Suppl.) 53

(1997) 1007. 3. 1~ Mawblnney, Nucl. Phys. B (Proc.Suppl.) 53 (1997) 1010. 4. A. Bartoloni et al. (APE collaboration), Nud. Phys. B (Proc.Suppl.) 63 (1998) 991 5. H. Leich et al., Proceedings of CHEP'97

C1997) 5sl.