APEmille

APEmille

Parallel Computing 25 (1999) 1297±1309 www.elsevier.com/locate/parco APEmille R. Tripiccione* INFN, Sezione di Pisa, I-56010 San Piero a Grado, Ital...

108KB Sizes 0 Downloads 65 Views

Parallel Computing 25 (1999) 1297±1309

www.elsevier.com/locate/parco

APEmille R. Tripiccione* INFN, Sezione di Pisa, I-56010 San Piero a Grado, Italy Received 25 February 1999

Abstract This paper describes the APEmille project, jointly carried out by INFN and DESY. It mainly focuses on the architectural features of this massively parallel compute engine that matches the computational requirements of LGT. In the paper I brie¯y cover the story of the APE projects, discuss the requirements of any ecient LGT engine and ®nally describe the APEmille machine. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: APEmille; Massively parallel computer; Lattice gauge theories

1. Introduction The APEmille project is one of a small number of physicist-driven e€orts in Europe, Japan [1] and the US [2] aiming at the development of number crunchers optimised for LGT applications. APEmille is the third generation of a line of LGT machines whose original element dates back to the middle eighties. The historical development of all APE projects is paralleled by rather similar stories for the machines developed at Columbia University, or in Japan, for which also three (or at least two) generations can be easily identi®ed. Indeed, I would like to claim that the three projects, along their histories, are just di€erent implementations (using a term popular in computer jargon) of the same (and unchanged) basic set of ideas, as made possible at any point in time by the status of technology (and probably also by a few more mundane constraints). In this paper, I plan to describe APEmille trying to focus speci®cally on the basic set of ideas underlying all successfull LGT machines.

*

E-mail: [email protected]

0167-8191/99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 9 9 ) 0 0 0 5 2 - 6

1298

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

This paper is organized as follows. In Section 2, I present a brief account of the development of the APE projects. Section 3 is a review of the opportunities and constraints given to a computer architecture by the requirement of being a fast LGT number cruncher. Section 4 describes how the requirements outlined in the previous section have shaped the present APE version. In Section 5, I outline my personal opinion on future perpectives for (Multi-TFlops) LGT engines. The paper ends with just a few concluding remarks. 2. Historical remarks APE started at the end of 1984. Several LGT physicists at INFN had already been privately discussing the possibility to procure at least one order of magnitude more computing power than available on vector computers. Vector machines (like the CRAY-1, and later the CRAY XMP, with peak performance around 100 MFlops) were the only real option for massive computing at the time. A few groups had already started pilot projects [3] exploring the possibility to build dedicated processors. A technical breakthrough (discussed later) happened in 1984 whose signi®cance for LGT physicists was that the goal of building an LGT number cruncher from scratch had become easy and perhaps even possible (as enthusiastically summarized by G. Parisi). The reason for such a phase transition was the availability of single-chip ¯oating point processors, developed by a start-up US company (Weitek) to meet the requirements of the workstation and military markets. This break-through made it possible to pack a lot of ¯oating-point performance in a simple home-designed electronics board. The Weitek processors had a peak performance of 8 (later 10) MFlops. At closer scrutiny, it was immediately found also that: · Fast memory access, in order to keep the processors busy, was in principle also a bottleneck, for which no easy shortcut was available. However, the problem was less formidable than it appeared, since LGT simulations are based on compute intensive kernels. For practically all such kernels, Rˆ

operations required operand

…1†

is a rather large number (typically 4 < R < 8). A limited memory bandwidth could therefore sustain a huge ¯oating-point performance. · Physical lattices could be easily mapped onto several independent processors. A measurable performance loss due to parallelization overheads was expected only for a number of processors well beyond any reasonable upper limit. · All processors were only required to execute exactly the same program, on independent copies of the same data-structure. Just one program ¯ow had to be controlled. Therefore (possibly more important) just one program had to be written. Building on the points just discussed, the ®rst APE architecture [4] was quickly de®ned. It was based on the following basic points: · As many ¯oating point units as possible had to be used at the same time. One elegant way to do this was found. The normal operation, that is

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

1299

ab‡c …2† was de®ned with a, b, and c complex numbers. In this way, eight ¯oating point operations on real numbers were needed and could be associated to eight independent units. The Weitek processors had a peak performance of 8 MFlops, so a Weitek based complex number processor could run at 64 M¯ops peak performance. · A large set of registers was built onto the processor. The reason for this choice was that many data-words had to be accessed from registers to keep the ¯oating-point units busy. This required, with available technology, a large number of register banks. The real advantages of this choice became clear much later on. · Since all processors had to perform the same instruction at any point in time, just one control processor, setting the pace for program execution, was needed. The 3081/E processor, developed at CERN and SLAC was chosen [5], as it could be easily modi®ed in order to adapt to the new environment. · A not too large number of processors (16) was needed to cross the performance threshold of 1 GFlops. The processors were arranged on a one-dimensional ring. There were 16 processors and 16 memory banks. Each processor had access to its private memory bank. All processors could however exchange data with their right (or left) nearest neighbour memory bank. The development pace for the APE was extremely quick: potential members of the APE group met at CERN in late 1984 for the ®rst time. In March 1985, as prototypes of the various APE building blocks were already under test, the project was ocially approved by INFN. The commissioning of a full machine took a few months (an unusually short period, although perceived extremely long and painful by the members of the collaboration) and a 4-node machine (APetto) was available in early 1986 (followed by a large 1 GFlops machine in summer 1987). Real physics (an analysis of the glueball spectrum) was ®rst presented at the lattice conference in 1987 [6] and the ®rst published paper also went to press in 1987 [7]. The last paper based on APE data was published in 1991. In retrospect, the remarkable success of the project was probably due to the following factors: · A clear focus on precisely and only the features needed for LGT simulations. · A close cooperation of theorists and experimental physicists with long-term experience in the electronic systems used in large experiments for triggering and dataacquisition. · An early understanding that software (i.e., comfortable tools to allow every physicist to write his/her own program) was an important issue. As early as 1988, with APE machines busy at running physics programs, a new generation machine started to take shape, named APE100 [8]. The name was chosen since it was planned to increase performance by two orders of magnitude with respect to APE. The basic ideas behind the new machine were basically the same as for the previous generation. One major technical choice made for the new machine was the use of custom designed node processors [9]. This choice brought forward two major advantages: · For the ®rst time, it was possible to tailor the processor architecture according to our needs. This process led to the de®nition of a computer rather di€erent from

1300

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

current orthodoxy. I will claim later that this led to a non trivial (albeit not suciently advertised) step forward in processor architecture. · One of the trade-o€s in the processor architecture was a rather drastic self-limitation in the complexity of the processor, as measured by several metrics (gatecount, pin-count, power consumption). This made it possible to pack a very large number of processors (and therefore a large global performance) in a very small volume. The choice to use custom processors made it possible to design systems with hundreds (and maybe thousands) of nodes. This in turn dictated a multi-dimensional grid geometry. Indeed, if we want to deploy N processors, we have to use dimensionality d, so that p …3† Lˆ dN is a fraction of the expected physical lattice size. Having chosen d ˆ 3, a system with 8  8  8 nodes (the so-called Tower) became the work-horse APE100 machine, joined by a zoo (the Tube or the Half Tower) of somewhat smaller but closely sized objects. First lattice results from APE100 appeared in late 1992. APE100 (or APE100 derived) machines are still in heavy use in a number of installations in Europe. APE100 machines have also been used to model several physical systems other than LGT, the heaviest APE100 user beyond QCD being computational ¯uid dynamics [10]. As it may be expected, yet another LGT machine (APEmille) is going to follow APE100. In Section 3, I present, with speci®c focus on APEmille [11], the requirements and constraints that shape an LGT machine. 3. Requirements for LGT engines: The APE perspective The ultimate goal of a project to build and deploy a custom-made LGT engine is that of making available a number-cruncher whose price-performance ratio for the envisaged application is comfortably better than commercially available machines (the more ambitious goal of assembling the fastest computer on earth is probably not realistic in view of the exponential rate of growth of computer performance with time). Possibly the only way to reach the goal is a careful matching of speci®c requirements with speci®c features. In short, virtually everywhere in LGT simulation the assumption holds that you can map the physical lattice onto a regular processor mesh, without appreciable performance loss. Also, no realistic suggestion has been made so far for algorithms that need more complicated control strategy than SIMD or SPMD. This features make parallel computing both straightforward and extremely ecient for LGT simulations, a situation not shared by many other applications. We will keep this two points as our basic assumptions. Starting from this points, we can adopt the following guidelines: We start from a physical requirement of a lattice large enough to allow accurate computations of (a bunch of) physical quantities.

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

1301

Note that, in practice, this is the only quantitative requirement that comes from physics judgement. This requirement ®xes the amount of on-line memory that must be available on the machine. It is then obviously desirable to crunch numbers from this database as fast as possible, by parallelizing on as many processing nodes as in principle useful and in practice possible. There will be an estimated (time dependent) technology-limited peak performance for each node P0 and (more interestingly) a sustained performance (P ˆ P0 ;  < 1). As we had more and more nodes (call k the total number of processing nodes), global performance PT increases linearly, until we hit the bandwidth limit. In fact, in order to sustain P, we need a bandwidth to local memory of B ˆ P =R:

…4†

In this equation, the relevant R-ratio is the one applicable to the most compute intensive algorithm to be used. As we increase k, the memory needed in each node is reduced. At some point, the only way to reduce memory is that of making the word length shorter, that, in turns, reduces bandwidth. When we reach this regime, B scales as B…k† ' 1=k. (This simplifying assumption is made to keep the discussion simple. In real life B…k† will decrease in quantum steps). Therefore the e€ective computing power of the machine levels to PT ' R  k  B…k†

…5†

an approximately constant value. There is therefore no point in increasing the number of processors beyond this limit (The only way to circumvent the problem would be to have more memory than needed in the machine. This is however bad use of the available money, since the extra money, if available, is better used to build additonal copies of the machine). There is a further bottleneck that must be closely monitored, related to the need to move data between neighbour nodes (we make here the simplifying but realistic assumption that data transfers beyond non nearest neighbour nodes are not needed in LGT). Again, available technology will dictate an upper limit on the bandwidth between remote nodes (call it BRM ). Since remote accesses are proportional to the area of the physical region mapped onto each processor, we have that (I consider here for de®niteness a three-dimensional mesh of processor) the needed remote bandwidth is given by  p 3 …6† BR ' P  R  k : As we hit the technology limit and BR becomes larger than BRM , e€ective performance starts to be cut o€ by the remote available bandwidth. In this regime, adding more processing nodes yields more e€ective performance only as k 2=3 . Therefore, the (very simple) secret of the trade, is summarized by saying that you have to assemble the largest set of processors that can be e€ectively fed with data by memory, while at the same time making sure that you are not limited by a poor remote bandwidth. We now turn to an analysis of the requirements that play a role in shaping the structure of a processing node optimized for LGT simulations. As already stressed,

1302

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

the most severe bottleneck is memory bandwidth. The usual solution to this problem in traditional processors is the use of a cache. A cache system contains a copy of a portion of the global memory. If we assume that a fraction h of all memory accesses reference the cache (which has a fast access time Tc if compared to main memory Tm ), the average access time to memory is Ta ˆ h  Tc ‡ …1 ÿ h†  Tm :

…7†

This trick works very well if h ' 1, but is desperately inecient for LGT, where h is close to zero. In this case, in fact, almost the whole database is used in each program iteration. A simple way to summarize the cache idea is that you are betting that data are available immediately upon request, when they are needed. The opposite approach is based on the assumption that data are never available and the program has to anticipate any memory request and fetch words ahead of time so they are available when needed. This step is obviously best performed under program control. All APE generations have followed the latter approach, that trades the hardware complexity of a cache system with the software overhead of anticipating the need for data and fetching them in time. This approach requires of course that pre-fetched data can be stored somewhere, waiting to be operated upon. The appropriate parking place is a large register ®le. A large register ®le is also needed to keep intermediate results of a long computation inside the processor. This is important, since we obviously want to keep global memory bandwidth as close as possible to the lower limit set by the algorithmic structure of the program. It is probably interesting to remark at this point that this general approach to the control of memory accesses has recently been advertised as a fundamental breakthrough of so-called next generation processor architectures (such as the IA-64 architecture by Intel). A further important issue concerns the techniques used to squeeze as much computing power as possible within one processor (of course within the limit of bandwidth already discussed). This must be done in such a way as to make it easy for the compiler to achieve high eciency and also trying to push the technology handle as little as possible. The obvious solution, in general terms, to this problem is based on the attempt to perform more operations at the same point in time (or, in computer language, during the same clock cycle). There is a basic and deep motivation based on physics for this approach. In fact, as electronic circuitry becomes smaller and smaller (parametrized for instance by some scale length k), switching speed increases / 1=k while logic complexity per unit area of course grows / 1=k2 . It is therefore much more ecient to try to perform more independent tasks in parallel within the same processor than to try to complete one operation at higher speed. We need therefore an approach that overlaps memory accesses, including address generation, control operations and arithmetic operations (these, after all, are the only useful operations from the point of view of the physics simulation).

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

1303

There is a very e€ective way to reach these goals in LGT, associated with the fact that practically all computations involved use complex numbers. As already brie¯y remarked, we can de®ne our basic arithmetics to be complex-valued. If we do this, by specifying a multiply operation, we actually start six ordinary ¯oating point operations, while an add is equivalent to two ¯oating ops. A further step is the de®nition of a so-called normal operation, de®ned as a  b ‡ c:

…8†

Since LGT simulations involve a lot of matrix manipulations, we may expect that arithmetic operations almost always can be cast in a normal form (This is not true, however, in some other interesting cases. A notable counter-example comes in the simulation of the Navier±Stokes equations within the lattice Boltzmann equation approach. In this context there is a dramatic excess of adds with respect to multiplies. In this case, the global performance of a normal-based processor drops quickly to less than 50%). Coming back to the more conventional cases, all arithmetic operations are based on just one very ecient (it starts eight operations) and easily-understood (by compilers) instruction. In summary, the basic way in which all APE generations have tried to ful®ll the computational requirements of LGT simulations, are as follows: · Use as many processors as made possible by the memory bandwidth bottleneck. · Try to minimize bottleneck e€ects by handling memory accesses under direct and explicit program control. · Start as many useful operations as possible at the same clock cycle, in order to obtain high performance at a conservative clock rate. · Group the operations that can be started together in forms that can be handled eciently by a compiler. In Section 4, I describe in some details the speci®c APEmille architecture and its current implementation, showing how the basic ideas summarized here have in fact shaped this third generation APE machine.

4. The APEmille architecture The APEmille architecture is very simple and regular. It is based on the following basic elements. · There is a three-dimensional euclidean grid of Nx  Ny  Nz processing nodes. All nodes in principle execute exactly the same program (SIMD programming style). · Each node has a private memory bank, that can be only accessed locally. While of course all nodes perform a memory access together, each node is able to compute its own memory address (This feature was not available in APE and in APE100. It was in fact one of the main limitations of these machines). · Each node stores its private copy of all data structures in its private memory bank. There are therefore as many copies of the same variable as there are nodes in the

1304

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

machine. This is not true for so called control variables (e.g. the index of a loop). Control variables are in principle unique to the whole machine (we will see later on how they are handled in practice). · As already stressed, all nodes execute the same program in lock step mode. Program branches are steered either by conditions on control variables or by performing logical operations (e.g. the and operation) of logical condition on local variables, independently computed on all nodes. · Each node in the mesh, while following the same instruction sequence of all its partners, is however able to mask the result of speci®c instructions if a locally evaluated condition is true. This feature is known as logical where. · Nodes on the mesh are physically connected by links stretching from each node to its six nearest neighbour partners. The mesh is made translationally invariant by wrapping local connections around the mesh. · There are two speci®c patterns of data-transfer among nodes. In rigid communication mode, all nodes send a data-packet to one of their neighbours. The relative distance between each sender±receiver pair in the communication is the same for all pairs. In other words, node …x; y; z† sends data word to node …x ‡ Dx; y ‡ Dy; z ‡ Dz†. In this way a rigid translation of the node grid is performed. In principle, the translation vector …Dx; Dy; Dz† can be arbitrary, although it is anticipated that only nearest (or almost nearest) neighbour communications will be used. Consequently, performance trade-o€s favour short distance communications. (Remote communications with far-away nodes pay a bandwidth penalty proportional to the longest component of the translation vector. Communication latency is also longer). · A broadcast communication mode is also foreseen. In this case, one of the nodes (software selected by the application program) distributes its own data-packet to all other elements of the mesh. There are several versions of broadcast (or partial broadcast). In fact, it is possible to limit the broadcast procedure to independent planes (two-dimensional broadcast) or to independent lines (one-dimensional broadcast). Broadast communications have been carefully optimized for bandwidth. The asymptotic bandwidth for large data-packets is as high as communication to the nearest neighbour. The basic features listed above are in fact very simple extensions to the basic APE architecture, whose basic features have remained remarkably stable since the introduction of custom processors. Small architectural improvements are in fact concentrated only on a more ¯exible local memory addressing scheme and on some extension to the interconnection structure among nodes. More substantial changes are present, on the other hand, in the actual machine structure. In the actual implementation, the machine is based on a basic building block (a printed circuit, in fact) that contains one control processor, eight processing nodes and the support for the node communication within the block and outside of the block, when more blocks are connected together. This basic building block is referred to as the PB (shorthand for Processing Board). The logic functions of the PB are mostly handled by a chip-set of three custom components, that make up the core of APEmille. The members of the chip set are:

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

1305

T1000. This is the control processor for the whole PB. T1000 controls the program ¯ow and computes base addresses needed by the nodes. The control variables described before are handled here and stored within the T1000 private data memory. T1000 is a processor of its own, with an extensive set of integer arithmetic and logic operations and a large register ®le. J1000. J1000 is the work-horse processor for APEmille. As already stated there are eight J1000 nodes within one PB. They are logically placed at the edges of a basic 2  2  2 mesh of processors. The mesh can be extended in all three dimensions by appropriately connecting more PB's. J1000 is strongly biased towards ¯oating point performance. J1000 is able to compute the normal operation described before on complex single-precision ¯oating point numbers, following the IEEE standard. The same operation can also be carried out on double precision real ¯oating point numbers as well as on single precision real numbers (in the latter case, two real normals can be started at the same time on stride-two vector data). It is also possible to perform arithmetic and logic operations on integer numbers (including shifts) and all conversions between integer and ¯oating-point data formats and between single and double precision formats. All operands involved in all mathematical manipulations come from a very large register ®le. This structure has 512 words of 32 bits each. Any pair of successive odd± even registers can be also handled as one double precision register, or as one complex single-precision variable. The register ®le is the hub where all data handled by J1000 are managed. There are enough access ports to permit execution of a normal operation at each clock cycle and to support concurrently one read or write access with memory. J1000 interfaces directly to the node memory bank. The memory is organized to be as fast as possible, while using the same commodity components usually found in personal computers. The memory is organized in 64 bit words, based on synchronous D-Ram (SDRAM) technology. After a few start-up cycles data can be written/ read to/from memory at a rate of one memory word per clock cycle. The R ratio ranges from 4 (in the case of a normal operation between complex numbers) to 2 (double precision real or single precision vector real normal operations). The memory system is based on 64 Mbit memory chips, and the memory available to each node is 32 Mbytes. Comm1000. The third element of the APEmille chip-set handles all communication tasks needed by the machine. Comm1000 basically interfaces with the eight nodes housed on one PB and with six links connecting to the nearby PB's in the x; y; z directions. Data involved in a remote transfer ¯ow from the memory banks to the local processor and then to the Comm1000 interface. From here, data packets are either send to the appropriate link, if the destination node belongs to a di€erent PB, or directly to the destination node on the same PB. Comm1000 handles all the step of the long-distance communications independently and does not require any help from either T1000 or J1000. From the point of view of a simulation program, data-packets are simply handed over to the

1306

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

network and later on delivered to the destination node. As usual in all APE generations, every memory access can be either local or remote, according to the value of the address ®eld (as evaluated at run time). If a speci®c memory access turns out to be to a remote destination, a longer time interval will have to elapse before data reaches destination. In this case processing can no longer continue. Clock cycles are therefore stretched till data reach destination and processing is allowed to continue. All members of the APEmille chip set are CMOS VLSI components, using rather conservative and well established standard cell technology. The set has been designed with a massive use of silicon compilers. Chips are currently manufactured in a 0:5 l CMOS technology by Es2-Atmel. They run synchronously at a clock frequency of 66 Mhz. A second source compatible version of the elements of the chip set will be manufactured soon by Alcatel±Mietec in a 0:35 l CMOS process. As already remarked the massive use of silicon compilation techniques makes migration to a di€erent silicon foundry straigthforward. As already remarked the basic APEmille building block is the PB, housing eight nodes. Larger machines can be built by assembling more PB's together. When more than one PB is involved in the same machine, the control processors running on all PB's are loaded with the same program and initialized with the same data. In this way they behave logically as one single control processor. There are three hierarchy layers: · Four PB's can be connected together to build a machine with 2  2  8 nodes. This set is known as an APEunit. · Four APEunits can also be connected together. The resulting topology is a thin layer of 2  8  8 nodes (128 nodes, altogether). This is the largest APEmille machine physically assembled within one APE enclosure. All electrical connections needed by the node-to-node links are built on the backplane present in the enclosure. · Finally an arbitrary number (N) of units from the previous hierarchy level can be connected together. The resulting geometry is 8  8  …2N †. Large machines, stretching beyond the single enclosure limit, use cable links with di€erential signalling techniques. As is the case in all APE generations, APEmille is an attached processor of some traditional (and familiar to users) computer that handles all host activities. In the case of APEmille, the host system is a networked array of personal computers (PC's), running the Linux operating system. There is one independent PC for each APEunit. Each PC sees the four PB's belonging to its APEunit as slave elements of its Compact PCI interface bus. Each PC can load and store data onto any memory location and control register. A rather complex network structure is needed to control a globally synchronous array of processing nodes through an asynchronous network of hosts. The basic trick used here is that only one of the PC's is able to start any operation that has to be performed in a completely synchronous way (basically, starting and stopping APE programs). Such a master PC sends control signals to a so-called Root Board that distributes such controls to all APEunits.

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

1307

The basic reason to introduce multiple hosts for APEmille is that this choice provides a natural way to increase input±output performance in the machine, scaling in performance exactly in the same way as processing performance. The present (February, 1999) status of the project is summarized in the following main points: · All hardware (non VLSI) components of the machine have been built and prototypes have been succesfully tested. · Prototypes of T1000 and Comm1000 have also been succesfully tested. Tests have been performed using a large fraction of the ®nal APEmille software (see later for some details). · Prototypes of J1000 are expected in just a few weeks. · Several machines with peak performance of about 64 G¯ops will be ready for physics late this year, while several 250 G¯ops machines will be also ready, both at INFN and DESY in the year 2000. We now conclude this section with a very short description of the APEmille software environment. The old APE100 programming environment is largely kept for this new generation machine. Programs are written in TAO, an implicitely parallel programming language that, on the one hand, makes it very easy to write parallel code, while allowing to specify only those programming constructs that may be ef®ciently mapped onto APEmille. Previous experience with APE100 has shown that TAO is quickly learnt and easily used by physicists with basic programming skills. TAOmille extends old TAO versions, including those features that were not supported in the previous machine. The key feature of the TAO programming environment for LGT simulations, is that the TAO compilers provides two di€erent levels of program optimization. First, a standard optimization procedure is used to remove dead branches from the code, move away from loop invariant expressions, and other similar improvements. A more speci®c optimization procedure is then applied to schedule instructions. In fact, T1000 as well as J1000 use a very long instruction word, in which the operations of all independent devices are speci®ed at each clock cycle. This allows a highly ecient scheduling and optimisation of the arithmetic and memory-access pipelines by the low-level optimiser in the last step of the compilation chain. Various new controller instructions and their improved temporisation allow, in general, shorter latencies for address calculations and should improve related performance bottle-necks in APE100. The local addressing and the more ¯exible communications will be helpful for various algorithmic problems, like preconditioning or FFT. Rough performance estimates with simple and un-tuned assembler code for the Wilson±Dirac operator yield 88% pipeline ®lling (760 clock cycles) when using complex normal instructions, corresponding to 60% peak performance (subtracting useless Flops from e.g., real  complex factors). This shows that no severe limitations from the memory bandwidth are encountered. For double precision a pipeline ®lling of 82% (1150 clock cycles) has been found, corresponding to 21% of single-precision peak performance. This indicates that less than the naive factor of 4 is lost with double precision arithmetics.

1308

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

5. Future perspectives The APEmille project is now close to its commissioning phase. We can therefore predict that, as soon as physics programs will be safely running on the new machines, ideas for a follow-up project will be discussed. If one compares the status of computing now with the situation at the points in time when APE or APE100 were started, one notices one key change. While ten or even ®ve years ago, di€erent hardware and software architectures happily lived together, today just one standard model, based on PC's and networks of PC's seems to survive. One might therefore ask the question whether a next generation LGT machine should also be based on PC's. The idea if of course based on the fact that processors for PC's have increased performance by nearly two orders of magnitude in the last ten years. A completely personal answer to this point is that we have to focus on one problem, other than the choice of processor, that was marginal in the past and is now becoming a critical bottleneck. The problem is related to the remote communication bandwidth. Processors of the current generation (either designed for PC applications or custom) are now so fast that readily available technology is not able to provide the communication bandwidth required to keep them busy, when they are used in the close coupled computing style needed for LGT simulations. Therefore, the development of a new generation LGT engine will require a large e€ort in the de®nition of a suitable communication structure. The use of a commercial processor or a new dedicated number cruncher is going to be to some extent an immaterial detail.

6. Conclusions LGT compute engines have been available in the last ten years, developed by several institutions. Their structures are reasonably similar, as expected since they have been developed to solve essentially the same computational problem. In the recent past a sizeable fraction of all compute cycles used for LGT simulations have been performed on these dedicated engines. This has been so in spite of the fact that, at any point in time, it has always been possible in principle to buy a better (that is, more general) computer of matching computing power from a commercial vendor at an appropriate (most of the times unattainable) tag-price. I do not see any clue that the situation might be changing now, so lattice physicists will probably have to keep devising not only bright ideas but also powerful computers for quite a few more years.

Acknowledgements I would like to thank the editors of this special issue for asking me to write on APEmille (and for patiently waiting while I failed to deliver my paper at all

R. Tripiccione / Parallel Computing 25 (1999) 1297±1309

1309

deadlines). I also want to thank all the members of the APE collaborations for so many years of common work. Finally, discussions with members of the CP-PACS and Columbia groups have often been an invaluable source of new ideas.

References [1] A. Ukawa, in: Proceedings of the CHEP'97, Berlin, 1997, p. 595 (see also this issue). [2] I. Arsenin et al., in: Proceedings of the CHEP'97, Berlin, 1997, p. 586 (see also this issue). [3] See for instance, A. Terrano, in: Review on the Impact of Specialized Processors in Elementary Particle Physics, Padova, 1983. [4] M. Albanese et al., The APE computer: an array processor optimized for lattice gauge theory simulations, Comp. Phys. Comm. 45 (1987) 345. [5] P.M. Farran et al., in: L.O. Hertzberg, W. Hoogland (Eds.), Proceedings of the Conference, Computing in High Energy Physics, 1986. [6] E. Marinari, Nucl. Phys. B (Proc. Suppl.) 4 (1988) 3. [7] M. Albanese et al., Glueball masses and string tension in lattice QCD, Phys. Lett. 192B (1987) 163. [8] C. Battista et al., Int. J. High Speed Comput. 5 (1993) 637. [9] A. Bartoloni et al., Particle World 2 (1991) 65. [10] R. Benzi et al., Boltzmann's Legacy 150 years after his death, in: Cercignani, Iona-Lasinio, Parisi, Radicati (Eds.), Atti dei Convegni Lincei, vol. 131, 1997, p. 41. [11] A. Bartoloni et al., Nucl. Phys. B (Proc. Suppl.) 42 (1995) 17.