Nuclear Physics B (Proc. Suppl.) 140 (2005) 859–861 www.elsevierphysics.com
Domain Wall Fermion Inverter on Pentium 4 Andrew Pochinskya∗ a
Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
A highly optimized domain wall fermion inverter has been developed as part of the SciDAC lattice initiative. By designing the code to minimize memory bus traffic, it achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors. The code uses the SciDAC QMP communication library.
1. INTRODUCTION Over the years, capabilities of personal computers continued to grow according to Moore’s law. A quest for faster graphics resulted in addition of vector extensions to many of popular instruction sets. While these extensions target gaming applications, they turn out to be useful for lattice QCD as well [1]. In this respect, Pentium 4 and Xeon processors from Intel offering good price/performance characteristics are attractive for large clusters. In fact, SciDAC LQCD initiative chosen P 4/Xeon as its cluster platform. However, the question of the best utilization of Pentium 4 for LQCD remains nontrivial, mostly because the system design is not optimized for lattice calculations. This requires a careful consideration of strong and weak sides of the cluster hardware to archive highest performance on a given platform. This talk highlights the issues one has to consider and provides an example of optimizing domain wall fermion inverter for Xeon based GigE clusters. 2. MACHINE AND PROBLEM FEATURES We start with identifying features of both the target machine and the problem that are important for optimization. ∗ Supported
by DOE contracts DE-FC02-94ER40818 and DE-FC02-01ER41193.
0920-5632/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.nuclphysbps.2004.11.362
2.1. Multimedia extensions Pentium 4 provides a set of instructions for operations of vectors of floating point numbers [2] (the vector length is four for single precision and two for double precision.) In addition to pointwise addition, substruction and multiplication of vectors, there is a limited number of shuffles for rearranging vector elements. One important restriction is that loading and storing vectors into memory is much faster if memory locations are aligned at 16 bytes boundary. Fortunately, resent versions of the gcc compiler are good at generating code with SSE instructions. In particular, they correctly handle automatic variable alignment and vector register allocation. Together with inline functions and enabled optimization allows the programmer to write very little assembly code. 2.2. Data cache Moving one step up form the ALU’s and registers, one should consider effects of the data caches on the performance. It appears that the data bus to L1 and L2 caches is fast enough to have only minor effects on SSE performance. The most tricky part of cache organization, data prefetching is not very important for Pentium 4, because the processor has hardware prefetch logic that adopts well to LQCD data flow. 2.3. Memory bus bandwidth Physically interesting problems do not fit into the cache hierarchy. Thus, one has to consider behavior of the memory system. There are two im-
860
A. Pochinsky / Nuclear Physics B (Proc. Suppl.) 140 (2005) 859–861
2.4. Domain wall fermions It is easy to identify features of the domain wall fermion action that are most important from the optimization standpoint: • gauge field is independent of the fifth dimension, thus offfering a way to reduce memory traffic if “inner loops” run over the Ls ; • action admits 4-d even/odd preconditioning [3] that improves locality of data access; • mapping the fifth dimension into SSE vectors allows one to avoid most of shuffles; • memory locality is further improved by arranging the code to run the “outer loop” over 4-d “sites”.
fermion vector code. Figure 1 shows performance of the following pseudo code as a function of data set size on 1.5HGz Xeon processor. for all i = 0 ... N do { W[i] = M[i] * V[i]; } We see that inspite impressive performance for small N , going outside of L2 cache significantly reduces performance. Different symbols show performance with and without PREFETCH instructions, supporting the claim that hardware prefetch engine works well.
6
Performance in GFlops
portant parameters: memory latency and memory bus bandwidth. It turns out, that for even moderately optimized QCD code memory latency is not very important. The likely reason is that prefetch hardware predicts access patterns well enough to hide it from the application. Unlike more expensive processors, the memory bus on Pentium 4 has relatively small bandwidth. Together with the cache structure it leads to certain tradeoffs in overall organization of the code.
5 4 3 2 1 0 512
2K
8K
32K
128K
512K
2M
8M
32M
128M
Dataset size in bytes
Figure 1. Performance of a matrix-vector multiply loop.
3. DATA LAYOUT Only layout of fermions is important for performance. The following layout appears optimal for DWF CG (left to right, top to bottom): 00 ψ00000 01 ψ00000 ··· 32 ψ00000 00 ψ00004 ···
00 . . . ψ00003 01 . . . ψ00003
00 ψ00000 01 ψ00000
00 . . . ψ00003 01 . . . ψ00003
32 . . . ψ00003 00 . . . ψ00007
32 ψ00000 00 ψ00004
32 . . . ψ00003 00 . . . ψ00007
In this case the only shuffles in the code are for computing (1 ± γ5 )ψs±1 . 4. VECTOR PERFORMANCE Before embarking on a full inverter, it is instructing to time a simple SU (3) times projected
5. CG PERFORMANCE Applying ideas of section 2.4 to conjugate gradient code [4], we get the following performance plots for 2.66 GHz 500 MHz FSB Xeon processor. The code supports clusters via QMP interface [5] and could run on both single processor or one to four dimensional torus logical network configurations. Performance on a single processor is shown on figure 2. As one would expect, final performance of the inverter depends on the lattice extend in the fifth direction. Effects of the loop bookkeeping level off at about Ls = 16. We also see that going out of L2 cache hurts performance, but not as much as for the vector multiply test code above.
861
A. Pochinsky / Nuclear Physics B (Proc. Suppl.) 140 (2005) 859–861 3.0
Performance in GFlops
2.5
Ls = 16 2.0
Ls = 8 1.5
Ls = 4
1.0
0.5
0.0
24 × 4 24 × 16
84 × 16 44 × 16 Lattice size per node
164 × 16
Figure 2. Single node full DWF inverter performance.
Overall structure of the code requires rethinking communication patterns. In general, one would like to overlap as much computation and communication as possible. In the CG code we adopted the following strategy. Boundary and inner sites of the local sublattice are separated both in memory and in the code. Part of the code dealing with the boundary has communication imbedded in such a way as to maximize the overlap. Split transaction design of the QMP helps in achieving this goal. Such an approach limits computation/communication overlap to sublattices 34 or larger, but the code will continue to work correctly for smaller sublattices as well. Here is the pseudo code for the linear operator M of the CG: for all inner sites { } for all boundary sites { } On figure 3 performance for 1-d, 2-d and 3-d tori networks is shown for Ls = 16. One can see, that communication induces memory bus traffic that
Performance in GFlops per node
3.0
2.5
2.0
1d
2d 1.5
3d
1.0
0.5
0.0
24 × 16
84 × 16 44 × 16 Lattice size per node
Figure 3. DWF inverter on a network.
competes with memory accesses from the computation parts of the code. 6. CONCLUSIONS The DWF conjugate gradient shows sustained performance over 2 GFlops on a single processor and above 1.5 GFlops/CPU on a 3-d GigE cluster, giving, in prices of summer 2004 price/performance under $1/MFlops. The code scales well with the cluster size. An implementation of the inverter as an L3 routine within SciDAC LQCD API will be available in fall of 2004. REFERENCES 1. M. L¨ uscher, Nucl.Phys.Proc.Suppl. 106 (2002) 21-28. 2. See http://developer.intel.com/ 3. K. Orginos, private communication. 4. http://www.mit.edu/~avp/sse/latest/ 5. http://www.lqcd.org/scidac/