Domain Wall Fermion Inverter on Pentium 4

Nuclear Physics B (Proc. Suppl.) 140 (2005) 859–861 www.elsevierphysics.com Domain Wall Fermion Inverter on Pentium 4 Andrew Pochinskya∗ a Center fo...

Download PDF

110KB Sizes 0 Downloads 56 Views

Report

PDF Reader
Full Text

Nuclear Physics B (Proc. Suppl.) 140 (2005) 859–861 www.elsevierphysics.com

Domain Wall Fermion Inverter on Pentium 4 Andrew Pochinskya∗ a

Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

A highly optimized domain wall fermion inverter has been developed as part of the SciDAC lattice initiative. By designing the code to minimize memory bus traﬃc, it achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors. The code uses the SciDAC QMP communication library.

1. INTRODUCTION Over the years, capabilities of personal computers continued to grow according to Moore’s law. A quest for faster graphics resulted in addition of vector extensions to many of popular instruction sets. While these extensions target gaming applications, they turn out to be useful for lattice QCD as well [1]. In this respect, Pentium 4 and Xeon processors from Intel oﬀering good price/performance characteristics are attractive for large clusters. In fact, SciDAC LQCD initiative chosen P 4/Xeon as its cluster platform. However, the question of the best utilization of Pentium 4 for LQCD remains nontrivial, mostly because the system design is not optimized for lattice calculations. This requires a careful consideration of strong and weak sides of the cluster hardware to archive highest performance on a given platform. This talk highlights the issues one has to consider and provides an example of optimizing domain wall fermion inverter for Xeon based GigE clusters. 2. MACHINE AND PROBLEM FEATURES We start with identifying features of both the target machine and the problem that are important for optimization. ∗ Supported

by DOE contracts DE-FC02-94ER40818 and DE-FC02-01ER41193.

0920-5632/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.nuclphysbps.2004.11.362

2.1. Multimedia extensions Pentium 4 provides a set of instructions for operations of vectors of ﬂoating point numbers [2] (the vector length is four for single precision and two for double precision.) In addition to pointwise addition, substruction and multiplication of vectors, there is a limited number of shuﬄes for rearranging vector elements. One important restriction is that loading and storing vectors into memory is much faster if memory locations are aligned at 16 bytes boundary. Fortunately, resent versions of the gcc compiler are good at generating code with SSE instructions. In particular, they correctly handle automatic variable alignment and vector register allocation. Together with inline functions and enabled optimization allows the programmer to write very little assembly code. 2.2. Data cache Moving one step up form the ALU’s and registers, one should consider eﬀects of the data caches on the performance. It appears that the data bus to L1 and L2 caches is fast enough to have only minor eﬀects on SSE performance. The most tricky part of cache organization, data prefetching is not very important for Pentium 4, because the processor has hardware prefetch logic that adopts well to LQCD data ﬂow. 2.3. Memory bus bandwidth Physically interesting problems do not ﬁt into the cache hierarchy. Thus, one has to consider behavior of the memory system. There are two im-

860

A. Pochinsky / Nuclear Physics B (Proc. Suppl.) 140 (2005) 859–861

2.4. Domain wall fermions It is easy to identify features of the domain wall fermion action that are most important from the optimization standpoint: • gauge ﬁeld is independent of the ﬁfth dimension, thus oﬀfering a way to reduce memory traﬃc if “inner loops” run over the Ls ; • action admits 4-d even/odd preconditioning [3] that improves locality of data access; • mapping the ﬁfth dimension into SSE vectors allows one to avoid most of shuﬄes; • memory locality is further improved by arranging the code to run the “outer loop” over 4-d “sites”.

fermion vector code. Figure 1 shows performance of the following pseudo code as a function of data set size on 1.5HGz Xeon processor. for all i = 0 ... N do { W[i] = M[i] * V[i]; } We see that inspite impressive performance for small N , going outside of L2 cache signiﬁcantly reduces performance. Diﬀerent symbols show performance with and without PREFETCH instructions, supporting the claim that hardware prefetch engine works well.

6

Performance in GFlops

portant parameters: memory latency and memory bus bandwidth. It turns out, that for even moderately optimized QCD code memory latency is not very important. The likely reason is that prefetch hardware predicts access patterns well enough to hide it from the application. Unlike more expensive processors, the memory bus on Pentium 4 has relatively small bandwidth. Together with the cache structure it leads to certain tradeoﬀs in overall organization of the code.

5 4 3 2 1 0 512

2K

8K

32K

128K

512K

2M

8M

32M

128M

Dataset size in bytes

Figure 1. Performance of a matrix-vector multiply loop.

3. DATA LAYOUT Only layout of fermions is important for performance. The following layout appears optimal for DWF CG (left to right, top to bottom): 00 ψ00000 01 ψ00000 ··· 32 ψ00000 00 ψ00004 ···

00 . . . ψ00003 01 . . . ψ00003

00 ψ00000 01 ψ00000

00 . . . ψ00003 01 . . . ψ00003

32 . . . ψ00003 00 . . . ψ00007

32 ψ00000 00 ψ00004

32 . . . ψ00003 00 . . . ψ00007

In this case the only shuﬄes in the code are for computing (1 ± γ5 )ψs±1 . 4. VECTOR PERFORMANCE Before embarking on a full inverter, it is instructing to time a simple SU (3) times projected

5. CG PERFORMANCE Applying ideas of section 2.4 to conjugate gradient code [4], we get the following performance plots for 2.66 GHz 500 MHz FSB Xeon processor. The code supports clusters via QMP interface [5] and could run on both single processor or one to four dimensional torus logical network conﬁgurations. Performance on a single processor is shown on ﬁgure 2. As one would expect, ﬁnal performance of the inverter depends on the lattice extend in the ﬁfth direction. Eﬀects of the loop bookkeeping level oﬀ at about Ls = 16. We also see that going out of L2 cache hurts performance, but not as much as for the vector multiply test code above.

861

A. Pochinsky / Nuclear Physics B (Proc. Suppl.) 140 (2005) 859–861 3.0

Performance in GFlops

2.5

Ls = 16 2.0

Ls = 8 1.5

Ls = 4

1.0

0.5

0.0

24 × 4 24 × 16

84 × 16 44 × 16 Lattice size per node

164 × 16

Figure 2. Single node full DWF inverter performance.

Overall structure of the code requires rethinking communication patterns. In general, one would like to overlap as much computation and communication as possible. In the CG code we adopted the following strategy. Boundary and inner sites of the local sublattice are separated both in memory and in the code. Part of the code dealing with the boundary has communication imbedded in such a way as to maximize the overlap. Split transaction design of the QMP helps in achieving this goal. Such an approach limits computation/communication overlap to sublattices 34 or larger, but the code will continue to work correctly for smaller sublattices as well. Here is the pseudo code for the linear operator M of the CG: for all inner sites { } for all boundary sites { } On ﬁgure 3 performance for 1-d, 2-d and 3-d tori networks is shown for Ls = 16. One can see, that communication induces memory bus traﬃc that

Performance in GFlops per node

3.0

2.5

2.0

1d

2d 1.5

3d

1.0

0.5

0.0

24 × 16

84 × 16 44 × 16 Lattice size per node

Figure 3. DWF inverter on a network.

competes with memory accesses from the computation parts of the code. 6. CONCLUSIONS The DWF conjugate gradient shows sustained performance over 2 GFlops on a single processor and above 1.5 GFlops/CPU on a 3-d GigE cluster, giving, in prices of summer 2004 price/performance under $1/MFlops. The code scales well with the cluster size. An implementation of the inverter as an L3 routine within SciDAC LQCD API will be available in fall of 2004. REFERENCES 1. M. L¨ uscher, Nucl.Phys.Proc.Suppl. 106 (2002) 21-28. 2. See http://developer.intel.com/ 3. K. Orginos, private communication. 4. http://www.mit.edu/~avp/sse/latest/ 5. http://www.lqcd.org/scidac/

Domain Wall Fermion Inverter on Pentium 4

Domain Wall Fermion Inverter on Pentium 4

Recommend Documents