Construction of staples in lattice gauge theory on a parallel computer

Construction of staples in lattice gauge theory on a parallel computer

PARALLEL COMPUTING Parallel Computing ELSEVIER 22 (I 997) 133% 1344 Construction of staples in lattice gauge theory on a parallel computer s. Hiok...

580KB Sizes 0 Downloads 23 Views

PARALLEL COMPUTING Parallel Computing

ELSEVIER

22

(I 997) 133% 1344

Construction of staples in lattice gauge theory on a parallel computer s. Hioki Depurtment

#Physics,

Hiroshima

*

Uniuersity,

Received 25 January;

revised

Higushi-Hiroshimu

739, Jupun

I I October 199.5

Abstract We propose a simple method to construct staples in lattice gauge theory with Wilson action on a parallel computer. This method can be applicable to any dimensional system and to any dimensional division without difficulty. Furthermore this requires rather small working area to realize gauge simulation on a parallel computer. Keywords:

Interprocessor

Lattice gauge theory; Quantum Chromo communication; Multiprocessors

Dynamics

(QCD);

Portability;

Construction

of staples;

1. Introduction Recently numerical simulations in lattice gauge theory have become very important to investigate non-petturbative physics as well as to evaluate interesting quantities from first principles. In fact numerical studies grow with the growth of powerful computers like supercomputers and parallel computers. There are many commercial parallel computers available now as well as many dedicated parallel computers. In particular many lattice Quantum Chromo Dynamics (QCD) works have been performed on parallel computers [ 1I. However for programmers, the variety of parallel computers prevents the portability of their codes. Moreover the parallel code usually requires more working area coming from the overlapping with next processors than the code which runs on a single processor. These “portability of code” and “reducing of working area” become crucial to simulate a larger system and on a different platform.

* Email: [email protected] 0167-8191/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved PII SOl67-8l9l(96)00053-1

S. Hioki/Pumllel

1336

Computing 22 (1996) 1335-1344

In this paper a simple method to overcome these problems in lattice gauge Monte Carlo simulations is proposed. We focus on a parallel computer in which the message passing routines for interprocessor communications are incorporated. The organization of this paper is as follows. In the next section, we will briefly introduce what lattice QCD is as a typical example of the lattice gauge theory. Some preparations and the construction of staples on a scalar and on a vector computer are discussed in Section 3. The actual construction of staples on a parallel computer is presented in Section 4. The last section is devoted to summary and discussions.

2. Lattice QCD It is now widely believed that the world of color degrees of freedom of quarks and gluons can be described by QCD. Because of the non-perturbative nature of QCD, in which the usual technique like a perturbation cannot work well, the only available tool for the quantitative study of the phenomena is the lattice QCD which is the discretized version of the continuum QCD. In pure gauge theory of lattice QCD, in which only gluonic degrees of freedom are taken into account, the action S is constructed by the link variable U,,@as S= -$

c Re Tr(%, n/L>v

Lrn+~,, G’+P+ &TV)

(1)

where U,,p is the SU(3) matrix whose number of degrees of freedom is eight which corresponds to the number of gluons, (I’ represents the hermitian conjugate of U, n is a lattice point and p is the direction and g is the coupling constant of QCD. This action is called “Wilson action” which was originally proposed by Wilson [2]. In terms of this Wilson action, it is noted that we can construct lattice action for any gauge groups such as Quantum Electra Dynamics (QED) or pure SU(2) gauge theory if we regard a link variable U as an element of these groups. In this sense the following arguments can be applicable to general lattice gauge theory as far as the Wilson action is concerned. If we look at a link U,,c,, the action S is rewritten as S = CxRe

Tr( Un,* Xi,, + terms independent of Un,,).

= F

(2)

Here we call X,,+ a “staple”. Using these variables we can evaluate the expectation value of a physical observable @ by (a> = i/[

dU] d exp S(U),

(3)

where Z is the partition function. In the real evaluation, we use the Monte Carlo technique to calculate the average (@>. We then generate a large number of configurations{Ul,(k= 1, 2,...) wh ose generated weights are proportional to exp S(U). We call this procedure the “update”.

S. Hioki/Purallel

Using these configurations,

(8) = ;

Compuring 22 (1996) 1335-1344

133-I

we can obtain the average by

Lq{U),). I

(4)

k=

In the update stage, a local algorithm like the Metropolis [3] or the heatbath [4] method is often used. In these local algorithms, a link CJ,,, is replaced by a new trial link variable CJ,,‘f:’with probability exp(AS) where

(5) while all the other links remain unchanged. of Unff; can be performed Once a staple X,,, is obtained, the accept/reject according to Eq. (5). Since the action S contains a nearest neighbour interaction of links, we must make access to other links to construct a staple X”,+, in other words, there exist links which are not independent of Un,F and cannot be updated simultaneously with CJ,,+. This feature of link dependence makes the construction of staples on vector and parallel computers complicated as we shall see later.

3. Construction

of staples

Consider the d-dimensional pure gauge theory defined on a lattice with Wilson action. Let N,(l G p G d) be a linear extension of the lattice in the Al. direction. Then each site n can be specified with its coordinate n = (n,, n2,. . . , n,) (0 < n, G N, - l>, and the total volume is V( = IIdI-r=, N,). Using these coordinates we can define a global address of the site. One simple implementation to do this is to specify global address j of site n as = 1 +n,

j(n)

+

c n,l--b$. p=2

k=

(6)

I

Then we introduce the “normal order” of sites such that sites are ordered with an increasing order of their global addresses. The staple X,., concerning the link U,,+ is decomposed as X n.F

=

c (Xz$fX1,?)>

(7)

x(+) n*pv = ~n.vUn+~,Jy+~.“,

(8)

xc-’

(9)

“.PV

= u’

n-idJn-&4-Q+p,v~

First we construct T”‘=u n

XA,;? in t two steps. We introduce

U _ . rl,Y n+v,w

the temporary

matrix T,“’ as (10)

S. Hioki/PurulIel

1338

Computing 22 (1996) 1335-1344

then we construct XA,L?as Xc+) = Tt’)Q++, Y. n,&Lv n

(11)

To calculate XA,;k we use a technique. First construct X,‘,;? as n - D are the starting points. Next we move the staples to the + Y direction and obtain Xl,>?. Just like in the Xi,:? case, we proceed in two steps: T(2) = u’ u n *.Y “.l.l’

(12)

7-‘3’ = T’2’u n lI+j..i,v n

(13)

xc-1 n+ji,pv

=

i”(3) n

(14)

-

3.1. Staple on a scalar computer In order to put the link variable on a computer, we introduce the one-dimensional addressing 1 of site n. Of course there are many ways to map n to 1; here we adopt j(n) in Eq. (6) as l(n). Using this address l(n), we refer here to the link variable U,,P and staple XA,‘,; as (15)

U,(l) = 5.P (l=W), x;*)(l)

=x;,‘,‘(l=l(n)).

In order to make access to related links it is useful to adopt a so-called list vector ~(1, CL)which points to the site address of n + ji as a function of 1 such that

l(n+fi)

= v(l(n),

CL).

Using this list vector ~(1, p.), Xi:‘(l) T”‘(l)

= U,(l)U,(z.~(l,

v)),

(16) and X:;‘(l)

are constructed as (17)

Xj,+‘( 1) = T”‘( I)&‘( u( 1, /A)), and T”‘( 1) = Uy’( l)U,( I), TC3’(1) = TC2’(l)Uy( o( 1, /A)),

(‘8)

Xi;)( v( 1, v)) = TC3’(1). 3.2. Staple on a vector computer In order to achieve high performance on a vector computer, Eqs. (17) and (18) should be vectorized, that is, these operations should be done for a large number of sites independently. But as mentioned above, we can do this only for links which are independent of each other. In order to overcome this difficulty, one commonly used way is to decompose all sites into even and odd groups, such that all sites in a group are

S. Hioki/Parullel

Compuring 22 (1996)

1335-1344

1339

independent of each other and can be vectorized. We call a site n an even (odd) site if Cc= , n, is even (odd). In order to realize an even-odd (checkerboard) decomposition of sites, we restrict ourselves to the case that V is even. Then we classify all sites into two groups; one is the even site group G, which consists of all even sites and the other is the odd site group G, which consists of all odd sites. Since both groups have V’( = V/2) elements, we can number all elements of G,(G,) from 1 to V’ in the normal order introduced above. In terms of this numbering we can define a local address of sites. Suppose a site n is in G,(G,) and has a local address 1. We refer to the link variable UC/l as

where E,( 1)(0,( I)) represents a link variable attaching an even (odd) site whose address is 1 and has direction /L. Next we introduce a list vector u,(l, p)(u,(I, ~1) which points to the odd (even> site address located on n + jL as a function of the even (odd) site address 1. Using these list vector, we can construct staples on a vector computer in the following way. For simplicity, we restrict ourselves to the case that n belongs to the even site group (n E G,). T”‘(I)

=&(l)O,(v,(L

V))>

(20)

(21) P’(1)

= O,‘(f)O,(l),

(22)

I-‘~‘( 1) = P2’( 1) E,( u<,(1, PU>)>

(23)

X/p(u,,(l,

(24)

v)) = F(1).

Since the above five equations can be vectorized these operations is expected on a vector computer.

for 1 < 1~ V’, high performance

for

4. Staple on a parallel computer In rhis section we put the lattice on a parallel computer which consists of nfi= , M,(l < M, < N,) processors, namely we consider the case that each processor is responsible for nt= , m,( mp = NJM,) sites. If M,, + 1, the p direction is divided into M, pieces and put on multi processors. On the other hand if M, = 1, the p direction is not divided into processors but put on a single processor. From now on we restrict ourselves to a processor which is denoted by 9, but the story discussed below is universal and applicable to all processors we think of. For convenience we denote a neighboring processor in the 5 p direction as 9 5 jL. In order to realize an even-odd decomposition of sites also in each processor, which is necessary

1340

S. Hioki/Parullel

Computing 22 (1996) 1335-1344

to update independent links simultaneously on a parallel computer, I-l:= ,m, is restricted to be even. Since the action has nearest neighbour interaction of links as mentioned above, we must make access to other processors to construct staples for links which lie on the boundary of .P. One naive way is to make access to the necessary link when required. But in many cases this causes the performance to slow down because of a rather large latency of communication concerning the data transfer for almost all parallel computers available now. A widely used way to hide this large latency of communications is to use as working area data buffers of neighbouring links and transfer large data at once. In the p direction, normally we have sites for nP = 1 - mp in 9. In a widely used implementation, we prepare n, = 0 and n, = mp + 1 areas in L?’ in addition to nP = 1 - m, in order to store the appropriate data we need. Before we make access to data at nP = mW+ 1 in B, for example, we transfer links at nw = 1 in 9 + j2 to 9 and store them into the nP = m,, + 1 area. After this transfer is completed, we can make access links at nP = m, + 1 in ~3’ in the normal way using a list vector. Although this technique can hide the latency of data transfer, it is obvious that this requires a large working area as buffers. For example consider the case for d = 4, mr* = 4 and IV, # 1 for p = 1 - 4; the ratio of the total size for links including a working area to the original size becomes larger than 5: d

d

ll(m, +2)/lItm,>5.

(25)

When we work on a computer with rather small size of memory or we want to simulate a large lattice size, this feature becomes a bottleneck. In this paper, we propose a new technique to map links in next processors onto buffers which might overcome this bottleneck. To realize an even-odd decomposition of sites also on a parallel computer, we classify all sites in ~3’ into two groups; one is the even site group G, which consists of all even sites and the other is the odd site group G, which consists of all odd sites. Since both groups have V”(= rIz= ,mll> elements, we can number all elements of G&G,) in 9 from 1 to V” in normal order. In terms of this numbering we can define a local address of sites in ~3’. Using this local address 1 we refer to the link variable just as in Eq. (19). Next we introduce a list vector ~~(1, pu)(u,(1, pu>>in each processor defined in the same way as in the vector computer case. If M, = 1, a site n + fi is also in ~3’. In this case u,(l, pu)(u,(Z, ~1) can be defined properly. If M, f 1, however, there is a subgroup @L’(B$)) of G,(G,J whose neighboring sites in the +p direction are not in L+’ but in 9 + ‘$. i.e.

S. Hioki/Parullel

1341

Computing 22 (1996) 1335-1344

In this case u,(l, p)(u,,(I, CL)) cannot be defined properly in 9 and we must extend the definition. We introduce a new group Nr,p(N,,p,) which consists of sites pointed by (+)( n E B(+ ‘1. It is noted that elements of N, pL(NC_) are not in 9, n + ~ZLwhere n E B,,, but in 9 + a. Here we extigd the site numbering in 5Y to N,,,(rV,,,), namely we number the elements of N,.@(NO,+) from V” + 1 to V” + V”/m, in normal order; then we can define ~~(1, p)(u,(l, p)) which points to the odd (even) site address located on n + p in Nc,,p(Ne,w>. Just like BL,z)( B:,:), we can introduce a subgroup Bj,‘(Bf,i)) of G,(G,) whose neighboring sites in the negative p direction are not in 9 but in 9 - fi. Since there are V”/m, sites in B&)( Bb;)), we can number these sites in BL,‘( Bi,,‘) from i = 1 to (- )( Bb;‘) has a local address 1 and is specified V”/mcl in normal order. If a site n in B,,, by a number i as discussed above, we can introduce a new function b,(i, p)(b,(i, p)) defined as 1= b,( i, p) in B!;),

1 < i < Vu/m,,

I= b,( i, p) in BLr>,

1 < i < Vu/m,.

(27)

The actual construction of staples is as follows. For simplicity just like in the case for a vector computer, we restrict ourselves to the case that IZE G,. 4.1. Positive

v: X,C%Li

If M, # 1 and in case of n E B6.z’ we do not have U,,, c.p in 9. Since the way to obtain information from other processors is machine dependent, in this paper we use the send/receive syntax to perform inter-processor communications. We have to get U,, i,fi from 9 + fj, in other words we have to send U,,, p,p to 9 - c. So we first prepare the links that should be sent so that the links are ordered sequentially in a dimension. Second we send links prepared to 9 - Q. Third we receive links from 9 + c and store them into an appropriate point of the dimension. PREPARE

O,( V” + i)

SEND

TO

9--

P

O,(V” +i)

(1 < i < V”/m,)

RECEIVE

FROM

9+

G

O,(V”+i)

(1 < i < V”/m,)

+

O,(b,(i,

For later convenience, we introduce three operations. The syntax is SETLINK(O,,

a function

b,, V”, my, v).

v))

(1
SETLINK

that performs

the above

(28)

Now T can be made by T(1) =Ev(l)Op(ue(l,

ZJ)),

1 < 1~ V”.

(29)

Obviously SETLINK is not necessary if M, = 1. If M, # 1 we then perform SETLINK: SETLINK( O,, b,, V”, m,,

p).

(30)

1342

S. Hioki/Pardel

Computing 22 (1996) 1335-1344

Next we can construct X(+) as

XL;‘< 1) = T( I)OJ( Ue(I, /A)), 1 Q l< V”.

4.2. Negative

(31)

Y: Xi.;;

Since n - P are odd sites, the temporal matrix T which is the product of the first two matrices of the right-hand side of Eq. (3) can be constructed as 1
T(l)=O;(Z)O#),

(32)

If M,, # 1 do SETLINK( E,,, b,, V”, mp, p).

(33)

We then get W(1) as W(I)=T(I)E,(v,(l,p)),

l
(34)

which corresponds to Xi;? but n - D are the starting points instead of n. Next we move W(l) to the + Y direction: T(v,(Z,

v))=W(l),

l
(35)

If A4,# 1 the above + v operation requires inter-processor communications. This can be done by sending T(n + vX R E N, .) to 9 + F and by receiving them from to 9 - i, and storing them into T(n) (n E Bj;‘). SEND

TO

P+

RECEIVED

FROM

STORE

T(b,(i,

v))

P

T(V”+i)

(1
9-P

T(V”fi)

(1 gi
+--

T(V”+i)

(1
This gives X$‘(1): X;;)(l)=T(l),

l
Just like SETLINK, it is useful to introduce performs the above three operations. The syntax is SLIDEMATRIX(

(36) a function

SLIDE-MATRIX

T, b, , V”, m, , v) .

4.3. Program Then the actual program to construct the staple X,,,(n E G,) is as follows, do 1= 1, V” X,(l) = 0 enddo do v= 1, d

that

(37)

S. Hioki /Parullrl

Computing 22 (1996) 1335-1344

v+ p then if M,# 1 call SETLINKCO,, b,, V”, my, do I= 1, V” T(Z) = E,(I) O,(u,(l, v)) enddo if M,# 1 call SETLINK(O,, b,, V”, m,, do l= 1, V” X,(I) =X,(I) + T(lkqu,(l, p>) enddo do I= 1, V” T(I) = 0,‘(1)0,(0 enddo if M*# 1 call SETLINK(E,, b,, V”, m,, do I= 1, V” W(I) = 7-(I)E,(u,(1, /.d) enddo do I= 1, V” T(LJ,,(l, VI) = w(1) enddo if M,# 1 call SLIDEMATRIX(T, b,, V”, do I = 1, V” X,(1) =x,(0 + T(1) enddo endif enddo

1343

if

5. Summary

v>

pu>

/.A)

m,, v)

and discussions

In this technique,

we have used the following

working area as buffers (for each ~1:

O,( V” + i) (1 < i < V”/m,).

(38)

If we think of the same lattice size as in Eq. (25), the ratio becomes (V” + V”/m,)/V”

= 1.25,

(39)

which is quite smaller than the same quantity in Eq. (25) and rather close to 1. It is also clear that this technique is independent of the dimension of the system. So we do apply this routine for any dimensional lattice QCD without any difficulty. Furthermore if we set M, as we like, we can do any dimensional division of the system which enables us to change the number of processors easily. These features are prominent aspects for the portability of this code. We have applied this method both on Fujitsu AI’1000 at Fujitsu Parallel Computing Research Facilities and on Intel Paragon at INSAM (Institute for Numerical Simulations and Applied Mathematics) in Hiroshima University. Reduction of the working area has been nicely realized on these machines compared with a program which uses overlapping with next processors.

1344

S. Hioki / Parallel Computing 22 (1996) 1335-1344

On the other hand for the portability of the code, we have checked that this method can be applicable to any dimensional system and to any dimensional division of the original system without any difficulty on these machines. Actually in order to change the space-time dimension of the system, we only change the d parameter in the program. And in order to realize, for example, 2-dimensional division (say the x and y directions are divided into processors), we only set M, # 1 for p = 1 and Al.= 2 and Mel= 1 otherwise in the program. As for the real simulation code, there are five different parts between a code on API000 and a code on Paragon. They are: l global synchronization part, l global summation part, l a part to get the node address, l SETLINK, l SLIDEMATRIX. In the actual simulation code for pure gauge update, the difference is less than 10 lines of FORTRAN code. And we can easily convert one to the other. In this sense, both “portability of code” and “reduction of working area” are thought to be achieved in the present program [51.

Acknowledgement

The author would like to thank the members of both Fujitsu Parallel Computing Research Facilities and Scalable System Division of Intel Janan K.K. for their technical support for parallel computing. He also thanks Dr. Nakamura for valuable discussions in the early stage of this work.

References [l] Y. Iwasaki, Nucl. Pbys. B (hoc. Suppl.) 34 (1994) 78. [2] KG. Wilson, Phys. Reu. DIO (1974) 2445. [3] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller, J. Chem. Phys. 21 (1953) 1087. [4] C.-P. Yang, Proc. of Symposia in Applied Mathematics, Vol. XV (Amer. Math. Sot., Providence, RI, 1963). 35 1. [5] The real implementation of this paper is available on the Web; see URL http://insam.sci.hiroshimau.ac.jp/QCDMPI/.