Parallelization of a cluster algorithm

Parallelization of a cluster algorithm

Computer Physics Communications 54 (1989) 201—209 North-Holland, Amsterdam 201 PARALLELIZATION OF A CLUSTER ALGORITHM Anthony N. BURKITT and Dieter ...

773KB Sizes 3 Downloads 140 Views

Computer Physics Communications 54 (1989) 201—209 North-Holland, Amsterdam

201

PARALLELIZATION OF A CLUSTER ALGORITHM Anthony N. BURKITT and Dieter W. HEERMANN’ Fachbereich 8: Physik, Bergische Universität, GauiJ Strasse 20, D-5600 Wuppertal, Fed. Rep. Germany Received 7 December 1988

We present two algorithms to paralleize the identification of clusters on a lattice. Such algorithms are necessary for a variety of problems such as percolation and non-local spin update algorithms. The algorithms were tested for the Swendsen—Wang method for the simulation of the Ising model. The tests were run on a multi-transputer system using up to 128 processors. A scaling law for the performance of geometric parallel algorithms is proposed and tested.

I. Introduction The identification of clusters pertains to many problems in computational physics [1]. The most prominent is perhaps the Monte Carlo simulation [2—4]of the percolation problem [5]. Today, such an identification of clusters is needed for many other simulation methods. The one we want to focus on here is the identification of clusters in the Swendsen—Wang [6] method to simulate the Ising model, although the algorithm itself can be used in a wider context. The system for which we formulate the algorithms is the well known Ising model with the Hamiltonian, 3~°ising’’

—J ~ sisj,

the lattice sites and an assignment of bonds between parallel spins. For a detailed exposition of the algorithm the reader is directed to ref. [6]. Here we are basically interested in the identification of clusters in such a configuration. Clusters consists of either up or down spins and our aim is to enumerate them for a given occupation. Consider a configuration of a lattice with spins up and down. Bonds are now distributed on the lattice in such a way that they are always broken between spins of opposite direction. Between spins of parallel direction a bond can be present with a

(1) -~

where J is the exchange coupling between nearest neighbours on the lattice, denoted by the symbol
=

Permanent address: Institut für Physik, Staudinger Weg 7,

Fig. 1. A sample configuration of occupied and and empty sites. The arrows indicate the motion of the sequential al-

Universität Mains, D-6500 Mains, Fed. Rep. Germany.

OO1O-4655/89/$03.50 © Elsevier Science Publishers B.V. (North-Holland Physics Publishing Division)

gorithm over the configuration.

202

AN. Burkitt and D. W. Heermann

/ Parallelization of a

probability that is temperature dependent. An example is depicted in fig. 1, where the up-spins are marked black. A cluster is defined by the criterion that two spins belong to the same cluster if they are nearest neighbours and if there is a bond between them. Here, as in other algorithms, the task to identify the clusters is non-local. Spins on sites very far apart may belong to the same cluster. This makes the parallelization of an identification algorithm non-trivial. Below we present two algorithms for the parallel identification of clusters in a spin system distributed over a number of processors. Though the algorithms presented here have been developed using the OCCAM language [7—12]on a trailsputer array they are suitable for other parallel architectures. 2. Identification of clusters let us start out with the basic idea to identify clusters on a scalar machine. This idea is due to Hoshen and Kopleman [13] and was further developed by Kertecz [14]. Basically the algorithm is a labelling technique. The algorithm presented here is that for the Ising model, in which we are given a configuration of spins and bonds, as previously described, and we wish to identify the clusters over the whole lattice. The algorithm starts out with the site in the “upper left corner” of the lattice. For simplicity we also assume that the boundaries are free, i.e. that sites on the boundary can not be connected to the outside. Since this is the first site we give it the cluster label number 1. We also define a “stack vector” (or “cluster permutation vector”) whose first element corresponds to this first cluster and is assigned the value 1 of —3 (for spin values + 1 and —1 respectively). Each spin on the lattice is sequentially visited and if it is not connected with any labelled cluster it is assigned a new label, and the corresponding spin-value entry is placed in the stack vector. Any spin that is connected to only one labelled cluster is assigned that particular cluster label. Suppose now that a spin is connected to two labelled clusters. In this case a —

cluster algorithm

1

2

0

0

1

0

2

o

_________

o

3

2

2

o

o

2

________



o

o

1

1.

1

2

8

6

L



I

J

3 -1

.

Fig. 2. A sample configuration of occupied and and unoccupied sites where two large clusters merge. The numbers give the labels assigned by the algorithm. The corresponding per. mutation vector for the labels are given at the bottom.

joining of the two clusters occurs, as is shown in fig. 2. The spin is assigned the lowest of the two cluster labels, and the other cluster, which this spin has now connected, is joined by overwriting its (negative) spin value in the stack vector with this (positive) lowest cluster label. When the next cluster is joined in this manner, the algorithm iterates through the stack vector until it reaches the “seed cluster”, which is the last cluster in this linked chain (and thus the one with the lowest cluster label). After one sweep through the lattice all clusters are stored in the stack vector. Clearly, the problem of the identification of the clusters is non-local. This cluster labelling procedure enables two clusters which were originally widely seperated to be part of a single larger cluster. This feature needs very careful attention when the lattice is divided into strips that are distributed over a number of processors, as we discuss in the following section.

A.N. Burkilt and D. W Heermann

3. Geometric parallelization The algorithms which we propose in this paper are based on the idea of geometric parallelization. Imaging that there are p workers, and that we split the task evenly between these p workers, “Evenly” here means that each one has exactly the same task to perform as the others. In the case of a system on a lattice this means that we divide the lattice into p equally sized portions. The division is not unique. We could divide it into squates or sublattices. Here we choose to divide the lattice into strips (c.f. fig. 3), so that if our lattice is of size L2 then each processor contains a strip of L * (L/p) spins and their associated bonds. The advantage for doing so is that the communication is minimized. While a division into squares results in 4 surfaces (communication to other processors), the division into strips results in only two surfaces. Each worker is assigned a lattice strip to work on. This is much like divide and conquer. After the workers have finished their part they need to communicate to their immediate neighbours to finish the global task. With the division of the problem into homogeneous subtasks we can hope to cut the time necessary to execute the task by the number of processors (tasks), i.e. linear speed-up. However, this neglects the communication. If we divide the problem into subtasks that are too small, a large communication overhead results. On the other

________

________

________

~

Fig. 3. Geometric decomposition of a large lattice. The parts are spread over several processors.

/ Parallelization of a

cluster algorithm

203

hand if the number of processors is too small we don’t gain in speed. A detailed discussion on the execution time is given in section 6. The procedure we adopt is to firstly consider each strip independently of the others. We do this by considering all the bonds along the boundaries of each strip to be broken and we assign a “local” cluster label to each cluster within the strip. This is done in exactly the way outlined in section 2, with a “local” stack vector on each processor, and it can clearly be implemented on each processor in parallel. The next step is then to use the actual bonds along the boundaries to establish the overall (or “global”) cluster structure by connecting the local clusters together across the boundaries. In order to do this it is first necessary to introduce “global” cluster numbers (the local cluster numbers range from 1 to a possible maximum of L * (L/p)). The global cluster number of a particular cluster on the r th processor is defined to be (r 1) * (L/p) * L plus the local cluster label; this ensures an overall unambiguous ordering of the clusters. The difficulty we now face is that clusters can be spread over many processors and it is necessary that the information on each processor is cornmunicable with every other processor. In this respect the problem is essentially different from the usual local Monte Carlo algorithm, in which it is sufficient for information to be exchanged between immediately adjacent strips. The most naive way to solve this problem is to proceed by identifying all those clusters which lie on more than one processor (i.e. by looking to see if they are connected by a bond across any boundary) and then to iterate our original algorithm over these clusters i.e. start on the first strip and move sequentially through the entire lattice. This sequential procedure would be, of course, extremely inefficient and would leave most of the processors idle during this stage. More sophisticated algorithms use the available parallelism to increase the efficiency of this global cluster identification process. In the following sections we present two algorithms for achieving this. Their key features are that the work load, in terms of communication and computation, is roughly the same for every —



204

A.N. Burkitt and D. W. Heermann

processor and that (after an initial “start-up”) the communication of information between processors and the “processing” of this information (i.e. computation) proceed simultaneously. There are, of course, a large number of possible variants of these algorithms and their relative speed and efficiency depends upon lattice size, memory on each processor, number of processors, efficiency of communication, etc., as we discuss in section 6.

/

Parallelization of a cluster algorithm

1.

1 or 3, corresponding to spins of +1 and 1 respectively, or 2. a positive number indicating another (lower) cluster number to which this cluster is connected. Do the periodic boundary conditions in the left/right direction using real.left [ ] (this does not require any communication with neighbouring blocks). Tidy-up of stack [ ] is done so that each value is either a spin (—1 or —3) or a primary cluster label (a primary cluster does not point to another cluster, rather it contains a spin). The boundary conditions in the up/down direction are now implemented: • First the array neighbour.clusters[L] is sent in the downward direction cyclically. This array contains the public cluster numbers of all clusters on the downwards boundary. The public cluster number private cluster number + (LpL * j), on the jth block/processor. This gives a global ordering of the clusters over the whole lattice. S The array join[ L][2] is created on each block. Only the first length.join elements of the array are occuppied, and contain the public cluster numbers of the clusters that are paired across the boundary, while the second component contains its spin value. (We explicitly checked that there were no repeated pairs in join [ }[ ]). S A copy of the scalar length.join and the array join [ ][ ] is sent from block j + 1 to j (i.e. upwards). S On each block a new array called stack.public [~ * L * 2][2] is formed (of variable length) of which the first length.stack elements are an ordered list of all clusters within this (jth) block that are connected to neighbouring blocks. The second component of stack.public[ ][ ] contains the spin values 1 or —3) of the cluster within the block. It is useful to define the arrays length.jn[ p] and length.stk[ p], which contain the length of the contribution to the arrays join[ ][ ] and stack[ ][ ] from each block. —









4. Algorithm 1: public stack method We now describe the first algorithm for the cluster identification. The algorithm will be presented in a form consistent with the block structure of OCCAM. Since OCCAM forces one to use block constructs, this exposition has the advantage that it is easy to translate the algorithm into another language. The aim is to identify the clusters of spins on a lattice of size L * L distributed over p processors (or blocks), each containing a block (or strip) of Lp * L spins, where Lp L/p (c.f. fig. 3). The scheme ensures that the clusters are correctly identified globally (i.e. over the whole lattice on all the blocks) and the spins updated accordingly. • Given a spin configuration, a bond configuration is generated: Generate the interior bonds, lbond[Lp][ L] (bonds connecting to the left neighbour) and ubond [Lp][L] (bonds connecting to the neighbour in the upwards direction), but keep the boundary bonds explicitly zero; simultaneously read and write the boundary spins to and from neighbouring blocks using neighbour spins [LI; subsequently generate the actual bonds on the boundary of each block, real.left[Lp] and real.up[L], using neighbour.spins[L]. • Identify clusters within each block using stack[LpL] (where LpL (L/p) * L is the maximum possible length). Each cluster is given a private cluster number which is valid only within the block and takes values from 0 to LpL. Each spin value is replaced by the cluster number to which it belongs. stack[ ] values are either: =







=

=

(—

The next two sections, “cyclic read/write” and “processing of stack.public”, can be done in paral-

AN. Burkitt and D. W. Heermann

lel on each single processing element (after an initial read/write to start-up the process): • The scalars length.join, length.stack and the corresponding parts of the arrays join[ ][ ], stack.public[ ][ ] are sent on a cyclic (downward) trip through all the blocks, and the array stack.public[ ][ ] is successively accumulated into a long array in each block. Likewise, length.join and length.stack are read into the arrays length.jn[ p] and length.stk[ p1. S These arrays are now “processed”, which means the pairs in join[ ][2] are used to connect-up the clusters in stack.public[ ][ ] in the usual way (i.e. if two clusters are connected by join[ ][ ]‘ then the “lowest wins” and the higher receives a (positive) pointer that overwrites the (negative) spin), Note: One step of this procedure on a particular block looks like: Read the contributions to join[ ~ stack.public[ ][ ], length.jn[ 1’ and length.stk[] that came originally from block 1. “Process” the contribution to stack.public[ 1[ from the (k 2)th block using the part of join[ I[ ] from the (k 2)th block, Write to the neighbouring block + 1 the contributions to the arrays join[ ~ stack.public[ ][ ], length.jn[ ] and length.stk[] that come originally from block k. Note that care is needed at this step to ensure that there is no conflict between the processing of join[ ][ ] and the parallel write. The problem arises, because the minimum of the two pairs of clusters in join[ ][ ] could be in the part of join[ ][ ] that is being written to the downwards neighbour. (We explicitly avoided this by shifting the third component of the part of the array, where there was a possible conflict, into a dummy array). (All integers are defined modulo ~ the number of blocks in the system). By doing these in parallel we are able to save on time. Note that it is not necessary to accumulate the complete array join[ ][ I since, once it has been used to “process” stack.public[ ][], the important information is then stored in stack.public[ ][ 1. The final two sets of arrays are “processed” after the final read/write. i



.~ —









j

/ Parallelization of a S

S

cluster algorithm

205

Having completely processed stack.public[ ][ ], it is now possible to “update” the clusters/spins within each block contained in stack[ ], where this is necessary. This updated private/local version of stack[ } is then used to update the spins within each block.

5. Algorithm 2: neighbour joining method This is another possible scheme to ensure that the clusters are correctly identified globally. In contrast to the “public stack” scheme it uses a long arrayjoin[ L * p ][3] directly (without any need for stack.public[ ][ ]). In this description we use the notation of the previous section and describe in detail only the parts that are different from the method presented there. S To implement the boundary conditions for the up/down direction both the arrays neighbour. clusters[L] and neighbour.spins[L] are sent cyclically in the downward direction, neighbour.clusters[ ] contains the public cluster numbers of all clusters on the downward boundary of a block and neigbour.spins[ ] contains the spin values 1 or 3) of the primary clusters of neighbour.clusters[]. • The first part of the array join[L * p][3] is created on each block. This array contains the cluster numbers of the clusters that are paired across the boundaries. The third component contains the spin values of the cluster with the lowest public cluster number, which at this stage is taken from neighbour.spins[ ] for every processor except processor 0, in which case the spin values of the appropriate primary clusters are taken from within the block (i.e. in stack[ ]). There is also the array length.jn[ ], as described previously. At this stage only the first length.join elements of the array are created, which will contain the public cluster numbers of the clusters that are paired across the upper boundary. (As before, we explicitly checked that there are no repeated pairs in join[ ][ 1.) (—



The next two steps “cyclic read/write” and “processing of the array join[ ][ I” can likewise be done in parallel (after an initial read/write to start-up the process):

206

AN. Burkitt and D. W. Heermann

The scalar length.join and the corresponding part of the array join[ ][ ] are sent on a cyclic (downward) trip through all the blocks, and the array join[][ ] is successively accumulated into a long array in each block. Likewise, length.join is read into the array length.jn[ p]. • The array join[ ][ ] is now “processed”, which means we take a cluster pair clusteri, cluster2 and search for either of these clusters in the array join[ ][3]. It is clear that such a search can be limited to the block on which it occurs and to the two neighbours. When two pairs of clusters are joined by a common cluster label then the “lowest wins”, i.e. the primary of the cluster pair that loses receives a pointer in the spin-index of join[ J[ 1. In the case of two lowest clusters appearing, the first appearing wins, Note: One step of this procedure on a particular block I looks like: Read the contributions to join[ 1[ I and length.jn[ ] that came originally from block k—i. “Process” the contribution to join[ ][ ] from the (k + 1)th block using the parts of join[ ‘[ from the kth and (k + i)th blocks. S



/

Parallelization of a cluster algorithm —

Write to the neighbouring block j + 1 the contributions to the array join[ if I and the scalar length.jn[ I that came originally from block k. Note that once again care is needed at this step to ensure that there is no conflict between the processing of join[ ][ ] and the parallel write (again avoided by using a dummy array).

As in the previous method the last two arrays are “processed” after the final read/write. S Having completely processed joinE ][], the clusters in stack[ ] can now be updated, where this is necessary. This is done by looking over all the spins on the boundary whose clusters are also in join[ J[ ]. All appearences of this cluster in join[ ][ ] are found and the new spin value is that of the most primary of the primary cluster pairs thus found. S This updated private/local version of stack[ ] is then used to update the spins within each block.



System Services ____________________________________ N C U

_____

6. Timing results The results on the timing of the two algorithms were obtained by running the programs on the 128 transputer machine of the Gesselschaft für Mathematik und Datenverarbeitung (c.f. fig. 4). This machine is electronically configurable. Corresponding to our geometric parallelization of the lattice the machine was configured as a ring of processors. The notion of speed-up [15] which we

Network Configuration Unit

Computing Cluster

computing Cluster

1 Computing Cluster

Cluster

I

____________________________________ ___

NCU

___

Network Configuration Unit Workstation Interface

__________________________________

Fig. 4. The parallel computing Structure of the SUPERCLUSTER installed at the Gesselschaft für Mathematik und Daten verarbeitung.

will be using below is time complexity on one processor, (2) time complexity onp processors s( p) It is possible to argue about the validity of such a definition, but for our purposes it seems fairly appropriate. In the ideal case we expect to gain a linear =

speed-up, i.e. if we cut the two dimensional lattice into p strips with each strip assigned to one of the p processors. Hence each processor is responsible for (L/p) * L data elements as outlined in the previous sections. Of course, we have assumed that the input/output operations of the basic al-

A.N. Burkitt and D. W. Heermann

/ Parallelization of a

cluster algorithm

Timing Results

Timing Results

/

16

120L ~

L=64

°

L=128

~

L=256 L=512

14 o

L=128

100

o L=256 12

•.

•.

I.



-a~~

..

U)

on

..,. El

•.

/

••..

/



:

/ 0

0

.

20

ZIuI’

• L=1024

L~512 •.

:.

-~

207

40

Number

I 60

I 80

of Tronsputers

I 100

/.~•

20

120

a

140

5

20

(p)

:

I 40

Number

I 60

I 80

of Transputers

I 100

I 120

b 140

(p)

Fig. 5. Speed-up of the geometric algorithm 1 as a function of the number of processors p for the cluster problem. Different curves denote different lattice sizes as indicated in the legend. The right part shows the results for a regular local-update Ising model.

gorithms on one processor do not degrade the performance of others. For the updating with the cluster algorithms (c.f. fig. 5) we obtain an initial linear speed-up as for a regular local-update Ising simulation. In fig. 5 we plot the results for algorithm 1, which was approximately 20% faster than algorithm 2, a!though both displayed the same general features. What we have to consider here is the by far larger communication overhead which is a result of the constraint of global connectivity. Initially the speedup is linear as was the case for the local Ising algorithm. The saturation sets in much earlier than for the local Ising algorithm. This is due to the longer communication paths and also larger amounts of data which have to be passed. It is interesting to note that the data obeys a scaling law. The speed-up S, in general, is a function of the problem size L and of the number of processors p. The observation is that the speed-up depends only upon the ratio of the problem size and the number of processors. This relation can not be linear. If it was then there would be no

saturation and the observation would be trivial. The essential point is that we need to generalize this argument to propose a scaling of the form SIL L~s( /Lx\ (3

~ =





We expect that for large lattices and for a small number of processors the speed-up will be linear. In this limit the dependence upon L should cancel out of the speed-up formula. On the other hand, for a large number of processors communication will dominate, so that the relation is non-linear. The exponent x can be interpreted as giving an effective lattice size for the given algorithm. If no communication overhead occurs for all p and L then x would be one. A small x says that the effective lattice is much smaller than the original. The scaling for the two problems is shown in fig. 6. The family of speed-up curves, both for the Ising as well as the cluster algorithm, can be collapsed onto a single scaling curve. The scaling functions themselves are not universal. Each algorithm gives a different scaling curve, but the scaling itself holds for both.

208

A.N. Burkitt and D. W. Heermann Scaling

of

/

Parallelization of a cluster algorithm

Timing Results

Scaling

A

Timing Results

A

V

V El

0.8

of

V

.

0

A

0.6

0.8

A

L=125

AL256

A



L=512

A

0.6

-r

-~

A

(~0 0.4

A

L=64

° A

0=128 L=256



L=512

°

A

A

0.4

A

• L=1024 0.2

0.2

0

0

0.

0.5 I

1.I

1.5 I

2.I

7’

2.5 I

3.I

3.5 I

a4.

0

0

2 I

4

6I

8I

10 I

12. I

14

p/L°’

Fig. 6. The scaling property of the geometric algorithm 1 which is presented in this text. The left figure shows the result for the local Ising problem and the right figure those of the cluster algorithm.

The exponents for the local Ising and the cluster algorithm are different. The exponent for the local Ising algorithm is 0 75

Xlsing

while the exponent for the cluster algorithm is Xsw

‘~

0.4.

Since this scaling holds for both problems we conjecture that such a scaling holds for all geometi-ic algorithms. Of course, the scaling function § i.e. the non-universal part, will depend on the processor speed, the detailed communication and the specific problem.

employ a tree structure for the communication, which would reduce the communication overhead. The point here was to demonstrate the possibility and the examine some of the features geometrically parallel algorithms. More work of needs to be done to investigate and to exploit parallel computing structures and algorithms. Also the scaling feature of parallel algorithms needs further investigation. The performance of other algorithms needs to be analysed in terms of scaling and further analytical works is required to understand this phenomena.

Acknowledgements 7. Discussion In this paper we have given two examples of algorithms which parallelize the recognition of clusters on a lattice. There are, of course, possible variants of the scheme. For example, one could

We would like to thank H. Mühlenbein and the Gessellschaft für Mathematik und Datenverarbeitung for giving us the opportunity to run our algorithms on their multi-transputer system. The partial support from the Sonderforschungsbereich

AN. Burkitt and D. W. Heermann

262 and the Schwerpunkt SCHI 257/2—2 is gratefully acknowledged.

Ref erences [1] K. Binder, ed., Monte Carlo Methods (Springer, Heidelberg, 1986). [21D.W. Heermann, Introduction to the Computer Simulation Methods of Theoretical Physics (Springer, Heidelberg, 1986). [3] M.H. Kalos and P.A. Whitlock, Monte Carlo Methods, Vol. 1 (Wiley, New York, 1986). [4] K. Binder and D.W. Heermann, Monte Carlo Method in Statistical Physics: An Introducing (Springer, Heidelberg, 1988). [5] D. Stauffer, Introduction to Percolation Theory (Taylor and Francis, London, 1985).

/ Parallelization of a

cluster algorithm

209

[6] R.H. Swendsen and J.-S. Wang, Phys. Rev. Lett. 58 (1987) 86. [7] RD. Kenway, G.S. Pawley, K. Bowler, AD. Bruce and D.J. Wallace, Physics Today 10 (1987) 40. [8] OCCAM Programming Language (Prentice-Hall, Englewood Cliffs, 1984). [91R. Steinmetz OCCAM2 (Huthig, Heidelberg, 1987). [101 G. Jones, Programming in OCCAM (Prentice-Hall, Englewood Cliffs, 1987). [11] G.S. Pawley, K.C. Bowler, RD. Kenway and D. Roweth, An Introduction to OCCAM 2 Programming (ChartwellBratt, Bromley, 1987). [12] D.W. Heermann and AN. Burkitt, Parallel Computations in Physics: OCCAM and Parallel Algorithms (Springer, Heidelberg, 1989, to appear). [13] J. Hoshen and R. Kopelman, Phys. Rev. B14 (1976) 3428. [14] J. Kertesz, private communication. [15] A. Samal and T. Henderson, International Journal of Parallel Programming 16 (1987) 341.