Optimization techniques for parallel molecular dynamics using domain decomposition

Optimization techniques for parallel molecular dynamics using domain decomposition

iii Computer Physics Communications ELSEVIER Computer Physics Communications 113 (1998) 145-167 Optimization techniques for parallel molecular dyna...

1MB Sizes 44 Downloads 131 Views

iii

Computer Physics Communications ELSEVIER

Computer Physics Communications 113 (1998) 145-167

Optimization techniques for parallel molecular dynamics using domain decomposition M. Ptitz, A. Kolb Max-Planck-lnstitute for Polymer Research, Ackermannweg 10. D-55128 Mainz, Germany Received 15 May 1998

Abstract

In this paper we describe the implementation of a new parallelized Molecular Dynamics code for many-particle problems with short-ranged interactions. While the basic algorithms have their foundation in the fairly standard methods of domain decomposition, linked-cell pair search and Verlet pair list, we have developed some refined techniques for optimizing them. The rewards of these optimizations are a up to 45% overall improvement in the scalar performance and very good scaling behavior in the number of processors even down to a few hundred particles per processor on a CRAY T3E. The best speedup can be obtained for systems with pair forces only since then the data structures can be organized in a very simple manner. To deal with more complex situations as well, we have developed a partial replicated data scheme which is suitable to simulate many molecules consisting of many simple particles (e.g. polymer chains) for many types of short-range interactions. @ 1998 Elsevier Science B.V. PACS: 02.70.Ns; 36.20-r; 82.20Wt Keywords: Molecular dynamics; Short range forces; Massively parallel computers; Link cell algorithm; Many-bodyinteractions; Domain decomposition; Partial data replication

1. Introduction

In the last decade supercomputing has gone through a dramatic increase of the importance of parallel architectures with distributed memory and strong interprocessor communication networks. The high availability of these massively parallel computers has led scientists to develop codes which run optimally on those machines. For molecular dynamics (MD) simulations of systems with short-ranged interactions, domain decomposition (DD) [ 1-6] has proved to be the most scalable approach for distributing the computational work among the processors. Compared to other methods, e.g. pure force decomposition and the replicated data approach [7,8], it has a low communication overhead whose relative contribution vanishes like ( P / N ) l / a , where N is the number of particles, P the number of processors and d the spatial dimension (see Section 3 for a detailed analysis). Thus it should become negligible for large N / P . It is the purpose of this paper to demonstrate that this can occur already for rather small N I P ~ 200, depending on the complexity of the interactions, due to proper optimization techniques on a computer with a fast communicalion network like the CRAY T3E. While there are many contributions in the literature describing techniques of DD, most of them are concerned with multiple purpose MD algorithms, which can be used to model many problems [9,10], or address the issues 0010-4655/98/$19.00 @ 1998 Elsevier Science B.V. All fights reserved. PII S001 0-4655 (98) 00074-5

M. Piitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

146

of load-balancing [ 10-13 ]. In general, load-imbalance in low density systems with attractive interactions, where aggregation of some type can occur, is a big problem when one is using the domain decomposition approach. However, many models, like most complex fluids, involve rather high densities and load-imbalance poses no problem since density fluctuations are small, and one can concentrate on other algorithmic points which become more prominent then. In the light of this, we tried to develop a "light-weight" code with emphasis on very fast communication and high computational speed, while still re,taining good flexibility in the choice of the detailed physical model. The problem of long-range interactions (e.g., coulombic or gravitational) requires a totally different approach (see, e.g,, [14-17]) and will not be discussed here. In the following section we will present the basic algorithm used for integration and force calculation. We have gone beyond the standard linked cell algorithm (see, e.g., [18]) to improve the efficiency of finding interacting particles. In the third section we describe our methods and data structures used for parallelization and in the fourth section we show a method of extending the pure domain decomposition approach by a partial replicated data (PRD) scheme to cope with more complex models. In the fifth section we provide a detailed performance and scaling analysis of our code by a ,;et of benchmarks and, from this analysis, draw our conclusions in the last section. In the appendix we reveal some selected parts of the commented source code, which offer a more detailed view on our communication schemes.

2. MD algorithm Molecular dynamics involves the numerical solution of Newton's equations of motion of a system of N interacting particles, N gtli3¢i = ~

Fij(xi,xj)

,

(1)

i=1 ,i
where Fij is the force exerted by particle j on particle i. Three- or more-body forces are also common in more complicated models, but they will be treated separately in Section 4. For equations of motion like (1), so-called symplectic discretizing schemes have proved to be very stable integrators since they conserve the volume of phase space exactly and are time-reversal symmetric (up to roundoff errors). For most cases the lowest order variant of this kind, the so-called velocity-Verlet or its leap-frog variant, is sufficient (see, e.g., [ 18] ). The most expensive part in the solution of the above equations is the calculation of the forces at each time step. The double sum in (1) is of order N 2 since each particle interacts with all other particles. However, most of the interactions between particles in condensed matter physics are short ranged unless unscreened charges are involved. Thus each particle effectively only interacts with a small and more or less constant number of neighboring particles. This feature of most models reduces the.. true effort of the force calculation down to order N with a prefactor depending on the range of the interaction.

2.1. The linked cell algorithm revisited For systems with more than about 500 particles ~, it is a standard technique in molecular dynamics to divide the volume of the simulation box into small cubic cells whose sizes are slightly larger than the interaction range rm and then sort all particles into these cells, storing them in a linked list for each cell [ 14,18,19]. It is then sufficient to calculate the distances between particles in neighboring cells only, since particles which are further than one cell apart are by construction beyond the interaction range. This algorithm is known as linked cell algorithm and its effort scales like N. While a thorough description for implementation is given by the i This number depends, of course, on the detailed nature of the system.

M. Piitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

147

above references, we want to analyze the effort of this algorithm in more detail and present an improvement of the method. The linked cell algorithm reduces the search effort to the order

N = .~(Pr~a)N, ~ ( Pr~,,) 2 pr3i,'

(2)

where p is the particle density. The last factor N/pr3a gives the total number of cells, each of these has 33 - I = 26 neighbors on a 3d cubic lattice and contains pr3a particles in average. The prefactor of ½ accounts for the fact that two neighboring cells have to be checked only once during the search because each particle pair needs to be checked only once due to Newton's third law. As a result the algorithm scales linearly with N, however, with a proportionality factor which can actually be quite large depending on the density of particles and the range of interactions. If one compares the interaction sphere around one particular particle to the volume that is actually scanned around it by this algorithm, one finds that only a fraction ~ ~ ~ of the particles really interact (see Fig. 1 for a 2d picture). To avoid these wasted distance calculations one should approximate the interaction sphere more closely. One way to do it would be to make the cells smaller and simultaneously increase the number of neighbor cells (next nearest, etc.) which one has to scan [20]. This method has the severe drawback of increasing the loop iterations dramatically while most of the cells will be empty. Thus, in a naive implementation the cost of the loop overhead will eventually overwhelm the benefit of the saved distance calculations 2. However, Everaers and Kremer [21] have shown how to exploit the fact that less particles occupy the ceils. They devised a method they called neighbor cell assignment to achieve that only a single particle or none sits in one cell, without decreasing the cell size very much. This allowed them to simplify the data structure which made their algorithm fully vectorizable and reduced the loop overhead significantly (the interior cell loops become unnecessary which balances the increased number of iterations of the loop over neighbor cells). A different approach was described by Boris [22], who sorted all particles onto a cubic grid with irregular lattice spacing to perform the search loop on this lattice. Both methods have in common that they are relatively complicated, and especially the first one is difficult to tune for optimum performance. Actually, we do not know of an efficient implementation of Boris' algorithm in a MD code. More importantly, their superiority is limited to vector architectures.

2.2. Reducing wasted search effort Instead we decided to try a different method to reduce the wasted volume which keeps the loop overhead almost constant. We divide the cells further into nmini3mini-cells and set up a table for each pair of neighbor cells and each pair of mini-cells therein, which contains the Boolean information whether two mini-cells in neighboring cells can possibly interact (i.e., are within ri,, of each other). For highest efficiency this table is implemented with a one bit representation for each Boolean. Then we append the additional information of the mini-cell, in which each particle sits, to the cell list which contains all particles within a cell. This is a very simple calculation and can be done during the step when all particles are stored within the linked cell list at almost no additional cost. Before we do the actual distance calculation of two particles, we first check the mini-cell table to find out, if their mini-cells are within interaction range. Thus we have exchanged the costly distance calculation of two particles involving 6 memory loads (one for each of the 3 components of the two vectors) and 9 FLOPS (3 subtractions + 3 multiplications + 2 additions + 1 comparison) by one memory load and a simple integer comparison for most of the potential interaction pairs in the scanned volume element. We found that the actual gain of this method depends strongly on platform-specific factors such as memory speed (for table access) and the details of the lookup-table implementation. The bit coding technique for the 2 Our own tests showed that this is already the case, if one divides the cell size by a factor of two.

148

M. Piitz, A. Kolb/Computer Physics Co,nmunications 113 (1998)145-167

mrliBMill llm EnGi [;m mmminlamnnrl mp wmr m d m hm mm m ni

lam

IDIDWIDaut m ,imlll Fig. 1. Neighbor cells (thick lines) with mini-cells (thin lines). The circle shows the interaction radius around one particle and how it is approximated by the mini-cells (shaded).

mini-cell table appears to be a crucial ingredient to obtain a noticeable speedup. On a computer with 64 (32) bit integers (such as the T3E) a choice of nmini = 4(4, 4, 2) seems optimal since then the mini-cell interaction table can be stored in a compressed bit array of 43 = 64 (422 = 32) length minimizing memory loads and cache misses. Moreover, the particle pointers in the cell list can store the mini-cell index in the unused upper bits of the pointers which further reduces costly memory loads. Our experience shows us that this method reduces the force calculation time for Lennard-Jones type interactions by up to 40% (for detailed numbers see Section 5). We will illustrate our implementation of this technique in an example. The bit-mask corresponding to the upper right cell of Fig. 1 which belongs to mini cell number ]0 of the middle cell is 0000000010001100 (bit sequence is from upper left to lower right corner). A 1 means that a particle in this particular mini-cell can interact with a particle in mini-cell 10 of the middle cell, while a 0 means they are too far apart. For each mini-cell in the middle cell we need to store such a bit-mask for every surrounding cell, hence in our 2d representation we need a table of nmini,l × ( 3 2 -- 1) = 16 × 8 = 128 masks. The logical condition whether two particles (labeled 1 and 2) can interact looks like this in C, ( ( bitmask[n_cell_2][n_mini

I] ~ (I << n_mini_2)

) != 0 )

which requires one memory load, one shift operation and one logical AND. Note, that it can actually happen that a mini-cell of the middle cell has no interactions at all with any mini-cell of a neighbor cell, e.g. the bit-mask for mini-cell 13 of the middle cell and the upper right cell is zero. This can also be exploited by sparing the loop over particles in neighbor ceils which have a bit-mask equal to zero. Quite generally, the method of storing related pieces of information which are frequently accessed together in different bits of a single integer variable is a very good optimization technique on scalar computers since it can reduce the number of memory load/stores dramatically. This is particularly true, if the memory caches are too small to hold the data arrays completely. Bit coding, of course, increases the number of instructions to extract a specific piece of information, however, these bit manipulation instructions are usually very fast (usually 0.5-1 effective clock cycles) on todays RISC processors compared to the latency times of the main memory (which can often be more than 10 clock cycles).

M. Piitz, A. Kolb / Computer Physics Communications 113 (1998) 145-167

149

2.3. Verlet-tables If one does not have to care about memory usage (which nowadays is hardly the case anymore) one can combine the linked cell algorithm with a so-called Verlet-table to store the indices of all pairs of particles within a distance ria -I- s (see, e.g., [ 10,23]). The length s, which is also called skin, represents a safety shell around the actual interaction sphere. When one has to calculate the forces for the next time step one does not need to scan all particle pairs for possible interactions but only those which are stored in the list. One can use this list as long as

(xi(to) - x i ( t ) ) 2 <

for all i

(3)

is satisfied (see [19]). This criterion can be found for the worst case scenario that any two particles move directly towards each other with equal speed. There are other strategies for updating a Verlet-table in the literature [24], which basically trade some accuracy in the force calculation for computational speed. The parameter s can also be optimized, since too large s makes the table too big and it eventually becomes worthless, and too small s means that the number of reuses decreases and one has to rebuild the table very frequently. Fortunately, the performance optimum as a function of s turns out to be rather fiat [23].

2.4. Implementation of periodic boundary conditions In order to simplify the distance calculations while using periodic boundary conditions (PBC), we employ the usual ghost particle scheme [ 19,25]. That is, the particles near the boundaries of the simulation box are replicated in memory with their coordinates shifted by the size of the simulation box. We store the indices (pointers) of these replicated particles within the cell lists instead of the original ones. This method saves us the complicated distance calculations across boundaries of the simulation box which would involve some back-folding of distance by multiples of the box size (usually done by costly integer conversions). Furthermore, the concept of ghost particles fits neatly into the domain decomposition scheme described in the next section (see also [3]).

3. Spatial decomposition for pair interactions The method of choice to parallelize MD with short-range forces seems to be spatial decomposition of the simulation box. One simply splits up the box into smaller sub-boxes and assigns one processor for each sub-box to deal with it. Each processor stores and calculates only the coordinates, velocities and forces of those particles that currently are within its sub-box. Of course, since the particles are free to move in space they will eventually cross the boundaries of these sub-boxes and therefore one has to redistribute them among the processors. When calculating the forces acting on each particle, a processor also needs to know the coordinates of particles sitting near the boundary of its sub-box within other processors' domains. This information has to be updated at every step of the integration. When the density of the system is inhomogeneous, one has to use some mechanism of adapting the subboxes to density fluctuations. This is a nontrivial procedure and usually has to be adapted to the specific model one is studying [ 10,26]. For our code we implemented a symmetric partitioning scheme, where the sub-boxes of all processors have exactly the same size and form. Different strategies for this can be found in the literature [1,3,26], which can, in general, be classified as being either one-, two or three-dimensional (see Fig. 2). One-dimensional means cutting the box into slices along one (preferably the longest) direction and 2d corresponds to tiling the box with columns. The advantage of using a lower dimensional partitioning is that each node has a smaller number of neighboring domains to communicate with, the disadvantage is that the

150

M. Pu'tz, A. Kolb/Computer Physics Communications 113 (1998) 145-167 CPU 1

I:" CPU 1

CPU 2

I

J



CPU 4

CPU 3



:'1

I'-

":'t{





O0

• O01

::o i

:O ~ gO Oi O .

/

o:: :: :: o i ° • i



CPU 2

.

.

.

.

.

.

.

.

.

ie......!.., e..!e..•

CPU 3

t.l.:"

.., CPU 4

Fig. 2. Different methods of domain partitioning and resulting communication. Higher dimensional splitting reduces the surface and therefore amount of transferred data, lower dimensional schemes reduce the number of communication steps (latency sensitive). Note, that the situation depicted here shows a limiting case, as the amount of communication for the ld- and 2d-partitioning is the same, i.e. low-dimensional partitioning is good for small processor numbers. surface to volume ratio becomes worse with growing number of processors and therefore increases the amount of data being transferred. Let us elucidate this point in more detail by assuming that our system has N particles, homogeneously distributed in a d-dimensional cubic box of size L a. For a given number of processors P the surface volume of a domain within a D-dimensional partitioning is

Sdomain = 2D \p---;7-SJ

L d-D + 2 ( d - D )

p---iT-D-

~

(4)

We assumed an equidistant, regular partitioning in the D directions which gives a domain length in these directions of L / P ]tD, while in the d - D untouched directions the domain's length is still L. The first summand ( & ) gives the surface along interprocessor communication directions, while the second part (St) gives rise to local memory copies (for PBC using ghosts). The total amount of time needed for a communication step is given by Tc = 2Dtlat -4- priaSrtr q- priaSltl,

(5)

where tlat is the latency time for remote communications, tr the transfer time for the data (coordinates, ...) of one particle to a remote processor and 0 the same quantity for a local memory transfer ( i f ghost particles are implemented). This simple ansatz assumes that the communication network of the particular computer scales perfectly in the sense that no additional overhead occurs when all processors exchange their data simultaneously 3. Two simple limiting cases emerge out of Eq. ( 5 ) . On computers whose communication latency times are comparatively large, the constant latency offset can be the dominant term and low-dimensional schemes provide better speedups. However, for very large numbers of processors the D = d partitioning always wins, since the second summand scales like p - ( D - n ) / o and therefore falls of faster for larger D. For practical applications one has to minimize Tc with respect to D as a function of all other parameters L, p, ria, P, tlat, tr and 0. For our own applications on the T3E, a 3-dimensional partitioning has always been the optimal choice as soon as more than 16 processors were used (for a cubic box). To achieve optimal performance on every hardware platform, our code allows one to either specify the partitioning manually for a given number of processors or use an automatic tilfng based on minimizing the surface to volume ratio of the sub-boxes. 3 This might be the case if the dimensionality of the communication network is smaller than the dimensionality of the partitioning D and the number of processors becomes very large. Thus there are simply not enough communication paths for all neighboring processors to communicate at the same time and the communication has to be partly seriahzed.

M. Piitz, A. Kolb/Computer Physics Communicatioas 113 (1998) 145-167

151

local domain coordinates

nLocal

X-frame coordinates Y-frame coordinates Z-frame coordinates nTotal

Fig. 3. Storage order of particles within the coordinate array.

3. I. Force calculations across domains To solve the problem of interactions across boundaries we introduce local ghost particles and frames around each sub-box. These ghost frames are equivalent to a shell around a sub-box with thickness rio + s (see Fig. 2). The particles inside the frames are replications of panicles native to neighboring processors; they are stored in the same array that holds the local (inner) particles of the sub-box. Therefore, the array holding the coordinates has to be dimensioned large enough for local and ghost particles. Further, we have to take into account that the particle numbers fluctuate with time which, as already mentioned, can be the cause for highly unbalanced workloads among the processors. Fig. 3 depicts the sequence of storage of local and ghost particles within an array. First there are the local ones, then follow the particles of the boundaries in + X and - X direction, thereafter the +Y and - Y boundaries and lastly the + Z and - Z frames. This particular storage order follows directly from the sequence of communication steps for the boundaries (see Fig. 4). First each processor identifies all local particles that belong to the X frames of their neighbors and copies them directly into the remote coordinate arrays. In the second step every processor searches for particles in the Y boundaries including the particles in the X frames they have just received. In this way XY edges are sent automatically to their right destinations without the need of sending them explicitly in a separate communication step. In the last step we repeat this procedure for the Z boundaries and now the corners of the ghost frames reach their destination, too. 3.2. Force back-propagation The ghost particles scheme of the last section is rather straightforward and has been used successfully in many codes with slight modifications [3,23,25]. However, pair forces between particles on two sides of a cell boundary have to be calculated twice or one has to devise a method of calculating them only once and sending them back to the ghost particles' original domain processor or vice versa. While the first method is very commonly used [ 1,5,6,26], since it seems simpler to implement, Brown et al. have presented a nice, optimal way for implementing the latter [2,3]. We implemented this method which is essentially summarized in Fig. 4, while our detailed implementation is revealed in the appendix. We present the source code to demonstrate that force back-propagation is as straightforward to implement as the purely coordinate based communication schemes.

152

M. Pfitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

xl

fl 3

'"7 ,4

f

,,

Fig. 4. Unidirectional communication of boundary particles' coordinates with extended force calculation between boundary corners and a unidirectional force communication in the backward direction, Force calculations between cells are indicated by the black lines connecting two cells (inner cell calculations are omitted in the graph). The numbers ~tt the arrows indicate the sequence of communication and the letters x and f show the type of data transmitted (coordinates or forces, re,,;pectively).

For a detailed discussion of the ideas and concepts used in the implementation the reader is referred to the work of Brown et al. [3], and we will limit our discussion to a brief comment. If one compares the "full-shell" coordinate based communication to the force back-propagation method, the number of ghost cells in the shell needed for a complete scan of all particle pairs is more than cut in half. In 3 dimensions the surfaces are reduced from 6 to 3, the edges from 12 to 3 and the corner cells from 6 to 1. While edges and corners are oneand zero-dimensional objects respectively, and therefore are negligible when a domain contains many neighbor cells, in practical applications with 100 to 10000 particles per processor, however, they do matter. One usually has around 23-103 cells per domain, so edges and corners make up for 85% to 1 I% of the ghost shell, and thus substantially contribute to unnecessary communication. Because of this, the force back-propagation method is optimal and one gets better scaling behavior of the code for large numbers of processors and fixed system size. Only for very big systems the performance difference of the two different schemes becomes negligible (or when the computational workload is so large that communication overhead hardly matters).

3.3. Handling different particles species Since in many situations one wants to simulate more than one type of particle simultaneously we must have a way to distinguish different species. Similarly, one often wants to simulate molecules composed of several atoms, e.g. polymer chains. On a single CPU computer one would simply deal with the excluded volume and bonding interactions separately. For the latter one can usually exploit the fact that the coordinates of the atoms of a single molecule are stored in a well-defined fashion in memory (e.g., linearly for polymer chains). So one can handle the bonding interactions within a molecule by some way of hard-coding the connectivity. Using DD, such ordering of particles in memory is impossible since usually a molecule will reside on several processors simultaneously.

M. Piitz, A. Kolb / Computer Physics Communications 113 (1998) 145-167

153

Brown et al. [4] have given each particle a unique label which is carried along, much like a "passport", during communication steps. From this label it is possible to deduce the particle's type and its bonds with other particles. Originally, it was used to indicate a particles position within a polymer chain and contained only a chain index and a monomer index. The method, of course, is very flexible and one can store more information within a passport structure. Since the structure should be kept small, both for reduced communication effort and for improving memory cache usage, we implemented the passport as a bit-field. For our own applications a single 64-bit word had enough bits to simulate large polymer networks, poly-disperse polymer melts or micro-emulsions with several, differently interacting species without any global connectivity tables.

3.4. Low-level communication For the implementation of interprocessor communication we employed a thin hardware abstraction layer which consists of only 7 subroutines needed by the current version of the program:

void parallel_put (void *dest, void *src, int count, int size, int destcpu) Copies count memory blocks with given size from local memory at position src to remote memory on CPU destcpu at position dest. void parallel_get (void *dest, void *src, int count, int size, int srccpu) Copies count memory blocks with given size from remote raemory of CPU srccpu at position src to local memory at position dest. void parallel_double_sum (double sum[], double add[], int counO Builds the sum of count double values add[] individually ove~: all processors and store the results in the array sum[]. (Note, this is not a summation of the indices of the ~rrays, but only over the particular values of an array element on the various processors.) void parallel_int_sum (int sum[], int add[], int count) Builds the sum of count integer values add[] individually over all processors and store the results in the array sum[]. void barrier(void) Stops execution until all processors have reached this point. int get_my_cpu_id(void) Returns this processors unique identification number. int get..number_of_cpus(void) Returns the total number of processors of the current process. The parallel_put() operation assumes asynchronous, single-sided communication as supported by the CRAY SHMEM library or the MPI 2.0 specification. Currently, we implemented low level realizations for single processor use on any hardware (simple local memory copy), the CRAY T3E (using SHMEM-library) and machines supporting the MPI 2.0 standard.

154

M. Piitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167 Pe 3

Pe4

coordinates

Pe 1

Pel

I I

r~z ~

Pe 5;

Pe 4

Pe2

domain decomposition Fig. 5. Illustration o f the concept o f two different data-sets: the intramolecular forces are calculated in the upper set, then the coordinates are mapped on the lower set and the forces are added to the excluded ,,olume interactions. After a time-step, the new coordinates are m a p p e d b a c k in the reverse direction.

4. Dealing with intramolecular many-particle interactions 4. I. The problem of many-body interactions Many-body potentials involving the coordinates of three, four, or even more particles are significantly more difficult to deal with efficiently. We will not discuss the general case, but rather focus on the special case where these terms only involve atoms located on the same molecules. While this simplifies the problem dramatically on a single CPU computer since one does not need to search for many-body pairs in the same way as for nonbonded pair interactions, this basic problem reoccurs in DD since the ordering of particle coordinates in memory is lost. One huge class of models with many-body interactions are atomistic simulations where the chemical structure of the molecules is modeled in detail. Bending potentials (three-body interactions) and torsional potentials of bonds (four point potentials) are commonly used [ 18] to incorporate the natural stiffness of such molecules on small length scales. Some biologically important molecules like DNA or Actin retain this rigidity up to very large length scales. For such systems it is necessary to include this feature into simplified coarse-grained models, too. While it has been shown that this problem can be dealt with by sophisticated methods, based on a recursive search of pair, triple and quadruple tables [4,27], the efficiency of such a method compared to single CPU runs is strongly reduced for large numbers of processors. Encouraged by the shear communicational power of todays massively parallel computers, we have implemented a partial replicated data scheme to divide the force calculation into intramolecular (complicated bending and torsional potentials) and intermolecular (excluded volume) forces. The latter contain only pair interactions and are efficiently handled via the DD scheme described in the previous sections. The former are computed using a second set of coordinates (hereafter called "molecule"-coordinates, see Fig. 5) which are stored linearly in memory similarly as one would do it in a single processor code [8]. So the connectivity can be hard-coded in the data structures and in the loops acting on them. The following pseudo-code summarizes this scheme: CalculateIntraChainForces(); CalcExcludedVolumeForces(); AddlntraToExcludedVolume();

/ * using linear chain array * /

/* using local + ghost particles */ /* combine the two forces */

After updating the coordinates with the velocity-Verlet algorithm, the second set of coordinates has to be updated as well with the new positions of the particles. To keep the two sets coherent, we supplement the passport of each particle by the domain index it belongs

M. Piitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

155

to (i.e., the processor which handles the particles excluded volume interactions), its local particle index for that processor as well as its molecule processor's (i.e., the processor calculating the intramolecular forces of molecule the particle belongs to) and its position index in that second coordinate set. In the pseudo-code given below, we denote the processor number of the second coordinate set of the i-th monomer by remPe [ i ] . The communication is buffered to reduce latency times. for (i = O;i < nChainCoordinates;i++) sendBuffer[remPe[i]] = force[i]; pkgSize [remPe [i] ] ++ ;

{ / * build the buffers

*/

} for (j = O;j < number_of_cpus();j++) /* send the buffers */ parallel_put (&receiveBuffer [j ] ,&sendBuffer [j ] ,pkgSize [3] ,j ,remPe [j] ) barrier() ; for (j = 0;3 < number_of_cpus();j++) /* unpack the package */ AddBuf ferToLocalForces (j ,receiveBuffer [j] ,Size0fPackage) ;

The communication procedure for updating the molecule coordinates works equivalently but in the opposite direction. The main advantage of this approach is that almost every kind of complicated intramolecular forces can be implemented in a rather straightforward way, while we can keep the interaction range r m for pair search small. The simplicity of implementation had to be paid for by a substatttial increase in communication. It is clear that this additional communication which is essentially global between all processors, if one does not take other precautions 4, prohibits scaling for very large numbers of processors (see Section 5), but it works well for numbers up to 128 on a T3E (and possibly other machines with strong communication networks).

5. Benchmarks: a comparative analysis In order to test the quality of the presented implementation in terms of speed and scalability, we have performed an extensive set of benchmarks. Where possible, we have tried to stick to the conventions used by other authors to be able to compare out results directly to theirs'.

5.1. Simple Lennard-Jones.)quid The first benchmark, we have run, was a Lennard-Jones (LJ) fluid with purely repulsive interaction (also known as Weeks-Chandler-Anderson potential) and the second including also the attractive part of this potential,

Vcl(r) = 4e

-

,

for r <

r c

with re = 2.50- or 21/6o -.

(6)

The temperature and density where set to the triple point of the attractive Lennard-Jones fluid (T = 0.72,p = 0.8442,e = 1 in LJ units). The time-step for the velocity-Verlet integration was At = 0.004662 in all cases. All benchmarks have been run keeping the volume L 3 and the number of particles P constant and the total energy conserved (NVE ensemble). 4 If one can achieve a large correlation between the two coordinate sets on one processor, most o f the coordinate transfer can be done locally, w h i c h should reduce the overhead significantly depending on size of one molecule, i.e. over how many domains its atoms are distributed. We are currently w o r k i n g on an improved redistribution process which tries to minimize the communication between different processors.

M. Piitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

156 Table 1 Benchmark results for LJ fluid Reference

N

P

Machine

N/P

tstep I msI

this work this work

1000 1000

T3E T3E

1 16

1,300 63

this work this work this work

250000 250000 250000

T3E T3E T3E

16 128 512

15,525 1'953 488

this work

250000

T3E-900

128

1'953

121] 15 ] 161 161 161

16384 108 296352 296352 1.2. 109

Y/MP PARAGON T3E T3E T3E

1 3960 8 512 512

16384 25253 37044 579 2. z-. 106

tl.pair l/zs ]

- (4900) - (750)

- (0.98) - (2.40)

377 (85) 49.3 (11,3) 13.2 (3.4)

0.437 (1.09) 0.457 (I.12) 0.489 ( 1.39)

36.2 (8.7)

0.335 (0.90)

22.7 (3.5) 3500 1956 63 1.8. 105

0.251 (0.42) 2.340 0.588 1.200 1.61

N represents the number of particles, P the number of processors, tstep the time needed for one integration step and tl,pair is (half) the time needed to calculate one interaction (see text). The cutoff was 2.5
512

.

256 o_ 128 rn

Q) r-~ 03 .>

"~ n-

.

.

.

.

.

.

~ o -

N= 1000, rcut = 1.122.. N=250000, rc~ = 1.122.. N=250000, reut = 2.5

--

optimall i n e

.

/ S /o

y

64 32 16 8

o

4 2 I

=

i

2

I

4

I

I

I

I

I

8 16 32 64 128 256 512 Number of Prc..~essorsP

Fig. 6. Benchmark: Lennard-Jones fluid (T3E). Dependence of relative speedup on number of processors P for fixed system size N. tp is the total time for one time-step on P processors. Scaling is almost perfectly linear for the large cutoff LJ benchmark and almost linear up to P = 128 for the small cutoff benchmark. The code still scales for vec¢ low particle numbers per processor as can be read from the graph for the small N = I000 system.

In Fig. 6 we show the results of our scalability tests where we kept the overall system size N constant and increased the number o f processors P. The first sequence of benchmarks has been done with an interaction cutoff rcut = 2.50- and the second for rcut = 21/60- (numbers in braces in Table 1). N = 250000 particles represents a typical system size for runs in our own projects. To test the limits of the code we also benchmarked a small ( N = 1000) system with rout = 21/60- and up to 16 processors. In this case there are only about 63 particles per processor with a small number (4¢r/3pr3c,t - 1 ~ 4) of interactions per particle. For the "production" systems the scaling is almost perfectly linear for large LJ cutoff down to about N/P = 500 particles per processor and down to N/P = 1000 for the low cutoff systems. The code still scales, i.e. still becomes faster with increasing

M. PEitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167 I

I

=

157

I

1 O

0.9 0.8 n

0.7 0.6 ~E ILl

0.5

.>_

0.4

17E.

0.3

a

N=

o

N=250000, rcut = 1.122_

0.2

- - -

1000, rcut = 1.122..

N=250000, rcut = 2.5 optimal

line

0.1 0

I

I

I

I

I

i

I

I

2

4

8

16

32

64

128

256

512

Number of Processors P

Fig. 7. Benchmark: Lennard-Jones fluid (T3E). Efficiency of a single CPU depe,nding on the number of processors P for fixed system size N. Efficiency is measured by comparing the execution times of a multi-processor run multiplied by P to the time of a single processor run. The difference from the ideal constant 1 represents the communication lo,;ses. For typical "production" systems the efficiency never drops below 80%.

number of processors, down to N/P = 63. Fig. 8 shows the results of a benchmark run at constant work per processor N/P = I000, which nicely tests the scalability of the hardware platform. The particular value of N/P = 1000 was chosen because the performance of the code starts to degrade at that particular N/P for the short cutoff LJ fluid, so that communication times become noticeable. One can see a logarithmic increase of the execution time with the number of processors which is mainly due to communication bandwidth limitations of the T3E. Load-imbalance effects start to play a significant role only at smaller numbers of N/P ~ 200 and give a sub-logarithmic growth of CPU time with P for these dense systems. Since the speedup is close to linear, it is more illuminating to look at the efficiency of CPU usage, which has been plotted in Fig. 7. The benchmark results for the smallest and largest processor numbers are presented in Table 1 where we also give some numbers obtained from different implementations. To allow for a comparison of results, it has been proposed to calculate the quantity tl,pair = tstepPX (4 3 ~q'rrcutPN ) - I , which measures the time needed to calculate one pair interaction [28] 5. For ideal scaling this number should be mostly independent of P, rcut, N and p, however, as will be shown this is not quite the case. While the data from [5] and [21 ] are not directly comparable since those benchmarks were run on different hardwares, the last one [6] shows that our performance is about 30% better than their best result. Moreover, if we look at the difference in times for our own results with different cutoffs, we see that /I,pair is by no means independent of rcut, but tl,pair drops significantly with rcut. There are essentially two reasons for this observation. The first is that the loop overhead in the linked cell search loop is independent of rcut, but the work done in the search loop grows like rcut,3 and second, the Verlet-table is much more efficient for larger rcu~ since the relative number of particles in the interaction sphere compared to the number in the skin scales like ?"cut• W h i l e rcut, p and T in [5,21] have been explicitly the same as in our benchmarks, those of [6] apparently were not (unfortunately they did not state explicit parameters). If we take their numbers, assuming the same

5 Note that the correct time for one pair is a factor of 2 larger since the formula given does not take into account Newton's 3. Law. Still we use the above formula to be consistent with numbers given in the literature.

M. Pfitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

158

NIP = 1000, rcut = 1.122..

1,35

NIP = 1000, rcut ~.

1.3

II)

E

- -

0.045 log(P) + 1

- -

0.023 log(P) + 1

=

2.5 i

1.25

co

1.2 x U..I i

1.15

nr"

1.1 J

1.05

2

I

I

i

i

[

i

i

4

8

16

32

64

128

256

512

Number of Processors P

Fig. 8. Benchmark: Lennard-Jones fluid (T3E). Dependence of the execution time as a function of the number of processors P at constant work NIP = 1000. The execution time increases logarithmically with P, which shows the limitations of the communication bandwidth on the T3E. The increase is very moderate on the T3E. Table 2 LJ benchmark results for different algorithms; all numbers represent tl,pair measured in /.zs for a system with N = 10000 and rcut= 2.5 on a single CPU Algorithm

Pentium/MMX-166

SGI lndy R5000/150

CRAY T3E

LC LC + SC LC + Verlet LC + SC + Verlet

1.78 1.24 0.863 0.785

1.95 1.5L 0.995 0.9~-8

0.757 0.547 0.433 0.405

density p, we find that their cutoff rcut must be about 2.8 by plugging the numbers into the formula for tl,pair. To reflect on this we have also done a small benchmark on 1 processor with rcut = 2.80- which gave us tl,pair = 0.383#zs which is about 50% faster than their best value. Table 2 demonstrates the various algorithmic improvements and their effects on three different scalar hardware platforms. As basis algorithm a linked cell ( L C ) algorithm has been used. This was modified by us with the presented sub-cell ( S C ) scheme to eliminate unnecessary distance calculations. Both approaches can be combined with a Verlet-table to reduce the frequency of search loops. One clearly sees a 2 5 - 4 5 % improvement of L C + S C over pure LC, if applied without a Verlet-table. When combined with a Verlet-table, a slight improvement of 5 - 1 0 % still remains. Note that the Verlet-table radius has been optimized in all benchmarks and may vary from platform to platform and for LC+Verlet or L C + S C + V e r l e t . SC will give only a small improvement, if rcut is only 2i/6o - with or without a Verlet-table which is of the order of 5 - 1 0 % since the absolute number of distance calculations during the neighbor search is rather small. Our update policy for the Verlet-table was to rebuild the table, if

i=1

(7)

M. Piitz, A. Kolb/ Computer Physics Communications 113 (1998) 145-167

159

where f = 0 . 6 n / ( n + 1) and n is the number of reuses of the last table (see [24] on this). This policy gave an average of about 13 reuses of the Verlet-table independent of rcut. When one uses the hard update criterion Eq. (3), the benefits of SC are larger due to an increased amount of time spent rebuilding the Verlet-table. In summary, we can say that the code works very good for systems with 500 particles per processor and still scales well for system with as few as N / P = 200. This is important for simulations which need to run only a small number ( 103-105) of particles but for very long times (as is often the case in soft-matter science). Scaling with millions of particles is not an issue since the communication to work ratio vanishes like ( P / N ) I/3 The scaling with constant work per processor, however, clearly demonstrates the strong capabilities of the communication network on the CRAY T3E. 5.2. Melt o f linear flexible p o l y m e r chains While the Lennard-Jones fluid is a standard benchmark which gives an indication of the relative performance of different algorithms, we also wanted to benchmark the code in a more complex "production" situation. We simulated 1000 linear flexible chains consisting of 100 monomers using a standard bead- spring-model repulsive LJ for excluded volume interaction between all monomers, and a finitely extendible nonlinear elastic (FENE) potential for the bonding interactions, °eR2 ( r2) VFENE(r) = - ~ - 7 1 n i--~

,

forr
(8)

We a l s o coupled the system slightly to a heat-bath with a stochastic force and a frictional term which is commonly done to simulate a NVT-ensemble and allows a larger time-step, miYc;

= -FJci

-

•V(x)

+ Wi(t) ,

(Wi,j( t) • Wk.t( t') ) = 6kTFSi.k3j, t6( t - t') ,

(9) (10)

where W is the stochastic force with vanishing first moment and the second moment is coupled to F via the fluctuation dissipation theorem. The simulation parameters we used were T = !.0, p = 0.85, F = 0.5 and At = 0.013, which have been used in many publications (e.g., [29,30] ). Since the model uses pair forces only, we do not actually need the hybrid extension of the code, however, to test this extension we have also run this benchmark calculating the bond interactions with the PRD method. Fig. 9 shows the speedups measured for both the PRD method and the pure DD algorithm. The simulation runs were done on the T3E with the memory stream-buffers 6 switched off, in contrast to the results of previous the section, where they were on by default. Moreover, using Brownian dynamics the Verlet-tables are invalidated more frequently and also the interactions of the model are more complicated than pure LJ. These facts account for an overall decrease in performance of about 40% relative to pure short-ranged Lennard-Jones benchmarks in NVE ensemble. The NVT benchmarks also show slightly worse speedups since particles must be exchanged between domains and the ghost tables must be build up more frequently, thus the communication overhead is higher. When comparing the absolute execution times of the PRD and DD algorithms (see Table 3), one should take into account that this benchmark tests a case for which the hybrid algorithm was not made for since it involves pair forces only. The introduction of three- or four-body forces slows the code down only marginally, by about 5% percent, since intramolecular interactions only play a minor role compared to intermolecular ones. 6 The CRAY T3E does not have a large tertiary cache, but has very deep data buffers to speed up consecutive memory accesses. Unfortunately, the use of stream-buffers on older T3E systems was considered too unstable and therefore had been switched off permanently on the big machines at the time these benchmarks were run. So we could not run all benchmarks with the same setting. The overall performance gain with stream buffers was between 12% and 20% for our code.

160

M. Pt~itz, A. Kolb/Computer Physics Communications 113 (1998) 145-167 128

i

i

i

i

i

i

64

a_

32

-- - optimalline

/

n

I

GL

"0

16

El.

O9 ~

8

rr

4

I

I

I

I

I

I

2

4

8

16

32

64

128

Number of Processors P Fig. 9. Benchmark: Bead-spring polymer model (T3E). Dependence of relative speedup on number of processors P at constant system size N = 100000 for two different parallel algorithms: partial replicated data (PRD) and pure domain decomposition (DD). Speedup and absolute times are better for pure DD, but PDR is a more versatile method at relative low performance cost. Table 3 Benchmark results for both DD and PRD algorithm applied to a polymel melt of 1000 flexible chains of length 100 on the T3E; we compare the absolute execution times per time-step P

N/P

t,,~:,.DD I ms I

t.,-tel,.?Ro I ms I

8 16 32 64 128

12500 6250 3125 1562 781

118 60 30 16 10

145 76 40.6 23.4 16.4

Therefore, a performance loss of about 20% to 40% in comparison to the code, which was fully optimized for pair forces, seems a very affordable price for the gained versatility of the code. The speedup of PRD is poorer than pure DD, but still it is a rather efficient and very convenient approach.

6. Conclusion and outlook

We have presented a method of improving the standard linked cell approach to finding short-range interacting pairs by an almost cost-neutral refining of the search grid, which improves the performance of this method by up to 45%. While this gain is reduced by a combination with a Verier-table, an overall improvement of about 10% still remains. The efficiency of a Verlet-table and, hence, the efficiency of our refined method strongly depends on model parameters and the time-step of the simulation. Due to the relatively simple nature of our improvement we generally recommend its use on scalar CPUs, especially when a Verlet-table requires too much memory to be feasible or is otherwise inefficient because it becomes invalidated too frequently. Our implementation of domain decomposition minimizes the amount of data exchange and completely avoids duplicate calculations of interactions. This results in extrem~,'ly good scaling behavior of the code down to

M. Pu'tz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

161

very small system sizes, which is important to reach long simulation times by increasing the number of processors efficiently. The overall performance in standard benchmarks is very good in comparison to other implementations. Moreover, the partial replicated data extension of the domain decomposition scheme makes the code a very versatile tool to study models with more complicated interactions along the molecules' backbone at an affordable performance trade-off. The code is currently being reworked and properly documented, and will soon be made publicly available to other scientists. Its modular implementation should make it rather straightforward to implement various modifications as has been done already for some models curren0y studied in our research group.

Acknowledgements Most of the development of the code has been done on the T3D and T3E systems at the Rechenzentrum (RZG) of the Max-Planck-Gesellschaft in Garching, the T3E of' the Max-Planck-Institute for Polymer Science in Mainz and at the Hochleistungsrechenzentrum (HLRZ) in J~ilich. We thank both the RZG and the HLRZ for a generous allocation of CPU time. We also wish to thank K. Kremer, B. DiJnweg, E MiJller-Plathe and T.B. Liverpool for discussions and conjectures about many of the ideas and methods presented in this paper as well as B. Dtinweg, E Mfiller-Plathe again and M. Deserno for a careful reading of the manuscript.

Appendix A. Selected parts of the source code

A. 1. Finding the interacting neighbor cells The subroutine SetupAccessTable() finds all nearest neighbor cell pairs counting each pair only once. For each cell we create a table of interacting neighbor cells. Interactions between ghost cell pairs in the corners of a domain are also taken into account to avoid double computation of pair forces between inner and ghost particles. This subroutine is called usually once at the startup of the program after the decomposition of the domains has be done, or whenever the decomposition is changed (adaptive grid). 1 2 3 4 5 6 7 8 9 tO 11 12 13 14 15 16 17 18 19 20 21 22

/** neighbor cell descriptor. * contains linear index offset relative to middle cell * and a pointer to the mini-cell mask table.

*/ typedef struct { int offset; TSegmentMask **mask; } TNeighCell; /** checks if vector (x,y,z) is within (0..b[0]-l,0..b[l]-l,0..b[2]-l]

*/ static InsideGrid

(int x,int y,int z,int b[3])

/** inserts a neighbor list of a particular cell into the global * neighbor cell table. Identical neighbor list are stored only * once, thus many cells share one neighbor list.

*/ static InsertNCList /** * * *

(int cellindex,int

listlen,TNeighCell

*list)

Finds all nearest neighbor cell pairs counting each pair only once. For each cell we create a table of interacting neighbor cells. Ghost cells are also taken into account to avoid double computation of pair forces between inner and ghost particles.

*/ static void SetupAccessTable

(void)

M. Pfftz, A. Kolb/Computer PhysRw Communications 113 (1998) 145-167

162 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

int ix, iy, iz,nx ,ny,nz,k, cnt, interactions, ind [3] ,nc [.3] ; TNeighCell localList [26] ; pubyte visited = calloc (ngCellVol, sizeof (ubyte)) ; interactions scanCount

ofsLast = 0; ofsBeginPtr [0] = comp0fsList; /* first we walk through for

all inner cells

*/

(iz = l;iz < ngCell[2];iz++)

for (iy = l;iy < ngCell[l];iy++) for (ix = l;ix < ngCell[O];ix++)

41 42 43 44 45 46

int i

= ix + ngCell[O]*(iy

intcnt visited[i]

= O; = i;

{ + ngCell[i]*iz);

for (nz = iz-f;nz <= iz+l;nz++) for (ny = iy-l;ny <= iy+l;ny++) for (nx = ix-l;nx <= ix+l;nx++) int n = nx + ngCell[O]*(ny

47 48 49 5O 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 7O 71 72 73 74 75 76 77 78 79

= 0; = 0;

if (n != i ~

!visited[n]

int mx = nx-ix+l,my

{

+ ngCell[l]*nz); ~& InsidsGrid(nx,ny,nz,ngCell))

= ny-iy+l,mz

{

= nz-iz+l;

localList[cnt].offset = n - i; SetMask(~localList [cnt++] .mask, ~neighMask [mE] [my] [rex] ) ;

} } if (cnt) { scanCell[scanCount]

= i;

InsertNCList(scanCount++,cnt,localList); interactions

+= cnt;

} } innerCells

= scanCount;

/* now build the scanlist for

(k = 0;k < 3;k++)

ind[(k+l)%3] ind[(k+2)%3] for

(ind[k]

int i intcnt

= i; = O;

nc[(k+l)%3] nc[(k+2)%3]

(only edges

{ /* loop over all edge directions /* cyclic permutation /* first = 2..ngCell,

= l;ind[k]

= ind[0] = 0;

for the boundary

= ind[(k+2)Y.3] ; = ind[(k+l)~3];

*/

of indizes */ second = i, third = 0 */

< ngCell[k];ind[k]+*)

+ ngCell[0]*(ind[l]

are important

{

+ ngCell[l]*ind[2]);

!) */

M. P~itz, A. Kolb/Computer Physics Communications 113 (1998) 145-167 for (no[k]

8O 81 82 83 84

int

n = nc[O]

<= ind[k]+l;nc[k]++)

+ ngCell[O]*(nc[1]

+ ngCell[1]*nc[2]);

if (InsideGrid(nc [0] ,nc [i] ,nc [2] ,ngCell) ) {

85 86 87 88

int mx = nc[O]-ind[O]+1,my

= nc[1]-ind[1]+l,mz

= nc[2]-ind[2]+l;

localLis~[cnt].offset = n - i; SetMask (klocalList [cnt++]. mask, &neighMask [mz] ~ny] [rex]) ;

89 90 91 92 93 94 95 96 97 98 99 100 101

= ind[k]-l;nc[k]

163

} } if (cnt) { scanCell[scanCount] = i; InsertNCList(scanCount++,cnt,localList); interactions += cnt;

} } if (interactions free(visited);

!= 13*nCellVol)

Error();

/* conslstmlcy

check */

A.2. Updating the ghost particles and force-backpropagation Here we present all subroutines involved in the ghost particle communication and force-backpropagation. The subroutine NewChosts() is called every time a particle redistribution has been performed (usually done immediately before the Verlet-table has to be rebuild). UpdateGhosts() is called immediately before a recalculation of the forces (only if NowChosts () was not used) and B a c k P r o p a g a t e F o r c e s ( ) immediately afterwards. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

typedef double TDoubleVec[3]; typedef struct { ... } TLabel;

/* 3 component vector */ /* particle label, contents vary with model */

/* externally declared inlined functions, void void void bool

meaning should be clear */

CopyVector(double *dest,double *src,int components); CopyLabel(TLabel *dest,TLabel *src); ClearLabel(TLabel *label); /* empty label contents */ LabelNotEmpty(TLabel *label); /* tests label contents */

/* externally defined program global variables int leftPe[3]; int rightPe[3]; TLabel TDoubleVec TDoubleVec int nLocal; int nTotal;

*/

/* CPU number to the left of local CPU in 3 directions */ /* CPU number to the right of local CPU in 3 directions */

label[]; xp[]; force[];

/* particle labels */ /* particle periodically /* particle forces */

folded coordinates

/* number of local particles for this CPU */ /* number of local+ghost particles on this CPU */

/* module's

local working buffers */

int

src[3] [];

/* index table for ghost particles */

*/

164 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 6O 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 8O 81 82

M. Piitz. A. Kolb/Computer Physics Com,nunications 113 (1998) 145-167 TLabel iBuf [3] [] ; TDoubleVec xBuf [3] [] ; TDoubleVec fBuf [] ; int sendPtr[3]; int sendLen[3]; int recLen[3];

/* send buffer for particle labels */ /* send buffer for particle coordinates */ /* receive buffer for forces ~/

/* desitination index in remote PE's particle arrays */ /* number of ghost particles to send in particular direction */ /* number of ghost particles to receive " */

/** Sorts out local particles that are at the bound~'ies of the local box * These particles are stored inside the list 'src' */ static int DirectionSort {

(int firstDir,int start)

int j,dir; while (LabelNotEmpty(&label [start]))

{

for (dir = firstDir;dir < 3;dir++) if (xp[start][dir] >= rightMargin[dir])

{

CopyVector (xBuf [dir] [sendLen [dir] ], xp [start] ,:3); CopyLabel (&iBuf [dir] [sendLen [dir] ], &label [start] ) ; if (myPos[dir]+1 == nPesPerDir[dir]) xBuf[dir] [sendien[dir]] [dir] -= box[dir]; src [dir] [sendLen[dir]] sendLen [dir] ++ ;

= start;

} start++; } ClearLabel (&lBuf [firstDir] [sendLen [firstDir] ] ) ; return start;

/** Copy boundary (ghost) particles for the first time along the direction * 'dir'. Each PE registers how many ghosts it received. Destinations for * for the ghosts are calculated and stored for use in UpdateGhosts. */ static void GhostCopy (int dir) { barrier(); /* inform my left neighbors how many ghosts they will receive */

parallel_put(~sendPtr[dir],~aTotal,l,!NT_SIZE,leftPe[dir]); /* Now send the coordinate data and the labels (including end mark) */ barrier(); •ara••e•-•ut(x•[sendPtr[dir]]•xBuf[dir][•]•sendLen[dir]•VECT•R-SIZE•right•e[dir]);

~ara~e~-put(&~abe~[send~tr[dir]],~Buf[dir].sendLen[dir]+~LABEL-SIZE~rightPe[dir]) barrier();

void NewGhosts

(double frame) {

M. Piitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

int i,start; for (i = O;i < 3;i++) rightMargin[i]

= localMax [i] -frame

start = O; nTotal = nLocal; ClearLabel(&label[nLocal]);

/* set end mark */

memset(sendLen ,O,3*INT_SIZE); memset(sendPtr ,O,3*INT_SIZE); memset(recLen ,O,3*INT_SIZE); memset(fSendPtr,O,3*INT_SIZE); /* First sort out all boundary particles in the whole simulation box * going to all directions and copy the ghostparticles in x-direction * into their proper places. Second find the yz-boundary of these * x-particles and sort them out, then put the y-particles to their * places. Last sort out the z-boundary of the y-particles and copy them. */ for (i = O;i < 3;i++) { start = DirectionSort(i,start); nTotal = start; fSendPtr[i] = nTotal; GhostCopy(i);

for (i = O;i < 2;i++) recLen[i]

= fSendPtr[i+l]-fSendPtr[i];

/* count the particles in the last packet */ while (LabelNotEmpty(~label [nTotal])) nTotal++; recLen [i] = nTotal-fSendPtr [2] ;

/** copy the ghosts to all directions using the src[] index to insure that * positions of ghosts are not changed */ void UpdateGhosts

(void) {

int i,k,dir; for (dir = O;dir < 3;dir++) if (nPesPerDir[dir]

{

> i) { /* fill the send buffers (apply PBC) */

for (i = O; i < sendLen[dir];i++)

{

int j = src[dir] [i]; CopyVector (xBuf [dir] [i], xp [j ], 3) ; CopyLabel (&iBuf [dir] [i] ,&label [j] ) ; if (myPos [dir]+l

= =

nPesPerDir [dir]) xBuf [dir] [i] [dir] -= box [dir] ;

165

M. Piitz, A. Kolb/ Computer Physics Communications 113 (1998) 145-167

166 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 16Z 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 18Z 188 189

} barrier()', /* send coords and labels */ parallel_put (xp [sendPtr [dir] ] ,xBuf [dir] [0] ,sendLen [dir] ,VECTOR_SIZE, rightPe [dir] ) parallel_put (~label [sendPtr [dir] ] ,IBuf [dir], sendLen [dir], LABEL_SIZE, rightPe [dir] ) barrier() ;

} else { /. copy data locally unbuffered TPDoubleVec TLabel *

(apply PBC) */

xDest = xp + sendPtr[dir]; IDest = &label [sendPtr [dir] ] ;

for (i = 0; i < sendLen[dir];i++)

{

int j = src[dir] [i]; CopyVector (xDest [i], xp [j ], 3) ; CopyLabel (alDest [i] ,~label [j ] ) ; if (myPos [dir] +I == nPesPerDir[dir])

xDest [i] [dir] -= box[dir] ;

} } } } /** Sends the forces for the ghost particles back to their original host processors and adds them to the locally calculated forces. void BackPropagateForces

(void) {

int dir,i,k; for (dir = 2;dir >= 0;--dir) if (nPesPerDir[dir]

{

> i) {

p a r a l l e l p u t (fBuf [0], force [fSendPtr [dir] ], recLen [dir], VECTOR_SIZE, leftPe [dir] ) ; barrier() ; for (i = O; i < sendLen[dir] ;i++) AddToVector(force [src [dir] [i] ], fBuf [i], 3) ;

} else { /* remark: TPDoubleVec

sendLen = recLen for local coords */

fBuf = force + fSendPtr[dir];

for (i = 0;i < sendien[dir] ;i++) AddToVector(force [src [dir] [i]] ,fBuf [i] ,3) ;

} } } }

M. Piitz, A. Kolb/Computer Physics Communications 113 (1998) 145-167

167

References [l] M.R.S. Pinches, W. Smith, D.J. Tildesley, Mol. Simul. 6 (1991) 51. [2] S.Y. Liem, D. Brown, J.H.R. Clarke, Comput. Phys. Commun. 67 (1991) 261. [3] D. Brown, J.H.R. Clarke, M. Okuda, T. Yamazaki, Comput. Phys. Commun. 47 (1993) 67. 141 D. Brown, J.H.R. Clarke, M. Okuda, T. Yamazaki, Comput. Phys. Commun, 83 (1994) 1. [51 S. Plimpton, J. Comput. Phys. 117 (1995) 1. [61 J. Stadler, R. Mikulla, H.-R. Trebin, Int. J. Mod. Phys. C 8 (1997) 1131. [71 W. Smith, T.R. Forrester, Comput. Phys. Commun. 79 (1994) 52. 181 M.R. Wilson, M.P. Allen, M.A. Warren, A. Sauron, W. Smith, J. Comput. Chem. 18 (1997) 478. 191 T.W. Clark, J.A. McCammon, Comput. Chem. 14 (1990) 219. [10] E Miiller-Plathe, W. Scott, W.E van Gunsteren, Comput. Phys. Commun. 84 (1994) 102. I 11 ] E Bruge, S.L. Fornili, Comput. Phys. Commun. 60 (1990) 39. 12] U. Becciani, R. Ansaloni, V. Antonuccio-Delogu, G. Erbacci, M. Gamber~t, A. Pagliaro, Comput. Phys. Commun. 106 (1997) 105. 13] W.S. Young, C.L. Brooks 1II, J. Comput. Chem. 16 (1995) 715. 14] R.W. Hockney, J.W. Eastwood, Computer Simulation using Particles (McGraw-Hill, New York, 1981). 151 R.K. Kalia, S. de Leeuw, A. Nakano, P. Vashishta, Comput. Phys. Commun. 74 (1993) 316. 161 J.A. Board Jr., J.W. Causey, J.E Leathrum Jr., A. Windemuth, K. Schulten, Chem. Phys. Lett. 198 (1992) 89. 17] W. Smith, Comput. Phys. Commun. 67 (1992) 392. 18[ M.P. Allen, D.J. Tildesley, Computer Simulation of Liquids (Clarendon Pless, Oxford, 1987). 191 M.C. Rapaport, The Art of Molecular Dynamics Simulation (Cambridge Univ. Press, New York, 1995). 1201 W.E van Gunsteren, H.J. Berendsen, E Colonna, D. Perahia, J.P. Hollenberg, D. Lellouch, J. Comput. Chem. 5 (1984) 1027. 1211 R. Everaers, K. Kremer, Comput. Phys. Commun. 81 (1994) 19. [221 J. Boris, J. Comput. Phys. 66 (1986) 1. 1231 G.S. Grest, B. Diinweg, K. Kremer, Comput. Phys. Commun. 55 (1989) 269. [241 J.A. Blink, W.G. Hoover, Phys. Rev. A 32 (1985) 1027. 125[ M.C. Rapaport, Comput. Phys. Rep. 9 (1988) 1. [261 S. Pickering, I.K. Snook, Comput. Phys. Commun. 108 (1998) 200. 1271 D. Brown, H. Minoux, B. Maigret, Comput. Phys. Commun. 103 (1997) 170. 1281 D.M. Beazley, P.S. Lomdahl, N. Gronbech-Jensen, R. Giles, P. Tamayo, Annual Reviews in Computational Physics, Vol. 3 (World Scientific, Singapore, 1995). [291 K. Kremer, G.S. Grest, J. Chem. Phys. 92 (1990) 5057. [30] E.R. Duering, K. Kremer, G.S. Grest, J. Chem. Phys. 101 (1994) 8169.