Optimization techniques for parallel force-decomposition algorithm in molecular dynamic simulations

Optimization techniques for parallel force-decomposition algorithm in molecular dynamic simulations

Computer Physics Communications 154 (2003) 121–130 www.elsevier.com/locate/cpc Optimization techniques for parallel force-decomposition algorithm in ...

329KB Sizes 0 Downloads 38 Views

Computer Physics Communications 154 (2003) 121–130 www.elsevier.com/locate/cpc

Optimization techniques for parallel force-decomposition algorithm in molecular dynamic simulations Ji Wu Shu a,∗ , Bing Wang a , Min Chen b , Jin Zhao Wang b , Wei Min Zheng a a Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, PR China b Department of Engineering Mechanics, Tsinghua University, Beijing, 100084, PR China

Received 30 December 2002; received in revised form 15 May 2003

Abstract The efficiency and scalability of traditional parallel force-decomposition (FD) algorithms are not good because of high communication cost which is introduced when skew-symmetric character of force matrix is applied. This paper proposed a new parallel algorithm called UTFBD (Under Triangle Force Block Decomposition), which is based on a new efficient force matrix decomposition strategy. This strategy decomposes only the under triangle force matrix and greatly reduces parallel communication cost, e.g., the communication cost of UTFBD algorithm is only one third of Taylor’s FD algorithm. UTFBD algorithm is implemented on Cluster system and applied to solve a physical nucleation problem with 500,000 particles. Numerical results are analyzed and compared among three algorithms, namely, FRI, Taylor’s FD and UTFBD. The efficiency of UTFBD on 105 processors is 41.3%, and the efficiencies of FRI and Taylor’s FD on 100 processors are 4.3 and 35.2%, respectively. In another words, the efficiency of UTFBD on about 100 processors is 37.0 and 6.1% higher than that of FRI and Taylor’s FD, respectively. Results show that UTFBD can increase the efficiency of parallel MD (Molecular Dynamics) simulation to a higher degree and has a better scalability.  2003 Elsevier B.V. All rights reserved. PACS: 02.70.Ns; 68.03.Hj; 89.20.Ff

1. Introduction Molecular Dynamics simulation is a numerical method increased significantly in the two last decades. The principle of MD simulation is to obtain macroscale data through microscale information. Speaking in detail, MD simulation is to study each particle’s motion based on practical physics models and to take sta-

* Corresponding author.

E-mail address: [email protected] (J.W. Shu). 0010-4655/$ – see front matter  2003 Elsevier B.V. All rights reserved. doi:10.1016/S0010-4655(03)00290-X

tistics of various macroscale parameters. MD simulation is best fit for non-equilibrium or non-linear conditions, which can replace many experiments difficult to control and give a clear picture of microscale motion [1]. In MD simulation, all particles are treated as mass points and their motions are calculated through the solution of Newton motion equation. Then both microscale and macroscale properties of the particle system can be obtained with statistics methods. Practical MD simulation always deals with numerous calculation complexities because the following two reasons:

122

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

(1) A large number of particles. Even in a submicronsimulated space, there are millions of particles. (2) A lot of simulating steps. Physical models often limit the timesteps to 10−15 second scale and so millions of timesteps are necessary to simulate even a process occurring in 10−12 –10−9 second scale [2]. In order to solve the calculation complexity problem, many parallel algorithms on different high performance computer systems are developed in the recently years. The algorithms can be divided into three types according to task partition strategies, namely, Spatial Decomposition algorithm (SD) [3–5], Atom Decomposition algorithm (AD) [6,7] and Force Decomposition algorithm (FD) [8–11]. In SD algorithm, the simulating domain is divided into some sub-domains and each sub-domain is assigned to a processor. When applied to short-range [12] MD simulation, SD algorithm always shows low communication cost and high scalability. But SD has one big shortcoming: It cannot give a satisfactory solution to load imbalance problem though researchers have developed several load balance strategies in SD algorithm [13,14]. So SD algorithm is usually used in MD simulation with a uniform density. On the other hand, in long-range [12] MD simulation, SD algorithm always gives a bad efficiency. In AD algorithm, particles are randomly distributed among processors irrespective of their spatial positions. Though this algorithm a good load balance can be achieved easily, it also requires numerous memory allocation and communication, and has a very bad scalability. So AD algorithm is rarely used in practical MD simulation. In FD algorithm, particlepairs are evenly assigned to each processor. With low memory requirement, good load balance and simple program structure, FD algorithm is the mostly used one in many research fields, such as nucleation. The main shortcoming of FD is relatively high communication cost. For example, both Taylor’s FD algorithm [10] and Plimpton’s FD algorithm [8] introduced a block-based decomposition to the whole force matrix, which is easy to understand but brings problem to the elimination of redundant calculation. Taylor’s algorithm had to add communication between symmetrical processors, which√introduced additional communication cost of 3N/2 P (here N stands for number of particles, and P stands for number of processors).

On the other hand, because Plimpton’s algorithm took use of a position vector permutation technique that destroyed the skew-symmetric character of force matrix, it had to introduce checkboard matrix as an assistant data structure to eliminate redundant calculation, √ which add additional communication cost of N/ P and calculation of N/2P . Generally speaking, the relatively high communication cost limits the scalability of FD algorithm to a certain extent. In this paper, a new parallel algorithm based on newly developed force matrix decomposition strategy is proposed. The new algorithm is called UTFBD (Under Triangle Force Block Decomposition) algorithm, which takes full advantage of skew-symmetric character of force matrix and apparently reduces the communication cost. Speaking in detail, the communication cost of UTFBD algorithm is only about one third of that of Taylor’s FD algorithm. On an Intel/Xeon cluster system, UTFBD algorithm is implemented and applied to solve a nucleation simulation of a physical system with 500,000 particles. The speedup of the UTFBD on 105 processors is up to 43.4, higher than other tested algorithms (see Section 4 for details). The rest of this paper is organized as follows. Section 2 analyzes traditional algorithms’ shortcoming in force matrix decomposition. Section 3 gives a detail picture of new UTFBD algorithm, its decomposition and processor assignment strategies. The theoretical analysis of calculation and communication is also given in this section. Section 4 shows benchmark system and numerical results.

2. Taylor’s FD algorithm Suppose the number of particles is N , and the normal force matrix F is a square one with scale of N × N , which element Fij stands for the force of particle j on particle i. So F gives a complete picture of interactions of the whole particle system. The principle of FD algorithm is to divide F into lots of sub-matrices according to some kind of rule and assign the sub-matrix to processors. FD algorithms can be divided into two classes in the light of matrix decomposition mode. The first class is row-based decomposition method, e.g., Murty’s FRI (Force-Row Interleaving) method [9]; the second class is block-based decomposition method, e.g., Plimpton’s method [8] and Taylor’s method [10].

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

In row-based decomposition method, each processor is responsible for the calculation of several matrix rows. Numerous memory requirement and communication cost are unavoidable, so this kind of method can fit for only small-scale MD simulation. On the other hand, because both the memory requirement and communication cost of block-based decomposition method are relatively low, it is the most commonly used method in large-scale MD simulation. Because force matrix is the most important structure in the design of parallel FD algorithms, we are likely to explore its character. According to Newton’s Third Law, fij and fj i are equal in size and opposite in direction. In another speaking, the force matrix F has a skew-symmetric character. Interactions in the upper (lower) part of the force matrix can be obtained easily as soon as the interactions in the lower (upper) part are available, so almost all of the parallel algorithms are eager to apply the Newton’s Third Law to eliminate the redundant calculation which is nearly half of the total calculation. But traditional FD algorithms introduced lots of communication cost when applying the Law. For√example, Taylor’s FD algorithm added cost of 3N/2 P in communications between symmetric processors. The following paragraph gives a detail description of Taylor’s algorithm. Suppose F is divided into P sub-block and P processors are available, so each sub-block is assign to a different processor. We use Pij to denote the processor corresponding to sub-block (i, j ), which is responsible for the force calculation of the j particles section on the i particles section, and Pj i , the transpose processor. In order to apply the Newton’s Third Law, the two transpose processors Pij and Pj i can share the calculation of interaction between particles section i and particles section j . Fig. 1 shows the case for N = 12 and P = 9. Each sub-block is a 4 × 4 square matrix. Processor P12 calculates force between particles (1, 2, 3, 4) and particles (5, 6) and its transpose processor P21 is up to the calculation of force between particles (1, 2, 3, 4) and particles (7, 8). Then the two transpose processors have to carry through communication to obtain the total interaction information between particles (1, 2, 3, 4) and particles (5, 6, 7, 8). Four communication modes exist in Taylor’s FD algorithm.

123

Fig. 1. Decomposition and communication of force matrix in Taylor’ FD algorithm.

(1) Communication between transpose processors as √ described above, which cost is of 3N/2 P . (2) Force fold communication [8] among processors in the same row, saying row i, so each processor obtains the total force on particles section i. This √ communication cost is of N/ P . (3) Force update communication among processors in the same row, saying row i, so each processor obtains the newest position information of particles √ section i. This communication cost is of N/ P . (4) Force update communication among processors in the same column, saying column j , so each processor obtains the newest position information of particles√ section j . This communication cost is also of N/ P . So the total communication cost of Taylor’s FD algorithm is √ √ √ √ 3N/2 P + N P + N P + N P √ = 9N/2 P . (1) In order to reduce the communication cost of Taylor’s FD algorithm, we propose a new blockbased decomposition technique, with which (1) and (4) communications are not needed any more. The communication cost of the new algorithm is reduced apparently.

3. UTFBD parallel algorithm 3.1. New matrix decomposition and processors map strategy Considering the skew-symmetric character of force matrix, we can obtain the whole information useful

124

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

to the force calculation once the lower part matrix is available, so the lower part is called Effective Force Matrix (EFM). The principle of newly proposed decomposition strategy is to partition EFM into subblocks ignoring the upper part of force matrix. This decomposition method is called Under Triangle Force Block Decomposition (UTFBD). Fig. 2 illustrates the processor mapping strategy. We introduce two new conceptions to interpret this strategy more easily: virtual processor and actual processor. The former is just the physical processor which is up to the calculation of some sub-matrix; the later is a conception exists only in our mind, which is introduced in the processor mapping process and helps to decrypt and evaluate our new parallel algorithm. Processor mapping strategy can be divided into two main steps. (1) Mapping sub-matrix to virtual processors. (2) Mapping virtual processors to actual processors. Step (1) is very simple, in which each sub-matrix Fij is assigned to a unique processor Pij . According the skew-symmetric character of force matrix, the tasks in virtual processors Pij and Pj i are reduplicate in fact. The following paragraph describes step (2). Suppose each virtual processor Pij is assigned to an actual processor Pk . The relationship of the two kinds of processors in our mapping strategy can be described with the following expression: Pij ⇔ P[ 1 max(i,j )×(max(i,j )−1)+min(i,j )] . 2

Speaking in detail, when i is no less than j , the virtual processor Pij is assigned to actual processor

P 1 (i×(i−1))+j ; when i is less than j, Pij is assigned 2 to P 1 (j ×(j −1))+i . For example, in step (1) sub-matrix 2 F32 and F24 are assigned to P32 and P24 separately; in step (2) P32 and P24 are assigned to P5 and P8 separately. So after the mapping process, sub-matrix F32 and F24 are assigned to P5 and P8 separately. In Fig. 1, the force matrix of Taylor’s FD is decomposed into 9 sub-matrix and they are assigned to 9 processors. Also considering the same system of 12 particles, two decomposition schemes can be used in the UTFBD algorithm. The first is to partition the force matrix into 4 × 4 = 16 sub-matrix as described in Fig. 2. Because there are 10 sub-matrix in the lower part of force matrix, totally 10 processors are used to calculate one sub-matrix, approximately equal to the number of processors in Taylor’s FD (9 processors). Note that when nearly equal number of processors are used, UTFBD algorithm can be applied to a system with more particles than Taylor’s FD. The other scheme is to partition the force matrix into 3 × 3 = 9 sub-matrix, just the same as in the Taylor’s FD. Only 6 processors are used to solve the 6 sub-matrices in the lower part of force matrix, 3 less than the Taylor’s FD. Note that when force matrix is partitioned to the same scale sub-matrix, fewer processors are needed in UTFBD algorithm. Let PT and PU denote the number of processors in Taylor’s FD and UTFBD separately. The relationship √ between them can be described as PU = (PT + PT )/2. A more usable equation in algorithm analyze is 

 PT = ( 8PU + 1 − 1)/2.

Fig. 2. Matrix decomposition and mapping strategy of matrix sub-block.

(2)

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

3.2. The UTFBD algorithm The F of a system with N particles is a matrix of N × N . Suppose N = k × m in convenience and the F is partitioned into m × m sub-matrix. Each sub-matrix is square matrix of k × k. Numbering the sub-matrix in the lower part in row order, we obtain F1 , F2 , . . . , F[m×(m−1)/2+m] . Processors P1 , P2 , . . . , P[m×(m−1)/2+m] are used to calculate the sub-matrix and Pi is up to Fi . The following is the UTFBD algorithm in one time step. (1) P1 , P2 , . . . , P[m×(m−1)/2+m] calculate their own sub-matrix. (2) P[i×(i−1)/2+1] , . . . , P[i×(i−1)/2+i] , P[(i+1)×i/2+i] , . . . , P[m×(m−1)/2+i] (i = 1, 2, . . . , m) communicate the interactions to P[i×(i−1)/2+1] (i = 1, 2, . . . , m). (3) P[i×(i−1)/2+1] (i = 1, 2, . . . , m) updates the positions and velocities information of particles. (4) P[i×(i−1)/2+1] , . . . , P[i×(i−1)/2+i] , P[(i+1)×i/2+i] , . . . , P[m×(m−1)/2+i] (i = 1, 2, . . . , m) communicate to update the positions information of particles. The following gives an example when N = 12, k = 3 and m = 4. Totally there are 10 sub-matrix in the lower part of force matrix, namely, F1 , F2 , . . . , F10 and are assigned to 10 processors, namely, P1 , P2 , . . . , P10 , as illustrated in Fig. 3. (1) P1 , P2 , . . . , P10 calculate the sub-matrix F1 , F2 , . . . , F10 separately. For example, P5 calculates

Fig. 3. Decomposition of force matrix in UTFBD algorithm.

125

the interactions between particles (4, 5, 6) and (7, 8, 9). (2) P2 , P4 and P7 communicate interactions to P1 , then P1 can get the total interaction on particles (1, 2, 3); P3 , P5 and P8 communicate interactions to P2 , then P2 can get the total interaction on particles (4, 5, 6); P5 , P6 and P9 communicate interactions to P4 , then P4 can get the total interaction on particles (7, 8, 9); P8 , P9 and P10 communicate interactions to P7 , then P7 can get the total interaction on particles (10, 11, 12). (3) P1 updates the positions and velocities of particles (1, 2, 3); P2 updates the positions and velocities of particles (4, 5, 6); P4 updates the positions and velocities of particles (7, 8, 9); P7 updates the positions and velocities of particles (10, 11, 12). (4) P1 communicates the updated positions of particles (1, 2, 3) to P2 , P4 and P7 ; P2 communicates the updated positions of particles (4, 5, 7) to P3 , P5 and P8 ; P4 communicates the updated positions of particles (7, 8, 9) to P5 , P6 and P9 ; P7 communicates the updated positions of particles (10, 11, 12) to P8 , P9 and P10 . 3.3. Theoretical analysis The main part of computation of UTFBD algorithm is force calculation. With the speedup methods of LinkCell and NeighborList [15], the complexity of force calculation is linear to particle number. That is, the complexity of UTFBD is O(N ), where N

denotes the average number of particles distributed to one processor. Considering Eq. (2), the complexity can be illustrated with the following equation:     O(N ) = O(N/ PT ) = O 2N/( 8PU + 1 − 1)  = O(N/ PU ). (3) Compared to Taylor’s FD algorithm, the communication cost of UTFBD is greatly reduced. Because the first and forth types of communication (see in Sec-

126

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

tion 2) are not necessary in UTFBD, the total communication cost is   2N/ PT = 4N/( 8PU + 1 − 1)  √ = 2N/ PU . (4)

energy E are constant. Because short-range MD simulation is more challenge in parallel computation, we choose a

It is only about one third of Taylor’s FD.

Verlet integration method [15] is adopted to solve the Newton motion equation:  t ∂Φ  1 2 k rn , rn , . . . , rnN , vn+1/2 = vnk − k 2m ∂r k k = rnk + t × vn+1/2 , rn+1 (5)  t ∂Φ  1 k k 2 N r , r , . . . , rn+1 , vn+1 = vn+1/2 − 2m ∂r k n+1 n+1     j  Φ rn1 , rn2 , . . . , rnN = φ rni − rn ,

4. Numerical results and comparison 4.1. Benchmark system and computation environments Lennard-Jones fluid model is used as the benchmark system [16], in which the potential energy between pairs of particles is described as below:  12  6  σ σ , − φ(r = 4ε) r r where r is the distance between particles, εand σ are constants. In liquid Argon system, ˙ σ = 3.405A,

ε = 120K × kB ,

kB is Boltzmann constant. A multi-particle system in a cubic space is simulated and periodic boundary condition is adopted. When simulation begins, all particles are on an FCC (FaceCentered Cubic) lattice with random velocities. During simulation, pressure P , space volume V and total

cut-off distance rc = 4.0σ.

1i,jN i
where, v k and r k are velocity and position vector of particle k separately, m is particle mass and t is one time step. The numerical experiment platform is a computer cluster system. A cluster system is composed of a ‘Network of Workstations’ including PC based systems, and commonly, their computing nodes are based on high volume workstation or small-scale SMP nodes. In our cluster, there are 36 SMP nodes and its architecture is shown in Fig. 4. Each node has 4 CPUs of Intel Xeon PIII700, 36 Gbytes of hard disk, and 1 Gbytes of memory. The communication medium between SMP nodes is Myrinet Switch whose bandwidth

Fig. 4. Architectures of cluster systems.

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

127

Fig. 5. The efficiency of the parallel algorithm against the number of processors in a cluster.

Fig. 6. Speed-up relative to one processor versus number of processor in a cluster.

is 2.56 Gb/s. The software environments are Redhat Linux 7.2 (kernel version 2.4.7–10 smp), MPICH1.2.7 and gm-1.5pre4 which are the network protocols running on Myrinet. Three parallel algorithms were tested, namely, UTFBD, FRI and Taylor’s FD. 4.2. Results The two common measurements for performance evaluation of a parallel algorithm are the computation speedup and efficiency. Speedup is measured by comparing the running time taken for the parallel algorithm on a number of processors to the running time taken for the corresponding serial algorithm on a single processor. Efficiency measures the scalability of

the parallel algorithm. The speedup and efficiency respectively are defined as Sp =

Ts Tp

and Ep =

Sp , Np

where Sp is speedup, Ts stands for the running time of serial program, which is the period from the beginning of the program to its end and Tp stands for the running time of the parallel program with FD algorithm, which is the period from the beginning of the first process to the end of the last process. Np is the number of processors. Figs. 5 and 6 illustrate the efficiency and speedup curves separately. The particle number of simulated system N is 500,000. Note that the number of proces-

128

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

Fig. 7. Communication cost against the number of processors used (108,000 particles).

Fig. 8. Communication cost against the number of processors used (500,000 particles).

sors in Taylor’s FD and UTFBD are not equal in most cases because UTFBD applies new decomposition and mapping strategies. As described in Section 3.1, when nearly equal numbers of processors are used, UTFBD algorithm can apply to a system with more particles than Taylor’s FD; when force matrix is partitioned to the same scale sub-matrix, fewer processors are needed in UTFBD algorithm. It is apparently, UTFBD algorithm is better than FRI and Taylor’s FD in short-range MD simulation. From the speedup curve in Fig. 6, we can draw the conclusion that UTFBD is more efficient than Taylor’s FD and the more processors used, the more advantage is introduced. For example, the speedup of Taylor’s FD on 36 processors is 20.2, and that of UTFBD is 21.1; the

speedup of former is 35.1 and that of later is 43.4 on 100 and 105 processors, respectively. The uptrend of speedup curve also shows that UTFBD algorithm has a better scalability. Furthermore, the efficiencies of FRI and Taylor’s FD on 100 processors are separately 4.3 and 35.2%, but that of UTFBD on 105 processors is 41.3%, about 37.0 and 6.1% higher than that of FRI and Taylor’s FD, respectively. Figs. 7 and 8 illustrate the communication cost of UTFBD and Taylor’s FD, with a system of 108,000 particles and a system of 500,000 particles, respectively. The communication cost is obtained as follows. (1) Each processor Pi running the parallel program calculated its own total communication time in one simulating time steps as Tc (i). (2) All processors com-

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

municated their Tc (i) to select the longest communication time Tc (i) as Tc (Tc = Max1iP Tc (i), P is the number of processors). Tc is the communication cost used in Figs. 7 and 8. When the number of processors is relatively small, the longest communication cost of UTFBD is greater than that of Taylor’s FD. But when more processors are used, the longest communication time of UTFBD decreases more greatly than that of Taylor’s FD. It seems conflict to the above conclusion that the communication data of UTFBD is only about one third of that of Taylor’s FD. It is because the total communication cost of parallel algorithms is determined by three factors together. (1) Total volume of data sent and received; (2) The number of messages and communication mode; (3) Synchronization cost due to load imbalance. Although the data volume of UTFBD is only one third of that of Taylor’s FD, UTFBD does not eliminate complex communication mode existing in Taylor’s FD. More important, UTFBD suffers from a relatively serious load imbalance problem as described in Section 5. In the first step of UTFBD, the work in a diagonal processor is only half of that in an offdiagonal processor, so the diagonal processors must wait for the off-diagonal processor, leading to a synchronization time that is included in the communication time. The more processors are used, the shorter a synchronization time is. On the other hand, Taylor’s FD does not suffer from a load imbalance like that, so when more processors are used, the decrease of the longest communication time of Taylor’s FD is not so greatly as that of UTFBD. Therefore we can make sure that when the number of processors used in MD simulation is large enough, UTFBD can have a much better communication performance than Taylor’s FD. We can also make sure that the more particles are dealt with, the greater is the influence of data volume to the total communication cost, the better communication performance will UTFBD have. The comparison of Figs. 7 and 8 can help us draw the conclusion that in the larger particle system, the advantage of UTFBD to Taylor’s FD in communication is more obvious. For example, in Fig. 7, which deal with 108,000 particles, the longest communication time of Taylor’s FD algorithm in one timestep on 100 processors is 49 ms; that

129

of UTFBD algorithm on 91 and 105 processors is 42 and 42 ms, 14 and 14% lower, respectively; in Fig. 8, which deal with 500,000 particles, the longest communication time of Taylor’s FD algorithm in one timestep on 100 processors is 247 ms; that of UTFBD algorithm on 91 and 105 processors is 205 and 176 ms, 17 and 29% lower, respectively. The difference increases obviously.

5. Conclusions and discussion Traditional parallel MD simulation algorithms can be divided into three classes: AD, FD and SD, in which FD is most popularly used. The main shortcoming of FD is comparatively large communication cost, which limits the scalability of FD. UTFBD, a new FD algorithm, is proposed in this paper, which introduces new decomposition and mapping strategies and greatly reduces the communication cost. Numerical results show that UTFBD algorithm is more efficient and scalable than other algorithms such as Taylor’s FD and FRI. UTFBD algorithm proposed in this paper is not much optimized up to now. In its first step, the computing work of diagonal processors is only half that of off-diagonal ones, which surely leads to load imbalance. Theoretical analysis is as below: let n denotes for the scale of force matrix, so the number of diagonal processors is n, the number of off-diagonal processors is n(n − 1)/2. Let c denotes for the task of one diagonal process, so the task of one off-diagonal processors is 2c, and the total task of step (1) is n2 × c. Let v denotes for the speed of processors, so the computing time of step (1) in UTFBD is 2c/v. If the total task is distributed evenly among all processors, 2×c×n2 the computing time of step (1) is n×(n+1)×v , which is close to 2c/v especially when n is large enough. In general, because the number of diagonal processors is much less than that of off-diagonal ones and the task of one diagonal process is less than that of off-diagonal processor, the additional cost due to load imbalance is very little. When large number of processors are used (satisfied in most parallel MD simulations), even if the total task is distributed averagely to all processors, the parallel efficiency cannot increase much.

130

J.W. Shu et al. / Computer Physics Communications 154 (2003) 121–130

Acknowledgement This research was supported by a 985 Basic Research Foundation of Tsinghua University of P.R. China (Grant No. J2001024).

[11] [12] [13]

References [1] D.C. Rapaport, The Art of Molecular Dynamics Simulation, Cambridge University Press, Cambridge, 1995. [2] M. Marin, Comput. Phys. Comm. 102 (1997) 81. [3] S. Gupta, Comput. Phys. Comm. 70 (1992) 243. [4] D. Brown, J.H.R. Clarke, M. Okuda, et al., Comput. Phys. Comm. 74 (1993) 67. [5] R. Koradi, M. Billeter, P. Guntert, Comput. Phys. Comm. 124 (2000) 139. [6] W. Smith, Comput. Phys. Comm. 67 (1992) 392. [7] G.S. Heffelfinger, Comput. Phys. Comm. 128 (2000) 219. [8] S. Plimpton, J. Comput. Phys. 117 (1995) 1. [9] R. Murty, D. Okunbor, Parallel Comput. 25 (1999) 217. [10] V.E. Taylor, R.L. Stevens, K.E. Arnold, Parallel molecular dynamics: Communication requirements for massively paral-

[14]

[15]

[16]

lel machines, in: Proc. 5th Symposium on the Frontiers of Massively Parallel Computation, IEEE Comput. Society Press, 1994, p. 156. Y. Komeiji, M. Haraguchi, U. Nagashima, Parallel Comput. 27 (2001) 977. J.M. Haile, Molecular Dynamics Simulation Elementary Methods, Wiley-Interscience Publication, New York, 1997. R. Hayashi, S. Horiguchi, Efficiency of dynamic load balancing based on permanent cells for parallel molecular dynamics simulation, in: Proc. 14th International Parallel & Distributed Processing Symposium, IEEE Comput. Society Press, 2000, p. 85. L. Nyland, J. Prins, R.H. Yun, et al., Modeling dynamic load balancing in molecular dynamics to achieve scalable parallel execution, in: Proc. 5th International Symposium on Solving Irregularly Structured Problems in Parallel, Springer-Verlag, 1998, p. 356. J. Shu, B. Wang, J.Z. Wang, et al., Cluster-based molecular dynamics parallel simulation in thermophysics, in: S.G. Akl, T. Gonzalez (Eds.), Proceeding of IASTED International Conference on Parallel and Distributed Computing and System, ACTA Press, 2002, p. 18. K. Esselink, B. Smit, P.A.J. Hilbers, J. Comput. Phys. 106 (1993) 101.