Computers Chew. Vol. 14, No. 3, pp. 219-224, 1990 Printed in Great Brilain.All rights resewed
0097~8485/90
$3.00 + 0.00
Copyright 0 1990 Pergamon Fkss plc
PARALLELIZATION OF A MOLECULAR DYNAMICS NON-BONDED FORCE ALGORITHM FOR MIMD ARCHITECTURE TERRY W. CLARK and J. ANDREW MCCAMMON Department of Chemistry, University of Houston, Houston, TX 772044641, U.S.A. (Received 24 August
1989; in revised form 20 December 1989; received for publication
15 Januury 1990)
Abstract-A method for parallelizing the non-bonded pair list generation and non-bonded force calculation algorithm for molecular dynamics is presented. Using the parallelism inherent to existing algorithms, it is possible, with minor modifications, to adapt the non-bonded routines to multipleinstruction, multiple-data (MIMD) computer architectures. This methodology has heen applied to the molecular dynamics program GROMOS for the Stellar GSlOOOGraphics Supercomputer. Aspects of the Stellar GSlOOOarchitecture and programming environment are presented with attention to the performance of the molecular dynamics program. A speed enhancement factor of about 3 has been obtained relative to the serial execution of the program, which is close to the theoretical maximum factor of 4 for this machine. The overall speed enhancement factor increases to about 6 with the additional use of vectorization for a version that has been extensively rewritten to be more highly vectorizable than the standard code where gains from vectorization are slight. In the former case, the program executes at about 35% of the speed obtained on a single-processor Cray X-MP.
1. INTRODUCTION
The molecular dynamics method is widely used to study the dynamic and thermodynamic behavior of molecular assemblies via computer simulations (Allen & Tildesley, 1987; McCammon & Harvey, 1987; Brooks el al., 1988). At the heart of the molecular dynamics simulation is the potential energy function. The gradient of the potential supplies the force acting on each atom which, in turn, is used to solve NewtorI’s equations of motion for the system. The potential function consists of several terms that can bc categorized either as non-bonded interactions or bonded interactions. Of special interest in terms of computational speed are the non-bonded calculations, namely, the pair list generation routines and the force matrix calculation for the non-bonded interactions. In a typical molecular dynamics calculation, these routines can consume wet1 over 75% of the computation time. In principle, non-bonded calculations can be speeded significantly by using either of two types of parallel architectures, namely, multiple-processor architectures, where each processor executes an independent stream of instructions, or vector architectures, which make use of pipelined or array processor ‘hardware. The key program component behind ivectorization, the array data structure, distinguishes this form of parallelism from the multipleprocessor, multiple-data-path (MIMD) parallelism (Hackney % Jesshope, 1981) where larger granularity paralleljsm contained in program sections comprising many ihstructions can be exploited.
The ongoing effort to exploit parallelism inherent to problems and their resulting algorithms and programs denotes a subject of great interest in computational chemistry (Wallace, 1988; Cochrane C Truhlar, 1988). For the parallelization of molecular dynamics simulations, a variety of methodologies have been used (Clementi et al., 1985; Berendsen er nl., 1984a, b; Fincham, 1987; Boris, 1986; Hoshino & Takenouchi, 1984). Some approaches take advantage of the observation that, conceptually, for any time step in a molecular dynamics simulation, forces act simultaneously on each atom (or particle) of the system. This inherent parallelism has been utilized in producing algorithms adaptable to a variety of multiprocessor computers. Fincham (1987) discusses a scheme whereby pair interactions are calculated concurrently by parallel processes using circulating force accumulators and particle coordinates. A “standard” pair list algorithm (Section 2.1) has a time complexity O(N*) for N particles. Attributing spatial significance to the processors of a multi-processor computer can eliminate the explicit tabulation of the pair list (Hoshino & Takenouchi, 1984). A segmentable, logical grid suitable for asynchronous, multi-processor parallelism (Boris, 1986) avoids the need to check the N* possible interactions by spatially ordering the N particles. In the present paper, we describe a strategy for the effective application of large granularity parallelism in non-bonded calculations on machines with a MIMD architecture involving modifications to an existing sequential algorithm. Illustrative results are presented for the GROMOS molecular dynamics
TERRYW. CLARKand J.
220
ANDREW MCCAMMON
DOUK)I=I,N
program (Van Gunsteren & Berendsen, 1987; Berendsen er al., 1984a, b) on one such machine, the Stellar GSlOOO (Stellar, 1989). The further effect of vectorization of the code for the vector processor in this machine is also considered.
. . . DO100J=I.N
. . . NJNBrNJNB+ 1 m-1
2.
SEQUENTIAL ALGORITHM
The calculation of the force array for non-bonded atom pairs in molecular dynamics is often divided into two parts: the pair list generation followed by the force calculation for the atom pairs. Within both phases, the computation can proceed independently for each atom i of the pairs ij. We illustrate this below using the specific nomenclature of GROMOS.
100 ~~~~TO~~EX ux)
CONTINUE
Fig. la. GROMOS loop structure for pair-list generation.
2.2 Force calculation The non-bonded force calculation proceeds by calculating the gradient of the potential energy of interaction for the atom pairs to yield the force on each atom. This part of the algorithm also uses a level 2 loop. The outer loop ranges over the atom sequence number i and the inner loop ranges over the set of js for i thereby unravelling the atom pairs ij from the INB and JNB arrays (Fig. 1b). The time complexity for this calculation will be something less than order of N2; while the pair-list generation considers approx. N2 pairs, many are rejected based on not satisfying distance cutoff or neighbor criteria. The
1.N
%%:=
KM4x=KMIN+INB(T)
2.1. Pair list generation For an N atom molecular assembly, a total of N(N - 1)/2 ij pairs exist, where i andj are sequential atom index numbers. This is naturally expressed with a loop containing a nested loop, where the outer loop ranges over the i of the ij pair, and the inner loop ranges over thej. Figure la shows the general loop structure for the pair list generation in GROMOS. The pair list routine represents the non-bonded pairs for a given molecular configuration with the two arrays INB and JNB. Indexed by the atom sequence number i, INB, contains the number of j-partners in JNB for this i. The first memory cell (or element number) for atom i in JNB is just the cell following the last cell for atom i - 1. The last cell for atom i is the last cell for atom i - 1 plus INB,. Since the induction variable of the inner loop (Fig. la) will always have a value greater than the induction variable of the outer loop, each ij pair will be considered once and only once. Thus, each cycle of the outer loop generates a unique set of ij pairs independently of other iterations of the loop. Using this method to generate the pair list, and assuming some uniform geometry, there will be more pairs with an i-value of m than pairs with an i-value of m + k, where m and k are positive integers. This observation will play an important role in adjusting the segmentation presented in the parallel version of this algorithm.
= A’lUM_JNDEX_NUMB~
. . .
lF(KMIN.EQ.KMAx)GOTO200
. .. KMIN=KMlNtl
DCIlOOK=~,KMAX
. . . F(T)= F(I) t NEW-FORCE F(K) = F(K) - NEW-FORCE
. 100
cONTmul? KMIN=KMAX
200
cKBmNuB
Fig. lb. GROMOS loop structure for force calculation.
force on each atom, accumulated in an array, may have numerous contributions depending on the number of pairs the atom forms. These separate contributions are, however, completely independent of one another. 3. PARALLEL ALGORITHM Using the facts that the pair list construction and the pairwise force calculation for each atom and atom pair are independent of other atoms and atom pairs, the non-bonded calculation can be performed concurrently. The strategy used to take advantage of this parallelism involves segmenting the array INB and assigning to each processor one of the segments. In what follows, a four-processor architecture is assumed, however, in principle, INB can be segmented into an arbitrary number of segments not to exceed the number of atoms. In implementing this scheme, the changes to GROMOS were kept to a minimum. In both phases of the algorithm a setup section preceding the parallel sections calculates pointers and intializes arrays. At synchronization points following the parallel sections, results are merged into the data structures used throughout other parts of GROMOS.
Parallelization of a non-bonded force algorithm 3.1. Pair -list generation In pair-list generation, each processor calculates the pair-list for one segment of atoms, that is, a segment of INB. Each processor implements an identical copy of the pair-list generation subroutine differing from the original, sequential version primarily in loop control variables. Figure 2a shows the modified loops in subroutine Nbpal of GROMOS. The output from each calculation, thejnb segments corresponding to inb segments, are merged into a JNB at the synchronization point. It is not feasible to share JNB among the processors during its making. Even if it were determined that resulting costs in merging separate jnb s are greater than sharing JNB, the elements in JNB must he contiguous by atom for each atom i, a feat that can be accomplished only after completing the calculation for the segments of INB. 3.2. Force calculation In the force calculation, segments of INB are passed to each processor so that each processor calculates the force for a partition of the set of ij pairs. As with pair-list generation, an identical copy of the non-bonded pair force routine used by each DOUX)I=A,B . . . DO 100J=I.N .
.
NJNE=NJNB+I JNB(NJNB)= ATOM_lNLEX_NUMB~ . . .
200
mNTINuE
Fig. 2a.
Modified GROMOS loop structure for pair-list generation. KMw=KMrN_PAEtmmER DOZOOI=A,B KMAx=KMiN+mB(r) IF ( KMINJZQ.~~MKX)GOT0 2133 . * . KMJN=KMIN+I no 1GoK=lcMIN,KMAX
. . . F(l) = PO + NEW-FORCE F(K). F(K) - NEWjORCE
. . . 100
COKMlN-KMAX
200
CONTINUE
Fig. 2b. Modified GROMOS loop structure for force calculation.
221
processor differs from the sequential version mainly in the loop control variables. Figure 2b shows the modified loops in subroutine Nonbal of GROMOS. An atom number will appear as an i of an ij pair in one and only one segment. An atom number can, however, appear in more than one segment as a j. Since the force array is indexed by atom sequence number, more than one processor can access the same cell, corresponding to some atom, simultaneously. Here, treating the access of the force array as a critical section (Andrews & Schneider, 1983) or setting up a separate force array for each processor to be merged at a synchronization point following the concurrent force calculation, are both viable options. For the implementation reported here, the latter approach was used. 3.3. Load balancing It is most desirable to keep each processor equally busy since an even distribution of work minimizes the real time that a calculation will take. When a processor finishes its work with its assigned INB segment, it must wait until all segments are processed before the molecular dynamics calculation can proceed. This wait can be viewed as a measure of wasted resources, assuming one is interested in maximizing the real time performance of the calculation. The key to creating an even distribution lies in the segmenting of INB. Given the way in which the pair list generating portion of the algorithm works, simply segmenting INB into four equal parts will not achieve balanced loads. In what follows, a set of closed expressions are derived to closely approximate even load distribution. Let a, b, c and d be the pointers into INB for processors 1,2,3 and 4, respectively. Define segments of INB for the four processors to be [1, a], [a + 1, b], [b + 1, c], and [c + l,d] where d = N. Assume an eight atom system (N = 8) with INB evenly segmented for four processors with a = 2, b = 4, c = 6 and d = 8. Consider the following loop: DO lOOI=
A,B
DO 100 J = I, N 100 Statement. We are interested in executing Statement an equal number of times on each processor. The loop parameters A and E are the values for segment [A, B] of INB. With the even segmentation of INB, Stafement will be executed 15, 11, 7 and 3 times for processors 1, 2, 3 and 4, respectively. In the subcolumns of Table 1, each iteration of the inner loop has a single entry. The pairs of subcolumns contain the atom numbers, or loop index of the inner loop, for each of the two iterations of the outer loop. A close look at Table 1 reveals that like-atom interactions, where i equals j, are considered; this is done for clarity and does not affect the arguments used here and their applicability to the physical picture.
TERRY W. CLARKand J. ANDREW MCCAMMON
222
Table
The loads for the four processors can be expressed in general with the following four series:
i-X+ t
AZ1
where S, corresponds to processor j, and i represents the number of iterations of the inner loop for each cycle of the outer loop. For even segmentation, x = 2, y = 4 and z = 6. In Table 2 the values for the induction variable J (inner loop) and the number of iterations for Statement in the loop shown above are given in terms of x, y and z. This tabIe makes evident the connection between the summations of equations (1) and the execution of the Fortran loop. Each line of Table 2 represents a single cycle of the outer loop, that is, a single value of the induction variable I. From Table 2 we note that the pointers segmenting INB can be expressed as: a=N-z,
b=N-y,
c=N-x,
d = N.
(2)
By choosing x, y and z such that: s,=s>=s3=s4, the a, b, c and segment INB so bution of work closed forms for
(3)
d of equations (2) can be used to as to achieve a nearly even distrifor the four processors. From the equations (1) and (3) we have:
I. Load distribution for even segmenting of INB
Pl
P2
P3
P4
12
34
56
7x
23 34 45 56
45 56 61 78
67 78 8
8
67 78 8
8
For the pair list generation the scheme is close to exact since alI pairs must be considered. For the force calculation, on the other hand, the success of the scheme is dependent on the geometry of the molecular assembly. For actual results with some test simulations see Section 5. 4. IMPLEMENTATION Until now the discussion has revolved around a four-processor architecture where segments of INB are assigned to individual processors. How this assignment is made is irrelevant to the overall procedure. With the Stellar GSlOOO, the machine used for this implementation, the Unix operating system will schedule concurrent segments of a process (threads or subprocesses) for execution In this multitasking environment [similar to Cray microtasking (Bieterman, 1988)], the processor-to-concurrent segment association breaks down since a concurrent segment can, at different times during its execution, be assigned to more than one processor.
(Z+Y+I)(z-Y)=(Y+x+l)(Y-x), 4.1. Stellar GSlOQO
(y + X + l)(Y - x) = (x + 1)X, (N$zt-I)(N-z)=(z+ytl)(z-Y).
(4)
An exact solution to equations (4) will not usually be possible with x, y and z on the natural numbers; assuming x, Y and z are large compared to 1, equations (4) may be approximated by: (z +Y)(z -Y) = (Y +x)(Y
The Stellar GSlOOO graphics supercomputer provides a single vector/floating point unit with a 45 MFLOP peak performance. The MIMD part of the machine consists of four identical logical processors with an aggregate rate of 20-25 MIPS sharing a Table
-x),
2. Iteration
cycles
of doubly
nested
(Y + X)(Y - x) =(x + 1)(X - l), (N t z)(N - z) = (z + Y)(z
-
y).
(5)
segment I
After some algebra the values of x, Y and z are:
i-indices I,2 ,...,
N--l,N
2,3 I...,
N-l,N
N-z-l,N-‘r N-z,N-z+l 2
,_.., I...,
N-z+l,N-zf2 N-2+/N-zf3
,...I I...,
N-y-I,N-‘Y N-Y,N-y+l,_.., 3
,...,
4
N
N-I,N N-I,N
I/2 z+l
N-l,N N-l,N
z Z-1
N-I,N N-I,N
YG2 Y+l Y
,...I
N-x-l,N-‘x N-x,N-x+1
I..., I...,
N-x+l,N-x+2 N-x+2,N--x+3
,..., I,.., N--‘I,N N
iterations for each
N-l
N-y+l,N-y+2,...,N-1,N N-y+2,N-Yf3
Using equations (2) and equations (6), the segmentation of INB will result in a nearly even number of iterations for the non-bonded calculations. (The results for a, b, c and d in terms of N can be expressed as the numbers: a = 0.13397 N, b = 0.29289 N, c = 0.50000 N, d = N.)
loop No.
prOCeSSOr/
N-1.N
y-1
N-1,N N-l,N N-l,N N-l,
x+2 X+1 N
x x-1 I 1
Parallelization of a non-bonded force algorithm high-speed, I MB memory cache with a 1.28 GB s-l bandwidth; up to 128 MB of main memory can be accessed at the rate of 320 MB s-l. The Stellar compiler can automatically parallelize and vectorize suitable code. A common situation is a two-deep loop with the appropriate data dependencies where the inner loop is transformed into a vector instruction and the outer loop is executed on more than one logical processor (Allen & Kennedy, 1987). In many cases a program will contain inherent parallelism that cannot be detected by the compiler or requires assumptions to be made that could lead to an unfaithful translation of the source code. The Stellar Fortran compiler provides directives for these instances whereby the programmer can require, for example, that a loop or program segment be parallelized. In Fig. 3, the calls to Fortran subroutines Nonbal and Nbpal illustrate an application of the Parallel directive to achieve concurrent execution of these subroutines. The Parallel directive will set up a parallel region for the loop immediately following it such that four threads will concurrently execute the four calls to the non-bonded subroutines. Concurrent multiple entries to the same subroutine can be correctly achieved in Stellar Fortran by compiling the subroutine with the reentrant flag, and as a result, the local variables for the subroutine are stored on the stack (automatic variables) as opposed to fYted memory locations (static variables). Using the same copy of a subroutine, instead of creating a separate copy for each thread, eliminates the overhead of maintaining multiple, identical copies with the added benefit of improving the program’s locality of reference during its execution.
C
Jndexwrkforcon~“tcllJl,s
. CSDIFtPARALLEL. DOloOi=1,4 cALLNBPAyslandsrdparamaters,additionalindi~)
223
Table 3. Test case timings’ Argonb
Pcptidcb
DNA’
154 165
534 644
711 853
51 55
181 216
277 302
Standard seqwntial vector&d Standard parallel vectorized
128 49
420 161
578 249
Convex sequential vectorized Convex parallel vectorizcd
61 28
182 93
276 144
Standard sequential scalar Convex sequential scalarStandard parallel scalar Convex parallel scalar
aWall clock seconds on a dedicated machine, 32-bit precision (approx. 14% more time for &Q-bitprecisvm). b50 steps, update pair list every IO steps, output every 10 steps. cIO steps, 1 pair list update, output every I step.
5. RESULTS
Two pairs of GROMOS non-bonded routines were modified to include concurrent pair-list generation and concurrent force calculations: the standard nonbonded routines not written for any hardware in particular, and non-bonded routines structured for the Convex vectorizing compiler. (The general and Convex-structured routines are from the 1987 standard release tape of the GROMOS software.) Prior to gathering timings for the performance of the parallel version of GROMOS, its correctness was ascertained. Molecular dynamics simulations were carried out for 1 and 20 ps using a step size of 1 fs. The analyses of these simulations were compared with the analyses of identical simulations performed with GROMOS running on a VAX 8650. The 1 ps analysis compared phase space coordinates and energies (Karplus & McCammon, 1983). The 20~s simulations were used to calculate solvent radial distribution functions (Watanabe & Klein, 1989) and diffusion constants (Allen & Tildesley, 1987). Among the test systems used: a box of 864 Argon atoms (Argon system), a 1037-atom system consisting of a five residue peptide with four ions in 322 waters (peptide system), and 4 DNA fragments with 2234 atoms and 22 sodium ions in a bath of 1220 water molecules (DNA system). 5. I. Timings
Using the test cases described above, lo-50 steps of molecular dynamics were performed using three different versions of GROMOS. These results, C JointwtltsfmnNbpal reported in Table 3, allow the comparison of the sequentially executed standard and Convex (highlyvectorizable) non-bonded routines with their concurrent counterparts. CSDIR PARALLEL The gain from simply vectorizing the standard DO2oOi=1,4 non-bonded routines is modest (compare standard CALL f’JONBAL(standmdpamme.-. additionat indices) sequential scalar with standard sequential vectorized). A more substantial speedup with minor code modi200 cavmuE fication (2.6-3 times) is obtained by parallelization (compare standard sequential scalar with standard C Join results horn Nonbal parallel scalar). Vectorization of the Convex code is much more effective than vectorization of the Fig. 3. Concurrent invocations of subroutines Nbpal and Nonbal. standard code, yielding speedups of 2.7-3.5 times. 100
CONTINUE
TERRY W. CLARK and J. ANDREW MCCAMMON
224
Table 4. Percentage time in threads for Nonbal and Nbpal Thread number I
2
3
4
Nbpal
36123
31122
26123
l/23
Nonbal
40129
30124
21/24
Each entry indicates % without load balancing/% balancing.
9124 with load
Parallelization of the vectorized codes increases the overall speedup factors to 2.9-3.3 for the standard code, and 5.9-6.9 for the Convex code. The greater average speedup obtained by parallel execution of the standard, vectorized version (2.5 times) compared to that for the parallel execution of the convex, vectorized version (2 times) may reflect contention for the vector processor, but this requires further analysis. 5.2. Profiring With Stellar profiling software, the percentage of time spent by the threads in the concurrent regions was determined. These normalized data are presented in Table 4 where the Argon case was used for Nbpal (calculates pair list) and the peptide case was used for Nonbal {calculates force). 6. CONCLUSIONS Not all existing programs are amenable to significant gains via vector optimizations without undergoing extensive rewriting. The subroutines Nonbal and Nbpal in standard GROMOS are sufficiently convoluted (that is to say vector resistent) that, without restructuring of the code, speedups from vector optimizations are modest as can be seen from Table 3. With relatively minor modifications to this code, a much greater speedup has been achieved by load-balanced parallelization. The theoretical upper bound for speed enhancement by parallelizing the entire program is the number of processors, or four in this case, and the attained speedup factor is about three. While the parallelizing scheme in this paper is presented for a four-processor computer, in theory each atom i of the ij pairs could be assigned to a separate processor. It comes as no surprise that the load balancing scheme works better for the pair list generation routine Nbpal, than for the force calculation routine Nonbal; there is no guarantee that the loop of Fig. 2b will iterate like the idealized loop discussed above, but the pair list loop of Fig. 2a always will. To achieve a more exact segmentation of INB for the force loop would require a new analysis after every non-bonded pair-list update: the scheme would have to consider the contents of JNB directly because the iterations of the inner loop are geometry dependent.
The segmentation scheme presented here, on the other hand, requires a single execution for the entire dynamics calculation, and is seen to be quite effective. When combined with effective vectorization, the use of MIMD parallelism yields significant aggregate speedups (about six times) on the Stellar GSlOOO. Comparative benchmarks show that the resulting speed is about five times that of the DEC VAX 8650 and about 0.35 times that of a single-processor Cray X-MP. AcknowiedPements-We wish to thank Wilfred van Gunsteren for iroviding the standard and Convex versions of GROMOS. David J. Mullallv of Stellar Cornoration for pointing out the use of the recirsive flag and the subsequent use of a loon for structurina the concurrent calls. and Chung F. W&g for many hzpful discussions conc&ning the GROMOS programs. This work was supported in part bv the National Science Foundation. the Robert A. Welch Foundation, ONR, and the Texas Advanced Research Program. J.A.M. is the recipient of the 1987 G. H. Hitchings Award from the Burroughs Wellcome Fund.
REFERENCES Allen R. & Kennedv K. (1987) ACM Trans. Proeram. Lanp sysrem.r 9, 491. . ’ Allen M. P. & Tildesley D. J. (1987) Computer Simularion of Liquids. Clarendon Press, Oxford. Andraws G. R. & Schneider F. B. (1983), Cornnut. Sum . 15, 3. Berendsen H. J. C., van Gunsteren W. F. & Postma J. P. M. (1984a) High-Speed Computation (Edited by J. S. Kowalik), p. 425. Springer-Verlag, Berlin. Berendsen H. J. C., Postma J. P. M., van Gunsteren W. F., DiNola A. & Haak J. R. (1984b) J. Chem. Phys. 81,3684. Bieterman M. (1988) J. Suoercomour. 2. 381. Boris J. (1986)‘5. &pur.‘Phys. iis, 1.. Brooks C. L., Karplus M. & Pettitt B. M. (1988) Adv. Chem. Phys. 71, I. Clementi E., Corongiu G., Detrich J., Kahnmohammadbaiai H.. Chin S.. Dominao L.. Laaksonen A. & Nauven 1. N. i. (Ib85) Phi&a 131b, 74. Cochrane D. L. & Truhlar D. G. (1988) Parallel Compul. 6, 63. Fincham D. (1987) M&c. Simul. 1, I. Hocknev R. W. & Jesshooe C. R. (19811 PorallelComouters. Adam Hi&r Ltd, Bristol. \ ’ Hoshino T. and Takenouchi K. 119841 Phvs. . , Cornour. _
Common.31, 287.
Karplus M. & McCammon J. A. (1983) Annu. Rec. Biochem. 53. 263. Mcdammon J. A. & Harvey S. C. (1987) Dynamics of Proteins and Nucleic Acids. Cambridge University Press. Stellar (1989) Stellar Computer Inc., 85Wells Ave, Newton
Mass. Van Gunsteren W. F. & Berendsen H. J. C. (1987) GROMOS; GROningen MOlecular Simularion Computer Package. BIOMOS, Biomolecular Software, Laboratory of Physical Chemistry, University of Groningen, Nijenborgh, The Netherlands. Wallace K. J. (1988) Phil. Trans. R. Sot. Land., Ser. A 326, 481. Watanabe K. & Klein M. L. (1989) Chem. Phys. 131, 157.