Computer Physics Communications 174 (2006) 521–529 www.elsevier.com/locate/cpc
Multibillion-atom molecular dynamics simulation: Design considerations for vector-parallel processing D.C. Rapaport a,b a Department of Physics, Bar-Ilan University, Ramat-Gan 52900, Israel 1 b Center for Engineering Science Advanced Research, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
Received 1 December 2004; accepted 28 November 2005 Available online 18 January 2006
Abstract Progress in adapting molecular dynamics algorithms for systems with short-range interactions to utilize the features of modern supercomputers is described. Efficient utilization of the latest generation of processor architectures requires algorithms that can be both vectorized and parallelized. The approach adopted for vectorization involves combining the layer and neighbor-list methods, while parallelization employs spatial subdivision with explicit communication. The techniques presented here have been used in performance tests on the Cray X1 vector-parallel supercomputer with systems containing over 12 billion atoms. 2005 Elsevier B.V. All rights reserved. PACS: 02.70.Ns Keywords: Molecular dynamics simulation; Supercomputer; Vector processing; Parallel processing; Algorithm; Performance evaluation
1. Introduction Although it is difficult to compare progress in different fields of human endeavor, few achievements can match the sustained, power-law growth in computer performance over the past several decades, a situation that may well persist for the foreseeable future. Two factors are the principal contributors to this trend. The first is the growth in raw processing speed due to technological advances: smaller component sizes that allow denser packing and higher levels of integration, faster clock speeds and a richer collection of hardware features. The second, a consequence of falling costs and reduced power requirements, is the ability to use large-scale parallelism, with hundreds, thousands, and now even tens of thousands of processors linked together and bringing their combined power to bear on a single calculation. Certain kinds of computations are able to utilize the advanced hardware features of modern supercomputers with very E-mail address:
[email protected] (D.C. Rapaport). 1 Permanent address.
0010-4655/$ – see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.cpc.2005.11.008
little effort on the part of the software designers, whereas others require major algorithmic overhaul if they are to utilize the hardware efficiently. While computations based on standard matrix–vector operations fall into the former category, since the efficient handling of this type of work is an important hardware design consideration, molecular dynamics (MD) simulation [1] lies in the latter, the implication being that while calculations of this type can run reasonably efficiently on these machines, a certain degree of effort is required to make this possible. The focus of this paper is on incorporating the two most prominent features of the latest high-performance computers, namely vector processing and parallelism, into a computational scheme capable of effective machine utilization. Earlier efforts in this field have been documented in a series of papers devoted to applying vector and parallel processing to MD simulation [2–4]; the titles of these papers included the term ‘multimillion’, which, at the time they were written, was considered to be at, or even slightly beyond the state of the art. The present title implies systems that are larger by some three orders of magnitude, reflecting the technological advances that have taken place in the interim. The computer under consideration here is the Cray X1 [5].
522
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
2. Basic molecular dynamics algorithms Molecular dynamics simulation deals with matter at the atomistic level, treating atoms and molecules as classical (i.e. nonquantum) particles and solving their Newtonian equations of motion to provide the atomic trajectories [1]. From these trajectories all manner of thermodynamic, structural and transport properties can be extracted. MD is capable, in principle, of reproducing any kind of laboratory measurement; in addition, it allows other kinds of observations that have no real-world counterparts but which, nevertheless, can help contribute to understanding the behavior and relating it to theory. The atomic properties and the nature of the interactions depend on the system being modeled; examples include spherical atoms interacting with short-range forces, rigid bodies whose interactions involve several suitably located interaction sites, flexible molecules held together by internal forces or rigid links with restricted degrees of freedom, and charged or dipolar atoms with long-range interactions. Specialized techniques have been developed to allow these and many other models to be dealt with efficiently. One common aspect of essentially all MD computations is the short-range interactions between spherically symmetric atoms; this can often constitute the dominant portion of the computation, while other interactions required by a particular model are computationally less demanding. Other aspects of the simulations include integrating the equations of motion (typically, a leapfrog integrator is used, as is the case here), dealing with boundary conditions (chosen to be periodic in most instances), establishing the initial state, saving results, and various other issues specific to individual problems; in general, these are not heavy tasks in terms of the amounts of computation required. The focus of the present paper is on the interaction calculations; in most MD applications, it is the evaluation of the interatomic forces that constitutes the major portion of the work. The interaction considered here is based on the truncated LennardJones, or soft-sphere, pair potential. Given a pair of atoms i and j located at ri and rj , the potential function is 4[( rσij )12 − ( rσij )6 ] + rij < rc = 21/6 σ, u(rij ) = 0 rij rc , where rij ≡ |ri − rj |. The parameter governs the strength of the interaction and σ , roughly equal to the atomic diameter, defines a length scale; for convenience, and with no loss of generality, both can be set equal to unity. In its most naive form, the interaction calculation employs a double loop, covering all pairs of atoms, that evaluates the forces acting on each atom due to all other atoms within the interaction range rc . Despite the fact that rc is very small compared with the overall system size, the amount of work is proportional to Na (Na − 1)/2, where Na is the number of atoms in the system. This is clearly an unacceptable amount of work if systems beyond minimal size are to be considered. The introduction of a little bookkeeping [1] reduces the computational effort to one that grows linearly with Na . To do this, the simulation region is subdivided into a set of identical cells; the criterion for choosing the cell size is that the cell edges
should exceed rc , with empirical evidence suggesting that cells should be made as small as possible subject to this condition (and also subject to there being an integral number of cells in all directions). At each time step the atoms are assigned to their cells; the force calculations are then carried out by considering the atoms in a given cell and pairing them with those in each of the neighboring cells (only half the neighbors are actually needed). Since both tasks require O(Na ) computation, linear size dependence is assured. The computational effort cannot grow more slowly than linearly, but the proportionality constant can be reduced by constructing a list of interacting atom pairs and then using this list over several time steps. For the list to remain valid over this interval, it must include neighboring atoms lying a little beyond rc , namely up to a range rn = rc + δ, where the value of δ is determined empirically; this means the cell size must be enlarged to exceed rn , rather than just rc . Determining the duration of validity of the neighbor list requires monitoring the maximum velocity to determine the earliest moment any atom pair initially separated by more than rn could approach to within a distance rc , and rebuilding the list just before this can happen. Use of the neighbor list reduces the number of pairs examined for the force computations (in 3D) by a factor of (4π/3)rn3 /27rc3 (typically ≈ 1/3), which is the ratio of the volumes surrounding any given atom examined by the two methods. 3. Vectorized algorithm Vector processing is based on decomposing machine instructions (such as addition and multiplication) into sequences of steps and using pipelined (i.e. overlapped, or assembly-line) processing to achieve a greatly improved throughput per machine cycle. A certain amount of pipelining is present in all modern processors, but high-performance vector computers, a recent example being the Cray X1 [5], take this approach to the extreme, particularly for floating-point arithmetic. This leads to substantial performance gains for those computations that are suitably structured; if, on the other hand, a computation is unable to satisfy the organizational requirements for vectorization, the performance can be severely degraded. The most general criteria for successful vectorization are that identical operations are repeated over comparatively large sets of data items organized as linear arrays, and that the calculations are entirely independent of one another—meaning that the output of one iteration is not the input to a later one. In some instances, these requirements can be relaxed slightly; for example, conditional execution can be included, as can indexed, random memory accesses, and even certain data dependencies provided they involve loop iterations that are well separated in time. Performance issues related to individual hardware designs are less likely to arise if the stricter requirements can be satisfied. The nature of the cell and neighbor-list methods renders them unsuitable for vector processing. Both methods fail to meet the requirements in regard to the iteration counts of the innermost loops, which tend to be extremely low, and the data
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
independence, due to there being multiple contributions to the forces on each atom. An entirely different means of data organization is essential if vectorization is to succeed. The strategy used for vectorizing the cell method appears in [1,2]; the reason for preferring cells to neighbor lists in earlier work is a shortage of memory, a serious issue with older generations of supercomputers. The present paper extends the technique in order to vectorize the neighbor-list method. In this approach, which involves organizing the cell occupancy data into ‘layers’, the all-important innermost loops become vectorizable because all data dependence between loop iterations [6] is avoided. The resulting algorithm is sufficiently complex, however, that this lack of dependence is not apparent to compilers attempting automatic vectorization, so that manually invoked compiler directives (or ‘pragmas’) are needed to help accomplish this, as discussed later on. 3.1. Neighbor-list construction The computation begins with cell assignment but, unlike the usual cell approach [1], a linked list of atoms belonging to each cell is not required. Periodic boundaries are avoided in the single-processor case by replicating atoms close to boundaries [1,2] (the atom copying used for parallel processing is similar— see below); if the region size is Lx (etc.), then Lx = Lx + 2wx is the size after enlargement to hold the replicas, where wx = Lx /Gx is the cell width and Gx = Lx /rn is the number of cells in the x-direction (rn was defined above). Na is the number of atoms to be processed; this includes actual as well as replicated atoms. The cell occupied by atom i is denoted by ci ; qi is initially set to the atom index i, but will be updated subsequently during layer assignment. for i = 1 to Na do gx = (rix + Lx /2 + wx )/wx ) . . . ci = (gz Gy + gy )Gx + gx + 1 qi = i enddo (hi)
(lo)
In the parallel version, wx = (Bx − Bx )/rn , where (hi) and Bx are the limits of the x-coordinate range covered by a particular subregion (which must be extended by wx in either direction as before), and in the expression for gx , above, (lo) Lx /2 is replaced by −Bx . The i-loop includes all atoms copied into the processor from neighbor subregions, rather than the replicated atoms of the single-processor case. The usual approach to neighbor-list construction [1] employs a set of nested program loops. The outermost loop considers all cells, and within this is a loop over all valid offsets between neighboring cells; in addition to itself, a cell has 26 neighbors, but only half need be considered. The two innermost loops generate all pairs of atoms, one atom belonging to each cell; in the case of zero offset, where both are from the same cell, only unique atom pairs are generated. Each pair is tested, and for separations < rn the atom indices are added to the neighbor list. On conventional (i.e. nonvector) processors (lo) Bx
523
this approach is highly efficient, but on vector processors it is doomed to failure for reasons given previously. An alternative strategy that does allow vectorization can be summarized as follows; the details appear subsequently. The outermost loop in the revised algorithm is now over possible offsets between neighboring cells. If the cell occupants are assumed to be ordered in some (arbitrary) way, then the role of the next two loops is to produce all valid pairings of the ith and j th occupants of whatever pair of cells happens to be under consideration by the inner loop (the case where both cells are the same is included); this loop ordering is the opposite of that used in [1,2]. The fourth, innermost loop scans all the cells, with the second cell of the pair determined by the specified offset. If the number of cells is similar to the number of atoms, this inner loop will have the large repetition count desirable for vectorization. The reordered cell occupancy data required by this approach are stored in layers. Thus the second stage in generating the neighbor list is the layer assignment, in which layers are populated one at a time. Each layer provides one storage element per cell, so maximum cell occupancy determines the number of layers needed. During each iteration of the loop over layers, atoms are assigned to the layer elements corresponding to the cells they occupy; since cells can be multiply occupied, the identities of these atoms can change, and it is the final atoms encountered that are the ones actually assigned. Successfully assigned atoms are then removed from further consideration, by deleting them from the array of atom indices qi , before starting work on the next layer. The process is repeated until all atoms have been assigned; the initial layers will be practically full, but later layers will be more sparsely populated. The layer contents are stored in a doubly-indexed array, hnm , with n denoting the layer number and m the index of the corresponding cell. To avoid concern over exceeding the array limits when dealing with boundary cells in the algorithm below, hnm (max) is padded (for each n) to a total size of nl × (2m0 + G), (max) where nl is the maximum number of layers allowed (and the maximum cell occupancy), m0 = Gx (Gy + 1) + 1, and G = Gx × Gy × Gz is the total number of grid cells. All hnm are initially set to zero. The layer assignment algorithm is as follows (the more concise notation of [2] is not used here since it is less useful for programming purposes). nl = 0 na = Na do while na > 0 nl = nl + 1 for m = 1 to G do hnl ,m0 +m = 0 for i = 1 to na do hnl ,m0 +cqi = qi for m = 1 to G do if hnl ,m0 +m > 0 then chnl ,m0 +m = 0 enddo i=0 for n = 1 to na do if cqn > 0 then i =i+1
524
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
qi = qn endif enddo na = i enddo In the final stage, the populated layers are used to construct the neighbor list; this is where the present approach differs from [2], where layers are used directly for the interaction computations. Vectorization requires that the neighbor list be organized in a manner that not only ensures data independence but also provides vectors of adequate length for efficient processing. The present approach is designed to accomplish this. The algorithm below contains a set of quadruply-nested loops used to construct the neighbor list. Of the three outer loops, the outermost chooses the relative cell offset between the layers, within this there is a loop over possible index differences between the pair of layers under consideration, and within this is a loop that picks the index of the first layer. The innermost fourth loop, the one that must be vectorized, scans all the paired cells in the chosen layers with the specified relative offset; there is a test (em + em+s > 0, where e is an array of size 2m0 + G used to distinguish cells interior to the region, or subregion in the parallel version, from those of the boundary which contain only replicated, or copied atoms) to ensure that at least one of the paired cells is an interior cell; if this is true, and if, in addition, both layers have occupants in the chosen cell positions, then, if this atom pair passes the distance test, it is added to the neighbor list. An earlier extension of the layer approach to produce vectorizable neighbor lists [7] used a similar, although slightly different, loop organization. (The possible benefit of using subcells, introduced [8] in order to limit the number of atom pairs examined on nonvector processors, is left as an exercise for the reader.) The amount of work involved depends quadratically on the number of layers, and linearly on the number of cells (and hence the system size). The neighbor list is generated in segments, each containing a set of atom pairs whose interaction calculations can be vectorized safely. The results of several iterations of the inner loops (indexed by l1 and m) are combined into a single segment, and a new segment is begun only when there is a risk that the maximum allowed segment size might be exceeded on the next l1 -loop iteration, and at the start of each d-loop iteration. Concatenating short segments into longer ones is consistent with correct vectorization, provided that, in the enlarged segment, atoms cannot be referenced more than once as the first member of a pair and once as the second. While performance improves with vector length, there is a point at which this ceases; since the number of cells (per processor in the parallel case) G is typically fairly large, this is used to limit the vector lengths. D (arbitrarily set to G/2) is the extra space in the temporary force and energy arrays (used below) that allows segments to be combined. The vectors vw contain the 27 integer offsets (each a triple of values 0, ±1) between adjacent cells, including the cell itself. Periodic boundaries are of no concern here, since they are accounted for during replication (or copying).
k=0 p=1 bp = 1 for w = 1 to 27 do s = (vw z Gy + vw y )Gx + vw x if w 15 then dmin = 0 else dmin = 1 for d = dmin to nl − 1 do for l1 = 1 to nl − d do l2 = l1 + d for m = m0 + 1 to m0 + G do if em + em+s > 0 then i = hl1 ,m , j = hl2 ,m+s if i > 0 and j > 0 then r = ri − rj if r 2 < rn2 then k=k+1 t1,k = i, t2,k = j endif endif endif enddo if k bp + D then p=p+1 bp = k + 1 endif enddo if k bp then p=p+1 bp = k + 1 endif enddo enddo From a practical viewpoint, the innermost m-loop in this algorithm could be decomposed into a sequence of two or more loops, each responsible for part of the work. In its present form, the loop cannot be multistreamed [5] (the terminology and the implications are described later) due to the need to increment the counter variable k, although this does not prevent normal vectorization. Decomposing the loop and introducing additional arrays to hold intermediate results reduces the amount of processing unable to utilize multistreaming, leading to a slight performance improvement. Since the obstacle to multistreaming is not eliminated entirely, however, better overall performance is obtained using a single-streamed environment (see later) where this issue does not arise. 3.2. Interaction calculations The goal of these preparations is to allow the force and energy computations to be carried out as efficiently as possible. This will now be done by processing the segmented neighbor list.
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
The interactions are computed in several stages. The outer of the doubly-nested loops below is over the p segments of the neighbor list. Each of the three successive inner loops can be vectorized. The first of these evaluates the interactions for each atom pair in the neighbor-list segment and stores these intermediate values for later use; the loop is an obvious candidate for vectorization. The remaining two loops then take the intermediate values and use them to update the interactions for each of the atoms in the segment; they are vectorizable because the neighbor list is segmented in a manner that avoids any data dependence. Effective vectorization is achieved because most of the smaller neighbor-list segments have been combined into relatively long segments. The array elements fm and um hold the intermediate values that are passed between loops; the total forces on the atoms are accumulated in fi and the interaction energies in ui . The force and energy values computed for replicated (or copied) atoms are discarded since they are properly computed when the original version of the atom is treated (as system size increases, the duplicated effort decreases relative to the total); for this reason ui is accumulated separately for each atom, and the total interaction energy, U , evaluated at the end. for i = 1 to Na do fi = 0, ui = 0 enddo for q = 1 to p do for k = bq to bq+1 − 1 do i = t1,k , j = t2,k , m = k − bq + 1 r = ri − rj if r 2 < rc2 then fm = 48r −8 (r −6 − 1/2)r um = 4r −6 (r −6 − 1) + 1 else fm = 0, um = 0 endif enddo for k = bq to bq+1 − 1 do i = t1,k , m = k − bq + 1 fi = fi + fm , ui = ui + um enddo for k = bq to bq+1 − 1 do i = t2,k , m = k − bq + 1 fi = fi − fm , ui = ui + um enddo enddo U =0 for i = 1 to Na do U = U + ui /2 4. Message-passing parallelism There are two principal hardware approaches to parallel computing, one employing shared memory accessible to all
525
processors (some parts of memory may be more readily accessible to a particular processor than others—the NUMA architecture), the other distributed memory, with processors having their own private storage. Hybrid architectures exist in which processor subsets share common memory, but each subset’s memory is private; the Cray X1 is of this design. Interprocessor communication is required in the case of distributed memory for processors to participate in a joint computation, implying the need for a communication infrastructure to support data transfer. Modern supercomputer designs ensure that communication bandwidth and delays (latency) associated with message passing are commensurate with processor performance, so as not to constitute major performance bottlenecks; while there will always be a certain amount of overhead, the aim is to keep it to a minimum. The programming approach depends on whether the memory is shared or distributed [1]; in the former, it is a simple matter of decomposing loops (carefully) into multiple threads that execute independently on separate processors, while in the latter, explicit data transfers are required between processors. Rather than attempting to combine the approaches (this is an option for the Cray X1), and especially in view of the fact that message passing can utilize shared memory, the algorithm described here relies on message passing alone. From a software perspective, there are schemes for hiding the multitude of hardware details involved; the MPI standard defines such a programming interface [9], implemented in terms of widely available function (or subroutine) libraries. Apportioning the MD workload across a set of processors can be carried out in various ways, but for large systems with relatively short-ranged interactions and a substantial number of processors the only viable approach is based on spatial (or geometric) subdivision. Here, each processor is responsible for a distinct subregion of the overall simulation volume and all the atoms it contains. As atoms travel between subregions, their ownership is transferred between processors. Interactions involving atoms resident in adjacent subregions can be handled independently by the processors concerned, after first copying the necessary information (a relatively small amount) between the processors. This technique is highly scalable, assuming, as should always be the case, that the communication capabilities match the processing power. For processors whose performance depends on vectorization, the additional computations required to support parallelism must also be vectorizable. Since spatial subdivision is described in detail in [1], a brief treatment will suffice here. Each processing node is aware of the subregion for which it is responsible (a single MPI function call helps provide this information) and the identities of the processors handling the adjacent subregions. If a full 3D subdivision is used there will be six neighbors; if only a 2D subdivision is used there will be just four, with the remaining direction typically treated using periodic boundaries; correspondingly for a 1D subdivision. An important aspect of the computation is the copying and moving of atoms between processors. Atom copying is required at each time step, but only involves the atom data needed by the interaction computations—typically just the coordinates. Atom
526
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
moving needs to be carried out immediately prior to rebuilding the neighbor list (the fact that atoms may move slightly outside a subregion after the neighbor list is built is of no consequence for the interaction computations); it entails a transfer of all the data associated with the atoms that have moved from one subregion to an adjacent one (far fewer than the number involved in the copying operation). If periodic boundaries separate a pair of communicating processors, the coordinate data (whether copied or moved) must be adjusted to shift the atom to the opposite end of the region; for this reason there is no need to address periodic boundaries explicitly in the neighbor-list and interaction computations. Each copying operation consists of three parts. The first, carried out whenever the neighbor list is about to be rebuilt, is the preparation of a list of atoms to be copied, namely those within a distance rn of the relevant subregion face. If Na is now regarded as the number of atoms in the subregion, rather than the total number of atoms, set Na = Na , then proceed as follows for the positive and negative x-directions. m=0 for i = 1 to Na do (hi)
if rix Bx − rn then m=m+1 (+x) pm = i endif enddo m=0 for i = 1 to Na do (lo)
if rix < Bx + rn then m=m+1 (−x) =i pm endif enddo The second part, performed every time step, is the copying of the coordinates of the relevant atoms to the adjacent processors. (+x) Atoms indexed by pm are copied across the subregion face to the processor responsible for the next higher subregion in the x(−x) direction, those indexed by pm in the opposite direction. The third part is the complementary task of receiving coordinates copied into the current processor from adjacent processors, after which the value of Na is incremented to include these copied atoms. The entire process is then repeated for the y- and zdirections; atoms may undergo multiple copying if close to a subregion edge or corner, but this is taken care of automatically. The process of moving atoms between subregions is similar, and is carried out prior to rebuilding the neighbor list, and before the copy operation at that time step. The position test involves (e.g.) Bx(hi) rather than Bx(hi) − rn ; after atoms have departed from a processor their storage is made available for reuse by simply compacting the arrays of atom data to remove any gaps. If a 2D or 1D subdivision of the region is used, the corresponding copy and move operations do not entail interprocessor transfers. The loops involved in selecting atoms to
be copied and moved, and in compacting the data, all vectorize readily. 5. Performance A computer with the complexity of the Cray X1 [5,6] offers considerable scope for experimentation to improve performance. In order to limit the effort invested in these measurements, once algorithm development yielded an apparently efficient program that allowed the compiler to vectorize all the innermost loops (with the aid of suitable pragmas), systematic exploration of further code enhancements terminated. While small additional gains are achievable with further effort, major performance improvements—if at all possible—are likely to require alterations to the underlying algorithm. Computation time is just one aspect of performance, albeit an important one; memory usage can also be a factor and sometimes one can be traded for the other, the use of layers being just one such example. Timing information is specific to the interaction potential, temperature, density and integration step size; changes will not only affect performance, but will alter the relative contributions of the different parts of the computation to the overall workload. These benchmark tests are based on the soft-sphere system at a density of 0.8 (a dense fluid state) and initial temperature 1.0, with a time-step size of 0.005 (all quantities are expressed in standard, dimensionless MD units); the width of the neighborlist shell is δ = 0.4. The test runs start in an ordered state with atoms arranged on an FCC lattice (the input data specify the number of unit cells, and system sizes for larger runs are chosen to utilize most of the available memory). Initial velocities are randomly directed, with magnitude determined by the temperature; during the run, temperature is not adjusted, and the kinetic energy per atom drops from an initial 1.5 to roughly 0.96 over the first 200 steps (after 50 steps it has already fallen to 0.98). The programming language used for the tests is C, and all arrays are allocated dynamically at run time, with array sizes determined by the input data. Message-passing uses the Cray version of the MPI library [9]. The total wall clock time for the run, after constructing the initial state, is used for estimating performance; since the processors of the machine are dedicated to individual jobs, the timing figures should not be unduly sensitive to other machine activities (see below, however, on the question of timing variability). Each run extends over 1000 steps; while a run of this length is inadequate for research simulations, it is sufficient for the atoms to migrate away from their original positions and cross subregion boundaries; thus, all the features of the program are exercised adequately and representative timing estimates—that will not change substantially over longer runs—obtained. Performance also depends on machine features over which the user typically has no control. These include the particular hardware configuration—processor clock rate, memory bandwidth, caching capabilities, and communication bandwidth (both overall and point-to-point) and latency. System software dependence includes the compiler optimization capabilities and operating-system overheads associated with communication.
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
Storage requirements for the test runs can be estimated from the following considerations: In the case of short-range interactions, the neighbor list must be able to accommodate 6 entries per atom, while the number of cells is slightly less than the number of atoms. The basic scalar MD code therefore requires 9 floating-point (single or double precision, SP or DP for short) values (for the position, velocity and acceleration) and 14 integers (12 for the neighbor-list entries and 2 for the cells) per atom; for a total of 92 (SP) and 128 (DP) bytes. Additional quantities are required for the layered, parallel version of the (max) code: On a per-atom basis, the array hnm requires nl (conservatively assumed to be 10) integer values, the arrays em and qi need one integer each, each atom needs the value of its own potential energy as well as an ‘identifier’ because of interprocessor mobility (there is a one-integer saving because linking atoms within cells is not needed), and finally the two arrays fm and um add an extra 6 floating-point values (originally 4 per atom, but this is raised to 6 due to the choice of D, above, to allow concatenation of shorter neighbor-list segments); the overall totals are 168 (SP) and 232 (DP) bytes per atom, a substantial increase. Thus, in the SP case, the maximum number of atoms per GByte of storage is close to 6 × 106 ; adequate space must be reserved for copied atoms and to allow for occupancy fluctuations. Because the tests focus on large systems, computations will be carried out using SP variables. 5.1. Choice of single-streaming or multistreaming processor modes The measurements focus on the two principle features of the computation, namely vectorization and parallelization, and the degree to which the present approach manages to utilize these hardware features. There is, however one further design characteristic of the machine that significantly impacts the overall performance. Each processing node of the Cray X1 [5] is a shared-memory parallel processor consisting of 4 multistreamed (MSP) vector processors. Applications spread over multiple nodes see a distributed-memory system in which explicit communication is required; within a node the processors can continue to rely on the same kind of communication (with whatever relative performance the hardware provides) or can operate together as a shared-memory system. In the present study, the former, homogeneous approach is adopted; this is likely to be the more efficient option, a conclusion based on experience with MD on small shared-memory machines—a communication-free multithreaded MD algorithm is described in [1]. Each MSP consists of 4 single-streamed (SSP) vector processors; in MSP mode these act as a single, tightly-coupled processor, but in SSP mode they are free to operate independently. Thus there is the capability for treating a set of n MSPs as 4n SSPs. In MSP mode the vectorizable loops in each MSP are spread evenly across the 4 SSPs; the result is improved performance (ideally approaching a 4× gain over a single SSP) and a reduced reliance on parallelism. Certain kinds of computation are unsuited for this decomposition, in which case only the SSP
527
Table 1 Timing and memory requirements using a single SSP processor of the Cray X1; the results show the dependence of the time per atom-step, ts , on system size, as well as its variability Atoms
ts (µs)
MByte
4 × 303
1.41–1.42 1.57–1.58 1.62–1.87 1.92–2.12 1.65–2.08
21 195 742 932 978
4 × 643 4 × 1003 4 × 1083 4 × 1103
mode can offer acceptable processor utilization—assuming that communication does not present a problem. Test runs on single MSP and SSP processors, involving small 108 000 and larger 5.3 million atom systems, are used to determine the preferred mode. In the MSP version, the compiler ‘loopmark’ listing (a useful means for discovering how successfully the compiler vectorizes the program) reports an inability to multistream the principal loop (the m-loop) of the neighborlist construction algorithm, although it is able to vectorize this and all other inner loops. The resulting time per atom-step, ts , for the small system is 1.0 µs on the MSP processor, while on the SSP processor this increases to 1.4 µs; for the larger system ts increases to 1.4 and 1.6 µs, respectively. Since there are four times the number of processors available in SSP mode than in MSP mode, SSP clearly provides better resource utilization, and it alone will be considered in the remaining analysis. (The mode is chosen when compiling and linking the program, and is not referenced within the program itself; the present default mode is actually MSP.) While the emphasis is on large systems, it is important to be aware of how single-processor performance varies with system size, since cache-based architectures tend to favor computations with a smaller memory footprint. Table 1 shows that this is indeed the case here; ts gradually increases with system size and there are significant fluctuations (well over 10%) in repeated runs of the bigger systems. The systematic variations are unrelated to the MD algorithm itself, which should produce a constant ts , so the size dependence is likely a consequence of caching. For each of the large systems considered later, only a single run is conducted. 5.2. Vector performance The next set of performance measurements address the vectorized algorithm, assessing the degree to which the design goals of the algorithm are met and the actual speedup relative to the original, much simpler ‘scalar’ version of the program. Full vectorization requires the use of certain compiler directives [10], namely the pragmas ‘_CRI concurrent’ and ‘_CRI safe_ address’, to inform the compiler about the lack of any dependence within the data being processed by the computational loop; these characteristics are not apparent unless the underlying data organization is understood, something far beyond the capabilities of any current compiler. The single-processor performance of the layered program is found to improve by more than an order of magnitude as a result of vectorization.
528
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
The actual timing values, after vectorization, appeared in Table 1. For comparison, timings for a single 2.4 GHz Intel Xeon processor (of similar vintage to the Cray model with a 800 MHz vector clock used here), for a 4 × 106 atom system under the same conditions, will be noted; for the scalar code ts = 1.0 µs, whereas for the layered code—clearly not intended for this processor architecture—ts = 3.8 µs. Contrast these times with the best ts for the vector code on the Cray X1 for a system of this size, namely 1.6 µs; while the Cray does a better job with the layered code, it has a 60% performance penalty when each processor is allowed to use its own preferred algorithm. The scalar code run on the Cray, with its limited scope for vectorization, yields ts = 4.1 µs, thereby confirming the efficacy of the layer approach. The following observations for a system of 108 000 atoms reveal the degree of success in rearranging the data in vectorizable form. The number of layers is almost always 7 (on occasion 8), the number of neighbor-list segments fluctuates slightly but remains close to 170, some 75% of these segments are of length 256 or greater, with only about 12% of the segments having lengths below 16, and the average segment length is almost 4000, far longer than even the most demanding vector architectures require for efficient performance. The layer count will not change for larger systems, but the proportion of longer segments will be even higher. Tracing the execution of an ‘instrumented’ version [6] of the vectorized MD program reveals that approximately 52% of the time is devoted to neighbor-list construction, almost all of this in the innermost (m-loop) of the quadruply-nested loop set (this is the loop that checks atom separation and adds the pair to the list if within range), 44% in the interaction computation, and 4% in the remaining work. This is a totally different distribution of effort from the typical scalar processor where over 90% of the work in the standard MD program is devoted to the interactions and only a minimal amount to neighbor-list construction. The conclusion is that while the layer approach produces a substantial speedup on the Cray X1, its effectiveness is diminished because of the multiple passes over increasingly sparsely populated layers. The fact that neighbor-list construction consumes so much of the effort is made even more conspicuous by the fact that rebuilding occurs, on average, only every 10th step. The measured layer occupancies are also worth noting when considering how to improve vectorization efficiency; the first layer is essentially completely full, the next four layers have occupancy fractions 0.94, 0.69, 0.30 and 0.06, while further layers are below 0.01. Given the sparseness of later layers, an alternative means of representing their contents—typically as pairs of cell index and atom-identity values—is likely to lead to a more efficient implementation; more specifically, both layer representations would be prepared for sparse layers, but when either or both layers are sparsely populated the alternative method of accessing the sparser layer’s contents, which is also vectorizable, would be used. The capacity for improvement is limited, however, due to the fact that populating the layers and then using them for neighbor-list construction consumes just half of the overall computational effort.
Table 2 Performance of large parallel test runs on the Cray X1; np is the number of SSP processors used (the number of atoms is four times the number of unit cells) Atoms
Unit cells
np
GByte
ts (ns)
np ts (µs)
47 409 408 379 275 264 1 291 315 456 3 054 207 744 5 988 775 936 12 093 857 792
2283 4563 6863 9143 11443 14463
8 64 216 512 1000 2016
7 57 193 458 894 1816
302.1 39.5 12.3 5.2 2.8 1.4
2.42 2.53 2.66 2.66 2.80 2.82
5.3. Parallel performance for large systems Multiprocessor behavior forms the subject of the next series of measurements. Table 2 summarizes the results for a series of systems of increasing size. The number of processors used in each run, np , determines the system size, since the latter is chosen to fill most of the 1 GByte memory available (effectively) per SSP processor; thus an almost constant number of atoms per processor is maintained across all the runs. The total number of SSP processors in the machine at the time of the test was 2048, although several were unavailable for batch processing use; thus the biggest run represents usage of essentially the entire machine. The wall clock time for the runs ranged from 4 to 4.6 hours. The final column shows the product np ts ; ideally this should be constant (since it represents a measure of single-processor performance), and in fact the variation is comparatively small, especially in view of the 250-fold range of system sizes. The parallel aspect of the performance is especially satisfying for the larger systems, the largest of which contains over 12 billion atoms and requires almost 2 TByte of memory. This establishes that there is adequate communication bandwidth available, even with a very large number of processors working together, and that the present approach could be applied to even larger systems, given sufficient hardware resources. One bigger test run has been reported in the literature [11], involving tests on systems having up to 19 billion atoms on a large cluster of Alpha processors. The runs themselves are much shorter, are not subject to the heavy memory cost in support of vectorization, and reveal the presence of communication bottlenecks leading to sublinear performance gains as more processors are used (for a fixed number of atoms per processor); the measured np ts ≈ 8 µs cannot be compared with the present results owing to a slightly different potential function and other run parameters, not all of which are specified. As more massively parallel machines become available, even larger test runs can be expected; further reference to the scientific relevance of exercises of this kind appears at the close of the paper. 5.4. Alternative spatial subdivisions The foregoing results are based on a 3D partitioning of the system, with np an exact cube, except in the largest case where there are no longer enough processors available. While this reduces the amount of data that must be transferred between processors (since the cube has the smallest surface area
D.C. Rapaport / Computer Physics Communications 174 (2006) 521–529
529
of all cuboids with the same volume) it requires communication across all six subregion faces. Reducing the number of communication directions is certainly worth considering for smaller systems (there is of course a minimum size in each direction to avoid communications dominating), and the resulting performance is examined in Table 3. Ideally, the np ts column should again be constant since the number of atoms per processor is fixed, and the only change is that the treatment of periodic boundaries using replication is replaced by the use of explicit communication between processors, one dimension at a time. The results show this to be true, to within the limits of timing reproducibility. The result for four SSP processors, namely ts = 0.576 µs, is a major improvement over the value ts = 1.4 µs obtained when the same computational resources are combined into a single MSP processor, further validating the SSP approach; the fact that it is less than the limiting fourfold improvement over a single SSP is probably due in part to a cache shared between all four of the SSPs in each MSP.
sociated with MD simulation that require careful planning for extremely large-scale computations, not the least of which is merging and saving the massive amounts of data produced by the computations that will have to be retrieved from the distributed memory of the machine; likewise, a suitable checkpoint and restart mechanism capable of handling the large quantity of data needed for a complete description of the system state; and finally, algorithms will be needed for analyzing the properties of the system, again requiring access to the large body of distributed data. One further point should be mentioned in regard to tests of the kind reported here. A scientifically useful MD simulation requiring such an enormous system would typically also require an appropriately large number of time steps to allow the events occurring in one part of the system to be felt throughout. Such information is often propagated by a diffusive mechanism (the three most familiar from fluid dynamics being mass, momentum and energy transfer); billion-plus-atom studies are therefore likely to require multimillion-step simulations, far more than the modest 1000 steps covered here. Thus, while very short MD calculations for extremely large systems provide valuable tests of computer performance, the fact that the resources consumed by an MD simulation depend on the product of the number of atoms and the number of time steps implies that system sizes will typically remain well below the maximum that supercomputers can handle, simply in order to allow runs of adequate length to be carried out.
6. Conclusion
Acknowledgements
The algorithm described in this paper for handling shortrange interaction computations in MD simulation has been designed with the requirements of modern vector-parallel supercomputers in mind. The lesson that emerges from this study is that a substantial amount of organizational work is required to accommodate the special needs of the vector architecture; this places computers of this kind at a disadvantage when compared with simpler processors, much of whose performance derives from a substantially faster clock speed. A computer designed for efficient MD simulation would probably not rely entirely on vectorization to boost performance. This is not the issue addressed here, however, but rather the way to harness an already-existing architecture for MD simulation. Some allowance can be made for the fact that the particular MD example studied here is particularly demanding, due to the extremely short-range interactions. In other MD applications there are likely to be additional computations that are more amenable to vectorization (for example, interactions between invariant atom pairs, as in polymers), so a smaller proportion of the overall effort would be consumed by neighbor-list construction. Insofar as the other major characteristic of the machine is concerned—parallelism—the Cray X1 performance scales almost linearly to over 2000 (SSP) processors, an entirely satisfactory result. It should be emphasized that this study has addressed just one aspect of implementing large-scale MD simulations on a vector-parallel computer. There are many other activities as-
J. Barhen, Y. Braiman, P. Cummings, M. Fahey, and J. Nichols are thanked for helpful discussion. This research was sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL) and used the high-performance computing resources of the Center for Computational Sciences at ORNL which is supported by the Office of Science of the U.S. Department of Energy (DOE). ORNL is managed by UT-Battelle, LLC for DOE under Contract No. DE-AC05-00OR22725.
Table 3 Effect of the dimensionality of the spatial subdivision on performance np
Unit cells
ts (µs)
np ts
1×1×1 2×1×1 2×2×1 2×2×2
110 × 110 × 110 220 × 110 × 110 220 × 220 × 110 220 × 220 × 220
2.082 1.056 0.576 0.290
2.08 2.11 2.30 2.32
References [1] D.C. Rapaport, The Art of Molecular Dynamics Simulation, second ed., Cambridge University Press, 2004. [2] D.C. Rapaport, Comput. Phys. Comm. 62 (1991) 198. [3] D.C. Rapaport, Comput. Phys. Comm. 62 (1991) 217. [4] D.C. Rapaport, Comput. Phys. Comm. 76 (1993) 301. [5] Cray X1 System Overview (Cray Inc., Mendota Heights MN, 2003), publication no. S-2346-23; Cray manuals available at http://www.cray.com. [6] Optimizing Applications on the Cray X1 System (Cray Inc., Mendota Heights MN, 2003), publication no. S-2315-50. [7] G.S. Grest, B. Dünweg, K. Kremer, Comput. Phys. Comm. 55 (1989) 269. [8] M. Pütz, A. Kolb, Comput. Phys. Comm. 113 (1998) 145. [9] W. Gropp, E. Lusk, N. Doss, A. Skjellum, Parallel Comput. 22 (1996) 789; see also http://www-unix.mcs.anl.gov/mpi/mpich. [10] Cray C and C++ Reference Manual (Cray Inc., Mendota Heights MN, 2003), publication no. S-2179-50. [11] K. Kadau, T.C. Germann, P.S. Lomdhal, Internat. J. Modern Phys. C 15 (2004) 193.