Computational Materials Science application programming interface (CMSapi): a tool for developing applications for atomistic simulations

Computational Materials Science application programming interface (CMSapi): a tool for developing applications for atomistic simulations

Computer Physics Communications 169 (2005) 462–466 www.elsevier.com/locate/cpc Computational Materials Science application programming interface (CMS...

205KB Sizes 0 Downloads 58 Views

Computer Physics Communications 169 (2005) 462–466 www.elsevier.com/locate/cpc

Computational Materials Science application programming interface (CMSapi): a tool for developing applications for atomistic simulations Simone Meloni a,∗ , Mario Rosati a , Alessandro Federico a , Luca Ferraro a , Alessandro Mattoni b , Luciano Colombo b a Consorzio per le Applicazioni del Supercalcolo Per Universitá e Ricerca (CASPUR), Via dei Tizii 6/b, 00185 Rome, Italy b INFM-SLACS and Department of Physics, University of Cagliari, Cittadella Universitaria, 09042 Monserrato (CA), Italy

Available online 18 April 2005

Abstract We describe our computational library for atomistic simulations: CMSapi. CMSapi provides building blocks for either molecular dynamics, molecular mechanics and other kinds of atomistic simulation techniques. CMSapi features a set of routines which allow for building a MD program for short ranged interactions, running NVE, NVT and NPT ensembles. In CMSapi it is implemented an improved algorithm to apply Minimum Image Convention. We also introduced on-the-fly reordering of atomic labeling to improve locality on memory access. Computer performances of CMSapi are discussed.  2005 Elsevier B.V. All rights reserved. PACS: 31.15.Qg Keywords: Molecular Dynamics; Application programming interface; Minimum image convention; Reordering

1. Introduction Atomistic simulations are a well established tool in computational materials sciences. This kind of simulations enable measurement of theoretical models, as well as the investigation of complex phenomena occurring over a wide length and time scales.

* Corresponding author.

E-mail address: [email protected] (S. Meloni). 0010-4655/$ – see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.cpc.2005.03.102

Nowadays simulations require the combination of several issues such as advanced ensemble sampling, multi-scale modeling and numerical methods for high performance computing. In particular, the need of efficient numerical tools is imposed by the ever increasing complexity of investigated problems and number of simulated particles. A possible path to produce such advanced tools is to develop a computational library containing highly optimized routines performing specific tasks in each of the afore mentioned fields.

S. Meloni et al. / Computer Physics Communications 169 (2005) 462–466

In this paper we present CMSapi, our full portable highly optimized library to perform atomistic simulations in materials science. We shall give a complete description of CMSapi, an example of usage and we shall describe special techniques adopted to improve performance in Molecular Dynamics (MD) simulations.

2. Structure and usage of CMSapi CMSapi is written in ANSI C and feature a binding for both C and Fortran. Using extra computer tools, like SWIG [1], it is possible to bind many more languages, such as Java, Python and Perl, among the others. Whenever possible, CMSapi uses BLAS and LAPACK [2] libraries. CMSapi is organized in modules. Modules are collections of routines relative to the same topic, for example all the routines about performing constant temperature simulations. The reason to organize CMSapi in modules is twofold: to rationalize the large number of routines into a smaller set of units, and to simplify the calling procedure of the routines. We adopted a technique known as “information hiding”: data have permanent storage (into memory) but they only have file scope, that is data can be accessed only by routines defined in the same file. Those constants and variables which need not to be modified by the users’ code are declared and defined within the module. Only functions within the module can access these type of data. Because less arguments must be passed to module’s routines their interface is simplified. When needed (for example for writing check-point files) these data can be accessed by users through ad-hoc routines. In the proceeding of this section we shall illustrate the usage of modules and the simplification arising from “information hiding”. The version 1.0 of CMSapi features the following modules: (i) Lists (O(N 2 ) complete and incomplete interaction lists, where N is the number of particles); (ii) LinkedCell (contains also combined linked cells/ Verlet lists); (iii) VelocityVerlet;

463

(iv) NVT (constant temperature MD using Nosé– Hoover chains [3] in conjunction with operator factorization [4]); (v) NPT (constant pressure MD with fully flexible cells [5] in conjunction with operator factorization [4]); (vi) Stillinger-Weber (Stillinger and Weber force model [6]); (vii) Tersoff (Tersoff force model [7]); (viii) SemiEmpiricalTightBinding (tight binding of type total energy functional [8,9]); (ix) Properties (for example kinetic energy, simulation box geometry—edge length, angles and volume); (x) SML (Specific Mathematic Library, mathematic operations specific to atomistic simulations— not already included in standard BLAS and LAPACK); (xi) Misc (utility routines not included in the previous modules). The use of CMSapi is very simple since it is only necessary: (i) to select the lists of routines required to add some extra-feature into a code, (ii) to call the routines from within the code, (iii) to compile and to link CMSapi libraries (libCMSapi_f77.a, if the original code is written in Fortran, and libCMSapi.a). For instance, if one wants to transform a standard NVE code so to perform molecular dynamics in the NVT ensemble using Nosé–Hoover Chains (NHC) [3] just 5 library calls need to be added as illustrated in Fig. 1. Using this example as a test case we can illustrate the advantage of “information hiding”. In order to run an NVT job using NHC, both dynamical variables and masses of “thermostat particles” need to be assigned. These degrees of freedom are of little physical interest and therefore we declare and initialize them within CMSapi and never access them directly but only by mean of NHC specific routines. In this way we can largely simplify NHC routines interface because the positions, velocities and masses of “thermostat particles” must not be passed.

464

S. Meloni et al. / Computer Physics Communications 169 (2005) 462–466

Fig. 1. Flow chart of an NVT molecular dynamics program produced using CMSapi. The light gray boxes indicate the routines to be added to the original NVE code. DSCAL is a BLAS which apply a scaling factor to a double precision array.

3. Improved and new algorithms In CMSapi 1.0 we improved the implementation of Minimum Image Convention (MIC). Standard implementations make use of computationally expensive functions such as “nearest integer” ([· · ·]nint ) or sign:  = ri,j − Hˆ × [Hˆ −1 ri,j ]nint , ri,j

(1)

 are, respectively, the distances bewhere ri,j and ri,j tween particle i and j before and after applying MIC while Hˆ is the metric tensor. However, Eq. (1) evalu = r + R,  where ates to ri,j i,j

 = −Hˆ (±1, 0; ±1, 0; ±1, 0), R depending on the relative positions of linked cells to  is the same for which particles i and j belong to. R all particle pairs belonging to the same pair of linked cells of particles i and j . Therefore it is possible to reduce the number of [· · ·]nint (or sign) operations from the number of pairs of particles to the number of pairs of linked cells. In practice, for each pair of linked cells we calculate  in a vector (MIC_Linked_ and store the value R Cells) and then sum it up to ri,j for each pair of particles belonging to that pair of linked cells (see Fig. 2). This scheme requires extra storage of size Ncells × NLinked_Cells × 3 × Floating-Point-Size bytes. Considering a system of 105 silicon atoms, using Stillinger– Weber potential [6], we need about 7.5 MB, which

Fig. 2. Scheme of Minimum Image Convention improved implementation.

turns out to be a very little amount even if compared with the typical memory installed on modern PCs. This method can be extended to the case of double lists scheme, in which linked cells are combined with Verlet lists [10]. In this case a new data-structure,  for each MIC_Verlet_Lists, which contains R pair of particles in Verlet lists, is created. This datastructure requires extra Nparticles × Nneighbor × 3 × Floating-Point-Size bytes. For the same case cited before, the extra storage required is about 70 MB, which is still small. In practice, MIC_Verlet_Lists contains the same values of MIC_Linked_Cells relative to the pairs of cells to which the pairs of particles belong to. In order to get this scheme working, Periodic Boundary Conditions (PBC) must be applied only when particles are assigned to linked cells. After particles have been assigned to cells, MIC_Linked_Cells is created and then “copied” into MIC_Verlet_Lists. We measured performances of this implementation of MIC using two different force fields (namely Stillinger–Weber [6] and Tersoff [7]) on several computers: IBM Power4 (P690) and Power3, Intel Itanium and AMD Opteron. Table 1 shows the gain in ex-

S. Meloni et al. / Computer Physics Communications 169 (2005) 462–466

465

Table 1 Gain in execution time obtained adopting the improved MIC algorithm on several computers for Stillinger–Weber [6] and Tersoff [7] force fields Particles Stillinger–Weber PWR4 PWR3 It.2

Tersoff Opt.

PWR4 PWR3 It.2

Opt.

64000 19.7% 33.4% 5.0% 49.9% 5.7% 13.1% 1.8% 12.5% 250000 30.9% 34.9% 6.1% 36.6% 10.7% 13.0% 2.8% 24.5% 500000 19.0% 27.9% 6.0% 25.8% 10.2% 13.2% 3.0% 11.1%

ecution time ([TimeSTD − TimeNEW ]/TimeSTD ∗ 100) obtained using this new implementation of MIC with respect to the standard implementation, which is based on Eq. (1), as a function of number of particles. On average, the gain is about 17.4%; the observed variations of the percentual gain depend on the different implementations of [· · ·]nint instruction. Another important issue in nowadays scientific computing is accessing memory using efficient paths. Superscalar computers are based on locality principles: if a datum is required by an operation, the next operations will require data next in memory; on the other hand, if a datum is required by an operation another operation will soon require the same datum. These principles are translated in practice providing computers with a hierarchy of memories: at lower level there are fast small memories (several level of cache memories), on top of this there are slower bigger storage units (main memory—RAM—or even disks). When an instruction requires a datum a block of contiguous data are brought into cache. Because of the increasing gap between the CPU clock time and memory access time, it is of paramount importance to make an optimal usage of data in cache. During MD simulations data are accessed in a random order: data relative to particles close in space are far in memory and vice versa. Even if particles have an optimal initial labeling, diffusion introduces a certain degree of “disorder”. This fact has a dramatic effect on performances. For example, on IBM Power4 (P670) the typical performance of a standard MD code running Stillinger–Weber potential [6] is 488 MFLOPS, that is about the 11% of Power4 peak performances. We were able to significantly improve performances by implementing a reordering scheme. Reordering consists in changing particle labeling according to their actual positions. Data-structures based

Fig. 3. Reordering algorithm: top panel indicate particle labels and a generic data-structure before reordering, bottom panel indicate the same after reordering.

on particle labeling are reordered to obey to the new numbering. We implemented this technique in conjunction with linked cells. In practice, particles belonging to the same cell are labeled with progressive numbers. We also assign progressive numbers to particles which belong to cells contiguous along one direction (see Fig. 3). The rationale for this reordering scheme is that particles within the same cell or contiguous cells interacts each other, Therefore, their data are accessed either at the same or successive clock cycles. Moreover, our simple scheme for reordering is very powerful thanks to its reduced computational cost. In fact, the difference in computing time for a run with and without reordering of a nondiffusing system with optimal initial labeling (no advantages in using reordering) is negligible. Table 2 reports the gain in execution time ([Time − TimeREORD ]/Time ∗ 100) obtained using reordering on several architectures: IBM Power4 and Power3, Intel Itanium and AMD Opteron. On average the gain is 33.6%. The actual gain varies from case to case because of different cache size and associativity, and

466

S. Meloni et al. / Computer Physics Communications 169 (2005) 462–466

Table 2 Gain in execution time obtained with reordering on several computers using either Stillinger–Weber [6] and Tersoff [7] force fields Particles Stillinger–Weber PWR4 PWR3 It.2

Tersoff Opt.

PWR4 PWR3 It.2

Opt.

64000 21.1% 17.3% 8.9% 40.9% 25.5% 10.4% 9.7% 15.4% 250000 56.2% 39.1% 41.3% 57.8% 40.6% 20.0% 25.6% 32.6% 500000 59.7% 56.7% 57.1% 43.1% 52.1% 29.2% 25.7% 19.7%

the efficiency of transfer mechanism from memory to cache. Acknowledgement One of us (L.C.) acknowledges financial support by MIUR under project FIRB (contract no. BRAUN01LLX2)

References [1] http://www.swig.org. [2] http://www.netlib.org/blas, http://www.netlib.org/lapack. [3] G.J. Martyna, M.L. Klein, M. Tuckerman, J. Chem. Phys. 97 (1992) 2635. [4] M. Tuckerman B.B.J., G.J. Martyna, J. Chem. Phys. 97 (1992) 1990–2001. [5] G.J. Martyna, D.J. Tobias, M.L. Klein, J. Chem. Phys. 101 (1994) 4177–4189. [6] F.H. Stillinger, T.A. Weber, Phys. Rev. B 31 (8) (1985) 5262– 5271. [7] J. Tersoff, Phys. Rev. B 39 (8) (1989) 5566–5568. [8] C.H. Xu, C.Z. Wang, C.T. Chan, K.M. Ho, J. Phys.: Condens. Matter 4 (1992) 6047–6054. [9] I. Kwon, R. Biswas, C.Z. Wang, K.M. Ho, C.M. Soukolis, Phys. Rev. B 49 (1994) 7242. [10] D.J. Auerbach, W. Paul, C. Lutz, A.F. Bakker, W.E. Rudge, F.F. Abraham, J. Phys. Chem. 91 (1987) 4881–4890.