MaMiCo: Software design for parallel molecular-continuum flow simulations

MaMiCo: Software design for parallel molecular-continuum flow simulations

Computer Physics Communications 200 (2016) 324–335 Contents lists available at ScienceDirect Computer Physics Communications journal homepage: www.e...

1MB Sizes 19 Downloads 72 Views

Computer Physics Communications 200 (2016) 324–335

Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

MaMiCo: Software design for parallel molecular-continuum flow simulations✩ Philipp Neumann ∗ , Hanno Flohr, Rahul Arora, Piet Jarmatz, Nikola Tchipev, Hans-Joachim Bungartz Technische Universität München, Department of Informatics, Boltzmannstr. 3, 85748 Garching, Germany

article

info

Article history: Received 25 March 2015 Accepted 29 October 2015 Available online 19 November 2015 Keywords: Coupling Software design Molecular dynamics Lattice Boltzmann Molecular-continuum Parallel Fluid dynamics

abstract The macro–micro-coupling tool (MaMiCo) was developed to ease the development of and modularize molecular-continuum simulations, retaining sequential and parallel performance. We demonstrate the functionality and performance of MaMiCo by coupling the spatially adaptive Lattice Boltzmann framework waLBerla with four molecular dynamics (MD) codes: the light-weight Lennard-Jones-based implementation SimpleMD, the node-level optimized software ls1 mardyn, and the community codes ESPResSo and LAMMPS. We detail interface implementations to connect each solver with MaMiCo. The coupling for each waLBerla-MD setup is validated in three-dimensional channel flow simulations which are solved by means of a state-based coupling method. We provide sequential and strong scaling measurements for the four molecular-continuum simulations. The overhead of MaMiCo is found to come at 10%–20% of the total (MD) runtime. The measurements further show that scalability of the hybrid simulations is reached on up to 500 Intel SandyBridge, and more than 1000 AMD Bulldozer compute cores. Program summary Program title: MaMiCo Catalogue identifier: AEYW_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEYW_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: BSD License No. of lines in distributed program, including test data, etc.: 67905 No. of bytes in distributed program, including test data, etc.: 1757334 Distribution format: tar.gz Programming language: C, C++II. Computer: Standard PCs, compute clusters. Operating system: Unix/Linux. RAM: Test cases consume ca. 30–50 MB Classification: 7.7. External routines: Scons (http:www.scons.org), ESPResSo, LAMMPS, ls1 mardyn, waLBerla Nature of problem: Coupled molecular-continuum simulation for multi-resolution fluid dynamics: parts of the domain are resolved by molecular dynamics whereas large parts are covered by a CFD solver, e.g. a lattice Boltzmann automaton Solution method: We couple existing MD and CFD solvers via MaMiCo (macro–micro coupling tool). Data exchange and coupling algorithmics are abstracted and incorporated in MaMiCo. Once an algorithm is

✩ This paper and its associated computer program are available via the Computer Physics Communication homepage on ScienceDirect (http://www.sciencedirect.com/ science/journal/00104655). ∗ Corresponding author. E-mail address: [email protected] (P. Neumann).

http://dx.doi.org/10.1016/j.cpc.2015.10.029 0010-4655/© 2015 Elsevier B.V. All rights reserved.

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

325

set up in MaMiCo, it can be used and extended, even if other solvers are used (as soon as the respective interfaces are implemented). Restrictions: Currently, only single-centered Lennard-Jones systems are supported. Running time: Runtime depends on the underlying coupled problem and may range from minutes to days. The provided test cases for all different solver couplings (incl. one complete coupling cycle of avg. domain size) take ca. 10 h on a regular Desktop. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Molecular-continuum simulations in fluid dynamics allow for multiscale (or multiresolution) simulation beyond the continuum scale, i.e. beyond the limits of pure spatial or temporal adaptivity applied to a continuum method. The computational domain Ω is split into a continuum region Ω C and a molecular region Ω MD : in each domain a continuum or molecular dynamics (MD) solver predicts the behavior of the fluid, and at the interface – typically prescribed by an overlap region Ω Ov lp := Ω C ∩ Ω MD [1] – flow quantities such as density, velocity or temperature are exchanged and matched. Applications comprise the investigation of polymers suspended in a solvent [2], fluid-wall interaction (e.g. slip effects [3]) and many more. Various methods have been presented in literature for both steady state (see e.g. [4]) and time-dependent (see e.g. [5]) coupling schemes. However, only few contributions have been made in the field of software development for respective hybrid simulation methods, although the design of a respective piece of software may be similarly challenging. Four challenges should be discussed and addressed in this contribution. A huge number of MD or continuum solvers exist—each solver comes with features and limitations, and depending on the current coupled molecular-continuum scheme, one or the other solver may do: incompressible or compressible flow solver, simple Lennard-Jones based MD simulation or a MD toolbox with various molecular interaction potentials, highly parallelized and efficient simulation codes or simple-to-use development codes, and so forth. Hence, a decoupling of molecular and continuum solvers (1) from each other is desirable. Moreover, a modularization and encapsulation (2) of the coupling functionality should be achieved. This would hide coupling algorithmics from the actual MD and continuum solver; existing implementations of coupling algorithms could thus be reused by any pair of MD/continuum solver that is potentially coupled. In order to bridge spatial and temporal scales, it is typically the MD simulation which needs to be executed for hundred thousands of time steps or even more. Consequently, the MD solver and respective coupling needs to be highly efficient and also support the exploitation of massive parallelism (3). Besides, parallelism is a feature for the continuum solver as well: assume only a small portion of the computational domain Ω shall be resolved by MD. Then, it is still a big amount of computations that is required on continuum side to compute the fluid flow in the continuum domain; this, again, shall be parallelized. The development of hybrid molecular-continuum schemes itself is highly challenging: open boundary conditions for the MD simulations need to be formulated, taking into account boundary forcing (to account for missing molecular interactions across the open boundary) [6,7], insertion and exchange mechanisms for hydrodynamic quantities [8–11] and molecular sampling need to be steered, and molecular/continuum models need to be matched accordingly. Due to these issues, development or respective hybrid schemes starts in 1D, and prototypes are often formulated in two-dimensional scenarios, before (or if ever) going to fully three-dimensional considerations, the latter also being due to the high computational demands. The support of two- and three-dimensional setups (4) is thus an important feature for molecular-continuum software to reduce implementational overheads during development and porting coupled schemes from 2D to 3D. Only few contributions to these considerations (1)–(4) have been published so far for computational fluid dynamics applications. Delgado-Buscalioni et al. [12] have published a coupling framework for molecular-continuum simulations which focuses on the deployment of hybrid simulations on grid-like computer architectures, still having a quite dense coupling of both MD and coupling components (w.r.t. exchange mechanisms on MD side). Besides, efforts have been made to (tightly) couple existing MD frameworks to continuum solvers [13]. Yielding good performance on MD side, these approaches lack the possibility of exchanging the MD simulation (1), or encapsulating coupling algorithmics from MD (2). Considering (3), highly efficient (triple-scale) considerations have been carried out in [14] which even lead to a Gordon-Bell-finalist nomination. Similar to this work, several approaches have been presented for general coupling of heterogeneous solvers (thus typically missing (2)), or purely particle-based coupling such as DPD-SPH; for an overview, see amongst others [15]. Our focus lies on mesh-particle coupling for multi-resolution fluid dynamics. We recently presented works towards a macro–micro-coupling tool (MaMiCo) [11,16,17], and particularly focused on the different components – boundary forcing, particle insertion and quantity exchange – that are applied to couple simple in-house test codes. Besides test case implementations, we employed a spatially adaptive Lattice Boltzmann implementation of the PDE framework Peano in this context [16]. With Peano relying on a mesh obtained via spacetree construction, we could show the flexibility of MaMiCo on continuum side with respect to potentially unstructured (that is non-lexicographically or indirectly indexed) meshes. In the present contribution, we point out the features and applicability of the macro–micro-coupling-tool for fully coupled simulation scenarios. We couple four different MD codes with the tool, pointing out its flexibility (1), (2), and support of massively parallel scenarios (3):

• a simple test code SimpleMD that has been and is used during the development of additional features of MaMiCo and therefore supports 2D and 3D scenarios (4), cf. [11].

• the software ls1 mardyn which is developed in cooperation of computer scientists and chemical engineers [18] and features high node-level performance and computational efficiency on current supercomputer architectures, with good scalability on up to 146 000 compute cores [19]. ls1 stands for ‘‘large systems 1’’, as ls1 mardyn is particularly designed for simulations at large molecule numbers.

326

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

• the Extensible Simulation Package for Research on Soft matter (ESPResSo) which ‘‘is a highly versatile software package for performing and analyzing scientific Molecular Dynamics many-particle simulations of coarse-grained atomistic or bead–spring models as they are used in soft-matter research in physics, chemistry and molecular biology. It can be used to simulate systems such as polymers, liquid crystals, colloids, ferrofluids and biological systems, for example DNA and lipid membranes’’.1 • the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) which is developed at the Sandia National Labs [20]. LAMMPS is also open-source, and is widely distributed in the MD community, featuring, amongst others, a multitude of molecular interaction potentials, atom-to-continuum coupling for structural mechanics, or steered MD. We detail all interface implementations, point out features and potential code extensions that were required for coupling. The interface implementations are validated by considering a steady state, three-dimensional channel flow scenario, coupling each of the MD simulations with the Lattice Boltzmann (LB) framework waLBerla [21,22]. The underlying methodology is based on the state-based coupling approach from [4]. The method is modified to further reduce the size of the overlap layer between molecular and continuum region. We discuss the computational efficiency of the coupled scenarios. For this purpose, we compare both coupled and non-coupled code performances and provide scaling results from coupled parallel simulations. We briefly revise the molecular dynamics and Lattice Boltzmann methods and describe the modified coupling algorithm for hybrid MD-LB simulations in Section 2. The different pieces of software – the coupling tool MaMiCo, the LB simulation waLBerla, and the MD codes SimpleMD, ls1 mardyn, LAMMPS, and ESPResSo – are described in Section 3, including details on their interface implementations. The validation of the coupling algorithm for channel flow is presented in Section 4. Besides, we detail sequential and parallel performance of all coupled simulations. We summarize our contribution and give an outlook to future work in Section 5. 2. Theory 2.1. Molecular dynamics We consider single-centered Lennard-Jones-based molecular dynamics, solving the equations of motion dxp dt dvp dt

= vp =

1 mp

(1) Fp

for all particle positions xp , velocities vp , masses mp and interaction forces Fp . The latter arise from Fp = 24ϵ



1

q̸=p

∥rpq ∥2

 ·

6   6  σ · 1−2 rpq ∥rpq ∥ ∥rpq ∥ σ

(2)

where rpq := xq − xp with Lennard-Jones parameters ϵ , σ . Force contributions of molecules at a distance ∥rpq ∥ > rc are cut off. This cut-off radius rc is realized in the MD simulations either using Verlet lists [23] or linked cells [24]. Simulations are performed in the NVT ensemble. 2.2. Lattice Boltzmann Large-scale motion of the fluid is computed by means of the Lattice Boltzmann method. Particle distribution functions fi are computed via explicit time stepping using the BGK collision operator approximation [25]: fi (x + ci , t + 1) = fi (x, t ) −

1

τ

(fi (x, t ) − fieq (x, t )) + gi

(3) eq

with relaxation time τ , the usual expression for the equilibrium distribution fi [26] and lattice velocities ci , i = 1, . . . , Q . The relaxation time τ is related to the kinematic viscosity, ν = cs2 (τ − 0.5), where cs denotes the speed of sound. The term gi corresponds to external forcing. For an external force F, gi arises as gi := wi

ρ ci F cs2

(4)

with fluid density ρ . In the above formulations, all quantities and equations are given in dimensionless form, assuming unit mesh and time step size. 2.3. Steady-state coupling The coupling scheme of LB and MD [16] is established similarly to the descriptions by Dupuis et al. [4]. The method corresponds to Schwarz-like coupling, see List 1, and is valid for steady, weakly compressible flows:

1 espressomd.org, as of March 04 2014.

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

327

Fig. 1. Schwarz-like coupling of LB and MD. The molecules in the outer (green-colored) overlap region are relaxed towards the cellwise average velocity given by LB. After equilibration, the flow velocities are sampled in MD, sent to LB and incorporated via an acceleration term. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

while(not steady send data from equilibrate MD sample MD with

state) LB to MD with LB constraints LB constraints

send data from MD to LB while (LB not steady state) solve LB with MD constraints The basic idea of incorporating LB and MD constraints is sketched in Fig. 1. First, consider the coupling from LB to MD. Flow velocities are extracted from LB grid cells and transferred to the MD simulation. The molecule velocities are relaxed towards these new average velocity values, vp := vp + λMD (vLB→MD − vMD av g ) with relaxation factor λ

MD

∈ (0, 1]. The velocity

(5) vMD av g

is the average velocity value sampled from MD. It is determined by averaging the

molecular velocities inside a LB cell at a given MD time step. The velocity vLB→MD defines the velocity value prescribed by LB. In the original work [4], vLB→MD was set to the cell-local LB velocity, corresponding to zero-order interpolation of the velocity for all molecules inside the respective grid cell. Choosing a rather thick overlap layer (green-colored cells in Fig. 1) δ ov lp ∈ [7σ ; 9σ ] guaranteed a consistent and smooth transfer of quantities from LB to MD. In order to reduce computational overhead, we reduce the size of the overlap region in our experiments to ca. 50%, cf. Fig. 2. The MD domain is covered by grid cells. For simplicity, we choose these cells to be of same size as the LB cells in the following (coupling and LB cells are used interchangeably); this, however, is no restriction w.r.t. the coupling tool and its functionality, see Section 3.1. The MD domain is further surrounded by one layer of ghost cells. Let a molecule be located inside the strip which is spanned by the first two layers of cell midpoints which are located inside the MD region. We interpolate the average LB velocity value vLB→MD at the location of the molecule using second-order polynomials. This guarantees a smooth transition in the MD velocities. It is thus sufficient to have an overlap region of ca. one cut-off radius (typically order of 2.5σ ). Besides, we abstain from external boundary forcing in our experiments to model open MD boundaries. Instead, we rely on periodic boundary conditions in MD (as e.g. already used in [27]), together with a second buffering region (half a LB cell width) in which molecules are allowed to freely move and relax. For example, for a typical LB cell width dx = rc = 2.5σ , this methodology results in a total overlap layer thickness of δ ovlp = 1.5dx = 3.75σ . Besides velocity relaxation, a cell-local thermostat is applied in each grid cell of the MD domain to conserve temperature. For this purpose, we re-scale the deviations of the average velocity vMD av g [28] and impose a temperature Ttarget via

 vp := vMD av g +

Ttarget Tcurrent

(vp − vMD av g )

(6)

where Tcurrent denotes the current temperature. →LB After equilibrating the MD system, average velocity values vMD are sampled (over time) in all inner coupling cells. These values are av g transferred from MD to LB and imposed via a forcing term: →LB F = λLB (vMD − vLB ) av g

(7)

where vLB denotes the LB velocity of the current LB cell, and λLB ∈ (0, 1] is a relaxation factor. Choosing λLB := 1 corresponds to the method described in [4]. 3. Software 3.1. Macro–micro-coupling tool All parts of the coupling algorithm from Section 2.3 are implemented and integrated into the macro–micro-coupling tool. The software design of the C++-based tool is depicted in Fig. 3 and is revised in the following, cf. also [11]. Several aspects are strictly separated in the

328

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

Fig. 2. Molecular velocity relaxation using second-order interpolation. We consider the lower right corner of a two-dimensional MD simulation. The yellow-colored cells are ghost cells surrounding the actual MD domain. If a molecule is located in the green strip, the average LB velocity vLB→MD is interpolated at its location from the light greencolored LB flow velocity values; a similar choice of interpolation points is employed in the three-dimensional case. The white layer between the green- and yellow-colored regions serves as (mass) buffering region. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 3. Software Design of the macro–micro-coupling tool, including associations to the considered MD and LB simulations.

tool which are data management, conservation of hydrodynamic quantities, coupling algorithmics and interfacing molecular dynamics and continuum solvers. Data management and distribution. Cartesian grid cells – referred to as macroscopic cells inside the tool – are defined which cover the MD domain, including one ghost layer surrounding it. Exchange of data between MD and coupling tool or coupling tool and continuum simulation is established via this data structure. In the case of distributed parallel MPI-based simulations, the cells always reside on the same process as the underlying MD subdomain. The domain decomposition and respective rank-to-coordinate mapping is implemented via ParallelTopology. Currently, xyz- and zyx-domain layout are supported. The class IndexConversion allows to convert between local and global macroscopic cell indices and yields access to the ParallelTopology. Momentum and energy conservation. When modifying mass, momentum or energy separately in the MD system, one needs to make sure that the other hydrodynamic quantities of interest are conserved. For this purpose, the classes MomentumController and KineticEnergyController come with the coupling tool. They allow to conserve momentum and temperature on a cellwise basis. For example, inserting a new particle into a macroscopic cell results in the following algorithm:

temperature = kineticEnergyController.getTemperature(cell) momentum = momentumController.getMomentum(cell) insertParticle(cell) momentumController.setMomentum(cell,momentum) kineticEnergyController.setTemperature(cell,temperature) The cell-local thermostat, cf. Eq. (6), is another example of respective energy controllers. Quantity transfer and coupling algorithmics. Mass and momentum exchange are implemented in terms of three separate modules. Algorithms for particle insertions or removals on MD side are incorporated via ParticleInsertion. Currently, a parallel USHER implementation is available [8,11]. The same concept applies for the MomentumInsertion. The interpolation-based velocity relaxation scheme from Section 2.3 is implemented in VelocityGradientRelaxation. The class TransferStrategy provides an interface to yield coupling algorithm-dependent operations on the macroscopic grid cells. Typical operations comprise sampling of molecular data,

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

329

modifying cell-local buffer variables (which determine mass and momentum transfer to/from the MD system, e.g.), or conversion and interpretation of buffer variables (state- vs. flux-based variable interpretation). Solver interfaces. Connecting to the continuum or MD solvers is established implementing the interfaces MacroscopicSolverInterface and MDSolverInterface, respectively. The MacroscopicSolverInterface only needs to provide information which process needs to send/receive particular macroscopic cell information to/from the coupling tool. On MD side, three interfaces need to be implemented. First, an interface Molecule allows to access the properties of individual molecules such as position or velocity. Second, the interface MoleculeInterator needs to be implemented to allow the traversal of all molecules which are located in a particular macroscopic cell. If a linked cell data structure is used by the MD simulation, this data structure can typically be easily reused, see for example Sections 3.3, 3.4 and 3.6. Third, the interface MDSolverInterface allows for MD simulation-specific operations such as access to physical simulation parameters (e.g. the Lennard-Jones parameters ϵ and σ ), synchronization of molecules across MPI processes, or access to the memory of the MD simulation (e.g. for molecule insertion and removal). 3.2. waLBerla Features and software design. The widely applicable Lattice Boltzmann simulation code from Erlangen (waLBerla2 ) [21,22] is a massively parallel C++ LB framework developed at FAU Erlangen-Nürnberg. It is deployed on some of the top HPC clusters of the world and is capable of fluid simulation in arbitrary geometries. Fields of application include free surface flows, medical applications such as blood flow in vessels, fluid–structure interaction, or electrokinetic simulation. It has a modular and flexible software design using generic data structures, created to allow for easy adaption on the user’s needs and implementation of new features. It contains CPU- and memoryefficient, hardware-specific compute kernels for an optimal performance on common supercomputing architectures. There are different kernel implementations for e.g. BGK, TRT and MRT collision operators and for several external force models, as well as a variety of applicable boundary conditions. Though designed for regular lattices such as grids with quadratic (2D)/cubic (3D) Cartesian grid cells, extensions of the LBM to irregular meshes, in particular to adaptively refined grids, are well established [29–31]. The software waLBerla implements the scheme proposed in [31]. Parallelization is achieved by a domain decomposition approach. For parallel execution on irregular grids, load balancing algorithms are available. Interface and coupling implementations. The coupling is embedded as an app that uses the MaMiCo tool in the waLBerla framework. The main Runner controls the LBSimulation, MaMiCo’s MacroscopicCellService and the MD simulation that is wrapped in a separate class CoupledMolecularDynamicsSimulation. The respective MD simulation (see following sections) is created via a factory pattern; each MD simulation can be switched on or off at compile time. The coupled simulation is parametrized by an XML configuration file. The coupling data is exchanged to and from waLBerla through the MacroscopicSolverInterfaceService which encapsulates the implementation of MaMiCo’s MacroscopicSolverInterface. The LBSimulation always runs waLBerla on all processes; alternatively it is possible to run waLBerla sequentially. For the incorporation of the coupling data received from MD, which is modeled as an external force in the LBM, we use an own BGK kernel with a force implementation. For performance reasons, it is only applied in coupling cells though. The kernel utilizes waLBerla’s concept of a sweep, i.e. an algorithm that operates on a single domain block and modifies its data fields; this approach allows to attach arbitrary data (e.g. the LB particle distribution functions) to blocks. 3.3. SimpleMD Features and software design. The SimpleMD software is part of the macro–micro-coupling tool and shall facilitate the incorporation of new coupling schemes and allow for rapid prototyping. It is written in C++ and supports two- and three-dimensional single-centered LennardJones MD simulations. Its software design [16] is shown in Fig. 4. The cut-off condition for molecular interactions is established via the linked cell approach, and velocity Störmer–Verlet time stepping is applied. Molecules can either be traversed on molecule or linked cell basis. The memory management and traversal of the linked cells or molecules are implemented in the LinkedCellService and MoleculeService. The respective functionality applied to the molecular system is imposed via Mappings that are invoked via a callback-concept. For example, a call to MoleculeService::getInstance().iterate(myMapping) invokes calls to myMapping’s methods beginIteration(), endIteration() and handleMolecule(Molecule&), with the latter called for each molecule in the simulation. The same concept applies to the linked cell traversal. Distributed memory parallelization is accomplished via domain decomposition and is implemented in the ParallelTopologyService, using a xyz-ordering of the MPI processes. Additional features are incorporated via the classes BoundaryTreatment (implements periodic boundaries) and MolecularGeometryService (fixes certain molecules in space to represent obstacles). Interface and coupling implementations. Since SimpleMD comes with the macro–micro-coupling tool, the implementation of the interfaces is straight forward and very moderate in code size, see Table 1. Incorporating calls to the MacroscopicCellService from the MD simulation (e.g. to distributeMacroscopicQuantities(...)) are established via deriving a new CoupledMolecularDynamicsSimulation from MolecularDynamicsSimulation. The calls are integrated into the derived time step implementation. No modifications or further extensions of the original code were necessary. 3.4. ls1 mardyn Features and software design. ls1 mardyn is a relatively new code for MD simulations of large systems arising in chemical and process engineering [18]. It is written in C++ and aims at delivering high performance while maintaining a good software-engineering design. It

2 http://walberla.net.

330

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

Fig. 4. Software design of SimpleMD [16]. The MD simulation class MolecularDynamicsSimulation combines all concepts: memory management and traversal (LinkedCellService, MoleculeService), MD functionality (mappings), boundary conditions (BoundaryTreatment), geometry handling MolecularGeometryService) and distributed memory parallelization (ParallelTopologyService).

Table 1 Lines of code (LoC) of the interface implementations. MD software

Interface

LoC

SimpleMD

SimpleMDMolecule SimpleMDMoleculeIterator SimpleMDSolverInterface

45 40 290 375

Total ls1 mardyn

MarDynCell.h MarDynMDSolverInterface.h LinkedCellsForCoupling.h MarDynMoleculeWrapper.h MarDynMoleculeIterator.h MarDynCoupledSimulation.h Total

LAMMPS

fix_mamico.cpp fix_mamico.h fix_mamico_template_functions.h ghost_atoms.h mamico_cell.h mamico_lammps_md_solver_interface.h mamico_lammps_molecule.h mamico_lammps_molecule_iterator.h sorting.h Total

ESPResSo

EspressoMDMolecule.h EspressoMDMoleculeIterator.h EspressoMDSolverInterface.h Total

24 265 59 123 33 357 861 53 39 27 46 38 319 112 31 174 839 100 44 221 365

supports rigid multi-centered molecules interacting via the Lennard-Jones, Coulomb, dipole or quadrupole potentials. Besides distributed and shared memory parallelization support, the code particularly features high node-level performance, due to an efficient SIMD vectorization scheme. This resulted in simulations running at 10% of the peak performance of SuperMUC [19], breaking the record for the largest MD simulation of this kind at the same time. Forces are computed via the linked cell algorithm, with support for adaptive subdivision of the cells. Parallelization is implemented via domain decomposition on either a regular Cartesian topology (employing MPI_Cart functionality) or a KD-tree based adaptive one. The (rotational) Leapfrog algorithm is used for the time integration of the (multi-centered) molecules. Interface and coupling implementations. ls1 mardyn is built and used as static library. Most of the interface implementations can be created using the functionality already provided by ls1 mardyn, some extensions are, however, necessary. A linked cell data structure is available, but a new constructor is needed that creates regular cells at a particular cell length: LinkedCellsForCoupling is derived from the base class LinkedCells of ls1 mardyn for this purpose. Additional implementations are created for, e.g., the iteration over all molecules of a linked cell or the retrieval of simulation specific parameters. The steering of the MD simulation (e.g. time stepping, algorithmic sequence of steps per time step, etc.) including calls to MaMiCo is established by deriving a new MarDynCoupledSimulation from the original class Simulation of ls1 mardyn; this further allows to keep the original simulation class as restrictive as possible. The ParallelTopology with zyx-domain layout is used in the ls1 mardyn-MaMiCo coupling. The number of lines of code for the interface implementations is 421 (MarDynMDSolverInterface.h, MarDynMoleculeIterator.h, MarDynMoleculeWrapper.h; cf. Table 1), which is only slightly higher than for the built-in SimpleMD. The code extensions for ls1 mardyn (LinkedCellsForCoupling.h, MarDynCell.h, MarDynCoupledSimulation.h) amount to additional

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

331

440 lines of code. These extensions are part of the coupling implementations, so only minor changes (member declarations, one new constructor) to the original ls1 mardyn code were required. 3.5. LAMMPS Features and software design. The LAMMPS Molecular Dynamics Simulator is a multi-functional 2D/3D integrator for Newton’s equations of motion. Applications comprise amongst others simulations of atoms, coarse-grained particles, all-atom polymers, metals or coarse-grained mesoscale models [32]. In terms of atom-to-continuum coupling, LAMMPS was previously used to couple MD to finite element [33] as well as fluid flow simulations [34]. In LAMMPS, the MD simulation can be steered via input scripting. Short-range interactions – such as the cut-off Lennard-Jones interactions from Section 2.1–are realized via efficient Verlet list data structures; no linked cells are employed. Concerning time integration, we apply the velocity-Verlet integrator (further integrators, e.g. for Brownian dynamics, are available). The parallelization is accomplished via a domain decomposition approach. The user can choose between different rank-to-coordinate mappings: subsequent lexicographic enumeration schemes (xyz, xzy, yxz, yzx, zxy, zyx), or MPI-based mappings (cart, cart/reorder) which make use of MPI_Cart. Plugging new functionality into the integrator is established via fixes: a time step is separated into different phases, and after/before each phase, specific calls to member functions of fix classes are triggered. Interface and coupling implementations. The interface and coupling implementation to connect LAMMPS and MaMiCo is established in a user-defined package USER_MAMICO. Different variants to couple codes with LAMMPS are detailed in the LAMMPS manual [32]; we follow the approach where LAMMPS is used and built as static library, including the new package USER_MAMICO. With no linked cell data structures available, a separate Cartesian grid structure and respective particle sorting is implemented to interface MaMiCo’s cellwise traversal of the molecules. The sorting comes at an overhead of linear complexity and was found to be acceptable, cf. Section 4. The functionality of MaMiCo is hooked into LAMMPS in each time step using the pre_force() method of the new fix fix_mamico. The parallel execution of LAMMPS–MaMiCo-coupling was verified and validated for xyz- and zyx-rank-to-coordinate mappings. Other domain decompositions that are supported by LAMMPS (see above) may be plugged in easily in future by extending the ParallelTopology implementations of MaMiCo. The number of lines of code for pure interface implementations (mamico_lammps_md_solver_interface.h, mamico_ lammps_molecule.h and mamico_lammps_molecule_iterator.h, cf. Table 1) is 680. This is slightly more than for SimpleMD. However, LAMMPS code extensions need to be provided (cells and sorting) and amount to 558 lines of code (all other implementation files, cf. Table 1). No further modification in LAMMPS internal routines have been made. 3.6. ESPResSo Features and software design. ESPResSo is a Molecular Dynamics code, which has been designed mainly for coarse-grained applications [35]. The source code is written in C and the simulations can be controlled using Tcl scripts, including setup, simulation and analysis: it is thus easily extensible and allows for rapid scenario preparation. ESPResSo uses the standard Velocity Verlet integration scheme. It can handle three kinds of interactions: short range, long range, and bonded interactions. Short range forces are handled using a linked cell structure, which is the default implementation. ESPResSo also contains an implementation of Verlet lists. The long range potentials are based on Coulomb interactions and include fast algorithms such as P3 M and MEMD. ESPResSo also has an implicit Lattice Boltzmann solver with limited functionalities, which is mainly used to couple to MD to incorporate hydrodynamics interactions. The code has been parallelized using the domain decomposition approach. It uses the standard Cartesian topology and maps ranks to coordinates using MPI_Cart_coords, resembling a zyx-domain layout. Interface and coupling implementations. ESPResSo can be built as a static library, which can be linked with other codes easily. This is also the approach we used. The molecules are already stored in linked cell structures, which can be directly used and accessed by MaMiCo. ESSPResSo does not store the calculated energy per molecule, but computes the energy values on-the-fly. An additional potential energy field was added in the molecule interface for MaMiCo to cache the potential energy. ESPResSo uses master–slave configuration, where the master processor sends commands to other processors, which are waiting in a loop. This configuration is not compatible with coupled simulations, where all the processors should pause for initialization, data sampling and exchange by the coupling tool. Therefore, a flag was introduced in communications.cpp to break the master–slave dependence and allow for a start–stop functionality. Besides, in order to reuse tcl-based initialization functions of ESPResSo, two minor modifications were introduced in initialize_interpreter.cpp. This yields a total of 9 modified lines of code in ESPResSo, and 365 lines of code for interface implementations, cf. Table 1. 4. Results 4.1. Validation: channel flow We validate the coupling methodology from Section 2.3 in a channel flow scenario. We considered three different setups (MD30, MD60, MD120), cf. Table 2. Our domain setup is shown in Fig. 5(a) and corresponds to the case MD60. We describe this scenario in the following; analogous validation studies have also been conducted for MD30 and MD120 and for various process counts. All quantities denoted with superscript MD are given in MD units; superscripts LB correspond to LB units. The MD units are scaled such that mp = ϵ = σ = 1.0 [24,4]. The LB domain is discretized by 96 × 48 × 48 cells of size dxMD = 10.0 c (one layer of cells corresponds to ‘‘outer’’ boundary cells). The MD region is embedded in the center of the last third of the channel. The LB grid is adaptively refined around this region with two more grid levels, consisting of 48 × 48 × 48 cells each. This yields a finest resolution – and consequently a macroscopic cell size – of dxMD = 2.5. The MD region has a size of 60.0 × 60.0 × 60.0 (MD units) and is filled with f

332

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335 Table 2 Description of the different MD-LB scenarios. The second column denotes the number of macroscopic cells that cover the whole MD domain, the third column shows the number of atoms used in the MD simulation. The fourth and fifth columns show the number of LB cells used on the coarsest and the finest grid level. In each scenario, a total number of three adaptive grid levels is used, with refinement triggered in the surrounding of the MD region. Name

No. macrosc. cells (MD)

No. atoms

No. LB cells (coarsest)

No. LB cells (finest)

MD30 MD60 MD120

12 × 12 × 12 24 × 24 × 24 48 × 48 × 48

16 250 130 050 1 040 502

48 × 24 × 24 96 × 48 × 48 192 × 96 × 96

24 × 24 × 24 48 × 48 × 48 96 × 96 × 96

Fig. 5. Results for one coupling cycle of the channel flow scenario MD60. (a) Cut through the velocity field of the channel, visualizing also the three-level adaptive LB grid. The inner white-framed box denotes the MD region. (b) Velocity profile (in LB units) before (continuous line) and after (dashed line) sampling in MD; the profiles are measured across the white line shown in (a).

130 050 atoms. The state of the MD system is chosen in accordance with [4] as (ρ MD , T MD ) = (0.6, 1.8), yielding a viscosity of ν MD ≈ 1.5. MD The viscosity in LB units is scaled to ν LB = 0.05 on the finest grid level. This yields a time step ∆tLB = 0.21 in the LB simulation. We impose a pressure difference across the channel of ∆pLB = 0.0001. First the channel flow is computed by LB. We run the LB system for 60 000 time steps. Afterwards, the velocity values are sent to and imposed in the MD simulation. We use a relaxation factor λMD = 0.05 in our experiments. The MD system is equilibrated for 800 000 time steps, and velocity values are sampled over another 200 000 time steps. We set the time step ∆t MD = 0.002 and choose the cut-off radius rc = 2.5 in our simulations. Additional studies were conducted varying rc ; for example, the choice rc = 1.12 ≈ 21/6 cuts off all attractive interactions of the Lennard-Jones potential and is therefore considered to be a good estimate to compare the molecular system to the ideal gas-based LB description. Similar results were achieved in this case. After the sampling, the velocity values are sent from MD to LB, the LB simulation is equilibrated for another 60 000 time steps, imposing the sampled velocity values from the MD simulation via a forcing term, cf. Eqs. (4), (7), with relaxation factor λLB = 0.1. The relaxation factor λLB was found to be a rather insensitive parameter in our studies, but may play a crucial role in simulations at higher level of thermal noise. The resulting channel profile for the velocity is shown in Fig. 5 for the waLBerla–ls1 mardyn coupling. Analogous results have been obtained for the other MD solvers. We see good agreement of both pure LB and LB-MD solution. The latter contains the natural fluctuations that arise from the molecular system; these fluctuations yield a maximum deviation from the smooth pure LB profile of ca. 7%. With the velocity fluctuating around the parabolic shape, this shows that the interpolation-based velocity relaxation on MD side captures the correct flow behavior, despite significant reduction in the overlap layer thickness. The computation – consisting of two full coupling cycles with 60 000 LB time steps and 1 000 000 MD time steps each – took ca. 7 hrs on 256 SandyBridge compute cores (see Section 4.2 for details). The LB simulation amounts to ca. 9% of the overall compute time. The number of MD/LB time steps is rather large since steady state/equilibrium needs to be reached. However, less time steps on LB side may be required: in the present study, we did not observe significant changes in the second coupling cycle after ca. 20 000 LB steps. Comparing to the original method [4] and assuming quadratic dependence of system size and time to reach equilibrium/steady state, the number of time steps chosen in our work is in accordance. 4.2. Scaling and efficiency We conducted studies to evaluate the performance of the scenarios MD30, MD60 and MD120 for the four coupled solvers. We used the partitions snb and bdz of the MAC-cluster3 in all our experiments. The partition snb consists of 28 nodes equipped with a dual socket Intel SandyBridge-EP Xeon E5-2670, yielding 16 compute cores per node. Each of the 19 nodes of the partition bdz comprises a quad socket AMD Bulldozer Opteron 6274 processor (yielding a total of 64 cores per node). All nodes are connected via QDR infiniband. Sequential performance. We measured the sequential runtime of the coupled simulations on one snb node. In particular, with MD dominating the overall runtime in LB-MD simulations (cf. previous section), we compared the runtime per MD time step for coupled and non-coupled MD simulations. The times for the different solvers are given in Table 3. All MD simulations, except ls1 mardyn show similar performance in the three test setups and an overhead of 10%–20% induced by the coupling. Due to the high node-level performance of ls1 mardyn with AVX-vectorized force-compute kernels, ls1 mardyn is 20%–25% faster than the community codes. Since the overhead of coupling to MaMiCo comes at comparable cost as for the other solvers, the relative overhead is higher. The overall runtime of waLBerla–ls1

3 www.mac.tum.de.

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

333

Fig. 6. Strong scaling for channel flow scenario on the partition snb, considering the scenarios MD30, MD60, MD120 for the different molecular dynamics–continuum couplings.

mardyn thus results in similar runtimes as for the other codes. With SimpleMD representing a very light-weight implementation specialized for single-centered Lennard-Jones molecules and since SimpleMD was co-designed with MaMiCo, it exhibits good performance in the current test cases. Parallel performance. We further investigated parallel coupled simulations in strong scaling experiments on the SandyBridge partition. The speedups are shown in Fig. 6. ESPResSo exhibits the best scaling, resulting in a speedup of 212 for the scenario MD120 on 512 snb cores; this setup corresponds to an average of ca. 2000 Lennard-Jones atoms per process. As expected, the waLBerla–ls1 mardyn simulation scales well for bigger particles numbers only. Improvements towards latency hiding and communication optimization is subject of current development, extending the applicability of ls1 mardyn also towards smaller particle numbers. When executed on higher core counts, the waLBerla–LAMMPS simulations are slightly slower compared to the other simulations in most of our measurements. This can be explained by the overhead arising from the additional sorting procedures of the coupled scheme. This overhead is particularly strong the less molecules are treated by each process. In the majority of cases, the overall runtimes are comparable for the four solvers. Still, maximum runtime differences occur at higher core counts, resulting in factors 1.7x–3.3x between the fastest and slowest solver. For example, the runtimes per MD time step on 512 cores for MD120 lie in the range [0.016 s; 0.027 s] in our measurements (ESPResSo: 0.016 s; SimpleMD: 0.020 s; ls1 mardyn: 0.024 s; LAMMPS: 0.027 s).

334

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

Fig. 7. Strong scaling for channel flow scenario on the partition bdz. Table 3 Runtime (s) per MD time step for LB-MD and pure MD simulations. The row MD shows the runtime for pure MD simulations, MD-LB shows the runtimes in the coupled channel flow scenarios. The relative overhead |t (MD)− t (MD − LB)|/t (MD) arising from the coupling is specified in the row OHD. waLBerla–SimpleMD

MD MD-LB OHD (%)

waLBerla–ls1 mardyn

MD30

MD60

MD120

MD30

MD60

MD120

0.041 0.046 12

0.289 0.336 16

2.296 2.777 21

0.032 0.037 16

0.279 0.457 64

2.353 3.440 46

waLBerla–LAMMPS

MD MD-LB OHD (%)

waLBerla–ESPResSo

MD30

MD60

MD120

MD30

MD60

MD120

0.043 0.049 14

0.351 0.397 13

2.842 3.261 15

0.043 0.048 12

0.363 0.420 16

3.141 3.469 10

Similar scaling behavior was observed using up to 1000 compute cores of the Bulldozer partition bdz. Exemplary strong scaling results are shown in Fig. 7 for the waLBerla–LAMMPS coupling on up to 1024 cores. 5. Summary We extensively discussed the coupling of various MD software frameworks and the LB framework waLBerla via the macro–microcoupling tool MaMiCo. The four MD simulations considered in this contribution could be successfully coupled to MaMiCo and applied in sequential and parallel hybrid simulations on AMD- and Intel-based platforms, independent from

• which kind of short-range acceleration technique is used (Verlet lists in LAMMPS vs. linked cells in the other codes; the latter, however, simplifies and minimizes the interface implementation),

• which kind of parallel communication concept is employed (strict master–slave communication in ESPResSo vs. concurrent local instructions in the other codes),

• which kind of domain layout/parallel topology is used (xyz-ordering in SimpleMD, LAMMPS vs. zyx-ordering in ls1 mardyn,ESPResSo, LAMMPS; extensibility for other topologies is provided via the interface ParallelTopology). MaMiCo is thus ready to be used and extended in the community, and comes with working interface implementations for the above MD codes. Download is available at www5.in.tum.de/mamico. The performance overhead introduced by the coupling tool MaMiCo was found to be ca. 10%–20% for three out of four MD simulations. Further optimization of the coupling components is part of future work to fully exploit highly optimized, e.g. vectorized, MD simulations. Acknowledgments This work is partially supported by the Award No. UK-C0020 made by King Abdullah University of Science and Technology (KAUST). Financial support of the German Federal Ministry of Education and Research (BMBF) in terms of the project SkaSim is further acknowledged. We particularly thank Florian Schornbaum for his support on the framework waLBerla. The Munich Centre of Advanced Computing (MAC) is acknowledged for providing computational resources (MAC-cluster). We further thank the Leibniz Supercomputing Centre for providing further computational resources in the scope of the project Massively Parallel Multiscale Simulation of Nanoflows (project pr87cu). References [1] M. Kalweit, D. Drikakis, Multiscale methods for micro/nano flows and materials, J. Comput. Theor. Nanoscience 5 (9) (2008) 1923–1938. [2] S. Barsky, R. Delgado-Buscalioni, P. Coveney, Comparison of molecular dynamics with hybrid continuum-molecular dynamics for a single thetered polymer in a solvent, J. Chem. Phys. 121 (5) (2004) 2403–2411. [3] N. Asproulis, M. Kalweit, D. Drikakis, A hybrid molecular continuum method using point wise coupling, Adv. Eng. Softw. 46 (2012) 85–92. [4] A. Dupuis, E. Kotsalis, P. Koumoutsakos, Coupling lattice Boltzmann and molecular dynamics models for dense fluids, Phys. Rev. E 75 (046704) (2007) 1–8.

P. Neumann et al. / Computer Physics Communications 200 (2016) 324–335

335

[5] R. Delgado-Buscalioni, P. Coveney, Continuum-particle hybrid coupling for mass, momentum and energy transfers in unsteady flows, Phys. Rev. E 67 (046704) (2003) 1–13. [6] E. Flekkøy, G. Wagner, J. Feder, Hybrid model for combined particle and continuum dynamics, Europhys. Lett. 52 (3) (2000) 271–276. [7] S.T. O’Connell, P.A. Thompson, Molecular dynamics–continuum hybrid computations: A tool for studying complex fluid flows, Phys. Rev. E 52 (6) (1995) R5792–R5795. [8] R. Delgado-Buscalioni, P. Coveney, USHER: An algorithm for particle insertion in dense fluids, J. Chem. Phys. 119 (2) (2003) 978–987. [9] G. De Fabritiis, R. Delgado-Buscalioni, P. Coveney, Energy controlled insertion of polar molecules in dense fluids, J. Chem. Phys. 121 (24) (2004) 12139–12142. [10] M. Borg, D. Lockerby, J. Reese, The FADE mass-stat: a technique for inserting or deleting particles in molecular dynamics simulations, J. Chem. Phys. 140 (074110) (2014) 1–13. [11] P. Neumann, N. Tchipev, A coupling tool for parallel molecular dynamics–continuum simulations, in: Proceedings of the International Symposium on Parallel and Distributed Computing, IEEE, Munich, 2012, pp. 111–118. [12] R. Delgado-Buscalioni, P. Coveney, G. Riley, R. Ford, Hybrid molecular-continuum fluid models: implementation within a general coupling framework, Phil. Trans. A Math. Phys. Eng. Sci. 363 (1833) (2005) 1975–1985. [13] B. FrantzDale, S. Plimpton, M. Shephard, Software components for parallel multiscale simulation: an example with LAMMPS, Eng. Comput. 26 (2010) 205–211. [14] L. Grinberg, J. Insley, D. Fedosov, V. Morozov, M. Papka, E. Karniadakis, Tightly coupled atomistic-continuum simulations of brain blood flow on petaflop supercomputers, Comput. Sci. Eng. 14 (6) (2012) 58–67. [15] Y.-H. Tang, S. Kudo, X. Bian, Z. Li, G. Karniadakis, Multiscale Universal Interface: A Concurrent Framework for Coupling Heterogeneous Solvers, 2014, arXiv preprint arXiv:1411.1293. [16] P. Neumann, Hybrid multiscale simulation approaches for micro- and nanoflows (Ph.D. thesis), Technische Universität München, 2013. [17] P. Neumann, W. Eckhardt, H.-J. Bungartz, A radial distribution function-based open boundary force model for multi-centered molecules, Internat. J. Modern Phys. C 25 (6) (2014) 1450008. [18] C. Niethammer, S. Becker, M. Bernreuther, M. Buchholz, W. Eckhardt, A. Heinecke, S. Werth, H.-J. Bungartz, C. Glass, H. Hasse, J. Vrabec, M. Horsch, ls1 mardyn: The massively parallel molecular dynamics code for large systems, J. Chem. Theory Comput. 10 (10) (2014) 4455–4464. [19] W. Eckhardt, A. Heinecke, R. Bader, M. Brehm, N. Hammer, H. Huber, H.-G. Kleinhenz, J. Vrabec, H. Hasse, M. Horsch, M. Bernreuther, C. Glass, C. Niethammer, A. Bode, H.-J. Bungartz, 591 TFLOPS Multi-Trillion Particles Simulation on SuperMUC, in: International Supercomputing Conference (ISC) Proceedings 2013, Springer, Heidelberg, Germany, 2013, pp. 1–12. [20] S. Plimpton, Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys. 117 (1995) 1–19. [21] C. Feichtinger, S. Donath, H. Köstler, J. Götz, U. Rüde, WaLBerla: HPC software design for computational engineering simulations, J. Comput. Sci. 2 (2) (2011) 105–112. [22] C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, U. Rüde, A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries, in: International Conference for High Performance Computing, Networking, Storage and Analysis (Proceedings of SC13), 2013, pp. 35, 1–12. [23] L. Verlet, Computer ‘Experiments’ on classical fluids. I. Thermodynamical properties of Lennard-Jones molecules, Phys. Rev. 159 (1) (1967) 98–103. [24] M. Griebel, S. Knapek, G. Zumbusch, Numerical Simulation in Molecular Dynamics - Numerics, Algorithms, Parallelization, Applications, Springer Verlag, Berlin, Heidelberg, New York, 2007. [25] P. Bhatnagar, E. Gross, M. Krook, A model for collision processes in gases. I. Small amplitude processes in charged and neutral one-component systems, Phys. Rev. 94 (3) (1954) 511–525. [26] D. Wolf-Gladrow, Lattice-Gas Cellular Automata and Lattice Boltzmann Models - An Introduction, Springer, Berlin, 2000. [27] D. Stephenson, D. Lockerby, M. Borg, J. Reese, Multiscale simulation of nanofluidic networks of arbitrary complexity, Microfluid Nanofluid (2014) 1–18. http://dx.doi.org/10.1007/s10404-014-1476-x. [28] G. Karniadakis, A. Beskok, N. Aluru, Microflows and Nanoflows: Fundamentals and Simulation, Springer, New York, 2005. [29] H. Chen, Volumetric formulation of the lattice Boltzmann method for fluid dynamics: Basic concept, Phys. Rev. E 58 (3) (1998) 3955–3963. [30] O. Filippova, D. Hänel, Grid refinement for Lattice-BGK models, J. Comput. Phys. 147 (1998) 219–228. [31] M. Rohde, D. Kandhai, J. Derksen, H. van den Akker, A generic, mass conservative local grid refinement technique for lattice-Boltzmann schemes, Internat. J. Numer. Methods Fluids 51 (2006) 439–468. [32] S. Plimpton, A. Thompson, P. Crozier, LAMMPS Users Manual, Sandia National Labs, 2013. [33] S. Yang, L. Xiong, Q. Deng, Y. Chen, Concurrent atomistic and continuum simulation of strontium titanate, Acta. Materialia 61 (1) (2013) 89–102. [34] I. Cosden, J. Lukes, A hybrid atomistic–continuum model for fluid flow using LAMMPS and OpenFOAM, Comput. Phys. Comm. 184 (8) (2013) 1958–1965. [35] H.J. Limbach, A. Arnold, B.A. Mann, C. Holm, ESPResSo – An extensible simulation package for research on soft matter systems, Comput. Phys. Comm. 174 (9) (2006) 704–727.