Petaflop hydrokinetic simulations of complex flows on massive GPU clusters

Petaflop hydrokinetic simulations of complex flows on massive GPU clusters

Computer Physics Communications 184 (2013) 329–341 Contents lists available at SciVerse ScienceDirect Computer Physics Communications journal homepa...

1MB Sizes 1 Downloads 91 Views

Computer Physics Communications 184 (2013) 329–341

Contents lists available at SciVerse ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

Petaflop hydrokinetic simulations of complex flows on massive GPU clusters M. Bernaschi a , M. Bisson a,∗ , M. Fatica b , S. Melchionna c , S. Succi a a

CNR-IAC, Istituto Applicazioni Calcolo, Consiglio Nazionale delle Ricerche, Rome, Italy

b

Nvidia Corp., Santa Clara, CA, USA

c

CNR-IPCF, Istituto Processi Chimico-Fisici, Consiglio Nazionale delle Ricerche, Rome, Italy

article

info

Article history: Received 14 February 2012 Received in revised form 3 August 2012 Accepted 9 September 2012 Available online 14 September 2012 Keywords: Multi-GPU Biofluidics Hemodynamics

abstract We present recent extensions of the MUPHY computational framework for multi-scale simulation of complex bio-fluidic phenomena in real-life geometries. The new framework, which builds on concurrent advances of the computational modeling and parallelization techniques, is able to simulate suspensions with several hundreds of millions of finite-size bodies, interacting with each other and with the surrounding fluid, in geometries of realistic anatomic complexity. Blood flow through the human coronary arteries, at physiological hematocrit values, is simulated with a spatial resolution of 10 micrometers, comparable with the size of red blood cells. The simulation exhibits excellent parallel scalability on a cluster of 4000 M2050 Nvidia GPUs, with an aggregate performance close to 1 Petaflop/s. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Fluid flow phenomena are ubiquitous in science, engineering, biology and medicine. In particular, blood is the carrier of the biological components which fuel the most basic physiological functions, such as metabolism, immune response and tissue repair. Gaining a deeper insight into blood circulation is therefore paramount to physiology in general, with special emphasis on understanding the onset and development of cardiovascular diseases. However, building a detailed model of hemodynamic flows in realistic geometries represents a formidable scientific challenge. The model must combine the motion of the fluid within the complex geometry of the vasculature, subject to unsteady changes in flow and pressure driven by the heartbeat, coupled to the dynamics of suspended, biologically relevant, bodies, such as red blood cells, as well as platelets and lipid molecules. Despite major advances in the simulation of large-scale hemodynamic flows, [1–5], the coupling between fluid dynamics and the motion of blood cells and other suspended bodies has been studied only in microscale vessels, such as capillaries and venules, focusing on local aspects of the fluid–particle interaction [6–9], whereas simulations of coupled fluid–particle systems in vessels of realistic shapes and sizes, has remained beyond reach up to now. The abstraction from the actual geometry still provides a wealth of useful information on the generic behavior of blood in different conditions, but fails to capture the history-dependent effects due to the local and global geometry and the non-local correlations



Corresponding author. E-mail address: [email protected] (M. Bisson).

0010-4655/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.cpc.2012.09.016

carried by the flow pressure. It is well known that the global geometry plays a significant role on local circulation patterns, most notably on the shear stress distribution at arterial walls. Wall shear stress is a recognized trigger for the complex biomechanical events that eventually lead to atherosclerotic pathologies. Since it is not possible to measure shear stress in vivo, simulations provide a precious and non-invasive tool for the prediction of the onset and the progression of cardiovascular diseases. In this work, we present the first multiscale simulation of cardiovascular flows in human coronary arteries reconstructed from computed tomography angiography, capable of retaining the dynamics of a realistic number of suspended bodies within the human arterial geometry. Spatial resolution extends from 5 cm down to 10 µm, close to the red blood cell (RBC) diameter (∼8 µm in size). The simulations involve up to a billion fluid nodes, embedded in a bounding space of about three hundred billion voxels, with 100–450 million suspended bodies, as shown in Fig. 1. The simulations are performed with the (MUlti PHYsics/ multiscale) MUPHY code, which couples the Lattice Boltzmann method for the fluid flow and a specialized version of Molecular Dynamics for the suspended bodies [10–12]. The simulation achieves a sustained performance close to 700 Teraflops, with a parallel efficiency of more than 90%, on Tsubame 2, a cluster that includes 4000 Tesla 2050 GPUs. Our work presents a number of unique features, both at the level of high-performance computing technology and in terms of physical/computational modeling. The extremely irregular geometries set a very demanding constraint on the workload distribution across the pool of as many as 4000 GPUs. On top of this,

330

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

simulation framework handles the numerical solution of the motion of a generic fluid and suspended, finite-size bodies. Within this picture, MUPHY represents a major effort towards the design of a comprehensive computational infrastructure for biofluidics research. A non-exhaustive list of functionalities includes:

• Newtonian versus non-Newtonian rheological response; • Different micro-collisional kernels, from molecular hydroki• • • • • •

Fig. 1. The geometry of the 10 µm resolution test case, derived from a CT scan of human coronary arteries. The first inset shows the detail of the geometry with red blood cells visible. The second inset depicts schematically the coexistence of red blood cells and the underlying Cartesian mesh, together with the exchange of populations (black arrows) due to the streaming phase of the Lattice Boltzmann scheme. For the simulations, the mesh spacing is 10 µm whereas the globule maximal linear size is 8 µm. In practice, the biconcave disk shape of RBCs is approximated by an oblate ellipsoid.

netics (Boltzmann kernel) to stochastic frictional dynamics (Fokker–Planck kernel); Support for the simulation of stochastic fluctuations; Solutes as disconnected particles or polymer-forming biomolecules; Suspended bodies with polymorphic hydrodynamic shapes; A library of different body–body forces; Scale-adaptive body–fluid coupling mechanisms; Handling of irregular confining media and tissues.

Within this range of options, biofluids are modeled in multiple ways and the multiscale/multiphysics behavior of biological or physiological systems can be studied accordingly. MUPHY leverages two main computational engines, Lattice Boltzmann (LB) and Molecular Dynamics (MD). The former handles the motion of the generic fluid within the hydrokinetic formulation embodied by the Lattice Boltzmann (LB) method [15]. In the hydrokinetic approach, the motion of fluids in the continuum is reproduced by using collision-driven mechanisms (in contrast to the direct macroscopic description of fluid motion). The MD engine handles the motion of Lagrangian bodies by using an approach that shares several technical aspects with Molecular Dynamics (MD). However, the nature of the suspended bodies requires a substantial extension of the basic MD technique, in order to deal with finitesize and non spherical objects. Finally, the coupling between the fluid and the moving particles takes place via specifically designed kernels, based on kinetic modeling, again significantly distinct from stresslets, boundary-integral and other methods based on macroscopic hydrodynamics. The end result is a fully time-explicit simulation technique represented in Fig. 2 that offers strategic advantages for the study of biofluids under realistic conditions, with support for:

• geometries with irregular boundaries, such as in the case of the blood vasculature;

• non-trivial rheology, as emerging from the underlying particle additional difficulties come from the Molecular Dynamics component of the multiscale methodology. To the best of our knowledge, the latter issue has never been tackled before in particle-based simulations. Even top-ranking multi-billion Molecular Dynamics simulations are usually performed in idealized geometries, such as cubes or regular boxes [13]. Similarly, multi-billion node simulations of, say, fluid turbulence, are indeed available, but only in the simple geometries mentioned above. Complex and large-scale geometries, such as the one considered here, are rare in the literature [14]. By leveraging large-scale parallel architectures, this work demonstrates the feasibility of cardiovascular simulations of unprecedented size in anatomically realistic geometric conditions. 2. Multiscale bio-fluidics The hallmark of living matter is the ubiquitous presence of two main components: a generic liquid, given by an aqueous solution (plasma, cytosol, etc.), and suspended bodies (bioparticles, cells, proteins, DNA, etc.) that move, whirl and tumble within the embedding solvent. The dual nature of the fluid–body system, together with the wide diversity of biosolutes, applies to both the mesoscopic and the macroscopic scales of living systems. As compared to the previous versions of MUPHY [10,11], the current

dynamics;

• scale-specific hydrodynamic interactions at sub-mesh spacing resolution;

• seamless representation of the fluid–body dividing interface; The last two points are crucial to boost the performance of the simulations, as well as to guarantee the numerical stability of the dual method, especially with regard to the stiff forces exerted on the fluid by the immersed bodies. The framework employs either a single or a multiple timestep algorithm, to handle stiff forces. It should be stressed that the LB method is fairly tolerant towards the presence of rapidly varying forces, and it allows dense suspensions to be simulated, from creeping flow conditions up to turbulent flows, with Reynolds numbers above ∼1000. This strategic asset allows a wide range of physical phenomena to be covered in realworld physiological conditions. The LB method is based on the evolution of the single-particle distribution fp (x, t ), representing the probability of finding at location x of a three-dimensional and Cartesian mesh, and at time t, a ‘‘fluid particle’’ traveling with discrete speed cp . In our implementation, the mesh has spacing ∆x and is equipped with 19 speeds cp (D3Q19) connecting mesh points to first and second √ neighbors, and with sound speed cs = 1/3∆x/∆t, with ∆t being

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

331

We approximate the RBC shape as an oblate ellipsoid, via the function

˜ x , Qi ) ≡ δ(

 α=x,y,z

δ˜α [(Qi x)α ]

(3)

with

    1  2  ˜ 5 − 4 |˜ y | − 1 + 8 |˜ y | − 16 y α α  α  8    |˜yα | ≤ 0.5      δ˜α (˜yα ) ≡ 1 2 ˜  3 − 4 |˜ y | − − 7 + 24 |˜ y | − 16 y α α α   8     0.5 < |˜yα | ≤ 1   0 |˜yα | > 1

Fig. 2. Diagram with the coupling between the various MUPHY components, taking care of the dynamical evolution of plasma (Lattice Boltzmann dynamics), the calculation of hydrodynamic variables, the evolution of red blood cells (Molecular Dynamics), the calculation of globule–globule forces and the exchange of hydrodynamic information fluid↔particles. ‘‘Pops’’, ‘‘Vars’’ and ‘‘Comm’’ stand for Lattice Boltzmann populations, variables and communications, respectively.

the timestep. The fluid populations are advanced in time through the update: fp (x + cp ∆t , t + ∆t ) − fp (x, t ) = −(1/T )∆t (fp − fpeq )(x, t )

+ ∆fp (x, t )

(1)

where T is a relaxation time related to the fluid viscosity. As this equation shows, the essence of the LB algorithm is to compute the rhs as a completely space–time local quantity and propagate fluid information non-locally by exchanging populations with the neighboring mesh nodes. The first term at the rhs of Eq. (1) contains the collisional relaxation to local equilibria, as given by a second order expansion in the local flowvelocity of the local equilibrium,  eq

fp = wp ρ 1 +

u·cp cs2

uu:(cp cp −cs2 I)

+ 

2cs4

. Here, wp is a set of weights, and

ρ = p fp and u = p cp fp /ρ are the fluid density and velocity respectively, whereas the last term represents the effect on the fluid   

of the suspended bodies ∆fp = −wp ∆t the local force/torque G(x, t ).

G·cp cs2

+

(G·cp )(u·cp )−cs2 G·u cs4

via

The hydrodynamic fluid–body coupling is based on specific roto-translational kernels that allow the representation of either rigid entities, moving in the fluid as impenetrable bodies, soft bodies that adapt to flow conditions, such as vesicles, or a combination thereof. In modeling blood at finite hematocrit, an RBC is introduced as a body having mass M, position Ri , velocity Vi , angular velocity Ωi , and instantaneous orientation given by the matrix



nˆ x,i Qi = nˆ y,i nˆ z ,i

tˆx,i tˆy,i tˆz ,i

φ = −γT δ˜i (Vi − u)

(2) QTi Qi

= 1.

(5)

where γT is a translational coupling coefficient and where δ˜ i ≡ ˜ x − Ri , Qi ) is a short-hand notation. δ( The RBC–fluid rotational coupling is derived by considering the decomposition of the deformation tensor in terms of the elongational and rotational contributions, ∂u = e + ρ where e = 1 (∂u + ∂uT ) is the rate of strain tensor and ρ = 21 (∂u − ∂uT ) is the 2 vorticity tensor. The rotational component of the deformation tensor generates solid-like tumbling motion, where the rotational and elongational ones generate tank treading motion. Consequently, at rotational level the RBC senses the fluid vorticity via the coupling

τ A = −γR δ˜i (Ωi − ω)

(6)

where γR is a coupling coefficient. The elongational component of the flow contributes to the orientational torque. Given the stress ˆ where nˆ is the outward normal to the surface vector tσ = σ · n, of a macroscopic RBC, the surface normal is replaced by the vector ˜ δ| ˜. ˆ = ∂δ/|∂ spanning over the volume of the diffused particle, i.e. n The associated torque is (7)

where α is a control parameter. Finally, the elongational torque includes an independent contribution arising from tank treading due to the quasi-vesicular nature of the RBC. We skip here the analytical details and simply report that the amount of tank treading motion is controlled by the parameter α [16]. The hydrodynamic force and torque acting on the RBC are obtained via integration over the globule spatial extension. Owing to the nature of the mesh, these are computed as discrete sums, Di =



gˆx,i gˆy,i  gˆz ,i

ˆ i , tˆi , gˆ i are orthogonal unit vectors, such that where n

and y˜ α ≡ (Qi x)α /ξα and ξα being a set of three integers for each Cartesian component α = x, y, z, representing the ellipsoidal radii in the three principal directions. It is easy to show that for a spherical particle, the shape function covers 64 mesh points, and about 32 points for the oblate ellipsoid representing RBCs. However, the effective hydrodynamic radii of the particles are typically smaller than mesh spacing. As explained in the following, the size of the RBC support is basically given by a minimal resolution of the RBC versus the underlying LB mesh. The small support allows for an efficient parallelization strategy by transferring a small amount of data across the boundaries. With particles larger than the grid spacing the communicated information would be much more substantial. The RBC–fluid translational coupling is given by the following kernel

τ S = α δ˜i tσ × (x − Ri )

2.1. Fluid–particle coupling

(4)



φ

(8)

x

and Ti =

 (τ A + τ S ). x

(9)

332

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

Forces and torques act on the fluid according to G=−



Di δ˜ i +

i

1 2



Ti × ∂δ˜ i .

(10)

Some algebra shows that the action of the forces Fi and torques Ti is counterbalanced by opposite reactions on the fluid side so that conservation of linear and angular momentum in the fluidRBC system is guaranteed. RBC are carriers of an internal fluid that contributes significantly to the dissipation of energy. The consequence of which is the steep rise of the apparent viscosity with the hematocrit. The presence of the RBC inner fluid is emulated by locally enhancing the viscosity o of the fluid comprised within the RBC extension. This is accomplished by considering a local relaxation time, reading

T (x) = T0 + ∆h



θ˜i

(11)

i

where T0 is the relaxation time corresponding to pure plasma, ∆ is the viscosity enhancement factor, and θ˜i is the ellipsoidal characteristic function. By choosing ν0 = 1/6 and ∆ = 2, the ratio between inner (θ˜i ≃ 1) and outer (θ˜i ≃ 0) viscosities corresponds to the physiological value of 5. Besides hydrodynamic interactions, mechanical forces regulate the direct interactions and the packing attitude of suspended bodies. The interactions are modeled via the pair-wise Gay–Berne ˆ ˆ potential [17], with the pairwise energy uGB ij (Rij , ui , uj ) being a function of the relative distance Rij and mutual orientation, given ˆ i and by the principal directions of the i-th and j-th ellipsoids, u ˆ j . The Gay–Berne potential considered here is purely repulsive, u that is, we exclude the interaction between pairs of RBCs if their mutual distance exceeds an orientation-dependent cutoff. For algorithmic purposes, we consider the maximum cutoff distance of the Gay–Berne potential corresponding to side to side orientation of pairs of RBCs. It it worth mentioning that mechanical forces depend on the eccentricity of each interacting particle, so that, as for the hydrodynamic coupling, mixtures of particles of different shapes can be handled within a unified framework. As per the RBCwall interaction, we take the wall mesh nodes as spherical pseudoparticles. Purely repulsive Gay–Berne interaction between RBC and wall particles avoids leakage of the RBCs from the vessels. Once the forces and torques arising from both hydrodynamics and direct mechanical forces are computed, the motion of the suspended bodies is propagated in time via a second-order accurate timestepping algorithm [18], suitable modified to handle velocity- and vorticity-dependent forces and torques. In principle, the MD and the LB degrees of freedom do not need to advance in time with synchronous time marching, but one can employ a multiple-timestepping scheme. Let n the ratio of the two solvers, the fastest of the two processes can be sub-cycled, i.e., it performs n timesteps before exchanging data with the slowest one. Although this scheme is available in MUPHY, for the present application we consider n = 1, due to the comparable masses of the suspended body and fluid mass within a lattice cell producing comparable timescales associated to RBCs and plasma. In principle, timestep sub-cycling could be needed in case of turbulent flows. However, in practical simulations of coronary flows, the Reynolds number stays below 500 and sub-cycling is not needed. The model for red blood cells described in this paper has been originally introduced and validated in Ref. [16]. In particular, the dynamical response of a single red blood in flow conditions has been studied, as well as the collective behavior of RBC suspensions in channel flow. Regarding the single RBC hydrodynamics, the model was shown to reproduce the Rotne–Prage, far-field level of hydrodynamics [16]. The result underscores the basic character of the approach that, in lieu of resolving the detailed flow around

each particle [19], focuses on the far-field hydrodynamics and its consequences on the suspension. In this respect, the effective hydrodynamic radii of the ellipsoid can be smaller than the mesh spacing whereas the far-field is resolved at much larger distances. The overlap between the RBC shape functions can, in principle, introduce additional coupling between RBCs especially at high hematocrit, but, given the small hydrodynamic radii, such overlap is expected to play a minor role. Additional contributions arising from near-field hydrodynamics, such as lubrication forces, that would require a much finer mesh to be resolved by direct simulation, can be added by suitable analytical or semi-empirical forms at a later stage. Previous data further demonstrated that in proximity of arterial walls RBCs are repelled away from the wall by hydrodynamic lift forces. Such forces produce the so-called Fahraeus–Lindqvist effect with far-reaching implications in physiology [16]. Here, we report additional validation of the rheology for channel flow (Fig. 3), and compare our data with the expression derived empirically by Pries et al. [20]. The simulations are performed at shear rate of 50 s−1 , a value typical of coronary flows. As Fig. 3 illustrates, the rise of viscosity with both the vessel size and the hematocrit closely follows the experimental curve in a wide range of values, suggesting that at this shear rate the effect of RBC deformation is marginal and the approach captures the basic features of blood rheology. We do not try here to evaluate the validity of the model in conditions pertaining to capillary flows, where the high shear rates involved would substantially deform RBCs, with an impact on the rheology. In essence, simulations confirm the quality of the model in reproducing blood rheology for the hematocrit range interesting for physiological applications (35%–50%). In clinical applications, a crucial aspect of hemodynamics of coronary systems regards the role of the finite hematocrit in modulating the shear stress distribution in anatomical vessels. Shear stress at the arterial wall is key to the formation and vulnerability of atherosclerotic plaques in pathological or preventive situations. Given the fact that near the endothelium RBCs and plasma undergo partial skimming, with strong sensitivity on the morphological details of the vessel, it is unlikely that continuum models are able to correctly capture the modulation of the shear stress distribution. The wall shear stress is computed via stress   the deviatoric  tensor, expressed as σ = νρ ∂x u + ∂x uT = − 3ν2 p cp cp (fp − eq

T cs

fp ). The tensor second invariant is the wall shear stress S (xw ) =



1 2

(σ : σ )(xw ) where xw is the position of sampling points in

close proximity to the mesh wall nodes. As Fig. 4 illustrates for an anatomically reconstructed bifurcation, the effect of RBCs is to modulate the endothelial shear stress in a rather heterogeneous way. In particular, the modulation takes place in proximity of bifurcations and where vessels have a diameter smaller than ∼0.4 mm. The modulation is tentatively ascribed to the structuring of RBCs (such as rouleaux stacking) as induced by the local flow and geometrical conditions. A thorough analysis deserves further investigation and will be the subject of a separate publication. The model for blood at finite hematocrit has been designed for large scale simulations, by retaining only the essential features of red blood cells and favoring the computational performances, but without degrading significantly the physical accuracy.We notice that, although the calculation of hydrodynamic interactions grows as the cube of the system size, the concurrent evolution of fluid and bodies provides a strategic algorithmic advantage. The reason is that the LB method features a computational cost which scales linearly with the number of mesh points M, that is, O(M ) = O(N /c ), for a fixed solute concentration c = N /M. By employing the link-cell method to compute the direct forces between bodies, the particle dynamics also shows O(N ) complexity.

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

333

Fig. 3. Relative viscosity ηrel = η/η0 in a tube channel for 15% (circles), 30% (squares), 45% (diamonds) and 60% (triangles) hematocrit versus the channel diameter (left panel) and in a tube channel of diameter 100 µm versus the hematocrit (right panel), as compared to experimental data (dashed lines) [20].

Fig. 4. Panel A: detail of the bifurcating vessel showing the organization of RBCs at hematocrit of 50% and average shear rate of 80 s−1 . Panels B1, B2: time-averages of shear stress for pure plasma (B1) and hematocrit level of 50% (B2).

Thanks to the fact that the solvent-mediated particle–particle interactions are localized and explicit, the LB-MD coupling scales linearly with the number of mesh cells and suspended bodies. However, Eqs. (8)–(9), represent the most time-consuming component of the methodology, mostly on account of the large number of mesh points (O(100)) enclosed in the particle support that require a scattered access to the memory. Since the particle shapes can, in principle, overlap, the calculation of Eqs. (8)–(9) must be performed directly, without deconvolving the summations and/or by using suitable transforms. In addition, due to the irregular distribution of solutes, the memory access of both particle quantities and fluid properties, such as populations and hydrodynamic variables, is highly scattered. Nevertheless, we have been able to efficiently implement the algorithm on GPUs, as detailed shortly. 2.2. Initial and boundary conditions Fluid boundary conditions are set up as follows. Pressure boundary conditions are imposed on plasma at the inlet and several outlets, via the Zou–He method [21], further revised according to a finite difference scheme to sustain large pressure gradients [22]. Finally, at the outlets, self-regulated resistive conditions are imposed, determined as follows. Initially, pressures at the different outlets of the healthy system are determined by reproducing the pressure drop vs local flow velocity reported in [23]. Once the outlet pressures were obtained,

each outlet is further equipped with a resistive element in series, to account for the myocardial vascular bed. Such resistive elements allow us to obtain the large pressure drops induced by arteries with stenoses for a wide range of severities, by minimizing finite size effects arising from the incomplete coronary representation. During the simulation, the outlet pressures are adjusted according to the equation pout = pv + R2 Q , where Q is the instantaneous flow rate at the outlet at given severity, R2 is the resistance of the myocardium and pv the venous pressure. The latter two parameters are chosen to be R2 = 10 × R1 , where R1 is the resistance of the healthy coronary segment ending at the outlet under considera1 pa . tion, and pv = 10 At rigid walls, a standard mid-way bounce-back rule is applied to impose no-slip flow conditions. The fluid flow is initialized with zero speed and constant density across the entire domain. Particles are seeded at random positions and orientations, and with null linear and angular velocity. In flow conditions, red blood cells that exit from the outlet ports are reinjected in the inlet port in order to maintain a constant total hematocrit. The injected RBCs have a velocity given by the imposed inlet velocity, random orientation and zero angular velocity. 3. MUPHY on massive GPU clusters A multi-GPU code resorts to parallelism at two levels: intra- and inter-GPU. Unlike the software developed for clusters of traditional

334

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

Fig. 5. Communication/Computation overlap by using the CPU as a MPI coprocessor of the GPU. Orange blocks are executed on the GPU whereas yellow blocks are executed on the CPU. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

multi-core systems, that can be implemented with either a hybrid (e.g.,OpenMP + MPI), or a pure distributed memory paradigm (by counting on the availability of highly efficient MPI libraries for shared memory systems), for a multi-GPU platform, a hybrid paradigm is the only choice. This is not the only difference that needs to be taken into account: on a traditional multi-core platform the parallelism is limited to a few threads, at most tens on highend systems. On a GPU hundreds of threads are required to keep the hardware busy, so that a much finer-grain parallelization is required. Finally, albeit the combination of the latest generation Nvidia GPUs and CUDA drivers offers the chance to copy data from/to the global (i.e., main) memory of GPU to/from the global memory of another GPU sitting on the same system, GPUs cannot, in general, exchange data with each other without using the CPU. In other words, when GPU s needs to send data to GPU t, the following steps are required: (i) GPU s copies data to CPU s memory; (ii) CPU s sends data, through MPI primitives, to CPU t; (iii) CPU t copies data from its memory to the memory of its GPU. CPU s and t can be, of course, two cores of the same system. In principle, the passage through the CPU introduces an overhead in the communication between the GPUs. However, by using the CUDA concepts of stream and asynchronous memory copy, it is possible to overlap data transfers between GPU and CPU memory with the execution of kernels (functions in the CUDA jargon) on the GPU. Moreover, the execution of functions running on the CPU (like MPI primitives) may also be concurrent with the execution of kernels on the GPU. As a result, with an organization of the communication/computation pattern like that shown in Fig. 5, the CPU could be treated as a MPI co-processor of the GPU. The details of the parallelization of MUPHY on moderate and massive clusters of GPUs have been described elsewhere. As a result, here we shall only summarize the main points, namely (1) Decomposition strategy (2) Communication pattern (3) Load Balancing. 3.1. Domain decomposition From its inception, MUPHY has been designed to deal with real-life geometries. The implication is that simple domaindecomposition based on regular partitions cannot be used. The geometry in our simulations is highly irregular, as shown in Fig. 1, and partitioning in subdomains handled by the available computing resources represents a major challenge in itself. In a previous scheme to solve this issue [24], we used the third-party software PT-SCOTCH, the parallel version of the SCOTCH graph/mesh partitioning tool [25]. The graph-based procedure resorts to a graph bisection algorithm and it is completely unaware of the geometry of the computational domain. The lack of geometrical

information was later found to degrade the quality of the partitioning as the number of partitions increases, resulting in subdomains with highly irregular shapes and with large contact areas between subdomains that introduce significant communication overheads. An optimal solution was found combining the graph-based partitioning with a flooding-based approach (also known as graphgrowing method). The mesh is first partitioned with SCOTCH into a given number of subdomains, typically 256. If the mesh needs to be partitioned into a finer number of parts, say 256 ∗ P, with P an integer ≥2, then each of the 256 SCOTCH-based domains is further subdivided by using a flooding scheme. Starting from a seed mesh point, a region is iteratively grown in an isotropic way, until the number of mesh points equals Ni /P (Ni being the number of mesh points in the i-th partition generated with SCOTCH). As the condition is met, the visited mesh points are assigned to a computational resource. Subsequently, a new growth procedure starts from a new seed until all points in the subdomain are assigned to a new computational resource. The procedure is such that each single subdomain is never split in disjoint subparts. Fig. 6 reports the partitioning obtained in this way, showing the distribution of tasks versus the number of neighbor domains, with which they exchange data. It is apparent that, for a large number of tasks, the distribution tends to stabilize around an average number ∼7, instead of increasing up to much higher values (∼15), as it would happen with a pure graph-based partitioning approach. By lowering the number of distinct tasks with which (on average) each task needs to exchange data, there is a significant reduction of the communication overhead. The distribution indicates that, despite the highly complex geometry, the final workload partitioning ends up relatively close to the initial topological input, which we consider as a heuristic quality factor of the workload balance algorithm. Visual inspection of the computational domains, shows that this is realized through a fairly varied morphology of the computational domains, often taking counterintuitive shapes in the vicinity of geometrical complexities. This ‘‘morphological richness’’, which stands in marked contrast with elementary partitionings in idealized geometries (cubes, slabs and similar), conveys an intuitive flavor for the complexity of the partitioning task. As a matter of fact, the domain decomposition is, by itself, a challenge since, by using 128 core of high-end CPU, takes up to two hours of elapsed time for a mesh having one billion nodes. 3.2. Design of the communication pattern In MUPHY, the communication pattern is set up at run-time. Each task determines the neighboring tasks owning mesh points that need to be accessed during the simulation, for the non-local streaming of populations in the LB algorithm, the migration of particles and calculation of inter-domain forces for the particle dynamics. During this pre-processing step, we employ mainly MPI collective communication primitives whereas, in the remainder of the execution, most communications are point-to-point and make use of the following scheme: the receive operations are always posted in advance by using corresponding non-blocking MPI primitives, then the send operations are carried out. Finally, each task waits for the completion of its receive operations, by using the MPI wait primitives. For the Lattice Boltzmann component, only the evaluation of global quantities (e.g., the momentum along the x, y, z directions) requires the use of MPI collective (reduction) primitives. Also for the MD component most of the communications are point-to-point but, here, the irregularity of the geometry and, as a consequence, of the corresponding domain decomposition results in other issues, as detailed out in the next subsection.

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

335

Fig. 6. Distribution of the tasks with respect to the number of neighbors with which they are required to communicate for different number of partitions.

3.3. Parallel molecular dynamics We resort to a domain decomposition strategy where the MD parallel domains and the LB parallel domains coincide. In this way, each computational task performs both the LB and MD calculations and the interactions of the particles with the fluid are quasi-local. The underlying LB mesh serves the purpose of identifying particles that belong to the domain via a test of membership: a particle with position R belongs to the domain if the vector of nearest integers [round(Rx ), round(Ry ), round(Rz )] coincides with a mesh point of the domain. Since an even number of bodies is expected to populate the domains, a pretty good load balancing of the MD parts is granted by the mesh partitioning. We have developed a novel parallelization strategy for Molecular Dynamics suitable for domains with irregular geometry. A major issue regards the identification of particles that reside next to the subdomain frontiers and that interact with intra- and interdomain particles. Our solution [26] relies on the notion of cells, cubes with side greater or equal to the interaction cutoff (that is, the maximum cutoff distance for the Gay–Berne potential) that tile the whole irregular domain handled by a computational task. This representation allows the processors to (i) perform an efficient search of both interdomain and intradomain pairs of particles and (ii) to reduce data transfers by exchanging a limited superset of both the particles actually involved in the evaluation of forces among interdomain pairs and the particles moving across domains. The cells are grouped into three sets (internal, frontier and external cells) that verify the following properties: (1) Every point of the domain is within either an internal or a frontier cell; (2) Internal cells contain only points of the domain at distance greater than the cutoff distance from the domain boundary; (3) Frontier cells contain all the points of the domain at distance less than or equal to the cutoff distance from the domain boundary;

(4) External cells contain only points outside the domain and, together with frontier cells, cover all the external region at distance less than or equal to the cutoff distance from the domain boundary. Moreover, by using the information about the subdomain interconnection obtained in the pre-processing phase, for each external and frontier cell a list of neighboring processors that handle domains with points at interacting distance with the cell is defined. This information allows the performance of selective particle transfers among domains efficiently and irrespective of the degree of geometrical regularity. Any particle is sent only to neighboring processors with which there can be interactions (note that the same particle can interact with particles belonging to more than one domain). Fig. 7 illustrates an example of such decomposition applied to a simplified two-dimensional geometry. The decomposition into cells handles the irregular domains in the following way. At the beginning of each iteration, each task identifies and exchanges with neighboring tasks the particles located inside its domain that could interact with the outer region. Property 3 guarantees that all particles involved in interdomain interactions are contained inside frontier cells and those particles are sent, so that only a limited superset of the particles that could interact with the outer region is transferred. Moreover, data are communicated only to tasks in close proximity with the frontier particles. This mechanism reduces dramatically the number of duplications within the set of transmitted particles with respect to a multicast approach where all particles are sent to all neighbors. Replication takes place only for frontier cells around the zones where more than two domains are in touch. Particles located in these cells are sent to more than one processor. On the receiving side, only the particles that could interact with the inner region are considered. Given property 4, the received particles to be retained are filtered depending on the cells they are positioned into. All particles located inside either external or frontier cells are kept and the others are discarded.

336

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

pairs assigned to a neighbor by a threshold, a fraction of the excess pairs is sent to that neighbor. In this way, a local load balancing for the computation of forces is achieved. A set of precedence rules prevents situations in which a task sends pairs to a neighbor and receives pairs from another. The receiving task computes the corresponding forces and sends them back to the task that actually owns the particles.

Fig. 7. Tiling of a 2D domain in external cells (red), frontier cells (yellow) and internal cells (green). The dashed line represents the region within a cutoff distance from the domain (solid line). The domain frontier has a staircase shape, but in this figure it is shown as a smooth curve for simplicity. The colored contours around frontier and external cells represent the neighboring processors lists. The red line encloses the cells at interacting distance only with processor 1, the purple line those at interacting distance with both processors 1 and 2 and the blue one encloses the cells at interacting distance only with processor 2. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

After the exchange of the particles involved in interdomain pairs, forces are computed and particle positions updated. The decomposition into cells allows the performance of these operations in conjunction with a linkcell algorithm. Given property 1, the update is done by scanning all internal and frontier cells and, for each particle in each cell, searching for all interacting particles inside the neighboring cells. For the management of particles migration among tasks, all particles are binned inside the cells they moved into, and those that leave the domain are exchanged with neighboring processors. Departing particles are found by selecting those that moved to external cells (property 4) and by selecting a subset of those that moved to frontier cells. Particles in frontier cells moving to neighboring domains are identified by the membership test described above. Once the list of departing particles is built, a selective transfer to neighboring domains is done by using the processor ownership associated to the cells. In this way, processors send exactly the particles that leave theirs domains, only to neighbors proximal to the updated positions. On the receiving side, incoming particles are selected by using the membership test. We designed the membership test and the cell tiling so that every task performs exactly the same computations to decide whether to accept frontier and migrated particles. By doing so, we avoid error conditions due to possible differences in the floating point rounding approximations that there could arise if two or more tasks, involved in the same decision, executed the operations in a different order (or using different constants) for any reason. This ensures, for instance, that when a departing particle is sent to more than one task, one and only one task, will be in charge of it. Since such a decomposition strategy does not allow the performance of a global balance of the MD computational load, we enhanced it with a local dynamic load balancing algorithm, as outlined in the following. At first, during the pre-processing step a subset of particles is assigned to each computational task according to the initial position in space of the particles. As the simulation proceeds, particles migrate from one domain to another and particle coordinates, momenta and identities are re-allocated among tasks. The dynamic load balancing strategy requires a communication operation between neighboring tasks that tracks the number of particles assigned to the neighbors. Whenever the number of interacting pairs owned by a given task exceeds the number of

3.3.1. Fluid–particle coupling The final component of our multiscale application deals with the fluid–particle coupling. Each suspended particle experiences hydrodynamic forces and torques arising from the fluid macroscopic velocity and vorticity, smeared over a domain made of 4 × 4 × 4 mesh points. Analogously, mesh points experience a momentum transfer arising from the surrounding particles. These nonlocal operations require multiple communication steps such that each processor owning a given particle, exchanges hydrodynamic quantities with the surrounding domains. Particle fluid coupling is a non-local operation involving the interaction between particles and mesh nodes inside the frontier cells of neighboring domains. For this reason, the coupling is performed by exploiting the cell tiling, much the same way the computation of inter-domain forces is performed. The cells edge is set to the maximum between the cutoff distance and the particle size. In order to compute the forces/torques acting from the fluid to the particles, each processor exchanges particles inside the frontier cells with its neighbors. Finally, forces and torques associated to the external particles are exchanged back with neighboring processors and, on the receiving side, the external contributions are distributed to the frontier particles. This approach is particularly efficient as compared to other strategies. For example, by exchanging the mesh points located inside frontier cells in place of particles, one could lower the number of data exchanges to a single communication step. However, such an approach would impose a large communication overhead, since all mesh nodes near the interfaces among domains should be exchanged. Finally, the computation of the momentum transfers exerted from the particles to the fluid is carried out by exchanging the frontier particles as in the particle-on-fluid case but without a second data exchange. 4. Results In this work we present results obtained by using the TSUBAME2 supercomputer of the Tokyo Institute of Technology, operated by the Global Scientific Information and Computing Center (GSIC). TSUBAME2 consists of more than 1400 compute nodes, interconnected by an Infiniband network. Each of the 1408 thin compute nodes features two Intel Xeon 2.93 GHz (6 cores), 54 GB of memory, two QDR Infiniband adapters and is equipped with three NVIDIA Tesla M2050 GPU, each with 448 cores and 3 GB of highbandwidth GDDR5 memory (so the total number of cores is ∼1.9 million). The low-latency, full-bisection bandwidth Infiniband network, provides to each node 10 GB/s inter-node bandwidth. The peak CPU memory bandwidth is 32 GB/s per CPU and 64 GB/s per node. The GPU memory features 150 GB/s bandwidth. GPUs are connected with dedicated x16 Gen2 PCI Express slots providing a theoretical bandwidth to/from the cards of 8GB/s. In our benchmarks, we measured the total runtime for the simulation, together with the breakdown between computation and communication. All simulations include about 1 billion lattice sites for the fluid, within a bounding box totaling almost 300 trillion nodes, and 450 million RBCs. We chose the fluid–body coupling parameters such that a single RBC carries a hydrodynamic shape with linear size of 4 µm and 8 µm, for the smallest and largest principal directions of the globule, respectively, corresponding to a hematocrit level of 58%. Further details are described in [27].

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

337

Fig. 8. In the MD component GPU threads are mapped on particle data rather than on cell data. While this approach hampers thread cooperation (for example during the computation of forces) it has the greater benefit of requiring an amount of computational resources proportional only to the number of particles.

Most computations are performed in single floating point precision, with only few reduction operations carried out in double precision. We used the CUDA C compiler version 3.2, Intel Fortran compiler version 11.1 and OpenMPI version 1.4.2. In the LB phase, each GPU thread is responsible for the update of a number of mesh nodes that depends on the total number of available GPU (the thread configuration is fixed on each GPU). For instance, with 512 GPU, each thread is in charge of 8 mesh nodes. For the MD phase, interactions among the particles are processed on a perparticle basis. In this case, the grid of threads is directly mapped onto the arrays of particles as shown in Fig. 8. Threads are assigned to particles according to a global id and the search for interacting pairs proceeds for each thread in an independent fashion. Each thread scans the cell neighbors and, for each interacting pair, it computes the contribution to the total force. The obtained results provide an important check of the reliability of the code, up to physiological levels of hematocrit. 4.1. Parallel performance: strong scaling data Strong-scaling analysis is performed by increasing the number of processors at a fixed problem size, in order to analyze the impact of the number of computing resources on the total simulation time. In Fig. 9, we report the elapsed time per simulation-step, as well as the breakdown for the LB and MD components separately. The performance of the Lattice Boltzmann component of MUPHY on a single GPU is in line with other, highly tuned, CUDA LB kernels [28]. The elapsed time decreases significantly with the number of cores, with a speed-up of 12.5 between the 256 and 4000 GPU configurations, corresponding to a global parallel efficiency close to 80%. The calculation of the MD direct forces displays an efficiency as high as 95%. These results show that the combination of asynchronous communication and overlap between communication and computation, significantly alleviates the lack of support for direct data exchange among GPUs (see Table 1). 4.1.1. MPI breakdown of the communication overhead The total parallel efficiency is shown in Fig. 9, and is referred to either 256 GPUs, if the available memory allows the required hematocrit level to be simulated, or 512 GPUs for higher levels of hematocrit. Interestingly, due to the highly non-linear domain partitioning, as the size of domains diminishes, some inherent optimization take place as regarding the MD-LB coupling and the MD direct forces calculations. As a result, the efficiency features superlinear scaling for a number of GPUs up to 1024, whereas at the largest number of GPUs available the efficiency slightly lowers to ≃ 80%.

Fig. 9. Upper panel: elapsed time per timestep for the full simulation, the major MD components (direct forces including link-cell and Gay–Berne interactions), the MD–LB coupling and for the LB component, versus the number of cores, for the system composed by 1 billion mesh nodes and 450 million RBCs. Lower panel: parallel efficiency obtained for 1 billion mesh nodes and 110, 220 and 450 million RBCs. Table 1 Total elapsed time and work share of the major components for the 1 billion mesh nodes/450 million RBCs test case. GPUs

Total time (s)

MD-LB coupling (%)

MD (%)

LB (%)

256 1024 1200 2048 4000

3.3902 1.6268 1.3812 0.7446 0.3440

69.73 72.09 70.58 63.04 62.29

25.48 25.00 25.20 21.56 15.10

1.68 2.08 2.24 2.88 4.04

The excellent scalability should be ascribed to the optimal load balancing obtained with our hybrid graph partitioning/flooding scheme for the multibranched arteries. The load balance is further confirmed by Fig. 11. Although the distribution of RBCs over a test case of 1200 domains shows a broad distribution of values, the execution time for both the aggregate MD and the LB component is compact. The result is a direct consequence of the minimal extension of contact regions among domains that optimizes the share between computations and communications. To further analyze the parallel performance, we inspected the breakdown of the communication time across the pool of GPUs.

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

Frequency

338

0

50000

1e+05

1.5e+05

Frequency

Number of RBCs

Fig. 10. Distribution of the percentage of total time spent in MPI for 4000 tasks. Table 2 Performance of MPI routines relative to 4000 tasks for the 1 billion mesh nodes/300 million RBCs test case. Routine

Calls

Avg bytes

Avg time (µs)

Bandwidth

Send Irecv Waitall Allreduce

38,012,346 38,467,062 7,635,000 4,000

43,600 45,000 N/A 16

76 67 750 675

570 MB/s 670 MB/s N/A N/A

Details of the communication performance are obtained by using the mpiP Profiler [29]. Fig. 10 shows the distribution of the time spent in communication for a run with 4000 GPUs. It is interesting to note that, on average, most of the tasks (∼3000 out of 4000) spend about 50% of the total time in communication. This result confirms that the asynchronous communication scheme in use works at sustained rate, and the CPU actually plays the role of MPI co-processor of the GPU, where most of the computations take place. Table 2 reports the first four entries of the MPI communication summary, where the Send/Irecv/Waitall calls represent by and large the most time-consuming communication routines. The average bandwidth for the (blocking) MPI_send corresponds to roughly 600 MB/s and is satisfactory in view of the highly nontrivial communication pattern. 4.2. Flops count and profiling Nvidia GPUs are not equipped with counters to measure the number of floating points operations, although the CUDA profiler allows the collection of useful information about kernels execution, such as the number of shared memory bank conflicts or the actual register file usage. Therefore, in order to measure the performances according to the floating point operations per second metrics, we rely on (i) counting the number of floating point operations (Flops) by means of indirect methods; (ii) cross-checking the resulting numbers to assess their reliability; (iii) estimating the number of Flops/s by using the high-resolution timers provided by the GPU profiler. The first method of counting the operations is a simple estimate based on the knowledge of the LB update and MD algorithms used in MUPHY . The second is a direct measure of the number of floating operations obtained with the CPU version of MUPHY by using both the PAPI [30] and the oprofile [31] tools. The two methods provided

1

1.25

1.5

0.025

0.03

Frequency

0.75

0.01

0.015

0.02 Time per step (sec)

Fig. 11. Load balance for the 1200 domains case: number of RBCs (upper panel) and distribution of elapsed times for the aggregate MD component (middle panel) and LB component (lower panel). All frequencies are in arbitrary units. The system is composed by 1 billion mesh nodes and 300 million RBCs. Table 3 Time per step (in seconds) for a test case with (about) 1 million mesh nodes and 3,00,000 RBCs.

LB MD LB + MD

CPU

GPU

0.434 8.235 48.794

0.016 0.391 0.948

results whose difference is between 10% and 15%. For our estimates we used the smallest of the two numbers (that is, not surprisingly, the number provided by the direct measure since the optimizer can eliminate some operations during the common sub-expression reduction phase). For 1334 nodes of the TSUBAME2 (4000 MPI tasks), we estimate a (weighted) average performance of slightly less than 600 TeraFlops distributed as shown in Fig. 12. In view of the intrinsic Flop-limitations of the Lattice Boltzmann algorithm and taking into account the coupling between LB and MD components, this appears to be a fairly satisfactory overall performance. The performances of the LB component are more than a factor two better than those reported in Ref. [32], where previous generation GPUs were used. Just to convey the flavor of the practical impact of this application, the above performance corresponds to simulating a complete heartbeat at microsecond resolution and fully inclusive of red blood cells, in 48 h of elapsed time on the full TSUBAME2 system. 4.3. GPU vs. CPU We carried out a small-scale test to compare the performances of the GPU and CPU implementations of MUPHY. The test has a mesh with 1,264,239 nodes and 300,000 particles. The GPU is a single Fermi C2050 whereas the CPU is a Xeon E5645 running at 2.4 Ghz. The results are reported in Table 3. We highlight that, although we use a single core of the CPU, the code has been highly tuned for both the GPU and the CPU.

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

339

Fig. 12. Breakdown of the MUPHY components (top panel) and relative performances (bottom panel).

5. Clinical application: flow sensitivity to stenotic plaque formation As an example of clinical application, we present here simulations of a cardiovascular pathology that appears in the form of stenotic plaque. We considered a healthy coronary system (stenosis-free) and inserted a number of stenotic plaques in regions of low shear stress (see Fig. 13A), where the growth of plaque is credited for being more likely to occur. As shown in Figs. 13B and C, a stenosis is a local constriction of the vessel wall, causing strong pressure drops over the circulatory system and therefore insufficient oxygenation of the myocardial tissues. We considered different pathologies, with and without red blood cells, in order to assess the role of blood granularity on the hemodynamics. Once the pressure profile is measured along the vessel extension, cardiovascular risk is typically quantified by the so-called Fractional Flow Reserve (FFR), defined as pd /pa , where pd is the pressure distal to the stenosis and pa is the aortic pressure, the pressure value at the inlet. According to medical tenets, surgical intervention is recommended whenever FFR < 0.75, since this entails a high probability of plaque rupture. We considered ten different coronary systems, one in healthy conditions and nine with diseased arteries, arising from either single or multiple stenosis, at 45% hematocrit. By employing 1200 GPUs, each run completed 10,000 steps in 1750 s, starting from pre-conditioned populations and ensuring good convergence to steady flow conditions either with and without RBCs. From this observation, it follows that the equilibration time of the coronary flow in presence of RBCs is comparable to the time to convergence of the Newtonian flow pattern. This is probably due to the rather homogeneous distribution of the suspension under flow conditions and that a global re-organization of the RBCs over long distances does not take place. In addition, we remark that steady flow conditions are a first approximation to pulsatile flow conditions, and non-linear effects could be emerging under unsteady conditions. In the future it would be important to study the timedependent FFR by the methodology presented here. The effect of stenosis on the statistically significant sample is shown in Fig. 13D, reporting FFR as a function of the stenosis sever(d −d ) ity, defined as nd s × 100%, where dn and ds are the vessel diamn eters in the healthy and diseased artery, respectively. As Fig. 13D

demonstrates, the idea of using a sharp threshold, when deciding for surgical intervention, can be risky, since it does not take into account the strong dispersion of data across the range of local Reynolds numbers. A more detailed analysis of the results will be the subject of a forthcoming publication [33]. In particular, the role of finite hematocrit on the FFR profiles will be presented. We anticipate here that RBCs enhance the pressure drop in stenotic vessels due to enhanced viscous loss. Moreover, Fig. 13 strongly suggests that a more comprehensive criterion for surgical intervention should take into account the local values of the Reynolds number experienced by the given stenosis. To the best of our knowledge, this figure represents the first instance of a quantitative assessment, relating FFR with plaque severity by simulating a rather complete portion of the coronary system. 6. Conclusions We have presented the main features of the updated version of the parallel multiscale code MUPHY, and demonstrated its capabilities through a large-scale simulation of the entire heart-sized coronary system. The simulations were carried out on a worldclass cluster of GPUs, enabled by the development of the MUPHY software, a general tool for computational research in multiscale biofluidics. The benchmark system consists of a realistic representation of the complex human arterial geometry at the spatial resolution of red-blood cells: from centimeters all the way down to microns in a single multiscale simulation. The computational environment involves one-billion fluid nodes, embedded in a bounding space of one trillion voxels and coupled with the concurrent motion of hundreds of millions of red-blood cells. We have achieved a performance close to 600 Teraflops on the 4000 GPU configuration of the TSUBAME2 supercomputer, with a parallel efficiency in excess of 80%, performing about 2 trillion lattice updates per second, concurrently with the simulation of the dynamics of up to 450 million red blood cells. The above accomplishment results from the development of several unique features, in terms of both high-performance computing technology and physical/computational modeling, namely

340

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341

1

0.8

FFR

0.6

0.4

0.2

0

0

20

40

60 severity (%)

80

Fig. 13. Several stenotic plaques are inserted in regions where plaque growth is most probable (insertion spots shown by red circles in the topology of panel A, reporting the anatomic vessel acronyms). We simulated values of plaque severity ranging from mild (33%) to severe stenosis (70%), see panel C, the latter values being typically found in the clinical arterography of panel B. Fractional Flow Reserve as a function of severity, for single and multiple instances of stenosis, is reported in panel D. The horizontal dotted line is the threshold below which surgical intervention is typically undertaken. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

(i) the solution of the highly complex domain-partitioning problem, prompted by the need of evenly distributing the workload associated with the complex arterial geometry across as many as 4000 GPUs; (ii) the innovative communication techniques required to secure a balanced workload between fluid-dynamics and particle dynamics in geometries of real-life complexity; (iii) the innovative modeling techniques required to manage the self-consistent fluid–particle interactions in complex geometries and (iv) the usage of CPUs as MPI-coprocessors of the GPUs that turned an inherent limitation of the GPU (the lack of support for direct communication) into an advantage. As to (ii), we are not aware of any previous implementation in non-idealized geometries. We believe that the present represents a major progress towards the predictive capability of computer simulations for reallife biofluidic and biomedical research, with special, yet not exclusive, focus on cardiovascular clinical practice. Acknowledgments This work was made possible by a collaboration between ‘‘Istituto per le Applicazioni del Calcolo’’ and Tokyo Institute of Technology. We would like to express our gratitude to Prof. Toshio Endo for all his logistic and technical support, and Prof Matsuoka for continued support. We wish to thank E. Kaxiras, from the Harvard School of Engineering, C.L. Feldman, A.U. Coskun, F.J. Rybicki, and A.G. Whitmore from Harvard Medical School for useful discussions and for providing the anatomic data. We wish to thank

G. Amati, F. Pozzati and F. Schifano for useful comments and suggestions about performance measurements and code tuning. We also thank U. Amato and the Naples branch of the IAC for making available the lilligridbio cluster for code testing. References [1] D.A. Vorp, D.A. Steinman, C.R. Ethier, Comput. Sci. Eng. (2001) 51. [2] A. Quarteroni, A. Veneziani, P. Zunino, SIAM J. Num. Anal. 39 (2002) 1488. [3] L. Grinberg, T. Anor, E. Cheever, et al., Phil. Trans. Royal Soc. A 367 (2009) 1896–2371. [4] R. Ouared, B. Chopard, J. Stat. Phys. 121 (2005) 209. [5] D.J.W. Evans, P.V. Lawford, J. Gunn, D. Walker, D.R. Hose, R.H. Smallwood, B. Chopard, M. Krafczyk, J. Bernsdorf, A. Hoekstra, Phil. Trans. R. Soc. A 366 (2008) 3343. [6] I.V. Pivkin, P. Richardson, G. Karniadakis, Proc. Natl. Acad. Sci. USA 103 (2006) 17164–17169. [7] J.L. Mc Whirter, H. Noguchi, G. Gompper, Proc. Natl. Acad. Sci. USA 106 (2009) 6039–6043. [8] S.K. Veerapaneni, et al., J. Comp. Phys. 228 (2009) 2334. [9] Scalable simulations up to 14,000 red blood cells were previously obtained with either boundary integral [34] or Lattice Boltzmann methods [35] on up to 64 thousands BlueGene/P cores. More recently, scalable simulations of blood flows were performed to scale up to 256 GPUs and up to 200,000 cores of the Cray Jaguar system [36]. This achievement used a very accurate representations of blood, mostly targeting capillaries, but with approximations that render the methods inapplicable to realistic vessel geometries and physiological flow conditions of high Reynolds number. [10] M. Bernaschi, S. Melchionna, S. Succi, et al., Comp. Phys. Comm. 180 (2009) 1495. [11] S. Melchionna, M. Bernaschi, S. Succi, et al., Comp. Phys. Comm. 181 (2010) 462. [12] R. Benzi, S. Succi, M. Vergassola, Phys. Rep. 222 (1992) 145. [13] J.C. Phillips, G. Zheng, S. Kumar, L.V. Kal, in: Proc. IEEE/ACM, SC2002 Conf. IEEE Press, 2002, Technical Paper 277.

M. Bernaschi et al. / Computer Physics Communications 184 (2013) 329–341 [14] LB simulations with coupled bodies, even deformable ones, are available in the literature, see for instance M. Dupin, Phys. Rev. E, 75 (2007) 6, but we are not aware of any massively parallel implementation of this method. Parallel LBM applications to complex geometries have been performed by L. Axner, J. Comp. Phys., 227 (2008) 4895, but on much smaller sizes (order of ten million lattice sites) and without suspended particles. Computational fluid dynamics with multi-body dynamics are also available, see J. Gotz, Parallel Computing, 36 (2010) 142, but on a much smaller number of nodes (8192). Larger Lattice Boltzmann simulations, with up to 150 billions cells and 260 million suspended particles, have been recently presented (J. Gotz, Supercomputing 2010). However, these simulations only deal with idealized Cartesian geometries. [15] S. Succi, et al., The Lattice Boltzmann Equation for Fluid Dynamics and Beyond, Oxford University Press, USA, 2001. [16] S. Melchionna, Macromol, Theory Sim. 20 (2011) 548. [17] J.G. Gay, B.J. Berne, J. Chem. Phys. 74 (1981) 3316. [18] A. Dullweber, B. Leimkuhler, R. McLachlan, J. Chem. Phys. 107 (1997) 5840. [19] S. Melchionna, J. Comput. Phys. 230 (2011) 3966. [20] A.R. Pries, D. Neuhaus, P. Gaehtgens, Am. J. Physiol. Heart Circ. Physiol. 263 (1992) 1770. [21] Q. Zou, X. He, Phys. Fluids 9 (1997) 1591. [22] J. Lätt, B. Chopard, O. Malaspinas, M. Deville, A. Michler, Phys. Rev. E 77 (2008) 056703. [23] K.M.J. Marques, H.J. Spruijt, C. Boer, N. Westerhof, C.A. Visser, F.C. Visser, J. Am. Coll. Cardiol. 39 (2002) 1632. [24] A. Peters, et al. in: Proceedings of Supercomputing 2010, New Orleans, 2010. [25] http://www.labri.fr/perso/pelegrin/scotch/.

341

[26] M. Bisson, M. Bernaschi, S. Melchionna, Commun. Comput. Phys. 10 (2011) 1071. [27] The global geometry of the problem used for the present simulations is obtained from CT scans of the coronary arterial system of a real patient. Data acquisition was performed by a 320 × 0.5 mm CT scanner (Toshiba) and subsequently segmented into a stack of two-dimensional contours at a nominal resolution of 0.5 mm. The slice contours, each consisting of 256 points, are oversampled along the axial distance down to a slice-to-slice separation of 0.0125 mm and further smoothed out by appropriate interpolators. The resulting multi-branched geometrical structure is finally mapped into the Cartesian LB lattice, ready for the simulation. Full details can be found in [11]. [28] J. Tölke, M. Krafczyk, in: Proceedings of International Conference for Mesoscopic Methods in Engineering and Science ICMMES07, Munich, 2007. [29] C. Chambreau, J. Vetter, mpiP: lightweight, scalable MPI profiling. http:// mpip.sourceforge.net/. [30] K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer, International Conference on Parallel and Distributed Computing Systems, Dallas, TX, August 8–10, 2001. [31] http://oprofile.sourceforge.net. [32] M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, E. Kaxiras, Concurrency and computation: practice and experience. http://dx.doi.org/10.1002/cpe.1466, 2009. [33] S. Melchionna, G. Amati, M. Bernaschi, M. Bisson, S. Succi, 2012 (in preparation). [34] A.Z. Zinchenko, R.H. Davis, J. Comput. Phys. 157 (2003) 539. [35] J.R. Clausen, D.A. Reasor Jr., C.K. Aidun, Comput. Phys. Comm. 181 (2010) 1013. [36] A. Rahimian, et al. in: Proceedings of Supercomputing 2010, New Orleans, 2010.