Computers & Geosciences 70 (2014) 181–189
Contents lists available at ScienceDirect
Computers & Geosciences journal homepage: www.elsevier.com/locate/cageo
Finite-difference staggered grids in GPUs for anisotropic elastic wave propagation simulation Felix Rubio n, Mauricio Hanzich, Albert Farrés, Josep de la Puente, José María Cela Computer Applications in Science and Engineering, Barcelona Supercomputing Center, Spain
art ic l e i nf o
a b s t r a c t
Article history: Received 7 November 2013 Received in revised form 12 May 2014 Accepted 5 June 2014 Available online 18 June 2014
The 3D elastic wave equations can be used to simulate the physics of waves traveling through the Earth more precisely than acoustic approximations. However, this improvement in quality has a counterpart in the cost of the numerical scheme. A possible strategy to mitigate that expense is using specialized, highperforming architectures such as GPUs. Nevertheless, porting and optimizing a code for such a platform require a deep understanding of both the underlying hardware architecture and the algorithm at hand. Furthermore, for very large problems, multiple GPUs must work concurrently, which adds yet another layer of complexity to the codes. In this work, we have tackled the problem of porting and optimizing a 3D elastic wave propagation engine which supports both standard- and fully-staggered grids to multiGPU clusters. At the single GPU level, we have proposed and evaluated many optimization strategies and adopted the best performing ones for our final code. At the distributed memory level, a domain decomposition approach has been used which allows for good scalability thanks to using asynchronous communications and I/O. & 2014 Elsevier Ltd. All rights reserved.
Keywords: Anisotropy Staggered grids High performance Simulation GPU Parallel
1. Introduction Acoustic wave propagation has been the preferred engine for geophysical exploration applications for the last few years due to the large cost involved in using better approximations, especially for 3D full-wavefield modelling-based applications. Hence, simplified approaches have been used to generate images of the subsurface so that data processing can be finished in a reasonable time. The current trend in seismic imaging aims at using an improved physical model, considering that the Earth is not rigid but an elastic body. This new model takes simulations closer to the real physics of the problem, at the cost of raising the associated computational resources. In this work, we aim at improving the performance of 3D elastic wave-propagation simulations by using state-of-the-art graphic processing units (GPUs). Nowadays, in general, most scientific high-performance computing (HPC) simulations are carried out relying on an impressive number of simple, general purpose, cores (Bordas et al., 2009). By using this strategy, scientists or developers are discharged from completely re-writing their codes for each new HPC architecture, so that they can focus entirely on the optimization scheme being applied (load balancing, communication strategy, etc.) (Sahni et al., 2009).
n
Corresponding author. E-mail address:
[email protected] (F. Rubio).
http://dx.doi.org/10.1016/j.cageo.2014.06.003 0098-3004/& 2014 Elsevier Ltd. All rights reserved.
However, the recently developed high-performing devices called accelerators or co-processors can outperform their general purpose counterparts by orders of magnitude in terms of efficiency. The penalty for using such devices is, nonetheless, the need for re-engineering some portions of the code, or even the whole application (Araya-Polo et al., 2011). Recent work has been done in the field of automatic code generation (Markall et al., 2010; Corrigan et al., 2012) which shows some success for certain specific problems and architectures. The resulting scenario drives the scientific community to the re-writing of their codes, using strategies better suited for such accelerators. While some approaches still try to use tools already available (e.g. OpenGL) (Wang et al., 2011), the most of the reengineering task force has taken different paths, ranging from saving memory accesses by recomputing intermediate results (Corrigan et al., 2011) to re-writing sections of the code to map more efficiently the algorithm to the underlying architectures (Rubio et al., 2009, 2011). Specifically for seismic propagation, there are works showing speed-up factors of 12 or 60 (Micikevicius, 2009; Michéa and Komatitsch, 2010; Komatitsch et al., 2010), depending on the specific initial code state and the problem that is being solved. In this work we have aimed towards a completely empirical approach, by choosing some optimizations implemented in isolation to evaluate their performance contribution. Finally, only those which reported benefits in terms of computational time have been integrated into the final code.
182
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
In the following sections we outline the physical and the mathematical description of the problem that we are simulating (Section 2). Similarly, we describe the data structures (Section 3) and implementations that we have developed for CUDA-enabled devices (Sections 4 and 5). Finally we show how our approach performs in a multi-GPU environment (Section 6) together with validation of the code's results and some conclusions (Section 7). 2. Governing equations and staggered grids Seismic wave propagation is a phenomenon governed by Newton's dynamic laws and the theory of elasticity (see e.g. Aki and Richards, 2002). Neglecting the body rotation, a body can be subjected to six different stress components, three compressional stresses acting upon each Cartesian axis (σxx, σyy and σzz) and three shear stresses (σyz, σxz and σxy). Stresses, in turn, produce strains in the material which can be expressed in terms of displacement gradients. Differentiating in time the stress–strain relation leads us to the coupled equation system (Eqs. (1) and (2)) which describes wave propagation in elastic media where u, v, and w are the particle velocity components in the x-, y- and z-directions, respectively, and the stiffness matrix C determines the aniso66 tropic behavior of the material 0 1 0 1 ∂t σ xx ∂x u B C B∂ σ C ∂y v B C B t yy C B C B C B ∂z w C B ∂t σ zz C B C B C ð1Þ B C B C¼C 66 B ∂z v þ ∂y w C B ∂t σ yz C B C B C B ∂z u þ∂x w C B ∂t σ xz C @ A @ A ∂t σ xy
∂x v þ∂y u
ρ∂t u ¼ ∂x σ xx þ ∂y σ xy þ ∂z σ xz ρ∂t v ¼ ∂x σ xy þ ∂y σ yy þ ∂z σ yz ρ∂t w ¼ ∂x σ xz þ ∂y σ yz þ ∂z σ zz
ð2Þ
Although C matrix can have up to 21 components in the most general anisotropic case, certain materials can be described with less parameters (the most frequent symmetry types are monoclinic, orthorhombic and transversely isotropic, with 13, 9 and 5 independent parameters). On the other end, the fully isotropic case requires only two material parameters to fully determine C . A classical way to solve Eqs. (1) and (2) is by using explicit finite-difference schemes based upon staggered grids (see e.g. Madariaga, 1976). On such grids, complementary variables are located in sub-grids, which are intertwined. In this way, spatial derivatives of a variable are solved precisely at the location of the complementary variable. In the case of elastic waves many options exist regarding the grid layout. The advantages of staggered grid algorithms for elastic wave propagation include a higher accuracy
in the solution of the spatial derivatives and a better behavior in the presence of large material contrasts (Moczo et al., 2007). The most popular staggered grid was introduced by Virieux (1986) in seismology and uses stresses and velocities as complementary variables, similar to the layout proposed by Yee (1966) for the electromagnetic wave equations. In this setup, the equations can be solved in time explicitly, using a staggered algorithm in time as well. In the following, we will refer to this grid as the standard staggered grid (SSG) as it is the most widely used staggered grid in the field of seismic wave propagation. SSG solves the elastic wave equations at a minimum cost when the material anisotropy is, at most, axis-aligned orthorhombic. The reason behind this is that some coefficients of C are then set to zero, thus freeing the 66 dependence on some velocity/stress variables. Although there exist trade-off solutions which can solve complex anisotropy with SSG (Igel et al., 1995), the modeling of anisotropic elastic waves is better accomplished using a fully staggered grid, or a Lebedev grid (FSG, e.g. Davydycheva et al., 2003; Lisitsa and Vishnevskiy, 2010; Bernth and Chapman, 2011). This other grid type requires more variables to be stored at each grid node than SSG, but allows for the solution of the full elastic anisotropic wave equation, without any loss of quality in the solution. Hence FSG is more desirable when solving strongly anisotropic scenarios, in particular those displaying anisotropy beyond axis-aligned orthorhombic. In order to accommodate both grid types (SSG and FSG) with a unified approach, we employ the concept of cells (grouping together 4 stress nodes, 4 velocity nodes, and a set of up to 22 material parameters) as shown in Fig. 1a for the FSG grid. In this case, the total information stored per cell is 58 values that are distributed among the four nodes tr (top-right), tl (top-left), br (bottom-right) and bl (bottom-left) for both the stress and velocity nodes. For the SSG case, our cell representation produces the layout shown in Fig. 1b, including only a subset of the data present at the FSG cell. Similarly, the amount of parameters that need to be stored is smaller, as the number of independent parameters in C is smaller whenever we use SSG. We remark that the node 66
layout is identical for both grid types in our cell representation, although both data structures have a very different size both in the amount of variables. Last but not least, we choose to use classical O8 stencils for the solution of the spatial derivatives, thus requiring up to eight neighboring cells in each direction for the computation of a single-direction spatial partial derivative. The temporal integration is achieved with a simple O2 scheme. As long as we are simulating an unbounded physical domain, we need to fade out the waves that reach the limits of the simulation model. In our case, we have chosen to implement two different Absorbing Boundary Conditions (ABCs): an already proved C-PML (Komatitsch and Martin, 2007), and Sponge condition (Cerjan et al., 1985). The reason for having two different kinds of boundary conditions is related to the stability of the PML in the presence of
Fig. 1. Elastic Full Staggered Grid and Standard Staggered Grid Cells. Notice that we have chosen a left-handed Cartesian coordinate system, as is common practice in geophysical exploration problems. (a) FSG Cell definition and (b) SSG Cell definition.
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
strong anisotropy or a Free Surface (FS) condition honoring a specific topography. As for the FS condition we used an in-house development based on a mimetic operator (Rojas et al., 2008), that also supports topography (de la Puente et al., 2014). Notice that in the current work the focus will be on the optimization of the propagation itself and not in the optimization of either the ABCs or the FS. 3. Algorithm and data structures As a first step towards implementing SSG and FSG velocity– stress time-domain algorithms, we pay attention to the structures in which we store the data in memory. As will be shown later, this choice is crucial for the development of an algorithm that makes the best possible usage of our computational bottleneck: the memory bandwidth. We introduce two possible alternatives for marshaling the data in memory. The first one, called Array of Structures (AoS), is the closest to the staggered grid cell concept, in which we have a computational structure representing a cell (with all its fields packed together, see Fig. 2a). This data structure is desirable when the whole working set (all the data that must be accessed to perform a given operation) can be kept in cache memory and we access sequentially the different fields of the structure. However, when we are accessing any variable on memory, we are accessing a whole cache line (the minimum memory transfer unit, usually 64 bytes) where this variable is stored, possibly bringing in extra data. The second approach uses a Structure of Arrays (SoAs), in which we have build a structure that contains as many arrays as field kinds we have in our geophysical cell. Each array in our SoA stores a single parameter for all the cells in our grid (see Fig. 2b). The advantage of this strategy is that values for the same field in consecutive cells are consecutive in memory and. This is useful if consecutive cells are computed close in time, thus reducing the number of memory accesses. Algorithm 1 shows how the velocity grid is updated as a function of the stress grid, the density, the spatial discretization and the time step (dt). In order to update a single velocity component (vcomponent) in the innermost loop of Algorithm 1, it is necessary to calculate the buoyancy (ρvnode) and a 1D stencil of the stress grid by means of the ST i function. Algorithm 1 uses the BUOYANCY function (line 3) to interpolate the density value for each velocity node (vnode) using density values of the surrounding cells. Algorithm 1. Velocity calculation. Require: stress grid (sgrid), density grid (ρ), spatial discretization (dz ; dx ; dy ), time step (dt) Ensure: velocity grid (vgrid) 1. for all vcell in vgrid do
183
2. 3. 4. 5.
for all vnode in vcell do ρnode ¼ BUOYANCY(ρ, vnode) for all vcomp in vnode do vcomp þ ¼ (STZ(snode,dz) þ STX(snode,dx) þ STY(snode,dy)) dt ρnode 6. end for 7. end for 8. end for
In case that we are not updating velocities but stresses, the algorithm we use is analogous to that shown in Algorithm 1, adding an additional computation in the inner loop to perform the multiplication of the velocities tensor and the material coefficient matrix.
4. GPUs and CUDA architecture Graphic Processing Units (GPUs) are hardware devices attached to a CPU (host), specially designed to perform computations with dense matrices very efficiently. CUDA (Computed Unified Device Architecture) provides an interface to program such GPUs, by supplying language extensions to already-existing languages (e.g. C and C þ þ ). The programmer must specify how the data must be transferred between the GPU and the host processor, and what subroutines (kernels) must be executed into the GPU. The smallest units able to do a computation in a CUDA-enabled device are the so-called threads. Groups of several threads (usually 32, called CUDA warps) are treated as a unit, executing the same instruction with different data, and are organized into thread blocks. Then, several thread blocks configure what is named a CUDA grid, a collection of data partitions to be processed on a given kernel call. The execution of CUDA kernels is performed by cores inside Streaming multiprocessors (SMs), capable of running several thread blocks concurrently (the actual number depends on the resources required by each thread block). This concurrency is achieved disabling the warps when they requests data to GPU memory, preventing the SM from being stalled. In order to improve the responsiveness of the whole device, two levels of cache memories can be found as well. Whilst the cache memory for each SM (Level 1 or L1 cache) is used only by itself, and is not coherent with any other on the GPU, the second cache level (L2) is shared among all the SMs and is coherent with the main memory. Specifically addressing our problem, the scenario we are facing is composed of several computations in which each cell to be updated depends on its neighbors in different arrays, so all cells can be processed at once. At the same time, on a finite-difference problem, all fields are stored in a Cartesian grid. This precludes indirect accesses to memory, and is the kind of problem GPUs are typically designed to address.
5. GPU kernels Thread
Bank 0 1 2 3
Cell 0
1
Cell 1
2
Cell 2
2
3
Cell 3
3
AoS
Bank
Thread
0
0
Field 0
1
0 1 2 3
Field 1 Field 2 SoA
Fig. 2. AoS and SoA structures. Background colors depict cache lines, green arrows depict coalesced accesses, and red arrows depict bank collisions. (a) AoS and (b) SoA. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
A kernel is a subroutine that must be executed by a GPU. We have split the computation in a natural set of two kernels, one for the velocities and another for the stresses, to which we have applied several optimization techniques, explained in next subsections. When programming such kernels, the configuration of the CUDA grid and thread block sizes must be passed to the driver, that will create the correspondent number of thread blocks that will be executed concurrently. There is no unique solution for this configuration, because it heavily depends on the resources of the current GPU being used and the problem being solved. Therefore, it is necessary to find empirically the optimal configuration. All our single GPU tests are performed on the same machine, based on 2 Intel Xeon E-5504, with 12 GB DDR3 RAM, and equipped
184
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
with an NVIDIA C2050 (codenamed Fermi). The simulation scenario for the single GPU case is a grid of 192 192 192 cells that translates into a CUDA block/grid configuration of 32 4 1=6 48 1, which we have found to be optimal for our kernels and hardware. Hence, the total amount of thread blocks (288) guarantees that the 14 SM available at the C2050 have a sufficient load (20.5 thread blocks/SM) for the results to be significative when only testing the efficiency of the kernels (this is important because a GPU works optimally when has enough computations to do to overlap such computations with their data transfers from/to memory). It is important to note the performance benefit we have obtained only by porting the code to GPUs, compared to that provided by the general purpose version. For the FSG code, we have obtained a speed-up that ranges from 4 , when we have topography and free surface, to 8 for the FSG code using an Sponge boundary condition. For the SSG, the speed-up obtained with an Sponge is 6 .
5.1. FSG velocity update To establish a performance baseline, we have ported to GPU an AoS general-purpose version, which consisted in a processing z x plane that streamed across the slowest memory direction (the direction in which the elements have the biggest stride, y in our case). However, as the performance was below of that expected, we developed a second code based on the stream over the slowest dimension, keeping an array of cells in the y direction in registers (hardware components inside each SM, in which data can be stored to be used several times), as seen in Micikevicius (2009). As shown in Table 1, by using this strategy, we have been able to raise the number of instructions executed per cycle (IPC), which is interesting because the more the instructions is processed on a time unit, the faster the simulation is computed. Also, this strategy has increased the Level 1 cache hit ratio (so we are improving the reuse of data already brought into the core from memory), thus reducing the pressure on the main memory. This has led us to a 2 speed-up. As explained in Section 3, by using an AoS approach we face two problems: only a portion of the memory is used at a time in each request, and we could incur in memory bank conflicts at each request depending on the data stride (see Fig. 2b). Therefore, we have proceeded with the porting of the AoS code to an SoA approach, calling this our naive version, which will be used as our new baseline for efficiency measurements in the following discussion. The results are shown in column SoA in Table 1. It must be noted that, while we were using a 3D stencil in cells in the AoS approach, now we use three 1D stencils (as seen in (Eqs. (2) and 1)). As in SoA each variable is stored in a different array, each stencil is just one-dimensional, in the x-, y- or zdirection. Therefore, there are no shared structures between the y and z x axes, and hence using Micikevicius' approach is useless. From this point on, we have assembled a set of possible optimization strategies that we have applied to the code in order Table 1 Comparison between AoS, SoA and Micikevicius for the velocity kernel. IPC is the number of instructions completed per core cycle. Metrics
AoS
AoSþ Micikevicius
SoA
Time L1 conflicts L1 hit ratio IPC Bandwidth
100% n/a 43.95% 0.56 100%
40.62% 19.10% 94.56% 1.82 30.73%
9.96% n/a 50.55% 8.32 88.70%
to empirically assess which ones give us positive results in terms of speed-up. These strategies are described in the following: Constant memory: Most NVIDIA GPUs have a cached constant memory, capable of broadcasting values to all threads in a thread block. This memory is useful for storing constant parameters, such as stencil coefficients or spatial and time discretizations. Texture memory: The texture memory is capable of producing linear interpolations by hardware. Such capability may be useful, for solving the buoyancy interpolations, hence saving some extra registers. Shared memory: In Fermi architecture it is possible to setup the scratchpad memory in each SM as 48/16 kB or 16/48 kB of cache (hardware managed)/shared (software managed, thread block-addressable) memory, respectively. By placing some common data in this memory it is possible to enhance reuse while maximizing the memory throughput per thread. Common sub- In order to prevent the compiler from spendexpressions: ing registers in recomputing recurrent subexpressions (e.g. dt ρ, or pointers to array bases), these are previously computed and stored in a register for later use. Instruction level par- Warps stop their execution when data from allelism (ILP): memory is required. Hence, by issuing all the load instructions, followed by the operations and then all the stores for a single 1D stencil (proposed by Volkov, 2010) is possible to overlap the computing phase with memory data transfers. Register streaming: Even if Micikevicius' approach is not completely applicable for SoA schemes, the concept of traversing the data volume in the slowest dimension may be used. By doing this it is possible to keep the slowest dimension values in a set of registers and load only one value between iterations in the traverse direction. In our case, this is useful for the y-dimension. This approach is similar to that used by Michéa and Komatitsch (2010), with some differences to take into account: our algorithm is FSG whereas theirs is SSG (and thus, they consider a smaller number of variables), and they use a spacial order of 4 whereas ours is 8, which raises the number of registers required in the kernel. Loop splitting: To keep the register usage below an acceptable level (to maximize utilization of SMs, see Section 4), it is useful to split the y-loop. By that means, we reduce the amount of computation and, hence, the amount of registers being used. As a result, the same loop is repeated several times in order to calculate all the stencils for a given thread block. Register By enclosing code blocks in always true condensation: conditionals we simplify the analysis phase of the compiler, thus enabling it to better reuse registers that are not shared between different blocks of code.
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
Manual code While developing code for the NVIDIA GPU, reordering: we have found that manually reordering independent instructions in the kernel (e.g. stencil contributions) allows the compiler to produce a binary that shows better performance while reducing resources used. In order to test several scenarios of instruction ordering we tried all the permutations of a set of instructions, and kept the order that results in the smallest register usage. Our approach, once all these optimizations have been analyzed, has been to apply only those that provided some benefit (i.e. speed-up was bigger than 1.0). Table 2 shows the speed-up for each optimization and the final version, where we merged the profitable ones. There remains, however, the question of why two approaches (register streaming and loop splitting) produced slower codes. Our opinion is that the performance drop is produced because of the differences in register usage. The number of available registers is critical for the compiler, enabling it to perform its own optimizations. On the other hand, the two optimizations which produced an improvement of roughly 1.0 are closely tied to the cache system of our GPU. Our guess about those results is that the memory access pattern is well suited for the caching hardware, although the understanding of such behavior requires very low level knowledge of CUDA architecture (not easily available). 5.2. FSG stress update In the stress update, even considering that the number of stencils being computed is the same as in the velocity update (see lines 6–10 in Algorithm 2), it is necessary to calculate the interpolation of 21 elastic coefficients instead of just the density. Algorithm 2. Stress calculation. Require: velocity grid (vgrid), elastic coefficients (C), spatial discretization (dz ; dx ; dy ), time step (dt) Ensure: stress grid (sgrid) 1. for all scell in sgrid do 2. for all snode in scell do 3. for all cij in C do 4. 5. 6. 7. 8. 9. 10. 11.
cij ¼COEFF(C, scell, snode, i, j) end for for all vcomp in vnode do for all dir in [Z,X,Y] do vpdvcomp ¼STdir(vnode, vcomp, ddir) dir end for end for for all scomp in snode do
Table 2 Optimizations results for FSG kernel.
scomp þ ¼ STR(snode, scomp, vpd, C , dt)
12. 13. 14. 15.
185
end for end for end for
Moreover, the stress calculation implies updating not three components as in the velocity case, but 6. Altogether, the cost of the stress update doubles for the AoS structures while it is almost 5 for the SoA case, when compared to the velocity update, as can be seen in Table 3. Similar to what we described for velocity update, it is possible to further improve the naive stress update kernel (simple SoA implementation) by applying the same techniques already described in the previous section. Table 2 shows the improvements that we were able to obtain by applying different optimizations and what can be expected by applying them all (final row). Surprisingly, some optimizations which were discarded for the velocity update case have been successful in the stress update. We can also see that both update schemes obtained a similar total speed-up.
5.3. SSG kernels As we have explained in Sections 2 and 3 we can simplify the problem by reducing the cell size from FSG to SSG. By doing such simplification we obtain a lightweight code at the cost of a restricted rheology representation. The first implication on the computational side of choosing this strategy is the reduction of all node structures that store velocity, stress and elastic materials. Due to the results obtained while developing the FSG grid code, only a SoA-based SSG code structure has been developed. Similar to the FSG case, the first version of the code we have developed for GPU is the one resulting from exchanging z, x indices for threads. Table 4 shows the speed-up on velocity update step for all proposed optimizations and the gain with respect to the first, naive version. In the stress case, the only strategies that report a significant improvement are the constant memory usage, the interpolations computed on the texture memory and, finally, the code reordering by hand. Table 3 Comparison between velocity and stress update costs for both AoS and SoA structures. Times are normalized by our reference. Physical vars.
AoS (%)
SoA varsþ AoS coeffs.
SoA vars þSoA coeffs (%)
Velocity Stress
100 100
— 44.70%
9.17 19.44
Table 4 Optimizations results for SSG kernel.
Optimization
Vel speed-up
Stress speed-up
Optimization
Vel speed-up
Stress speed-up
Naive Constant memory Texture memory Shared memory Common sub-expressions Instr. level parall. Register streaming Loop splitting Register condensation Manual code reordering Final
1.00 1.18 1.02 0.98 1.18 1.09 0.57 0.77 1.20 1.24 1.81
1.00 1.05 1.02 1.07 1.04 1.12 0.95 1.11 1.14 1.06 1.76
Naive Constant memory Texture memory Shared memory Common sub-expressions Instr. level parall. Register streaming Loop splitting Register condensation Manual code reordering Final
1.00 1.28 1.20 1.13 1.04 1.02 1.07 1.18 1.00 1.02 1.66
1.00 1.14 1.17 1.10 1.01 1.03 1.08 1.10 1.01 1.25 1.36
186
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
5.4. Numerical results In order to show that the results of our optimized code are numerically correct, we have simulated an elastic TTI propagation (with the FSG kernel, because it is the more complex one), and we have obtained the results shown in Fig. 3. It is interesting to note that, because of the expressed anisotropy in such a model, velocity in horizontal propagation is notably higher than the vertical one.
6. Multi-node parallelism A real industry-sized problem does not easily fit into the memory of a single GPU. Therefore, in order to exploit the full potential of multi-GPU environments we must be able to work with several nodes in parallel, using a domain-decomposition strategy. On this scenario, staggered grids must be split in such a way that each velocity or stress update can be computed simultaneously on many GPUs. Then, at the end of each time step we exchange the updated information between GPU cards. Algorithm 3 depicts the pseudo-code for this operation, in which the communication and computation is divided in two phases, thus allowing them to overlap each other. Algorithm 3. Time step cycle and domain decomposition pseudocode. for step ¼ 0-nt do phaseð1Þ dataExchange(front) dataExchange(back) phaseð2Þ wait(front) wait(back) end for
Notice that phase 1 and phase 2 make the same computations, Phase 1 being responsible for computing everything that needs to be exchanged with the neighbor domains, while phase 2 computes the rest of the domain. Between these two computation phases, communication starts and is synchronized at the end (wait front/ wait back) of the loop. In order to simplify the domain decomposition, we only decompose in the slowest dimension of the domain (see Algorithm 3, where only front and back neighbors are present). This strategy avoids the need of maintaining extra buffers for marshaling communication data, although it may prove insufficient when aiming at solving extremely large problems. Codes running on multi-GPU scenarios have been run on a cluster build upon B505 nodes from BULL. Each one of these nodes is equipped with 2 Intel Xeon E-5649, and 2 NVIDIA M2090 CPU cards. The network connection is provided by an InfiniBand
networking interface. In order to analyze the scalability of our codes when working on a multi-GPU scenario, we have defined some tests based on the same model, which is composed of a total of 3 3 3 similarly sized sub-cubes, each of them containing a material with different elastic properties (see Fig. 4). The parameters of these tests are shown in Table 5. The reason to chose a 2 m discretization is that we want to ensure a ratio of 4 points per minimum wavelength, which is enough to satisfy our accuracy requirement. In this case, it is not possible to fit the FSG-Rubik test in less than 6 GPUs. Hence, we have defined another test with a smaller model of 192 192 192 cells that fits into one computational node, runs 10,000 time steps, that we call FSG-Rubik-Small. Fig. 5 shows the scalability curves for both the FSG-Rubik and FSG-Rubik-Small tests while running the FSG GPU kernel. The key to the success of Algorithm 3 is the memory transfer scheme between GPUs used. In order to communicate two GPUs several options are available, and some of them imply intermediate copies between the GPU and the underlying host. Hence, we have developed an isolated test-bed code that emulates Algorithm 3 regarding the host/device interoperability and the communication scheme. This is useful for testing the performance of different host/ device memory transfer schemes available on CUDA language. We have tested 5 mutually excluding, different memory transfer schemes: Default: memory transfers and CUDA kernels are queued to the default stream (i.e. processing queue) on the device. Using this scheme, memory accesses are synchronous as the streams sequentially process the enqueued kernels and memory transfers. This approach does
Fig. 4. Cube used for Rubik test.
Table 5 Parameters of tested data sets.
Fig. 3. TTI propagation across BP-2007 TTI model, obtained with our FSG code. The influence of the anisotropy in the wave propagation can be seen in (Rubio et al., 2013).
Feature
Value
FSG test name SSG test name P-wave velocity S-wave velocity ρ(density) Poisson ratio Spatial disc. Time disc. Test dimensions Source Source freq. Source position
FSG-Rubik SSG-Rubik ½1000; 5000 m=s ½408; 3536 m=s ½1200; 2700 kg=m3 ½0:0; 0:45 2m 160 μs 512 504 512 cells Explosive Ricker 20 Hz ðzs ; xs ; ys Þ ¼ ð255; 251; 255Þ
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
Overlapping:
Overlapping host:
Overlapping peer:
not overlap memory transfers and computation (see Fig. 6a). different GPU streams (i.e. queues) are used to overlap memory transfers and kernel computation on the device, as elements from different streams can be executed in parallel. Memory accesses are not synchronous and, therefore, stream synchronization functions provided by CUDA are used (see Fig. 6b). The same as before, but memory transfers are done directly from the neighbor-domain host memory to device memory. This approach saves a host to host memory copy, and is useful when the GPUs are located in the same computational node (see Fig. 6c). The same as before, but memory transfers are done directly from device memory to device memory. This approach saves a host to/from device and a host to host memory copy. However, to be able to use this scheme, Peer Device Memory Access must be enabled on the device.
187
Mapped: Memory transfers between host and device are done automatically by the GPU driver, so there is no need to declare any explicit memory transfer. Mapped Opt.: The same as before, but only declaring as mapped memory the data calculated by phase(1) in Algorithm 3. This scheme decreases the amount of data moved between host and device, but increases the code complexity. Table 6 shows the performance achieved for all the strategies implemented with respect to the default scheme using two GPUs. As a test case, we have used for each device a data volume of size 1024 1024 700 floats (2.8 GiB) filled with random values and iterated 100 steps. A 10% of the volume is exchanged with each neighbor. Due to a hardware configuration limitation, Overlapping Peer have been run on another machine that hosts two Tesla C2050 cards (input data size has been halved due to memory limitations, while the rest of input parameters remain untouched). The results for this particular architecture are shown in Table 6. Notice that some strategies show a drop of performance. In this case, such strategies are the ones that enable the GPU driver to transparently manage the memory transfers. Each time that the kernel accesses the mapped memory the CUDA driver has to take care of the transfer, to keep the memory on the device and on the host coherent. The reason behind these strategies' failure is that our Table 6 GPU memory transfer schemes performance. Memory scheme
M2090 speed-up
C2050 speed-up
Default Overlapping Overlapping Host Overlapping Peer Mapped Mapped Opt.
1.00 1.18 1.38 – 0.05 0.27
1.00 1.25 – – – –
Fig. 5. FSG-Rubik and FSG-Rubik-Small tests.
computation
computation
communication
phase 1
wait
front | back send
phase 1
computation
phase 2 wait
phase 2 front | back receive
front | back send
Default.
front | back receive
communication
Overlapping. HOST
HOST
HOST
memory buffer
memory buffer
memory buffer
memory buffer
memory buffer
memory buffer
GPU
GPU
GPU
GPU
GPU
GPU
Overlapping host. Fig. 6. GPU-Host data transfer schemes.
188
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
kernels are memory bound and most computations are done inside the GPU. In such a situation, the driver is busy synchronizing memory spaces that are not required to be in sync. At this point we fallback to our FSG and SSG codes and run scalability tests on many GPUs, using the strategy Overlapping Host for memory transfers. The speedups shown in Fig. 5 (and, from now on, for every speedup measure) were measured with the following metric: speedupðnnodes Þ ¼
timeð1node Þ timeðnnodes Þ
ð3Þ
Notice that, as the model used for the test does not change when the number of nodes increases, the results correspond to a “hard scalability” test. For our tests using the FSG code, the dimension decomposed is y (same as for the shared memory decomposition), which is to 512 and 192 cells in size for FSG-Rubik and FSG-Rubik-Small, respectively. Finally, as the FSG-Rubik test does not fit in less than 3 computational nodes (i.e. 6 M2090 GPUs), the results for 1 and 2 nodes were estimated based on the results for the other test. An important conclusion unveiled from Fig. 5 is the ratio between overlapping cells and domain size that reaches an assumed efficiency efficiencyðnnodes Þ ¼
speedupðnnodes Þ nnodes
ð4Þ
Let us assume an 80%, for the FSG-Rubik test case, using 4 nodes (8 domains, using two GPUs per node), where each domain has 64 cells in the decomposition dimension (Y). In the worst case, a given domain has two neighbors (as we only decompose in one direction). Thus, the communication of domain boundaries implies exchanging 4 cells in the decomposition dimension (because we are in an O8 scenario, so to update any cell up to 4 neighbor cells are required in the z, x and y directions). This leads to 8 cells per domain being exchanged out of 64, which delivers a ratio of 8=64 ¼ 12:5% between the domain size and the exchanged cells. For the FSG-Rubik-Small test, the same results are shown. It must be taken into account that the scalability is reduced with the y dimension, but the ratio of 12.5% found for the FSG-Rubik test is maintained. In this case up to 2 nodes (i.e. 4 domains) can be used before degrading efficiency below 80%, as the domain size is reduced to 48 cells in the Y dimension. Regarding the SSG code scalability using GPUs, Fig. 7 shows the execution time and scalability for the SSG-Rubik case when running the SSG GPU kernel. As already seen for the FSG case, keeping an efficiency of 80% implies that no more than a 12.5% ratio between domain and interchanged cells may be used for this case. Notice also that, for over 8 nodes, the scalability is inverted. This happens due to the
Fig. 7. SSG-Rubik test.
Fig. 8. Comparison of a trace segment, obtained with our general purpose and GPU codes for SSG Rubik test. Note that error is below 5 orders of magnitude of the current value. Although initial dt is 160 μs, traces have been down-sampled by 6, thus delivering a new dt of 960 μs.
MPI and GPU/Host communication overheads introduced by communications. The same is not seen for FSG, as the amount of computation is much bigger while the overhead remain almost the same for communications.
6.1. Numerical results To check the approaches we have taken, both in the numerical kernels and in the decomposition scheme, we have extracted the seismic traces belonging to the RUBIK test for the general purpose and the CUDA-enabled versions of our code. As can be seen in Fig. 8 (only a part of a trace, for the sake of clarity), the GPUaccelerated version produces results that are within the acceptable error range (error is 5 orders of magnitude smaller than the current values of the traces). The conditions under which these traces have been obtained can be found in Table 5, related to the SSG RUBIK-test.
7. Conclusions In this work we have shown how the problem of elastic wave simulation using staggered grids must be tackled on state-of-theart GPUs. After describing the equations and the building blocks, i.e. the different staggered grids cells, we have presented the algorithms employed for data processing. In the following sections, we have proposed several optimization strategies for GPU code, together with the performance improvement provided by each of them. This has been evaluated both individually for each strategy and combined in the final version. Our final speed-up has been, roughly, 14.4 when compared to the performance of the general purpose code for the FSG approach (8 obtained by the porting, and an additional 1.8 with the optimization), whereas for the SSG approach we have obtained a speed-up of 9.6 (6 for the porting, and an additional 1.6 with the optimizations). Furthermore, as we are interested in running large-scale simulations, we have introduced a scenario whose size required using multiple GPUs with a domain decomposition strategy. These simulations have been run across several multi-GPU computing nodes. We have tackled the problem by structuring our code in different phases that enable asynchronous communication,
F. Rubio et al. / Computers & Geosciences 70 (2014) 181–189
resulting in scalability factors of 10 in the case of the FSG and 5 in the case of SSG, when using a 16-node cluster.
Acknowledgments The authors want to thank Repsol for the permission to publish the present research, carried out at the Repsol-BSC Research Center. References Aki, K., Richards, P.G., 2002. Quantitative Seismology. University Science Books, Sausalito, California. Araya-Polo, M., Cabezas, J., Hanzich, M., Pericas, M., Rubio, F., Gelado, I., Shafiq, M., Morancho, E., Navarro, N., Ayguade, E., Cela José, M., Valero, M., 2011. Assessing accelerator-based HPC reverse time migration. IEEE Trans. Parallel Distrib. Syst. 22 (1), 147–162. Bernth, H., Chapman, C., 2011. A comparison of the dispersion relations for anisotropic elastodynamic finite-difference grids. Geophysics 76 (3), WA43–WA50. Bordas, R., Carpentieri, B., Fotia, G., Maggio, F., Nobes, R., Pitt-Francis, J., Southern, J., 2009. Simulation of cardiac electrophysiology on next-generation high-performance computers. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 367 (1895), 1951–1969. Cerjan, C., Kosloff, D., Kosloff, R., Reshed, M., 1985. A nonreflecting boundary condition for discrete acoustic and elastic wave equations. Geophysics 50 (4), 705–708. Corrigan, A., Camelli, F., Löhner, R., Mut, F., 2012. Semi-automatic porting of a largescale Fortran CFD code to GPUs. Int. J. Numer. Meth. Fluids 69 (May), 314–331. Corrigan, A., Camelli, F., Löhner, R., Wallin, J., 2011. Running unstructured gridbased CFD solvers on modern graphics hardware. Int. J. Numer. Methods Fluids Int. J. Numer. Methods Fluids 66 (May (2)), 221–229. Davydycheva, S., Druskin, V., Habashy, T., 2003. An efficient finite-difference scheme for electromagnetic logging in 3D anisotropic inhomogeneous media. Geophysics 68 (5), 1525–1536. de la Puente, J., Ferrer, M., Hanzich, M., Castillo, J.E., Cela, J.M., 2014. Mimetic seismic wave modeling including topography on deformed staggered grids. Geophysics 79 (3), T125–T141 〈http://geophysics.geoscienceworld.org/content/ 79/3/T125.abstract〉. Igel, H., Mora, P., Riollet, B., 1995. Anisotropic wave propagation through finitedifference grids. Geophysics 60, 1203–1216. Komatitsch, D., Erlebacher, G., Göddeke, D., Michéa, D., 2010. High-order finiteelement seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput. Physics 229 (20), 7692–7714.
189
Komatitsch, D., Martin, R., 2007. An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation. Geophysics 72 (5), SM155–SM167. Lisitsa, V., Vishnevskiy, D., 2010. Lebedev scheme for the numerical simulation of wave propagation in 3D anisotropic elasticity. Geophys. Prospect. 58 (4), 619–635. Madariaga, R., 1976. Dynamics of an expanding circular fault. Bull. Seismol. Soc. Am. 66 (3), 639–666. Markall, G., Ham, D., Kelly, P., September 2010. Generating optimised finite element solvers for GPU architectures. In: Simos, T., Psihoyios, G., Tsitouras, C. (Eds.), American Institute of Physics Conference Series, American Institute of Physics Conference Series, vol. 1281; September 2010, pp. 787–790. Michéa, D., Komatitsch, D., 2010. Accelerating a three-dimensional finite-difference wave propagation code using GPU graphics cards. Geophys. J. Int. 182 (1), 389–402. Micikevicius, P., 2009. 3D finite difference computation on GPUs using CUDA. In: Proceedings of the Second Workshop on General Purpose Processing on Graphics Processing Units. GPGPU-2. ACM, New York, NY, USA, pp. 79–84. Moczo, P., Kristek, J., Galis, M., Pazak, P., Balazovjech, M., 2007. The Finite-difference and finite-element modeling of seismic wave propagation and earthquake motion. Acta Phys. Slov. 57 (2), 177–406. Rojas, O., Day, S., Castillo, J., Dalguer, L.A., 2008. Modelling of rupture propagation using high-order mimetic finite differences. Geophys. J. Int. 172 (2), 631–650. Rubio, F., Farrés, A., Hanzich, M., de la Puente, J., Ferrer, M., 2013. Optimizing isotropic and fully-anisotropic elastic modelling on multi-GPU platforms. In: The 75th EAGE Conference & Exhibition. EAGE. Rubio, F., Hanzich, M., Aris, R., Houzeaux, G., Vazquez, M., June 2009. Parallel computational electrophysiology in cell/B.E. processors. In: van Loon Perumal Nithiarasu, R.L.R. (Ed.), The First International Conference on Mathematical and Computational Biomedical Engineering, vol. 1. Swansea University, pp. 273–275. Rubio, F., Hanzich, M., Aris, R., Vazquez, M., Houzeaux, G., 2011. Parallel computational electrophysiology in NVIDIA GPUs. In: van Loon Perumal Nithiarasu, R.L.R. (Ed.), The Second International Conference on Mathematical and Computational Biomedical Engineering. Swansea University. Sahni, O., Zhou, M., Shephard, M.S., Jansen, K.E., 2009. Scalable implicit finite element solver for massively parallel processing with demonstration to 160K cores. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. SC '09. ACM, New York, NY, USA, pp. 68:1– 68:12. Virieux, J., 1986. P-SV wave propagation in heterogeneous media: velocity–stress finite-difference method. Geophysics 51, 889–901. Volkov, V., 2010. Better performance at lower occupancy. In: GPU Technology Conference. Wang, Z., Peng, S., Liu, T., 2011. GPU accelerated 2-D staggered-grid finite difference seismic modelling. J. Softw. 6 (8). Yee, K., 1966. numerical solution of initial boundary value problems involving Maxwell's equations in isotropic media. IEEE Trans. Antennas Propag. 14 (3), 302–307.