Global magnetohydrodynamic simulations on multiple GPUs

Global magnetohydrodynamic simulations on multiple GPUs

Computer Physics Communications 185 (2014) 144–152 Contents lists available at ScienceDirect Computer Physics Communications journal homepage: www.e...

864KB Sizes 65 Downloads 97 Views

Computer Physics Communications 185 (2014) 144–152

Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

Global magnetohydrodynamic simulations on multiple GPUs Un-Hong Wong a,b , Hon-Cheng Wong c,a,∗ , Yonghui Ma a a

Space Science Institute, Macau University of Science and Technology, Macao, China

b

Global Scientific Information and Computing Center, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8550, Japan

c

Faculty of Information Technology, Macau University of Science and Technology, Macao, China

article

info

Article history: Received 23 October 2012 Received in revised form 25 August 2013 Accepted 28 August 2013 Available online 7 September 2013 Keywords: Global MHD simulations Multiple GPUs CUDA Parallel computing

abstract Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind– magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the twostep Lax–Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision. © 2013 Elsevier B.V. All rights reserved.

1. Introduction Global magnetohydrodynamic (MHD) models have been used in modeling various phenomena in space physics since it was developed by Leboeuf et al. [1]. The first global MHD model was a twodimensional (2D) model and the same group then developed the first three-dimensional (3D) model [2]. The physical and numerical foundation of global MHD models can be found in [3]. Despite the fact that 3D global MHD models are very powerful for studying the solar wind–magnetosphere interaction, performing 3D global MHD simulations is very computationally expensive; thus these models are often implemented on massively parallel computers [4–10]. The rapid advances in the development of graphics processing units (GPUs) provide us an alternative choice to implement global MHD models on GPUs other than on central processing unit (CPU)-based parallel architecture. Recently, the present authors successfully developed an efficient MHD code on a single GPU [11]. However, the capacity of video memory on a single GPU is very limited; multiple GPUs can provide more memory and more computation powers to calculate complex physical models such as global MHD models. As 3D global MHD models are very useful in simulating the solar wind–magnetosphere interaction, we are interested

∗ Corresponding author at: Faculty of Information Technology, Macau University of Science and Technology, Macao, China. E-mail address: [email protected] (H.-C. Wong). 0010-4655/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cpc.2013.08.027

in exploring the possibility to implement a 3D global MHD model on multiple GPUs. Such an implementation will be useful to most of the research groups in space physics simulations and also of importance to the space physics community. On the other hand, a closely related work on performing MHD simulations on multiple GPUs has recently been proposed by Feng et al. [12,13]. They present an OpenCL-based implementation of their Solar-Interplanetary-CESE MHD model (SIP-CESE MHD model) [14] on CPU/GPU clusters in double precision for studying the solar corona/interplanetary solar wind, achieving a performance speedup of 5× without any optimization compared to that of their CPU implementation. Before CUDA 4.0 [15] was released, handling the data transfer between GPUs and implementing multi-GPU programs on multiple GPUs were not easy. One reason is that the CUDA Runtime API only allowed one GPU corresponding to one CPU core. Even the CUDA Driver API coordinated multiple GPUs with one CPU; it was too complex to be used and debugged. Both the manufacturers of GPUs and the developers of CUDA know that making the usage of multiple GPUs more easier and efficient is important, leading to the development of the second generation Tesla GPU architecture— Fermi as well as CUDA 4.0, which improve the usability of GPUs a lot. CUDA 4.0’s Runtime API provides one-to-many GPUs handling by setting the current GPU in a single CPU thread. Since CPUs and GPUs work in parallel, the CPU will be free after it calls GPUs to start running kernels. All large calculations can be thrown to GPUs while a CPU can be the manager of the process, data exchange, and synchronization between GPUs.

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152

145

In this paper, we present an efficient 3D global MHD simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our simulator is based on the 3D global MHD model proposed by Ogino et al. [16,17] for simulating the interaction between the solar wind and the Earth’s magnetosphere. This paper is organized as follows: we review the global MHD model we used in Section 2. The numerical scheme we adopted is presented in Section 3. In Section 4, we present the GPU implementation in detail. Experimental results and performance measurements are given in Section 5. We conclude our work and indicate some directions for possible future work in Section 6. 2. Global MHD model for simulating the solar wind–magnetosphere interaction In this section we review the global MHD model we used in our implementation. This model was developed by Ogino et al. [16, 17] to simulate and investigate the solar wind interaction with the Earth’s magnetosphere. This model has also been proved very useful for simulating the solar wind–planetary magnetosphere interaction, such as the applications of the model on the Jupiter’s magnetosphere [18–21] and Saturn’s magnetosphere [22–25]. The model is based on the conservation laws (density, momentum, and energy) and Maxwell equations, which form a couple of partial differential equations (PDEs)—the MHD equations. Thus, the time evolution of the solar wind–magnetosphere interaction can be simulated by solving the MHD equations. Numerical solutions of a domain (mesh) of the MHD equations as well as the boundary conditions can be calculated via a numerical scheme such as the two-step Lax–Wendroff scheme [26]. In our code, the modified leapfrog scheme [17] has been implemented for performing high efficient 3D global MHD simulations on multiple GPUs. Our multiGPU simulator is based on the following normalized resistive MHD equations [16,17]:

∂ρ = −∇ · (vρ) + D∇ 2 ρ ∂t ∂v 1 1 1 = −(v · ∇)v − ∇ p + (J × B) + g + Φ ∂t ρ ρ ρ ∂p = −(v · ∇)p − γ p∇ · v + Dp ∇ 2 p ∂t ∂B = ∇ × (v × B) + η∇ 2 B ∂t J = ∇ × (B − Bd )

(1) (2) (3)

Fig. 1. The 1/4 simulation domain of the solar wind–Earth’s magnetosphere interaction: the Sun is located in the positive side along the x-axis, and the north magnetic pole is assumed along the z-axis positive side.

The 1/4 simulation domain of the solar wind–Earth’s magnetosphere interaction is shown in Fig. 1. Detailed definition, assumptions, parameters, coefficients, and the coordinate system of the simulation domain of the model can be found in [16]. The constant boundary condition is set at x = x0 for the constant source of the solar wind with a free/open boundary at x = x1 . At y = y0 and z = z0 the free/open boundaries are set to the direction with 45° to the x-axis. The mirror boundaries are set at y = 0:

∂ρ ∂p ∂vx ∂vz ∂ Bx ∂ Bz = = = = = =0 ∂y ∂y ∂y ∂y ∂y ∂y vy = B y = 0 and at z = 0:

∂ρ ∂p ∂vx ∂vy ∂ Bz = = = = =0 ∂z ∂z ∂z ∂z ∂z vz = Bx = By = 0.

(5)

where ρ is the plasma density, v the velocity, B the magnetic field vector, and p the plasma pressure. Numerical error is removed using Eq. (5). Bd is the magnetic field of a planet. In our test case, the solar wind interaction with the Earth’s magnetosphere, Bd , is the magnetic dipole as the approximation of the Earth’s magnetic field. Φ ≡ µ∇ 2 v is the viscosity. η is the resistivity, which was taken to be uniform throughout the simulation box with the range 0.0001 ≤ η ≤0.002 for the magnetospheric configuration, g = −g0 /ξ 3 (ξ = x2 + y2 + z 2 , g0 = 1.35 × 10−7 (9.8 m/s2 )) is the force of gravity, and γ = 5/3 is the ratio of specific heat. D is the diffusion coefficient of particles and Dp is the diffusion coefficient of pressure. Coefficient µ was artificially assigned in order to control the numerical vibration of the short wavelength resulting from an initial value or a rapid magnetic field change, and let D = Dp = µ/ρsw = 0.001, where ρsw is the solar wind density. Re = 6.37 × 106 m is the radius of the Earth. ρs = mns (ns = 1010 m−3 ) is the ionosphere density. The magnetic field is represented using the electric doublet magnetic field Bs = 3.12 × 104 nT, and vA = 6.80 × 106 m/s is the Alfvén speed at the equator.

(7)

The initial condition is set to a constant source of density, velocity, and temperature from the upper stream (x = x0 ) of the simulation domain. The magnetic field of the Earth is defined as −3yz Bd = ( −ξ3xz 5 , ξ5 ,

(4)

(6)

x2 +y2 −2z 2

ξ5

). The parameters of the solar wind are

set as follows (the values are normalized for the normalized resistive MHD equations): density ρsw = 5 × 10−4 (corresponding to 5/cm3 ), vsw = (vsw , 0, 0) at x = −x0 , vsw = 0.0441 to 0.118 (300–800 km/s), psw = 3.56 × 10−8 (Tsw = 2 × 105 K) at the upper stream (x = x0 ). The z component of the uniform interplanetary magnetic field (IMF) traveling with the solar wind can be set to BIMF = 0 or ±1.5 × 10−4 (±5 nT) for different tests. 3. The modified leapfrog scheme In this section we review the numerical scheme we adopted in our implementation. The modified leapfrog scheme [17] is one of the finite difference schemes. It is a combination of the twostep Lax–Wendroff scheme [26] and the leapfrog scheme [26]. In each sequence l, the first time step is calculated using the twostep Lax–Wendroff scheme and the subsequent l − 1 time steps are calculated using the leapfrog scheme. These steps will be repeated until the calculation reaches the target time step. The modified leapfrog scheme adopts both the numerical stabilization of the two-step Lax–Wendroff scheme and the numerical attenuation of the leapfrog scheme to provide numerical results in a way that is much stable than using one of the two schemes individually. The analysis and experiments in [17] suggested that l = 8 is the

146

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152

optimal number for the modified leapfrog scheme to provide stable results. We would like to detail the modified leapfrog scheme in the following. For a simple partial differential equation

∂f ∂f ∂f ∂f =− − − −f. ∂t ∂x ∂y ∂z

(8)

The two-step Lax–Wendroff scheme is listed as follows; the 1st step is obtained by f

t i+ 12 ,j+ 21 ,k+ 21

=

1 8

(fi,j,k + fi+1,j,k + fi,j+1,k + fi,j,k+1 t

t

t

t

t t t t + fi+ 1,j+1,k + fi+1,j,k+1 + fi,j+1,k+1 + fi+1,j+1,k+1 ) 1t t t+ 1 t f 12 1 − f 1 1 = fi+ 1 i+ ,j+ ,k+ 1 ,j+ 1 ,k+ 1 i+ ,j+ ,k+ 1 2

2

2

2

2

2

2

2

2

Fig. 2. Calculations of one time step of the two-step Lax–Wendroff scheme (left) and the leapfrog scheme (right).

(9)

2

1t t t t t − (fi+1,j+1,k+1 + fi+ 1,j+1,k + fi+1,j,k+1 + fi+1,j,k 81x − fi,tj,k − fi,tj+1,k − fi,tj,k+1 − fi,tj+1,k+1 )



1t 81y

t t t t (fi+ 1,j+1,k+1 + fi+1,j+1,k + fi,j+1,k+1 + fi,j+1,k

t t t − fi,tj,k − fi+ 1,j,k − fi,j,k+1 − fi+1,j,k+1 ) 1t t t t t − (fi+1,j+1,k+1 + fi+ 1,j,k+1 + fi,j+1,k+1 + fi,j,k+1 81z t t t − fi,tj,k − fi+ 1,j,k − fi,j+1,k − fi+1,j+1,k ).

(10)

The 2nd step is obtained by t+ 1

fi,j,k2 =

1 t + 21 t + 21 t+ 1 f 1 1 + fi− 12,j+ 1 ,k− 1 1 + f i+ 12 ,j− 12 ,k− 12 8 i− 2 ,j− 2 ,k− 2 2 2 2 t+ 1

t+ 1

t+ 1

+ fi− 12,j− 1 ,k+ 1 + fi+ 12,j+ 1 ,k− 1 + fi+ 12,j− 1 ,k+ 1 2

t+ 1

2

2

2

t+ 1

2

2

2

2

2

+ fi− 12,j+ 1 ,k+ 1 + fi+ 12,j+ 1 ,k+ 1 2

fi,tj+,k1

2

2

t + 21

= fi,j,k − 1tfi,j,k t

2

2

(11)

2

 1t t+ 1 t + 21 − f 12 1 1 + f 1 i + , j + , k + i + ,j+ 12 ,k− 12 41x 2 2 2 2

t+ 1

t+ 1

t+ 1

+ fi+ 12,j− 1 ,k+ 1 + fi+ 12,j− 1 ,k− 1 − fi− 12,j− 1 ,k− 1 2

2

2

2

2

2

2

t + 12

t + 12

t + 12

t+ 1

t+ 1

t+ 1

t+ 1

t+ 1

t+ 1

2

2



− fi− 1 ,j+ 1 ,k− 1 − fi− 1 ,j− 1 ,k+ 1 − fi− 1 ,j+ 1 ,k+ 1 2 2 2 2 2 2 2 2 2  1 1 1 1t t+ t+ 2 t+ f 12 1 + fi− 12,j+ 1 ,k+ 1 − 1 + f i+ 12 ,j+ 12 ,k− 21 41y i+ 2 ,j+ 2 ,k+ 2 2 2 2 + fi− 12,j+ 1 ,k− 1 − fi− 12,j− 1 ,k− 1 − fi+ 12,j− 1 ,k− 1 2 2 2 2 2 2 2 2 2 1t t + 12 t + 12 t + 21 − fi− 1 ,j− 1 ,k+ 1 − fi+ 1 ,j− 1 ,k+ 1 − f 1 1 1 41z i+ 2 ,j+ 2 ,k+ 2 2 2 2 2 2 2 + fi+ 12,j− 1 ,k+ 1 + fi− 12,j+ 1 ,k+ 1 + fi− 12,j− 1 ,k+ 1 2

t+ 1

2

2

2

t+ 1

2

2

2

2

2

Fig. 3. Illustration of the modified leapfrog scheme.

− fi− 12,j− 1 ,k− 1 − fi+ 12,j− 1 ,k− 1 2

2

2

2

t + 12

2

2



t + 12

− fi− 1 ,j+ 1 ,k− 1 − fi+ 1 ,j+ 1 ,k− 1 . 2

2

2

2

2

(12)

2

The only difference between the two-step Lax–Wendroff scheme and the leapfrog scheme is that the leapfrog scheme uses f

t − 12 i+ 12 ,j+ 21 ,k+ 21

instead of f t

i+ 12 ,j+ 21 ,k+ 12

as the first term on the right-

hand side of Eq. (10), and 1t instead of

1t 2

in Eq. (10). The leapfrog

scheme uses the result of last time step (t − 12t ) where the two-step Lax–Wendroff uses the interpolation between the n and n + 1 grid

points in the same time step t. Calculations of a time step of these two schemes are shown in Fig. 2. Fig. 3 shows the whole procedure of the modified leapfrog scheme. In each sequence l, the two-step Lax–Wendroff scheme is used for the 1st time step and the leapfrog scheme is used for the subsequent l − 1 time steps. 4. Multi-GPU scheme Implementing a multi-GPU program is somehow similar to implementing a program on a cluster. Even though the GPUs are installed on the same motherboard, each GPU has it own memory

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152

147

Fig. 4. Comparison of 1D decomposition (left) and 3D decomposition (right). Only the corresponding memory addressings of the first partition are shown. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(GRAM). Thus, we can exploit the domain decomposition strategy often used to implement programs running on a cluster in our implementation, but the situation is not similar to implementing a multi-core CPU program. In addition, we have to consider some characteristics and features of GPUs. 4.1. Mesh decomposition strategy for multi-GPU One of the key issues of decomposing a mesh grid is the handling of adjacent grids since the values of adjacent grids have to be copied to the adjacent partitions after every calculation step. In 1D decomposition, a 3D calculation domain with resolution of (SIZE x , SIZE y , SIZE z ) will be decomposed into n partitions and assigned to each GPU. To achieve the best performance, all partitions should contain the same number of total grids and adjacent grids. Otherwise, other partitions will have to wait for the partition which contains the largest number of grids in the calculation process. Likewise, other partitions will have to wait for the partition which contains the largest number of adjacent grids in the data exchange process. Therefore, the resolution of each partition z should be (SIZE x , SIZE y , SIZE ). The number of adjacent grids of two n sides in the z-direction will be 2 × SIZE x × SIZE y . In 3D decomposition, the 3D domain—a big cube—will be decomposed into many small cubes. To make all the cubes with same size, the resolution SIZE y SIZE z x √ , √ , √ ). The number of adjacent of each partition will be ( SIZE 3 3 3 n SIZE

n

n

SIZE

y y z z x x √ √ √ √ grids will be 2 × ( SIZE × √ + √ × SIZE + SIZE × SIZE ). It 3 3 3 3 3 3

n

will be complicated if

√ 3

n

n

n

n

n

SIZE

y z x √ √ n, SIZE , √ or SIZE are not integers. We 3 3 3

n

n

Both 1D and 3D decomposition methods are shown in Fig. 4 (note that not all the cases of decompositions into cubes are shown) where the actual memory storage is shown. Orange grids are the BC grids stored data of the boundary condition (BC) of the whole computation domain, blue grids are the adjacent grids duplicated the data of the neighboring partitions, and green grids are the data in the interior positions of the decomposed domain. In theory, the best way to decompose a mesh domain is to make all the partition to be a square (for 2D mesh) or a cube (for 3D mesh) to make the number of adjacent grids to be minimum. On GPUs, compared to 1D decomposition, decomposition in 2D or 3D cannot get the best benefit in reducing the number of exchanging grids. Besides, data that are not stored in continuous addresses will cause a non-coalescing memory access on a GPU and the efficiency will be declined significantly. For example, for a rowmajor order array—(x, y, z ), memory addresses of the adjacent grids in the z-direction are continuous but memory addresses of the adjacent grids in the y-direction and x-direction are not continuous. In this case, 2D and 3D decompositions will spend more time in the data exchange between partitions. By comparing these two decomposition methods in Fig. 4, it can be realized that the 3D decomposition contains less adjacent grids. However, there are adjacent grid points stored in non-coalescing status while the adjacent grids in the partition of the 1D decomposition are all in continuous addresses. In our test, there were only 5% noncoalescing memory accesses with the 1D decomposition method. 4.2. GPU implementation of the modified leapfrog scheme

n

have to spend much more effort to adjust the resolution of the partitions to get the similar number of total grids as well as adjacent z grids. In 1D decomposition, if SIZE is not an integer, the resolution n

z of (SIZE z mod n) partitions will be set to (SIZE x , SIZE y , ⌈ SIZE ⌉) and n the resolution of the order (n − (SIZE z mod n)) partitions will be z (SIZE x , SIZE y , ⌊ SIZE ⌋). The number of adjacent grids of two sides in n the z-direction will still be 2 × SIZE x × SIZE y .

The modified leapfrog scheme is used to solve the MHD equations and Maxwell’s equations. One important thing that has to be considered is that if there are too many inputs (pointers of that array of global memory) sent to a kernel, the kernel will be skipped silently. That means it just does not work without producing any error messages while the program is running. In the following description, in order to avoid being confused with the index and the

148

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152

direction, the data of a grid point will be shown as vx (i, j, k, t ), where (i, j, k) is a 3D index of the mesh (they actually have the same direction as (x, y, z )). Then, we can express the discretized equation of vx . We have the 1st step of the numerical scheme of Eq. (2) as follows:



1

1

1

1

2

2

2

2

vx i + , j + , k + , t +



1t = vx (I ) − vx (I )Dx [vx (i, j, k, t )] 41x 1t vy (I )Dy [vx (i, j, k, t )] − 41y 1t 1t 1 − vz (I )Dz [vx (i, j, k, t )] − Dx [p(i, j, k, t )] 41z 41x ρ(I )    1t 1 1 1 + Jy i + , j + , k + , t Bz (I ) ρ(I ) 2 2 2    1

1

1

2

2

2

− Jz i + , j + , k + , t By (I ) + α

(13)

Lax–Wendroff scheme and I = ( + , j + , k + , t − ) for the leapfrog scheme. Dx (φ), Dy (φ) (φ) are the functions for calculating the differences in the x-, y- and z-directions for any physics quantity φ(i, j, k, t ): 1 2

1 2

1 2

Dx (φ(i, j, k, t )) = φ(i + 1, j + 1, k + 1, t ) + φ(i + 1, j + 1, k, t )

+ φ(i + 1, j, k + 1, t ) + φ(i + 1, j, k, t ) − φ(i, j + 1, k + 1, t ) − φ(i, j + 1, k, t ) − φ(i, j, k + 1, t ) − φ(i, j, k, t ) Dy (φ(i, j, k, t )) = φ(i + 1, j + 1, k + 1, t ) + φ(i + 1, j + 1, k, t ) + φ(i, j + 1, k + 1, t ) + φ(i, j + 1, k, t ) − φ(i + 1, j, k + 1, t ) − φ(i + 1, j, k, t ) − φ(i, j, k + 1, t ) − φ(i, j, k, t ) (14) Dz (φ(i, j, k, t )) = φ(i + 1, j + 1, k + 1, t ) + φ(i + 1, j, k + 1, t ) + φ(i, j + 1, k + 1, t ) + φ(i, j, k + 1, t ) − φ(i + 1, j + 1, k, t ) − φ(i + 1, j, k, t ) − φ(i, j + 1, k, t ) − φ(i, j, k, t ). From the above equations one can easily know that there are many inputs for the calculation kernels. This situation is not a matter to a CPU program, however, as we mentioned above, such huge number of inputs will cause the CUDA kernels fail. It is an important issue which needs to be addressed in implementing a GPU program. Actually, this issue can be solved straightforward by splitting Eq. (13) into many short equations (steps) as follows:



1

1

1

1

2

2

2

2

vx i + , j + , k + , t +



1t = vx (I ) − vx (I )Dx [vx (i, j, k, t )] 41x 1t − vy (I )Dy [vx (i, j, k, t )] 41y 1t − vz (I )Dz [vx (i, j, k, t )]  41z  1

1

1

1

vx i + , j + , k + , t + 2 2 2 2   1 1 1 1 = vx i + , j + , k + , t + 2

2

2

2



1 1t Dx [p(i, j, k, t )] 41x ρ(I ) 1

1

1

1

(16)



vx i + , j + , k + , t + 2 2 2 2   1 1 1 1 = vx i + , j + , k + , t + 2 2 2 2    1t 1 1 1 + Jy i + , j + , k + , t Bz (I ) ρ(I ) 2 2 2    1

1

1

2

2

2

− Jz i + , j + , k + , t By (I ) + α.

where we let α represent the gravity clause and viscous clause for brevity. Index I = (i + 12 , j + 12 , k + 12 , t ) for the two-step 1 i 2 , and Dz



(15)

(17)

Eq. (13) is now split into three steps (Eqs. (15)–(17)). To improve the efficiency, we have tried our best to make an input in each step not to be called again in other steps or be called as less as possible. The same method is applied to other equations in every step in the modified leapfrog scheme. Besides the initialization, the overall program flow is shown as follows; note that the physics quantity is φ = {ρ, V, p, B}: 4.3. Data exchanging using Peer-to-Peer Direct Transfer The numerical scheme of MHD equations is grid-based. Each grid point of the domain (mesh) will be updated via a series of calculations using the values of its adjacent grid points. The common method is to let boundary grid points as a copy of the corresponding adjacent grid points of the adjacent partition (see Fig. 4). Before Fermi GPUs and GPUDirect v2.0 (included in CUDA 4.0) were released, the data exchange between two GPUs was always the bottleneck in implementing an efficient multi-GPU program. Data copy from one GPU to the other GPU has to be copied to the RAM via a DeviceToHost copying, and then copied to the other GPU via HostToDevice copying. To overcome this problem, GPUDirect v2.0 provides the Peer-to-Peer Direct Transfer and Peer-to-Peer Direct Access features with the Fermi GPUs. Peer-toPeer Direct Transfer provides a high performance DeviceToDevice copying between GPUs. Peer-to-Peer Direct Transfer provides a more efficient copy other than that of the previous generation GPUs. According to our test, it preformed 2.3× throughput to the optimized GPU– CPU–GPU copy (6.19 GB/s v.s. 2.68 GB/s) between two M2050 GPUs on the same PCI-e switch. However, Peer-to-Peer Direct Transfer is not fully supported on all multi-GPU systems. PCI-e slots are connected using PCI-e switches or I/O hubs (IOHs) on the motherboard. Two PCI-e slots are connected with a PCI-e switch. Two PCIe switches are connected via an IOH. And two IOHs are connected via the QPI interface. Unfortunately, Peer-to-Peer Direct Transfer between IOHs (thought the QPI interface) is not yet supported in the current CUDA; thus a memory copy between two GPUs via two IOHs has to perform a GPU–CPU–GPU copy. Peer-to-Peer Direct Transfer is only supported when two PCI-e switches connect the same IOH; however, it does not perform the best bandwidth in this case. Our test shows that Peer-to-Peer Direct Transfer performed about 80% of its best performance. In conclusion, exchanging the data of the adjacent grid points between GPUs via Peer-toPeer Direct Transfer basically requires the same time period as a GPU–CPU–GPU copy; this is due to the fact that the data needs to be transferred between IOHs when using Peer-to-Peer Direct Transfer. 4.4. Transfer order scheduling The multi-GPU system we used is the FT72-B7015 multi-GPU solution produced by Tyan [27] (see Fig. 5). The system has eight

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152

149

Algorithm 1 GPU implementation of the modified leapfrog scheme 1: 2: 3: 4:

Exchange the data ‘‘bottom" and ‘‘top" (z = 0 and z = nz) slices of the adjacent partition domain. Assign boundary conditions and calculate the boundary data; Calculate the current, J ← B(t ); If l = 1, interpolate φ(i + 21 , j + 12 , k + 21 , t ) and proceed the two-step Lax–Wendroff scheme.

Else use φ(i + 12 , j + 12 , k + 21 , t − 12 ) to proceed the leapfrog scheme;

5:

6: 7: 8:

The 1st step of the numerical scheme, If l = 1, two-step Lax–Wendroff scheme φ(i + 12 , j + 21 , k + 12 , t + 12 ) ← [J, φ(i, j, k, t ), φ(i + 12 , j + 21 , k + 12 , t )]; Else, the leapfrog scheme φ(i + 12 , j + 21 , k + 12 , t + 12 ) ← [J, φ(i, j, k, t ), φ(i + 12 , j + 21 , k + 12 , t − 12 )]; Calculate the current, J ← B(t + 21 );

The 2nd step of the numerical scheme, φ(i, k, j, t + 1) ← [J, φ(i, j, k, t ), φ(i + 12 , j + 21 , k + 12 , t + 12 )]; If t < tfinish , t = t + ∆t , l = (l + 1) mod 8, repeat from Step 1. Else finish the simulation and output the results.

Fig. 5. The multi-GPU system we used.

M2050 Tesla GPUs installed. We name these eight GPUs in our system from GPU-0 to GPU-7. PCI-e slots 0 and 1, 2 and 3, 4 and 5, 6 and 7 are connected through a PCI-e switch, which performs the best performance of GPUDirect v2.0. One IOH connects the PCI-e switches of GPU 0–1 and 2–3; another one connects the switches of GPU 4–5 and 6–7. Two IOHs between GPUs 0–1–2–3 and 4–5–6–7 are communicated via the QPI interface. Since the copy through the QPI spends the longest time, it can be expected that the steps of exchanging the data between the adjacent grid points will take time as long as a GPU–CPU–GPU copy. On the other hand, when a GPU is transferring data to another GPU, the traffic (PCI-e interface) of both GPUs will become a busy status and these two GPUs will block all other accesses. For example, when GPU-0 is transferring data to GPU-1, the accesses to the device memory on GPU-0 or GPU-1 from other GPUs will have to wait until the data transfer between GPU-0 and GPU-1 has finished. In this situation, one way to perform better efficiency is to schedule an overlapping data transfer between GPUs. Let us define the time expense Ta,b as the time expense of data transferring from GPU-a to GPU-b. In theory, the best scheduling makes the time expense to be 2 × (T3,4 + T4,5 ), where T3,4 is the time expense of data transferring from GPU-3 to GPU-4 through an IOH. The procedure in our implementation is as follows. (1) Copy data from GPU-3 to GPU-4, from GPU-1 to GPU-2, and from GPU-5 to GPU-6, simultaneously. (2) Copy data from GPU-0 to GPU-1, from GPU-2 to GPU-3, from GPU-4 to GPU-5, and from GPU-6 to GPU-7, simultaneously. (3) Repeat the above steps in reverse data transfer order. (For example: from GPU-4 to GPU-3, from GPU-1 to GPU-0, and so on.) As the procedure shown above, we let the data transfer via switches (longer copying time) and data transfer via IOHs (longest copying time) process simultaneously. So the first step will take the longest time expense T3,4 . Then process all other data transferring

between the GPUs that are connected directly (fastest copying time) which will take T0,1 = T2,3 = T4,5 = T6,7 . As a result, the total time expense will be T3,4 + T4,5 of the one-way transferring; then it will be 2 × (T3,4 + T4,5 ) including both forward transferring and backward transferring (so-called right phase and left phase). 4.5. Concurrent process and synchronization between GPUs To boost up the process with the parallelism of using GPUs, the asynchronous copy between GPUs is applied. It should be realized that there is no overlapping of the adjacent grid points of the forward transferring and backward transferring. In one partition domain, the slice of z = 0 will be renewed by the slice of z = nz − 1 of the last partition, while the slice of z = nz of the last partition will be renewed by the slice of z = 1, and so on (see Fig. 6), so that the data exchange between GPUs can be processed simultaneously without conflict. For the procedure in Section 4.4, each pair of GPUs was worked separately. For example, the data transfer from GPU-1 to GPU-2 will be processed in Step 1 and data transfer from GPU0 to GPU-1 will be processed in Step 2. GPU-1 was only invoked once in each step, so that the expected transferring time will be 2 × (T3,4 + T4,5 ). However, GPU-0 to GPU-1 and GPU-1 to GPU-2 can be processed at the same time via non-blocking communication. Step 1 and Step 2 in Section 4.4 can actually be done together; thus the transferring time will be T3.4 in one way, 2 × T3,4 for both forward and backward transferrings plus the extra time for synchronization. CUDA Streams are used to complete these processes. Fermi GPUs provide the ability of concurrent kernel processing and data transferring of GPUs. That means not only the data transfer but also the calculation kernels can be run in parallel if the resources of the GPUs are enough. Such concurrent methods have been used in [28]. CUDA Events are used to notify other GPUs when the concurrent kernel is finished. It is very important to synchronize GPUs after

150

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152

Fig. 6. Data exchange of the adjacent partitions.

Fig. 7. Pressure after running 840 time steps. The values in the top half of the panel are along the noon–midnight meridian, while those in the bottom half of the figure are in the equatorial plane.

Fig. 8. Pressure after running 1800 time steps. The values in the top half of the panel are along the noon–midnight meridian, while those in the bottom half of the figure are in the equatorial plane.

processing a step, guaranteeing that all the partitions being calculated are processed within the same time step. Corresponding techniques of asynchronous copy, concurrent processing, CUDA Streams, and CUDA Events can be found in [15,29–31], respectively. Since the data transfer and calculation kernels are able to process concurrently and the MHD calculations actually process much longer than the data transfer, the time expense will mostly depend on the processing time of the calculation kernels. 5. Experiment results and analysis We have implemented a 3D global magnetohydrodynamic (MHD) simulator using CUDA 4.0 with GPUDirect 2.0 on multiple GPUs. We have tested our code on a multi-GPU system with

dual Intel X5680 Xeon 3.33 GHz CPUs, 96G main memory, eight M2050 Tesla GPUs (each has 3G video memory), running Ubuntu Linux 10.04 operating system. Here we show an example of results obtained with our multi-GPU code for the 3D MHD simulation of the solar wind and the Earth’s magnetosphere for southward IMF (BIMF = −5 nT) in Figs. 7 and 8. The grid was (180, 60, 60) with a mesh size of 0.5RE . At time step = 0, the southward IMF entered the simulation box at its upstream edge. Figs. 7 and 8 show the pressure after running 840 time steps and 1800 time steps, respectively. We have examined the simulation results of our implementation with the results in [32,33] to validate the correctness of our code. To measure the performance, we run our multi-GPU code on the system mentioned above and the original FORTRAN CPU code [17]

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152 Table 1 Average data transfer time and the number of exchanging elements between GPUs at different resolutions.

151

Table 2 Comparisons of the peak speed and average speed of our multi-GPU code at different resolutions.

Number of grid points

Number of adjacent grids

Data transfer time (ms/step)

Total throughput (GB/s)

Number of grid points

Peak speed (ms/step)

Average speed (ms/step)

Ratio

180 × 60 × 60 320 × 80 × 80 480 × 160 × 160 640 × 160 × 160 800 × 200 × 200

11 284 26 404 78 084 104 004 162 004

2.193 4.066 7.834 9.793 12.746

0.539 0.679 1.040 1.106 1.323

180 × 60 × 60 320 × 80 × 80 480 × 160 × 160 640 × 160 × 160 800 × 200 × 200

16.1270 27.2390 124.9045 164.7110 328.0580

41.4667 71.2172 165.2347 270.0812 370.4031

2.57 2.62 1.32 1.64 1.13

Therefore, the synchronization overhead is varying and is hard to measure in an explicit manner. Considering that the peak speed has the least synchronization overhead, the time of the total synchronization overhead is around 40 ms for most cases in our measurement. The synchronization overheads are about 25 ms and 105 ms for the 180 × 60 × 60 and 640 × 160 × 160 meshes, respectively. These amounts of time are acceptable since we get 8× the computational power with eight GPUs. The peak performance of our multi-GPU code is 97.36 GFLOPS and the original FORTRAN CPU code on a single core is 1.27 GFLOPS. Fukazawa and Umeda [10] reported that the peak performance of the CPU code with 1D decomposition method on the HA8000 supercomputer, which has 952 nodes, and each node has four AMD Opteron 8386 processors (quad core, 2.3 GHz), reaches 100 GFLOPS using about 90 cores (see Fig. 3 in [10]). 6. Conclusion and future work Fig. 9. Average data transfer time between GPUs at different resolutions (unit = ms/step).

on a single core of an Intel X5680 Xeon 3.33 GHz CPUs in avoidance to reflect parallelization artifacts. All results were calculated in double precision. We also compared the performance results evaluated on two typical scalar-type supercomputer systems by Fukazawa and Umeda [10]. Table 1 and Fig. 9 are the average data transfer time of our practices. Note that the adjacent grids in Table 1 and Fig. 9 include the boundary in x and y dimensions (Adjacent grids = (SIZE x + 2) × (SIZE y + 2)). The total throughput was calculated with n − 1 = 7 times of the right phase and left phase data transferrings. We consider that the data to be exchanged was small to perform the highest bandwidth of GPUs. This can be referred to the jump of the transfer speeds between the resolutions of 640 × 160 × 160 and 800 × 200 × 200. According to the bandwidth of the data transfer between GPUs (2.68 GB/s for GPU–CPU–GPU), we believe our code perform higher bandwidth when the number of adjacent grids is larger. Table 2 shows the peak and average speeds of our multi-GPU code at different resolutions. It reveals that our multi-GPU code is more stable when the resolution is larger. For example, the peak speed and average speed of the 180 × 60 × 60 mesh are 16.1270 ms and 41.4667 ms, respectively. The ratio between the peak speed and the average speed is 2.57×. The peak speed and average speed of the 800 × 200 × 200 mesh are 328.0580 ms and 370.4031 ms. And the ratio between the peak speed and the average speed is only 1.13×. We consider that the difference between the peak speed and the average speed is due to the difference of the time spent when the GPUs waiting for each other at the synchronization step will be much less than the time spent by all the GPU to process the data when the resolution is larger. The low performance when processing small size data is mostly due to the synchronization overhead which appears in two steps (before and after the data copy between GPUs) of a simulation run. As every GPU is processing concurrently, hence, the threads that are processed within the same GPU will not finish the job at the same time.

We have proposed a 3D global MHD simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. The modified leapfrog scheme, which is a combination of the leapfrog scheme and the twostep Lax–Wendroff scheme, has been implemented. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision. The data transfer between GPUs was always a bottleneck in a multi-GPU program; our approach uses the latest GPUDirect v2.0 as well as the concurrent processing ability of the Fermi GPU architecture to boost up the data transfer. We have analyzed the performance of GPUDirect v2.0 between the GPUs connected via switches, IOHs, as well as the PCI-e interface and QPI interface on our system. Even the performance of the GPUDirect v2.0 is limited due to it being not supported through the QPI interface; our code achieves a relatively high FLOPS. We are looking forward to extending our work on a GPU supercomputer such as TSUBAME 2.0 [34]. Since NVIDIA Quadro GPUs and Tesla GPUs can now work together in a workstation by NVIDIA Maximus Technology [35], we are very interested in integrating real-time visualization techniques into our simulator to provide simultaneous visualizations of the simulation results. We are also looking forward to new GPUs and new CUDA version that will make the breakthrough of the peer-topeer GPU communication via the QPI interface. Acknowledgments This work was supported by the Science and Technology Development Fund of Macao SAR (004/2011/A1 and 080/2012/A3) and the National High-Technology Research and Development Program of China (2010AA122205). The authors would like to thank Professor Tatsuki Ogino at the Solar-Terrestrial Environment Laboratory, Nagoya University, for providing the FORTRAN global MHD code. Special thanks to anonymous reviewers for their constructive and valuable comments that helped us to improve the paper. Thanks also to Dr. Keiichiro Fukazawa at the Research Institute for Information Technology, Kyushu University, for his suggestions on this work.

152

U.-H. Wong et al. / Computer Physics Communications 185 (2014) 144–152

References [1] J.N. Leboeuf, T. Tajima, C.F. Kennel, J.M. Dawson, Global simulation of the time-dependent magnetosphere, Geophysical Research Letters 5 (1978) 609–612. [2] J.N. Leboeuf, T. Tajima, C.F. Kennel, J.M. Dawson, Global simulations of the three-dimensional magnetosphere, Geophysical Research Letters 8 (1981) 257–260. [3] J. Raeder, Global magnetohydrodynamics—a tutorial, in: J. Büchner, C.T. Dum, M. Scholer (Eds.), Space Plasma Simulation, Springer, 2003, pp. 212–246. [4] T. Ogino, Three-dimensional global MHD simulation code for the earth’s magnetosphere using HPF/JA, Concurrency and Computation: Practice and Experience 14 (2002) 631–646. [5] T. Ogino, Global magnetohydrodynamic simulation using high performance FORTRAN on parallel computers, in: J. Büchner, C.T. Dum, M. Scholer (Eds.), Space Plasma Simulation, Springer, 2003, pp. 296–314. [6] Z. Huang, C. Wang, Y. Hu, X. Guo, Parallel implementation of 3D global MHD simulations for earth’s magnetosphere, Computers and Mathematics with Applications 55 (2008) 419–425. [7] K. Reuter, F. Jenko, C.B. Forest, R.A. Bayliss, A parallel implementation of an MHD code for the simulation of mechanically driven, turbulent dynamos in spherical geometry, Computer Physics Communications 179 (2008) 245–249. [8] U. Ziegler, The NIRVANA code: parallel computational MHD with adaptive mesh refinement, Computer Physics Communications 179 (2008) 227–244. [9] K. Fukazawa, T. Umeda, T. Miyoshi, N. Terada, Y. Matsumoto, T. Ogino, Performance measurement of magneto-hydro-dynamic code for space plasma on the various scalar type supercomputer systems, IEEE Transactions on Plasma Science 38 (2010) 22–54. [10] K. Fukazawa, T. Umeda, Performance measurement of magnetohydrodynamic code for space plasma on the typical scalar type supercomputer systems with the large number of cores, International Journal of High Performance Computing Applications 26 (2012) 310–318. [11] H.-C. Wong, U.-H. Wong, X. Feng, Z. Tang, Efficient magnetohydrodynamic simulations on graphics processing units with CUDA, Computer Physics Communications 182 (2011) 2132–2160. [12] X. Feng, D. Zhong, C. Xiang, Y. Zhang, GPU computing in space weather modeling, in: ASTRONUM 2012—the 7th Annual International Conference on Numerical Modeling of Space Plasma Flows, in: ASP Conference Series, vol. 474, 2012, pp. 131–139. [13] X. Feng, D. Zhong, C. Xiang, Y. Zhang, GPU-accelerated computing of threedimensional solar wind background, Science China Earth Sciences (2013) http://dx.doi.org/10.1007/s11430-013-4661-y. [14] X. Feng, L. Yang, C. Xiang, C. Jiang, X. Ma, S.T. Wu, D. Zhong, Y. Zhou, Validation of the 3D AMR SIP-CESE solar wind model for four Carrington rotations, Solar Physics 279 (2012) 207–229. [15] NVIDIA Cororation, NVIDIA CUDA compute unified device architecture programming, 2011. http://developer.nvidia.com/cuda-toolkit. [16] T. Ogino, A three-dimensional MHD simulation of the interaction of the solar wind with the earth’s magnetosphere: the generation of field-aligned currents, Journal of Geophysical Research 91 (1986) 6791–6806.

[17] T. Ogino, R.J. Walker, M. Ashour-Abdalla, A global magnetohydrodynamic simulation of the magnetosheath and magnetopause when the interplanetary magnetic field is northward, IEEE Transactions on Plasma Science 20 (1992) 817–828. [18] T. Ogino, R.J. Walker, M.G. Kivelson, A global magnetodynamic simulation of the Jovian magnetosphere, Journal of Geophysical Research 103 (1998) 225–235. [19] K. Fukazawa, T. Ogino, R.J. Walker, Dynamics of the Jovian magnetosphere for northward interplanetary magnetic field (IMF), Geophysical Research Letters 32 (2005) L03202. http://dx.doi.org/10.1029/2004GL021392. [20] K. Fukazawa, T. Ogino, R.J. Walker, Configuration and dynamics of the Jovian magnetosphere, Journal of Geophysical Research 111 (2006) A10207. http://dx.doi.org/10.1029/2006JA011874. [21] K. Fukazawa, T. Ogino, R.J. Walker, A simulation study of dynamics in the distant Jovian magnetotail, Journal of Geophysical Research 115 (2010) A09219. http://dx.doi.org/10.1029/2009JA015228. [22] K. Fukazawa, S. Ogi, T. Ogion, R.J. Walker, Magnetospheric convection at Saturn as a function of IMF BZ , Geophysical Research Letters 34 (2007) L01105. http://dx.doi.org/10.1029/2006GL028373. [23] K. Fukazawa, T. Ogino, R.J. Walker, Vortex-associated reconnection for northward IMF in the Kronian magnetosphere, Geophysical Research Letters 34 (2007) L23201. http://dx.doi.org/10.1029/2007GL031784. [24] R.J. Walker, K. Fukazawa, T. Ogino, D. Morozoff, A simulation study of Kelvin–Helmholtz waves at Saturn’s magnetopause, Journal of Geophysical Research 116 (2011) A03203. http://dx.doi.org/10.1029/2010JA015905. [25] K. Fukazawa, T. Ogino, R.J. Walker, A magnetohydrodynamic simulation study of Kronian field-aligned currents and auroras, Journal of Geophysical Research 117 (2012) A02214. http://dx.doi.org/10.1029/2011JA016945. [26] D. Potter, Computational Physics, John Wiley and Sons, 1973. [27] TYAN high performance brebone system service engineer’s manual for model FT72-B7015, 2009. http://www.tyan.com. [28] M. Bernaschi, M. Fatica, G. Parisi, L. Parisi, Multi-GPU codes for spin systems simulations, Computer Physics Communications 183 (2012) 1416–1421. [29] NVIDIA Cororation, CUDA C programming best practices guides, 2011. http://developer.nvidia.com/cuda-toolkit. [30] CUDA Readiness Tech Brief, 2011. http://developer.nvidia.com/cuda-toolkit40rc2. [31] P. Micikevicius, Multi-GPU and host multi-threading considerations, 2011. Available on GPU Computing Webinars: http://developer.nvidia.com/gpucomputing-webinars. [32] R.J. Walker, T. Ogino, J. Raeder, M. Ashour-Abdalla, A global magnetohydrodynamic simulation of the magnetosphere when the interplanetary magnetic field is southward: the onset of magnetotail reconnection, Journal of Geophysical Research 98 (A10) (1993) 17235–17249. [33] T. Ogino, R.J. Walker, M. Ashour-Abdalla, A global magnetohydrodynamic simulation of the response of the magnetosphere to a northward turning of the interplanetary magnetic field, Journal of Geophysical Research 99 (1994) 11027–11042. [34] TSUBAME 2.0, 2011. http://tsubame.gsic.titech.ac.jp/en/tsubame2-systemarchitecture. [35] NVIDIA MAXIMUS Technology, 2011. http://www.nvidia.com/object/maximus.html.