Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks

Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks

Computer Physics Communications ( ) – Contents lists available at ScienceDirect Computer Physics Communications journal homepage: www.elsevier.com...

2MB Sizes 2 Downloads 75 Views

Computer Physics Communications (

)



Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks Alain Marx *, Hinrich Lütjens Centre de Physique Théorique, Ecole polytechnique, CNRS, Université, Paris-Saclay, F-91128 Palaiseau, France

article

info

Article history: Received 16 July 2016 Received in revised form 3 October 2016 Accepted 27 October 2016 Available online xxxx Keywords: Implicit Newton–Krylov MPI/OpenMP SPIKE

a b s t r a c t A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton–Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the twofluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Realistic 3D simulation of tokamak MHD instabilities like tearing or sawtooth oscillations are notoriously a difficult numerical problem, mainly because of the stiffness of the system of equations to be solved. An explicit time advance scheme is limited by a Courant–Friedrichs–Lewy (CFL) condition on fast compressional magneto-acoustic waves which scale as τa /h where τa is the toroidal Alfvén time and h is the discretization grid size (τa = R0 /va where R0 is the major axis of the torus and va = B/ρ 1/2 is the Alfvén velocity). For simulations of the order of 5.104 − 5.105 τa as typically performed in physical parameter studies [1–10], this is unacceptable in terms of CPU costs. On computers of the 1980–1990s, the CPU cost of an implicit time advance scheme was prohibitive. In spite of this, a certain number of tools were developed during that period in 3D resistive MHD [11–13], several of them based on a semi-implicit method [14–17]. This was also the choice for the first XTOR and NIMROD codes [18–20]. However, semi-implicit methods turned out to be difficult to generalize to more sophisticated physical models, such as two-fluid MHD or two-fluid equations (in twofluid MHD, only one density and one temperature or pressure equation is evolving). Between 2000 and 2010, increased computer performances allowed the solution of the 3D problem by fully implicit methods. New codes were developed such as XTOR-2F [21], M3D-C1 [22] and JOREK [23]. For the first time, pre-conditioned

*

Corresponding author. E-mail address: [email protected] (A. Marx).

iterative matrix-free Newton–Krylov (N–K) methods were used to solve the implicit time advance [21,24,25]. However, this method has the drawback that a significant reduction of the N–K iterations sensitively depends on the quality of the pre-conditioner. A detailed review of these developments over the last 30 years can be found in Ref. [26]. In the version of the XTOR-2F code used in the present paper, the pre-conditioner is a so-called ‘‘physical’’ one, i.e. it is based on the linearization of the two-fluid MHD equations about the equilibrium used as initial conditions and provided by the CHEASE code [27,28]. In XTOR-2F, the amount of iterations of the N–K solver is controlled by tuning the time step, see Ref. [21], Section 4.1. This method has the advantage to adapt the resolution in time during simulation phases where the dynamics evolves rapidly or slowly. As a consequence, every change of the time step is accompanied by a reconstruction of the pre-conditioner. In long time simulations, such an operation occurs several dozen of times. The choice in XTOR-2F of an iterative pre-conditioned N–K method for the time advance and pseudo-spectral Fourier decomposition in the angular directions has a deep impact on the efficiency of the numerical method. The pre-conditioner is stored in Fourier space. It is dense, diagonal in the toroidal mode number n, and block penta-diagonal in the two other directions (this matrix profile arises from the linearization of the equations used for preconditioning). This matrix is small compared to the one of the same problem discretized with finite elements [20,22,23], which lead to large sparse matrices. This choice in XTOR-2F was efficient on CRAY or NEC vector machines, because it mainly involved operations such as many FFT’s on arrays of same dimensions, or matrix–matrix

http://dx.doi.org/10.1016/j.cpc.2016.10.014 0010-4655/© 2016 Elsevier B.V. All rights reserved.

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

2

A. Marx, H. Lütjens / Computer Physics Communications (

and matrix–vector products of blocks due to poloidal couplings in the pre-conditioner during its LU decomposition and solution, which all can easily be multi-threaded or vectorized. However, the small size and the profile of the pre-conditioning matrix is a significant hurdle for an efficient MPI parallelization because of the numerical cost of the interprocess communications. An MPI parallelization of the complete time advance method requires significant changes in the method of solution. A two-level MPI parallelization method has been implemented into XTOR-2F. The first level is composed both of dividing the pre-conditioning along the toroidal modes n and of splitting the evaluation of the two-fluid equations solved by XTOR-2F into sub-operators. The second level consists of domain-decomposing the entire problem along the minor radius of the torus. Every N–K iteration requires one inversion of the pre-conditioner and one evaluation of the two-fluid equations. The pre-conditioner is stored and applied to the equations in Fourier space, which renders the toroidal splitting straightforward. The two-fluid equations are evaluated in the physical space, thus making straight a division into suboperators. The price to pay for both the toroidal and the operator splitting are global MPI communications at the end of every N–K iteration for every radial sub-domain between the MPI processes of the first level of MPI parallelization (i.e. toroidal and operator splitting). But it is emphasized that these communications reduce with increasing radial (second level) MPI parallelization. The pre-conditioner inversion is performed by a LU decomposition and forward and backward substitution. The exact LU method is motivated by the pre-conditioner matrix ill-condition. It is MPI parallelized along the toroidal mode number n, and with the SPIKE algorithm [29] for every n. As a side effect, this distribution of the matrix over many CPU’s reduces the per MPI process requirements in memory, which was a problem for large resolution cases with the original sequential version of the code. The disadvantage of the SPIKE method is that a matrix due to communications between the different radial sub-domains must be solved. Of course, the method is only efficient if this communication matrix remains small. For the N–K iterative method, XTOR-2F now uses the PETSc library [30], which is MPI parallelized. At moment, XTOR-2F only benefits from this property at the second level of parallelization, i.e. the radial domain decomposition. This is not a handicap because the amount of operations spent exclusively in the PETSc part of N–K is small. Finally, the code makes extensive use of shared memory parallelization for every MPI process, i.e. threading in BLAS and LAPACK routines, which are extensively used throughout the code, and coarse grain OpenMP for FFT’s and native XTOR-2F algebra routines. The code spends significant time in swapping between physical and Fourier space, for instance when inverting the preconditioner, performing angular filtering or angular derivatives. Many comparisons were done between different FFT package, the most efficients being MKL [31] and FFTW [32]. For portability reasons, the latter is used in general. The paper is organized as follows. Section 2 introduces the extended MHD model. Section 3 presents the numerical scheme of the sequential version of the code. The parallelization strategy is described in Section 4. Performance results are given in Section 5. A physical application is presented in Section 6. Finally conclusions are given in Section 7. 2. Equations The XTOR-2F code uses a two-fluid MHD model based on the drift kinetic equations that includes resistivity, viscosity, anisotropic thermal transport and electron and ion diamagnetic effects [21]. These equations are derived following Braginskii [33],

)



expanding the perpendicular ion velocity as a sum of drift velocities. Retaining the lowest order drift terms, i.e. vi = v + v∗i with v = E×B/B2 + vi// , gives the set of equations

⎧ B ⎪ ∂t ρ = −ρ∇.v − v.∇ρ − α∇ pi .∇× 2 + ∇.D⊥ ∇ρ + S ⎪ ⎪ B ⎪ ⎪ ⎪ ρ∂t v = −ρ (v · ∇ )v − ρ (vi∗ .∇ )v⊥ + J ×B − ∇ p + ν∇ 2 v ⎪ ⎪ ⎪ ⎨ p B ∂t p = −Γ p∇.v − v.∇ p − α Γ ∇ pi .∇× 2 ρ ) ( ) B ( ⎪ ⎪ ⎪ ⎪ + ∇. χ⊥ ρ∇ ρp + ∇. BB χ∥ ρ∇∥ ρp + H ⎪ ⎪ ⎪ ⎪ ⎪ ⎩∂t B = ∇×(v×B) + α∇× ∇∥ pe − ∇×ηJ ρ

(1)

where:

⎧ qB 1 ⎪ ⎪ (ion skin depth in diamagnetic effects); ωci = α≡ ⎪ ⎪ ω τ m ⎪ ci A ⎪ ⎪ ⎪ B ×∇ p ⎪ i ∗ ⎪ ⎪ ⎪vi ≡ α ρ B2 (ion diamagnetic velocity) ⎪ ⎪ ⎪ p p ⎪ ⎪ pe ≡ ; pi ≡ ⎪ ⎪ ⎪ 2 2 ⎨ ν ≡ viscosity; η ≡ resistivity ⎪ ⎪ ⎪ ⎪D⊥ ≡ diffusion coefficient; ⎪ ⎪ ⎪ ⎪ χ⊥ , χ∥ ≡ anisotropic transport coefficients ⎪ ⎪ ⎪ ⎪ ⎪ S ≡ −∇.D⊥ ∇ρ (t = 0) density source; ⎪ ⎪ ⎪ ( ) ⎪ ⎪ p (t = 0) ⎪ ⎪ ⎩H ≡ −∇. χ⊥ ρ∇ heat source ρ and ⊥ indicate the direction along and perpendicular to the magnetic field B(t). The perpendicular diffusive source terms S and H restore respectively the initial density and pressure profiles within their characteristic times. The geometry of the plasma is toroidal with non-circular poloidal cross-sections. //

3. Sequential numerical scheme For the sake of clarity and self consistency of the present work, Section 3 summarizes the method used to solve Eqs. (1) by XTOR2F. For a detailed description, see Refs. [19,21]. 3.1. Numerical scheme and mesh The MHD model applied to tokamak plasmas is a stiff problem due to the great separation of scales involved. There is a coexistence of high frequency modes (compressional Alfvén waves) and of low frequency modes (shear Alfvèn waves). Moreover, the dominant physical phenomenas are slow in comparison to the basic time scale (Alfvén time τa ). Hence interesting simulations require to evolve the dynamics for long times. For instance, sawtooth oscillation simulations as in Refs. [7,8] require 105 − 106 τa , and the fastest compressional Alfvén frequencies in the discretized spectrum of XTOR-2F scale as ω = kR0 /τa ∼ (mmax /h)R0 /τa where R0 is the major axis of the torus, mmax the maximum poloidal mode number and h the radial mesh size. With e.g. R0 = 3, mmax = 64 and h ∼ 10−2 , the ratio of time scales involved in the complete dynamics of the discretized problem is ∼1010 . Numerical time advance schemes were developed which selectively filter out fast modes, as for instance in the semi-implicit scheme in the first version of XTOR [18,19]. This was well adapted to resistive MHD simulations. However the inclusion of two-fluid effects and the ongoing effort to take into account kinetic effects required a shift to a fully implicit numerical scheme [21]. This fully implicit version uses the following time advance scheme: xn+1 − xn − ∆tF

[

xn+1 + xn 2

] + Θ (xn+1 − 2xn + xn−1 ) = 0

(2)

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

A. Marx, H. Lütjens / Computer Physics Communications (

)



3

where F is the RHS of Eqs. (1) and x = (v, B, p, ρ )T represents all the evolved quantities. Θ is a numerical constant, usually set to 1 in XTOR-2F (see Ref. [21]). The system (2) can be rewritten as G (∆n , x¯ ) ≡ ∆n − ∆tF

[(

1 2

) ] + Θ ∆n + x¯ = 0

(3)

with x¯ ≡ (1 − Θ ) xn + Θ xn−1 and ∆n ≡ xn+1 − xn . It must be inverted at every time step for the evaluation of xn+1 . The poloidal and the toroidal directions use a pseudo-spectral Fourier representation. The first derivatives in the angular directions of scalar fields and vector field components in Eqs. (1) are evaluated by multiple 1-D Fourier transforms. Products between these fields are done in physical space to evaluate the RHS of Eqs. (1). After each of such a product, modes which are not in the mode set evolved by the code are filtered out in Fourier space, therefore requiring a multiple forward and backward 2-D Fourier transform (for details see [19], Section 3.6.3). Altogether, one evaluation of the RHS of Eqs. (1) requires about two hundreds such forward or backward multiple 1-D or 2-D FFT transforms. The radial coordinate is the equilibrium poloidal magnetic flux surfaces. The radial discretization is done by means of linear finite differences. Staggered meshes are used for the radial derivative evaluations. The resolution of the linear stability problem involves only a moderate number of resonant surfaces. Indeed the problem is decoupled in the toroidal direction and the dynamics involves only poloidal mode couplings. The situation is different in XTOR2F, where the nonlinear dynamics is solved and thus, couplings arise between all the poloidal and toroidal modes. Therefore the high radial resolution is required throughout the entire plasma. In general, XTOR-2F uses an uniform mesh. 3.2. Newton–Krylov matrix free solver The solution of 3D two-fluid equations as Eqs. (1) needs considerable computing resources if we choose to build and invert the matrix representing the implicit problem linearized at the current time position. An alternative is the use of a Newton–Krylov iterative method to solve each implicit time step. For a system of non linear equation such as Eq. (3), the Newton iteration is written:

( ( ))−1 ( k ) ∆kn+1 = ∆kn − G′ ∆kn G ∆n

(4)

where k indexes the Newton steps, n indexes the time step and prime denotes the derivative with respect to ∆n . Each non-linear( Newton iteration requires the inversion of the ) Jacobian matrix G′ ∆kn . This task is performed by a Krylov method using a pre-conditioner. The Krylov method used by XTOR-2F is the Generalized Minimal Residual Method (GMRES) [34]. We obtain from Eq. (4) the pre-conditioned Newton–Krylov method used to solve Eq. (3) for ∆n : M −1 G′ ∆kn

(

)(

) ( ) ∆kn+1 − ∆kn + M −1 G ∆kn = 0

(5)

where M is the pre-conditioning operator. Forming the Jacobian matrix G′ requires taking analytic or discrete derivatives of the system of equations. For a complicated system like the extended MHD equations, this can be very difficult and both error-prone and time consuming. A property of Krylov methods is that they require only matrix–vector products to carry out the iteration. These products are approximated by finite difference:

( )( ) [ ( ( )) ] G′ ∆kn ∆kn+1 − ∆kn ≈ G ∆kn + ϵ ∆kn+1 − ∆kn − G(∆kn ) /ϵ where ε is a small parameter. The entire process of forming and storing the Jacobian matrix and multiplying it by a vector is replaced by just two evaluations of the nonlinear function G, hence the name ‘‘Jacobian-free’’ or ‘‘matrix-free’’. This method allows the inclusion into XTOR-2F of complex physical models without the need to derive explicitly the Jacobian matrix terms.

Fig. 1. XTOR-2F main steps: pre-conditioner construction, Newton–Krylov solver, function evaluation and pre-conditioner.

3.3. Pre-conditioner An efficient pre-conditioner is necessary to ensure the convergence of the Krylov linear solver. XTOR-2F uses a physical preconditioner operator based on the linearized Eqs. (1). This operator needs to be inverted at each time step. In XTOR-2F, it is stored in Fourier space. Because it is ill-conditioned, this inversion is performed by an exact LU method. In order to ensure fast convergence of the Krylov solver, the number of iterations for each time step is controlled by an adaptive time step. If the number of Krylov iterations is greater or less than a threshold, the time step ∆t is modified. Hence the preconditioner matrix construction and the LU decomposition, which are computationally heavy, are performed only at startup and at each time step change. The LU decomposed pre-conditioner is stored, and only a forward and backward substitution is necessary during each pre-conditioning required at every Krylov iteration. The sequential version of the code uses a block Thomas algorithm to perform this step. A summary of the main program steps and the location of the LU decomposition and forward and backward substitution is given in Fig. 1. The pre-conditioner matrix structure is described since it gives important constraints on the parallelization strategy. The problem is diagonal in toroidal modes. The pre-conditioner is indeed based on a linearization of Eqs. (1) in toroidal axi-symmetry. The poloidal spectral modes are coupled in the linearized equations by the dependence of the equilibrium magnetic field on the poloidal angle. The resulting pre-conditioner matrix is diagonal along the toroidal modes and block penta-diagonal along the radial coordinate. Each block is dense of size (8∆m)2 , where ∆m is the poloidal mode range evolved by the code and the factor 8 comes from the 8 MHD field components (v, B, p, ρ )T . The pre-conditioner matrix structure is given in Fig. 2. For its construction, the images of the vector space basis vectors by the linearized MHD operator are computed, each giving a column of the matrix. The matrix construction is MPI parallelized over these basis vectors. 4. Parallelization strategy We can see in Fig. 1 that each Krylov iteration is composed of two parts: the evaluation of G in Eq. (3) with the RHS of Eqs. (1) denoted by ‘‘Function Evaluation’’ and the pre-conditioner matrix solve denoted by ‘‘Pre-conditioner’’. The algebra of the function evaluation part is performed in physical space while the preconditioner matrix solve is performed in Fourier space. We have implemented a two-level parallelization strategy

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

4

A. Marx, H. Lütjens / Computer Physics Communications (

Fig. 2. Pre-conditioner matrix structure for radial resolution lmax, poloidal modes number ∆m, four toroidal modes n = 0, . . . , 3 and eight MHD field components.

• The first parallelization level splits the pre-conditioner matrix solve along the toroidal modes and splits the evaluation of the RHS of Eqs. (1) along group of operators. • The second parallelization level is a domain decomposition along the radial coordinate. We give in Fig. 3 a summary of the parallelization implemented on the three XTOR-2F main steps. 4.1. First level: Pre-conditioner toroidal modes and group of operators split Since the pre-conditioner is diagonal along the toroidal modes, we can split the matrix along this axis. While the solve is done with perfect parallelism, the results for different toroidal modes should however be reassembled at the end of the pre-conditioner evaluation with a MPI_Allgather. This communication is a consequence of the spectral representation chosen since all the toroidal modes are needed by the FFT’s in the function evaluation step. Using the same MPI communicator as for the toroidal parallelization, the computation of the RHS of MHD equation (1) has been split into groups of operators. Each group of operators is computed by one MPI process and broadcasted to the other processes at the end of the computation. The advantage of this parallelization is its straightforward extension to new terms or the generalization of the model by the addition of new variables evolving in time. For example the evolution of several impurity species along the MHD fluid [10] or the handling of RF heating [35] was added and benefited from that flexibility. The loops along the angular coordinates inside each algebraic operator in the physical space have been parallelized using coarse grain OpenMP directive. 4.2. Second level: Radial domain decomposition As we stated in Section 3.1 the angular coordinates θ, φ are subject to numerous FFT procedure calls for mode projection during each evaluation of G in Eq. (3). As FFT’s are applied to the entire domain of solution, this rules out these two coordinates for domain decomposition in the evaluation of G. Hence a domain decomposition was implemented along the radial coordinate. Eventually, this radial domain decomposition splits the entire time advance iteration, as shown in Fig. 3. The Newton–Krylov solver has been ported from the sequential NITSOL library [36,37] to the PETSc framework [30], thus benefiting from native MPI parallelization. PETSc includes the features needed

)



by XTOR-2F: inexact Newton solves using Eisenstat–Walker convergence criteria [38], GMRES Krylov linear solver and matrix free capability. Each radial domain is extended on its left and right side by a set of ghost points necessary to compute the derivatives at its boundaries. The ghost points of a given domain should be updated from nearest neighbor domains by MPI communications at each new Krylov iteration. Global reduction communications are required for the scalar product and norm involved in the Newton– Krylov algorithm. The mesh along the radial coordinate is uniform, which facilitates the parametrization of PETSc. The block Thomas algorithm used for the LU decomposition of the pre-conditioner in the sequential version of the code albeit very fast is not designed for parallelization. The algorithm creates a chain of dependencies along the entire radial domain. Therefore, the pre-conditioner was parallelized along the radial coordinate with the SPIKE algorithm [29]. SPIKE allows to solve in parallel large banded linear systems by LU decomposition. It relies on the decomposition of a given banded matrix A into the product of a block diagonal matrix D, and another matrix S which has the structure of an identity matrix with some extra ‘‘spikes’’. These decomposition are summarized in Fig. 4. The S matrix can be transformed to a reduced banded system S r of smaller size taking into account only the couplings near the interface of each partition. The size of S r is 2 (p − 1) q and its band size is 2q, p being the number of radial partition and q the upper band size in blocks of the matrix A (q = 2 for a block penta-diagonal matrix). The construction of S r from S is shown in Fig. 5. I stands for the identity matrix. The reduced matrix S r taking into account the communications between the partition is replicated and solved on all processors. A solve of S r on a single processor followed by a broadcast of the solution is less efficient because of the additional communication step between the radial partitions. Then the diagonal block Ai can be solved in parallel. The parallelization scheme is summarized in Fig. 6. The optimal partition number giving a balanced size for the √ . The 21 constant may depend on local block Ai and S r scales as lmax 2 the size of the block but works well in the cases we consider. This gives a lower limit to the size of the radial sub-domain, which cannot be arbitrarily reduced without increasing the S r size. Since S r is banded, we can envisage to apply recursively the SPIKE algorithm for its solve but in today’s XTOR-2F radial resolution range ≲104 , this does not give additional speedup. The key feature of the SPIKE algorithm in XTOR-2F context is that the local block Ai and S r solves can be split in a LU decomposition and a forward and backward substitution. It allows to keep the efficiency of the sequential version: only a fast forward and backward substitution is required during each pre-conditioning. This step is still performed through Thomas algorithm but this time restricted to a radial subdomain. Note that a SPIKE library distribution is available from Intel without source code. For code portability, we made the choice to rewrite the SPIKE algorithm in Fortran. 4.3. Pre-conditioner matrix construction The pre-conditioner solve is split along the radial subdomains and the toroidal modes. The pre-conditioner matrix construction and LU decomposition performed at each time step rescaling follow a similar decomposition. The pre-conditioning matrix is computed by evaluating the images of basis vectors by the linearized two-fluid MHD equations. These images are scattered on the processors corresponding to the subdomain of the discretization mesh where the problem is solved. These operations are MPI parallelized over the basis vectors.

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

A. Marx, H. Lütjens / Computer Physics Communications (

)



5

Fig. 3. Summary of XTOR-2F parallelization : radial domain decomposition combined with pre-conditioner toroidal mode and group of operator split for the code three main steps: Newton–Krylov solver, function evaluation and pre-conditioner.

Fig. 4. SPIKE factorization A = DS for a block penta-diagonal matrix and three partitions p = 3.

Fig. 5. Construction of the reduced matrix S r from SPIKE matrix S for a pentadiagonal matrix and three partitions p = 3.

Fig. 6. XTOR-2F pre-conditioner matrix A solve parallelized using SPIKE algorithm on three MPI processes (P1 , P2 , P3 ). Each process solves a global matrix S r and a local matrix Ai (i = 1, 2, 3). The block size of S r is 4 times the block size of A.

4.4. Impact of parallelization on memory requirement For a given resolution, the memory requirement in Bytes for the unsplit pre-conditioner matrix is memory = (2×∆n) × (5×lmax) ×(8×∆m)2 ×8 where ∆n is the toroidal modes number span, lmax the radial mesh size and ∆m the poloidal modes number span. For example for a high resolution mesh with lmax = 1024, ∆n = 8, ∆m = 27, it gives a memory requirement of 30.5 GBytes for the preconditioner. The pre-conditioner matrix is diagonal in the toroidal mode number n. Therefore, the toroidal parallelization over p MPI processes leads to a reduction of the memory requirement ∼ 1/2(p−1) per MPI process, see Fig. 7(a) with the above resolution. The radial parallelization also reduces the memory √ requirement, as long as the number of radial partitions p ≲ lmax/2. For larger values of p, the size of additional matrices linked to the SPIKE algorithm (S r matrix and Vi /Wi blocks) increase significantly and revert this trend, as can be observed in Fig. 7(b) with Ptoroidal = 8. The pre-conditioner matrix memory requirement is divided by

≈30.5/0.8 = 38 for a total parallelization level Ptotal Ptoroidal ×Pradial = 8×16 = 128.

=

5. Results The following performance tests were performed on computing R R nodes equipped with two Intel⃝ Xeon⃝ Haswell 2.6 Ghz processors with 12 Cores each and 64 GB memory. For test cases to follow, setting the number of OpenMP threads to 6 per MPI process is optimum (i.e. 2 MPI processes by processor). The speedup between 1 and 6 OpenMP threads is of about a factor 3. Increasing the number of OpenMP threads per MPI process from 7 to 12 gives only a minute gain in performance (less than 10%). Multithreading parallelization is efficient on compute-intensive sections: the function evaluation using coarse grain OpenMP and the pre-conditioner matrix exact LU decomposition using level 3 BLAS. Performance measurements with Scalasca [39] show that it is not efficient on memory-bandwidth bounded section as the preconditioner matrix forward and backward substitution through

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

6

A. Marx, H. Lütjens / Computer Physics Communications (

)



(a) Toroidal parallelization.

(a) Low resolution mesh.

(b) Radial parallelization with Ptoroidal = 8.

(b) High resolution mesh.

Fig. 7. Pre-conditioner matrix memory size per MPI process versus parallelization level (a) Toroidal (b) Radial. Ai , Vi /Wi and S r are defined in Figs. 4–6. Table 1 Definition of low and high resolution meshes used in the performance tests. Meshes

Radial

Poloidal

Toroidal

Low resolution High resolution

256 1024

48 64

12 24

level 2 BLAS. As a consequence, a stronger MPI parallelization may lead to a degradation of the shared memory threading speedup. We define the two meshes used in the performance tests in Table 1. The set of performance tests performed for the low and high resolution meshes and for different combination of radial and toroidal parallelization level, noted respectively Pradial and Ptoroidal , is given in Table 2. For the high resolution case, we are able to run only certain combinations of Pradial and Ptoroidal . This is due to memory constraint mainly from the pre-conditioner matrix storage in the RAM. Fig. 8 gives a summary of the gain in execution time versus the total number of MPI processes, noted Ptotal . It is the product of the number of MPI processes used for toroidal parallelization and for radial domain decomposition: Ptotal = Pradial ×Ptoroidal . First, an overall speedup of about a factor 10 is observed for the low resolution case between Ptotal = 1 (sequential run) and Ptotal = Ptoroidal ×Pradial = 4×8 = 32, which corresponds to an efficiency of about 30%. The high resolution case shows a speedup of about a factor 24 between Ptotal = Ptoroidal = 8 and Ptotal = Ptoroidal ×Pradial = 8×32 = 256, which corresponds to a relative efficiency of about 75% for the radial domain decomposition.

Fig. 8. Execution times for 100 time steps versus the total number of MPI processes Ptotal = Ptoroidal ×Pradial for the low and high resolution meshes defined in Table 1. The radial and toroidal parallelization levels are those from Table 2. Each curve corresponds to a fixed number of toroidal MPI processes (Ptoroidal ).

Table 2 Combination of Pradial and Ptoroidal for which performance tests were performed for the low and high resolution meshes defined in Table 1.

The computational gain by doubling Ptoroidal or Pradial differ. This makes it difficult to determine the optimal Ptoroidal ×Pradial combination for large Ptotal otherwise than by testing. This can be observed in Fig. 8(b) for Ptotal = 32, the most efficient configuration being Ptoroidal ×Pradial = 2×16. It can also be observed that when doubling Pradial (Ptoroidal ×Pradial = 2×32) the gain saturates because of the SPIKE algorithm, see Section 5.2. On the contrary, when doubling Ptoroidal , i.e. Ptoroidal ×Pradial = 4×16, the execution time reduces. In Sections 5.1 and 5.2, results are given for both the low and the high resolution cases when the simulation is parallelized by scanning one of Ptoroidal or Pradial , respectively.

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

A. Marx, H. Lütjens / Computer Physics Communications (

)



7

Similarly, for the high resolution mesh, this factor is of about 3 with Pradial = 32 and Ptoroidal between 1 and nmax = 8. It is also observed that the gain is minimal for every Ptoroidal with Pradial varying between 8 and 16 for the low resolution case and between 16 and 32 for the high resolution one. This as a consequence of the SPIKE algorithm, as will be discussed in Section 5.2. The toroidal matrix splitting also reduces significantly the memory requirements per MPI process, as was observed in Section 4.4. This gives access to higher toroidal and poloidal resolutions. 5.2. Effects of radial domain splitting

(a) Low resolution mesh.

(b) High resolution mesh.

Fig. 9. Execution times for 100 time steps versus the number of toroidal MPI processes Ptoroidal for the low and high resolution meshes defined in Table 1. The radial and toroidal parallelization levels are those from Table 2.

5.1. Effects of toroidal and operator splitting The toroidal splitting of the pre-conditioner solve and the operator splitting of the RHS of Eqs. (1) to evaluate G in Eq. (3) are straightforward when they are limited to a moderate amount of MPI processes. The pre-conditioner matrix profile is indeed diagonal in the toroidal mode number n, and thus can easily be split into up to nmax processes, where nmax is the maximum toroidal mode number in the mode set evolved by the code. The parallelization along the toroidal direction is therefore limited in XTOR-2F to a maximum of Ptoroidal = nmax processes. Furthermore, the split of Eqs. (1) used to evaluate G in Eq. (3) into groups of operators is also straightforward. It was decided to keep the number these groups ≤ nmax . This allows a straightforward management of the MPI processes at the first level of parallelization in the pre-conditioner solve and operator splitting of G when flipping between physical and Fourier space. It is emphasized that both of these splittings increase the global communications because G in Eq. (3) and M −1 G in Eq. (5) are required at every radial position by every of the Ptoroidal processes. Fig. 9 shows the execution times versus Ptoroidal for the low and high resolution meshes. Every curve is obtained with a fixed number of radial domains Pradial for the second level of parallelization. The radial and toroidal parallelization levels used in the figures are those presented in Table 2. For the low resolution mesh, varying Ptoroidal between 1 and nmax = 4 gives a speedup factor of about 2 for every Pradial .

Next, the effect of the radial domain MPI splitting on the execution time is analyzed. In the left part of Fig. 10, it is observed that for the low and high resolution meshes, the execution times spent in the function evaluation, G in Eq. (3), and the time spent exclusively in the Newton–Krylov routines of PETSc decreases when the number of radial domains Pradial are increased. Conversely, the execution times of the pre-conditioning step √ c c show a minimum when Pradial = Pradial = lmax . P = 8 and radial 2 c Pradial = 16 for low and high resolution meshes, respectively. This is expected from the pre-conditioner radial parallelization scheme based on the SPIKE algorithm (Fig. 6). In the right part of Fig. 10 the pre-conditioner contribution is further split into a local part, due to the local block Ai solves, and a global part, due to the SPIKE S r matrix solve (Ai and S r definitions are given in Fig. 5). For both cases, the execution times spent in the global part, i.e. the S r matrix c solves, increase significantly when Pradial > Pradial . For the high resolution case in the right Fig. 10, it is also observed that the proportional contribution from the toroidal communication (MPI_Allgather over the toroidal modes done after the split pre-conditioner) stays relatively constant for all Pradial . This means that the radial domain decomposition not only splits the computation times but also the communication times due to the toroidal parallelization. This is confirmed when the toroidal communication times for the high resolution mesh are plotted against the radial parallelization level Pradial . Fig. 11 shows that for increasing value of Pradial , the time spent for toroidal communications decreases. Note that the cases with (Pradial = 2/Ptoroidal = 2) and (Pradial = 4/Ptoroidal = 2) cannot be run because of memory constraints. The cost of toroidal communications for the low resolution case are negligible. Finally Fig. 12 shows the execution time scalings, taking the low resolution mesh of Table 1 and varying the radial resolution. The poloidal resolution, the toroidal resolution and toroidal parallelization level Ptoroidal = 8 are fixed and the radial resolution lmax is varied within (256,1024,4096). When only the toroidal parallelization is activated, the scaling is linear with the radial resolution. Fig. 12 shows that when the radial domain decomposition is activated √ with the radial parallelization level set to Pradial = lmax , the scaling 2 follows the square root of the resolution. 6. Application The experimental observation of cyclic internal disruptions, leading to fluctuations of the temperature with a characteristic sawtooth behavior dates back to 1974 on the ST tokamak (Princeton) [40]. Sawtooth are triggered by the internal kink instability with toroidal mode number n = 1 and poloidal mode number m = 1 [41]. Magnetic field lines are reconnected at the resonant surface of the instability where the safety factor q = 1. The cycling dynamics of the internal kink mode, which drives sawtooth oscillations in tokamak plasmas, has been studied using XTOR-2F in Refs. [7,8].

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

8

A. Marx, H. Lütjens / Computer Physics Communications (

)



(a) Low resolution mesh with Ptoroidal = 1.

(b) High resolution mesh with Ptoroidal = 8. Fig. 10. Execution times for 100 time steps versus the number of radial MPI processes Pradial for the low and high resolution meshes defined in Table 1. Ptoroidal = 1 for the low and Ptoroidal = 8 for the high resolution mesh. The Newton–Krylov solver, the function evaluation, the pre-conditioner and total execution times are given in the left figures. Proportional execution times for Newton–Krylov solver, the function evaluation, the pre-conditioner local part (Ai ) and global part (S r ) (from Fig. 5) and the toroidal communications are given in the right figures.

Fig. 11. Toroidal mode communication times (100 time steps) for the high resolution mesh defined in Table 1 scaling with Pradial = 2, 4, 8, 16, 32. For every curve Ptoroidal is fixed.

Fig. 13 shows the evolution of pressure during sawtooth cycles for different normalized radial chords in the Z = 0 plane of the

Fig. 12. Execution time scalings for 100 time steps against the radial resolution with and without the radial domain decomposition. With the domain decomposition, the √ radial parallelization level is set to Pradial =

lmax . 2

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

A. Marx, H. Lütjens / Computer Physics Communications (

)



9

Fig. 13. Time evolution of the pressure during sawtooth cycles with XTOR-2F.

small section of the torus from an XTOR-2F simulation with a Lundquist number S = 107 . The discretization mesh is (256 radial, 48 poloidal, 12 toroidal) intervals. Ptotal = Ptoroidal ×Pradial = 4×8 = 32. With the parallelized version a cycle is computed within 24 h of simulation whereas the same case took 10 days with the sequential version.

The number of poloidal modes required today is kept relatively low (∆m ≲ 50 in today’s simulations). For higher poloidal resolution simulations, it may be interesting to introduce another parallelization level into the pre-conditioner dense block solve with ScaLAPACK [44]. This is left for future work. Acknowledgments

7. Conclusions The XTOR-2F code has been parallelized using hybrid MPI/ OpenMP operations. The MPI parallelization, applied to the iterative matrix-free Newton–Krylov (N–K) method of solution in XTOR-2F [21], employs both domain decomposition and operator splitting. To reduce the number of iterations of the N–K method, the code uses physical pre-conditioning, based on the linearization of the two-fluid MHD equations. The corresponding matrix is inverted using an exact LU method with the SPIKE algorithm [29]. The PETSc library [30], which is MPI parallelized, provides the N–K solver. The algebra and the Fourier decompositions in the angular directions, both used heavily for the evaluation of the two-fluid MHD operations and the LU decomposition of the pre-conditioner are using intrinsic and OpenMP threading. The hybrid MPI/OpenMP parallelization is limited by three issues. First, the MPI parallelization in the toroidal direction is limited by the maximum toroidal modes nmax evolved by the code. For MPI process management reasons, the same limit holds for the MPI parallelization by splitting the two-fluid operator, Eqs. (1). Second, the SPIKE algorithm execution time scales with the square root of the domain decomposed radial mesh, and the amount of radial subdomains is limited by the increasing size of the SPIKE communication matrix. Last, when the level of MPI parallelization is too large, the shared memory threading acceleration drops. For example, the linear algebra in the forward and backward substitution of the LU method is performed through level 2 BLAS calls which, due to their data versus computation ratio, are inherently bounded by the memory bandwidth. Nevertheless, the code shows a speedup of one order of magnitude for low resolution discretization meshes with respect to the sequential version of the code. This ratio increases when the discretization mesh is refined. This improvement reduces long time simulations such as the ones in Ref. [8] from hundreds to tens of hours, which allows future physical application either by accelerating parametric studies or to explore phenomena needing higher resolution. For example, the MPI/OpenMP parallelized XTOR-2F was used for the study of sawtooth [42] and tearing [43] dynamics.

This work was carried out within the framework of the French Research Federation for Magnetically Confined Fusion (FR-FCM) and the Eurofusion grant agreement no. 63053. It is also part of project AMICI (ANR14CE32000401) funded by the Agence Nationale pour la Recherche. All the simulations were accomplished with the HPC resources from the GENCI project no. 050198. We would like to thank T. Nicolas for proofreading this manuscript as an experimented user of the code. References [1] [2] [3] [4] [5] [6]

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

H. Lütjens, J.-F. Luciani, X. Garbet, Phys. Plasmas 8 (2001) 4267. H. Lütjens, J.-F. Luciani, Phys. Plasmas 12 (2005) 080703. H. Lütjens, J.-F. Luciani, Phys. Plasmas 13 (2006) 112501. P. Maget, H. Lütjens, G.T.A. Huysmans, P. Moreau, B. Schunke, J.L. Segui, X. Garbet, E. Joffrin, J.-F. Luciani, Nucl. Fusion 47 (2007) 233. P. Maget, G.T.A. Huysmans, H. Lütjens, M. Ottaviani, Ph. moreau, J.L. Segui, Plasma Phys. Control. Fusion 51 (2009) 065005. P. Maget, H. Lütjens, R. Coelho, B. Alper, M. Brix, P. Buratti, R.J. Buttery, H. De la Luna, N. Hawkes, I. Jenkins, C. Challis, C. Giroud, X. Litaudon, J. Mailloux, M. Ottaviani, EFDA Contributors, Nucl. Fusion 50 (2010) 045004. F.D. Halpern, D. Leblond, H. Lütjens, J.-F. Luciani, Plasma Phys. Control. Fusion 53 (2011) 015011. F.D. Halpern, H. Lütjens, J.-F. Luciani, Phys. Plasmas 18 (2011) 102501. T. Nicolas, R. Sabot, X. Garbet, H. Lütjens, J.-F. Luciani, Z. Guimaraes, J. Decker, A. Merle, Phys. Plasmas 19 (2012) 112305. T. Nicolas, H. Lütjens, J.-F. Luciani, X. Garbet, R. Sabot, Phys. Plasmas 21 (2014) 012507. A.Y. Aydemir, D.C. Barnes, J. Comput. Phys. 59 (1985) 108. W. Park, D.A. Monticello, H.R. Strauss, J. Manickam, Phys. Fluids 29 (1986) 1171. A.M. Popov, V.S. Chan, M.S. Chu, Y.Q. Liu, B.W. Rice, A.D. Turnbull, Phys. Plasmas 8 (2001) 3605. D.S. Harned, W. Kerner, J. Comput. Phys. 60 (1985) 62. D.S. Harned, D.D. Schnack, J. Comput. Phys. 65 (1986) 57. D.D. Schnack, D.C. Barnes, Z. Mikic, D.S. Harned, E.J. Caramana, J. Comput. Phys. 70 (1987) 330. L.A. Charlton, J.A. Holmes, V.E. Lynch, B.A. Carreras, T.C. Hender, J. Comput. Phys. 86 (1990) 270. K. Lerbinger, J.-F. Luciani, J. Comput. Phys. 97 (1991) 444. H. Lütjens, J.-F. Luciani, J. Comput. Phys. 227 (2008) 6944. D.D. Schnack, D.C. Barnes, D.P. Brennan, C.C. Hegna, E. Held, C.C. Kim, S.E. Kruger, A.Y. Pankin, C.R. Sovinec, Phys. Plasmas 13 (2006) 058103. H. Lütjens, J.-F. Luciani, J. Comput. Phys. 229 (2010) 8130. J. Breslau, N. Ferraro, S.C. Jardin, Phys. Plasmas 16 (2009) 092503.

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014

10

A. Marx, H. Lütjens / Computer Physics Communications (

[23] G. Huysmans, S. Pamela, E. van der Plas, P. Ramet, Plasma Phys. Control. Fusion 51 (2009) 124012. [24] L. Chacón, Comput. Phys. Comm. 163 (2004) 143. [25] L. Chacón, Phys. Plasmas 15 (2008) 056103. [26] S.C. Jardin, J. Comput. Phys. 231 (2012) 822. [27] H. Lütjens, A. Bondeson, A. Roy, Comput. Phys. Comm. 69 (1992) 287. [28] H. Lütjens, A. Bondeson, O. Sauter, Comput. Phys. Comm. 97 (1996) 219. [29] E. Polizzi, A.H. Sameh, Parallel Comput. 32 (2006) 177. [30] S. Balay, W.D. Gropp, L.C. McInnes, B.F. Smith, in: E. Arge, A.M. Bruaset, H.P. Langtangen (Eds.), Modern Software Tools in Scientific Computing, Birkhauser Press, 1997, p. 163. [31] Intel Math Kernel Library XE 2016 Update 3 for Linux and Mac OSX. [32] M. Frigo, S.G. Johnson, Proc. IEEE 93 (2) (2005) 216. [33] S.I. Braginskii, in: M.A. Leontovich (Ed.), in: Reviews of Plasma Physics, vol. 1, Consultants Bureau, New York, 1965, p. 205. [34] Y. Saad, M.H. Schultz, J. Sci. Stat. Comput. 7 (3) (1986) 856. [35] O. Février, P. Maget, H. Lütjens, J.-F. Luciani, J. Decker, G. Giruzzi, M. Reich, P. Beyer, E. Lazzaro, S. Nowak, the ASDEX Upgrade team, Plasma Phys. Control. Fusion 58 (2016) 045015.

)



[36] H.F. Walker, L. Zhou, Numer. Linear Algebra Appl. 1 (1994) 571. [37] M. Pernice, H.F. Walker, Special issue on iterative Methods, SIAM J. Sci. comput. 19 (1998) 302. [38] S.C. Eisenstat, H.F. Walker, SIAM J. Sci. Comput 17 (1994) 16. [39] M. Geimer, F. Wolf, J.N. Brian wylie, E. ÁbrahÁm, D. Becker, B. Mohr, Concurrency Comput. : Pract. Exp. 22 (2010) 702. [40] S. Von Goeler, W. Stodiek, N. Sauthoff, Phys. Rev. Lett. 33 (1974) 1201. [41] M.N. Bussac, R. Pellat, D. Edery, J.L. Soulé, Phys. Rev. Lett. 35 (1975) 1638. [42] J.-H. Ahn, X. Garbet, H. Lütjens, A. Marx, T. Nicolas, R. Sabot, J.-F. Luciani, R. Guirlet, O. Février, P. Maget, Phys. Plasmas 23 (2016) 052509. [43] P. Maget, O. Février, X. Garbet, H. Lütjens, J.-F. Luciani, A. Marx, Nucl. Fusion 56 (2016) 086004. [44] L.S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R.C. Whaley, ScaLAPACK user’s Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.

Please cite this article in press as: A. Marx, H. Lütjens, Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks, Computer Physics Communications (2016), http://dx.doi.org/10.1016/j.cpc.2016.10.014