An implicit block ILU smoother for preconditioning of Newton–Krylov solvers with application in high-order stabilized finite-element methods

An implicit block ILU smoother for preconditioning of Newton–Krylov solvers with application in high-order stabilized finite-element methods

Available online at www.sciencedirect.com ScienceDirect Comput. Methods Appl. Mech. Engrg. 358 (2020) 112637 www.elsevier.com/locate/cma An implicit...

5MB Sizes 0 Downloads 10 Views

Available online at www.sciencedirect.com

ScienceDirect Comput. Methods Appl. Mech. Engrg. 358 (2020) 112637 www.elsevier.com/locate/cma

An implicit block ILU smoother for preconditioning of Newton–Krylov solvers with application in high-order stabilized finite-element methods Behzad R. Ahrabi 1 ,∗, Dimitri J. Mavriplis 2 Department of Mechanical Engineering, University of Wyoming, Laramie, WY 82071, United States of America Received 14 December 2018; received in revised form 15 June 2019; accepted 12 September 2019 Available online xxxx

Abstract This paper presents an efficient and highly-parallelizable preconditioning technique for Newton–Krylov solvers. The proposed method can be viewed as a generalization of the implicit line smoothing technique by extending the groups of implicitly-solved unknowns from lines to blocks. The blocks are formed by partitioning the computational domain such that the strong connections between unknowns are not broken by the partition boundaries. The ILU algorithm is used to obtain an approximate (or exact) factorization for each block. Then, a block-Jacobi iteration is formulated in which the degrees of freedom within the blocks are solved implicitly. To stabilize the iterations for high-CFL systems, a dual-CFL strategy, with a lower CFL in the preconditioner matrix, is developed. The performance of the proposed method as a linear preconditioner is demonstrated for second- and third-order steady-state solutions of Reynolds-Averaged Navier–Stokes (RANS) equations on the NASA Common Research Model (CRM), including the high-lift configuration. For the studied test cases, it is shown that in comparison with the traditional ILU(k) method, the proposed preconditioner requires significantly less memory and it can result in notably faster solutions. c 2019 Elsevier B.V. All rights reserved. ⃝ Keywords: Computational fluid dynamics; Newton–Krylov methods; Steady- state RANS simulations; Preconditioning techniques; High-order methods; Stabilized finite-element methods

1. Introduction In computational fluid dynamics (CFD) community, Newton–Krylov methods are being widely used to develop robust flow solvers that can reach deep convergence levels, especially for large-scale steady-state Reynolds-Averaged Navier–Stokes (RANS) simulations. Most of the developed solvers utilize a pseudo-transient continuation (PTC) method in which a pseudo-time term is added to the steady-state equations and the resulting nonlinear equations are solved in a time-marching manner. In this process, the pseudo-transient equations are linearized to employ an inexact Newton method, and the entailed linear system is solved using a Krylov solver. The overall success of the ∗ Corresponding author.

E-mail addresses: [email protected] (B.R. Ahrabi), [email protected] (D.J. Mavriplis). Research Scientist. 2 Professor. 1

https://doi.org/10.1016/j.cma.2019.112637 c 2019 Elsevier B.V. All rights reserved. 0045-7825/⃝

2

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

resulting Newton–Krylov method depends on the design of the nonlinear solution methodology and the performance of the utilized linear solver. Research on these methods have progressed on multiple fronts, including: 1. Developments of compact-stencil discretizations, such as finite-element methods (FEMs) [1–7], which facilitate the evaluation of the exact linearization (i.e. the exact Jacobian matrix) and matrix-based preconditioning techniques. 2. Development of line search and CFL control mechanisms [2,3] which adjust the advancement pace of the nonlinear solver. 3. Development of linear preconditioning techniques which enhance the performance of the Krylov solvers [8–12]. 4. Development of residual smoothing methods which provide well-conditioned paths for the nonlinear solver [1,13]. Despite significant advancements, research in the above-mentioned topics is still essential to improve the robustness, parallelizability, and efficiency of the flow solvers. In this paper we mainly focus on the third item of the given list and partly investigate the fourth item. Seeking robustness, incomplete LU factorization methods are widely used as preconditioners for Krylov solvers, particularly for the second- and high-order finite-element discretizations, which can result in very stiff linear systems. In a traditional distributed-memory implementation, the factorization is performed locally within each partition, which results in decreased convergence rates with increasing levels of parallelization. Therefore, to strengthen the preconditioner, either a more accurate factorization or a larger number of Krylov vectors is required, and in practice, usually the former is more effective. Also, in the traditional form, the factorized matrix is applied only once on each Krylov vector, as an iterative application can be uncontrollably unstable. Thus, the computational expense of the factorization may not be well exploited. In short, incomplete factorization methods, in their traditional form can be very memory consuming and inefficient, and thus they may not be suitable for emerging high-performance computing (HPC) systems, which are being developed based on manycore and low-memory processing elements. To provide an alternative, in our previous work [9–11], we developed a matrix-based implicit line preconditioner in which the strongly-connected unknowns are solved in a semi implicit manner. In this method, the lines are extracted from a properly chosen scalar matrix which represents the strength of connections between the nodes [14,15]. By comparison of this method to a traditional ILU(k) preconditioner, it was shown that, in addition to being more robust, the line preconditioner can offer notable benefits in terms of computational and memory efficiency. Still, one of the drawbacks of the implicit line method is that for single nodes (nodes that do not belong to any line), it reduces to a point-implicit method with a low convergence rate. In the present work, we generalize the line smoothing technique by extending the groups of strongly-connected nodes from lines to blocks. After forming the blocks, the nodes within each block are reordered to reduce the local bandwidth of the block, and then the ILU algorithm is used to obtain an approximate or exact factorization for each block. Compared to the traditional ILU method, the developed block solver benefits from a smaller bandwidth for the intra-connections of the block nodes, and by shifting some of the load of the preconditioning from factorization to iterations, it can use lower fill levels. Furthermore, compared to the line method, all the nodes within the computational domain are contained within the blocks and thus the isolated point-implicit nodes are avoided. Using the above-mentioned preconditioner, we try to enhance the robustness and efficiency of the Newton– Krylov solvers for solution of the steady-state RANS problems. For the same purpose, we also investigate the role of a diffusion-based residual-smoothing technique in conditioning the nonlinear path for the Newton method. The resulting solution strategy is developed within a finite-element code known as HOMA (High-Order Multilevel Adaptive) [9–11]. This flow solver uses a Streamline Upwind Petrov–Galerkin (SUPG) discretization [16–18] and it can work on unstructured grids with linear (P1) and quadratic (P2) basis functions. The choice of this flow solver is motivated by the fact that during the last few decades, there has been a growing interest in development of Petrov–Galerkin (PG) discretizations [16–19] for production RANS CFD codes (e.g. GGNS [1], FUNSAFE [20–25], FUN3D-SFE [4], and COFFE [6]). In the context of RANS solvers, PG methods are naturally compared with the developing discontinuous Galerkin (DG) finite-element as well as traditional second-order finite-volume (FV) methods. Compared to DG schemes, previous works have shown that for moderate discretization orders and for comparable accuracy, PG schemes require significantly less computational resources than DG schemes [20,26].

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

3

Also, for viscous flows, DG schemes require super-parametric (Q = P +1)3 elements to deliver the nominal order of accuracy of the discretization [5,7], whereas PG schemes can properly work with iso-parametric (P = Q) elements. Therefore, PG schemes do not require curved meshes for second-order-accurate solutions. Compared to FV methods, a major advantage of PG schemes is the use of nearest neighbor stencils to reach high-order discretizations. The use of nearest neighbor stencils also facilitates the accurate linearization of the nonlinear residual and utilization of matrix-based preconditioning techniques, in contrast to FV methods, for which usually a first-order Jacobian matrix is employed. Also, compared to FV methods, second-order PG discretizations have conclusively demonstrated higher accuracy on equivalent meshes [25,27,28]. Furthermore, PG discretizations have shown less sensitivity to grid irregularities, as demonstrated by the ability of these discretizations to produce competitive solutions for highReynolds number RANS problems using fully tetrahedral elements in the highly stretched boundary layer regions of the mesh, versus the requirement of using prismatic elements in these regions for FV methods. Although the availability of mixed-element meshes with prisms in the boundary-layer region is now widespread, the ability to employ tetrahedral elements throughout the computational domain offers significant advantages particularly for use with anisotropic mesh adaptation techniques, where most progress has been made using fully simplicial (i.e. tetrahedral in 3D) elements [29,30]. On the other hand, it can be said that, in many cases, the computational cost in solving steady-state RANS problems using second-order PG discretizations is still higher than what can be achieved using state-of-the-art unstructured mesh FV solvers. This might be due to the fact that most of SUPGRANS codes employ a strongly-preconditioned Newton–Krylov solver, as weaker options such as point-implicit or nonlinear multigrid solvers often fail to converge to steady-state solutions for these codes. However, it should be noted that, when coupled with FV codes, Newton–Krylov solvers also incur substantial additional expense over simpler solvers in what is a classic trade-off between robustness and efficiency [31,32]. Whether the enhanced accuracy of PG discretizations is worth the added cost, and/or whether more efficient solvers can be devised for these discretizations remain open research questions today. Nevertheless, research and development on solvers for high-order PG and DG discretizations, driven by the desire to increase the efficiency and robustness for solving the stiffer discretized equations, has benefited other discretizations as well. The Newton–Krylov techniques originally developed for the finite-element discretizations have been applied increasingly to FV discretizations increasing their robustness in turn. The benefits of the present work are also applicable to FV methods as well as other discretizations such as residual distribution schemes. The goal here is to enhance robustness and efficiency of a Newton–Krylov method such that the resulting solver is capable to solve stiff problems resulting from high-order SUPG-RANS discretizations. The remainder of this paper is arranged as follows. In Section 2, the governing equations are reviewed and in Section 3, the utilized SUPG discretization is presented. In Section 4, the nonlinear solution strategy based on the PTC algorithm is explained, and in Section 5, the linear solver and the preconditioning techniques are described in detail. Then, in Section 6, example second- and third-order steady-state RANS solutions are presented for the NASA Common Research Model (CRM) wing-body configuration, showing that the developed block smoothing technique can provide a more robust and more efficient preconditioning method, compared to the traditional ILU(k) method. At the end, the conclusions are given in Section 7. 2. Governing equations The governing equations consist of the compressible Reynolds-Averaged Navier–Stokes (RANS) equations coupled with the negative version of the one equation Spalart–Allmaras (SA) turbulence model [33]. In the following, a dummy index (an index which appears twice within an additive term of an expression) implies a summation over the range of indices, unless otherwise stated. In the conservative form, the governing equations can be written as ∂t Q + ∂i Fi = S,

i = 1, . . . , n d

(1)

where the bold letters denote vector variables due to multiple equations, i indexes the spatial dimension, n d is the number of space dimensions, ∂i indicates a spatial derivative in the direction of ith Cartesian coordinate, and ∂t 3

Here, P and Q are the polynomial degrees of the shape functions for the field variables and geometrical mapping, respectively.

4

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

denoted a time derivative. The vector of the conservative flow variables Q, the source term S, and the flux vector Fi which consists of inviscid and viscous parts, FiE and Fiv , are given by ( ) (2) Fi = FiE − Fiv , FiE = FiE (Q) , Fiv = Fiv Q, ∂ j Q , i, j = 1, . . . , n d ⎧ ⎫ ⎧ ⎫ ⎫ ⎧ ⎫ ⎧ 0 ⎪ ⎪ 0 ρu i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ρ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ ⎬ ⎨ ⎨ .. ⎪ ⎬ ⎬ ⎨ τ ji ρu j ρu j u i + δ ji p E v . Q= , Fi = ,S = , Fi = (3) τi j u j − qi ρet ⎪ ρh t u i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ ⎭ ⎩ ⎪ ⎭ ⎪ ⎩ 1 ⎩ ⎭ ρ ν˜ ρ ν˜ u i µ (1 + f n χ) ∂i ν˜ ST σ where ρ is the density, p is the static pressure, u i is the velocity component in the ith Cartesian direction, et is the specific total energy, h t = et + p/ρ is the specific total enthalpy, and δi j is the Kronecker delta. With the assumption of a perfect gas, the pressure is related to the state variables by the constitutive relation, ) ( 1 (4) p = (γ − 1) ρet − ρu i u i 2 where γ is the ratio of specific heats which is set to 1.4. Also, τ is the shear stress tensor and q is the heat flux which are given by ) ( 2 τi j = (µ + µT ) ∂ j u i + ∂i u j − ∂k u k δi j 3 ( ) (5) µ µT qi = −c p + ∂i T Pr Pr T where T is the temperature, and Pr = 0.72 and Pr T = 0.9 are Prandtl and turbulent Prandtl numbers, respectively. µ is the dynamic viscosity which is obtained by the Sutherland’s Law and µT is the turbulent eddy viscosity which is given by { χ3 ν˜ ρ ν˜ f v1 ν˜ ≥ 0 µT = , f v1 = 3 , χ≡ (6) χ + cv31 ν 0 ν˜ < 0 where ν is the kinematic viscosity and ν˜ is working variable of the turbulence model. The function f n in the diffusion term of the turbulence model equation is given by ⎧ ν˜ ≥ 0 ⎨1 (7) f n = cn 1 + χ 3 ⎩ ν˜ < 0 3 cn 1 − χ The source term of the turbulence model equation is given by ST = ρ (Pn − Dn ) +

cb2 ρ 1 (∂i ν˜ ∂i ν˜ ) − (ν + ν˜ ) (∂i ρ∂i ν˜ ) σ σ

(8)

where ⎧( ) { ( ) ⎨ c f − cb1 f ( ν˜ )2 cb1 1 − f t2 S˜ ν˜ ν˜ ≥ 0 w1 w t 2 2 d κ ( ) Pn = Dn = ⎩−cw ( ν˜ )2 cb1 1 − ct3 S ν˜ ν˜ < 0 1 d ⎧ ¯ ¯ ⎪ ⎪ ) S ≥ −cv2 S ⎨S + S ( ˜S = S + S cv2 S + cv3 S¯ 2 ⎪ ⎪ ) S¯ < −cv2 S ⎩ ( cv3 − 2cv2 S − S¯ Here, d is the distance to the wall and S is the magnitude of vorticity √ S = ωi ωi , ωi = εi jk ∂ j u k

ν˜ ≥ 0 ν˜ < 0

(9)

(10)

(11)

5

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

where εi jk is the alternating unit vector. The remaining functions are ν˜ f v2 χ 2 , f v2 = 1 − , f t2 = ct3 e−ct4 χ κ 2d 2 1 + χ f v1 ( )1 ) ( 1 + cw6 3 6 ( 6 ) ν˜ fw = g , g = r + c r − r , r = min , r w2 ˜ 2 d 2 lim g 6 + cw6 3 Sκ S¯ =

(12)

(13)

And finally, the constants(are σ = ) 2/3, κ = 0.41, cv1 = 7.1, cv2 = 0.7, cv3 = 0.9, cn 1 = 16.0, cb1 = 0.1355, cb cb2 = 0.622, cw1 = κ 21 + 1 + cb2 σ , cw2 = 0.3, cw3 = 2.0, ct3 = 1.2, ct4 = 0.5, and rlim = 10.0. 3. Spatial discretization To start the discretization, the strong form of the problem is written as an initial boundary value problem: L (q) = S

x⃗ ∈ Ω , t ∈ [0, ∞)

(14a)

x⃗ ∈ Γ F , t ∈ [0, ∞) , i = 1, 2, 3

(14b)

q(x, t) = q D

x⃗ ∈ Γ D , t ∈ [0, ∞)

(14c)

q (x, 0) = q0 (x)

x⃗ ∈ Ω

(14d)

Fi =

Fib

where q is the vector of the dependent variables, which may be chosen over the conservative variables Q to facilitate the implementation, Ω ⊂ R3 is a bounded domain with Lipschitz-continuous boundary Γ , Fib are the prescribed boundary fluxes through the Γ F portion of the boundary, q D is the Dirichlet boundary condition on the Γ D portion of the boundary, and the operator L is defined as [ ] [ ][ ] ( [ ] ) L (·) := Aq ∂t (·) + AiE Aq ∂i (·) − ∂i Gi j Aq ∂ j (·) (15) [ ] E where [Aq ] = ∂q Q =[∂Q/∂q is the variable transformation matrix, AiE = ∂Q FiE = ∂F ] [ i /∂Q ] is the Euler flux v Jacobian matrix, and Gi j is the diffusivity matrix which is defined such that Fi = Gi j ∂i Q. In the present [ ]T work, q is the vector of primitive variables q = ρ, u i , T, ν˜ . This choice is suitable for modeling fluids with nonlinear equations of state which provide the pressure and other thermodynamic variables in terms of density and temperature. Also, this choice facilitates the implementation of the constant temperature boundary condition. For discretization, Ω is approximated domain Ω h with piecewise-polynomial boundary Γ h . { 1by a2 computational } h n el Then, the finite-element mesh T = Ω ,⋃ Ω ,...,Ω is defined as the division of Ω h into a finite of ⋃nnumber n el el h e non-overlapping elements such that Ω = e=1 Ω . Accordingly, the boundary is partitioned as Γ h = e=1 Γ e ∩Γ . Next, each element e is equipped with a polynomial order 1 ≤ P (Ω e ) = P e . At this point, spatial approximation spaces can be defined as } { [ ( )]n [ ( )]n (16) S th := q|q (·, t) ∈ H1 Ω h Q , q (·, t)|Ω e ∈ P P e Ω e Q , t ∈ [0, ∞) ∀e and q(·, t) = q D on Γ Dh { ( e) } h 1 h h W := w|w ∈ H (Ω ); w|Ω e ∈ P P e Ω ∀e and w = 0 on Γ D (17) [ ] n where H1 is the usual Sobolev space of weakly differentiable functions, H1 Q is the corresponding space for vector functions with n Q components, and P P is the polynomial space, complete to the order P. Now, the SUPG solution [16–18] to the weak form of the problem can be expressed as: for any t ∈ [0, ∞) find qh ∈ S th such that for all N h ∈ Wh , ∫ ∫ n el ∫ ( ) ∑ ( h ) [ e] ( ) h,b h h h h h h N ∂t Q − ∂i N Fi − N S dΩ + N Fi n i dΓ + P ∂t Qh + ∂i Fih − Sh dΩ e = 0 Γ Fh

Ωh





Galer kin T er m

e=1





Ωe



Stabili zation T er m

 (18)

In above equation, the first and second terms are direct results of a Galerkin discretization, and the third term is added for stabilization of the problem in presence of strong convection effects. The superscript h denotes discretized

6

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

variables and n i are the components of the unit outward-normal on Γ . The discrete solution qh is expanded as: nDO F

qh =



qi Nih on Ω h

(19)

i=1

where qi ’s are the solution’s coefficients or the Degrees of Freedom (DOFs), Nih ’s are the basis functions for the finite-dimension space S th , and n D O F is the dimension of that space as well as the number of DOFs. Note that the stabilization term is calculated over all the elements in the computational domain. The term [P] is called the Perturbation to the weight function space as it modifies the original Galerkin method to a Petrov–Galerkin method with N [I] + [P] as the weight function [18]. For the SUPG method, [ ] [ e] [ E ] (20) P = Ai ∂i N h τ e where [τ ] is called the stabilization matrix. It has the dimension of time and it can be obtained based on the eigensystem decomposition of the projection of the flux Jacobian matrices onto the spatial gradients of the basis functions [34]. For element e, ⎡ e ⎤−1 n ∑ [ e] [ e] τ =⎣ Bj ⎦ (21) j=1

[ e] ⏐ [ ]⏐ B j = ⏐∂i N j AiE ⏐ ⏐ [ ]⏐ ⏐∂i N j A E ⏐ = [T] |Λ| [T]−1

(22) (23)

i

where [T] and |Λ| denote, respectively, the matrix of right eigenvectors and the diagonal matrix of absolute values of the eigenvalues of the left-hand side of Eq. (24). N j and n e are the shape functions and number of nodes within the element e, respectively. For low Reynolds number flows, the formulation of Eq. (23) would [result ] in an overly dissipative scheme. In order to maintain the nominal order of accuracy for viscous flows, matrix Bej is augmented by a viscous term, given as ⏐ [ ]⏐ Bej = ⏐∂i L ej AiE ⏐ + ∂i L ej [Gik ] ∂k L ej (no summation on j) (24) 3.1. Boundary conditions For boundary conditions, far-field, symmetry, and no-slip wall boundaries are considered. The walls are assumed to be adiabatic. For the far-field boundaries, the boundary flux vector Fib only includes the inviscid part and is calculated using the Roe scheme [35] based on the free-stream and interior state values. For inviscid walls, the boundary flux vector only takes the pressure from interior and thus Fib = [0, δ1i p, δ2i p, δ3i , 0]T . For viscous walls, the no-slip condition can be enforced in either strong or weak form. For strong enforcement, the velocities on the wall nodes are initialized to zero and kept constant during the iterative process. As will be described in Section 4, the system of discretized equations is solved using a Newton-type algorithm. To keep the wall velocities constant during the iterations, the linearized system of equations, Eq. (39), is modified as such: at the corresponding rows, the residual values (at the right-hand side) are set to zero, the diagonal entries of the Jacobian matrix (at the left-hand side) are set to one, and the off-diagonal entries are set to zero [4]. For weak enforcement, the following boundary term is added to the left-hand side of Eq. (18): ∫ ∫ [ ( )] [ ( )] ( ) ( E ( b) ( b )) h v h NΓD = N Fi q − Fi q , ∂ j q n i dΓ − ∂i N h Gi j qh Aq qh qh − qb n j dΓ h h ΓD ΓD ∫ (25) [[ ( h )] [ q ( h )] ( h ) ] h b + η Gi j q A q q − q n j N n i dΓ h ΓD

where qb is a state vector which is constructed based on the Dirichlet boundary conditions and the interior solution. Adiabatic and no-slip conditions yield qb = [ρ, 0, 0, 0, T, 0]T and thus the component of the boundary viscous fluxes Fiv associated with the energy equation vanishes. To calculate qb , ρ and T are calculated based on the

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

7

interior solution on the element adjacent to the boundary. Also, the turbulence working variable ν˜ is set to be zero at no-slip walls. In Eq. (25), η is called the penalty parameter and it is given by [36] η=

(P e + 1) (P e + n d ) S e ( e) nd V

(26)

where S e and V e are the surface area and volume of the element adjacent to the boundary, respectively. Remarks. – When no-slip conditions are enforced weakly, the definition of the weight functions in Eq. (17) must be updated such that they do not vanish on the Γ D portion of the boundary. – For the numerical results presented in this study, the no-slip condition is enforced in the weak form.

3.2. Shock capturing In presence of shock waves, if the utilized mesh does not have the proper resolution, additional dissipation is required to stabilize the numerical solution. This can be done by augmenting the viscous flux as: [ ] [ ] ˜ Fiv = Gi j ∂ j Q + Gis j ∂ j Q (27) [ s] [ h] Gi j = νs Di j (28) [ ] ˜ is a state where Gis j is the artificial viscosity matrix, νs is the artificial viscosity, and [Dihj ] is a diagonal matrix. Q vector which includes ρh t instead of ρet ; this choice is expected to conserve the total enthalpy across the shock which is required by the Rankine–Hugoniot shock jump relations [37]. Here, using an identity matrix [I], [Dihj ] is defined as [ h] Di j = [I] δi j (29) but, in general, it can be designed to be biased by the directional mesh size metrics [24,37]. The artificial viscosity is given by he νs = ϵˆs λmax ( e ) √ P λmax = u i u i + c )− 1 ( h e = ∂i L ej ∂i L ej 2 , i = 1, . . . , n d , j = 1, . . . , n eL

(30) (31) (32)

where n eL is the number of linear shape functions within the element e and h e is a local mesh metric, which its division by P e aims to make the shock width to be O(h/P) [38]. ϵˆs is a controlled shock switch which is given by ϵˆs = ϵs ψ (ϵs : xs , xe ) In above equation, ϵs is a shock switch which is given by [39] ⎧ (u i ∂i p) h e ⎨ (u i ∂i p) > 0 ϵs = (u i ∂i p) h e + κs pc ⎩ 0 (u i ∂i p) ≤ 0

(33)

(34)

where κs is a tuning parameter that varies between 0.1, for a strong shock, and 1.0, for a weak shock. An important feature of this shock switch is that it turns off for a negative value of (u i ∂i p) that corresponds to an expansion fan; as such, it does not blindly work based on high pressure gradients. Further control on the shock switch is obtained

8

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

by the use a ramping function ψ (first appeared Eq. (33)) which is given by [4] ⎧ ⎪ 0 x < xs ⎪ ⎨ 1 ψ (x : xs , xe ) = (sin (θ ) + 1) xs ≤ x ≤ xe ⎪ ⎪ ⎩2 1 x > xe ( ) π 2x − (xs + xe ) θ= 2 xs − xe

(35)

(36)

This ramping function is zero for values under the starting threshold, xs , and then it smoothly increases to one when the ending threshold, xe , is reached. The incentive for using the ramping function is to deactivate the artificial viscosity for small values of the shock switch. As explained by Anderson et al. [4] without the ramping function, the switch may be nonzero in a considerable portion of the flow field, and thus it may cause the entire scheme to become first-order accurate, independent of the polynomial degree of the basis functions. 4. Nonlinear solver In this section, the time integration and the nonlinear solution methods are explained. The utilized method is known as pseudo-transient continuation (PTC) [40], which in essence, is an inexact Newton method with a line search process to find an optimal relaxation factor for the Newton updates (obtained from solution of the linearized system), and a controlling mechanism for the CFL number to adjust the pace of the nonlinear advancement. The specific algorithm developed here is similar to the algorithms presented in Refs. [2,3]. Minor modifications have been added that are described in the following. After spatial discretization, the system of equations resulting from Eq. (18) can be written in a semi-discrete form as: [ h ] h ( ) MPG ∂t q + Rh qh = 0 (37) [ ] h where Rh denotes the spatial residual, and MPG represents a mass matrix. Since in Eq. (18) both the Galerkin and stabilization terms include temporal parts, this mass matrix gets contribution from both terms, and the subscript PG has been used to denote this. Also, the mass matrix includes the transformation from Q to q. Applying the backward difference formula (BDF) on Eq. (37) results in: ) ( ) ( ) 1 [ h ] ( h,n+1 MPG q − qh,n + Rh qh,n+1 = 0 (38) Resh,n+1 qh,n+1 = ∆t where Resh,n+1 represents the unsteady flow residual at time step n + 1. This implicit system is then linearized using an automatic differentiation technique (by operator overloading) and the vector of the dependent variables is updated in a Newton-type iteration. The linearized system and the update equation are given by [ h,n ( h,n )] ( ) J q ∆qh,n = −Rh,n qh,n (39) qh,n+1 = qh,n + ωopt ∆qh,n (update equation) (40) [ h,n ] ( h,n ) h,n J = ∂qh Res q (41) [ h,n ] where J denotes the Jacobian matrix and ωopt is an optimal relaxation factor which is determined in a linesearch process. To accelerate the global convergence, the Newton algorithm is modified to include local time-steps that are amplified by a CFL number. The local time steps are calculated for each node and they form a continuous space. Here, the linear system in Eq. (39) is solved using a preconditioned Flexible Generalized Minimal Residual (FGMRes) [41] algorithm. For computational efficiency, this system is not fully solved; usually, a few (two to four) orders of magnitude reduction in the L 2 -norm of the linear residual is sufficient. Also, due to memory limitations, a limited number (100 to 300) of Krylov vectors is allocated for the FGMRes solver. If after using all the allocated Krylov vectors the linear tolerance is not reached, the FGMRes solver may be restarted. If after the restarts, the linear tolerance still is not reached, the final linear residual is compared against a maximum allowable tolerance tola (= 0.5 in this paper). If this criterion is not met either, the result of the linear solution is rejected. In such cases we say that the linear solver has “failed”. If the linear solver fails, the CFL number is reduced and the Jacobian

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

9

matrix is reevaluated. Otherwise, we proceed to calculate the optimal relaxation factor ωopt through the line search process. Line-Search Process: In this process, first, based on the physical considerations, an upper bound for the relaxation factor is determined. Here, the upper bound wρ,T is determined such that the relative changes in density and temperature are limited by a maximum value θρ,T (= 0.1 to 0.4). That is ωρ ∆ρih,n ωρ,T

≤ θρ,T

ρih,n = min(ωρ , ωT )

and

ωT ∆Tih,n Tih,n

≤ θρ,T

1 ≤ i ≤ n DO F

(43)

Next, the unsteady residual Resh is evaluated using four relaxed updates, as ( ) Resh, j = Resh qh,n + w j ∆qh,n , j = 0, 1, 2, 3 ω0 = 0,

ω1 = 0.25ωρ,T ,

(42)

ω2 = 0.5ωρ,T ,

(44)

ω3 = ωρ,T

and a cubic polynomial is fitted to the L 2 -norms of the above residuals. The optimal relaxation factor wopt is found by locating the minimum of the cubic polynomial. CFL Control Mechanism: Using ⎧ ⎪ β CFLn ⎪ ⎨ CFL CFLn+1 = CFLn ⎪ ⎪ ⎩ CFLn /βC3 F L

wopt , the CFL number is updated as wopt = 1 wmin ≤ wopt < 1

(45)

wopt < wmin

4.1. Implicit residual smoothing As described in Ref. [13], the possibility of localized nonlinearities that hinder the advancement of global nonlinear problem may be attributed to a non-smooth residual distribution, since for a perfectly smooth residual distribution, it should be possible to advance the global problem uniformly. Here, following the work by Kamenetskiy et al. [1], we have implemented an implicit residual smoothing (IRS) technique that is applied to the left-hand side (i.e. Jacobian matrix) and line search of the PTC algorithm. In this technique, the already-pseudo-transient equations are augmented by an artificial transient viscous term, aiming that the additional viscous terms in the Jacobian matrix guide the solution through a perturbed nonlinear path with smoother residuals. For our finite-element discretization, the following term is added to the left side of Eq. (38) and the Jacobian matrix is updated accordingly: ( ) ( ) n el ∫ ( e )2 ∑ FvR S,i qh,n+1 , ∂ j qh,n+1 − FvR S,i qh,n , ∂ j qh,n I R S = −c h dΩ e (46) e ∆t e=1 Ω In above equation, c (= 1 to 10) is a tuning parameter and h e is given by Eq. (32). For simplicity, in this study FvR S,i = ∂i Qh . Note that the ∆t is amplified by the CFL number and thus for very large CFL values, Newton convergence can be obtained. In the current implementation, the residual smoothing term only affects the timedependent components of the PTC formulation, that is the Jacobian matrix and transient residual in the linear research process and thus the steady-state residual on the right-hand-side of Eq. (39) remains untouched. This is in contrast with the residual smoothing technique developed in Ref. [13], in which the steady-state residual is modified, and the Jacobian matrix is untouched. 5. Linear solver and preconditioners The performance of the nonlinear solver described in the previous section depends on the performance of the linear solver used to obtain the Newton updates in Eq. (39). In this study, a preconditioned Flexible GMRes (FGMRes) algorithm is used to ensure monotonic convergence of the linear solver. This is particularly important when a smoothing technique is used for preconditioning to ensure that any instability in the smoothing iterations does not result in divergence of the linear solver.

10

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

5.1. Incomplete LU factorization The first preconditioner considered is the Incomplete Lower Upper factorization with k levels of fill, which is also known as ILU(k) [41]. For an unpartitioned computational domain, the memory usage, efficiency, and effectiveness of the ILU(k) method is dependent on the ordering of the unknowns. In this study, a Reverse Cuthill–McKee (RCM) [42] algorithm is used for reordering. This method results in a reduced bandwidth in the Jacobian matrix, and thus fewer number of fill-ins in the factorized preconditioner matrix. Here, the ILU preconditioner is applied only once on each Krylov vector. This is a common approach that is used in most linear solver libraries, including PETSc [43–45]. Also, it should be mentioned that a block version of the ILU algorithm is utilized here, where the blocks refer to n Q × n Q blocks associated with each DOF (e.g. 6 × 6 blocks for a 3D turbulent flow problem). However, in the next subsection, blocks refer to zones including a group of mesh nodes. 5.2. Block ILU smoothing As mentioned in the introduction section, in recent years, incomplete factorization methods have been widely used for preconditioning of the Krylov solvers. However, despite robustness benefits, these methods are not suitable for large-scale parallel processing, particularly in their traditional form. These methods are inherently sequential and in the context of domain decomposition, they are most often applied locally within each partition, which leads to decreased convergence rates with increasing levels of parallelization. Also, usually the factorized matrix is applied only once on each Krylov vector, since an iterative application can be uncontrollably unstable. Consequently, improving the robustness is primarily achieved through increasing the number of fill-ins, which translates to prohibitive memory requirements for high-order discretizations. On the other hand, the simple point-Jacobi method is memory efficient and easily parallelizable, and for diagonal-dominant matrices, it is stable. However, due to its purely local nature, the Jacobi method suffers from low convergence rates. Note that a single iteration of the point-Jacobi method can be viewed as the extreme case of the of ILU method when the domain decomposition is maximal. Therefore, as shown in Fig. 1, one can consider these two methods as two extremes of a double-spectrum that characterizes the factorization and smoothing iterations of preconditioning techniques. The poor convergence rate of the point-Jacobi method can be greatly improved by grouping strongly-connected degrees of freedom in separate groups and solving for them in a semi-implicit manner. This notion has been the key idea behind the development of traditional implicit line smoothers, in which the strongly coupled unknowns are

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

11

Fig. 1. Factorization and smoothing iteration spectrums of preconditioning techniques for Krylov solvers.

grouped into lines, ordered sequentially, and solved using the Thomas algorithm [46]. Following previous work in this area [14,15], we developed a matrix-based implicit line smoothing technique in which the lines are constructed based on directional stiffness measure extracted from a (properly chosen) stiffness matrix that represents the strength of couplings between the unknowns [9–11]. This smoothing technique was used for preconditioning of the FGMRes algorithm and to stabilize it for high-CFL systems, a dual-CFL technique was developed. In this technique, which is described in Section 5.2.1, the preconditioning system itself is preconditioned using a secondary matrix with a capped CFL number. Also, in that work the application of the line solver was extended to high-order discretizations. Comparison of the developed line preconditioner with a traditional ILU(k) preconditioner showed that, in addition to robustness improvements, considerable benefits in terms of computational and memory efficiency can be obtained. One of the drawbacks of the line method, however, is that for single nodes (nodes that do not belong to any line), it reduces to an implicit point-Jacobi method. In the present work, we generalize the aforementioned smoothing technique by extending the groups of stronglyconnected nodes from lines to blocks. Fig. 2a shows a two-dimensional mesh with sample lines and blocks. In this figure, the blocks are shown by same-colored vertices and the lines are colored according to the blocks they reside in. The construction algorithms for the lines and blocks are described in Section 5.2.2. After forming the blocks, the nodes of the mesh are reordered in sequence of the blocks. That is, first, the nodes of the first block are numbered, and then the first node of the second block continues the numeration; this process continues until all the nodes are numbered. By doing so, the utilized matrix can be decomposed based on the blocks, as shown in Fig. 2b. In this figure, [D] and [O] denote portions of the matrix that represent the intra- and inter-connections of the block nodes, respectively. Having this decomposition, the ILU algorithm is used to obtain an approximate (or exact) factorization for submatrices of [D] and then Algorithm 2 is used to perform the following block Jacobi iteration: rk = b − [A] xk xk+1 = xk + ω L [D]−1 rk

(47)

where [A] = [D] + [O], and ω L is an under-relaxation factor. To reduce the bandwidth of intra-connection submatrices, the nodes within each block are reordered using the RCM algorithm. Fig. 2c compares the bandwidths of [D] for two cases with 1 and 64 blocks. Note that in a distributed memory implementation, the blocks are constructed within each partition of the mesh and the case with 1 block corresponds to the traditional ILU method. As seen in Fig. 2c, the case with 64 blocks benefits from a much smaller bandwidth. Therefore, for this case a more accurate factorization (i.e. a factorization with a higher fill level) for intra-connections of the blocks can be obtained at a lower memory and computation cost. Also, one can show that in the special case that the blocks are reduced to the lines, [D] becomes a tridiagonal (for a P1 discretization) or a pentadiagonal (for a P2 discretization) matrix. Thus, the ILU(0) algorithm returns the exact linearization of [D], and the block smoothing algorithm presented here reverts to the line smoothing algorithm described in Ref. [11]. By returning to the aforementioned double-spectrum in Fig. 1, the proposed Block ILU Smoothing method locates between the line smoothing and traditional ILU methods. In comparison with the line smoothing method, it utilizes more factorization, but fewer smoothing iteration as each iteration is more effective. And in comparison with the traditional ILU method, it reduces the memory requirements by shifting some of the preconditioning load from factorization to smoothing iterations. It should be noted that an initial attempt to employ the ILU preconditioner in a Jacobi-type iteration can be found in the work by Liu [47] in the context of overset grids, where each block includes all the nodes of a grid

12

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 2. An example of block partitioning for the proposed block ILU smoother. (a) Blocks are shown by same-colored vertices and the lines are colored according to the blocks they reside in. (b) Sample decomposition of the Jacobian matrix in to [D] and [O] parts based on 4 blocks, (c) Comparison of bandwidths of [D] portion for cases with 1 and 64 blocks.

component. In that work, however, the instability issue for high-CFL systems has not been discussed. The next subsection addresses this issue. 5.2.1. Dual-CFL algorithm An important consideration in the application of an iterative line/block smoother is the diagonal dominancy of the utilized matrix. For low-CFL systems, Algorithm 1 can provide a very effective preconditioning technique for Krylov solvers. However, as mentioned in Section 4, during the PTC algorithm, the CFL number is increased aggressively and thus, for high-CFL systems, the iterative line/block preconditioning method may become very unstable and ineffective. To overcome this problem, in Refs. [9–11] a dual-CFL algorithm was developed in which the preconditioning system itself is preconditioned using a secondary Jacobian matrix [P] calculated based on a capped CFL number (referred to as CFLcap ). Details of this method are shown by Algorithms 2 and 3. Note that in Algorithm 3, like Eq. (47), in an outer loop (with n o sweeps) the residual of the linear system ([A]x = b) is calculated using the original matrix (rk = b − [A] xk ), and then in an inner loop (with n i sweeps), the solution updates dxk are calculated using the preconditioner matrix [P]: ( k ) dxk,l+1 = dxk,l + ω L [D]−1 r − [P] dxk,l , l = 0, . . . , n i − 1 (48) P where k and l count outer and inner loops, respectively, and [D]P denotes the intra-connection of block nodes in the preconditioner matrix [P] which is calculated as: { CFL ≤ CFLcap [A] ] [ ] (49) [P] = [ ACFLcap or ˆ ACFLcap CFL > CFLcap [ ] ˆ M [ ] ij ˆ (50) ACFLcap i j = [A]i j + CFLcap ∆ti [ ] [ ] ˆ is a mass matrix which is calculated based where ACFLcap is a Jacobian matrix evaluated using CFLcap and M on the Galerkin portion of the discretization (see Eq. (18)). As seen in Eq. (49), when the CFL number is higher than CFLcap , the preconditioner matrix [P] can be calculated in two ways. The first way requires evaluation of the Jacobian matrix at the capped CFL number and the second way makes use of a mass matrix to add diagonal dominancy to the original matrix. In this paper, the numerical examples are solved using the second way. If in Algorithm 3, the inner iteration starts with a trivial initial guess (i.e. dx0 = 0) and n i = 1, then one can show that the dual-loop iteration in this algorithm is reduced to a preconditioned Richardson iteration [48] as: k xk+1 = xk + ω L [D]−1 P (b − [A] x )

(51)

Algorithm 4 describes this version of the block ILU smoothing algorithm. Finally, it should be noted that although

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

13

the dual-CFL strategy can greatly help stabilize the smoothing iterations, it cannot guarantee the convergence. Therefore, in Algorithms 2 to 4, the iterations are stopped if the linear residual goes above a divergence tolerance tol D (=106 in this paper).

14

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 3. Lines constructed using Algorithm 5 [11] (see also [14,15]) and different input stiffness matrices. (a) Stiffness matrix of a Laplace operator; b, c, d) Stiffness matrices of a SUPG discretization of the linear convection–diffusion LCD operator at Pe = 10−6 , 106 and 104 , respectively.

5.2.2. Line construction, block construction, and domain decomposition Similar to the lines for the line solver technique, here the blocks should include strongly-connected nodes. In this study, the blocks are generated using a graph partitioning technique. To ensure that blocks include stronglyconnected nodes, first the lines are generated and then, the computational domain is partitioned such that lines are not broken by the partition boundaries and the resulting partitions are used as blocks. Such blocks do not break any strong connections previously detected by the line construction algorithm. In Refs. [9–11,49], this partitioning technique was employed for the domain decomposition used in distributed-memory parallelization. In this study, it is used for both purposes. That is, first it is used for distributed-memory domain decomposition, and then within each partition, it is used for block construction. The details of this approach are explained in the following. We start by construction of the lines. In Ref. [11] we presented a general matrix-based algorithm for construction of the lines, which is similar (but not identical) to those previously developed in Refs. [14,15]. In this approach, the lines are extracted form a properly chosen scalar matrix which represents the strength of the connections between unknowns (nodes). Algorithm 5 shows the details of this approach. Fig. 3 shows four examples of the lines constructed using this algorithm for a three-element airfoil. In Fig. 3a, the utilized matrix is the stiffness matrix of the Laplace operator (with zero normal gradient boundary conditions). As expected, in this figure the lines are extracted in the boundary layer and they follow the node clustering. In Fig. 3b to d, the stiffness matrices are evaluated using a SUPG discretization of the following nondimensionalized linear convection–diffusion (LCD) operator: L (ϕ) = ∂i (ϕu i ) −

1 ∂i (∂i ϕ) , i = 1, . . . , n d Pe

(52)

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

15

In above equation, Pe = UµL is the Peclet number, which represents the relative importance of convection to diffusion. This nondimensional value can be used as a tuning parameter to bias the generated lines towards the diffusion or convection operators. In Fig. 3b to d, the velocity field has been taken from the converged P1 RANS solution at Mach number of 0.2, Reynolds number of 9 million, and angle of attack of 16 degrees. For Fig. 3b, Pe is set to 10−6 and since for this value the diffusion term becomes the dominant operator, the resulting lines match with those using the pure Laplace operator (shown in Fig. 3a). In contrast, in Fig. 3c, Pe is 106 , and the resulting lines are mostly convection-based; there are almost no lines perpendicular to the solid walls. Finally, in Fig. 3d, Pe is set to 5 × 103 and the resulting lines are properly affected by both diffusion and convection operators and that is why in the remainder of the paper, this Peclet number is used where the linear convection–diffusion operator is used for the line generation. As a final point regarding the choice of the stiffness matrix for line construction, note that using the Laplace operator, the lines are static during the solution process, whereas using the LCD operator, the lines should be updated as the solution evolves. For the line smoothing method, since the factorization of the submatrices associated with intra-connections of the line nodes is essentially a sequential process, in a distributed memory implementation, one should ensure that lines are not broken by the partition boundaries. To do this in Refs. [9–11], we utilized a weighted graph partitioning technique which was initially developed by Mavriplis [49] for a strongly-scalable implementation of

16

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

the implicit line smoothing technique in the context of unstructured finite-volume solvers. In this method, first, the node connections in a mesh are represented by an unweighted graph and then this graph is contracted along the lines to generate a weighted graph. More specifically, in the original graph, all vertices and edges are assigned a unity weight. Then during the contraction process, all the vertices that are associated with a line are merged into a new vertex. Connections between merged vertices also produce merged edges and the weights associated with the merged vertices and edges are taken as the sum of the weights of the constituent vertices or edges. The contracted weighted graph is then partitioned using a weighted graph partitioner such as METIS [50]. To obtain the partitioning for the original unweighted graph, the partitioned weighted graph is de-contracted, meaning that all constituent vertices of a merged vertex are assigned the partition number of that vertex. Since the lines reduce to a single vertex in the contracted graph, they can never be broken by the partitioning process. The weighting assigned to the contracted graph ensures load balancing and communication optimization of the final uncontracted graph. After generation of the initial domain decomposition, the mentioned algorithm can be used to generate the blocks within each partition. For this purpose, within each partition the local lines are considered. In this study, the Laplace operator is used for the initial domain decomposition and construction of the initial set of the blocks. During the solution, the domain partitions remain static while the blocks within each partition are reconstructed based on the lines extracted from the SUPG discretization of the linear convection–diffusion equation using current velocity field and a Peclet number of 5 × 103 . However, it should be noted that in RANS simulations, a considerable portion of the strong connections is in the boundary layers and thus even using static blocks for these simulations results in good performance. 6. Numerical results In this section, several second- and third-order steady-state RANS simulations over the NASA Common Research Model (CRM) are presented to compare the preconditioning performance of the developed block ILU smoothing technique with that of the traditional ILU(k) algorithm. For the sake of brevity, hereafter, the block ILU smoother is referred to as ILUB (see Algorithms 2 to 4). The numerical results are presented in two test suites. 6.1. Test suite 1 (HL-CRM): Subsonic turbulent flow over HL-CRM configuration 6.1.1. Problem description In the first test suite, effectiveness, memory footprint and computational efficiency of the ILUB and ILU(k) preconditioners are compared for P1 and P2 solutions of a subsonic turbulent flow over the HL-CRM wing-body configuration. Also, the effectiveness of the residual smoothing technique discussed in Section 4.1 is demonstrated. The HL-CRM geometry has been the subject of extensive studies in the 3rd AIAA CFD High-Lift Prediction Workshop (HiLiftPW-3) as well as the 5th International Workshop on High-Order CFD Methods (HiOCFD5). Table 1 shows statistics of the grids used in this suite, all of which are available on the websites of HiLiftPW-34 and HiOCFD5.5 Fig. 4 shows different parts of the coarsest grid (i.e. Extra-Tiny), which is curved using quadratic (P2) basis functions; the curvatures of the edges are apparent in Fig. 4e. For all the solutions presented in this test suite, the Mach number is 0.2, the Reynolds number is 3.26 million, and the angle of attack is 8.0 degrees. Fig. 5 shows the contours of density and streamlines for the P2 solution on the Extra-Tiny grid, and Fig. 6 compares the contours of pressure coefficient for P1 and P2 solutions on this grid. As expected, the P2 discretization provides a highly-resolved solution, especially in areas with high gradients. 6.1.2. Convergence histories of P1 solutions P1 solutions are presented for all the grids in Table 1, although for brevity, the comparison of the preconditioners is limited to the Tiny and Coarse grids. Tables 2 and 3 report the details of the domain decomposition and the settings of the linear solvers for P1 solutions that include 6 cases. In Table 2, n K is the number of allocated Krylov vectors, n M is the maximum number of iterations in the FGMRes solver (including the restarts), n p is the number of domain partitions as well as the number of processors, n b/ p is the number of blocks within each partition, k is the fill level, and Fill is the ratio of the number of nonzeros in the factorized preconditioner over the number of nonzeros of the 4 5

https://hiliftpw.larc.nasa.gov/Workshop3/grids.html https://acdl.mit.edu/HOW5/MC1 HighLiftCommonResearchModel/

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

17

Fig. 4. Test Suite 1 (HL-CRM): Extra-Tiny grid from HiOCFD5; the grid is curved using quadratic (P2) basis functions (see the nose in subfigure e).

Fig. 5. Test Suite 1 (HL-CRM): contours of density and streamlines for the P2 solution; streamlines are colored by the turbulent working variable.

18

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 6. Test Suite 1 (HL-CRM): contours of pressure coefficient for P1 and P2 solutions.

Jacobian matrix. Note that for ILU(k), n b/ p is one, and for ILUB, it is chosen such that for all cases, the blocks include between 1500 to 2000 nodes; we do not imply that this is an ideal size for blocks, however, considering its memory usage, we have found it give reasonable performance for a wide range of problems. Also, note that for ILUB, only the intra-connections of the blocks [D]P are factorized. In Table 3, which reports the additional settings of the ILUB preconditioner, n o , n i , and w L are the number of outer iterations, number of inner iterations, and the relaxation factor in Algorithms 2 to 4, respectively. Fig. 7 shows the convergence histories for cases 2, 3, 5, and 6. These cases compare the operation of the preconditioners on the Tiny and Coarse grids. The convergence plots show the total steady-state residual (in black), the number of linear iterations (in blue), and the CFL number (in red) against the nonlinear (Newton) steps. For all cases, the computational domain is initialized by the free-stream condition, except on solid walls where no-slip condition is applied. Also, for all cases, the CFL number starts at one and rises gradually before the turbulence model starts to trigger noticeable changes in the flow field. At this phase, the CFL number undergoes several rises and falls and if the linear systems are solved adequately and the nonlinear controller chooses proper updates, the solution algorithm can pass the transient nonlinearities. In that case, the CFL number ramps up quickly and Newton convergence is obtained at large CFL numbers. This scenario can be seen for solutions on the Tiny grid for both preconditioners (i.e. cases 2 and 5). Fig. 8 compares the wall clock times spent on the linear solutions for the Tiny grid, and as seen, both preconditioners have similar performance. However, note that ILU(k) requires more memory to demonstrate this similar performance. This can be verified by comparison of the Fill and the number of allocated Krylov vectors for cases 2 and 5 in Table 2. For the Tiny grid, comparison of the total memory used for storing the factorized matrices and the Krylov vectors shows that ILU(k) has used twice the memory used by ILUB. By looking at the convergence results for the Coarse mesh (cases 3 and 6), we can see that ILUB shows a good performance throughout all nonlinear steps, while ILU(3) frequently fails for linear systems with the CFL numbers over 104 . To alleviate this problem, either a higher fill level should be used, or the settings of the nonlinear solver or discretization should be retuned, none of which are desirable, particularly because in similar conditions the ILUB preconditioner performs well. Another point seen in Table 2 is that for cases 5 and 6, which use ILU(k), a smaller linear tolerance is used. In our numerical tests, we found this to be necessary to solve the nonlinear problem in comparable number of nonlinear steps as the ILUB cases. This subject is discussed in more detail for P2 solutions, in which the nonlinear problem has a higher sensitivity to the linear tolerance.

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

19

Fig. 7. Test Suite 1 (HL-CRM): convergence histories for P1 solutions using ILU(k) and ILUB preconditioners on the Tiny and Coarse grids.

So far, for all the presented solutions, the implicit residual smoothing (IRS) technique described in Section 4.1 has been used. Fig. 9 demonstrates the effect of this technique for P1 solutions on all grids of Table 1. In these solutions only the ILUB preconditioner is used. As seen, for the coarser grids (i.e. Extra-Tiny and Tiny), the effect of the residual smoothing is negligible. However, for the finer grids (i.e. Coarse and Medium), using IRS the problem is solved in notably fewer number of nonlinear steps. In fact, the presented results suggest that using the IRS, the nonlinear solution shows less sensitivity to the grid size. Motivated by these results, the IRS technique is also used for the remainder of the solutions. 6.1.3. Convergence histories of P2 solutions P2 solutions are presented only for the Extra-Tiny mesh. To start the P2 solutions, a converged P1 solution is used as the initial condition. Tables 4 and 5 report the settings for P2 solutions that include 5 cases, and Fig. 10 shows the corresponding convergence histories. In Fig. 10a (case 7), the ILUB preconditioner is used and the linear tolerance is set to 10−10 . As seen, 79 nonlinear steps are taken to reduce the initial nonlinear residual by seven orders of magnitude, which is considered to be sufficient for the purpose of this study. Since the aforementioned linear tolerance is very small, this solution can be considered as a reference solution with exact Newton steps. Considering the prohibitive cost of such a setting, it is interesting to see how higher tolerances affect the nonlinear path. Fig. 10b (case 8) shows that if the linear tolerance is increased to 10−4 , using ILUB, the nonlinear problem is still solved in 79 steps. However, as seen in Fig. 10c (case 9), if the preconditioner is changed to ILU(8), this linear tolerance is not adequate to advance the nonlinear solution efficiently. Fig. 10d (case 10), shows that by reducing the

20

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 8. Test Suite 1 (HL-CRM): comparison of wall clock times spent on linear solutions using ILU(k) and ILLB preconditioners for P1 solutions on the Tiny grid.

Fig. 9. Test Suite 1 (HL-CRM): comparison of convergence histories of P1 solutions with and without residual smoothing.

linear tolerance to 10−6 , ILU(8) can solve the problem in 120 steps. Clearly, the fill level of 8 results in an excessive memory usage. However, it is worth noting that our attempts with lower fill levels resulted in considerably more nonlinear steps. For example, Fig. 10e (case 11) shows that using ILU(6), more than 200 iterations are required to reach the desired nonlinear tolerance. The above discussion suggests that for a given linear tolerance, the ILUB preconditioner provides better linear solutions for nonlinear advancements; this property may be attributed to the smoothing effect of this preconditioner. Fig. 10f compares the wall clock times spent on the linear solutions for cases 8 and 10, and as seen, using ILUB 60% less time is spent, i.e. a speedup of 2.5 for the linear solution phase. Also comparing the total memory used for storing the factorizations and Krylov vectors, ILUB has used only 20% of the memory used by ILU(8).

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 10. Test Suite 1 (HL-CRM): convergence histories for P2 solutions using ILU(k) and ILUB preconditioners.

21

22

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 11. Test Suite 2 (DPW): Fine grid from HiOCFD5; the grid is curved using quadratic (P2) basis functions (see the wing tip in subfigure d).

Fig. 12. Test Suite 2 (DPW): contours of the pressure coefficient for the P2 solution.

6.2. Test suite 2 (DPW): Transonic turbulent flow over CRM configuration 6.2.1. Problem description In the second test suite, the performance of the ILU(k) and ILUB preconditioners in the PTC algorithm is compared for a transonic turbulent flow over the CRM wing-body configuration. The utilized geometry includes

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

23

Fig. 13. Test Suite 2 (DPW): convergence histories for P1 solutions using the ILUB preconditioner and different values of CFLcap .

the effects of elastic deflection and it has been previously studied in the 6th AIAA CFD Drag Prediction Workshop (DPW-6) as well as HiOCFD5. For the considered flow, the Mach number is 0.85, the Reynolds number is 5.0 million, and the angle of attack is 2.75 degrees. In this suite, only one grid is used, which is the Fine grid of the CR1 test case in HiOCFD5.6 Fig. 11 shows different parts of this grid, which has been curved for quadratic (P2) basis functions. It includes, 10,925,975 tetrahedral elements, 1,839,905 P1 nodes, and 14,493,243 P2 nodes. Fig. 12 shows P2 solutions of the density on this mesh. To obtain this solution, an additional artificial viscosity, as described in Section 3.2, has been added to the governing equations for shock capturing purposes. As seen, the structure of the shock is clearly detectable on the wing. Tables 6 and 7 report the settings used for 9 cases considered for this test suite. The first 7 cases use the ILUB preconditioner and the last two use the ILU(k) preconditioner. Like the previous test suite, the P1 solutions are initialized by the free-stream conditions and P2 solutions are initialized by converged P1 solutions. 6.2.2. Convergence histories of P1 solutions For brevity, we present only three P1 solutions for which only the ILUB preconditioner is used. For case 1, the CFLcap is set to infinity. This means that for all CFL numbers, the preconditioner matrix [P] in Eq. (49) is the same 6

https://acdl.mit.edu/HOW5/CR1 CommonResearchModel/

24

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 14. Test Suite 2 (DPW): convergence histories for P2 solutions using the ILUB preconditioner.

as the original Jacobian matrix. As a result, the single-loop iteration of Algorithm 2 can be used for preconditioning. Fig. 13a shows the convergence history obtained by this setting. As seen, the problem is solved in 70 nonlinear steps, and in all steps, less than 20 Krylov vectors are used to drop the initial residual of the linear system by three orders of magnitude. Fig. 13b shows the results of case 2, in which the CFLcap is set to 1000, and since n i = 1, Algorithm 4 is used for preconditioning. Compared to case 1, in this case the total number of linear iterations during the entire solution has increased by about 60%. Although for this case, the problem is still solved efficiently, this observation shows that higher values of CFLcap are desired for faster convergence. However, note that case 1 is a fortunate example in which the double-CFL strategy is not required. Unfortunately, this is not always the case as can be seen for presented P2 solutions in the next subsection. In case 3, all the settings are the same as in case 2 except for the fill level which was reduced from 1 to 0. This leads to 44% reduction in Fill. As shown in Fig. 13c, the problem is solved in the same number of nonlinear steps and very close number of linear iterations as case 2. This shows that using ILUB, lack of an accurate factorization can be effectively compensated by iterations on the Krylov vectors. 6.2.3. Convergence histories of P2 solutions Fig. 14a shows the convergence history of case 4 that is a P2 solution without employing the dual-CFL strategy. As seen in this figure, every time the CFL number passes 103 the linear solver fails and thus the solution proceeds very slowly. This issue can be addressed using the dual-CFL strategy. By setting CFLcap to 500, the P2 problem can

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

25

Fig. 15. Test Suite 2 (DPW): convergence histories for P2 solutions using the ILU(k) preconditioner.

be properly solved, as seen for cases 5 to 7 in Fig. 14b to d. Note that for case 5, the linear tolerance is set to 10−10 and, like the previous test suite, this case can be considered as a reference solution with exact Newton steps. In case 6, the linear tolerance is increased to 10−3 and still the problem is solved in the same number of nonlinear steps as case 5, that is 33 steps. In cases 5 and 6, the ILUB preconditioner uses 32 blocks in each partition. In case 7, all the settings are the same as case 6, except for the number of blocks that is one. As seen in Fig. 14d, the nonlinear problem is again solved in 33 steps, while the linear solvers use fewer Krylov vectors. This test shows that the block ILU preconditioner can be effective on blocks with significantly different sizes, which indicates that the size of the blocks can be tailored for the utilized processing elements. This is particularly important for implementations on emerging manycore processors, using which shared-memory versions of the ILU algorithm [51] can be used to perform the factorization on large-size blocks. Investigation of this topic is reserved for future works. Fig. 15 shows the convergence histories for cases 8 and 9, in which the ILU(3) preconditioner is used while the linear tolerances are 10−3 and 10−5 , respectively. Like the previous test suite, the case with the bigger linear tolerance does not proceed efficiently, while using the ILUB preconditioner, the same tolerance was sufficient. Finally, Fig. 16 shows the wall clock times spent on the linear solutions for the cases 7 (using ILUB) and 9 (using ILU(k=3)), and as seen for the case 7, a speedup of 2.8 is obtained. Also, case 7 has used only 25% of the memory used by case 9. 7. Conclusions In this work we have proposed an efficient and highly parallelizable preconditioning technique for Newton– Krylov methods applied to finite-element discretizations for applied CFD problems. The proposed method seeks to form groups or blocks of unknowns which are strongly coupled and proceeds to solve these implicitly within each block. The method is then used as a linear smoother with lagged inter-block coupling, resulting in a globally convergent preconditioner. In this sense, the proposed preconditioning strategy can be viewed as a generalization of the implicit line-smoothing technique. Furthermore, by prescribing the block size, this approach enables a full spectrum of preconditioners that can trade factorization effort for iterative smoothing effort and can be tailored for specific hardware characteristics. Results have shown that this approach results in significantly lower memory usage than traditional ILU(k) preconditioners while requiring equal or lower computational effort overall. Additionally, the inclusion of a residual smoothing technique has been shown to further improve overall time to solution. Although current results are encouraging, additional work is required to better understand the effect of block size and construction, as well as for optimizing this strategy for emerging many-core computer hardware architectures with deep memory hierarchies.

26

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

Fig. 16. Test Suite 2 (DPW): comparison of wall clock times spent on linear solutions using ILU(k) and ILUB preconditioners (cases 7 and 9) for P2 solutions. Table 1 Test suite 1 (HL-CRM): statistics of the grids. Extra-Tiny (X) Tiny (T) Coarse (C) Medium (M)

Cells (all tetrahedral)

P1 nodes

P2 nodes

5,540,545 12,009,584 33,361,746 98,659,138

933,440 2,016,118 5,591,371 15,654,484

7,344,288 not used not used not used

Table 2 Test suite 1 (HL-CRM): grids, partitioning, and settings of the linear solver for P1 solutions. Case 1 2 3 4 5 6

Grid X T C M T C

Preconditioner

tol L

nK

nM

np

n b/ p

k

Fill

ILUB ILUB ILUB ILUB ILU(k) ILU(k)

10−3

100 200 200 200 300 300

100 200 200 200 300 600

128 256 512 1024 256 512

4 4 6 8 1 1

0 0 1 1 2 3

0.88 0.85 1.43 1.61 3.13 5.54

10−3 10−4 10−4 10−5 10−6

Table 3 Test suite 1 (HL-CRM): additional settings of the ILUB preconditioner for P1 solutions. Case

no

ni

ωL

CFLcap

1 2 3 4

10 10 7 20

1 1 3 3

0.4 0.4 0.2 0.1

1000.0 1000.0 1000.0 1000.0

Acknowledgments This research was sponsored by NASA’s Transformational Tools and Technologies (TTT) (NNX15AU23A) Project of the Transformative Aeronautics Concepts Program under the Aeronautics Research Mission Directorate.

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

27

Table 4 Test suite 1 (HL-CRM): grids, partitioning, and settings of the linear solver for P2 solutions. Case 7 8 9 10 11

Grid X X X X X

Preconditioner

tol L

nK

nM

np

n b/ p

k

Fill

ILUB ILUB ILU(k) ILU(k) ILU(k)

10−10

300 100 300 300 300

600 200 1200 1200 1200

1024 1024 1024 1024 1024

4 4 1 1 1

2 2 8 8 6

2.28 2.28 12.4 12.4 11.2

10−4 10−4 10−6 10−6

Table 5 Test suite 1 (HL-CRM): additional settings of the ILUB preconditioner for P2 solutions. Case

no

ni

ωL

CFLcap

7 8

20 20

5 5

0.1 0.1

500.0 500.0

Table 6 Test suite 2 (DPW): polynomial order, partitioning, and settings of the linear solver. Case

P

Preconditioner

tol L

nK

nM

np

n b/ p

k

Fill

1 2 3 4 5 6 7 8 9

1 1 1 2 2 2 2 2 2

ILUB ILUB ILUB ILUB ILUB ILUB ILUB ILU(k) ILU(k)

10−3 10−3 10−3 10−10 10−10 10−3 10−3 10−3 10−5

20 50 50 100 100 100 100 300 300

20 50 50 300 300 100 100 300 300

256 256 256 256 256 256 256 512 512

4 4 4 32 32 32 1 1 1

1 1 0 1 1 1 1 3 3

1.56 1.56 0.87 1.42 1.42 1.42 2.08 7.03 7.03

Table 7 Test suite 2 (DPW): additional settings of the ILUB preconditioner. Case

no

ni

ωL

CFLcap

1 2 3 4 5 6 7

20 20 20 20 20 20 20

1 1 1 1 1 1 1

0.4 0.4 0.4 0.4 0.4 0.4 0.4

∞ 1000.0 1000.0 ∞ 500.0 500.0 500.0

Thanks are due to Dr. Steve Karman from Pointwise for providing the high-order grids used in this study. References [1] D.S. Kamenetskiy, J.E. Bussoletti, C.L. Hilmes, V. Venkatakrishnan, L.B. Wigton, F.T. Johnson, Numerical evidence of multiple solutions for the Reynolds-averaged Navier–Stokes equations, AIAA J. 52 (2014) 1686–1698. [2] N.K. Burgess, R.S. Glasby, Advances in numerical methods for CREATETM -AV analysis tools, presented at the 52nd AIAA Aerospace Sciences Meeting and Exhibit, AIAA Paper 2014–0417, January 2014. [3] M. Ceze, K.J. Fidkowski, Constrained pseudo-transient continuation, Internat. J. Numer. Methods Engrg. 102 (2015) 1683–1703. [4] W.K. Anderson, J.C. Newman, S.L. Karman, Stabilized finite elements in FUN3D, J. Aircr. 55 (2017) 696–714. [5] B.R. Ahrabi, M.J. Brazell, D.J. Mavriplis, An investigation of continuous and discontinuous finite-element discretizations on benchmark 3D tturbulent flows (invited), presented at the 2018 AIAA Aerospace Sciences Meeting, AIAA Paper 2018–1569, January 2018. [6] R.S. Glasby, J.T. Erwin, Introduction to COFFE: The next-generation HPCMP CREATETM -AV CFD solver, presented at the 54th AIAA Aerospace Sciences Meeting, AIAA Paper 2016–0567, January 2016.

28

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

[7] M.C. Galbraith, S.R. Allmaras, D.L. Darmofal, SANS RANS solutions for 3D benchmark configurations, presented at the 2018 AIAA Aerospace Sciences Meeting, AIAA Paper 2018–1570, January 2018. [8] L.T. Diosady, D.L. Darmofal, Preconditioning methods for discontinuous Galerkin solutions of the Navier–Stokes equations, J. Comput. Phys. 228 (2009) 3917–3935. [9] B.R. Ahrabi, D.J. Mavriplis, Scalable solution strategies for stabilized finite-element flow solvers on unstructured meshes, presented at the 55th AIAA Aerospace Sciences Meeting, AIAA Paper 2017–0517. [10] B.R. Ahrabi, D.J. Mavriplis, Scalable solution strategies for stabilized finite-element flow solvers on unstructured meshes, Part II, presented at the 23rd AIAA Computational Fluid Dynamics Conference, AIAA Paper 2017–4275, June 2017. [11] B.R. Ahrabi, D.J. Mavriplis, A scalable solution strategy for high-order stabilized finite-element solvers using an implicit line preconditioner, Comput. Methods Appl. Mech. Engrg. 341 (2018) 956–984. [12] P.-O. Persson, J. Peraire, Newton-GMRES preconditioning for discontinuous Galerkin discretizations of the Navier–Stokes equations, SIAM J. Sci. Comput. 30 (2008) 2709–2733. [13] D.J. Mavriplis, B.R. Ahrabi, M.J. Brazell, Strategies for accelerating newton method continuation in CFD problems, presented at the AIAA Science and Technology Forum and Exposition, AIAA Paper 2019–0100, January 2019. [14] T. Okusanya, D. Darmofal, J. Peraire, Algebraic multigrid for stabilized finite element discretizations of the Navier–Stokes equations, Comput. Methods Appl. Mech. Engrg. 193 (2004) 3667–3686. [15] B. Philip, T.P. Chartier, Adaptive algebraic smoothers, J. Comput. Appl. Math. 236 (2012) 2277–2297. [16] A.N. Brooks, T.J. Hughes, Streamline upwind/Petrov–Galerkin formulations for convection dominated flows with particular emphasis on the incompressible Navier–Stokes equations, Comput. Methods Appl. Mech. Engrg. 32 (1982) 199–259. [17] T. Tezduyar, T. Hughes, Finite element formulations for convection dominated flows with particular emphasis on the compressible Euler equations, presented at the 21st Aerospace Sciences meeting, AIAA Paper 1983–0125, 1983. [18] T.J.R. Hughes, G. Scovazzi, T.E. Tezduyar, Stabilized methods for compressible flows, J. Sci. Comput. 43 (2010) 343–368. [19] F. Shakib, T.J. Hughes, Z. Johan, A new finite element formulation for computational fluid dynamics: X. The compressible Euler and Navier–Stokes equations, Comput. Methods Appl. Mech. Eng. 89 (1991) 141–219. [20] W.K. Anderson, L. Wang, S. Kapadia, C. Tanis, B. Hilbert, Petrov–Galerkin and discontinuous-Galerkin methods for time-domain and frequency-domain electromagnetic simulations, J. Comput. Phys. 230 (2011) 8360–8385. [21] J.T. Erwin, W.K. Anderson, L. Wang, S. Kapadia, High-order finite-element method for three-dimensional turbulent Navier–Stokes, in: presented at the 21st AIAA Computational Fluid Dynamics Conference, Fluid Dynamics and Co-located Conferences, AIAA Paper 2013–2571, June 2013. [22] L. Wang, W.K. Anderson, J.T. Erwin, S. Kapadia, Discontinuous Galerkin and Petrov Galerkin methods for compressible viscous flows, Comput. & Fluids 100 (2014) 13–29. [23] J.C. Newman, W.K. Anderson, Investigation of unstructured higher-order methods for unsteady flow and moving domains, in: Presented at the 22nd AIAA Computational Fluid Dynamics Conference, AIAA Paper 2015–2917, June 2015. [24] B.R. Ahrabi, W.K. Anderson, J.C. Newman, An adjoint-based hp-adaptive stabilized finite-element method with shock capturing for turbulent flows, Comput. Methods Appl. Mech. Engrg. 318 (2017) 1030–1065. [25] W.K. Anderson, B.R. Ahrabi, J.C. Newman, Finite element solutions for turbulent flow over the NACA 0012 airfoil, AIAA J. 54 (9) (2016) 2688–2704. [26] R.S. Glasby, N. Burgess, W.K. Anderson, L. Wang, D.J. Mavriplis, S.R. Allmaras, Comparison of SU/PG and DG finite-element techniques for the compressible Navier–Stokes equations on anisotropic unstructured meshes, presented at the 51st AIAA Aerospcae Sciences Meeting including the New Horizons Forum and Aerospace Exposition, AIAA Paper 2013–0691, 2013. [27] C. Rumsey, Turbulence modeling resource website. Available: http://turbmodels.larc.nasa.gov. [28] B. Diskin, W.K. Anderson, M.J. Pandya, C.L. Rumsey, J. Thomas, Y. Liu, et. al, Grid convergence for three dimensional benchmark turbulent flows, presented at the 2018 AIAA Aerospace Sciences Meeting, AIAA Paper 2018–1102, January 2018. [29] T.R. Michal, D.S. Kamenetskiy, J. Krakos, Generation of Anisotropic Adaptive Meshes for the first AIAA geometry and mesh generation workshop, in: Presented at the 2018 AIAA Aerospace Sciences Meeting, AIAA Paper 2018–0658, January 2018. [30] F. Alauzet, A. Loseille, D. Marcum, T.R. Michal, Assessment of anisotropic mesh adaptation for high-lift prediction of the HL-CRM configuration, in: 23rd AIAA Computational Fluid Dynamics Conference, AIAA Paper 2017–3300, American Institute of Aeronautics and Astronautics, 2017. [31] D.J. Mavriplis, K. Mani, Unstructured mesh solution techniques using the NSU3D solver, in: 52nd Aerospace Sciences Meeting, AIAA Paper 2014–0081, American Institute of Aeronautics and Astronautics, 2014. [32] D.J. Mavriplis, An assessment of linear versus nonlinear multigrid methods for unstructured mesh solvers, J. Comput. Phys. 175 (2002) 302–325. [33] S.R. Allmaras, F.T. Johnson, Modifications and clarifications for the implementation of the Spalart-Allmaras turbulence model, in: ICCFD7-1902, 7th International Conference on Computational Fluid Dynamics, Big Island, Hawaii, 2012. [34] T.J. Barth, Numerical methods for gasdynamic systems on unstructured meshes, in: An Introduction to Recent Developments in Theory and Numerics for Conservation Laws, Springer, 1999, pp. 195–285. [35] P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference schemes, J. Comput. Phys. 43 (1981) 357–372. [36] K. Shahbazi, D.J. Mavriplis, N.K. Burgess, Multigrid algorithms for high-order discontinuous Galerkin discretizations of the compressible Navier–Stokes equations, J. Comput. Phys. 228 (2009) 7917–7940. [37] G.E. Barter, D.L. Darmofal, Shock capturing with PDE-based artificial viscosity for DGFEM: Part I. Formulation, J. Comput. Phys. 229 (2010) 1810–1827. [38] P.-O. Persson, J. Peraire, Sub-cell shock capturing for discontinuous Galerkin methods, presented at the 44th AIAA Aerospace Sciences Meeting and Exhibit, AIAA Paper 2006–0112, January 2006.

B.R. Ahrabi and D.J. Mavriplis / Computer Methods in Applied Mechanics and Engineering 358 (2020) 112637

29

[39] K.R. Holst, R.S. Glasby, J.T. Erwin, D.L. Stefanski, R.B. Bond, J.D. Schmisseur, High-order simulations of shock problems using HPCMP CREATE(TM)-AV Kestrel COFFE, presented at the 2018 AIAA Aerospace Sciences Meeting, AIAA Paper 2018–1301, January 2018. [40] C.T. Kelley, D.E. Keyes, Convergence analysis of pseudo-transient continuation, SIAM J. Numer. Anal. 35 (1998) 508–523. [41] Y. Saad, Iterative Methods for Sparse Linear Systems, second ed., SIAM, 2003. [42] E. Cuthill, Several strategies for reducing the bandwidth of matrices, in: D.J. Rose, R.A. Willoughby (Eds.), Sparse Matrices and their Applications, Springer US, Boston, MA, 1972, pp. 157–166. [43] S. Balay, K. Buschelman, W.D. Gropp, D. Kaushik, M.G. Knepley, L.C. McInnes, et al., PETSc web page, 2001, 2004. [44] S. Balay, W.D. Gropp, L.C. McInnes, B.F. Smith, Efficient management of parallelism in object-oriented numerical software libraries, in: Modern Software Tools for Scientific Computing, Springer, 1997, pp. 163–202. [45] B. Smith, PETSc, the portable, extensible toolkit for scientific computing, in: Encyclopedia of Parallel Computing, Springer, 2011. [46] L. Thomas, Elliptic Problems in Linear Differential Equations over a Network, Watson Scientific Computing Laboratory, Columbia Univ., NY, 1949. [47] C. Liu, A Stabilized Finite Element Dynamic Overset Method for the Navier–Stokes Equations (Ph.D.), Department of Computational Engineering, University of Tennessee at Chattanooga, 2016. [48] L.F. Richardson, The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam, Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci. 210 (1911) 307–357. [49] D.J. Mavriplis, S. Pirzadeh, Large-scale parallel unstructured mesh computations for three-dimensional high-lift analysis, J. Aircr. 36 (1999) 987–998. [50] G. Karypis, V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput. 20 (1998) 359–392. [51] E. Chow, A. Patel, Fine-grained parallel incomplete LU factorization, SIAM J. Sci. Comput. 37 (2015) C169–C193.