Development of an overset grid computational fluid dynamics solver on graphical processing units

Development of an overset grid computational fluid dynamics solver on graphical processing units

Computers & Fluids 58 (2012) 1–14 Contents lists available at SciVerse ScienceDirect Computers & Fluids j o u r n a l h o m e p a g e : w w w . e l ...

3MB Sizes 0 Downloads 29 Views

Computers & Fluids 58 (2012) 1–14

Contents lists available at SciVerse ScienceDirect

Computers & Fluids j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / c o m p fl u i d

Development of an overset grid computational fluid dynamics solver on graphical processing units Kunal Soni, Dominic D.J. Chandar ⇑, Jayanarayanan Sitaraman Department of Mechanical Engineering, University of Wyoming, Laramie, WY 82071-3295, United States

a r t i c l e

i n f o

Article history: Received 19 January 2011 Accepted 24 November 2011 Available online 1 February 2012 Keywords: Overset grids Parallel computing Graphical Processing Units (GPUs) Navier-Stokes equations Euler equations

a b s t r a c t General Purpose computation on Graphics Processing Units (GPGPUs) has gained popularity recently. Graphics Processing Units (GPUs) are being used for computationally intensive and data intensive problems to obtain orders of magnitude speed up in wide range of domains like molecular dynamics, biophysics, geo-physics and CFD [1–4]. In this paper we discuss the development of a two-dimensional overset grid CFD solver on GPUs for moving body problems and demonstrate orders of magnitude speed-up on single GPU unit as compared to C/FORTRAN solver on a single CPU core. The two-dimensional overset grid CFD solver consists of three modules. A near-body solver module which solves the fluid conservation laws on structured and unstructured mesh systems, an off-body solver module which solves the fluidconservation laws on an isotropic Cartesian mesh, and the domain connectivity module which manages the interaction between these two mesh systems. The GPU acceleration is extended to all the three modules. We expect this work to be the fore-runner for future development efforts for full three-dimensional Navier–Stokes solutions capable of executing in a heterogeneous parallel environment. Here ‘‘heterogeneous parallelism’’ refers to combination of Message Passing Interface (MPI) based communication for distributed memory systems, and large scale multi-threading using GPUs for shared memory systems. Ó 2012 Elsevier Ltd. All rights reserved.

1. Introduction Graphics Processing Units (GPUs) have been used to solve wide range of problems. It is quickly becoming a key element of high performance computing because of its ability to perform a large number of numerical operations by massive multi-threading. According to Flynn’s [5] taxonomy this kind of programming model is called Single Instruction Multiple Data (SIMD) model. A stream of data (which is a collection of data records similar to conventional arrays) can be provided into a GPU unit and processed simultaneously with a single instruction. Therefore, compared to CPU execution, where every data in an array will consume some CPU cycles sequentially adding up to a large wall-clock time, GPU execution requires only few processing cycles. Hence it is possible to make numerical computations on the GPU extremely fast. In this paper we capture this ability of the GPUs to accelerate a two-dimensional overset grid CFD solver. Speed and accuracy are the two factors to gauge the performance of any flow solver. Here we show that speed-up of up to two of orders of magnitude can be obtained on the GPU platform without loss in accuracy compared to conventional C/Fortran solvers. These preliminary results are extremely encouraging and show the potential of GPGPU based ⇑ Corresponding author. Tel.: +1 307 343 0863. E-mail address: [email protected] (D.D.J. Chandar). 0045-7930/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.compfluid.2011.11.014

computations to achieve categorical superiority over traditional CPU computations. The motivation for the overset grid approach stems from the need to solve unsteady flow problems with relative motion between various bodies [6]. Examples of such problems can be commonly found in rotary wing aerodynamics, flapping wings and wind turbines. Overset grid methods are computationally much more easier and efficient than mesh regeneration methods for treating moving body problems. Even in problems having highly constrained and complex meshes and multiple moving bodies, this approach still remains easier to implement. In the overset grid approach, only the grids surrounding the moving bodies are regenerated thereby saving a lot of computational time. The two-dimensional overset grid solver developed in this article consists of three modules: the off body solver, the near body solver and the overset grid assembler (domain connectivity module). The off body solver solves the fluid conservation laws in their inviscid form, i.e. the Euler equations on an isotropic Cartesian mesh. The near body solver solves the fluid conservation laws in the vicinity of the body using a finite volume method. We develop, test and validate finite volume solvers that follow both structured curvilinear as well as unstructured mesh topology in this paper. Standard test cases such as an inviscid Lamb vortex convection, inviscid low Mach No. (incompressible) flow past a cylinder, and transonic flow past a NACA0012 airfoil are used for validation purposes. The overset grid

2

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

assembly module manages the hole cutting and interpolation between these two mesh systems. In its current form, the overset grid assembly module is capable of interpolating between any type of meshes (structured/unstructured) on the GPU, however in the present demonstration, only the structured grid interpolation results are presented. The solver is implemented on the GPU platform for massive parallelization without any secondary (shared memory) optimization. Furthermore, the current solver can be executed on a heterogeneous platform. Heterogeneous computing can be done with either domain decomposition where one can solve separate domains of the problem on separate GPU units or by modular decomposition where one solves different modules on different GPU units and the domain connectivity is performed using the Message Passing Interface Standard (MPI). 2. GPU architecture The GPU architecture was pioneered for high throughput computing usage like graphics pipelining (graphics processing) and development of video games [1]. Its potential for general purpose computing is realized relatively recently. The GPU architecture falls in the category of shared memory architecture or more specifically Symmetric Multi Processing (SMP) architecture. A GPU unit is a set of multiprocessors with each having many cores and each core capable of executing many threads simultaneously. Thus tens of thousands of threads can be executed simultaneously rendering it suitable for massive parallelization. There are two main kinds of memories in a GPU unit. The global or device memory and the shared or cached memory. The global memory is accessible from all multiprocessors within the GPU unit. It has higher memory latency of access and is of the order of few GBs in size. Each multiprocessor has multiple cores which share common memory called the shared memory. The shared memory is similar to cached memory in CPUs with lesser memory latency of access and can be accessed by only the cores within the multiprocessor. Usually it has size of few KBs. In addition the data on device memory can be stored on texture memory or texture cache and accessed in read only manner. There is also a limited amount of read-only memory called constant cache for storing constants to be accessed by all processors. A schematic diagram of the GPU architecture is shown in Fig. 1. There are many GPU models available depending on vendor and purpose. We use NVIDIA TESLA C-1060 for our testing which has 30 multiprocessors with each multiprocessor having 8 cores total-

ing to 240 cores. The global memory on the device is 4 GB and the shared memory is 16 KB in size. 3. GPU programming For our implementation on the GPU platform we use NVIDIA’s [7,8] C/C++ language extension known as Compute Unified Device Architecture (CUDA). Using CUDA, one has the capability to access GPU specific functions to exchange data between the CPU and GPU and to execute blocks of code on the GPU. The programming strategy centers around the usage of streams and kernels. A stream is a collection of data similar to an array, and a kernel is like a function that processes the stream in a Single Instruction Multiple Data (SIMD) model. All the data has to be copied onto the device memory on the GPU unit for processing. The data is then accessed directly by processors or copied onto shared memory if required. Shared memory acts as an explicitly managed cache in contrast to it’s implicit usage in CPUs. The cache or shared memory usage is defined by the user. A CUDA code is basically a sequential code that does fork-join parallelism similar to OpenMP or Pthreads. The main code is a sequential C/C++ code which forks the data into many threads at the invocation of CUDA kernel. At run time thousands of ‘‘lightweight‘‘ threads are generated by a CUDA kernel with less creation overhead and which have fast context switching which is mostly hidden because of large time taken in copying the data from global memory to the cores. The data set is divided into data blocks [8,10]. Each block executes on one multiprocessor. Numerous threads make up one block. There is a limit to number of threads in a single block (512 threads in TESLA C1060) [8] which depends on the GPU unit model. A thread is the smallest unit in the hierarchy of programming model. The data within a block is further split into warps. All the threads within a warp execute simultaneously at a time on one multiprocessor. There is a limit to number of threads within a warp, for e.g. in Tesla C1060, the warp size is 32 threads [8]. Multiple warps execute one after another to process a block. The data can be synchronized within a block and stored in shared memory. As an analogy between programming model and hardware, a block corresponds to the shared memory and the entire data set or grid corresponds to the global memory respectively. Figs. 2 and 3 illustrate the C and CUDA codes respectively for implementation of 5-point stencil for a Poisson solver using Point Jacobi method. The for loops in the C code which execute the same statement over and over again in a sequential manner are absent in the CUDA kernel where all the data is evaluated in a single kernel call. Also to be noticed, is the similarity between data streams in CUDA and the argument arrays in C. They both are one-dimensional pointers and are used in a similar way. Our current solver is written in the global memory framework. Further optimization of solver with shared memory is not yet implemented. Placing the entire overset grid CFD solver code on shared memory makes it extremely complicated and unworthy of the time and effort especially since newer hardware architectures promise lower latency on global memory access [8]. 4. Off body Cartesian solver 4.1. Formulation

Fig. 1. A schematic of the GPU architecture adapted from Ref. [9].

The effort in developing the overset grid solver starts with the off-body solver, where the Euler equations are solved for an isotropic Cartesian grid. The formulation encompasses variable orders (up to 5th) and is given as follows [6]. Note that we express the governing equations and discretization in three dimensions for

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

3

Fig. 2. C implementation of a 5-point stencil.

Fig. 3. CUDA implementation of a 5-point stencil.

generality. In the present paper, only the two-dimensional formulation of these equations is implemented. The governing Euler equations are expressed as:

@Q @E @F @G þ þ þ ¼0 @t @x @y @z

ð1Þ

where the vectors: Q, E, F and G have their usual definitions:

1

0

0

q C B B qu C C B C B Q ¼ B qv C; C B B qw C A @

1

C B 2 B qu þ p C C B C B E ¼ B quv C; C B B quw C A @

e

0

qu

1

qw quw qv w

ðe þ pÞu

1

0

qv C B B quv C C B C B F ¼ B qv 2 þ p C; C B B qv w C A @

E iþ1=2  b E i1=2 @E b ¼ @x Dx

b e iþ1=2 E iþ1=2 ¼ e E iþ1=2  D

ð2Þ

ðe þ pÞw and

h¼eþ

p

q

;

  1 p ¼ ðc  1Þ qe  qðu2 þ v 2 þ w2 Þ 2

ð4Þ

where b E represents the total inviscid flux evaluated at the cell-face (notionally).

ðe þ pÞv

C B C B C B C B G¼B C; C B B qw2 þ p C A @

Here the velocities are given by u,v and w and time by t. Pressure, density and total energy per unit volume are represented as P,q and e. Note that the use of Cartesian meshes allow the equations to be written in their original Cartesian form, which makes it straightforward to write finite difference discretizations of the flux derivatives. Although presented as central difference, the scheme is implemented as a pseudo-finite volume method. The finite difference spatial discretizations is expressed as:

ð3Þ

ð5Þ

e is the artificial dissipaHere, e E is the physical flux term and D tion flux term. Using central differences of second, fourth and sixth-order accuracy, we can write the physical flux as

1 e E IIiþ1=2 ¼ ðEiþ1 þ Ei Þ 2

ð6Þ

1 e e II E IV ðEiþ2 þ Eiþ1 þ Ei  Ei1 Þ iþ1=2 ¼ E iþ1=2 þ 12

ð7Þ

4

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

1 e e IV E VI ðEiþ3  3Eiþ2 þ 2Eiþ1 þ 2Ei  3Ei1 þ Ei2 Þ iþ1=2 ¼ E iþ1=2 þ 60

ð8Þ

where r is the spectral radius of the inviscid flux Jacobian. Note the difference in notation between the physical flux and dissipation flux terms. In the case of the physical flux, the super-script refers to the order of accuracy of the spatial discretizations. On the other hand, for the dissipation flux term, the superscript refers to the order of the artificial dissipation derivative. The artificial dissipation terms can be similarly expressed in their discrete forms as:

e II D iþ1=2 ¼

jrjiþ1=2 2

ðQ iþ1  Q i Þ

e II e IV D iþ1=2 ¼ D iþ1=2  e VI e IV D iþ1=2 ¼ D iþ1=2 þ

jrjiþ1=2 12 jrjiþ1=2 60

5

Y

4

3

ð9Þ

2

ðQ iþ2 þ 3Q iþ1  3Q i  Q i1 Þ

ð10Þ

1

ðQ iþ3  5Q iþ1 þ 5Q i  Q i2 Þ:

ð11Þ 0

0

1

2

3

4.2. Implementation The time integration for the Off body solver is performed using an explicit 3-stage Runge–Kutta scheme[13] described as below. During each stage, the right hand side of the equation (net flux across the cell) is evaluated using same subroutine. The Stage 1 inHH volves the calculation of variables Q H from Q ni (state in i and Q i current time step).

Dt  n  R Qi DV i Dt  n  ¼ Q ni þ f2 R Qi DV i

1.005 Exact 6th(O)FD scheme

1 0.995

Q HH i

ð13Þ

0.99

In stage 2, the variable Q HH is used to calculate the variable i Q HHH . i

Density

ð12Þ

Q HHH i

0.985 0.98

ð14Þ

0.975

Finally in the final stage, the variable Q HHH is used to calculate i R.H.S to update the solution to the next time step, i.e. evaluation of variable Q nþ1 . i

0.97

Q nþ1 i

Dt  HHH  ¼ QH R Qi i þ f4 DV i

5

Fig. 4. Density contours for the Lamb vortex convection problem at the end of 100 time steps.

n QH i ¼ Q i þ f1

Dt  HH  ¼ QH R Qi i þ f3 DV i

4

X

0.965 −1

2

3

4

5

6

Fig. 5. Comparison of the computed density along y = 3 with the corresponding exact solution.

C and CUDA rms-error comparison 0.001 C-rms-error CUDA-rms-error 0.0001

rms-error

4.3. Validation The inviscid Lamb vortex convection problem is chosen for testing and validation of the off-body solver. The exact solution for this problem is very simple, and is given by the convection of the vortex with free-stream conditions without any dissipation or dispersion effects. We apply and test all the central difference discretization that are described in Eqs. (6)–(11). However, the results are shown only for the highest order (5th order) scheme. In the present case, the vortex is initially located at (3, 3) and the computations are carried out for 100 time steps at a free-stream Mach number of 0.5. Fig. 4 shows the contours of density at the end of 100 time steps. The computed density along y = 3 is compared with the corresponding exact solution in Fig. 5. The computed solution shows excellent agreement with the exact solution.

1

X−axis

ð15Þ

where the constants are given as f1 = 0.25, f2 = 8.0/15.0, f3 = 5.0/12.0 and f4 = 0.75. The ‘‘for’’ loops of the C code are rewritten as kernels in the CUDA implementation. Every single statement under the ‘‘for’’ loops of the C code has same structure as in the CUDA kernel. Instead of using ‘‘for’’ loops over and over again on every grid point, the CUDA implementation vectorizes each statement.

0

1e-05

1e-06

1e-07 10

100

1000

grid size Fig. 6. Comparison of the error between computed and exact solutions for the Lamb vortex convection problem.

5

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

where Q represent conserved quantities

C and CUDA time comparison for the off body solver 100000

q C B B qu C C B C Q ¼B B qv C C B @ qw A

10000

execution time

1

0

C-time (in ms) CUDA-time (in ms)

ð20Þ

e 1000

and E, F and G are the inviscid fluxes.

1 qU C B B quU þ nx p C C 1B E¼ B qv U þ ny p C C; JB C B @ qwU þ nz p A Uðe þ pÞ  nt p 1 0 qW C B B quW þ nx p C C 1B G¼ B qv W þ ny p C C; JB C B q wW þ n p A @ z Wðe þ pÞ  nt p 0

100

10 10

100

1000

grid size Fig. 7. Comparison of the CPU time between C and CUDA execution for the Lamb vortex convection problem.

The time and accuracy plot comparison between C and CUDA codes are shown in Figs. 6 and 7 respectively. It is encouraging to note that both C and CUDA codes reproduce almost the same errors in solutions over different grid sizes. The CUDA code shows much faster execution time than the corresponding C code. The GPU solver is tested with both single and double precision, and the errors in either case remain the same. The results shown here are obtained through single precision execution. The latency of loading the data from global memory to registers is significant. Therefore for small grid sizes (16  16–32  32) the processing time is negligible and the overall execution times for grid sizes are almost the same. Furthermore the CUDA speed-up continuously increases with the grid size and attains a maximum at the largest grid size of 512  512 tested, (about 16 times). As expected, the CUDA code execution time increases with the increase in grid size. This trend is attributed to the fact that, as the number of threads increase with grid size, more time is spent in context switching.

5. Near body solver using a structured grid framework 5.1. Formulation The next step in the development of the overset grid solver is the near body solver. The basic flow conservation laws are same for both the off body and the near body solvers. The near body solver in the current context is an Euler solver on a structured curvilinear grid that conforms to the body. The structured curvilinear grid is generated by a standard technique of mapping a rectangular Cartesian grid to a curvilinear grid around the body. Again, we note the numerical formulation in 3D for generality, but only the 2D version is implemented in this work. Consider the curvilinear coordinates given by

n ¼ nðx; y; z; tÞ

ð16Þ

g ¼ gðx; y; z; tÞ

ð17Þ

f ¼ fðx; y; z; tÞ

ð18Þ

The Euler’s equations in curvilinear coordinates are given by

@Q @E @F @G þ þ þ ¼0 @t @n @ g @f

ð19Þ

1 qV C B B quV þ nx p C C 1B F¼ B qv V þ ny p C C; JB C B @ qwV þ nz p A Vðe þ pÞ  nt p 0

ð21Þ

The Jacobian of transformation, J is defined as



@ðn; g; fÞ @ðx; y; zÞ

ð22Þ

or

h i1 J ¼ xn ðyg zf  zg yf Þ  yn ðxg zf  zg xf Þ þ zn ðxg yf  yg xf Þ

ð23Þ

and U, V and W are contravariant velocities.

U ¼ nx u þ ny v þ nz w þ nt

ð24Þ

V ¼ gx u þ gy v þ gz w þ gt

ð25Þ

W ¼ fx u þ fy v þ fz w þ ft

ð26Þ

Closure of the governing equations is obtained by combining the equation of state given by

  1 p ¼ ðc  1Þ qe  qðu2 þ v 2 þ w2 Þ 2

ð27Þ

In compressible flows, the gradients in flow properties can be very sharp leading to shocks. The central difference schemes described in the context of the Off body solver usually require modifications for efficient shock capturing. These schemes rely on classical shock capturing methods that involve addition of artificial dissipation to the scheme. The artificial dissipation term is usually linear and is applied uniformly across all grid points. Classical shock capturing methods only exhibit accurate results in case of smooth and weak shock solutions. When strong shocks are present in the flow, non-linear instabilities and oscillations may arise across discontinuities. Therefore classical shock capturing methods without suitable modifications will become unstable for strong shocks. Modern shock capturing methods however have non-linear dissipation terms and automatic feedback mechanisms which adjust amount of dissipation in any cell of the mesh, in accord with the property gradients. One such class of schemes are Roe’s approximate Riemann solvers [11]. Roe solvers are very efficient in shock capturing and is therefore implemented in the present work. 5.2. Implementation The Roe’s approximate Riemann solver is an upwind biased finite volume scheme where inviscid fluxes at cell faces are evaluated using Roe averaged variables. This scheme eliminates the use of artificial dissipation. In fact, the scheme has an inbuilt artificial

6

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

dissipation with an automatic feedback mechanism. Hence it is an appropriate upwinding scheme for shock capturing. The space discretized form of the Euler equations is given as

b E i1=2 E iþ1=2  b @Q ¼ @t Dn

!

b iþ1=2  G b i1=2 G Df





b F i1=2 F iþ1=2  b Dg

!

wi ¼

! ð28Þ

The fluxes in each of the directions are evaluated independently. Here we present the discretization of fluxes in n-direction only for the purpose of brevity as described in [11,12]. H

H

@E Eiþ1=2  Ei1=2 ¼ @n Dn i Ei þ Eiþ1 h H  Aiþ1 ðQ R  Q L Þ Eiþ1=2 ¼ 2 2

ð29Þ ð30Þ

or

EH iþ1=2 ¼

5 Ei þ Eiþ1 1 X jkn j@W n rn  2 n¼1 2

ð31Þ

where [A] is Roe averaged flux Jacobian matrix and QL and QR are left and right state variables. E and Ew are fluxes at cell center and cell interface respectively. The eigenvalues of [A] are given by

; þa k1 ¼ u

; k2 ¼ k3 ¼ k4 ¼ u

 a k5 ¼ u

ð32Þ

 are Roe averaged variables and eigenvectors ; u  ; v ; w  and h where q of [A] are given by

1 1 0 1 1 C B   C B  u C B ua C B C C B B 2 C; C; B   v r r1 ¼ B v ¼ C C B B C B  C B  w A @ w A @ 1 2 u  2Þ  a ðu þ v 2 þ w h 2 1 0 1 0 1 0 0 0 1 B C B C B   C B0C B0C B uþa C C B C B C B 3 4 5 C; B 0 C; B v C r r r ¼B 1 ¼ ¼ C B C B C B B C B C B  C @1A @0A @ w A þu   a w v h 0

ð33Þ

and the Riemann variables (invariants) are given by

@p  @u; @W 3 ¼ q  @v ; ; @W 2 ¼ q c2  c@wÞ  c@wÞ ð@p þ q ð@p  q ; @W 5 ¼ @W 4 ¼ 2c2 2c2

encing schemes (i.e. j ¼ 13 constructs a third-order scheme) and w is a limiter. Koren’s differentiable limiter is utilized and is given by

@W 1 ¼ @ q 

3rpi Dpi þ  2ðrpi  Dpi Þ2 þ 3rpi Dpi þ 

where  is a small constant (106) used to improve the robustness of the scheme by preventing division by zero. In traditional Fortran90/C implementation, the Roe flux calculation subroutine takes vector inputs, row-wise or column-wise depending upon the direction of flux calculation, f, g respectively (2D) and calculates the right and left states. In the CUDA implementation, the calculation of left and right states of the entire domain is performed simultaneously. Furthermore, all the interface fluxes are also evaluated simultaneously. In summary, the CUDA implementation can be considered the matrix version of the vectorized implementation in F90/C. For the time discretization, a three stage Runge–Kutta integration method described in Section 4.2 is used. 5.3. Validation 5.3.1. Inviscid incompressible flow past a cylinder A cylinder is placed in a uniform stream of fluid moving at Mach, M = 0.1. This case is included to test the solver for blunt body problems. Since the analytical solution for this problem is available, it serves as a good platform for validating the near body solver. In our implementation we have applied a 3rd(O) reconstruction to compute the Roe averaged field variables. Fig. 8 shows the pressure contours of the solution obtained. The pressure contours show the expected symmetry around both the x- and y-axis. Furthermore, the expected trends of high pressure at forward and rear stagnation points and low pressure at the high velocity zones on top and bottom of the cylinder is captured accurately by analysis. Computed pressure coefficients on the surface of the cylinder are compared with the theoretical result in Fig. 9. Figs. 9 and 10 show the error and CPU time comparisons between exact and CUDA solutions respectively. This problem is a steady state problem (boundary value problem) as compared to the initial value problem in the off body solver. Here the solution is converged to maximum precision, therefore the error between the exact and the CUDA solution tends to zero with finer meshes as shown in Fig. 9. For grid size equal or larger than 256  256, the analysis virtually overlaps the exact solution. The CUDA base solver outperforms the F90 counterpart by at least an order of

ð34Þ

Discretization schemes with both 1st(O) and 3rd(O) accuracies are implemented. In the 1st(O) implementation of the scheme, the left and right state variables are piecewise constant cell center values. A construction of higher order schemes known as the Monotone Upper Symmetric Conservation Law (MUSCL) scheme is used for higher order accuracy, i.e. 3rd(O). The MUSCL scheme has TVD (total variation diminishing) property. The higher order schemes are constructed from a one parameter family of interpolations for the primitive variables p, q, u, v and w. For example the left and right states for any general primitive variable is given by

wi ½ð1  jÞr þ ð1 þ jÞDQ i 4 w Q R ¼ 1  i1 ½ð1 þ jÞr þ ð1  jÞDQ iþ1 4

QL ¼ 1 þ

ð35Þ ð36Þ

where r and D are backward and forward difference operators. j is a parameter that controls the construction of higher order differ-

ð37Þ

Fig. 8. Density contours for the flow past a cylinder at M = 0.1.

Coefficient of pressure around cylinder (Cp)

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

unstructured mesh made up of triangles. For a node centered algorithm, a dual cell is constructed surrounding every node as in Fig. 11 and the conservation equations are discretized on this dual cell. The integral form of the discretized conservation equations is given as follows:

exact

1

64−grid 128−grid 256−grid

0

512−grid

nf @Q k 1X Hki  Si ¼ V i¼1 @t

−1

−2

−3 0

1

2

3

4

5

6

Fig. 9. A comparison of the pressure coefficients around the cylinder for different grids.

C and CUDA time comparison for the near body solver 100000 C-time (in s) CUDA-time (in s) 10000

1000

100

rQ i ¼

nf 1 X Q dSj V i j¼1 j

1000

Fig. 10. Comparison of the CPU time between C and CUDA execution for the flow past a cylinder at M = 0.1.

ð40Þ ð41Þ

Thus, one can loop over the edges and simultaneously update the gradient at two nodes without having the need to store Q34 ds34. A similar methodology is adopted for summing up the fluxes over the dual cell. However, vectorizing the above operation is not straightforward. It is easy to note that in Fig. 12, the node edge2 occurs twice during the summing operation, once for edge 1 and edge 2. Similarly other nodes may be repeated a finite number of times depending on how many edges it is connected to. Hence there is a need for a coloring scheme to segregate the nodes which are repeated during the summing operation. For the purposes of illustration, consider a portion of an unstructured grid that contains two triangles as in Fig. 13. The edges and nodes are colored in black and green respectively. For each edge, there exists a variable e1 and e2 which is essentially edge1 and edge2 described previously. As we loop

Fig. 11. A portion of the unstructured grid showing the dual cell.

magnitude (maximum speed-up of about 27 times is obtained for the 512  512 grid). At the smallest grid size of 16  16, the CUDA CPU time is larger than the Fortran CPU time. This anomaly is because the time taken to load the data from global memory to registers in GPU is significant. 6. Near body solver using an unstructured grid framework We also implement, test and validate a node centered finite volume unstructured grid based flow solver. Consider a portion of an

ð39Þ

From a programming perspective, there are different ways of evaluating Eq. (39). A straightforward method would be to have an edge based data structure as shown in Fig. 12. Each edge is numbered in black (edges 1–14). Consider three different edges (1, 2, and 5). For each of these edges, there exists a corresponding node index (numbered 1, 2 – identified by edge1, edge2), and left and right cell index (numbered 3, 4 – identified by edge3, edge4). One of the quantities required to evaluate rQi is evaluating Q34 dS34, where the subscript 34 indicates the dual cell face joining edge3, and edge4. The surface vector dS is always positive with respect to node1 of any given edge. Taking this into consideration, the partly computed gradient at the center of the dual cells corresponding to edge1 and edge2 is given by:

rQ 1 ¼ rQ 1  Q 34 dS34 rQ 2 ¼ rQ 2 þ Q 34 dS34 100

ð38Þ

where nf is the number of faces, Q is the cell average of the field variables, V is the dual cell volume, H is the Roe averaged upwinded fluxes at the dual cell interface, and S is the surface vector for a given face i. For second order accuracy, the field variables Q are reconstructed at the dual cell interface using the Green-Gauss theorem. For any arbitrary node i, the discretized form is given by:

Angle (in radians)

10 10

7

Listing 1. Parallel algorithm for computing the gradient.

8

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

while looping over the edges. Hence for the first occurance, repeat2 is set to 0. This value is then incremented each time the node is repeated. At the end of this step, the maximum number of occurances for e1 and e2 are determined.The gradient calculation described in Eq. (40) can now be vectorized as in Listing 1. In Listing 1, all edges having the same repeat value are computed at the same time. Hence, although node 2 appears thrice as e2 as mentioned previously, it is computed separately. The above algorithm guarantees the fact that no edges will update the same node simultaneously. The aforementioned approach is termed as the ‘edge based’ approach. Another approach which turns out to be more efficient is the ‘node based’ approach where the list of edges connecting a given node is computed during the

Listing 2. Parallel algorithm for computing the gradient without coloring.

through all the edges, an integer (repeat1/repeat2) is assigned to each edge which keeps track of the number of times it has encountered the same node. For example, node 2 appears thrice as e2

2 11 12 5 4

4

3 6

10

13

1

i : index of any edge, edge1(i) : node1 of an edge edge2(i) : node2 of an edge edge3(i) : left cell of an edge edge4(i) : right cell of an edge

3

2

2 4

7

3 9

4 14

1

2

Edge 1 : Blue Edge 2 : Red Edge 5 : Green

3 1 8

1 Fig. 12. Edge data structure.

Fig. 13. Coloring scheme for repeated nodes (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).

K. Soni et al. / Computers & Fluids 58 (2012) 1–14 Table 1 Various grids used for testing the unstructured flow solver. Grid

Number of nodes on airfoil

Total number of nodes

1 2 3 4

100 150 230 350

8582 18,498 45,588 87,145

pre-processing step. Using this methodology, the above algorithm is reshaped as in Listing 2. The direction is 1 or 1 depending on e2 or e1 respectively. It is clearly observed that one has to loop over only the nodes rather than the edges as described previously. Using these gradients, the primitive variables are reconstructed at the edge of the dual cell. The interface flux is then computed using Roe’s approximate Riemann solver described in Section 5. The flux balance around the dual cell is then carried out in a similar fashion to the gradients. 6.1. Unstructured flow computation results In this section, we describe the validation of the two-dimensional unstructured flow solver on the GPU using standard tran-

9

sonic flow test cases. A parametric study of the CPU time spent per time step for different grid sizes is also presented. 6.1.1. Transonic flow past NACA0012 airfoil The transonic flow past a NACA0012 airfoil at Mach number M = 0.8, and angle of attack a = 1.25° is considered using four different grids as described in Table 1. Computations are performed for both first and second order spatial accuracy and by using an explicit three stage Runge–Kutta time stepping scheme (Eqs. (12)–(15)). Fig. 14 shows the Mach contours surrounding the airfoil after convergence has been obtained. The contours compare quite well with the structured grid solution on the GPU on a grid size of 255  105 in Fig. 15. Fig. 16 shows the pressure coefficient on the airfoil surface in comparison with the structured flow computation result and that of Jameson’s SLIP scheme[14]. As no limiters have been used for second order accuracy in the unstructured framework, minor oscillations are observed in the vicinity of the shock. Apart from these irregularities, the solutions compare very well throught the domain of the solution. To assess the performance of the GPU, several tests were conducted on different grids described in Table 1 using first and second order spatial accuracy, node and edge based data structures. A serial version of the unstructured flow solver in FORTRAN90 is used to gauge the speed-up of the GPU code. Figs. 17

Fig. 14. Unstructured grid and mach contours for the flow past NACA0012 airfoil at M = 0.8, a = 1.25°.

Fig. 15. Structured grid and mach contours for the flow past NACA0012 airfoil at M = 0.8, a = 1.25°.

10

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

1.5

1

−C p

0.5

0

−0.5

Jameson[14] GPU−structured−Euler

−1

GPU−unstructured−Euler −1.5

0

0.2

0.4

0.6

0.8

1

x/c Fig. 16. Surface pressure coefficient for the flow past NACA0012 airfoil at M = 0.8, a = 1.25°.

and 18 shows the CPU time spent per time step for the first and second order accurate schemes respectively. For both first and second order accurate schemes, the node based approach is faster than the edge based approach as GPU kernels span only the number of nodes. However, it is to be pointed out that each node does not have the same number of edges connected to it. As a result, some threads take a longer time to execute and the performance of the GPU is hampered by this effect. This also explains why the structured grid solver is capable of extracting a higher speed-up compared to the unstructured solver on similar grids. Comparing the first and second order timing results on the finest grid, a speed-up of 12 is obtained for the first order scheme, but for the second order scheme it is observed that the gradient computation consumes almost half of the total time, as a result only a speed-up of 5 could be achieved. It is however unknown at this point whether the node centered finite volume formulation is the limiting factor to achieving larger speed-up. It is anticipated that a cell centered based approach will be more robust to vectorize as each cell would have a fixed number of edges. It is also hoped that three-dimensional problems with the order of million nodes will benefit drastically from GPU parallelism.

300

7. GPU based overset grid assembler

CPU Time per Time Step (ms)

GPU ( Node Based ) GPU ( Edge Based )

250

F90 Serial ( Edge Based ) 200

150

100

50

0

0

2

4

6

8

10 4

x 10

Number of Nodes

Fig. 17. CPU time spent per time step on different grids for the first order scheme on an unstructured grid.

450

GPU ( Node Based )

CPU Time per Time Step (ms)

400 350

GPU ( Edge Based ) F90 Serial ( Edge Based)

300 250 200 150 100 50 0

2

4

6

Number of Nodes

8

10 4

x 10

Fig. 18. CPU time spent per time step on different grids for the second order scheme on an unstructured grid.

Overset grid assembly (also known as ‘‘domain connectivity’’ in overset grid parlance) involves generation of donor-recipient relationships and determination of interpolation strategies between overlapping grids. The overall solution process in overset meshes is as follows. At each iteration step, the solutions of the fluid equations in each mesh are computed independently with the solution in the fringe region (region consisting of grid nodes who are recipients) of each mesh specified by interpolation from the overlapping ‘‘donor’’ mesh as a Dirichlet boundary condition. At the end of the each iteration, the fringe data are exchanged between the solvers so that evolution of the global solution is faithfully represented in the overset methodology. Overset grid based CFD approach has currently emerged as the standard for the solution of time-dependent problems with relative motion. A suite of overset grid assembling tools with varying degree of capabilities have been reported in the literature [15– 24]. Most of these methodologies are limited in parallelism and therefore cannot be applied in large scale problems on HPC systems. None of these methodologies are directly portable to application in GPU based computing systems. To the knowledge of the authors, this paper forms the first attempt at constructing a GPU based algorithm for overset grid assembly. In our algorithm design, we seek to reap the maximum benefit of the multi-threaded architecture in the GPU. We utilize the so called ‘‘implicit hole-cutting algorithm’’ of Lee [25] which proved quite successful and robust in PUNDIT [24] (which is a fully parallel overset grid assembling tool utilized in the HELIOS infrastructure developed by the DoD [26]). The core concept of the ‘‘implicit hole-cutting methodology’’ is as follows: given two meshes with a common overlap region, the cells with the smallest physical volume (or resolution capacity) take precedence. These cells are solved via their parent grid discretization strategy and become donors to the cells with a larger physical volume (recipients). The primary advantage of this approach is automatic overlap optimization (i.e. mesh sizes are commensurate in the overlap region) and the associated reduction in interpolation errors. The core computational task involved in the overset grid assembly process is termed ‘‘donor search’’ which involves locating the grid cell (donor) that contains a given target point (recipient). In the context of the implicit hole-cutting algorithm, we seek to find a suitable donor for every grid point in each mesh, thus

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

11

Fig. 19. Schematic of the donor search algorithm used in overset grid assembly.

Wall clock time (milli−seconds)

7000 6000

Donor search time on CPU Donor search time on GPU

5000 4000 3000 2000 1000 0 4 10

5

10

6

10

7

10

Number of points searched Fig. 20. Timing comparison between GPU and CPU codes for donor search. Fig. 21. Domain connectivity solution for cylinder/Cartesian grid overlap problem.

considerably increasing the number of required donor searches. Therefore the efficiency of the entire overset grid assembly process is dependent to a large extent on the efficiency of the donor search algorithm. Brute-force linear searching is extremely inefficient because of its O(N  M) complexity (where N is the number of grid points and M the number of cells). Popular methods on the CPU to speed up the donor searches are through meta-data structures such as Alternating Digital Trees (ADT) [27] or Inverse maps (IM) [24] to localize the search domain and hence reduce the search complexity to O(Nlog(M)). Construction of the alternating digital tree data structure is essentially a sequential recursive process. Our investigation showed that this is not a viable option on the GPU architecture. In contrast, the inverse map approach, owing to its inherently parallel nature, showed much more potential for success in the GPU architecture. Therefore, our approach in this pa-

per follows the inverse map based search strategy. We present the steps of the donor-search algorithm as follows: 1. Estimation of resolution capacity of on a per grid node/cell basis for each participant mesh. Cell volumes and nodal volumes (dualcell volume in case of nodes) are estimated in this step to aid comparison of resolution capacities between potential donors and recipients. This process can be executed fully in parallel on the GPU by spawning as many threads as there are cells. The operations performed for each cell is completely independent of each other providing full parallel scalability. 2. Construction of the inverse map for each mesh. The inverse map that we use is structured auxiliary mesh (SAM) (with rectilinear topology) that encompasses the given curvilinear or

12

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

Fig. 22. Domain connectivity solution for NACA0012 problem.

Fig. 23. Pressure solution for NACA0012 overset grid problem (M = 0.5, a = 0°).

unstructured mesh (Fig. 19). The objective of this step is to associate each cell of the SAM with a grid cell of the actual mesh. The advantage of using such an approach is that the SAM-cell containing a given target point can be readily identified and the actual mesh cell associated with this SAM-cell can be used as the starting guess for the search process. Generation of the inverse map (i.e. associating indices of mesh cells with SAMcells) is usually performed as a sequential process on the CPU, by performing a linear traverse of the cell-list and associating cell-indices with SAM-cells based on cell-center containment. At the end of this process, each SAM-cell will retain the index of the last cell that tagged it. We perform the same operation on the GPU, but in a parallel way. We spawn as many threads as there are cells and each cell will tag the SAM-cell that contains its centroid. We deliberately force a ‘‘race’’ condition on the GPU by this process since it is possible for multiple mesh cells to tag the same SAM-cell. In the scenario where a SAM-cell gets tagged by multiple mesh-cells, this process does guarantee the association of at least one of these mesh-cells with that

Fig. 24. Domain connectivity and pressure solution for the biconvex airfoil problem (M = 0.7, a = 0°).

SAM-cell. One of the problems that stems from this approach is the lack of repeatability of the inverse map data, because different mesh cells will be associated with the SAM-cell every time the code executes in parallel on the GPU. However, since any mesh cell in the proximity of a given target point is acceptable as an initial guess, the repeatability of the inverse map data is not critical for the success of the donor search algorithm. 3. Donor search process. To accomplish implicit hole-cutting, a donor search needs to be performed for each grid point in each mesh. The search process progresses as follows. For each pair of meshes, we spawn as many threads as there are grid nodes on the first mesh and search for suitable donor cells in the second mesh. For each grid node, first the SAM cell that contains it is determined (which is a one-step process owing to the rectilinear topology of the SAM mesh) and subsequently the mesh cell associated with that SAM cell is picked as the starting guess. If the SAM cell located is not associated with any mesh cell, a spiral search is performed until another SAM cell that has mesh

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

13

Fig. 25. Grid and pressure solution for the biconvex airfoil problem using unstructured grids (M = 0.7, a = 0°).

7.1. Verification and validation of donor search

Fig. 26. Pressure solution for the biconvex airfoil problem from Morinishi [28] (M = 0.7, a = 0°).

cell association is located. Following that, a search line is constructed connecting the centroid of the starting mesh-cell and the target grid node. A check is made to determine if the search line intersects any of the faces of the starting mesh-cell. If an intersection is located, the neighbor of that face is chosen as the next potential donor cell. This procedure is repeated until a cell with no intersection is located and this would be the donor cell for a given point. If the search line penetrates a cell face without a neighbor, the target grid node is deemed to not have a valid donor. However, if the cell of the final penetration contains wall boundary nodes, the target grid node is identified to be inside the solid walls and tagged to be blanked out from the flow solution process. Once a donor cell is found, its resolution capacity is compared with that of the target grid node. If the donor cell has better resolution, then the grid node is identified as a recipient. Bilinear interpolation weights are subsequently computed by solving set of algebraic basis equations using Newton–Raphson method. For clarity, the concept of SAM and the various steps followed in the donor search algorithm are described schematically in Fig. 19.

In Fig. 20, we demonstrate the wall-clock times required for donor search between CPU and GPU. We utilize identical donor meshes in both cases and search for random points, whose number is increased from 10,000 to 10 million. The search time increases for both CPU and GPU codes with number of points as expected. The GPU searches outperform the CPU by a factor of 4.5 in all the cases. Fig. 21 shows the typical domain connectivity solution obtained for the simple problem of an overlapping cylindrical and Cartesian mesh. The entire solution was completely computed on the GPU by direct memory referencing of the near-body (cylindrical) and off-body (Cartesian) grid data already allocated by the respective solvers on the device memory. Note that the domain connectivity solution shown follows the expected trend of overlap optimization by achieving commensurate resolution at regions of overlap. Figs. 22 and 23 show the overset grid assembly and flow solution for NACA0012 airfoil at zero angle of attack (M = 0.5). The overlapping grid system shows the expected trend of overlap optimization. Furthermore, the flow solutions demonstrated in Fig. 23 shows symmetric and continuous contours which verify the accuracy of the domain connectivity solutions. Erroneous overset grid assembly often manifests itself as discontinuous solution contours and lack of symmetry. 7.2. Application to transonic bi-convex airfoil problem We further apply the overset grid framework developed in this paper to the staggered biconvex airfoil problem investigated by Morinishi [28]. This problem is particularly interesting because of the shock wave developed between the two airfoils. It is a well known observation that the errors in overset grid computations will be amplified at the shock locations due to lack of conservation. Therefore, this test case will also facilitate characterization of the quality of the domain connectivity solution. Fig. 24a and b shows the domain connectivity and flow solutions for the bi-convex airfoil problem. The domain connectivity solution again shows the expected trend of overlap optimization even in the more complex problem presented here. Note that the cell sizes in each grid is of commensurate size at the regions of overlap. We compare the pressure solution obtained with the fully unstructured methodology developed in this article in Fig. 25 and also with the solutions reported by Morinishi [28] in Fig. 26. It is encouraging to note that

14

K. Soni et al. / Computers & Fluids 58 (2012) 1–14

both the overset and unstructured solution are qualitatively and quantitatively consistent with each other. In addition both solutions show good qualitative correlation with the results obtained by Morinishi providing, further confidence to the analysis framework developed in this paper. 8. Conclusions A two-dimensional Euler solver on structured and unstructured grids with overset grid connectivity has been developed on the GPU framework. In its current form, the GPU platform provides a maximum speed-up of about 27 for the structured grid solver (C code compiled in intel compiler), 12 for the unstructured grid solver, and 4 for the domain connectivity module. It is seen that structured grid solvers offer more flexibility to achieve speed-up primarily due to its simple grid arrangement and therefore efficient memory access. For node centered unstructured grids, computing the gradient for obtaining higher order accuracy is one of the limiting factors in achieving a speed-up. In the node centered implementation, each node will not have the same number of dual cell faces across the entire grid, hence computation of the net flux balance across the dual cell volume in a vectorized form is not performed very efficiently. It is speculated that a cell centered based approach might be more efficient as fixed number of faces would enclose a cell volume. Both the structured and unstructured solvers are currently limited to explicit time integration, which takes longer times for convergence especially for larger grids. Efforts are underway to implement an implicit based time integration algorithm using Krylov subspace methods such as GMRES/BiCGStab. As the present solver suite has overset grid capabilities that execute on the GPU, it becomes a preferred choice to solve moving body problems such as wind turbines and flapping wings. By utilizing heterogeneous computing capabilities through a combination of MPI and GPU, higher order computations on finer grids can be carried out to investigate a variety of flow features. Acknowledgments We gratefully acknowledge support from the Office of Naval Research with Dr. Judah Milgram as the technical monitor. References [1] LeGresley P, Elsen E, Darve E. Large calculation of the flow over a hypersonic vehicle using a GPU. J Comput Phys 2008;227(24):10148–61. [2] Phillips EH, Zhang Y, Davis RL, Owens JD. Rapid aerodynamic performance prediction on a cluster of graphics processing units. In: 47th Aerospace sciences meeting and exhibit, AIAA-2009-065, Orlando, FL; 2009. [3] Brandvik T, Pullan G. Acceleration of a 3D Euler solver using commodity graphics hardware. In: 46th AIAA aerospace sciences meeting and exhibit, AIAA-2008-0607, Reno, NV; 2008.

[4] Hagen TR, Lie K-A, Natvig JR. Solving the Euler equations on graphics processing units. Lect Notes Comput Sci 2006;3994:220–7. [5] Flynn M. Some computer organizations and their effectiveness. IEEE Trans Comput 1972;C21(9):948–60. [6] Sitaraman J, Katz A, Jayaraman B, Wissink AM, Sankaran V. Evaluation of multisolver paradigm for CFD using overset unstructured and structured adaptive Cartesian grids. In: 46th AIAA aerospace sciences meeting and exhibit, AIAA2008-660, Reno, Nevada; 2008. [7] http://gpgpu.org. [8] NVIDIA, NVIDIA CUDA programming guide version 3.2. [9] Fermi. NVIDIA’s next generation CUDA compute architecture. . [10] Micikevicius P. 3D finite difference computation on GPUs using CUDA, GPGPU2. In: Proceedings of the 2nd workshop on general purpose processing on graphics processing units; 2009. p. 79–84. [11] Roe PL. Approximate Riemann solvers, parameter vectors, and difference schemes. J Comput Phys 1997;135:250–8. [12] Vatsa VN, Thomas JL, Wedan BW. Navier Stokes computations of a prolate spheroid at an angle of attack. J Aircraft 1989;26(11):986–93. [13] Kennedy Chistopher A, Carpenter Mark H, Michael Lewis R. Low-storage, explicit Runge–Kutta schemes for the compressible Navier–Stokes equations. NASA/CR 1999-209349; 1999. [14] Jameson A. Analysis and design of numerical schemes for gas dynamics 2 artificial diffusion and discrete shock structure. Int J Comput Fluid Dynam 1995;5:1–38. [15] Buning PG et al. OVERFLOW users manual. NASA Langley Research Center, July 2003. [16] Petersson NA. An algorithm for assembling overlapping grid systems. SIAM J Sci Comput 1999;20:1995–2022. [17] Brown DL, Henshaw WD, Quinlan DJ. Overture: object-oriented tools for overset grid applications. In: AIAA 17th conference on applied aerodynamics, AIAA 1999-3130, Norfolk, VA; 1999. [18] Henshaw WD. OGEN: an overlapping grid generator for overture. Research Report LA-UR-96-3466. Los Alamos National Laboratory; 1996. [19] Rogers SE, Suhs NE, Dietz WE. PEGASUS 5: an automated preprocessor for overset-grid computational fluid dynamics. AIAA J 2003;41(6):1037–45. [20] Noack RW. SUGGAR: a general capability for moving body overset grid assembly. In: 17th AIAA computational fluid dynamics conference, AIAA-20055117, Toronto, ON; 2005. [21] Alonso JJ, Hahn S, Ham F, Herrmann M, Iaccarino G, Kalitzin G, et al. CHIMPS: a high-performance scalable module for multi-physics simulations. In: 42nd AIAA/ASME/SAE/ASEE joint propulsion conference, AIAA-2006-5274, Sacramento, CA; 2006. [22] Belk DM, Maple RC. Automated assembly of structured grids for moving body problems. In: 12th AIAA computational fluid dynamics conference. Part 1, AIAA-1995-1680, Washington, DC; 1995. p. 381–90. [23] Wang ZJ, Parthasarathy V, Hariharan N. A fully automated chimera methodology for multiple moving body problems. Int J Numer Methods Fluids 2000;33(7):919–38. [24] Sitaraman J, Floros M, Wissink A, Potsdam M. Parallel domain connectivity algorithm for unsteady flow computations using overlapping and adaptive grids. J Comput Phys 2010;229(12):4703–23. [25] Lee Y, Baeder JD. Implicit hole cutting—a new approach to overset grid connectivity. In: 16th AIAA computational fluid dynamics conference, AIAA2003-4128, Orlando, FL; 2003. [26] Sankaran V, Sitaraman J, Wissink A, Datta A, Jayaraman B, Potsdam M, et al. Application of HELIOS computational platform to rotorcraft flowfields. In: 48th AIAA aerospace sciences meeting and exhibit, AIAA-2010-1230 Orlando, FL; 2010. [27] Bonet J, Peraire J. An alternating digital tree algorithm for 3D geometric searching and intersection problems. Int J Numer Methods Eng 1991;31(1):1–17. [28] Morinishi K. A finite difference solution of the Euler equations on non-bodyfitted Cartesian grids. Comput Fluids 1992;21(3):331–44.