CFD-based analysis and two-level aerodynamic optimization on graphics processing units

Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722 Contents lists available at ScienceDirect Comput. Methods Appl. Mech. Engrg. journal homepage:...

Download PDF

1MB Sizes 0 Downloads 48 Views

Report

PDF Reader
Full Text

Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

Contents lists available at ScienceDirect

Comput. Methods Appl. Mech. Engrg. journal homepage: www.elsevier.com/locate/cma

CFD-based analysis and two-level aerodynamic optimization on graphics processing units I.C. Kampolis, X.S. Trompoukis, V.G. Asouti, K.C. Giannakoglou * National Technical University of Athens, Lab. of Thermal Turbomachines, Parallel CFD & Optimization Unit, P.O. Box 64069, Athens 15710, Greece

a r t i c l e

i n f o

Article history: Received 21 May 2009 Received in revised form 24 September 2009 Accepted 3 November 2009 Available online 13 November 2009 Keywords: Graphics processing units Computational ﬂuid dynamics Aerodynamic shape optimization Evolutionary algorithms

a b s t r a c t This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code. Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efﬁciency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D ﬂow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same ﬂow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction The continuous increase in performance of GPUs has extended their use from the gaming community to game physics calculations and, lately, to number-crunching applications in the ﬁeld of computational physics [4,21,42]. Their increased ﬂexibility and performance allowed the implementation of data-parallel algorithms that efﬁciently unlock their peak computational capabilities by exceeding that of high-end CPUs, combining a great number of ﬂoating point operations per second (GFLOPS) and high memory bandwidth. Moreover, their mass production made them a cheaper, viable alternative to conventional high-performance computing systems. GPUs from both major manufacturers (ATI & NVIDIA) associated with a number of software programming models (Cg [37], Brook [7], Sequoia [15], CTM [1], CUDA [40] to mention a few of them) are available to scientists for making GPUs appropriate to cope with high-performance scientiﬁc applications. In this paper, NVIDIA’s graphics cards and CUDA are used. CUDA is an extended * Corresponding author. E-mail address: [email protected] (K.C. Giannakoglou). 0045-7825/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.cma.2009.11.001

subset of the C language and is supported by all latest NVIDIA’s graphics cards. It uses the GPU as a device suitable for parallel computations with its own random access memory (device/global memory) capable of concurrently executing a large number of threads. The latter are grouped together into blocks; many blocks comprise a grid. Grids of blocks execute the code (referred to as the kernel) on the GPU, using the single instruction multiple thread (SIMT) model. Limited shared memory (due to hardware restrictions) as well as synchronization instructions are available to the threads forming the same block. A wide range of applications which beneﬁt from the excellent parallel performance of GPUs can be found in the literature [42]. However, applications in computational ﬂuid dynamics (CFD) are still limited. In [20], a GPU implementation of the multigrid method for the solution of boundary value heat and ﬂuid ﬂow problems, using the vorticity-streamfunction formulation, was presented. Kruger and Westermann [36] proposed a framework for the implementation of explicit and implicit numerical methods on graphics hardware (ATI 9800 graphics card) and showed applications related to the 2D wave equation and the incompressible Navier– Stokes equations. Bolz et al. [4] describe a conjugate gradient solver for sparse matrices resulting from the use of unstructured grids

713

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

and a multigrid solver for structured grids, on NVIDIA’s Ge-Force FX cards. Goddeke et al. [19] used many cluster nodes with NVIDIA’s Quadro FX4500 PCIe and addressed the issue of limited precision on GPUs (by that period of time) by applying a mixed precision iterative reﬁnement technique. A detailed description of the GPU implementation of a Navier–Stokes ﬂow solver for structured grids was presented in [22]. In [9], this was extended to 3D ﬂows by laying emphasis to realistic and accurate visualization effects for graphics applications. Implementations to compressible ﬂuid ﬂows were ﬁrstly presented in [21]. Brandvik and Pullan [5,6] focus on performance comparisons between GPU and CPU codes for the solution of 2D and 3D Euler equations and reported considerable speed-ups using exclusively structured grids. Applications to complex geometries, even for hypersonic ﬂows, can be found in [11]. A ﬁrst application in 3D unstructured grids for inviscid, compressible ﬂows on NVIDIA Tesla GPU is presented in [8]. Although [8] has certain similarities to the present paper, it is based on the cell-centered whereas this paper on the vertex-centered ﬁnite volume technique. As it will be shown as the paper develops, there are important differences between the two schemes as far as porting on GPUs is of concern. In view of the above, in the present paper, a solver for the 2D steady state, Navier–Stokes equations for compressible ﬂuids on unstructured grids with triangular elements, coupled with a oneequation turbulence model, is ﬁrstly presented; the programming details for porting the code to the GPU are discussed. The ﬁrst part of this study focuses on both single (SPA) and double (DPA) precision arithmetic GPU implementations, in order to quantify their speed-ups compared to the CFD code running on the CPU. A mixed precision arithmetic (MPA) implementation, which uses SPA for the left-hand-side (l.h.s.) part of the discretized equations and DPA for the right-hand-side terms (r.h.s., i.e. the residuals), is devised. Below, the three GPU implementations with SPA, MPA and DPA will be referred to as GPU SP ; GPU MP and GPU DP , respectively. The solver running on the CPU implements DPA; the speed-ups of all GPU-enabled variants are measured with respect to the same CPU code. The proposed GPU MP leads to high speed-ups without damaging the prediction accuracy of the GPU DP . A 3D Euler equations GPU-enabled solver (using SPA, MPA and DPA, too) for unstructured grids with tetrahedral elements is also presented and assessed, in terms of speed-ups, with the corresponding CPU code. The second part of this paper deals with aerodynamic shape optimization, by exploiting the CFD software variants which have already been ported on GPUs. In particular, the GPU-enabled ﬂow solvers are used as analysis tools in the context of a hierarchical EA, to efﬁciently solve aerodynamic shape optimization problems. In such a method, the search for the optimal solution(s) is considerably accelerated by the combined use of (a) a low cost, low ﬁdelity ﬂow solver ðGPU SP Þ which undertakes the exploration of the design space on the low level and (b) a high ﬁdelity ﬂow solver ðGPU MP Þ, with higher cost, which is mainly used to reﬁne the most promising solutions by injecting them into the high level search algorithm. The present applications include single- and multiobjective optimizations of a compressor cascade and an isolated airfoil. A ﬁnal statement should be made. Nowadays, ‘‘personal supercomputers”, i.e. clusters of CPUs and GPUs, allow a signiﬁcant amount of performance at a reasonable cost-price. The present CFD software may certainly run on these clusters, after deﬁning an appropriate number of grid subsets exchanging data for physically coincident grid nodes which reside on different grid subsets. However, this is beyond the scope of this paper. The reason is simple: since the GPU code is used to support an EA, where candidate solutions can be simultaneously and independently evaluated and due to the limited number of available GPUs, it was decided that each

CFD code runs on a single GPU. We thus minimize the communication overhead and applications are limited by the available memory. 2. 2D/3D CFD Implementations on GPUs This section presents the porting of the 2D and 3D Navier– Stokes equations solvers to the GPU. The basic features of the time-marching Navier–Stokes ﬂow solver (CPU implementation) are ﬁrst presented. Programming issues for the GPU implementations are then discussed by taking into consideration the speciﬁc characteristics/architecture of the available NVIDIA graphics cards. The section concludes with performance comparisons between CPU and GPU for 2D and 3D, inviscid and/or turbulent steady-state ﬂows around an isolated airfoil, an aircraft and in a compressor cascade. 2.1. The Navier–Stokes equations solver – implementation on CPUs The Navier–Stokes equations for a compressible ﬂuid are written in vector form as !

f inv @~ f v is @ U @~ þ i i ¼ 0; @t @xi @xi

ð1Þ

with i ¼ 1; 2 in 2D (i ¼ 1; 2; 3 in 3D) and xi the cartesian coordinates. ! U ¼ ½.; .~ u; ET is the vector of conservative variables. The inviscid v Þ and viscous ð~ f iv is Þ ﬂuxes are given by ð~ f in i

3

2

7 v ¼6 ~ f in 4 .ui~ u þ p~ di 5; i

6 ~ f vi is ¼ 4

2

.ui

ui ðE þ pÞ

0 ~ si uj sij þ qi

3 7 5;

ð2Þ

where . is the density, ~ u is the velocity vector, ~ si ¼ ½si1 ; si2 T T (~ si ¼ ½si1 ; si2 ; si3 in 3D) are the viscous stresses, ~di ¼ ½di1 ; di2 T @T ~ (di ¼ ½di1 ; di2 ; di3 T in 3D) the Kronecker symbols and qi ¼ k @x the i thermal ﬂux components. The solution of Eqs. (1), on unstructured grids with triangular (2D) or tetrahedral (3D) elements, is based on the vertex-centered ﬁnite volume technique and a second-order upwind scheme for the inviscid ﬂuxes. On a 2D grid, Fig. 1 shows the ﬁnite volume XP deﬁned around node P; in 3D cases, ﬁnite volumes are deﬁned in a similar way. The integration of the governing equations over XP gives

X ! ! XP ! v v is DUþ Uin PQ UPQ ðabÞ ¼ 0; DtP P Q 2neiðPÞ

ð3Þ !

!

v v is where Dt P is the local pseudo-time step and Uin PQ ; UPQ are the inviscid and viscous numerical ﬂuxes across ab, i.e. the interface of XP and XQ .

Fig. 1. Vertex-centered ﬁnite volume XP (hatched area) deﬁned around node P of an unstructured grid with triangular elements.

714

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

In the CPU implementation, the inviscid ﬂuxes are computed using the 1D Roe approximate Riemann solver [48] applied between P and Q, i.e. in an edgewise manner for both 2D and 3D grids, as

! ! ! ! 1 1 e R L AP U LPQ þ AQ U RPQ j A PQ j U PQ U PQ ; 2 2

!

v Uin PQ ¼

ð4Þ

where A stands for the Jacobian matrix. For second-order accuracy, ! ! ! ! U LPQ , and U RPQ are computed from U P and U Q through variables’ extrapolation, as !

!

U LPQ ¼ U P þ

! ! ! ! 1 ! 1 ! PQ rU P and U RPQ ¼ U Q PQ rU Q : 2 2

ð5Þ

The viscous ﬂuxes are expressed in terms of the gradients of prim! ! itive variables V . The computation of rV PQ along the edge PQ [3], is given by !

rV PQ

2 3 ! ! ! ! ! 1 1 ! VQ VP5 4 ~ ¼ rV P þ rV Q ½rV P þ rV Q ~ n n; 2 2 ðPQ Þ

ð6Þ

where ~ n is the normal to ab. The gradient at each grid node is, then, derived from the elements’ gradient !

rV T ¼

3X or 4 !

Vk

k¼1

@Lk ; @xk

ð7Þ

where Lk are element shape functions. In the CPU implementation (either 2D or 3D), both inviscid and viscous ﬂuxes are computed by looping over grid edges (such as PQ) and scatter-adding contributions to their end nodes. This avoids the computation of the same ﬂuxes twice for contributing to P and Q. The discretized Eqs. (3) are solved using the point-implicit Jacobi method. The mean ﬂow equations are coupled with the one-equation Spalart–Allmaras turbulence model [49]. The model is solved for l~ , which is then used to compute the turbulent viscosity lt . Con~ -equation are handled as in vection and diffusion terms in the l the mean ﬂow equations.

2.2. The NVIDIA GPU architecture In this paper, NVIDIA GPUs (GTX 280, GTX 285) based on the GT 200 architecture, in which all the computational elements implement the Uniﬁed Shader Architecture, are used. So, they all have the same computational capabilities, i.e. can read from textures, linear memory and perform arithmetic operations. The GTX 285 board consists of 10 texture processor clusters (TPCs). Each TPC includes 3 streaming multiprocessors (SM), control circuits and a texture cache block. SMs are the key compute elements in the GT 200 architecture and consist of 8 streaming processors (SPs) along with two special function units (SFUs), an instruction unit and 16 KB of local shared memory (Fig. 2). SPs are fully pipelined and may deliver three FLOPS per clock. SFUs are mainly used for transcendental operations and interpolations and may also contribute another FLOP per clock. On this basis, the theoretical peak performance of each GTX 285 board is estimated to 1063 GFLOPS. Each SM executes warps of threads (each warp consists of 32 threads) using the SIMT model. So, all the SPs of a SM execute the same command/instruction on different parts of the data to be processed. A SM may handle up to 1024 threads, adding up to as many as 30720 concurrent threads in each GTX 285 board. All 240 SPs are used exclusively for single precision operations. Double precision operations are carried out in another unit, inside each SM. Thirty units responsible for double precision arithmetic operations reside in each GTX 285 board. These are less than the 240 SPs for the SPA operations, since DPA ones are quite limited in graphics. All threads of the same block (with a number of blocks comprising a grid, Fig. 2b) are able to cooperate using the SM shared memory. Furthermore a synchronization instruction is available. This brings up a straightforward hardware requirement, restricting each block to be processed by a single SM. The total number of blocks that can be assigned to a single SM depends upon shared memory and registers requirements as well as the maximum number of threads. Scheduling occurs within each SM, according to the number of active warps. Dynamic scheduling is used; for instance, while a warp is waiting for a memory transaction to be completed,

Fig. 2. (a) Overview of a streaming multiprocessor of the GT 200 chip and (b) threads’ grouping into blocks and grids of blocks.

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

715

Fig. 3. Flowchart of the GPU-enabled ﬂow solver for both 2D and 3D ﬂows. The turbulence model (T.M.) is numerically decoupled from the mean ﬂow equations.

the scheduler initiates the execution of another (also in active state) to maximize utilization of resources. The interchange between warps is of zero (wall clock time) cost. 2.3. The GPU implementation – programming issues Porting efﬁciently the Navier–Stokes ﬂow solver to the GPU requires a different programming approach due to shared memory conﬂicts, the lack of cache memory on the GPU and for achieving the highest possible memory bandwidth. In the GPU implementation, the CPU controls a sequence of data processing tasks to be carried out on the GPU, as shown in Fig. 3. Input/output operations (for grid coordinates, connectivities and boundary conditions) as well as the preparation of the data structure to handle the unstructured grid topology are all carried out on the CPU. Then, all the so-processed data are copied to the device memory (GPU global memory). The ﬁrst task carried out, in parallel, on the GPU is the initialization of the ﬂow variables. The discretization and iterative solution (time-marching) of the ﬂow equations, which corresponds to the computationally demanding part of the code, follow. The discretization comprises the computation of inviscid and viscous ﬂuxes for the mean ﬂow and turbulence model equations (r.h.s.) and that of the corresponding l.h.s. matrix coefﬁcients. The resulting system of linearized equations is, then, iteratively (Jacobi iterations) solved using kernel calls driven by the CPU. During the iterative part of the algorithm, ‘‘synchronization” between CPU and GPU occurs only during convergence checks, when the residuals’ norms are copied to the CPU. Upon convergence, all ﬂow quantities are transferred back to the CPU. A number of implementation details follow. In the CPU implementation and for the reasons exposed in (2.1), the computation of both the inviscid and viscous ﬂuxes (Eqs. (4) and (6)) is based on edgewise sweeps. On the GPU, however, this is not possible due to shared memory conﬂicts when independent, concurrently executed threads try to accumulate ﬂuxes to coincident end nodes. Thus, a nodewise scheme is applied instead. Each thread undertakes the formation of the l.h.s. and r.h.s. equation terms for a single node, by looping over all edges emanating from it. Even if ﬂuxes associated with all edges are computed twice, the

proposed scheme proved to outperform an edge coloring based computation which was also tested. Coloring (as an alternative way to overcome shared memory access conﬂicts) creates groups of edges which, when simultaneously swept, scatter contributions to distinct nodes. For instance, a coloring scheme was used in [35] to port a ﬁnite element earthquake modeling software to NVIDIA’s graphics cards. Here, however, standard edge coloring method proved less efﬁcient than the use of nodewise sweeps (including local renumbering) and, thus, abandoned; this comparison is not included in the paper. As also explained in [8], redundant computations are harmless for they can be carried out simultaneously, with global memory access. It is evident that the global memory access signiﬁcantly affects the performance of the GPU implementation. According to [39], better efﬁciency can be achieved if concurrent memory accesses from all threads composing a half-warp can be coalesced into a single memory transaction. However, coalesced memory access is not possible for unstructured grids (as it is for structured ones) due to the random numbering of grid nodes, edges and elements as well as the variable number of edges emanating from the nodes.1 For instance, the kernel undertaking the computation of ﬂuxes, loops over the edges emanating from each node. In unstructured grids, this is a highly non-coalesced access, due to the lack of a constant memory access pattern; the latter strongly depends upon grid connectivity. As a remedy, an efﬁcient renumbering scheme is proposed and used. The idea is to ensure that the edges which are accessed simultaneously reside nearby in the global memory. The edge connectivity is sorted locally and the ﬁrst edge of each node is placed on top of the list; the second, third and so forth follow. This allows an approximately 70% performance increase especially in 3D applications. Over and above, scattered data that are not iteratively updated (i.e. 1 In the CFD method described in this paper, there is an additional ‘‘difﬁculty” compared, for instance, to [8] which also handles 3D unstructured grids. In [8], a cellcentered ﬁnite volume scheme is used. So, each kernel associated with a tetrahedral element (and the corresponding node) loops over a ﬁxed number (four, if boundary elements are left aside) of faces and, thus, a ﬁxed number of neighboring nodes. Unfortunately, this is not the case in this paper. In a vertex-centered, ﬁnite volume scheme, the number of nodes which are linked to each grid node is not ﬁxed, causing extra overhead during the global memory access.

716

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

geometric quantities required to compute the local time step) can be accumulated in arrays prior to iterating, at the cost of extra memory. The GPU implementation performance was signiﬁcantly improved by using texture memory. This is a cached read-only memory, accessible by all threads. It is known that, multiple fetches of a texture memory address cost as much as a single access to the global memory. However, the limited size of the texture memory requires a careful selection of the arrays to be textured. Best efﬁciency occurs if randomly accessed (due to the unstructured topology) quantities, frequently used during the iterative process but updated only once, are textured. For instance, this is the case of the primitive ﬂow variables and their gradients contributing to the computation of inviscid and viscous ﬂuxes (Eqs. (5) and (6)). ! ) are derived from the conThe primitive ﬂow variables (vector V ! servative ones (vector U ) and stored at the beginning of each pseudo-time iteration. So, in the kernels undertaking the computation of inviscid and viscous ﬂuxes, both the primitive ﬂow variables and their gradients are accessed from the texture memory, decreasing thus the wall clock time per iteration. The code performance was further improved by minimizing the global memory access within a thread. For instance, the computation of the l.h.s. and r.h.s. terms of the discretized equations at any node is based on a loop that collects contributions from all the edges emanating from it. In such a scheme, global memory access is minimized via increased shared memory usage and by updating quantities stored in the global memory only upon completion of all computations.

In addition, signiﬁcant extra gain in speed-up was achieved thanks to the vector types deﬁned in CUDA and, especially for single precision arithmetic operations, by taking advantage of the ﬂoat4 variable type. GPUs are capable of copying 128-bit words from the device memory to the registers in a single memory transaction and vice-versa. So, loading a ﬂoat4 variable from the device memory to the registers is faster than loading four ﬂoat variables using four memory transactions. For instance, with SPA, 2D compressible ﬂuid ﬂow problems coping with four mean ﬂow equations suit perfectly to the use of ﬂoat4 variable type. To account for the extra variables, such as an extra mean ﬂow variable in 3D, turbulence quantities or even the use of DPA, properly aligned user-deﬁned data types are used to ensure the minimum number of memory transactions. For instance, structures composed of ﬁve double precision variables (as required by the GPU DP code used for the prediction of 3D or 2D turbulent ﬂows with one equation model) are compiled into three 128-bit transactions instead of ﬁve. Finally, extra gain in runtime, less signiﬁcant however compared to the preceding actions, resulted from the use of the constant memory space and loop unrolling. Similarly to the texture memory, constant memory is read-only, cached and accessible by all threads. Its access was restricted to the gas and turbulence model constants, as it performs better if each thread of a warp accesses the same memory address, due to the linear scaling of the cost of reading from the constant memory in terms of the number of different reading instructions. On the other hand, unrolling loops that access global memory spaces proved to be beneﬁcial.

Log10(Residual)

2.4. Performance comparisons between CPU and GPU implementations

-2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22

Double precision Single precision Mixed precision

0

1000 2000 3000 4000 5000 6000 7000 8000 Iterations

Fig. 4. 2D ﬂow solver. Convergence of the continuity equation residual obtained by the GPU DP ; GPU SP and GPU MP implementations.

32.5 30 27.5 25 22.5 20 17.5 15 12.5 10 7.5 5

DPA MPA SPA

2.4.1. 2D ﬂow studies For the isolated airfoil turbulent ﬂow studies, the NACA4415 airfoil was used, with M1 ¼ 0:5; a1 ¼ 2 and Rec ¼ 9 105 . For the

b Speed-Up

Speed-Up

a

The performance of the GPU-enabled ﬂow solver implementations both 2D and 3D ðGPU SP ; GPU MP ; GPU DP Þ is assessed on inviscid and turbulent ﬂows developed around isolated airfoils, aircrafts or in compressor cascades. A basic feature of our GPU-enabled implementations (GPU DP and GPU MP ) is that they reproduce the results of the corresponding CPU code (i.e. identical convergence and results). GPU MP runs faster than GPU DP , as it will be shown below. All computations were carried out on NVIDIA’s Ge-Force GTX 285 graphics card and an Intel Pentium Core 2 Duo 2.8 GHz (3 GB RAM). The 2D and 3D CPU codes were written in Fortran and compiled using the GNU Fortran compiler version 4, with full optimization options.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 5 Grid nodes [10 ]

32.5 30 27.5 25 22.5 20 17.5 15 12.5 10 7.5

DPA MPA SPA

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 5 Grid nodes [10 ]

Fig. 5. 2D ﬂow solver. Speed-up of the GPU SP ; GPU MP and GPU DP implementations (in all subsequent ﬁgures, speed-ups refer to how much the GPU code is faster than the CPU code) as a function of the number of grid nodes, for the (a) inviscid and (b) turbulent ﬂow around an isolated airfoil.

717

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

Speed-Up

cascade ﬂow studies, the optimal blade airfoil designed in Section 4.2 was used, with M2;is ¼ 0:3; a1 ¼ 44 and Rec ¼ 9 105 . The convergence of the residual of the continuity equation using the three GPU implementations is shown in Fig. 4; the SPA operations of GPU SP (for both l.h.s. and r.h.s.) are lowering the accuracy in computing residuals while the convergence plots for GPU DP and GPU MP are identical. Speed-ups are shown in Figs. 5 and 6. As the grid size increases, speed-ups up to 28 for GPU SP ; 25:2 for GPU MP and 19:6 for GPU DP can be observed. The much higher speed-up of the GPU SP implementation is due to (a) the use of four-component, single precision vector type (ﬂoat4) variables instead of the two-component double precision (double2) of the GPU DP and (b) the lower cost of SPA operations. For instance, in the GPU SP implementation, the conservative variables associated with a single node are represented by a single ﬂoat4 type variable, instead of two double2 type variables; this results in half memory instructions compared to the

32.5 30 27.5 25 22.5 20 17.5 15 12.5 10 7.5 5

DPA MPA SPA

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 5 Grid nodes [10 ] Fig. 6. 2D ﬂow solver. Speed-up of the GPU SP ; GPU MP and GPU DP implementations as a function of the number of grid nodes, for the turbulent ﬂow in a compressor cascade.

Table 1 CFD Prediction of 2D turbulent ﬂows on unstructured grids. Comparison of speed-ups achieved using NVIDIA’s GTX 280 and GTX 285 for the GPU SP ; GPU MP and GPU DP implementations.

GPU SP GPU MP GPU DP

a

GTX 280

GTX 285

21 20:7 14:8

28 25:2 19:6

2.4.2. 3D ﬂow studies The performance of the 3D GPU implementation is assessed on the inviscid ﬂows developed around a civil aircraft, with M 1 ¼ 0:7; a1 ¼ 0 and within a supersonic compressor cascade [10], with M1 ¼ 1:6; a1 ¼ 60 . The Mach number contours of the numerically predicted ﬂow ﬁelds, in the two examined cases are shown in Figs. 8 and 9, respectively. The performance of the GPU-enabled 3D ﬂow solver compared to the CPU one, for a variety of grids (generated on purpose) is presented in Fig. 10. As shown, speed-ups up to 27:5 for GPU SP ; 23 for GPU MP and 19:6 for GPU DP were measured. Furthermore, in Fig. 11 the effect of ‘‘coalesced” access of ﬂuxes after an appropriate re-arrangement and sorting (as described in Section 2.3) is quantiﬁed for the GPU DP implementation. It is

b

1 DPA/MPA SPA

0.5

GPU DP implementation. Recall that CUDA does not yet support double2 variable fetching from the texture memory and that fetching a double variable requires tedious programming. On the other hand, fetching ﬂoat4 variables is straightforward. GPU MP implementation combines the fast memory instructions concerning the l.h.s. terms (by using ﬂoat4 variables) with the DPA of the r.h.s. ones, yielding speed-ups between those reported for the GPU SP and GPU DP . It should be stated that the speed-ups mentioned above are signiﬁcantly improved compared to those presented in the ﬁrst submission of this paper where NVIDIA GeForce GTX 280 cards were used. A comparison of the speed-ups achieved on the previous (GTX 280) and the new (GTX 285) card is shown in Table 1. Porting the code to GTX 285 did not require further modiﬁcations or programming, so it was straightforward to repeat the runs. The main difference between the two GPUs is a frequency increase (7.6%, 13.9% and 12.2% increase on GPU, shader and memory clock, respectively) which signiﬁcantly affects memory bandwidth (159.0 GB/s instead of 140.7 GB/s), texel rate (51,840 MT/s instead of 48,160 MT/s) and the peak computational capabilities (1063 GFLOPS instead of 933 GFLOPS). The increase in hardware performance corresponds to an approximate 25% increase in speed-up between the two cards. For the sake of completeness, it should also be stated that GTX 285 have lower power consumption and, thus, thermal emissions, which gave rise to increase in frequency. The difference in the prediction accuracy of the three GPU implementations is presented in Fig. 7 which compares the pressure distribution along the solid walls of an isolated airfoil for both inviscid and turbulent ﬂows. GPU DP and GPU MP produce identical solutions whereas, as expected, GPU SP led to slightly different results. The corresponding pressure distributions obtained by the solver running on the CPU are not presented since they are identical to those of the GPU-enabled solvers.

1 DPA/MPA SPA

0.5

Cp

0

Cp

0 -0.5

-0.5

-1

-1

-1.5

-1.5 0

0.2

0.4 0.6 x-axis

0.8

1

0

0.2

0.4 0.6 x-axis

0.8

1

Fig. 7. 2D ﬂow solver. Comparison of the pressure coefﬁcient distribution computed using the GPU SP ; GPU MP and GPU DP for (a) inviscid and (b) turbulent ﬂow around an isolated airfoil. The corresponding CPU code results are not presented since these are identical to those of the GPU MP or GPU DP implementations.

718

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

Fig. 8. Flow around a civil aircraft: Mach number contours.

obvious that such a treatment signiﬁcantly affects the 3D ﬂow solver (where the number of edges emanating from a node is unknown and may signiﬁcantly vary) by increasing the speed-ups up to 70%. Note that we restricted ourselves to up to 250,000 nodes since the GTX 285 card cannot support larger grid sizes due to memory limitations. We recall that a single GPU is assigned to each ﬂow analysis in view of our need to parallelize the evaluations of many candidate solutions on the available GPUs, during the EA-based optimization. 3. Aerodynamic optimization on GPUs Having developed efﬁcient GPU implementations of the CFD tool, the second part of this paper is concerned with their use in the context of an EA-based optimization. The two-level optimization algorithm presented below fully exploits the availability of the fast, low ﬁdelity CFD tool ðGPU SP Þ and the more expensive, high ﬁdelity one ðGPU MP Þ. Though the developed GPU-enabled solvers may cope with both 2D and 3D ﬂows, the cases examined include only airfoil designs. To the authors’ opinion, this sufﬁces for demonstrating the capabilities of the method, since any extension to 3D shape optimization could be limited only by the global memory size of GPUs and/or the number of available GPUs (in case more than one GPUs are to be used for the same CFD analysis). It is well known that next generation GPUs are expected to achieve much more than one teraﬂop of performance with much higher memory capacity.

Fig. 9. Flow in a supersonic compressor cascade: Mach number contours.

a

27.5 25

27.5 25

22.5

22.5

20

20

17.5

0

0.5

1 1.5 2 5 Grid nodes [10 ]

DPA MPA SPA

32.5 30

Speed-Up

30 Speed-Up

b

DPA MPA SPA

32.5

2.5

17.5

0

0.5

1 1.5 2 5 Grid nodes [10 ]

2.5

Fig. 10. 3D ﬂow solver. Speed-up of the GPU SP ; GPU MP and GPU DP implementations as a function of the number of grid nodes, for the inviscid ﬂows (a) around an aircraft and (b) within a 3D supersonic compressor cascade.

719

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

Speed-Up

Without Re-arrangement With Re-arrangement 20 17.5 15 12.5 10 0

0.5

1 1.5 2 Grid nodes [105]

2.5

Fig. 11. 3D ﬂow solver. Speed-up of the GPU DP implementation with and without re-arrangement in the ﬂuxes access as a function of the number of grid nodes.

3.1. Aerodynamic optimization-generalities In aerodynamic shape optimization, the search tools are either gradient-based methods or global search metaheuristics [33,50,13]. EAs are gradient-free methods that may accommodate any readyto-use analysis software, even without access to its source code. For EAs to become routine industrial tools, much focus has been placed on methods reducing the number of evaluations they need to reach the optimal solution(s). To this end, a number of papers employing surrogate evaluation models [29,26,14,16,31,12,47,41], hierarchical or multilevel schemes [32,30,28,27] and/or hybridization with gradient-based methods [27,38,34] have appeared in the literature. Gradient-based methods make use of adjoint methods [24,25,2,18,43,44] to compute the gradient vector (and/or Hessian matrix [46,45]) of the objective function with respect to the design variables to iteratively compute the optimal solution. The main advantage of the adjoint methods is the low extra cost for the computation of the sensitivity derivatives, which is equal to a ﬂow solution, irrespective of the number of design variables. 3.2. Two-level EA-based optimization on GPUs CFD codes ported to GPUs have a great value not only as standalone ﬂow analysis tools but, also, as means to support any aerodynamic shape optimization software. Since the latter requires many CFD-based evaluations of candidate solutions, the use of CFD code implemented on a GPU, rather than a CPU, may considerably accelerate an optimization project. The difference in speed-up between the GPU MP and GPU SP implementations inspired the hierarchical optimization algorithm presented below. The idea is to associate GPU SP and GPU MP with its two levels. To further reduce the overall CPU cost, surrogate evaluation models (metamodels) are used. Thus, in total, two problem-speciﬁc tools (GPU SP and GPU MP ) and one metamodel are employed. For more details on the optimization platform itself (irrespective of the use of GPUs), the reader should turn to [27,30]. The two-level algorithm consists of two semi-autonomously running optimization levels, associated with different evaluation software. The main idea is to use cheap and low-ﬁdelity evaluation codes (such as the GPU SP ) on the low level to roughly locate optimal solutions and, then, reﬁne them using a higher ﬁdelity software ðGPU MP Þ on the high level. On each level, a metamodelassisted evolutionary algorithm (MAEA) is used to search the design space, supported by metamodels trained on-the-ﬂy. According to the inexact pre-evaluation (IPE) algorithm, [17], the MAEA operates as a conventional EA during the ﬁrst generations archiving all

evaluated individuals, among which data for training the metamodels will be selected during the subsequent generations. Once enough entries are archived in the database (the exact number is user-deﬁned) the IPE ﬁlter is activated. According to this, properly trained metamodels (Radial Basis Function-RBF Networks, [23]) are used to approximate the objective value(s) for each and every offspring. Only a small percentage of the population, with the best (according to the metamodel) responses, undergoes exact evaluation on the GPU SP or GPU MP during each generation. Note that a different database exists on each level, since the evaluation is carried out using a different GPU-enabled code. The CPU cost of the search algorithm (including metamodels) is negligible and, therefore, this runs on the CPU. The two levels communicate regularly by exchanging their current best solutions. To control the interlevel migrations, the generation in which the ﬁrst communication takes place and the communication frequency (expressed in terms of generations) are user-deﬁned. Whenever a level reaches the migration generation, it suspends execution and waits for its adjacent one. During the interlevel migration a (user-deﬁned) percentage of immigrants is re-evaluated on the evaluation software of the destination level. After each migration, execution is resumed on both levels. 4. Applications In this section, single- and multi-objective aerodynamic shape optimization problems are solved using the two-level EA described in the previous sections and the GPU SP and GPU MP implementations of the ﬂow solver. In both cases, signiﬁcant reduction in the wall clock time of the optimization was observed. 4.1. Optimization of a 2D isolated airfoil The ﬁrst case is concerned with the two-objective optimization of an isolated airfoil for minimum drag ðC D Þ and maximum lift coefﬁcient ðC L Þ. The ﬂow conditions are: M 1 ¼ 0:3; a1 ¼ 4 and Rec ¼ 5 106 . The airfoil was parameterized using two Bézier curves, one per airfoil side, with 9 control points each. The control points were allowed to vary in both the chordwise and normal-tochord directions. The leading and trailing edge control points were ﬁxed. The control points next to them were free to move only on condition that C 1 continuity at the leading and trailing edges is ensured. This gave rise to 24 design variables in total. Three constraints on the airfoil thickness t, at given chordwise positions, were imposed,

tð0:25cÞ P 0:07c;

tð0:60cÞ P 0:05c;

tð0:85cÞ P 0:02c

ð8Þ

where c is the chord length. The high level of the two-level algorithm employed a (10,20) EA (i.e. with l ¼ 10 parents and k ¼ 20 offspring) and the GPU MP implementation for the evaluation of candidate solutions. The low level employed a (20,50) EA, based on GPU SP . Both levels were assisted by RBF networks as metamodels for the IPE of the population; they operated as conventional EAs until a pool of 200 exactly evaluated individuals was made available on each level. Then, the IPE ﬁlter was activated, according to which 2 individuals on the high and 5 on the low level were exactly evaluated on GPU SP and GPU MP , respectively. The high level EA started after the 10th generation of the low level EA. The initial population of the former incorporated the front of non-dominated solutions computed by the latter. The migration step was equal to 10 and 5 generations for the low and high level EAs, respectively. Up to 15 top individuals were broadcast from the low to the high level EA during each migration. After re-evaluation on the GPU MP , immigrants outperforming the high level offspring were allowed to displace them.

720

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

At the same time, the high level broadcast up to 6 top individuals to the low level EA. The evolution of the hypervolume indicator, which quantiﬁes the part of the non-dimensional objective space (up to a user-deﬁned ‘‘nadir” point) dominated by the front [51], is illustrated in Fig. 12a. For the purpose of comparison, the same quantity computed during a MAEA (using the same evolution parameters as on the high level and GPU MP as evaluator) is also shown. The two-level EA is able to capture a better front of non-dominated individuals at the same time cost; this is why the hypervolume indicator of the two-level method is constantly higher. The ﬁnal front of non-dominated solutions, computed by the two-level algorithm, is shown in Fig. 12b. Three airfoils selected from this front along with the corresponding C L and C D values are shown in Figs. 13a–c. On the average, the cost ratio of an evaluation on GPU MP and GPU SP is 49:20 (on a computational grid of 35,000 nodes). So, a run of the GPU MP corresponds to one time unit (abscissa of time units. During the two-level Fig. 12a) and a run on GPU SP to 20 49 optimization, 336 high and 1410 low level evaluations were carried out, represented by the continuous line plotted in Fig. 12a. With approximately the same cost, the single-level MAEA based on GPU MP produced a front of non-dominated solutions with about 10% worse hypervolume indicator value. From the same ﬁgure, it can be seen that the ﬁnal solution of the single-level MAEA could be captured approximately in 500 time units of the two-level algorithm (i.e. with 45% economy in time). Note that, if the single-level MAEA employed the CPU rather than the GPU, the speed-up would be 22:5 (for the given computational grid). Hence, the two-level GPU-based method can produce the same quality of ‘‘optimal” solution with the single-level CPU code based MAEA, 50. with a speed-up of approximately 22:5 0:45 4.2. Optimization of a 2D compressor cascade airfoil This case is concerned with the design of a 2D compressor cascade for minimum total pressure loss coefﬁcient, x ¼ ðpt1 pt2 Þ= ðpt1 p1 Þ, where pt and p are the total and static pressures, respec-

Two-Level Single-Level

0.65

tð0:30cÞ P 0:075c;

b

tð0:90cÞ P 0:018c

ð9Þ

0.018 0.017

0.6

0.016

0.55

0.015

0.5

tð0:60cÞ P 0:065c;

were imposed. A minimum ﬂow turning angle constraint was also imposed ða1 a2 P 25 Þ. This study was based on the same two-level EA and the same GPU MP and GPU SP implementations. A (10,20) MAEA was employed on the high level. The low level used a (15,60) MAEA. On either levels, the IPE technique was activated after 300 and 200 exact evaluations on the high and low levels, respectively. Once metamodels have been activated, 2 and 5 individuals per generation, identiﬁed as top ones by the metamodel, were evaluated on the level’s CFD code. As in the previous case, at the end of the 10th generation of the low level EA, the high level EA initiated by injecting current top individuals from the low level EA into its starting population. As part of the interlevel migration process, the two levels exchanged their best solution every 10 (low) and 5 (high level) generations. The stopping criterion was arbitrarily set to 800 time units or equivalent GPU MP evaluations. These correspond to 472 and 695 evaluations on GPU MP and GPU SP , respectively. Fig. 14a illustrates the convergence history of the hierarchical algorithm (total pressure loss coefﬁcient with respect to time units). The convergence history of a single level optimization (with evaluations on GPU MP ) is also shown. With the same computational cost, the hierarchical algorithm was able to reach a much better solution, with x ¼ 0:0135, instead of 0.0141 (which was the x value of the optimal solution computed by the single level method at the same computational cost). Note that all solutions provided by the two-level algorithm were better than those computed by the single level one. The solutions provided by the two-level algorithm have

CD

Hypervolume indicator

a

tively. Indices 1 and 2 indicate the inlet to and the outlet from the cascade. The ﬂow conditions are: M 2;is ¼ 0:3; a1 ¼ 44 and Rec ¼ 9 105 . The stagger angle was 25 and the pitch-to-chord ratio s=c ¼ 0:6. The cascade airfoil contour was parameterized using Bézier curves with 18 control points (9 on each side) resulting to 24 design variables in total. To prevent the formation of unacceptably thin airfoils, three constraints on thickness, namely

0.014 0.013

0.45

0.012 0.4

0.011

0.35 0

100 200 300 400 500 600 700 800 900 Time units

0.01 0.5

0.6

0.7

0.8 CL

0.9

1

1.1

Fig. 12. Optimization of a 2D isolated airfoil: (a) Evolution of the hypervolume indicator of the two-level algorithm and a single-level MAEA. (b) The front of non-dominated solutions computed by the two-level algorithm.

Fig. 13. Optimization of a 2D isolated airfoil: Three airfoil proﬁles selected from the computed front of non-dominated solutions (Fig. 12).

721

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

a

0.0165

b

Two-Level Single-Level

0.016

ω

0.0155 0.015 0.0145 0.014 0.0135 0.013 0

100 200 300 400 500 600 700 800 Time units

Fig. 14. Optimization of a 2D compressor cascade: (a) Evolution of the total pressure loss coefﬁcient value in the course of the optimization. Comparison of two and single-level MAEA. (b) Optimal cascade airfoil computed by the two-level algorithm.

a

b

0.015 0.014

1 High Level (MPA) Low Level (SPA)

0.5

0.013 High level (MPA) Low level (SPA)

Cp

0

ω

0.012 0.011

-0.5

0.01 -1

0.009 0.008

-1.5 0

1

2

3

4

5

6

Candidate Solution

7

0

0.2

0.4

0.6

0.8

1

x-axis

Fig. 15. Optimization of a 2D compressor cascade: (a) Comparison of the x values of 6 candidate solutions communicated between the two levels. One may see the differences in the x values predicted by the two CFD codes running on the GPU. (b) Comparison of the pressure coefﬁcient distribution along the optimal airfoil, computed using GPU MP and GPU SP .

necessarily been evaluated by the high level evaluation tool; thus, the two-level algorithm starts producing results after the ﬁrst 130 time units, since the high level started after the 10th generation. During this period of time, only the low level was allowed to evolve for 10 generations, yielding a cost of 110 time units. The extra 20 time units correspond to the cost of evaluating the ﬁrst high level generation. On the average, the cost ratio of an evaluation on GPU MP and GPU SP is 159:92 (on grids of 61,000 nodes). Finally, Fig. 15 compares the x values, computed by GPU MP and GPU SP , for 6 designs communicated between the two levels. Note that, in this case, the computational grid used was very stretched close to the solid walls, to maintain a yþ value less than 1, and this affects all SPA operations. Though GPU SP is less accurate than GPU MP , it was able to support the EA, acting on the low level, as shown in Fig. 14a. The comparison of the pressure distribution around the optimal airfoil computed using GPU SP and GPU MP is also shown in Fig. 15b.

5. Conclusions The purpose of this paper was two fold: (a) to present the GPU implementation of a 2D/3D Navier–Stokes/Euler equations solver for unstructured grids, based an a vertex-centered ﬁnite volume scheme and (b) given the two variants of this code with different prediction accuracy and computing cost, to demonstrate a

two-level EA-based optimization algorithm, for problems with one or more than one objectives as well as a number of constraints. NVIDIA’s Ge-Force GTX 285 graphics cards and the CUDA programming language were used. With respect to the ﬂow solver, complete restructuring was necessary to maximize the parallel speed-up. GTX 285 allows single and double precision arithmetic which are optimally combined in the GPU MP implementation. For maximum speed-up of the GPU-enabled ﬂow solver (compared to the CPU code), it was necessary to: Switch from edgewise to nodewise sweeps for both inviscid and viscous terms. Use coalesced memory access and renumbering schemes to overcome the management of scattered data due to the unstructured grid topology and connectivity, particularly for vertexcentered schemes. The vertex-centered scheme (compared to a similar work that recently appeared in the literature, [8], which was based on cell-centered discretization) introduces extra difﬁculties and calls for careful treatment, mainly due to the nonﬁxed number of edges emanating from each node. Use texture memory for a limited number of selected ﬂow variables. Minimize the global memory access within each thread. Use vector types deﬁned in CUDA (for instance ﬂoat4) and properly aligned user-deﬁned data types. Use the constant memory space and unrolling.

722

I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722

Using all these techniques, speed–ups of about 28 for the GPU SP ; 25:2 for the GPU MP (with SPA for the l.h.s. and DPA for the r.h.s.) and 19:6 for the GPU DP one, asymptotically as the grid size increases, were reached. These values are valid for Euler and Navier–Stokes computations on 2D grids. For the 3D Euler equations, speed-ups of about 27:5 for GPU SP ; 23 for GPU MP and 19:6 for GPU DP were measured. In the second part of the paper, a two-level optimization method that exploits the GPU’s variants was demonstrated. The proposed two-level scheme is based on EAs, surrogate evaluation models (RBF networks) and two GPU-enabled code variants (GPU SP as the low ﬁdelity tool and GPU MP as the high ﬁdelity evaluation tool) used in a hierarchical manner. The examined aerodynamic optimization problems demonstrate that the new algorithm clearly outperforms conventional EAs (based on the most accurate GPU implementation of the ﬂow solver) in terms of computational cost. Acknowledgment The second author was supported by a scholarship from the Public Beneﬁt Foundation Alexandros S. Onassis. References [1] AMD, AMD ATI close to the metal (CTM) guide, , 2008. [2] W.K. Anderson, V. Venkatakrishnan, Aerodynamic design optimization on unstructured grids with a continuous adjoint formulation, Comput. Fluids 28 (4-5) (1999) 443–480. [3] T.J. Barth, D. Jespersen, The design and application of upwind schemes on unstructured meshes, AIAA Paper 1989-0366, in: 27th Aerospace Sciences Meeting, January 1989. [4] J. Bolz, I. Farmer, E. Grinspun, P. Schröder, Sparse matrix solvers on the GPU: conjugate gradients and multigrid, ACM Trans. Graphics 22 (3) (2003) 917– 924. [5] T. Brandvik, G. Pullan, Acceleration of a two-dimensional Euler ﬂow solver using commodity graphics hardware, Proc. Inst. Mech. Engineers, Pt C: J. Mech. Engrg. Sci. 221 (12) (2007) 1745–1748. [6] T. Brandvik, G. Pullan, Acceleration of a 3D Euler solver using commodity graphics hardware, AIAA Paper 2008-607, 46th AIAA Aerospace Sciences Meeting and Exhibit, January 2008. [7] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, O. Hanrahan, Brook for GPUs: stream computing on graphics hardware, ACM Trans. Graphics 23 (2004) 777–786. [8] A. Corrigan, F. Camelli, R. Löhner, J. Wallin, Running unstructured grid based CFD solvers on modern graphics hardware. AIAA Paper 2009-4001, 19th AIAA Computational Fluid Dynamics, June 2009. [9] K. Crane, I. Llamas, S. Tariq, Real-time simulation and rendering of 3D ﬂuids, GPU Gems 3 (2007) (Chapter 30). [10] J.D. Denton, G. Hirch, Ch. ana Meauzé, Analytical test cases of cascades, AGARD-AR-275, 1990. [11] E. Elsen, P. LeGresley, E. Darve, Large calculation of the ﬂow over a hypersonic vehicle using a GPU, J. Comput. Phys. 227 (2008) 10148–10161. [12] M. Emmerich, K.C. Giannakoglou, B. Naujoks, Single- and multi-objective evolutionary optimization assisted by Gaussian random ﬁeld metamodels, IEEE Trans. Evolut. Comput. 10 (4) (2006) 421–439. [13] B. Epstein, S. Peigin, Optimization of 3d wings based on Navier–Stokes solutions and genetic algorithms, Int. J. Comput. Fluid Dyn. 20 (2) (2006) 75–92. [14] M. Farina, A neural network based generalized response surface multiobjective evolutionary algorithm, in: 2002 Congress on Evolutionary Computation – CEC ’02, vol. 1, Honolulu, HI, May 2002, pp. 956–961. [15] K. Fatahalian, D.R. Horn, T.J. Knight, L. Leem, M. Houston, J.Y. Park, M. Erez, M. Ren, A. Aiken, W.J. Dally, P. Hanrahan, Sequoia: Programming the memory hierarchy, in: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, ACM, New York, NY, USA, 2006, p. 83. [16] K.C. Giannakoglou, Design of optimal aerodynamic shapes using stochastic optimization methods and computational intelligence, Prog. Aerospace Sci. 38 (1) (2002) 43–76. [17] K.C. Giannakoglou, A.P. Giotis, M.K. Karakasis, Low-cost genetic optimization based on inexact pre-evaluations and the sensitivity analysis of design parameters, Inverse Prob. Engrg. 9 (2001) 389–412. [18] M.B. Giles, N.A. Pierce, Improved lift and drag estimates using adjoint Euler equations, AIAA Paper 1999-3293, in: 14th Computational Fluid Dynamics Conference, June–July 1999. [19] D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, H. Wobker, C. Becker, S. Turek, Using GPUs to improve multigrid solver performance on a cluster, Int. J. Comput. Sci. Engrg. (IJCSE) 4 (1) (2008) 36–55.

[20] N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, A multigrid solver for boundary value problems using programmable graphics hardware, Graphics Hardware (2003) 1–11. [21] T.R. Hagen, K.A. Lie, J.R. Natvig, Solving the Euler equations on graphics processing units, Comput. Sci. – ICCS 3994 (2006) 220–227. [22] M.J. Harris, Fast ﬂuid dynamics simulation on the GPU, GPU Gems (2004) 637– 665 (Chapter 38). [23] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, New Jersey, USA, 1999. [24] A. Jameson, Aerodynamic design via control theory, J. Scientiﬁc Comput. 3 (1988) 33–260. [25] A. Jameson, Optimum aerodynamic design using CFD and control theory, AIAA Paper 1995-1729, in: 12th AIAA Computational Fluid Dynamics Conference and Open Forum, June 1995. [26] Y. Jin, M. Olhofer, B. Sendhoff, A framework for evolutionary optimization with approximate ﬁtness functions, IEEE Trans. Evolut. Comput. 6 (5) (2002) 481– 494. [27] I.C. Kampolis, K.C. Giannakoglou, A multilevel approach to single- and multiobjective aerodynamic optimization, Comput. Methods Appl. Mech. Engrg. 197 (33–40) (2008) 2963–2975. [28] I.C. Kampolis, K.C. Giannakoglou, Distributed evolutionary algorithms with hierarchical evaluation, Engrg. Optim. 41 (11) (2009) 1037–1049. [29] I.C. Kampolis, E.I. Karangelos, K.C. Giannakoglou, Gradient-assisted radial basis function networks: theory and applications, Appl. Math. Modell. 28 (13) (2004) 197–209. [30] I.C. Kampolis, A.S. Zymaris, V.G. Asouti, K.C. Giannakoglou, Multilevel optimization strategies based on metamodel-assisted evolutionary algorithms, for computationally expensive problems, in:2007 Congress on Evolutionary Computation – CEC ’07, Singapore, September 2007. [31] M.K. Karakasis, K.C. Giannakoglou, On the use of metamodel-assisted, multiobjective evolutionary algorithms, Engrg. Optim. 38 (8) (2006) 941–957. [32] M.K. Karakasis, D.G. Koubogiannis, K.C. Giannakoglou, Hierarchical distributed evolutionary algorithms in shape optimization, Int. J. Numer. Methods Fluids 53 (3) (2007) 455–469. [33] A. Keane, P. Nair, Computational Approaches for Aerospace Design: The Pursuit of Excellence, John Wiley and Sons Ltd., 2005. [34] J. Knowles, D. Corne, M-PAES: A memetic algorithm for multiobjective optimization, in: 2000 Congress on Evolutionary Computation – CEC ’00, IEEE Press, 2000, pp. 325–332. [35] D. Komatitsch, D. Michéa, G. Erlebacher, Porting a high-order ﬁnite-element earthquake modeling application to NVIDIA graphics cards using CUDA, J. Parallel Distrib. Comput., 2009, doi: 10.1016/j.jpdc.2009.01.006. [36] J. Kruger, R. Westermann, Linear algebra operators for GPU implementation of numerical algorithms, ACM Trans. Graphics 22 (3) (2003) 908–916. [37] W.R. Mark, R. Steven, G. Kurt, A. Mark, J. Kilgard, Cg: A system for programming graphics hardware in a C-like language, ACM Trans. Graphics 22 (2003) 896–907. [38] F. Muyl, L. Dumas, V. Herbert, Hybrid method for aerodynamic shape optimization in automotive industry, Comput. Fluids 33 (5–6) (2004) 849–858. [39] NVIDIA, NVIDIA CUDA Compute Uniﬁed Device Architecture Programming Guide, 2008. NVIDIA CUDA homepage, , 2008. [41] Y.S. Ong, K.Y. Lum, P.B. Nair, D.M. Shi, Z.K. Zhang, Global convergence of unconstrained and bound constrained surrogate-assisted evolutionary search in aerodynamic shape design, in:2003 Congress on Evolutionary Computation – CEC ’03, vol. 3, Canberra, Australia, 2003, pp. 1856–1863. [42] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.E. Lefohn, T.J. Purcell, A survey of general-purpose computation on graphics hardware, Comput. Graphics Forum 26 (1) (2007) 80–113. [43] D.I. Papadimitriou, K.C. Giannakoglou, A continuous adjoint method with objective function derivatives based on boundary integrals for inviscid and viscous ﬂows, Computers Fluids 36 (2) (2007) 325–341. [44] D.I. Papadimitriou, K.C. Giannakoglou, Total pressure loss minimization in turbomachinery cascades using a new continuous adjoint formulation, Proc. I MECH E Part A: J. Power Energy 221 (2007) 865–872. [45] D.I. Papadimitriou, K.C. Giannakoglou, Computation of the Hessian matrix in aerodynamic inverse design using continuous adjoint formulations, Comput. Fluids 37 (8) (2008) 1029–1039. [46] D.I. Papadimitriou, K.C. Giannakoglou, Direct, adjoint and mixed approaches for the computation of Hessian in airfoil design problems, Int. J. Numer. Methods Fluids 56 (10) (2008) 1929–1943. [47] A. Ratle, Optimal sampling strategies for learning a ﬁtness model, in: 1999 Congress on Evolutionary Computation – CEC ’99, vol. 3, Washington, DC, 1999, pp. 2078–2085. [48] P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference schemes, J. Comput. Phys. 43 (2) (1981) 357–372. [49] P. Spalart, S. Allmaras, A one-equation turbulence model for aerodynamic ﬂows, La Recherche Aérospatiale 1 (1994) 5–21. [50] D. Thévenin, G. Janiga, Optimization and Computational Fluid Dynamics, Springer-Verlag, 2008, ISBN: 978-3-540-72152-9. [51] E. Zitzler, D. Brockhoff, L. Thiele, The hypervolume indicator revisited: on the design of Pareto-compliant indicators via weighted integration, in: S. Obayashi et al. (Eds.), Conference on Evolutionary Multi-Criterion Optimization – EMO, LNCS, Springer, Berlin, 2007, pp. 862–876.

CFD-based analysis and two-level aerodynamic optimization on graphics processing units

CFD-based analysis and two-level aerodynamic optimization on graphics processing units

Recommend Documents