Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
Contents lists available at ScienceDirect
Comput. Methods Appl. Mech. Engrg. journal homepage: www.elsevier.com/locate/cma
CFD-based analysis and two-level aerodynamic optimization on graphics processing units I.C. Kampolis, X.S. Trompoukis, V.G. Asouti, K.C. Giannakoglou * National Technical University of Athens, Lab. of Thermal Turbomachines, Parallel CFD & Optimization Unit, P.O. Box 64069, Athens 15710, Greece
a r t i c l e
i n f o
Article history: Received 21 May 2009 Received in revised form 24 September 2009 Accepted 3 November 2009 Available online 13 November 2009 Keywords: Graphics processing units Computational fluid dynamics Aerodynamic shape optimization Evolutionary algorithms
a b s t r a c t This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code. Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efficiency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D flow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same flow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software. Ó 2009 Elsevier B.V. All rights reserved.
1. Introduction The continuous increase in performance of GPUs has extended their use from the gaming community to game physics calculations and, lately, to number-crunching applications in the field of computational physics [4,21,42]. Their increased flexibility and performance allowed the implementation of data-parallel algorithms that efficiently unlock their peak computational capabilities by exceeding that of high-end CPUs, combining a great number of floating point operations per second (GFLOPS) and high memory bandwidth. Moreover, their mass production made them a cheaper, viable alternative to conventional high-performance computing systems. GPUs from both major manufacturers (ATI & NVIDIA) associated with a number of software programming models (Cg [37], Brook [7], Sequoia [15], CTM [1], CUDA [40] to mention a few of them) are available to scientists for making GPUs appropriate to cope with high-performance scientific applications. In this paper, NVIDIA’s graphics cards and CUDA are used. CUDA is an extended * Corresponding author. E-mail address:
[email protected] (K.C. Giannakoglou). 0045-7825/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.cma.2009.11.001
subset of the C language and is supported by all latest NVIDIA’s graphics cards. It uses the GPU as a device suitable for parallel computations with its own random access memory (device/global memory) capable of concurrently executing a large number of threads. The latter are grouped together into blocks; many blocks comprise a grid. Grids of blocks execute the code (referred to as the kernel) on the GPU, using the single instruction multiple thread (SIMT) model. Limited shared memory (due to hardware restrictions) as well as synchronization instructions are available to the threads forming the same block. A wide range of applications which benefit from the excellent parallel performance of GPUs can be found in the literature [42]. However, applications in computational fluid dynamics (CFD) are still limited. In [20], a GPU implementation of the multigrid method for the solution of boundary value heat and fluid flow problems, using the vorticity-streamfunction formulation, was presented. Kruger and Westermann [36] proposed a framework for the implementation of explicit and implicit numerical methods on graphics hardware (ATI 9800 graphics card) and showed applications related to the 2D wave equation and the incompressible Navier– Stokes equations. Bolz et al. [4] describe a conjugate gradient solver for sparse matrices resulting from the use of unstructured grids
713
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
and a multigrid solver for structured grids, on NVIDIA’s Ge-Force FX cards. Goddeke et al. [19] used many cluster nodes with NVIDIA’s Quadro FX4500 PCIe and addressed the issue of limited precision on GPUs (by that period of time) by applying a mixed precision iterative refinement technique. A detailed description of the GPU implementation of a Navier–Stokes flow solver for structured grids was presented in [22]. In [9], this was extended to 3D flows by laying emphasis to realistic and accurate visualization effects for graphics applications. Implementations to compressible fluid flows were firstly presented in [21]. Brandvik and Pullan [5,6] focus on performance comparisons between GPU and CPU codes for the solution of 2D and 3D Euler equations and reported considerable speed-ups using exclusively structured grids. Applications to complex geometries, even for hypersonic flows, can be found in [11]. A first application in 3D unstructured grids for inviscid, compressible flows on NVIDIA Tesla GPU is presented in [8]. Although [8] has certain similarities to the present paper, it is based on the cell-centered whereas this paper on the vertex-centered finite volume technique. As it will be shown as the paper develops, there are important differences between the two schemes as far as porting on GPUs is of concern. In view of the above, in the present paper, a solver for the 2D steady state, Navier–Stokes equations for compressible fluids on unstructured grids with triangular elements, coupled with a oneequation turbulence model, is firstly presented; the programming details for porting the code to the GPU are discussed. The first part of this study focuses on both single (SPA) and double (DPA) precision arithmetic GPU implementations, in order to quantify their speed-ups compared to the CFD code running on the CPU. A mixed precision arithmetic (MPA) implementation, which uses SPA for the left-hand-side (l.h.s.) part of the discretized equations and DPA for the right-hand-side terms (r.h.s., i.e. the residuals), is devised. Below, the three GPU implementations with SPA, MPA and DPA will be referred to as GPU SP ; GPU MP and GPU DP , respectively. The solver running on the CPU implements DPA; the speed-ups of all GPU-enabled variants are measured with respect to the same CPU code. The proposed GPU MP leads to high speed-ups without damaging the prediction accuracy of the GPU DP . A 3D Euler equations GPU-enabled solver (using SPA, MPA and DPA, too) for unstructured grids with tetrahedral elements is also presented and assessed, in terms of speed-ups, with the corresponding CPU code. The second part of this paper deals with aerodynamic shape optimization, by exploiting the CFD software variants which have already been ported on GPUs. In particular, the GPU-enabled flow solvers are used as analysis tools in the context of a hierarchical EA, to efficiently solve aerodynamic shape optimization problems. In such a method, the search for the optimal solution(s) is considerably accelerated by the combined use of (a) a low cost, low fidelity flow solver ðGPU SP Þ which undertakes the exploration of the design space on the low level and (b) a high fidelity flow solver ðGPU MP Þ, with higher cost, which is mainly used to refine the most promising solutions by injecting them into the high level search algorithm. The present applications include single- and multiobjective optimizations of a compressor cascade and an isolated airfoil. A final statement should be made. Nowadays, ‘‘personal supercomputers”, i.e. clusters of CPUs and GPUs, allow a significant amount of performance at a reasonable cost-price. The present CFD software may certainly run on these clusters, after defining an appropriate number of grid subsets exchanging data for physically coincident grid nodes which reside on different grid subsets. However, this is beyond the scope of this paper. The reason is simple: since the GPU code is used to support an EA, where candidate solutions can be simultaneously and independently evaluated and due to the limited number of available GPUs, it was decided that each
CFD code runs on a single GPU. We thus minimize the communication overhead and applications are limited by the available memory. 2. 2D/3D CFD Implementations on GPUs This section presents the porting of the 2D and 3D Navier– Stokes equations solvers to the GPU. The basic features of the time-marching Navier–Stokes flow solver (CPU implementation) are first presented. Programming issues for the GPU implementations are then discussed by taking into consideration the specific characteristics/architecture of the available NVIDIA graphics cards. The section concludes with performance comparisons between CPU and GPU for 2D and 3D, inviscid and/or turbulent steady-state flows around an isolated airfoil, an aircraft and in a compressor cascade. 2.1. The Navier–Stokes equations solver – implementation on CPUs The Navier–Stokes equations for a compressible fluid are written in vector form as !
f inv @~ f v is @ U @~ þ i i ¼ 0; @t @xi @xi
ð1Þ
with i ¼ 1; 2 in 2D (i ¼ 1; 2; 3 in 3D) and xi the cartesian coordinates. ! U ¼ ½.; .~ u; ET is the vector of conservative variables. The inviscid v Þ and viscous ð~ f iv is Þ fluxes are given by ð~ f in i
3
2
7 v ¼6 ~ f in 4 .ui~ u þ p~ di 5; i
6 ~ f vi is ¼ 4
2
.ui
ui ðE þ pÞ
0 ~ si uj sij þ qi
3 7 5;
ð2Þ
where . is the density, ~ u is the velocity vector, ~ si ¼ ½si1 ; si2 T T (~ si ¼ ½si1 ; si2 ; si3 in 3D) are the viscous stresses, ~di ¼ ½di1 ; di2 T @T ~ (di ¼ ½di1 ; di2 ; di3 T in 3D) the Kronecker symbols and qi ¼ k @x the i thermal flux components. The solution of Eqs. (1), on unstructured grids with triangular (2D) or tetrahedral (3D) elements, is based on the vertex-centered finite volume technique and a second-order upwind scheme for the inviscid fluxes. On a 2D grid, Fig. 1 shows the finite volume XP defined around node P; in 3D cases, finite volumes are defined in a similar way. The integration of the governing equations over XP gives
X ! ! XP ! v v is DUþ Uin PQ UPQ ðabÞ ¼ 0; DtP P Q 2neiðPÞ
ð3Þ !
!
v v is where Dt P is the local pseudo-time step and Uin PQ ; UPQ are the inviscid and viscous numerical fluxes across ab, i.e. the interface of XP and XQ .
Fig. 1. Vertex-centered finite volume XP (hatched area) defined around node P of an unstructured grid with triangular elements.
714
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
In the CPU implementation, the inviscid fluxes are computed using the 1D Roe approximate Riemann solver [48] applied between P and Q, i.e. in an edgewise manner for both 2D and 3D grids, as
! ! ! ! 1 1 e R L AP U LPQ þ AQ U RPQ j A PQ j U PQ U PQ ; 2 2
!
v Uin PQ ¼
ð4Þ
where A stands for the Jacobian matrix. For second-order accuracy, ! ! ! ! U LPQ , and U RPQ are computed from U P and U Q through variables’ extrapolation, as !
!
U LPQ ¼ U P þ
! ! ! ! 1 ! 1 ! PQ rU P and U RPQ ¼ U Q PQ rU Q : 2 2
ð5Þ
The viscous fluxes are expressed in terms of the gradients of prim! ! itive variables V . The computation of rV PQ along the edge PQ [3], is given by !
rV PQ
2 3 ! ! ! ! ! 1 1 ! VQ VP5 4 ~ ¼ rV P þ rV Q ½rV P þ rV Q ~ n n; 2 2 ðPQ Þ
ð6Þ
where ~ n is the normal to ab. The gradient at each grid node is, then, derived from the elements’ gradient !
rV T ¼
3X or 4 !
Vk
k¼1
@Lk ; @xk
ð7Þ
where Lk are element shape functions. In the CPU implementation (either 2D or 3D), both inviscid and viscous fluxes are computed by looping over grid edges (such as PQ) and scatter-adding contributions to their end nodes. This avoids the computation of the same fluxes twice for contributing to P and Q. The discretized Eqs. (3) are solved using the point-implicit Jacobi method. The mean flow equations are coupled with the one-equation Spalart–Allmaras turbulence model [49]. The model is solved for l~ , which is then used to compute the turbulent viscosity lt . Con~ -equation are handled as in vection and diffusion terms in the l the mean flow equations.
2.2. The NVIDIA GPU architecture In this paper, NVIDIA GPUs (GTX 280, GTX 285) based on the GT 200 architecture, in which all the computational elements implement the Unified Shader Architecture, are used. So, they all have the same computational capabilities, i.e. can read from textures, linear memory and perform arithmetic operations. The GTX 285 board consists of 10 texture processor clusters (TPCs). Each TPC includes 3 streaming multiprocessors (SM), control circuits and a texture cache block. SMs are the key compute elements in the GT 200 architecture and consist of 8 streaming processors (SPs) along with two special function units (SFUs), an instruction unit and 16 KB of local shared memory (Fig. 2). SPs are fully pipelined and may deliver three FLOPS per clock. SFUs are mainly used for transcendental operations and interpolations and may also contribute another FLOP per clock. On this basis, the theoretical peak performance of each GTX 285 board is estimated to 1063 GFLOPS. Each SM executes warps of threads (each warp consists of 32 threads) using the SIMT model. So, all the SPs of a SM execute the same command/instruction on different parts of the data to be processed. A SM may handle up to 1024 threads, adding up to as many as 30720 concurrent threads in each GTX 285 board. All 240 SPs are used exclusively for single precision operations. Double precision operations are carried out in another unit, inside each SM. Thirty units responsible for double precision arithmetic operations reside in each GTX 285 board. These are less than the 240 SPs for the SPA operations, since DPA ones are quite limited in graphics. All threads of the same block (with a number of blocks comprising a grid, Fig. 2b) are able to cooperate using the SM shared memory. Furthermore a synchronization instruction is available. This brings up a straightforward hardware requirement, restricting each block to be processed by a single SM. The total number of blocks that can be assigned to a single SM depends upon shared memory and registers requirements as well as the maximum number of threads. Scheduling occurs within each SM, according to the number of active warps. Dynamic scheduling is used; for instance, while a warp is waiting for a memory transaction to be completed,
Fig. 2. (a) Overview of a streaming multiprocessor of the GT 200 chip and (b) threads’ grouping into blocks and grids of blocks.
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
715
Fig. 3. Flowchart of the GPU-enabled flow solver for both 2D and 3D flows. The turbulence model (T.M.) is numerically decoupled from the mean flow equations.
the scheduler initiates the execution of another (also in active state) to maximize utilization of resources. The interchange between warps is of zero (wall clock time) cost. 2.3. The GPU implementation – programming issues Porting efficiently the Navier–Stokes flow solver to the GPU requires a different programming approach due to shared memory conflicts, the lack of cache memory on the GPU and for achieving the highest possible memory bandwidth. In the GPU implementation, the CPU controls a sequence of data processing tasks to be carried out on the GPU, as shown in Fig. 3. Input/output operations (for grid coordinates, connectivities and boundary conditions) as well as the preparation of the data structure to handle the unstructured grid topology are all carried out on the CPU. Then, all the so-processed data are copied to the device memory (GPU global memory). The first task carried out, in parallel, on the GPU is the initialization of the flow variables. The discretization and iterative solution (time-marching) of the flow equations, which corresponds to the computationally demanding part of the code, follow. The discretization comprises the computation of inviscid and viscous fluxes for the mean flow and turbulence model equations (r.h.s.) and that of the corresponding l.h.s. matrix coefficients. The resulting system of linearized equations is, then, iteratively (Jacobi iterations) solved using kernel calls driven by the CPU. During the iterative part of the algorithm, ‘‘synchronization” between CPU and GPU occurs only during convergence checks, when the residuals’ norms are copied to the CPU. Upon convergence, all flow quantities are transferred back to the CPU. A number of implementation details follow. In the CPU implementation and for the reasons exposed in (2.1), the computation of both the inviscid and viscous fluxes (Eqs. (4) and (6)) is based on edgewise sweeps. On the GPU, however, this is not possible due to shared memory conflicts when independent, concurrently executed threads try to accumulate fluxes to coincident end nodes. Thus, a nodewise scheme is applied instead. Each thread undertakes the formation of the l.h.s. and r.h.s. equation terms for a single node, by looping over all edges emanating from it. Even if fluxes associated with all edges are computed twice, the
proposed scheme proved to outperform an edge coloring based computation which was also tested. Coloring (as an alternative way to overcome shared memory access conflicts) creates groups of edges which, when simultaneously swept, scatter contributions to distinct nodes. For instance, a coloring scheme was used in [35] to port a finite element earthquake modeling software to NVIDIA’s graphics cards. Here, however, standard edge coloring method proved less efficient than the use of nodewise sweeps (including local renumbering) and, thus, abandoned; this comparison is not included in the paper. As also explained in [8], redundant computations are harmless for they can be carried out simultaneously, with global memory access. It is evident that the global memory access significantly affects the performance of the GPU implementation. According to [39], better efficiency can be achieved if concurrent memory accesses from all threads composing a half-warp can be coalesced into a single memory transaction. However, coalesced memory access is not possible for unstructured grids (as it is for structured ones) due to the random numbering of grid nodes, edges and elements as well as the variable number of edges emanating from the nodes.1 For instance, the kernel undertaking the computation of fluxes, loops over the edges emanating from each node. In unstructured grids, this is a highly non-coalesced access, due to the lack of a constant memory access pattern; the latter strongly depends upon grid connectivity. As a remedy, an efficient renumbering scheme is proposed and used. The idea is to ensure that the edges which are accessed simultaneously reside nearby in the global memory. The edge connectivity is sorted locally and the first edge of each node is placed on top of the list; the second, third and so forth follow. This allows an approximately 70% performance increase especially in 3D applications. Over and above, scattered data that are not iteratively updated (i.e. 1 In the CFD method described in this paper, there is an additional ‘‘difficulty” compared, for instance, to [8] which also handles 3D unstructured grids. In [8], a cellcentered finite volume scheme is used. So, each kernel associated with a tetrahedral element (and the corresponding node) loops over a fixed number (four, if boundary elements are left aside) of faces and, thus, a fixed number of neighboring nodes. Unfortunately, this is not the case in this paper. In a vertex-centered, finite volume scheme, the number of nodes which are linked to each grid node is not fixed, causing extra overhead during the global memory access.
716
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
geometric quantities required to compute the local time step) can be accumulated in arrays prior to iterating, at the cost of extra memory. The GPU implementation performance was significantly improved by using texture memory. This is a cached read-only memory, accessible by all threads. It is known that, multiple fetches of a texture memory address cost as much as a single access to the global memory. However, the limited size of the texture memory requires a careful selection of the arrays to be textured. Best efficiency occurs if randomly accessed (due to the unstructured topology) quantities, frequently used during the iterative process but updated only once, are textured. For instance, this is the case of the primitive flow variables and their gradients contributing to the computation of inviscid and viscous fluxes (Eqs. (5) and (6)). ! ) are derived from the conThe primitive flow variables (vector V ! servative ones (vector U ) and stored at the beginning of each pseudo-time iteration. So, in the kernels undertaking the computation of inviscid and viscous fluxes, both the primitive flow variables and their gradients are accessed from the texture memory, decreasing thus the wall clock time per iteration. The code performance was further improved by minimizing the global memory access within a thread. For instance, the computation of the l.h.s. and r.h.s. terms of the discretized equations at any node is based on a loop that collects contributions from all the edges emanating from it. In such a scheme, global memory access is minimized via increased shared memory usage and by updating quantities stored in the global memory only upon completion of all computations.
In addition, significant extra gain in speed-up was achieved thanks to the vector types defined in CUDA and, especially for single precision arithmetic operations, by taking advantage of the float4 variable type. GPUs are capable of copying 128-bit words from the device memory to the registers in a single memory transaction and vice-versa. So, loading a float4 variable from the device memory to the registers is faster than loading four float variables using four memory transactions. For instance, with SPA, 2D compressible fluid flow problems coping with four mean flow equations suit perfectly to the use of float4 variable type. To account for the extra variables, such as an extra mean flow variable in 3D, turbulence quantities or even the use of DPA, properly aligned user-defined data types are used to ensure the minimum number of memory transactions. For instance, structures composed of five double precision variables (as required by the GPU DP code used for the prediction of 3D or 2D turbulent flows with one equation model) are compiled into three 128-bit transactions instead of five. Finally, extra gain in runtime, less significant however compared to the preceding actions, resulted from the use of the constant memory space and loop unrolling. Similarly to the texture memory, constant memory is read-only, cached and accessible by all threads. Its access was restricted to the gas and turbulence model constants, as it performs better if each thread of a warp accesses the same memory address, due to the linear scaling of the cost of reading from the constant memory in terms of the number of different reading instructions. On the other hand, unrolling loops that access global memory spaces proved to be beneficial.
Log10(Residual)
2.4. Performance comparisons between CPU and GPU implementations
-2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22
Double precision Single precision Mixed precision
0
1000 2000 3000 4000 5000 6000 7000 8000 Iterations
Fig. 4. 2D flow solver. Convergence of the continuity equation residual obtained by the GPU DP ; GPU SP and GPU MP implementations.
32.5 30 27.5 25 22.5 20 17.5 15 12.5 10 7.5 5
DPA MPA SPA
2.4.1. 2D flow studies For the isolated airfoil turbulent flow studies, the NACA4415 airfoil was used, with M1 ¼ 0:5; a1 ¼ 2 and Rec ¼ 9 105 . For the
b Speed-Up
Speed-Up
a
The performance of the GPU-enabled flow solver implementations both 2D and 3D ðGPU SP ; GPU MP ; GPU DP Þ is assessed on inviscid and turbulent flows developed around isolated airfoils, aircrafts or in compressor cascades. A basic feature of our GPU-enabled implementations (GPU DP and GPU MP ) is that they reproduce the results of the corresponding CPU code (i.e. identical convergence and results). GPU MP runs faster than GPU DP , as it will be shown below. All computations were carried out on NVIDIA’s Ge-Force GTX 285 graphics card and an Intel Pentium Core 2 Duo 2.8 GHz (3 GB RAM). The 2D and 3D CPU codes were written in Fortran and compiled using the GNU Fortran compiler version 4, with full optimization options.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 5 Grid nodes [10 ]
32.5 30 27.5 25 22.5 20 17.5 15 12.5 10 7.5
DPA MPA SPA
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 5 Grid nodes [10 ]
Fig. 5. 2D flow solver. Speed-up of the GPU SP ; GPU MP and GPU DP implementations (in all subsequent figures, speed-ups refer to how much the GPU code is faster than the CPU code) as a function of the number of grid nodes, for the (a) inviscid and (b) turbulent flow around an isolated airfoil.
717
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
Speed-Up
cascade flow studies, the optimal blade airfoil designed in Section 4.2 was used, with M2;is ¼ 0:3; a1 ¼ 44 and Rec ¼ 9 105 . The convergence of the residual of the continuity equation using the three GPU implementations is shown in Fig. 4; the SPA operations of GPU SP (for both l.h.s. and r.h.s.) are lowering the accuracy in computing residuals while the convergence plots for GPU DP and GPU MP are identical. Speed-ups are shown in Figs. 5 and 6. As the grid size increases, speed-ups up to 28 for GPU SP ; 25:2 for GPU MP and 19:6 for GPU DP can be observed. The much higher speed-up of the GPU SP implementation is due to (a) the use of four-component, single precision vector type (float4) variables instead of the two-component double precision (double2) of the GPU DP and (b) the lower cost of SPA operations. For instance, in the GPU SP implementation, the conservative variables associated with a single node are represented by a single float4 type variable, instead of two double2 type variables; this results in half memory instructions compared to the
32.5 30 27.5 25 22.5 20 17.5 15 12.5 10 7.5 5
DPA MPA SPA
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 5 Grid nodes [10 ] Fig. 6. 2D flow solver. Speed-up of the GPU SP ; GPU MP and GPU DP implementations as a function of the number of grid nodes, for the turbulent flow in a compressor cascade.
Table 1 CFD Prediction of 2D turbulent flows on unstructured grids. Comparison of speed-ups achieved using NVIDIA’s GTX 280 and GTX 285 for the GPU SP ; GPU MP and GPU DP implementations.
GPU SP GPU MP GPU DP
a
GTX 280
GTX 285
21 20:7 14:8
28 25:2 19:6
2.4.2. 3D flow studies The performance of the 3D GPU implementation is assessed on the inviscid flows developed around a civil aircraft, with M 1 ¼ 0:7; a1 ¼ 0 and within a supersonic compressor cascade [10], with M1 ¼ 1:6; a1 ¼ 60 . The Mach number contours of the numerically predicted flow fields, in the two examined cases are shown in Figs. 8 and 9, respectively. The performance of the GPU-enabled 3D flow solver compared to the CPU one, for a variety of grids (generated on purpose) is presented in Fig. 10. As shown, speed-ups up to 27:5 for GPU SP ; 23 for GPU MP and 19:6 for GPU DP were measured. Furthermore, in Fig. 11 the effect of ‘‘coalesced” access of fluxes after an appropriate re-arrangement and sorting (as described in Section 2.3) is quantified for the GPU DP implementation. It is
b
1 DPA/MPA SPA
0.5
GPU DP implementation. Recall that CUDA does not yet support double2 variable fetching from the texture memory and that fetching a double variable requires tedious programming. On the other hand, fetching float4 variables is straightforward. GPU MP implementation combines the fast memory instructions concerning the l.h.s. terms (by using float4 variables) with the DPA of the r.h.s. ones, yielding speed-ups between those reported for the GPU SP and GPU DP . It should be stated that the speed-ups mentioned above are significantly improved compared to those presented in the first submission of this paper where NVIDIA GeForce GTX 280 cards were used. A comparison of the speed-ups achieved on the previous (GTX 280) and the new (GTX 285) card is shown in Table 1. Porting the code to GTX 285 did not require further modifications or programming, so it was straightforward to repeat the runs. The main difference between the two GPUs is a frequency increase (7.6%, 13.9% and 12.2% increase on GPU, shader and memory clock, respectively) which significantly affects memory bandwidth (159.0 GB/s instead of 140.7 GB/s), texel rate (51,840 MT/s instead of 48,160 MT/s) and the peak computational capabilities (1063 GFLOPS instead of 933 GFLOPS). The increase in hardware performance corresponds to an approximate 25% increase in speed-up between the two cards. For the sake of completeness, it should also be stated that GTX 285 have lower power consumption and, thus, thermal emissions, which gave rise to increase in frequency. The difference in the prediction accuracy of the three GPU implementations is presented in Fig. 7 which compares the pressure distribution along the solid walls of an isolated airfoil for both inviscid and turbulent flows. GPU DP and GPU MP produce identical solutions whereas, as expected, GPU SP led to slightly different results. The corresponding pressure distributions obtained by the solver running on the CPU are not presented since they are identical to those of the GPU-enabled solvers.
1 DPA/MPA SPA
0.5
Cp
0
Cp
0 -0.5
-0.5
-1
-1
-1.5
-1.5 0
0.2
0.4 0.6 x-axis
0.8
1
0
0.2
0.4 0.6 x-axis
0.8
1
Fig. 7. 2D flow solver. Comparison of the pressure coefficient distribution computed using the GPU SP ; GPU MP and GPU DP for (a) inviscid and (b) turbulent flow around an isolated airfoil. The corresponding CPU code results are not presented since these are identical to those of the GPU MP or GPU DP implementations.
718
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
Fig. 8. Flow around a civil aircraft: Mach number contours.
obvious that such a treatment significantly affects the 3D flow solver (where the number of edges emanating from a node is unknown and may significantly vary) by increasing the speed-ups up to 70%. Note that we restricted ourselves to up to 250,000 nodes since the GTX 285 card cannot support larger grid sizes due to memory limitations. We recall that a single GPU is assigned to each flow analysis in view of our need to parallelize the evaluations of many candidate solutions on the available GPUs, during the EA-based optimization. 3. Aerodynamic optimization on GPUs Having developed efficient GPU implementations of the CFD tool, the second part of this paper is concerned with their use in the context of an EA-based optimization. The two-level optimization algorithm presented below fully exploits the availability of the fast, low fidelity CFD tool ðGPU SP Þ and the more expensive, high fidelity one ðGPU MP Þ. Though the developed GPU-enabled solvers may cope with both 2D and 3D flows, the cases examined include only airfoil designs. To the authors’ opinion, this suffices for demonstrating the capabilities of the method, since any extension to 3D shape optimization could be limited only by the global memory size of GPUs and/or the number of available GPUs (in case more than one GPUs are to be used for the same CFD analysis). It is well known that next generation GPUs are expected to achieve much more than one teraflop of performance with much higher memory capacity.
Fig. 9. Flow in a supersonic compressor cascade: Mach number contours.
a
27.5 25
27.5 25
22.5
22.5
20
20
17.5
0
0.5
1 1.5 2 5 Grid nodes [10 ]
DPA MPA SPA
32.5 30
Speed-Up
30 Speed-Up
b
DPA MPA SPA
32.5
2.5
17.5
0
0.5
1 1.5 2 5 Grid nodes [10 ]
2.5
Fig. 10. 3D flow solver. Speed-up of the GPU SP ; GPU MP and GPU DP implementations as a function of the number of grid nodes, for the inviscid flows (a) around an aircraft and (b) within a 3D supersonic compressor cascade.
719
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
Speed-Up
Without Re-arrangement With Re-arrangement 20 17.5 15 12.5 10 0
0.5
1 1.5 2 Grid nodes [105]
2.5
Fig. 11. 3D flow solver. Speed-up of the GPU DP implementation with and without re-arrangement in the fluxes access as a function of the number of grid nodes.
3.1. Aerodynamic optimization-generalities In aerodynamic shape optimization, the search tools are either gradient-based methods or global search metaheuristics [33,50,13]. EAs are gradient-free methods that may accommodate any readyto-use analysis software, even without access to its source code. For EAs to become routine industrial tools, much focus has been placed on methods reducing the number of evaluations they need to reach the optimal solution(s). To this end, a number of papers employing surrogate evaluation models [29,26,14,16,31,12,47,41], hierarchical or multilevel schemes [32,30,28,27] and/or hybridization with gradient-based methods [27,38,34] have appeared in the literature. Gradient-based methods make use of adjoint methods [24,25,2,18,43,44] to compute the gradient vector (and/or Hessian matrix [46,45]) of the objective function with respect to the design variables to iteratively compute the optimal solution. The main advantage of the adjoint methods is the low extra cost for the computation of the sensitivity derivatives, which is equal to a flow solution, irrespective of the number of design variables. 3.2. Two-level EA-based optimization on GPUs CFD codes ported to GPUs have a great value not only as standalone flow analysis tools but, also, as means to support any aerodynamic shape optimization software. Since the latter requires many CFD-based evaluations of candidate solutions, the use of CFD code implemented on a GPU, rather than a CPU, may considerably accelerate an optimization project. The difference in speed-up between the GPU MP and GPU SP implementations inspired the hierarchical optimization algorithm presented below. The idea is to associate GPU SP and GPU MP with its two levels. To further reduce the overall CPU cost, surrogate evaluation models (metamodels) are used. Thus, in total, two problem-specific tools (GPU SP and GPU MP ) and one metamodel are employed. For more details on the optimization platform itself (irrespective of the use of GPUs), the reader should turn to [27,30]. The two-level algorithm consists of two semi-autonomously running optimization levels, associated with different evaluation software. The main idea is to use cheap and low-fidelity evaluation codes (such as the GPU SP ) on the low level to roughly locate optimal solutions and, then, refine them using a higher fidelity software ðGPU MP Þ on the high level. On each level, a metamodelassisted evolutionary algorithm (MAEA) is used to search the design space, supported by metamodels trained on-the-fly. According to the inexact pre-evaluation (IPE) algorithm, [17], the MAEA operates as a conventional EA during the first generations archiving all
evaluated individuals, among which data for training the metamodels will be selected during the subsequent generations. Once enough entries are archived in the database (the exact number is user-defined) the IPE filter is activated. According to this, properly trained metamodels (Radial Basis Function-RBF Networks, [23]) are used to approximate the objective value(s) for each and every offspring. Only a small percentage of the population, with the best (according to the metamodel) responses, undergoes exact evaluation on the GPU SP or GPU MP during each generation. Note that a different database exists on each level, since the evaluation is carried out using a different GPU-enabled code. The CPU cost of the search algorithm (including metamodels) is negligible and, therefore, this runs on the CPU. The two levels communicate regularly by exchanging their current best solutions. To control the interlevel migrations, the generation in which the first communication takes place and the communication frequency (expressed in terms of generations) are user-defined. Whenever a level reaches the migration generation, it suspends execution and waits for its adjacent one. During the interlevel migration a (user-defined) percentage of immigrants is re-evaluated on the evaluation software of the destination level. After each migration, execution is resumed on both levels. 4. Applications In this section, single- and multi-objective aerodynamic shape optimization problems are solved using the two-level EA described in the previous sections and the GPU SP and GPU MP implementations of the flow solver. In both cases, significant reduction in the wall clock time of the optimization was observed. 4.1. Optimization of a 2D isolated airfoil The first case is concerned with the two-objective optimization of an isolated airfoil for minimum drag ðC D Þ and maximum lift coefficient ðC L Þ. The flow conditions are: M 1 ¼ 0:3; a1 ¼ 4 and Rec ¼ 5 106 . The airfoil was parameterized using two Bézier curves, one per airfoil side, with 9 control points each. The control points were allowed to vary in both the chordwise and normal-tochord directions. The leading and trailing edge control points were fixed. The control points next to them were free to move only on condition that C 1 continuity at the leading and trailing edges is ensured. This gave rise to 24 design variables in total. Three constraints on the airfoil thickness t, at given chordwise positions, were imposed,
tð0:25cÞ P 0:07c;
tð0:60cÞ P 0:05c;
tð0:85cÞ P 0:02c
ð8Þ
where c is the chord length. The high level of the two-level algorithm employed a (10,20) EA (i.e. with l ¼ 10 parents and k ¼ 20 offspring) and the GPU MP implementation for the evaluation of candidate solutions. The low level employed a (20,50) EA, based on GPU SP . Both levels were assisted by RBF networks as metamodels for the IPE of the population; they operated as conventional EAs until a pool of 200 exactly evaluated individuals was made available on each level. Then, the IPE filter was activated, according to which 2 individuals on the high and 5 on the low level were exactly evaluated on GPU SP and GPU MP , respectively. The high level EA started after the 10th generation of the low level EA. The initial population of the former incorporated the front of non-dominated solutions computed by the latter. The migration step was equal to 10 and 5 generations for the low and high level EAs, respectively. Up to 15 top individuals were broadcast from the low to the high level EA during each migration. After re-evaluation on the GPU MP , immigrants outperforming the high level offspring were allowed to displace them.
720
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
At the same time, the high level broadcast up to 6 top individuals to the low level EA. The evolution of the hypervolume indicator, which quantifies the part of the non-dimensional objective space (up to a user-defined ‘‘nadir” point) dominated by the front [51], is illustrated in Fig. 12a. For the purpose of comparison, the same quantity computed during a MAEA (using the same evolution parameters as on the high level and GPU MP as evaluator) is also shown. The two-level EA is able to capture a better front of non-dominated individuals at the same time cost; this is why the hypervolume indicator of the two-level method is constantly higher. The final front of non-dominated solutions, computed by the two-level algorithm, is shown in Fig. 12b. Three airfoils selected from this front along with the corresponding C L and C D values are shown in Figs. 13a–c. On the average, the cost ratio of an evaluation on GPU MP and GPU SP is 49:20 (on a computational grid of 35,000 nodes). So, a run of the GPU MP corresponds to one time unit (abscissa of time units. During the two-level Fig. 12a) and a run on GPU SP to 20 49 optimization, 336 high and 1410 low level evaluations were carried out, represented by the continuous line plotted in Fig. 12a. With approximately the same cost, the single-level MAEA based on GPU MP produced a front of non-dominated solutions with about 10% worse hypervolume indicator value. From the same figure, it can be seen that the final solution of the single-level MAEA could be captured approximately in 500 time units of the two-level algorithm (i.e. with 45% economy in time). Note that, if the single-level MAEA employed the CPU rather than the GPU, the speed-up would be 22:5 (for the given computational grid). Hence, the two-level GPU-based method can produce the same quality of ‘‘optimal” solution with the single-level CPU code based MAEA, 50. with a speed-up of approximately 22:5 0:45 4.2. Optimization of a 2D compressor cascade airfoil This case is concerned with the design of a 2D compressor cascade for minimum total pressure loss coefficient, x ¼ ðpt1 pt2 Þ= ðpt1 p1 Þ, where pt and p are the total and static pressures, respec-
Two-Level Single-Level
0.65
tð0:30cÞ P 0:075c;
b
tð0:90cÞ P 0:018c
ð9Þ
0.018 0.017
0.6
0.016
0.55
0.015
0.5
tð0:60cÞ P 0:065c;
were imposed. A minimum flow turning angle constraint was also imposed ða1 a2 P 25 Þ. This study was based on the same two-level EA and the same GPU MP and GPU SP implementations. A (10,20) MAEA was employed on the high level. The low level used a (15,60) MAEA. On either levels, the IPE technique was activated after 300 and 200 exact evaluations on the high and low levels, respectively. Once metamodels have been activated, 2 and 5 individuals per generation, identified as top ones by the metamodel, were evaluated on the level’s CFD code. As in the previous case, at the end of the 10th generation of the low level EA, the high level EA initiated by injecting current top individuals from the low level EA into its starting population. As part of the interlevel migration process, the two levels exchanged their best solution every 10 (low) and 5 (high level) generations. The stopping criterion was arbitrarily set to 800 time units or equivalent GPU MP evaluations. These correspond to 472 and 695 evaluations on GPU MP and GPU SP , respectively. Fig. 14a illustrates the convergence history of the hierarchical algorithm (total pressure loss coefficient with respect to time units). The convergence history of a single level optimization (with evaluations on GPU MP ) is also shown. With the same computational cost, the hierarchical algorithm was able to reach a much better solution, with x ¼ 0:0135, instead of 0.0141 (which was the x value of the optimal solution computed by the single level method at the same computational cost). Note that all solutions provided by the two-level algorithm were better than those computed by the single level one. The solutions provided by the two-level algorithm have
CD
Hypervolume indicator
a
tively. Indices 1 and 2 indicate the inlet to and the outlet from the cascade. The flow conditions are: M 2;is ¼ 0:3; a1 ¼ 44 and Rec ¼ 9 105 . The stagger angle was 25 and the pitch-to-chord ratio s=c ¼ 0:6. The cascade airfoil contour was parameterized using Bézier curves with 18 control points (9 on each side) resulting to 24 design variables in total. To prevent the formation of unacceptably thin airfoils, three constraints on thickness, namely
0.014 0.013
0.45
0.012 0.4
0.011
0.35 0
100 200 300 400 500 600 700 800 900 Time units
0.01 0.5
0.6
0.7
0.8 CL
0.9
1
1.1
Fig. 12. Optimization of a 2D isolated airfoil: (a) Evolution of the hypervolume indicator of the two-level algorithm and a single-level MAEA. (b) The front of non-dominated solutions computed by the two-level algorithm.
Fig. 13. Optimization of a 2D isolated airfoil: Three airfoil profiles selected from the computed front of non-dominated solutions (Fig. 12).
721
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
a
0.0165
b
Two-Level Single-Level
0.016
ω
0.0155 0.015 0.0145 0.014 0.0135 0.013 0
100 200 300 400 500 600 700 800 Time units
Fig. 14. Optimization of a 2D compressor cascade: (a) Evolution of the total pressure loss coefficient value in the course of the optimization. Comparison of two and single-level MAEA. (b) Optimal cascade airfoil computed by the two-level algorithm.
a
b
0.015 0.014
1 High Level (MPA) Low Level (SPA)
0.5
0.013 High level (MPA) Low level (SPA)
Cp
0
ω
0.012 0.011
-0.5
0.01 -1
0.009 0.008
-1.5 0
1
2
3
4
5
6
Candidate Solution
7
0
0.2
0.4
0.6
0.8
1
x-axis
Fig. 15. Optimization of a 2D compressor cascade: (a) Comparison of the x values of 6 candidate solutions communicated between the two levels. One may see the differences in the x values predicted by the two CFD codes running on the GPU. (b) Comparison of the pressure coefficient distribution along the optimal airfoil, computed using GPU MP and GPU SP .
necessarily been evaluated by the high level evaluation tool; thus, the two-level algorithm starts producing results after the first 130 time units, since the high level started after the 10th generation. During this period of time, only the low level was allowed to evolve for 10 generations, yielding a cost of 110 time units. The extra 20 time units correspond to the cost of evaluating the first high level generation. On the average, the cost ratio of an evaluation on GPU MP and GPU SP is 159:92 (on grids of 61,000 nodes). Finally, Fig. 15 compares the x values, computed by GPU MP and GPU SP , for 6 designs communicated between the two levels. Note that, in this case, the computational grid used was very stretched close to the solid walls, to maintain a yþ value less than 1, and this affects all SPA operations. Though GPU SP is less accurate than GPU MP , it was able to support the EA, acting on the low level, as shown in Fig. 14a. The comparison of the pressure distribution around the optimal airfoil computed using GPU SP and GPU MP is also shown in Fig. 15b.
5. Conclusions The purpose of this paper was two fold: (a) to present the GPU implementation of a 2D/3D Navier–Stokes/Euler equations solver for unstructured grids, based an a vertex-centered finite volume scheme and (b) given the two variants of this code with different prediction accuracy and computing cost, to demonstrate a
two-level EA-based optimization algorithm, for problems with one or more than one objectives as well as a number of constraints. NVIDIA’s Ge-Force GTX 285 graphics cards and the CUDA programming language were used. With respect to the flow solver, complete restructuring was necessary to maximize the parallel speed-up. GTX 285 allows single and double precision arithmetic which are optimally combined in the GPU MP implementation. For maximum speed-up of the GPU-enabled flow solver (compared to the CPU code), it was necessary to: Switch from edgewise to nodewise sweeps for both inviscid and viscous terms. Use coalesced memory access and renumbering schemes to overcome the management of scattered data due to the unstructured grid topology and connectivity, particularly for vertexcentered schemes. The vertex-centered scheme (compared to a similar work that recently appeared in the literature, [8], which was based on cell-centered discretization) introduces extra difficulties and calls for careful treatment, mainly due to the nonfixed number of edges emanating from each node. Use texture memory for a limited number of selected flow variables. Minimize the global memory access within each thread. Use vector types defined in CUDA (for instance float4) and properly aligned user-defined data types. Use the constant memory space and unrolling.
722
I.C. Kampolis et al. / Comput. Methods Appl. Mech. Engrg. 199 (2010) 712–722
Using all these techniques, speed–ups of about 28 for the GPU SP ; 25:2 for the GPU MP (with SPA for the l.h.s. and DPA for the r.h.s.) and 19:6 for the GPU DP one, asymptotically as the grid size increases, were reached. These values are valid for Euler and Navier–Stokes computations on 2D grids. For the 3D Euler equations, speed-ups of about 27:5 for GPU SP ; 23 for GPU MP and 19:6 for GPU DP were measured. In the second part of the paper, a two-level optimization method that exploits the GPU’s variants was demonstrated. The proposed two-level scheme is based on EAs, surrogate evaluation models (RBF networks) and two GPU-enabled code variants (GPU SP as the low fidelity tool and GPU MP as the high fidelity evaluation tool) used in a hierarchical manner. The examined aerodynamic optimization problems demonstrate that the new algorithm clearly outperforms conventional EAs (based on the most accurate GPU implementation of the flow solver) in terms of computational cost. Acknowledgment The second author was supported by a scholarship from the Public Benefit Foundation Alexandros S. Onassis. References [1] AMD, AMD ATI close to the metal (CTM) guide,
, 2008. [2] W.K. Anderson, V. Venkatakrishnan, Aerodynamic design optimization on unstructured grids with a continuous adjoint formulation, Comput. Fluids 28 (4-5) (1999) 443–480. [3] T.J. Barth, D. Jespersen, The design and application of upwind schemes on unstructured meshes, AIAA Paper 1989-0366, in: 27th Aerospace Sciences Meeting, January 1989. [4] J. Bolz, I. Farmer, E. Grinspun, P. Schröder, Sparse matrix solvers on the GPU: conjugate gradients and multigrid, ACM Trans. Graphics 22 (3) (2003) 917– 924. [5] T. Brandvik, G. Pullan, Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware, Proc. Inst. Mech. Engineers, Pt C: J. Mech. Engrg. Sci. 221 (12) (2007) 1745–1748. [6] T. Brandvik, G. Pullan, Acceleration of a 3D Euler solver using commodity graphics hardware, AIAA Paper 2008-607, 46th AIAA Aerospace Sciences Meeting and Exhibit, January 2008. [7] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, O. Hanrahan, Brook for GPUs: stream computing on graphics hardware, ACM Trans. Graphics 23 (2004) 777–786. [8] A. Corrigan, F. Camelli, R. Löhner, J. Wallin, Running unstructured grid based CFD solvers on modern graphics hardware. AIAA Paper 2009-4001, 19th AIAA Computational Fluid Dynamics, June 2009. [9] K. Crane, I. Llamas, S. Tariq, Real-time simulation and rendering of 3D fluids, GPU Gems 3 (2007) (Chapter 30). [10] J.D. Denton, G. Hirch, Ch. ana Meauzé, Analytical test cases of cascades, AGARD-AR-275, 1990. [11] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic vehicle using a GPU, J. Comput. Phys. 227 (2008) 10148–10161. [12] M. Emmerich, K.C. Giannakoglou, B. Naujoks, Single- and multi-objective evolutionary optimization assisted by Gaussian random field metamodels, IEEE Trans. Evolut. Comput. 10 (4) (2006) 421–439. [13] B. Epstein, S. Peigin, Optimization of 3d wings based on Navier–Stokes solutions and genetic algorithms, Int. J. Comput. Fluid Dyn. 20 (2) (2006) 75–92. [14] M. Farina, A neural network based generalized response surface multiobjective evolutionary algorithm, in: 2002 Congress on Evolutionary Computation – CEC ’02, vol. 1, Honolulu, HI, May 2002, pp. 956–961. [15] K. Fatahalian, D.R. Horn, T.J. Knight, L. Leem, M. Houston, J.Y. Park, M. Erez, M. Ren, A. Aiken, W.J. Dally, P. Hanrahan, Sequoia: Programming the memory hierarchy, in: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, ACM, New York, NY, USA, 2006, p. 83. [16] K.C. Giannakoglou, Design of optimal aerodynamic shapes using stochastic optimization methods and computational intelligence, Prog. Aerospace Sci. 38 (1) (2002) 43–76. [17] K.C. Giannakoglou, A.P. Giotis, M.K. Karakasis, Low-cost genetic optimization based on inexact pre-evaluations and the sensitivity analysis of design parameters, Inverse Prob. Engrg. 9 (2001) 389–412. [18] M.B. Giles, N.A. Pierce, Improved lift and drag estimates using adjoint Euler equations, AIAA Paper 1999-3293, in: 14th Computational Fluid Dynamics Conference, June–July 1999. [19] D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, H. Wobker, C. Becker, S. Turek, Using GPUs to improve multigrid solver performance on a cluster, Int. J. Comput. Sci. Engrg. (IJCSE) 4 (1) (2008) 36–55.
[20] N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, A multigrid solver for boundary value problems using programmable graphics hardware, Graphics Hardware (2003) 1–11. [21] T.R. Hagen, K.A. Lie, J.R. Natvig, Solving the Euler equations on graphics processing units, Comput. Sci. – ICCS 3994 (2006) 220–227. [22] M.J. Harris, Fast fluid dynamics simulation on the GPU, GPU Gems (2004) 637– 665 (Chapter 38). [23] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, New Jersey, USA, 1999. [24] A. Jameson, Aerodynamic design via control theory, J. Scientific Comput. 3 (1988) 33–260. [25] A. Jameson, Optimum aerodynamic design using CFD and control theory, AIAA Paper 1995-1729, in: 12th AIAA Computational Fluid Dynamics Conference and Open Forum, June 1995. [26] Y. Jin, M. Olhofer, B. Sendhoff, A framework for evolutionary optimization with approximate fitness functions, IEEE Trans. Evolut. Comput. 6 (5) (2002) 481– 494. [27] I.C. Kampolis, K.C. Giannakoglou, A multilevel approach to single- and multiobjective aerodynamic optimization, Comput. Methods Appl. Mech. Engrg. 197 (33–40) (2008) 2963–2975. [28] I.C. Kampolis, K.C. Giannakoglou, Distributed evolutionary algorithms with hierarchical evaluation, Engrg. Optim. 41 (11) (2009) 1037–1049. [29] I.C. Kampolis, E.I. Karangelos, K.C. Giannakoglou, Gradient-assisted radial basis function networks: theory and applications, Appl. Math. Modell. 28 (13) (2004) 197–209. [30] I.C. Kampolis, A.S. Zymaris, V.G. Asouti, K.C. Giannakoglou, Multilevel optimization strategies based on metamodel-assisted evolutionary algorithms, for computationally expensive problems, in:2007 Congress on Evolutionary Computation – CEC ’07, Singapore, September 2007. [31] M.K. Karakasis, K.C. Giannakoglou, On the use of metamodel-assisted, multiobjective evolutionary algorithms, Engrg. Optim. 38 (8) (2006) 941–957. [32] M.K. Karakasis, D.G. Koubogiannis, K.C. Giannakoglou, Hierarchical distributed evolutionary algorithms in shape optimization, Int. J. Numer. Methods Fluids 53 (3) (2007) 455–469. [33] A. Keane, P. Nair, Computational Approaches for Aerospace Design: The Pursuit of Excellence, John Wiley and Sons Ltd., 2005. [34] J. Knowles, D. Corne, M-PAES: A memetic algorithm for multiobjective optimization, in: 2000 Congress on Evolutionary Computation – CEC ’00, IEEE Press, 2000, pp. 325–332. [35] D. Komatitsch, D. Michéa, G. Erlebacher, Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA, J. Parallel Distrib. Comput., 2009, doi: 10.1016/j.jpdc.2009.01.006. [36] J. Kruger, R. Westermann, Linear algebra operators for GPU implementation of numerical algorithms, ACM Trans. Graphics 22 (3) (2003) 908–916. [37] W.R. Mark, R. Steven, G. Kurt, A. Mark, J. Kilgard, Cg: A system for programming graphics hardware in a C-like language, ACM Trans. Graphics 22 (2003) 896–907. [38] F. Muyl, L. Dumas, V. Herbert, Hybrid method for aerodynamic shape optimization in automotive industry, Comput. Fluids 33 (5–6) (2004) 849–858. [39] NVIDIA, NVIDIA CUDA Compute Unified Device Architecture Programming Guide, 2008. NVIDIA CUDA homepage, , 2008. [41] Y.S. Ong, K.Y. Lum, P.B. Nair, D.M. Shi, Z.K. Zhang, Global convergence of unconstrained and bound constrained surrogate-assisted evolutionary search in aerodynamic shape design, in:2003 Congress on Evolutionary Computation – CEC ’03, vol. 3, Canberra, Australia, 2003, pp. 1856–1863. [42] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.E. Lefohn, T.J. Purcell, A survey of general-purpose computation on graphics hardware, Comput. Graphics Forum 26 (1) (2007) 80–113. [43] D.I. Papadimitriou, K.C. Giannakoglou, A continuous adjoint method with objective function derivatives based on boundary integrals for inviscid and viscous flows, Computers Fluids 36 (2) (2007) 325–341. [44] D.I. Papadimitriou, K.C. Giannakoglou, Total pressure loss minimization in turbomachinery cascades using a new continuous adjoint formulation, Proc. I MECH E Part A: J. Power Energy 221 (2007) 865–872. [45] D.I. Papadimitriou, K.C. Giannakoglou, Computation of the Hessian matrix in aerodynamic inverse design using continuous adjoint formulations, Comput. Fluids 37 (8) (2008) 1029–1039. [46] D.I. Papadimitriou, K.C. Giannakoglou, Direct, adjoint and mixed approaches for the computation of Hessian in airfoil design problems, Int. J. Numer. Methods Fluids 56 (10) (2008) 1929–1943. [47] A. Ratle, Optimal sampling strategies for learning a fitness model, in: 1999 Congress on Evolutionary Computation – CEC ’99, vol. 3, Washington, DC, 1999, pp. 2078–2085. [48] P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference schemes, J. Comput. Phys. 43 (2) (1981) 357–372. [49] P. Spalart, S. Allmaras, A one-equation turbulence model for aerodynamic flows, La Recherche Aérospatiale 1 (1994) 5–21. [50] D. Thévenin, G. Janiga, Optimization and Computational Fluid Dynamics, Springer-Verlag, 2008, ISBN: 978-3-540-72152-9. [51] E. Zitzler, D. Brockhoff, L. Thiele, The hypervolume indicator revisited: on the design of Pareto-compliant indicators via weighted integration, in: S. Obayashi et al. (Eds.), Conference on Evolutionary Multi-Criterion Optimization – EMO, LNCS, Springer, Berlin, 2007, pp. 862–876.