Numerical simulations of elastic wave propagation using graphical processing units—Comparative study of high-performance computing capabilities

Available online at www.sciencedirect.com ScienceDirect Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126 www.elsevier.com/locate/cma Numerical s...

Download PDF

3MB Sizes 2 Downloads 38 Views

Report

Full Text

Available online at www.sciencedirect.com

ScienceDirect Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126 www.elsevier.com/locate/cma

Numerical simulations of elastic wave propagation using graphical processing units—Comparative study of high-performance computing capabilities P. Packo a,∗ , T. Bielak a , A.B. Spencer b , T. Uhl a , W.J. Staszewski a , K. Worden b , T. Barszcz a , P. Russek c , K. Wiatr c a Department of Robotics and Mechatronics, AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krakow, Poland b Department of Mechanical Engineering, University of Sheffield, Mappin Street, Sheffield S1 3JD, UK c Academic Computer Centre Cyfronet AGH, AGH University of Science and Technology, ul. Nawojki 11, 30-950 Krakow, Poland

Received 28 January 2014; received in revised form 5 March 2015; accepted 10 March 2015 Available online 19 March 2015

Abstract High-performance computing is important in many areas of engineering. Increasing capabilities of modern workstations open new possibilities in scientific computing. This paper demonstrates how graphical processing units can be used efficiently for large models of elastic wave propagation in complex media. The method is based on the local interaction simulation approach and a parallel algorithm architecture. The focus of the work presented is on numerical implementation and covers aspects related to software modular architecture, computer memory organisation and optimisation. A domain decomposition approach allowing for calculations using multiple-GPU configurations is proposed, implemented and examined. The performance of the proposed simulation framework is tested for numerical models of different sizes, various computing precisions and hardware platforms. The results obtained are discussed in terms of graphical processing unit limitations. Obtained results indicate significant speed-up factors comparing to calculations using central processing units or different modelling approaches. c 2015 Elsevier B.V. All rights reserved. ⃝

Keywords: CUDA; Parallel processing; Wave propagation; LISA

1. Introduction Various types of elastic waves can propagate in solids. Waves travelling in an unbounded body of materials are known as bulk waves. These waves have a finite number of relatively simple longitudinal and shear components. In contrast, wave propagation in bounded media leads to guided waves that can exhibit an infinite number of complex dispersive modes. Analysis of elastic waves and their propagation has been investigated for many years in science ∗ Corresponding author. Tel.: +48 126173677, +48 604 526 235.

E-mail addresses: [email protected] (P. Packo), [email protected] (T. Uhl), [email protected] (W.J. Staszewski). http://dx.doi.org/10.1016/j.cma.2015.03.002 c 2015 Elsevier B.V. All rights reserved. 0045-7825/⃝

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

99

and engineering. These waves are governed by the fundamental elastodynamic equations based on linear (or nonlinear) stress–strain relations and Newton’s second law. It is well known that analytic solutions for this equation are only known for a range of canonical problems. Semi-analytic approaches include methods based on the theory of diffraction [1]. Therefore, various methods have been developed for modelling and numerical simulation of elastic wave propagation. This includes methods based on finite differences (FD) [2], velocity–stress FD [3], finite elements (FE) [4,5], spectral elements [6,7], the elastodynamic finite integration technique (EFIT) [8], local interaction simulation approach (LISA) [9–20] and cellular automata [21]. An excellent review of many different available approaches is given in [22]. The high frequencies and complex geometries involved in many applications often require large models of wave propagation. These models are challenging and computationally expensive; however parallel computation algorithms can be used to solve the problem [23,24]. This approach can be applied using either matrix manipulations or direct domain decomposition techniques. The former relies heavily on various properties of assembled equations and their coefficients (e.g. sparseness or diagonality) to speed up computations. However, the increase in calculation speed strongly depends on mathematical properties of matrices and model configuration (e.g. connectivity of elements). The latter method decomposes models into a number of sub-models that are calculated independently. Each sub-model is augmented by a set of boundary conditions that represent interactions with removed parts of original models. As a result original models are modified. Nevertheless, the speed-up factor obtained with the use of this approach is linear with respect to the number of domains, up to the hardware limitations. Hard disk space and speed is always the important limitation in hardware. The aforementioned parallelisation techniques require specific hardware to carry out the computations. Typical computing units consist of a number of processors that can process data in parallel. Standard Central Processing Units (CPUs) typically include from eight to twelve processors. Although each core is relatively powerful, the total number of such units is relatively low. Usually when core numbers are increased costly hardware investments are required. Moreover, a significant computing power of these cores remains unused due to model subdivision into a number of relatively small sub-models. Although, data are exchanged between processors to advance computations, data transfers are limited by different factors, such as for example memory or bus speed. Therefore even large networks of computing units (e.g. computer clusters) are often exploited properly. Recent years have shown very fast evolution and development of Graphical Processing Units (GPUs). As a result, new architectures – available in low-cost graphical cards such as Computer Unified Device Architectures (CUDAs) – can be used very efficiently for numerical simulations [25]. GPUs consist of a large number of processors with extremely fast on-board memories. Therefore any computation problem can be decomposed into a number of threads and processed efficiently. However, it is important to note that these computing cores are not as powerful as standard CPUs and thus appropriate computation algorithms are required. The CUDA-based technology has been used to develop extremely efficient and accurate modelling tools in medicine, finance, seismology and computer aided design, as shown in [26–31]. The CUDA technology was introduced in 2006 by Nvidia in GPUs for graphics applications. Current Nvidia GPUs – widely available in graphic cards of high-end desktop PCs and laptops – have several computing multiprocessors, consisting of hundreds of cores, and capable of running thousands of light-weight threads in parallel. This technology, when combined with on-board, high-bandwidth memory, exceeds the performance and computing power of typical desktop CPUs. According to recent data the computational power of modern graphical cards is at least eight to ten times larger than the computation capability of CPUs. It is important to note that both approaches, i.e. CPU- and GPUbased, are used in practice for parallelisation of matrix operations. However, approaches employing large numbers of less powerful units have been found superior when compared with small numbers of more powerful units. Efficient parallelisation algorithms are usually based on approaches that require less mathematical operations. Therefore explicit rather than implicit time integration techniques are preferred for wave propagation simulations. Also, it is important that analysed problems do not lead to increased computational complexities in single treads, for example due to intrinsic integral-related properties of the governing equations. Consequently, single treads should perform simple calculations involving small numbers of variables and operations. Therefore, when parallel computing algorithms are considered for numerical simulations of wave propagation, the LISA is one of the possible options. The method is perfectly suited for this task due to relative simplicity and proven record of various applications. The LISA has been proposed for wave propagation modelling in physics [14] and used for damage detection in structural health monitoring [9–13,18–20]. More recently, the new LISA-based platform for wave propagation modelling has been proposed in [20] and used for Lamb wave propagation modelling in structural damage detection applications.

100

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

The method was implemented using the two-dimensional (2-D) parallel processing architectures based on CUDA and graphical cards (i.e. GPUs). The major objective of the work presented in this paper is to extend the LISA parallel model to a three-dimensional (3-D) platform and to assess the performance of the method. The focus of the work presented is on numerical implementation and covers aspects related to software modular architecture, computer memory organisation and optimisation for GPUs. The performance of the proposed simulation framework is tested for numerical models of different sizes, various computing precisions and hardware platforms. The structure of the paper is as follows. Firstly, a brief introduction to the 3-D Local Interaction Simulation Approach method is given in Section 2. Then the implementation of the method is briefly described in Section 3. The performance of the simulation framework is assessed in Sections 4 and 5. Section 4 compares the LISA and FE results whereas Section 5 investigates CPU and GPU implementations in terms of calculation speed for different model sizes and precision modes. Finally, the paper is concluded in Section 6. 2. The local interaction simulation approach The Local Interaction Simulation Approach is a technique that was proposed in physics for wave propagation in complex media with sharp impedance changes [14,20]. The LISA algorithm was originally designed for use with a connection machine (a super computer with thousands of parallel processors) and therefore it is well suited for parallel processing. For the sake of completeness, this section briefly describes the 3-D algorithm used for wave propagation. The LISA can be used for wave propagation in any heterogeneous material of arbitrary shape and complexity. The method discretises the structure under investigation into grids of cells. Material properties are assumed to be constant within each cell but may differ between cells. Discretisation in time is also used in dynamic problems, e.g. for wave propagation modelling. The algorithm can be derived from the elastodynamic wave equation for homogeneous media given as [32] ∇σ (C∇ε W) = ρW,tt

(1)

where C is the stiffness matrix, ∇σ and ∇ε are the differential operators matrices for stress and strain respectively, W is the vector of particle displacements, and ρ is the density. A comma before the subscript in Eq. (1) denotes differentiation. The stiffness matrix for a general orthotropic material in principal axes for a spatial point ( p) – written in Voigt notation – is given as  ( p)  ( p) ( p) C11 C12 C13 0 0 0  ( p)  ( p) ( p) C12 C22 C23 0 0 0    C ( p) C ( p) C ( p) 0 0 0   13  ( p) 23 33 C = (2)  ( p)  0  0 0 C 0 0 44     ( p)  0 0 0 0 C55 0  0

0

0

0

0

( p)

C66

( p)

where the Ci j constants depend on Young’s moduli, Poisson ratios and Kirchhoff’s moduli for particular directions. For the 3-D LISA wave propagation simulation, the structure is discretised into parallelepiped cells, as illustrated in Fig. 1. Then, the junction of the eight cells defines the nodal point P. The second time derivatives across the eight cells are required to converge towards a common value Ω at the point P. The central difference scheme is used to calculate spatial derivatives in the eight surrounding cells to P. Stress continuity across adjacent cells is imposed to obtain the solution. For a general orthotropic case, the following iteration equations are obtained for each displacement component χ (u t+1 − 2u t + u t−1 )  −α(1)α(3)P (1,11),(2,66),(3,55) P   1  1  ( p) (P) (P) C + 1x w C + 2 u αi Ck = −2u 0 2 0 k 13+55 2 1x 1x i p p i p=P (i,k)

101

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 1. Eight elementary cells required to derive the governing LISA equation. Node and cell numbering scheme.

+

1,3 (r −2)α(1)α(3)P   r

(P) wα(r )r C13−55

+

−Pα(1)α(3) 

p



p

 + 1x3 v0

(P) wα(1)α(3)5 C13+55



( p) C12+66

(−1)r P S

+

1,2   r

p=S P

(P) vα(r )r C12−66

+

−P S

p

(P) vα(1)α(2)6 C12+66

 (3)

p

χ (vt+1 − 2vt + vt−1 )  −α(2)α(3)P (1,66),(2,22),(3,44)   1 1  ( p) (P) C + 1x w C23+44 + 2 = −2v0 1 0 k 2 1xi 1xi p=P p (i,k) ×

P  p

(P) vαi C(66,22,44)

 

+ 1x3 u 0

+

2,3 (2r −5)α(2)α(3)P   r

( p) C12+66

+

+

−Pα(2)α(3) 

p 1,2   r

(P) wα(2)α(3)4 C23+44



p

(−1)r P S

p=S P

(P) wα(r )r C23−44

(P) u α(r )r C12−66

+

−P S

p

(P) u α(1)α(2)6 C12+66

 (4)

p

χ (wt+1 − 2wt + wt−1 )  −α(2)α(3)P (1,55),(2,44),(3,33)   1 1  ( p) (P) C + 1x v C23+44 + 2 = −2w0 1 0 k 2 1xi 1xi p=P p (i,k) ×

P  p

(P) vαi C(55,44,33)

 

+ 1x2 u 0

+

2,3 (5−2r )α(2)α(3)P   r

( p) C13+55

+

+ α = + +

− + +

− − +

+ − +

+ + −

− + −

− − −

p

T + − −

(P) wα(2)α(3)4 C23+44



p

1,3 (2−r )Pα(1)α(3)  

where: 

+

Pα(2)α(3) 

p

r

p=S P

(P) wα(r )r C23−44

(P) u α(r )r C13−55

+

−Pα(1)α(3)  p

(P) u α(1)α(3)5 C13+55

 (5)

102

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

  P= 1 2 3 4 5 6 7 8  T S= − + − + − + − + 1 1x1y1zρavg χ= 1t 2 1  ( p) ρavg = ρ 8 p=P 1xi is grid spacing in ith direction, 1t is the time step, α (i) denotes the ith column of the α matrix, displacement components are taken at time t and point (0, 0, 0) if not stated otherwise. The p index in the summation formula carries the sign of the summed component. Eqs. (3)–(5) are the principal displacement equations of LISA in three dimensions. These equations are further implemented in the solver. The relevant data have been organised in arrays and – introducing auxiliary variables – transformed into C++ implementation as Unext[x][y][z] = −Uprev[x, y, z] + U [x, y, z] ∗ smb + chi ∗ (2 ∗ sigma[x + a][y + b][z + c] ∗ U[x + alpha][y][z] + 2∗ mu[x + a][y + b][z + c] ∗ U[x][y + beta][z] + 2 ∗ mu[x + a][y + b][z + c] ∗ U[x][y][z + gamma]+ phi[x + a][y + b][z + c] ∗ (V[x][y + beta][z] − V[x + alpha][y][z]) + phi[x + a][y + b][z + c]∗ (W[x][y][z + gamma] − W[x + alpha][y][z]) + alpha ∗ beta ∗ nu[x + a][y + b][z + c]∗ (V[x + alpha][y + beta][z] − V[x][y][z]) + alpha ∗ gamma ∗ nu[x + a][y + b][z + c]∗ (W[x + alpha][y][z + gamma] − W[x][y][z]))/8.0; (6) Vnext[x][y][z] = −Vprev[x][y][z] + V[x][y][z] ∗ smb + chi ∗ (2 ∗ sigma[x + a][y + b][z + c] ∗ V[x][y + beta][z] + 2∗ mu[x + a][y + b][z + c] ∗ V[x][y][z + gamma] + 2 ∗ mu[x + a][y + b][z + c] ∗ V[x + alpha][y][z]+ phi[x + a][y + b][z + c] ∗ (U[x + alpha][y][z] − U[x][y + beta][z]) + phi[x + a][y + b][z + c]∗ (W[x][y][z + gamma] − W[x][y + beta][z]) + alpha ∗ beta ∗ nu[x + a][y + b][z + c]∗ (U[x + alpha][y + beta][z] − U[x][y][z]) + beta ∗ gamma ∗ nu[x + a][y + b][z + c]∗ (W[x][y + beta][z + gamma] − W[x][y][z]))/8.0; (7) Wnext[x][y][z] = −Wprev[x][y][z] + W[x][y][z] ∗ smb + chi ∗ (2 ∗ sigma[x + a][y + b][z + c] ∗ W[x][y][z + gamma]+ 2 ∗ mu[x + a][y + b][z + c] ∗ W[x][y + beta][z] + 2 ∗ mu[x + a][y + b][z + c] ∗ W[x + alpha][y][z]+ phi[x + a][y + b][z + c] ∗ (V[x][y + beta][z] − V[x][y][z + gamma]) + phi[x + a][y + b][z + c]∗ (U[x + alpha][y][z] − U[x][y][z + gamma]) + beta ∗ gamma ∗ nu[x + a][y + b][z + c]∗ (V[x][y + beta][z + gamma] − V[x][y][z] + alpha ∗ gamma ∗ nu[x + a][y + b][z + c]∗ (U[x + alpha][y][z + gamma] − U[x][y][z])/8.0; (8)

It is important to note that the local interaction nature of boundaries in the model is one of the major advantages of the LISA algorithm when used for wave propagation. The Sharp Interface Model (SIM) [14] is used to average physical properties at interface grid points, which represent intersections of eight elementary cells. In other words, cells are treated as discontinuous and displacements and stresses are matched at interface grid points. The SIM allows for a more physical and unambiguous treatment of interface discontinuities for different layers of material than typical FD schemes. Typically FD requires parameter smoothing across material interfaces, which depends on the applied scheme, and is therefore not very accurate for sharp interfaces of high impedance mismatch. The SIM leads to more accurate results when wave propagation problems in complex media with complex boundaries are studied. More details related to 3-D wave propagation implementation can be found in [14,20]. The LISA algorithm has been used successfully for Lamb wave propagation investigations for damage detection. Various application examples can be found in [9–13,18–20]. 3. Numerical implementation of CUDA-based solver for wave propagation modelling 3.1. Structure of the solver The CUDA-based LISA solver was developed in C++ and compiled by Microsoft Visual C++ 2008 Pro compiler. The GPU-related code was initially developed in CUDA 2.3 environment, and ported to higher versions, up to CUDA 4.0. All CUDA kernel routines used in the solver were developed from scratch by the authors. No libraries were used throughout the development process, except for MSVC++ runtime libraries.

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

103

Fig. 2. LISA 3-D modular solver overview.

The CUDA-based LISA solver developed for wave propagation simulation has a modular structure. Therefore flexible management – for example with respect to Matlab libraries or console tools – and different computation platforms can be used. The three basic modules are: reader, core and the Matlab wrapper, as illustrated in Fig. 2. The reader module is responsible for delivering input data to the computing core. The input data describe the geometry of the object as well as other simulation parameters/conditions (e.g. material properties, excitation signals) and the definitions of result files. The task of the core part is to build data structures and represent them in the CUDA device memory. It is also responsible for performing calculations directly on the CUDA device and saving results to output files and/or regular memory. The third module, i.e. the wrapper, is optional and allows for managing the core operations through a user-friendly Matlab environment. Alternatively, the core can be controlled directly, i.e. by a low-level Application Programming Interface (API). Similarly to commercial CAE solvers, input files are text files and are typically created by external pre-processing applications. This allows for additional flexibility of the framework. For example, the solver can be coupled to other simulation packages and updated constantly when required. 3.2. Simulation sequence of events Once the model is defined for simulation, the iteration equations of the LISA algorithm are called to process the data. However, each time step involves auxiliary operations that are required to apply model changes (e.g. due to excitations) and to save intermediate results (e.g. output results for different sensor locations). The sequence of sub-steps within a single time step cannot be arbitrary and depends on the numerical model applied in the entire analysis. A typical example of the simulation sequence is shown schematically in Fig. 3. The single time step starts with loading 3-D object definitions by the reader and creating all required data structures in the CPU and GPU memories. Then, a requested number of simulation steps is executed. The following sub-steps are executed in parallel as the CUDA kernel routines on the CUDA device: • ‘Iterate’ step—the LISA iteration equations are applied to the model; • ‘Save sensors’ step—the results from pre-defined points of the model (i.e. sensor locations) are copied to separate buffers (sensor buffers); • ‘Impulse’ step—single-point and matrix excitation impulses are applied to the model; the matrix approach allows for simulating real piezo-based excitation (a number of points, for which separate excitation signals can be defined in any arbitrary direction, with on-the-fly interpolation performed in parallel on the CUDA device). After executing the requested number of steps, displacement arrays and/or sensor data gathered are copied to the CPU memory and stored to the files for external processing. 3.3. Memory organisation Modern graphical cards are equipped with a typical amount of 3 GB high-speed memory. This is much less than what a standard RAM memory – available in desktop PCs – can offer. Despite the fact that newly developed graphical cards are more and more powerful, customisation is still not possible. Therefore, memory management is a crucial

104

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 3. Typical example of the simulation sequence of events.

point in the entire simulation strategy. Since the total available GPU memory can be increased by the application extra GPU devices, a multi-GPU memory organisation is of particular interest. The geometry of the analysed structure is introduced to the computing core using a list of cells comprising the 3-D model. Each cell is defined by four – or optionally five – numbers. The first three numbers specify the cell coordinates in the 3-D domain, whereas the fourth number defines the material (an index to the material database is provided; this information is also accessible by the computing core). The three integer numbers refer to coordinates of a single vertex and completely define cube’s position in space. For a general cell’s definition a set of eight coordinate triples (floats defining x, y, z coordinates) would be necessary. However, since the interior cubes of the model are surrounded by other cubes, each vertex’s spatial coordinate would be stored in each of eight neighbouring cells, resulting in redundant data in memory. Hence, all floating point numbers defining the cells vertices are converted into integer numbers. New (integer) vertex coordinates are related with the original ones by 1x, 1y and 1z, the spatial grid size in the respective direction. Subsequently, the number of integer coordinates for each vertex is reduced by storing only a single triple for each cell. For consistency an integer coordinate triple for a vertex with the lowest (x, y, z) indices is stored resulting in data reduction from eight to a single set of coordinates. Remaining data are automatically retrieved by applying the appropriate procedure. The fifth optional number is the initial cell temperature. This option is used when temperature-dependent simulations are required. The algorithms used for the data structure in the CPU memory and in the CUDA device memory differ only slightly when single and multiple CUDA devices are applied. These differences are described in the following subsections. 3.3.1. Single GPU memory organisation Firstly, a single-GPU calculation is considered. The starting point is the model defined as an input file. The memory organisation and the entire process of creating all required structures in the model are shown in Fig. 4. The proposed process consists of the following steps: Step A: The list of cells provided by the user is extended by the so-called ‘air’ cells that surround the actual model. The air cells create a one- or two-cell thick air layer surrounding the analysed 3-D model. This allows for natural simulation of the stress-free boundary conditions used in the standard LISA equations, as explained in [20]. Therefore

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

105

Fig. 4. Model loading process and memory organisation for a single CUDA device.

no artificial boundary conditions are introduced, which is typical for the other FD schemes. This approach reproduces very well wave reflections from the boundary of any 3-D shapes thanks to the SIM model used. Subsequently, the optimal order list of cells is established. This enables one to perform efficient calculations on the CUDA devices. Step B: Once the total number of cells in the model is known, the computing core is able to allocate the regular CPU memory for arrays containing material properties, i.e. density ρ and Lam´e constants, µ and λ. Material data arrays for each cell are filled with requested values based on the material indices defined in the input file. Indices are used to retrieve material data from the database. When temperature-dependent simulations are required, the relevant material parameters are calculated based on the temperature field used. Various auxiliary structures are created additionally in the regular CPU memory to facilitate further computations. These are: arrays containing excitation signal samples, piezo-actuator definitions (when piezo-based excitation is used) and some additional structures used for optimisation. Step C: All previously prepared structures are copied into the CUDA device memory. Step D: Since the time domain second order hyperbolic equation is solved using the central difference formalism, nine displacement values must be stored for each existing cell. These nine values represent three time steps (i.e. the previous, current and next steps) of three geometrical axes in the 3-D domain. Thus, nine arrays keeping these values for each cell are created in the CUDA device memory. Additionally, sensor data buffers and miscellaneous internal data buffers are created. The A-D steps, described in this section, comprise the memory organisation scheme for a single GPU calculation approach. The model in the CUDA memory is now ready to perform calculations. 3.3.2. Multiple GPU memory organisation Models that do not fit to a single GPU memory are automatically split into sub-models and are distributed over all GPUs available in a workstation or within a network. The memory organisation is therefore slightly different, to allow for subdivisions and auxiliary structures required for internal data transfer between parts of the model. The multiple GPU memory organisation is shown in Fig. 5. The process of creating all required structures is similar to a single device approach. The only difference is that the Step A, is now divided into two sub-steps, i.e. Step A1: The list of cells provided by the user is extended by the ‘air’ cells in the same way as for single-GPU calculations. Step A2: The list of cells is split into n sub-lists for n CUDA devices available in the system. However, the model subdivision process cannot be arbitrary. A dedicated algorithm has been developed for this purpose and is

106

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 5. Model loading process and memory organisation for multiple CUDA devices.

described in the following section. The algorithm allows for model decomposition into domains that are calculated on separate CUDA devices. The B, C and D steps are analogous to the single-GPU calculations, but they are performed for every cell list on every CUDA device. 3.4. The 3-D shape decomposition method A model that cannot be loaded into a single GPU memory is internally split into n domains (for calculations based on n available CUDA devices). This operation does not require user intervention. For optimal performance, the splitting algorithm has to decompose the model into independent domains that are calculated by separate CUDA devices to ensure even load of all cores of all GPUs during simulation and to maintain minimal redundancy of model definition in terms of memory consumption. The former requirement is fulfilled by analysing the hardware properties of the CUDA devices available and by evaluating the equivalent sub-model size that can be calculated. The sub-model sizes are normalised to the biggest domain that can be calculated. The equivalent model size depends on the memory size, speed and the number of CUDA cores for each device. The minimal model redundancy requirement is used to decrease the memory occupancy by limiting the number of cells that are calculated by more than one device. This is an inevitable consequence of the LISA iteration equations that, due to second-order central difference formulae, require the displacement values from adjacent cells to calculate the solution at a particular grid point. Hence, at least a two-cell thick ‘overlay’ layer is necessary. However, this is flexible and can vary with the interface orientation with respect to the global coordinate system. The thickness of the shared cell layer is directly related to the amount of data transferred between computing units, after each iteration. The domain decomposition algorithm developed for the LISA method slices the model into various sub-domains depending on the number and the type of devices, and considers different subdivisions obtained for various slicing directions. The resulting sets of consecutive cells are gathered in even groups, having the smallest possible number

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

107

Fig. 6. Model sub-division example—a rod split into four parts.

of cells common with neighbouring domains. For instance, for a multi-GPU configuration using the same devices the model is sliced into n sections by taking one-nth number of the total number of cells counting in positive and negative x, y and z directions. This procedure results in six possible model subdivisions, where the interfaces between the domains are aligned with global coordinate axes. Then, the configuration with the smallest number of cells at the interfaces is selected as the best match. If a multi-GPU setup with variable memory sizes for different devices is used, the number of cells assigned to a split is taken proportional to the memory size and the same procedure is used to determine the optimal configuration. Finally, when the devices differ in memory clocks and other parameters, a test run is performed to estimate the number of cells calculated per second by the device and to assign a set of cells to a device. For various subdivision scenarios total calculation time is estimated and the optimal solution is selected. Other subdivision algorithms – used for example in finite element models for domain decomposition [33] – can be easily employed here. However, this task was not found to be crucial. The applied solution has resulted in satisfactory results and performance. The decomposition method is illustrated in Fig. 6. The left hand side of this figure shows a complex shape of a rod approximated by 3 million cube elements (the air layer surrounding the structure is not visualised). The rod is composed of two different materials, illustrated by different grey shades. The entire structure is split into four parts, i.e. A, B, C and D. The local nature of the LISA method allows for overlapping between parts. This is achieved by two-cell layers. The right side of Fig. 6 illustrates the division between parts A and B. The brighter (top) division layer is calculated by the CUDA device that performs calculations for the entire part A. The darker (bottom) division layer is required to make these calculations possible due to the local interaction nature of the boundaries. Analogically, the darker layer is calculated by the CUDA device that performs calculations for the entire part B and the brighter layer is required to carry out these calculations. After every single step of simulations, computing devices exchange required data, i.e. the brighter layer displacement values are copied from part A to part B and the other way around. There are two ways of memory block transfer between two different CUDA devices: (1) using host (CPU) memory—copying data from the 1st device to the host memory, then copying from the host memory to the 2nd device; (2) using peer-to-peer memory transfer between devices (this option was introduced in CUDA 4.0 version). The solver utilises the second approach by default. However, when a CUDA device is not able to perform peer-to-peer transfers, the first approach is used automatically. It is important to note that even a better solution could also be used here: the Unified Addressing feature of the Tesla GPUs enables sharing unified address space with the host and multiple CUDA devices. In such a case, memory block transfers mentioned above are not needed. However, this solution has not been used in the current investigations due to the lack of this feature in many mainstream graphics cards. The time overhead of peer-to-peer copying of the layers between the two devices has been investigated using different model sizes. The time required to copy the data between devices in presented in Fig. 7. It has been found that the increase in calculation time, even for large models, does not exceed 2%–3% of a single step calculation time. This time is included in all benchmark results for the two CUDA devices.

108

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 7. Time overhead of peer-to-peer copying layers between two devices.

Fig. 8. GPU memory needed for simulation running on single and double CUDA setup.

Additional cell layers required for establishing boundary conditions between adjacent model sub-domains increase memory consumption and decreases effective total model size. The amount of memory for various model sizes running on a single and double GPU system is presented in Fig. 8. It is clearly visible that the GPU memory usage for one and two devices is almost the same. It has been observed that the extra amount of memory required for double-GPU configuration is between 0.5% and 1.9% for models consisting of 2.5–20 millions of cells. The requirement of even load on all cores of all GPUs during simulation is also fulfilled. The results obtained for all tested models, shown in Fig. 8, indicate that the simulation runs on two Teslas 1.86–1.91 times faster than on a single Tesla system in single precision, and 1.97–1.98 times faster in double precision mode. It was also observed that the better results are obtained for bigger models. Results are very promising in terms of algorithm scalability on more than two CUDA devices. 3.5. Different implementations and computing precisions The solver operates on 32-bit and 64-bit architectures under 32-bit and 64-bit operating systems. Calculations may be performed on double precision or single precision floating point numbers.

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

109

Table 1 Comparison of results accuracy between CPU and GPU implementations, in double and single precision.

Minimum value Maximum value Maximum difference Average difference

CPU DP vs. GPU DP

CPU DP vs. GPU SP

−1.25 · 10+2

−1.25 · 10+2 3.29 · 10+1 3.87 · 10−3 −1.31 · 10−8

3.29 · 10+1 1.30 · 10−12 −1.00 · 10−18

The 32-bit Solver, using 32-bit memory pointers, addresses up to 4 GB of memory on the CUDA devices and 2 GB of CPU memory. For a 64-bit operating system each 32-bit process is associated with virtual 4 GB address space and can manage up to 4 GB of memory. In 32-bit operating systems all the processes share the same address space. As a result a single process can allocate less memory, especially, when requesting continuous memory blocks. In this context the 32-bit environment is not preferable in this application. The 64-bit applications running on 64-bit operating systems are able to address larger memory (the limit is far beyond the current hardware capabilities) using 64-bit memory pointers to the data. Typically GPUs available on CUDA devices are equipped with 32-bit registers, thus a pair of registers must be used to store a single memory pointer. Effectively, more registers are required to run the same algorithm in 64-bit application than in 32-bit application, and a slight performance drop is observed. The calculations are performed in two modes, i.e. double precision (64-bit) or single precision (32-bit). Double precision is commonly used in engineering applications because of wide useful range of values (±2.2 · 10−308 to ±1.8 · 10308 ) and high precision (52-bit mantissa resulting in 15 significant decimal digits). On the other hand, the single precision mode effectively utilises smaller range of values (±1.18 · 10−38 to ±3.4 · 1038 ) and precision (23-bit mantissa resulting in 6–7 significant decimal digits). Single precision calculations may lead to overflows during the intermediate calculations, if the result goes beyond the range represented. However, single precision floating point numbers are of interest for the performance of the simulation framework due to: (1) smaller memory consumption enabling larger models that may be loaded into GPU memory, (2) computations executed on CUDA devices are at least twice as fast than in double precision, (3) the result accuracy may be improved by normalisation procedures. Since linear-elastic problems are scalable, the excitation amplitude may be adjusted to minimise the errors due to single precision mode. Considering the linear elastic models the values may be arbitrarily rescaled due to the superposition properties of the solution. Thus the LISA algorithm may be efficiently applied to single precision floating-point numbers. The error introduced by the single precision calculations was evaluated using a double precision reference model. Displacement arrays for all DOFs in the model were used to calculate the differences between the double and single precision modes running on GPUs and double precision running on CPU platforms. For each dataset, maximum, minimum, absolute difference and average difference were calculated. The results for a test model are presented in Table 1. This comparison shows that the GPU running in double precision provides the same results as the CPU. As expected, single precision GPU results are less accurate, however the maximum relative difference is approximately 0.01%. Thus the error introduced by the single precision calculation mode is insignificant. 3.6. Numerical dispersion analysis Dispersion characteristics of a numerical model differ from the mathematical model. Additional errors are introduced by a spatio-temporal discretisation that distort the response of a model. An additional source of error is introduced in the solution that affects the model response. The error depends on model parameters, thus a quantitative analysis is required to evaluate settings that enables keeping the accuracy of the response at the desired level. The analysis may be carried out in a spectral framework, typical for FD schemes [2]. For the ideal case of a linear isotropic homogeneous unbounded solid medium, the basic dispersion relationships are ω k L ,T = VL ,T

110

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 9. Dispersion relationship for the 3D LISA model for different Courant’s numbers. (L)-longitudinal wave mode, (T)-transversal wave mode.

Fig. 10. Relative error in wavenumber as a function of frequency. (L)-longitudinal wave mode, (T)-transversal wave mode.

where k L ,T denotes the wavenumber of longitudinal or transversal wave respectively, VL ,T is the corresponding wave speed and ω is the angular frequency. Depending on the discretisation procedure applied, a numerical dispersion relationship is different. It is typically calculated numerically based on the governing iteration equations. Assuming propagation in one direction, the following dispersion relationships are obtained for the 3D LISA model for different discretisation parameters (Fig. 9). The cell size was assumed 1x = 1 mm, while the time step was calculated according to the Courant number γ = {0.2, 0.4, 0.6, 0.8, 1.0}, using the formula 1t = γ

1x Vm

where Vm is the model velocity calculated as Vm =



VL2 + VT2 .

As can be seen, the numerical model diverges from the analytical solution for increasing frequency and wavenumbers, which is a direct consequence of temporal and spatial discretisation. The numerical characteristics do not coincide with the ideal case even for γ = 1, the stability limit. For decreasing γ the rate of divergence increases. Clearly visible are two distinct divergence rates, for longitudinal and transversal wave modes. Transversal waves diverge faster due to smaller Courant number for this particular mode than for the longitudinal case. Fig. 10 illustrates relative errors in wavenumbers corresponding to curves presented in Fig. 9.

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

111

Fig. 11. A close-up of the numerical dispersion curves in range 300–500 kHz (left column) and a zoom around 300 kHz (right column), for longitudinal (top row) and transversal (bottom row) modes.

Relative errors for wavenumber estimation can reach 50% for both modes, however the error grows faster for the transversal mode. It should be noted that due to shorter wavelength of transversal modes, finer spatial discretisation is required to obtain the same level of accuracy as for the longitudinal case. Quantitative estimation of the error introduced by the numerical scheme is done through the analysis of the dispersion curves in the upper frequency range that may be excited with assumed cell size 1x. Fig. 11 shows a close up of the dispersion curves in the range 300–500 kHz and a zoom around 300 kHz for longitudinal (top row) and transversal (bottom row) modes. The differences in wavenumber accuracy for the two cases are clearly visible. However, as indicated by the zoom around 300 kHz, the sensitivity of the transversal mode to the γ parameter is significantly smaller, which is an immediate consequence of relatively lower effective Courant number for this mode. The relative error calculated for the frequency range 0–500 kHz, assumed as the maximum frequency range that might be excited for 1x = 1 mm, is presented in Fig. 12. The relative error for the longitudinal mode does not exceed 1% or 5% for the transversal component, for the lowest value of the Courant parameter. The outlined analysis indicates that the numerical errors introduced by the numerical scheme influence the solution more than the assumed simplifications in the computing precision. Although, the dispersion errors are significant, a detailed analysis of their influence on wave amplitudes and phases should be carried out.

112

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 12. Relative error in wavenumber for frequency range 0–500 kHz. Table 2 Summary of CUDA devices used in computational test. GPUs:

Memory (MB):

Number of CUDA cores:

GTX 260 GTX 460 GTX 470 GTX 680 Tesla M2050

896 1024 1280 2048 3072

216 336 448 1536 896

CPUs:

Number of cores:

Core2Duo E6400 i7-2600K i7-960 Xeon X5670 Xeon E5-2650

2 4 4 12 8

4. Numerical results of wave propagation The performance of the developed solver was verified using different hardware platforms, i.e. GPUs and CPUs, on a test object of varying discretisation accuracy. This section provides the numerical results obtained for various computational platforms and model configurations. First, a detailed description of the hardware platforms used is given. The GPU architecture is briefly discussed with the aim on connection of hardware-related aspects with a particular case of the wave equation discretised with the LISA method. Then, the test object, model configurations used in the verification test and the staircase problem related to the LISA spatial grid are described. Finally, numerical results for the GPU and CPU platforms are reported and discussed. 4.1. Computational test platforms The GPU-based LISA solver operates on a single or multiple CUDA devices. It may also operate on regular CPUs, taking benefit from multiple cores. To compare the performance of the CPU and GPU implementations of the solver, various devices – listed in Table 2 – were used. The CUDA devices available in the test platforms differ by the number of CUDA cores and available on-board memory. Other, not listed parameters, such as core and memory clocks, may be found in technical specifications. These hardware parameters result in different processing powers measured by the number of fused multiply–add

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

113

Fig. 13. 3-D cylinder test object.

floating-point instructions executed per second for single and double precision computations. This particular feature of GPUs is particularly important, since the discretised equations of motion require a number of multiplications and additions. Therefore the iteration equations of the LISA algorithm have been re-designed and optimised in terms of these GPU capabilities. 4.2. GPU architecture A single GPU consists of a number of multiprocessors that are responsible for executing many threads. Each multiprocessor consists of an array of cores, responsible for arithmetic operations. Usually a CUDA device is equipped with one GPU, and the number of CUDA cores of the device (shown in Table 2) is the number of multiprocessors multiplied by the number of cores per multiprocessor. CUDA kernel routines are executed on a GPU grid composed of independent groups of threads called thread blocks. At runtime, a thread block is divided into warps of 32 threads for execution on the cores of the multiprocessor. All threads in a single warp execute exactly the same instructions on different data (SIMD execution model) on one core. The maximum number of threads effectively running in parallel depends directly on number of cores of multiprocessors. What is more, it depends on number of active warps, which is physically limited by several attributes of GPU architecture, e.g. maximum thread block size or total number of 32-bit registers per multiprocessor. The second cause in case of implementation of LISA iteration equations is that the complexity of iteration equation for a single grid point requires a number of registers to represent the variables and memory addresses of data being accessed. When the number of registers required to compute the equations is greater than the number of available registers, some warps must stay inactive, resulting in total calculation efficiency drop. This issue is addressed in details in Section 5 of this paper. The coefficients in the LISA iteration equations are functions of material properties and discretisation parameters, and typically do not change during the simulation. These coefficients may be calculated from the basic parameters, increasing computational time and reducing the memory consumption, or pre-calculated, stored and accessed by the solver. In the latter case additional memory resources are needed, but the calculation time is lower. The results presented in this work were obtained for the solver based on Eqs. (6)–(8), that are optimised for the isotropic material model (two material constants) and 64-bit architecture. 4.3. 3-D test object The performance of the solver was examined using models of different sizes calculated on various hardware platforms. The test model was a thick walled pipe (Fig. 13) of dimensions: outer diameter—100 mm, inner diameter— 60 mm, height—70 mm. The model was discretised using various cell sizes, resulting in various degrees of geometrical approximation accuracy. Cell sizes were chosen to fit the model into the available system memory and resulted in a total number of cells from ∼0.4 to ∼600 millions cells. The last model provided the best shape representation, however it required more than 65 GB of regular memory and could not be calculated by available GPU devices. It is worth noting that the model size increases drastically following the cubic function relationship.

114

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 14. Outer edges for the 3-D model for 10 million (left) and 100 million (right) elements.

Fig. 15. Out of plane displacement component gathered at test object’s face for different element sizes.

Two groups of models were build. The first group consisted of model sizes that allowed the investigation of the general solver’s performance for different GPU and CPU configurations. This group comprised model sizes of 2.5, 5, 7.5, 10, 20, 31, 49 and 88 million elements. The second group was chosen to test the multi-GPU capabilities of the solver and comprised 0.4, 3, 23 and 45 million cells. The model sizes were selected to test the domain decomposition algorithm in terms of memory consumption and calculation time. 4.4. Discretisation accuracy and the ‘staircase’ problem The test object has been chosen to investigate the discretisation accuracy and the so-called ‘staircase’ problem common for all finite difference and similar grid-based methods. The geometry under investigation was discretised using various cell sizes, resulting in models consisting of 2.5, 5, 7.5, 10, 20, 30, 49, 88 and 100 millions of cells that were calculated using GPUs and CPUs. The outer edges of the considered shape for two different model sizes are presented in Fig. 14. This clearly shows that for the 10 million element model the representation of the shape is relatively good and improves for bigger models. For the 100 million element model the ‘staircase’ problem is hardly visible. For the excitation frequency equal to 300 kHz, the corresponding wavelengths are approximately 20 mm and 10 mm for longitudinal and transverse modes respectively. According to the dispersion analysis, the relative error in wavenumbers is less than 1.5% for 1x = 1 mm. Thus for 1x = 1 mm and smaller values, the main source of errors can be attributed to the staircase problem for the test object. The influence of the spatial discretisation on the response at a selected point at the pipe’s face is presented in Fig. 15. Significant differences in displacement responses are visible for different cell sizes. For smaller elements the response converges to a common shape. However, it should be noted that the error introduced by the staircase problem is more severe than other error sources. Displacement patterns obtained for different cell sizes are presented in Fig. 16.

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

115

Fig. 16. Wave propagation displacement patterns for cell sizes equal to (from left): 1x = {1; 0.5; 0.25} mm.

Fig. 17. Number of iterations per second for the 32-bit, double precision implementation.

4.5. Numerical results for GPUs—general solver performance The first set of models was calculated on GPUs using 32 and 64-bit architectures in single and double precision modes. The results – i.e. the amount of GPU memory used – are given in Fig. 17. These results present the number of iterations per second, calculated for different CUDA devices and model sizes from 2.5 million of cells to 88 million of cells. The results in Fig. 17 show that in the 32-bit and double precision implementation, the GTX-260 device can be used only for the smallest models, i.e. 2.5 and 5 millions of cells, requiring 281 and 555 MB of memory respectively. The model of 7.5 million elements takes 823 MB of memory and theoretically should fit into 850 MB of available space on the GTX-260 device. However, in practice the memory allocation failed. This device is not able to allocate the requested continuous memory blocks, even if the non-continuous memory is available. In contrast, the GTX-460 device handles the 7.5 million-element model, while the GTX-470 device is capable to model also the 10 millionelement model. A single Tesla module is able to run the 20 million-element model, however two Tesla modules are needed to run the 31.1 million-element model. Both, the 49 and 88 million-element models require more than 4 GB of memory and cannot be run using the 32-bit implementation. The GTX-260 device with 216 CUDA cores runs 15.7 iterations per second in the 2.5 million-element model, and almost two times slower (8.3 iterations per second) in the two times bigger model (i.e. 5 million-elements). The GTX-460 device with 336 cores achieves a much better result, running 9.4 iterations per second of 7.5 million-element model. As expected, the GTX-470 device and the single Tesla M2050, with the same GPU and 448 CUDA cores, achieve similar performance. The performance differences

116

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 18. Number of iterations per second for the 64-bit, double precision implementation.

observed for these two devices are negligible. For the set of test models, a slight advantage of the Tesla M2050 device was observed, running faster by 2%–4% (depending on model size) than the GTX-470 device. These results confirm that mainstream graphics cards can be utilised very efficiently for scientific calculations; very expensive professional computing modules do not need to be used. However, the Tesla computing modules: (1) provide ECC technology assuring high reliability of results; (2) are designed to be stable during many hours in continuous 100% load; (3) deliver mechanisms simplifying development of multi-device algorithms, i.e. Unified Memory Architecture; (4) can be easily scaled-up in server solutions. Features (1) and (2) are invaluable in systems where data corruption cannot be tolerated under any circumstances, but the authors want to emphasise the fact that no data corruption has been observed on mainstream graphics cards as long as correct thermal conditions were met (in case of high overclocking, the risk of receiving incorrect results raises drastically). Feature (4) is important when searching for huge processing power, but it simply leads to higher costs. Moreover, a single server node is usually equipped with two Tesla computing modules, and inter-node connections are much slower than peer-to-peer connections between modules. This affects the overall performance and complicates tasks’ distribution. The facts and obtained results lead to the conclusion that a high-end PC with 2–4 mainstream graphics cards (i.e. four GTX 680) is a low-cost highperformance solution. The fastest hardware configuration—two Teslas M2050 (896 CUDA cores) runs ∼4.5 times faster than the slowest GTX-260 and ∼1.95 times faster than a single Tesla. The 64-bit implementation allows for running bigger models, i.e. exceeding 4 GB of CUDA device, as demonstrated in Fig. 18. For 64-bit implementation two Teslas are able to run the 49 million-element model, requiring 5225 MB of memory, which is close to the total memory available on both devices (i.e. 5376 MB). The 88 millionelement model does not fit into the memory. The performance of the 64-bit implementation is reduced in the case of all CUDA devices. Only the GTX-260 device with Compute Capability 1.3 runs as slow as in the case of the 32-bit implementation. It is an immediate result of the complexity of the CUDA iteration kernel and number of 32-bit registers available on the device per block. The performance drops to 74%–75% in case of the GTX-460, 69%–70% in the case of GTX-470 and 64%–67% in the case of single and two Teslas M2050 device, depending on model size. A slight performance drop of the Tesla modules compared to the GTX-470 was attributed to the memory ECC enabled. In single precision, the solver needs about 56% of the memory required for double precision computations, as illustrated in Figs. 19 and 20. The performance of both the 32- and 64-bit implementations in single precision mode are almost the same and reach 152 iterations per second for the 2.5 million-element model. The 49 million-element model requires 2916 MB of CUDA device memory that can be allocated on two Teslas. However, the geometry of such a model requires more than 2 GB of CPU memory, and cannot be allocated in the 32-bit implementation. The 88 million-element model requires 5213 MB and it takes almost all of the memory of two Teslas.

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

117

Fig. 19. Number of iterations per second for the 32-bit, single precision implementation.

Fig. 20. Number of iterations per second for the 64-bit, single precision implementation.

The average performance of all CUDA devices employed in this analysis is summarised in Fig. 21. The results indicate the advantage of the single precision mode and the disadvantage of the 64-bit implementation in case of double precision. 4.6. Numerical results for CPUs—general solver performance The solver may be also used on regular CPUs, taking advantage of multiple cores by running several threads in parallel. However, the memory organisation of CPU implementation is analogous to a single GPU memory organisation (see Section 3.3.1). It should be emphasised that the authors did not take any specific effort to fit the implementation to any specific memory architecture or to force any thread-to-core affinity. The development effort was focused on the CUDA-enabled solution, optimised for execution on modern GPUs. Hence, the CPU results are presented here to compare GPU and CPU implementations and to show how the current implementation fits into existing CPU architectures.

118

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 21. Number of cells calculated per second for various implementations tested.

Fig. 22. Number of iterations per second for various CPU configurations. Results for the 64-bit implementation and double precision.

Firstly, the performance of CPUs available for performance tests with different number of threads, model sizes on real CPU cores (with Hyper Threading disabled where applicable) were investigated. The results are presented in Fig. 22. As expected, the Core2Duo CPU is the slowest in the 1 and 2 threads-in-parallel execution and the algorithm does not scale linearly on this CPU. Even worse scalability was observed when executing 6 and 12 threads in parallel using 6 and 12 cores of two Xeon X5670 CPUs. For the Xeon running 6 threads-in-parallel, the calculation time was slightly longer than the 4 threads-in-parallel execution on the i7-960 CPU. This can be attributed to the core clock. The best performance results were obtained for the i7-2600K CPU, especially for small models. The results obtained using two Xeon X5670 CPUs with 12 threads are only slightly better than the results for the i7-960 using 4 threads due to the specific solver implementation that does not fit into the NUMA (Non-Uniform Memory Access) architecture present on dual-CPU server nodes. On the hardware level, a single CPU accesses directly a half of the memory installed. Accessing the memory associated with the other processor is much slower. Thus, the memory access time depends on the memory location relative to a processor. Current CPU implementation of the LISA solver allocates the memory independently from executing threads and cores of the processors, divides the model equally for the requested number of threads and may result in accessing the whole memory range by any thread (due to lack of thread-to-core affinity). As a result, some of 12 threads require more calculation time than others and

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

119

Fig. 23. Performance of the 600 million-cell model executed on twelve CPU cores—time needed to calculate single step by 12 threads of 12 core of two Xeon X5670 CPUs (double precision). Memory required 65 260 MB.

the simulation step is finished when all threads finish calculations. This effect is clearly visible in Fig. 23 showing an example of execution of 10 simulation steps of a very large model (approx. 600 millions cells), occupying approximately 65 GB of regular memory. Such memory access problems were not observed for GPU executions, since GPUs are based on the Single Instruction Multiple Data (SIMD) architecture and perform the same operation on multiple data simultaneously, exploiting data level parallelism. 4.7. Multiple GPU calculations The performance of the solver operating on a multi-GPU platform was examined using the test models consisting of 0.39, 3, 23 and 45 million cells. The model sizes were selected to verify the efficiency of the domain decomposition method implemented to subdivide the total model into a number of parts that are calculated by separate CUDA devices. The models were calculated using 1, 2, 3, and 4 GTX 680 devices using the 64-bit implementation in single and double precision modes. Fig. 24 shows the memory requirements for the models in single precision mode. The results show that there is a slight increase in memory consumption caused by the domain decomposition algorithm; the redundant cells that are required by the algorithm are copied and stored on different devices. This is clearly visible, especially for the small models (i.e. 0.39 and 3 million-element models). For larger models this effect is negligible. The largest 45 million-element model exceeds the on-board memory of a single device. This model can be computed only when at least two GPUs are used. Calculation times for a single time step for different models on the multi-GPU platform are presented in Fig. 25. A significant increase in calculation time was observed for the smallest model due to the large number of redundant cells and long peer-to-peer copying times. The domain decomposition method is superior for large models with speed-up becoming linear with increasing model sizes. A similar, linear relationship was found for other increasing model sizes on each platform, summarised in Fig. 26. Following the above analysis, all models were calculated in a double precision mode. The memory requirements for various models are presented in Fig. 27. Again, the increase in memory consumption due to redundant overlapping layers is significantly visible for the smallest models. The models require approximately twice as much memory in double precision than in the single precision calculations. The 23 million-element model could not be calculated by a single device, while the 45 million-element model did not fit into 2-GPU platform. This is due to increased memory consumption. Average calculation times of a single time step for double precision calculations are presented in Fig. 28. Increased amount of data transferred between devices for a small model results in increased average calculation times for a single time step for the 4 GPU platform. The most significant drop in calculation time is obtained for a setup that exhibits the smallest ratio of the redundant to initial number of cells.

120

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 24. Memory requirements for various models calculated using the multi-GPU platform and single precision mode.

Fig. 25. Average step calculation times for various models calculated on the multi-GPU platform; single precision mode used.

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

121

Fig. 26. Average time step calculation times for numbers of GPUs on the multi-GPU platform, single precision mode used.

Fig. 27. Memory requirements for various models calculated using the multi-GPU platform and double precision mode.

Fig. 29 summarises average calculation times of a single time step for various model sizes calculated using different multi-GPU platforms. A linear relationship is observed with increased model sizes for each platform, as well as for constant model size with increasing number of CUDA devices. 4.8. Comparison with the finite element model To illustrate the efficiency of the simulation platform used, an FE model of the cylinder was built. The model consisted of 367 080 eight-noded hexahedral elements and 400 320 nodes. Target element size was set to 1 mm.

122

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 28. Average step calculation times for various models calculated on the multi-GPU platform; double precision mode used.

Fig. 29. Average time step calculation times for numbers of GPUs on the multi-GPU platform, double precision mode used.

Analysis was performed using the explicit time integration scheme with a central difference operator and lumped mass matrix. The analysis time step was set to 0.1 us to satisfy the stability condition. The model was calculated using a single core of the Xeon E5 2650 CPU, and four GTX 680 GPUs. Calculation times of a single time step for the respective FE models were 42.7 and 47 s. Displacement patterns of the FE model compared to the GPU-based LISA solution – for the cell size equal to 1x = 0.25 mm – are presented in Fig. 30. The maps show the magnitude of the displacement vector normalised to the excitation amplitude.

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

123

Fig. 30. Comparison of the GPU-based LISA solution (left), and the FE model (right).

A direct quantitative comparison of the displacement fields (Fig. 30) is rather cumbersome, thus a qualitative inspection was performed. Almost identical displacement patterns were obtained in both models. The displacement field on the flat top surface and cylindrical outer surface is very similar, however, slightly higher amplitudes were observed in the LISA model. Nevertheless, the same wave components and packets can be distinguished in both cases. The similarity of the displacement patterns on the cylindrical surface provide a good indicator of the accuracy of the solution. This is due to the superposition of waves travelling directly from the excitation point through the bulk medium and surface waves distorted by the interaction with cylinder’s edge. As can be seen from Fig. 30, the staircase problem associated with the circular edge discretisation does not introduce significant wavefield distortion. 5. Further comments It is worth noting that the CUDA computing devices have grown from graphics cards, for which the main goal is to process huge amounts of data in parallel using simple algorithms (e.g. transforming vertices of millions of triangles) for visualisation purposes, where single precision is accurate enough. These devices are designed to run as many light-weight threads in parallel as possible. When the thread code becomes complex the overall computing efficiency drops. The LISA iteration equations are quite complex in terms of the amount of different data involved, i.e. displacement components for different time steps, material properties and other parameters. A single iteration requires lots of simple computations, which fits well to the CUDA devices architecture, but needs the data from various arrays of adjacent cells. The need for computing large models minimises the input data kept in memory and forbids storing the same intermediate results calculated for many cells instead of re-computing them. The re-computing of intermediate results is not a problem for the CUDA device, but increases the already large number of computations of memory addresses needed to pick up the data from the memory. Thus, reused data array indices, explicit and implicit pointers to the data and intermediate results of computations shall be placed in hardware registers of the CUDA device by the compiler, when generating the GPU code, to speed up the execution of the kernel routine. The compiler does it automatically in code optimisation process. If the hardware registers are exhausted, the compiler spills registers to much slower local memory, resulting in significant performance drop. Each multiprocessor on the CUDA device has a set of 32-bit hardware registers. The device with Compute Capability 2.x has 32,768 such registers per block. Each CUDA kernel routine can be compiled to allocate a certain maximum number of hardware registers and can be executed in a certain number of threads per block. Reducing the usage of registers per thread maximises the occupancy of the multiprocessor. The 100% occupancy means that all warps, consisting of 32 threads, are active, and running in parallel fully exploit the hardware resources of the GPU. Maximising the number of registers used by the kernel routine decreases warp occupancy (less warps are active per multiprocessor) and may decrease the maximum number of threads per block. The thread blocks for various kernel routines, are organised as follows. For the Iterate CUDA kernel routine, responsible for displacement calculations, optimal number of threads per block for a given GPU architecture is first estimated. Then the number of cells in the model is divided by the number of threads per block. Resulting number of blocks is then represented as a two-dimensional thread block. Finally, the Iterate kernel routine is called with

124

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

Fig. 31. The occupancy of the Compute Capability 2.x CUDA Device for various kernel configurations.

two-dimensional thread block and one-dimensional thread configuration. In case of other, simpler kernel routines, one-dimensional thread block is used, or – in case of the simplest – one block with t threads. The effective warp occupancy may be calculated by the CUDA Occupancy Calculator, developed by Nvidia. The results obtained – assuming various code complexities – are presented in Fig. 31. The results show that 100% warp occupancy can be achieved with 20 registers per thread in several configurations, e.g. for 768 threads per block. However, in practice, only simple kernels can fit into 20 registers per thread configuration. By default the compiler produces the code of kernel routines using up to 32 registers per thread, resulting in maximum 2/3 multiprocessor warp occupancy for several configurations of threads per block, including maximum number of threads per block, which is 1024 for devices with Compute Capability 2.x. When the code of the kernel becomes more complicated and the number of spill loads and stores in local memory increases, the compiler can be forced to use even 63 registers per thread, which results in 1/3 warp occupancy at a maximum number of 512 threads per block. The iteration CUDA kernel of the LISA algorithm is very complex and forcing the compiler to use 63 registers per thread produces the fastest code (even with 1/3 warp occupancy at 512 threads per block), still leaving some spill loads and stores in local memory. In the single precision mode, every intermediate result is 32-bit wide and can be stored in a single hardware register. In the double precision mode two registers are required for every 64-bit value. In the 32-bit implementation every memory pointer can be stored in a single hardware register, while in the 64-bit implementation two registers are required. The last fact has currently the biggest influence on the performance, as shown in previous sections. During the solver implementation several attempts to improve the situation described above have been undertaken, and several optimisation techniques were carried out. The most obvious approach of decomposing a complex kernel routine into many simple kernel routines to achieve better warp occupancy results in the need for storing intermediate results, different for each simulation cell. The large amount of intermediate data to store and the fact that the results are not shared between threads, leads to storing them in global memory, which decreases the performance and significantly decreases the maximum size of the model. Similarly, the constant memory cannot be efficiently utilised. It is too small to handle all constants or subset of constants (i.e. cubes’ coordinates, material constants for each cell) needed during computation. Finally, only the global CUDA memory has been utilised. 6. Conclusions The LISA was implemented for wave propagation modelling in complex geometries. Graphical processing units and the parallel algorithm architecture were used in this implementation. The focus of the work presented was on numerical implementation and covered aspects related to software modular architecture, computer memory

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

125

organisation and optimisation. The performance of the proposed simulation framework was tested for numerical models of different sizes, various computing precisions and hardware platforms. The conclusions can be summarised as follows. The classical LISA iteration equations were reformulated. This allowed for significant increase of calculation speed due to balance between on-the-fly calculated parameters and references to memory registers storing precalculated values. The proposed memory organisation enabled the adaptivity of the solver for particular hardware used, maintaining optimal ratio of calculated to stored parameters. Due to flexible modular structure of the solver, further modifications, adaptivity to new hardware platforms and coupling to different numerical methods are possible. Modifications to original FD-based calculation scheme of the LISA method covered also the changes in model gridding and element’s numbering. This allowed for significant reduction of memory requirements and increase of maximum possible models size that can be calculated. To further extend the solver capabilities and increase maximum model size, a domain decomposition approach was proposed. The method searches for optimal model subdivision in terms of minimal data transfer between the domains and maximum GPU occupancy. This allowed for utilisation of multiple GPU workstations. It was shown that models up to 270 millions degrees of freedom can be calculated using a standard PC equipped with four GPUs. The achieved speed-up was linear with increasing number of computing devices. Various tests performed for the domain decomposition approach proved the minimal increase in memory requirements and negligible efficiency drop due to redundant cells stored by GPUs. Two different data transfer methods were implemented for multi-GPU configurations that further extended the flexibility of the solver and allowed for operation on different hardware configurations. The latter is one of particular strengths of the simulation framework presented in this work, i.e. the solver ability to adapt to the hardware configuration. Maintaining high efficiency is possible due to proposed memory organisation, multiple computing devices employed for calculations, domain decomposition and the data transfer strategy. The accuracy and performance of the solver was investigated using a 3-D test object discretised with various element sizes and computing precisions. Two versions, i.e. the 32-bit and 64-bit implementations were considered. It was shown that for linear analysis the single precision mode was superior in terms of calculation time, preserving small numerical errors due to simplified numbers representation. The results were compared to the finite element solutions. The displacement patterns differences were found satisfactory, however the calculation time was reduced by more than two orders of magnitude. What is more, significantly higher spatial and temporal resolutions were possible to obtain when compared to commercial FE solvers for a given hardware configuration. Finally, it was proven that the mainstream graphical processing units can be effectively employed for wave propagation simulation, which provides a low-cost efficient solution for scientific and engineering computations. Acknowledgement The fifth author would like to acknowledge the support of the Foundation for Polish Science (project WELCOME No. 2010-3/2). References [1] M. Darmon, N. Leymarie, S. Chatillon, S. Mahaut, Modelling of scattering of ultrasounds by flaws for NDT, in: Ultrasonic Wave Propagation in Non Homogeneous Media, in: Springer Proc. in Physics, vol. 128, 2009, pp. 61–71. [2] J.C. Strickwerda, Finite Difference Schemes and Partial Differential Equations, Wadsworth-Brooks, Belmont, 1989. [3] J. Virieux, P-SV wave propagation in heterogeneous media: velocity-stress finite difference method, Geophysics 51 (1986) 889–901. [4] O.C. Zienkiewicz, The Finite Element Method, fourth ed., McGraw-Hill, London, 1989. [5] G.S. Verdict, P.H. Gien, C.P. Burger, Finite element study of Lamb wave interactions with holes and through thickness defects in thin metal plates, in: D.O. Thompson, D.E. Chimenti (Eds.), Review of Progress in Quantitative Nondestractive Evaluation, Vol 11, Plenum, New York, 1992, pp. 97–104. [6] U. Lee, Spectral Element Method in Structural Dynamics, J. Wiley & Sons Asia, 2009. [7] A. Chakraborty, S. Gopalakrishnan, A spectrally formulated plate element for wave propagation analysis in anisotropic material, Comput. Methods Appl. Mech. Engrg. 194 (2005) 4425–4446. [8] P. Fellinger, R. Marklein, K.J. Langenberg, S. Klaholz, Numerical modeling of elastic wave propagation and scattering with EFIT— elastodynamic finite integration technique, Wave Motion 21 (1995) 47–66. [9] B.C. Lee, W.J. Staszewski, Lamb wave propagation modelling for damage detection. Part I: Two-dimensional analysis, Smart Mater. Struct. 16 (2007) 249–259.

126

P. Packo et al. / Comput. Methods Appl. Mech. Engrg. 290 (2015) 98–126

[10] B.C. Lee, W.J. Staszewski, Lamb wave propagation modelling for damage detection. Part II: Damage monitoring strategy, Smart Mater. Struct. 16 (2007) 260–274. [11] B.C. Lee, W.J. Staszewski, Modelling of Lamb waves for damage detection in metallic structures; part I: Wave propagation, Smart Mater. Struct. 12 (2003) 804–814. [12] B.C. Lee, W.J. Staszewski, Modelling of Lamb waves for damage detection in metallic structures; part II: Wave interactions with damage, Smart Mater. Struct. 12 (2003) 815–824. [13] B.C. Lee, W.J. Staszewski, Sensor location studies for damage detection with Lamb waves, Smart Mater. Struct. 16 (2007) 399–408. [14] P.P. Delsanto, R.S. Schechter, R.B. Mignogna, Connection machine simulation of ultrasonic wave propagation in materials. III: The threedimensional case, Wave Motion 26 (1994) 329–339. [15] P. Pa´cko, T. Uhl, W.J. Staszewski, Coupled thermo-mechanical simulations of Lamb waves propagation in structures with damage, in: 18th International Congress on Sound & Vibration, Rio de Janeiro, 2010. [16] T. Bielak, P. Pa´cko, A. Spencer, W.J. Staszewski, T. Uhl, K. Worden, CUDA technology for Lamb wave simulations, in: Smart Structures and Materials & Nondestructive Evaluation and Health Monitoring 2011: Proc. of SPIE, San Diego, 2011. [17] P. Pa´cko, T. Uhl, Elastic wave propagation simulation system based on parallel processing architecture, in: 19th International Conference on Computer Methods in Mechanics, Warsaw, 2011. [18] P. Pa´cko, Ł. Ambrozi´nski, T. Uhl, Structure damage modelling for guided waves-based SHM systems testing, in: Fourth International Conference on Modeling, Simulation and Applied Optimization, Kuala Lumpur, 2011. [19] Ł. Ambrozi´nski, P. Pa´cko, T. Stepinski, T. Uhl, Ultrasonic guided waves based method for SHM—simulations and an experimental test, in: Fifth World Conf. on Structural Control and Monitoring, Tokyo, 2010. [20] P. Packo, T. Bielak, A.B. Spencer, W.J. Staszewski, T. Uhl, K. Worden, Lamb wave propagation modelling and simulation using parallel processing architecture and graphical cards, Smart Mater. Struct. 21 (2012). [21] M.J. Leamy, Application of cellular automata modeling to seismic elastodynamics, Internat. J. Solids Structures 45 (2008) 4835–4849. [22] A. Balikin, Numerical simulation of guided waves for SHM—a literature survey, UPSSP Platform Grant Report, Dynamics Research Group, Department of Mechanical Engineering, University of Sheffield, 2007. [23] H. Bao, J. Bielak, O. Ghattas, L.F. Kallivokas, D.R. O’Hallaron, J.R. Shewchuk, J. Xu, Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers, Comput. Methods Appl. Mech. Engrg. (1998) 85–102. [24] G. Seriani, 3-D large-scale wave propagation modeling by spectral element method on Cray T3E multiprocessor, Comput. Methods Appl. Mech. Engrg. (1998) 235–247. [25] M. Papadrakakis, G. Stavroulakis, A. Karatarakis, A new era in scientific computing domain decomposition methods in hybrid CPU GPU architectures, Comput. Methods Appl. Mech. Engrg. 200 (2011) 1490–1508. [26] J.E. Stone, J.C. Phillips, P.L. Freddolino, D.J. Hardy, L.G. Trabuco, K. Schulten, Accelerating molecular modeling applications with graphics processors, J. Comput. Chem. 28 (2007) 2618–2640. [27] A. Seitchik, D. Jurick, R. Brietzke, K. Benny, J. Codd, F. Hoxha, C. Pignol, D. Kessler, The Tempest Project—Addressing challenges in deepwater Gulf of Mexico depth imaging through geologic models and numerical simulation, Lead. Edge 28 (2009) 546–553. [28] M. Roberts, W.K. Jeong, A. Vazquez-Reina, M. Unger, H. Bischof, J. Lichtman, H. Pfister, Neural process reconstruction from sparse user scribbles, in: Medical Image Computing and Computer Assisted Intervention (MICCAI) 2011, 2011. [29] G.P. Krawezik, G. Poole, Accelerating the ANSYS direct sparse solver with GPUs, in: Symp. on Application Accelerators in High Performance Computing, Urbana, 2009. [30] A. Karatarakis, P. Metsis, M. Papadrakakis, GPU-acceleration of stiffness matrix calculation and efficient initialization of EFG meshless methods, Comput. Methods Appl. Mech. Engrg. 258 (2013) 63–80. [31] A. Karatarakis, P. Karakitsios, M. Papadrakakis, GPU accelerated computation of the isogeometric analysis stiffness matrix, Comput. Methods Appl. Mech. Engrg. 269 (2014) 334–355. [32] J.G. Harris, Linear Elastic Waves, Cambridge University Press, 2004. [33] A fast and high quality multilevel scheme for partitioning irregular graphs. in: George Karypis and Vipin Kumar. International Conference on Parallel Processing, 1995, pp. 113–122.