A GPU parallel computing strategy for the material point method

A GPU parallel computing strategy for the material point method

Computers and Geotechnics 66 (2015) 31–38 Contents lists available at ScienceDirect Computers and Geotechnics journal homepage: www.elsevier.com/loc...

1MB Sizes 1 Downloads 118 Views

Computers and Geotechnics 66 (2015) 31–38

Contents lists available at ScienceDirect

Computers and Geotechnics journal homepage: www.elsevier.com/locate/compgeo

Research Paper

A GPU parallel computing strategy for the material point method Youkou Dong, Dong Wang ⇑, Mark F. Randolph 1 Centre for Offshore Foundation Systems, The University of Western Australia, 35 Stirling Highway, Crawley, WA 6009, Australia

a r t i c l e

i n f o

Article history: Received 3 September 2014 Received in revised form 16 January 2015 Accepted 16 January 2015 Available online xxxx Keywords: Material point method Parallel computing Large deformations Structure–soil interaction Landslide

a b s t r a c t The material point method (MPM), which is a combination of the finite element and meshfree methods, suffers from significant computational workload due to the fine mesh that is required in spite of its advantages in simulating large deformations. This paper presents a parallel computing strategy for the MPM on the graphics processing unit (GPU) to boost the method’s computational efficiency. The interaction between a structural element and soil is investigated to validate the applicability of the parallelisation strategy. Two techniques are developed to parallelise the interpolation from soil particles to nodes to avoid a data race; the technique that is based on workload parallelisation across threads over the nodes has a higher computational efficiency. Benchmark problems of surface footing penetration and a submarine landslide are analysed to quantify the speedup of GPU parallel computing over sequential simulations on the central processing unit. The maximum speedup with the GPU used is 30 for single-precision calculations and decreases to 20 for double-precision calculations. Ó 2015 Elsevier Ltd. All rights reserved.

1. Introduction Numerical simulations of large deformation problems, such as the impact of submarine landslides on pipelines and the free-fall of torpedo anchors through water into soil, are technically challenging. Although widely used in geotechnical engineering, traditional finite element (FE) methods are not suitable for large deformation problems due to severe mesh distortion [5,19]. Meshfree methods avoid mesh distortion by spatially discretising the materials with particles [4]. The recently developed material point method (MPM) can be regarded as a combination of the FE and meshfree methods and provides an acceptable balance between the computational cost and accuracy for large deformation analyses. The MPM originates from the particle-in-cell method in computational fluid dynamics [10] and was extended to solid mechanics by Sulsky et al. [18]. A limited number of MPM studies have been conducted in geotechnical engineering to investigate the flow of granular materials [2], subaerial landslides [1], and the response of multiphase porous media [22]. Each incremental step in the MPM includes a Lagrangian stage and an Eulerian stage. The materials are discretised as Lagrangian particles, and the history-dependent variables, such as stresses, material properties and velocities, are inherited by the particles. ⇑ Corresponding author. Tel.: +61 8 6488 3447; fax: +61 8 6488 1044. E-mail addresses: [email protected] (Y. Dong), dong.wang@uwa. edu.au (D. Wang), [email protected] (M.F. Randolph). 1 Tel.: +61 8 6488 3075; fax: +61 8 6488 1044. http://dx.doi.org/10.1016/j.compgeo.2015.01.009 0266-352X/Ó 2015 Elsevier Ltd. All rights reserved.

A background mesh on which the governing equations are solved remains unchanged throughout the analysis. Before each Lagrangian phase, the particle attributes are mapped to the nearby nodes of the background mesh to allow the governing equations to be solved. After the Lagrangian calculation, the velocities and accelerations are mapped from the element nodes to the particles, and the particle positions and the field variables at the particles are then updated for the next step. Because the background mesh is fixed in space, mesh distortion is avoided. The mesh that is used in the MPM usually needs to be finer than the mesh in conventional FE analysis to achieve similar accuracies, since the material points are not always located at the optimum positions for integration within the element [6,17]. To promote the efficiency of the MPM, parallel computing on the central processing unit (CPU) or graphics processing unit (GPU) has been exploited by a limited number of researchers [11,7]. GPUs feature more cores in the processor, lower thread-scheduling cost and higher memory bandwidth than most commercially available CPUs; as such, they are ideal for arithmetically intensive and highly parallel computations [16]. Currently, most GPU parallelisation is implemented using the ‘compute unified device architecture’ (CUDA; [14], which provides a friendly platform for coding in languages such as C++ and Fortran. This paper presents a GPU parallel computing strategy for the MPM that is based on the CUDA. Benchmark problems of surface footing penetration and a submarine landslide are analysed to quantify the potential of GPU parallelisation with single and double precision floating numbers.

32

Y. Dong et al. / Computers and Geotechnics 66 (2015) 31–38 int

2. GPU parallelisation of MPM

fi ¼ 

X

rSip rp v p

ð4Þ

p

The GPU parallel computing program developed in this study was coded in C++ on the Windows operating system for quasi-static and dynamic analyses using an explicit integration scheme, and the parallelisation was coupled with the generalised interpolation material point (GIMP) method presented by Bardenhagen and Kober [3]. The GIMP method was integrated into the publically available Uintah package [9] but can only be used for conventional sequential or parallel computing on CPUs [15]. The GPU parallel computing strategy developed in this study can be combined with any MPM using explicit integration schemes, although its operational procedure will be described in terms of the GIMP method. To validate the applicability of the GPU computing strategy, the interaction between a structural element and soil was considered. Two Eulerian meshes were configured for the structural element and soil. The structural element was simplified as a rigid body, and the frictional contact was implemented using a technique called ‘Geo-contact’, in which a penalty function was incorporated into the GIMP method to minimise the computational noise in the contact force [12,8]. In the Geo-contact, a limited shear stress was also specified along the contact interface, irrespective of the normal stress, to quantify structure-clay interactions under undrained conditions. A deformable structural element, such as a slender pile subjected to a horizontal force, can be parallelised using a strategy similar to that for the soil assuming that the structural element needs to be discretised with a large number of material particles. The definitions of the stresses and strains followed finite strain theory taking into account the incremental rotation of the configurations between time steps for objectivity; the stresses were measured with the Cauchy stress and were updated with the Jaumann rate, and the strains were calculated with the logarithmic strain and were updated with the deformation rate. 2.1. Functions of MPM

(i) The time step starts with the function ‘Initialisation of nodal variables’, which initialises the nodal variables of the rigid body and soil, such as the masses, velocities, momenta, normal directions and internal and external forces. (ii) The functions ‘Interpolation from rigid body particles to nodes’ and ‘Interpolation from soil particles to nodes’ are used to interpolate the masses and momenta of the rigid body and soil particles to the corresponding element nodes, respectively, which may be summarised as

X

Sip mp

ð1Þ

X Sip mp V p

ð2Þ

p

Mi ¼

p

ni ¼

X

rSip mp

X ext Sip f p

¼

ð5Þ

p ext

where rp, vp and f p are the stress, volume and external force at soil particle p, respectively. (iii) The function ‘Calculate nodal velocities and accelerations’ is used to perform the explicit calculations to obtain the velocities and accelerations on the background mesh. At the beginning of the current incremental step, the velocity of the rigid body or at the soil nodes is

Vi ¼

Mi mi

ð6Þ

For the rigid body nodes, the nodal accelerations for the time step are the same everywhere

ai ¼

f

ext

cont

þf P p mp

ð7Þ

where f ext is the total external force on the rigid body, and f cont is the total contact force on the rigid body from the previous step time. The acceleration for the current time step at soil node i is ext

ai ¼

fi

int

þ fi mi

ð8Þ

Then, at the end of the current step, the velocity of the rigid body or soil is

V new ¼ V i þ ai Dt i

ð9Þ V new i

where Dt is the time increment. at the soil nodes within the contact region is further adjusted depending on the contact technique employed, and the contact force on the rigid body, f cont, is updated by

X mi DV cont i

Prior to implementing the GPU parallel computing strategy, the main functions within each incremental step and the manipulations involved in the parallel computing are addressed briefly.

mi ¼

ext

fi

ð3Þ

p

where mi, Mi, and ni represent the mass, momentum and normal direction, respectively, at node i; mp and Vp are the mass and velocity, respectively, of particle p; Sip and rSip are the shape function and its gradient, respectively, at node i evaluP ated at particle p; and p represents a summation over all related particles. For soil, two additional field variables, the internal and external forces, are obtained at node i:

f

cont

¼

i

ð10Þ

Dt

P where DV cont is the adjusted velocity at soil node i, and i i represents a summation over all related nodes. (iv) In the function ‘Update stresses and material properties of soil particles’, the first variable that must be calculated at the soil particles is the deformation gradient.

F new p

¼



X

r

! Sip V new Dt i

Fp

ð11Þ

i

where I indicates the identity matrix. The stresses and material properties then evolve by inputting F new into the constip tutive model. (v) The functions ‘Movement of soil particles’ and ‘Movement of rigid body particles’ are used to update the velocities and positions of the particles, respectively, by mapping the nodal accelerations and velocities.

V new ¼ Vp þ p

X Sip ai Dt

ð12Þ

i

¼ Xp þ X new p

X Sip V new Dt i

ð13Þ

i

where Xp represents the particle coordinates at the beginning of the current step. Essentially, the movements of the particles represent the flow of each material within the background mesh.

Y. Dong et al. / Computers and Geotechnics 66 (2015) 31–38

2.2. Parallelisation procedure To evaluate the distribution of the workload across the functions, a sequential CPU simulation of the penetration of a surface footing into an elastic-perfectly plastic soil was performed with a total number of 76,800 soil particles and 320 footing particles (the first scenario in Table 1, which will be detailed later). Most of the computational effort is allocated to the functions ‘Interpolation from soil particles to nodes’ and ‘Update stresses and material properties of soil particles’, which account for 30% and 43% of the total workload, respectively. The functions ‘Calculate nodal velocities and accelerations’ and ‘Movement of soil particles’ perform 14% and 13% of the workload, respectively, while the effort for ‘Initialisation of nodal variables’ is minimal. In structure–soil interaction problems, the structure is usually represented by far fewer particles than the soil, so the workloads of the ‘Interpolation from rigid body particles to nodes’ and ‘Movement of rigid body particles’ functions are also minimal. Ideally, all of the functions are expected to be parallelised on the GPU to maximise the speedup of the GPU parallelisation. However, due to the risk of a data race (which will be discussed in the following section) and the trivial benefit from the potential GPU parallelisation, the function ‘Interpolation from rigid body particles to nodes’ was implemented sequentially on the CPU, while the other functions were programmed to be parallelised on the GPU (Fig. 1). To reduce the overhead of frequent data transfer between the CPU and GPU, all variables in Eqs. (1)–(13) were reserved on the GPU memory during the entire simulation. The GPU memory spaces include the global, register, and local memories. The variables transferred between the most functions, such as the masses, velocities and momenta at the nodes and particles, were allocated to the global memory; therefore, they can be accessed directly by the executed GPU threads. The temporary variables, such as the values of the shape functions, were stored on the register memory of each thread. The local memory serves as the backup of the register memory; once the register memory is fully occupied, the temporary variables are automatically allocated to the local memory. As the interpolation in terms of the rigid body is completed on the CPU, the corresponding data need to be transferred between the CPU and GPU in each incremental step (see Fig. 1), although the overhead of the data transfer is usually small due to the limited number of rigid body particles. For a deformable structural element, if the number of structural particles is limited, the interpolation from the structure particles to the nodes should be implemented sequentially on the CPU; otherwise, the interpolation can be parallelised on the GPU using a technique similar to that for the soil. 2.3. Parallelisation of ‘Interpolation from soil particles to nodes’ In the function ‘Interpolation from soil particles to nodes’, the variables are interpolated from each soil particle to the related Table 1 Average runtimes of an incremental step for the footing case. Number of soil particles

76,800 115,200 245,760 737,280 1,474,560 2,949,120 5,898,240

L/B

3 4.5 9.6 28.8 57.6 115.2 230.4

CPU

GPU parallel

Runtime per increment (ms) doubleprecision

Runtime per increment (ms) Singleprecision

Doubleprecision

62 88 187 563 1203 2625 5578

3.11 4.14 8.11 20.07 41.32 87.63 180.85

3.77 4.98 9.39 24.91 50.07 110.16 232.00

Runtime ratio between double- and singleprecision 1.21 1.20 1.16 1.24 1.21 1.26 1.28

33

nodes via Eqs. (1)–(5). The simplest choice for parallelisation is to decompose the workload across threads over the particles, i.e. the interpolation from each individual particle to its related nodes is configured on a thread. However, a data race may be induced at the common nodes that are shared by neighbouring particles: the variables at the common node are stored in the global memory, and the interpolation operation is scattered over a number of threads; if the outcomes from different threads are delivered back to the global memory concurrently, the expected superimposition of the outcomes, as detailed in Eqs. (1)–(5), is not ensured. Two GPU parallel techniques without a data race were developed and assessed. For clarity, a two-dimensional analysis using square elements is used as an example. (i) A domain decomposition technique proposed by Huang et al. [11] for the CPU parallelisation of the MPM was expanded to the GPU. The technique comprises three steps: Step 1: All of the soil elements and particles are decomposed into a series of regular patches in space. A patch contains three elements in each direction and forms a nine-elemental square. The patches are categorised as four domains with odd and even indexes in the horizontal and vertical directions. The ‘mosaic’ partition is shown in Fig. 2. The particle list in each patch is established on the CPU and saved to the global memory of the GPU. Step 2: Parallel computing is performed over the patches that belong to a domain. For example, when the parallelisation is across threads over the horizontal-odd vertical-odd patches, the interpolation from the particles to the nodes is conducted in each thread over the particles within a patch. The element patches in the same domain are separated by at least three elements, for which no common nodes are shared by even the nearest patches, as shown in Fig. 2; the data race is thus avoided. Step 3: The particle list of the patch is updated. The soil particles flow through the fixed mesh, which suggests that the particle lists in the patches may vary between the incremental steps. Conservatively, the lists are expected to be updated at the beginning of each incremental step. However, it usually takes several time steps for a soil particle to flow from its original position to the location where it is related to a neighbouring patch because the time step in the explicit scheme is small. Thus, the lists can be updated every several dozen steps. Overall, the performance of the domain decomposition is not entirely satisfactory. Taking the last scenario in Table 1 (which will be described later) as an example, for 6 million soil particles and single-precision calculations, the GPU computing was only 5.4 times faster than the sequential CPU simulation. The reason for this non-significant increase is that the GPU threads access the particle and nodal variables on the global memory in a random mode, which is 1–2 orders of magnitude slower than the bandwidth of the global memory. (ii) The workload is decomposed across the threads over the nodes. Chiang et al. [7] suggested this scheme; however, its efficiency was significantly limited by the update frequency of the particle lists. The performance of their scheme was only assessed in terms of small scale problems (e.g., fewer than 13,000 particles). Our development comprises two steps: Step 1: Establish and update the particle list for each node. Prior to the parallel computing, a list of particles related to each node is established on the CPU and then saved to the global memory of the GPU. Because the soil particles will

34

Y. Dong et al. / Computers and Geotechnics 66 (2015) 31–38

Fig. 1. Parallelisation schemes for the MPM functions.

Fig. 2. Domain decomposition of soil elements and particles.

Fig. 3. Particle list of a common node.

flow through the fixed mesh between time steps, the particle lists of the nodes are ideally expected to be updated at the beginning of each time step to guarantee a unique relationship between the node and the related particles as in [7].

However, the cost of updating the lists at each time step is unacceptably high for large scale problems. Our solution is to expand the list of each node to cover not only the particles that are located within the influence zone of the common node but also the nearest neighbouring particles (Fig. 3). Several time steps are required for the soil particles to flow in or out of the ‘expanded’ zone, while the particles outside the influence zone are not involved in the interpolation operation that is a function of the distance between the node and the particle. Therefore, the particle lists of the ‘expanded’ zone can be updated periodically at intervals of several time steps. The update frequency will depend on the element size and the characteristics of the problem, which can be determined by trial calculations. Step 2: Parallel computing of the interpolation across the GPU threads over the nodes. In each thread, the interpolation of the variables in Eqs. (1)–(5) from the particles in the ‘expanded’ list to the node is performed sequentially.

Y. Dong et al. / Computers and Geotechnics 66 (2015) 31–38

This ‘gathering’ operation does not cause a data race. At the end of the interpolation, the updated nodal variables are written to the global memory in a consecutive sequence, which avoids random memory access as in a domain decomposition technique and increases the access speed to the global memory. The efficiency of this technique is much higher than that of the former technique, and it was used in the following comparison with the CPU calculation. The expanded zone extends the immediate influence zone by half of an element in each direction (Fig. 3). The particle lists were updated every 1000 time steps for the surface footing case and every 50 time steps for the landslide case. As a result, the computational effort for updating the particle lists in the former was less than 0.1% of the total runtime of GPU parallelisation using single-precision, while up to 7% in the latter case. The frequency of updating the particle lists could be reduced further if a larger expanded zone is employed, but this requires more global memory space. 2.4. Parallelisation of the other functions Compared with the ‘Interpolation from soil particles to nodes’ function, the other functions were more straightforward to parallelise across the threads over the nodes or particles. For the parallelisation of these functions, the GPU threads write variables to the global memory in a consecutive sequence, which maximises the speed to memory. (i) The function ‘Calculate nodal velocities and accelerations’ was parallelised across threads over the nodes, i.e. the update of the velocities and accelerations of each node was scheduled into a thread. The required information, such as the masses and momenta, is in terms of the particular node in each thread and does not communicate with any other threads. Therefore, there is no risk of a data race. (ii) The functions ‘Update stresses and material properties of soil particles’ and ‘Movement of soil particles’ were combined in the programming to save calculations of the shape functions. The workload was decomposed over the soil particles, and the stresses and movements of each particle were calculated in a thread. In each thread, the interpolated components, such as the deformation rates, velocities and accelerations, were superimposed from the related nodes one by one. Similarly, the function ‘Movement of rigid body particles’ was also parallelised across threads over the rigid body particles. 3. Performance evaluation Two benchmark cases, the penetration of a surface footing and a submarine landslide, were used to evaluate the performance of the GPU parallel computing strategy. The parallel computing was conducted on an ordinary laptop with an NVIDIA GeForce GTX 780M GPU and an Intel i7-4700HQ CPU with a frequency of 2.4 GHz. On the GPU, a maximum of 1536 cores were available, and the global memory was 4 GB. For comparison purposes, the sequential CPU simulations were conducted on a desktop computer with an Intel i7-2600 CPU with a frequency of 3.4 GHz. All of the CPU simulations were double-precision. The trial CPU simulations validated that the costs of using either single- or double-precision were similar. In contrast to the CPU, single-precision computations on the GPU are executed faster than double-precision computations. Although single-precision is accurate enough for the benchmark cases, double-precision may be necessary for highly numerical-divergent problems, such as the strain-softening of soil

35

[13]; therefore, efforts based on single- or double-precision computations were both investigated. All of the CPU and GPU calculations were in terms of the ‘release’ solution configuration rather than the ‘debug’ configuration. A load balance is not necessary because the threads are automatically fed to the available GPU cores. The runtimes quoted in the discussion below were the elapsed times. In both benchmarks, the soil was assumed to be an elasticperfectly plastic material with the von Mises yield criterion. The Poisson’s ratio of the soil was taken as 0.49 to approximate constant volume under undrained conditions. The Young’s modulus was taken as 500 times the undrained shear strength. Both the footing and the seabed base were modelled as rigid bodies with a density similar to the soil. The time step, Dt, was determined through the Courant–Friedrichs–Lewy stability condition

ad Dt 6 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðk þ 2GÞ=q

ð14Þ

where G is the elastic shear modulus of the soil, k is the Lamé constant, q is the soil density, d represents the square element size, and a is the Courant number. The two benchmarks used maximum values of 5.9 and 4.4 million soil particles, respectively. For cases of this size, the main calculations and all of the variables related to the soil were processed on the GPU; only the variables of the structural element were transferred between the CPU and the GPU, for which the bandwidth of PCI Express was large enough because the number of particles for the structural element was small. If the computational scale is increased beyond approximately 7 million soil particles, the soil variables cannot be allocated fully into the global memory on the GPU due to size limitations (4 GB in our case). An alternative option is to transfer the necessary data between the CPU and GPU in several batches during each incremental step, although this will significantly reduce the computational efficiency, and the bandwidth of PCI Express may become the bottleneck of the parallelisation. This option was not explored here, and the proposed methodology may need to be refined for complex three-dimensional analyses that require a greater number of particles. 3.1. Surface footing penetration For the surface footing penetration case, the width of the strip footing was B = 1 m, and the thickness was 0.025 m (Fig. 4). The corner of the footing was filleted to avoid a singularity in the normal direction, which benefits the footing-soil contact simulation. The interface of the footing and the soil was completely smooth (zero interface shear stress). The horizontal and vertical dimensions of the soil were 3B and 4B, respectively, and the element size was 0.025B. The original soil and footing elements were represented with 4 and 16 particles, respectively. The model included a total of 19,560 soil elements, 76,800 soil particles and 320 footing particles. The undrained shear strength of the weightless soil, su, was 3 kPa, and the submerged density of the soil was q = 500 kg/ m3. The geostatic stresses induced by the self-weight of the soil were not considered. The footing penetrated to a depth of 0.2B. The penetration velocity was taken as 0.1B s1, which was found to be sufficiently slow to generate a quasi-static response. The critical time step was taken as 6  105 s, and a = 0.55 in Eq. (14). The normalised load–displacement curves from the single- and double-precision GPU computations are compared in Fig. 5. The discrepancy between the two curves is slight, which suggests that the single-precision computation is sufficiently accurate. Sequential CPU analysis using the Uintah package [9] was also investigated, and the results were nearly identical to the GPU curves. For clarity, the Uintah results were not plotted in the figure.

36

Y. Dong et al. / Computers and Geotechnics 66 (2015) 31–38

35 30

Speedup

25 20 15 10

Single-precision Double-Precision

5 0

0

1.5

3

4.5

6

Number of soil parcles (million) Fig. 6. Speedups for the entire incremental step for the footing case.

Fig. 4. Configuration of the surface footing penetration case.

Capacity factor, Nc = F/suB 0

1

2

3

4

5

6

0.05

Fig. 7. Initial configuration of the submarine landslide case (not to scale).

25

0.10

Single-precision Double-precision

0.15

Exact solution

0.20 Fig. 5. Normalised load–displacement curves from the MPM with GPU computing.

Table 2 Speedups for time-consuming functions (single-precision GPU computation). Number of soil particles

Interpolation from soil particles to nodes

Update stresses and material properties of soil particles & movement of soil particles

Calculate nodal velocities and accelerations

Overall speedup

76,800 115,200 245,760 737,280 1,474,560 2,949,120 5,898,240

7.9 9.1 11.9 13.4 13.0 12.6 13.7

76.0 83.6 80.7 80.5 82.3 92.9 85.7

23.5 17.1 10.3 32.8 50.4 53.6 86.9

19.9 21.3 23.1 28.1 29.1 30.0 30.8

Run-out of front toe (m)

Normalised selement, w/B

0.00

20 15 10

MPM, GPU FE

5 0

0

5

10

15

Time (s) Fig. 8. Predicted run-out distances of the front toe (double-precision).

Although moderate computational fluctuations were observed, the bearing capacity factor that was predicted by the GIMP method agrees reasonably well with the exact solution of 5.14. As expected,

in the GPU parallel computation, the runtime on the data transfer between the CPU and GPU is as small as 0.02 ms for each incremental step. The horizontal dimension ratio of the soil domain, L/B, was increased from 3 to 4.5, 9.6, 28.8, 57.6, 115.2 and 230.4 to increase the computational scale; the corresponding numbers of soil particles increased from 76,800 to 5,898,240. The vertical dimension of the soil and the footing geometry remained unchanged. The average runtime of each incremental step was calculated as the footing penetrated to the target depth of 0.2B, covering all the possible costs during the entire simulation. The average runtimes of each

Y. Dong et al. / Computers and Geotechnics 66 (2015) 31–38

Fig. 9. Velocity distribution predicted by the MPM at time of 4 s.

Table 3 Average runtimes of an incremental step for the sliding case. L/H

Number of sliding mass particles

69,120 138,240 276,480 552,960 1,105,920 2,211,840 3,317,760 4,423,680

7.5 15 30 60 120 240 360 480

CPU

GPU parallel

Runtime per increment (ms) doubleprecision

Runtime per increment (ms) Single precision

Double precision

47 110 219 432 922 1891 2984 4032

2.31 4.85 9.56 18.86 39.97 77.72 119.53 162.04

3.46 6.59 12.32 24.15 49.88 99.77 150.24 201.38

Runtime ratio between double- and single-precision 1.50 1.36 1.29 1.28 1.25 1.28 1.26 1.24

25

Speedup

20 15 10

Single-precision

0

Double-precision 0

1

2

3

4

5

Number of sliding mass parcles (million) Fig. 10. Speedups for the entire incremental step for the dynamic sliding case.

incremental step with single- and double-precision computations are compared in Table 1. For the sequential CPU simulation, the runtime is nearly linearly proportional to the number of soil particles. The significant reduction of runtimes by the GPU parallel strategy is clear, while the ratios of the runtimes of the doubleand single-precision computations varied between 1.16 and 1.28 as the number of soil particles increased from less than 0.1 to 6 million. A ‘speedup’ factor for the entire incremental step or for each function within an incremental step is used to quantify the effect of parallel computing:

speedup ¼ T CPU =T GPU

and Fig. 6. The speedup increases for less than 1.5 million soil particles; if the computational scale is enlarged further, the GPU runs with a full load, and the improvement gradually stabilises, especially for double-precision analysis. The maximum speedups achieved are 30.8 for single-precision computations and 24.0 for double-precision computations. The performance of the GPU parallel computing varies amongst the different functions. The acceleration effects for the most timeconsuming functions are listed in Table 2 based on single-precision GPU computations. The speedup for the function ‘Interpolation from soil particles to nodes’ improves with increasing computational scale for fewer than 0.7 million soil particles, but the acceleration effect becomes relatively constant with a further increase in the number of particles. The maximum speedup reaches 13.7 for 6 million soil particles. The acceleration effect of the ‘Interpolation from soil particles to nodes’ function is much less than for the other functions that are parallelised over nodes or particles, mainly due to the low arithmetic intensity in the function. The proportion of memory access in the runtime of the function is therefore very high. The bandwidth of the global memory becomes the bottleneck for the parallelisation of this function, especially when the interpolation operation accesses the global memory randomly. The efficiency of the combined ‘Update stresses and material properties of soil particles’ and ‘Movement of soil particles’ functions is enhanced tremendously with speedups of 90. This is due to the high arithmetic intensity involved in updating the stresses. 3.2. Submarine landslide

30

5

37

ð15Þ

where TCPU and TGPU represent the average runtimes of the sequential CPU and parallel GPU calculations per increment, respectively. The speedups for the entire incremental step are shown in Table 2

The second example, a submarine landslide, was chosen to assess the performance of the GPU computing strategy for dynamic analyses. As shown in Fig. 7, an initially trapezoidal mass with length L = 48.66 m and height H = 5 m was placed on a rigid seabed base and allowed to slide under its self-weight from a state of rest. The element size was 0.1 m. Four particles were allocated in each element for the sliding mass and the rigid base. The model included a total of 53,000 elements, 79,464 sliding mass particles and 2000 base particles. The base was inclined at 5. The submerged density of the sliding mass was 600 kg/m3, and the undrained shear strength was 2.5 kPa. The maximum shear stress along the frictional base was specified as 2.5 kPa, i.e. the base is fully rough. The time step was 2  104 s, with a = 0.38 in Eq. (14). The run-out profiles of the front toe that are predicted by the MPM with parallel computing and a large deformation FE approach [20,21] are compared in Fig. 8; both analyses are double-precision. The sliding duration was estimated as 11 s by the MPM analysis, which is similar to the 10.2 s estimated by the FE. The final runout distances predicted by the two approaches agree well. The velocity distribution in the deformed sliding mass that was predicted by the MPM is shown in Fig. 9 and is also similar to that obtained by the FE approach [21]. The large deformation FE analysis used implicit integration and was performed on a single-thread CPU. Up to the point where the sliding mass becomes stationary, the runtime of the FE analysis on the CPU is 2.7 h, compared with 5.7 min and 1.3 h using the MPM analyses with the GPU and sequential CPU, respectively. Additional single-precision GPU computations were conducted, and the results are nearly identical to those of the double-precision computation. Therefore, it appears that single-precision MPM computing is sufficiently accurate in spite of the large number of particles and time steps. The runtime of the single-precision GPU parallel computation was 3.9 min, which is 68% of that with double-precision computation. The performance of the GPU computation was assessed for different computational scales. To facilitate the construction of the MPM model, the sliding mass in Fig. 7 was changed to be rectangular.

38

Y. Dong et al. / Computers and Geotechnics 66 (2015) 31–38

The initial height was kept constant at 5 m, while the aspect ratio, L/H, was varied to 7.5, 15, 30, 60, 120, 240, 360 and 480. The average runtimes of the incremental steps are compared in Table 3. Again, a significant reduction of runtimes was achieved by parallel computing. The runtime of the double-precision GPU computation was 1.24–1.50 times that of the single-precision computation. Similar to the quasi-static case of surface footing penetration, the speedup for the function ‘Interpolation from soil particles to nodes’ improved from 8.5 with fewer than 0.1 million particles to 13.0 with 4.4 million mass particles. The speedup for the combined ‘Update stresses and material properties of soil particles’ and ‘Movement of soil particles’ functions was as high as 61–88, while the function ‘Calculate nodal velocities and accelerations’ was accelerated by a factor of 13–29. The speedups for the entire incremental step are shown in Fig. 10. Despite the acceleration effect of GPU parallelisation, the speedups for this dynamic analysis are slightly lower than those for the quasi-static analysis of surface footing penetration, reaching 25 for single-precision and 20 for double-precision with 4.4 million particles. 4. Conclusions The material point method (MPM) is a computationally intensive form of finite element analysis that suffers from limitations associated with CPU performance. GPU parallelisation offers a path to boost the computational efficiency of the MPM, thus allowing larger scale simulations using the MPM. This study presents a GPU parallelisation strategy for the MPM that is based on the ‘compute unified device architecture’ (CUDA). To validate the applicability of the parallelisation strategy, the interaction between a structural element and soil was investigated. We presented different schemes to parallelise all of the functions except the ‘Interpolation from rigid body particles to nodes’ function that is executed sequentially on the CPU. A particle list is established for each node to parallelise the function ‘Interpolation from soil particles to nodes’ with the GPU over the nodes to avoid a data race, while the other functions are also parallelised with the GPU over the particles or nodes accordingly. The example applications indicate that single-precision computations are sufficiently accurate in spite of the large numbers of particles and time steps involved. Compared with the sequential CPU simulation, the GPU parallelisation with the adopted schemes results in a significant ‘speedup’. The maximum speedups with single-precision computations in the two benchmark cases are 30.8 and 25, and they decrease to 24 and 20 using double-precision computations. In particular, the speedup for the combination of the functions ‘Update stresses and material properties of soil particles’ and ‘Movement of soil particles’ is the most significant and reaches a factor of 90; in contrast, the speedup for the function ‘Interpolation from soil particles to nodes’ reaches a maximum of only 13 for single-precision computations. Acknowledgements The research presented in this paper was supported by the Australian Research Council through an ARC Discovery Grant

(DP120102987). The work forms part of the activities of the Centre for Offshore Foundation Systems (COFS), which is currently supported as a node of the Australian Research Council Centre of Excellence for Geotechnical Science and Engineering and in partnership with The Lloyd’s Register Foundation. References [1] Andersen S, Andersen L. Modelling of landslides with the material-point method. Comput Geosci 2010;14(1):137–47. [2] Bardenhagen SG, Brackbill JU, Sulsky D. The material point method for granular materials. Comput Methods Appl Mech Eng 2000;187:529–41. [3] Bardenhagen SG, Kober EM. The generalized interpolation material point method. Computer Model Eng Sci 2004;5(6):477–96. [4] Belytschko T, Krongauz Y, Organ D, Fleming M, Krysl P. Meshless methods: an overview and recent developments. Comput Methods Appl Mech Eng 1996;139:3–47. [5] Benson DJ. An efficient, accurate and simple ALE method for nonlinear finite element programs. Comput Methods Appl Mech Eng 1989;72:305–50. [6] Buzzi O, Pedroso DM, Giacomini A. Caveats on the implementation of the generalized material point method. Computer Model Eng Sci 2008;31(2):85–106. [7] Chiang WF, DeLisi M, Hummel T, Prete T, Tew K, Hall M, Wallstedt P, Guilkey J. GPU acceleration of the generalized interpolation material point method. In: Proceedings of the 1st symposium on application accelerators in high performance computing. Urbana, USA; 2009. [8] Dong Y, Ma J, Wang D, Randolph MF. Assessment of applicability of the material point method in offshore geotechnical engineering. In: Proceedings of the 14th international conference of the international association for computer methods and advances in geomechanics. Kyoto, Japan; 2014. p. 117–22. [9] Guilkey J, Harman T, Luitjens J. et al., Uintah code (Version 1.5.0). Computer program; 2012 available at . [10] Harlow FH. The particle-in-cell computing method for fluid dynamics. Methods Comput Phys 1964;3:319–43. [11] Huang P, Zhang X, Ma S, Wang HK. Shared memory OpenMP parallelization of explicit MPM and its application to hypervelocity impact. Computer Model Eng Sci 2008;38(2):119–47. [12] Ma J, Wang D, Randolph MF. A new contact algorithm in the material point method for geotechnical simulations. Int J Numer Anal Meth Geomech 2014;38:1197–210. [13] Murakami S, Liu Y. Mesh-dependence in local approach to creep fracture. Int J Damage Mech 1995;4:230–50. [14] NVIDIA. CUDA Programming Guide, version 5.5. Santa Clara, USA; 2013. [15] Schmidt J, Berzins M, Thornock J, Saad T, Sutherland J. Large scale parallel solution of incompressible flow problems using Uintah and Hypre. In: Proceedings of the 13th IEEE/ACM international symposium on cluster, cloud, and grid computing; 2013. p. 458–65. [16] Stantchev G, Dorland W, Gumerov N. Fast parallel particle-to-grid interpolation for plasma PIC simulations on the GPU. J Parallel Distrib Comput 2008;68(10):1339–49. [17] Steffen M, Kirby RM, Berzins M. Analysis and reduction of quadrature errors in the material point method (MPM). Int J Numer Meth Eng 2008;76(6):922–48. [18] Sulsky D, Zhou SJ, Schreyer HL. Application of a particle-in-cell method to solid mechanics. Comput Phys Commun 1995;87:236–52. [19] Wang D, Bienen B, Nazem M, Tian Y, Zheng J, Pucker T, et al. Large deformation finite element analyses in geotechnical engineering. Comput Geotech 2015;65:104–14. [20] Wang D, Hu Y, Randolph MF. Three-dimensional large deformation finite element analysis of plate anchors in uniform clay. J Geotech Geoenviron Eng 2010;136(2):355–65. [21] Wang D, Randolph MF, White DJ. A dynamic large deformation finite element method based on mesh regeneration. Comput Geotech 2013;54:192–201. [22] Zhang HW, Wang KP, Chen Z. Material point method for dynamic analysis of saturated porous media under external contact/impact of solid bodies. Comput Methods Appl Mech Eng 2009;198:1456–72.