An adaptive dual-information FMBEM for 3D elasticity and its GPU implementation

An adaptive dual-information FMBEM for 3D elasticity and its GPU implementation

Engineering Analysis with Boundary Elements 37 (2013) 236–249 Contents lists available at SciVerse ScienceDirect Engineering Analysis with Boundary ...

2MB Sizes 5 Downloads 53 Views

Engineering Analysis with Boundary Elements 37 (2013) 236–249

Contents lists available at SciVerse ScienceDirect

Engineering Analysis with Boundary Elements journal homepage: www.elsevier.com/locate/enganabound

An adaptive dual-information FMBEM for 3D elasticity and its GPU implementation Yingjun Wang, Qifu Wang n, Gang Wang, Yunbao Huang, Shuting Wang National CAD Support Software Engineering Research Center, Huazhong University of Science and Technology, Wuhan 430074, China

a r t i c l e i n f o

a b s t r a c t

Article history: Received 7 January 2012 Accepted 26 September 2012 Available online 30 November 2012

Combined the boundary element method (BEM) with the fast multipole method (FMM), the fast multipole BEM (FMBEM) is proposed to solve large scale problems. A key issue the FMBEM has to address is the element integrals, which usually consumes much time when the FMM for N-body problems is directly used. In order to accelerate element integrals, we present an adaptive FMBEM with a particular dual-information tree structure which contains both node and element information, and use it for 3D elasticity in this paper. In our adaptive FMBEM, the Multipole Expansions (ME), Momentto-Local (M2L) translation, Local Expansions (LE), and the Near Field Direct Computation (NFDC) are level independent so that they are suitable for parallel computing. The examples show that the time of ME and NFDC in our FMBEM is almost 1/3 and 1/2 compared with that in a node-based FMBEM which deals with FMBEM in a particle interaction mode. We develop two GPU parallel strategies to accelerate the processes of ME, M2L and NFDC and implement them on a NVIDIA GTX 285 GPU, and the speedups to an Intel Core2 Q9550 CPU using 4 cores can reach 10.7 for ME, 16.2 for M2L, and 3.6 for NFDC. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Fast multipole method Boundary element method Dual-information Graphics processing unit 3D elasticity

1. Introduction The fast multipole method (FMM), one of the top 10 algorithms in the 20th century [1], was invented in the late 1980s by Rokhlin and Greengard [2]. The FMM which adopts the hierarchical tree data structure can reduce the complexity of N-body problems [3] (e.g., gravitational and electrostatic problems) from   O N 2 or OðNÞ by approximating and aggregating the pairwise interactions. There are two versions of the tree structures in the FMM: the full tree structure [2,4] and the adaptive tree structure [5,6]. The former works well if the particles in the domain are uniformly distributed while the latter is more efficient and suitable for both uniform and non-uniform distribution. Although the FMM with the full tree structure is easily parallelized [7], the FMM with the adaptive tree structure is widely used nowadays because of its adaptivity and high efficiency; usually, the FMM based on the adaptive tree is called adaptive FMM. In this paper, only the adaptive FMM is discussed. The boundary element method (BEM) combined with the FMM can efficiently deal with large-scale engineering and scientific problems [8–10], which is called fast multipole BEM (FMBEM). There are two kinds of elements that can be used to discretize the boundary integral equations (BIEs) in the BEM. One is called

n

Corresponding author. Tel.: þ86 27 87541974. E-mail address: [email protected] (Q. Wang).

0955-7997/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.enganabound.2012.09.012

‘‘constant element’’ which has only one node located at the centroid of the element. In this element, the number of nodes equals to the number of elements, and the tree structure can be set up based on nodes in the same way as the standard tree structure for the N-body problems based on particles. The other, more accurate and efficient, called ‘‘high-order element’’ has several nodes in each element, and the parameters in the element can be interpolated from the parametric values of the nodes, therefore the number of nodes is not equal to the number of elements in high-order elements. The final BIEs are assembled by the unknowns of nodes but the boundary integrals are element-based [11], thus the tree structure for FMM cannot directly be set up as the N-body problems when using high-order elements. There are some researchers who have adopted the FMBEM with high-order elements to solve different problems. Hamada and Takuma [12] and Buchau et al. [13] used it to solve 3D electrostatic problems, and Zhao and Yao [14] used it to deal with 3D elastostatic problems, but details of the assembly of unknowns and the element integrals, as far as we know, have not been unveiled. Some researchers have proposed detailed approaches to apply high-order elements in the FMBEM. For example, Yoshida [15] presented a ‘‘global-node approach’’ which uses the shape functions of elements to interpolate the element integrals to the nodes and converts the element influence into node influence, and Lu et al. [16] presented an approach called ‘‘node-patch approach’’ which constructs a ‘‘working’’ patch around each node instead of directly using the element and finally the FMBEM works on nodes. However, the both approaches above will

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

increase the time of the element integrals and influence the efficiency of the FMBEM, which will be discussed in detail in Section 2. In order to reduce the element integral computation, we present a dual-information FMBEM with a particular adaptive dual-information tree for 3D problems (dual-information FMBEM for 2D problems is easier, not mentioned here). The adaptive dualinformation tree contains information of both elements and nodes in its cells (cell is a domain described in Section 2). By the dualinformation tree, the element integrals in the FMBEM are efficient, the FMBEM can be implemented as easily as the FMM for N-body problems, and the Multipole Expansions (ME), Moment-to-Local (M2L) translation, Local Expansions (LE) and the near field computation (NFDC) can be easily parallelized. In view of Graphics Processing Units (GPUs) parallel computing has been successfully applied to accelerate the BEM and the FMM [17–19], and the speedups reach dozens of times or even more. We complement the three major parts for time consuming (ME, M2L and NFDC) of our adaptive FMBEM in a parallel mode for GPU parallel computing, and implement it on a NVIDIA GeForce GTX 285 GPU under Computing Unified Device Architecture (CUDA). The parallel strategies for ME and M2L are based on the characteristics of the FMBEM expansions which are efficient and can be easily used for different physical problems. This paper is organized as follows. In Section 2, we describe the adaptive dual-information tree construction and the element integration in our dual-information FMBEM, and compare our element integration with the element integration with ‘‘globalnode approach’’ and ‘‘node-patch approach’’. In Section 3, we present a detail algorithm of the dual-information FMBEM for 3D elasticity, and propose our own data structure in order to make the algorithm suitable for parallel computing. In Section 4, the parallel dual-information FMBEM using GPU is implemented. In Section 5, the examples and results are discussed in detail. Finally, we summarize the paper and discuss the presented method and obtained results in Section 6.

237

A level l cell A level l+1 cell A level l+2 cell

Observation cell

Adjacent cells

Interacting cells

Far cells

Fig. 1. Illustration of 3D cell’s relations. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

2.2. Approaches of organizing element integrals in adaptive FMBEM The major difference between BEM problems and N-body problems is that BEM problems have to deal with element integrals but N-body problems just need to compute the particle-to-particle interactions. It is difficult to apply the N-body FMM directly in FMBEM with high-order elements because the cells only have node (a node as a particle) information but lack of element information. Before discussing the approaches to apply the N-body FMM directly in FMBEM, it is necessary to analyze the basic BEM equations. The FMBEM for 3D elasticity is taken as an example in this paper. The discrete boundary integral equations for 3D elasticity can be written as M Z M Z   X X cij PÞuj PÞ ¼ U ij ðP,Q Þt j ðQ ÞdSðQ Þ T ij ðP,Q Þuj ðQ ÞdSðQ Þ e ¼ 1 Se

2. Methods of tree construction and element integration ¼

2.1. Adaptive FMM tree construction One of the most important properties of d-dimensional Euclidean space (Rd) is that it can be subdivided into cells [20]. The problem domain with which we are concerned can be enclosed in a bounding cell. This bounding cell is assigned to level 0, and level 0 cell can be subdivided into 2d smaller cells of level 1 by dividing each side in half. In the same way, a sequence of cells are generated at level 2, level 3, and so on, until the greatest refinement level where cells has less particles than a given number (a childless cell is called a leaf). Finally, a hierarchical tree of cells is constructed. In the adaptive tree of cells, the empty child cells have to be trimmed off. For 3D N-body problems, the cells are cube cells and the trees are octrees. Based on the geometric subdivision of the domain, all cells to an observation cell (purple cell in Fig. 1) can be divided into adjacent cells, interacting cells and far cells. Two cells are identified as adjacent cells if they are at the same level and have at least one common vertex (blue cells are adjacent cells of the observation cell in Fig. 1). Two cells are identified as interacting cells if they are not adjacent but their parent cells are adjacent (yellow cells are interacting cells of the observation cell in Fig. 1). Other cells which parent cells are not adjacent are called far cells (white cells are far cells of the observation cell in Fig. 1).

e ¼ 1 Se

M X N X

 Z t aj Q Þ U ij ðP,Q ÞNa ðQ ÞdSðQ Þ Se

e ¼ 1a ¼ 1



M X

Z uaj Q Þ T ij ðP,Q ÞN a ðQ ÞdSðQ Þ, 

N X

e ¼ 1a ¼ 1

ð1Þ

Se

where the summation is to be carried out over the range of two identical subscripts, a denotes the a-th node in the N-node element, M is the number of elements, Se is the area of the element e, Na ðQ Þis the interpolation function of a-th node at point Q, uaj ðQ Þand t aj ðQ Þare the components in j direction of displacements and tractions at the a-th node of the element which Q belong to, cij ðPÞ are free-term coefficients, U ij ðP,Q Þ and T ij ðP,Q Þ are the fundamental solution of Kelvin’s problem as follows: U ij ðP,Q Þ ¼

  1 ð34vÞdij þ r ,i r ,j , 16pmð1vÞr

T ij ðP,Q Þ ¼ 

1 8pð1vÞr 2



   @r  ð12vÞdij þ 3r ,i r ,j ð12vÞ r ,i nj r ,j ni , @n

ð2Þ

where v is Poisson’s ratio, m is shear modulus, dij is Kronecker delta, n is the outward normal of the boundary Se, r is the distance between the source point P and the field point Q, ( ),k is partial derivative in the direction k of coordinates. To produce a closed set of equations, Eq. (1) is written for each node in turn. It can be observed that the integrals in Eq. (1) are carried out over each element. Let us assume that the integrals

238

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

Fig. 2. Illustration of the global-node approach.

have been computed with the result that Z U ij ðP,Q ÞNa ðQ ÞdSðQ Þ ¼ Geij ðP,Q Þ, Se

Z Se

T ij ðP,Q ÞNa ðQ ÞdSðQ Þ ¼ Heij ðP,Q Þ

ð3Þ

Eq. (1) can be described as M X N M X N   X X cij PÞuj PÞ ¼ t aj ðQ ÞGeij ðP,Q Þ uaj Heij ðP,Q Þ e¼1a¼1

ð4Þ

e¼1a¼1

Eq. (4) has the matrix representation 0

½cu ¼ ½Gt½H u,

ð5Þ

where the vectors {u} and {t} constitute the complete set of nodal displacements and tractions that have yet to be determined, and the matrices [c], [G] and [H0 ] correspond to the coefficients defined before. From Eqs. (3) and (5), it can be found that equations are assembled by the unknowns of nodes but the boundary integrals are element-based, and this is why the N-particle trees cannot directly be used in FMBEM. Two representative approaches to solve this problem have been proposed [15,16]. The first one is the ‘‘global-node approach’’, which uses the shape functions to interpolate the element integrals at the nodes. Linear triangular element is chosen to make an example here. Some local nodes of different boundary elements may share a global position (point Q), which is called a global node in [15] (shown in Fig. 2). In Eq. (5), if we want to get the coefficients about a point P and global node Q in the matrix [G] for a ti (a nodal traction at point Q) of vector {t}, we can compute Geij ðP,Q Þ of Eq. (3) for all the elements {Se, e¼1,y6} around the global node Q, and add all Geij ðP,Q Þabout ti together as Fig. 2. The same approach can be used for the coefficients in [H0 ]. By this approach, the FMBEM can be implemented based on nodes as the FMM for N-body problems. In the FMBEM, if the number of nodes which directly influence a node P is N, and the average number of elements around a node is 6 (an angle of an equilateral triangle is 601, 6  601¼3601), the number of element integrals that need to be computed for node P is 6 N. The other approach is the node-patch approach which constructs a ‘‘working’’ patch around each node instead of directly using the facet patch (element). A patch is illustrated in Fig. 3, in which a ‘‘node patch’’ is constructed around the node Q that has 6 adjacent elements. The patch is defined by the area encircled by a set of points {O1, C1, O2, C2,yO6, C6, O1}, where {Ol, l¼1,y, 6} are the centroids of the six adjacent triangles, and {Cl, l ¼1,y,6} are the midpoints of the 6 joint edges [16].

Fig. 3. A ‘‘node patch’’ around the node Q enclosed by the dashed lines.

It is easy to find out that each triangle element contributes 1/3 of its area to the ‘‘node patch’’. The surface integrals in Eq. (3) on the patch SQ described as Z l Z X U ij ðP,Q ÞN a ðQ ÞdSðQ Þ ¼ U ij ðP,Q ÞNa ðQ ÞdSðQ Þ, SQ

Z SQ

e¼1

T ij ðP,Q ÞNa ðQ ÞdSðQ Þ ¼

SQ e

l Z X e¼1

SQ e

T ij ðP,Q ÞN a ðQ ÞdSðQ Þ,

ð6Þ

where SQe is the subpart around node Q of element Se whose area is 1/3 of element Se. In the ‘‘node-patch’’ method, each node has its own patch which has no overlap with patches of others, and like the constant element method, the nodes can be used directly for any current FMM. If the number of nodes which influence the node P is N, and the average number of elements around a node is 6, the number of element integrals that need to be computed for node P is 6N and each integral area is just 1/3 element area. 2.3. The dual-information adaptive tree for FMBEM We first clarify some common parameters that will be used throughout this paper. Maximum number of nodes allowed in a leaf is noted maxn, the greatest refinement level number is nlevel, total number of cells is ncell, total number of leaves is nleaf, and some lists and array elements are shown in Table 1. In the FMBEM, the element integrals involve in two aspects, the ME in leaves and the NFDC where at least one of the two adjacent cells is a leaf. If the tree only has the node information

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

The adaptive dual-information tree algorithm is as follows

Table 1 Common notations of lists and array elements. Notation

Description

C(l) M(c) L(c) Nlist(c) Elist(c) Nnum[c] Enum[c] Clevel[c] Adlist(c) Inlist(c) Cnumad[c] Cnumin[c] Aenlist(i) Cleaf [i]

List—cells of level l. List—multipole moments of cell c. List—local moments of cell c. List—nodes in cell c. List—elements in cell c. Array element—the number of nodes in cell c. Array element—the number of elements in cell c. Array element—level number of cell c. List—adjacent cells of cell c. List—interacting cells of cell c. Array element—the number of adjacent cells of cell c. Array element—the number of interacting cells of cell c. List—adjacent elements of node i. Array element—cell number of leaf i.

Cell

Centroids of Elements

(4)

(1)

239

(2) (3)

Fig. 4. An element-division example. (1)(2) Belong to cell and (3)(4) Not Belong to cell.

(using conventional direct BEM), the element integrals have to be node-based. Assuming that there are N nodes in a cell c and the average number of elements adjacent to a node is 6, the number of element integrals in the ME and the NFDC of cell c is 6N and 6N2. However, the number of elements is just about twice as many as the number of nodes when linear triangle elements are used [16], it means that if an adaptive tree has the element information, the element integrals can be reduced proximately to 2N and 2N2 in the ME and the NFDC of cell c. But a major disadvantage of the tree which only has element information is that BEM system equations are hard to set up because the equations are node-based. In order to solve this problem, we present an adaptive dualinformation tree which contains both node and element information. The construction of the adaptive dual-information tree is also node-based, which means that the cell-division is based on nodes. In the process of tree construction, elements are divided into the cells which have just been created. Whether an element belongs to a cell or not is determined by the centroid of the element in or out of the cell. An elementdivision example is shown in Fig. 4. When the tree construction is finished, the levels (except level 0 and level 1) of the tree and the cells of each level are traversed to record the adjacent cells in a list (Adlist) and record the interacting cells in another list (Inlist), and two array (Cnumad and Cnumin) are used to record the number of adjacent cells and interacting cells. Besides, an array (Clevel) is used to record the levels of cells. These lists and arrays will benefit the parallel computing of the dual-information FMBEM that will be discussed in Section 3.

Algorithm 1. Adaptive dual-information tree construction Start algorithm (1)Choose maxn, define Cell 0 to be the cube that covers all nodes and call this cube cell of level 0. Get the coordinate of each element centroid by shape function and coordinates of element vertices. If the element centroid is in cell c, the element belongs to cell c. (2)For levels l ¼0, 1, 2,y do//level For each cell cAC(l) do//cell If Nnum[c]4maxn then Divide c into 8 child cells c0–c7. Ignore empty children (both elements and nodes are empty) and add the nonempty child cells to C(l þ1). Record Nnum[c0]–Nnum[c7] and create Nlist[c0]–Nlist[c7]. Record Enum[c0]–Enum[c7] and create Elist[c0]–Elist[c7]. If ckAc0–c7 is a leaf then Record cell number of cell ck in Cleaf[ck]. End if End if End for//cell End for//level (3) Record nlevel, ncell, and nleaf. (4) For levels l ¼2 to nlevel do//level For each cell ciAC(l) do//cell ci //record relation between cell and level Record Clevel[ci] (Clevel[ci] ¼l). Get ci’s parent cell pci. For each cell cjAchild cells of C(pci) do//cell cj //construct adjacent list If cell cj is an adjacent cell of cell ci then Add cell cj to the Adlist[ci] which is the adjacent cell list of cell ci. Cnumad[ci] þ ¼1. else//construct interacting list Add cell cj to the Inlist(ci) which is the interacting cell list of cell ci. Cnumin[ci]þ ¼1. End if End for//cell cj End for//cell ci End for//level End algorithm By Algorithm 1, the tree includes both element information and node information in each cell. The element integrals in the ME and NFDC can use the element information, and respectively, the number of element integrals is about 2N and 2N2. The system equations can be easily set up based on nodes, and the adjacent cells and the interacting cells can be easily found from Adlist and Inlist.

3. The FMBEM 3.1. Fundamental formulae The formulae of FMBEM for 3D elasticity are listed briefly here, and the detail formulae derivation can be found in [15]. The formulae are divided into five parts: (1) Multipole Expansions (ME), (2) Moment-to-Moment (M2M) translation, (3) Local Expansion (LE), (4) Moment-to-Local (M2L) translation, and (5) Local-to-Local (L2L) translation.

240

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

ME is a series expansion in order to separate a source point from a field point by a point that is close to the field point. Fundamental solution in Eq. (2) can be expanded as U ij ðP,Q Þ ffi

p n X X

1







½F PQ c Rn,m Q Q c 8pm n ¼ 0 m ¼ n ij,n,m      þGi,n,m PQ c Q Q c j Rn,m Q Q c ,



Z Se

T ij ðP,Q Þuj ðQ ÞdSðQ Þ ffi

p n X X

1

½F R ðPPc ÞL0j,n,m ðPc Þ 8pm n ¼ 0 m ¼ n ij,n,m 0 ðPc Þ, þ GRi,n,m ðPPc ÞLn,m

ð12Þ

where the local moments are given by the M2L translations p X

Ln,m ðPc Þ ffi ð1Þn

n0 X

    Sn þ n0 ,m þ m0 Pc Q c ½M n0 ,m0 Q c

n0 ¼ 0 m0 ¼ n0

T ij ðP,Q Þ ffi Ejklp nk ðQ Þ

@ U ðP,Q Þ, @Q p il

ð7Þ

in which

Lj,n,m ðPc Þ ffi ð1Þn





  lþm @ S PQ c , ¼ l þ2m @P i n,m





  lþm   @   l þ3m ¼ d S PQ c  PQ c j Sn,m PQ c , @P i l þ2m ij n,m l þ 2m

Gi,n,m PQ c

F ij,n,m PQ c

where m ¼E/2(1þv), E is Young’s modulus, Eiklp ¼ ldikdlp þ m(dildjp þ dipdkl), l ¼Ev/(1þv)(1 2v), Qc is the expansion center, p is the number of expansion terms, ðÞ is the complex conjugate of ( ), @=@ðÞk is the partial derivative in the direction k of ( ), the functions Sn,m and Rn,m are solid harmonic functions [15]. Applying Eq. (7) into the integrands of Eq. (1) on an element Se, we obtain the following: Z p n     1 X X U ij ðP,Q Þt j ðQ ÞdSðQ Þ ffi ½F PQ c Mj,n,m Q c 8pm n ¼ 0 m ¼ n ij,n,m Se     þ Gi,n,m PQ c M n,m Q c ,

Se

T ij ðP,Q Þuj ðQ ÞdSðQ Þ ffi

1

p n X X

8pm n ¼ 0 m ¼ n     0 Q c , þ Gi,n,m PQ c M n,m

M 0j,n,m ðQ c Þ

where and moments whose center is Qc and described as Z         M n,m Q c ¼ Q Q c j Rn,m Q Q c t j ðQ dSðQ ,

ð9Þ are multipole

Se

  M j,n,m Q c ¼

Se

  Rn,m Q Q c t j ðQ ÞdSðQ Þ,

  0 M n,m Q c ¼ Ejpkl

Z

  M 0j,n,m Q c ¼ Ejpkl

  i @ h Q Q c j Rn,m Q Q c nk ðQ Þul ðQ ÞdSðQ Þ, @Q p Se

Z Se

  @  Rn,m Q Q c nk ðQ Þul ðQ ÞdSðQ Þ @Q p

ð10Þ

The ME center can be moved from Qc to Qc0 by M2M translations n0 X

n X   M n,m Q c0 ¼ n0 ¼ 0

m0

¼

    Rn0 ,m0 Q c Q c0 ½Mnn0 ,mm0 Q c

n0

    þ Q c Q c0 j M j,nn0 ,mm0 Q c , n X   M j,n,m Q c0 ¼ n0

¼0

n0 X m0

¼

n0 X m0

    Rn0 ,m0 Q c0 Q c0 M j,nn0 ,mm0 Q c ,

ð11Þ

n0

    which are also valid for M 0n,m Q c0 and M0j,n,m Q c0 . LE is another expansion which is similar to ME, but the expansion is about a source point P. The left of Eq. (9) can be expanded as follows: Z p n 1 X X U ij ðP,Q Þt j ðQ ÞdSðQ Þ ffi ½F R ðPPc ÞLj,n,m ðPc Þ 8pm n ¼ 0 m ¼ n ij,n,m Se þ GRi,n,m ðPPc ÞLn,m ðPc Þ,

¼

    Sn þ n0 ,m þ m0 Pc Q c M j,n0 ,m0 Q c

ð13Þ

n0

Similarly for L0n,m ðPc Þ and L0j,n,m ðPc Þ, only with M n0 ,m0 and M j,n0 ,m0 replaced with M0n0 ,m0 andM 0j,n0 ,m0 , GRi,n,m and F Rij,n,m are obtained from Eq. (8) with Sn,m replaced with Rn,m. When the LE point is moved from Pc to Pc0 , we can use the following L2L translations Ln,m ðPc0 Þ ffi

p X

n0 X

Rn0 n,m0 m ðPc0 Pc Þ½Ln0 ,m0 ðPc ÞðPc0 Pc Þj Lj,n0 ,m0 ðPc Þ,

n0 ¼ 0 m0 ¼ n0

Lj,n,m ðPc0 Þ ffi

p X

n0 X

Rn0 n,m0 m ðPc0 Pc ÞLj,n0 ,m0 ðPc Þ

ð14Þ

n0 ¼ 0 m0 ¼ n0

L0n,m ðPc0 Þ and L0j,n,m ðPc0 Þ can be obtained in the same way. 3.2. Process of the FMBEM

    ½F ij,n,m PQ c M 0j,n,m Q c

0 Mn,m ðQ c Þ,M j,n,m ðQ c Þ,M n,m ðQ c Þ

Z

p X n0 ¼ 0

ð8Þ

Z

     Pc Q c j M j,n0 ,m0 Q c ,

The FMBEM consists of two basic steps, the upward pass and the downward pass [2,15]. The upward pass consists of ME and M2M translations, and the downward pass consists of M2L, L2L, LE and NFDC where conventional BEM is used. In ME and NFDC, the direct integration is used. One process of a node-based FMBEM can be described briefly as follows. Upward: Loop level l ¼nlevel to 2, and loop each cell c of level l. If c is a leaf, M(c) is computed by Eq. (10) with a particular nodebased approach (‘‘global-node approach’’ or ‘‘node-patch approach’’). If cell c has child cells, M(c) is computed by summing the moments of its child cells using the M2M translation as Eq. (11). Downward: Loop level l¼ 2 to nlevel, and loop each cell c of level l. LE is computed for each node in cell c by Eq. (12)when c is a leaf. L(c) is the sum of the contributions from the interacting cells and far cells of cell c (cells of level 2 just has interacting cells). For far cells, the contributions are computed by L2L translations by Eq. (14). For interacting cells, the contributions are computed by M2L translations by Eq. (13). If at least one of cell c and its adjacent cell is a leaf, we also need to directly compute the NFDC from each node in the adjacent cell to each node in cell c by the conventional BEM with a particular nodebased approach (‘‘global-node approach’’ or ‘‘node-patch approach’’). The flow chart of the above process is shown in Fig. 5 (we just described the level loop explicitly). With this node-based FMBEM, the adaptive tree in section 2.1 can be used directly, and the memory consumption is low. However, besides the low efficiency of the node-based direct integration, this FMBEM is difficult to parallelize because the upward and downward have to be implemented level by level. In order to improve the efficiency of direct integration and make the adaptive FMBEM easy to parallelize without increasing much memory consumption, we present an dual-information FMBEM with the adaptive dual-information tree structure described in

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

241

Fig. 5. A flow chart of a node-based FMBEM.

Fig. 6. Flow chart of the dual-information FMBEM.

Algorithm 1. The flow chart of the dual-information FMBEM is shown in Fig. 6, and the detail upward and downward process is described in Algorithm 2. Algorithm 2. Dual-information FMBEM Start algorithm //Upward: //ME (element-based) (1) For each leaf cACleaf[0] Cleaf[nleaf] do Call function ME(c,M(c)) to compute M(c) End for //M2M (2) For levels l ¼nlevel to 3 do For each cell cAC(l) Get c’s center and its parent pc’s center Translate M(c) to M(pc) via M2M from c’s center to pc’s center as Eq. (11) End for End for //Downward: //M2L (3) For each cell cAC(2)  C(nlevel) do Get the level of cell c from clevel(c) and get the center of cell c. For each cell c~ A Inlist(c) do Get c~ 0 s center.

Accumulate the contribution of M(c~ ) to L(c) via M2L from c~ 0 s center to c’s center as Eq. (13). End for End for //NFDC (element-based) (4) For each cell cAC(2)  C(nlevel) do For each cell c~ AAdlist(ci) do If c is a leaf 9 9 c~ is a leaf then Call function Direct(c, c~ ) to compute NFDC. End if End for End for //L2L (5) For levels l ¼ 3 to nlevel do For each cell cAC(l) do Get c’s center and its parent pc’s center. Accumulate the contribution of L(pc) to L(c) via L2L from pc’s center to c’s center as Eq. (14). End for End for //LE (6) For each leaf cACleaf[0] Cleaf[nleaf] do Compute LE to all nodes in c via LE formula as Eq. (12). End for End algorithm

242

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

Function ME(c, M(c)) Get the level of cell c from clevel(c) and get c’s center. For element i ¼1 to Enum[c] do Get element Se from Elist(c). Compute Eqs. (9) and (10) from Se to c’s center, and add the element integral results to M(c). End for Function Direct(c, c~ ) For i¼1 to Enum[c~ ] do Get element Se from Elist(c~ ). Compute the NFDC by computing each element in c~ to each node in c via conventional BEM. End for From Algorithm 2, it is easy to obtain that the ME, M2L, NFDC and LE which are irrelevant to levels are suitable for parallel computing, and the additional memory used by the lists and arrays (e.g. Adlist, Cleaf) is small. Furthermore, the integrals in ME and NFDC can be efficiently computed by direct integration without any particular node-based approach.

number of elements in a leaf is twice about the number of nodes. The maximum number of cells in the adjacent cell list is 27, and assume the average number is about half of the maximum number. If at least one of two adjacent cells s a leaf, the NFDC should be used between the cells. The cost of the NFDC is    Cost NFDC  npointð2nleaf Þ 2maxn2 27=2 ¼ 54ðnpointÞðmaxnÞN ð22Þ The difference between the node-based FMBEM and the dualinformation FMBEM is that the CostME and CostNFDC in the node-based FMBEM is approximately 2–3 times as those in the dual-information FMBEM. In general, maxn is less than 50 and p is less than 12, and the number of integral points of a triangular element is 4 or 7 by Hammer numerical quadrature method [21]. From Eq. (17) to Eq. (22), we can obtain that the CostME, CostM2L and CostNFDC occupy most of the time consumption. In view of GPU parallel computing has been successfully applied to scientific computations and obtained significant results, the GPU parallel computing for the ME, M2L and NFD of our dual-information FMBEM will be discuss in the next section.

3.3. Computational complexity The computational costs of the different parts of the dualinformation FMBEM depend on the maximum number of nodes allowed in a leaf maxn and the number of expansion terms p. The computational cost of the dual information FMBEM for 3D elastic problems is discussed here, if the number of elements is Ne (triangular elements), the number of nodes N approximately equals to Ne/2. The depth of the octree nlevel is log8(N/maxn), and the time for constructing the tree is as short as discussed below (e.g. less than 1% of the total time in [7]). The approximate number of leaves is nleaf  N=maxn

ð15Þ

The number of cells is   ncell  nleaf 1 þ 1=8 þ 1=82 þ 1=83 þ UUUþ 1=8nlevel r 8nleaf =7 ¼ 8N=ð7maxnÞ

ð16Þ 0 ðQ c Þ and M0j,n,m ðQ c Þis Set p2 ¼ (p þ1)(p þ2)/2, the cost of Mn,m about 9 times as the cost of M n,m ðQ c ÞandM j,n,m ðQ c Þin Eq.(10) because of their complicated summation. Assuming the number of numerical integral points in an element is npoint (we used numerical method to compute Eq. (10)), the cost of ME (only compute the multipole moments with mZ 0) is  Cost ME  p2 Ne npoint ð1þ 9Þ=2Þ ¼ 10p2 NðnpointÞ ð17Þ

The cost of M2M is Cost M2M  2p22 ncell ¼ 16p22 N=ð7maxnÞ

ð18Þ

The maximum number of cells in the interacting cell list is 189, and the average number is about half of the maximum number, therefore the cost of M2L (only compute the local moments with m Z0) is   Cost M2L  2p22 189ncell=2 ¼ 216p22 N=maxn ð19Þ The cost of L2L is Cost L2L  Cost M2M  16p22 N=ð7maxnÞ

ð20Þ

Because each node result has 3 components (for 3 coordinate direction), the cost of LE is Cost LE  2p22 ð3NÞ ¼ 6p22 N

ð21Þ

In the NFDC, most element integrals are computed by numerical method (number of integral points is npoint), and assume the

4. Implementation of the dual-information FMBEM on GPUs GPUs which were originally used for graphic processing are increasingly being applied to scientific computations due to their outstanding computational power, especially since NVIDIA released the parallel computing architecture (CUDA) in 2006 [22]. CUDA allows C program to be compiled and executed on stream processors of GPUs that offers a low barrier to entry. Under CUDA, a program is completed by both Host port (CPU) and Device port (GPU), where Host port is in charge of serial parts, and Device port is in charge of parallel parts. The functions defined in Device port is called kernel functions. Threads are the executable units of GPU. A certain number of threads bundled together form a thread block. Within a block, threads are handled as warps (32 threads) which always run together on a processor (core). There are several memories that can be used by programmers to achieve high efficiency. Global memory which has a noticeable latency usually stores input and output data, CUDA offers a coalesced accessing scheme where data are read and written in contiguous addresses to reduce this latency; registers are allocated to individual threads, each thread can only access its own registers (speed is very high); shared memory can be accessed by all threads in a block as fast as registers; constant memory has cache, where host can write and read but device can only write; local memory which is allocated to each thread usually store data when register is used out and texture memory; and texture memory has some special functions (e.g. texture rendering). NVIDIA GTX 285 GPU is used to develop our parallel FMBEM program, which consists of 30 multiprocessors. Global memory occupies most of the all memory (1 GB). Local memory uses the global memory (in the latest GPUs, e.g. NVIDIA GTX 580, local memory has its own memory space). The 8 processors in a multiprocessor share 16 KB shared memory and 16384 registers, and all multiprocessors share 64 KB constant memory. In Section 3, we have estimated that ME, M2L and NFDC spend most of time in the FMBEM, and the dual-information FMBEM which is more efficient and more suitable for parallel, thus we make effort to parallelize ME, M2L and NFDC of the dualinformation FMBEM in the GPU parallel implementation using CUDA.

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

and M 0j,n,m ðQ c Þ need three arrays respectively (each j of M j,n,m ðQ c Þ or M 0j,n,m ðQ c Þ need one array). The location of (n,m) in an array is

4.1. ME parallel strategy From Algorithm 2, it is easy to find out each ME of leaves is independent, thus ‘‘one-block-one-cell’’ mode of parallelization is adopt to compute ME. According to the property of the FMBEM, we propose a particular thread distribution strategy that assigns each thread in a block to compute the values of Mn,m ðQ c Þ, 0 M j,n,m ðQ c Þ,Mn,m ðQ c Þ and M 0j,n,m ðQ c Þ of each n and m pair (named (n,m) for short) in Eq. (10). The number of expansion terms is p, the value domain of (n,m) is ( 0 rn r p ðn,m A ZðthesetofintegersÞÞ, ð23Þ n rm r n where it can be obtained that m is symmetrical about 0, and the multipole moments in Eq. (10) has the following property [15]:     ðm Z0Þ, M n,m Q c ¼ ð1Þm M n,m Q c     M j,n,m Q c ¼ ð1Þm Mj,n,m Q c M 0n,m





m



0 Q c ¼ ð1Þ M n,m Qc

243



    M 0j,n,m Q c ¼ ð1Þm M0j,n,m Q c

loct n,m ¼ nðn þ 1Þ=2 þm

ð25Þ

One thread is assigned to charge one (n,m) computation of the multipole moments, then the number of threads in one block is (p þ1)(p þ2)/2, and the thread to (n,m) mapping relationship is shown in Fig. 8. In this parallel mode, all multipole moments are stored in shared memory in the process of parallel computing that can avoid the access latency of global memory, and the loads between each thread are the same. When the number of leaves is large enough, the computing capability of GPU can be fully utilized. After the ME computing, the values of multipole moments are copied from shared memory to global memory. The parallel CUDA ME algorithm is given in the following. Algorithm 3. CUDA ME Algorithm

ðm Z0Þ, ðm Z 0Þ, ðm Z0Þ,

ð24Þ

In order to save the memory which is especially limit in GPU, only the multipole moments of mZ0 are stored, and the values when mo0 is obtained from Eq. (24). The detail (n,m) distribution is shown in Fig. 7, like the indices of an lower triangular matrix. One dimension arrays are used to store multipole moments, 0 M n,m ðQ c Þ and M n,m ðQ c Þ need one array separately, and M j,n,m ðQ c Þ

Number of blocks is: numblocks ¼nleaf. Number of threads in a block is: TB ¼p(p þ1)/2þp þ1. Let bid¼block ID, tid¼local thread ID. Start algorithm //Parallel executed in a 1-D grid of threads For bid¼0 to numblocks in parallel do If tid ¼ ¼0 then Get the cell number of bid: c¼Cleaf [bid]. Get the level of cell c from clevel[c] and get the center of cell c. End if //Detail computation of ME refers Algorithm 2 Using tid compute the tid-th location value of M(c). End for End algorithm

4.2. M2L parallel strategy In M2L, the local moments from M2L need to be computed (see Eq. (13)), and the local moments has the same property as the multipole moments in Eq. (24) Ln,m ðPc Þ ¼ ð1Þm Ln,m ðPc Þ Lj,n,m ðPc Þ ¼ ð1Þm Lj,n,m ðPc Þ 0 ðPc Þ L0n,m ðPc Þ ¼ ð1Þm Ln,m

Fig. 7. Distributing mode of (n,m).

L0j,n,m ðPc Þ ¼ ð1Þm L0j,n,m ðPc Þ

Fig. 8. Stretch of thread distribution.

ðm Z 0Þ, ðm Z 0Þ, ðm Z0Þ, ðm Z 0Þ

ð26Þ

244

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

The ME parallel strategy is also applied to the M2L that one thread to charge one (n,m) computation of M2L translations (see Eq. (13)). All M2L translations are stored in shared memory in the process of parallel computing, and the results are copied from shared memory to global memory. However, when the ME is computed in a cell, the multipole moments just need to be computed in its own cell, but when the M2L is computed in a cell, all M2L translations of all cells in the interacting cell list of that cell have to be computed and added together. The parallel CUDA M2L algorithm is as follows:

N1  N2, so we abandon this strategy. In order to improve the accumulating process, we parallel compute integrals of one element in field cell to all nodes in source cell (Fig. 9 (b)), then loop the elements in field cell to finish the cell computation. In this way, the accumulating operation number is N2. According to our parallel strategy, we set the number of threads in a block equals to the maximum number of nodes allowed in a leaf, and use the shared memory to store the integral results. The parallel CUDA NFDC algorithm is described below

Algorithm 4. CUDA M2L Algorithm

Algorithm 5. CUDA NFDC Algorithm The number of cells in C(2) C(nlevel) is n2nlevcell. Number of blocks is: numblocks ¼n2nlevcell. Number of threads in a block is: TB ¼maxn. Let bid¼block ID, tid ¼local thread ID. Use array dresult[i] to store the integration result of the i-th node in a source cell.

The number of cells in C(2) C(nlevel) is n2nlevcell. Number of blocks is: numblocks¼n2nlevcells. Number of threads in a block is: TB¼p(p þ1)/2 þpþ1. Let bid¼block ID, tid¼ local thread ID. Start algorithm //Parallel executed in a 1-D grid of threads For bid¼0 to numblocks in parallel do If tid ¼ ¼0 then Get the cell number c of bid. Get the level of cell c from Clevel[c] and get the center of cell c. End if For each cell c~ AInlist(c) If tid¼ ¼0 then Get the center of cell c~ . End if //Detail computation of M2L refers Algorithm 2 Using tid to accumulate the tid-th location value of L(c). End for End for End algorithm

4.3. NFDC parallel strategy From Algorithm 2, we can find that the computation in the element-based NFDC is cell independent, thus we also adopt ‘‘oneblock-one-cell’’ mode of parallelization, but the details are different. If two adjacent cells (each cell is considered to be an adjacent cell of itself) need to be computed in the NFDC, the two cells can be divided into source cell and field cell, and the contributions from each element in the field cell to each node in the source cell need to be computed. The first parallel strategy appears in our mind is to simultaneously compute integrals of all the elements in field cell to one node in source cell (see Fig. 9 (a)), and then traverse the nodes in source cell to finish the cell computation. In this strategy, each integral of the elements in a field cell to a node in a source cell need to accumulate, and the accumulation cannot be parallel. If the number of nodes in source cell is N1 and the number of elements in field cell is N2, the accumulating operation number is

Source cell

Field cell

Start algorithm //Parallel executed in a 1-D grid of threadsall blocks and their //threads are parallel For bid¼ 0 to numblocks in parallel do Get the cell number c of bid. For each cell c~ AAdlist(c) do If c is a leaf 9 9 c~ is a leaf then Call function Pdirect(c,c~ ) to compute NFDC. End if End for End for End algorithm Function Pdirect(c,c~ ) //get ratio to determine thread loop times for cell c, int(n) means get n0 s integer part. ratio¼int ((Nnum[c]-1)/maxn) For nodeoffset¼0 to ratio do dif ¼ratio-nodeoffset//dif is an offset index If dif a 0 then For each element SeAElist(c~ ) do dresult[maxnn(dif-1)þtid] þ ¼ contribution from Se End for Else If tido(Nnum[ci]-maxnnratio) then dresult[maxnnratioþ tid]þ ¼contribution from Se End if End if End for

5. Examples and analysis The FMBEM programs are executed on a desktop computer: CPU is Intel Core2 Q9550 (2.83 GHz), GPU is NVIDIA GTX 285 GPU, OS is Windows XP Professional, RAM is DDR3 SDRAM (2 GB),

Source cell

Field cell

Fig. 9. Modes of element-to-node integration in NFDC. (a) Multi elements to one node and (b) one element to multi nodes.

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

the compiler for CPU code is Microsoft Visual Studio 2008, and the compiler for GPU codes is NVIDIA CUDA 4.0 (C language).

5.1. Example 1 Example 1 is a solid rivet whose sizes and boundary conditions are shown in Fig. 10. Elastic modulus is 200000 Mpa, Possion’s ratio is 0.3, and the load value is 10 Mpa (Y direction). The surface of the rivet is meshed with 712, 5704, 16100 and 67776 linear triangle elements separately (see Fig. 11), and the degrees of freedom (DOFs) are 1074, 8562, 24156 and 101670. The number of expansion term p is 6, and the maximum number of nodes allowed in a leaf maxn is 20. Computational time of the major parts of the node-based FMBEM (global-node approach) and the dual-information FMBEM is shown in Table 2 (CPU using 1 core). We can get that ME, M2L and NFDC spend most of time (more than 95% of the total time in the FMBEM process), and L2L and LE spend much less time in all cases (DOFs from 103 to 105), which accords with our estimation in Section 3.3. It is easy to find out that the time of ME and NFDC in the dual-information FMBEM is almost 1/3 and 1/2 of that in the node-based FMBEM separately. Because the element-based integration used in the dual-information FMBEM is more efficient than the node-based integration used in the node-based FMBEM. The time of M2M, L2L and LE is almost equal (because no difference in these parts between the two FMBEMs). Table 3 shows the computational time and the speedups between the CPU (using 4 cores by OpenMP) and GPU for ME, M2L and NFDC which were computed in parallel based on the dual-information FMBEM. Note that, when Ne (number of elements) is small, the speedups are not significant because GPU cannot make full use of its computational ability in this situation and we do not care about the small scale cases because the CPU is

16 mm Load Face

20 mm

25 mm

Load Direction

Z

Y Constraint Face

X 10 mm

Fig. 10. Size and boundary conditions of the solid rivet.

also very fast. When Ne increases to a certain scale, for example Ne ¼16100 in Table 2 (DOFs are 24156), the speedups of the ME, M2L and NFDC reach 10.1, 14.0 and 2.7 times separately (when CPU uses 1 core, speedups can reach 27.5, 42.4 and 7.6). The speedups of the ME and M2L behave well as expected, however, the speedup of the NFDC is not satisfactory because the logic structure of the CUDA NFDC algorithm (Algorithm 5) is so complicated that is not suitable for the GPU computing (logic judgment of GPU is much slower than that of CPU). Our future research will focus on optimization of the NFDC and acceleration of its run speed on the GPU. Excluding the memory usage in solution, the memory usage of the CPU code (double precision) is 11 MB, 20 MB, 33 MB and 137 MB when element number Ne equals to 712, 5704, 16100 and 67776 respectively, while the graphic memory usage of the GPU code (single precision) is 60 MB, 66 MB, 73 MB and 102 MB. Although the memory usage of single-precision data is less than that of double-precision data, some coefficients stored in graphic memory for GPU parallel computing, especially in the NFDC (about 45 MB), causes the memory usage of the GPU code is larger than that of CPU code when Ne is small. We used the generalized minimum residual (GMRES) method [23] to solve example 1 with 712 linear triangle elements where leaf-based preconditioner [24] is used and the convergence tolerance is set to 10  5. The number of iterations is 33 for all the CPU codes and GPU code, total time is 24.6 s (Node-based FMBEM with CPU using 1 core), 14.0 s (Dual-information FMBEM with CPU using 1 core), 7.8 s (Dual-information FMBEM with CPU using 4 core) and 5.4 s (Dual-information FMBEM with GPU) respectively. The displacement results of FMBEMs in the load direction are shown in Fig. 12, and the result of ANSYS [25] (a famous CAE software using finite element method) is also shown in Fig. 12 (using 2826 quadratic tetrahedron elements whose edge-size control is the same as that of the 712 triangle boundary elements). The results among all FMBEMs and ANSYS are consistent. In order to compare the errors of the FMBEMs in detail, the displacement results of 11 nodes in a line along the load direction (Line 1 in the 712 element model in Fig. 11) were picked up. At first, we adopted the corresponding ANSYS results as the criterion, and the relative errors between FMBEMs and ANSYS are shown in Fig. 13(a). Then we adopted the dual-information FMBEM (CPU) results as the criterion with which the Node-based FMBEM (CPU) results and the dual-information FMBEM (GPU) results were compared, and the relative errors are shown in Fig. 13(b). From Fig. 13(a), it can be observed the relative errors between all the FMBEMs and ANSYS is less than 4.5%. The number of the triangle elements in the two FMBEMs is much less the number of

11 10 9 8 Line 1 7 6 5 4 3 2 1 712 elements

5704 elements

245

16100 elements

Fig. 11. Four different mesh models for the solid rivet.

67776 elements

246

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

Table 2 Time comparison between the node-based FMBEM of CPU (1 core) and the dual-information FMBEM of CPU (1 core). Ne

712 5 704 16 100 67 776

Node-based FMBEM

Dual-information FMBEM

ME (s)

M2M (s)

M2L (s)

NFDC (s)

L2L (s)

LE (s)

ME (s)

M2M (s)

M2L (s)

NFDC (s)

L2L (s)

LE (s)

0.219 1.813 5.125 21.547

0.001 0.021 0.036 0.134

0.063 4.062 5.235 19.234

0.531 2.609 14.704 55.344

0.001 0.023 0.039 0.143

0.004 0.029 0.082 0.345

0.063 0.609 1.688 7.094

0.001 0.021 0.036 0.134

0.063 4.063 5.235 19.238

0.25 1.203 6.687 24.969

0.001 0.023 0.039 0.144

0.004 0.029 0.082 0.346

Table 3 ME, M2L and NFDC of example 1 with different number of elemenets between GPU and CPU (4 cores) by dual-information FMBEM. Ne

712 5 704 16 100 67 776

CPU (4 cores)

GPU

Speedups (GPU/CPU)

ME (s)

M2L (s)

NFDC (s)

ME (s)

M2L (s)

NFDC (s)

ME

M2L

NFDC

0.031 0.188 0.625 2.626

0.031 1.016 1.751 6.486

0.11 0.532 2.422 9.078

0.015 0.031 0.062 0.258

0.016 0.11 0.125 0.453

0.125 0.281 0.891 3.25

2.1 6.1 10.1 10.2

1.9 9.2 14.0 14.3

0.9 1.9 2.7 2.8

Fig. 12. Displacement results at load direction of example 1.

Fig. 13. Relative errors of the node values on Line 1. (a) Relative errors with ANSYS and (b) relative errors with dual-information FMBEM (CPU).

the tetrahedron elements in ANSYS (712 and 2826 separately) because of the dimension reduction property of the BEM. From Fig. 13(b), it can be obtained that the relative error between the node-based FMBEM (CPU) and the dual-information FMBEM (CPU) is less than 0.12%. The reason for this error is that the node-based ME in the node-based FMBEM and element-based ME in dual-information FMBEM is different. For example, in Fig. 14, the triangular element belongs to cell 1 but one of its node (node 3) is belongs to cell 2. The contribution of the element

is fully given to cell 1 in the element-based mode, but just part of the contribution is given to cell 1 (the part of node 1 and node 2) and the rest is given to cell 2 (the part of node 3). The GPU used in this paper is single precision architecture, but the CPU is double precision, some roundoff errors may be induced from the different precisions. For our example, the relative error between the dual-information FMBEM (CPU) and the dualinformation FMBEM (GPU) is less than 0.01%. This relative error is so small that can be ignored. However, according to the analysis

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

in [17], the error increases with element number because of the accumulation of roundoff errors when computing large sums, but the error increasing speed is slow.

5.2. Example 2 Example 2 is a quarter of a curing press hub, the boundary conditions and mesh model are shown in Fig. 15, the traction is 10 Mpa (Z direction) on the top face and all displacements are constrained on bottom face. Elastic modulus is 87500 Mpa, Possion’s ratio is 0.25. The surface of the rivet is meshed with 28638 linear triangle elements, and the number of DOFs is 42930. The number of multipole expansion term p is set to 4, 6, and 8 separately, and the maximum number of nodes allowed in a leaf is 20. The dual-information FMBEM is used and the convergence tolerance of GMRES solution is set to 10  3. Table 4 shows the computational time and the speedups of ME, M2L and NFDC between the CPU (using 4 cores) and GPU with different multipole expansion term p. The GPU-CPU speedups of ME, M2L is higher when p is large, especially for M2L, because the computational ability of GPU are used more efficient when the computational scale is large. From Eqs. (17) and (19), we can get that the computational cost of ME is approximately in proportion to p2 and the cost of M2L is in proportion to p22 , where p2 ¼(p þ1)(p þ2)/2. That is why the time spent in M2L increases significantly with p0 s increment, but the increasing speed is much slower if the new version FMM [26] is used in FMBEM. Comparing the time of M2L and NFDC of example 2 with that of example 1, it can be obtained that the time of example 2 with 28638 elements spends more time than example 1 with 67776 elements (p ¼6 for both examples). The reason for this result is that the geometry of example 2 is more complicated than that of example 1 so that the

Cell 1

Cell 2

247

tree structure of FMBEM of example 2 is also more complicated which complicates the relationship of cells. Table 5 shows total time and total memory with different p, it can be seen that the GPU code is much faster than the CPU code even though only the ME, M2L, NFDC and some linear algebra computations in GMRES have been parallelized. The total speedups are 3.9, 4.7 and 6.2 when p equals to 4, 6 and 8 respectively. Although the convergence rates using double precision is faster than single precision in the same conditions prescribed in [27], it is not obvious in example 2. Both memory usage of CPU and GPU increases when p grows, but the increment is small. The memory usage of GPU is less than that of CPU in example 2 because the CPU uses double precision and the GPU uses single precision. The displacements at the load direction of the GPU computation (p¼6) and the ANSYS computation (89128 quadratic tetrahedron elements) are shown in Fig. 16. The results between FMBEM and ANSYS are consistent, and the relative error is less than 9% which is much lower than the relative error (about 30% tested by us) of linear tetrahedron elements with quadratic tetrahedron elements in ANSYS. The reason for this is that the accuracy of BEM is higher than that of FEM. In order to discuss the influence of the multipole expansion term p on computational accuracy, displacements at the load direction of 50 different nodes (results of GPU computation) are selected to compare the relative errors of p¼4 and p ¼6 with p ¼8 (p ¼8 is the criterion), and the results are shown in Fig. 17. We can find that the relative error between p¼6 and p¼8 is very small (the max is less than 1%), and the relative error between p¼4 and p¼ 8 is obviously bigger. The computational accuracy is more sensitive when the expansion term p is small, which conforms error estimate of multipole expansion in [28]; in the literature, the errors is described with respect to (1/Rd)p, where Rd is the ratio of P  Qc distance to Qc  Q distance in Eq. (7). The CPU results have a similar conclusion that would not be discussed here.

1 6. Conclusions

Centroid 3 2 Element-based contribution

Node-based contribution

Fig. 14. Modes of element-based contribution and node-based contribution.

We presented an adaptive dual-information FMBEM whose adaptive tree structure contain both node and element information, and used this FMBEM in 3D elasticity. In the dual-information FMBEM, the cell division is based on nodes, the elements whose centroids are located inside a cell are assigned to the cell, and the element integrals in ME and NFDC can be computed efficiently without influencing the efficiency of other parts in the FMBEM. We optimized the data structure in our dual-information FMBEM to separate cells from their levels, which makes the ME, M2L, LE and NFDC suitable for parallel computing. Through estimating the computational costs of the different parts of the dual-information

Load Direction

900 mm

Load Face

mm 00 24

Bottom Face Constraint Face

Z Y X Fig. 15. Geometric model and mesh model of example 2. (a) Geometric model and the boundary conditions and (b) mesh model.

248

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

Table 4 ME, M2L and NFDC of example 2 with different multipole expansion term p between GPU and CPU (4 cores) by dual-information FMBEM. p

4 6 8

CPU (4 cores)

GPU

Speedups (GPU/CPU)

ME (s)

M2L (s)

NFDC (s)

ME (s)

M2L (s)

NFDC (s)

ME

M2L

NFDC

0.797 1.450 2.344

1.891 6.875 20.812

10.156 10.156 10.156

0.092 0.149 0.232

0.281 0.579 1.282

2.791 2.791 2.791

8.7 9.7 10.7

6.7 11.9 16.2

3.6 3.6 3.6

Table 5 Total time and total memory of example 2 with different multipole expansion term p between GPU and CPU (4 cores) by dual-information FMBEM. p

4 6 8

CPU (4 cores)

GPU

Iteration number

Total time (s)

Total memory (MB)

Iteration number

Total time (s)

Total memory (MB)

71 72 72

933.1 1386.1 2517.9

122 132 145

72 73 73

237.6 294.4 406.8

90 93 97

Fig. 16. Displacement results at load direction of example 2.

consumption of ME and NFDC in the dual-information FMBEM is less than that in the node-based FMBEM (almost 1/3 and 1/2 of the node-based FMBEM), and the max GPU-CPU (using 4 cores) speedups of examples can reach 10.7 for ME, 16.2 for M2L, and 3.6 for NFDC. As a general conclusion, the computational precision of GPU satisfies our demand, despite the single precision architecture of GPU may cause some roundoff errors. In the future, we will further optimize the FMBEM program, and integrate it into a CAD system to make the CAD model creation and its elasticity analysis realized in the same software environment. Such work will offer a scientific tool for CAD/CAE analysis.

Acknowledgments This work was supported by National Science Foundation of China under Grant nos. 50975107 and 51075162. Fig. 17. Relative errors (GPU) of p ¼ 4 and p ¼ 6 with p ¼ 8.

FMBEM, we found that the ME, M2L and NFDC occupy most of the time consumption, thus we accelerated these three parts on the GPU to reduce the run time. We assigned one thread in one block to compute the each value of each n and m subscript pair of the multipole moments and local moments from M2L in Eqs. (10) and (13), and used one element to multiple nodes parallel strategy in the NFDC (each thread charges one node of a cell at a time). The example results (DOFs from 103 to 105) show that the time

References [1] Dongarra J, Sullivan F. Guest editors’ introduction: the top 10 algorithms. Comput. Sci. Eng. 2000:22–3. [2] Greengard L, Rokhlin V. A fast algorithm for particle simulations. J Comput Phys 1987;73(2):325–48. [3] Warren MS, Salmon JK. A parallel hashed oct-tree n-body algorithm. In: Proceedings of the ACM/IEEE conference on supercomputing; 1993. p. 12–21. [4] Greengard L, Rokhlin V. A new version of the fast multipole method for the Laplace equation in three dimensions. Acta Numer 1997;6:229–69. [5] Carrier J, Greengard L, Rokhlin V. A fast adaptive multipole algorithm for particle simulations. SIAM J Sci Statistical Comput 1988;9:669–86.

Y. Wang et al. / Engineering Analysis with Boundary Elements 37 (2013) 236–249

[6] Cheng H, Greengard L, Rokhlin V. A fast adaptive multipole algorithm in three dimensions. J Comput Phys 1999;155(2):468–98. [7] Singh JP, Holt C, Hennessy JL, Gupta A. A parallel adaptive fast multipole method. In: Proceedings of the ACM/IEEE conference on supercomputing; 1993. p. 54–65. [8] Liu YJ, Nishimura N. The fast multipole boundary element method for potential problems: a tutorial. Eng Anal Boundary Elem 2006;30(5):371–81. [9] Nishimura N, Yoshida K, Kobayashi S. A fast multipole boundary integral equation method for crack problems in 3D. Eng Anal Boundary Elem 1999;23(1):97–105. [10] Chaillat S, Bonnet M, Semblat JF. A multi-level fast multipole BEM for 3-D elastodynamics in the frequency domain. Comput Methods Appl Mech Eng 2008;197(49–50):4233–49. [11] Gao XW, Davies TG. Boundary element programming in mechanics. Cambridge: Cambridge University Press; 2002. [12] Hamada S, Takuma T. Effective precondition technique to solve a full linear system for the fast multipole method. IEEE Trans Magn 2003;39(3):1666–9. [13] Buchau A, Rieger W, Rucker WM. BEM computations using the fast multipole method in combination with higher order elements and the Galerkin method. IEEE Trans Magn 2001;37(5):3181–5. [14] Zhao LB, Yao ZH. Fast multipole BEM for 3-D elastostatic problems with applications for thin structures. Tsinghua Sci Technol 2005;10(1):67–75. [15] Yoshida K. Applications of fast multipole method to boundary integral equation method. Japan: Department of Global Environment Engineering, Kyoto University; 2001. [16] Lu BZ, Cheng XL, Huang JF, McCammon JA. An adaptive fast multipole boundary element method for Poisson–Boltzmann electrostatics. J Chem Theory Comput 2009;5(6):1692–9.

249

[17] Gumerov NA, Duraiswami R. Fast multipole methods on graphics processors. J Comput Phys 2008;227(18):8290–313. [18] Takahashi T, Hamada T. GPU accelerated boundary element method for Helmholtz’equation in three dimensions. Int J Numer Methods Eng 2009;80(10): 1295–321. [19] Cruz FA, Layton SK, Barba LA. How to obtain efficient GPU kernels: an illustration using FMM and FGT algorithms. Comput Phys Commun 2010;182(10):2084–98. [20] Gumerov NA, Duraiswami R. Fast multipole methods for the Helmholtz equation in three dimensions Oxford: Elsevier Science; 2004. [21] Dunavant D. High degree efficient symmetrical Gaussian quadrature rules for the triangle. Int J Numer Methods Eng 1985;21(6):1129–48. [22] NVIDIA Corp. NVIDIA CUDA programming guide, version 3.0; 2010. [23] Saad Y, Schultz MH. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 1986;7(3): 856–69. [24] Wang HT, Yao ZH, Wang P. On the preconditioners for fast multipole boundary element methods for 2D multi-domain elastostatics. Eng Anal Boundary Elem 2005;29(7):673–88. [25] ANSYS Inc. ANSYS 13.0 online help; 2010. [26] Yoshida K, Nishimura N, Kobayashi S. Application of new fast multipole boundary integral equation method to crack problems in 3D. Eng Anal Boundary Elem 2001;25(4):239–47. [27] Xiao H, Chen ZJ. Numerical experiments of preconditioned Krylov subspace methods solving the dense non-symmetric systems arising from BEM. Eng Anal Boundary Elem 2007;31(12):1013–23. [28] Liu YJ. Fast multipole boundary element method: theory and applications in engineering. Cambridge: Cambridge University Press; 2009.