Advances in Engineering Software 38 (2007) 87–101 www.elsevier.com/locate/advengsoft
Development of parallel 3D RKPM meshless bulk forming simulation system Wang H. *, Li Guangyao, Han X., Zhong Zhi Hua The State Key Laboratory of Advanced Technology for Vehicle Design and Manufacture, College of Mechanical and Automotive Engineering, Hunan University, China Received 21 November 2005; received in revised form 25 July 2006; accepted 9 August 2006 Available online 22 September 2006
Abstract A parallel computational implementation of modern meshless system is presented for explicit for 3D bulk forming simulation problems. The system is implemented by reproducing kernel particle method. Aspects of a coarse grain parallel paradigm—domain decompose method—are detailed for a Lagrangian formulation using model partitioning. Integration cells are uniquely assigned on each process element and particles are overlap in boundary zones. Partitioning scheme multilevel recursive spectrum bisection approach is applied. The parallel contact search algorithm is also presented. Explicit message passing interface statements are used for all communication among partitions on different processors. The parallel 3D system is developed and implemented into 3D bulk metal forming problems, and the simulation results demonstrated the efficiency of the developed parallel reproducing kernel particle method system. 2006 Elsevier Ltd. All rights reserved. Keywords: Parallel; Reproducing kernel particle method; Bulk forming; Domain decomposition method; Partition; Message passing interface; Speedup
1. Introduction The non-linear finite element formulations for non-linearity of geometric and material have been well developed and a lot of significant works have been completed in bulk forming analysis. Nevertheless, FEMs are still ineffective in dealing with some extreme material distortions owing to severe mesh distortion. There are difficulties in solving problems involving large deflections and moving discontinuities. Typical problems include extremely large deformations in manufacturing processes, the propagation of interface between solids and liquids in casting, the propaAbbreviations: RKPM, reproducing kernel particle method; EFG, element free Galerkin method; PIM, point interpolation method; SPH, smoothed particle hydrodynamics; FEM, finite element method; RSB, recursive spectrum bisection; MRSB, multilevel recursive spectrum bisection; DDM, domain decomposition method; MPI, message passing interface; EBC, essential boundary condition; SYMMLQ, sparse symmetric equations; PE, process element. * Corresponding author. Tel.: +86 731 4329 266; fax: +86 731 8821 445. E-mail address:
[email protected] (H. Wang). 0965-9978/$ - see front matter 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2006.08.002
gation of cracks with arbitrary and complex paths in failure, and the tracking of the growth of phase boundaries and microcracking in advanced materials development. Some of FEMs require remeshing models in large deformation conditions; it still requires considerable computational efforts. Meshless methods eliminate mesh distortion for both large shape design changes and large deformation non-linear analysis. A variety of meshless modeling methods have recently emerged [1]. RKPM was proposed by Liu et al. [1–5,39] to improve the accuracy of the SPH [27–29,31,32] method for finite domain problems. These methods have advantages over traditional FEMs for their ability to handle large deformation problems without mesh distortion, and for their solution accuracy due to the large domain of influence covered by particles/nodes. In this method, the kernel function is modified by introducing a correction function to meet the reproducing conditions. The resulting modified kernel function exactly reproduces polynomials to a specific order and thereby fulfills the completeness requirement. The shape functions developed from the method were later
88
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
proven to be equivalent to moving least-squares kernel interpolates if linear basis functions were used [1–3]. Liu et al. also introduced wavelets [5] as the kernel functions and successfully applied RKPM to multiple scale analysis. More recently, some effort has also been devoted to the solution of bulk metal forming processes by means of meshless approaches. Referenced papers encompass the following operations: ring compression [6,7], upsetting [6– 13] and extrusion [6,7,10,14,15]. Meshless methods encounter the major technical barrier-low computation efficiency. Because meshless methods are based on the high order interpolation, the computation time of meshless requires much longer than FEMs. By taking a general view to the content of these papers it appears that major results are generally limited to 2D problems and simple 3D problems. Therefore, meshless methods are not wildly applied in practice. Parallelization of meshless codes is the way to settle expense of meshless computations. The parallel meshless methods to decompose the domain into several subdomains to solve are called DDM. DDMs are useful in two contexts. First, the division of problems into smaller problems through usually artificial subdivisions of the domain is a means for introducing parallelism into a problem. In this manner, problems that are intractable on serial computers can be solved on parallel computers. Second, many problems involve more than one mathematical model, each posed on a different domain, so that domain decomposition occurs naturally. DDMs are wildly used for explicit FEM problems [42,43,45]. It is natural that DDMs [44] can advance the meshless computation efficiency easily. The present treatment will focus on the RKPM for explicit dynamic analysis of bulk metal forming problems, but the procedures can directly apply to some other meshless methods (e.g. EFG [1,30], PIM [16]) without difficulties. Several distinct advantages of RKPM are its ability to accurately model extremely large deformations without mesh distortion problems and its ease to adaptive modeling by simply changing particle definitions for desired refinement regions. In this paper, an overview of a Lagrangian RKPM for non-linear explicit dynamic analysis is first given. A general description of the parallel implementation is described. The parallel procedure primarily consists of a mesh partitioning pre-analysis phase, a parallel analysis phase that includes explicit message passing among partitions on separate processors. In the finial, the numerical examples have been presented to demonstrate the efficient parallel 3D RKPM system.
2. Weak form and discretization of RKPM for contact problems The reproducing kernel approximation [1–5] of a function u(X) in a domain Xx is expressed
ua ðX Þ ¼
Z
Ua ðX Y ÞuðY ÞdXx ;
ð1Þ
Xx
where ua(X) is the reproduced of function u(X), and Ua is the window or kernel function with compact support. In the following X represents the material coordinates, x is the spatial coordinate and u is the displacement of the particles and t denotes the time. Discretizing the domain Xx by a set of particles {X1, X2, . . . , XNP}, where NP is the total number of particles, the integral is approximated by the following summation: uðX ; tÞ ¼
NP X
N I ðX ÞuI ðtÞ;
ð2Þ
I¼1
where NI(X) is the Lagrangian shape function defined by N I ðX Þ ¼ CðX ; X X I ÞUa ðX X I ÞDV I :
ð3Þ
C(X; X XI) is the correction function and DVI is the volume of particle I. The more details on the construction of the shape function of RKPM can be found in Refs. [1–5]. For an approximation with the virtual work principle, the essential boundary conditions (EBCs) must be satisfied directly by the interpolation functions or accommodated by augmenting the variational statement with constraints. A major difference between RKPM and other methods (e.g. FEMs) is the manner in which EBCs can be enforced directly. The non-local interpolation condition of equation poses an additional computational challenge. Whereas EBCs for finite elements are imposed locally at particles (because they possess the Kronecker delta property), The EBCs enforcement with RKPM are non-local over a patch of particles/nodes. In some cases, the EBCs can be adequately approximated by local specification at the particles (assuming a Kronecker delta property). This approximation can be accurate, by St. Venant’s Principle, when the primary regions of interest are away from the EBCs. In general, however, a coupled set of equations is usually solved, even for explicit analyses. Previous efforts used Lagrange multipliers to constrain the variational statement [17] or a set of simultaneous equations is directly solved. In either case, significant computations were generally necessary to enforce the EBCs. These procedures also are not well suited for parallel processing, since they must generally be made over multiple processors. Kent [18] proposes an alternate approach that may also require significant computational effort, as it is algebraically equivalent to other existing equations solving methods. By treating the imposition as a transformation of the interpolation functions, however, this form is better for parallel processing. Describing the EBC equations by gB ¼
NP X J ¼1
N J ðX B ÞuJ ;
ð4Þ
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
89
where gB is the specified EBC at XB or can be an integral relation along XB. It is written in matrix form for all conditions
present normal contacting force and frictional force, respectively. The principle of virtual work can be written as
GT d ¼ g;
dW int dW ext dW kin dW c ¼ 0:
ð5Þ
ð16Þ
where by Gram–Schmidt or Householder orthogonalization
Substituting the particle approximations (2) and (11) into (16) the discretizing equation is obtained as
GT G ¼ I;
mI €uI ¼ fIext þ fIcont fIint ; Z Z ext fI ¼ uI b dX þ uIt dC; X0 Ct Z int rX uI P dX; fI ¼
JG ¼ 0;
J T J ¼ 1:
ð6Þ
The generalized variables are represented by d ¼ J^ u þ Gg:
ð7Þ
And we can obtain ð8Þ
ouðX Þ oN T ðX Þ oN T ðX Þ ¼ J^ uþ Gg: oX oX oX
ð9Þ
Note that the constraint conditions are recovered. That is ð10Þ
Since the new transformed interpolation functions of equation satisfy the EBCs, they can be directly applied in the virtual work statement. The alternate displacement vector ^u, is now determined with the EBCs directly specified in g. The orthogonalization in Eq. (6) may require a significant number of computer operations. These computations and the creation of the alternate interpolation functions of Eqs. (8) and (9), however, are done only once in the preprocessing phase. Transient displacement conditions are then accommodated by only changing the values in g. The gradient of displacement is rX u ¼
NP X
rX N I ðX ÞuI :
ð18Þ ð19Þ
X0
u þ N T ðX ÞGg; uðX Þ ¼ N T ðX ÞJ ^
GT d ¼ GT ðJ ^ u þ GgÞ ¼ g:
ð17Þ
ð11Þ
I¼1
The principle of the virtual work is used to formulate the general contact problems. The virtual work done by a stress field is Z rX du : P dX: ð12Þ dW int ¼ X0
The virtual work done by the external loads, the inertial force and the contact traction through the virtual displacement field are denoted by dWext, dWkin and dWc, and are calculated as Z Z ext dW ¼ du b dX þ du t dC; ð13Þ X0 Ct Z q0 du € u dX; ð14Þ dW kin ¼ X Z 0 ðqN dcN þ qT dcT ÞdC; ð15Þ dW c ¼ Cc
where P is the nominal stress tensor, q0 is the density in the reference configuration, t is the traction force, b is the body force, cN and cT denote the relative displacement in normal and tangent direction at the contact interface Cc, qN and qT
where mI denotes the mass of particle I, fcont is the contact force whose details are described in Refs. [29,30]. 3. Central difference algorithms For explicit time integration methods, central difference algorithm [30] is popularly being used and it can be obtained by writing down the following central difference expressions for velocity vn and acceleration an and vectors vn ¼ ðd nþ1 d n1 Þ=ð2DtÞ;
ð20Þ 2
an ¼ ðd nþ1 2d n þ d n1 Þ=ðDt Þ:
ð21Þ
Substituting these expressions in the dynamic equilibrium equations, and rearranging terms it can be written as 1
d nþ1 ¼ ðM þ C2DtÞ ½Dt2 ðpn1 fn1 Þ þ 2Md n ðM CDt2Þd nþ1 ;
ð22Þ
where p is the internal force vector. This is a recurrence relation in dn. If both M and C are diagonal matrices of mass and damp, the solution for dn+1 does not require factorization. Since the central difference algorithm does not require assembly of global matrices and can be handled at degrees of freedom level, the parallel implementation is pretty straightforward. The interprocessor communication requirement per time step is very small when compared to the implicit algorithms. So the DDM parallel method based on MPI is the best choice for explicit algorithms. 4. The explicit parallel algorithm for RKPM development As FEM explicit parallel algorithms, separate pre-analysis software is created to partition any general unstructured RKPM models. Similar to explicit FEMs [33–35], in this section, DDM is implemented for parallel 3D RKPM system. We mainly focus on the partition methods and data structures for complex communication among subdomains. 4.1. Partition 4.1.1. Partition scheme for meshless methods For effective parallel computing, it is critical to balance the computational load among processors while minimizing
90
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
interprocessor communication. If tasks are not distributed in a balanced way, system may end up waiting for one task to complete while other tasks are idle. Therefore, separate pre-analysis software is created to partition any general unstructured RKPM models. Similar to explicit finite element parallel algorithms, it requires partitioning RKPM model first. Unlike the FEM, The shape function of meshless is interpolated in dynamics domain according to dilation parameter a. The integration point of meshless is not limited inside of the support domain. Program need to keep all integration and interpolation information in each process elements (PE) entirely. There are two strategies for partition process, partition based on particles and partition based on integration cells. The integration point partition is used in this work due to two advantages: 1. When particles based partition strategy is used, particle overlap domains are generated. And then, we should assign integration cells to each subdomain according to particles. As we know, unlike FEMs, particles may not belong to only one integration cell, thus, overlap integration cell domains should be generated. We have to design data structures for both particles and integration cells, respectively, therefore, the complexity and storage size are both enlarged a lot. On the other hand, if the integration cells based partition strategy is implemented, integration cells should be first partitioned and no cell overlap subdomain should be generated. We only assign particles to relevant subdomains according to integration cells. 2. We can directly use FE partition methods for integration cells instead of reconstructing connected relationship with discrete particles of global domain. 4.1.2. Multilevel approach for integration cells Like FEMs, integration cells partition problems can be easily converted to graph partition problems [41,46]. So we discuss graph partition approaches to present the procedure for integration cells of RKPM method. The RSB [19] algorithm is one of a class of recursive bisection methods for partitioning unstructured problems. The most straightforward implementation of RSB, which uses the Lanczos algorithm [20] to find the Fiedler vectors, is unacceptably slow for many applications. The MRSB [21] is a refinement of the algorithm that is much faster, typically by an order of magnitude or more, and has been instrumental in the acceptance of RSB. The basic idea behind MRSB is to speed up the Fiedler-vector computation by constructing a series of successively smaller contracted graphs that maintain the global structure of the original graph. An algorithm based on the multilevel approach normally comprises three phases. 1. Coarsening phase: the original graph G is reduced into a series of successively coarser graphs. There are several ways to coarsen a graph. In the maximal independent set of a graph is chosen as the vertices of
the coarse graph. An independent set of the graph G is a set of its vertices, with no two of the vertices in the set connected by an edge of G. An independent set is a maximal independent set if the addition of an extra vertex makes it no longer independent. The most popular method for generating the multilevel of coarse graphs is based on edge collapsing. Selected edges are collapsed so that two vertices connected by one of these edges form a multi-node. Each vertex of the resulting coarse graph has a weight associated with it, which indicates the number of original vertices it contains. Each edge of the coarse graph also has a weight associated with it that indicates the number of original edges it represents. This edge collapsing approach has the distinctive advantage that the edge-cut (generalized here as the sum of the edge weights cut by the partitioning) on the coarsest graph equals the edge-cut of the finest (original) graph, if the partitioning on the coarsest graph are inherited by the finer graphs without further refinement. The edges are usually selected using the idea of matching. A matching of a graph is a set of edges, such that no two of them share the same vertex. The maximal match is a matching of the largest size. In a heavy edge matching [22], vertices are visited randomly. For each vertex, the heaviest unmatched edge from the vertex is selected. Heavy edge matching has the advantage that the resulting coarse graph has a relatively small total edge weight, and therefore the partitioning of it is more likely to give a small edge-cut. As an alternative to edge collapsing based on maximal matching, a more aggressive coarsening approach has been employed [23], where the greedy algorithm is used to form small clusters of, say, 10 vertices for collapsing. The coarsening process is applied until the graph has less than a preset minimum number of vertices. 2. Partitioning phase: partition the coarsest graph Gl into n parts. As the coarsest graph is of very small size, any partitioning algorithms will be able to partition it rapidly. The partition is executed in a manner such that the summation of the vertex weights of each subgraph should be roughly the same. When edge collapsing approach is used for graph coarsening, the edge-cut of the partitioning is the same as that for the finest graph, should the partitioning be inherited all the way to the finest graph without refinement. The choice of partitioning algorithms at this coarsest level is not very important [22,24] to the overall quality of the partition because refinement will be carried out at the subsequent finer levels of the graph. 3. Uncoarsening and refinement phase: the partitioning of Gl is interpolated to Gl1 and refined, the process repeated and terminated at G. During the uncoarsening and refinement phase, the partitioning of a graph Gl is inherited by the finer graph Gl1. For the multilevel recursive spectral bisection (MRSB) algorithm [21], the eigenvector for the coarse graph Gl is interpolated to the graph Gl1 as the first approximation
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
of the eigenvector for Gl1. From there, instead of using the Lanczos algorithm to recalculate the eigenvector, Rayleigh iterations are used to improve the approximation. The resulting positive semi-definite linear systems are solved using sparse symmetric equations (SYMMLQ) [22]. Because a good initial approximation exists, the number of matrix–vector products involved in finding the Fiedler vector using the Rayleigh quotient algorithm is normally no more than ten, as opposed to up to several hundred if the Lanczos algorithm is used. Thus the MRSB algorithm is about an order of magnitude faster than the single level Lanczos implementation of RSB. But there are two disadvantages to the multilevel spectral bisection approach. First, this is still a bisection algorithm and the partitioning of a graph into many subdomains would involve recursive use of the algorithm. Second, even though the Rayleigh quotient procedure is substantially faster than the Lanczos algorithm, each iteration is still of the order O(n), where n is the number of vertices in the graph. However, all that may be needed on the finer graph Gl1 is some refinement of the subdomain boundary to reduce edge-cut and to maintain the load balance. For a large graph with a good existing partition, the size of the boundary should be only a small fraction of that of the interior. Therefore, multilevel algorithms that adopt Kernighan–Lin (K–L) [26] type algorithms during the uncoarsening and refinement phase have proved themselves to be superior to MRSB [23–25]. The boundary refinement approach is not restricted to bisection and n-way refinement to the existing partition could be used. In [25] vertices of any subdomain are allowed to move to any of the other n 1 subdomains, while in [23] boundary vertices are allowed to move to the neighboring subdomains only. 4.1.3. Comparison between FE and meshless partition results Meshless methods use the particle values in the local support domain to interpolate the shape functions. The RKPM approach needs about 20–30 particles in a support domain to construct the shape functions in this work, and the area of overlap zone depend on the number of particles in the local support domain. Thus the partition results of RKPM meshless models will generate much wider range boundary zones than FEM models. Comparing the partition results of FEM and RKPM models, the boundary of each subdomain is dissimilitude. Fig. 1 shows that the boundary between two partitions of FE model is a line. The integration points of each element are contained the inside of each element. Fig. 2 shows that the boundary between two partitions of RKPM model is widely overlap zone. It is clearly show that the number of particles in the overlap zone of RKPM model is much larger than the number of particles in the boundary belt of FEM model. 4.2. The data communication scheme Because quantity of particles in boundary zones of the RKPM model is much larger than the particles of the
91
Fig. 1. A boundary line of a FEM model.
Fig. 2. A boundary zone of a RKPM model.
FEM model, the time consumed in communication during the computation of RKPM is much longer than FEM. Additional, the EBCs cannot be enforced directly as mentioned in Section 2. For problems with a large number of EBCs, however, the number of support particles can significantly increase for integration points affected by the modifications in Eqs. (8) and (9). The number of shared particles may drastically increase. Therefore, communication may greatly increase thereby reducing the number of effective processors in some cases. Further effort regarding the accommodation of EBCs in parallel is still needed. For these reasons, proposed parallel 3D RKPM system requires constructing structures for efficient data communication. For purpose of transferring the information in high efficiency, the system need to store following information before exchanging data. Firstly, in order to estimate time of communication, the number of particles and integration points in each overlap boundary zone among subdomains
92
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
CPU identify number
CPU identify number
Number of particles in boundary zone
Overlap zone identify number
Subdomain map table
Local information table … of subdomains
Overlap boundary identify number
Local identify number in each subdomain
Fig. 3. Data structures for transferring data in overlap zones.
Searching the overlap particles which locate in subdomain N
Searching the overlap zones which contain current overlap particles in subdomain N
Searching the subdomains which are adjacent with subdomain N
Positioning the overlap particles in adjacent subdomains
Transferring internal force of overlap particles to adjacent subdomains Fig. 4. Flow-chart of internal force transferring procedure.
must be summed. For positioning the particles needed exchanging data, the particle’s location in each subdomain requires fixing and storing. Finally, we build the location mapping table to store relations of particles in each boundary zone as shown in Fig. 3 and the procedure of exchanging data of overlap zone is shown in Fig. 4. 4.3. The parallel algorithms for contact searching process To simulate the contact between a deformable body and several rigid bodies in 3D parallel RKPM, a direct approach based on the segments, particle to segment contact algorithm, is used for 3D parallel system. The deformable body is discretized by a set of particles. The boundary particles of the deformable body are referred as slave particles in this work. The surfaces of the rigid bodies (master) are modeled by quadrilateral (or triangular) segments. The particles to segment contact algorithm is based on three major procedures: the test pairs’ creation, the contact
detection and the contact forces evaluation. More details about contact algorithm are discussed in Refs. [29,30]. In common sense, master segments and slave segments of meshless models should be partitioned in synchronism, but such a strategy cannot be implemented due to following reasons: 1. Complexity of meshless parallel contact algorithms. As we know, each subdomain of meshless model has boundary domains composed of quite a few shared particles. When the master segments group and the salve particles group are partitioned in the same domains, the segments belonging to this domain should be searched by particles in limited local zone. Thus, the segments of others should be ignored. The contact search under such statement probably losses contact particles in practice. In order to solve above problems, contact searching procedure need exchanging information among all subdomains. This procedure requires complex data structure to guarantee accurate results. 2. Significant computational effort. Due to the large deformation problems, the position of each particle should be changed a lot. The initial partition results cannot fit for the contact search requirements. Therefore, it requires partitioning in each time step. If the master segments and salve particles are both partitioned, the information of master segments and slave particles need updating and rearranging. So partition process will employ significant computational effort. For above reasons, we suggest the dynamic partition strategy for parallel contact problems. Only slave particles should be partitioned in each step during contract procedures. Although the search range should be expanded and more CPU time should be increased, dynamic partition strategy can find contact pairs completely and balance the communication time of each PE. 4.4. The parallel computation of RKPM explicit dynamic analysis implemented by MPI Explicit MPI statements make all communication of parallel 3D RKPM system. In order to transfer data in
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
safety status and avoid possible deadlock and overhead associated with buffering, the non-blocking method is applied. Due to the explicit algorithms, the computations in each particle are independent. The developed system only needs to transfer the internal force of particles limited in overlapped boundary zones. Due to partition scheme proposed in Section 4.1, integration points are partitioned firstly; subdomains of integration points are all in nonoverlap fashion. We only require summing all internal force of each overlap boundary zone in correct position to obtain the true value. The explicit parallel RKPM algorithm of whole process is described in Fig. 5.
Input data
Partitioning integration points
93
5. Numerical examples In this section, Taylor bar impact problem is benchmark test for demonstrating the validity and efficiency of the parallel 3D RKPM algorithm for large deformation. Back extrusion problem as practical example is used to test the efficiency of proposed system for bulk forming problems. Both of commercial software and the proposed parallel 3D RKPM system run on the IBM p690 with the AIX5.2 OS in this work. In all computations, the material model is elastic–plastic with Von Mises flow theory and linear isotropic hardening. Taylor bar impact problem is solved by LSDYNA3D; Back extrusion problem is solved by ANSYSY8.0, and the type 8-nodes hex solid element is used for analysis. In this paper, we apply speedup and efficiency to estimate the performance of parallel 3D RKPM system. For a given algorithm or computation, let Tp be the time to perform the computation using p processors or arithmetic units. Note that this makes T1 the depth of the algorithm. We define the speedup Sp [35–37] with p processors as Sp ¼
Partitioning correspond particles of integration points
T1 Tp
ð23Þ
and the efficiency Ep with p processors as Initialing MPI environment variables
Calculating particles’ displacement
Calculating the length of step time and broadcasting to every subdomains
Calculating the external force, internal force and contact force of every subdomains
Exchanging the internal force of overlap zones
Calculating the accelerations of every subdomains
Ep ¼
Sp : p
ð24Þ
Because p processors can do no more than p times as much work per unit time as one processor, the efficiency is theoretically bounded above by one. 5.1. Taylor bar benchmark test We need a benchmark test in order to verify the accuracy of 3D RKPM system and compare the efficiency of the various proposed parallelization methods presented further. Taylor bar impact problem which involves large deformation and moderate high strain rate is selected as benchmark. In the standard Taylor bar impact, a cylindrical bar impacts a rigid surface, as shown in Fig. 6a. Because of the symmetry of the problem only a quarter of the bar is discretized as shown in Fig. 6b. In this paper,
Calculating the displacement of every subdomains
Gathering data from all subdomains and scattering the updated data for every subdomains
Outputting data
End Fig. 5. The flow chart of RKPM parallel computation process.
Fig. 6. The quarter of Taylor bar: (a) continuum domain and (b) discretized model.
94
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
Table 1 Material parameter list for Taylor bar impact Young’s modulus Poisson’s ratio Initial yield stress Hardening modulus Mass density
E t r H q
117 GPa 0.35 400 MPa 100 MPa 8930 kg/m3
Table 2 Comparison of computation efficiency of different PE for Taylor bar problem
8.00 7.00 6.00 Radius (mm)
we will focus on the speedup obtained after the parallelization of the code using this benchmark test. The initial dimensions of the rod are r0Z = 3.2 mm and l0Z = 32.4 mm. The impact is assumed frictionless and the impact velocity is set to ViZ = 227 m/s. The final configuration is obtained after 80 ls. The constitutive law is elasto-plastic with a linear isotropic hardening, material properties, given in Ref.
5.00 4.00 3.00 Parallel 3D RKPM system FEM LSDYNA3D code
2.00 1.00
Number of processors
Run time (s)
Sp
Ep (%)
1 2 4 8 16 32
5323 2812 1656 830 512 220
1.00 1.81 3.21 6.41 10.39 24.14
100.00 90.5 80.35 80.16 64.64 75.5
120
0.00 0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
Time (µs)
100 35.00
Efficiency (%)
30.00
Height (mm)
25.00 20.00
80
60
Ideal 40
15.00 Parallel 3D RKPM system FEM LSDYNA3D code
10.00 5.00
Parallel 3D RKPM system
20
0
0.00 0.00
10.00
20.00
30.00
40.00 50.00 Time (µs)
60.00
70.00
1
80.00
2
4
8
16
32
Number of Processors
Fig. 7. Taylor bar impact: time histories of the variation of the dimensions of the bar. (a) Radius and (b) height.
35 Ideal Parallel 3D RKPM system
30
Speedup
25 20 15 10 5 0
Fig. 8. Deformed configurations for Taylor impact at time 80 ls. Distribution for 16 CPUs, different deep color means different processes. (a) RKPM, (b) FEM and (c) Gaussian integration cells.
1
2
4
8
16
32
Number of Processors Fig. 9. Parallel performances on IBM690 of RKPM for Taylor bar problem: (a) speedup comparison and (b) efficiency comparison.
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
[38], corresponding to an OHFC copper are reported in Table 1. For verifying the accuracy of parallel 3D RKPM system, the commercial code LS-DYNA3D is also implemented for solving same problems. Models of meshless consist of 1369 particles for RKPM with stress point integration and 1200 elements for LSDYNA3D.
95
Fig. 7a shows the time history of the radius at the impact plane. It can be seen that the results by RKPM meshless code and LSDYNA3D FEM program are nearly the same. The results by FEM are also given. At t = 80 ls, the radius at the plane of impact is 7.30 mm with RKPM. The result by FEM is 7.38 mm. Many results for this radius have been collected in Ponthot [40] and the mean value is 7.11 mm.
Fig. 10. Schematic model for back extrusion analysis: (a) continuum domain and (b) discretized model.
Fig. 11. Deformed configurations for back extrusion process.
96
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
Fig. 7b gives the time history of the height of the bar. The results by RKPM and FEM are also nearly identical. At time t = 80 ls the height of the bar is 21.48 mm by 3D RKPM system and 21.46 mm by FEM. Fig. 8a and b shows the deformed configuration at t = 80 ls by RKPM and FEM. Fig. 8c demonstrates the integration cells for each process during parallel computation, it demonstrates the proposed system has good load balancing for each process during computation. Similar deformation patterns are observed. The differences between the solutions are reasonable and this benchmark test is retained. Additional, the performance parameter Ep and Sp for different PE number also presented in Table 2 and comparison between practical speedup and efficiency of parallel RKPM meshless program and ideal value are shown in Fig. 9.
5.2. Back extrusion analysis The simulation of the back extrusion process is considered. Fig. 10 shows the statement of the problem. The die is fixed while the punch is moved downward with a constant velocity of 10 m/s. The coefficient of friction is 0.12. The simulation time is 0.01 s. Because of the symmetry of the problem only half of the workpiece has been discretized with 5621 (73 · 7 · 11) particles and 16,000 Gaussian integration points are applied for integration. The material is the same as in Table 1. Fig. 11 shows the deformed configuration at punch stroke of 0.05 m, 0.15 m. The deformed patterns are nearly the same by RKPM and FEM. Fig. 12 demonstrates a good effective strain and stress distribution agreement by RKPM and FEM. However, as simulation process contin-
Fig. 12. Computed results for the compression of back extrusion at punch stroke of 0.05 m deformed models.
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101 35 Ideal Developed RKPM system
30 25
Speedup
ues at punch stroke of 0.25 m, the mesh distortion is more severe and difficulties occur in FEM computation as shown in bottom of Fig. 11. It can be seen that some elements in FEM solution are overlap at the punch stroke and will come to a halt. It is proved that the proposed 3D RKPM system can easy to solve element distortion problems. In order to demonstrate the efficiency of parallel meshless program, The 1, 2, 4, 8, 16, 32 CPUs are implemented for back extrusion problem. The performance parameters speedup Sp, efficiency Ef and computation time are listed in the following Table 3. The distribution of Gaussian integration cells of RKPM in each process during computation is shown in Fig. 13. Comparison among practical speedup and efficiency of parallel 3D RKPM system, commercial software ANSYS and ideal value are shown in Fig. 14.
97
ANSYS
20 15 10 5 0 1
2
5.3. Wheel forging analysis
4
8
16
32
16
32
Number of Processors 120
100
80 Efficiency (%)
This example aims to simulate the cold forging process of a wheel. A cylindrical slug as shown in Fig. 15a is used to obtain a wheel as shown in Fig. 15b, meshless model is presented in Fig. 15c. The following materials parameters are used for the simulation. Because of the symmetry of the problem an axisymmetric formulation is used and only half of the domain is discretized with 9610 particles. The punch and die, assumed to be rigid and frictionless, are discretized by piecewise linear segments to fit the geometry of the wheel. The punch is moving downward with constant velocity of 10 m/s while the die is fixed. The maximum stroke of the punch at the end of the simulation is 0.02 m. The configuration at punch stroke of 0.01 m, 0.02 m and distribution of stress and strain at 0.01 m are shown in Figs. 16 and 17. It is pre-
60
40 Ideal Developed RKPM system ANSYS
20
0
Table 3 Performance parameters for various implementations of drawback extrusion problem by parallel 3D RKPM system Number of processors
Run time (s)
Sp
Ep (%)
1 2 4 8 16 32
69783.17 35479.84 20545.67 12674.82 7765.21 3116.71
1.00 1.96 3.39 5.53 8.98 22.39
100.00 98.00 84.75 69.13 56.13 69.96
1
2
4 8 Number of Processors
Fig. 14. Performance diagram for drawback extrusion: (a) speedup comparison and (b) efficiency comparison.
sented that proposed parallel 3D RKPM system and ANSYS8.0 have almost same results. The 1, 2, 4, 8, 16, 32 CPU are also implemented for forging problem to demonstrate efficiency of parallel 3D RKPM system. The performance parameters speedup Sp, efficiency Ef and computation time are listed in the following Table 4. The distribution of Gaussian integration cells in each process is presented in Fig. 18. Comparison between practical speedup and efficiency of parallel 3D RKPM system and ideal value are shown in Fig. 19. 5.4. Discussions
Fig. 13. Gaussian integration cells distributed in 16 PE at punch stroke of 0.25 m. Effective strain distribution (TOP) and effective stress distribution predicted by RKPM and FEM.
The capabilities of proposed parallel 3D RKPM method in simulating 3D bulk forming problems with bulk forming problems have been studied. Figs. 11, 12 and 15, 16 show that both proposed parallel 3D RKPM system and the
98
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
Fig. 15. Schematic model for back extrusion analysis: (a) raw slug, (b) geometry of the wheel and (c) discretized model.
Fig. 16. Deformed configurations for wheel forging process.
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
99
Fig. 17. Computed results for the compression of wheel forging problems at finial punch stroke deformed models: effective strain distribution (TOP) and effective stress distribution predicted by RKPM and FEM.
Table 4 Material parameter list for wheel forging process Young’s modulus Poisson’s ratio Mass density Stress–strain curve
E t q r = 576.79(0.01658 + ep)0.3593 MPa
71 GPa 0.3 2700 kg/m3
FEM software LSDYNA3D, ANSYS8.0 have a good convergence rate in configuration and distribution of strain and stress, but the RKPM can solve the distortion of meshes easily. From Tables 2, 3 and 5, Figs. 9, 14 and 19 of above numerical examples, we can see almost ideal speedup is achieved for both the fixed and scaled models even up to 32 processors, and this trend is anticipated to continue over 32 processors. Figs. 8c, 13 and 18 present good balancing rates during simulation process, each process has almost same amount of Gaussian integration cells. It is the key
Fig. 18. Gaussian integration cells distributed in 16 processes in the finial step of wheel forging process.
reason why proposed parallel 3D RKPM system has such good parallel computation efficiency based on MPI. Additional, although commercial FEM software (e.g. ANSYS, DEFORM) have better computation efficiency than developed 3D RKPM system, current system has better speedup
100
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101
processors by MRSB method and shared particles definitions are overlap, so that all support domains for all particles are defined locally on the corresponding processor. Explicit MPI message passing statements are used for all communication among subdomains on distributed processors. Taylor bar impact problem is presented for proving validity of parallel 3D RKPM. Both the simulation of Taylor bar and back extrusion analysis demonstrate the efficiency of the current parallel RKPM for bulk forming problems. However, it should be noted that the communication execution time of meshless is much than FEM due to large overlap boundary zones.
35.00 30.00
Ideal Developed RKPM system ANSYS
Speedup
25.00 20.00 15.00 10.00 5.00 0.00 1
2
4 8 Number of Processors
16
32
120. 00
Efficiency (%)
100. 00 80. 00 60. 00 Ideal 40. 00
Developed RKPM system FEM
20. 00 0.00
1
2
4 8 Number of Processors
16
32
Fig. 19. Performance diagram for wheel forging process: (a) speedup comparison and (b) efficiency comparison.
Table 5 Performance parameters for various implementations of wheel forging problem Number of processors
Spend time (s)
Sp
Ef
1 2 4 8 16 32
92752.12 48308.39 27041.43 17018.74 9293.80 4193.13
1.00 1.98 3.43 5.45 9.98 22.12
100 95 85.75 68.13 62.38 69.12
and parallel efficiency than ANSYS. The parallel 3D RKPM system is a useful system for bulk forming problems in consideration of above factors. 6. Conclusions A parallel computational implementation of 3D RKPM methods is presented for the explicit dynamic analysis. A DDM parallel implementation is used with model partitioning for a Lagrangian formulation. With proposed strategy, integration cells are uniquely partitioned on separate
References [1] Belytschko T, Krongauz Y, Organ D, Fleming M, Krysl P. Meshless methods: An overview and recent developments’ structures. Comput Methods Appl Mech Eng 1996;139:3–47, Special issue on Meshless Methods. [2] Liu Wing Kam, Jun Sukky, Zhang Yi Fei. Reproducing kernel particle methods. Int J Numer Methods Fluids 1995;20:1081–106. [3] Liu WK, Jun S, Li S, Adee J, Belytschko T. Reproducing kernel particle methods for structural dynamics. Int J Numer Methods Eng 1998;38:1655–79. [4] Jun S, Liu WK, Belytschko T. Explicit reproducing kernel particle methods for large deformation problems. Int J Numer Methods Eng 1998;41:137–66. [5] Liu WK, Chen Y. Wavelet and multiple-scale reproducing kernel methods. Int J Numer Methods Fluids 1995;21:901–31. [6] Chen JS, Roque C, Pan C, Button ST. Analysis of metal forming process based on meshless method. J Mater Proc Technol 1998;80: 642. [7] Chen JS, Pan C, Wu CT, Roque C. A Lagrangian reproducing kernel particle method for metal forming analysis. Comput Mech 1998;19:153. [8] Kulasegaram S. Development of particle based meshless method with applications in metal forming simulations, Ph.D. Thesis, University of Wales, Swansea, 1999. [9] Bonet J, Kulasegaram S. Correction and stabilization of smooth particle hydrodynamics methods with applications in metal forming simulations. Int J Numer Methods Eng 2000;47:1189. [10] Li GY, Belytschko T. Element-free Galerkin method for contact problems in metal forming analysis. Eng Comput 2001;18:62. [11] Lou LL, Zeng P, Fang G. Meshless method and its application to bulk metal forming. J Plast Eng 2001;3:1 [in Chinese]. [12] Guo YM, Nakanishi K. A rigid-plastic meshless method for forming simulations. Eng Plast Macroscale Nanoscale PTS 1 and 2, Key Eng Mater 2003;233(2):341. [13] Yvonnet J, Chinesta F. The natural element method in metal forming simulations involving large strains. In: LUXFEM03 international conference, Luxembourg, 2003. [14] Chen JS, Wang HP, Liu WK. Meshfree method with enhanced boundary condition treatment for metal forming simulation. In: The 1999 NSF design and manufacturing grantees conference, Queen Mary, USA, 1999. [15] Chen JS, Wang HP. New boundary condition treatments in meshless computation of contact problems. Comput Method Appl Mech Eng 2000;187:441. [16] Liu GR. Mesh-free methods, moving beyond the finite element method. CRC Press; 2002. [17] Belytschko T, Lu YY, Gu L. Element free Galerkin methods. Int J Numer Methods Eng 1994;37:229–56. [18] Kent T Danielson, Hao Su, Liu Wing Kam, Aziz Uras R, Li Shoafan. Parallel computation of meshless methods for explicit dynamic analysis. Int J Numer Methods Eng 2000;47:1323–41.
H. Wang et al. / Advances in Engineering Software 38 (2007) 87–101 [19] Simon H. Partitioning unstructured problems for parallel processing. Comput Syst Eng 1991;2(2–3):135–48. [20] Lanczos C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Nat Bur Stand 1950(45):255–82. [21] Barnard ST, Simon HD. Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. Concurr: Practice Experience 1994;6(2):101–17. [22] Paige CC, Saunders MA. Solution of sparse indefinite systems of linear equations. SINUM 1975;12:617–29. [23] Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs, Technical Report TR 95-035, Department of Computer Science, University of Minnesota, 1995. [24] Walshaw C, Cross M, Johnson S, Everett M. A parallelisable algorithm for partitioning unstructured meshes. In: Ferreira A, Rolim J, editors. Proceeding of irregular’94: parallel algorithms for irregular problems: state of the art. Dordrecht: Kluwer Academic Publishers; 1995. p. 23–44. [25] Henderson B, Leland R. A multilevel algorithm for partitioning graphs. In: Proceeding of supercomputing’95. ACM; 1995, December. [26] Kernighan B, Lin S. An efficient heuristic procedure for partitioning graphs. Bell Syst Tech 1970;49(2):291–307. [27] Lucy LB. Astron Astrophys 1977;82:1013. [28] Gingold RA, Monaghan JJ. Mon Not R Astron Soc 1977;181:375. [29] Zhong Zhihua. Finite element procedures for contact-impact. Oxford University Press; 1993. [30] Belytschko T, Moran B, Liu WK. Nonlinear finite element continua and structures. Wiley, John & Sons, Incorporated; 2000. [31] Randles PW, Libersky LD. Smoothed particle hydrodynamics: some recent improvements and applications. Comput Methods Appl Mech Eng 1996;139:375–408. [32] Cambell J, Vignjevic R, Libersky L. A contact algorithm for smoothed particle hydrodynamics. Comput Methods Appl Mech Eng 2000;184:49–65. [33] Hoover CG, DeGroot AJ, Maltby JD, Procassini RJ. ParaDynDYNA3D for massively parallel computers. Engineering, Research, Development and Technology FY94, CR, 53868-94, Lawrence Livermore National Laboratory, Livermore, CA, 1995.
101
[34] Plimpton S, Attaway S, Hendrickson B, Swegle J, Vaughan C, Gardner D. Transient dynamics simulations: parallel algorithms for contact detection and smoothed particle hydrodynamics. In: Proceedings of superComputing96, Pittsburgh (PA), 1996. [35] Danielson KT, Namburu RR. Nonlinear dynamic nite element analysis on parallel computers using FORTRAN 90 and MPI. Adv Eng Softw 1998;29(3-6):179–86. [36] Williams RD. Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurr: Practice Experience 1994;3:457–81. [37] Hsieh SH. Parallel processing for nonlinear dynamics simulations of structures including rotating bladed-disk assemblies, Ph.D. Dissertation, Cornell University, Ithaca, New York, 1993. [38] Vignjevic R, Campbell J. A penalty approach for contact in smoothed particle hydrodynamics. Int J Impact Eng 1999;23:945–56. [39] Liu WK, Chang H, Chen JS, Belytschko T. Arbitrary Lagrangian– Eulerian Petrov–Galerkin finite elements for nonlinear continua. Comput Methods Appl Mech Eng 1988;68:259–310. [40] Ponthot JP. Mechanics of continuous solids under large strains, and unified treatment by the finite element method, Technical Report, TF38, University of Liege, Liege, 1995. [41] Wang Hu, Li Guang-Yao, Zhong Zhi-Hua. Modification of automatic mesh grid partition in parallel finite element computation. Gongcheng Lixue/Eng Mech 2005;22:46–51. [42] Anderheggen E, Renau-Munoz JF. A parallel explicit solver for simulating impact penetration of a three-dimensional body into a solid substrate. Adv Eng Softw 2000;31:901–11. [43] Brown K, Attaway S, Plimpton S, Hendrickson B. Parallel strategies for crash and impact simulations. Comput Methods Appl Mech Eng 2000;184:375–90. [44] Jia R, Sunde’n B. Parallelization of a multi-blocked cfd code via three strategies for fluid flow and heat transfer analysis. Comput Fluids 2004;33:57–80. [45] Rama Mohan Rao A, Appa Rao TVSR, Dattaguru B. A new parallel overlapped domain decomposition method for nonlinear dynamic finite element analysis. Comput Struct 2003;81:2441–54. [46] Wang Hu, Li Guangyao, Zhong Zhihua. Optimization on automatic mesh partition method in parallel finite element computation. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/J Comput Aided Des Comput Graph 2005;17(8):1766–72.