Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 www.elsevier.com/locate/cma
A parallel two-sided contact algorithm in ALE3D Tim Pierce
a,*
, Garry Rodrigue
q
b
a b
Lawrence Livermore National Laboratory, Mail Stop L-98, Livermore, CA 94551, United States Department of Applied Science, University of California at Davis, Davis, CA 95616, United States Received 11 November 2003; accepted 20 August 2004
Abstract A scalable parallel algorithm for treating two-sided contact in a finite-element multi-physics code (ALE3D) is presented. This method assumes that proximity between the two sides changes continuously, and uses a local search to update proximity relations each cycle. The evolving communication pattern is treated entirely by local, nearest-neighbor communication; there is no global communication. Opening and closing voids, overlapping and intersecting contact surfaces, and a number of other special features are supported. 2004 Elsevier B.V. All rights reserved. Keywords: Contact surface; ALE; Finite element; Parallel computing
1. Introduction ALE3D is an abitrary-Lagrangian–Eulerian (ALE) finite-element code that treats fluid and elastic–plastic response of materials on an unstructured 3D grid. The major components of the code are explicit and implicit continuum mechanics, heat transfer, and chemistry [1]. ALE3D is used primarily to perform largedeformation transient dynamics simulations, including those involving strong shocks. It has also been applied to a number of metal-forming applications [2]. Problems of this type often involve contact surfaces, 2D surfaces defined by the meeting (potential or actual) of separate material regions. An impenetrability constraint is enforced along the contact surface.
q This work was performed under the auspices of the US Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48. * Corresponding author. E-mail addresses:
[email protected] (T. Pierce),
[email protected] (G. Rodrigue).
0045-7825/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.cma.2004.08.011
3128
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
However, the materials are (usually) free to slide relative to each other tangential to the contact surface. A friction model may be used in the simulation of tangential sliding. The finite element procedure begins with a discretization of the material regions into a mesh of zones, or elements. The mesh in the parallel version of ALE3D is further decomposed into submeshes (called domains) and typically one domain is assigned to each processor of a parallel distributed-memory machine. This is a fairly standard approach to parallelizing finite-element codes, [3], and a number of tools such as METIS, [4], have been developed to produce a well balanced decomposition. Typically each zone in a domain is permanently assigned to that processor, i.e. the decomposition is static. The force at a given node (mesh vertex), which ultimately dictates its displacement, is a function of the stress in neighboring zones. If the neighboring zones fall into multiple domains, then inter-processor communication will be required to calculate the force at the node. Since the decomposition is static, this communication will have a static, local (or nearest-neighbor) topology. Nodes on contact surfaces are exceptional, because the set of neighbor zones includes zones directly across from the node on the other side of the surface, and these change due to relative tangential motion at the surface. Thus the calculation of nodal forces on contact nodes requires a dynamic communication topology. A major part of the parallel contact problem consists of determining, at each time step, the current neighbor zones of each contact node, finding where on the parallel machine those zones ‘‘live’’, and bringing each contact node and all its current neighbors together on the same processor so that the nodal force can be determined. In surveying various approaches that have been taken to deal with parallel contact surfaces, it is necessary to distinguish between one-sided and two-sided contact. In one-sided contact any node on a surface designated as a contact surface may come into contact with any face of that surface (excluding faces to which the node is directly connected). Typical uses of one-sided contact surfaces are car crash simulations, in which extensive buckling of shell elements occurs, and many-body collisions, where the union of the surfaces of all bodies forms the one-sided contact surface. One-sided contact is discontinuous in nature, see Fig. 1. One consequence is that the search for contact must consider, for each face, all nodes within a geometrical region within which contact is possible. An example of this is the one-sided contact algorithm in DYNA3D, [5,6], which sorts all contact nodes into a number of spatial ‘‘buckets’’ in one or more dimensions. The minimum bucket size is chosen such that no contact face is spread over more than three adjacent buckets in each sort dimension. A node in a given bucket can potentially interact only with faces that intersect that bucket. This bucket sort, an expensive global operation, can be amortized over multiple cycles. The rebucketing frequency is determined by the maximum velocity of any contact node. To parallelize, a contiguous set of contact nodes in one or more buckets is assigned to a given processor. All nodes in these buckets and neighboring buckets must be sent to that processor in order to perform the contact enforcement. Note that this is in effect a second independent, dynamic decomposition, in addition to the static bulk decomposition. This feature is shared by PRONTO, discussed next, and by the ALE3D contact algorithm. PRONTO [7,8] sorts the set of contact nodes and faces of a single-sided contact surface in three dimensions using recursive coordinate bisection (RCB), [9]. RCB provides dynamic load balance efficiently.
c a b
d
x
Fig. 1. Nodes d and b come into contact with face x as the buckled portion of the surface moves to the right. Node c located between b and d does not.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3129
c a b
d
x
Fig. 2. Node d is nearest to face x at one point in time while node b is nearest to x at a later time. Thus, there is an intervening time when c is nearest to x.
Nodes and faces are sent at each time cycle from their bulk domain to the RCB contact domain of the previous time cycle. The new RCB is calculated from there and a small subset of data is transferred from the old RCB domain to the new RCB domain, taking advantage of the fact that the RCB decomposition evolves in an incremental fashion. A number of global operations with short message lengths are required to orchestrate the parallel RCB procedure. Two-sided contact as formulated in ALE3D is simpler than one-sided contact because of the additional assumption that contact between the two sides evolves continuously, see Fig. 2. This assumption will be further developed in the next section. It is precisely what allows the ALE3D dynamic domain decomposition to evolve with a complete absence of global communication. In short, a global search is initially performed to find nearest nodes on the opposing side, and thereafter the search for nearest nodes is restricted to the mesh neighborhood of the nearest node during the preceding time cycle.
2. The contact algorithm 2.1. Terminology The problem mesh is the discretization of one or more physical bodies into zones. The mesh is then partitioned into bulk domains, each assigned to its own processor with separate memory. The processors share data by message passing. The faces forming the exterior of the problem mesh are called the boundary faces. A two-sided contact surface consists of a pair of boundary face subsets, the master side and the slave side, see Fig. 3(b). These subsets are specified by the user, and presuppose some knowledge of which parts of the bodies will come in contact. Each side of a contact surface is a 2D mesh. The relationship (or connectivity) between the various faces and nodes of each mesh is fixed. Each face on a contact side always consists of the same nodes, and each node is a vertex for the same set faces.
Fig. 3. (a) Bulk decomposition into eight domains where each color represents a different bulk domain and (b) the red and blue surfaces form the master and slave sides of one contact surface, while the green and yellow surfaces form the master and slave sides of a different contact surface. (For interpretation of color the reader is referred to the web version of this article.)
3130
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
a
Fig. 4. The neighborhood of node ‘‘a’’ consists of all circled nodes (plus ‘‘a’’ itself).
The neighborhood of a contact node consists of all the nodes within two faces of the specified node, see Fig. 4. Although a uniform mesh is shown in Fig. 4 for simplicity, ALE3D supports arbitrarily connected meshes. Contact enforcement in explicit time-differencing codes is essentially a local process where each ‘‘patch’’ of nodes on one side of a contact surface is only affected by a patch of nodes directly opposite it. However, the opposing patch of nodes may be continually changing as the two sides move relative to each other. The order nodeof a given node on one side of a contact surface is the node nearest to it on the opposite side of the contact surface. Each contact node has an order node during a time cycle but this order node constantly changes as the problem evolves. The amount of change allowed in a given time cycle is limited; the order node of a particular node at the next time cycle must share a face with its order node during the present time cycle. In other words there must exist a face on the opposing side such that the previous and current order node are both vertices. This allows us to limit the search for the new order node to a local search at each time cycle. The locale of a contact node consists of the node itself, its neighborhood, its order node, and the neighborhood of its order node, see Fig. 5. The locale of a node is the minimum amount of state required to perform the contact enforcement and subsequent order node update on that node.The locale is a more precise definition of the ‘‘patch’’ referred to earlier. Contact enforcement is parallelized by partitioning the contact surface into a number of contact domains, or CDs, and then assigning each contact domain to a processor. The nodal data resides permanently in the bulk domains and is sent to the contact domains for contact enforcement during a time cycle. The updated nodal data is then returned to the bulk domains. Within a single contact domain, contact forces are balanced using a symmetric, momentum-conserving approach that is modelled after the 2D method of HEMP, [10]. This method has proven to be robust, stable, and accurate in the presence of strong shocks. Details may be found in [11]. The communication topology is the overall organization of inter-processor communication and determines the specific communication pattern between bulk domains and contact domains during a given time
a
Fig. 5. The locale of node ‘‘a’’ consists of all nodes marked by open circles.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3131
cycle. The cross-section of the topology for a given processor is the set of processors with which it communicates. Classes of communication topology relevant to this paper are 1. Static. Each processor communicates with the same set of other processors (its ‘‘neighbors’’) during each time cycle. 2. Dynamic. Each processorÕs neighbors may change from one time cycle to the next. 3. Local. There is a upper bound on the number of processors that a given processor communicates with, independent of the total number of processors. 4. Global. The number of processors that a given processor communicates with increases as the overall number of processors increases. Communication between bulk domains (BDs) and contact domains (CDs) is accomplished by: 1. Each BD determining what information to send to which CD. 2. Each CD determining from which BDs it will receive information. Both of the above steps are necessary as each communication requires the sender and receiver to explicitly cooperate. BD messages cannot be simply sent from various places while the CD continuously polls for incoming messages. The CD would never know when it had waited long enough (unless it waited till it got messages from all BDs, which would imply global communication). 2.2. Contact domains If a contact node is assigned to a particular contact domain then its state is updated and returned by that contact domain during the process of contact enforcement. A given master node is always assigned to the same contact domain. Slave nodes, on the other hand, are assigned to the contact domains of their order nodes. The contact domain a slave node is assigned to may change from cycle to cycle because of relative movement of the order nodes. Simplified 1D diagrams will henceforth be used to represent the contact surface. Also, node neighborhoods will only include nodes within one face (or segment, in 1D) of a given node. The description of various scenarios is thus less cluttered while the extension to 2D surfaces is obvious. A contact surface is shown in Fig. 6 at two different times, t1, t2. If a node is assigned to a CD, then the CD also requires the nodeÕs locale, see Fig. 7. The nodes m10 and s8 in Fig. 7 are referred to as ghost nodes of CD 5. The set of master nodes, both assigned and ghost, that are required by a given contact domain is the same each cycle, so the focus from now on will be on the more complicated and dynamic issue of slave nodes.
CD 4 m1
CD 5 m2 m 3
s2 s3
s4
t1
CD 6 CD 7 m4 m 5 m 6
s5
s6
s7
CD 4 m1
CD 5 m 2 m3
CD 6 CD 7 m4 m 5 m6
s2 s3
s4
s5
s6
s7
t2
Fig. 6. t1: s4 orders on m3 so that both m3 and s4 are assigned to CD 5. t2 The slave side has moved so that s4 now orders on in m4 and is assigned to CD 6.
3132
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 CD 4 m6
s5
CD 5 m7 m8
s6
CD 6 m10
m9
s7
s8
Fig. 7. m9 is assigned to CD 5, so all circled nodes must be sent to CD 5. Nodes m10 and s8 are included even though they are assigned to a different CD.
2.3. Bulk domains A bulk domain will send one of its local slave nodes to a particular CD when: 1. The slave nodeÕs current order node is statically assigned to the CD, 2. the slave node is the order node of a master node that is assigned to the CD, or 3. the slave node is in the neighborhood of another slave node of the first or second class. The second and third classes are a consequence of the locale requirements for assigned nodes. A node in the first class is called a rule 1 generator for the CD because it generates a list of accompanying nodes (its neighbors) that must also be sent to the CD. A slave node is a rule 1 generator for exactly one CD, namely, its assigned CD (which changes dynamically). A slave node in the second class is called a rule 2 generator for the CD. A slave node can be a rule 2 generator for multiple CDs (or none), see Fig. 8. The rule 1 CD for a given slave node is determined at the end of the previous cycle, after the position of the slave and all other nodes in its locale have been updated. At that point the new order node for the slave is searched for, starting from the old order node (this is done in the contact domain). Once the new order node is found, the order nodeÕs assigned CD becomes the slaveÕs new rule 1 (i.e. assigned) CD. The rule 2 CD list is built a little less directly. The slave was sent to a number of CDs as either an assigned node or a ghost node during the previous cycle. The contact domain determined the new order node of each of its assigned masters at the end of that cycle. If the slave node was chosen as an order node of any assigned master node(s) on that CD, the slave node was ‘‘tagged’’. The CD returned to the BD a list of all of the BDÕs slave nodes that were tagged. If a slave node was sent to multiple CDs, then any or all of those CDs may have returned a tagged status on that node. Any CD that did so is then added to the nodeÕs rule 2 CD list for the next (i.e. current) time cycle. For example, both CD 2 and CD 3 in Fig. 8 will return a tag on node s4 to the BD(s) where s4 resides, and s4Õs rule 2 CD list will be {CD 2, CD 3}.
CD 1 m2
s2
CD 2 m3
m4
s3
CD 3 m5
m6
s4
Fig. 8. Both m5 and m6 order on s4. Hence, s4 is a rule 2 generator for both CD 2 and CD 3 (and a rule 1 generator for CD 3).
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146 BD 1 s1
s2
3133
BD 2 s3
s4
s5
s6
Fig. 9. The circle represents the neighborhood of node s4.
2.4. Domain communication 2.4.1. BD M BD Consider the example in Fig. 9. If node s4 has rule 1 CD set to CD 3, then nodes s3, s4, and s5 must all be sent to CD 3. Since node s4 is on BD 2, then BD 2 can determine to send nodes s4 and s5 to CD 3 by looking at s4Õs rule 1 CD. BD 1 determines whether to send s3 to CD 3 by first creating a ghost layer of nodes (called proxy nodes) around each bulk domainÕs local set. These nodes are generated by taking the union of the bulk domainÕs local nodesÕ neighbors (i.e. nodes within two faces of all the local nodes) and then removing the local nodes so that only the proxy nodes for the bulk domain remain. A list of the proxy nodes is sent to the other bulk domains for matching against their own local nodes. A list (possibly null) of matches is then returned. A static communication topology is established between a bulk domain and the other bulk domains containing the proxy nodes. The topology and list of nodes to be sent is static. Each bulk domain sends to the other BDs in its ‘‘neighborhood’’ (i.e. the BDs that have proxy copies of any of its local nodes) the latest rule 1 CD and rule 2 CD list for each of the shared nodes, and receives the same. In the example of Fig. 9, this allows BD 1 to know what s4Õs rule 1 and rule 2 CDs are. For each local node in a bulk domain, a list of neighboring domains, if any, on which the node is proxy is built and permanently associated with the node. This is referred to as the nodeÕs neighbor BD list. The BD M BD communication topology is a local, nearest-neighbor topology because it is bounded by the product of the number of native nodes and the number of neighbors per node. In practice, each bulk domain will have on the order of eight neighbors, i.e., a 3 · 3 grid where the central element represents the local BD itself. 2.4.2. BD M CD Each contact domain builds a list of bulk domains called the next time BDs list (Section 2.5.2) from which it expects to receive data during the next time cycle. The contact domain preposts the ‘‘receives’’ from each bulk domain in its next_time_BDs list at the start of a time cycleÕs BD M CD communication pass. A buffer is allocated on each bulk domain for each contact domain on its CD ‘‘send’’ list, containing state information such as the nodeÕs position, velocity, and order node, and its neighbor BD list, assigned (i.e. rule 1) CD, and list of ghost CDs,. This buffer is sent to the destination CD upon completion where it is assembled by the contact domain with the data from all other received buffers into CDÕs piece of the contact surface. CD 1
CD 2
CD 3 m6
m5
s7
Fig. 10. CD M CD communication.
3134
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
2.4.3. CD M CD Consider the situation in Fig. 10. s7Õs order node is m6 at the start of the cycle, so s7 is assigned to CD 3. However, surface movement results in s7Õs new order node being m5. This calculation is done by CD 3, since that is where s7 is assigned. However, CD 2 also needs to know s7Õs new order node because it has to accumulate s7Õs neighbor BD list into its next time BDs list, since this is where s7 will be assigned next cycle. This situation is accomodated by defining another communication topology between a contact domain and other contact domains where any of its ghosts are assigned or where any of its assigned nodes are ghosts. This is a local topology that is bounded by the total number of nodes in a contact domain.The typical number of neighbors is on the order of eight. This communication topology is also used several times in a time cycle to communicate intermediate results during the course of the contact calculation. For example, in Fig. 10, node s7 needed the freshly updated coordinates of m5 (a ghost node to CD 3) in order to determine that its new order node is indeed m5. To construct this topology, for each local node the bulk domain builds a list of which contact domains the local node will be sent to as a ghost. The bulk domain sends this list when it sends the node to its assigned CD. When it sends the node to any other CD as a ghost, it sends along with it the information of which CD it is assigned to. Each CD constructs from this information a set of CDs where its assigned nodes are ghosts, and another set of CDs where its ghost nodes are assigned. Unlike the BD M BD topology, the CD M CD communication topology is dynamic. 2.5. Domain preparation 2.5.1. Bulk domain The bulk domain builds a package of data to send to each of its neighbor contact domains after the BD M BD communication pass by looping through all its local and proxy nodes and looking at each nodeÕs rule 1 CD and rule 2 CD list. The first time a CD is encountered, it creates a new list for nodes going to that CD and adds the CD to its list of CDs to send to. If the node is local, it is added to the list of nodes being sent to the rule 1 CD (if not already on the list) and flags the node as an assigned node. The BD adds all the local neighbors of the node to the lists of all CDs in the rule 1 and rule 2 CD lists regardless of whether the node is local or proxy. 2.5.2. Contact domain Each contact domain must know beforehand which bulk domains will send it messages, in order to prepost the receives of those messages. This determination is actually done at the end of the previous cycle. After the order nodes have been updated at the end of the previous time cycle, each contact domain determines which slave nodes have order nodes that are among its assigned masters (the CDÕs new rule 1 generators) and which slave nodes are the order nodes of its assigned masters (the CDÕs rule 2 generators). The bulk domains that own these generators will communicate with the contact domain next (i.e. current) cycle. In addition, the contact domain will also receive all the neighbors of its generators, and these neighbors may not all be from the same bulk domains as the generators themselves (see Fig. 9). To solve this problem, when a node is sent to a CD, extra data is also sent with it called the neighbor BD list. For a given node, its neighbor BD list is the set of all bulk domains on which the node is local or proxy. This is a static list, and can be computed during initialization. If a bulk domain appears in a given nodeÕs neighbor BD list, then that bulk domain contains at least one node in the neighborhood of the given node. Thus, the contact domain will also communicate with all bulk domains in the neighbor BD list of any of its generators. Thus, at the end of the previous cycle the contact domain finds all slaves that order on its assigned masters and all slaves that are ordered on by its assigned masters. Then, the contact domain takes the union of
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3135
these nodesÕ neighbor BD lists in order to determine which BDs it will talk to during the next time cycle. This list is stored in the next time BDs list until the start of the next (i.e. current) cycle. 2.6. Communication summary The ALE3D parallel contact algorithm uses three types of communication topologies in dynamically evolving the communication patterns of material contact. BD M BD communication. Ties the bulk domains together by sharing information near mutual boundaries and thus allows a given bulk domain to ‘‘sense’’ when it is about to become involved in the work of a specific contact domain by involvement of its proxy nodes. BD M CD communication. Supplies the contact domain with the data needed to perform the contact enforcement, to update the order nodes of the CDÕs assigned nodes, and to predict, on the basis of the updated order nodes, which bulk domains will communicate with it during the next time cycle. Returns the updated nodal state for its assigned nodes to their BD homes. CD M CD communication. Supports the contact domain in determining which slaves order on its assigned master nodes, and thus which bulk domains the contact domain will communicate with during the next time cycle. CD M CD communication also permits less redundant calculation of the ghost state by trading communication for redundant calculation. The control flow for a single time cycle is given in Fig. 11. 2.7. A simplified example A simple example is provided to demonstrate how the communication pattern evolves. In a more realistic example the surfaces would be 2D, and the node neighborhoods would include all nodes within two faces. As a further simplification, all the nodes on one side of the surface will exactly match up with a node on the other side, so that the rule 2 CD list will only contain the rule 1 CD. Nevertheless this example shows much of the mechanism by which BDs and CDs start and stop communicating with each other in the course of contact evolution.
Fig. 11. Control flow for a single time cycle.
3136
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
1. Initial configuration; mi orders on si m1
m2
m3
m4
s1
s2
s3
s4
m5
m6
s5
s6
2. Static layout of the contact surface into the six bulk domains. The nodes represented by open circles are proxy nodes BD 4
BD 5
BD 6
m1
m2
m3
m2
m3
m4
m5
s1
s2
s3
s2
s3
s4
s5
BD 1
m4
s4
BD 2
m5
m6
s5
s6
BD 3
3. Static BD M BD topology BD 1 $ BD 2
BD 2 $ fBD 1; BD 3g
BD 3 $ BD 2
BD 4 $ BD 5
BD 5 $ fBD 4; BD 6g
BD 6 $ BD 5
4. Static master side contact decomposition. The assigned nodes to a domain are represented by filled circles and the ghost nodes are presented by open circles CD 4
m3
CD 5
m4
m1
m5
m4
m2
m1
CD 1
CD 6
m5
m6
m2
m5
m3
m6
m2
m3
CD 2
m4
CD 3
5. Contact domain configuration CD 4
CD 5
CD 6
m3
m4
m5
m4
m5
m6
m5
m6
s3
s4
s5
s4
s5
s6
s5
s6
m1
m2
m1
m2
m3
m2
s1
s2
s1
s2 CD 2
s3
s2
CD 1
m3 s3 CD 3
m4 s4
6. Contact surface movement during time cycle tn1
s1
m1
m2
m3
m4
m5
s2
s3
s4
s5
s6
m6
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3137
7. CD allocation and order node updates at tn1 CD 4
s3
CD 5
m3
m4
s4
s5
m5 s4
m1 s1
m2
s2
s1
CD 1
CD 6
m4
m5
s5
s6
m1
m2
s2
s3
m6
m5 s5
m6
s6
m3 s2
CD 2
m2
m3
s3
s4
m4
CD 3
8. The contact domains during time cycle tn CD 4
m3
m4
s4
s5
s1
CD 5
CD 6
m5
m4
m5
m6
s6
s5
s6
s7
m5 s5
m6 s6
m1
m2
m1
m2
m3
m2
m3
m4
s2
s3
s2
s3 CD 2
s4
s3
s4 CD 3
s5
CD 1
A particular event is illustrated in Fig. 12. Note that BD 2 does not send anything to CD 1 during tn1 as BD 2 ! CD 1 is not part of the communication topology. Also, BD 2 finds it will send data during tn to CD 1 and BD 2 ! CD 1 gets added to the BD ! CD topology. Similarly, communication paths can disappear when they are no longer needed. For example, CD 3 received s2 from BD 1 during time cycle tn1 while in the next time cycle CD 3 no longer needed s2, and consequently BD 1 ! CD 3 is dropped out of the topology.
3. Timing studies Scalability of the contact algorithm is tested by performing several timing studies. The test geometries are rectangular, cylindrical, and spherical, and have been designed to isolate the contact scalability issues. Scaled efficiency and fixed efficiency are both measured. Since contact surfaces are two dimensional, the number of contact surface elements in conventional scaling grows as N2/3 of the overall element count N. If the overall amount of work per processor is held constant, contact work per processor actually shrinks for scaled speedup. This clouds the issue of scalability of the contact algorithm itself. Ideally one would hope to see super-linear scaling of the contact phase in this case, but quantifying this is difficult. The simplest solution is to hold the contact work per processor constant, by performing mesh refinement only in the dimensions of the contact surface. Fixed efficiency tests are achieved by running a fixed problem size on various numbers of processors. Ideally each doubling of processors should cut the run time by a half. However, as the amount of actual work per processor becomes small compared to the time spent in various fixed-overhead work, e.g., communication latency times, this ideal will inevitably fail at some point. The problem sizes chosen exhibit this failure at the largest processor counts run. All tests were performed on the Lawrence Livermore National Laboratory ASCI ‘‘Y’’ machine, which is an IBM SP system with 488 nodes and 1.5 GBytes of memory per node. Each node consists of 4 PowerPC 604e processors, running at 332 MHz with a peak of 664 MFlops, that share the nodeÕs memory, for a total of 1952 processors.
3138
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
Fig. 12. One particular event of the example.
3.1. A case for local communication It was observed in the following sections that there was both a steady increase in typical cycle times and an increasing variance in the individual cycle times when running with larger numbers of processor. Indeed, some time cycles were four times as long as others. This occurred even in the fixed planes test (see below) where identical work and communication is performed during each time cycle. Moreover, which cycles took longer on a given run was entirely unrepeatable. This behavior is now understood to be a result of system ‘‘daemon’’ processes that randomly interrupt various processors to perform housekeeping tasks. If each processor ran entirely independently and the system interruptions are uniformly distributed, then scalability would not be affected. However, if one processor is interrupted during processor intercommunication, then all processor communication must wait. While this is most dramatic for global operations such as MPI_Barrier and MPI_Allreduce, it also affects local communication patterns. To see this, consider Fig. 13.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
A
B
C
D
3139
E
Fig. 13. Five processor example where each processor communicates only with its adjacent neighbor.
If processor A is interrupted and temporarily cannot send the expected data to processors B and E, then at some point processors B and E can go no further and cannot send the data to processors C and D. Thus, processors C and D must wait for processor A to return from the interruption. This ‘‘daisy-chain’’ can be of any length. The probability of a processor being interrupted by the system in any given time interval increases as the processor count is increased, resulting in the increased run times observed as well as the variance in cycle times. A simple test to measure this phenomenon is constructed where each processor communicates with 6 neighbor processors. For simplicity the neighbors are the six with processor numbers (MPI rank) nearest to the given processor. The test is run 1000 time cycles where each processor sends and receives 2000 floating-point numbers to its neighbors during a time cycle. This test should ideally take the same amount of time regardless of the total number of processors involved. The results are shown (normalized) in Fig. 14 and may seem to indicate that producing an algorithm that uses only local communication is futile. However, a similar test was constructed which instead performs a global operation (MPI_Barrier) within the loop. The results of this test are again shown in Fig. 14. Since barriers are the simplest form of global communication in that they involve no data transfer, it is clear that global communication produces a much bleaker picture for scalability than local communication. Thus efforts expended in restricting communication to local interchanges only are worthwhile. 3.2. Fixed planes test Two rectangular boxes of equal size are oriented so that one sits on top of the other with perfect overlap, see Fig. 15. Larger runs use a finer grid in x- and y-directions while shrinking the box thickness in the zdirection. The contact surface is defined to be the surface where the blocks meet. The mesh for the two blocks is slightly different as there is one more zone in the x- and y-directions of the bottom block than in the top block. Thus, the contact nodes do not coincide except at the edges (the contact surface is orthogonal to the z-direction). The entire machinery of the contact algorithm, including all communication, is
Fig. 14. (a) Local communication (green line); (b) barrier communication (red line) and run-time is the average of 4 runs. (For interpretation of color the reader is referred to the web version of this article.)
3140
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
Fig. 15. Fixed planes.
Table 1 Results for fixed planes test Procs.
Total zones
Zones/proc.
Contact nodes
Nodes/proc.
Contact time
8 16 32 64 128 256 512 1024
25,282 50,622 101,762 203,386 408,322 815,346 1,635,842 3,264,994
3160 3164 3180 3178 3190 3185 3195 3188
12,961 25,763 51,521 102,597 205,441 409,481 820,481 1,636,113
1620 1610 1610 1603 1605 1600 1603 1598
1.0 1.11 1.14 1.24 1.36 1.51 1.65 2.57
exercised even though there is no movement in this test. Computational load balance should be very good and any non-scalability is attributable to communication. Table 1 lists the parameters and results for a set of timed runs. Each test is approximately twice the size of the previous test, and is executed on twice as many processors. All times are normalized. There is a gradual increase in time attributable to system interrupts leading to communication blocks as expected from the previous discussion on IBM SP scalability issues. The overall increase in time is about 3·, for a 128· increase in problem size and processor count. Fig. 16 compares the actual scaled efficiency compared to the ideal. The normalized maximum run time is shown and is actually the inverse of scaled efficiency. A fixed-size test was also executed with various processor counts. The problem chosen is identical to the one used for the scaled test with 256 processors. Results for the fixed test are compared with ideal scaling in Fig. 17.
Fig. 16. Comparison of actual scaled efficiency (red line) to the ideal (green line) for fixed planes test. (For interpretation of color the reader is referred to the web version of this article.)
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3141
Fig. 17. Comparison of actual efficiency (red line) with the ideal (green line) for fixed sized problem on the fixed planes test. (For interpretation of color the reader is referred to the web version of this article.)
3.3. Sliding planes test Two rectangular boxes are again modelled in this test. In this case they sit at an initial off-set as shown in Fig. 18(a). One block is given a velocity, so that at a later time they are in the configuration shown in Fig. 18(b). This test exercises the ability of the algorithm to change its communication pattern over time. The partitioning algorithm decomposes the problem into domains such that each domain is in one block or the other but not both. This is a consequence of the rule that attempts to minimize connectivity between domains. The most obvious cut is between the blocks along the contact surface as there is no connectivity between blocks. The domains directly across from a given domain will change as the block slides, thus altering the communication pattern. Individual slave nodes are assigned to the contact domain of their associated master node. As the block slides, slaves that were previously ordered on masters located on the edge of a contact domain will change ordering to new masters in a different contact domain. Thus, the individual data sent to contact domains changes each time cycle even when the pattern of communicating domains does not. Table 2 and Fig. 19 present the normalized results for the scaled sliding planes test. It is clear from the above results that this test does not scale as well as the fixed plane test. The source of the problem is poor load-balance and is a consequence of the way in which the slave nodes are assigned to contact domains. As previously discussed, each slave node is dynamically assigned to the same contact domain as its statically assigned master order node. Since the blocks are offset, all the slave nodes in the offset region order on the limited set of master nodes along the edge of the master side closest to the offset slave region. Fig. 20 depicts a simplified 1D example.
Fig. 18. Sliding planes.
3142
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
Table 2 Results for sliding planes test Procs.
Total zones
Zones/proc.
Contact nodes
Nodes/proc.
Contact time
8 16 32 64 128 256 512 1024
25,600 51,072 102,400 204,288 409,600 817,152 1,628,400 3,268,608
3200 3192 3200 3192 3200 3192 3200 3192
13,122 25,990 51,842 103,050 206,082 410,368 821,762 1,637,922
1640 1624 1620 1610 1610 1603 1605 1600
1.0 1.10 1.20 1.40 1.78 2.15 3.05 5.21
Fig. 19. Comparison of actual scaled efficiency to the ideal for sliding planes test.
A
Master
A
B B
C Slave
Fig. 20. The slave side is on the bottom. Each partition along the top represents a different contact domain.
Although the master side is well load-balanced, most of the slave side is nearest to the master nodes in contact domain A. Thus, most of the slave nodes are assigned to contact domain A while a much smaller number are assigned to contact domain B and none are assigned to contact domain C. This results in severe load imbalance. The above imbalance gets worse as the problem is scaled up. In Fig. 20, 50% of the slave side lies to the left of the master side and thus more than 50% of the slave side will be assigned to 33% of the contact domains. If the mesh is refined 10 times with a proportional increase in processor count yielding 30 contact domains, then more than 50% of the slave nodes will now be assigned to 1/30, or about 3%, of the contact domains. This is a definite weakness in the current version of the contact algorithm but can be ‘‘fixed’’ by using a different method of assigning slave nodes to contact domains. Fig. 21 shows the results for the fixed size scaling test for sliding planes. 3.4. Cylinder test Two concentric cylinders are modelled in this test, see Fig. 22. The outer cylinder is held fixed while the inner cylinder is given a constant angular velocity. Each part of the contact surface in this problem even-
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3143
Fig. 21. Comparison of actual efficiency with the ideal for a fixed sized problem on the sliding planes test.
Fig. 22. Cylinder test.
tually comes into contact with many other parts of the surface. Consequently, there is a large amount of change in the communication pattern over time. This test does not have the load-imbalance problem of the sliding planes test and so scalability is improved. One full revolution of the interior cylinder is modelled over the course of the test. The results for this test are shown in Table 3 and Fig. 23. Fig. 24 shows the corresponding fixed size results for the problem used in the 256 processor scaled test. It can be seen that the fixed-size problem becomes communication-bound past 512 processors. This is not surprising. There are less than 400 contact nodes per contact domain at this point and each processor is left without a lot of computation to balance the communication overhead.
Table 3 Results for cylinder test Procs.
Total zones
Zones/proc.
Contact nodes
Nodes/proc.
Contact time
8 16 32 64 128 256 512 1024
12,800 25,536 51,200 102,144 204,800 408,576 819,200 1,634,304
1600 1596 1600 1596 1600 1596 1600 1596
6,642 13,110 26,082 51,754 103,362 205,650 411,522 819,847
830 819 815 809 808 803 804 801
1.0 1.01 1.13 1.21 1.40 1.56 1.85 2.37
3144
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
Fig. 23. Comparison of actual scaled efficiency to the ideal for the cylinder test.
Fig. 24. Comparison of actual efficiency with the ideal for fixed sized problem on the cylinder test.
3.5. Concentric spheres test This final test has two contact surfaces separating three concentric spherical shells. The outer and inner shell are a metallic material, while the middle shell is an explosive material. Only one octant of the material is modelled using symmetric boundary conditions. At the initial time the explosive is ignited, leading to a rapid expansion of the material. Fig. 25 shows the initial problem configuration. While there is not a lot of movement along the contact surfaces, there is some, and to capture the correct physics it is necessary to
Fig. 25. Initial configuration.
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
3145
Table 4 Results for concentric spheres test Procs.
Total zones
Zones/proc.
Contact nodes
Nodes/proc.
Contact time
8 16 32 64 128 256 512 1024
21,240 42,350 84,860 169,920 338,800 678,880 1,359,360 2,710,400
2655 2646 2652 2655 2646 2652 2655 2646
5312 7669 12,936 20,740 30,064 50,948 81,956 119,044
664 479 404 324 235 199 160 116
1.0 0.90 0.90 1.09 1.11 1.37 1.62 1.93
allow the free motion. The most important requirement on the contact surfaces in this application is the need to accurately model shock waves as they cross the surface. Note that unlike the previous tests, the ratio of contact nodes to overall zones decreases as this test is scaled up, see Table 4. The shrinkage in the amount of actual computation initially leads to a reduced run time in the above results. However, as the processor count increases, the amount of computation becomes less significant
Fig. 26. Comparison of actual scaled efficiency to the ideal for the concentric spheres test.
Fig. 27. Comparison of actual efficiency with the ideal for fixed sized problem on the concentric spheres test.
3146
T. Pierce, G. Rodrigue / Comput. Methods Appl. Mech. Engrg. 194 (2005) 3127–3146
relative to the communication time and suffers the mild non-scalability discussed earlier. A plot of the timing results is presented in Fig. 26. Fig. 27 shows the results of a fixed size scaling test. As in the other tests, the problem size previously used for 256 processors is again used for all processor counts of the fixed size test.
References [1] A. Anderson et al., UserÕs Manual forALE3D, Lawrence Livermore National Laboratory, 2003. [2] R. Couch, R. McCallen, I. Otero, R. Sharp, 3D Metal Forming Applications of ALE Techniques, Simulation of Materials Processing: Theory, Methods, and Applications, Balkema, Rotterdam, 1995. [3] A. Koniges (Ed.), Industrial Strength Parallel Computing: Programming Massively Parallel Systems, Morgan Kaufmann Publisher Inc., Los Altos, CA, 1999. [4] G. Karypis, V. Kumar, METIS: Unstructured Graph Partitioning and Sparse Matrix Ordereing System, Version 1.0, University of Minnesota, Department of Computer Science, 1995. [5] D.J. Benson, J.O. Hallquist, A single surface contact algorithm for the post-buckling analysis of shell structures, Comput. Methods Appl. Mech. Engrg. 78 (1990). [6] C.G. Hoover, A.J. DeGroot, J.D. Maltby, R.D. Procassini, ParaDyn: DYNA3D for massively parallel computers, in: Presentation at Tri-Laboratory Engineering Conference on Computational Modelling, October 1995. [7] S.W. Attaway, B.A. Hendrickson, S.J. Plimpton, D.R. Gardner, C.T. Vaughan, K.H. Brown, M.W. Heinstein, A parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D, Computat. Mech. 22 (1998). [8] K. Brown, S. Attaway, S. Plimpton, B. Hendrickson, Parallel strategies for crash and impact simulations, Comput. Methods Appl. Mech. Engrg. 184 (2000). [9] M.J. Berger, Bokhari, A partitioning strategy for nonuniform problems on multiprocessors, IEEE Trans. Comput. C 36 (1987) 570–580. [10] M.L. Wilkins, Calculation of elastic–plastic flow, Lawrence Livermore National Laboratory, UCRL-7322, rev 1, 1969. [11] T.G. Pierce, A parallel algorithm for contact in a finite element hydrocode, Lawrence Livermore National Laboratory, UCRLLR-154063.