Nonlinear dynamic finite element analysis on parallel computers using FORTRAN 90 and MPI1

Nonlinear dynamic finite element analysis on parallel computers using FORTRAN 90 and MPI1

Advances in Engineering Software Vol. 29, No. 3–6, pp. 179–186, 1998 q 1998 Published by Elsevier Science Ltd Printed in Great Britain. All rights res...

430KB Sizes 1 Downloads 80 Views

Advances in Engineering Software Vol. 29, No. 3–6, pp. 179–186, 1998 q 1998 Published by Elsevier Science Ltd Printed in Great Britain. All rights reserved PII: S 0 9 6 5 - 9 9 7 8 ( 9 8 ) 0 0 0 1 9 - 2 0965-9978/98/$19.00 + 0.00

Nonlinear dynamic finite element analysis on parallel computers using FORTRAN 90 and MPI* Kent T. Danielsona,b & Raju R. Namburub b

a Aerospace Engineering and Mechanics/Army HPC Research Center, University of Minnesota, USA US Army Engineer Waterways Experiment Station, CEWES-SD-R, 3909 Halls Ferry Road, Vicksburg, MS 39180, USA

A nonlinear explicit dynamic finite element code for use on scalable computers is presented. The code was written entirely in FORTRAN 90, but uses MPI for all interprocessor communication. Although MPI is not formally a standard for FORTRAN 90, the code runs properly in parallel on CRAY T3E, IBM SP, and SGI ORIGIN 2000 computing systems. Issues regarding the installation, portability, and effectiveness of the FORTRAN 90-MPI combination on these machines are discussed. An algorithm that overlaps message passing and computations of the explicit finite element equations is also presented and evaluated. Several large-scale ground-shock analyses demonstrate the varying combined importance of load balance and interprocessor communication among the different computing platforms. The analyses were performed on only a few to hundreds of processors with excellent speedup and scalability. q 1998 Published by Elsevier Science Limited. All rights reserved.

FORTRAN 90 is also becoming widely used on many computing systems. For instance, it is the only FORTRAN compiler available on the Cray T3E. Most parallel FEA software, however, is a version of a legacy code written primarily in FORTRAN 77. Whereas FORTRAN 90 permits the use of FORTRAN 77 syntax, the code is still compiled under FORTRAN 90. When using explicit message passing libraries, several idiosyncrasies between the two FORTRAN standards exist. For the development of new FEA software, programmers are naturally migrating to FORTRAN 90 for its advanced features (free format, dynamic dimensioning, whole array operations, etc.). Data parallel approaches with FORTRAN 90 or High Performance FORTRAN (HPF) have frequently not yielded the high performance levels achieved by similar implementations using explicit message passing. Therefore, this paper will address implementation and performance issues associated with combining FORTRAN 90 and explicit message passing libraries for coarse grain parallelism. Ground-shock problems5 are typical FEA applications made in the Structures Laboratory at U.S. Army Engineer Waterways Experiment Station (WES). The large threedimensional domains inherent to these applications may require millions of elements to adequately perform such analyses. For example, Fig. 1 depicts a wave propagation analysis through soil towards a buried structure resulting from an explosive detonation. Two sides of the finite element

1 INTRODUCTION The effective use of multiple processors in parallel for a single application can greatly increase the feasibility of performing large-scale nonlinear finite element analysis (FEA). Using a Single Instruction Multiple Data (SIMD) and data parallel approaches, Refs 1,2 presented issues regarding the placement of explicit contact–impact FEA codes on Connection Machine computers, CM-2 and CM-5. More recently, the appearance of message passing libraries, such as Message Passing Interface (MPI) or Parallel Virtual Machine (PVM), has increased the attractiveness of coarse grain Multiple Instruction Multiple Data (MIMD) or Single Program Multiple Data (SPMD) approaches. Using explicit message passing calls, general purpose explicit dynamic finite element codes that take advantage of the power of massively parallel processor computers are emerging (e.g., ParaDyn3 and PRONTO3D4). The general nature and capabilities of these codes make them attractive to a large and diverse user group. Therefore, it is desirable for these codes to run large-scale analyses efficiently on a variety of different parallel computing platforms. *This paper was presented at the 4th NASA National Symposium on Large-Scale Analysis and Design on High-Performance Computers and Workstations Williamsburg, VA, October 16, 1997. 179

180

K.T. Danielson, R. R. Namburu

model assume mirror symmetry, three sides use nonreflecting infinite boundary conditions, and the top surface is free. The different materials are represented by several different inelastic models specifically developed for WES applications. The mesh consists of approximately 1.1 million 8-noded hexahedral elements, and a 10 ms transient analysis required approximately 175 CPU h on a single processor of a Cray C-90 computer. The analysis was performed using a serial special-purpose finite element code. The type of application depicted in Fig. 1 would be significantly more reasonable with the efficient use of a massively parallel computer. In this paper, the authors relate their experiences with such analyses on Cray T3E, IBM SP, and SGI ORIGIN 2000 computing systems described in Table 1. Results are presented from finite element software developed by the authors using a combination of FORTRAN 90 and MPI. A discussion is made about the effective portability of this code. In addition, several large-scale models, similar to that in Fig. 1, are used to benchmark the varying combined importance of load balance and interprocessor communication among the different platforms. These analyses are also used to evaluate the overall effectiveness of the FORTRAN 90/MPI combination for this class of problems.

2 PARALLEL CODE DEVELOPMENT The software, ParaAble, is an explicit dynamic finite element code developed by the authors for three-dimensional large strain/deformation solid mechanics problems that may include nonlinear boundary conditions. ParaAble contains many different options commonly available in other popular finite element codes (e.g., different material models, loadings, multipoint constraints, etc.). Whereas its number of features may not be as extensive as well established codes like DYNA3D or ABAQUS, ParaAble is structured like a general purpose code with the hooks to easily implement new capabilities. The main purposes for the development of

ParaAble are (a) for its use as a research tool to test and evaluate different parallel algorithms in a modern programming environment, and (b) to perform special purpose investigations. The parallel development of ParaAble also has a similar structure to other parallel explicit dynamic codes (e.g., Refs 3,4). A SPMD paradigm is used with the code written in FORTRAN 90 and all interprocessor communication made with explicit MPI calls. In this section, the key components of ParaAble are outlined. A brief overview of nonlinear explicit dynamic FEA of solids and structures is first given, then a general description of the parallel implementation is described. The parallel procedure primarily consists of a mesh partitioning preprocessing phase and a parallel analysis phase that includes explicit message passing among each partition on separate processors. 2.1 Basic explicit dynamic finite element algorithm The geometric and materially nonlinear finite element formulation is similar to that of many popular codes (e.g., Refs 1–4,6) and is highlighted herein. Using virtual work, the basic finite element equations representing dynamic equilibrium for the mesh at time t are: Mq¨ t ¼ Pt ¹ Ft

(1)

where each dot on top of a variable refers to differentiation with respect to time, M is the mass matrix, q is the generalized displacement vector, P is the vector of applied loads, and F is the vector of internal resistance forces. The elemental internal resistance force contributions, F e, are determined by inclusion of the finite element interpolation functions into the following: Z j : d« dV (2) Fte ¼ Ve

where d is the variational operator and, for geometric and materially nonlinearities, j and « are any work-conjugate pair of stress and strain measures associated with the elemental reference configuration V e. Eqn (1) is first used to evaluate the accelerations, q¨ t . Next, a central difference scheme, which is algebraically equivalent to a Newmark-b method7 with g ¼ 1/2 and b ¼ 0, is used for temporal integration of the velocities and displacements. Table 1. Hardware summaries of WES parallel computing systems used in present study

Fig. 1. Three-dimensional finite element ground shock simulation of an explosive detonation.

Machine

Cray T3E

IBM SP

SGI Origin 2000

Number of CPUs Memory

336

256

32

256–2000 Mbytes/CPU 135 MHz

16 Gbytes (shared) 200 MHz

3

2

256 Mbytes/ CPU Processor speed 450 MHz Floating instructions/ cycle 2

Nonlinear dynamic finite element analysis on parallel computers

Fig. 2. Unstructured mesh partitioning using Metis and duplicating nodal definitions along partition boundaries.

q˙ t þ (Dt qt þ Dt

nþ1

nþ1

Þ=2

¼ q˙ t ¹ (Dt

nþ1

 Þ=2

þ

¼ qt þ Dtn þ 1 q˙ t þ (Dt

 Dtn þ 1 þ Dtn t q¨ 2

nþ1

Þ=2

(3) (4)

where Dt with superscripts n þ 1 and n refers to the current and previous time increments, respectively. The elemental mass is lumped at the nodes, so that the mass matrix, M, is diagonal. Therefore, the computations associated with eqns (1)–(4) are primarily vector operations, and corresponding CPU usage is dominated by evaluation of the elemental integrals of eqn (2). 2.2 Mesh partitioning Using separate preprocessing software, any general unstructured mesh is partitioned (domain decomposition) to distribute the computational load associated with the finite element equations. Metis 8, freely available on the worldwide web, is used to provide finite element mesh partitions of nearly equal sizes while also generally minimizing the number of partition interfaces. For parallel computing, these partition characteristics are important for improving load balance and reducing interprocessor communication. In contrast to the nodal partitioning (sometimes referred to as coloring) done for explicit FEA of fluids in Ref. 9, the partitioning is performed on the elements, since the majority of the computational effort is associated with the elemental calculations in eqn (2). An example of the mesh partitioning is shown in Fig. 2. The Metis output provides a list of processor numbers for the elements, so that each is uniquely defined on a single processor. The diagonal nature of the mass matrix, M, permits the finite element equation of each nodal degree of freedom to be solved independent of other degrees of freedom. To retain data locality, nodes on partition boundaries are there-

181

fore redundantly defined on all processors possessing elements attached to these nodes (see Fig. 2). All elemental and nodal loads, boundary conditions, material properties, constraints, etc., are defined only on the processors for which they apply. All nodal and elemental numbers are converted to local numbers on each processor, but for convenience with output and post-processing, retain their global numbers as lists of labels. The preprocessing software is similar to that used with ParaDyn10 and can be reasonably executed for large models on a workstation. Metis can use finite element connectivities to efficiently create the graph (element to element connection list) and then perform the partitioning, or the graph can be directly supplied. Graph vertex weighting is applied according to element and material type. Elements associated by static constraints (e.g., tied surfaces, M.P.C.’s) can be simply added to the graph in the same manner as for elements connected by nodes11. Graph edge weights are defined by the number common nodes between elements or by common elemental faces. 2.3 Parallel solution schemes At each time increment, t, the basic parallel scheme first consists of creating a global force vector for the partitions on each processor resulting from elemental contributions to P t and F t. Next, the forces belonging to nodes on partition boundaries are gathered into vectors and sent to the processors possessing a duplicate definition of each partition boundary node. The partial force vectors are then received from the other processors and added to the global force vector on the current processor. Finally, the critical time step and energies on each processor are sent to all processors in order to determine global values. At this point, concentrated boundary conditions (point loads, specified velocities, etc.) are accommodated in the force vector, and the new accelerations, velocities, and displacements are determined by the relations in eqns (1), (3) and (4), respectively. Using the new configuration, the process is then started all over again for a new time increment. For each time increment, two basic parallel schemes were implemented in ParaAble, one overlaps message passing with computations and one does not. 2.3.1 Overlapping scheme Nonblocking MPI message passing can be used to overlap communication with computation. As opposed to the nonoverlapping algorithm described in Section 2.3.2, the communication of force vectors of partition boundary nodes among processors can be overlapped with the computations for the elements in the interior of the partitions. The present overlapping approach is similar to that of the explicit algorithm in Ref. 9 for fluid dynamic applications using PVM. The basic overlapping algorithm within each time increment is described in Table 2. Note that the communications in steps A and D are overlapped with the computations in step E.

182

K.T. Danielson, R. R. Namburu

Table 2. Overlapping algorithm A. B. C. D. E. F. G. H. I. J.

Post Nonblocking Receives (MPI_IRECV for Force Vectors on Partition Boundaries; Compute Applied Elemental Forces and Place in P t; Calculate F t for Elements Connected to Partition Perimeter Nodes and store as P t ¼ P t ¹ F t; Gather Partition Perimeter Force Vectors from P t and Post Nonblocking Sends (MPI_ISEND); Calculate F t for Elements on the Interior of the Partitions and store as P t ¼ P t ¹ F t; Post Nonblocking Waits (MPI_WAIT) for Completion of Communications in Steps A and D; Add Received Partition Perimeter Force Vectors to P t; Gather Minimum Time Increments and Energies from All Processors (MPI_ALLREDUCE) and Determine Global Values; Applyn þNodal Loads/Kinematics and Determine q¨ t , 1 nþ1 q˙ t þ (Dt =2) , and qt þ Dt via eqns 1), (3) and (4); For the Total Time, Recursively Repeat Steps A–I With New Configurations.

2.3.2 Non-overlapping scheme The standard non-overlapping parallel scheme for each time increment scatters the partial force vectors among the necessary processors and then waits for these communications to complete before proceeding with other computations. This is equivalent to performing the computations in step E with those in step C of Table 2. The message passing could be accomplished by standard blocking MPI SEND/RECV or SENDRECV calls, but the nonblocking communication avoids possible deadlocks and overhead associated with buffering. With ParaAble organized in this manner, the algorithm can be changed from a non-overlapping to an overlapping one by changing only two lines of code. For the overlapping scheme, however, the preprocessing partitioning software must include the identification of the elements contributing to the nodes on the perimeter of the partitions (see Fig. 2).

3 ParaAble SOFTWARE STRUCTURE AND CHARACTERISTICS In ParaAble, the real and integer variables are primarily eight and four byte FORTRAN 90 kinds, respectively. Generally, data is passed by modules and memory managed by FORTRAN 90 dynamic allocation and deallocation statements. Whereas FORTRAN 90 has pointers and derived data structures for object-oriented programming styles, the authors’ experiences are that these features currently degrade performance. Unless an object-oriented style allows the ability to easily implement an efficient complex algorithm, it does not appear to provide a significant advantage. Therefore, computationally intensive portions of the code are written in a modular FORTRAN 77 programming style, but still use many of the convenient FORTRAN 90 features and intrinsic functions (e.g., SELECT CASE, CYCLE, whole array operations, etc.). The code has been verified by comparison to solutions of

several textbook problems and to analyses using other existing finite element codes. For the partitioned strategy, the message passing primarily consists of scattering partial forces for partition boundary nodes to duplicate nodes on other processors. The energy quantities and minimum time increment must also be communicated among all processors at each time increment, but these are only a small number of variables. In addition, a communication procedure, as used for the force vectors, must be performed for the global lumped mass vector, M, but this is usually only done once at the beginning of the analysis. For the overlapping algorithm to be effective, the computations for the elements on the partition interiors must be large enough to mask a significant portion of the message passing. Clearly, this is a function of the specific computing platform and analysis. The overlapping algorithm can only be effective if the partition interiors contain many elements and/ or use computationally intensive constitutive models (e.g., inelastic ones). In addition, platforms possessing fast communication links with high bandwidths may only slightly benefit from the overlapping algorithm. The authors prefer to use the overlapping one for general purpose codes, however, since they may be used with a variety of different applications and parallel computing systems. The implementation differences between the two algorithms described above are also trivial. Whereas many practical applications exist without contact (like ones in this study), most explicit dynamic analyses involve surface to surface contact. An overlapping algorithm might also be beneficial to several parallel contact implementations which communicate considerably larger amounts of data among processors (e.g., Ref. 12). Finally, the explicit message passing portions of ParaAble were not significant coding efforts. The greatest programming tasks involved the pre- and post-processing software to accommodate the mesh partitioning. In addition, the primary parallel aspects of ParaAble only require the inclusion of additional routines to a basic serial code. Therefore, most portions of an existing serial code would remain the same, if modified to run in parallel by a similar mesh partitioning approach.

4 COMMENTS ON FORTRAN 90 AND MPI The MPI standard is intended for FORTRAN 77, not FORTRAN 90. With some slight modifications, however, the authors were able to successfully run ParaAble on the different parallel computers. The code also runs properly on the previous versions of these machines (i.e., T3D, SP2, and PCA). Since the only available FORTRAN compiler on the CRAY T3E is FORTRAN 90, porting the code to the T3E required no modifications. Porting ParaAble to the IBM SP and SGI ORIGIN 2000 required minor modifications to the MPI header file, mpif.h. The use of an INCLUDE statement to specify this file is fine with most FORTRAN 90 compilers, but the inclusion of this information in a module would be a more modern and preferable programming style13. In addition, slight modifications were necessary to accommodate

Nonlinear dynamic finite element analysis on parallel computers

183

the libraries and configuration associated with the Parallel Operating Environment (POE) on the IBM SP. No efforts were made to streamline the code for each particular machine (e.g., to improve cache performance). A problem can arise with the combination of MPI nonblocking communication and FORTRAN 90. Unlike FORTRAN 77, FORTRAN 90 compilers may, for performance reasons, buffer data. For example, the following MPI statement might be used for a nonblocking receive of members N1 to N2 in array A: CALL MPI_IRECV(A(N1 : N2), N2 ¹ N1 þ 1, & datatype, source, …) Since the nonblocking receiving call statement (MPI_IRECV) may return before the information is completely received, a FORTRAN 90 buffer of A(N1:N2) for this call would be disastrous. Under these circumstances, the message passing would send the data to the buffered memory address that no longer exists after the receiving call statement has returned. This is a dilemma with FORTRAN 90, as it could occur with nonblocking communication by any message passing language. Ways to help avoid this problem is to pass whole arrays or starting points in arrays (e.g., A(N1)), but nothing is guaranteed. Although these methods have alleviated the problem for the authors and compilers may never buffer the nonblocking communication messages, the only way to ensure proper nonblocking communication is for the FORTRAN 90 compiler not to buffer such messages. Note that even a code written entirely in FORTRAN 77 may experience the same problem, if compiled with fixed format FORTRAN 90. This may be a necessity, as only FORTRAN 90 is available on some machines (e.g., the Cray T3E). For the mesh partitioning described in Section 2.2, a point of diminishing returns exists with regard to effectiveness and the number of partitions/processors. Obviously, there cannot be more partitions than elements, and the effectiveness of the above scheme becomes ineffective long before each processor has a single element. At this point, improved computation rates might only be attained by better performances per partition. This may be achieved by higher processor speeds or better cache performances. This could also be attained by a hierarchical strategy in which each partition invokes a data parallel model to solve its own computations, and then uses explicit message passing to communicate the partial forces belonging to partition interfaces. FORTRAN 90 or HPF would be natural candidates for such a strategy.

Fig. 3. ParaAble finite element analysis of an elastic-bar/rigidwall contact–impact problem.

on each system. The three platforms are significantly different architectures. The Cray T3E is a logically sharedphysically distributed memory system, the IBM SP is a distributed memory system, and the SGI Origin 2000 is a shared memory system. A verification analysis, as depicted in Fig. 3 and defined in Ref. 14, was performed for an elastic bar impacting and bouncing off a rigid wall. This contact–impact analysis used 2500 eight-noded hexahedral elements and compared well with the textbook solution. Whereas this analysis is small, Fig. 4 shows that the speedup is almost perfect (ideal) up to around 32 processors. Although noticeable improvement is seen for more processors, the parallel performance diminishes as the number of elements per processor becomes small. Note that the overlapping algorithm of Section 2.3.2 has little effect under these circumstances, since few elements (if any) exist at the partition interiors. To assess the appropriateness of the aforementioned procedures for applications commonly made by analysts at WES, three large-scale ground-shock analyses were conducted. The three models, I, II, and III, are composed of homogenous soil with sizes of 117 649, 970 299, and 3 307 949 8-noded hexahedral elements, respectively. In

5 NUMERICAL STUDIES For benchmarking, a series of ParaAble analyses was made on three parallel computing platforms at WES. All analyses described herein were made with dedicated processors and the highest performance communication channels available

Fig. 4. ParaAble performance of elastic-bar/rigid-wall contact– impact analyses on the parallel computing systems (overlapping algorithm).

184

K.T. Danielson, R. R. Namburu

Fig. 5. ParaAble performance of Model I analyses on the parallel computing systems (overlapping algorithm).

Fig. 7. Comparison of the non-overlapping and overlapping algorithm performances for the Model I analyses on the IBM SP.

all cases, Metis produced partitions of almost exactly equal sizes. To partition these models for different numbers of processors, the required Metis CPU time averaged about 1, 8, and 28 s, respectively. The runs were linear elastic, in order to benchmark the different computing platforms with nearly perfectly balanced analyses. This allows for a more meaningful evaluation of the effect of interprocessor communication on these computers for this class of problems, including the overlapping and non-overlapping algorithms described in Section 2.3. All analyses used approximately 12 500 time increments. Fig. 5 shows that the speedup is excellent for Model I on each computing system. The large-scale analyses were quite efficient on hundreds of SP or T3E processors. Similar results were obtained for Models II and III. As depicted in Figs 6–8, the non-overlapping and overlapping algorithms of Section 2.3 performed similarly. The high speed interprocessor communications of these platforms are apparent.

For the applications considered in this study, the overlapping algorithm generally outperformed the non-overlapping algorithm, but not significantly. These analyses, however, should provide a lower bound on the effectiveness of the overlapping algorithm, since the linear elastic constitutive model is inexpensive. Whereas all of the code timings presented here were made on dedicated processors and high speed network lines, some of these analyses were additionally made on shared processors of the IBM SP and the SGI O2000 and using slower (or shared) communication channels. Since the amount of processor and network use by others is variable, the performance under such circumstances cannot be reasonably quantified. Nevertheless, the overlapping algorithm occasionally exhibited noticeable benefits in such cases. Figs 9 and 10 show the required CPU times for the different sized Models I, II, and III on various numbers of processors. For the problem sizes considered, the IBM SP

Fig. 6. Comparison of the non-overlapping and overlapping algorithm performances for the Model I analyses on the Cray T3E.

Fig. 8. Comparison of the non-overlapping and overlapping algorithm performances for the Model I analyses on the SGI Origin 2000.

Nonlinear dynamic finite element analysis on parallel computers

Fig. 9. ParaAble performance of Models I, II, III analyses on the Cray T3E and IBM SP (overlapping algorithm).

and the Cray T3E performed similarly with both outperforming the SGI Origin 2000 by a noticeable amount. Note that for the above test problems, the IBM SP is the fastest system for the smallest problem size and the Cray T3E is the fastest system for the largest problem size. On each system, the analyses scaled well with the size of the problem as well as with the number of processors.

6 CONCLUDING REMARKS The results presented in this paper demonstrate that significant advantages can be realized by using multiple processors of the IBM SP, Cray T3E, or SGI Origin 2000 computing platforms for nonlinear explicit dynamic finite element analysis. As depicted herein, this is particularly true for the class of large-scale ground-shock applications frequently made at the Waterways Experiment Station. Typical practical analyses using several million elements that may require days or weeks of processing time on high-end serial/vector computers may be performed in a few hours of processing time on current massively parallel

185

processing machines. Using mesh partitioning, large-scale analyses of this type can be applied to hundreds of processors with excellent speedup. With no attempt to optimize the code for a particular platform, the IBM SP and Cray T3E performed competitively and noticeably better than the SGI Origin 2000 for this class of problems. All platforms performed competitively, however, with respect to scalability. An algorithm to overlap interprocessor communication with computations was presented and evaluated. Whereas the overlapping algorithm did not provide significant advantages for the high communication rate machines used in this study, the authors advocate its inclusion in general purpose codes. The implementation of the overlapping algorithm is simple, straightforward, and causes no negative effects on performance. At least, it provides an insurance policy for pathological cases and for cases when the platform is significantly shared with other users. Whereas MPI is not a standard for FORTRAN 90, the authors were able to easily port and run an explicit dynamic finite element code written entirely with FORTRAN 90 and MPI on the different parallel computing platforms. The primary installation differences were to slightly modify the mpif.h header files, intended for FORTRAN 77, on the IBM SP and the SGI Origin 2000 machines. Issues regarding the use of nonblocking explicit message passing with codes compiled by FORTRAN 90 were also discussed. The combination of FORTRAN 90 and MPI has a promising future for efficient and portable software on scalable computing systems.

ACKNOWLEDGEMENTS The authors gratefully acknowledge the permission from the Office, Chief of Engineers, U.S. Army Corps of Engineers, to publish this paper. The work was supported in part by a grant of HPC time from the DoD HPC Center at U.S. Army Engineer Waterways Experiment Station. The authors appreciate the assistance and suggestions of many individuals associated with the DoD HPC Center and with the Structures Laboratory at WES. The work of the first author is sponsored in part by the Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH04-95-2-003/contract number DAA-95-C-0008, the content of which does not necessarily reflect the position or policy of the government, and no official endorsement should be inferred.

REFERENCES

Fig. 10. ParaAble performance of Models I, II, III analyses on the Cray T3E and SGI Origin 2000 (overlapping algorithm).

1. Belytschko, T., Plaskacz, E. J. and Chiang, H. Y., Explicit finite element method with contact–impact on SIMD computers. Comput. Syst. Eng., 1991, 2, 269–276. 2. Namburu, R. R. and Turner, D. A., Contact–impact algorithm on a data parallel computer. In CED-Volume 6:

186

3.

4.

5.

6. 7. 8.

K.T. Danielson, R. R. Namburu Proceedings of the 1994 International Mechanical Engineering Congress and Exposition, Chicago, November 6–11, 1994. Hoover, C. G., DeGroot, A. J., Maltby, J. D., and Procassini, R. J., ParaDyn—DYNA3D for massively parallel computers. Engineering, Research, Development and Technology FY94, Lawrence Livermore National Laboratory, UCRL 53868-94, 1995. Plimpton, S., Attaway, S., Hendrickson, B., Swegle, J., Vaughan, C., and Gardner, D., Transient dynamics simulations: parallel algorithms for contact detection and smoothed particle hydrodynamics. In Proceedings of SuperComputing 96, Pittsburgh, PA, 1996. Akers, S. A. and Windham, J. E., Structure medium interaction calculations of DIPOLE FACADE 8. In 14th U.S. Army Symposium on Solid Mechanics, Myrtle Beach, SC, October 16–18, 1996. ABAQUS User’s Manual, Version 5.6, Hibbitt, Karlsson, and Sorensen, Inc., Pawtucket, RI, 1996. Newmark, N.M., A method of computation for structural dynamics. Int. J. Earthquake Eng. Struct. Dynamics, 1959, 1, 241–252. Karypis, G. and Kumar, V., A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report TR 95-035, Department of Computer Science, University of Minnesota, 1995.

9. Cabello, J., Parallel explicit unstructured grid solvers on distributed memory computers. Adv. Eng. Software, 1996, 26, 189–200. 10. Procassini, R. J., DeGroot, A. J. and Maltby, J. D., PartMesh user manual—partitioning unstructured finite element meshes for solution on a massively parallel processor. Methods Development Group, Mechanical Engineering, Lawrence Livermore National Laboratory, UCRL-MA118774, 1994. 11. Danielson, K. T. and Namburu, R. R., A general partitioning approach for analyses involving constraints and contact. In Proceedings of the Fourth U.S. National Congress on Computational Mechanics, San Francisco, August 6–8, 1997. 12. DeGroot, A. J., Sherwood, R. J., Badders, D. C., and Hoover, C. G., Parallel contact algorithms for explicit finite element analysis (DYNA3D). In Proceedings of the Fourth U.S. National Congress on Computational Mechanics, San Francisco, August 6–8, 1997. 13. Brainerd, W. S., Goldberg, C. H., and Adams, J. C., Programmer’s Guide to Fortran 90. Unicomp, Albuquerque, 1994. 14. Zhong, Z., Finite Element Procedures for Contact Impact Problems. Oxford University Press, Oxford, 1993, pp. 251– 252.