Computers and Geotechnics 25 (1999) 281±285 www.elsevier.com/locate/compgeo
Technical Note
Parallel processing for a discrete element program C.H. Dowding a,*, O. Dmytryshyn b, T.B. Belytschko b b
a Department of Civil Engineering, Northwestern University, Evanston, IL 60208, USA Department of Mechanical Engineering, Northwestern University, Evanston, IL 60208, USA
Received 10 August 1998; received in revised form 15 May 1999; accepted 14 June 1999
Abstract Recon®guration of a discrete element code for parallel operation provided the opportunity to compare processing speeds on various hardware platforms. The program employed in this comparison, NURBM3DP, is a three dimensional, distinct element code employed to calculate dynamic response of a cavern in a jointed rock mass. On a 16 processor IBM SP2, it is capable of calculating dynamic response with 1000's of explicit time steps of jointed rock masses with up to 2,000,000 blocks. Comparison of single instruction multiple data stream (SIMD) and multiple-instruction multiple-data stream (MIMD) operation showed MIMD processing to provide the best overall parallelization. The full report of the comparisons of operation on dierent hardware with dierent data streaming con®gurations can be found at the research section of the Northwestern University Computational Mechanics site: http:// www.tam.nwu.edu/compmech.html. In addition, a color movie of dynamic response of a million block model of a cavern responding to dynamic excitation can be seen at: http://geotech.civen. okstate.edu/ejge/ppr9801/index.htm. # 1999 Published by Elsevier Science Ltd. All rights reserved.
1. The need for and options in parallelization Because multi-processor platforms provide an opportunity for faster performance than their single processor counterparts, it is widely acknowledged that parallel machines ultimately will be employed for very large models; especially those that involve time varying loading. Parallel computers can be classi®ed as single instruction multiple data stream (SIMD) and multiple-instruction multiple-data stream (MIMD) machines. In a SIMD machine, each processor performs the same operation at the same time. While in a MIMD machine, each of the processors can perform * Corresponding author. 0266-352X/99/$ - see front matter # 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S0266-352X(99)00015-4
282
C.H. Dowding et al. / Computers and Geotechnics 25 (1999) 281±285
dierent functions at the same time, which gives MIMDs greater ¯exibility and parallelization potential as will be shown. Parallelization was undertaken with both types of machines and the MIMD approach was found to be more attractive. The SIMD version was operated on the CM5 manufactured by Thinking Machines, and the MIMD version was operated on CM5, the SP2 manufactured by IBM and on various Pentium Pros operating MS Windows NT4.0. Communication between processors is one of the, if not the, most important considerations in the reformulation of the computational algorithm. Communication becomes important because each processor can rapidly access only its own local memory. Local data on other processors must be accessed through the slow communication network. While it is impossible to completely eliminate communication between processors, every eort that reduces communication is helpful. In some MIMD machines communication can be scheduled concurrently with computation to achieve a high level of parallelism. This scheduling is more dicult to achieve with SIMD computers. 2. NURBM3DP algorithm As with other discrete element models and NURBM3DP's serial forms [3,4], it is assembled from rigid blocks that are bounded by planar surfaces. In rock mechanics applications, adjacent blocks share a common surface called a joint, where relative deformation results in joint forces. Interaction between blocks in the model occurs at the common contact or joint. Lumped points (centers of triangular subdivisions of joint face shapes) are employed to calculate moments produced along each planar contact. A prismatic, brick shaped block would have (6 faces4 lumped points/ face=) 24 lumped points, where each lumped point has 3 degrees of freedom. A new Curnier, velocity-based, contact interface law was employed to save computational steps and increase eciency. It is described in Dmytryshyn et al. [1] along with example calculations and comparison with response of model tunnels in jointed media to blast-induced loading. Curnier interface implementation is based upon relative velocity as opposed to displacements and thus saves several computational steps. 3. Pre- and post-processing and de®nition of block model assembly With hindsight it now seems logical that pre- and post-processing and graphical presentation pose signi®cant challenges for million block models, but at ®rst it did not. Graphical depiction of the model is in itself both a pre- and post-processing challenge as it is necessary both to check the adequacy of the block de®nition (preprocessing) as well as the ®nal state of the blocks (post-processing). Consider the pre-processing necessary to create and de®ne variable block geometry. The serial version of NURBM3D [4] is capable of de®ning blocks from families of parallel but arbitrarily oriented joint sets whose joints are variably
C.H. Dowding et al. / Computers and Geotechnics 25 (1999) 281±285
283
spaced. A wide variety of shapes and sized blocks result. When joints and the resulting blocks are few, this geologically faithful approach is advantageous. Unfortunately, the process of block de®nition for such a non-uniform array requires an inordinately long time because it is necessary to calculate properties and location of each block and contact separately. Furthermore, output ®les for large models can be burdensomly large because they must contain all data about all blocks and contacts. To avoid the challenge posed by the irregular blocks and focus on the performance of million block models, we chose to assemble models from a small number of pre-de®ned block shapes with a constant size and orientation. Restriction of block geometry allowed assembly of very large model volumes with 1,300,000 regular blocks (with 4,000,000 contacts, 16,000,000 interacting points, and 48,000.000 calculated degrees of freedom) in a matter of minutes with a single processor PC. Similar assembly with variable block geometry required days for a smaller, 10,000block model. The million uniform block data ®le is approximately 70 Mb while that of the 10,000 irregular blocks required 86 Mb. Illustrating model results with 100,000's of three dimensional elements produces a large calculation burden. As with all modeling, graphical representation of nodal output of NURBM3DP is necessary for human interpretation. Furthermore, these plots must be able to be produced at speeds greater than the computation itself for what-if and comparative interpretation. To reduce load time and input data ®le size, a binary input format was chosen for communication between the NURBM3DP output and TecPlot which reduces ®le size by almost a factor of three and load time by ®ve. 4. Parallelization for MIMD operation The MIMD version of NURBM3DP provides nearly 100% ecient parallelization because the algorithm employs communication during computation. The algorithm employs one thread process per processor. Nonblocking send and receive operations allow higher performance of the program compared to multi-thread approach [5,6] because thread control and switch support are eliminated from the code. The MIMD algorithm is the same optimized serial algorithm that contains contact and block loops; however, the order in which contacts and blocks are processed is dierent in the parallel version. The entire model is divided into volumetric zones that are assigned to separate processors. During any time step a processor calculates block forces and motions in its zone until data arrives from another processor (calculated at a previous time step). The processor then calculates forces for the blocks that are on the boundary and sends information to other processors. While the processor continues with the rest of the zone, data is traveling to another processor. Results of operating the MIMD model on dierent hardware are presented in Table 1. The table was prepared to compare eects of hardware for both serial and parallel operation. Platforms are arranged in increasing eciency of parallelization, with the SP2 the highest. The CM5 and SP2 are parallel processors, while the PC is operated in parallel through networking. Even though these comparisons were obtained for 10,000 element models, results were employed to choose the most
284
C.H. Dowding et al. / Computers and Geotechnics 25 (1999) 281±285
Table 1 Eects of MIMD version of algorithm (ANSI C code) System
Processor code (Kb)
Test program run time (s)
Speedup over serial version
Parallelization
CM5 node serial version CM5 64 nodesa DELL PC(PPro200) 2 DELL PC (network) IBM SP2 1 processor IBM SP2 10 processor
4063 4063 135 142 36 36
2035 34 252 133 277 28
1 59.85 1 1.89 1 9.89
N/A 93% N/A 95% N/A 99%
a CM5 64 nodes with 10410 block/processor model with 10 iterations test normalised to 500 iterations with 10140 blocks (MIMD model). N/A: not applicable.
optimal system for operation of the 1,000,000 element model [2]. The least time was required for operation on a non-shared, non-remote SP-2. Processing was scheduled by the load leveler that provided exclusive access to each processing node for the SP2. The networked PC performance shows that a Windows NT network allows calculation of very large models in a time comparable to that achieved with the use of SP2. The experiment was conducted with only two computers as a simple means of illustrating this inexpensive approach. NT parallel operation will allow discrete element model research without access to super computer centers with computational speeds that are slower but comparable with those of multiprocessor supercomputers. The serial code optimized for MIMD parallel processing allows operation of 1,200,000 block models for 2000 time steps in less than 5 h on a 16 processor SP-2 (180 MHz processor clock speed). The calculation can be stopped at any time and saved on disk for a later restart. This procedure requires storage of more than 30 Mb of data per processor. An SP2 with only 128 Mb of RAM per node could operate larger models (up to 6,000,000 blocks for 16 node SP-2). 5. Conclusions This paper describes the acceleration of a discrete element code with explicit time integration through MIMD parallel operation. After modi®cation, the code is now capable of analyzing 2,000,000 regular elements for time varying loading. This study showed that: 1. Relatively speaking MIMD operation requires less eort than SIMD and provides the best overall parallelization for a small number of processors. 2. Pre- and post-processing is of signi®cant importance and requires special attention. 3. Binary input/output is recommended for all inter-program communication because it oers higher speed and smaller ®le size. 4. Single processor computers operating in parallel via local networks developed speeds slower but comparable to SP2 machines.
C.H. Dowding et al. / Computers and Geotechnics 25 (1999) 281±285
285
Acknowledgements We gratefully acknowledge the support of the National Science Foundation under Grant MSS 92-16274 and the encouragement of Dr. Priscilla Nelson of the Geomechanics-technical & -environmental Systems G3S Program. We are grateful to Professor Prithviraj Banerjee for providing access to the Northwestern University Center for Parallel and Distributed Computing facilities also supported by the National Science Foundation Grant CDA-9703228. References [1] Dmytryshyn O, Dowding C, Belytschko T. Very large discrete element assessment of 3D cavern response to dynamic excitation (Vol. 1) and optimization of a discrete element model for parallel processing (Vol. 2). Department of Civil Engineering Internal Reports, Northwestern University, Evanston, IL, available at (http://www.tam.nwu.edu/compmech.html), 1998. [2] Dowding C. Movie of million block model of cavern response to high frequency excitation. Electronic Journal of Geotechnical Engineering, Vol. 3 (http://geotech.civen.okstate.edu/ejge/index.htm), 1998. [3] Belytschko T, Plesha, M, Dowding C. A computer method for the stability analysis of caverns in jointed rock. Int J Num Anal Meth Geomech 1984:473±492. [4] Gilbert C. Development of a three dimensional small displacement rigid block model for dynamic analysis. Ph.D. Dissertation, Department of Civil Engineering, Northwestern University. Evanston, IL, 1988. [5] Banerjee P, Chandy J, Gupta M, Holm JG, Lain A, Palermo DJ, Ramaswamy S, Su E. Overview of the PARADIGM compiler for distributed memory message-passing multicomputers. IEEE Computer 1995:37±47. [6] Foster I. Designing and building parallel programs: concepts and tools for parallel software engineering. Reading, MA: Addison-Wesley Publishing Company, 1995.