Computing Systems in Engineering, Vol. 5, No. 4-6, pp. 389~.05, 1994
Pergamon
0956-0521(94)00011-5
Elsevier Science Ltd Printed in Great Britain.
POSTBUCKLING A N D LARGE-DEFLECTION NONLINEAR ANALYSES ON DISTRIBUTED-MEMORY COMPUTERS BRIAN C. WATSON and AHMED K. NOOR Center for Computational Structures Technology, University of Virginia, NASA Langley Research Center, Hampton, Virginia, U.S.A.
Abstract--A computational strategy is presented for postbuckling and nonlinear static analyses of large complex structures on distributed-memory parallel computers. The strategy is designed for message-passing parallel computer systems. The key elements of the proposed strategy are: (a) a multiple-parameter reduced basis technique, (b) a nested dissection (or multilevel substructuring) ordering scheme, (c) parallel assembly of global matrices, and (d) a parallel sparse equation solver. The effectiveness of the strategy is assessed by performing thermomechanical postbuckling analyses of stiffened composite panels with cutouts, and nonlinear large-deflection analyses of High Speed Civil Transport models on three distributed-memory computers. The numerical studies presented demonstrate the advantages of nested dissection-based solvers over traditional skyline-based solvers on distributed-memory machines.
INTRODUCTION
Distributed-memory parallel computers, such as Thinking Machines Corporation CM-5, Intel Paragon, C R A Y T3D, and IBM SP2, have the potential to provide the increased speed and capacity required to perform large-scale postbuckling and nonlinear analyses of complex structures. This potential can only be realized if the computational strategies used in such analyses take advantage of the unique characteristics of these computers. In recent years, intense efforts have been devoted to the development of parallel computational strategies and numerical algorithms for large-scale finite element computations (see, for example, Refs 1-4). Much attention has been focused on implementing efficient linear equation solvers on distributed-memory computers. This has led to the development of a number of direct and iterative numerical algorithms for the solution of large sparse linear systems of equations (see Refs 5-11). Most parallel strategies are related to the "divide and conquer" paradigm based on breaking a large problem into a number of smaller subproblems, which may be solved separately on individual processors. The degree of independence of the subproblems is a measure of the effectiveness of the algorithm since it determines the amount and frequency of communication and synchronization. The numerical algorithms developed for structural analysis can be classified into three major categories, namely: element-wise algorithms; node-wise algorithms; and domain-wise algorithms. The element-wise parallel algorithms include element-by-element equation solvers and parallel frontal equation solvers. The node-wise parallel
equation solvers include node-by-node iterative solvers as well as column-oriented direct solvers. The domain-wise algorithms include nested dissectionbased (substructuring) techniques and domain decomposition methods. The first two categories of numerical algorithms allow only small granularity of the parallel tasks, and require frequent communications among the processors. By contrast, the third category allows a larger granularity, which can result in improved performance for the algorithm. Nested dissection ordering schemes have been found to be effective in reducing both the storage requirements and the total computational effort required of direct factorization] 2 The performance of nested dissection-based linear solvers on distributedmemory parallel computers depends on balancing the computational load across processors in a way that minimizes interprocessor communication. Several nested dissection ordering schemes have been developed which differ in the strategies used in partitioning the structure and selecting the separators. Among the proposed partitioning strategies are: recursive bisection strategies ~3'~ (e.g., spectral graph bisection, recursive coordinate bisection, and recursive graph bisection); combinatorial and design-optimization-based strategies 15 (e.g., the simulated annealing algorithm, the genetic algorithm and neural-network-based techniques); and heuristic strategies 16 (e.g., methods based on geometric projections and mappings; and algorithms based on embedding the problem in Euclidean space). For highly irregular and/or threedimensional structures the effectiveness of nested dissection-based schemes may be reduced. However, this is also true for most other parallel numerical algorithms. 389
390
BRIAN C. WATSON and AHMED K. NOOR
The overall goal of the present study is to develop fully-parallel, scalable computational strategies for large-scale nonlinear structural problems on massively parallel computers. Specifically, the objectives of the present paper are to: (a) present an effective computational strategy for the postbuckling and large-deflection nonlinear analysis of large structures on distributed-memory computers; and (b) assess the performance of the proposed strategy, and demonstrate potential benefits resulting from its use on modern distributed-memory computers. The key elements of the proposed strategy are (a) a multiple-parameter reduced basis technique, (b) a nested dissection (or multilevel substructuring) ordering scheme, (c) parallel assembly of global matrices, and (d) a parallel sparse equation solver. An assessment is made of the effectiveness of the strategy for thermomechanical postbuckling analysis of stiffened composite panels and nonlinear static analysis of a High Speed Civil Transport (HSCT) model on the Intel Paragon, the CRAY T3D, and the IBM SP2. MATHEMATICAL FORMULATION
Governing finite element equations The analytical formulation is based on a moderaterotation, geometrically nonlinear, shallow shell theory, with the effects of transverse shear deformation and anisotropic material behavior included. A mixed formulation is used, in which the fundamental unknowns consist of the internal forces (or stress resultants) and the generalized displacements. A total Lagrangian description of the deformation of the structure is used, in which the deformed configurations of the structure are referred to the initial coordinate system of the undeformed structure. The stress resultants are allowed to be discontinuous at the interelement boundaries. The external loading consists of a combination of transverse loading, p, temperature change, T, and in-plane edge shear loading, N12. The governing finite element equations describing the large deflection and postbuckling responses of the structure can be written in the following compact form: {f(Z)} = [K]{Z} + {G(Z)} -- ql {QI }
-q2{Qz} = {0}
(1)
where {Z} is the response vector which includes both nodal displacements and stress-resultant parameters; [K] is the global linear structure matrix which includes the flexibility and the linear strain-displacement matrices; {G(Z)} is the vector of nonlinear terms; q~ and q2 are thermal and mechanical load parameters; {Ql } is the vector of normalized thermal strains; and {Q2} is the vector of normalized mechanical loads. The forms of the arrays, [K], {G(Z)}, {QI } and {Q2}, are given in Ref. 17.
Note that because the stress resultants are allowed to be discontinuous at interelement boundaries, the stress resultant parameters can be eliminated from the system of equations at the element level. The resulting equations for the structure can be expressed as in Eq. (1) with {Z} representing the vector of nodal displacements, and [K] representing the global linear stiffness matrix (see Ref. 17). The standard approach for the solution of Eq. (1) is to fix the value of one of the two parameters, q~ and q2, and to vary the other, or to express q~ and q2 as functions of a single parameter, q. In either case, the solution corresponding to the chosen load path (which is effectively dependent on a single parameter) constitutes a curve on the equilibrium surface of the structure.
Prebuckling state The prebuckling responses (generalized displacements and stress resultants), associated with the thermal strain and mechanical loading are given by the following set of linear equations with two righthand sides (one corresponding to q~ = 1, q2 = 0, and the other to q~ = 0, q2 = 1):
[K][{Z, }{Z2}] = [{O1 }{02}]
(2)
where {Z 1} and {Z2} refer to the response vectors associated with the thermal loading and mechanical loading, respectively.
Determination of stability boundary For certain combinations of the two parameters, ql and q2, an instability (or bifurcation) occurs. The set of critical (or bifurcation) points in the q~-q2 space constitutes the stability boundary, which separates regions of stability and instability. If prebuckling deformations are neglected, the equations that determine the stability boundary for the structure can be cast into the form of a linear algebraic eigenvalue problem as follows: ([K] + ql [KI ] + q:[K2]){Z} = {0}
(3)
where ( ~ , q2) represents a critical combination of the load parameters qi, q2; {Z} is the associated eigenvector; and [KI] and [Kz] are the geometric stiffness matrices. The explicit forms of [KI] and [K2] are given in Ref. 18.
Postbifurcation configurations
and
large-deflection equilibrium
Large deflection and postbifurcation equilibrium configurations corresponding to specified values of the parameters ql and q2 are found by solving the nonlinear system of algebraic equations, Eq. (1), using an incremental-iterative technique such as the Newton-Raphson method. The recursion formula
Postbuckling and large-deflection nonlinear analyses for the rth iteration can be written in the following form:
(tKl+[ °c' azj ]] z =z,) {Az}, = - {f(z,)}
(4)
{z}'+~ = { z ) ' + {Az}'
(5)
where { A Z } r is the change in the response vector during the rth iteration cycle; and the range of i and j is from 1 to the total number of free nodal displacements and stress-resultant parameters in the model. OVERVIEW OF THE M U L T I P L E - P A R A M E T E R REDUCED BASIS T E C H N I Q U E
Multiple-parameter reduction methods, which substantially reduce the number of degrees of freedom used from that of the initial discretization, have been developed to reduce the cost of generating the stability boundary, postbifurcation equilibrium configurations, and sensitivity coefficients (see Refs 18-20). The methods are based on successive applications of the finite element method and the classical Rayleigh-Ritz technique. The finite element method is used to generate a few global approximation vectors (or modes) for approximating the linear eigenvalue problems, Eq. (3), and the nonlinear equations, Eq. (1). The Rayleigh-Ritz technique is then used to generate the reduced sets of equations in the amplitudes of these modes. An effective set of modes for approximating both the linear eigenvalue problem and the nonlinear equations was found to be the path derivatives of the response quantities with respect to the parameters q~ and q2. The equations used in evaluating the path derivatives are obtained by successive differentiation of the original nonlinear equations, Eq. (1), with respect to qt and q2. The left-hand-side matrix in these equations is the same as that of Eq. (4). The explicit form of the reduced equations is given in Appendix A. The details of applying reduction methods to the generation of the stability boundary and postbifurcation equilibrium paths are given in Ref. 18. The application of the reduction method to postbuckling analysis involves the following basic steps: (a) Generate the initial basis vectors, at qt = q2 = 0, and form the reduced linear stiffness matrix and the geometric stiffness matrices. (b) Determine the stability boundary using the reduced eigenvalue problem. (c) Generate a nonlinear solution in the vicinity of the stability boundary. This step is accomplished by selecting a suitable value for a typical displacement component, using the associated eigenmode as a predictor, and obtaining a corrected solution with the Newton-Raphson method.
391
(d) Evaluate the path derivatives and form the reduced system of equations which approximate Eq. (1). (e) Generate approximate solutions associated with different combinations of the parameters q~ and q2 using an incremental-iterative approach, in conjunction with the reduced equations. (f) Determine the error resulting from the use of the approximate reduced equations. Whenever the error exceeds a prescribed tolerance, an iterative procedure is used to generate an improved solution, a new (updated) set of basis vectors (path derivatives), and a new set of reduced equations. COMPUTATIONAL PROCEDURE
Phases of the computational procedure The computational procedure is outlined in Table 1, and a corresponding flowchart is given in Fig. 1. The proposed procedure can be divided into five distinct phases, namely: preprocessing and input phase; reduction phase; solution phase; error sensing and control phase; and output and postprocessing phase. The first phase consists of generating the discretized structure data, ordering the finite element nodes on a workstation, and transferring the problem data to the parallel machine. In the second phase, the basis vectors and the reduced system matrices are computed. This involves generating and assembling the elemental arrays, factoring the global structure matrix, evaluating right-hand side vectors, and performing forward/ back substitutions. An initial set of basis vectors, associated with zero values of the load parameters, is generated. For postbuckling problems another set of basis vectors and reduced equations are generated in the vicinity of the stability boundary. Updated sets of basis vectors and reduced systems are generated whenever the error, computed in the error sensing and control phase, exceeds a prescribed tolerance. The solution phase consists of determining the stability boundary and/or the nonlinear equilibrium configurations associated with different combinations of the load parameters. The reduced system of equations is used in this phase. In the error sensing and control phase, the norm of the residual vector of the full system of equations is computed and used as a measure of the error due to basis reduction. Whenever the error exceeds a prescribed tolerance, a corrected (improved) solution of the full system of equations is generated using the Newton Raphson method, and the reduction phase is repeated. The frequency of error sensing is related to changes in a single parameter that characterizes the nonlinear response (see Ref. 18). The final phase, output and postprocessing, consists of collecting the calculated response quantities (e.g., solutions of reduced equations,
392
BRIANC. WATSONand AHMEDK. NOOk Table 1. Outline of computational procedure
Preprocessing phase 1. Generate discrete structure data for the problem. 2. Use the incomplete nested dissection scheme to partition the structure. 3. Transfer problem data to the parallel machine. Reduction phase 4. Generate elemental arrays, eliminate stress resultant parameters, and assemble global stiffness matrix. 5. Factor the global stiffness matrix. 6. Compute right-hand-side vectors and solve for the basis vectors (through forward/back substitution), orthonormalize the basis vectors, and form the reduced system arrays. Solution phase
For postbuckling problems : 7. 8. 9. 10a.
Use the reduced system of equations to determine the stability boundary (by solving sets of eigenvalue problems for different combinations of the two parameters qt and q2). Select a point on the stability boundary (corresponding to a pair of values of ql and q2) and find a buckled equilibrium configuration in the vicinity of that point, using the Newton Raphson method with the full system of equations. Evaluate new set of basis vectors and reduced equations by repeating the reduction phase (steps 4-6). Use the reduced system of equations to generate nonlinear postbuckled equilibrium configurations associated with different combinations of ql and q2.
For large-deflection problems: 10b.
Use the reduced system of equations to generate nonlinear equilibrium configurations associated with different combinations of ql and q2 (as in step 10a).
Error sensing and control phase 11. The reduced unknowns are used to generate an approximation for the response vector {Z}. 12. The norm of the residual vector of the full system of equations is evaluated and used as a measure of the error due to reduction. If it exceeds a prescribed tolerance, the approximate {Z} is used as a predictor and the Newton Raphson method is applied to obtain a corrected (improved) solution. A new set of basis vectors and reduced equations are generated by repeating the reduction phase, and additional solutions are calculated by returning to step 10 of the solution phase. Output and postprocessing phase 13. Collect and output the response quantities, which are distributed among processors. 14. Calculate transverse stresses and strain energy in the structure.
displacements, stress resultants, and the stability boundary) from the different processors, and calculating other response quantities, such as transverse stresses and strain energy.
Numerical algorithms The most time-consuming steps of the aforementioned computational procedure are those associated with operating on the original, full system of equations, namely, evaluation of the basis vectors and generation of an initial (or improved) nonlinear solution. The numerical calculations in these steps involve assembling and factorizing the global structure matrix, and performing forward and back substitutions. Efficient parallelization of each of these steps was attempted as described subsequently.
Element generation and assembly o f global matrices. The element generation and assembly of the global structure matrix is parallelized using the node-bynode algorithm described in Ref. 21. The node-bynode algorithm is well suited to distributed-memory processing because it avoids synchronization and communication. Each processor stores a set of columns of the lower triangle (and main diagonal) of the global structure matrix. A processor loops over the elements and generates only the elements that contribute to the columns which it stores. It then
assembles the blocks of the elemental arrays that it needs. Although there are some redundant computations performed by this algorithm, there is no communication required, and the algorithm achieves high parallel efficiency. Factorization o f the global structure matrix. The global structure matrix is factored using the L D L T algorithm. High parallel efficiency for this algorithm can be achieved by exploiting the sparsity of the matrix associated with a proper ordering of the finite element nodes, z: The nodes are ordered by using an incomplete nested dissection ordering scheme. The finite element model is first partitioned into two parts (substructures) plus interface nodes. The interface nodes provide the only connection between the two parts. The nodes are renumbered with the interface nodes numbered last. The partitioning process continues by recursively taking each part, partitioning it and renumbering the nodes. A complete ordering continues until the final partitions contain nodes of a single element. For this incomplete ordering, the ordering continues until the final partitions contain a small number of nodes (based on a user-defined criterion). The benefit of nested dissection ordering schemes becomes evident when one examines the elimination tree produced by such an ordering. The elimination
Postbuckling and large-deflection nonlinear analyses
393
Generate model Re-order finite element nodes
Preprocessing and input phase
input problem data
÷ r
Generate reduced basis I vectors • and reduced system matrices
. no
Reduction phase
Solve reduced system for nonlinear equilibrium configuration
t
Solution phase
Solve reduced system for stability boundary Correct buckling mode shape with full system
el
Output stability boundary and eigenmode
Correct solution with full system
Error sensing and control phase
Output displacements and/or stresses ] Output and postprocesslng phase
T ? Postprocessing Visualization Fig. l. Flowchart of the computational procedure.
tree describes the dependencies within the factorization algorithm. That is, it shows which column blocks (of the lower triangular portion of the matrix) must be factored before factorization of any particular column block may begin (see Fig. 2). A column block may not be factored until all nodes below that node in the elimination tree have been factored. This implies that all of the column blocks represented by the bottom-most nodes of the tree have no dependencies (e.g., blocks 1, 2, 4, and 5 of Fig. 2a), and can be factored immediately and concurrently. This type of parallelism may be classified as large grain. For comparison, a typical elimination tree produced by
1
an ordering scheme based on bandwidth minimization is given in Fig. 2b. This elimination tree shows that there is no large grain parallelism available with this ordering. The nested dissection ordering usually has the added benefit of reducing the fill-in of the factored matrix. Reducing fill-in not only reduces the storage requirements, but also reduces the total number of floating-point operations required in the factorization (see Ref. 12). Nodes of the elimination tree correspond to column blocks of the lower triangular portion of the global structure matrix. These are distributed among
394
BRIAN C. WATSON a n d AHMED K . NOOR
x• X
x
Sym. X
X X X
234
567
MatrixStructure
1
2
34
5
67
Matrix Structure
() (,) (9 (,) () EliminationTree (a) Incomplete nested dissection ordering
EliminationTree (b) Bandwidth minimizing ordering
Fig. 2. Elimination trees and matrix structures produced by nested dissection and bandwidth minimizing ordering schemes. the processors. The communication pattern for the factorization follows the branches of the elimination tree. Each processor factors its own column block, waiting only to receive the information it needs from processors storing the column blocks below it in the tree, and then sending information to processors storing blocks above it in the tree. Forward/back substitution. The forward substitution follows the same pattern as the factorization. The computation begins at the bottom-most nodes in the elimination tree and proceeds up to the top. The back substitution follows the opposite pattern. The computation begins at the top and proceeds down the elimination tree to the bottom. Thus, with the nested dissection ordering, the later stages of the back substitution have the same large grain parallelism as the earlier stages of the factorization and forward substitution. I M P L E M E N T A T I O N ON DISTRIBUTED-MEMORY COMPUTERS
The aforementioned computational procedure, based on the nested dissection partitioning scheme, has been implemented on three distributed-memory computer systems, namely, the Intel Paragon XP/S computers at N A S A Langley Research Center and Sandia National Laboratories; the CRAY T3D computer at the Jet Propulsion Laboratory; and the IBM SP2 computer system at the Maul High Performance Computing Center. The major characteristics of these computer systems are described in Appendix B. The program was coded in Fortran. The Paragon at NASA
Langley used Intel's OSF/I operating system, while the Paragon at Sandia used the SUNMOS operating system to obtain improved communication rates. Each of the three computers used offers message passing facilities for communication between processors. On the Paragon, the NX message functions c s e n d ( ) and crecv() were used (note: the functionality of these routines is also provided by SUNMOS). On the T3D, a lightweight message passing library was written using the SHMEM system library routines. These system routines provide for simulated shared memory, through which processors may directly access the local memory of other processors, and provide the applications programmer with the highest communication rates available on the T3D. On the SP2, IBM's MPL library routines mp_bsend( ) and mp_brecv( ) were used. A global operations library was written to provide reduction and synchronization routines for each of the machines. Each phase of the computational procedure offers opportunities for parallelism at the node level and/or the element level. Additionally, the solution phase offers opportunities at the task level (e.g., solving the reduced system for different combinations of the load parameters). The level of parallelism exploited dictates many details of the implementation, such as data distribution, and communication strategy.
Preprocessing and input phase In the preprocessing phase, the nodes of the finite element model are ordered with a coordinate-based nested dissection algorithm, the elimination tree is
Postbuckling and large-deflection nonlinear analyses determined, and the nodes of the tree are mapped to the processors. This step is implemented on a Sun SPARCstation as a serial algorithm. For the problems considered in this study, the computational time expended in this step was not high. The preprocessing is performed only once, and this expense is amortized over the many factorizations and back solves necessary for the total solution process. However, for much larger problems, the preprocessing time can become unacceptably high. The use of a parallel algorithm for structure partitioning and node ordering is described in Ref. 23. The current implementation of the input phase uses a host-node model. One processor acts as a host and performs all of the data input. It then uses a broadcast send to communicate this data to the other processors. All processors maintain a copy of the input data to eliminate the need for communication during element generation. Distributed-memory computers typically have several processors devoted to I/O, while the remaining processors are dedicated to computation. A notable exception is the IBM SP2, with which each processor performs both I/O and computation. Therefore, parallelization of the input phase is conceptually possible. Several processors may be selected to read the input data (or a portion thereof) in parallel, and then distribute it to the remaining processors. For the problems considered in the present study, the total time for data input was a small percentage of the total problem time. Therefore, no effort was made to parallelize this phase.
Reduction phase The second computational phase consists of forming the basis vectors and reduced system arrays. This phase, together with the full-system solution step of the error sensing and control phase, consumes the bulk of the total computation time for a given problem. Thus, efficient parallelization of this phase is very important. The current implementation follows a row-oriented column block sparse storage scheme similar to that described in Ref. 5. In this scheme, within each column block, matrix elements are stored by row. To minimize the communication overhead, the column blocks are not shared by several processors. Each column block belongs to a single processor, which is responsible for both storing and factoring it. With this approach, the communication pattern follows exactly the branches of the elimination tree. The program factors the principal triangle of its column block, first getting dot products from its descendant blocks with the receive communication function, if necessary. It then computes the dot products of the remaining rows in the block and stores them in a buffer. It uses the send communication function to transmit these dot products to its ancestor blocks. The forward substitution follows the same pattern as the factorization, Processors storing the bottom-
395
most nodes of the elimination tree solve their systems and communicate the results to their ancestor blocks using sends/receives. This continues up the tree to the top. Processors then apply the diagonal scaling to their own blocks concurrently. The back substitution follows the opposite pattern. The computation begins at the top node of the elimination tree, and information is passed to child blocks with sends/receives. This continues down the elimination tree to the bottom. Because each right-hand-side vector is a function of the element matrices and the previously determined basis vectors, formation of these vectors is parallelized on the element level. The elements are distributed among the processors in a wrap-map fashion. Each processor loops over its own elements and assembles the contributions in local storage for the right-hand-side vector. Then, a global sum operation is used to add all of the processors' contributions into the final right-hand-side vector. Note that this global sum operation gives all of the processors a copy of the updated vector. The procedure used to form the reduced system matrices is similar to that used for forming the righthand-side vectors. Each processor assembles the contributions from its own elements into its own local storage for the reduced system matrices. Then, a global sum operation is used to add all of the processors' copies to produce the final reduced system matrices.
Solution phase A consequence of using the multiple-parameter reduced basis technique is to significantly reduce the amount of computational work involved in solving the governing equations and generating nonlinear equilibrium configurations. However, significantly more equilibrium configurations, associated with a large number of combinations of load parameters, can be generated in the same wall clock time by parallelizing at the task level. The procedure used in determining the stability boundary associated with two load parameters, q~ and q2, is as follows: The critical value of the first load parameter (qlcr) is determined for q2 = 0. Then, the critical value of q2 associated with a prescribed value of q~ ~iqlcr(0 < ~i'< 1) is determined. The computations involved for each different value of e, are independent, and therefore can be performed concurrently. Similarly, the calculations of different nonlinear equilibrium configurations along different load paths are independent. These tasks are performed in parallel. A very large number of postbuckled (or nonlinear) equilibrium configurations, corresponding to different combinations of ql and qe, are obtained in a very short time, =
Error sensing and control phase a
When the residual exceeds a prescribed tolerance, corrected solution is generated using the
396
BRIAN C. WATSONand AHMED K. NOOR
Newton-Raphson method on the full system of equations. This task involves repeated assembly and factorization of the global structure matrix, as well as forward/back substitutions. These calculations are parallelized in a manner identical to that used for the formation of the reduced basis vectors.
Output and postprocessing phase The result of a reduced basis analysis is a set of amplitudes of basis vectors (reduced unknowns). The strain energy may be computed directly using these amplitudes and the reduced system matrices. The nodal displacement vector is generated as a linear combination of the reduced basis vectors using their computed amplitudes. Storage for the reduced basis vectors is distributed a m o n g the processors on an element-by-element basis. Therefore, each processor performs a portion of the product, and a global sum is performed to collect the entire displacement vector on a single processor, so that it may be written to the output file. NUMERICALSTUDIESAND PERFORMANCEEVALUATION OF THE PROPOSED STRATEGY
To assess the effectiveness of the proposed strategy and the foregoing computational procedure, a number of postbuckling and large-deflection nonlinear structural problems have been solved using this
Unstiffened 8232 DOF Stiffened 9096 DOF (a) Square Panel
strategy on the Intel Paragon XP/S, the C R A Y T3D, and the IBM SP2. For each problem, the computational time required by the procedure based on the nested dissection ordering scheme was compared to that of an analysis based on a minimum bandwidth ordering scheme and an efficient parallel skyline equation solver similar to that of Ref. 6. For the Paragons, both the SUNMOS and the OSF/1 operating systems were used; however, the performance using SUNMOS was always better than that obtained with OSF/1. Therefore, the Paragon results presented herein are only for the SUNMOS operating system. Results are presented for two postbuckling problem sets and one large-deflection nonlinear problem set. The two postbuckling problem sets are: a square composite panel, both with and without transverse stiffeners (Fig. 3a); and two discretizations of a rectangular composite panel, each with and without transverse stiffeners (Fig. 3b). The largedeflection problem set consisted of two discretizations of a Mach 3.0 High Speed Civil Transport (HSCT) model (Fig. 3c). Note that using symmetry, only the left half of the HSCT was analyzed. The loading for the postbuckling problems consisted of combined in-plane edge shear and uniform temperature change. The loading for the large-deflection problems consisted of a concentrated transverse load combined with a uniform temperature change. Table 2 lists problem size data for the finite element models used.
32880 DOF Stiffened 59064 DOF (b) Rectangular Panel
.....
. ::::?!5~
:
~
~
45492 DOF (c) HSCT
Fig. 3. Characteristics of the finite element models used in the present study.
397
Postbuckling and large-deflection nonlinear analyses Table 2. Problem size data for finite element models used in this study
Number of elements
Number of nodes
Unstiffened Stiffened
320 354
1372 1516
8232 9096
37 l, 148 410,460
Coarse mesh
Unstiffened Stiffened
1252 1320
5200 5480
31,200 32,880
1A40,528 1,518.600
Fine mesh
Unstiffened Stiffened
2304 2394
9476 9844
56,856 59,064
2,644,500 2,747,652
521 2051
1807 7582
10,842 45,492
556,629 2,262,750
Description Square panel Rectangular panel
HSCT (left half only)
Number of displacement degrees of freedom
Number of non-zero coefficients in upper triangular portion of stiffness matrix
Coarse mesh Fine mesh
Fig. 4. Partitions generated by the nested dissection algorithm and corresponding elimination tree for the 32880 DOF composite panel model.
398
BRIANC. WATSONand AI-IMEDK. NOOR
Figure 4 illustrates the partitions created by the nested dissection algorithm used in the present study, for the composite panel model with 32880 DOF. The colors of the different parts of the model correspond to the different levels in the elimination tree. In Fig. 4, each node of the elimination tree represents a group of finite element nodes in the model. The size of each node in the elimination tree is indicative of the number of corresponding finite element nodes.
Postbuckling problems Figures 5-8 show various wall clock times for the postbuckling analysis of the composite panels. Each figure has results for both the nested dissection-based strategy and the skyline-based strategy. Figure 5 shows the wall clock times expended in the determination of the stability boundary using the nested dissection-based equation solver and the skyline-based equation solver for each of the distributed-memory machines. Each finite element model, identified by its number of degrees of freedom, was analyzed using increasing numbers of processors. Figure 5 gives an indication of the scalability of the computational procedure (i.e. the effects of increasing the number of processors and the problem size on the wall clock time). An examination of Fig. 5 shows that the time taken for the nested dissection-based strategy grows at a much slower rate than the skylinebased strategy, as the problem size and number of processors increase. Thus, the nested dissection-based
solver offers better scalability than the skyline-based solver. Figure 6 shows the wall clock times expended to calculate the postbuckled equilibrium configurations for the six composite panel models. The trends depicted in Fig. 6 are similar to those shown in Fig. 5, which indicates that the wall clock times expended in the reduced system eigen analysis are comparable to those expended in the solution of the nonlinear reduced equations. Figure 7 shows the wall clock times for each of the three major phases of the postbuckling analysis for one of the composite panels (31200 DOF model). The results are representative of all the composite models considered in the present study. The three major phases are: determination of the stability boundary; generation of a postbuckled equilibrium configuration near the stability boundary; and tracing postbuckled equilibrium paths. The effects of increasing the number of processors on the wall clock times are also shown in Fig. 7. For the skyline-based scheme, the back substitution step has very little parallelism. Thus, when additional processors are used, communication overhead causes the time for this operation to increase. When there are many forward/back substitutions per factorization, this effect can outweigh the other benefits of using additional processors (see Fig. 7b). Figure 8 shows the wall clock times expended in the different steps required for generating postbuckled
300
~ -
350
Number of Processors
Problem Size
(a) Intel Paragon (SUNMOS)
~
i:
160 140 120 j 1 J
6 ~j f
/
f
Number of Processors
Problem Size
(b) Cray T3D
Nested Dissection Scheme
J J
J
[ ] Skyline Scheme
0-
Problem Size
Number of Processors
(c) IBM SP2
Fig. 5. Measured wall clock times to determine the stability boundary.
350
45O 4OO 350
3OO 25O
& 250 E 20o
.~ 1-';0
lOO
100 511
o
0
Number of Processors
Problem Size
Number of Processors
Problem Size
(a) Intel Paragon (SUNMOS)
(b) Cray T3D
180 160 140
~ Nested Dissection Scheme
~ 120
._~
8o m.- 60 411
[ ] Skyline Scheme
Number of Processors
Problem Size (c) IBM SP2
Fig. 6. Measured wall clock times to compute the postbuckling response.
Paragon/32 Paragon/64 o Paragon/128
= o
T3D/32
n
d z
T3D/64 T3D/128
.T= 8
SP2/4
=E
SP2/8 SP2/16 5O
100 Time (see)
(a)
I
I
150
200
i Determine Stability Boundary [ ] Determine Initial Postbuckled Solution
Paragon/32 p o
• Trace Postbuckled Solution Paths
Paragon/64 Paragon/128
o= o O.
15
T3D/64
Z
T3D/128
:E
SP2/8 SP2/16
~ :
0
I
200
(b)
I
400 600 Time (sec)
I
I
800
1000
Fig. 7. Measured wall clock times for postbuckling analysis of the composite panel with 31200 D O F using (a) the nested dissection scheme, or (b) the skyline scheme. 399
400
BRIANC. WATSONand AHMEDK. NOOR Paragon/32 .~ Paragon/64 O
~ Paragon/128 o
2
T3D/32
6
T3D/64
..E =
T3 D/128
co
YiTT
Element Generation, Assembly, and Factorization
S P2/4
=E S P2/8 L i
S P2/16 10
20
30
40
5O
Time (sec)
(a)
Evaluate RHS FonNard/Back Substitution Orthonormalize Basis Vectors
Paragon/32
[i11 Form Reduced System
Paragon/64 ~:: ~i )FF),:i. . . . . . . . . . . . . . . .
[] Trace Postbuckling Path
o Paragon/128 (o (D o O a.
(5 Z "~ "
m Recover Displacements and Output T3 D/eA T3D/128
S P218
SP2/16 0
I
I
I
50
100
150
I 200
I 250
Time (sec) (b)
Fig. 8. Measured wall clock times for the different steps involved in tracing postbuckled solution paths for the composite panel with 31200 DOF using (a) the nested dissection scheme, or (b) the skyline scheme. equilibrium configurations for one of the composite panels (31200 D O F model) using each of the distributed memory machines. As can be seen from Fig. 8, the relative times expended in the different
modules are vastly different for the nested dissectionbased and the skyline-based solvers. This is primarily due to the fact that the forward/back substitution is much faster in the nested dissection-based solver. As
1.(
U/Urn.,, .4
.~. 0.0
N,z/N,2 ~
q
|/|Or
o
Fig. 9. The total strain energy, as ~ function of T and NI2 , in the postbuckled range for the stiffened composite panel model with 32880 DOF.
Postbuckling and large-deflection nonlinear analyses a consequence of this improved performance, the time expended in the output phase of the nested dissection-based strategy is of the same order as that expended in the. factorization. Figure 8b shows how the forward/back substitution time can increase as additional processors are used. Figure 9 shows an example of the detailed postbuckling response results possible with the proposed strategy. The surface plot shown in Fig. 9 depicts the variation of the total strain energy of the 32880 D O F composite panel model with two load parameters (in-plane edge shear loading and uniform temperature change). A reduced system was generated in the vicinity of the stability boundary and used by each processor to evaluate the total strain energy for different combinations of load parameters. This allowed a high-resolution surface to be computed in a short time.
Fig. l0 shows that the factorization in the nested dissection-based solver is still always faster than the factorization in the skyline-based solver. This figure also shows that the forward/back substitution is much faster in the nested dissection-based solver than in the skyline-based solver. This is particularly important when using the foregoing multipleparameter reduced basis technique, which requires many forward/back solves per factorization. COMMENTS
Figure 10 shows the wall clock times for matrix assembly and factorization and forward/back substitution for the two discrete HSCT models. Because of the complex geometry of these models, it is more difficult to find a favorable nested dissection ordering, as was done for the composite panels. However,
Element Generation, Assembly, and Factorization
5
PROCEDURE
One Forward/Back Substitution
Paragon/16 Paragon/32 Paragon/64
Paragon/16 ~ I Paragon/32 ~ - ] o~ Paragon/64 "---7
T3D/16 T30/32 T3D/64
o 6
SP2/1 SP2/2 SP2/4 - - 1 SP2/8 I I
I.
0
I 10
I 15 Time (8~:)
5
Paragon/64
i Paragon/256
I 25
I 20
T3D/32 T3D/64
SP2/2 SP2/4 SP2/8 j
I 05
I 1
I 15
I 2
Time (sec)
Paragon/64
I
Paragon/128 .~
ON THE COMPUTATIONAL
The following comments can be made concerning the potential of the foregoing computational procedure for solving large-scale nonlinear structural analysis problems on distributed-memory parallel computers: 1. For all of the distributed-memory computers considered, the use of a nested dissection scheme resulted in improved performance of the solution process over that obtained with efficient skylinebased solvers. The improvement is more pronounced in the forward/back substitution phase than in the factorization phase, and increases as the problem size increases. The improvement is particularly important in computational strategies requiring several
Large deflection nonlinear problem
o
401
= Paragon/128 "~
t
--1
Paragon/256 -~
T3D/128
d
T3D/128 T3D/2.56. ~]
~
SP2/16 SP2~2
SP2/16 /
I I 20
40
60
i
i
I
80
100
120
SP2/32 m 0
I
I 4
2
Time (sec)
Time (sec)
I
~] Nested Dissection • Skyline
I 6
I
Fig. lO. Wall clock times for analysis of the HSCT models.
I 8
I 10
I 12
402
BRIANC. WATSONand AHMEDK. NOOR
forward/back substitutions per factorization, such as the evaluation of basis vectors in the multipleparameter reduced basis technique, or the generation of derivatives of response quantities with respect to problem parameters in sensitivity analysis. For complex three-dimensional structures, such as the HSCT, the efficiency of the factorization phase of the nested dissection scheme reduces, and becomes only comparable to that of the skyline solver. However, the performance of the forward/back substitution phase remains significantly better than that of the skyline solver. 2. With the improvements achieved in factorization and forward/back substitution by using the nested dissection-based solver, the time to collect and output the different response quantities may become a significant portion of the total analysis time (see Fig. 8). Therefore, when implementing a nonlinear structural analysis procedure on massively parallel computers, the entire solution process, including the input and output phases, should be optimized, rather than optimizing only the linear equation solver. 3. Efficient utilization of the parallel machine requires the selection of the proper number of processors to use for a given problem. The use of additional processors, over the minimum dictated by storage requirements, is only justified if it results in a pronounced reduction in the total wall clock time. Because communication costs typically grow as additional processors are used, it is usually best to use the smallest number of processors that the problem storage requirement allows. However, if the processors have a large local memory, the minimum number of processors required for storage may be on the order of 10 or less. In this case, the parallel speedup provided by the use of additional processors can significantly outweigh the additional communication costs. 4. As expected, the effectiveness of the sparse solver is strongly dependent on the strategies used in partitioning the structure. Initial numerical experiments have indicated that a fully automated partitioning strategy (e.g. Cartesian nested dissection) may not produce satisfactory results. However, a semi-automated partitioning strategy, in which the user provides guidelines for selecting the separators can produce satisfactory results. 5. The sparse nested dissection-based linear equation solver used in the present study exploits only large grain parallelism. The performance of the computational procedure can be improved by exploiting medium grain parallelism within each node of the elimination tree, though this incurs additional communication. 6. For some analysis problems there is another level of very large grain parallelism available. For example, in a sensitivity analysis, the derivative calculations associated with each parameter are independent. If a reduced basis technique is used, the
derivative calculations can consume a significant portion of the total analysis time. Therefore, it can be worthwhile to exploit this level of parallelism, even at the expense of duplicating some of the computations (e.g., redundantly factoring the left-hand-side matrix).
C O N C L U D I N G REMARKS
A computational strategy is presented for performing postbuckling and nonlinear static analyses of large complex structures on massively parallel computers. This strategy is designed for distributed-memory, message-passing parallel systems. The key elements of this strategy are: (a) a multiple-parameter reduced basis technique, (b) a nested dissection (or multilevel substructuring) scheme, (c) parallel assembly of global matrices, and (d) a parallel sparse equation solver. The computational procedure is divided into five phases, namely: preprocessing and input phase; reduction phase; solution phase; error sensing and control phase; and output and postprocessing phase. Efficient parallelization for the most time-consuming tasks of these phases is attempted. The implementation of the proposed strategy is described for three distributedmemory computers, namely, the Intel Paragon XP/S; the CRAY T3D; and the IBM SP2. The performance of the strategy is evaluated by solving several laminated composite panel postbuckling problems, as well as nonlinear large-deflection problems for an HSCT model. These numerical results demonstrate the advantages of the nested dissection-based solution strategy over the skyline-based strategy. The nested dissection-based strategy has a better scalability than the skyline-based strategy, and its efficiency increases with the increase in problem size. The results of these studies also show the importance of exploiting parallelism in each phase of the solution process in order to minimize the total solution time. Acknowledgements--The present work is supported by NASA Cooperative Agreement NCCW-0011. The authors appreciate useful discussions with David Womble of Sandia, Majdi Baddourah of Lockheed, Olaf Storaasli of NASA Langley, and Srini Chari of IBM. Work on the IBM SP2 at the Maui High Performance Computing Center is sponsored in part by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.
REFERENCES
1. O. O. Storaasli and E. A. Carmona (guest eds), "High Performance Computing for Flight Vehicles," Special Issue of Computing Systems in Engineering, 2, (2/3), 1992.
Postbuckling and large-deflection nonlinear analyses 2. A. K. Noor and S. L. Venneri (guest eds), "High Performance Computing for Flight Vehicles," Special Issue of Computing Systems in Engineering, 3, (1-4), 1992. 3. O. O. Storaasli, J. M. Housner and D. T. Nguyen (guest eds), "Parallel Computational Methods for Large-Scale Structural Analysis and Design," Special Issue of Computing Systems in Engineering, 4, (4-6), 1993. 4. O. O. Storaasli, D. T. Nguyen, M. Baddourah and J. Qin, "Computational mechanics analysis tools for parallel-vector supercomputers," Proceedings of the 34th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, La Jolla, CA, pp. 772 778, 15-22 April 1993. 5. K. H. Law and D. R. Mackay, "A parallel row-oriented sparse solution method for finite element structural analysis," International Journal for Numerical Methods in Engineering, 36, 2895-2919, 1993. 6. J. Qin and D. T. Nguyen, "A new parallel-vector finite element analysis software on distributed memory computers," Proceedings of the 34th AIAA/ASME/ ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, La Jolla, CA, pp. 98-102, 15-22 April 1993. 7. M. T. Heath and P. Raghavan, "A distributed solution of sparse linear systems," University of Illinois Report UIUCDCS-R-93-1793, Feb. 1993. 8. K. P. Wang and J. C. Bruch, "A highly efficient iterative parallel computational method for finite element systems," Engineering Computations, 10, 195-204, 1993. 9. A. Gupta and V. Kumar, "A scalable parallel algorithm for sparse matrix factorization," Technical Report 9419, Dept. of Computer Science, University of Minnesota, April 1994. 10. G. Karypis and V. Kumar, "A high. performance sparse Cholesky factorization algorithm for scalable parallel computers," Technical Report 94-41, Dept. of Computer Science, University of Minnesota, 1994. 11. C. T. Vaughan, "Structural analysis on massively parallel computers," Computing Systems in Engineering, 2 (2/3), 261-267, 1991. 12. A. K. Noor, H. A. Kamel and R. E. Fulton, "Substructuring techniques--status and projections," Computers & Structures, 8, 621~532, 1978. 13. A. Pothen, H. D. Simon and K. Liou, "Partitioning sparse matrices with eigenvectors of graphs," SIAM Journal of Matrix Analysis Applications, 11 (3), 430-452, 1990. 14. B. Hendrickson and R. Leland, "An improved spectral graph partitioning algorithm for mapping parallel computations," Sandia National Laboratories TR SAND92-1460, 1992. 15. A. I. Khan and B. H. V. Topping, "Subdomain generation for parallel finite element analysis," Computing Systems in Engineering, 4 (445), 473-488, 1993. 16. T. N. Bui and C. Jones, "A heuristic for reducing fill-in in sparse matrix factorization," Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993. 17. A. K. Noor and C. M. Andersen, "Mixed models and reduced/selective integration displacement models for nonlinear shell analysis," International Journal for Numerical Methods in Engineering, 18, 1429-1454, 1982. 18. A. K. Noor and J. M. Peters, "Multiple-parameter reduced basis technique for bifurcation and postbuckling analyses of composite plates," International Journal for Numerical Methods in Engineering, 19, 1783-1803, 1983. 19. A. K. Noor and J. M. Peters, "Bifurcation and postbuckling analysis of laminated composite plates via
20. 21.
22. 23. 24.
403
reduced basis technique," Computer Methods in Applied Mechanics and Engineering, 29, 271-295, 1981. A. K. Noor, "Recent advances and applications of reduction methods," Applied Mechanics Reviews, 47 (5), 125 145, May 1994. M. A. Baddourah, O. O. Storaasli, E. A. Carmona and D. T. Nguyen, "A parallel algorithm for generation and assembly of finite element stiffness and mass matrices," Proceedings of the 32nd AIAA/ASME/ASCE/AHS/ ASC Structures, Structural Dynamics, and Materials Conference, Baltimore, MD, pp. 1547-1553, 8 10 April, 1991. M. T. Heath, E. Ng and B. W. Peyton, "Parallel algorithms for sparse linear systems," SlAM Review, 33 (3), 420460, 1991. M. T. Heath and P. Raghavan, "A Cartesian nested dissection algorithm," University of Illinois Report UIUCDCS-R-92-1772, 1992. Anon., Paragon User's Guide, Intel Supercomputer Systems Division, Beaverton, Oregon, 1994.
APPENDIX A: EXPLICIT FORM OF THE REDUCED EQUATIONS USED IN THE PRESENT STUDY
Basis reduction and reduced system of equations The nonlinear response vector {Z} is approximated over a range of the two control parameters, ql and q2, by a linear combination of a nonlinear solution for a particular combination (q~, q2) and a number of path derivatives evaluated at the same (q~, q2). The approximation can be expressed by the following transformation:
{z} = [r]{~}
(A1)
where [F] is an (n x r) transformation matrix whose columns represent basis vectors, and {~ } is an (r x 1) vector of reduced degrees of freedom, which represent amplitudes of displacement modes. The matrix [F] is given by: [F] =
~ql J ( Oq2J ( Oq~ J (Oq, (~q2)( Oq2 J
1
"A"
(A2)
The Rayleigh-Ritz technique is used to approximate Eq. (l) by the following reduced system of nonlinear algebraic equations in the reduced unknowns {~ }: {f(O)/= [l(l{O } + {(~(~)} - q, {1~, } -- q2{02 } = {0}
(A3)
where
~1 = [FlrNI[F]
(A4)
15(0}) = [rl+{G([rl{0 })}
(A5)
{O~ } = [FIT(Q, }
(A6)
{(~: } = [F]T{Q: }.
(A7)
Determination of stability boundary To determine the stability boundary for the structure under the combined loading system, initial basis vectors are generated at q~ = q2 = 0, {Z} = 0, then the Rayleigh-Ritz technique is used to approximate the linear eigenvalue problem, Eq. (3), by the following reduced eigenvalue problem: ([1(1 + 0, [I~, ] + ~2[K:]){~h} = {0}
(AS)
404
where
BRIAN C. WATSONand AHMEDK. NOOR
~, l =l[rlT[lK~ ] [£]
(A9)
[I~2] = [T']T[K21[£]
(A l 0)
and [F] is the matrix o f the basis vectors (initial path derivatives) evaluated at ql = qz = O. APPENDIX B: SUMMARY OF THE M A J O R FEATURES OF THE DISTRIBUTED-MEMORY COMPUTERS CONSIDERED IN THE P R E S E N T STUDY Three distributed-memory, message passing computers were used in the present study, namely: the Intel Paragon XP/S; the C R A Y T3D; and the IBM SP2. The major features of these machines are summarized in this appendix. lntel Paragon
The Intel Paragon consists of a set o f nodes connected with a high speed node interconnect network. Each node consists of two or more Intel i860 CPUs, at least 16 MB of local memory, and possibly I/O hardware. One of the i860 processors performs computations, while one optionally works as a message coprqcessor. The node interconnect network has a two-dimensional rectangular mesh architecture. The nodes are divided into two main groups: a service partition and a compute partition. Nodes in the service partition are directly connected to the I/O devices. Nodes in the compute partition have access to the I/O through communication with the service nodes. Details of the Intel Paragon are described in Ref. 24. Intel provides a version of the OSF/1 operating system, which provides all of the functionality of UNIX, as well as special services required by parallel applications. Each node runs its own copy of the operating system. OSF/1 provides each node with a full multitasking environment with virtual paging memory and seamless access to I/O. Additionally, OSF/1 provides message passing and global operation services for parallel applications. Messages are packetized, and full error detection and recovery is provided.
Researchers at Sandia National Laboratories and the University of New Mexico have developed an alternate operating system, SUNMOS, for the compute nodes. SUNMOS trades functionality for performance. It provides for a single process per node with no virtual memory. The I/O is handled through the controlling process in the service partition. Message passing and global operation services are provided; however, messages are not packetized, and no error detection occurs. This allows node-to-node communication at rates as high as the hardware allows. C R A Y T3D
The CRAY T3D computer consists of a set of nodes connected with a high speed node interconnect network. Each node consists of two DEC 21064 (Alpha) processors, each with at least 16MB of local memory. The node interconnect network has a three-dimensional torus architecture. The system is hosted by a CRAY Y-MP or C90 computer. The host provides job control and access to I/O. Each node of the T3D runs the U N I C O S - M A X operating system, which provides message passing functions, memory allocation, and communication with the host. IBM SP2
The IBM SP2 computer system consists of a set of nodes connected with a high-performance switch. Each node consists o f either an IBM POWER2 or P O W E R processor together with at least 64 MB of local memory, at least I GB of local disk space, external network connections, and a high-performance switch adapter. Nodes are designated as "thin", or "'wide" depending on the processor, cache size, and number of data paths to memory. Each node runs the IBM AIX operating system, which provides all of the functionality of UNIX, including full multi-tasking job control, I/O access, and memory allocation. Two communication libraries, IP and US, are provided. The IP library allows many users to share the high-performance switch adapter, while the US library provides the highest communication rates. Figure I 1 shows the time required for a message to travel from one processor to another, as a function of message
0.1
• IBM SP2 ~.
Intel Paragon (OSF)
0.01
0
q)
E ~-
• Cray T3D
o.ool
o Intel Paragon (SUNMOS) 0.0001
0.00001 0.001
I
I
I
0.01
0.1
1
10
MBytes
Fig. 11. Measured communication times. Table 3. Measured communication rates and latencies Intel Paragon (OSF RI.2) (SUNMOS) Comm. Rate (MB/sec) Latency (microsec)
52 61
160 26
Cray T3D
IBM SP2
108 20
36 63
Postbuckling and large-deflection nonlinear analyses 200
180
l~lF-11---
160
~
140
[]
•
180,
'~
~
i
~
i1__11__1___1-~-II
--
-
--il
160
120 0
,
405
$o
.J
~
6O
140 120 100
80 00
4O
40
20
20
0 0
0
I
I
I
I
1000
2000
3000
4000
Vector
•.
_ _
-
I
I
1000
2000 Vector
Length
(a) AX+Y (BLAS 1)
/
3000
-
I
4000
Length
(b) Dot product (BLAS 1)
300.
• IBM SP2 250 •
"wide"
200.
IBM SP2 "thin"
//
~150
, //
#
Cray T3D
100. /
50.
~ v/~
.~ >
I
50
- ---
~-
0~ .
.
.
I
100 Matrix Dimension
.
.
.
.
.
.
.
,
I
I
150
200
0 Intel Paragon
(c) Matrix-matrix product (BLAS 3) Fig. 12. Measured computation rates of a single processor (using 64-bit floating point numbers). length, for each of the three machines. Table 3 lists the measured communication rates and latencies (message startup times) for each of the machines. Figure 12 shows the measured single processor computation rates for the two BLAS1 routines and the BLAS3 routine most heavily used in the current implementation of the foregoing nonlinear analysis strategy. A P P E N D I X C: O P E R A T I N G S Y S T E M AND COMPILER INFORMATION FOR THE SYSTEMS USED IN T H E P R E S E N T S T U D Y At the time this study was carried out, the Intel Paragon XP/S at Sandia National Laboratories had 1840 compute
nodes, each running S U N M O S version 1.5. Of these compute nodes, 512 had 32 MB of R A M , while the remainder had 1 6 M B of R A M . The service partition nodes were running OSF/1 version Rl.2. The code was compiled with Intel's f77 compiler version R4.5 with level 3 optimization. The C R A Y T3D at the Jet Propulsion Laboratory had 256 processors, each running U N I C O S - M A X version 1.1.0.2. Each processor had 16 MB of R A M . The code was compiled with C R A Y ' s cf77 compiler (cf77 M) version 6.1. At the Maui High Performance C o m p u t i n g Center, the IBM SP2 had 80 "'wide" nodes, each running AIX version 3.2. Each processor had at least 64 MB of R A M . The code was compiled with IBM's XL Fortran compiler (mpxlf} with level 3 optimization.