Parallel-vector out-of-core equation solver for computational mechanics

Parallel-vector out-of-core equation solver for computational mechanics

~ Pergamon Congmting Systems in Engineering, Vol. 4. Nos 4 6, pp. 381 385, 1993 Elsevier Science Ltd Printed in Great Britain 0956-0521/93 $6.00 + 0...

345KB Sizes 0 Downloads 49 Views

~

Pergamon

Congmting Systems in Engineering, Vol. 4. Nos 4 6, pp. 381 385, 1993 Elsevier Science Ltd Printed in Great Britain 0956-0521/93 $6.00 + 0.00

PARALLEL-VECTOR OUT-OF-CORE EQUATION SOLVER FOR COMPUTATIONAL MECHANICS J, QIN,'~ T. K. A G A R W A L , t O. O. STORAASLI,++ D. T. NGUYENt and M. A. BADDOURAH§ tCenter for Multi-disciplinary Parallel-Vector Computation, Civil Engineering Department, Old Dominion University, Norfolk, VA 23529, U.S.A. ++NASA Langley Research Center, Computational Mechanics Branch, Hampton, VA 23665, U.S.A. §Lockheed Engineering and Science Co., Hampton, VA 23665, U.S.A. Abstract- - A parallel/vector out-of-core equation solver is developed for shared-memory computers, such as the Cray Y-MP machine. The input/output (1/O) time is reduced by' using the a synchronous BUFFER 1N and BUFFER OUT, which can be executed simultaneously with the CPU instructions. The parallel and vector capability provided by the supercomputers is also exploited to enhance the performance. Numerical applications in large-scale structural analysis are given to demonstrate the efificienc~ of the present out-of-core solver.

I. INTRODUCTION

/._'I!,' : h

For large-scale finite element based structural analysis, an out-of-core equation solver is often required, since the in-core memory of a computer is very limited. For example, the Cray Y - M P has only 256 mega words incore memory compared to its 90 gigabytes of disk storage. Furthermore, in a multiuser environment, each user can only have 10 mega words of main memory, while 200 million words of disk storage is available. A typical aircraft structure (High Speed Research Aircraft) needs 90 million words of incore memory to store the stiffness matrix, which is not usually available in a multi-user environment. This paper presents a parallel/vector out-of-core equation solver which exploits features of the Cray type computers. Details of the development of the proposed out-of-core solver is explained in Section 2. The newly developed code is tested in Section 3 by solving several medium to large-scale structural examples. Finally, conclusions are given in Section 4.

U.v = v

(solve for y (solve for.v).

2.1. Memory usage and record length

The matrix A is stored in a one-dimensional array with row-oriented storage scheme/, and each 8 rows of the matrix ,4 have the same last column number (loop-unrolling level 8, or loop = 8). Assuming ,4 is written in a file on the disk, the file contains m records (each record contains one or more block-rows of ,4, here one block-row has 8 rows). The required in-core memory is assigned in the variable "'mtot'" and the maximum half bandwidth of ,4 is " m a x b w " . To reduce the input/output (1/O) time during the solution procedure, one hopes to reduce the number of records "'m", while the available in-core memory "istorv" should be capable to hold at least three records at any moment. When "'istorv'" and "'maxhw" are given, the procedure (given in Table 1) is used to determine the record length and the number of records stay simultaneously in the main memory. The above procedure will adjust the number of blocks in a record automatically, and give the optimal values for " n l o o p " and "icstore". It will also determine the minimum incore memory required ( = mtot + 1) for a given "'maxbw'" (mtot is close to maxbw*maxbw).

2. OUT-OF-CORE PARALLEL/VECTOR EQUATION SOLVER

To solve the following systems of linear equations A x = h,

2.2. A synchronous input/output on Crav compuler,s

where A is a n × n symmetric, positive definite matrix, one first factorizes the matrix A into the product of two triangular matrices

Considerable input/output (I/O) work is required for an out-of-core equation solver during the solution, which undoubtedly increases the solution time. Forturnately, the Cray computers offer B U F F E R IN and B U F F E R O U T ~ as an extension of the regular Fortran R E A D and W R I T E statements. B U F F E R IN and B U F F E R O U T can perform several 1/O operations concurrently, and 1/O operations can be

A = UrU, where 17 is the upper triangular matrix. Then, the solution vector x can be obtained through the standard forward/backward elimination 381

J. QIN et al.

382

executed c o n c u r r e n t l y w i t h C P U i n s t r u c t i o n s . It is r e q u i r e d t h a t the files s h o u l d be declared as u n blocked.

(incore) f a c t o r i z a t i o n is s u m m a r i z e d here. U s i n g the C h o l e s k i m e t h o d , the factorized u p p e r t r i a n g u l a r m a t r i x U c a n be c o m p u t e d as

A typical use o f B U F F E R I N a n d B U F F E R O U T s t a t e m e n t s c a n be w r i t t e n as s h o w n in T a b l e 2.

i-I

A~- E Uk,Ukj Or)--

2.3. Parallel-vector incore equation solver on the Cray

k=l

I n o r d e r to facilitate the d i s c u s s i o n o f the o u t - o f c o r e v e r s i o n o f the e q u a t i o n s o l v e r in the next section, the p a r a l l e l - v e c t o r i n c o r e v e r s i o n 2 is s u m m a r i z e d here. Since the m a j o r p o r t i o n o f the total s o l u t i o n time f o r s o l v i n g s y s t e m s o f linear e q u a t i o n s o c c u r s d u r i n g the f a c t o r i z a t i o n p h a s e (A - - U r U ) . P a r a l l e l - v e c t o r

for i # j

U.

y-Mp 2 Uii

=

Aii --

Ukij

for i = j .

A s a n e x a m p l e , for i = 5 a n d j = 7, o n e has:

U57 -

A57 -

U15 UI7 -

U25U27 ~5

U35 U37 -

Table 1. Procedure to find record length and number of records C C C C C C C

Definitions: istorv . . . . in-core memory available. mtot . . . . in-core memory required. loop . . . . loop-unrolling level, here: loop = 8. maxbw . . . . maximum half-bandwidth. icstore . . . . number of records stay in the memory at any time. nloop . . . . number of "blocks" in a record, one block = 8 rows.

C***** C***** C 100 C C*****

nloop = maxbw/(4 + loop) if (nloop.le.1) nloop = 1 Find out the number of records required to be kept in the memory ******* icstore = 2 + max (1,maxbw/(nloop*loop) + 1) Find out the in-core memory required: mtot mtot = loop*nloop*maxbw*icstore continue

Check if the available in-core memory is enough or not ************* if (mtot.lt.istorv) then C***** more blocks can be included in one record ************* idelt = (istorv-mtot)/(loop*icstore*maxbw) nloop = nloop + idelt if (nloop.le.0) nloop = 1 icstore = 2 + max (1,maxbw/(nloop*loop) + l) mtot = loop*nloop*maxbw*icstore idelt = (istorv - mtot)/(loop*icstore*maxbw) nloop = nloop + idelt if (nloop.le.0) nloop = l icstore = 2 + max(l,maxbw/(nloop*loop) + l) mtot = loop*nloop*maxbw*icstore idelt = (istorv - mtot)/(loop*icstore*maxbw) else C***** Too many blocks in one record, take some out! ********** idelt = (istorv - mtot - 1)/(Ioop*maxbw) - 1 idelt = idelt/nloop - 1 endif nloop = nloop + idelt if (nloop.le.0)nloop = 1 C***** Check again if mtot < istorv or not ***************** icstore = 2 + max(1,maxbw/(nloop*loop) + l) mtot = loop*nloop*maxbw*icstore mloop = nloop*loop IF(MTOT.LT.ISTORV) go to 200 if (nloop.eq.l) then write (*,*),*" Please increase the in-core memory to:',mtot + 1 stop endif go to I00 ********************** of looking for nloop, icstore ********** 200 continue

U45 U47

383

Parallel vector out-of-core equation solver Table 2. Formats of BUFFER IN and BUFFER OUT Statements To read a record from file (or unit number) id ********** call setpos(id,ilocate) buffer in (id,ml(a(start), a(start + length - 1))

C*****

C

To write a record on file (or unit number) id ************ call setpos (id,ilocate) buffer out (id,m)(a(start), a(start + l e n g t h - 1))

C*****

where: id ilocate m a start length C

.... .... .... .... .... ....

is is is is is is

a unit identifier or a file name. the beginning location of the record a mode identifier (m = 1 in this paper). an array (in-core memory) to hold the record. the start location of the record in a. the length of the record.

F r o m the a b o v e formula, one can see that to update the term U57 (of row 5), one only needs i n f o r m a t i o n from columns 5 a n d 7. Therelore, to update the entire r o w ¢ i, one needs the complete u p d a t e d i n f o r m a t i o n (right above row i/ as s h o w n in Fig. 1 (where the matrix A is assumed to be b a n d e d to simplify the discussion). Also from the above formula, one can see that if only rows 1 and 2 have been completed (even t h o u g h rows 3 a n d 4 have not been completed), the term U57 (or the entire row 5) can be partially completed. F o r example Us7(incomplete) = A57 - U15 UI7 -- U25

U27.

The above o b s e r v a t i o n will immediately suggest the parallel procedure for o b t a i n i n g the factorized matrix U. Each processor will h a n d l e the u p d a t i n g of one row. F u r t h e r m o r e , to exploit the vector capability of Cray type computers, loop-unrolling technique: is used. In the above formula for UsT, assuming loop-unrolling level 2 is employed. W h a t it m e a n s is t h a t every 2 rows are grouped a n d processed by a processor. In the real c o m p u t e r code implementation, loop-unrolling level 8 is used to optimize the vector speed.

2.4. Parallel- rector out-~?/-core equation soh'er

14oo 11.oc. 21....

IRoc.61 oo.71 ....

Filo'ID

-o,.~.

IRoc. 20o I

/

"onNL'~- . . . . . . . .

?- Rec

. . . . . . . . . "-

5

R~c. 7

Row I

Not completely updated

m

Fig. I. Information required to update row i (incore version).

the

F o r an in-core solver, during the factorization of A, the rows of A will be u p d a t e d (which is now become the rows o f / 1 ) and stored in the same locations of A (from row 1, row 2 . . . . . to row n). F o r an out-of-core solver, however, since only a small part of .4 is currently stored in the memory, it is necessary to write ( B U F F E R O U T ) the rows of U on the disk file, a n d to read ( B U F F E R IN) the other rows of A into the memory. In this p r o p o s e d out-of-core solver, a record will be B U F F E R O U T when it is completely updated, and a new record will be B U F F E R IN as soon as the first row of a record is begun to be u p d a t e d (see Fig. 2). This can also be s h o w n in Table 3 (to simplify the discussion, assuming N P = 1 processor). In actual c o m p u t e r code i m p l e m e n t a t i o n , the addresses such as ilocate, start, length, ilocatel, start l a n d lengthl should be properly defined to ensure a correct solution. In a parallel c o m p u t e r e n v i r o n m e n t , only one processor will be assigned to deal with the

- Row I-NP

[]

on

Cray Y - M P

Fig. 2. Parallel vector out-of-core Choleski.

J. QIN et al.

384

Table 3. Typical use of BUFFER IN and BUFFER OUT statements DO 1000 1= 1,n if (1 -- ((1 - l)/(nloop*loop))*(nloop*loop)).eq.NP) then call setpos (id,ilocate) call buffer in (a(start), a(start + length - 1)

Table 5. Performance of pvsolve-ooc on Cray Y-MP No. of processors

HSRA (s)

Refined HSRA (s)

SRB (s)

1 2 4 8

6.98 3.50 1.85 1.01

43.87 20.00 10.00 5.71

31.26 15.53 7.80 4.21

endif C ....

Update the l-th row of A

solution. The Cray timing (TSECND) is used to measure the time.

C ....

The 1-th row of A has been updated

3.1. High Speed Research Aircraft (HSRA) application

if (1 - ((1 - 1)/(nloop*loop))*(nloop*loop)).eq.NP)

then

call setpos (id,ilocatel) call buffer out (a(startl), a(startl + lengthl - 1)) endif 1000

continue

I/O, while other processors will directly (and simultaneously) do the calculations. Similar I/O pattern is used in the forward/backward elimination phases. Loop-unrolling level 8 is adopted in this work to enhance the vector performance of the solver. A parallel Fortran Language Force (Fortran Concurrent Execution 3) is used here to develop a parallel version of the out-of-core solver. For more details on the parallel/vector aspects of the in-core solver, see Ref. 2.

3. APPLICATIONS To test the effectiveness of the proposed out-of-core parallel-vector equation solver (pvsolve-ooc), described in Section 2, three large-scale structural analyses have been performed on the Cray Y - M P supercomputer at N A S A Ames Research Center. These analyses involved calculating the static displacements resulting from initial loadings for finite element models of a high speed research aircraft (HSRA) and the space shuttle solid rocket booster (SRB). The aircraft and SRB models are selected as they were large, available finite-element models of interest to NASA. The characteristics of the stiffness matrix for each of the above practical finite element models is shown in Table 4. In the following applications, code is inserted in pvsolve-ooc to calculate the C P U time for equation

To evaluate the performance of the parallel-vector out-of-core Choleski solver, a structural static analysis has been performed on a 16,146 degrees-of-freedom finite-element model of a high-speed aircraft concept 4. Since the structure is symmetric, a wing-fuselage half model is used to investigate the overall deflection distribution of the aircraft. The half model contains 2851 nodes, 4329 4-node quadrilateral shell elements, 5189 2-node beam elements and 1143 3-node triangular elements. The stiffness matrix for this model has a maximum semi-bandwidth of 600 and an average bandwidth of 321. The half-model is constrained along the plane of the fuselage centerline and subjected to upward loads at the wingtip and the resulting wing and fuselage deflections are calculated. The time taken for a typical finite element code to generate the mesh, form the stiffness matrix and factor the matrix is 325 s on a Cray 2 (802 s on a C O N V E X 220) of which the matrix factorization is the dominant part. Using pvsolve-ooc, the factorization for this aircraft application requires 6.98 and 1.01 s on one and eight Cray Y - M P processors, respectively, as shown in Table 5.

3.2. Refined rnodel for HSRA problem More details and more realistic model of the H S R A structure than the crude model in Section 3.1 is used. The characteristics of the resulted stiffness matrix is shown in Table 4. The numerical performance of the proposed parallel-vector out-of-core solver for this example is presented in Table 5. 3.3. Space Shuttle solid Rocket Booster ( SRB ) appli-

cation 5 In addition to the high-speed aircraft, the static displacements of a two-dimensional shell model of the space shuttle SRB have been calculated.

Table 4. Characteristics of finite element models HSRA Max. bandwidth Ave. bandwidth Matrix terms Non-zero terms No. operations No. equations

600 321 5,207,547 499,505 171,425,520 16,146

Refined HSRA 1272 772 12,492,284 373,752 16,152

SRB 900 383 21,090,396 1,310,973 9.2 × 1 0 9 54,870

Parallel-vector out-of-core equation solver This SRB model is used to investigate the overall deflection distribution for the SRB when subjected to mechanical loads corresponding to selected times during the launch sequence. The model contains 9205 nodes, 9165 4-node quadrilateral shell elements, 1273 2-node beam elements and 90 3-node triangular elements, with a total of 54,870 degrees-offreedom. The stiffness matrix for this application has a maximum bandwidth of 900 and an average bandwidth of 383. A detailed description and analysis of this problem is given in references 5, 6. The times required for a typical finite element code to generate the mesh, form the stiffness matrix and factor the matrix are about one-half hour on the Cray 2 (15h on a VAX 11/785) of which the matrix factorization is the dominant part. Using the pvsolve-ooc, the factorization for this SRB problem requires 31.26 and 4.21 s on one and eight Cray Y-MP processors, respectively, (as shown in Table 51.

385

The proposed parallel-vector Choleski method has been used to calculate the static displacements for three large-scale structural analysis problems; a highspeed air craft and the space shuttle solid rocket booster. The total equation solution time is small for one processor and is further reduced in proportion to the number of processors. Factoring the stiffness matrix for the space shuttle solid rocket booster, which formerly required hours on most computers and minutes on supercomputers by other methods, has been reduced to seconds using the parallel-vector variable-band Choleski method. The speed and low incore memory requirement of pvsolve-ooc should give engineers and designers the opportunity to include more design variables and constraints during structural optimization and to use more refined finite-element meshes to obtain an improved understanding of the complex behavior of aerospace structures leading to better, and safer designs.

Acknowledgement--The Old Dominion University portion 4. CONCLUSIONS

A parallel-vector out-of-core Choleski method (pvsolve-ooc) for the solution of large-scale structural analysis problems has been developed and tested on Cray supercomputers. The method exploits both the parallel and vector capabilities of modern high-performance computers. To minimize computation time and memory requirement, BUFFER IN and B U F F E R OUT statements are used for effective I/O operations. The method performs parallel computation at the outermost DO-loop of the matrix factorization, the most time-consuming part of the equation solution. In addition, the most intensive computations of the factorization, the innermost DO-loop has been vectorized using a SAXPY-based scheme. This scheme allows the use of the loop-unrolling technique which minimizes computation time. The forward and backward solution phase have been found to be more effective to perform sequentially with loop unrolling and vector-unrolling, respectively.

of this research was supported by a grant (NAG-I-858) from NASA Langley Research Center. REFERENCES

1. NAS User Guide, Version 6.0, NASA Ames Research Center, Moffett Field, CA (1991). 2. Agarwal, T. K., Storaasli, O. O. and Nguyen, D. T., "A parallel vector algorithm for rapid structural analysis on high-performance computers", Proceedings of the A1AA/ASME/ASCE/AHS 3lst SDM Conference, Long Beach, CA (24 April 1990). 3. Jordan, H. F., Benten, M. S., Arenstorf, N. S. and Ramann, A. V., "'Force user's manual: a portable parallel FORTRAN", NASA CR 4265 (January, 1990). 4. Robins, W. A., Dollyhigh, S. M., Beissmer, F. L and Geiselhart, K. "Concept Development of a Mach 3.0 High-Speed Civil Transport", NASA TM 4058, September 1988. 5. Knight, N. F., McCleary, S. L., Macy, S. C. and Aminpour, M. A., "'Large scale structural analysis: the structural analyst, the CSM testbed, and the NAS system", NASA TM-100643 (March, 1989). 6. Knight, N. F., Gillian, R. E. and Nemeth, M. P., "'Preliminary2-D shell analysis of the space shuttle solid rocket boosters", NASA TM-100515 (1987).