A method of deriving parallel algorithms for direct integration in structural dynamics

A method of deriving parallel algorithms for direct integration in structural dynamics

~ Pergamon Computing Systems in Engineering, Vol. 4, Nos 4-6, pp. 415 420, 1993 ~') 1994 Elsevier Science Ltd Printed in Great Britain. All rights r...

521KB Sizes 1 Downloads 16 Views

~

Pergamon

Computing Systems in Engineering, Vol. 4, Nos 4-6, pp. 415 420, 1993 ~') 1994 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0956-0521/93 $6.00 + 0.00

A M E T H O D OF D E R I V I N G P A R A L L E L A L G O R I T H M S F O R D I R E C T I N T E G R A T I O N IN S T R U C T U R A L D Y N A M I C S R. S. HARICHANDRAN a n d BINSHAN YE Department of Civil and Environmental Engineering, Michigan State University, East Lansing, MI 48824-1226, U.S.A. Abstract--A method of deriving parallel algorithms from well-established conventional direct integration algorithms such as the Wilson 0 method, Newmark fl method etc. is presented. The new algorithms are able to effectively exploit the power of parallel computers for the dynamic analysis of large-scale structures, and may be more efficient than the conventional methods, even on uniprocessors. It appears that the parallel algorithms can be made to maintain the property of unconditional stability displayed by the conventional methods. The parallel algorithms are derived by splitting the stiffness, damping and mass matrices such that the dynamic equations of motion can be cast in block diagonal form. The splitting of the matrices corresponds to physical subdomains of the structure. Each set of block diagonal equations can be solved independently of the others and therefore the computations can be performed in parallel. However, a predictormorrector approach requiring iterations within a time step must be used for acceptable accuracy. The method described in this paper is derived from the Wilson 0 method and is called the "parallel Wilson 0 method." A simple numerical example is used to illustrate the accuracy of the parallel method, which is comparable to the conventional method when a sufficient number of iterations is used. Preliminary computations were performed on a BBN GP-1000 distributed-memory parallel computer to assess the performance of the parallel algorithm. The proposed method is especially suited for large-scale non-linear structural dynamics problems.

NOMENCLATURE C

f h

K

K. K~/, Ki, LHS M RHS U

0

damping matrix of entire structure nodal force excitations time step stiffness matrix of entire structure stiffness matrix of the ith subdomain stiffness matrices representing the coupling between subdomains i and j left-hand side mass matrix of entire structure (assumed to be diagonal in this paper) right-hand side vector of nodal displacements coefficients of the initial predictor parameter of the Wilson 0 method ( >~1.4 for unconditional stability)

Subscr~ts n,n-l,n+O

quantity computed at times t =nh, t = ( n - 1)h and t = ( n +O)h

Superscripts .

.....

,

,r

(i) (0)

first, second and third derivatives with respect to time block diagonal matrix and remainder matrix obtained by splitting value of quantity at iteration i initial predictor of the quantity

INTRODUCTION

M a n y m e t h o d s have been derived for numerical step-by-step integration for strutural d y n a m i c problems. It is well k n o w n t h a t m e t h o d s which explicitly c o m p u t e the response at a given time based on the responses at previous times are only

conditionally stable a n d require the use of a sufficiently small time step. Implicit methods, on the o t h e r h a n d , require the solution of a system of equations at each time step but can be unconditionally stable. F o r large problems, the ability to use relatively large time steps favors implicit m e t h o d s over explicit ones. 1'2 Trujillo developed a n unconditionally stable "explicit" algorithm for structural dynamics using matrix splitting. 3 The algorithm is restricted to equations of m o t i o n in which the mass matrix is diagonal a n d in the strict sense is implicit, but the coefficient matrix for the system of equations to be solved at each time step is triangular, hence requiring only b a c k s u b s t i t u t i o n for the solution. A significant s h o r t c o m i n g o f Trujillo's algorithm is that it does not exhibit significant parallelism because of its sequential b a c k s u b s t i t u t i o n procedure, and is hence n o t efficient for use on parallel computers. It also has low accuracy [i.e. of O(h2)]. In this p a p e r we introduce a new implicit m e t h o d which is also derived by matrix splitting a n d is based o n well-established c o n v e n t i o n a l m e t h o d s such as the Wilson 0 m e t h o d a n d the N e w m a r k /3 method. The p r o p o s e d m e t h o d is believed to be unconditionally stable, has good accuracy, a n d is highly parallel, allowing efficient i m p l e m e n t a t i o n on parallel computers with a large n u m b e r of processors. In this method, the original coupled stiffness and d a m p i n g matrices are each split into two matrices such that the resulting d y n a m i c equations of m o t i o n have a block diagonal structure, a n d the c o m p u t a t i o n s c o r r e s p o n d i n g to each block diagonal set of equations are performed o n a different processor. 415

416

R.S. HARICHANDRANand B. YE

Since the future of parallel computing lies with scalable distributed-memory multi-processors rather than with shared-memory multi-processors (which are not easily scalable), efficient parallel methods must minimize: 1. The amount of data that is communicated from one processor to another; 2. The physical distance for data communication (i.e. communication between nearby processors is more efficient than communication between distant processors); and 3. The amount of data that must be stored in the local memory of each processor.

ALGORITHM

paper is restricted to a diagonal lumped mass matrix, but the method can be readily extended to account for a consistent mass matrix formulation by similar splitting of the mass matrix. The physical domain decomposition is illustrated in Fig. 1, in which the nodes are partitioned into four subdomains ~1, ~2, f~3 and fL by "cutting" through the middle of elements. The stiffness matrix of the entire structure, partitioned to correspond to the four subdomains, is split as follows:

KII K12 KI3 KI4] K=

in which u, /l, ii and f denote nodal displacement, velocity, acceleration and force vectors, respectively. All direct integration algorithms transform Eq. (1) into a finite difference form. In this paper, the Wilson 0 method is used to obtain 4 6 (~

+

M

M

3 +NC+K]

[(~ 6

u"+°= f"+°

u . + 6~ . +

2ii, ]

in which h is the time step increment, u, are the displacements at t = nh, u, + 0 are the displacements at t = (n + O)h and 0 is a parameter. For unconditional stability, 0 ~> 1.4. To obtain a parallel direct integration method, a physical domain decomposition is used to split the damping and stiffness matrices into two matrices, i.e. C ~ C ' + C" and K ~ K ' + K", such that C' and K' are block diagonal matrices. The development in this Subdomain 2

~

Sulxlomam1

Fig. 1. Physical domain decomposition.

0

0 K33 0

K41 K42 K43 K ~ J

0

+ (1)

0

K31 K32 K33 K34J

The dynamic equation of motion of a multiple degree-of-freedom (DOF) structure is Mil + O i + Ku = f,

IKI1 0

0

K,~

0 Kl2 KI3 K14] K21 0 K23 K24/ K31 K32 0 1~34J '

I

(3)

K41 K42 K43

in which Kii is the stiffness matrix of the ith subdomain, Ko represents the coupling between the ith and j t h subdomains, the first block diagonal matrix on the right-hand side (RHS) is K' and the second matrix on the RHS is K". If subdomains i and j are not directly connected by elements, then K~ will be a null matrix• As will be apparent later, it is advantageous to minimize coupling between subdomains by careful partitioning. Since K is usually a banded matrix, K" will also be banded• The damping matrix is also split in a similar way. Since K, is the stiffness matrix corresponding to the free DOFs of subdomain i when all the nodes outside it are fully restrained, K, is non-singular. By using the split form of K and C in Eq. (2) and moving terms related to K" and C" to the RHS, we obtain

6 =f.+0 + M[~5~u. + ~6 ~°+

2ii,]

(4) Now the coefficient matrix on the left-hand side (LHS) is a block diagonal matrix, and each block can be solved in parallel. The coefficient matrix of each block is assured to be non-singular since it represents the "dynamic stiffness matrix" of a stable structure consisting of a single subdomain, with all nodes outside the subdomain being fully restrained• Note that the unknown vector u,+0 appears on both the LHS and RHS of Eq. (4)• In order to overcome this,

417

Parallel algorithms for direct integration in structural dynamics we use the following predictor-corrector iterative scheme: [~M+

~3c '

I

(Oh)2

"+" + K ' ]u.+o

6

=,+,,"(°)= u , + O h i l , + ~ f i , +

6

u.+2u.+2ii,,

-

-"]-,+0" (5)

If the iteration in Eq. (5) is repeated until convergence is obtained at each time step, then the results of this "parallel Wilson 0 method" would be identical to those obtained by the conventional Wilson 0 method. However, as with most predictor~zorrector methods, carrying out the iteration until convergence is wasteful, and only a few iterations are usually necessary for acceptable accuracy. A drawback of terminating the iteration prematurely is that the property of unconditional stability exhibited by the associated conventional algorithm can be destroyed. 5 In linear problems, the coefficient matrix in Eq. (5) should be factored at the beginning and then re-used for the iterations and at other time steps. For nonlinear problems the coefficient matrix changes from time step to time step and should be factored at the beginning of each time step. The iteration within each time step, however, is carried out with the previously factored coefficient matrix and hence involves only backsubstitution. This makes the iteration very fast. Further, factoring the coefficient matrix is equivalent to factoring each of its block diagonal submatrices, and it takes less effort to factor all of the submatrices than it does to factor the original larger matrix. To begin the iteration in Eq. (5) at each time step, we need to estimate u,~°~+0. A good initial estimate can drastically reduce the number of iterations that are required. The simplest assumption is ~. ,,~0} + 0 = u,,. but the results obtained with such an assumption have extremely poor accuracy. Simple predictors involving higher-order derivatives such as

(6)

u,,~°~+,~ = u. + Oh fi,, or

(Oh)2 u,,.c°~+,, = uo + Ohfi, + T

(0h) 3

(9)

The quantities u._. 1, fin 1 and ii. t appearing on the RHS of Eq. (8) can also be expanded about u,,,/i,, and ii,,. respectively, realizing that u. ~= u(t + ( - h)). etc. Substituting these expansions into the RHS of Eq. (8) yields

]

=f.,+,,+ io77.o+g +C

The Taylor series expansion of the LHS of Eq. (8) about u,, yields

RHS = (~0 + ~1 )Un + ( -- ~1 + /30 ~'- /31 )h ti,, + ( ½~1 --/31 + 70 + 71 )h :ii,, 1 I +(-g~l+~/31-~'l)h3fi,,+O(h4).

(10)

Equating coefficients of the derivatives of u,, on the RHS of Eqs (9) and (10), we have the following conditions that must be satisfied for various orders of accuracy: h °: ~0 + ~1 = 1 hi:

- ~ l +/3o+/~1 = 0

he: ½~1-/31+~'o+71=½0 2 hS:

-~:~

~/31--~'t=~(7 etc.

(11)

For accuracy of O(h°), only the first condition must be satisfied, and for each additional order of accuracy each subsequent condition must also be satisfied. It is wasteful to provide higher-order accuracy for the initial predictor than that of the integration scheme itself, and a predictor with too high an order can sometimes give poorer results than a lower order predictor. 6 For the Wilson 0 method which has an accuracy of O(h3), it would be more than sufficient to satisfy the four conditions in Eq. (11 ). By appropriate choice of the coefficients ~,,/~i and 7i, different initial predictors can be obtained by Eq. (8). Once the initial predictor is obtained, the coefficient matrix in Eq. (5) is factored, and ,,~ Un + (~ is computed. Additional iterations are performed without re-factoring the coefficient matrix. As with the normal Wilson 0 method, u,+~ is then computed from u,+~, and the process advances to the next time step.

(7)

fl.

P A R A L L E L I S M AND P E R F O R M A N C E C O N S I D E R A T I O N S

produce much better results. Here we provide a general method of finding an accurate initial predictor. 6 Let

-~ /31Un

1 -[- ~0 Jin -t- ~'1 an

l"

(8)

The efficiency of parallel algorithms is usually measured in terms of (a) the speed-up obtained when the number of processors used for the computations are increased, and (b) the memory needed to be accessed by each processor. Most parallel algorithms have some serial segments, parallel computation segments and communication requirements. The

418

R.S. HARICHANDRANand B. YE

algorithm presented here can be broken down into the following steps: 1. Obtain the stiffness matrix for the current time step, split it into K' and K" and factor the coefficient matrix on the LHS of Eq. (5). This step is done in parallel, with each processor assembling, splitting and factoring the stiffness matrices related to its own subdomain. For linear problems this step needs to be performed only once, but for non-linear problems it needs to be performed at the beginning of each time step. 2. Compute the predictor u~°)+0 at the beginning of each step. This is done in parallel with each processor computing the predictor for the DOFs associated with its own subdomain. 3. Communicate the values v. ,~¢ u, ,,~0) + 0, u., /~, and ii, for DOFs in one subdomain that are coupled to another subdomain. The amount of time spent in communication can be minimized by minimizing the number of DOFs that are coupled between different subdomains, and allocating neighboring subdomains to neighboring processors so that communication does not take place between distant processors. 4. Compute the RHS of Eq. (5) and solve for U,.~) n+O. This is done in parallel, with each processor computing results for its associated DOF. 5. Iterate steps 3 and 4 to refine un+ 0. The number of iterations required for acceptable accuracy depends on the time step, with larger time steps requiring more iterations. 6. Compute u, + 1, li, + ~ and iin+ j as in the conventional Wilson 0 formulation. 7. Advance to the next time step and repeat steps 1-6. In order to obtain good speed-up, the "granularity" of the problem must not be too fine. That is, the computational effort expended by each processor should be sizeable compared to the communication requirements. This will be the case if each subdomain is sufficiently large, and the problem is non-linear requiring re-factoring of the coefficient matrix at each time step. The overall speed-up will not be impressive for small problems with small subdomains, because the communication requirements may outweigh the computational speed-up. The memory requirements for each processor are dependent on the size of the subdomains and on the amount of coupling between different subdomains. Each processor needs to access ul.'~+0, u., ii, and ii. for DOFs that are associated with its own subdomains as well as the DOFs of other subdomains that are coupled to its own subdomain, and also needs to store the parts of M, C', C", K' and K" associated with these DOFs. Note that the total memory requirement is no greater than that of the conventional Wilson 0 method.

It should be noted that the proposed method could yield improved performance even on uniprocessors for large non-linear problems. The domain decomposition scheme replaces the factoring of the large coefficient matrix in the conventional Wilson 0 method with the factoring of smaller block diagonal matrices associated with the subdomains. The operation count (each with one multiplication and one addition) required to solve for u,+0 during a single time step in the conventional and parallel Wilson 0 methods [i.e. Eqs (2) and (5), respectively] are as follows:

1. L U decomposition of a positive definite, symmetric, banded matrix of size N and halfbandwidth B requires about 5NB i 2 operations, and reduction of the load vector and backsubstitution requires about 2NB operations. 7 The total operation count for the conventional method is therefore ½NB(B + 4). 2. For S subdomains the parallel method requires L U decomposition of S submatrices. The submatrices are also positive definite and symmetric, but if they are small (i.e. many subdomains), then they will be more or less fully populated. L U decomposition of a general matrix of size N 1 3 requires ~N operations, and reduction of the load vector and backsubstitution requires N 2 operations, s For a symmetric matrix the operation count for the decomposition is reduced by 1 3 about half (i.e. to gN ). Therefore, for S subdomains, each containing N / S DOFs, Computing un+O ~+1) from u,,~') n + O in Eq. (5) requires about I(N/S)'-(N + 6S) operations. The parallel method requires additional iterations within a time step with each iteration requiring only backsubstitution. For M iterations per time step for each subdomain, M S ( N / S ) 2 operations are required. The total operation count for the parallel method is therefore approximately ~(N/S) 2 [N + 6 S ( M + 1)]. A lower operation count may be obtained if consideration is given to the bandwidth of the coefficient matrices corresponding to the subdomains. Assuming S = 0.01 N (e.g. 10 subdomains for a problem with 1000 DOFs or 100 subdomains for a problem with 10,000 DOFs) and M = 10, the parallel method becomes faster than the conventional method if the half-bandwidth exceeds 72. Many large problems with complex finite element meshes would yield half-bandwidths exceeding 72. NUMERICAL

EXAMPLE

The undamped free vibration of a simple 2D cantilever beam is used to illustrate the properties of the parallel Wilson 0 algorithm presented in this paper. The beam was subdivided into four linear beam elements having standard cubic interpolation functions as shown in Fig. 2. The structure was

Parallel algorithms for direct integration in structural dynamics l

1.6

.

,

.

,

.

,

.

,

419 .

.

.

.

.

1.4 1,2 E

Subdomain 1

;

Subdot'nain

1.o

_o o_

divided into two subdomains. All three DOFs (axial translation, lateral translation and rotation) were retained at each node. The free tip of the cantilever was given an initial velocity of 0.254 m/s (10 in/s) in the lateral direction and an initial velocity of 0.0254m/s (l in/s) in the axial direction. Th physical properties of each beam element were: mass density = 10.78 kg/m 3 (0.1 lb/in3), moment of inertia = 4.16 x 10 s m 4 (100 in4), cross-sectional a r e a = 6 . 4 5 x 10 3m-" (10in2), length=0.914m (36 in) and elastic modulus = 200 GPa (29 x 10 6 lb/in2). A diagonal mass matrix consisting of the diagonal terms of the consistent mass matrix was used. While this is not an accurate method of diagonalizing the mass matrix, it is adequate for the purpose of this illustration. The initial predictor used in this example was obtained by setting % =/~l = ?0 = "1'1= O, ~{1= 1 and [4. = 1 + 0 in Eq. (8) and is , + ( 1 + O ) h i , o.

(12)

The predictor satisfies the first two conditions of Eq. (11) and therefore has first-order accuracy. The stability of the parallel Wilson 0 method was examined as the time step was increased from 0.001 to 0.1 s. For all computations, the value of 0 was fixed at 1.4. Even with a single iteration in Eq. (5) (i.e. computing only up to -,,~2~ , , + o , ,~ the parallel Wilson 0 method was stable for both ftexural and axial responses. The lateral displacement of the cantilever tip computed by the conventional Wilson 0 method and the parallel method are presented in Figs 3 and 4 for time steps of 0.01 and 0.05 s, respectively, and the parallel method was used with one, 10 and 20 iterations of Eq. (5) during each time step. The displacements in each figure are normalized by

~_ --

-- --

PARALLEL,1 ITERATION PARALLEL,10ffERATIONS

1.O

~-

.e

,I

,4

;2

'

/

.2 °

~Nm - , 2 -.4

.8 -1.01

.2

.4

.5

.8 Time

1 .O

,

.

,-

,6

N

.4

[:

.2

/k

/

\'

°

1.2

.

2

Fig. 2. Subdomains of example problem.

u')~o:u,,

,

PARALLEL.101ERATK3t4 PAP~LLE[.lO ffERATK)NS PA~LLet, 2OrrE~TX~S

1.2

1.4

1,6

1.8

(sec,)

Fig. 3. Accuracy of the parallel Wilson 0 method for h ~ 0.01 s.

"--,7,'

"k'/

• -

:1 ,' /:'

\',~

-.4

,

k..

-. 5

/i

-.8 -1 .o

,2

.4

.6

.8

Time

1.0

1,2

1.4

1,6

.8

(sec.)

Fig. 4 .Accuracy of the parallel Wilson 0 method for h - 0.05 s.

dividing by the maximum absolute response obtained by the conventional Wilson 0 method in that figure. Figure 3 indicates that for a time step of 0.01 s, even a single iteration of Eq. (5) yields acceptable results, while performing 10 iterations yields results comparable to the conventional Wilson 0 method. However, as indicated in Fig. 4, when the time step is large, a greater n u m b e r of iterations of Eq. (5) is necessary to obtain results comparable to the conventional method. This is due to the fact that when the time step is large, the initial predictor given by Eq. (12) is not sufficiently accurate. Using a higher-order initial predictor should improve the performance of the parallel Wilson 0 method and this is being investigated. It should be noted, however, that a time step of 0.05 s is quite large for the cantilever having a fundamental period of about 0.6 s, and a smaller time step would normally be used even for the conventional Wilson 0 method. In general, the parallel Wilson 0 method exhibits slightly larger amplitude decay and period elongation than the conventional method if only a few iterations are performed. Note, however, that for forced vibration problems this inaccuracy should be less noticeable. The performance of the parallel Wilson 0 method was investigated by solving the cantilever beam example on a BBN GP-1000 distributed-memory parallel computer using one, two, four, eight and 16 processors at a time. For the parallel computation, the beam was divided into 128 elements having a total of 384 free DOFs. The beam was decomposed into 128 subdomains with each subdomain containing only a single node with three DOFs. With n processors, each processor solves for the DOFs associated with 128/n subdomains. Although the problem being solved was linear, the coefficient matrix in Eq. (5) was factored at the beginning of each time step to simulate a non-linear solution and the response was computed for 10 time steps. One iteration of Eq. (5) was performed for each time step (i.e..,(2~ ,,, +,, was computed). The total processing time, the speed-up (i.e. the processing time with one processor divided by the processing time with n processors), and the efficiency (i.e. speed-up divided

420

R.S. HARICHANDRANand B. YE

by the number of processors) are shown in Table 1. The speed-up increased as the number of processors was increased from one to eight, but decreased when the number of processors was increased further to 16. The efficiency decreased as the number of processors was increased due to the increased communication requirements. The increase in the total processing time when the number of processors was increased from eight to 16 indicates that the reduced computation time was greatly offset by the increased communication time. The rapid decrease in efficiency as the number of processors was increased is believed to be due to the following reasons: 1. No special effort was made to access information only for those DOFs that belonged to or were coupled to a particular subdomain. The program that was used was rather crude and displacement, velocity and acceleration at all DOFs were accessed by each processor to construct the RHS of Eq. (5) (many of these were subsequently multiplied by zero due to the banded structure of the matrices M, C, C" and K"). This unnecessarily increased the amount of communication that was required, especially as the number of processors increased. 2. The example problem was small and had fine granularity. 3. No special effort was made to take advantage of the particular architecture of the machine (e.g. assigning neighboring subdomains to neighboring processors). Given these shortcomings, the results are quite encouraging. Significantly better speed-up is expected with improved programming for large problems with large subdomains, since the computation time will dominate over the communication time. Even for this example problem, the domain decomposition could have been done differently by choosing several nodes in each subdomain rather than only a single node. The proposed parallel Wilson 0 method is currently being tested with improved programming, different domain decomposition schemes, higher-order initial predictors and larger problems, and the results will be reported at a later date. Table 1. Performance measures on BBN GP-1000 computer No. of processors

Processing time (s)

Speed-up

I 2 4 8 16

333.49 214.11 133.03 94.25 105.67

1.00 1.56 2.51 3.54 3.15

Efficiency 1.00 0.78 0.63 0.44 0.20

CONCLUSIONS A new method of obtaining parallel algorithms for structural dynamic analysis from well-established conventional direct integration algorithms is presented. The method is based on splitting the stiffness, damping and mass matrices to correspond to the physical partitioning of a large structure into several subdomains. Computations related to each subdomain can be performed on a different processor, and the time required for data communication can be minimized by minimizing the amount of coupling between the subdomains and allocating neighboring subdomains to neighboring processors. The parallel method presented in this paper is derived from the Wilson 0 method, and results from the numerical example indicate that its accuracy is similar to that of the conventional Wilson 0 method when a sufficient number of iterations is used. It is shown that for large non-linear problems the proposed method may be more efficient than the conventional method on which it is based even on uni-processors. Preliminary computations for a small example problem on a BBN GP-1000 distributed-memory parallel computer yielded promising performance levels. The parallel method that is presented is especially useful for the non-linear dynamic analysis of large-scale structures on scalable distributed-memory multiprocessors. Acknowledgements--The research described in this paper was supported by the National Science Foundation under Grant Number ECS-9006910. Any opinions, findings and conclusions are those of the authors and do not reflect the views of the National Science Foundation. REFERENCES

1. J. M. Ortega and W. G. Poole, Jr, An Introduction to Numerical Methods for Differential Equations. Pitman, Marshfield, MA, 1981. 2. O.C. Zienkiewiczand R. L. Taylor, The Finite Element Method, 4th edn, Vol. 2. McGraw-Hill, New York, 1991. 3. D. M. Trujillo, "An unconditionally stable explicit algorithm for structural dynamics," International Journal for Numerical Methods in Engineering 11, 1579-1592 (1977). 4. J. L. Humar, Dynamics of Structures. Prentice-Hall, Englewood Cliffs, NJ, 1990. 5. C. W. Gear, Numerical Initial Value Problems in Ordinary Differential Equations. Prentice-Hall, Englewood Cliffs, NJ, 1971. 6. Department of Mathematics, Wuhan University and Shandong University, Methods of Computation. People's Education Publisher, Beijing, 1979. 7. K.-J. Bathe, Finite Element Procedures in Engineering Analysis. Prentice-Hall, Englewood Cliffs, NJ, 1982. 8. W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, Numerical Recipes: The Art of Scientific Computing (FORTRAN Version). Cambridge University Press (1989).