Is a simple diagonal scaling the best preconditioner for conjugate gradients on supercomputers?

Is a simple diagonal scaling the best preconditioner for conjugate gradients on supercomputers?

Is a simple diagonal scaling the best preconditioner for conjugate gradients on supercomputers? Giorgio Pini and Giuseppe Gambolati Dipartimento di M ...

680KB Sizes 23 Downloads 36 Views

Is a simple diagonal scaling the best preconditioner for conjugate gradients on supercomputers? Giorgio Pini and Giuseppe Gambolati Dipartimento di M e t o d i e Modelli Matematici p e r le Scienze Applicate, Universitg~ degli Studi, Padova, ltaly

The implementation of accelerated conjugated gradients for the solution of large sparse systems of linear equations on vector/parallel processors requires programming features which are significantly different from the one needed on a scalar machine. Furthermore, a numerical algorithm which works well on the latter may be largely inefficient on the former. In the present analysis the numerical performances on a CRAY X-MP[48 of some widely known preconditioning techniques are compared, including the diagonal scaling, the incomplete Cholesky decomposition and the least squares polynomial preconditioners. The last ones, which are not suited to scalar machines, appear to be particularly attractive from a conceptual view point on vector/parallel architectures. The results obtained with 12 arbitrarily sparse finite element problems of growing size up to almost 5000, shows surprisingly that simple diagonal scaling is the most efficient preconditioning scheme in the majority of applications. In the few cases where it is not so, its performance is nevertheless comparable with that of the incomplete Cholesky factorization. Contrary to the general expectation, the polynomials exhibit a poor performance which does not increase with the degree and appear never to be competitive with the other two more traditional preconditioners. Key Words:

Linear systems, conjugate gradient method, preconditioners, vector computers.

1. INTRODUCTION In recent years, the method of preconditioned or modified conjugate gradient (MCG) has received much attention in linear algebra, particularly for the solution of large sparse sets of equations A x = b where A is a symmetric positive definite matrix. The good convergence properties of MCG in real world applications have been demonstrated in a number of papers 16't7'24. The preconditioners belonging to the family of the incomplete Cholesky (ICCG) decompositions have proven particularly attractive15'22'26'32'33 However, most, if not all, of the preconditioners have been conceived for scalar processors and their design does not fit the vector and parallel architecture. Alternative preconditioners have been devised which properly meet the special technological features of the supercomputers. Dubois et al. ta propose a matrix Neumann's series truncated after very few terms. Johnson et al. 25 and Saad 46 develop polynomial preconditioners based on a best fit technique. Ashby 2 and Meyer et al. 4o suggest an improvement by the use of adaptive polynomials where the optimum polynomial preconditioner (minimizing the number of the MCG iterations) is determined dynamically by the algorithm itself. Poole and Ortega 43

Paper accepted March 1990. Discussion closes March 1991.

and Melhem 35 present some ICCG multicolor techniques for boundary value problems solved on a regular grid by finite differences or finite elements. These multicolor schemes rely on a proper nodal reordering which is heavily problem dependent. Kershaw 27, van der Vorst 5°-52, Lichnewsky 3t, Kightly and Jones 28, Hughes et al. 23 and Ashcraft and Grimes 3 have implemented some variants intended to increase the ICCG vector efficiency (see also Meier and Sameh34,). Meurant 37 and Concus et al. 9 have tried a preconditioning based on a block factorization. A useful review o f these techniques for vector machines has been contributed by Axelsson 6. Finally, MCG methods designed for parallel architectures can also be found in the literature 7'3°'3s'47'48. All the techniques previously reviewed have been vectorized mostly in connection with structured matrices and for very specific problems and can hardly be generalized to arbitrarily sparse matrices such as those arising from the finite element integration of PED's of the elliptic and parabolic type on irregular meshes. Moreover, some vector MCG schemes prove less efficient for 3-D models than they are for 2-D applications and ICCG algorithms on supercomputers are not necessarily inferior to non-ICCG methods. The present paper deals with a numerical analysis of some well known preconditioners to be implemented on vector computers for the efficient solution of non-

1990 Computational Mechanics Publications

Adv.

Water Resources, 1990, Vol. 13, No. 3

147

Diagonal scaling: G. P i n i a n d G. G a m b o l a t i

structured arbitrarily sparse linear problems. In particular, we will compare the performance of diagonal, polynomial, ICCG (as proposed by Kershaw 26) and Newton-Raphson preconditioners. In all the test problems the vectorized implementation exhibits a speedup larger than 2 or 3 with respect to the corresponding best scalar implementation. However, the most interesting findings concern the diagonal preconditioner. It will be shown, in fact, that the simple diagonal scaling which is so ineffective on sequential processors, in the majority of examples dealt with proves the most efficient preconditioning technique. The reason for this surprising outcome is that, despite the larger number of iterations required to achieve convergence, each single iteration is much more efficiently accomplished on a vector computer. The numerical experiments have been performed on a CRAY X-MP]48.

2. I M P L E M E N T A T I O N OF MCG ON A VECTOR COMPUTER 2.1. M C G e q u a t i o n s

obtained method increases will also

already with n = 1. Clearly, the MCG will converge in fewer iterations as n but the computational cost per iteration increase.

As far as the polynomial preconditioner is concerned, a polynomial of low degree k (k ~< 7) can be used in such a way that the Chebyshev norm I R(X)I= of R(X) = 1 - X P ( X ) , where X is a scalar, is minimized over all polynomials of degree ~< k within the interval [a,b] including the eigenspectrum of A, such that 0 < a < b. The computation of the P ( A )-coefficients is done by the well known Chebyshev iteration. Following an idea of Stiefe149 recently revisited by Johnson et al. 25 Saad46 suggests using a polynomial which minimizes instead the L2-norm I R(X)I2,.. with respect to some weighting function w defined over [ a , b ] . The advantage of this procedure is that an accurate estimate of the extreme eigenvalues of A is not needed. Actually, while the Chebyshev iteration requires that 0 < a < b, the optimization of the L2-norm can be accomplished over the interval [0, b] where b is an upper bound of the largest eigenvalue of A provided, for instance, by Gershgorin's theorem. The Jacobi weights may be used46:

The MCG recursive relationships to be implemented in the vector code may be written in the form 17: xi + 1 = xi + otipi Pi+ 1 = K - lri+ 1 4 {Jipi

ri+ i = ri -- otimpi where: ri = b - A x i c~i = pTri/ p ~ A p i ~3i = - r r , K - 1 A p i / p [ A p i xo= K-lb Po = K - t ro.

The iteration is completed when I ril2 < e,fN, N b e i n g the size of A and e a prescribed tolerance. The preconditioning matrix is K - i . 2.2. M C G p r e c o n d i t i o n e r s

The performance of MCG is explored in relation to the following preconditioners: 1. K - 1 = D - 1 where D = diag(A) 2. K - l = ( L L r ) -I where L is the lower incomplete Cholesky factor with the same sparsity pattern as the corresponding part of Z 26. This simple decomposition is denoted by ICCG(0) by Meijerink and van der Vorst 33. 3. K - l = P ( A ) where P ( A ) is a polynomial in A whose coefficients are evaluated by a best fit technique, as will be discussed later. 4. K g 1 = K 2 J l (2•- A K 2 J l ) n = 1,2 .... where K f f l = D - 1 or K f f l = ( L L r ) - I (NewtonRaphson preconditioner). This preconditioner can be used if limn~= A K 2 1 = I, i.e. if and only if the spectral radius of the initial error matrix Eo = I - A K f f 1 is < 1 (otherwise the N e w t o n - R a p h s o n scheme does not converge). Since the convergence of Newton-Raphson is quadratic14, a significant improvement of the preconditioning matrix can be

148

Adv.

W a t e r Resources, 1990, Vol. 13, N o . 3

w(X) = X'~-l(l - X)~ with ~ >i 0 and/3/> - 1/2. The coefficients of P ( A ) for various degrees k, have been tabulated by Saad 46. In the sequel, the L2-norm will be used. The product K - l b = P ( A ) b is carried out in a very accurate and effÉcient way by the Ruffini-Horner rule. 2.3. M a t r i x storage a n d the m a t r i x - v e c t o r p r o d u c t Following the procedure indicated in Kincaid et a129, Radicati and Vitaletti 44 and Peters et al. 42, matrix A is

stored in compact form by eliminating the coefficients which are equal to zero. Let's denote by M the largest number of non-zero coefficients per row of A. The storage of A is made in COEF and JCOEF as shown by the simple example that follows:

A=

ll 0 0 14 25

I

COEF =

0 22 0 0 0 11 22 33 44 55

0142 0 0 33 0 0 44 0 34

i] (N = 5, M = 3) 34[ 55 /

14002!] JCOEF = 14 25

34 J 34

1411

26 3 6 41 51

The M columns of COEF and JCOEF are stored consecutively, thus COEF and JCOEF can be thought of as vectors with N- M elements. As far as the product A p is concerned, we proceed in the following way42: a. the elements of p are rearranged in a compact matrix PC having the same sparsity as A: PCj(k) = p(JCOEFj(k))

k = l , 2 ..... N;

j = l ..... M

Similarly to COEF, PC is stored in an N . M long vector.

Diagonal scaling: G. Pini and G. Gambolati b. PC and COEF are multiplied by a vector instruction:

/

• ,N.

e C j ( k ) = PC/(k) • COEFj(k) k = l ..... N;

j = l ..... M

c. The final A p vector is then obtained by adding M vectors each one having a length equal to N.

"i. i:

'

,

2.4. Execution o f the product (LL 1 ) - lb The product ( L L r ) - ~ b may be executed by solving two triangular systems with backward and forward substitution (case 1). To improve the ( L L r ) - ~ b product we followed the procedure suggested by Anderson and Saad ~ which is especially designed for parallel processors. In the standard solution scheme the i-th unknown in the triangular system L x = b is computed as: I xi = ~ (hi - [i. I X l

. . . . .

li.i-IXi-l)

If L is full, the determination o f xi requires the evaluation of the previous components x~ .... , xi_l. However, if L is sparse only few o f the xj O : 1..... i - 1) components are needed. As a major result, we need not compute all the i - 1 unknowns before solving the i-th equation. In Anderson and Saad's algorithm, more than one equation is usually solved at each step but the overall number of steps (levels) is (much) smaller than N. This method is referred to in the sequel as case 2. Finally, a further improvement in the computational efficiency of ( L L r ) - tb may be achieved by writing the subroutine which performs the product (through the backward and forward substitution) in CAL language ~1, i.e. in the Assembler language of the CRAY X-MP]48 (case 3).

3. N U M E R I C A L

RESULTS

The size N of the test matrices is between 377 and 4560. The related numerical problems arise from the finite element integration of structural equations for land subsidence prediction 18, subsurface flow equations in 2-D 17, quasi 3-D 2° or 3-D 19 systems. The distribution of the few non-zero coeffÉcients is arbitrarily sparse within a banded structure. As an example, the pattern for the matrix with N = 456011 is shown in Fig. 1. Other representative patterns may be found in Gambolati et al. 22. The numerical experiments have been performed on the CRAY X-MP/48 of the CINECA (Bologna). The system configuration includes 4 CPU, each one with the possibility of scalar and vector processing, hardware assistence for the operation of indirect addressing and chaining, clock period(CP) equal to 8.5 nsec., 8 Mega words o f main storage subdivided into 32 memorybanks entirely addressable by each CPU. The code, which makes use of one CPU, is written in F O R T R A N language and is compiled by CFT lz. For each sample problem and for the various preconditioners, the comparisons are made in terms of CPU time needed to complete the iterative scheme with a tolerance-- 10 - 9 using one single processor, of MFLOPS (millions o f floating point operations per second) indi-

Fig. 1. Pattern o f the non-zero coefficients o f the largest test matrix (N = 456011). The upper minor with a size equal to 912 is shown cating how well vectorized the algorithm is and of spectral condition number c of the iteration matrices. The CPU times are given in seconds and have been computed by the subroutine SECOND on a fully dedicated machine. To increase the efficiency of the vector code, the following routines from the library ~ SCILIB have been used: SCOPY, SSCAL, SNRM2, SDOT, SPDOT and SAXPY which, being written in CAL, are more efficient than the equivalent F O R T R A N versions 10. The performance o f the vector MCG algorithms with the diagonal and the ICCG preconditioning implemented as is described in Section 2.4 is compared in Table 1 with that of the ICCG scalar code to which the speed-up is also referred. Actually, the performance of the scalar code with diagonal scaling was so poor that it could not be used as a reference basis. Table 1 shows that vectorizing the ICCG scheme leads to a significant improvement with respect to the equivalent non-vectorized sequential version in terms of both CPU time (speed-ups larger than 2) and M F L O P S (which increase by one order of magnitude). It may be noted that there is a slight benefit from case 1 to case 2 as suggested by Anderson and Saad~. The reduction of the CPU times is not significant however, because the number of levels of the matrices analyzed, although smaller than N, is nevertheless still large. A further slight improvement is achieved in case 3. Therefore, case 3 (product ( L L r ) - l b executed in CAL language) represents the best ICCG vector procedure and provides the maximum speed-up with respect to the scalar ICCG implementation. Table 1 also gives the vector performance of the diagonal scaling which in most examples appears to be the most efficient technique of preconditioning. This is so because the product D-~b is executed very fast and the overall scheme is very well vectorized as is also evidenced by the large MFLOPS values. It is worth noting from Table 1 that, despite the larger number of iterations (ITE) required to achieve convergence by the Adv. Water Resources, 1990, Vol. 13, No. 3

149

D i a g o n a l scaling: G. P i n i a n d G. G a m b o l a t i Table 1. Comparison between the computational performance of scalar and vector MCG schemes implemented on a CRA Y X-MP[48. The preconditioner K- t is either the diagonal matrix D- ~ or the matrix (LL ~)- ~obtained with the incomplete Cholesky factorization of A. CPU times are given in seconds. ITE indicates the number of MCG iterations required to meet the exit test. The speed-ups So and S~ refer to the use of the preconditioners D- 1 and (LL r)- J {case 31, respectively

Scalar

Vector

(LLT) -1

(LLr) -I case 1

D -I

(LLr) -I case 2

(LLr) -I case 3

Speed-up

N

CPU

MFL* CPU

MFL

ITE

CPU

MFL

1TE

CPU

MFL

CPU

MFL

St)

S~

So]S~

377 456 812 1020 1338S(1) 1338T(2) 1701 1802 2220 2304 45601(3) 456011(4)

0.174 0.526 0.384 1.722 1.133 0.316 2.586 3.512 2.232 2.162 3.555 20.416

3.67 3.69 3.67 3.70 3.69 3.65 3.70 3.71 3.70 3.69 3.69 3.71

60.33 69.60 84.76 92.86 75.99 75.46 97.66 65.69 67.96 75.01 80.92 70.93

116 254 228 619 980 110 263 1339 267 1492 2098 1255

0.0766 0.217 0.157 0.694 0.425 0.120 1.036 1.082 0.842 0.828 1.298 5.982

32.52 33.83 34.08 34.18 34.53 34.31 34.30 34.94 35.16 34.66 34.70 35.79

34 84 34 126 56 14 113 96 64 62 49 211

0.0767 0.213 0.144 0.620 0.420 0.118 0.957 1.029 0.789 0.757 1.155 5.681

33.53 33.65 37.10 37.94 34.98 34.77 38.42 37.11 37.49 39.68 39.77 37.69

0.0699 0.198 0.143 0.614 0.401 0.109 0.939 0.995 0.769 0.755 1.184 5.555

36.73 37.13 37.40 37.67 37.87 37.63 37.82 38.00 38.46 38.00 38.06 38.90

5.9 8.5 5.1 7.9 1.6 4 17.5 1.9 5.1 1.1 0.7 4.5

2.5 2.6 2.7 2.8 2.8 2.9 2.8 3.5 2.9 2.9 3.0 3.7

2.36 3.27 1.89 2.82 0.57 1.38 6.25 0.54 1.76 0.38 0.23 1.22

0.0297 0.0616 0.0753 0.219 0.693 0.079 0.148 1.846 0.440 1.929 4.741 4.530

*MFL MFLOPS (1) stead?,' flow (2) transient flow (3) quasi 3-D stead?,' flow (41 3-D steady flow =

diagonal preconditioner, the total C P U time is smaller in most cases since each single iteration is very efficiently completed on a vector machine. Therefore, while on a scalar c o m p u t e r the diagonal scaling can never compete with I C C G (see Refs 16 and 17), on a vector c o m p u t e r this result is reversed and, as m a y be seen f r o m Table l, the relative superiority can even be 6 times as much. Unlike the scalar case, when the perf o r m a n c e is inferior, it m a y still be c o m p a r a b l e with that o f I C C G . The diagonal scaling turns out to be less efficient than I C C G for N = 1338 (2-D steady state flow), N = 1802 (structural model o f land subsidencet8), N = 2 3 0 4 (steady 3-D flow model ~9) and N = 4560I (steady quasi 3-D flow model2°). To gain a better insight into the behavior o f the diagonal vs I C C G preconditioning, the spectral condition numbers c(AD-I) and c[A(LLr)-1] have been computedZt'41. As is k n o w n 4'5, the asymptotic convergence rate o f M C G is p r o p o r t i o n a l to the square root o f c. The values o f c ( A ) , c ( A D -1 ) and c [ A ( L L T ) -I] are given in Table 2. We observe that large values o f c ( A K - 1 ) denote a bad a p p r o x i m a t i o n o f the inverse o f A by K -~. Note, in Tables l and 2, that when D - 1 is a m u c h worse approxi-

mation to A -1 than ( L L r ) -1, the diagonal scaling is more expensive than I C C G . This occurs when c(AD-1)]c[(A(LLr)-11 is larger than 10 2. In other words, the results shown in Tables 1 and 2 suggest that only when the spectral condition n u m b e r o f A D -1 is m u c h greater (two orders o f magnitude) than that o f A ( L L r ) -1, the diagonal scaling is less efficient than I C C G . L o o k i n g at the nature o f the corresponding sample problems we can add that this appears to be the case in steady quasi 3-D and 3-D flow and structural models (i.e. 3-D elliptic problems). Table 3 provides the C P U times and the n u m b e r o f iterations ITE required by the vector M C G scheme when the Newton-Raphson preconditioner is implemented. One can note that, while ITE decreases as n increases, the smallest C P U time is obtained usually for n = 0 (i.e. K - 1 = K(;- ~ = D - ~ or K- 1= K~t = (LLr)-t). The only exceptions concern the matrices with N = 377, 812 and 2220 and Kff I = D -~ where K -~ = K i -t yields the minimal C P U time. It is also worth pointing out that for N = 1802 and N = 456011 and Kff t = D - 1 the iterations grow with n, i.e. K,~-t is a worse a p p r o x i m a t i o n to A - 1 than K~-J~. This is so because I - A D - l has a spectral radius p > 1

Table 2. Spectral condition number of the iteration matrix for the diagonal, ICCG and I-degree po(vnomial preconditioners. The condition number of A is also given. The rightmost column shows that in.four sample problems c(AD I)]clA(LL r) i] is greater than 100

150

Adv.

N

c(A)

377 456 812 1020 1338S 1338T 1701 1802 2220 2304 45601 456011

1.1 x 1.2× 1.8 × 3.0 × 4.0 x 1.6 x 2.2× 5.0 x 1.9 × 1.2 × 6.4 x 1.1 x

104 1() ' 10 I° l0 t 10' 104 104 108 104 10 ~' 10" 10 ~

c(AD

I)

3.1 × 3.1 x 5.6x 2.7 × 3.0 × 2.0 x 2.1 × 1.2 × 1.8 × 6.9 × 8.7 × 2.3 x

102 103 102 106 104 102 104 104 103 104 104 104

W a t e r R e s o u r c e s , 1990, Vol. 13, N o . 3

c [ A ( L L r ) -l]

clAP(A)]

c(AD

2.3 z 2.5x 2.9× 1.3 x 9.0 × 2.5 x 1.8X 8.2 × 9.3 × 7.8 × 1.5 × 1.6×

4.5 × 4.6X 7.3 × 1.0 × 1.3 × 5.1 x 6.7x 1.6 × 6.7 × 4.1 × 2.7 × 3.4×

13.5 12.4 19.3 20.8 333.3 80.0 11.7 146.3 19.4 884.6 5800.0 1.4

101 102 101 l05 101 10 ° lO s 101 10 ~ I01 IO t 104

103 103 109 10 -~ 10 ~ 103 10 s 108 lO s 10 v 106 106

I)/cIA(LLr ) i]

9999

9999

r=99

9999

=bbb

==b"

0 0 0 0

bb

0 0 0 0

.

e~

~ ,..< ~

.~o~

~ .

i

~ ~

"--

~

tr~ "

¢~ 0 =

,-,

=.~.~

I

II

II

0.198 0.125 1.321 1.767 2.840 0.719 0.156 7.07 1.353 6.33 12.81 16.53

377 456 812 1020 1338S 1338T 1701 1802 2220 2304 45601 456011

* * 1725 586 151 * 453 * * *

465 314

ITE

59.78

48.93 65.77 73.53

53.6 60.88

MFL

0,233 0.167 • 2.235 3.015 0.771 0.158 • 1.486 • • •

CPU

* 2093 1274 435 106 * 339 * * *

365 297

ITE

2

57.63

56.55 47.07 63.06 68.57

51,79 58.65

MFL 0.234 0.190 • 2.886 3.158 0.807 0.162 • 1.539 • • •

CPU

* 2143 1023 348 84 * 266 * * *

295 262

ITE

3

56.51

56.23 46.09 61.62 66.76

50.79 57.43

MFL

e : CPU time is larger than that corresponding to the I-degree polynomial preconditioner

*: 2500 iterations (maximum allowed in the c o d e )

CPU

N

1

0.239 0.202 • 2.176 3.268 0.834 0.163 • 1.496 • • •

CPU

* 1311 853 291 69 * 208 * * *

244 227

ITE

4

Degree

55.83

54.89 45.21 60.71 66.14

50.15 56.63

MFL

0.241 0.208 • 2.168 3.413 0.840 0.160 • 1.599 • • •

CPU

* 1232 725 246 59 * 186 * * *

207 197

ITE

5

55.37

55.57 43.37 60,10 67.49

49.75 56.06

MFL

0.249 0.214 • 2.29 3.345 0.852 0.162 • 1.622 • • •

CPU

184 175 * 984 632 215 51 * 162 * * *

ITE

6

55.03

52.32 44.42 59.67 66.04

49.44 55.65

MFL

0.254 0.216 • 2.139 3.339 0.854 0.168 • 1.634 • • •

CPU

165 155 * 867 559 189 45 * 143 * * *

ITE

7

54.78

55.60 44.55 59.34 64.29

49.19 55.37

MFL

0.9 4.2 <0.3 < 1 0.4 0.4 16.6 <0.5 1.6 <0,3 <0.3 < 1.2

Sp

6.5 2.0 --4.0 I0.0 1.1 -3.2 ----

SDISP

Speed-up

2.8 0.6 --7.0 7.3 0.2 -1.8 ----

S3ISP

Table 4. Computational performance o f vector MCG schemes implemented on a CRA Y X-MP[48 with polynomial preconditioners with degree between 1 and 7. CPU times are given in seconds. ITE indicates the number o f MCG iterations needed to meet the exit criterion. The speed-up Sp f o r the l-degree polynomial with respect to the scalar ICCG preeonditioner is also given. For the meaning o f $1~ and S~ see Table 1

m

r

0 0 0 0

~-~

~-~-S~_

~."

D i a g o n a l scaling." G. P i n t a n d G. G a m b o l a t i

t o r i z a t i o n achieved. H o w e v e r , the C P U times are always larger t h a n the c o r r e s p o n d i n g times r e q u i r e d by the d i a g o n a l a n d the I C C G p r e c o n d i t i o n e r s . N o t e that, in general, the C P U time increases with the degree o f the p o l y n o m i a l a n d that for s o m e p r o b l e m s , practical c o n v e r g e n c e ( m a x i m a l n u m b e r o f I T E equal to 2500) is not a t t a i n e d . T a b l e 4 also gives the s p e e d - u p o f the p o l y n o m i a l p r e c o n d i t i o n e r o f degree 1 a n d c o m p a r e s it with the s p e e d - u p o f the d i a g o n a l a n d I C C G p r e c o n ditioners. It a p p e a r s that only in very few cases the p o l y n o m i a l p r e c o n d i t i o n i n g is better t h a n the I C C G scalar scheme a n d is never s u p e r i o r to the d i a g o n a l scaling. To c o m p l e t e the picture, T a b l e 2 supplies the spectral condition number c(AP(A)) for the most efficient P ( A ) , i.e. the one with degree 1.

worse a p p r o x i m a t i o n to A - t t h a n ( L L r ) - t. In the tests p e r f o r m e d this s i t u a t i o n occurs whenever c ( A D -~) is two o r d e r s o f m a g n i t u d e greater than c[A(LL

ACKNOWLEDGEMENT This w o r k has been in part s u p p o r t e d by the Italian C N R , G r u p p o N a z i o n a l e per la Difensa dalle Catastrofl I d r o g e o l o g i c h e , Linea di Ricerca No. 4, a n d P r o g e t t o Finalizzato "Sistema Informatici e Calcolo Parallelo".

REFERENCES 1 2

4. C O N C L U S I O N S F r o m the n u m e r i c a l e x p e r i m e n t s p e r f o r m e d in the present analysis the following conclusive r e m a r k s m a y be s u m m a r i z e d : 1. The p r e c o n d i t i o n e d o r m o d i f i e d c o n j u g a t e g r a d i e n t m e t h o d for the s o l u t i o n o f large a r b i t r a r i l y sparse sets o f s y m m e t r i c positive definite e q u a t i o n s can be effectively i m p l e m e n t e d on vector c o m p u t e r s with s p e e d - u p s , with respect to the best scalar I C C G code, up to one o r d e r o f m a g n i t u d e . 2. The n u m e r i c a l p e r f o r m a n c e o f M C G is in general related to the q u a l i t y o f the p r e c o n d i t i o n e r (i.e. one yielding a small spectral c o n d i t i o n n u m b e r o f the i t e r a t i o n m a t r i x A K - t ) as well as to its suitability to g e n e r a t e a highly v e c t o r i z a b l e p r o c e d u r e (i.e. large M F L O P S values), possibily c o m b i n e d with a low n u m b e r o f a r i t h m e t i c o p e r a t i o n s per iteration. A simple low q u a l i t y p r e c o n d i t i o n e r m a y be s u p e r i o r in terms o f overall c o m p u t a t i o n a l cost, to a c o m p l e x good quality preconditioner. 3. T h e most p r o m i s i n g p r e c o n d i t i o n e r for vector c o m puters as it is c o n c e p t u a l l y suited for vectorial architectures, i.e. the p o l y n o m i a l p r e c o n d i t i o n e r , t u r n e d out to be less efficient t h a n o t h e r p r e c o n ditioners m o r e c o m m o n l y used on scalar machines. Increasing the degree o f the p o l y n o m i a l does not increase the overall p e r f o r m a n c e o f the a l g o r i t h m despite the m o r e f a v o u r a b l e c o n d i t i o n n u m b e r s o f the i t e r a t i o n m a t r i x . 4. F o r most o f the p r o b l e m s a n a l y z e d in the present p a p e r the simple unexpensive d i a g o n a l scaling t u r n e d out to be s u p e r i o r to the best vector I C C G p r e c o n d i t i o n i n g with a f a c t o r r e d u c t i o n in the C P U time o f up to 3. Even when I C C G is faster, the speed gain is not very large so that it can be c o n c l u d e d that for general u n s t r u c t u r e d p r o b l e m s where efficient ad h o c schemes have not been d e v e l o p e d , the simple d i a g o n a l p r e c o n d i t i o n e r is highly r e c o m m e n d a b l e on vector m a c h i n e s . 5. C a r e f u l analysis o f the n a t u r e o f the s a m p l e p r o b lems a n d c o m p u t a t i o n o f the v a r i o u s spectral condition n u m b e r s show that the d i a g o n a l scaling is likely to be less efficient t h a n I C C G in structural a n d 3-D or quasi 3-D flow m o d e l s in s t e a d y state c o n d i t i o n s (i.e. 3-D elliptic t y p e p r o b l e m s ) when D - t is a m u c h

152

Adv.

Water Resources,

1990,

Vol. 13, N o . 3

T)-t].

3 4 5 6 7

8

9 10 11 12 13 14 15 16

17

18

19

Anderson, E. and Saad, Y. Solving sparse triangular linear systems on parallel computers, Symposiunt on Vector and Parallel Processors for Scient(ftc Computation, 2, Rome, 1987 Ashby, S. F. Polynomial preconditioning for conjugate gradient methods, Dept. of Computer Science, Report No. UIUCDCS-R-87-1355, University of Illinois at UrbanaChampaign, Urbana, IL 61801, 1987 Ashcraft, C. and Grimes, R. On ~ectorizing incomplete factorization and SSOR preconditioners, SL4M J. Sci. Star. Comput., 1988, 9, 122-151 Axelsson, O. A class of iterative methods for finite element equations, Comp. Methods App. Mech. Engng., 1976, 9. 123-137 Axelsson, O. Solution of linear systems of equations: iterative methods, Sparse Matrix Techniques, Barker, V. A. (ed.), Springer, Berlin, 1977, 1-51 Axelsson, O. A survey of preconditioned iterative methods for linear systems of algebraic equations, BIT, 1985, 25. 166-187 Baxter, D., Sahz, J., Schultz, M., Eisenstat, S. and Crowley. K. An experimental study of methods for parallel preconditioned Krylov Methods, Report. RR 629. Computer Science Dept., Yale University, New Haven, 1988 Concus, P., Golub, G. H. and O'l_eary. D. P. A generalized conjugate gradient method for the numerical solution of elliptic partial differential equations, Sparse Matrix Conlpurations, Bunch, J. R. and Rose, D. J. (eds), Academic Press, New York, 1976, 309-332 Concus, P., Golub, G. H. and Meurant, G. Block preconditioning for the conjugate gradient method, SlAM J. Sci. Star. Comput., 1985, 6, 220-252 CRAY, Library Reference Mamml SR-O014. revision 1 (1984) CRAY, CAL, Assembler Version 2, Relerence Manual SR-2003 (1986) (.?RAY, FORTRAN (CFT), Reference Manual SR-O009, revision L-01 (1986) Dubois, P., Greenbaum, A. and Rodrigue, G. Approximating the inverse of a matrix for use in iterativc algorithms on vector computers, Computing, 1979, 22, 257-268 Froberg, C. E. Introduction to numerical attalvsi,s, Addison Wesley, New York, 1972 Gambolati, G. Fast solution to finite element flow equations by Newton iteration and modified conjugate gradient method, Int. J. Numer. Methods Eng., 1980, 15, 661-675 Gambolati, G. and Volpi, G. Analysis of the performance of the modified conjugate gradient method for solution o1"sparse linear sets of finite element equations, Third Int. Conf. on Finite Elements in Flow Problems'. Banff, Carlada, 1980 256-264 Gambolati, G. and Perdon, A. M. ]he conjugale gradients in subsurface flow and land subsidence modelling, Fundamentals of Transport Phenomena in Porous Media, Bear, J. and Corapcioglu, M. J. (eds), NATO-ASI Series, Applied Sciences 82, Martinus Nijoff, The Hague 1984, 953-984 Gambolati, G., Gatto, P. and Ricceri, G. Land subsidence due to gas]oil removal in layered anisotropic soils b} a finite element model, 3-rd Int. Symposium on Land Subsidence. IAI-IS-AISH, Publ. no. 151, 1984. 23-41, Venice Gambolati, G., Pint, G. and lucciarelli, T. A 3-D finite element conjugate gradient model of subsurface flow with automatic mesh generation, Adv Water Resources, 1986, 9, 34-41

D i a g o n a l scaling." G. Pini a n d G. G a m b o l a t i 20

21

22

23

24

25 26

27

28

29

30 31

32

33

34

35

Gambolati, G., Sartoretto, F. and Uliana, F. A conjugate gradient finite element model for large multiaquifer systems, Water Resour. Res., 1986, 22, 1003-1015 Gambolati, G., Pini, G. and Sartoretto, F. An improved iterative optimization technique for the leftmost eigenpairs of large symmetric matrices, J. Comp. Phys., 1988, 74, 41-60 Gambolati, G., Pini, G. and Zilli, G. Numerical comparison of preconditionings for large sparse finite element problems, Numerical Methods for PDE, 1988, 4, 139-157 Hughes, T. J. R., Ferencz, R. M. and Hallquist, J. O. Largescale vectorized implicit calculations in solid mechanics on a CRAY X-MP/48 utilizing EBE preconditioned conjugate gradients, Comp. Methods App. Mech. Eng., 1987, 61, 215-248 Jackson, C. P. and Robinson, P. C. A numerical study of various algorithms related to the preconditioned conjugate gradient method, Int. J. Numer. !vlethods Eng., 1985, 21, 1315-1338 Johnson, O. G., Micchelli, C. A. and Paul, G. Polynomial preconditioners for conjugate gradient calculations, SlAM J. Numer. Anal., 1983, 20, 362-376 Kershaw, D. S. The incomplete Cholesky-conjugate gradient method for the iterative solution of systems of linear equations, J. Comp. Phys., 1978, 26, 43-65 Kershaw, D. S. Solution of single tridiagonal linear systems and vectorization of the ICCG algorithms on the CRAY-I, in: Parallel Computations, Rodrigue, G. (ed.), Academic Press, New York, 1982, 85-99 Kightly, J. and Jones, I. A comparison of conjugate gradient preconditionings for three dimensional problems on a Cray-l, Comput. Phys. Comm., 1985, 37, 205-214 Kincaid, D. R., Oppe, T. C., Respess, J. R. and Young, D. M. ITPACKV 2C User's Guide, Report CNA-19I. Center /or Numerical Analysis, Univ. of Texas, Austin, 1984 O'Leary, D. P. Parallel implementation of the block conjugate gradient algorithm, Parallel Comput., 1987, 5, 127-139 Lichnewsky, A. Some vector and parallel implementations of preconditioned conjugate gradient algorithms, High Speed Computations, Kowalik, J. S. (ed.), Springer, Berlin, 1984, 343 -359 Manteuffel, T. A. An incomplete factorization technique for positive definite linear systems, Math. Comp., 1980, 34, 473 -497 Meijerink, J. A. and van der Vorst, H. A. An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-Matrix, Math. Comp., 1977, 31, 148-162 Meier, U. and Sameh, A. The behavior of conjugate gradient algorithms on a muhivector processor with a hierarchical memory, J. Comport: Applied Math., (Special issue on iterative methods for soh'ing linear systems of algebraic equations), in press 1988 Melhem, R. Toward efficient implementation of precon-

ditioned conjugate gradient on vector supercomputers, Int. J.

Supercomputer Applications, 1987, 1, 70-98 36 37 38 39

40

41

42

Melhem, R. Parallel solution of linear systems with striped sparse matrices, Parallel Comput., 1988, 6, 165-184 Meurant, G. The block preconditioned conjugate gradient method on vector computers, BIT, 1984, 24, 623-633 Meurant, G. Multitasking the conjugate gradient method on the CRAY X-MP/48, Parallel Comput., 1987, 5, 267-280 Meurant, G. Conjugate gradient preconditioners for parallel computers in Proceedings of the third SIAM Conference on parallel processing for scientific computing, Los Angeles, SIAM 1988 Meyer, P. D., Valocchi, A. J., Ashby, S. F., and Saylor, P. E. A numerical investigation of the conjugate gradient method as applied to three-dimensional groundwater flow problems in randomly heterogeneous porous media, Water Resour. Res., 1989, 25, 1440-1443 Perdon, A. M. and Gambolati, G. Extreme eigenvalues of large sparse matrices by Rayleigh quotient and modified conjugate gradients, Comp. Methods App. Mech. Engng., 1986, 56, 251-264 Peters, A., Romunde, B. and Sartoretto, F. Vectorized implementation of some MCG codes for F.E solution of large groundwater flow problems, Proceedings of the International

Conference on Computational Methods in Flow Analysis, 43 44

45 46 47

48 49 50 51

52

Okayama, Japan, 1988 Poole, E. and Ortega, J. Multicolor ICCG methods for vector computers, SIAM J. Numer. Anal., 1987, 24, 1394-1418 Radicati di Brozolo, G. and Vitaletti, M. Sparse matrix vector product and storage representations on the IBM 3090 with Vector Facility, Technical Report G513-4098, IBM ECSEC, Rome, 1986 Romine, H. and Ortega, J. Parallel solution of triangular systems of equations, Parallel Comput., 1988, 6, 109-114 Saad, Y. Practical use of polynomial preconditionings for the conjugate gradient method, SIAM J. Sci. Stat. Comput., 1985, 6, 865-881 Saad, Y. and Schultz, M. H. Parallel implementation of preconditioned conjugate gradient methods, Tech. Report YALEU[DCS[RR-425, Computer Science Dept., Yale University, New Haven, CT, 1985 Seager, M. K. Parallelizing conjugate gradient for the CRAY X-MP, Parallel Comput., 1986, 3, 35-47 Stiefel, E. L. Kernel polynomials in linear algebra and their applications, U.S. NBS Applied Math. Series, 1958, 49, 1-24 van der Vorst, H. A. A vectorizable variant of some ICCG methods, SlAM J. Sci. Stat. Comput., 1982, 3, 350-356 van der Vorst, H. A. The performance of FORTRAN implementations for preconditioned conjugate gradients on vector computers, Parallel Comput.. 1986, 3, 49-58 van der Vorst, H. A. ICCG and related methods for 3D problems on vector computers, Report A-18, Data processing center, Kyoto University, Kyoto, Japan, 1988

Adv.

Wa t er Resources, 1990, Vol. 13, N o . 3

153