Design of an iterative solution module for a parallel sparse matrix library (P_SPARSLIB)

Design of an iterative solution module for a parallel sparse matrix library (P_SPARSLIB)

~ ~ ELSEVIER APPLIED NUMERICAL MATHEMATICS Applied Numerical Mathematics 19 (1995) 343-357 Design of an iterative solution module for a parallel sp...

922KB Sizes 0 Downloads 23 Views

~ ~ ELSEVIER

APPLIED NUMERICAL MATHEMATICS

Applied Numerical Mathematics 19 (1995) 343-357

Design of an iterative solution module for a parallel sparse matrix library (P_SPARSLIB) * Yousef Saad*, Kesheng Wu University of Minnesota, Department of Computer Science, Minneapolis, MN 55455, USA

Abstract P_SPARSLIB is a library of portable FORTRAN routines for parallel sparse matrix computations. The current thrust of the library is in iterative solution techniques. In this paper we present the "accelerators" part of the library, which comprise the best known of the Krylov subspace techniques. The iterative solution module is implemented in reverse communication mode in order to allow any preconditioner to be combined with the package. In addition, this mechanism allows us to ensure portability, since the communication calls required in the iterative solution process are hidden in the dot-product, the matrix-vector product and preconditioning operations.

I. Introduction In many scientific and engineering applications one must solve a linear system of the form Ax = b where the coefficient matrix A can be very large and sparse. For such systems, iterative methods that exploit sparsity are often used because of their advantage in storage and computational requirements. Iterative methods are especially attractive for the new class of very large three-dimensional problems that are emerging because of the dramatic improvement in computational power provided by parallel processing. In a multiprocessing environment, iterative methods have the further advantage that they are far simpler to implement than sparse direct methods. Because of the gain in popularity of these methods and their importance for multiprocessing environments, the development of a parallel sparse *Work supported in part by ARPA under grant number NIST 60NANB2D1272, in part by NSF under grant number NSF/CCR-9214116, and in part by AHPCRC (University of Minnesota) under Army Research Office grant number DAAL03-89-C-0038. * Corresponding author. 0168-9274/95/$09.50 (~) 1995 Elsevier Science B.V. All rights reserved SSDI 0168-9274(95)00090-9

344

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

matrix library may play a major role in facilitating the use of efficient iterative solvers in highperformance computing environments. P_SPARSLIB is intended to be a portable general purpose parallel sparse matrix package. It emphasizes iterative techniques but it is not a parallel iterative package per se. Indeed, it will include a number of other tools directed towards facilitating the development and prototyping of sparse iterative solvers in parallel environments. Such tools include reordering routines, partitioners, direct solution modules, and analysis modules just to cite a few examples. In this paper we present the module of the library which consists of the Krylov subspace accelerators. In the last three decades, a large number of iterative solvers have been developed and as a result we had a rich selection of algorithms to choose from when implementing this module. However, we found that the typical performance of these algorithms can be rather well represented by a smaller subset of the methods available in the literature. We have selected the following subset for the library: • • • • •

• • • •

the conjugate gradient (CG) method, conjugate gradient on the normal equations (CGNR, i.e., residual version), bi-conjugate gradient (BCG), direct implementation of Lanczos-BCG (DBCG), bi-conjugate gradient stabilized (BiCGSTAB), transpose-free quasi-minimum residual method (TFQMR), generalized minimum residual (GMRES), flexible GMRES (FGMRES), direct implementation of quasi-GMRES (DQGMRES).

The above list includes some of the most commonly used accelerators and, although not exhaustive, is very representative of the state of the art. A characteristic of these Krylov subspace methods is that they utilize the coefficient matrix only in the form of matrix-vector operations. As will be seen, it is then possible to program these solvers in a way that is independent of the specific sparse matrix storage format used. This decoupling from the data structure is particularly important in a parallel computing environment since we would like to adapt the storage format depending on the architectures. When designing this module we attempted to achieve a few of the basic qualities of software systems such as: convenience of use, reliability, flexibility, and good performance. These requirements are often in conflict with each other. However, although we tried not to sacrifice performance too much, we feel that the current implementation, which emphasizes modularity via the reverse communication mechanism, will be far more amenable to incorporating improvements that may be obtained in the other modules, such as the preconditioning techniques, or the basic sparse matrix kernels. In the rest of this paper, we will discuss each of the iterative solvers, the protocol used for the reverse communication and the stopping criteria used for them. We will also show examples of the use of the iterative solvers on different types of machines to demonstrate the portability and flexibility of the routines.

Y Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

345

2. Krylov subspace iterative solvers For reference, in what follows we will call ITERS the iterative solution module of P_SPARSLIB. We begin by giving a brief introduction of the iterative solvers implemented. In the second half of this section we will briefly discuss the number of operations required by the algorithms implemented. 2.1. The algorithms Conjugate gradient (CG) This well-known algorithm was independently developed by Hestenes and Stiefel (1952). For further references see, for example, [6] or [1]. CGNR CGNR is equivalent to solving the linear system, ATAx = ATb by a CG method and can therefore be used for solving nonsymmetric linear systems as well as least-squares problems. Since ATA is always positive semi-definite, it is guaranteed, in theory, to converge to the solution. CGNR may be a good approach for highly indefinite matrices. For example, if the matrix is unitary, then it can solve the linear system in just one step, whereas most of the other Krylov subspace projection methods will typically converge slowly. For typical problems arising from the discretization of elliptic partial differential equations, CGNR usually converges much more slowly than a Krylov subspace method applied to the original system Ax = b, so this approach is not as popular in this particular context. Bi-conjugate gradient (BCG) BCG is a variation of the Lanczos algorithm for solving linear systems [7] and was developed in its best known form by Fletcher [3]. It is a generalization of the conjugate gradient algorithm for solving nonhermitian linear systems. For references see, for example, [9,5,8,18]. Implicitly, the Lanczos algorithm is a projection technique onto a Krylov subspace, and the representation of the projected matrix is tridiagonal. In BCG the projected tridiagonal system is solved progressively as the tridiagonalization is performed. Since no pivoting is performed during the solution of the tridiagonai system, the algorithm is vulnerable to potential break-downs, or, more generally, the approximation delivered may be very inaccurate in some cases. Direct implementation of BCG (DBCG) DBCG is a variant of BCG where the tridiagonal system mentioned above is solved with partial pivoting. In the original BCG algorithm the tridiagonal matrix generated from the Lanczos procedure is solved by Gauss elimination without pivoting. The pivoting improves the stability of the iterative solver, and eliminates some of the potential break-downs in the BCG algorithm. The benefit of the partial pivoting option seems to be helpful only for the "harder" problems. This approach is borrowed from [ 10]. Bi-conjugate gradient stabilized (BiCGSTAB) The motivation for this algorithm is to eliminate the need to use AT in the BCG algorithm [ 18[ as well as to smooth convergence of the conjugate gradient squared (CGS) [ 17] upon which it is built. CGS is the first of this class of techniques referred to as transpose-free variants of the bi-conjugate

346

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

gradient method. CGS obtains iterates whose residual polynomial are the square of those of the standard BCG. This has the effect of accelerating convergence for the same amount of work, in the cases when convergence is fast. However, when the original iterates diverge or have a very erratic convergence behavior, then CGS may have an even poorer convergence behavior. Because of these potential difficulties in CGS, we did not implement it and preferred to only implemented BiCGSTAB instead.

Transpose-free quasi-minimum residual method (TFQMR) The underlying idea of this transpose-free variant of the BCG developed by Freund [4], is to "quasi-minimize" the residual norm of the current approximation based on the iterates obtained from CGS. The quasi-minimum residual (QMR) algorithm [5] is based on the Lanczos tridiagonalization and requires the use of the transpose of A. It attempts to produce residual norms that are "quasioptimal", in that an expression for the residual norms is minimized, by "pretending" that the Lanczos vectors were orthonormal. Despite the lack of orthogonality, the residual vectors tend to behave smoothly and show an almost monotonic decrease. TFQMR is in effect a QMR [ 5 ] implementation of CGS which avoids the use of the transpose of A.

Generalized minimum residual (GMRES) This algorithm, described in [ 14], differs from Lanczos-based methods in that it reduces the matrix to the Hessenberg form instead of the tridiagonal form, and it uses the Givens rotations rather than Gaussian elimination to solve the resulting Hessenberg linear system generated. GMRES minimizes the residual norm at every step, therefore the residual norm is guaranteed not to increase. For very indefinite problems, the restarted version of GMRES, GMRES(m) may show a stagnation behavior, characterized by an almost constant (nonzero) residual norm after a certain step is reached.

Flexible GMRES (FGMRES) FGMRES is a variation of GMRES which has the objective of allowing variations in the preconditioning [ 13]. This allows many combinations of preconditioners that cannot be used with the standard GMRES variant, and thus flexibility is greatly enhanced. For example any iterative solver can now be used as a preconditioner to FGMRES, including GMRES itself, CGNR, etc.

Direct implementation of quasi-GMRES (DQGMRES) This is another variation of GMRES which is based on incomplete orthogonalization [15]. In order to limit the size of Krylov subspace, i.e., the work space required, GMRES and FGMRES restart the algorithm, i.e., after a certain number of steps a new outer iteration of GMRES/FGMRES is started with the initial guess set to the latest approximate solution obtained. On the other hand, DQGMRES keeps a fixed number of the most recent Krylov vectors, and the newly generated Krylov vector will only be made orthogonal to them. As a result the Krylov basis is only "incompletely orthogonal". This provides some continuity in the solution process, and has the effect of reducing the stagnation phenomenon often observed with GMRES for indefinite problems. In our implementation, DQGMRES has a flexibility feature similar to that of FGMRES.

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

347

Table 1 Complexity of the algorithms CG CGNR BCG DBCG BiCGSTAB TFQMR GMRES(m) FGMRES(m) DQGMRES(k)

Memo~

FLOP

5n 5n 7n 11n 9n 11n (m + 2)n (2m + 2)n (2k + 1)n

8n 4n 7n 7n 1On 11n (2m + 7)n (2m + 7)n (6k + 2)n

2.2. Complexity of the algorithms The main computational kernels of all the above methods are as follows: ( 1) SAXPY, (2) dot-product, (3) matrix-vector multiplication, (4) preconditioning operation. In addition, some of the methods require some low order computations which are not represented here. In our implementation, only the SAXPY and the dot-product operations are part of the iterative solvers. The matrix-vector multiplications and preconditioning operations will be performed by the caller. Details on this reverse communication mechanism will be discussed in Section 3. In the current implementation, the work space required by the solvers must be passed to them. We will now briefly discuss the work space requirements and the time complexities of the solvers. Table l lists the memory and arithmetic complexities, where n is the dimension of the linear system. For both GMRES(m) and DQGMRES(k), m, k < < n. The time complexity shown in the table is the number of floating-point operations required to update the solution averaged over the number of matrix-vector multiplications called (assuming the number of iterations is much larger than m and k). We should emphasize that the operation counts Shown in the table, exclude the more important operations related to the matrix, i.e., matrix-vector products and preconditioning operations. The cost of these operations is typically much higher than those of the SAXPY and dot-product operations that are represented in Table 1. The operation counts listed in the table assume that the least costly stopping tests are used. These stopping tests are based on the residual norms, since most of the methods provide an inexpensive strategy for computing or estimating the Euclidean norm of the residual vectors. The data in Table 1 is obtained by counting the workspace size and number of floating-point operations used by the programs. In order to accommodate preconditioning in an uniform fashion, 1-2 more vectors are used than shown in standard nonpreconditioned algorithm shown in literature. The numbers shown in this table are only accurate to their highest order term in n. The exact amount of work space required may be slightly higher than indicated for GMRES and its variants, since these require some additional space to store the Hessenberg matrix.

348

Y Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

2.3. Parallel implementation There are four types of operations required by all of the iterative linear system solvers: ( 1) vector updates, (2) inner-products, (3) matrix-vector multiplications, and (4) preconditioning operations. An implementation on distributed memory environment is described in [12]. The coefficient matrix is distributed rowwise according to a selected mapping which is very general. In the simplest, but by no means best, case we can map rows cyclically to processors 1,2 . . . . . p. A mapping actually maps unknowns and associated rows to processors. All vector variables will be split conformally with this mapping. For example, if rows 1 to 10 are mapped to processor 15, then the components 1 to 10 of every vector will also be mapped to processor 15. A preprocessing phase is then performed in which the processors prepare the ground for the exchange of unknowns that will be needed at each step of the Krylov subspace iteration. The matrix-vector product then consists of a local matrix-vector product, an exchange of "boundary" information, and a second matrix-vector product which involves only boundary elements. Here, by boundary element we mean an element which is coupled with data located in another processor. The only other operations that require communication are the dot-product operation and the preconditioning operations. A dot-product consists of adding up the local dot-products of all the local subvectors. This sum is often best achieved by using native libraries which are often provided. In our implementation of the iterative solvers, the dot-product function distdot has the same calling sequence as that of the dot-product function d d o t in BLAS1. real*8 function distdot (n, x, ix, y, iy) integer n, ix, iy real*8 x(l:ix*(n-l)+l), y(l:iy*(n-1)+l) On a uniprocessor machine or a shared memory machine, distdot can be simply a wrapper for ddot. On a distributed machine, d i s t d o t must be replaced by a parallel dot-product function which returns the result of operation to every processor, as just discussed. The accelerators themselves do not require any modification to run on a distributed machine. The communication required in the preconditioning operations depend on the preconditioner. One can use preconditioning which requires no communication, e.g., a block Jacobi, but this is not necessarily a good choice, since the number of iterations may be quite high. In the next section, we will discuss a well-known strategy, known as reverse communication, for hiding the data structures of the matrices from the iterative solvers.

3. Reverse communication

The rather large number of different data structures that exist in sparse matrix techniques, makes it difficult to develop a library of iterative solvers which accommodates all possible cases in a unified calling sequence. One obvious solution would be to limit the number of possible data storage schemes and have users make the necessary transformations from their data structures to a few possible choices supported by the library, using transformation routines from libraries such as SPARSKIT [ 11 ]. This has the disadvantage of providing a rigid architecture. An alternative which is becoming increasingly popular in software packages is to bypass data structures altogether by resorting to what is referred

Y. Saad, K. Wu/Appl&d Numerical Mathematics 19 (1995) 343-357

349

to as "reverse communication". The reverse communication mechanism consists of putting the iterative solution subroutine in the body of a FORTRAN loop which is run for as long as it is needed to achieve convergence. Every time the subroutine returns, it will indicate to the caller what operation must be done (e.g., a matrix-vector product) with the help of a flag (called i c o d e in the following illustrative segment of pseudo-code). Once the caller has performed the requested operation, the control is given back to the solver by calling it again. Thus, a communication loop in the simplest case would look as follows, icode = 0 I continue call solver (n, param, iparm, wkl, wk2, icode) if (icode .eq. i ) t h e n call precon(n, wkl, wk2) <-- preconditioning operation goto 1 else if (icode .eq. 2) then call matvec (n, wkl, wk2) <--matrix--vector product goto 1 endif The correct control path inside the s o l v e r routine requires some additional care and can be implemented with the help of ASSIGN statements or computed GOTO statements. Reverse communication allows us to move the matrix dependent operations outside of the iterative solvers, leaving only BLAS-1 operations and other simple operations within the body of the iterative solvers. For the purpose of portability, we only require the user to provide a dot-product routine, since distributed versions or sequential versions of it may be needed depending on the environment. The other BLAS-1 type operations are coded directly in the solvers. An advantage of using reverse communication is that it provides maximum flexibility for the iterative solvers. The user has the complete control over the matrix storage format to use, as well as the implementation of the key matrix dependent operations, such as the matrix-vector multiplications and the preconditioning operations. On the negative side, reverse communication may impose an overhead to the program because of the increased number of function calls required. However, as can be expected for large problems this overhead is likely to be small compared with the cost of the arithmetic operations. We verified this by performing some tests, and observed that on most machines considered, the overhead was, in the worst case, a very small percentage of the total execution time. Another concern is that reverse communication shifts the responsibility of performing the basic matrix dependent operations to the user. The ITERS code does not take the matrix as part of the actual inputs, as a result there is no way for the iterative solvers to check whether a failure resides in the method itself or in the user's implementation of the matrix operations. However, we note that there are subroutines, e.g., from SPARSKIT, that can be used to translate specified storage schemes for which well-tested matrix-vector multiplications and preconditioners are available. Using the basic building blocks from this and other packages, the time required for developing or prototyping a given iterative solution routine can be greatly reduced. Some illustrations will be provided in later sections. Before giving some details of the reverse communication protocol, we would like to list all the

350

Y Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

possible matrix dependent operations that will be required in the course of the iteration process. When no preconditioner is applied, the iterative solvers will perform matrix-vector multiplications with A, namely operations of the form v = Au, and A T, v = ATu. With both left and right preconditioning, the two matrix-vector multiplications will be replaced by v = M~-IAMrnU and v = M S A T M ~ T u . Therefore, there is a total of six possible operations, namely: (1) v = Au, matrix-vector multiplication with A, (2) v = ATu, matrix-vector multiplication with AT, (3) v = M~ ~u, left preconditioning operation, (4) v = M J u , left preconditioner transposed operation, (5) v = M r l u , right preconditioning operation, (6) v = MTTu, right preconditioner transposed operation. These operations will be performed outside of the solver routine. Upon return from the solver, the user determines which of these six operations is requested, then performs the desired operation on specific vectors and calls the solver again. In addition, the solver may also return with error or termination conditions. Finally, in our current implementation, the user can also opt to apply his/her own set of convergence or termination criteria. In such cases, the iterative solver must be able to communicate the request to the user to apply the stopping test. Considering these and the termination conditions that will be discussed later (Section 4), we have decided to use the following return values for the reverse communication status parameter, 1: request a matrix-vector multiplication with A, 2: request a matrix-vector multiplication with AT, 3: request a left preconditioner operation with Mr, 4: request a left preconditioner transposed operation with M T, 5: request a right preconditioner operation with Mr, 6: request a right preconditioner transposed operation with Mrr, 10: request the caller to perform a stopping test, 0: normal termination of the solver, satisfied the convergence test, - 1 : termination because number of matrix-vector multiplications is greater than the preset limit, - 2 : return due to insufficient work space, - 3 : return due to anticipated break-down/divide by zero.

4. Stopping criteria To determine when to stop an iteration, we need to define a stopping criterion based on a set of conditions. An iterative solver may stop for one of the following reasons: • convergence test satisfied, • iterations count limit exceeded, • insufficient resources to perform/continue the algorithm, • algorithm breaks down, • external error. We will address each of the above items in turn. External errors are errors that are external to the body of the iterative solver itself. When a solver requests that the caller perform a given operation, the operation may be unavailable, fail, or produce a wrong answer. The caller must be able to detect

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 3 4 3 - 3 5 7

351

this, and stop the reverse communication loop. If the user doesn't guarantee the correctness of the service routines, the iterative solver may break down, fail to converge or converge to an incorrect solution without being able to identify the sources of the error. Most of the iterative solvers may break down in some circumstances. Often, this is due to a division by zero, and is easily detected in the iterative solvers. In the case of a so-called "lucky break-down", characterized by the solution being exact, the solver should indicate that the convergence test is satisfied rather than a failure. Insufficient resource errors occur when the iterative solver does not have enough work space to perform its operations. The size of the work space is checked before the iteration starts. When this error is detected, the minimum required work space size is reported. It is common in implementations of the iterative solvers to limit the maximum number of iterations allowed, in order to prevent cases where the iterative solver makes no progress towards the solution or progress that is unacceptably slow. In our implementation, the limit is imposed on the total number of matrix-vector multiplications. The most complex and important stopping criterion is the convergence test. One can decide to stop the iteration based on the size of the residual norm, the estimated error norm, or the correction to the approximate solution at each update. Norms that are commonly used are the l-norm, the 2-norm, and the infinity-norm. The tests can be based on an absolute tolerance or a relative tolerance. More complicated convergence tests may be devised using some backward error analysis. For the sake of simplicity, we decided to provide only four different convergence tests in our iterative solvers. If we define d x i = x i - x i _ l, ri = b - A x i , these tests are

IIr, II ~ ~'rllr011 + r~,

(1)

IIr, II ~ r, llbl[ +

(2)

Ta,

Ildxill ~ TrlIdxl II + ~'..

(3)

Ildx, II ~ rrllbll-4- "ra.

(4)

in which 7"a is the absolute tolerance and rr is the relative tolerance, as supplied by the user. We only use the 2-norm in all the iterative solvers. The user can specify one of the above four convergence tests to be used in the iterative solvers. Alternatively, the caller may also direct the iterative solver to use a user defined termination test using the reverse communication protocol discussed earlier, see Section 3. In this situation, after each update of the solution x i , the iterative solver will ask the caller to perform a stopping test. If no stopping criterion is specified, the iterative solver will select a default stopping test, which is the first one in the above list.

5. Using the iterative solvers In this section we will show two segments of pseudo-code to illustrate the use of the iterative solvers. Then we will show some results of a set of experiments. In introducing the pseudo-code, we will also give additional details on the reverse communication protocol.

352

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

5.1. Two examples

First we will give the prototype of the iterative solvers. All the iterative solvers are programmed with the following standard interface:

solver(n, b, x, ipar, fpar, wk) integer n, ipar(16) real*8 b(n), x(n), fpar(16), wk(*) where b is the right-hand side, x is the solution, n is the dimension of the linear system, wk is the work space, and i p a r , f p a r are two arrays used for passing various parameters, including those used in the reverse communication protocol. The complete description of the parameters is listed in [ 16]. Here we will give only enough information to explain the following segment of pseudo-code.

i0:

ipar(1) = 0 call solver(n, b, x, ipar, fpar, wk) if (ipar(1).eq.l) then call amux(n, wk(ipar(8)), wk(ipar(9)), a, ja, ia) goto i0 else if (ipar(1).eq.2) then call atmux(n, wk(ipar(8)), wk(ipar(9)), a, ja, ia) goto 10 else if (ipar(1).eq.5) then call lusol(n, wk(ipar(8)), wk(ipar(9)), alu, jalu, iau) goto 10 else if (ipar(1).eq.6) then call lutsol(n, wk(ipar(8)), wk(ipar(9)), alu, jalu, iau) goto 10 else if (ipar(1).gt.O) then preconditioner or convergence test not implemented else solver terminated with code = ipar(1) endif

This segment of pseudo-code illustrates how the reverse communication is used in an actual iterative solution procedure. It implements an iterative solver with right preconditioning. It is assumed that the sparse matrix is stored in the CSR format as defined in SPARSKIT [ 1 1 ]. Subroutine amux and atmux are SPARSKIT routines to perform the matrix-vector operations v = Au and v = ATu respectively. For the preconditioner, we assume that an ILU type preconditioner is being used, such as ILUT or ILU0 from SPARSKIT. The results of the incomplete factorization is stored in arrays alu, j a l u , iau. The subroutine l u s o l solves the two triangular systems from the LU factorization, v = ( L U ) - ~ u , and l u t s o l performs v = ( L U ) - T u . In the four matrix-vector multiplication and triangular solution routines, the second arguments in the calling sequences are the input vector u, and the third arguments are the output vector u. Before calling the iterative solvers, the first six elements of i p a r and the first two elements of f p a x must be specified. In this example, we only show the initial assignment to i p a x ( 1 ) , see [ 16] for an example of assigning all eight input

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995)343-357

353

parameters. Three elements of the array ipar are used here. The parameter ipar(1) is used as the status variable discussed before. The parameter i p a r ( 8 ) is a pointer to the input vector u, i.e., u( 1 ) = w k ( i p a r ( 8 ) ), u(2) = w k ( i p a r ( 8 ) + 1) . . . . . u ( n ) = w k ( i p a r ( 8 ) + n - 1 ). The i p a r (9) points the output vector v from the user-s,pplied matrix operation--whether a matrix-vector multiplication or a preconditioning operation. In this example, only right preconditioning is supported. Next we will show an example of using an iterative solver as a preconditioner to another iterative solver.

i0:

20:

ipar(1) = 0 call dqgmres(n, b, x, ipar, fpar, wk) if (ipar(1).eq.l) then call amux(n, wk(ipar(8)), wk(ipar(9)), a, ja, ia) goto i0 else if (ipar(1).eq.5) then jpar(1) = 0 call cgnr(n, wk(ipar(8)), wk(ipar(9)), jpar, gpar, wk2) if (jpar(1).eq.l) then call amux(n, wk2(jpar(8)), wk2(jpar(9)), a, ja, ia) goto 20 else if (jpar(1).eq.2) then call atmux(n, wk2(jpar(8)), wk2(jpar(9)), a, 3a, ia) goto 20 endif goto 10 endif

This example shows how DQGMRES uses CGNR as a preconditioner in an "inner-outer iteration" loop. The inner iteration, CGNR, uses its own set of parameters j p a r , g p a r and its own work space. Notice that we could precondition the inner iteration as well. We may use any iterative solver in place of CGNR, except DQGMRES itself, since otherwise the local variables saved using the Fortran SAVE command will be lost when passing from the outer DQGMRES call to the inner call. The same set of local variables cannot be used to solve two different systems at the same time. Note that in this inner-outer iteration scheme we can alternate among several preconditioners or dynamically adjust the parameters to the preconditioners. An example of this type will be shown later.

5.2. Experiments

We have performed a set of experiments on a few different architectures to demonstrate the capabilities of the iterative solvers. We will first report on the speed (in megaflops) of the solvers relative to their major components, then show some data on the convergence behaviors of the iterative solvers. In one of the examples, we use an ILUT preconditioner, and in another one we mix two types of preconditioners. In our first experiment we will not use any preconditioning. All the timings are obtained for the

354

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

parameters za = 0, 7"r i = 2.2 x 10 -16. The default stopping criterion, namely the test (1), is used. With the given tolerance specifications and the "hard" problems we have chosen, all the iterative solvers are made to run for a maximum of 100 matrix-vector multiplications. The tests have been performed on a SPARC-2, a SPARC-10, a CRAY C90 and a CM-5. Since the types of machines have vastly different capabilities, we tested different size problems. On the SPARCs, the test problem is constructed from a matrix in the Harwell-Boeing collection [2] named SHERMAN5. It is a matrix arising from a black oil simulator on a 16 × 23 x 3 grid. The matrix has 3312 rows and 20,793 nonzero elements, for details see [2]. The matrix used on the CRAY is BBMAT, an exact Jacobian from a 2-D airfoil simulation. It has 38,744 rows and 1,771,722 nonzero elements. The matrix used on the CM-5 is from a finite difference discretization of the Poisson's equation on a rectangular grid of size 320 x 320. The partition we performed the test on has 32 PNs. The program is written in SPMD mode using CMMD library functions. At the time of this test, we were unable to utilize the vector units available. Table 2 shows the speed of the major components of the iterative solvers, namely SAXPY, dotproduct and matrix-vector multiplication routines. The matrices are stored in the standard compressed sparse row (CSR) [ 11 ]. On the SPARCs and the CRAY, the matrix-vector multiplication routine, AMUX is from SPARSKIT. It is constructed from dot-product operations. Due to indirect addressing and the short inner loops, the performance of this matrix-vector multiplication is far from optimal on the CRAY. This also makes the program unsuitable for multiprocessing. Since our goal here is to compare the overall speed of the iterative solvers with that of the major components, we have not attempted to optimize the performance of the AMUX routine on the CRAY and have reported on results using only one PE. For a similar reason, we did not use a larger partition on the CM-5. From Table 3 we can see that most of the overall speeds of the iterative solvers are very close to those of the matrix-vector multiplication routine. The GMRES algorithm and its variants have more BLAS-I type operations in addition to matrix-vector multiplications, which explains the higher overall speeds compared with the other solvers. Fig. 1 shows the convergence history when solving SHERMAN5 using ILUT(3, 10 -3) as the preconditioner. This example only uses four out of the nine solvers. Since the preconditioner is fixed, FGMRES behaves exactly like GMRES--as is confirmed by the experiments, so we didn't show both of them. The rest of the iterative solvers generally converge more slowly than these four. Same convergence criteria as before is used with ~'r = 10-7, % = 2.2 × 10 -16. Fig. 2 shows an example of the flexible preconditioning. The solver used is DQGMRES, and the preconditioner alternates between one of the regular solvers and SSOR. The number of matrix-vector multiplications shown on the horizontal axis is the total number of the matrix-vector multiplications from both the outer and inner iterations. The number of iterations taken by DQGMRES is not very different from that shown in Fig. 1.

6. Concluding remarks We have shown experiments using the same set of iterative solvers on three very different architectures: a sequential machine, a shared memory machine, and a distributed memory machine. With the t This is the size of the unit epsilon of the 64-bit arithmetic on the SPARCs.

355

I: Saad, K. W&Applied Numerical Mathematics 19 (1995) 343-357

0.1 0.01 0.001 0.0001 ....‘...X. “x

le-05

x...

...x.., .‘X..,..

le-06 I

le-07 '

10 20 30 40 50 number of matrix-vector multiplications

0

Fig.1.ProblemSHERMAN5

60

solved with ILUT( 3, 10m3) preconditioner.

bcgstab +-gmres(8) .+-. dqgmres ( 8 ) ...x

~~----“““__“_*____

0.0001 le-05 le-06 le-07 t 0

Fig. 2. Problem SHERMAN5 and an iterative solver.

I

50

I

,

1

I

I

I

I

100 150 200 250 300 350 400 450 number of matrix-vector multiplications

solved by the inner-outer

iteration schemes with a preconditioner

! 1

500

alternating

between SSOR

356

Y Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

Table 2 Speed (megaflops) of the major components of the iterative solvers AMUX SAXPY dot-product

SPARC 2 2.0 4.5 5.5

Table 3 Overall speed (megaflops) of the iterative solvers SPARC 2 CG 2.3 BCG 2.0 DBCG 2.0 CGNR 2.0 BiCGSTAB 2.2 TFQMR 2.2 GMRES 3.0 FGMRES 2.9 DQGMRES 2.9

SPARC 10 7.7 10.9 22.5

C90 1 PE 96 797 830

CM-5 32 PNs 54 75 90

SPARC 10 7.6 6.6 6.7 6.9 7.6 7.3 9.3 9.1 9.1

C90 1 PE 106 115 115 112 106 112 129 129 171

CM-5 32 PNs 47

48 47 64 63 65

reverse c o m m u n i c a t i o n implementation, the tasks requiring architecture dependent operations have been isolated out o f the solver themselves and as a result there is no difference between a code that is m n on the C M - 5 and the m o r e traditional machines. On a massively parallel machine such as the CM-5, c o m m u n i c a t i o n is achieved as part of the distributed dot-product function, the m a t r i x - v e c t o r multiplication and preconditioning operations. This module of the P_SPARSLIB library is therefore portable across a variety of architectures and can be used to solve general sparse linear systems. A m o d u l e o f iterative solvers without preconditioners would not be too effective in solving realistic problems. We are currently developing a module of preconditioners, to be used in conjunction with the accelerators. Needless to say, this is a far more challenging task than that of assembling a collection of conjugate gradient-like accelerators. The preconditioners will depend on the architecture, the data structure used, as well as on the type of problems at hand.

References [ 1] O. Axelsson and V.A. Barker, Finite Element Solution of Boundary Value Problems (Academic Press, Orlando, FL, 1984). [2] I.S. Duff, R.G. Grimes and J.G. Lewis, User's guide for the Harwell-Boeing sparse matrix collection, Tech. Rept. TR/PA/92/86, CERFACS, Toulouse (1992). [3] R. Fletcher, Conjugate gradient methods for indefinite systems, in: G.A. Watson, ed., Proceedings of the Dundee Biennal Conference on Numerical Analysis 1974, University of Dundee (Springer, New York, 1975) 73-89. [4] R.W. Freund, A Transpose-Free Quasi-Minimal Residual algorithm for non-Hermitian linear systems, SIAM J. Sci. Comput. 14 (1993) 470-482. [ 5 ] R.W. Freund and N.M. Nachtigal, QMR: a quasi-minimal residual method for non-Hermitian linear systems, Numer. Math. 60 (1991) 315-339. [6] G.H. Golub and C. Van Loan, Matrix Computations (Academic Press, New York, 1981). [7] C. Lanczos, Solution of systems of linear equations by minimized iterations, J. Res. Nat. Bur. Standards 49 (1952) 33-53.

Y. Saad, K. Wu/Applied Numerical Mathematics 19 (1995) 343-357

357

[8] N.M. Nachtigal, A look-ahead variant of the Lanczos Algorithm and its application to the Quasi-Minimal Residual method for non-Hermitian linear systems, Ph.D. Thesis, MIT, Applied Mathematics, Cambridge, MA (1991). [9] Y. Saad, The Lanczos biorthogonalization algorithm and other oblique projection methods for solving large unsymmetric systems, SIAM J. Numer. Anal. 19 (1982) 470-484. [ 10] Y. Saad, Practical use of some Krylov subspace methods for solving indefinite and unsymmetric linear systems, SlAM J. Sci. Statist. Comput. 5 (1984) 203-228. [11] Y. Saad, SPARSKIT: A basic tool kit for sparse matrix computations, Tech. Rept. 90-20, Research Institute for Advanced Computer Science, NASA Ames Research Center, Moffet Field, CA (1990). [ 12 } Y. Saad, Krylov subspace methods in distributed computing environments, Tech. Rept. 92-126, Army High Performance Computing Research Center, Minneapolis, MN (1992). [ 13 ] Y. Saad, A flexible inner-outer preconditioned GMRES algorithm, SIAM J. Sci. Statist. Comput. 14 (1993) 461-469. [ 14] Y. Saad and M.H. Schultz, GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems, SlAM J. Sci. Statist. Comput. 7 (1986) 856-869. [ 15] Y. Saad and K. Wu, DQGMRES: a quasi-minimal residual algorithm based on incomplete orthogonalization, Tech. Rept. UMSI-93 / 131, Minnesota Supercomputing Institute, Minneapolis, MN (1993). [ 16] Y. Saad and K. Wu, Parallel sparse matrix library (P_SPARSLIB): The iterative solvers module, Tech. Rept. 94-008, Army High Performance Computing Research Center, Minneapolis, MN (1994). [17] P. Sonneveld, CGS, a fast Lanczos-type solver for nonsymmetric linear systems, SlAM J. Sci. Statist. Comput. 10 (1989) 36-52. [ 18] H.A. van der Vorst, Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of non-symmetric linear systems, SIAM J. Sci. Statist. Comput. 12 (1992) 631-644.