A variable precision approach to speedup iterative schemes on fine grained parallel machines

A variable precision approach to speedup iterative schemes on fine grained parallel machines

Parallel Computing 18 (1992) 1223-1231 North-Holland 1223 PARCO 714 Practical aspects and experiences (short communication) A variable precision a...

654KB Sizes 2 Downloads 63 Views

Parallel Computing 18 (1992) 1223-1231 North-Holland

1223

PARCO 714

Practical aspects and experiences (short communication)

A variable precision approach to speedup iterative schemes on fine grained parallel machines M. Zubair, S.N. Gupta and C.E. Grosch Old Dominion University, College of Sciences, Department of Computer Science, Norfolk, VA 23529-0162, USA Received 4 December 1991 Revised 24 Februa,'5, 1992

Abstract Zubair, M., S.N. Gupta and C.E. Grosch, A variable precision approach to speedup iterative schemes on fine grained parallel machines, Parallel Computing 18 (1992) 1223-1231. With the advent of massively parallel machines, such as AMT DAP and MASPAR, having bit, 4-bit nibble, or byte level processing capabilities an interesting question arises: Is it possible to exploit their variable precision capabilities to speedup iterath'e schemes? In this paper we address this question, and show that on a fine grained parallel machine it is possible to speedup iterative schemes like the Newton, Jacobi, and Gauss-Seidel schemes by performing computation in variable precision. Experimental results are obtained for the AMT DAP 510.

Keywords. Iterative schemes; variable precision technique; fine grained parallel machines; experimental results.

1. Introduction Iterative schemes for solving a system of linear equations consist, in essence, of repeated application of some simple matrix operations. In any iteration one begins with an initial approximation and then successively modifies the approximation according to some rule. On present day computers iterative schemes are implemented using fixed precision arithmetic. That is, every iteration is implemented with the same length precision. However, it is possible to implement iterative schemes in variable precision, as was shown theoretically by Bojanczyk [2]. He starts with an initial iteration in the lowest sufficient precision, and then as the Correspondence to: M. Zubair, Old Dominion University, College of Sciences, Department of Computer Science, Norfolk, VA 23529-0162, USA, email: [email protected] 0167-8191/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

M. Zubair et al.

1224

approximations are refined the precision is increased. With the advent of massively parallel machines, such as AMT DAP and MASPAR, having bit, 4-bit nibble, or byte level processing capabilities an interesting question arises: Is it possible to exploit their variable precision capabilities to speedup iterative schemes? The motivation for investigating this possibility stems from the observation that on such machines there is a strong performance dependency on the data word length [5]. While variable precision arithmetic is theoretically attractive for certain classes of problems, there are a number of questions which can arise in its implementations. For example, how difficult is it to implement such an algorithm and is the overhead associated with changing precision a major cost? We know of no implementation of a variable precision algorithm for iterative schemes. In this paper we present such an implementation in order to determine its practical feasibility, We consider two types of iterative schemes in variable precision. The first one is based on the classical Newton scheme for iterative improvement of an approximate inverse. The Newton scheme is self-correcting, numerically stable, and converges quadratically. The scheme involves only matrix multiplication operations and is, therefore, suitable for parallel implementation. Bojanczyk [2] has investigated the Newton scheme in variable precision. He also discusses a possible implementation on a two-dimensional systolic array. The second class of iterative schemes such as Jacobi and Gauss-Seidei are part of a large family of iterative methods based on matrix splitting [4]. To the best of our knowledge these schemes have not been studied using variable precision. In this paper we give implementation results for both types of iterative schemes on the AMT DAP-510 using variable precision computations. We compare these results to the ones obtained for fixed precision implementations. In Section 2 we briefly discuss various iterative schemes. In Section 3 we describe the AMT-DAP and the implementation of various iterative schemes on it. Section 4 gives experimental results. Finally we give conclusions in Section 5.

2. lterative schemes and variable precision model 2.1. Newton scheme

The Newton scheme solves a system of linear equations, A u - v, by computing A-1. This scheme was suggested as a good candidate for parallel implementation by Bojanczyk [2] and Pan [6]. We summarize the scheme. 1. Choose an initial guess for A-1, say B t°~. 2. lteratively improve B tk~ using B ( k ) - (2•- B ( k - I ) A ) B (k-I),

k >_ 1.

(1)

The above algorithm converges if the initial guess B t°~ satisfies III - Bt°M II 2 <( 1. Here, II. II 2 denotes the 2-norm of a matrix. It can be shown that the algorithm converges quadratically

[11. Bojanczyk [2] has theoreticallyinvestigatedthe Newton scheme using variable precision. Amongst other interestingresults,he shows that in order to achieve quadratic convergence (in the variable precision model) the precision should almost double at every step startingwith the precision essentiallyequal to the logarithm of the condition number of the matrix A.

A variable precision approach on fine grained parallel machines

1225

2.2. Jacobi and Gauss-Seidel schemes

Classical iterative schemes like Jacobi, Gauss-Seidel etc. are based on a splitting of ,4 into the form [4] ,4 = R + D + S,

(2)

where R, D, and S are matrices having the same elements as ,4 respectively below the main diagonal, on the main diagonal, and above the main diagonal, and zeros elsewhere. These schemes are very well documented in the literature (e.g. in [4,7]). Here, for completeness we briefly review these schemes. The underlying iteration for these schemes to solve ,4u = v is given by u ~k~= Gu ok- I~ + My.

(3)

The iteration starts with an initial approximation u 0 which in general can be chosen arbitrarily. For the Jacobi scheme matrices G and M are given by G = -D-I(R +S),

and

M=D -l.

For the Gauss-Seidel scheme these matrices are given by G = -(R

+D)-IS,

M---(R+D)

and

-l.

The convergence rate for these schemes is determined by A, the spectral radius of G. It has been shown that on an average, the error ( e - u - A - ~ v ) decreases by the factor A at each step in the iteration. In other words, -log~0 A is the number of decimal digits of accuracy gained in each iterative step. This suggests that these schemes are also amenable to the variable precision model of computation.

3. Parallel implementation 3.1. DAP-510 machine

The DAP-510 is one member of a family of fine grain massively parallel computer. It is an SIMD machine with 1024 one-bit processors arranged in a 32 × 32 matrix. Each processor is

connected to its four nearest neighbors. The processors on the edge of the matrix have wrap around connections to the processors on the opposite edge. In addition to nearest neighbor connections, a bus system connects all the processors in each row and all the processors in each column. Each processor has its local memory of 64Kb. The whole memory can be viewed as a three dimensional array of bits, consisting of 64K bit-planes. A bit-plane has 1024 bits, one from each processor's local memory. Similarly, a word-plane has 1024 words, one from each processors local memory. All the bits of single word are stored in the local memory of a single processor. In addition to the DAP-510 the other members of the family are 510C, 610 and 6i0C. The 510C has a general Farpose 8-bit processor per node in addition to the one bit processor. The 610 and 610C are. identical to the 510 and 510C except that they have 4096 processors arranged in a 64 × 64 matrix. A number of tests on 510C have shown that it is about 5 times faster than 510 for fixed precision floating point computation. Similarly, the 610 and 610C have been shown to be 4 times faster than the 510 and 510C respectively. In all of these

1226

M. Zubair et ai.

machines the internal floating point representation is hexadecimal. Thus, the basic unit of computation is a 4-bit nibble on the 510 and 610 and a byte on the 510C and 610C. All the floating point arithmetic routines are implemented in microcode. In the experiments reported here we have used the 510 because it is the most readily available machine (we have one in our department). The higher level language available on the DAP is Fortran Plus, a parallel extension of Fortran. The most important feature of Fortran Plus is the ability to manipulate matrices and vectors as single objects. For example, two matrices can be added with a single statement, as for scalars. The masking and selection operations are available for performing computation on selected processors. In addition to the basic functions which have been extended to take vector and matrix arguments, a large number of other functions are provided as standard. Because floating point processing is done at bit, nibble or byte level the timings of the functions are not what might be expected. For example, taking the square root of every element in matrix requires only 10% more time than adding elements of two matrices and summing all the elements in a matrix only takes twice as long as adding two matrices. For the purpose of this implementation the most important feature of Fortran-Plus is that the programmer can specify precision of floating point numbers in steps of 8-bits from 24-bit to 64-bits. In all cases the exponent is 8-bits and the rest of the bits are used in the mantissa. The precision is specified at the compile time rather than at the run-time. Fortran-Plus allows assignment of lower precision floating point numbers to higher precision numbers and vice-versa for a negligible cost. For details on these functions one can refer to the AMT DAP 500 Fortran-Plus Language Manual [3]. Some standard functions which we have used in our implementation are describe below. matc: returns a matrix formed by replicating a column vector. matt: returns a matrix formed by replicating a row vector. sumr: returns a row vector by summing all columns of a given matrix. s u m c : returns a column vector by summing all rows of a given matrix. maxv: returns the maximum element of a given vector or matrix.

3.2. Implementation The two operations which are central to the implementation of iterative schemes discussed in this paper are: (i) matrix-matrix multiplication, and (ii) matrix-vector multiplication. It is fairly straight forward to implement these operations on the AMT DAP with the help of the standard functions listed above. For example, a matrix-vector operation, v - A u , can be coded in Fortran Plus using a single statement v - sumc( A • matr(u)),

where • represents the Schur product of two matrices. Because of the restriction in Fortran-Plus that the floating point representation must lie between 24 and 64 bits in steps of 8, we started with 24-bits precision and increased the precision with an increment of 8-bits going up to 64-bits. The following table compares the time taken by two matrix operations namely matrix-vector multiplication and matrix-matrix multiplication on DAP-510 for different word lengths. The results indicate that (i) the cost of both matrix operations increases with the word length, and (ii) the ratio of the operation cost with 64-bits to the operation cost with 24-hits is as high as three.

A variable precision approach on fine grained parallel machines

1227

Table 1 Time to multiply a (32 x 32) matrix with a 32 element vector and to multiply two (32 x 32) matrices in different precision. Operation

24-bits time (sec)

32-bits time (sec)

40-bits time (sec)

48-bits time (sec)

56-bits time (sec)

64-bits time (sec)

Matrix-Vector Matrix-Matrix

3.91E-5 1.43E-2

5.10E-4 1.79E-2

7.45E-4 2.56E-2

9.00E-4 3.04E-2

10.70E-4 3.58E-2

12.60E-4 4.19E-2

Note, that these results are for 32 × 32 matrices. We found that with increase in matrix size the ratio between operation cost with 64-bits and 24-bits also increases. For example this ratio is 4.55 for 128 x 128 matrices. The particular timing values are due to implementation of floating point add and multiply in microcode and to the fact that data is transferred from local memory to the processor one bit at a time. At any precision the error will be reduced as far as possible consonant with this precision. When this occurs we say the convergence rate has saturated. By saturation we mean, at any precision, that the rate of convergence has a large decrease. In order to determine that we monitor rate of convergence. Further reduction in the error requires going to a higher precision. Note that the range of floating point numbers which can be represented at each precision is nearly the same because the exponent has the same number of bits and only the number of bits in the mantissa change. As precision is increased the additional bits are used in the low order positions. At all precisions, floating point overflow and underflow is monitored in the hardware. We never encountered floating point overflow or underflow in any of our experiments. We now give a high level description of the variable precision implementation. The following assumptions and notations are used in the description. • Subscript of a variable denotes its precision. • The system to be solved is A64x64 = b64

Algorithm Newton/Jacobi/Gauss-seidel begin{main}

{Copy A64 and b64 to other precisions} A24 - A32 - A40 - A4s - A 5 6 - A64

b24 -

b32 = b40 -/748 - b56 -

b64

Initialize the guess x24 for i - 24 to 64 step 8 do begin{for} repeat

{Execute one iteration of Newton /Jacobi / Gauss-Seidel scheme} iterate(Ai~ bi, xi) until (convergence rate saturates)

{Copy ~he low precision solution to the next precision.} Xi+ 8 --" Xi

end{for} end{main}

4. Experimental results We have implemented the Newton scheme for matrix inversion and Jacobi, weighted Jacobi and Gauss-Seidel schemes for solving a system of linear equations on the DAP-510. All

M. Zubair et al.

1228

120

30 Variable Precision

25

T

1

Variable Precision

100

t

Fixed Precision

20

Iterations

Iterations

I

Fixed Precision

80 60

40 20

24

32

40

48

56

64

24

32

Precision ~

Newton Scheme

56

64

56

64

25

3S

Iterations

48

Jacobi S c h e m e

4O

T

40

Precision

Variable Precision

30

1

Variable Precision

t

Fixed Precision

Iterations

25

20 1

Fixed Precision

15

2O 15

10

o

24

32

40

48

56

Precision ~

Damped Jacobi Scheme

64

24

32

40

48

Precision

Gauss--Siedel Scheme

Fig. I. Distribution of iterations over different precision.

these schemes were implemented in fixed precision as well as in variable precision. A set of experiments was conducted to evaluate the performance of various iterative schemes using the variable precision. These experiments were done on a large set of random and deterministic matrices. Here, we report results for only four matrices which are representative of this set. These four matrices are: MI: Deterministic and symmetric. M2: Deterministic and unsymmetric. M3: Random and symmetric. M4: Random and unsymmetric. The results of these experiments are summarized in Figs. 1 and 2. It should be noted that in our experiments, the final output of an iterative scheme in variable precision had the same

A variable precision approach on fine grained parallel machines

T q~

3.

I

1229

T

2.

1 0. ~2

9'6

0

:28

Matrix Size

128

Jacobi Scheme

Newton Scheme

T

9'6 Matrix Size

3-

l

3-

2.

~.

2.

1 C 32

64 96 Mauix Size Damped Jacobi Scheme

128

0 32

9'6

128

Matrix Size Gauss-Seidel Scheme

Fig. 2. Speedup as a function of matrix size in variable precision model.

accuracy as in fixed precision. The distribution of iterations over different precisions for va, ious iterative schemes is shown in Fig. 1. The distribution is shown for 128 x 128 size matrix (a similar pattern exists for other matrix sizes). "hese results indicate that the number of iterations taken by an iterative scheme in fixed precision is equal to the total sum of iterations over different precisions in variable precision. This is a result of the criterion which we use to change precision described previously. For example, the Newton scheme in fixed precision (64-bits) takes 26 iterations; while in variable precision it takes 18 iterations in 24-bit precision, 4 iterations in 32-bit precision, 1 iterations in 40-bit precision, 1 iterations in 48-bit precision, 1 iterations in 56-bit precision, 1 iterations in 64-bit precision, for a total of 26 iterations. Also observe from Fig. I that the distribution of iterations for the Newton scheme is different from the other three schemes. This is due to the quadratic convergence of Newton scheme as compared to the linear convergence of the other three schemes. Figure 2 gives the

1230

M. Zubair et al.

performance of different iterative schemes in variable precision as a function of matrix size. The performance of an iterative scheme in variable precision is measured by the speed up defined as Speedup-- Tf

Tv'

where Ts is the time taken by the iterative scheme in 64-bits precision to get e-accurate solution, and T~, is the time taken by the iterative scheme in variable precision to get an e-accurate solution. We have given plots for only M1 matrix. The behavior for the other three matrices is identical. These results indicate that the speedup is 2.6 for 32 × 32 matrix and 3.5 for 128 × 128 for the Newton scheme. For the other three schemes the speedup is almost independent of matrix size and is approximately 1.7.

$. Conclusion In this paper we showed that, on fine grained parallel machines, such as DAP-510 and MASPAR, it is possible to speedup iterative schemes by performing the computation in variable precision. In particular, we showed that on AMT DAP 510 the Newton scheme has a speedup of 3.5, and the Jacobi and Gauss-Seidel schemes have speedups of 1.7. We believe that the performance of these schemes would be further improved if they were implemented in variable precision with a greater number of levels. For the ease of programming, the current variable precision implementation was confined to only six levels: 24-bits, 32-bits, 40-bits, 48-bits, 56-bits and 64-bits. While the DAP-510 only gives 4.0 Mflops for the algorithm with 32-bit precision, the 510C yields about 20 Mfiops and the 610C about 80 Mflops. These Mflops ratings are highly dependent on the algorithm and the problem. In other problems using different algorithms we have obtained about 45 Mflops on 510C. We suggest that variable precision arithmetic be used for iterative schemes on the DAP and MASPAR, and on any other machine with similar architecture. It is an open problem whether the variable precision technique can be extended to other iterative schemes such as conjugate gradient methods. Finally, it may be that a version of this algorithm could be useful on computers which have only 32 bit floating point hardware when 64-bit accuracy is required for a particular relaxation application. A standard approach is to use 64-bit precision, which is implemented in software, for the entire computation. Our results suggest that beginning the computation with 32-bit precision and later switching to 64-bit precision might be effective.

Acknowledgements We wish to thank the referees for their helpful comments.

References [1] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computation Numerical Methods (Prentice-Hall, Englewood Cliffs, NJ, 1989). [2] A. Bojanczyk, Complexity of solving linear systems in different models of computations, SlAM J. Numerical Analysis 21 (1984)591-603.

A variable precision approach on fine grained parallel machines

[3] [4] [5] [6] [7]

1231

Fortran Plus Language (man002.02), Active Memory Technology, 1988. G.H. Golub and C.F. Van Loan, Matrix Computations (Johns Hopkin University Press, Baltimore, MD, 1989). R.W. Hockney and C.R. Jesshope, Parallel Computers 2 (Adam Hilger, Bristol, 1988). V. Pan and J. Reif, Efficient solution of linear systems, Proc. 17th ACM STOC (1985) 143-152. R.S. Varga, Matrix lterative Analysis (Prentice-Hall, Englewood Cliffs, NJ, 1962).