Performance of a subroutine library on vector-processing machines

Performance of a subroutine library on vector-processing machines

Computer Physics Communications 37 (1985) 181—186 North-Holland, Amsterdam 181 PERFORMANCE OF A SUBROUTINE LIBRARY ON VECTOR-PROCESSING MACHINES C. ...

562KB Sizes 1 Downloads 28 Views

Computer Physics Communications 37 (1985) 181—186 North-Holland, Amsterdam

181

PERFORMANCE OF A SUBROUTINE LIBRARY ON VECTOR-PROCESSING MACHINES C. DALY University of Manchester Regional Computer Centre, UK

*

and J.J. Du CROZ Numerical Algorithms Group Ltd., 256 Banbury Rd., Oxford 0X2 7DE, UK

We describe the strategy adopted to enable selected linear algebra routines in the NAG Library to achieve high levels of performance on vector-processing machines. The routines have been restructured so as to perform the bulk of their computation via calls to simple kernel routines, which have been implemented in different versions to suit specific vector-processing machines as well as scalar machines. We present performance measurements for some of the kernel routines and some of the routines which call them, for Cray-i, Cray-XMP, Cyber 205 and FPS-164.

I. Introduction The NAG Library [6] is a library of over 500 FORTRAN subroutines for numerical computation. It is a transportable library, designed to be implementable with as little effort as possible in a very wide range of computing environments (machine/precision/compiler); around 50 different compiled implementations of the Library are currently distributed, including implementations for the Cray-i, CDC Cyber 205 and FPS-164. (The implementation for the Cray-i is also used, after re-compilation, on the Cray-XMP, using only a single processor.) The four machines just named will all be regarded here (loosely speaking) as examples of vector-processing machines. Details of their architecture are less important than the efficiency with which they can perform arithmetic operations on vectors. The paper is concerned, not with how to achieve a high level of performance of any particular vector-processing machine, but with how to provide a subroutine library that performs sufficiently well *

Present address: Computer Services Dept., University of Lancaster, Lancaster, UK.

over a wide range of both scalar and vector processors (but we do not attempt to cover machines with a higher degree of parallelism such as the ICL DAP). Throughout this paper it must be remembered that performance measurements can easily be affected by minor changes in hardware, or, more frequently, in a Fortran compiler; and that, especially when comparing the performance of routines over a range of different machines, one should not attach too much importance to timing differences of up to, say, 30—40%. There remains some latitude, and room for compromise, in what constitutes a “sufficiently good” level of performance on any machine; however the first implementation of the NAG Library on the Cray-i was certainly not sufficiently good in terms of performance, because some of the routines ran about 100 times slower than in the current implementation. This was an extreme example (due in part to an unnecessary use of double precision arithmetic), but it is frequently true on vector-processors that ill-adapted code can degrade performance by a factor of 5 or 10, and users of a subroutine library can reasonably expect that library routines will avoid such degradation; indeed, because users may be unwilling to learn for themselves how to get the

0010-4655/85/$03.30 © Elsevier Science Publishers B.V. (North-Holland Physics Publishing Division)

182

C. Daly, J.J. Du Croz

/

Subroutine library on vector -processing machines

best performance out of their vector-processor, there is all the more onus on the library to embody the necessary expertise. Therefore the NAG project has devoted some effort during the last three years toward investigating and improving the performance of selected NAG Library routines on vector-processing machines. 2. Strategy We have been guided by four aims: a) to provide the user of the NAG Library on a vector-processing machine with access to cornputational power which comes reasonably close to the full power of the machine; b) to preserve the existing specifications of the routines, so that no new documentation is required, and no changes need be made to users’ programs; c) to preserve as much generality in the code as possible, ideally to develop a single version of the code which performs satisfactorily on both scalar and vector-processors this simplifies management and maintenance of the Library; d) to develop a limited amount of machine-specific code where necessary, if the ideal of c) is unattamable, We have initially concentrated on the Linear Algebra chapters of the Library. They include many of the most frequently used routines, and probably account for a major share of the central processor time spent within the Library. They are called by many routines in other chapters (e.g. Ordinary Differential Equations, Integral Equations, Optimization, Regression, Time Series Analysis). Finally it is comparatively easy to cast most non-sparse linear algebra algorithms in a form well-suited to vector-processing. Computations in linear algebra occur at different levels of complexity: 1. vector operations, involving t!~ ( n) scalar operations; 2. matrix—vector computations, involving cV(n2) scalar operations; 3. matrix computations, involving ~ ( n3) scalar operations. Among vector operations at the tP(n) level we —

shall concentrate on two: the inner or dot product, which we shall denote DOT; and the opertion y: = ax + y which, following the terminology of the BLAS [5], we shall denote AXPY. The user-callable linear algebra routines in the NAG Library are nearly all at the 0(n3) level, and were originally organized very largely in terms of DOT operations (often, but not always, via calls to an inner product library routine XO3AAF). Modifying the code of the D(n3) routines to ensure that the DOT operations were performed efficiently on different vector-processors would in general have improved the speed of the routines compared with the original code. But either the number of machine-specific modifications would have been very large, or the overhead in subroutine calls (~!2( n2) of them) would have been serious; more importantly, better performance can frequently be achieved on the vector-processors considered here, if algorithms are organized in terms of AXPY instead of DOT. Fortunately many important linear algebra algorithms allow considerable flexibility in their organization (see ref. [2] for example). We have found that the requirements of our strategy can be met if we organize the t!2(n3) algorithms to work in terms of e’ ( n2) (matrix—vector) operations. These t!2(n2) computations are performed by a set of kernel routines, which are sufficiently simple and few in number for machine-specific variants to be provided where necessary in order to achieve the desired level of performance. A more extended discussion is given in ref. [3]. In section 3 we describe the current set of kernel routines, and in section 4 their implementation on different machines.

3. Description of kernel routines We list here the computations performed by the kernel routines that have been used in the NAG Library so far, together with the algorithms that call them. Except where otherwise stated, A denotes a full rectangular matrix, L a lower triangular matrix, and U an upper triangular matrix. a) Matrix—vector products:

C. Daly, J.J. Du Croz

al)

C:

=

c + Ab

a2) c:

=

c + ATb

a3) c:

=

c + Ab,

a4) b:

=

Lb

a5) b:

=

LTb

/

(matrix-multiplication; LLT~factorization; reduction of an unsymmetric matrix to upper Hessenberg form; back-transformation of all the eigenvectors of an unsymmetric matrix; reduction of a symmetric-definite generalized eigenvalue problem to a standard problem); (matrix-multiplication; Q U-factorization; accumulating a product of Householder transformations); where A is symmetric and only its lower triangle is stored (reduction of a symmetric matrix to tridiagonal form); (back-transformation of selected eigenvectors of an unsymmetric matrix); (inversion of a positive-defmite symmetric matrix after LLT~factorization).

b3) b:

=

L

tiOflS~ as described

b,

in note 4 below (LU-factorization; solution of linear equations after LU-factorization; solution of linear equations after LLT~factorization;inversion of a positive-definite symmetric matrix after LLT~factorization; reduction of an unsymmetric matrix to upper Hessenberg form; reduction of a symmetric-definite generalized eigenvalue problem to a standard (solution of problem); linear equations after LLT~factoriza~ tion; back-transformation of eigenvectors of a sym•



1b b2) b:

=

(LT)

-

U’b

=

metric-definite generalized eigenvalue problem); (solution of linear equations after LU-factorization; solution of linear least-squares problem after QU-factorization).

Notes 1. Complex analogues of some of these computations are required, wherever complex analogues of the calling algorithms are included in the NAG Library. 2. The predominance of lower triangular, as opposed to upper triangular, matrices is accidental, resulting largely from the fact that Cholesky factorization is performed in the NAG Library as LLT rather than U TU 3. The reduction of an unsymmetric matrix to upper Hessenberg form is performed in the NAG Library by stabilized elementary transformations, not by Householder transformations. 4. The computation b) actually allows the matrix L to have the form L

L11

=

b) Solution of triangular systems of linear equabi) b:

183

Subroutine library on vector- processing machines

L21

I

where L11 is lower triangular of order n, and L~ is rectangular (rn—n)-by-n; only L and U L2~are stored. If b is partitioned conformably, then the computation b: = L b is equivalent





.

b1: b2:

= =

.



1b L~ 1 b2 L21b1 —

If these two computations are combined, longer vectors can be used. This feature of bi) is used in LU-factorization and in reduction to upper Hessenberg form; elsewhere m = n and L reduces to L11. 4. Implementation of kernel routines

+

For illustration we take the computation c: = c Ab (where A is rn-by-n). It may be organized in

184

C. Daly, J.J. Du Croz

/

Subroutine library on vector - processing machines

Table 1 Speed in megaflops of computing C: c + Ab, where A is a square matrix of order 320, on different vector-processors. The figures in parentheses in the rows for the Cray-i and the Cray-XMP are the speeds obtained when memory-bank conflicts cause significant degradation Automatic vectonzation DOT Cray-i Cray-XMP Cyber 205 FPS-164

Explicit vectorization

AXPY

24 (17) 45 (30) 6 3.4

Unrolled AXPY

40

73

76

157

110 2.5

110 5.9

terms of either DOT or AXPY vector operations: DOT form: for 1: = I to m forj: = i to n c(i): = c(i) + a(i, j) b(j), i.e.: for i: = 1 to rn c(i): = c(i) + dot(a(i, 1: n), b(l: n)). AXPY form: for j: = 1 to n for 1: = 1 to m = c(i) + a(i, j) b(j), i.e.: forj: =1 ton c(l: rn): =c(l: m)+a(l: rn, j)’b(j). Table i compares the speed of these forms on different vector-processing machines (when rn = n = 320). It shows their speed both when they are coded in standard FORTRAN (relying on automatic vectorization) and when using explicit vectorization either by subroutine calls or non-standard syntax (as available). It also shows measurements when the technique of “unrolling vector

Table 2 Ratios of the speed of computing C: = c + Ab using AXPY operations with various depths of unrolling, on different machines. The speed of the ordinary AXPY form with no unrolling is talcen as i

Cray-i Cray-XMP Cyber 205 FPS-164 DEC VAX-11/780

Depth of unrolling 2 3 4

5

6

1.6 1.5 1.1 1.5 1.1

2.3 2.1 1.2 2.5 1.2

2.4 2.1 1.2 2.4 1.2

1.9 1.8 1.2 1.9 1.2

2.1 1.9 1.2 2.1 1.2

_____________________________________________

DOT

AXPY

57 (SDOT) (27) 71 (SDOT) (35) 27 (Q8SDOT) 4.8 (DOTPR)

41 (SAXPY) 73 (SAXPY) 79 3.1 (VSMA)

loops” [1] is applied to the AXPY form. Table 2 shows how this technique can roughly double the speed of the AXPY form on the Cray-i, Cray-XMP and FPS-164, while it has negligible effect on the Cyber 205 or on a typical scalar machine. We have fixed on unrolling to a depth of 4 to give a reasonable compromise between speed of performance and complexity of the code. Similar variants can be programmed for the computationsb: =Lb, b: =L1b, and b: = U1b; and a similar comparative picture is obtained for their performance, although the absolute speeds are slower because the vector operations involve vectors which vary in length from i to n. Similar variants can also be programmed for those kernel computations which involve transposed matrices, i.e.: c: = c + ATb, and b: = (LT)~b. Again a similar comparative picture is obtained on the Cray-i and Cray-XMP (as long as memory-bank conflicts are ignored) and on the FPS-164. But the Cyber 205 exhibits a different pattern of performance because now the DOT form works with contiguous vectors, while the AXPY form requires a gather operation. From results such as these we can de.

cide how to implement the kernel routines on different machines. For the Cray-i and Cray-XMP we use the unrolled AXPY forms for all the kernel routines (except that we use the ordinary AXPY form for the complex analogues, and that for computation a3) we only unroll to a depth of 2 in order to keep .

the code reasonably simple). For these machines the speeds quoted in table 1 have reached their

C. Daly, J.J. Du Croz

/

Subroutine library on vector-processing machines

asymptotic limit. For the computation b: = Lb the limiting speeds using unrolled AXPY’s are 60 Mflops on the Cray-i and 120 Mflops on the Cray-XMP. These speeds still fall short (by about a factor of 2 on the Cray-i) of the so-called “supervector speed” [4] which can be achieved by avoiding unnecessary memory-references but only at the price of programming in machine-language, However, it would be easy to incorporate machine-language versions of kernel routines into the Library if they are written, For the Cyber 205 we use the ordinary AXPY forms wherever they work with continguous vectors. The speed quoted in table 1 is only about half the asymptotic limit for this form, but is almost optimal for this machine for the given value of n. Where the AXPY form would require noncontiguous vectors, we prefer the DOT form working with contiguous vectors. The compiler does not automatically vectorize a DOT operation in this context (this is the reason for the speed of only 6 Mflops in the first column of table i); therefore we must vectorize the routines explicitly using the function Q8SDOT. We then obtain for the computation c: = c + AT!,, again with m = n = 320, a speed of 59 Mflops for the DOT form using Q8SDOT, but only 43 Mflops for the AXPY form. For the FPS-i64 we use the same unrolled AXPY forms as for the Cray-i and Cray-XMP. The speeds quoted in table i have reached their asymptotic limits, and the speed for the unrolled AXPY form is a little over half the maximum

185

speed of which the machine is theoretically capable (ii Mflops). Even faster speeds (8.8 Mflops) have been obtained for the computation c: = c + ATb, by using the DOT form with the vector b copied into special table-memory (coded by calls to the manufacturer’s library routines); but this improvement has not yet been included in the implementation of the NAG Library. For scalar machines we attach less weight to the difference in speed between the DOT and AXPY forms (which on most such machines is negligible) than to the desirability of working with contiguous vectors in order to avoid page-thrashing on virtual memory systems. Therefore the standard version of the kernel routines, which is used in default of any special adaptation, uses the same combinations of AXPY and DOT forms as is used on the Cyber 205 (but of course with DOT forms coded in standard FORTRAN). A further exception is made, not for efficiency but for numerical stability, on machines with a low working precision (e.g. IBM or VAX machines in single precision): for these implementations we use the DOT forms throughout and accumulate the inner products in double precision. 5. Conclusion and prospects Table 3 shows the approximate speeds of a sample of NAG linear algebra routines from the implementations now available on the Cray-i,

Table 3 Speed in megaflops of selected NAG linear algebra routines, working with square matrices of order 200, on different vector-processors

Matrix multiplication, real LU-factorization, real LU-factonzation, complex LLT~factorization,real QU-factorization, real Reduction of a real symmetric matrix to tridiagonal form Reduction of a complex Hermitian matrix to tridiagonal form Reduction of a real unsymmetric matrix to Hessenberg form Reduction of a complex unsymmetric matrix to Hessenberg form

Cray-i

Cray-XMP

Cyber 205

FPS-164

65 45 55 45 35

125 70 70 75 65

85 32 25 35 40

5.7 i.7 2.5 4.0 1.4

40

75

42

1.7

55

90

50

2.6

55

100

55

2.9

55

93

65

2.7

i 86

C. Daly, J.J. Du Croz

/

Subroutine library on vector- processing machines

Cray-XMP, Cyber 205 and FPS-i64. These results were obtained working with square matrices of order 200. The speeds are a little short of their asymptotic limits on the Cray-i and Cray-XMP, and considerably short on the Cyber 205. Even so they compare quite satisfactorily with the theoretical maximum speed of these machines (about 150 Mflops on the Cray-i and 200 Mflops on the 2-pipe Cyber 205). Certainly they are a dramatic improvement on the speeds obtained before the work described here was undertaken. The speeds on the FPS-164 are not all as fast as might be hoped, but can probably be improved, partly by using table-memory as was mentioned in section 4, partly by using a higher level of compiler optimization (only the kernel routines have been compiled at the highest level in the current implementation). We have also observed that on a sample of scalar machines the re-organized routines tend to perform somewhat faster than before; certainly there has been no obvious detrimental effect. This paper has described only the first phase in the task of adapting the NAG Library so that it can include vector-processors in the range of machines on which it offers satisfactory performance. We expect that the same strategy can be extended

to other linear algebra computations in many areas of the Library. We need also to test it on other types of computation such as FFT’s.

Acknowledgements We are grateful to Mr. J. Greenaway of the European Centre for Medium Range Weather Forecasting for the timings on the Cray-XMP; and to Floating-Point Systems (UK) Ltd for providing access to an FPS-164.

References [1] J.J. [2] [3] [4] [5] [6]

Dongarra and S.C. Eisenstat, Argonne National Laboratory report MCS-TM 9 (i983). J.J. Dongarra, F.G. Gustavson and A. Karp, SIAM Rev. 26 (1984) 91 J.J. Du Croz, NAG Newslett. 2/83 (1983) 23. K. Fong and T.L. Jordan, Los Alamos Scientific Laboratory report LA-6774 (i977). CL. Lawson, Ri. Hanson, DR. Kincaid and F.T. Krogh, ACM Trans. Math. Software 5 (1979) 308. NAG Fortran Library Manual, Mark 11 (NAG Ltd., Oxford, 1984).