Basic linear algebra subprograms (BLAS) on the CDC CYBER 205

Basic linear algebra subprograms (BLAS) on the CDC CYBER 205

Parallel Computing 4 (1987) 143-165 North-Holland 143 Basic linear algebra subprograms(BLAS) on the CDC CYBER 205 Margreet LOUTERoNOOL Centre for Ma...

1MB Sizes 71 Downloads 84 Views

Parallel Computing 4 (1987) 143-165 North-Holland

143

Basic linear algebra subprograms(BLAS) on the CDC CYBER 205 Margreet LOUTERoNOOL Centre for Mathematics and Computer Science, 1009 AB Amsterdarn, Netherlands Received June 1986 Abs~et. This paper examines the efficiency of a set of linear algebra subprograms, commonly referred to as the BLAS, as standard available on a CDC CYBER 205. We have g~tten an alternative BLAS implementation. The perfo~:mance of both implementations on a 1-pipe CYBER 205 is compared and suggestions for improvements with ~espect to the CDC BLAS are given. Sp~ial attention is paid to operations on non-contiguously stored vector elements, leading to so-called stride problems. In this paper, we only consider the single precision .=.JE~± and COMPLEXBLAS for vectors with positive strides.

Keywords. Vectorization, basic linear algebra subprograms, stride problems, operations on contiguously and non-contiguously stored REALand COMPLEXvectors.

1. Introduction The BLAS (Basic Linear Algebra Subprograms) is a set of basic routines for problems in linear algebra. The complete set as described in Lawson et al. [7] includes the operations of dot products, wxtor plus a scalar times a vector, copy, swap, Euclidian norm, sum of magnitudes, multiplying a scalar times a vector, locating an element of largest magnitude, the Givens transformation and the modified Givens transformation. The collection of routines is divided into four separate parts, ¢omn.Ex, t~AL, DOUBLEP~CISXOX~and COMPLEX*16. The BLAS has been very successful and is used in a wide range of software, including LINPACK [4], EISPACK [9] and is inserted in the IMSL [1]. The BLAS has been implemented on a number of scalar machines, and since all subprograms are performing operations on vectors, excluding the Givens transformations, the BLAS could be efficiently implemented on vector computers. On the CYBER 205, as located at SARA, Amsterdam, Netherlands, the single precision REALand COMPLEXBLAS routines are installed by CDC. In this paper, we shall refer to this implementation as the CDC BLAS. The goal of this paper is to examine the efficiency of the CDC BLAS. For this purpose, we have written our own alternative version in FORTRAN 200, the CWI BLAS. CWI is an abbreviation of "Centrum voor Wiskunde en Informatica", the Dutch name for "Centre for Mathematics and Computer Science". We have time measured both implementations for large values of n, i.e. the number of elements on which an operation is actually performed. Another approach of testing the efficiency of the CDC BLAS could be to determine for which values of n, scalar optimization is more profitable than vectorization, but am yet no investigations have been done with respect to this approach. The CYBER 205 is designed to operate on contiguous memory locations. However, the BLAS permits a constant spacing, or stride, between the vector elements, described by an incremeat parameter. We have paid much atteation to the performance of the subprograms, 0167-8191/87/$3.50 © 1987, Elsevier Science Publishers B.V. (North-Holland)

M. Louter.Nool / BLASon the CYBER 205

144

when a stride problem is involved. Though the BLAS also permits a negative stride, the CDC BIAS only allows positive values. For this reason, we have restricted ourselves to positive increment values. If the reader is interested in our implementation, please contact the author of this paper. All source texts are available and provided with (as yet, only in Dutch) comments on many pieces of the code. In Dongarra et al. [5] an extended set of BLAS at the level of matrix-vector" operations is proposed. They state that, on vector machines, one needs to optimize at this level in order to approach the potential efficiency. At the CWI, many of {-~e-~op~ed routines of the Extended B ~ S have already been developed. The routines are inserted in the NUMVEC library [8]. In Section 2 we shall specify the BLAS subprograms, their headings and parameter lists. The timing conventions are described in Section 3. We have made a distinction between the REAL BLAS and the COMPLEXBLAS. For the REAL BIAS, the execution times and suggestions for mVre efficient implementations can be found in Section 4. Especially, when a stride problem is involved, the CDC implementation of the COMPLEXBLAS can be substantially improved, as is shown in Section 5. Finally, we remark, that in this paper, one will meet two fonts for variables, the SMALL CAPITALS, tL~d in FORTRA~context, and the italics, used in text environment. E.g., there is no difference in meaning between tq and n, IX and ix, v," and iv. Furthermore, the vectors SX(REAL) and ¢x(colvwLEx) are often shortly denoted by x, and similarly, s¥ and cg by y. Moreover, all FORTRAN ' r e s e r v e d words' and all subprograms names are written in SMALLCAPITALSto0.

2. Specification of the BIAS subpropmm In Section 2.1 some remarks on the increment parameters are made. They play an important role in our time measurements. Only positive increment values are considered in this paper. Next, we give information about the type and dimensions of the variables (Section 2.2) and the type of the functions (Section 2.3). Section 2.4 lists all of the 20 subprogram names and their parameter lists, and defines the ,'perations performed by each subprogram.

2,1. Increment parameter or stride All subpro~ams permit a constant spacing between the vector elements. This spacing, or stride is specified by an increment parameter, e.g. ix. This means, only those elements with storage location l+(i-1)*ix

foriffil,...,nandix>O

(2.1)

will be used in the subprogram. All elements with another storage location are not used. 2.2, 7)~pe and dimension information for variables

The variables of BI.AS have the following type: - INTEGER:

NX, NY, t~, "X, IY, IMAX,

- REAL: SX(NX), SY(NY), SW, SA, SO, SS, = COMPLaX: Cx(Nx), CV(NY), CW, CA.

The variables ix and iy are the increment values of the arrays sx, cx and sY, cy respectively. The variable n denotes the number of elements on which an operation is actually performed. Note that the dimensions nx and ny must satisfy" nx ~ ( n - l ) , ix + l ,

ny ~ ( n - 1 ) , iy + l .

(2.2)

M. Louter-Nool / BLAS on the CYBER 205

145

2.3. Type declarationsfor function names The function names are of the following type:

-- INTEGER: ISAMAX,ICAMAX, - REAL: SASUM,SCASUM, SNRM2, SCNRM2, SDOT, - COMPLEX: CDOTC, CDOTU. 2.4. The subprograms of BLAS and their specification This section contains the subprogram names and their parameter lists and specifies what the routines actually do. For simplicity we assume here that the increment parameters ix and iy are unity. (I) Find the largest component of a vector: (a) I~v~x -- ISA~t~X(~, SX, IX), (b) x~x = ICAMAX(N, CX, IX). The function I s , ~ x determines the smallest integer i such that

Ixil = m a x { Ixj!: j = l , . . . , n } and ICAMAXdetermines the smallest integer i such that

Ixtl =max{ IRe xyl+lIm xjl: j = l ..... n}. For argumentation of this definition of ICAMAXsee [7]. (2) Euclidian length or 12 norm of a vector: (a) s w = SNRM2(N, SX, IX), (b) sw = SCNRM2(N, CX, IX). Both functions calculate

w : - ~ [ , ~ 1 Ixi[2] 1/2. (3) Sum of the magnitudes of vector components: (a) s w = S^SUM(N, SX, Ix), (b) sw -- SCASUM(N, CX, IX). The function s^SUM computes n

w'= E Ix,l i-1

and the function SCASUM computes n

w'= E { IRex, l+IIm x,l}. i=l

For algumentation of this specific definition of SCASUM, we refer to [7]. (4) Dot products: (a) sw = SDOT(N, SX, IX, SY, IY),

(b)

cw

-- CDOTU(N, CX, IX, CY, IY),

(c) cw = CVOTC(N, cx, Ix, cY, xv). SDOT and CDOTU calculate the following dot product: tl

W :-" ~.~ X i "Yi" iml

146

M. I,outer.Nod / BLAS on the CYBER 205

The suffix c on CDOTC indicates, thag tile vector components x~ are used conjugated, so CDOTC calculates n

i=,1

(5) Elementary vector scaling (a ffi scalar): (a) c~aJ. s~u,v(s, sA, sx, xx, sY, n'), (b) CALL CA~XX,V(S, CA, CX, Ix, cY, ~v). Both subroutines compute y'=a.x+y.

(6) Vector scaling (a = scalar): (a) cmJ. s s c ~ ( s , sA, sx, ix), (1) CALL CSC,U.Oq,CA, CX, IX), (c) CAU. CSSC~.(S, SA, CX, IX). Both subroutines compute X :== a . x .

(7) Copy a vector x to y:

(at CALL scovy(N, sx, ix, s¥, x,,,), (b) CALL CCOPY(N, CX, IX, CY, 1Y).

Both subroutines compute y ::= X.

(87 Interchange vectors x and y: (a) c ^ u . s s w . ~ s , sx, Ix, sy, iv), (b) cAu. csw^P(S, cx, ix, cy, Iv). Both subroutines compute X :my.

(9) Apply a plane rotation: ta) c^ct. saoT(s, sx, Ix, sx,, ly, sc, ss), (b) c ~ CS~OT(S, cx, ix, cv, tv, sc, ss). These subroutines compute y~

I"

- ss

sc

'

y~

for

i=l,...,n.

3. Timing conventions We have timed both the CDC BIAS as well as our alternative version, to be referred to as the CWI BIAS. Our version is not a set of ready-made routines, but a piece of FORTn~N 200 code performing the operations as described in Section 2.4. The reduction of CPU time is valued above memory usage, which results in a frequent use of temporary vectors. All testing during the first part of 1985, has been executed on the 1-pipe CYBER 205, located at SARA, Amsterdam, Netherlands. This vector computer has an 8 bank memory with a V$OS RSYS (cycles 60/and 631) operating system. All programs were compiled with FTN200 (Fortran compiler FORTP.ZN200, cycles 60"/and 631), and OPT = v (i.e., automatic vectorization, wherever possible) and the vector elements were stored on large pages (65 535 words) of the (virtual) memory.

M. Louter-Nool / B I A S on the CYBER 205

14/

In order to obtain accurate timing results, each CDC BLAS subprogram and the corresponding CWI FORTRAN 200 code were called 10000 times. The CPU times were measured by the standard FORTRAN function SECOND,as follows: TIME1 ----SECOND( )

10

DO 10 I ----1, 10000 CALL CDC BLAS RO~rrINE(...) CONTINtn8 ' r i ~ 2 = sECONnf ) TIMECDC ----TIME2 -- TIMI/I DO 20 1 ffi 1, 10000

20

CALL CWI BLAS RoLrrINE(...) CONTINUE TI~CWl = S~COND( ) -- TIME2

For the CWI BLAS, the observed CPU times could be reproduced fairly well. However, for the CDC BLAS the timing results appear to depend on the length of the object code of the program. This phenomenon has been signalled already" by Van der Vorst and Van Kats [10]. Here, we always list the smallest observed CPU time. By a single vector operation we mean a vector operation, which can perform one or two scalar operations per clock cycle (two in case of a LINKED TRIAD). The LINKED TRIAD, an operation involving 2 vectors, 1 scalar and 2 operators (one floating-point multiply and a floating-point add or subtract operator with a vector result), like i>o I0 1 -- I, N VR(t) = vl(t) + S * V2(I)

10

CONTINLU~

is considered as a single vector operation with a very high performance. In case of optimal vcctorization the CPU time q'm, needed for a single vector operations with m elements (stored in adjacent storage locations) would be, apart from startup times: '/'m = a . m . 2 0 / p ns

(3.1)

on a CYBER 205 with p pipes. Remark. There are several vector instructions, like the special hardware instruction QSSDOT for the computation of an inner product and the GATHER/SCATTER instructions QSVOATHP/ Q8VSCATP (see Section 4.2.1), whose execution times for 1-, 2- and 4-pipe machines are identical. These instructions occur frequently in our implementation (and in the CDC implementation as well), and in such cases the CPU times will not reduce by a factor 2 or 4 when execution is carded out on a 2- or 4-pipe CYBER 205 instead of a 1-pipe CYBER 205. Timings of the BLAS are given here 1 for increment values 1 (no stride problem) and 2 (stride problem). All arrays in our testing program contain 20 000 elements, which results in n x -~ n y = nx -- ny

20 000 for ~AL vectors,

-- 10000

for COMPLEXvectors.

(3.2)

The timings of cases with non-contiguous elements in one or both vectors ate compared with the corresponding contiguous case i x ( = i y ) -- 1 and the s a m e value of n, i.e. the number of In Appendix A, results for all combinations of strides 1 and 10 are given with an adjusted value of n.

148

M. Louter.Nool / BIAS on the CYBER 205

elements on which the operation is performed. The value of n is chosen as large as possible such that the vectors still fit into the maximum array space. This implies (3.3)

n = ½nx

in all cases. Furthermore, for all time measuring we took a = 10000.

(3.4)

Obviously, m -- n for the ReAl. case and m -- 2n for the COMPLEX case. W e always work with m = 10000 (see (3.2) and (3.3)), for which

-- qqo00o = 2 CPU seconds.

(3.5)

By choosing m large, we may expect a small contribution of loop overhead and startup times. Therefore, in the analyses of the performance of the subprograms this q,'-value of 2 seconds plays the role of a timing-unit. And, indeed, the actually observed CPU-times are always an integral multiply of 2 seconds as can be seen from the timing results in Tables 1 and 9.

4. The ~

BIAS

The implementation and testing of these low-level operations needs some explanation, in particular when the elements are stored non-contisuously. Section 4.1 contains timings of the ReAL CDC BLAS for contiguous vectors. Besides, we shall describe how the execution times of some subprograms can be improved. In Section 4.2 we treat some tools for operations on vectors with non-contiguously stored elements. The timing results of both the CDC BLAS and the CWI BLAS with and without a stride in one (or both) vector(s) can be found in Section 4.3. This section ends with some conclusions concerning the REAL BLAS. 4.1. Timing results of the REAL B I A S for contiguous vectors

To start with, we give the execution times of the CDC BIAS for n = 10000 (see Table 1). As mentioned before, one vector operation will take at least q, - 2 CPU seconds (cf. (3.5)). Clearly, all subprograms perform just one vector operation except for SASUM,SSWAPand SROT. SSWAPneeds three copy operations. The execution times of the CDC SASUMand SROTcan be reduced as follows: (eL)SkSUM: Unfortunately, the available function for calculating the sum of all elements of a vector (function O8SSU~, see the FORTRAN 200 manual [3]) does not have the possibility to Table ! Executiontimeson ~.u. CDC BIAS (n ,,,,10000) /x-1 .,

(()'-D

2.0

SASUM SNaM2 emor uau,v

4.1

mc~a.

2.0

nw~ ~PY

6.1 2.0

mot

12.1

2.1 2.1 2.0

149

M. Louter.Nool / B I A S on the CYBER 205

calculate the sum of the absolute values. For that reason, we first have to compute a vector with the absolute values, and the complete process w~uld require 2,/, = 4 CPU ~econds (cf. Table 1). However a better result can be obtained as follows: Let us assume n even. The computation of the fotlowing vector: x x : = [xj] + I xi+a/2 1 for

i ffi 1,...,½n,

(4.1)

can be carried out in ½~' CPU seconds (using QSADD~V; see the Hardware Manual [2]). Similarly, the calculation of the sum of the elements of the vector x x requires ½q, CPU seconds, since the vector is of length ½n. In case of n being odd, this process is performed on the first n - 1 elements. The absolute value of the nth element must be added to the obtained sum, an operation with negligible execution time. In this way, the ~e,tal amount of work is reduced by a factor 2. (b) SROT: We notice that CDC SROT apparently needs 6 vector operations. However, the same numerical result may be obtained in only 4 operations. This solution requires 2 temporary vectors, say v and w, containing or = sc • x i,

w~ = ss * xi

for i = 1 . . . . .

n.

(4.2a)

The computations can then be accomplished by performing 2 Llt~gJ~DTRIADS: Xi=Vi+Ss*Yi,

yi=sc*y~--Wi

for i----1 , . . . , n.

(4.21))

There is even a possibility to perform the SROT operation in 3 vector operations, with the introduction of only 1 temporary vector, say z, where zt=ss

• (xi+yi)

for i - - 1 . . . . . n.

(4.3a)

The following 2 LINKEDTRIADS: x~ = ( s c - s s ) x , + z i,

Yi -- ( s c + s s ) Y i - zi

for i = 1 . . . . . n,

(4.3b)

deliver the desired result. We shall refer to this solution as SROT(3) and to the 4 vector operation solution as SROT(4). The numerical behaviour with respect to round-off of SROT(4) and SROT(3) may differ; we will not consider this point here. In Table 2 we give the timing results for these different approaches and compare them to the CDC implementations. The percentage in the third column give the approximate ratio of the CWI column and the CDC column. 4.2. S o m e tools f o r vector operations on non.contiguous vectors

As mentioned already in Section 2.1, the CYBER 205 is actually designed to operate on contiguous memory locations. There are two machine instructions, the so-called GATHER and SCATTER,available for the purpose of moving data elements from non-contiguous to contiguous locations (O^THER) and the reverse (SCATTER). Another possibility to operate on non-contiguously stored vector elements is to use control or mY vectors, e.g. in the usage of the WH~t~ statement.

Table 2 tuBAL BLAS, stride = 1 (n ffiI0000, ix (=iy) •l)

SASUM SkO~4) SROT(3)

CDC

CWl

~$

4.1 12.1 12.1

2.1 8.2 6.1

50 67 50

M. Louter.Nool / BL4S on the CYBER 205

150 Table 3 Increment n Gxmrd~

SCATTI~t~ Theoretical

4

S

8

10

5000

4000

2500

2000

1.5 1,7 1.25

1.2 1.4 1.0

2

10000 2.9 2.9 2.5

0.9 0.9 0.625

0.7 0.6 0.5

4.2.1. The GATHER and SCATTER operations

Since the BLAS permits a constant stride, one can use the periodic GATHERand SCATTER.W e have used the vector intrinsic functions Q8VGATttP (GATHER)and Q8VSCATP(SCATTER).The theoretical timings, from the CYBER 205 user's guide [6], are as follows: GATHER: 39 + In

clock cycles,

SCATTER" 71 + In

clock cycles,

(4.4)

where n denotes the number of data elements to be moved. The first number in these timing formulas denotes the startup time. In Table 3 we give the execution times under the timing conventions of Section 3. In all cases the n elements are scattered over 20000 adjacent storage locations. These machine instructions turn out to be very useful, because their timings are independent of the length of the vector from which the elements are gathered or to which they are scattered (as opposed to the control vector construction described below). 4.2.2. The control vector solution The WHt~t~ statement controls operations by means of a nXT vector or control vector. Actually, an operation is performed on the whole vector and a result is stored only if the corresponding bit is set. This implies, that the execution time of an operation guided by a control vector depends on the time needed to perform the operation on all elements. As an example consiO;~r Table 3, where a BIT vector of length 20000 is needed for all increment values, which will result in an execution time of at lea~3t 4 CPU seconds for one single vector operation on n elements, where n takes the values as in Table 3. As a consequence, the control vector solution can be useful only if the increment parameter is small. Moreover, the W H ~ construction can only be used for subprograms with one increment parameter and in those cases where both parameters are equal. 4.3. Timing results of the REAL B I A S

We have timed the REALCDC BLAS for n -- 10000 with the four possible combinations of increment values 1 and 2 for ix and i.,.. Table 4 gives the relevant execution times. The open places follow trivially. As an example, for SDOT the execution time for ix = 2 and iy -- 1 equals that for ix = 1 and iy -- 2. The results in the first column are those of Table 1. The execution times in the other columns clearly display the substantial effort necessary to eliminate the stride problems. We note that n is left unchanged for different increment values. The difference in execution times in column 1 as compared to the others is exclusively due to the non-contiguous storage of the vector elements. Note that for increment values greater than one the copy and swap operations consist only of GATHER/$CATrER operations.

M, Louter-Nool / B I A S on the CYBER 205

151

Table 4 Execution times on UAL C D C BLAS ( n -~ 10000)

ISAMAX SASUM sNm¢2 SDOT SAXPV SSCAL SSWAP scopY SROT

ix - 1

ix ffi2

ixffil

ixffi2

Oyffi1)

Oy- 2)

iy ffi2

iy - i

2.0 4.1 2.1 2,1 2.0 2.0 6.1 2.0 12.1

5.1 7.1 5.1 8.1 11.0 8.0 11.9 5.9 20.1

5.1 7.9 7.9 2.9 16.0

5.1 3.0 -

From the timing results in Table 3 it appears that the execution time of one GATHER operation and also of one SCATTER operation for n ffi 10000 is abou, 3 CPU st~onds. Obviously, subtracting the results in column 1 from the others for the first six subprograms, we obtain multiples of 3 C P U seconds. This suggests, that for ~on-conti~lously stored vectors the implementors of CDC BLAS utilise the GATHER and SCATTERinstructions and not, e.g., the WHEm~ construction. With respect to SROT we would have expected 12.1 + 4 • 3 ffi 24.1 CPU seconds in case of i x ffi iy ffi 2, and 12.1 + 2 * 3 = 18.1 CPU seconds in case of ix -- 1, iy = 2. We will not discuss the rather technical reflections on this poi,lt. Although the machine instructions GATHER/SCATT~P.are very fast, they do not always yield the most efficient code for small increment values. We illustrate this by means of an example with the SAXPYoperation (y :ffis a . x + y ) .

Exemple. C o n s i d e r

SAXPY w i t h i x ffi iy ffi k > 1. G A T H E R / S C A T T E R : Both x and y are stored

non-contiguously; this implies that both vectors are to be gathered into temporary vectors with adjacent memory locations. Next, the operation can be performed on the temporary vectors and finally, the result vector is to be scattered into y. This whole operation on n elements takes -

GATHER X

~il

GATHER )'

4 i n clock cycles,

LINKED TRIAD

SCA ER y

in]

ignoring startup times. - WHEPJ~: Since ix equals iy, one can use the WHERE construction with a control vector to perform the LINKED TRIAD. The elements are scattered here over ix * n memory locations, so the LINKED TRIAD is performed [a LINKED TRIAD

iX *

n

clock cycles,

ignoring the time needed to create a control vector. Obviously, creating such control vectors takes time too, except when this is done during compilation time, by means of a data statement. Therefore we have created some SIT vectors of size 65 535, the maximum length of a vector on a large page. Performing sAxpY with a control vector is more efficient than with G A T H E R / / S C A T T E R up tO an increment of 4 (in practice, this value would be 5, cf. the differences in the theoretical and

M. Louter.Nool / B I A S on the CYBER 205

152 Table 5

k <4 k<5 k ~<4 k ~<2 k<3 k<2 k< 3

SVOT

SAXPY SSCAL SSWAP SCOpV SaOT(4)

SaO~3)

Table 6

REAL BLAS, stride 2 (n ..,i0000, i x ( ,= i y ) = 2)

SDOT SAXeY SSCAL &SWAP SCOPV

SROT(4) SROT(3)

CDC

CW!

%

8,1 11,0 8.0 11.9 5.9 20.3 20.3

4.1 4.1 4.0 12.1 4.1 16.7 12.4

50 37 50 101 68 83 61

measured times for the O^THeRand SCATTERoper,~tions as listed in Table 3). The total amount of work is reduced to 379~ compared to the CDC BLAS. Analogously, one can determine for all REAL BLAS routines the increment values k for which it is still profitable to use control vectors instead of the G A T H E R / S C A T T e R operations. The routines and the corresponding values of k are listed in Table 5.

Remark. On a 2-pipe or 4-pipe CYBER 205, the gain of using control vectors for small increment values would have been much greater, since the execution times of operations guided by control vectors will decrease with a factor 2 or 4, respectively, and those of the OATHBR/ SCATTERoperations will remain the same.

Table 6 shows the execution times for various REALBLAS subprograms with n ffi 10 000 and

ix(-iv)- 2, with both the O^TH.R/SCATTER (CDC BLAS) solution and the control vector solution (CWI BLAS). The last column, giving the ratio of the CWI times and the CDC times (in ~), illustrates the reduction in time which can be obtained by using control vectors instead of the GAtHER/SCATTERconstruction. For completeness, we list all timing results of our SASUM,SROT(4)and SROT(3)in Table 7.

Table 7 ~AL c w l a l a s (n - 10000)

ix . . I Oy-t)

ix = 2

Oy= 2)

SASUM

2,1

5.1

SROT(4) S~OT(3)

8.2 6,1

16.7 12.4

ix=l ~y ,= 2 14.4

12.1

M. Louter-Nool / B I A S on the CYBER 205

153

To emphasize that the G A T H E R / S C A T T E R solution only depends on the value of n, and not on the size of the increment values, we also have timed for all combinations of increment values 1 and 10, and n ffi 2000. The results can be found in Appendix A, Tables A.1 and A.2. 4. 4. Conclusions

The REAL BLAS subprograms, as implemented on the one-pipe CYBER 205, are all optimal for large contiguous vectors, except for the subprograms s^suM and SROT. For non-contiguous vectors with short strides, the execution times can be decreased by making use of control vectors in those cases, where the gather and scatter time exceeds the time to perform the required operations.

S. The c o m , I.eX B I A S The implementation of the COMPLEXBLAS is more complicated, compared to the REALcase. Firstly, let us consider the storage of a COMPLEX vector. A COMPLEX vector is stored in the following way:

RltlR212...

(5.1)

where Ri and It represent the real and imaginary part, respectively, of the ith COMPLEX dement. This way of storage implies that an operation on the consecutive real parts results in a stride 2 problem, in Section 5.1 we describe some subroutines to perform such a separation of a COMPLEX vector into two ~AL vectors together with their execution times. For a definition of the subprograms of the COMPLEX BLAS we refer to Section 2.4. The execution times of the COMPLEXCDC BLAS for contiguous vectors are listed in Section 5.2. In Section 5.3 a description is given of our implementation of the COMPLEXBLAS. Much attention is paid to the COMPLEXBLAS for non-contiguous vectors. In Section 5.4 we propose some tools for handling stride problems in COMPLEX vectors. In Section 5.5 we demonstrate how we have implemented these subprograms, assuming that, in case of two increments, both are greater than one. In Section 5.6 we consider the subprograms with two increment parameters, one of which is equal to unity. Finally, in Section 5.7 some conclusions are drawn. 5.1. Splitting of a COMPLEX vector into a real and an imaginary part

Working with COMPLEX vectors often makes it necessary to separate the real from the imaginary components. For vectors, just like for scalars, there exist intrinsic functions for this purpose, the so.called V-functions. To obtain the real and imaginary components, the functions VS~AL aria v.A_!u~c~ are available. Table 8 gives their execution times for n--5000 and a -- 10 000 (see Section 3 for our timing conventions). Since in a COMPLEXvector the real and imaginary components alternate, an EQUIW,LESCE of a COMPLEX vector with a t~AL vector delivers a vector, containing the real and imaginary

Table 8 ( n ffi 5000, a = 10000) V~AL VAIMAG GATHER (per.)

4. 2. 1.5

IM

"M. Louter-Nool / BIAS on the CYBER 205

components each with a stride 2. As a consequence, one can use the periodic GATHER (i.e. the vector function Q8VOATHP)to gather the reaJ/imaginary components into temporary vectors as follows: - "

PARAMeTeR(N = 5000, N2 ~ 2 * N) eX,,J, ReX(N2) COMPLex c x ( . ) EeUWAt~NCE(CX, XCX) DESCRIP'roR D1, D2 ASSIGN D|,.DYN.N ASSIGN D2,.DYN.N

D1 = QgVGATHP(RCX(I;N2 - -

I), 2, N;

D1)

I)2 ~ QgVGATHP(RCX(2;N2 -- 1), 2, N; D2)

Note that in the second call of QgvoATHp we start with the second element of Rex or the first imaginary component. After compilation and execution of this part of FORTRAN 200 code the DSSCeaWroRsD1 and D2 will refer to vectors containing the real and imaginary components, respectively. In the FORTRAN200 manual [3] one can find a description of the vector intrinsic function Q8vo^THP. As can be seen in Table 8, the execution time of the perLodic OATtmR is about 1.5 CPU seconds, considerably faster than the V~AL and VAIMAO timing results. Apart from this, we were very surprised to find different timings for VRe,~[. and VAIMAO on the CYBER 205. Moreover, both intrinsic functions appear to be too slow compared to the periodic GATHER. In case of contiguous COMPLEXvectors it is not necessary to separate the real and imaginary components. This will be illustrated in Section 5.3.

5,2. Timing results of the COMPLEXBLAS without additional stride problems In Table 9 we [live the execution times of the COMPLEX CDC BIAS for n = 5000, and a .. 10000, where a is the number of times a subprogram is called. Note that there are 5000 COMPLEXelements, so that each operation is performed on m - 10000 memory locations. We notice that each subprogram roughly takes a multiple of the theoretical time '/, -. 2 CPU seconds (cf. (3.5)), so the number of vector operations performed for each subprogram can be easily derived. As for our implementation, we show in the next section how we have achieved or even improved these times,

Table 9 Execution times on COMe~X CDC BLAS (n - 5000) /x,-1

0y-l) ICa~X

~

4.1

SCaSUM SCNaM2

4,1 2.1

CDO~

8.2 6,1 6.3 6.1

CDOTC c~v CSCAI. cssc~ csw~m

2,0

6,1

CCOPY

2,0

CemOT

12,1

M. Louter.Nool / B I A S on the CYBER 205

155

5.3. Our implementation of the COMPLEX BLAS without additional stride problems In this section we demonstrate how we have vectorized the various COMPLEX BLAS opera~ions. As mentioned in Section 5.1, it is often quite convenient to use a PEAL vector EQUIV~LENCEd with a vector of type COMPLEX. Such vectors enable us to operate on PEAL vectors. We shall use the vectors PEAL rx(1; 2 n ) = COMPLEX X(I; n)

(5.2a)

REAL ry(1; 2.-*) = COMPLEX y(1; n)

(5.2b)

and It is not our intention to give extended specifications of our implementation. The aim of this description is to show, which operations on the PEAL vectors rx and ry actually are to be performed. The theoretical times, according to the timing conventions listed in Section 3, are added. (a) ICAMAX. First of all we have to compute a temporary vector z of size n as follows:

zi ffi IRe x, I + IIm xi I for i = 1 . . . . . n.

(5.3)

There are several methods to compute such a vector all having the same speed. T o show the nice possibilities of the so-called sparse vector instructions, we discuss one of these instructions. According to the hardware manual [2], a sparse vector consists of a vector pair, one of which is a n i t string, identified as an order vector, and the other is a floating-point array identified as the data vector. Sparse order vectors determine the positional significance of the segments of the corresponding sparse data vector. T o compute the vector z we used the sparse vector addition Q8ADD~V (see [2]) with the following vectors: I

b

Q

Q

I

J

I

I

Q

I

m

I

•"

...

1 Rj=l /~-i

1 6-1

0

1 1,

R,

1 R~ /~

1

0

1

1

0

0

0

0

0

(IR,-II+I/,-]I)

l

0

1 R,+I

... ...

R~+l

L+l

...

1

1

...

(4) +

""

(5)

...

(6)

(IR, I+II~I)

1

(IR,+]i+li,+~l)

l

(1) (2) (3)

We add by means of the logical .AND. operation on' the SIT vectors (1) and (4), which results in the sparse order vector (6), The mT vectors (1) and (4) are the same vectors, but for a different start address, and can be created by a data statement. The REAL vector (2) contains rx(1; 2n - 1) and REAL vector (3) contains rx(2; 2 n - 1). As an important side effect the sparse vector instructions have sign control (i.e. it is possible to use the absolute value of all input data elements). The resulting sparse data vector (5) contains just the desired values. OATHER operations or operations with control vectors are superfluous for further computations. The length of the mT or sparse order vectors determines the timing, so this instruction takes 1½ * the time of a non-sparse addition, i.e. 1½'/' = 3 C P U seconds, but at the gain of the desired storage of the result vector. Finally, a call of Q8SMAX] (see [3]) will be sufficient to obtain the smallest index i~ such that z~.= max{ zj: j = l . . . . . n } . Notice that in this case all elements of z are positive. Since n is equal to ½m, we observe that this operation is performed in ½ffp CPU seconds. Summarizing, the theoretical time of this implementation of *CAMAXis 2'/' ffi 4 C P U seconds. (b) SCASUM.This subprogram illustrates why splitting of real and imaginary components can be superfluous. In analogy to the PEAL BLAS routine SASUM w¢ start with a vector addition

M. Louter-Nool / BIAS on the CYBER 205

156

(with sign control) of the first n elements of the REALvector r x and its last n elements (rx is of length 2n, cf. (5.2a)). Next we calculate the sum of the elements of the temporary vector (2 • ½~'--2 CPU seconds). (c) SCtqSM2. The implementation of this routine is quite simple. The dot product of r x . rx (~'-- 2 CPU seconds) followed by the square root, gives the required result. (d) eDOTU/CDOTC. The routines ¢DOTU/CDOTC compute the unconjugated and conjugated dot products of two vectors. Obviously the timings of the routines are different. CDOTU iS performed in 4 vector operations on 2n REAL elements and CDOT¢ is performed in 3 vector operations. This is explained as follows: For the real part of the result we have Re CDOTU----~ Re xi- Re Yi - ~ Im x~. Im y~, l

l

Re CDOTC-~ ~ Re x~- Re y: + ~., Im x~. Im y~, i

l

and for the imaginary part we have Im CDOTU= Y'. Re x~. Im y~ + ~ Im x~. Re y~, i

i

Im Ci~TC ~- Y'. Re x~. lm y~ - ~ Im x~. Re yj. i

i

Considering the result part Re CDOTC, two dot products have to be calculated, one of the real components and one of the ima.ginaries. However, the real and imaginary elements of the COM_PLr~ vector x are stored one after the other in the ~XL vector rx, as is the case for the elements of y in ry. So, only one dot product of the vectors rx and ry will suffice to obtain Re c~o'C ('/'~, 2 CPU seconds) and no gathering of the real (imagirJary) componen'.s is needed. In case of Re CDOTU we are dealing with a minus sign, and consequently one dot product will be insufficient. Here, two dot products on rx and ry each with control vector 1 0 1 0 . . . are needed, followed by a subtraction of the scalar result values (2q' - 4 CPU seconds). in lm CDOTU/Im CDOTC mixed products of real and imaginary components appear. The first sum in Im CDOTU/Im CDOTU is obtained by calculating the dot product (with control vector 1 0 1 0 . . . ) of rx and ry, where ry is shifted one element In a similar way, the second sum is found. Summarizing, the theoretical time of CDOTC amounts 3'/,' ~ 6 CPU seconds and of CDOTU 4 q ' - 8 CPU seconds. (e) c,,,xPY. We have to compute y .= c a . x + y , i.e. R e y ~ ' = R e c a . Re x~- Im c a . Im x~+ Re ya

Im y, ,= Re c a . Im x, + Im c a . Re x~ + lm y~

for i - - 1 . . . . . n.

This can be implemented as three LXX~KEDTRIADS ry(1; 2 n ) - - r y ( 1 ; 2 n ) + Re c a . r x ( 1 ; 2n) Re y~ -- Re y~- Im c a . Im x: Im y~ = Im y~ + lm c a . Re x~

for i -~-1,..., n.

The last two statements require the use of control vectors. For these LINKED TRIADS real and imaginary components of different vectors are required simultaneously per cycle. Similar to the situation for Im COOTU/Im CDOTC, a shift on the vector r x or ry enables us to start with the first imaginary component. The expected execution time amounts to 3'/' -- 6 CPU seconds.

M. Louter-Nool / B I A S on the CYBER 205

157

(f) CSCAL. Obviously, the implenientation of this scaling operation is almost equivalent to that of cxxPs. We have to compute x "ffic a . x , i.e. Re x~ "ffiRe c a . Re x ~ - Im ca. Im x~ Im x~ "= Re c a . Im xi + Ira c a . Re x~

for iffil . . . . . n.

For the real part as well as for the imaginary part this would take at least two vector operations with control vectors. Similar to cAxPY this process can be implemented by rz(1; 2n) ffi Im c a . rx(1; 2n) Re x, ffi Re c a . Re x ~ - Im z~ for i ffi 1 . . . . . n. In., x~ ffi Re c a . Im x~ + Re zi

The REXLvector rz (COMPLEXvector z) is a temporary vector with 2n elements (c.q. n COMPLEX elements). Again we have 3 vector operations, the first without and the latter two with a control vector (3'/" ffi 6 CPU seconds). (g) CSSCXL/CSWAP/CCOPY. The operations are performed on the REAL vectors rx and ry. CSSCAL and ccoPY each take a single vector operation on m ffi 2n elements (q, ffi 2 CPU seconds). The computation of cswxp requires three copy operations (3'/' ffi 6 CPU seconds). (h) CSROT.The subprogram ¢SROT performs the following operation on the COMPLEXvectors x and y: x ~ : f f i s c . x i + s s . y ~,

y~:ffi- s s . x ~ + s c . y j

foriffil,...,n.

Since the values ss and sc are of type PEAL, we are dealing with simple scalar times a COMPLEX vector multiplications, or, using rx and ry, with scalar times a REAL vector multiplications. In Section 4.1, we have demonstrated how the SROT operation can be performed in 4 vector operations and even in 3 operations. Similarly, the performance of CSROTCan be done in 4 or 3 vector operations on the REALvectors rx and ry (CSROT(4) : 8 CPU seconds; CSROT(3) : 6 CPU seconds). The description of our implementation is now complete for contiguous vectors. The execution times of our code agree with the times measured for the CDC BLAS implementation in Table 9 except for SCASUM(measured 2.1 CPU seconds), CSROT(4)(8.2 CPU seconds) and CSROT(3) (6.1 CPU seconds). 5. 4. S o m e tools f o r operating with strides in COMPLEX vectors

A storage spacing between COMPLEX elements of size ix will deliver the next pattern

R111... Rjx+ll~x+l... R2,~+!I2,~+1. For the REAL BLAS the use of the periodic GATHER appeared to provide a fast solution to the stride problem. Here however, two consecutive locations per element are required to maintain the specific storage of a COMPLEXvector. Hence, the periodic GATHERwill fail, because only one element per period can be stored into a vector. A compression of the original vector is a possibility to maintain the COMPLEX vector storage. To scatter the elements back after computation merging and masking operations can be used. Unfortunately, all these operations need control vectors. The length of such control vectors dominates the execution time of the operations, so that, for large increments, these operations will be very expensive. Moreover, the creation of the control vectors increases the execution time. A better solution is again the p~iodic GATHER, but now operating on the real and the imaginary components, separately. "[hen with a lm~RGEoperation on the resulting two vectors we can get the original COMPLEX storage. But, why should we do this? We would like to

158

M. Lauter.Nool / BIAS on the CYBER 205

emphasize this important question. A possible answer could be, that the subprograms could be performed in the same way as in case of contiguous COmpLEX vectors. But, precisely the troublesome storage of COMpLSX vectors involves considerable effort, like operating with control vectors to obtain only the real components or just the imaginary ones. Why should we not take full advantage of working with two separated vectors? In the next section we shall discuss this new situation end give the execution times of our implementation compared to the times of the CDC BLAS. 5.5. Our implementation of the COMPLEX B L A S f o r non-contiguous vectors

The periodic OA'rHFR is particularly suitable to handle strides. The major motivation to use this instruction is its substantial speed, its execution time only depending on the number of elements to be gathered. So, for large increment values this method is always attractive. In contrast to the p.~t. BLAS, this method may be worthwhile even for small increment values, as we shall see below. For COMPLaX vectors, we have to gather twice, once to create a vector with the real components and once to create a vector with the imaginary components. Notice that the original storage of a contiguous COm'LaX vector is R ItR2t

...

(5.4)

We shall refer to this storage as Storage !. Two periodic gatherings will lead to RIR 2 . . . ~1t2 "'"

(5.5)

referred to as Storage II. To diminish the number of startup times, the real and imaginary vectors are stored in the same array after each other, wherever possible. Let us now discuss our implementation of the COMPLEXBLAS for non-contiguous vectors based on Storage II, Only those subprograms are treated below, which can take full advantage of this storage. Here we assume, that if two vectors are involved, both vectors are stored non-contiguously. In Section 5.6 we discuss the case of dealing with two vectors, one of which has increment value 1. (a) zcA~x, As pointed out in Section 5.3, the main part of the computation is to construct the vector : (cf. ($.3)). Here, we are dealing with two separated vectors, and only one simple addition of these vectors (with sign control) is all we need, followed by a call of Q8S~AXl, of course.

(b) CDOTU/CDOTC.As we may recall from our description of CDOTU/CDOTC for increments equal to unity, inner products with control vectors are invoked. In the case of separated vectors no time is wasted to compute unnecessary results. Both for CDOTU and CDOTC it is recommended to put the vector with the real components and the vector with the imaginaries after each other (or the reverse) into one array in order to reduce the startup times. (c) CAXPY/CSC~J.. It will be clear, that all LINK~D TreADS in the implementation of both subprograms can now be performed without control vectors. This gives a substantial gain in speed. (d) ccopY. The execution time of this routine depends exclusively on the CPU time of the O^THn and sc^yreR operations. Obviously, first the relevant real components of x are gathered, and next they are scattered into the desired locations of 'the vector y. An analogue process on the imglinary components is performed afterwards. (e) CSWAP.The implementation of csw^p with increments equal to one, needs three copies. In this case, it is only a matter of gathering all relevant elements into tempora~ vectors, followed by scatter operations to restore them into the fight locations.

M. Louter-Ntx,d / B I A S on the CYBER 205

159

(f) CSROT, As described in Section 5.3, the CSROToperation on contiguously s Lored vectors is carded out without rearranging the elements and without the use of control vectors. In the non-contiguous case, after gathering the relevant elements, we obtain the Storage II mode. Hence, it is more efficient to follow a method based on disjunct vectors, than to rearrange the elements into Storage I. Moreover, a Storage I implementation still requires a splitting of the real and Lmaginary components into temporary vectors, before the elements can be scattered into the right locations.

At this point we are left to give the timings of our implementation and to compare them ,~o the CDC BLAS-timings. The timings were accomplished for the following three combinations of increment values, with associated n values: (i) ix(ffi iv) = 2; n = 5000, (ii) i x ( = iy) = 10; n = 1000, (iii) ix ffi 2 (iv ffi 10); n ffi 1000. A combination of (ix ffi i0 (,iy = 2)) should yield the same timing results as combination (iii). Moreover, from the special choice of the size of n (see appendix A), it turns out that combinations (ii) and (iii) result in the same execution times, despite the fact that the relevant elements of the vector x are differently located. The measured CPU times of these combinations can be found in Appendix A, Table A.3. -Table 10 gives the timing results of combination (i); all subprograms are called 10000 times. Comments. (a) The difference in execution speed between SCASUM(CDC) and SCASUM(CWI) versions is caused by the alternative approach of the BLAS operation after elimination of the stride problem (see Section 5.3). (b) The timing results of the CDC BLAS suggest, that only GATHERand SCATTERoperations are involved to handle the stride problems. (c) All execution times of the CDC BLAS can be improved for all increment values greater than one, except for the subprograms SCNRM2 and CSSCAL, whose performance do not suffer from the inherent troublesome storage of a COMPLEXvector. (d) Comparing the percentages in Table 10 to those in Table A.3 (the results of combinations (ii) and (iii), see Appendix A), the conclusion must be, that all improvements based on Storage II are independent of the size of the increment value (> 1). (e) The execution times are totally dominated by the size of n. As an example, a multiplication of the CPU times of both the CDC and CWI implementation in Table A.3 by a factor 5 (i.e. the quotient of the n values used) roughly delivers the execution times of Table 10.

Table 10 COMPLEXBLAS, stride 2 (n ,- 5000, ix( = iy) = 2)

]CAMAX SCASUM SCNSU2 CDOTU CDOTC CAXPY CSCAL CSSCAL CSWAP ccoPY CSltOT(4) CSROT(3)

CDC

CWI

13.9 7.0 5.1 22.2 20.2 23.7 21.5 8.5 21.8 14.4 29.8 29.8

5.1 5.0 5.1 10.3 10.3 13.9 10.6 8.5 13.1 6.5 21.8 19.1

37 72 101 46 51 58 50 101 60 45 72 64

160

M. Louter.Nool / B I A S on the CYBER 205

All control vector operations that are suitable for contiguously stored COMPLEXvectors have disappeared in the code based on Storage II. The number of GATHERand SCATTFRinstructions is unchanged. And consequently, th~.ir contributions to the total execution time still remain considerable for small increment values. Hence, we may expect that the use of the WHER~ statement, as pointed out in Section 4.2.2, will deliver improved timings. And, indeed, especially for hhtose routines which do not suffer from the mixed data structure of real and imaginary components, like the copy, swap, real scaling and rotation operations, some gain could be obtained. The values in Table 5 are valid for this kind of operations. For the inner products ¢DOTU and CDOTC the Storage II solution will result in the most efficient code. For the cAxpY operation ( i x = iy = 2), as well as the COMPLEXscaling routine CSCAL(ix = 2), both versions are comparable in efficiency on a 1-pipe CYBER 205. On a 2-pipe or 4-pipe CYBER 205, it is better to avoid the rearranging of elements as long as possible, because the timing for the OATtma/SCAT~R instruction will not decrease on a machine with 2 or more pipes. This contradicts with the timing for the WHERE instruction, which depends on the number of pipes. For that reason, we assume, that the number of pipes has to be taken into accc,unt for the implementation. 5.6. The COMPLEX B L A S with only one increment greater than one

First of all, we state that the periodic SCATTERis not always the most efficient way to put elements into the right locations. We mentioned before that merging of two vectors is a very expensive operation. However, if the control vector contains as much l's as O's, this is not true. More explicitly, a r,mxoE operation on the vectors with real and imaginary components is more efficient, than two SCATrEt~ operations. Table 11 illustrates this. We notice that a ~,~soE operation instead of two ScxTrER operations decreases the execution time with a factor of about 1½. In practice, one will often deal with only one increment greater than one. In this case, a choice must be made between an implementation based on Storage I and one based on Storage 11 (el. pattern ($.4) and (5.5)). To illustrate which of them is preferable we shall give two examples, both concerning CAXPY (y , : y + ca. x)~ We shall distinguish between two cases, (ix > 1, ly - 1) and (ix - 1, iy > 1). For the theoretical timings of the OATHER/SCAYr~g we refer to formula (4,4). All startup times will be ignored.

Exomple $.1: C , ~ p Y

with ix > 1 and iy - 1. - Storage I: Two GATHER operations are needed to obtain the relevant rea~ and imaginary elements of the vector x, followed by a MERO~ operation to restore Storage 1. At this point we can continue with the implementation as described in Section 5.3 (3 LINKEDTRIADSon 2n elements). Storage If: First we have to gather all relevant real/imaginary elements of the vectors x and y to get storage mode il. The operation can now be performed in 4 LXSrd~,>T~AD~ on n UAL elements. Finally, a t,mgop operation is needed to store the result vector in normal c o d e x storage mode.

Table !1 Increment n 2 , scArnsR taROE

2

4

5

8

10

$000

2500

2000

1250

1000

3.5 2.0

1.8 1.0

1.2 0.8

0.8 0.5

0.6 0.4

M. Louter-Nool / B I A S on the CYBER 205

161

The theoretical results for both methods are: op.

#

2 * GATHER MERGE

2½ n 2 n

LINKED TRIADS O1~ 2 /~

2

1 1 3

LINKED TRIADSon I n

n

I

# 2½ n 2 n 6 n

n ~

II

2 1

5n 2n

4

4n

+





.

.}.

II.

1o½ n

Example 5.2: CAXPY with ix = 1 and iy > 1. - Storage I: The first part is analogous to ix > 1, iy ffi 1; however, in this case the relevant elements of the vector y are gathered and merged. Here, the COMPLEX elements of the temporary result vector must be copied to the desired locations with a stride of iv. The most efficient way for large increment values is the GATHER/SCATTER solution as used in the next table. - Storage II: This method is almost identical to the previous example of Storage II. Here, the UmRGE operation is replaced by two SCATTER operations, since the result vector y has an increment value of iy.

The theoretical times for both methods are: op.

#

I

#

2 • GATHER

2½ tl

2

5

2 * SCATTER

2½ n

I

2½ n

IV~RGE LINKED TklADS c n 2 n LINKED TRIADS on I n

2 n 2 n n

1 3 -

2 n 6 n

II

2 1

5 n 2½ n

4

4 n

~ + 15½ n

~ + 11½ n

The conclusion from both examples must be that in case of (ix > 1, iy ffi 1) the computation based on Storage I has to be elected, and, if (ix = 1, iy > 1), the Storage II approach has to be preferred. We would like to emphasize, that the theoretical times for the GATHER/SCA~ER operations are very optimistic (see Table B.1, Appendix B). In practice, the difference in execution time between the Storage I and Storage II approach in Example 5.1 will be greater (Storage II needs more GATHER operations). Table 12 COMPLEXBLAS,mixed increment values (n = 5000, (ix = 1, iy = 2)v (ix ffi 2, iy = 1))

CDOTU CDOTC CAXPY12 CAXPY21 CSWAP cCOPY12 ccopY21 CSROT(4) CSROT(3)

CDC

CWI

15.2 13.2 16.2 13,3 12.7 7,2 7.4 21.1 21.1

10.1 10.1 13.9 11.0 11.5 6.6 4.9 19.6 17.8

66 77 86 83 91 91 66 93 84

162

M. Louter.Nool / B I A S on the CYBER 20~

'A similar analysis can be made for CDOTU/CDOTC. It appears to be more efficient to follow the Storage II approach for these functions, avoiding dot products with control vectors. The subprograms CSWAP and ccoPY are implemented according to Storage II, as described in the previous section. A m,;ZOS operation instead of two SCATTERoperations is involved in case the result vector has an increment value of 1. In Table 12 the execution times of both the CDC BLAS and our implementation are listed, where either ix ffi I or iy ffi 1; the value of the other increment parameter is 2. The suffices 12 and 21 symbolize the nonsymmetrical cases (ix ffi 1, ix, ----2) ~ d (':~ ~ 2~ "yffi 1). Similarly to the case where both increments are greater than 1, the CWI timings are smaller than the CDC BLAS timings, albeit less spectacular, as can be seen in the Tables 10 and A.3 (in Appendix A). 5. 7. Conclusions

All execution times of the CDC BLAS can be improved for all increment values greater than one, except for the subprograms SCNRM2 and CSSCAL. The performance of both SCNRM2 and CSCAL does not suffer from the inherent troublesome storage of a COMPLEX vector. Like the REAL SASUMand SROT, also the execution times of the cOmPLEX routines SCASUMand CSROTCan be speeded up. Splitting the real and imaginary components of COMPLEX vectors if ix(ffi iy) = 1 does not improve the efficiency. But, in case of an additional stride problem, disjunct vectors of real and of imaginary components are created. And then it is possible to take advantage of working with separated vectors. Even when only one vector has an increment greater than one, it is attractive to split the real and imaginary components of both vectors. In accordance with the REAL BLAS, some timing results for small increment values can be improved using the WHEREconstruction. The improvement depends on the number of pipes on the CYBEK 205.

Acknowledgment The ~uthor would like to thank Drs. J. Schlichting for his valuable suggestions on the implementation of S(C)ASUMand (C)SltOT3. She wishes to acknowledge Prof. dr. H.A. van der Vorst and Dr. ir. H.J3. te giele for their constructive remarks and careful reading of the manuscript. Appendix A In order to show that not only the CDC BLAS but also the CWI BLAS timing results are independent of the increment value, or stride, we have experimented with increment value I0. As a consequence, the relevant vector elements are more scattered through the vectors. To avoid needless expensive LP FAULTS, we keep the size of the dimensions equal to the values given in (3.2) (i.e. 20000 for ~,,.L vectors and I0000 for COMPLEXvectors). Therefore, the value of n must be adapted, such that the elements still fit in the vectors. With formula (2.2) the maximum allowed value n m ~nx

is obtained, this yields m = 2000. Again, all subprograms are called a = 10000 times. The theoretical time, needed for one single vector operation becomes q"ffi qt2000= 0.4 CPU seconds.

M. Louter-Nool / B I A S on the CYBER 205

163

Comparing this value of _¢" ~ t h *Re value of 'P given in (3.5), we notice that the theoretical time decreases with a factor 5, e~d consequenl~y, ~ l measured CPU times roughly decrease with this factor. Though the startup times are still negligible with respect to the vector length, the overhead here is larger than in the case of a stride 2. The tables list the following results: - Table A.1 corresponds to Table 4, except here, n = 2000 and ix, iy¢ (1, 10}. - Table A.2 corresponds to Table 7, and lists the timings of the improved subprograms SASUM and SROT. The other execution thnes of the CWI BLAS correspond to the values in Table A.1. - Table A.3 contains the results of combinations (ii) and (iii) as mentioned in Section 5°5.

Table A.1

ReAL CDC BLAS ( n -- 2000) ix • l (iy = 1)

ix •10 (iy ffi 10)

ix=,lO iy • l

ix--1 iy --10

ls~x

0.5

1.2

-

-

SASUM

0.9

1.5

-

-

SNRM2

0.5 0.5 0.5

1.2 1.8 2.4

1.1 1.7

1.2

0.5

1.7

--

-

1.3 0.5 2.5

2.6 1.3 4.2

1.7

-

03 3.4

0.7 -

SDOT SAXPY SSCAL SSWAP scoev SCOT

T a b l e A.2 anAL C W ! B L A S ( n ffi 2000)

ix = 1

(~y - 1 ) SASUM

0.5

SROT(4) SlAOT(3)

1.7 1.3

ix --10 ~y -10) |.2 4.3 3.9

ixffil ~y - 1 0 -

3.0 2.6

T a b l e A.3

COMPLEX BLAS, c o m b i n a t i o n s (ii) a n d (iii) ( n ffi 100t), i x ( ffi (v) ffi 1 0 V i x ffi 2 (iy ffi 10))

ICM4AX SCAStn4

SCNeJ42 CDOTU

CDOTC CAXPY CSCAL CSSCAL

CDC

CWI

9~

2.9 1,4 1.0 4.6

1.1 1.1 1.1 2.2

38 61 105 47

4.2 5.0 4.5

2.2 2.8 2.1

52 47 104

56

1.6

1.6

csw^P

c~.4

2.4

53

ccoPY CSXOT(4) CSROT(3)

3.1 6.0 6.0

1.3 4.1 3.7

39 68 60

M. Louter.Nool / BIAS on the CYBEB 205

164

Table A.4 com, uBx BIAS. mixed increment values (n = 1000. (ix ffi1. iy •10) v(ix ffi10. iy = 1)) CDOTU CE~'C C~teY lk C,~7Y kl CSW~d' ccot~ lk (:coPY kl csltoT(4) csltoT(3)

CDC

CWl

3.2 2.8 3.5 2.8 2.7 1.6

2.2 2.2 2.8 2.4 2.3 1.2

69 79

1.6 4.3 4.3

1.1 4.1 3.6

69 95

81

85 85 78

84

- Table A.4 c o r r e s p o n d s to Table 12, with mixed i n c r e m e n t values. T h e suffices k l and l k symbolize the n o n s y m m e t r i c a l cases ( i x = k = 10, iy = 1) a n d (ix ffi 1, iy -- k ffi 10).

Appendix B

In formula (4.4) the theoretical timings o f the GATHER a n d SCATTER o p e r a t i o n s are given. In practice, the factor ~ appears to b e t o o optimistic. In T a b l e B.1, we list the actually obtained

Table B,! OATIISIt/SCATTBI~ factors

Increment -64 -63 -62 -61 -60 -59 - 58 -5"/ - $6 -48 -40 - 32 -24 - 16 - 8 -7 -6 -5 -4 - 3 - 2 - 1 0

OAl'llBlt

4,20 2,58 1.50 1,$4 1,58 1.92 1.69 1,88 1.74 1.52 1.68 2,27 1,63 1,63 1.$4 1,73 1.63 1,56 1,45 1.65 1,45 2,83 4.20

SCATrBK

4,21 2.74 !.49 1.6S 1.74 1.82 !,81 1.85 1.87 1.51 1.63 2.27 1.46 1.39 1.23 1.42 1.55 1.46 1.35 1.36 1.27 2.72 4.21

Increment

0 1 2 3 4 S 6 7 8 16 24 32 40 48 56 57 58 59 60 61 62 63 64

GATllBR

4.20 2.83 1.57 1.92 1.44 1.68 1.69 1.62 1.74 1.51 1.62 2.27 1.64 1.63 1.54 1.38 1.54 1.56 1.44 1.56 1,44 2.58 4,20

SCATTER

4.21 2.76 1.49 1.66 1.76 1.82 1.81 1.85 1.88 1.51 1.64 2.27 1.46 1.40 1.24 1.42 1.56 1.47 1.35 1.35 1.27 2.74 4.21

M. Louter.Nool / B I A S on the CYBER 205

165

factor f for a GATHER/SCATTER operation on n ffi 1000 elements. Again a = 10000 and c denotes the cycle time on the CYBER 205. This yields

fffi measured CPU time ~g* t i * c

Both positive and negative increment values are involved.

References [1] R.L. Anderson et al., IMSL user's manual, Problem-solving software system for mathematical and statistical FORTRAN programming, 1984. [2] CDC, Cyber 200 ASSEMBLER reference manual, version 2, publ. number 60485010. [3] CDC, Cyber 200 FORTRAN reference manual, version 1, publ. number 60480200B. [4] J.J. Dongarra, J.R. Bunch, C.B. Moler and G.W. Stewart, Linpack User's Guide (SIAM, Philadelphia, PA, 1979). [5] J.J. Dongarra, J. Du Croz, S. Hammefling and R.J. Hanson, A proposal for an extended set of Fortran basic linear algebra subprograms, $1GNUM 20 (1985). [6] A. Emmen, Cyber 205 user's guide, SARA, Amsterdam, part 3, Optimization of FORTRAN programs, 1984. [7] C.L. Lawson, P..J. Hanson, D.R. Kincaid and F.T. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Trans. Math. Software S (1979) 308-323. [8] NUMVEC, A library of NUMerical software for VECtor and parallel computers in FORTRAN, Centre for Mathematics and Computer Science, Amsterdam. [9] B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe, V.C. Klema and C.B. Moler, Matrix Eigensystem Routines-- Eispaek Guide, Lecture Notes in Computer Science 6 (Springer, Berlin, 1976). [10] H.A. van der Vorst and J.M. van Kats, The performance of some linear algebra algorithms in FORTRAN on CRAY-1 and CYBER 205 superc~mputers, Report TR-17, Ac~demisch Computer Centrum, Utrecht, 1984.