An investigation of the performance characteristics of an i860 processor within a Meiko computing surface

An investigation of the performance characteristics of an i860 processor within a Meiko computing surface

Microprocessing and Microprogramming 37 (1993) 29-32 North-Holland 29 An Investigation of the Performance Characteristics of an i860 Processor withi...

326KB Sizes 4 Downloads 18 Views

Microprocessing and Microprogramming 37 (1993) 29-32 North-Holland

29

An Investigation of the Performance Characteristics of an i860 Processor within a Meiko Computing Surface S. A. Rea a, P. Milligan b, R. K. McConnella, L. A~ Murphy c and J. Pelan c a Parallel Computer Centre, The Queen's University of Belfast b Department of Computer Science, The Queen's University of Belfast c Department of Applied Mathematics and Theoretical Physics, The Queen's University of Belfast This paper presents an implementation of the BLAS 1 subroutine library in i860 assembly code. The target machine is an MK086 high performance dual-node in a Meiko Computing Surface, however this library of routines has been designed to run on any i860 processor. The inherent deficiencies of the generic Fortran compilers are addressed and program portability is still retained. The use of techniques such as dual-instruction mode and pipelined operations have enabled execution times of all routines to be reduced compared to the Fortran equivalents. 1. I N T R O D U C T I O N "The i860 64-bit microprocessor brings new levels of performance and capability to the microprocessor world. Incorporating more than one million transistors and executing up to three operations per clock, the i860 processor can perform up to 80 Mflops (Millions of Floating-point Operations Per Second) and 40 MIPS (Millions of Instructions Per Second) at 40 MHz."[1]. Statements like this by Les Kohn, Inters Chief Architect of the i860 microprocessor led many research groups to believe that they could move from expensive, heavily congested supercomputers such as a Cray YMP, to local, low cost, high performance i860 based machines. In 1991 the Parallel Computer Centre in The Queen's University of Belfast purchased an MK086 high performance compute board[2] for its Meiko Computing Surface, together with a C and Fortran Compiler. The MK086 board is configured with two i860's and four T800 transputers which provide the communications interface to the Computing Surface. The two i860s may be programmed independently or as part of a processor network.

The Fortran and C compilers were supplied by Green Hills via Meiko. Originally it was intended to port a number of large, compute intensive, Fortran applications from a Cray YMP. However, when running the Linpack benchmarks using Green Hills Fortran on the i860 only 4 Mflops was achieved and clearly this was not satisfactory. Meiko now market a compiler developed by the Portland Group but the quoted performance is less than 10 Mflops per i860 node[3]. These figures seem to be consistent with the published performances of i860 processors in a number of hosts. An investigation of techniques to improve the peak and sustainable Mflop rate was instigated and it was decided to construct core routines in assembly language. The Basic Linear Algebra Subprograms (BLAS levels 1, 2 and 3) subroutine libraries were chosen as a suitable starting point.

2. The BLAS LIBRARIES The case for using the BLAS libraries has been eloquently argued in many publications [4,5,6]. A brief summary of the main points is given below: (i) it is a structured modular approach;

30

S.A. Rea et al.

(ii) (iii)

(iv)

(v)

the mnemonics of the subprograms are self-documenting; in general complex linear algebra programs have execution time hot spots in a few low level routines, where hand coding will significantly improve performance; the core routines are implemented using the most efficient algorithms, fully exploiting the particular computer's architecture; BLAS is widely available both as public domain software and in packages such as LINPACIL

3. i860 ASSEMBLER TECHNIQUES The following section details the three main aspects of the i860 assembly language which were exploited to improve execution times[7,8].

3.1. Dual-Instruction Mode As well as executing a single stream of instructions the i860 can execute certain instructions in parallel. There are two classifications of instruction, those which execute in the floating point unit (FPU) and RISC core instructions. Programmers should, where possible, pair core and FPU instructions and enable dual-instruction mode by prefixing each instruction with "d.". In dual-instruction mode the suitably paired core and FPU instructions are executed simultaneously in one clock cycle, thus halving the execution time. As this mode requires a delay of one clock cycle before it engages, and an instruction pair is executed after the disabling instruction, care should be taken to reorder and group instructions to maximize the number of consecutive dualinstruction mode operations. Also engaging this mode for one or two cycles tends to have little effect on execution times. The set of core instructions includes those for integer arithmetic, shift operations, loading and branching. The predominantly floating point nature of the BLAS routines left little scope for the inclusion of dual-instruction operations and where it was used a number of

core NOPs had to be included to balance the number of FPU instructions.

3.2. Pipelined Instructions The superscalar architecture of the i860 supports short length pipeline instructions. However, the i860 does not support the singleinstruction fetch, execute and store methodology, typical of traditional vector processor supercomputers such as a Cray YMP. The following schematic of a code fragment would load two vectors into multiword vector registers VR1 and VR2, add the vectors placing the result in register VR3 and then store the contents of VR3. VLOAD VR1 ... VLOAD VR2 ... VADD VRI VR2 VR3 VSTORE VR3 ... In contrast the i860 pipelined instructions use 32 four byte floating-point registers (f0, fl,...,f31). Therefore a vector is loaded one element at a time (double precision values use even/odd pairs of registers) and each floating point operation only refers to single vector elements. As pipelined instructions require 3 cycles, apart from double precision multiplication which is a two stage operation with each stage requiring two clock cycles, the pipe must be primed with three operations. These priming instructions specify the "dummy register", f0, as output before the fourth instruction in the sequence steres the result. Similarly at the end of a pipelined series of instructions the pipe must be flushed specifying f0 as dummy inputs together with the destination for the previous instructions. The pipelined load operates in a similar fashion but experience during the implementation of these routines has shown that this instruction, pfld, should be used sparingly. Timings indicated that the standard load instruction, rid, was more reliable as the freeze conditions [7, pp C1-C2] associated with pfld can cause significant delays especially if a load hits data in the cache. The freeze conditions occur when two or more instructions conflict for a resource or a resource is not available in the expected number of cycles. The processor automatically

Performance characteristics of an i860 processor

31

detects freeze-conditions and delays accordingly. The plfd instruction expects to load data from memory and cached data incurs at least a two cycle penalty. Also a string of pfld instructions causes internal delays due to the fact that the bandwidth of the i860 bus permits only one transfer every two clock cycles. Therefore although judicious use was made ofpfld, fld was preferred.

operation which was initiated two instructions previously. The final operand f20 specifies the destination of the addition operation started two instructions before. The complexity of the dual operation mode means that great care must be taken when using these instructions, however they yield the maximum floating point performance of the i860 and should be used where possible.

3.3. D u a l Operation Mode The add and multiply units can execute in parallel if the output of one is connected to the input of the other, this is referred to as dual operation mode. There are a total of 64 possible dual operations but only 62 are available[5]. Although each FPU requires two inputs and one output, giving a total of six operands for a dual operation instruction, an i860 instruction can only specify three operands. This problem is solved by supplying inputs from one of two constant registers, KR and KI, or from a transfer register T and chaining the result of one FPU into the input of the other. The configuration of these connections is dependent on the particular instruction (see the example below). The chained FPUs require 6 priming cycles (5 for double precision) but then deliver the results of two floating point operations every subsequent cycle. Dual operation mode was particularly relevant in the implementation of the BLAS routines classed as a x p y as well as those for calculating dot products. The _axpy routines include saxpy, daxpy, caxpy and zaxpy and calculate: y=ax+y where x and y are vectors and a is a scalar constant. The prefixes s, d, c and z indicate single precision, double precision, complex and double precision complex arithmetic respectively. The saxpy routine includes the following instruction: r2pl.ss f18,f16,f20 This instruction initiates both FPUs. The inputs to the multiply unit, a and x[i], are taken from the pre-loaded real constant register KR and f16, while the inputs to the add unit, y[j] and a*x[j], are supplied from register f18 and result of the multiply

4. RE SUL'I~ The results presented in this section are intended to give an overview of the execution performance for the hand coded assembler routines. A detailed set of graphs for each routine may be found in [9,10] but as many had a characteristic shape one is reproduced. SAXPY

3

2.52

1.5 I

0

500

1000

N Figure 1. Ratio o f Fortran/Assembler e x e c u t i o n times against v e c t o r l e n g t h for SAXPY. The Fortran reference routines were the LINPACK public domain versions of BLAS as implemented by J.J. Dongarra. These were compiled using all possible compiler optimization flags but faetorization was not supported. Each routine was timed over 1000 executions for vectors of length 1 to 1000 and Table 1 summarizes the improvements obtained. The improvement ratio was calculated by dividing the execution times for the Fortran versions by those of the assembler routines. Figure 1 illustrates the relationship of the improvement ratio to

32

S.A. Rea et al.

vector length. The degradation of performance as vector length increased occurred when the cache size was exceeded. In the case of the saxpy routine shown this was around a vector length of 350 whereas the double precision version, daxpy, caused the cache to overflow at half this value. The saturation of the cache also occurred earlier for complex and double precision complex versions of routines. The Fortran routines did not show this strong dependency on the cache utilization.

Table 1: BLAS 1 Ratio o f e x e c u t i o n times (Fortran vs A s s e m b l y code) Vector Length 5OO

Vector Length 1000

Peak

sswap dswap cswap zswap

16.3 8.8 14.2 1.9

17.2 2.3 3.3 1.9

17.2 9.3 14.9 8.8

sscal dsca] cscal csscal

13.8 8.7 8.7 14.6

14.8 9.1 8.8 15.1

15.2 9.1 8.8 15.1

seopy deopy ccopy zcopy

7.6 4.1 5.5 3.0

8.1 4.4 5.3 1.7

8.1 4.2 5.5 3.1

sdot ddot

2.5 1.6

1.9 1.5

4.4 3.6

cdotu cdotz

2.3 2.4

2.2 2.3

3.4 4.0

saxpy daxpy

2.1 1.3

1.6 1.2

2.9 2.7

isamax idamax icamax izamax

1.3 1.4 2.5 2.8

1.3 1.3 2.2 2.8

1.6 1.6 2.5 3.2

sasum dasum seasum dzasum

1.4 1.2 1.5 2.0

1.4 1.2 1.4 2.0

1.4 2.7 2.5 2.3

Routine

5. SUMMARY AND F U T U R E WORK The key level 1 BLAS routines have been completed apart from zaxpy and caxpy which will be available by October 1992. The improvement levels as given in Table 1 are encouraging for vector lengths below 500 where execution times of up to 20 times faster than the standard Fortran versions have been achieved. However as vector lengths increased there was a marked drop in the ratio of execution times. This was due to caching effects as described in [8]. It is intended to produce a revised form of the library which will use techniques such as strip mining[8] to improve the performance for longer vectors. Also the level 2 and level 3 BLAS routines will be implemented together with a library of advanced algorithms for eigenvector solution, LU factorization, etc.

REFERENCES 1. N. Margulis, i860 Microprocessor Architecture, Osborne McGraw-Hill. 2. MK086 (High Performance Compute Board) Reference Manual, Meiko. 3. Meiko Information and Product News, Meiko, 2nd Quarter 1992. 4. C.L. Lawson, R.J. Hanson, D.R. Kineaid and F.T. Krogh, Basic Linear Algebra Subprograms for Fortran usage, ACM Trans. Maths. Software, Vol 5, No. 3, 1979. 5. M. Louter-Nool, Basic Linear Algebra Subprograms (BLAS) on the CDC Cyber 205, Parallel Computing, 4, 1987. 6. J.J. Dongarra, J. Du Croz, S. Hammarling and I. Duff, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Maths. Software, Vol 16, No. 1, 1990. 7. i860 64-bit Microprocessor Programmer's Reference Manual, Intel. 8. M.T. Heath, et al., Early Experience with the Intel iPSC/860 at Oak Ridge National Laboratory, ORNL/TM-11655. 9. L.A. Murphy, MSc dissertation, QUB, 1992. 10. J. Pelan, MSc dissertation, QUB, 1992.