Parallel Computing 15 (1990) 107-117 North-Holland
107
On the performance of transputer arrays for dense linear systems J. BOREDDY and A. PAULRAJ Center for Development of Advanced Computing, 2/1, Brunton Road, Bangalore-560 025, India Received January 1990
Abstract. In this paper, computation and communication performance is evaluated for single and multitransputer arrays. Performance models are proposed for Occam program execution, under Transputer Development System TDS2. The performance features of normalised arithmetic, concurrent floating and integer arithmetic, logarithmic array indexing, and on-chip/off-chip RAM are studied. The startup time, byte transfer rate, asymptotic link bandwidth, and half performance message length are estimated for simultaneous operation of one, two, three, and four links at 10/20 MHz clock in unidirectional/bidirectional modes. The impact of various performance maximisation techniques on execution time is also addressed. The matrix factorisation algorithms for dense linear systems are chosen as the focus for this study. The implementations include LUD, Householder, Gauss-Jordan, Choleski, and Givens methods. Floating point operations count alone is inadequate to estimate computation time; many other factors such as array indexing, load/store overhead, and loop overhead play a significant role in the transputer performance for the dense linear systems. The reduction in array indexing overheads in multitransputer arrays may result in superlinear speedups.
Keywords. Dense linear systems, Matrix factorisation, Performance characterisation, Performance maximisation, Superlinear speedup.
1. Introduction
In this paper, computation and communication performance is evaluated for single and multitransputer arrays. Performance models are proposed for Occam program execution, under Transputer Development System TDS2. The performance features of normalised arithmetic, concurrent floating and integer arithmetic, logarithmic array indexing, and on-chip/off-chip RAM are studied. The startup time, byte transfer rate, asymptotic link bandwidth, and half performance message length are estimated for simultaneous operation of one, two, three, and four links at 10/20 MHz clock in unidirectional/bidirectional modes. The impact of various performance maximisation techniques on execution time is also addressed. The matrix factorisation algorithms for dense linear systems are chosen as the focus for this study. The implementations include LUD, Householder, Gauss-Jordan, Choleski, and Givens methods. Floating operations count alone is inadequate to estimate computation time; many other factors such as array indexing, load/store overhead, and loop overhead play significant role in the transputer performance for the dense linear systems. The reduction in array indexing overheads in multitransputer arrays may result in superlinear speedups. 0167-8191/90/$03.50 © 1990 - Elsevier Science Publishers B.V. (North-Holland)
J. Boreddy, A. Paulraj / Performance of Transputer arrays
108
2. The T800 and Oecam The transputer was described as a concept [1] and was first announced as a product in 1983 as T424 [8]. The device itself was not widely available until early 1986, after which several chips were announced, but none of these were useful for numerically intensive applications due to their low Flops ratings. However, the introduction of the T800 in early 1987, with on-chip floating point, boosted the performance to 1-2 Mflops per node making it a superior building block. The T800 is a 32-bit CMOS microcomputer with a 64-bit floating point unit and communications support. It has 4 Kbytes on-chip RAM for high speed processing (50 ns), a configurable memory interface and four communication links. The instruction set achieves efficient implementation of high level languages and provides direct support for the Occam [11,14] model of concurrency, embedded in CSP[6]-like environment, when either using a single transputer or a network of transputers. Procedure calls, process switching and typical interrupt latency are sub-micro second. All features of the transputer are directly accessed through Occam. Occam differs from other languages in that it directly provides for the execution in parallel of communicating sequential processes. Channel commands can make direct data transfers between concurrent processes. A single process can be constructed from a collection by specifying sequential, parallel, or alternative execution of the constituent processes. This combination of program structure and integrated communication allows Occam to describe the control and data flow for virtually any algorithm.
3. Computing performance 3.1 Arithmetic
The T800 has on-chip floating point support. The 64-bit floating point unit provides single and double length arithmetic concurrently with the CPU and sustains a maximum rate of 1.5 Mflops for scalar operations at a processor speed of 20 MHz. The 64-bit operations are marginally slower than 32-bit operations in all cases, due to the additional stack load time; also 64-bit multiplication and division performance will be further degraded as they use iterative algorithms, based on a three bit at a time approximation. The transputer does not take a fixed amount of time for floating point operations. The operands are shifted to line up their binary points at a speed of one bit per clock cycle prior to the actual operation, or normalised at two bits per clock cycle if the output yields a number smaller than the operands. The microprocessor also does a check of whether either operand is zero, and aborts calculation if true. Both single and double precision floating point operations and their execution times in processor cycles are listed in Table 1.
Table 1
Execution times for floating point arithmetic Operation
add sub div mult
Typical cycles
M a x i m u m cycles
single
double
single
double
6 6 16 11
9 9 28 18
6 6 31 18
9 9 43 27
J. Boreddy, A. Paulraj / Performance of Transputer arrays
109
Table 2 Execution times for Householder factorisation Matrix size
M i n i m u m time (/ts)
M a x i m u m time (gs)
4 8 16 32 64 100 200 300
331 2059 14313 106 932 827 440 3119603 24 700 728 83 080 251
347 2091 14369 107 413 828 688 3122570 24 715 743 83 098 977
This timing variability while typical o f non-pipelined arithmetic, is quite unlike that of machines like CRAY. The variation caused by data introduces a slight load imbalance in large jobs, even when the computation appears perfectly load balanced on the ensemble. It may not be possible to derive an accurate model for estimating computation time for a transputer based machine, because data influences computation time. It is noteworthy that a program may yield different computation times for the same matrix size, but for different data. Considerable variations in execution times were noticed for Householder factorisation on a single transputer for matrices with different data; the maximum and minimum execution times are given in Table 2.
3.2 Array indexing In T800, the FPU (Floating Point Unit) operates concurrently with the CPU; this means that it is possible to perform an address calculation in the CPU while the FPU performs a floating point calculation. This can lead to significant performance improvements in real applications which access arrays heavily. The T800 has a fast multiplication instruction (" product") [10] which is used for the multiplication implicit in multi-dimensional array access. Its execution time is 4 + Tb where Tb is the most significant bit set in the multiplier. For example, in the following fragment of Occam: [Iq3 [N]REAL32 A: SEQ
B:=AEi]Ej] . . °
performing the above assignment involves calculating the offset of element A[i][j] from the base of the array A. The compiler would generate the following code for this computation: load I load N
product load J add
Since the product instruction executes in a time dependent on the highest bit set in the constant N is log2N, in this case the array indexing takes log2N + 11 cycles (by taking addition and load times as one and two clock cycles respectively). In general, the multiplication in an address calculation for two-dimentional array is performed in time proportional to the logarithm of row width, since the storage allocation is row-wise in Occam. This unique feature of the transputer can be exploited in choosing the data mapping methodology for the matrix computations.
110
J.
Bored@, A. Paulraj / Performanceof Transputer arrays
Given N × N matrix and P processors and if the column-wise mapping is adopted, then indexing time per address calculation takes l o g ( N / P ) + 11 cycles, i.e. there is a reduction of lOgEP cycles in computation time. For example, in LU factorisation, the indexing count is O(N 3) and so we gain O(N 3 log N) processor cycles. Thus column-wise wrapped mapping is adopted for factorising dense linear systems. In certain cases, results show superlinear speedups due to this mapping strategy. 3.3 Computation time
The computation time depends on the algorithmic complexity as well as machine characteristics. The computation time for matrix problems can be modelled by considering various factors such as floating point calculation time (Tnop), loop overhead (Tloop), store/load overhead (Tload/store), a n d indexing overhead (Tindex). The computation time also depends on extra operational overhead (T~xt~aop) such as comparisons and library functions, and the overhead of operating system (OS), for example, time slicing and process scheduling. However, OS overhead is not considered in the model. The computation time is given by the following expression:
Lomp = T~oop + ~ndox + r~oad/storo + T.op + rox,ra Various overheads for LU factorisation on a single transputer are given in Table 3. The Textraop is negligible here; but it is important in certain factorisation algorithms. Since the array indexing and floating point arithmetic run in parallel, the overlapped computation is modelled in Tnop, i.e. in Tindex CPU calculates array index while FPU remains idle, but in Tnop, FPU is fully engaged in arithmetic while CPU may or may not be doing indexing. It is clear from the results that floating point operations count alone is inadequate for modelling the computations on transputer. 3. 4 Memory access
The Tload/store overhead, described in the previous section, heavily depends on the relative sizes of on-chip and off-chip RAM. It also depends on frequency and localisation of accesses, and placement of data/code in on-chip/off-chip. Some tests are performed to see the effect of placement of data/code in off-chip RAM on execution time. Results given in Table 4 indicate degradation due to the slow access times of external RAM. The results also reflect that by Table 3 Various overheads in computation time Matrix size 4 8 16 32 64 100 200 300
Tioop
Tinde~
Tstor¢/ioa d
Tnops
(~sec)
(~tsec)
(t~sec)
(~see)
39 195 1183 8193 60 984 226 719 1 772 942 5 938 664
75 690 5845 48112 390130 1 495 844 12 016 548 40 608 986
19 167 1355 10 890 87 266 333 062 2 665 625 8 997 688
18 150 1218 9796 78 523 299 723 2 398 945 8 097 667
111
J. Boreddy, A. Paulraj / Performanceof Transputer arrays Table 4 Off-chip RAM performance for dense linear systems Algorithm
Data offchip
Code offchip
Both offchip
LUD Householder Gauss-Jordan Choleski Givens
1.63 1.65 1.633 1.62 1.62
1.42 1.39 1.42 1.42 1.45
1.9 1.87 1.9 1.9 2.0
placing code rather than data off-chip, we can expect better performance. So the recommended lay-out for memory allocation of Occam programs for matrix computations is as follows: • System workspace for link words • Process workspace for processes • Buffers for critical vectors • Data space for arrays • Code space for compiled program.
4. Communications performance 4.1 Link protocol The transputer consists of four identical bidirectional bit-serial links that provide synchronised communication between processors and with the outside world. Communication between processes is achieved by means of channels. Process communication is point-to-point, synchronised and unbuffered; and hence a channel does not need any message queue, or message buffer. The communication links allow networks of transputers to be constructed by direct point to point connections with no external logic. The T800 links can operate at the speed of 5, 10, or 20 Mbits/sec. The T800 implements a protocol that allows fully overlapped data and acknowledge packets. D a t a is transmitted in l l - b i t packets, which comprise one start bit, one flag bit to distinguish between data and acknowledge packets, followed by eight data bits and a stop bit. The fully overlapped protocol allows the acknowledge packet to be sent to the transmitting transputer as soon as the receiving transputer has detected that the incoming packet is a data packet (i.e. at the second packet period of the protocol). The link communication is not sensitive to clock phase, thus facilitating the independently clocked transputer networks (as long as the communication frequency is equal!). 4.2 Benchmarks The cost of communication between adjacent processors in a transputer based system can be modelled with reasonable accuracy by tcomm ----0t + t i M where a is a start-up time for any message independent of its length, fl is the incremental cost per unit length, and M is the length of the message in bytes or words. Another communication parameter called half-performance message length (m l~ 2) [7] is the message length that causes the communication bandwidth to become half of the peak performance. Some tests were performed to determine a, fl and mr~ 2 for unidirectional as well as bidirectional data transfer for one, two, three and four links. The tests were performed at 20
J. Bored@, A. Paulraj / Performance of Transputer arrays
112
Table 5 20 MHz Link performance for unidirectional transfer No. of links 1 2 3 4
ml/2
a (#sec)
fl ( # sec/byte)
BW (Mbytes/sec)
(bytes)
8.33 12.56 16.60 20.70
0.563 0.563 0.563 0.563
1.78 3.56 5.34 7.12
14.8 11.15 9.83 9.19
Table 6 20 MHz Link performance for bidirectional transfer No. of links 1 2 3 4
a (psec)
fl (#sec/byte)
BW (Mbytes/sec)
ml/2
12.12 18.41 27.72 44.32
0.7818 0.7818 0.7975 0.8134
2.56 5.12 7.56 9.84
15.51 11.77 11.59 13.62
(bytes)
Table 7 10 MHz link performance for unidirectional transfer No. of links 1
2 3 4
a (#Sec)
fl (#sec/byte)
BW (Mbytes/sec)
ml/2
11.35 15.67 19.1 22.9
1.125 1.125 1.125 1.125
0.89 1.78 2.67 3.56
10.1 6.96 5.69 5.1
(bytes)
MHz processor frequency and 10/20 MHz link frequency. The communication parameters estimated via least-squares fit are given in Tables 5, 6, 7, and 8. From these results, the following conclusions can be drawn: • Startup time depends on the number of links that operate in parallel, mode of transfer, but independent of link frequency. • Byte transfer rate depends on link frequency, mode of message transfer, and independent of the number of links that operate in parallel. • It may be noteworthy that 20 MHz link speed in bidirectional mode yielded large variations in the measurements. This needs further investigation. Communication parameters are however calculated over averaged values of measurements.
Table 8 10 MHz rink performance for bidirectional transfer No. of links 1 2
3 4
ml/2
a (p,sec)
fl (gsoc/byte)
BW (Mbytes/sec)
(bytes)
12.89 15.67 25.66 41.73
1.505 1.505 1.505 1.505
1.32 2.64 3.96 5.28
8.56 5.21 5.69 6.93
J. Bored@, A. Paulraj / Performance of Transputer arrays
113
5. Dense linear systems on a transputer array
The hardware configuration consists of four transputers connected as unidirectional ring, effectively utilising only two channels per transputer. Each transputer consists of 1 Mbyte external RAM in addition to 4 K internal RAM. The links and the processors were operated at 20 MHz. The external RAM needed 4 processor cycles access time. Householder, LUD, and Gauss-Jordan methods were implemented for dense linear systems on this hardware configuration by wrapping columns among processors.
5.1 Pipelined ring algorithm The columns with the same number modulo P are assigned to a given processor [3,5,13,16]. At step k, the processor holding the pivot column, say Pi, sends it to its right neighbour Pi+ 1, which in turn sends to Pi+2, and so on until the pivot column reaches the left neighbour of Pi-1- The main advantage of this asynchronous strategy is that communication and computation can be overlapped. A processor can start updating its internal columns as soon as it has received the pivot column and transmitted to its neighbour. There is no need to wait for all processors to receive the information before starting the computation. The pipelined ring algorithm is as follows [15]: Pipelined Ring Algorithm: { program of processor -P! } for k=l to n-1 do if i=k rood p then calculate column k and send it to-P!+I else receive column k from-Pi_1 if i ~k-1 then send column k to-P!+1 endif end i f perform the updating of all internal columns j ~_ k+l
LUD, Householder, and Gauss-Jordan methods were implemented using the above pipelined ring algorithm on a network of four transputers. The speedups and Flops ratings are given in Tables 9 and 10. Considerable Flops ratings are noticed for matrix dimensions that are greater than 64 × 64. Surprisingly, superlinear speedups are reported for large grain sizes. It is also said in [12] that superlinear speedups can be obtained for a transputer based supernode machine for FFT applications, by distributing the algorithm across the processors so that the butterfly coefficients become constants rather than indexed variables.
Table 9 Speedups for dense linear systems on four transputers Matrix size
LUD
Householder
Gauss-Jordan
4 8 16 32 64 100 200
0.72 1,39 2,21 2,95 3.50 3,73 3.96
0.97 1.59 2.39 3.11 3.58 3.77 3.95
0.78 1.44 2.23 2.95 3.48 3.71 3.94
300
4.03
4.02
4.01
114
J. Bored@, A. Paulraj / Performance of Transputer arrays
Table 10. Sustained Mflops rating for dense linear systems Matrix size
LUD Seq
4 8 16 32 64 100 200 300 400 500 600
Householder •
0.23 0.26 0.29 0.30 0.31 0.31 0.31 0.31 -
Gauss-Jordan
Par
Seq
Par
Seq
Par
0.16 0.37 0.64 0.89 1.08 1.16 1.23 1.26 1.28 1.28 1.29
0.28 0.35 0.4 0.42 0.43 0.43 0.43 0.43 -
0.27 0.56 0.95 1.3 1.52 1.62 1.71 1.74 1.76 1.77 1.77
0.24 0.28 0.29 0.30 0.30 0.30 0.30 0.30 -
0.19 0.4 0.65 0.88 1.05 1.12 1.2 1.22 1.24 1.24 1.25
5.2 Parallel execution time
The parallel matrix computations on a transputer array can be modelled as: Tpar-~ Teomp + Tcontrol + Tcomm + Tsync where Tcomp ~ T s e q / / P -
Tindexgain
In the parallelised version, in addition to computation time, we also have communication, and synchronisation overheads; a control cost is added that consists of determining whether a pivot column is inside the processor, and finding the nearest neighbours. In [2], control cost is modelled for iterative methods and it was shown that a non-constant control cost, for example linear in the number of processors, can lead to an optimum number of processors to increase the speedup of parallel algorithms. Since the the load is fairly balanced, computation time becomes 1 / P t h of sequential execution time and further reduced by gain in array indexing. This gain in array indexing is peculiar to tranputer networks; otherwise the above equation can be used for any parallel computer. To achieve the superlinear speedup, the following condition has to be satisfied: Tindexgain > Tloss
where Tloss = (Tcomrol + Tcomm + Ysync ) •
The Tind~x~in and Tit~ calculated for LU factorisation are given in Table 11. It is evident that the above condition satisfies for matrix size 300 × 300 where superlinear speedup occurs.
Table 11 Index gain versus induced losses for L U D Matrix size
Tindex~n (/Asec)
Tloss (/tsec)
4 8 16 32 64 100 200 300
18.4 32.55 192.3 1508.25 9972.8 33 994.8 238 536.75 763 210.0
158.2 546.8 2017.6 7695.4 30004.7 72 548.4 288112.8 647 854.3
J. Bored@,A. Paulraj / Performanceof Transputerarrays
115
5.3 Data partitioning
For partitioning the matrix for the parallel execution on transputer networks, given an option that both row-wise as well as column-wise partitioning are equally efficient, the column-wise partitioning is recommended since it reduces the array indexing overhead. Therefore for the parallel execution of matrix computations such as direct methods for factorising matrices, back substitution methods for solving triangular systems, iterative methods for solving partial differential equations on regular domains, and matrix multiplication algorithms, columnwise partitioning of data is recommended on transputer arrays. (However, it is not recommended for the Givens method, since efficient row-wise schemes are available). It is shown in [4] that two-dimensional areal partitioning into patches results in smaller total communication volume than one dimensional partitioning into vertical or horizontal strips. However, areal partitioning requires a larger number of messages yielding poor performance.
6. Performance maximisation
Performance maxim_isation can yield significant improvement in Occam program execution. The effect of various performance optimisation techniques on the execution times of factorising algorithms is estimated. Readers may refer to [9] for optimisation techniques. 6.1 Separate vector space
There is a compiler option called "separate vector space" [16]; with this option compiler allocates the Occam arrays and vectors into a separate vector space rather than into workspace. This decreases the amount of stack required. This has two benefits: first, the offsets of variables are smaller (therefore access to them is faster), and second, the total amount of stack used is smaller, allowing better use of on-chip RAM. Occam also provides "place" command to selectively place data either in workspace or in vector space. 6.2 Loop unrolling
The transputer loop overhead is about 0.75 /~s; thus for a frequently executed loop containing just a few instructions the overhead is significant. By opening up the loop at the expense of increasing the size of the code, execution time can be reduced. Abbreviations may be used to unroll loops. 6.3 Range checking avoidance
Range checking is a compiler option enabled by default which checks the validity of array subscripts and type conversions; extra code is created to do the checking, and this is a useful facility during software development, but once the program is thoroughly tested the speed of the operation can be greatly increased by switching off the range checker. Another way of avoiding the range checking is through abbreviations; by abbreviating sub-vector of larger vector, or rows of an array and using constants to index into subvector or sub-array, the compiler will generate range-checking code for the abbreviation, but will not generate rangechecking code for accesses to the sub-vector, or sub-array.
116
J. Boreddy, A. Paulraj / Performanceof Transputer arrays
Table 12 Execution times of Gauss-Jordan method for 100 x 100 matrix Option
Time taken (#sec)
Without optimisation Gather Separate vector space Without range checking Loop unrolling
5168 721 4 821548 4 313 552 3 267 599 2 807104
6.4 Gather operation
I n a gather operation, the scattered or non-contiguous data is buffered onto contiguous locations. With the help of this, we can reduce array indexing, message transfer, and long m e m o r y access times. In L U D or G a u s s - J o r d a n methods, at any stage of iteration, the pivot c o l u m n and pivot row are heavily used in the elimination process, and therefore, b y transfering these vectors o n t o buffers, array indexing overhead can be reduced. Secondly for sending the pivot c o l u m n to the nearest neighbour, it is very efficient to gather the pivot column (which has stride lengh of N ) into a buffer, thereby avoiding the most costly startup overheads in communications. But in our implementations, only the pivot column is buffered to avoid startup overheads. The buffer was also placed after array declaration to place it in internal R A M . Some of the performance maximisation results for G a u s s - J o r d a n m e t h o d are given in Table 12 where reductions in execution time due to aforementioned techniques are noted.
6.5 Overlapped and prioritised communications
C o m m m u n i c a t i o n s should be overlapped with c o m p u t a t i o n s and c o m m u n i c a t i o n process should have higher priority than a c o m p u t a t i o n process to ensure full processor utilisation. Householder factorisation was run on a ring of four transputers with non-overlapped, overlapped and prioritised communications; here pivot column calculation and transfer are pipelined, and the transfer is prioritised over pivot calculation. The results are given in Table 13.
Table 13 Execution times for Householder factorisation Matrix s i z e
Nonoverlapped ( # s)
Overlapped (/Ls)
Prioritised ( # s)
4 8 16 32 64 100 200 300 400 500 600
422 1781 9576 60 652 362162 1171014 8 267 302 27061473 63 319 478 122 805 478 211276101
384 1280 6016 34 880 23 7632 856704 6 524 736 21666752 50 954112 99 044416 170601920
343 1297 5986 34400 231045 827110 6 253 946 20713135 48 645 953 94480 235 162650195
J. Bored@, A. Paulraj / Performance of Transputer arrays
117
7. Conclusion In this paper, computation and communication performance was evaluated for single and multitransputer arrays. Performance models were proposed for Occam program execution, under Transputer Development System TDS2. Matrix factorisation algorithms for dense linear systems were chosen as the focus for this study. The future work will focus on the following issues: • Analytical estimation of sustained Flops rating for various n-adic operations and BLAS routines by incorporating memory access delays for scalar as well as vector arithmetic. • Characterising the speedups for matrix problems with respect to granularity; obtaining the expressions for superlinear speedups, and also the optimal number of processors for a given problem size and vice versa. • Developing a methodology for implementation of matrix computations for general-purpose library subroutines. • In the present work, pipelined execution of dense linear systems is addressed; only two channels of transputer are used here. The future work will focus on efficient broadcasting algorithms for dense linear systems where all the links may be utilised.
Acknowledgement J. Boreddy thanks Giridhar Gopalan for his patient explanation of TDS2 environment. Thanks are also due to Sanjay Tambwekar, S. Vaidya Subramanian, and S. Balasubramanian for their help in typesetting the paper in Troff. J. Boreddy is also indebted to Professor Azriel Rosenfeld for his continuing encouragement.
References [1] I.M. Barton, The transputer, in: Aspinall, ed., The Microprocessor and Its Application (Cambridge University Press, Cambridge, UK, 1978). [2] L. Brochard, Communication and control costs on loosely coupled multiprocessors, Preprint, Ecole Nationale des Ponts et Chaussees, 1986. [3] M. Cosnard, B. Tourancheau and G. Villard, Gaussian elimination on message passing architecture, in: E.N. Houstis et al., ed., Supercomputing, Lecture Notes in Computer Science 297 (Springer, Berlin, 1988) 611-628. [4] G.C. Fox, Square matrix decompositions: symmetric, local, scattered, Tech. Rept. HM-97, Caltech., Pasadena, California, 1984. [5] G.A. Geist and M.T. Heath, Matrix factorisation on a hypercube multiprocessors, M.T. Heath, ed., SIAM (Philadelphia, 1986) 161-180. [6] C.A.R. Home, Communicating sequential processes, CACM 21 (8) (1978) 666-677. [7] R.W. Hockney and C.R. Jesshope, Parallel Computers 2 (Adam Hilger, Bristol, UK, 1988). [8] Inmos, T424 Transputer-Advance Information (Bristol, UK, 1983). [9] Inmos, The Transputer Applications Notebook Systems and Performance (Trowbridge, UK, June 1989). [10] Inmos, The Transputer Databook (Bath, UK, November 1988). [11] Inmos, Occam 2 Reference Manual (Prentice Hall, UK, 1988). [12] J.B.G. Roberts, Recent developments in parallel processing, in: Proc ICASSP, Volume 4, Spectral Estimation, VLSI, and Underwater Signal Processing, EVU (Glasgow, Scotland, May 1989) 2461-2467. [13] Y. Robert and B. Tourancheu, LU and QR faetorisation on the FPS T Series hypercube, in CONPAR 88 (Manchester, 1988). [14] D. Roweth, Design and performance analysis of transputer arrays, J. Syst. Soft. 1 (2) (1986) 21-22. [15] Y. Saad, Gaussian elimination on hypercubes, in: M. Cosnard et al., eds., Parallel Algorithms and Architectures (North-Holland, Amsterdam, 1986), 5-18. [16] B. Tourancheau, LU factorisation on The FPS T series hypercube, in M. Cosnard et al., eds., Parallel and Distributed Algorithms (North-Holland, Amsterdam, 1989).