A systematic approach for designing concurrent error-detecting systolic arrays using redundancy

A systematic approach for designing concurrent error-detecting systolic arrays using redundancy

Parallel Computing 19 (1993) 745-764 North-Holland 745 PARCO 768 A systematic approach for designing concurrent error-detecting systolic arrays usi...

888KB Sizes 3 Downloads 95 Views

Parallel Computing 19 (1993) 745-764 North-Holland

745

PARCO 768

A systematic approach for designing concurrent error-detecting systolic arrays using redundancy C.N. Zhang

a,

H . F . Li b a n d R. J a y a k u m a r b

a Department of Computer Science, University of Regina, Regina, Sask, $4S OA2 Canada b Department of Computer Science, Concordia University, Montreal, Quebec, H3Q IM8 Canada Received 5 November 1991 Revised 12 May 1992, 31 October 1992

Abstract

Zhang, C.N., H.F. Li and R. Jayakumar, A systematic approach for designing concurrent error-detecting systolic arrays using redundancy, Parallel Computing 19 (1993) 745-764. A systematic approach for designing systolic arrays with concurrent error detection (CE D) capability using time and/or space redundancy is proposed. This approach is based on a new theory which relates CED and the generalized space-time mapping. Under a restriction that there is one generated (modified) variable in the systolic arrays, a simplified CED scheme is presented. That not only significantly reduces the hardware and time overheads but also has capability of error correction. As well, the resulting systolic array can be used to compute two problem instances simultaneously to achieve double throughput without extra cost.

Keywords. Fault-tolerance; systolic array; concurrent error detection; space-time mapping.

1. Introduction The key issue in run-time fault tolerance is how to detect and possibly correct errors concurrently with computation. Various fault tolerance techniques on systolic arrays have been proposed [1,2,4-8,11,12]. Algorithm-based fault tolerance has the advantages of lower hardware and time overheads, but has the drawback of algorithm-specific design and arithmetic errors including truncations and overflows. Gulati and Reddy [4] proposed an approach to concurrently detect errors using concurrent redundant computation (eyeball-to eyeball checking). Wu [12] presented a similar approach which is applicable to unidirectional linear systolic arrays; Patel and Fung [12] developed a method based on repeating the computation with shifted operands (RESO); Cosentino [2] proposed a concurrent error correction scheme which is also based on the comparison with redundant computation. The latter scheme is restricted to a class of systolic arrays in which the partial result must stay in the processing elements (PE). However, most of the approaches reported in the literature are ad hoc designs, that is, they are specific designs for particular problems or they are applicable to a certain kind of systolic architectures. Correspondence to: C.N. Zhang, Department of Computer Science, University of Regina, Regina, Sask, Canada $4S 0A2, email: [email protected]

0167-8191/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

C.N. Zhang et al.

746

In this paper, we present a systematic approach for designing systolic arrays with CED capability using time a n d / o r space redundancy. The proposed approach is based on a theory which reveals relationship between generalized space-time mappings and CED using redundant techniques. In addition, a design procedure is presented which leads to a CED systolic implementation with minimum extra hardware cost, in terms of additional PEs, comparator logic circuits and latches (buffers). In cases where there is only one output data path in the systolic array, a simplified error detection and correction scheme is presented which significantly reduces the hardware and time overheads, and requires no additional circuits to be built inside the systolic array. As a consequence, the resulting systolic array has the flexibility to compute one problem instance with CED or to compute two problem instances simultaneously to achieve double throughput.

2. Generalized space-time transformation method Consider the problem of computing the product of two matrices C = A × B , where A = (aik), B = (bkj) and C = (cij). The calculation can be described by the following normalized algorithm in which all broadcasting variables have been eliminated.

Algorithm 1. (matrix multiplication: C = A × B) fori:=lto Ndo for j := l to N do for k := l to N do begin

a(i, j, k ) : = a(i, j - 1, k); b(i, j, k ) : = b ( i - 1, y, k); c(i, j, k) -'= c(i, j, k - 1) + a(i, j, k)b(i, j, k) end. Initially, a(i, O, k) = aik , b(0, j, k) = bki, and c(i, j, 0) = 0 for all i, j and k. The computational structure of an n-level nested loop algorithm such as Algorithm 1 (n = 3) can be represented by a tuple pair, (D, Co), in which C o is the index space, {(11, 12. . . . . In)}, where the data are computed or used, and D = (d l, d 2. . . . . dm) is a constant matrix which characterizes the data dependency relationship among the computations in the index space [8,13]. The key step towards obtaining a systolic array implementation for a given algorithm is to find a linear transformation (n × n matrix) T = (sn) as described by Moldovan and others [3,9,10] where the 1 × n vector H (called the time schedule function) maps the index space into time sequence, and the (n - 1) × n submatrix S (the space-transformation function) maps the n-dimensional index space into an (n - 1)-dimensional systolic array. To ensure a causal time schedule in the systolic array, all elements of the first row of the matrix A = TD must be negative (or positive depending on the convention) [3,9,10,14]. In case of n = 3, the resulting computation space C s c Z 3 ( Z is linear integer space) is obtained by

where t represents the time, x and y represent x-coordinate and y-coordinate in the x-y plane. The matrix T is sometimes called the space-time transformation, and the mapping

747

Designing concurrent error-detecting systolic arrays

from a given algorithm represented by ( D , D o) into a systolic array represented by (A, C,) obtained by a space-time transformation T is called the generalized space-time mapping denoted by T ( D , C o) = (A, Cs) [8,13]. For example, in Algorithm 1, we have C O = {(1, 1, 1), (1, 1, 2),..., (N, N, N)}, and ( 0 - 1 0 ) -1 0 0 ; 0 0 -1

D=

(o) (_1) (o)

the columns

-1

,

0

0

and

0

0

-1

represents the variables a, b and c, respectively. If we choose T1=

(11i) 0 -1

1 0

,

then A 1 ~-

T1D =

- 1 0

0 ' 1

i)



The resulting systolic array consisting of N 2 PEs is shown in Fig. 1 for N = 4. If we choose T2-

1 -1 0

1 1 0

1) 0, -1

b44 b42

b4a 1333

b41 bat b2t

b32

b23

b22 bt2

bl3

b34 b24 bl4 0

0

0

btl

0

0

0

al4al3al2all - - - ~ I I

Fig. 1. Systolic array for Algorithm 1 obtained by T1.

C.N. Zhang et al.

748

C44

C41

c~z

0

c::

0

0

c~

0

c~

ca 0

Clt

c=~ 0

0 ct~

cu 0

0 ejz

cj~ 0

o o

0

0

e,,

0

0

o /

a~ O a3t 0 0 b. 0 I~, /

~0~0~ b~ 0 b~ 0 b~, ~0~0~0

0~0~0~ a4~ 0 a~ 0 a~ 0 a .

b,, 0 b,2 0 b,: 0 b. Fig. 2. Systolic array for Algorithm 1 obtained by T2.

then -1

Az= TzD=

-1

0

-1

-1)

1

0

0

1



The resulting systolic array is shown in Fig. 2. The space-time mapping approach provides a mathematical way to map an n-level nested loop algorithm into an ( n - 1)-dimensional systolic array. The matrix A represents the physical structure of the resulting systolic array such as connections among PEs, number of time delay units required between two PEs (if the element of the first row of matrix A is not - 1 ) , and the directions of input and output data paths. The transformation matrix determines the most important parameters of the systolic array implementation, such as the time schedule, size of the array (number of PEs), locations of PEs and total computation time [8,131. The pipelining (time) latency denoted by I t which is the time interval in clock units between two successive input data is another important parameter which indicates the rate of the input/output data paths in the systolic array. Due to regular computational structure of the systolic arrays, the time latency can be defined as follows.

Definition I. Suppose

t

T(D, Co) = (A, Cs) where (i, j, k) ~ CD, (t, x, y) ~ Cs and i

Designing concurrent error-detecting systolic arrays

749

The time latency, lt, is the minimum time difference At > 0 such that

{tiAt =T

[il Jl kl

(mapped to the same PE) for all possible (i 1, Jl, kl) ~ Co. We show that It depends on the transformation matrix T only and can be calculated by the following formula. Lemma 1.

iTI

(1)

It = gcd( I Tll I, I T12 I, I T13 I) where I T [ is the determinant of matrix T and Tij is the (i, j)th co-factor of T, and }TI

gcd(lZll I, I Z12 [, IT13 1) is absolute value of [T[ gcd( I rll l, [ T12 [, I r13 1) " Proof. Let

y

/

=TIj+Aj \k+Ak

.

Since =T

i j , and T = k

tll t21 t31

t12 t22 t32

t13 ) t23 ' t33 ]

/

we have

tuAi + t12Aj + tl3Ak = At t21Ai + t22Aj + t23Ak = 0 t31Ai + t32AJ + t33Ak = O. Thus,

At IT u I ITI '

Ai

Aj=

- A t I Tlu I Izl

and At [ Zl3 [

Ak=--

ITI Since [Zl ÷ 0, at least one of I Tll I, IT12 I and I T13 [ is not zero. Suppose IT11 I =~0 (the proofs for the rest of cases are similar). As

Aj

=

I Tx2 I --Ai ITu]

750

C.N. Zhang et aL

is an integer, ITll[ Ai = c 1 gcd( I TI1 l, [ TiE l) for some integer ct. Similarly, as

ITs31

Ak

--Ai [Ttll

is an integer, Ai

ITll[ =C2gcd(lTH I, ITs3 I)

for some integer c 2. So, we have clT111

Ai=

gcd(lZll 1, [Z121, ITs3 l) for some integer c. Thus At = tllAi + tl2A j + tl3Ak c(tlllTlll-tl2lTlel+tl3lT13[) gcd(I Ttl 1, [Tie 1, IT13 l) clTI gcd(lTll

l, IT12 l, IT13 [)'

and hence l t = min{At} = dt>O

ITI gcd( I Tll l, [ T12 [, I T13 1)

[]

For example, the systolic array obtained by T~ shown in Fig. 1 has 1t = 1, but the systolic array obtained by T 2 shown in Fig. 2 has l t = 2. Similarly, we can define space latency with respect to x, l x, (min Ax > 0 when At = 0 and Ay = 0) and space latency with respect to y, ly, (min Ay > 0 when At = 0 and Ax = 0), and have the following results: IT[ lx = gcd( =

ly

IT21 I, I T22 1, IT23 1) [TI

gcd(

(2) .

] T31 [, [ 7"32 [, [ T33 1)

(3)

3. A theory for C E D using redundancy in systolic arrays

Permanent faults and transient faults are two types of faults which may occur during the real time computation environments. For concurrent error detection, a systolic array should be able to detect any kind of permanent or transient fault during the computations.

751

Designing concurrent error-detecting systolic arrays

×

),

Fig. 3. The structure of the first type of CED.

×

),

Fig. 4. The structure of the second type of CED.

Throughput the paper, we assume that there is at most a single faulty (permanent or transient) PE in a systolic array and there are no errors in the data transformations and inserted logic circuits. In some systolic designs, e.g. the systolic array shown in Fig. 2, there are some PEs which are not active at all times. These idle PEs (at a cycle time) may be used to do some useful computations. Based on this observation, several concurrent error detection (CED) techniques using redundant computations in the systolic array have been proposed [1,2,4,5,11,12,14]. There are two types of CED approaches in systolic arrays using space o r / a n d time redundancy. The first one, as shown in Fig. 3, is able to detect a permanent or transient fault, if two results from two different PEs using the same input data X are not identical, where d is a time delay unit. The second type of the CED compares two results from the same PE produced at different times and is shown in Fig. 4. This technique, however, detects only a transient fault. To detect a permanent fault, other techniques such as encoding and decoding could be used. A modification of type 2 CED with encoding and decoding functions is shown in Fig. 5 where X is the input data of the function Y = f ( X ) performed by the PEs, and E ( X ) and D(Y) are encoding function and decoding functions, respectively. The functions E(X) and D(Y) should be selected such that D(f(E(X))) = f(X).

e zcoder

ecode

Fig. 5. A modified structure of the second type of CED with encoding and decoding.

752

C.N. Zhang et al.

An important motivation for us in examining time latency, space latency and the generalized space-time mapping is to develop a theory which reveals the inherent relationship between the generalized space-time mapping and CED in the systolic array. Suppose that a systolic array (A, Cs) is obtained by a space-time transformation T for a given algorithm (D, C o) and Co c Z 3. Definition 2. A non-overlapping version represented by (D, C~) of a given algorithm (D, C o) is obtained from algorithm (D, C o) by substituting every index point (i, j, k) ~ C O by a new index point (i', j', k') ~ C~ in which i' = i + di, j' = j + dj, k' = k + d k such that intersection of set C O and set C~ is empty (C O n C b = ~), where di, dy and d k are constants.

According to Definition 2, C O ~ Z 3 and C D (-'1Cl)= ~, we conclude that (D, C~) is a non-overlapping version of (D, Co), if and only if at least one of di, dj and d k is not an integer. For example, if we choose d i = -0.5, d r = 0.5 and d k = 0, then the following algorithm is a non-overlapping version of Algorithm 1. Algorithm 2. (a non-overlapping version of Algorithm 1 computing: C' = A' × B') for i'.'= 0.5 to N - 0.5 do for j ' .'= 1.5 to N + 0.5 do for k'.'= 1 to N do begin a'(i', Ji, ki):= a'(i', j ' - 1, k'); b'(i', j', k'):= b ' ( i ' - 1, j', k'); c'(i', j', k') := c'(i', j', k' - 1) + a(i', j', k')b(i, j', k'); end; where a'(i', 0.5, k ' ) = a~,~,, b ' ( - 0.5, j', k') = b'k,.j,, c'(i, j, 0) = 0 and c'(i', j', n) = c;,j, for all i', j ' and k'. It is clear that both Algorithm 1 and Algorithm 2 produce the same result, if A' = A and

B'=B. Definition 3. Suppose T(D, C o) = (A, C s) and (D, C~) is a non-overlapping version of (D, Co). If T is also a valid space-time transformation of algorithm (D, C~), that is, T(D, C~) = (A, Cj) where C~ = {(t + dr, x + dx, y + dr)} c Z 3 then the systolic array (A, C~) is called a systolic redundant computation (SRC) of the systolic array (A, Cs).

Since I T I ~ 0, and C o n C~ = ¢, we have that C s n C~ = ¢. Thus, if there are systolic arrays (A, C s) and its SRC (A, Cj) for a given algorithm (D, Co) , we can use the systolic array (A, C s) to compute the original computation (D, C o) and use the systolic array (D, C~) to compute the redundant computation (the non-overlapping version (A, Cj). Let the given algorithm and its non-overlapping algorithm compute the same problem instance. An error is detected whenever these two computations ((t, x, y) and (t + dr, x + dx, y + dy)) generate different values. A new systolic array which is obtained by merging systolic array (A, C s) and its SRC (A, C~) together is called concurrent error detection (CED) systolic array represented by (A, C s u C~). Given (A, C s) and (A, C~) we can insert comparator logic circuits in the CED systolic array (A, C s u Cj) such that results from (t, x, y) and (t + dt, x + dx, y + dy) are compared directly to detect errors.

Designing concurrent error-detecting systolicarrays c~, 0

~i 0 ~, 0

~

~3 0 a~ O a~ O . . ~

c~3

0

e~

c~t 0 ci, 0

0 c~ 0 e/,

c~, 0 eh 0

0 c~ 0 ei~

c~, 0 e;~ 0

c;, 0 0

0

0

c;,

0

0

0

'

II

753

a,

8'

"7 I

o

tL

a~, 0 a~, 0 a~, 0 ah

L,o

b:,o.d,

Fig. 6. A systolicarray for Algorithm 2 obtained by T2. For example, T 2 is also a valid transformation of Algorithm 2 (D, C~). From equation

dx

= r 2 dy

dy

dk

where d i = - 0 . 5 , j = 0.5, and d k = O, we have d t = 0, d x = 1, and dr = 0. The SRC (A, C}) is shown in Fig. 6. A CED systolic array is constructed by merging systolic array in Fig. 2 and its SRC in Fig. 6 together as shown in Fig. 7 where two computations C = A x B and C' --A' X B' are performed simultaneously. Let A' = A and B ' = B, and insert a comparison logic between every pair of PE(x, y ) a n d P E ( x + 1, y). Any error (either permanent or transient) can then be detected. In the following we derive the necessary and sufficient condition under which an SRC of the systolic array (A, C s) exists. First we give some lemmas. Lemma 2. I f l x > 1, then there is an S R C o f systolic array (A, C s) with d r = O , dx-- 1, d y = 0 . P r o f . Let T ( D , C D ) = (A, Cs). Constructing a systolic array (A, C}) from systolic array (A, C s) by substituting C s = {(t, x, y)} with C S' -- {(t', x', y')}, t' = t ( d t = 0), x ' = x + 1 (d x = 1), and y' = y (dy = 0). From

Idtl dx dy

Idil

= T

dy

,

dk

we have t l l d i + t12d j + tl3d k = 0 t21d i + t22d j + t23d k = 1 t31d i + t32d j + t33d k = 0

C.N. Zhang et al.

754

C~t

c42 c;, 0 0

Oo

c43

c~,

c~

ch

c~ c~

c~ c,~

c~ c,3

c,4 cfi

c~, 0

01

Oz

c~, c,,

Oa

c,, c;,

04

Cl4

CI( 0

0

ch 0 0

05

06

07

c;,

0 0

Fig. 7. A C E D systolic array obtained by T 2.

Thus, - IT211

d,

IT~---~' d j -

IT221

ITS-

and dk

-IT231 ITI

Because ITI 4:0 and ITI = +lx gcd(lT2l I, I/'22 I, 1T23 1), I x > 2 , ITI 4:I/'211 or ITI 4: IT22 I or I T I 4: I T23 I. Thus, at least one of d i, d i and d k is not an integer. Therefore, (D, C~) is a non-overlapping version of (D, C9). [] Similarly we can prove: Lemma 3. I f ly > 1, then there is an S R C (A, C~) o f systolic array (A, C]), with d t = 0, d x --- 0, dy = 1.

Lemma 4. I l l t > 1, then there is an S R C (A, C~) o f systolic array (A, Cs) , with d t = 1, d x = 0, dy = O.

Based on Lemmas 2-4, we have the following necessary and sufficient condition under which an SRC exists. Theorem 1. There exists an S R C o f a giuen (A, Cs) if and only if m a x { l , Ix, ly} > l. Proof.

(i) Sufficiency The proof directly follows from Lemmas 2-4.

Designing concurrent error-detecting systolic arrays XI

XI

Xm-I

755

][11

y yzyl~... ....

X

Fig. 8. A n example of C C R C systolic array.

(ii) Necessity If l t = 1, I x = 1 and ly = 1 then C s forms a continuous region in Z 3 whose size is problem size dependent. Thus we can not find any non-zero integer c o n s t a n t s dr, d x a n d dy so that t'=t+dt, x'=x+d x and y ' = y + d y for C~. [] Theorem 1 reveals the inherent relationship between CED and space-time mapping as well as it shows that the capability of designing a CED systolic array depends on the space-time transformation T only. Based on Theorem 1, various CED techniques proposed by others using redundancy can be explained and analyzed. The first type of CED approaches shown in Fig. 3 corresponds to constructing an SRC with d x ~ 0 a n d / o r d r ~ 0 where the time delay unit d depends on the value d t. The second type of CED approaches shown in Fig. 4 corresponds to constructing an SRC with d x = 0, d y = 0, and d t ~ O. We briefly relate some previous CED methods to our results. 1. C C R C [4]: A typical systolic array structure which is applicable to comparison with concurrent redundant computation (CCRC) is shown in Fig. 8. Note that no data are stored in the PEs during the computation and there is only one horizontal data path. The xi's are controlled and supplied by a host computer. The proposed CED systolic array is shown in Fig. 9 [4]. Here $i represents either x i or xi_t (i = 1, 2 , . . . , n), depending on whether it is even numbered cycle or odd numbered cycle. Variable x0 represents the sequence xl, x e . . . . . x n. The 2-to-1 MUX and an extra PE (PE 0) produce a sequence of input data Yl, f(Xl, Yl), Y2, f ( X l , Y2). . . . . yn, f ( x n, y,). This array can be redrawn as shown in Fig. 10. This CED array is based on the fact that if a (sub)computation is carried out in PE(i), then its redundant

~0

xj

y. • .

strobe Fig. 9. A C E D systolic array by CRCC.



756

C.N. Zhang et al.

i,

i.

f(x..y.), y ..... I(x,. y,). y,, I(x,, y,).y, ~

,

I

.

,

/

Fig. 10. SimplifiedCED systolicarray.

io

y,...Oye O

y

l

xn

~

.

.

.

Fig. ll(a). Systolicarray(A*,Cs, ). computation is performed in PE(i + 1). The systolic array of Fig. 8 obtained by a transformation T can be represented by (zi, C s) where A - - ( -11

-1)0

and l t = 1, I x = 1 (1-D systolic array). To construct an SRC, let T*=(0

0) T'I

We have 1t = 2 and Ix = 2. By applying T* to the same algorithm, we have a new systolic array (A*, Cs. ) where

as shown in Fig. 11(a), Xi represents either x i or zero. Choosing an SRC (A*, Cj.) of (zi*, Cs.) with d t 0 and d x = 1, the corresponding systolic array is shown in Fig. 11(b). Figure 11(c) shows the CED systolic array obtained by merging (A*, Cs. ) and (zl*, C~.). Since there is only one horizontal data path and all xi's are not stored in the PEs, the latches between PEs can be removed to yield the same systolic array as shown in Fig. 10. =

y~ . . O y l O y | ~ . . . Fig. ll(b). Systolicarray(zl*,C~,).

Fig. ll(c). Systolicarrayof (A*,Cs.

U C}.).

757

Designing concurrent error-detecting systolic arrays

Y

ny

/

I 1

I I ......

I I

i

I1 x X Fig. 12. A systolic array with ny rows and n x columns.

[11]: The basic theme of this method is to test a systolic array by repeating every computation with shifted operands (RESO). In terms of the proposed CED theory, this approach can be viewed as constructing an SRC with d t = 1 and dx = 0 in 1-D systolic arrays (or d t = 1, d~ = 0 and dy = 0 in 2-D systolic arrays) To detect an error by using the same computational unit (PE), encoding and decoding techniques are applied. In RESO, the encoding and decoding functions are chosen as shift back by one bit (or multiple bits), respectively.

2. R E S O

4. Designing CED systolic arrays

According to above analysis, we know that the corresponding hardware cost of CED systolic implementation depends on the values of d t, d x and dy (or d t and d x in case of 1-D systolic arrays). In particular, the value d t indicates the number of buffers (latches) required to synchronize the comparison between the original result produced at time t and the redundant result produced at time t + d r The value d x (dy) represents the additional d x (dy) columns (rows) of PEs required to calculate the redundant computations. Furthermore, d x and dy indicate the distances between two PEs (one is at PE (x, y), the other is at PE (x + d x, y + dy)) and the corresponding comparison logic circuit. Therefore, small values of d t , d x and dy ensure locality of spatial connection to the comparison logic circuit, temporal locality (buffering of results to be compared) and small number of additional PEs. Suppose that there is a space-time transformation T which maps an algorithm onto a systolic array with n x columns and ny rows as shown in Fig. 12. If I x = 2, ly = 2 and n y _< n x , then according to Lemma 2 we could choose an SRC with d t = 0, d x = 1 and dy = 0. The resulting CED systolic array requires additional ny PEs as shown in Fig. 13. Similarly, if n y > n x, then we could choose an SRC with d t = 1, d x = 0 and dy = 1. The resulting CED systolic array requires additional n x PEs. In the case of l t = 2, [ x = 1 and ly = 1, according to Lemma 4, one can choose an SRC with d t = 2, d x = 0 and dy = 0. The resulting CED systolic array does not require any additional PEs. However, since the two results (original one and redundant one) are produced at the same PE, as described above, a permanent fault may be masked. If the function performed by PEs involves some simple arithmetic operations, e.g. addition and multiplication, a method of repeating every computation with shifted operands (RESO) can be used. In general, however, there is no systematic way to construct an encoding

758

C.N. Z h a n g et a t

II ny

D ...... II

'

i

|

,

i i

I I

I I ......

I I

I i

', i

I I

l-I x+l v

Fig. 13. A CED systolic array by choosing

dx

= 1 and

X

dy = O.

function and a decoding function such that any permanent faults could be detected for an arbitrary function. To avoid this difficulty, we could construct an SRC (A, Cj) such that d x = 0 and dy = 0 do not occur simultaneously. This is guaranteed by the following theorem.

Theorem 2. / f T ( D , C D) = (A, Cs) and l t = 2, l x = 1, ly = 1, then (A, C~) with d t = 1, dx = 1, dy = O or d t = l, d* = O, dy = l is an S R C of ( A, Cs).

Proof. Construct a systolic array (A, Cj) from systolic array (A, Cs) by letting d t = 1, d x = 1, and dy = 0. We have 1

tl2

t13

1

t22

t23

0

t32

t33

di =

ITI tll

1

t13

t21

1

t23

t31

0

t33

dj=

d,=

IZl

IT111

IT21[

fTf

fTI

IT12 I IZ~

tll

t12

1

t21 t31

t22 t32 fTI

1

0

IT13 ] ]TI

1T22 I +

IT------I-

and

IT2 3 I ITI

Since l t = 2, and l x = 1, we have IT[ = 2 gcd( [ Zll 1, [Zl2l ,

]Z13 l)

or IT] = - 2 gcd([ T~I 1, 1T12 l, ITs3 1)

or I T I = gcd( [ 7"21 l, I T22 I, [ T23 I) or I T [ = - gcd( I Z21 1, ] T22 1, I T23 I).

Designing concurrent error-detecting systolic arrays

Let IT[ = 2 gcd([T11 l, IT12 I, ITl3 l ) = gcd(lT21 [, cases are similar). Thus,

T22 ], 1T23 l)

759

(the proofs for the rest of

IV21L

[Till

di = 2 g c d ( l T l l I, I Tl2 I, 1T13 ]) - gcd( T21 l, 1T22 [, I T23 I)'

dj

]T22 [ -IT12 ] and = 2 gcd( I Tll 1, 1T12 1, [ TI3 1) + gcd( I T21 I, I T22 l, T23 l) IT13 ]

1T23]

dk = 2 g c d ( l T l l l, I TI2 l, iT13 l) - gcd(I T2~ I, I T22 1, T23 ])" Suppose that all d i, dj and d k are integers. We have

I/'2, I 2 gcd( I r2, I, I

I, I

I) = c,,

IT221 2 gcd(lT2, ], ] T22 ], ]T23 I) = c2 and

17"231 2 gcd( I Z21 ], ] T22 1, ] T23 [)

~C 3

for some integers Cl, c 2 and c 3. Thus, we have gcd( T21 l, IT22 I, 1T23 l ) > 2 gcd( I T21 I, I T22 1, I T23 1) which is a contradiction. Therefore, at least one of d i, dj and d k is not an integer. Similarly, we can prove the case of d t = 1, d x = 0, dy = 1. [] T h e o r e m 2 states that if a transformation T has max{/t, l,, ly} > 1, then one always can design a C E D systolic array such that the original computation and its r e d u n d a n t computation may perform at different PEs. Based on the above results, to design a C E D systolic array for a given algorithm, one should choose a space-time transformation T such that max{l/, lx, ly} > 1, and, then, choose an S R C properly to minimize the n u m b e r of extra PEs. Formally, it can be described by the following procedure.

Procedure 1 (designing a C E D systolic array for algorithm (D, Co)). Step 1. Choose a space-time transformation T for (D, CD): T ( D , CD) = (A, Cs) such that max{l,, lx, ly} = 2. Find values of n x and ny which are the n u m b e r of columns of PEs and the n u m b e r of rows of PEs in systolic array (A, Cs). Step 2. Construct an SRC according to the following cases: Case 1.1 t = 2, l . = l, and l r = l . if n~ <_ny then choose d, = 1, d x = 0 and dy = 1 (it requires an aditional nx PEs) else choose d t = 1, d x = 1 and dy = 0 (it requires an additional ny PEs). Case 2. l x = 2 and ly = 2. if n x < ny then choose d t = O, d x = 0 and dy = 1 (it requires an additional nx PEs) else choose d t = O, d x = 1 and dy = 0 (it requires an additional ny PEs) Case 3. l x = 2 and ly = 1. Choose d t = O, d x = 1 and dy = 0 (it requires an additional n y PEs). Case 4. l x = 1 and ly = 2. Choose d t = 0, d , = 0 and dy = 1 (it requires an additional n x PEs).

760

C.N. Zhang et al.

Table 1 Outputs of the CEDarray in Fig. 7

t=l t=2 t=3 t=4 t=5 t = 6 t = 7 t=8 t= 9 t = 10 t = 11

O0(t)

Of(t)

O2(t)

O3(t)

O4(t)

O45(t)

O6(t)

O7(t)

0 0 0 0 0 0 0 C41 0 0 0

0 0 0 0 0 0 c31 C41 c42 0 0

0 0 0 0 0 c21 c31 C32 ¢42 c43 0

0 0 0 0 ¢11 c-~1 c22 ¢32 c33 c43 c 44

0 0 0 0 C~l c12 c22 C23 c33 c34 c,~4

0 0 0 0 0 c~2 c13 C23 c24 c34 0

0 0 0 0 0 0 ¢~3 C14 ¢24 0 0

0 0 0 0 0 0 0 C~4 0 0 0

As an example, consider designing a CED systolic array for Algorithm 1 ( N = 4). If we choose a space-time transformation T2=

1 -1 0

1 1 0

1) 0 -1

which has I t = 2, I x = 2, ly = 2, The systolic array is shown in Fig. 2 where each row consists of seven PEs (n x = 7), each column consists of four PEs (ny = 4). According to Step 2 of Procedure 1, the SRC should be with d t = O, d x = 1, d r = 0. The corresponding CED systolic array is the same one shown in Fig. 7 in which four extra PEs are required.

5. A simplified CED design In the previous discussion it is assumed that a comparison logic circuit is required for each pair of PEs ((PE (x, y) and PE (x + dx, y + dy)) in a CED systolic array. In general, the number of comparison logic circuits required in the CED systolic array equals the total number of PEs. Due to the time delay of comparison logic circuits, the system clock may have to be slower than the one without CED capability. In the following, we show that if there is only one variable in the systolic array which may update its value during the computation (e.g. variable c in the systolic array of Fig. 2), then a simplified CED technique can be applied. For the sake of simplicity we use an example of the CED systolic array shown in Fig. 7 to illustrate the ideas. Suppose that the CED systolic array starts its computation at time t = 1 and terminates at time t = 11 ( N = 4). Because of T2=

1 -1 0

1 1 0

1) 0 , -1

we have t = i + j + k, x = - i + j and y = - k . It is easy to see that if PE (x, y) computes for (i, j, k) at time t, then the same PE (x, y) will calculate a corresponding redudant computation for (i + 1, j, k) at time t + 1. This leads to the following result: under fault-free condition, PE (2x, y) and PE (2x + 1, y), x = 0, 1, 2, 3, produce the same result (same computation) at each even cycle, and PE (2x - 1, y) and PE (2x, y), x = 1, 2, 3, produce the same result at each odd cycle. In short, PE (2x, y) performs regular computation at each even number cycle and the redundant one at each odd number cycle and PE (2x + 1, y) performs

761

Designing concurrent error-detecting systolic arrays Table 2 Error diagnosis for strings generated by Pj(t)'s A permanent PE in jth column

A transient PE in jth column

j= 0

Pl(t)= O. • .0 10 1 0 . . . 0 1

Pl(t)= 0... 0 10.. • 0

l_
~ ( t ) = 0 . . . 0 1 1 1 1.-.1 1 P6(t)=0-"010 10""01

~(t)=0...010...0 P6(t)=0""0 10""0

regular c o m p u t a t i o n at each odd n u m b e r cycle and the r e d u n d a n t one at each even n u m b e r cycle * Table 1 shows the outputs of the systolic array f r o m time t = 1 to time t = 11 w h e r e c~j r e p r e s e n t s the r e d u n d a n t value of cij and Or(t) r e p r e s e n t s the output data p r o d u c e d in the j t h column of the array at time t, ( j = 0 , . . . , 7 , t = 1, 2 . . . . . 11). According to Table 1, if the the systolic array is fault free, we have (i) O l ( t ) = Oa(t), O3(/) = O4(t), O4(t) = O6(t) at time t = 1, 3, 5, 7, 9, 11. (ii) Oo(t) = 01(t), 0 2 ( t ) = O3(t), O4(t) = Os(t), O6(t) = O7(t) at time t = 2, 4, 6, 8, 10. Let t e • {t = 2, 4, 6, 8, 10} and t o c {t = 1, 3, 5, 7, 9, 11}. We define a set of binary functions, {Pj(t)}, j = 1, 2 . . . . . 6, as follows.

{10

ifO)(te)--/:Oj-l(te)OrOj(to)~Oj+l(to)

P~(t) =

for j = 1, 3, 5 and otherwise

P~(t) = (loifOj(t°)~Oj-l(t°)°rO~(te)~Oj+l(te)otherwise

for j = 2, 4,6.

S u p p o s e that there is a p e r m a n e n t fautly PE, say P E (1, 3) in Fig. 7. According to the definition of {Pj(t)}, a string p r o d u c e d by P3(t) with respect to t = 1, 2 , . . . , 1 1 will be 0 0 0 1 1 1 1 1 1 1 1. T h e n u m b e r of zeros prior the first one in the string r e p r e s e n t s the distance b e t w e e n the fault P E and the P E located in the lowest row in the s a m e column ( P E (3, 3)). If a p e r m a n e n t P E is located in the left-most column (0th column) or the right-most column (7th column) in the systolic array, then the strings p r o d u c e d by P~(t) or P6(t) are different from the others. For example, the string 0 0 0 1 0 1 0 1 0 1 0 p r o d u c e d by Pl(t) indicates that there is a p e r m a n e n t fault in P E (1, 0). In the case that t h e r e is a transient fault during the computation, for example, if a transient fault occurs at time t = 3 in P E (1, 3), then the string p r o d u c e d by P3(t) will be 0 0 0 0 1 0 0 0 0 0 0. T h e n u m b e r of zeros prior the one in the string equals the s u m m a t i o n of the distance b e t w e e n the fault P E and the P E located in the first row in the same column (PE (0, 3)) and the time t w h e n the transient error occurs. Table 2 s u m m a r i z e s the error detecting p a t t e r n s (strings) for a faulty P E in the j t h column of the systolic array. Based on this result, we can design an error detection circuit which uses O~(t)'s, j = 0, 1 , . . . , 7 , t = 1, 2 . . . . . 11, as its inputs and p r o d u c e s all error detection signals Pj(t)'s, j = 1, 2 , . . . , 6 , t = 1, 2 , . . . , 11. This circuit is located b e t w e e n a host c o m p u t e r and a C E D systolic array as shown in Fig. 14. Figure 15 shows the logic d i a g r a m of the error detection circuit w h e r e the o u t p u t of the c o m p a r a t o r is logic one if enable input is one and two inputs data are not identical. Table 2 can be used to check w h e t h e r it is a p e r m a n e n t or transient error. D u e to the fact that t h e r e are r e d u n d a n t c o m p u t a t i o n s in the C E D systolic arrays, (if an * Extension of this to the case d x = 0, dy = 1, d t = 0 is immediate, other extension to d x = 1, dy = 1 and d t = 0 can be similarly dealt with.

C.N. Zhang et al.

762

CED Systolic Array

E r r o r D e t e c t i o n Circuit P6 t)

Host Computer Fig. 14. Connectionsamongof CED systolicarray, error detectingcircuit and host computer. error occurs) one can ignore these outputs which are redundant computations (c[j) and replace those outputs which are original computations (cij) by their redundant ones. For example, suppose there is a permanent PE located in the 3rd column of the array shown in Fig. 7. Since all outputs, clj, C~l, c22, c~2, c33, c~3 and c44, produced by O3(t) are faulty, the host computer can replace ctt, c22, c33 and c44, by their redundant ones, c~1, c~2, c~3 and c'44, generated by 04(t), and ignore all redundant outputs, c~l, c~2 and c~3. Therefore, the proposed scheme enables the host computer to achieve single error fault-tolerance. Another interesting result of the proposed design is that since the error detection circuit is not required to be built inside the array, it provides an alternative in some applications. For example, a CED systolic array can be used to compute two problem instances for higher

P1 (t)

Oo(t)

Or(t)

P2 (t)

p+ (t)

I

[

%(o

03(o

P4(t)

I

Ps (t)

[

Fig. 15. The logiccircuit diagramof the error detectingcircuit.

P6 (t)

I o+(o

07(o

Designing concurrent error-detecting systolic arrays

763

throughput, and the same systolic array can be used to compute one problem instance with capability of fault tolerance. Comparing with previous systolic array CED designs, this schemes has the following advantages: (i) The number of comparators can be reduced from O ( N 2) to O ( N ) (suppose the size of the systolic array is N x N). (ii) The resulting systolic array is able to detect and correct a single error. (iii) The same systolic array can be used to compute two problem instances simultaneously without extra cost.

6. Conclusion In this paper we have proposed a novel approach for mapping an algorithm onto a systolic array with the capability of CED. Compared with previous CED techniques, the proposed approach provides a systematic way for users to construct a CED systolic array for a given algorithm. Since the proposed approach is independent of the algorithm, it can be applied to all algorithms as long as a valid space-time transformation exists. We illustrated our results for the case of 2-D systolic arrays. However all our results can be extended to the general cases (n-D systolic arrays). The approach proposed in this paper can be extended to achieve more results. For example, one can design a systolic array to compute three identical computations simultaneously such that any error (either permanent or transient type) can be corrected concurrently by a majority voting circuit. In addition, we have presented a simplified CED scheme which can be applied to algorithms where there is only one variable which may change its value during the computation. The simplified CED approach not only significantly reduces hardware and time overheads but also provides the capability of error correction. As well, it has the flexibility to use the same CED systolic array to compute two problem instances to achieve double throughput. One of the major differences from former studies is that the proposed approach is based on the generalized space-time mapping method. This method focuses not only on the mapping from a computational structure, characterized by the data depenency matrix D, onto systolic computational structure, characterized by the resulting dependency matrix A (systolic matrix), but also on the computational space C O of the given algorithm onto an integer computation space C s by the same transformation. Consequently, more general and interesting results are obtained.

References [1] S.W. Chan and C.L. Wey, The design of concurrent error diagnosable systolic arrays for band matrix multiplications, IEEE Trans. CAD lntegr. Circuits Syst. 7 (1) (1988) 21-37. [2] R.J. Cosentino, Concurrent error correction in systolic architectures, IEEE Trans. CAD lntegr. Circuits Syst. 7 (1) (1988) 117-125. [3] J.A.B. Fortes and D.I. Moldovan, Parallelism detection and algorithm transformation techniques useful for VLSI architectures design, J. Parallel Distributed Comput. (1985) 277-301. [4] R.K. Gulati and S.M. Reddy, Concurrent error detection in VLSI array structures, Proc. IEEE Internat. Conf. on Computer Design (1986) 488-491. [5] K.H. Huang and J.A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput. 33 (6) (1984) 218-255. [6l A. Jacob, P. Banerjee, C-Y Chen, W. Fuchs, S-Y Kuo and A. Reddy, Fault tolerance techniques for systolic arrays, IEEE Comput. (July 1987) 65-74.

764

C.N. Zhang et aL

[7] R.H. Kuhn, Yield enhancement by fault-tolerant systolic arrays in: S.Y. Kung, H.J. Whitehouse and T. Kailath, eds., VLSI and Modern Signal Processing (Prentice-Hall, Englewood Cliffs, NJ, 1985) 178-184. [8] H.F. Li, C.N. Zhang and R. Jayakumar, Latency of data-flow and concurrent error detection in systolic arrays, CCVLSI-89 (1989) 251-258. [9] W.L. Miranker, Space-time representations of computational structures, Computing 32 (1984) 93-114. [10] D.I. Moldovan, On the design of algorithms for VLSI systolic arrays, Proc. IEEE 71 (1) (1983) 113-120. [11] J.H. Patel and L.Y. Fung, Concurrent error detection in ALU's by recomputing with shifted operands, IEEE Trans. Comput. C-31 (1982) 589-595. [12] C-C. Wu and T-S. Wu, Concurrent error correction in unidirectional linear arithmetic arrays., Proc. 17th Internat. Symp. on Fault-Tolerant Computing (1987) 136-141. [13] C.N. Zhang, H.F. Li and R. Jayakumar, A general model for concurrent error detection in systolic arrays, 1 S M M / IASTED 4th Internat. Conf. on Parallel and Distributed Computing and Systems, Washington, D.C. (Oct. 1991) 267-271. [14] S.Y. Kung, VLSIArray Processors (Prentice-Hall, Englewood Cliffs, N J, 1988).