Distributed fault-tolerant embeddings of rings in hypercubes

Distributed fault-tolerant embeddings of rings in hypercubes

JOURNAL OF PARALLEL AND DISTRIBUTED Distributed COMPUTING l&63-7 1 ( 199 i ) Fault-Tolerant MEE YEE Computer Science Program, Embeddings of ...

881KB Sizes 1 Downloads 30 Views

JOURNAL

OF PARALLEL

AND DISTRIBUTED

Distributed

COMPUTING

l&63-7

1 ( 199 i )

Fault-Tolerant MEE YEE Computer

Science

Program,

Embeddings of Rings in Hypercubes CHAN

AND

The University

SHIANG-JEN LEE

OJ Texas

at Dallas.

Richardson,

Texus

75083

for a fault-tolerant embedding. In fact, we give an algorithm in which only two processors (including the faulty one) are excluded from the ring when there is a single fault, and moreover, when there are no faults, a full ring (of size 2”) is constructed. Note that the single-fault algorithm in [ 111 will construct a ring of size 3 X 2 n-2 regardless of how many faults there are. Like the single-fault algorithm of [ 1 I], the algorithm presented here will, in many cases, tolerate more than a single fault; in the cases where ring construction is 1. INTRODUCTION successful, the ring will be of size at least 2” - 2f, where f is the number of faults in the hypercube. We characterize The hypercube has been the topic of much recent research. and enumerate all multiple-fault scenarios that can be hanOn the one hand, various researchers have gone to great dled by our simple single-fault distributed algorithm. Again, lengths to demonstrate that the hypercube is a very versatile our single-fault algorithm is found to be competitive with parallel computer architecture capable of simulating other that of [ 111 in that, apart from having larger rings, a higher networks such as rings, grids, and trees with little overhead percentage of multiple-fault scenarios can be successfully (e.g., [ 2-4, 121). Such research enters the realm of graph handled by our algorithm. embedding problems in which these other networks are clevWe use our single-fault algorithm as the basis for defining erly mapped to the hypercube. On the other hand, various a double-fault algorithm guaranteed to construct a ring of other researchers have gone to equally great lengths to show size at least 2” - 4 when faced with two processor faults. the robustness and fault tolerance of the hypercube, focusing The approach is simply to run several copies of the singleon the hypercube’s ability to route and reconfigure itself de- fault algorithm in parallel, each copy attempting to build a spite faults (e.g., [ 1, 6, 7, 9, lo]). Combining both aspects, different ring; at least one copy is guaranteed to succeed. this paper concentrates on the power of the hypercube to This approach is noteworthy for its natural extensibility to simulate other networks in the presence of faults. In partica triple-fault- and, in fact, f-fault-tolerant algorithm, where ular, we devise simple distributed fault-tolerant algorithms f < I( n + 1)/21. So, ultimately, we contribute simple disto embed a ring in a hypercube with faults. The emphasis tributed algorithms for successfully embedding a ring of size on distributed algorithms implies the assumption of only at least 2” - 2fin an n-cube with f < ~(n + 1)/2] faults, local knowledge of faults rather than global knowledge, with which work for many cases where fexceeds L( n + 1)/2]. each processor having knowledge of the status of only its Apart from processor faults, we comment briefly on link immediate neighbors. faults. With link faults there is still hope for building a full It is well known that rings can be easily embedded into ring (of size 2”) excluding faulty links. The approach is to hypercubes using cyclic Gray codes [ 121. In fact, insofar as exploit the various link-disjoint Hamiltonian circuits present distributed fault-tolerant embeddings of rings into hyper- in an n-cube. The organization of the rest of the paper is as follows. cubes are concerned, this too has been addressed. Provost and Melhem [ 1 I] have given a distributed algorithm for Section 2 describes our single-processor fault-tolerant disembedding a ring of size 3 X 2”-2 into an n-cube despite a tributed ring embedding algorithm and its properties. Section single processor fault and an algorithm for embedding a ring 3 further addresses the problem of multiple processor faults. of size 2”-’ despite two faults. Note that in the former case, The discussion on link faults can be found in Section 4 along 25% of the processors are wasted, and in the latter case, 50% with other concluding remarks. of the processors are unused. The waste in processors seems First of all, let us recall the standard definition for a binary a very high price to be paid for fault tolerance. Provost and hypercube. A binary hypercube of dimension n, or an nMelhem’s contribution marks our point of departure. We cube, can be viewed as an undirected graph of 2” nodes, show that it is not necessary to sacrifice so many processors where each node is specified by a unique n-bit label. There This paper contributes simple distributed algorithms for successfully embedding a ring of size at least 2” - 2fin an n-cube with f < L(n + 1)/21 faults. Only local knowledge of faults is assumed with each processor aware of the status of only its immediate neighbors. The algorithms work in many cases where f exceeds L(n + 1)/21. o 1991 Academic~ress,~nc.

63

0743-7315/91 $3.00 Copyright Q 1991 by Academic Press, Inc. All rights of reproduction in any form reserved.

64

CHAN

is an edge or a link between two nodes if and only if their labels differ in exactly one bit position. If two nodes differ in only bit position d, they are said to be neighbors across dimension d, and the edge or link between them is said to be on dimension d. We use “processor” and “node” interchangeably throughout the paper.

AND

LEE Theembeddingsequence~: “12

Y, 3 “6 2 v,

123214121312151213121412131215

VI02 VI,

VI4

2v,5 VI82ViY v22 2v2? V2L 2”21 Y302!J3i

\ (a) “5 1sfaulty

2. SINGLE-PROCESSOR FAULT-TOLERANT RING EMBEDDING

The embedding of a ring of size 2” in an n-cube may be given as a sequence of nodes, R = ( ol, u2, . . . , uzn), where each adjacent pair of nodes Vi and ~)(i+l)~~d2”, 1 < i =S2”, are neighbors across some dimension di in the n-cube. The same embedding can be specified in terms of the sequence of dimensions that the neighboring nodes go across, i.e., S = Cdl, 4, . . . , d2”). We call S the embedding sequence. As observed by Dekel et al. [ 8, 111, an embedding sequence which corresponds to a binary-reflected Gray code embedding for a 2 “-node ring in an n-cube may be generated using the following procedure, in which the vertical bar denotes the concatenation operator:

TheemkddingsequenceS”: VI2

V)

“6

V?

121324121312151213121412131215

~102~llv14 2v15 v,a2v19 v222 VI3

3 v5 2 Y8 4 vg

y,* 3v13

“16 5 “‘17 vzo3 3,

V26

vu 4 Y?J

2531 V3” 2 v,, “28 3 VZY “32

s , (b) v6 IS faulty

FIG.

2.

Handling

a single

fault.

on. The ring is drawn in such a way as to emphasize that vl and v4 are neighbors in the hypercube across dimension 2, v3 and t+j are neighbors in the hypercube across dimension 3, v5 and zigare neighbors in the hypercube across dimension 2, u7 and ulo are neighbors in the hypercube across dimension 4, and so on. Note that the link between ol and v4 on diProcedure RING ( n) mension 2 is not a part of the ring. We call such links spare links. S+(l); FORi=2tonDO The idea is to use these spare links when there are faults. So, if v4 or v5 is faulty, a ring R’ can be made by skipping S+Sl(i)lS; return S; both v4 and 2)5and using the spare link between v3 and 2)6, end RING i.e., R’ = (v,, v2, v3, 06, v7, . . . , v32) and S’ = ( 1, 2, 3, 2, 1, 4, 192, 1, 3, 1,2, 1, 5, 1,2, 1, 3, 1,2, 124, 1, 2, 1, 3, 1, 2, 1, 5). Refer to Fig. 2a. Similarly, if 216or V7 is faulty, a ring R” The embedding sequence is produced by calling RING(n) and concatenating the returned result with (n), which is the can be made by skipping both v6 and v7 and using the spare closing dimension for the ring. Thus, the embedding se- link between v5 and v8, i.e., R” = (v, , v2, v3, v4, v5, v8, v9, . . . ) ~32) and S” = ( 1, 2, 1, 3, 2, 4, 1, 2, 1, 3, 1, 2, 1, 5, 1, 2, quence for a 32-node ring in a 5-cube resulting from RING(5))(5)isS=(1,2, 1, 3, 1,2, 1,4, 1,2, 1,3, 1,2, 1, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1, 5). Refer to Fig. 2b. In 5, 1, 2, 1, 3, 1, 2, 1, 4, 1, 2, 1, 3, 1, 2, 1, 5). We call the 2”- general, if Vi or Vi+, is faulty where i is even, both Vi and vI+, node ring built on the basis of the binary-reflected Gray code should be skipped. Note that S’ and S” are minor variations embedding sequence produced by RING(n) I n the basis ring. on the embedding sequence RING( 5 ) I(5) in which two The fault-tolerant rings we construct for an n-cube are based successive 1‘s are removed from S. on minor variations of the basis ring. In the context of a distributed fault-tolerant algorithm, It is convenient for us to view the embedding sequence node vl would represent the source, the processor that iniand the basis ring in the manner shown in Fig. 1 for tiates the ring construction, and is reckoned to be nonfaulty. RING( 5) I ( 5). Processor nodes 21,and vz are connected by Without loss of generality, we may assume that v2 and v2” a link on dimension 1, u2 and u3 are connected by a link on are also nonfaulty since v2 is a neighbor of v1 across dimendimension 2, v3 and v4 are connected by a link on dimension sion 1 and 02” is a neighbor of v1 (the source) across dimen1, v4 and v5 are connected by a link on dimension 3, and so sion n, and we can always rename dimensions 1 through n so that v2 and v2” are nonfaulty as long as v1 has two nonfaulty TheembeddingsequenceS: 12131214121312151213121412L31215 neighbors. Source vI must have at least two nonfaulty neighbors in order for there to exist a ring of size at least 2 that includes v, . The basic idea is to trace through the embedding sequence S = RING(n) 1(n) until a faulty processor prevents continuance, or until the ring is complete. Upon encountering a FIG. 1. The embedding sequence and the basis ring produced for n = 5. faulty processor, we backtrack by at most one processor if

DISTRIBUTED

FAULT-TOLERANT

necessary to a ui where i is odd, use the appropriate spare link to avoid the fault, and continue building the ring from there. The detailed algorithm is given below as Algorithm RING- 1PFI and resides at each processor. Parameters used in the algorithm include n, the dimensionality of the hypercube; m, the next position in the embedding sequence that needs to be traversed; skip, a flag to indicate whether the next 1 in the embedding sequence ought to be skipped; and source, the address of the source processor. Other variables include ADDR, the address of the processor currently invoking RING-I PFT, and nm and nskip, the new parameter values for m and skip, respectively, for the next invocation of RING- 1PIT. To begin ring construction, the source processor sends a message to invoke RING- 1PFI’( n ,2, FALSE, source) at its neighbor across dimension 1. At the conclusion of the distributed algorithm, for each processor, the variable D will contain the dimension to traverse to get to the next processor in the constructed ring. At the source, D = d, = 1 by definition. In other words, the constructed ring is implicitly specified by the D values of the processors, and the ring can be readily traced out from the source using the D values. Procedure RING-1 PFI’( n , WI, skip, source) if source = ADDR and m = 2” + 1 then exit; (* Ring is formed! * ) S = (d,, d2, . . . , d2”) + RING(n)l(n); if skip and m is odd then (* when m is odd, d,,, = 1 * ) skip + FALSE; m+m+l;

if processor across d,,, is not faulty then nm+m+ 1; D +- d,,,; nskip +- skip;

RING

65

EMBEDDINGS

at this processor else invoke RING-1PFT (n, nm, nskip, neighbor processor across D; end RING-

source) at

1PFT

Figure 3 gives a table which traces the execution of the distributed algorithm for the scenario in which us of the basis ring is faulty. The final D values for the processors are given as well. Note that although v4 has a D value of 1, v4 is actually not part of the constructed ring: when tracing out the ring starting at v, , v4 and v5 will not be reached. A similar trace can be carried out for the scenario in which 216is faulty, and in fact, the general behavior of the algorithm when a single fault occurs at Vi is depicted in Fig. 4. When there are no faults, the behavior of the algorithm is given by Fig. 5. The single-fault algorithm may still successfully build a ring when faced with more than one fault. Figure 6 shows two examples. As it turns out, we can characterize exactly the scenarios which can be handled successfully by the algorithm. The characterization is unraveled through a series of lemmas culminating in Theorem 1. With such a characterization, it is possible to enumerate the exact number of successful versus unsuccessful scenarios, as we see in Theorem 2. LEMMA 1. RING- 1PIT will successfully construct a ring of size 2” in an n-cube with no faults.

Proof Apparent from the trace of the algorithm as depicted in Fig. 5. n LEMMA

qf size 2”

2. RING-l PFI will successfully construct a ring - 2 in an n-cube with one fault.

goto EXITING; ( * * * Processor across d,,, is faulty! * * * ) if m is odd then nm+-m+ 1; D - 0; nskip + TRUE;

RING-1PFT invoked at

goto EXITING; ( * * * m is even! Backtrack to a Vi where i is odd! * * * ) if not skip and processor across dimension 1 is not faulty then

with n n

v2 “3 “4 “3 “6 “7 va v9

nm+m; D+ 1; nskip + TRUE;

n n n n n n n

m

skip

Source

2 3 4 4 5 7 8 9

FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE

v1 Y, v, v, Y, v, vI v,

goto EXITING: (* Otherwise, algorithm cannot handle this more-thanone-fault situation * ) Announce failure and exit; EXITING: ifD = 0 then invoke RING-IPFI (n, nm, nskip, source)

“1 0

FIG. 3. Fig. 2a.

Trace

11

v2

“3

v4

“5

v6

yl

v8

2

3

1

-

2

1

4

of the algorithm

RING-IPFT

._.

for the example

shown

in

66

CHAN RINGJPFI invoked at

“2 “3

t

with . n n ”

2 3

v,-I “8-1 “,+2 “,+3 “‘44 vr “I v, D

Id,

“‘ -2

“z-1

dz

4-2

d,

with

Y, vi

i-l

FALSE

v,

i i+l i+3 i+4

TRUE TRUE

Y1 Y1

FALSE FALSE

Y, vI

FALSE FALSE

Y, Y,

y,

RING-1PFT invoked at

SOWCP

FALSE FALSE

2” 2”+ 1

n n

“*

skip

m

AND LEE

n

m

n n

2 3 4

skip FALSE FALSE FALSE

murce v, v; v,

n

i

FALSE

Y,

n

2” 2” + 1

FALSE FALSE

v, v,

n

n

v,+t

“‘ +*

“a+3

Y,h

“2”

-

d,+z

d,+3

d,M

din

D

1 d,

v2

“3

VL

“2”

d2

d3

d,

d2.

FIG. 5. The behavior when there are no faults. (a) v, is faulty where i is even. vi and v,+, skipped

RING-IPFT invoked at ,..

with ___ n

n

“2 “3

n

y, -2

n

“6 -1 “8-2 “,+i “, +* “8 +3

n n n n n

“2” “1

n n

skip

m 2 3

FALSE FALSE

i-2 i-l i-l i i+2 i+3

FALSE FALSE TRUE TRUE FALSE FALSE

;n

FALSE FALSE

2”+

1

1PFT ( n, 2” + 1, FALSE, ~1) at ~1 and nodes Yi,+z, ) v2”+, are included in the constructed ring.

RING-

Proof: Source

[Induction basis]. k = 1. Then zl;, is the first fault in the basis ring. The scenario is the same as that for single fault cases shown in Fig. 4. Clearly,

Yl “1

(i) when i is even, as shown in Fig. 4a, RING-1 PFT ( n, i + 1, TRUE, q) is invoked at v,+~ and nodes v2, 03, . . . , 2),-i, uj+2 are all included in the constructed ring. (ii) when i is odd, as shown in Fig. 4b, RING-1 PFT (n, i + 2, FALSE, ~1) is invoked at Vi+2and nodes ~2, ~3, . . . , VI-22

D

v,

“2

1 d,

d,

__.

(i) and (ii) are proven by induction on k.

h-3

Y‘ -2

y,-1

v,

K+1

Y,+z

d,-,

dzwl

1

-

d,+,

d,+2

Y*” .._

4”

(b) v; is faulty where i is odd, Y,-, and v, skipped

FIG. 4. The general behavior when there is a single fault.

vi+1

5

v,+~ are all included in the constructed ring.

[Induction step] Assuming the lemma is true for all k < z, we show that it is also true when k = z. (i) i, is even; i.e., i, - 1 is odd. i, - 1 b i,-, + 2 since ii - i,-] b 3. Then, RING-1PFI (n,i, - 1, FALSE, u,) will be invoked at q- 1and nodes v,~~,+~, . . , uiimI are all included in the constructed ring. The trace of the algorithm is shown in Fig. 7a.

Proof Apparent from the trace of the algorithm as depicted in Fig. 4. W

(ii) i, is odd; i.e., i, - 1 is even.

3. Suppose processor o 1 invokes RING- 1PFT (n, 2, FALSE, vl) at v2 in the basis ring of an n-cube with f

Theembeddingsequence:1232142312152312421325

LEMMA

faults. Let Vii, Vi,, . . . , vy be the ffaults, where 2 < i, < i2 < . . . 3 or ik+, = ik + 1 and ik is even. Then, with iO = 0 and v2”+, = VI, (i) fork = 1, 2, . . . , f; tf ik is even, then RING-1 PFT (n, ik + 1, TRUE, v,) will be invoked at Vik+2 and nodes Vik-,+2r . . . 7 vik-l 7 v&+2 are included in the constructed ring, (ii) fork = 1, 2, . . . , f; if ik is odd, then RING-1PFT (n, ik + 2, FALSE, VI) will be invoked at Vi,-2 and nodes are included in the constructed ring; nodes vg+ I I vi,+2 V +,+2r . 7 Vtk-2 are also included in the constructed ring ij ik - ik-, > 4; (iii) the last invocation of the algorithm will be

“2 2 Y, 3 Y6 2 Y,

“10

“II

VI4

2Vl5 VI8 VIP

Theembeddmgsequence:121312412312152321423121s

FIG. 6. Handling multiple faults.

DISTRIBUTED

FAULT-TOLERANT

67

RING EMBEDDINGS

RING-IPFT

. D

/ ..,

“‘,4 d,,

“8, -

“‘,+I -

Y,,+Z

...

dx.+2

..’

(c) i, is odd. L,-, is even and iz+ = i,-1

(a) i, is even RING-1PFT

(d) i, is odd, i,-, is even and i, - i,-, = 3 (b) i, is odd and i, - i,_, 2 4

FIG. 7.

Trace of the algorithm RING-IPFT

Case I. i, - i,-, > 4, which implies i, - 1 > i,-i + 2. Then, RING-1PFI (~1, i, - 1, FALSE, 0,) will be invoked at V~,-~.The trace of the algorithm is shown in Fig. 7b. One can easily verify that nodes vi,-,+2, . . . , Viz-z, Us,+,, u,?+2are all included in the constructed ring. Case II. i,-, is even and i,-i = i, - 1. From the induction hypothesis, RING-1 PM ( IZ, i,-, + 1, TRUE, vI ) is invoked at ~,~_,+2= z);,+~and 2);,-,+2is included in the ring. The trace of the algorithm is shown in Fig. 7c. Nodes o,~+,, ui1+2are all included in the constructed ring. Case III. i,-i is even and i, - i,-, = 3. From the induction hypothesis, RING-IPFT ( IZ, i,-i + 1, TRUE, o1 ) is invoked at Vi,-,+2 = v;,-~ and is included in the ring. The trace of the algorithm is shown in Fig. 7d. Nodes qz+, , Ui,+2are included in the constructed ring.

in the proof of Lemma 3 RING-1PFT invoked at .,.

Y*“+,SE Y,

with

n n

***

m

skip

murce

241

FALSE terminates

“I ***

skip

source

TRUE FALSE terminates

Vl Yl **+

Algorithm (a) i, = Y-1

RING-IPFT invoked at .,,

“2” Y.p+,

E Y,

with __. n

m

n

2”1,

n .**

2”+1 Algorithm (b) il = 2”-2

(iii) This part of the lemma is about the last invocation of RING-1PFI in the constructed ring. Since v2n is not faulty, let us consider the position of uis as follows: Case I, iJ = 2” - 1. Follows immediately from (ii). RING-1PFf (n, 2” + 1, FALSE, q) is invoked at 2)2n+l = v, and nodes vzn, z++, are included in the ring. The trace of the algorithm is shown in Fig. 8a. Case II. if= 2” - 2. Since 2” - 2 is even, from (i) we (c)i,<2”-3 know that RING- 1PFT (n, 2” - 1, TRUE, u, ) is invoked at ~2” and 212”is included in the ring. The trace of the algoFIG. 8. Trace of the algorithm RING-l PFT for the last invocation in rithm is shown in Fig. 8b. the constructed ring. ’

68

CHAN

AND

LEE

Case III. if< 2” - 3. Follows from (i) and (ii). Since uzn-, is included in the ring, RING-1PFT (n, 2”, FALSE, v,) is invoked by 2)2n.The trace of the algorithm is shown in Fig. 8c. n

will successfully construct a ring of size at least 2” - 2 fin an n-cube with ffaults provided that, for any pair offaults Vi and Vj in a basis ring where i
RING-

1PFI

Proof Directly follows from Lemma 3. Lemma 3(iii) implies that a ring will be successfully constructed. To see that the ring constructed includes at least 2” - 2 f nodes, let us first define a window for a faulty node vi in the basis ring v~+~). Lemma 3 says that if i is as W(Vi) = (V,-1, Vi, ui+l, even, then zliel and vi+2 are included in the constructed ring provided that any other faulty node Vj is such that j = i + 1 or 1i - jl > 3. If i is odd, then Vi+] and Vi+2 are included in the constructed ring provided that any other faulty node v, issuchthatj=i1 or [i-j1 >3.So,atmosttwooutof the four nodes in the window W( Vi) can be excluded from the ring. Note that only those nodes inside windows have the possibility of being excluded from the constructed ring, and nodes that do not belong to any window are always included. Since there are f faults, we have fwindows. So, at most 2f nodes can be excluded from the ring. n

RING-1PFT

invoked at

n n n 1,”

“‘PI “‘ -2

RING-1PFf

invoked at

with n

“‘ -2 “,+I

n

Each fault scenario can be thought of as a 2”-bit string with a O/ 1 in the ith bit representing the status, i.e., nonfaulty/faulty, of Vi. For example, the fault situation depicted in Fig. 6a can be expressed by the 32-bit string 000 110000 100000000 10000 110000 110 while the situation depicted in Fig. 6b can be expressed by 00000001001000000100100001000000. An acceptable 2”bit fault string is one which corresponds to a fault scenario for which our single-fault algorithm can construct a ring. Under the assumption that vl, v2, and vzn are nonfaulty, acceptable fault strings should begin with 00 and end with 0. Theorem 1 suggests that any pair of l’s in an acceptable

skip

sowce

FALSE TRUE TRUE TRUE failure

vI v, Y, Y, ***

If “,-j nonfaulty if v,-~ faulty

i-l i-2 i-l i an”O”nce

If v,_, nonfaulty if v,_) faulty

(b)j=i+2andiisodd

“z-1“, --m-L “i RING-1PFT

Let Vi and Vj denote the first pair of such faults. We divide the possible scenarios for Vi and V, into three cases: (l)j= i+ 1 andiisodd,(2)j= i+2andiisodd,and (3) j = i + 2 and i is even. The trace of the algorithm for each of these cases is shown in Fig. 9. We can see that failure is announced in each case. n

Follows from Lemmas 1 through 5.

v, v, v, ***

m

n n n n ***

“,-I

-n-H-

“‘+I

“a-1

invoked at

with ._. n

Proof

Proof

FALSE TRUE TRUE failure

(a)j=i+landiisodd

LEMMA 5. RING- 1PFT will announce failure if there exists a pair offaults Vi and v,, i c j, in the basis ring of an n-cube such that either j = i + 2 or j = i + 1 and i is odd.

THEOREM 1. RING- 1PFT will successfully construct a ring of size at least 2” - 2 fin an n-cube with ffaults if and only iJ;for any pair offaults Vi and v,, i < j, in the basis ring, either j - i 2 3, or j = i + 1 and i is even.

i-l i-2 i-l an”O”nce

“,-I

n

“a-1

n ***

m i-l i anllO”“lX

“,

skip

source

FALSE TRUE failure

vI “1 ***

(c)j=i+2andiiseven

FIG.

9.

Failure

cases.

fault string either must have at least two O’s between them or must be next to eachother with the first at an even position. The next theorem enumerates all multiple-fault scenarios that can be handled by our single-fault distributed algorithm.

Under the assumption that vI , v2, and UN, where N = 2”, are nonfaulty, RING- 1PFT will succeed in constructing a ring of size at least N - 2f in an n-cube with Lf/2J ffaults in x(N, f) = 2 i-0 ( N-2f-I /-zi )( N’t;f+i-‘) instances out of the (“7’) dtyerent possible scenarios. THEOREM

2.

Proof Let x( N, f) denote the number of N-bit strings containing exactly f 1 ‘s with the following properties. begin with 00 end with 0 all pairs of 1 ‘s either have at least two O’s between them or are next to each other with the first 1 at an even position. l l

l

DISTRIBUTED

FAULT-TOLERANT -

Let p( NJ) denote the number of N-bit strings containing exactly f l’s with the following properties:

T-

T

tc

-i

begin with 1 end with 0 all pairs of 1 ‘s either have at least two O’s between them or are next to eachother with the first 1 at an even position. l l l

Let y( N,f) denote the number of N-bit strings containing exactlyf‘ l’s with the following properties: begin with 1 end with 0 all pairs of l’s either have at least two O’s between them or are next to eachother with the first 1 at an odd position. l l

l

The following equations express the relationships between p, and q: .(N,f)

= P(N - 2,f)

+ 4(N - 3,f) + p(N - 4,f)

,(N,f)=q(N-Zf-

l)+P(N-4,f-

l)+

= [q(N - 4, f - 2) + p(N - 5, f-

+ [p(N - 3, f-

+ ***

1)

l)+p(N-6,f-

+q(N-5,fq(N,f)

+ q(N - 5,f)

...

2) + . * * ]

1) + q(N - 4, f-

1) + - . *I.

From these equations, we get the following recurrence for x(N,f):

x(N,f) = x(N - 23.f) $x(N-2,f-

l)-tx(N-4,f-

with i

1)

x(N, 0) = 1, for all

N > 2

x(O,f)

= 0,

for all

f>0

x(&f)

= 0,

for all

f>- 1.

The solution to the recurrence is as stated in the theorem.

n

From Theorem 2, we see that, for example, over 99% of the time we can build a ring of at least 508 nodes in a 9cube with two faults and over 96% of the time we can build a ring of at least 504 nodes in a 9-cube with four faults. We have calculated the percentages for the cases of 7-cubes, 8cubes, 9-cubes and lo-cubes. Figure 10 summarizes the results. As expected, these percentages are competitive with the simulation results given by Provost and Melhem [ 1 l] for their single-fault algorithm. We can also use the idea of using spare links to get around faults introduced after a ring has already been constructed. For example, consider the ring shown in Fig. 2. Suppose, after the ring has been built, that ~22 becomes faulty. Processor v2, will detect this and use the spare link across dimension 2 between ~2, and 1124, modifying the ring to exclude

69

RING EMBEDDINGS

2 3 4 5 6 7 8 9 IO 11 12 13 14 15 16 17 18 19 20 -

7-c nRinlmum I ring Size -iz124 122 120 118 116 114 112 110 108 106 104 102 100 98 96 94 92 90 88

i”CC.ZSS rate 1oo.o* 97.6% 93.0% 86.3% 78.1% 68.8% 59.0% 49.1% 39.8% 31.3% 23.9% 17.7% 12.7% 8.8% 5.9% 3.9% 2.4% 1.5% 0.9% 0.5%

FIG. 10.

9c 7 --iG minimum :“CCesS ,nlnimum S”ClXSS llll”llll”lll ring size 254

rate .w.o9b

ringsize -Tic--

rate 1oo.01

ring size 1022

252 250 248 246 244 242 240 238 236 234 232 230 228 226 224 222 220 218 216

98.8% 96.5% 93.1% 88.7% 83.4% 77.5% 71.1% 64.4% 57.5% 50.7% 44.2% 379% 32.1% 26.8% 22.1% 17.9% 14.3% 11.3% 8.0%

50a 506 504 so2 500

99.4% 98.2%

1020

498 4% 494 492 490 488 486 484 482 480 478 476 474 472

96.5% 94.2% 915% 88.2% 84.6% 80.6% 76.3% 71.9% 67.2% 62.5% 57.7% 52.9% 48.3% 43.7% 39.4% 352% 313%

1018 1016 1014 1012 1010

1008 1006

1001 ,002 1000 99x 996 994 992 990 98X 9X6 984

Success rate of Procedure RING-IPFT.

v22 and ~23. This modification is confined to affect only v2, and 1124; the rest of the ring is unaware of the change. So, we can always start with a nonfaulty ring, and when a fault occurs, there is a way to change the ring to avoid the fault without major disruption. If a second fault occurs, it is highly probable that it too can be avoided in a similar fashion, without major disruption to the existing ring. However, there is no absolute guarantee that any fault beyond the first will be tolerated. For example, in 4% of the time for a 9-cube with four faults, this strategy will fail. However, in this case, a ring of at least 504 nodes can be built (unfortunately, from scratch) using our multiple-fault algorithms. Multiple-processor fault-tolerant algorithms are described in the next section. We emphasize that reconfiguration strategies proposed so far (e.g., [ 5, 1 I]) do not guarantee tolerance of more than one or two faults. 3. MORE ON MULTIPLE-PROCESSOR FAULT-TOLERANCE Although our single-fault algorithm is capable of handling many multiple-fault cases, it would be more satisfying to guarantee the tolerance of multiple faults. We begin here by devising a double-processor fault-tolerant distributed ring embedding algorithm which guarantees the construction of a ring of size at least 2” - 4 when faced with two processor faults, using our single-fault algorithm as a stepping stone. The approach is simply to run several copies of the singlefault algorithm at the same time, each copy attempting to build a different ring based on a distinct binary-reflected Gray code embedding sequence: at least one copy is guaranteed to succeed, as we see. As it turns out, three copies are sufficient in the case of two faults. First, we introduce the notion of a dimension mapping dm,), essentially a permutation of DM=(dml,dm2,..., n). The binary-reflected Gray code embedding (1,2,3,..., sequence for the basis ring used in the distributed algorithm RING-1PFT will be generated using the following proce-

70

CHAN

dure, which is a slightly modified version of the procedure RING(n): Procedure RING-D( n ,DM) s * (dm,); FORi=2tonDO s + Sl(dm)l s; return 9, end RING-D Each embedding sequence is produced by calling RING-D( n, DM) and concatenating the returned result with (dm,), the closing dimension of the ring. For example, the embedding sequence for a 32-node ring in a 5-cube resulting from RING-D( 5, (4, 3, 2, 5, 1)) ) ( 1) is S = (4, 3, 4, 2, 4, 3, 4, 5, 4, 3, 4, 2, 4, 3, 4, 1, 4, 3, 4, 2, 4, 3, 4, 5, 4, 3, 4, 2, 4, 3, 4, 1). After replacing RING( n)l (n) by RING-D (n, DM) 1(dm,) in the second line of the distributed algorithm RING- 1PFT and introducing DM as an extra parameter, we are ready to build the double-processor faulttolerant ring in the n-cube. To begin ring construction, the source processor sends three messages, with dimension mappings DM, = ( 1, 2, 3, 4 . . , n), DM2 = (2, 1, 3,4, . . . , n), and DM, = (3,2, 1, 4: i, . . . , n), to invoke RING_lPFT(n, 2, FALSE, source, DM,), RING-PFT(n, 2, FALSE, source, DMz), and RING-l PFT( n, 2, FALSE, source, DM,) at its neighbors across dimensions 1, 2 and 3, respectively. Note that DM2 is just DM1 with 1 and 2 swapped while DM3 is just DM, with 1 and 3 swapped. LEMMA 6. RINGAPFT will successfully construct at least one ring of size at least 2” - 4 in an n-cube with two faulty processors while attempting to build three distinct rings based on DM,, DMz I and DM3.

Proof: Let F, and F2 denote the n-bit labels for the two faulty processors in the n-cube. Note that, keeping in mind Theorem 1, (i) if F, and F2 agree on their dimension 1 bit, the algorithm will succeed in constructing a ring based on DM,; (ii) if F, and F2 agree on their dimension 2 bit, the algorithm will succeed in constructing a ring based on DM2; (iii) if F, and F2 agree on their dimension 3 bit, the algorithm will succeed in constructing a ring based on DM3; (iv) otherwise, F, and Fz will be distanced by at least three links in the hypercube and all three rings will be successfully constructed. n This approach can be generalized to guarantee a higher degree of fault tolerance. To tolerate f < l( n + 1 )/ 2 J processor faults, we define dimension mappings DM, , DM2, . . . ) DMzfm,, where DM, = ( 1, 2, 3, 4, . . . , n) and, for i = 2 . . > 2f - 1, DMi is DM, with 1 and i swapped. To begin ring construction, the source processor sends 2f - 1

AND

LEE

messages, invoking RING-l PFT( n, 2, FALSE, source, DM;) at its neighbor across dimension i. LEMMA 7. Fori= 1, 2, . . . . 2f - 1, let Ui and Wi be two n-bit strings which disagree on bit i and agree on at least n-2bits. Thelist(U,, U, ,..., &,, W,, W, ,..., W,r-,) must contain at least f + I distinct n-bit strings.

Proof: By contradiction. reader. n

Left as an excercise for the

THEOREM 3. RING- 1PFT will successfully construct at least one ring of size at least 2” - 2f in an n-cube with f < L( n+ 1)/2 J faculty processors while attempting to build 2 f - 1 distinct rings based on dimension mappings DM, , DM,, . . . , DMzJ-, .

ProoJ Suppose to the contrary that the algorithm fails to construct a ring. In particular, suppose that two faulty processors with n-bit labels Uj and W, cause RING-1PFT to fail in constructing a ring based on DM,, i = 1, 2, . . . . 2f - 1. Then, from Lemma 5, Vi and Wi must disagree on bit i and must agree on at least n - 2 bits. Since the list ( U, , u,J-, , WI, w2, . . . , W2r-,) must contain at least uz,..., f + 1 distinct n-bit strings according to Lemma 7, there must be at least f + 1 faulty processors in contradiction to our assumption off faults. n 4. CONCLUDING

REMARKS

In summary, we have ultimately described simple distributed algorithms for embedding a ring of size 2” - 2 fin an n-cube with f < l( n + I )/ 21 processor faults. The algorithms work for many cases where fexceeds l( n + I )/2 J. In closing we mention briefly link faults. With link faults, there is still hope for building a 2”-node ring which excludes faulty links. One idea is to exploit the presence of link-disjoint Hamiltonian circuits in the n-cube. The reader can easily verify that the embedding sequences X I(4) started at some source P and Y) (4) started P’s neighbor across dimension 1, where X = ( 1, 2, 1, 3, 1, 2, 1, 4, 1, 3, 2, 3, 1, 3, 2) and Y = (3, 4, 2, 4, 3, 4, 1, 4, 2, 3, 2, 4, 2, 3, I), form two linkdisjoint 16-node rings in a 4-cube. These two rings are useful for constructing two link-disjoint 2”-node rings for any n-cube where n > 4: considerX)(5))XRJ(6)lX/(5)) XRl(7)~Xl(5)lXR~(6)lXl(5)~XR~ - - - I(n) started at some source P and consider Yl(5)IYRl(6)IY1(5)/ YR((7)~YI(5)lYR~(6)~Y~(5)~YRI~~~((n)startedatP’s neighbor across dimension 1, where XR and Y R are the reverse of sequences X and Y, respectively. The existence of two-link-disjoint rings immediately implies algorithms that achieve one-link fault-tolerance. We refrain from commenting more on this, since, in fact, there are already claims for the existence of 1n/ 2 J link-disjoint Hamiltonian circuits in an n-cube, implying up to (Ln/2] - I)-link fault tolerance.

DISTRIBUTED

FAULT-TOLERANT

3. 4. 5.

6.

7.

A “row” of the mesh can be viewed as: 4 0 O 3

3

o------d 4 5 FIG. Il.

8.

7-G 4

31

d

Fault-tolerant mesh.

9.

10. 11.

This paper has demonstrated, in contrast, up to (L( Iz + 1)/ 2 J) -processor fault tolerance. The ring results also have impact on the fault-tolerant embeddings of meshes and toruses in hypercubes. For example, in the case of processor faults, Fig. 11 illustrates how a mesh may be constructed on the basis of ring ideas.

REFERENCES 1. Banerjee, P. Reconfiguring a hypercube multiprocessor in the presence of faults. Proc. Conference on Hypercubes, Concurrent Computers and Applications, 1989. 2. Bhatt, S., Chung, F., Leighton, T., and Rosenberg, A. Optimal simuReceived June 13, 1989; accepted February 20, 1990

12.

RING EMBEDDINGS

71

lations of tree machines. Proc. IEEE Foundations of Computer Science, 1986. Chan, M. Y. Dilation-2 embeddings of grids into hypercubes. Proc. International Conference on Parallel Processing, 1988. Chart, M. Y. Embedding of d-dimensional grids into optimal hypercubes. Proc. ACMSymposium on Parallel Algorithms and Architectures, 1989. Chen, S-K., Liang, C-T., and Tsai, W-T. Loops and multidimensional grids on hypercubes: Mapping and reconfiguration algorithms. Proc. 1988 International Conference on Parallel Processing, 1988, pp. 3 1% 322. Chen, M-S., and Shin, K. G. Message routing in an injured hypercube. Proc. Conference on Hypercubes, Concurrent Computers and Applications, 1988. Chen, M-S., and Shin, K. G. On hypercube fault-tolerant routing using global information. Proc. Conference on Hypercubes. Concurrent Computers and Applications, 1989. Dekel, E., Nassimi, D., and Sahni, S. Parallel matrix and graph algorithms. SIAMJ. Comput. 10,4 (Nov. 1981), 657-675. Hastad, J., Leighton, T., and Newman, M. Reconfiguring a hypercube in the presence of faults. Proc. ACM Symposium on Theory of Computing, 1987. Hastad, J., Leighton, T., and Newman, M. Fast computation using faulty hypercubes. Proc. ACM Symposium on Theory of Computing, 1989. Provost, F. J., and Melhem, R. Distributed fault tolerant embedding of binary trees and rings in hypercubes. Proc. International Workshop on Defect and Fault Tolerance in VLSI Systems, 1988. Saad, Y., and Schultz, M. H. Topological properties of hypercubes. Res. Rep. YALEU/DCS/RR-389, Yale University, June 1985.

MEE YEE CHAN received her Ph.D. degree in computer science from the University of Hong Kong in 1988 and her M.S. and B.A. degrees in computer science from the University of California, San Diego, in 198 I and 1980, respectively. She has been an assistant professor in the Computer Science Program at the University of Texas at Dallas since September 1987. SHIANG-JEN LEE received her Ph.D. and M.S. degrees in computer science from the University of Texas at Dallas in 1990 and 1986 and her BA degree in economics from the National Taiwan University, Taiwan, in 1981.