A distributed algorithm for embedding trees in hypercubes with modifications for run-time fault tolerance

A distributed algorithm for embedding trees in hypercubes with modifications for run-time fault tolerance

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 14, 85-89 (1992) A Distributed Algorithm for Embedding Trees in Hypercubes with Modifications for ...

588KB Sizes 0 Downloads 46 Views

JOURNAL

OF PARALLEL

AND

DISTRIBUTED

COMPUTING

14, 85-89 (1992)

A Distributed Algorithm for Embedding Trees in Hypercubes with Modifications for Run-Time Fault Tolerance* FOSTER J. PROVOSTAND RAMI MELHEM Department

of Computer Science, University of Pittsburgh,

1. INTRODUCTION Several researchers have studied the embedding of balanced binary trees in hypercubes. In doing so it is important to discover embeddings which preserve the adjacency of nodes so that communication can take place efficiently. It has been proven that an n-tree cannot be embedded in an n-cube with adjacency preserved [6]. Different methods have been suggested to bypass this difficulty. Bhatt and Ipsen [I] and Deshpande and Jenevein [2] relax the adjacency condition, while Johnsson [4] and Wu [6] embed an (n - 1)-tree in an n-cube preserving adjacency. Wu gives a recursively defined, bottom-up algorithm for determining a proper embedding; Johnsson, in contrast, gives a top-down algorithm. Both algorithms rely on a “bird’s eye” picture of the entire embedding and hence the embedding has to be determined by a host and tree nodes have to be statically assigned to processors. The question of fault tolerance arises naturally at this point. If a tree node processor or communication link is faulty, the entire embedding must be recalculated and tree nodes reassigned to processors. Even worse, if the fault occurs during processing, the entire structure must ONR

Contract

15260

be remapped before the processing can continue. In a time critical application, this would not be feasible. Hastad et al. [3] consider the general case of embedding one hypercube within another in the presence of faults. However, their methods deal with finding a successful embedding and not with run-time reconfiguration. In addition, when they consider total faults two logically adjacent nodes in the embedded cube may be separated physically by paths of length 7 (which allows them a high probability of tolerating a large number of faults). In Section 2 we present the distributed embedding algorithm. In order to overcome the problem of reassigning the entire embedding in the presence of faults, we introduce, in Section 3, a tree restructuring technique which may tolerate node or edge faults. The technique is extremely local and entirely distributed; in all but a single case, only the faulty node is remapped and only its parent and children are aware of the change. Any single fault can be tolerated as well as many multiple fault configurations. The tree is initially embedded with adjacency preserved; an embedding which bypasses a faulty processor (or a set of faulty processors) may contain one-hop communication between the remapped tree node and its logical neighbors, where one-hop denotes the passage of a message through one node en route to another node (distance 2 communication). In Section 4, we enhance the restructuring algorithm so that it tolerates any configuration of two faults. All algorithms run in O(log n) time, where II is the size of the cube-the time it takes to walk down the paths from the root of the tree to the leaves in parallel. In order to demonstrate that the algorithms are more robust than their worst case bounds indicate, in Section 5, we give some simulation results which show the robustness of the fault tolerant algorithms when used for run-time reconfiguration. Because of space constraints the proofs of correctness of the algorithms have been omitted; for a more comprehensive coverage of the topic we refer the reader to [5].

In this paper we present a distributed algorithm for embedding binary trees in hypercubes. Starting with the root (invoked in some cube node by a host), each node is responsible for determining the addressesof its children, and for invoking the embedding algorithm for the subtree rooted at each child in the proper cube node. This distributed embedding, along with the wealth of communication links in the hypercube, leads to a high potential for fault tolerance. We demonstrate the fault tolerance capability by introducing restructuring techniques which may be used to tolerate faults during the initial embedding, but is especially useful for remapping nodes that fail at run-time. The distributed nature of the embeddings eliminates the need for global knowledge of faulty nodes; each node must only know the status of its neighbors. In addition, only the neighbors of a faulty node need be aware of any change. 0 195’2Academic Press, Inc.

* This research was, in part, supported under 80-C-04555.

Pittsburgh, Pennsylvania

2. A TOP-DOWN, DISTRIBUTED TREE EMBEDDING ALGORITHM

We have developed a top-down, distributed algorithm for embedding a complete (n - I)-tree in an n-cube, pre-

N00014-

85 0743-7315/92 $3.00 Copyright 0 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.

86

PROVOST

AND

serving adjacency. In this algorithm, each node receives a small packet of information from its parent, determines which nodes will be its children, and sends them the information they need to continue the configuration. The information needed to determine the addresses of a node’s children consists of: (i) the height of the subtree rooted at the node; and (ii) an “interchange word” which specifies some axes along which the subcube must be reflected in order to yield a proper embedding. The addresses of the node’s children are calculated based on the bit positions which differ between the node’s address and its child’s. Since adjacent nodes in a hypercube differ in only a single bit position, a single integer will suffice to differentiate between the neighbors of a node. In the subsequent discussion, neighbors of a given node will be referred to by the bit position differentiating them from this node. The interchange word, which in our algorithm is passed from a parent to its children, consists of a string of bits. Numbering the bits with zero corresponding to the rightmost (least significant) bit, a “1” in the jth position specifies an interchange of the form (j/j + l), where an interchange of the form (ilk) indicates that subsequent children that are to be assigned to neighboring nodes across the ith dimension (if any) must be reassigned to the neighboring nodes across the kth dimension and vice versa. More specifically, a node receives the height h of the subtree rooted there, and initially takes its left child to be its neighbor in the h dimension and its right child the neighbor in the (h - 2) dimension. These bit positions are then modified by applying the interchanges specified in the node’s interchange word. The word is read right to left, applying a given interchange to the bit position addresses of both children if the corresponding bit in the interchange word is set. To the right child, whose dimension has just been determined, the node sends height (h - 1) and its interchange word unmodified. To the left child the node sends height (h - 1) and its interchange word, modified by setting bit (h - 2). The embedding can be rooted at any node in the cube by invoking procedure TREE at that node with the height h = n - 1, where n is the size of the cube, and the interchange word rot = 0. When invoked at a particular node, the procedure TREE computes 1 and r, the dimensions across which its child nodes should be located, and then invokes TREE at these child nodes. The procedure TREE is described formally in Fig. 1. 3. MODIFICATIONS

FOR FAULT

TOLERANCE

Our fault model takes into account both faulty processors and faulty communications links. A node is considered faulty if: (i) the node itself does not function properly (i.e., the computation and/or the communication

MELHEM procedure TREE(h ,ro~) begin if h =l then stop /* this is a leaf node */ else 1” h>l “1 r :=h ; l :=h -2;

for i := 0 to (length of rot) loop if (bit i of rol=l) then if (r=i) then r:=i+l; else if (r=i+l) then r:=i; if (/=i) then I:=i+l; else if (I=i+l) then /:=i; rol,=rot; rog=ror+2h-2 invoke TREE(h-l,rof,) at the node across dimension r /* right child */ invoke TREE(h-l,rof,) at the node across dimension I /* left child *I end TREE FIG.

1.

Distributed

Tree

Embedding

Algorithm

part fails); (ii) a communications link used in the embedded tree structure for communications with the node’s parent does not function properly. We assume that a node can detect faults in its immediate neighbors. Our fault tolerance scheme is based on the fact that in embedding an (n - I)-tree into an n-cube (using our scheme), the n-cube can be divided into two (n - l)cubes, one of which contains three-fourths of the nodes; the other contains only one-fourth. This can be seen by studying the embedding algorithm. Specifically, bit zero of the exchange word is never used; it is set by the height 2 nodes, but the height 1 nodes (the leaves) do not use it. This implies that any embeddings across dimension zero come from the initial assignment of 1 or r. Moreover note that in TREE, only 1 can ever be zero, and it will be zero for every height 2 node (and only for these nodes). Thus, dimension zero separates the n-cube into two subcubes, one of which contains one-half of the leaves of the tree (about one-fourth of the total tree nodes), the other contains the rest of the tree (about three-fourths of the total tree nodes). This separation of the tree nodes yields a structure which is amenable to a very local and distributed fault tolerance scheme. If we look at a mirror image of the tree structure across dimension zero (those nodes whose addresses differ from those of the tree only in dimension zero), we can note that the only nodes which do not have a free image are the height 2 nodes and their left children. Hence the image subcube may be used to remap faulty processors for any node at a height higher than 2. For the leaf nodes and their parents, a different strategy must be employed. In other words, the scheme is divided into two parts. Qne applies to the leaves and their parents; the other applies to the rest of the nodes. For nodes of height greater than two, if a child is detected to be faulty, the node will route all information (for that child) to the child’s image across dimension zero. This will incur a one-hop delay. Once in the image subcube, the embedding algorithm of Section 2 will be used to determine each node’s children. After the initial node remapping, a node in the image subcube will always attempt to bounce buck (i.e., return processing to the origi-

A DISTRIBUTED

0 0 69

FAULT

TOLERANT

TREE EMBEDDING

87

ALGORITHM

non-tree nodes tree nodes faulty nodes unused image1 fault-free tree dimension zero 64

(b)

FIG. 2. (a) A fault-free tree and its image (b) local remapping for a single fault.

nal subcube) before taking over processing duties. Thus, for a single fault, the faulty processor’s image will take over the processing duties, determine its children (the images of the faulty node’s children), and route information to them. These image children will bounce back to the original children-meaning that they will act as onehop communications elements between the remapped node and its children. The net effect is a distributed shift of the node into the image subtree with the rest of the tree structure remaining unchanged (see Fig. 2). If one of the original children is also faulty, its image will not be able to bounce back and hence will automatically take over the processing; the logical tree structure in each branch will return to the original subcube as soon as a nonfaulty node is available. In this way, sections or branches of the original tree can be faulty, as long as their images are not (see Fig. 3). For nodes on the bottom two levels (leaves and their parents) a special remapping must be employed. Consider the two-dimensional subcube containing the parent and its two children; if one of these four nodes is faulty, we will use the other three-never incurring more than a one-hop delay in communications between logically adjacent nodes. The height 2 node will bypass a faulty node by routing the information for one leaf through the other. Note that for each height 2 node the left child is originally connected across dimension zero. At run-time, remapping a faulty height 2 node may entail the remapping of its left child, within this two-dimensional subcube. This is the only case where more than just the faulty node must be remapped.

0

non-tree nodes

. tree nodes m faulty nodes

FIG. 3.

Remapping of a faulty section of the tree.

Any single fault is guaranteed to be tolerated. This guarantee, by itself, downplays the strength of the algorithm. As will be seen by the simulation results of Section 5, the extreme “localness” of the restructuring along with the relatively empty image subcube allows the system to tolerate many faults simultaneously, including groups and strings of faults in the original tree. It can be seen that communications between any two logically adjacent nodes will incur at most a one-hop delay. This is the cost of the fault tolerant scheme. Note also that the remapping of an interior (h > 2) node will never interfere with the remapping of another node, and that no remapping of a node in one parent/leaf triad will interfere with the remapping of any node outside that triad. It should be clear that the run-time complexity of the embedding algorithm is only increased by a constant factor due to the modifications for fault tolerance. 4. AN ENHANCED

FAULT TOLERANT ALGORITHM

TREE

Although the modified algorithm may tolerate multiple faults, certain double fault configurations cannot be tolerated. When a node is faulty, its image across dimension zero should not be faulty-neither should the images of its parent or its children. In this section, we modify our algorithm so that any configuration of two faults will be tolerated. For clarity, we assume distance 2 fault detection (it is not necessary). The modifications stem from the fact that if a node can communicate with another node across dimension x and then y (one-hop), it can also communicate across dimension y and then x (also one-hop). Thus if we encounter a second fault when attempting to remap a node, we can initiate the embedding at a neighbor and one-hop to the children bypassing the faulty node. We also use a limited form of backtracking if the embedding gets stuck due to faults. If a node discovers that it cannot embed its children, it declares itself faulty and lets its parent worry about reconfiguration. In the worst case, the algorithm will backtrack all the way up a path to the root and the embedding will fail; thus, the run-time complexity is still O(log n). Let us consider a tree node N’, its parent P’, and its

PROVOST AND MELHEM

88

For PP, the node to which a child is mapped, P, and its image, p (or a section of a subtree rooted at P), are faulty. We have shown that such a case may be handled by the reconfiguration algorithm. The boundary cases of the root and the leaves take special treatment; see [5]. 5. PERFORMANCE

EVALUATION REMARKS

AND CONCLUDING

FIG. 4. A node and its image are both faulty.

In order to test the robustness of our fault tolerant algorithms, we performed simulation studies with various size hypercubes and varying numbers of faults. Attempts children L’ and R’ which are initially mapped (in a faultwere made to embed a tree of a given height for different free embedding) to cube nodes N, P, L, and R, respecdistributions of a specific number of faults, and results tively; in addition, let the images across dimension zero were collected as to the number of successful attempts. of these nodes be denoted by @, p, z, and R. In our Run-time fault tolerance is simulated by assuming that modified algorithm, if N is faulty N’ is remapped to F, L’ the node into which the root of the tree is embedded is and R’ are then remapped to z and R (see Fig. 4). Thus, fixed; if the embedding succeeds with the given distributhere are two one-hop paths from N’ to its children, and a tion of faults, then that fault distribution could be tolersecond fault, in N, can be bypassed. For example, if N is ated at run-time. For the first fault tolerant algorithms the adjacent to P, L, and R across dimensions d, 1, and r simulation results showed, as expected, that embeddings (respectively) and E is faulty, there is still a fault-free with zero or one fault were always successful (100% of path from p to, say, z across dimensions 1 and then d. We the attempts succeeded). For the enhanced algorithm emrefer to this method of avoiding faults in the image tree as beddings with up to two faults were always successful. bypass embedding. In our example if, say, z is faulty, L’ The performance of the embedding algorithms degraded can be mapped to node X, the node across dimension 1 nicely as the number of faults increased. Figure 6 shows from p, and the children of L’ can be remapped using a typical results. It is clear that the additional complexity in similar bypass embedding. The embedding will bounce the enhanced fault tolerant algorithms provides more roback as soon as it can do so (to a nonfaulty node) and stay bust fault tolerant embedding algorithms. within the one-hop communication limit. Thus, if a secThe above results show the average number of faults ond faulty node (N being the first) is L or R it will not that can be tolerated given an initial nonfaulty starting affect the embedding. Likewise, if a second faulty node is node. This corresponds to the case of run-time fault tolerone to which one of L’s (or R’s) children is mapped (in a ance. At run-time, the basic structure of the tree embedfault-free embedding) the embedding will not bounce ding has already been established and, as faults occur, back to that node but continue in the image tree. In gen- this basic structure should not change. The problem of eral, if a large subtree and its corresponding image are fault tolerance at the initial embedding of the tree in the faulty, an embedding can still be realized with only one- hypercube is slightly different. If the algorithm fails to hop communication; an example is given in Fig. 5. Fi- embed the structure starting at the given node, there is no nally, if both N and p (or N and the node to which the other child of P’ is mapped) are faulty, P declares itself faulty and the node to which the parent of P’ is mapped, % of successful embeddings call it PP, takes the responsibility for remapping P’; this 0 EMBED Ff is what we mean by “a limited form of backtracking.” A P A EMBEn-m2

0 .

non-tree nodes tree nodes

q faulty nodes

10

0

10 FIG.

5.

A section

of the tree and its image are faulty.

20

30

FIG. 6. Run-time reconfiguration a 512 node hypercube.

0

0

40

50’

# faulty nodes

performance for a tree embedded in

A DISTRIBUTED

FAULT

TOLERANT

reason why it might not succeed from another starting node. Also, if the algorithm fails to embed the structure for some set of faults, it is possible that permuting the assignment of logical to physical ‘dimensions may produce a successful embedding. We tried to make as few assumptions as possible regarding the hardware model. Certain hardware models would simplify the fault tolerant algorithms and enable modifications that would make them far more robust. For instance, if the communications part of a node remains operable when the computation part fails (e.g., there is a separate routing part), this node would not have to be bypassed for communications purposes and more fault configurations could be tolerated; the benefits from such a “partial failure” assumption can be seen in [3] where it is shown that, if the communications portion of the processor remains operable, there is a high probability that a successful dilation 3 embedding of an (n - 1)-cube in an n-cube can be found in an environment where faults are relatively common. If such a partial fault model is assumed and run-time reconfiguration is not necessary, this embedding approach can be composed with that of [l] (mentioned in the Introduction) to give a dilation 4 tree embedding with the same resilient character. Our method, however, is concerned with local run-time reconfiguration in an environment where total faults are possible. In addition, we insisted that processing efficiency be degraded by using other than nearest neighbor communications only if one of the active processors fails. An expansion of 2 is necessary to enable a nearest neighbor tree embedding, but the embedding algorithm scatters the spares throughout the cube enabling reconfiguration techniques where only local knowledge of faults is needed and only the logical neighbors of a faulty node are concerned with the reconfiguration. Embedding-time fault tolerance is beyond the scope of this paper; however, we have conducted some simulation experiments to determine how well the modified embedding algorithms would perform given a little more freedom. In particular, Received November 5, 1988; revised December 21, 1989; accepted December 11, 1990

TREE EMBEDDING

ALGORITHM

89

upon failure we enabled the algorithms to try different roots until they had either succeeded or had exhausted the cube’s nodes. In the case of a height 8 tree embedded in a 9-cube, even with 90 faults (18% of the cube) the average probability of finding a successful embedding (over 100 random fault distributions) was 0.96. REFERENCES 1. Bhatt, S., and Ipsen, I. How to embed trees in hypercubes. Res. Rep. YALEU/DCS/RR-443, Dec. 1985. 2. Deshpande, S., and Jenevein, R. Scalability of a binary tree on a hypercube. Proc. ICPP, 1986, pp. 661-668. 3. Hastad, J., Leighton, T., and Newman, M. Reconfiguring a hypercube in the presence of faults. Proc. 19th Annual ACM Symposium on Theory of Computing, 1987, pp. 274-279. 4. Johnsson, S. L. Communication efficient basic linear algebra computations on hypercube architectures. J. Parallel D&rib. Comput. 4 (1987), 133-172. 5. Provost, F. J., and Melhem, R. Fault tolerant embedding of binary trees in hypercubes. Tech. Rep. 88-3, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, 1988. 6. Wu, A. Embedding of tree networks into hypercubes. J. Parallel Distrib. Comput. 2 (1985), 238-249. FOSTER JOHN PROVOST received his B.S. in physics and mathematics from Duquesne University in 1986. He received the university’s Certificate of General Excellence, along with the Certificates of Excellence in Physics and in Mathematics. In 1988 he received his M.S. in computer science from the University of Pittsburgh, where he is currently pursuing his Ph.D. in the area of machine learning. In 1989, Foster was the recipient of an Andrew Mellon Predoctoral Fellowship and is currently working under an IBM Graduate Fellowship. RAMI G. MELHEM is an associate professor of computer science at the University of Pittsburgh. He received a B.E. in Electrical Engineering from Cairo University, Egypt, in 1976, an M.S. in mathematics/ computer science from the University of Pittsburgh in 1981, and a Ph.D. in computer science from the University of Pittsburgh in December 1983. He was an assistant professor of computer science at Purdue University from 1984 to 1986 and at the University of Pittsburgh from 1986 to 1989. His research interests include fault tolerant computing, optical computing, and the application of large computational arrays to scientific problems.