Journal of Parallel and Distributed Computing 62, 696–714 (2002) doi:10.1006/jpdc.2001.1811, available online at http://www.idealibrary.com on
Self-Stabilizing Deterministic Network Decomposition 1 Fatima Belkouch Heudiasyc, Université de Technologie de Compiègne, France
Marc Bui Laboratoire de Recherche LRIA, Université Paris 8, France
Liming Chen ICTT, Département Maths/Info., Ecole Centrale de Lyon, France
and Ajoy K. Datta 2 , 3 Department of Computer Science, University of Nevada, Las Vegas, Nevada Received January 14, 2001; revised September 25, 2001; accepted October 3, 2001
We present a simple and efficient self-stabilizing protocol for the network partitioning problem. Given a graph with k 2 nodes, our decomposition scheme partitions the network into connected and disjoint partitions, with k nodes per partition. The proposed algorithm starts with a spanning tree of the graph, but uses some links which do not belong to the tree, if necessary. The protocol is self-stabilizing meaning that starting from an arbitrary state, it is guaranteed to reach a state where the network is correctly partitioned. The protocol stabilizes in 3(h+1) rounds, where h is the height of the tree. We also propose solutions to the case where the network size is n ] k 2. Hence our protocol works for dynamic systems in the sense that the protocol can adapt to changes of the network size. We discuss an important application of the proposed protocol. © 2002 Elsevier Science (USA) Key Words: network decomposition; quorum systems; self-stabilization; spanning tree.
1 An earlier version of this paper was presented at the International Conference on High Performance Computing, Calcutta, India, December 17-20, 1999. 2 Contact author: Ajoy K. Datta. E-mail:
[email protected]. Fax: 702 895-4075. 3 Supported in part by a sabbatical leave grant from University of Nevada, Las Vegas.
0743-7315/02 $35.00 © 2002 Elsevier Science (USA) All rights reserved.
696
⁄
NETWORK DECOMPOSITION
697
1. INTRODUCTION As network size grows, the control and communication protocols become more complex and inefficient. The motivation for decomposing (large) networks is to improve the performance of the protocols, i.e., avoid the performance degradation due to the excessive growth of the network. Fault-tolerance is one of the most important requirements of modern distributed systems. Various types of faults are likely to occur at various parts of the system. The distributed systems go through the transient faults because they are exposed to constant changes of their environment. The concept of self-stabilization [Dij74] is the most general technique to design a system to tolerate arbitrary transient faults. A selfstabilizing system, regardless of the initial states of the processors and initial messages in the links, is guaranteed to converge to the intended behavior in finite time. The self-stabilization paradigm aims at designing distributed algorithms with the ability of recovering spontaneously from any arbitrary state of the system, without any type of an outside intervention. So, a self-stabilizing system does not require any initialization, tolerates transient faults, and adapts to dynamic changes of the network. The self-stabilizing property is very useful in those situations where sites may be inserted or removed (due to crash failures) and then recover spontaneously from an arbitrary state. If the period between two successive network topology changes or the time between a recovery and the next crash is long enough, the system stabilizes. We propose in this paper the first self-stabilizing network partitioning algorithm. Our self-stabilizing algorithm is evaluated in terms of stabilization time, T, which is defined as the number of rounds the algorithm needs to stabilize (to compute and to propagate the partitioning information to all the nodes in the network). In the following, we will discuss related work in the area of network decomposition and then present our contributions. 1.1. Related Work The concept of network decomposition was introduced in [AGLP90]. Awerbuch et al. presented a fast algorithm to partition a network with O(n E) diameter clusters, where n is the network size and E=O(`loglog n/`log n). This algorithm requires O(n E) time. The algorithm of Linial and Saks [LS90] is the best known sequential algorithm for this problem. This algorithm is based on diameter separators and uses a randomized approach. The algorithms in [AP90] and [LS91] are improved versions of the scheme proposed in [AGLP90] in terms of the quality of the decomposition. However, these algorithms are inherently sequential, and their distributed implementation requires O(n log n) time. Another method, called block decomposition, is introduced in [LS93]. In this algorithm, the nodes of the graph are partitioned into blocks which may not create connected subgraphs. The algorithm requires both the number of blocks and the diameter of the connected blocks to be small. Some randomized distributed partitioning algorithms exist in the literature. The algorithm in [LS91] achieves a high quality network decomposition with a high
698
BELKOUCH ET AL.
probability by introducing randomization and is very efficient in time. An asynchronous randomized algorithm with O(log 2n) time and O(|E|log 2n) communication complexity is presented in [LS93] (E is the number of edges in the graph). A deterministic sublinear time distributed algorithm for network decomposition was presented recently in [ABCP96]. The algorithm takes O(n E) time, where E=O(1/`log n). This paper also includes a randomized algorithm for a high quality network decomposition in polylogarithmic expected time. The protocol is based on a restricted model, the static synchronous network. In such a model, the communication is assumed to be completely synchronous and reliable, there is no limit on the size of the messages, and all operations within a subgraph of diameter d are performed centrally by collecting all the information at the leader. Then the leader computes the partition locally. Therefore the time for running the algorithm increases by a factor of d. This method will also need a leader election algorithm. In summary, all previous approaches fell short of designing a distributed, deterministic, and fault-tolerant network partitioning scheme. 1.2. Contributions We present a simple, efficient, distributed, deterministic, and self-stabilizing protocol for constructing network partitions. An important application of our method is in designing the quorum systems. The quorums can be used to design protocols for various problems, e.g., mutual exclusion, data replication, name servers, etc. The key contribution of our work is in designing a computationally inexpensive selfstabilizing partitioning protocol which ensures high quality quorums in terms of response time and site load in the system [MV88]. Any quorum-based protocol that uses the proposed partitioning algorithm guarantees a balanced load between the sites of the system and has a low communication complexity of O(`n) messages (see Section 6). We combine the basic network partitioning protocol with a Propagation of Information with Feedback scheme [BDPV99] (described in more detail in Section 4.2) to design a self-stabilizing partitioning protocol. The proposed algorithm is written on a spanning tree of the network. Our protocol stabilizes in 3(h+1) time, where h is the height of the spanning tree. Since our algorithm is selfstabilizing, the protocol will recompute the partitions in the event of topological changes. There exist many self-stabilizing spanning tree construction algorithms in the literature, e.g., [AG94, AKM+93, CYH91, DIM93]. Any of these algorithms can be combined with the proposed partitioning algorithm to design a partitioning algorithm for a general network. 1.3. Outline of the Paper In Section 2, we describe the distributed systems and model we consider in this paper. In Section 3, we specify the problem of network partitioning. The basic idea of the network partitioning strategy and the partitioning algorithm is given in Section 4. The self-stabilizing partitioning algorithm is presented in Section 4. In Section 5, we give a solution to deal with the dynamic changes of the network size. We discuss an application of the proposed scheme in Section 6. Finally, we make some concluding remarks in Section 7.
NETWORK DECOMPOSITION
699
2. PRELIMINARIES In this section, we define the distributed systems and programs considered in this paper and state what it means for a protocol to be self-stabilizing. System. A distributed system S is an undirected connected graph, G=(V, E), where V is a set of nodes (|V|=n) and E is the set of edges. Nodes represent processors (denoted by p) and edges represent bidirectional communication links. (We use ‘‘nodes’’ and ‘‘processors’’ interchangeably). A communication link (p, q) exists iff p and q are neighbors. We consider arbitrary rooted asynchronous networks. We then assume the existence of an underlying self-stabilizing BFS (breadth first search) spanning tree protocol. So, the algorithm presented in this paper is written for a BFS spanning tree. But, as mentioned in the previous section, we can use any of the existing self-stabilizing (BFS) spanning tree algorithms to design a protocol for a general network. Programs. Every processor, except the leaf processors, executes the same program. The program consists of a set of shared variables (henceforth referred to as variables) and a finite set of actions. A processor can only write to its own variables and can only read its own variables and variables owned by the neighboring processors. So, the variables of p can be accessed by p and its neighbors. Each action is uniquely identified by a label and is of the following form: OlabelP::OguardP 0 OstatementP. The guard of an action in the program of p is a boolean expression involving the variables of p and its neighbors. The statement of an action of p updates one or more variables of p. An action can be executed only if its guard evaluates to true. We assume that the actions are atomically executed: the evaluation of a guard and the execution of the corresponding statement of an action, if executed, are done in one atomic step. The atomic execution of an action of p is called a step of p. The state of a processor is defined by the values of its variables. The state of a system is a product of the states of all processors ( ¥ V). In the following, we refer to the state of a processor and system as a (local) state and configuration, respectively. Let a distributed protocol P be a collection of binary transition relations denoted by W , on C, the set of all possible configurations of the system. A computation of a protocol P is a maximal sequence of configurations e=(c0 , c1 , ..., ci , ci+1 , ...), such that for i \ 0, ci W ci+1 (a single computation step) if ci+1 exists, or ci is a terminal configuration. The maximality means that the sequence is either infinite or it is finite and no action of P is enabled in the final configuration. All computations considered in this paper are assumed to be maximal. During a computation step, one or more processors execute a step and a processor may take at most one step. This execution model is known as the distributed daemon [BGM89]. We assume a weakly fair daemon, meaning that if a processor p is continuously enabled, p will be eventually chosen by the daemon to execute an action. The set of computations of a protocol P in system S starting with a particular configuration a ¥ C is denoted by Ea . The set of all possible computations of P in system S is denoted as E. A configuration b is reachable from a,
700
BELKOUCH ET AL.
denoted as a M b, if there exists a computation e=(c0 , c1 , ..., ci , ci+1 , ...) ¥ Ea (a=c0 ) such that b=ci (i \ 0). In order to compute the time complexity measure, we use the definition of round [DIM97]. This definition captures the execution rate of the slowest processor in any computation. Given a computation e (e ¥ E), the first round of e (let us call it e −) is the minimal prefix of e containing one (local) atomic step of every continuously enabled processor from the first configuration. Let e ' be the suffix of e, i.e., e=e −e '. Then the second round of e is the first round of e ', and so on. 2.1. Self-Stabilization A specification is a predicate on computations that are admissible for a distributed system. A system matches its specification if all its possible computations match the specification. If we consider only static problems (i.e., problems whose solutions consist of computing some global calculus), the specification can be given in terms of a set of configurations. Every computation matching the specification would be a sequence of such configurations. The set of configurations that matches the specification of static problems is called the set of legitimate configurations (denoted as L), while the remainder C 0 L denotes the set of illegitimate configurations. We need to introduce the concept of an attractor to define self-stabilization. Intuitively, an attractor is a set of configurations of the system S that ‘‘attracts’’ another set of configurations of S for any computation in E, the set of all possible computations. In addition, if the attractor is closed, then any subsequent computation of the algorithm remains in the same set of configurations. Definition 2.1 (Closed attractor). Let C1 and C2 be subsets of C. C1 is an attractor for C2 if and only if for any initial configuration c1 in C2 , for any execution e in Ec1 (e=c1 , c2 , ...) there exists i \ 1 such that for any j \ i, cj ¥ C1 . In the usual (i.e., nonstabilizing) distributed systems, the possible computations can be restricted by allowing the system to start only from some well-defined initial configurations. On the other hand, in stabilizing systems, problems cannot be solved using this convenience, since all possible system configurations are admissible initial configurations. Definition 2.2 (Self-stabilization). A system S is called self-stabilizing if and only if there exists a non-empty subset L … C of legitimate configurations such that L is a closed attractor for C. 3. NETWORK PARTITIONING Assume that the number of nodes in the network n is equal to k 2. The case of n ] k 2 is discussed in Section 5. The network partitioning problem deals with the grouping of the set of nodes V into k connected and disjoint partitions where each partition contains k nodes. A connected partition reduces the communication cost among the nodes. Maintaining equal size partitions allows load sharing and reduces the quorum size when such a scheme is used to build a quorum system.
NETWORK DECOMPOSITION
701
In the quorum construction methods [Mae85, KC91, AA91, PW95, Baz96] the logical organization of the network (grid, tree, etc.) is used to improve the performance in terms of quorum size, load, and availability. However, the proper selection of quorum size is not sufficient to reduce the communication cost. In this paper, we propose to use the physical (real) organization of the network to improve the communication time. Generating quorums from this real information allows a better evaluation of the communication cost. In the following section, we discuss our network partitioning algorithm. Then in Section 6, we discuss how quorums can be constructed using the partitions obtained from this partitioning algorithm. We use G[g] to denote the subgraph induced by the subset g … V in G. A partitioned network, denoted as P={B1 , B2 , ..., Bk }, is a collection of k partitions such that the following conditions are true: • Condition 1: -i ¥ {1..k} :: G[Bi ] is connected. • Condition 2: -i ¥ {1..k} :: |Bi |=k. • Condition 3: -i ¥ {1..k} :: 1i Bi =V. • Condition 4: -i, j ¥ {1..k} :: Bi 5 Bj =”, i ] j. A partition Bi is called a complete partition if |Bi |=`n. If |Bi | < `n, then the partition is considered an incomplete partition. Connected and disjoint partitions are called correct partitions. The partitions are considered balanced if they have (or almost) the same size. A partition Bi constructed at processor p is characterized by the following parameters: Node–list
The list of nodes in Partition Bi .
Density
The density of Graph Gi induced by Bi , i.e., the number of edges in Gi .
External Links A set of edges: {(x, y) ¥ Gi | x ¥ Bi , y ¨ Bi , and y is a descendant of p.} We will denote this set by L.
Specification of the partitioning problem. The problem solved in this paper is to design a deterministic self-stabilizing algorithm for decomposing a graph G(V, E) (|V|=n=k 2) into k connected and disjoint partitions, P={B1 , B2 , ..., Bk }, each partition consisting of k nodes. The resulting partitions must also be known to all processors in the network. We also solve the problem of partitioning a general and dynamic network, i.e., when the network size is not equal to k 2. The resulting partitions will be disjoint and connected, but not necessarily balanced. The algorithm proposed in this paper would construct partitions with the above properties provided the input graph can be partitioned. The characteristics of a graph that can be decomposed into this kind of partitions is still an open question. 4. NETWORK PARTITIONING ALGORITHM NP The algorithm uses a BFS spanning tree of the underlying network as the input. The basic scheme is as follows: The decomposition process starts from the leaf
702
BELKOUCH ET AL.
processors. Every leaf processor forms an incomplete partition. The nonleaf nodes try to create larger partitions from the partitions received from its descendants until they form a set of complete partitions covering the tree rooted at them. Eventually, this process reaches the root. The root now knows the partitions and so broadcasts the information about the partitions to all processors in the network.
Algorithm 4.1 (NP): Distributed algorithm for network partitioning at processor p Variables LDp ={B p1 , B p2 , ..., B pk } /* Local Set of Partitions */ GDp ={B p1 , B p2 , ..., B pk } /* Global Set of Partitions */
Macro Parentp = p’s parent in the spanning tree Chp = {q ¥ Np | Parentq =p}
p’s children
|B qi | =
Number of nodes in partition B qi
ICPp = {B ¥ 1q ¥ Chp LDq | |B| < `n } CPp = (1q ¥ Chp LDq ) 0 ICPp
Set of incomplete partitions received from Chp Set of complete partitions received from Chp
ICP+ = {x|,B ¥ ICPp :: x ¥ B} p
Set of nodes in ICPp
Predicates Leafp
— Chp =”
Path(i, j, g) — (,k ¥ g :: (i, k) ¥ G[g] N Path(k, j, g)) K (i=j) Connected(Gg) — -i ¥ g : -j ¥ g :: Path(i, j, g) Combinep Reconstructp
+ — (|ICP+ p |=`n) N (Connected(ICP p ) + + — (|ICP+ p | > `n) K (|ICP p |=`n N ¬ connected(ICP p ))
Actions /* For the leaf processors */ A1 ::
Leafp
0 LDp := { {p} } ; GDp :=GDparentp
/* For other processors */ A2 :: A3 ::
0 LDp := CPp 2 { ICP+ (|ICP+ p | < `n) p 2 {p} } ; GDp :=GDparentp ICP+ 0 LDp := CPp 2 { {p}} ; GDp :=GDparentp p =”
A4 ::
Combinep
A5 ::
Reconstructp
0 LDp := CPp 2 { ICP+ p } 2 { {p}} ; GDp :=GDparentp 0 Combine-Partitions(1Chp LDq , Ainc , Ac ) if (Connected(Ainc 2 {p})) and (|Ainc | < `n ) then LDp :=Ac 2 {Ainc 2 {p}} else LDp :=1q ¥ Chp LDq 2 { { p} } GDp :=GDparentp
Each processor p uses two variables LDp and GDp to implement the local and global sets of partitions, respectively. LDp contains the set of partitions that covers the subtree Tp rooted at p. GDp consists of the set of partitions created at the root (denoted as p0 ) and is broadcast to other processors. GDp will eventually cover the whole graph and will contain the final and correct set of partitions.
NETWORK DECOMPOSITION
703
The partitioning algorithm shown in Algorithm 4 contains some macros, predicates, and actions. The macros are not variables and are dynamically evaluated. The predicates are used to describe the guards of the actions in Algorithm 4. The goal is to have at most one incomplete partition in the local set LDp of each processor p. This condition makes the construction of partitions easier. This avoids the destruction (and reconstruction) of complete partitions in the higher levels in the tree. We assume that the leaves of a tree are at Level 0, and the level of a nonleaf node is one plus the maximum of the level of its children. The informal outline of the task at processor p is given below: Level 0 : Every leaf processor forms an incomplete partition. Level i : 1. Try to complete the incomplete partitions received from Level i − 1 by combining nodes in these incomplete partitions. 2. If the set of partitions created at this level in Step 1 contains more than one incomplete partition, try to reconstruct all partitions using some external links. 3. If Step 2 also fails, the parent of p will read the original partitions as obtained from Level i − 1 and try to complete at Level i+1. We refer to the examples shown in Figs. 1, 2, and 3 to explain the algorithm. Action A1 implements the construction of the local set LDp for a leaf processor p. In Part a1 of Fig. 1, the graph contains 16 nodes. So, a complete partition must have four nodes. For a leaf processor like Node 7, LD7 consists of a single incomplete partition {7}. Actions A2, A3, and A4 maintain the complete partitions created by the children and try to complete the incomplete partitions. When the total number of nodes including all the incomplete partitions of all p’s children is less than `n , A2 adds the node identity p in LDp . In Part a1 of Fig. 1, Node 4 combines all the partitions {7}, {8}, and {9} and adds its identity 4 to obtain a complete and correct partition {4, 7, 8, 9}. When all the partitions from the children are complete, A3 creates a new partition containing only itself (i.e., only p). In Part a1, the local set in processor 1 consists of two complete partitions ({4, 7, 8, 9} and {5, 10, 14, 15}) and an incomplete partition {1}.
FIG. 1. Partitioning without using any external link.
704
BELKOUCH ET AL.
FIG. 2. Two incomplete partitions. Node 1 uses some external links.
When it is possible to combine all the nodes of all incomplete partitions from the children, into a complete partition, A4 combines them into one set provided this partition forms a connected and complete partition. Then it creates a new partition with only one node {p}. All incomplete partitions will be completed in the higher levels of the tree. Finally, we obtain the following partitions as shown in Part a2 in Fig. 1: {0, 1, 2, 3}, {4, 7, 8, 9}, {5, 1, 14, 15}, and {6, 11, 12, 13}. When the first four actions (A1–A4) fail to complete the partitions, Action A5 reconstructs the partitions (Procedure Combine-Partitions) using some external links (links that do not belong to the tree). If the partitioning is possible at this level, the procedure returns a set of partitions with at most one incomplete partition. If the partitioning cannot be done, the parent of p will read these local sets of partitions received from the children of p, and the same process is repeated at p’s parent node to complete the partitions. In Fig. 2, the partitions at Node 1 received from the children consist of two incomplete partitions ({4, 10} and {6, 14, 15}) and a complete partition {5, 11, 12, 13}. Node 1 reconstructs the partitions {5, 10, 11, 12}, {6, 13, 14, 15}, and {1, 4} by running procedure Combine-Partitions. The final set of partitions, as shown in Part b2 of Fig. 2, is {5, 10, 11, 12}, {6, 13, 14, 15}, {1, 4, 0, 2}, and {3, 7, 8, 9}. In Fig. 3, node 1 cannot reconstruct the input partitions {3, 9, 10}, {4, 11, 12, 13}, and {5, 14, 15}. The root uses the external link (6, 15) to create the partitions (as shown in c2) {4, 13, 11, 12}, {5, 15, 14, 6}, {0, 2, 7, 8}, and {1, 3, 9, 10}.
FIG. 3. Two incomplete partitions. The partitioning cannot be done at Node 1 since there are no external links at this level. The partitioning is completed at the root.
NETWORK DECOMPOSITION
705
4.1. Completing the Incomplete Partitions We can easily see that Algorithm NP produces correct partitions (as per Conditions 1–4 in Section 3) as long as the partitions do not need to be reconstructed. So, we now need to show that Procedure Combine-Partitions (as used in Action A5) also combines the incomplete partitions and creates correct complete partitions. We use the following notations in the reconstruction process: • D: A set of disjoint connected partitions before reconstruction. D=Binc 2 Bc , • Bc : The set of complete partitions in D before reconstruction. • Binc : The set of incomplete partitions in D before reconstruction. • Tp : The subtree rooted at p. • T–Nodesp : The set of nodes common in Tp and the partition p belongs to. • Ac : The set of complete partitions in D after reconstruction. • Ainc : The set of nodes that do not belong to Ac . • Pt: The current incomplete partition being considered (in Procedures Complete and More-Links). Input to Procedure Combine-Partitions. A set of correct partitions D= {B1 0, B2 0, ..., Bi 0,..., Bk 0}. The subset {B1 0, B2 0, ..., Bi 0} consists of the incomplete partitions (ICP), and the subset {Bi+1 0, Bi+2 0, ..., Bk 0} is the set of complete partitions (CP). Output from Procedure Combine-Partitions. The largest set Ac that can be constructed at this processor and Ainc . We assume that all partitions in D are ordered by some arbitrary ordering O such that Bi n O Bj n,
-i < j
and -Bi n ¥ ICP and Bj n ¥ CP ::
Bi n O Bj n.
The process of completing the incomplete partitions is shown in Fig. 4. The algorithm at process p tries to complete the partitions, using some external links in the subgraph of G induced by the nodes of the subtree Tp rooted at p. In each step m (m=0 in Fig. 4), the algorithm first tries to find an external link that could complete the first incomplete partition (B1 m) according to the order O . If there exists a link such that lk =(ek , fk ), ek ¥ B1 m, fk ¥ Bj m, j ] 1, then B1 m is completed by the addition of the set of nodes in Bj m that belongs to the subtree Tfk rooted at fk . Otherwise, one of the following conditions is true: (i) B1 m can be completed by using more than one link and (ii) it cannot be combined with any link(s) to create a complete partition. If (i) is true, B1 m is created by calling Procedure More-Links. Procedure More-Links looks for a subset of external links to complete the incomplete partitions. We maintain an arbitrary ordering among the external links. This ordering combined with a recursive process allows us to check all possible cases to complete the incomplete partitions. If the procedure succeeds in step m,
706
BELKOUCH ET AL.
FIG. 4. Completing the incomplete partitions.
another set of incomplete partitions may be generated. The process is repeated in step m+1 to complete the resulting incomplete partitions. If (ii) is true, when the partition cannot be completed, the same process is executed for the next partition B2 m following the order O . This recursive process terminates when there is at most one incomplete partition or no partition can be completed. Procedure Combine-Partitions. This procedure attempts to complete the incomplete partitions by considering them one at a time. It selects the partitions, starting with the first one in Binc according to the ordering O .
Proposition 4.1. Procedure Combine-Partitions computes a set of correct partitions {B1 m, B2 m,..., Bk m} in step m with the following properties:
˛
|Bi m|=`n
i
,1 [ i [ k : 1 [ j [ k : i \ j :: |Bi m| [ `n
i=j
m
|Bi |=0
i > j.
NETWORK DECOMPOSITION
707
Algorithm 4.2 (PCP): Procedure combine-partitions. 1. Procedure Combine-Partitions(D, Ainc , Ac ) 2. i=1 /* Tries to complete the first incomplete partition. */ 3. j=1 /* j is used to detect a cycle which can occur when Complete is called twice with the same parameters. */ /* D=Binc 2 Bc */ 4. Complete(i, Binc , Bc , j) 5. Ainc =Binc Ac =Bc
Procedure Complete. This is a recursive procedure. It tries to complete an incomplete partition. The incomplete partition Pt can be completed using either one link (Line 12) or a set of links (Line 13). In the first case, the partition is immediately completed and another set of incomplete partitions will be generated (Line 15). In the second case, Procedure More-Links is called (Line 13). This process is repeated recursively (Line 16). Algorithm 4.3 (PC): Procedure Complete. 1. Predicate 2. FoundLink(lk , Pt) — (,lk =(ek , fk ) ¥ E) N (ek ¥ Pt) N (fk ¨ Pt) N (|CurrentPartition(Pt, fk )|=`n) 3. New(i, Binc , Bc , j) — (-k < j , Complete (i, Binc , Bc , k) ¨ SavedCalls) 4. Macro 5. CurrentPartition(Pt, fk )={Pt 2 T–Nodesfk } 6. SavedCalls
=The past calls of Procedure Complete Every call is stored as: Complete(i, Binc , Bc , j)
7.
8. Procedure Complete (i, Binc , Bc , j) 9.
if |Binc | > 1 then
10.
select vi from Binc
11.
Pt=vi
12.
if FoundLink(lk , Pt) then Pt=CurrentPartition(Pt, fk )
13.
else
14.
if (|Pt|=`n ) then
More-Links (1, L, Pt) /* L: External Links of Pt */
15.
update Binc and Bc
16.
if (New(1, Binc , B, j)) then Complete(1, Binc , Bc , j+1)
17.
else exit /* Cannot be completed at this level. */
18. 19.
/* A cycle exists. Try at a higher level. */ else
20.
if ((i < |Binc |) and New(i++, Binc , B, j)) then Complete(i, Binc , Bc , j)
21.
else exit
/ * Cannot be completed at this level. */
708
BELKOUCH ET AL.
FIG. 5. A cycle case.
If Procedure Complete is called twice with the same arguments, the procedure may run into a cycle; i.e., the process may not terminate. To detect this kind of cycle, we define a predicate New to ensure that the current procedure call has not been invoked in the past. We use Fig. 5 as an example to explain the case of a cycle. In the example in Fig. 5, a system with 16 nodes needs to be partitioned into four partitions. Node 3 has three incomplete partitions (B0 m={8, 11, 12}, B1 m={9, 13}, and B2 m={10, 14, 15} (m=0)) and two external links ((12, 13) and (13, 14)). The main goal of every process is to obtain a local decomposition with at most one incomplete partition. If we follow the execution of the algorithm at process 3, a cycle will occur as follows: 1. It completes B0 m using external link (12, 13) (m=m+1). 2. The resulting partitions are B1 m={9}, and B2 m={8, 11, 12, 13}. 3. It completes {10, 14, 15} using external link (13, 14) (m=m+1). 4. The resulting partitions are B0 m={8, 11, 12}, B1 m={9, 13}, and B2 m= {10, 14, 15}. 5. Go to Step 1. Procedure More-Links. This is also a recursive procedure whose task is to find a set of links to complete the incomplete partition Pt. It tries to complete Pt by adding some nodes until the partition is complete (Lines 6–9). The procedure starts with the first link following some order O . If the nodes induced by this link can complete Pt, then the procedure completes the partition and exit the procedure (Line 4–5). If adding this set of nodes is not enough to complete the partition, it adds these nodes in Pt and looks for another link. This process continues until the partition is complete (Lines 6–8). If the total number of nodes induced by some link plus the nodes in partition Pt exceeds `n, we ignore the link and try with the next link in the order. If the procedure finds a set of links that can complete Pt, it completes the partition. Otherwise, the process is repeated with the next incomplete partition as per the order O . 4.2. Broadcasting the Partitioning Information By Algorithm NP, the root knows the correct set of partitions. So, the only other task left to solve the partitioning problem (as defined in Section 3) is for the root to share this partitioning information with all other processors in the network. We use a self-stabilizing propagation of information with feedback (PIF) scheme [BDPV99] to implement this task. We compose a PIF scheme with Algorithm NP.
NETWORK DECOMPOSITION
709
Algorithm 4.4 (PML): Procedure More-Links 1. Procedure More-Links (i,L,Pt) 2. if i [ |L| then 3. take li =(ei , fi ) from L 4. 5. 6. 7. 8. 9. 10.
if |CurrentPartition(Pt, fi )|=`n then Pt=CurrentPartition(Pt, fi ) exit else if |CurrentPartition(Pt, fi )| < `n then Pt=CurrentPartition(Pt, fi ) update L OF Pt More-Links (1, L, Pt) else More-Links (i+1, L, Pt)
Let us quickly review the well-known PIF scheme [Cha82, Seg83] on tree structured networks. The PIF scheme is the repetition of a PIF cycle. The PIF cycle can be informally defined as follows: Starting from an initial configuration where no message has yet been broadcast, the root (r) initiates the broadcast phase. The root’s descendants (except the leaf processors) participate in this phase by forwarding the broadcast message to their descendants. When the broadcast phase reaches the leaf processors, since the leaf processors have no descendants, they notify their ancestor of the termination of the broadcast phase by initiating the feedback phase. When every processor, except the root, sent the feedback message to its ancestor, the root executes a special internal action indicating the termination or completion of the current PIF cycle. In [BDPV99], a PIF scheme, called propagation of information with feedback and cleaning (PFC), is introduced. We chose to use this PFC scheme in this paper. The PFC algorithm is both space optimal (in terms of the number of states) and time optimal (in terms of the stabilization time). In the PFC scheme, starting from any configuration, the first normal (correct) broadcast phase from the root is guaranteed to start in one round (see [BDPV99] for details). Thus, the broadcast phase of the first PIF cycle reaches the leaves (hence, all processors) within h+1 rounds (where h is the height of the spanning tree of the network). This phase will carry the partition information from the root. Since the system may start in an arbitrary configuration, this information may not be correct. Now, at the termination of this phase, the feedback phase will take another h rounds to reach the root. This feedback phase will collect the correct information (i.e., the correct sets of partitions) all the way to the root. Since we want all the processors to get the correct information about the partitions of the network, the root needs to broadcast again; i.e., the root needs to initiate the second PIF cycle. Due to some technical reasons (see [BDPV99] for details), there will be a delay of two rounds before the second PIF cycle (or the next broadcast phase) can start. The second broadcast phase is guaranteed to deliver the correct partition information to all processors of the network. Based on the above discussion, we can claim the following result: Lemma 4.1. Starting from any configuration, all processors will receive the correct partition information in 3(h+1) rounds.
710
BELKOUCH ET AL.
The attractor L (as per Definition 2.2) for Algorithm NP is the set of configurations where all processors know the correct partitions of the network (as per the specification in Section 3). Theorem 4.1. Algorithm NP combined with the PFC scheme is a self-stabilizing network partitioning scheme. Proof. Follows from the PFC scheme, Algorithm NP, and Lemma 4.1. L Note 4.1. Our goal in this paper is not to compute the local execution time (at each processor), but to compute the self-stabilization time of the algorithm. The local execution time is usually ignored in an asynchronous distributed algorithm where the communication time is more important.
5. DYNAMIC MAINTENANCE OF THE PARTITIONS In order to use the network decomposition in an environment where the topology of the network may change over time, it is necessary to be able to maintain the correct partitions in spite of the node/link failures. Such an event can make the network size n ] k 2. Particularly, in the application of the quorum systems, an important open question is how to deal with the change of network size to achieve both a low load and a small quorum size. It is difficult to maintain these desirable properties in a dynamic network due to the synchronization problems. Given a general network with n ] k 2 nodes, first, we present a scheme to obtain the ‘‘best’’ decomposition scheme where partitions are balanced, i.e., have almost the same size. Then, we give a solution to deal with the dynamic configuration of the network. The latter method adapts to the dynamic networks but does not guarantee the quality of the decomposition in terms of the partition size, meaning the resulting partitions may not be balanced. To obtain the best decomposition of the network, we discuss the following two cases (where k 2+1 [ n [ k 2+2k) : 1. Assume that n=k 2+k0 and 0 [ k0 [ k. We construct k0 partitions with k+1 nodes each and k − k0 partitions with k nodes each. 2. Assume that n=k 2+k+k0 and k0 \ 0. We construct k0 partitions with (k+1) nodes and (k − k0 +1) partitions with k nodes. Decomposing the network in this manner offers a balanced system with all the properties as discussed before. But, the implementation of this decomposition is difficult in a distributed context. Each process must know the current value of n, the number of partitions, and the number of nodes to be included in each partition. Now, assuming a general network size (n=k 2+r), we propose a dynamic solution to deal with constant changes of n: One trivial strategy for handling both the addition and removal of a node is to construct k partitions with k nodes in each partition and one partition with r nodes. The incomplete partition is constructed by the root, and all other nodes maintain the same behavior as in NP. The worst case of this solution leads to a partitioning with a partition containing only one node (the root).
NETWORK DECOMPOSITION
711
We see above that in the worst cast, for a general network, the solution may produce a partition with one node (when n=k 2+1). But, this solution does not use the current value of n. We can improve the solution if we combine our protocol with another which computes the value of n and broadcasts that value across the network. Then each node can decide how many nodes should be included in each partition. In this case, we obtain a more balanced partitioning: (k − 1) partitions with k nodes and one partition with (k+1) nodes instead of k partitions with k nodes and one partition with 1 node. Thus, the best decomposition in this case would construct (k − 1) partitions with k nodes and one partition with (k+1) nodes. With the above strategy, we can claim that Algorithm NP is dynamic. Our solution can adapt to the situation where due to some topological changes, the partitions become unbalanced. 6. APPLICATION The decomposition technique described in this paper is expected to provide an important and reliable tool for various applications. This scheme can be used to construct the quorums for general large networks. With such a construction, we can also guarantee a fast, coherent, and reliable document access method in a fully distributed information system. The TransDoc project [CDF97] aims at providing intelligent research vehicles to facilitate the multimedia document access in a fully distributed information system such as the World Wide Web (WWW). We study the issue of a reliable document retrieval scheme in the context of a large communication network [BBC98]. Before presenting this application, we first give an overview of the quorum systems. Quorum systems. A quorum system Q={ q1 , q2 , ..., qk } over a set V of elements (which represents the identities of the sites in a distributed system) is a nonempty set where each element is a nonempty subset of V and qi 5 qj ] ”, i, j ¥ {1, ..., k}. A coterie C over the set V is a quorum system over V which is minimal under set inclusion, meaning there are no q1 , q2 ¥ C such that q1 … q2 . Quorum systems have been used in the study of distributed control and management problems such as mutual exclusion, data replicated protocols, name servers, selective dissemination of information, and distributed access control and signatures. Generally, a system based on a quorum concept works as follows: To execute an action, the user selects a quorum transparently and accesses all its elements. The intersection property guarantees that the user will have a consistent view of the current state of the system. Name server. A global information system such as the WWW is a very large dynamic system. It is difficult for the user to keep up with the fast pace on information generation. In such an open information system (the Internet), the servers may duplicate the services they offer in many sites, and a service may be provided by more than one server. A client asks the system for a particular service by means of its name and not by its address because servers may be mobile. Thus, before the client sends its request, it has to locate a server that provides the desired service. The mechanism that translates the name of a service into an address in the network,
712
BELKOUCH ET AL.
FIG. 6. The name server application.
called the name server [MV88], can be implemented using quorums. Each server s posts at a set of nodes P(s) the address where it resides (where the service is available). This information is locally stored at each node in P(s). To request a service, the client selects a set Q(c) and queries every one of its elements. If P(s) and Q(c) are quorums, any element of P(s) 5 Q(c) which is not empty can return the address at which the service is available. An improved response time implies reduced quorum size and fewer passes to route between the user and the information server. Our network decomposition technique guarantees an optimal quorum size of `n (|P(s)|+|Q(c)|=2 `n). The proposed algorithm creates k connected and disjoint partitions P={B1 , B2 , ..., Bk } with k nodes in each partition (k=`n). We define a quorum Q(c) as the set of all nodes in one partition. P(s) is formed by one node from every partition. As illustrated in Fig. 6, the intersection of two quorums is exactly one node.
7. CONCLUSIONS We proposed a simple and efficient method for partitioning a network. Our scheme can be used in many applications, e.g., designing quorum systems. Our algorithm is also suitable in a dynamic system. We combine our partitioning algorithm and the PIF scheme of [BDPV99] to design a self-stabilizing network partitioning scheme. The partitioning protocol stabilizes in 3(h+1) rounds.
REFERENCES [AA91]
D. Agrawal and A. El Abbadi, An efficient and fault-tolerant solution for distributed mutual exclusion, ACM Trans. Comput. Systems 9 (1991), 124–143.
[ABCP96]
B. Awerbuch, B. Berger, L. Cowen, and D. Peleg, Fast distributed network decomposition and covers, J. Parallel Distrib. Comput. 32 (1996), 105–114.
[AG94]
A. Arora and M. G. Gouda, Distributed reset, IEEE Trans. Comput. 43 (1994), 1026–1038.
NETWORK DECOMPOSITION
[AGLP90]
713
B. Awerbuch, A. V. Goldberg, M. Luby, and S. A. Plotkin, Network decomposition and locality in distributed computation, in ‘‘Proc. of the 30th IEEE Symposium on Foundations of Computer Science,’’ pp. 364–369, 1990.
[AKM+93] B. Awerbuch, S. Kutten, Y. Mansour, B. Patt-Shamir, and G. Varghese, Time optimal self-stabilizing synchronization, in ‘‘STOC93 Proceedings of the 25th Annual ACM Symposium on Theory of Computing,’’ pp. 652–661, 1993. [AP90]
B. Awerbuch and D. Peleg, Sparse partitions, in ‘‘Proc. of the 31th IEEE Symposium on Foundations of Computer Science,’’ pp. 503–513, 1990.
[Baz96]
R. Bazzi, Planar quorums, in ‘‘Proc. International Workshop on Distributed Algorithms WDAG, Bologna, Italy,’’ Vol. 1151, pp. 251–268, Springer-Verlag, Berlin/New York, October 1996.
[BBC98]
F. Belkouch, M. Bui, and L. Chen, Self-stabilizing quorum systems for reliable document access in fully distributed information systems, SIC, Stud. Inform. Control 7 (1998), 311–328.
[BDPV99]
A. Bui, A. K. Datta, F. Petit, and V. Villain, State-optimal snap-stabilizing PIF in tree networks, in ‘‘Proceedings of the Fourth Workshop on Self-Stabilizing Systems,’’ pp. 78–85, IEEE Comput. Soc. Press, Los Alamitos, CA, 1999.
[BGM89]
J. E. Burns, M. G. Gouda, and R. E. Miller, On relaxing interleaving assumptions, in ‘‘Proceedings of the MCC Workshop on Self-Stabilizing Systems,’’ MCC Technical Report STP-379–89}, 1989.
[CDF97]
L. Chen, D. Donsez, and P. Faudemay, Design of transdoc : A research vehicle for multimedia documents on the internet, in ‘‘BIWIT’97, Proc. of the 3th Basque International Workshop on Information Technology, Biarritz, France,’’ IEEE Comput. Soc. Press, Los Alamitos, CA, 1997.
[Cha82]
E. J. H. Chang, Echo algorithms: depth parallel operations on general graphs, IEEE Trans. Software Eng. 8 (1982), 391–401.
[CYH91]
N. S. Chen, H. P. Yu, and S. T. Huang, A self-stabilizing algorithm for constructing spanning trees, Inform. Process. Lett. 39 (1991), 147–151.
[Dij74]
E. W. Dijkstra, Self stabilizing systems in spite of distributed control, Commun. Assoc. Comput. Mach. 17 (1974), 643–644.
[DIM93]
S. Dolev, A. Israeli, and S. Moran, Self-stabilization of dynamic systems assuming only read/write atomicity, Distrib. Comput. 7 (1993), 3–16.
[DIM97]
S. Dolev, A. Israeli, and S. Moran, Uniform dynamic self-stabilizing leader election, IEEE Trans. Parallel Distrib. Systems 8 (1997), 424–440.
[KC91]
A. Kumar and S. Y. Cheung, A high availability sqrt(n) hierarchical grid algorithm for replicated data, Inform. Process. Lett. 40 (December 1991), 311–316.
[LS90]
N. Linial and M. Saks, Finding low-diameter graph decompositions distributively, unpublished manuscript, 1990.
[LS91]
N. Linial and M. Saks, Decomposing graphs into region of small diameter, in ‘‘Proc. of the Second Annual ACM-SIAM, Sympsium on Discrete Algorithms,’’ pp. 320–330, 1991.
[LS93]
N. Linial and M. Saks, Low diameter graph decomposition, Combinatoria 13 (1993), 441–454.
[Mae85]
M. Maekawa, A `n algorithm for mutual exclusion in decentralized systems, ACM Trans. Comput Systems 3 (1985), 145–159.
[MV88]
S. J. Mullender and P. M. B. Vitanyi, Distributed match-making, Algorithmica 3 (1988), 367–391.
[PW95]
D. Peleg and A. Wool, Crumbling walls : A class of high availability of quorum systems, in ‘‘Proc. 14th ACM Symp. on Proncipales of Distributed Computing,’’ pp. 120–129, 1995.
[Seg83]
A. Segall, Distributed network protocols, IEEE Trans. Inform. Theory 29 (1983), 23–35.
714
BELKOUCH ET AL.
FATIMA BELKOUCH received her Ph.D. from the University of Technology of Compiègne, France. She is an assistant professor at the University of Lille II, France. Her research interests are distributed computing, self-stabilization, and quorums and their application in distributed systems.
MARC BUI received his Ph.D. in computer science in 1989 from the University of Paris 11. From 1991 to 1995, he was an assistant professor at the University of Paris 10. During 1993–1994, he also served as Conseiller Extérieur in the PARADIS team at INRIA, Rocquencourt. From 1995 to 1999, he was a professor at the University of Technology of Compiègne. He is now a professor at the LRIA, University of Paris 8. His research interests include distributed algorithms, computer networks and communications, and mobile agents.
LIMING CHEN obtained his Ph.D. from the University of Paris 6. He is currently a professor at the Ecole Centrale de Lyon, where he heads a research team on multimedia distributed information systems. His main research interest is in the area of smart, efficient, and reliable access of multimedia documents. His research team has been working on image and video indexing. Liming is a founder and partner of various French national research projects such as Cyrano for personalized video distribution on the Internet and Muse for the Multimedia Research Engine on the Internet.
AJOY K. DATTA is a professor of computer science at the University of Nevada, Las Vegas. His primary research areas are distributed computing and self-stabilization