ELSEVIER
Information
Processing
Letters 55 (1995)
Information Processing Letters
265-271
Resolving all deadlocks in distributed systems Soojung Lee a,‘, Junguk L. Kim b,* a NM.7 group, Communication Systems R&D Centea Samsung Electronics Co., Garak-dong, Songpa-ku, Seoul, South Korea b Bell Communications Research, RRC IN2111. 444 Hoes Lane. Piscataway, NJ 08854, USA. Communicated
by F. Dehne;
received 1 March 1994; revised 1 February 1995
Abstract
A distributed algorithm is presented for deadlock detection and resolution in distributed systems. The unique feature of the proposed algorithm is that it detects and resolves all deadlocks reachable from the initiator of the algorithm using 2n time units and 2e messages, where n and e are the number of nodes and edges in the wait-for graph reachable from the initiator, respectively. A simple analysis shows that our scheme allows less number of algorithm initiations throughout the system and faster deadlock detection. Keywords: Distributed
systems;
Deadlock detection;
Deadlock resolution;
1. Introduction
In a distributed system, processes send messages through a network to request access to resources at remote sites. A deadlock occurs when a group of processes are waiting for access to resources held exclusively by one another in a circular way. Processes involved in a deadlock are blocked indefinitely unless one of the processes is aborted and restarted. The resource requesting state between processes is often modeled by a directed graph, known as wait-forgraph ( WFG) [ 81, where each node represents a process and an arc is originated from a blocked process waiting for a resource to a process holding the resource. This paper uses the terms node and process interchangeably. It is assumed that a process cannot continue its computation until all of its requests are granted. Thus,
* Corresponding author. Email:
[email protected]. ’ Email:
[email protected]. 0020-0190/95/$09.50 @ 1995 Elsevier Science B.V. All rights reserved SSDIOO20-0190(95)0011S-8
Distributed
algorithms;
Wait-for graph
finding a deadlock in this model corresponds to finding a cycle in the WFG. Most algorithms developed recently for detecting deadlocks in distributed systems use a distributed approach by employing the deadlock detection message called probe; the initiator of the algorithm propagates a probe in the WPG to declare deadlock if the probe returns to itself. These algorithms can be classified into history-based and history-independent algorithms. As the term indicates, history-based algorithms [ 51 store the received probes to use them for later execution of the algorithms. This is intended to reduce the number of generated probes by not transmitting probes repetitively over the same paths of the wait-for graph. However, the stored probes must be up-to-date, which might cause unignorable overhead due to additional message transmission to report any wait-for graph changes. Moreover, these algorithms rely on intuitive arguments in maintaining the consistency of stored probes by enumerating all possible cases of resource requesting and releasing events to occur in the sys-
266
S. Lee, J.L. Kim/Information Processing Letters 5.5 (1995) 26.5-271
tern. This might result in incorrect algorithm [ 4,5]. On the other hand, in history-independent algorithms [ 2,3,7], an execution of the algorithm is irrespective of another. Hence, these algorithms are simpler and require no extra storage. A major drawback of most algorithms using probes [ 3-51 is that they detect only those deadlocks in which the initiator is involved, even if the probes are delivered to all the nodes reachable from the initiator. If the initiator is waiting outside a deadlock, its probes are of no use in detecting the deadlock but mainly contribute to heavy communication traffic. In this paper, we propose a distributed, historyindependent algorithm that overcomes the above disadvantages. Our algorithm operates hierarchically while building a tree through propagation of deadlock detection messages. Deadlocks are resolved from collecting information on the dependency relationship among the tree nodes. As a tree is constructed, the WFG edges are classified so that only those types of edges necessary to detect deadlocks can be reported upwards the tree. Accordingly, a single execution of the algorithm resolves all deadlocks reachable from the initiator that are present at the time of execution with 2n time units and 2e messages, where n and e are the number of nodes and edges in the WFG reachable from the initiator, respectively. This overcomes the drawback of the algorithm in [ 21 in which the initiator only finds out whether it is in deadlock or not. Moreover, our algorithm achieves faster deadlock detection than the other algorithms, since even a node waiting outside a deadlock resolves the deadlock upon initiation of our algorithm. The following assumptions are made: Messages are received in FIFO order in a finite but unpredictable time. Each process in the WFG is assigned a unique identifier; A blocked process checks to see if it is in deadlock as a result of making a resource request, by executing the algorithm upon waiting for some predefined time period on that request. If there is an edge in the WFG from processes i to j, we call j a successor of i. A process is called executing if it has no successor.
2. The proposed algorithm We first introduce how the WFG edges are classified while building a tree through the propagation of the
deadlock detection messages and then how such classification method is used in resolving deadlocks. Each node is associated with a special identifier named path string which is a binary string used to distinguish one branch from another in the tree. As it is seen later, path strings are used to identify not only the back edges but the other types of edges such as cross and forward edges [ 11. Some definitions for the manipulation of path strings are provided in the following. Definitions. ( 1) For any path strings (Y and /?, the concatenation of LYand /3, denoted as (~1I& is a noncommutative operation which maps to another path string formed by appending p to cr. (2) A is defined as a path string with the properties that the length of A is zero and that (~1IA = AI Ia = cr for any path string cy. (3) A path string cx is a prefix of a path string p if and only if p is a concatenation of (Yand some path string y # A. Moreover, cy and p are said to be in prefix relationship. A tree is built as the initiator of the algorithm, the root of the tree, sends out the deadlock detection messages to its successors all at once. If a node receives the message for the first time, it takes the path string carried by the message as its own and becomes a child of the sender of the message. The message is then further propagated until it reaches an executing node or a tree node. Let brothers denote those nodes which are neither ancestors nor descendants of one another in a tree. Our objective is to make the path string of a node be a prefix of that of its children and to let brothers have the path strings with no prefix relationship between any two. All edge types can then be distinguished by comparing the path strings of nodes incident on the corresponding edges. To achieve the objective, a node generates some bits unique to each successor, appends these bits to its own path string, and sends the whole bits to the successor. It can be easily seen that it is enough to append [log, ml number of bits to preserve uniqueness, where m is the number of successors of the node, provided that m > 1. If m = 1, a node appends one bit to its own path string for the resulting path string sent to the successor, in order to maintain the prefix relationship between the path string of a node and that of its child.
S. Lee, J.L. Kim/Information Processing Letters 55 (1995) 26.5-271
267
00
(ii)
!i)
(iii) e-b L----l
(iv)
(VI
Fig. I. An example execution of our algorithm: The message names are abbreviated to their first letters for brevity.
The idea is illustrated in Figs. 1 (i) through 1 (iii) on a WFG shown in Fig. 1 (i), assuming that node a initiates the algorithm with path string A. The deadlock detection message named ASK carries three arguments, the path string generated for the successor, the path string of the sender, and the sender identifier, in that order. As shown in Fig. 1 (i) , node a sends the path strings, A11“O”and All“1” to its successors. The other nodes act similarly in generating path strings for their successors. Upon receiving the ASK for the first time, each node saves the carried path string as its own which is marked next to the encircled node identifier. In Fig. 1 (iii), tree edges are depicted with solid lines and non-tree edges are represented by dashed lines. Note that the path string of a node is a prefix of those of its descendants and that path strings of brothers are not in prefix relationship. Let pstr, denote the path string of a node p. By employing the above idea, the type of non-tree edge (j, i) can be easily defined: A
forward edge iff pstrj is a prefix of pstr,; A cross edge iff neither pstri nor pstrj is a prefix of the other; A back edge iff pstri is a prefix of pstrj. To resolve all deadlocks reachable from the initiator in one execution, our algorithmoperates in a hierarchical fashion. The key idea is as follows. Each tree node reports to its father information on the edges incident on its subtree nodes. Since ancestor-descendant relationship between any two nodes can be easily inferred from their associated path strings, message length can be reduced by not reporting tree edges to ancestors. Only non-tree edges are reported along with the path strings of nodes incident on the edges. When such information is collected from all the successors, each node performs deadlock detection and resolution activity. Basically, deadlock resolution activity is performed step by step starting from nodes at the bottom level of the tree. The node which performs the activity last would be the root of the tree.
268
S. Lee. J.L. Kim/Information Processing Letters 55 (1995) 265-271
The detailed description of the deadlock detection and resolution activity at a node is provided in [ 61. Its key idea is as follows. A node i constructs its own WFG, denoted by WFGi = (q, Ei), upon receiving non-tree edge information incident on its subtree nodes. For each reported non-tree edge (p, q), Vj = F U {p,q} and Ei = Ei U {(p,q)}. For each q in x, find the node p in K with the longest path string satisfying that pstr, is a prefix of pstr, and insert (p, q) into Ei, indicating that p is an ancestor of q. Then the following steps are repeated: (i) the depth-first search is executed on WFGi to record any cycle along with nodes involved in it; if any cycle is recorded then proceed with step (ii) ; (ii) select a victim as the one that is involved in the most cycles recorded and then remove all edges from WFGi incident on the victim; if the recorded cycles are not all resolved, go to step (ii). When no more cycle is found by the depth-first search, node i sends the remaining non-tree edge information to its father for the latter to perform similarly. Next is provided with the detailed description of the algorithm operated at an arbitrary node i. The variable fatheri represents the father identifier of node i in the tree. Zi indicates the set of collected edge information. Initially, 1i andfutheri are set to 0 and zero, respectively. When node i initiates the algorithm: pstri := A; proc_propagateASK( ); When node i receives an ASK( succ_pstr,pstrj, j) : if i has released the resource requested by j then send REPLY( 8) to j; else iffutheri = 0 and i is not the initiator then begin /* a tree edge is found */ fatheri := j; pstr, := succpstr, if there is a successor then procqropagateASK( >; else send REPLY( 8) to fatheri; end else else send REPLY( (pstrj,pstri) : (j, i) ) to j; /* a non-tree edge is found */ When node i receives a REPLY(I) : Ii := Ii U I; if all the ASK messages sent by i have been replied then begin perform local deadlock detection and resolution
activity based on Ii; if i is not the initiator then send REPLY( li) to father,; else terminate the algorithm; end if procedure proc_propagateASK( n := number of successors;
)
ifn=l then succpstr := pstril )“O”; else succpstr := pstril I( [log, nl number of “0” bits) ;
for each successor begin send ASK( succ_pstr,pstri, i) to the successor; succpstr := binary addition of succpstr and 1; end for end procedure Figs. 1 (iv) and (v) show a possible sequence of algorithm execution after the action in Fig. 1 (iii). In Fig. 1 (iv), when back and cross edges are identified, REPLYs are sent with the path strings and the identifiers of nodes incident on the edges. Upon receiving all the REPLYs, node d forms a WFG as depicted inside a box next to the node identifier, based on the carried information. In addition to the received edges, node d inserts (a, d) and (a, c) into its WFG, since node u’s path string A is a prefix of those of d and c. Node e acts similarly in forming its own WFG. In Fig. 1 (v), node d resolves the deadlock by selecting itself as a victim. Consequently, the information collected at node d becomes obsolete and thus an empty set is reported to the father node b. All the other nodes simply propagate the received information to their fathers upon finding no deadlock from the information. In the following theorem, we denote as cy <,, /3 that path string (Yis a prefix of path string p. Theorem 1. If there exists a deadlock reachable from the initiatorat the initiation of the algorithm, the deadlock will be resolved unless it is resolved by other execution of the algorithm initiated by different initiator. Proof. Suppose there remains an unresolved cycle C composed ofedges, (1,2), (2,3), . . ., (n- 1,n) and (n, 1) at the initiation time. There should be at least one non-tree edge in C. We prove the theorem in the two cases as follows:
S. Lee, J.L.
Kim/lnforma~ion Processing Letters 55 (1995) 265-271
Case 1: Assume there is at least one back edge, say (k,k+l) inC.Then (pstr,,pstr,+,) : (k,k+l) will be carried onto REPLY and sent back to node k. Thus, (k, k + 1) is in the WFGk. Since C is not resolved, no WFG at any node contains a directed path from the node k + 1 to k. However, node k is informed that psrrk+,
269
from k + 1 to k can be deduced. No edge in C will be removed during sending REPLY messages upwards the tree, since otherwise the cycle is resolved. Hence, there should be a node that contains the path from k + 1 to k as well as the edge (k, k + 1) in its local wait-for graph. Then as proved in Case 1, the cycle containing k and k + 1 will be resolved by the depthfirst search. q
Theorem 2. When a deadlock is declared by the algorithm, the deadlock in fact exists unless it is resolved by another execution of the algorithm. Proof. Suppose the algorithm finds a back edge. Then no node on the declared cycle could release the resource and thus be executing, since otherwise ASK could not have been propagated entirely. Now, assume that the tree constructed by the algorithm does not include a back edge. Consider a general cycle including no back edges as shown in Fig. 2(i). For convenience, we define ASKi to be the ASK received and propagated by node i for i = 1,. . . , n. Let Pi be the path from node i to i + 1 modn. Fig. 2(ii) shows a simplified picture of the cycle using this notation. Let Dpr be nj D,, where ej are on Pi and Dei is the duration of ej. It is known that Dpi # 0, since the ASK has been propagated through Pi. We need to prove that n;,, Dp, # 0. The theorem is proved by induction on n. First suppose n = 2. Assume Dp, n Dq = 0. Let tij be the time when the ASKi is received by node j and further propagated if the edge that has delivered the message is a tree edge. Also, let tp,(S) and tpi(T) be the beginning and termination time of Dp,. Since Dp, 13 Dp2 = 0 by the assumption, suppose tp, (S) < tp, (T) < t,(S) < r&(T) without loss of generality. Note that tpi(T) is equal to the termination time of the last edge in Pi. Since ASK2 has been transmitted through all the edges in 9 by t21, t,(S) < 121. Therefore, tp, (T) < t21. That is, the last edge of PI has been removed and then ASK2 reached node 1. This implies that node 2 released the resource it has been holding, before it even acquires the resource it has been waiting on, since the message could not have been delivered further if any edge in P2 has been removed. This contradicts to our model. Now suppose n = k. Since n&’ DP~ # 0 by the inductive assumption, we only need to prove (&’ Dp,) n Dp, # 0. It can be proved as in the
210
S. Lee, J.L. Kim/Information Processing Letters 55 (1995) 265-271
(0
(ii)
Fig. 2. (i) A generalized cycle consisting of only cross and tree edges and (ii) its simplification.
case n = 2 by considering the two paths IJLY’ Pi and Pk. 0
3. Discussion In our algorithm, a node transmits at most one ASK to each of its successors all at once. An ASK is replied with at most one REPLY. Hence, 2e messages are generated in 2n time units per algorithm execution, where n and e are the number of nodes and edges reachable from the initiator, respectively. To see how much efficiency can be attained by resolving all deadlocks in one execution of our algorithm, we compare our algorithm with the representative algorithm using probes presented by Chandy et al. [ 31 through a simple analysis. For both algorithms, it is assumed that a blocked node needs to check at every time-out (To) if it is in deadlock through algorithm execution. In our algorithm, if a node executes the algorithm, then all deadlocks reachable from the node would be resolved. Hence, a node need to invoke the algorithm only when it has been waiting for To time since the last algorithm execution initiated by any node. According to the algorithm in [ 31, however, only the initiator declares a deadlock upon receiving its own probe back. Hence, a blocked node should initiate the algorithm at every TO time to find out if it is in deadlock. Two performance indices are selected; the number of algorithm initiations (N) and deadlock duration (D), i.e., how long a deadlock will last since it is formed. Let us first consider the number of initiations made by a particular blocked node u. Let Tb denote the time interval during which node o has been blocked. In
[ 31, each node invokes the algorithm independently of other nodes, and thus node u will invoke the algorithm at every To time while it is blocked, which results in N = [Tb/Toj. To calculate N for our algorithm, let T,, n = 1,2,..., represent the successive time intervals between executions of the algorithm at node u initiated by other nodes. We assume that the T,, are independent and identically distributed with exponential distribution function with parameter A. According to our algorithm, node IJitself will invoke the algorithm if and only if T,, exceeds To for any n = 1,2, . . .. Hence the probability p that node u itself initiates the algorithm is PIT,, > To} = e- *To. Then the probability of node CJinitiating the algorithm i number of times during Tb is (f)#( 1 - P)~-‘. where k = [Tb/Toj. Hence, the expected N = ci”, i( “)p’( 1 - P)~-’ for node U. For example, if Tb = 500, To = 100, and h = 0.01, then N = [SOO/lOOj in Chandy et al.‘s algorithm, whereas N x 1.8 in our algorithm. This comparison considers only one node and thus the difference would be much greater when all nodes are considered. Now consider deadlock duration D. Let Td be the time taken to detect a cycle after an initiation regardless of the initiators. Assuming that all the nodes in the cycle become blocked at about the same time t,, the algorithm in [ 31 would be invoked not earlier than t, + TO by nodes in the cycle. Thus the algorithm in [ 31 would take approximately TO + Td time to detect the cycle. However, in our algorithm, if any node waiting outside the cycle initiates the algorithm at time tinit,t, < tinit 6 tc +To, then the cycle can be detected in tinit - t, + Td. If there is no such node, the cycle would be detected in TO + Td as in [ 31, when a node in the cycle initiates the algorithm. Let T denote the
S. Lee, J.L. Kim/Information Processing Letters 55 (1995) 26S-271
random variable representing tinir- t,. We assume that T is exponentially distributed with parameter A. Then D = s,” td(P{T
< t}) + &; Tod(P{T
< t}) + Td,
where P{T < t} = 1 - e- I’. If To = loo, Td = 5, and A = 0.01, then D = 105 in Chandy et al.‘s algorithm, whereas D M 68 in our algorithm. Here, we assumed that Td is the same regardless of the initiator for simplicity. However, if the initiator is outside the cycle, the delay for its message to reach the cycle needs to be considered in calculating Td for greater accuracy. References [ 1j A.V. Aho, J.E. Hopcroft and J.D. Ullman, The Design and Analysis of Computer Algorithms (Addison-Wesley, Reading, MA, 1974). [2] G. Bracha and S. Toueg, A distributed algorithm for generalized deadlock detection, in: ACM Symp. on Principles of Distributed Computing ( 1984) 285-301.
271
[3] K.M. Chandy, J. Misra and L.M. Haas, Distributed deadlock detection, ACM Trans. Comput. Systems 1 (1983) 144-156. [4] A.N. Choudhaty, W.H. Kohler, J.A. Stankovic and D. Towsley,
l5 16
A modified priority based probe algorithm for distributed deadlock detection and resolution, IEEE Trans. Sofnvure Engineering 15 (1989) 10-17. A.D. Kshemkalyani and M. Singhal. Invariant-based verification of a distributed deadlock detection algorithm, IEEE Trans. Sofrware Engineering 17 ( 1991) 789-799. S. Lee, Distributed deadlock detection algorithms and their
performance study, Ph.D. Dissertation, Dept. of Computer Science, Texas A&M University, 1994. ]7 S. Lee and J.L. Kim, Efficient deadlock resolution in distributed systems, in: Proc. IFIP WG10.3 Internar. Conf on Applications in Parallel and Distributed Computing ( 1994) 73-81. M. Singhal, Deadlock detection in distributed systems, IEEE 1 ‘8 Compur. 22 (1989) 37-48.