JOURNAL
OF PARALLEL
AND
DISTRIBUTED
COMPUTING
6,498-5 14 (1989)
Detection of Mutual Inconsistency in Distributed Databases K. V. S. RAMARAO* Department of Computer Science, University of Pittsburgh, Pittsburgh, Pennsylvania 15260 Received November 2 1, 1986 A Distributed Database Management System should guarantee the consistency of databases at all sites in the system. In particular when the databases are replicated, mutual consistency among all copies ofan item should be maintained. Network partitioning is a difficult kind of failure to deal with. A strategy known as the optimistic approach allows transactions to be run independently in different parts of the distributed system formed due to a partition. The mutual consistency among the copies of an item cannot be taken for granted in this scenario. Thus a natural requirement of the system then is that it determines whether or not the database is consistent and tries to resolve any inconsistencies. A simple scheme is presented in this paper to detect mutual inconsistency when partitioned databases merge. This scheme is similar to that of D. S. Parker et al. [IEEE Trans. Software Eng. (May 1983)] in spirit although it is more general. In contrast to that of S. Davidson [ACM Trans. Database Systems (Sept. 1984), 456-4821 where a certain amount of information is maintained about each transaction run after the partition, our scheme maintains certain information with each data item accessed after the partition. o 1989 Academic press, IIIC.
1, INTRODUCTION One of the major motivations for the construction of distributed systems is the continued availability of shared information despite failures. The information is usually replicated at a number of sites in the system to meet this requirement. Then, the failure of a reasonable number of sites and/or communication links can be tolerated without appreciable degradation of the system performance. But the replicated information in turn creates a new problem-that of ensuring mutual consistency among various copies of an item. Our interest in this paper is on the management of replicated databases in the presence of one of the most difficult kinds of failures, networkpartitioning. AdatabaseisasetDB= {(d,, V,),(d,, V,),...,(d,,, V,),F}wheredi is the name of a data item, Vi is the value-set of di, and Fis a set of constraints on the vectorsf from VI X V, X . . . X V,. For instance, the following con* Present address: SBC Technology Resources, St. Louis, Missouri 63 14 1. 498 0743-73 15189 $3.00 Copyright Q 1989 by Academic Press, Inc. All rights of reproduction in any form reserved
MUTUAL
INCONSISTENCY
499
straint is natural in a banking environment: the balance in an account is equal to the total withdrawals subtracted from the total deposits. Thus, the values taken by the items balance, deposits, and withdrawals are bound together by this constraint. Let D = { d, , d2, . . . , d,, } . Any element Z of V, x v,x * *. X V, is an instance of D. For simplicity in terminology, we refer to the instances of D to be the instances of the database DB itself. An instance I is consistent if, in I, the constraints in Fare all satisfied. Since the set of constraints F is supposed to model the physical reality, it is expected that any instance of the database visible to the user be consistent. Unfortunately, determining if a given database instance is consistent is practically infeasible [ 51. Thus, the notion of transactions has been widely accepted as a useful tool in ensuring consistency: a transaction T is an ordered sequence of actions { aI, u2, . . . , ak} where aj: V, X V2 X . . . X V, -P V, x v, x . * * X V, and for any consistent instance I, QJZ(~-~). . . aI (I) is also consistent. A consequence of this is that a transaction should be implemented as an atomic action: either all of the effects of a transaction are incorporated into the database or none are. (In other words, no nonempty subsequence of T can guarantee the consistency.) The transaction is said to be committed in the former case and aborted in the latter. A distributed system consists of a number of sites interconnected by a set of communication links. A distribution DD of DB is a collection of subsets l,ofDsuchthatUg,D,=D.DDB=(DD,F)isa Q, D2,. . .,D,,m> distributed database corresponding to the distribution DD of the database DB. An instance DI of DD is a collection of instances Z( D1 ), Z( D2), . . . , Z(Dm) of D, >D2, . . . , D,, respectively. The instance DI is consistent if (a) each Z( Di) is consistent ( internal consistency) and (b) for any di in Dj fl D, forj Z 1, the value of di in Z( Dj) is the same as the value of di in Z( 0,) (mutual consistency). An instance DI which is not mutually consistent is mutually inconsistent. When the database is distributed, many sites may participate in the execution of a transaction due to the distribution of data items. In this case, the all-or-none implementation at each site participating in a transaction guarantees the internal consistency, but not the mutual consistency (when some sites commit but others abort the same transaction). Mutual consistency demands that either all sites participating in a transaction commit it or all of them abort it. Failures in a distributed system can lead to database inconsistency if not properly handled. Misbehavior by processors has been widely studied in the literature. The interested reader is referred to [ 131 for an informal survey of this work. Communication links can also misbehave: for instance, they can generate messages spontaneously, or even lose some messages. Given such a situation, the following approach to reliable distributed computing is widely accepted:
500
K. V. S. RAMARAO
Step 1. Design schemes to mask the misbehavior of processors and links so that a well-defined behavior is exhibited by them when seen from a higher level of abstraction, and Step 2. Design agreement protocols to handle these expected classes of failures. Most widely accepted well-defined behavior for processors and links isfailstop: such components simply stop when an unhandled internal failure occurs and remain so until the operating components explicitly allow them to operate again [ 91. Solutions which implement such components with acceptable degrees of reliability are available in the literature [lo]. Protocols for agreement in the presence of fail-stop processor failures have been studied in [ 2, 11, 12 1. A combination of fail-stop failures of links, processors, or both can, depending on the topology of the network, lead to a more complex kind of failure known as network partition. Here the network is partitioned into a number of groups of sites such that a site in a group cannot communicate with a site in a different group. Transaction processing in the presence of a network partition is very complex. In this paper, we study one of the fundamental problems arising due to continued transaction processing despite a partition, that of detecting possible inconsistencies among the copies in different groups. Essential features of the related works in the literature are presented at the appropriate locations.
2. DETECTIONOFMUTUALINCONSISTENCY
Assume that a partition occurs in a distributed system at a certain (logical, global) time. A number of transactions may be in progress at that point of time. Since the communication graph is not connected, it may not be possible to complete all of these transactions atomically until the communication is regained. To see this, consider the following simple scenario: the sites participating in a transaction are located among two groups and all sites in one of the groups decide to commit it while all sites in the other group decide to abort it. Due to the absence of any communication between the two groups, no group can know the decision made in the other group. Thus, the sites in at least one of the groups should wait until the communication is reestablished before they can complete the transaction atomically. If the number of such transactions is large, then the availability of the databases in each of the groups is drastically reduced, even if all data items are physically accessible from each of the sites. This is a rather undesirable situation but unfortunately there is no protocol that can prevent this scenario from arising in the worst case [ 121. In certain applications such as banking and airline reservations, such reduced database availability may not be acceptable for practical rea-
MUTUAL INCONSISTENCY
501
sons. Thus, transactions are completed despite the insufficient information available in partitioned systems. This in turn may leave certain transactions committed in some groups and aborted in others. Furthermore, new transactions are also independently initiated in various groups in the presence of a partition. Some of these transactions may update the databases in the groups in which they are being processed, while the other groups are not aware of the existence of these transactions. When two groups come into contact, it is necessary that the databases in both groups are first updated to reflect the effects of all transactions completed until that time. A major aspect of this update process is that all copies of replicated data items are made mutually consistent. This is not straightforward as can be seen from the following simple example: a distributed banking system is partitioned into two groups and a withdrawal of some amount x is made from an account in each of the groups. Assuming that the copies of the item balance had identical values at the time of partition, they will still have identical values when the two groups merge. But clearly the total amount withdrawn is 2x and not x. Hence, the value of the balance in each of the groups needs to be updated! This example shows that the copies of an item being identical is not a sufficient condition for mutual consistency when transactions are processed independently in noncommunicating clusters of sites in a distributed system. It also shows that the criterion for mutual consistency should involve the interaction among such independently executed transactions. Below, we define the notion of mutual consistency in a partitioned environment. The traditional serializability theory is informally introduced first. When a number of transactions are concurrently run, it may be possible to interleave the actions of different transactions while guaranteeing that any transaction has only a consistent view of the database. This notion is formalized as follows [ 5 ] : Any sequence of possibly interleaved actions of a set of transactions in which the actions of each particular transaction appear in their original order is called a schedule for those transactions. A schedule is a serial schedule if there are no interleavings. A schedule is equivalent to another schedule if given an initial database instance, both schedules when independently executed on it generate the same final instance and any transaction reading an item reads the same value of the item in both executions. A schedule is serializable if it is equivalent to a serial schedule. It is known [ 1] that in a failure-free situation, a schedule of distributed transactions guarantees database consistency if and only if it is serializable and each transaction is atomically implemented. Thus, one would naturally expect that when groups merge after a partition, the (global) database instance is consistent if and only if the totality of schedules in each group when appropriately unified is serializable. Davidson provides this unification in [ 31, which we shall describe. Four important as-
502
K.
V. S. RAMARAO
sumptions are made: (a) each transaction reads all items it updates (that is, the read-set of a transaction is a superset of its write-set), (b) the updated values of all data items in a transaction’s write-set are functions of the values of all data items in the read-set of that transaction, (c) the schedules in each group are serializable, and (d) there exists a mechanism to detect a network partition and inform all sites about such an occurrence within a fixed amount of time after the occurrence. Assume for simplicity that two groups S, S’ are merging. Construct a serialization graph as follows: represent all transactions committed in S, S’ by nodes in the graph; for transactions within the same group, the arcs in the graph are precisely the arcs in the standard serialization graph-those that correspond to the reads-before relationship and those that correspond to the reads-from relationship; for transactions in different groups, place arcs corresponding to read-write conflicts, from readers to writers. More formally [ 3 1, let H, = T,, , T12, . . . , Tlk, Hz = T2,, Tz2, . . . , Txm be the sequences of transactions appearing in the serial schedules in S, S’, respectively. For i = 1, 2, place the ripple edges TO + Tik only if j < k and there exists d in writeset ( TOi>fl read-set ( Tik) and there is no p such thatj < p < k and d is in writeset( Tip); then place the precedence edges TV + Tik only if j < k, there is no ripple edge TO + I;:k, there exists d in read-set ( Tti) fl write-set ( Tlk), and there is no p such that j < p < k and d is in write-set( Tip); finally, place the interference edges Tli -N T2/ if and only if there exists d such that d is in readset( T,i) fl write-set( Tzj). Interference edges from the transactions in S’ to those in S are also defined similarly. Observe that the ripple edge TV + Tik in this graph corresponds to the relationship 7;;Greads-from TV, the precedence edge TV + Tik corresponds to the relationship TO reads-before Tik, and these two types of relationships are among the transactions run in the same group. We need not consider the write-write conflicts since the read-set is assumed to contain the write-set, for all transactions. The interference edges try to order the intergroup transactions with common items accessed in conflicting modes. Also, a transaction that has read an item in one group updated by a transaction in the other group must precede the writer in any equivalent serial schedule. Thus the interference edges are placed from reading transactions to the writing transactions of the other group. Consider the following example: T, , T,, TX are transactions in S and T4, T5 are transactions in S’, appearing in that order in equivalent serial schedules. Let read-set ( T, ) = write-set ( T, ) = { d, , dz } , read-set ( T2) = { d2, d3 } , write-&( T2) = { d2}, read-set( T3) = {d,, d4}, write-set( T3) = { dl }, readset(Td) = (41, write-set( T4) = {d,}, read-set( T5) = { d2, d,}, writeset( T5) = 0. See Fig. 1 for the corresponding serialization graph. Schedules in S, S’ guaranteeing the mutual consistency among the copies in S, S’ can now be characterized in terms of the serialization graph:
MUTUAL INCONSISTENCY
503
FIG. 1.A serializationgraph. Solidline, ripple edges.Dashedline, interferenceedges.
if
THEOREM 1 [ 31. The copiesof data items in S, s’ are mutually consistent and only if thecorrespondingserializationgraphis acyclic.
One interesting special case studied in the literature occurs when each transaction accesses exactly one data item [ 71. Consider a simple example: T, , T, , T3are transactions in S and T4, T5, T6are transactions in S’, appearing in that order in equivalent serial schedules. Let read-set( T, ) = writeset( T,) = read-set( T3) = read-set( T5) = write-set( TS)= (d, }, write-set( T3) = @, read-set( T2) = write-set( T2) = read-set( T4) = write-set( T4) = readset ( T6) = { d2>, write-set( T6) = $?5.The corresponding serialization graph is shown in Fig. 2. Observe that the serialization graph is a collection of (weakly) connected components in this case. Hence, the graph is acyclic if and only if all components are acyclic. Note further that each component contains all nodes accessing the same data item. Now we ask, under what conditions does a component remain acyclic? Consider the following example before we answer this question: T, , T2are run in S and T3, T4are run in S’ where read-set( T, ) = read-set( T,) = read-set( TX)= read-set( T,) = { dl } , write-set ( T, ) = write-set( TJ) = { d, } , write-set( T2) = write-set( T,) = la. The corresponding component is given in Fig. 3. There are cycles in this component. But the important fact to note is that a cycle remains even if the read-only transactions are discarded. In other words, it is sufficient to consider only the updating transactions in detecting mutual inconsistency. (It can be easily proved to be the case in general.) The following scheme is a corollary of this observation: with each data item, keep the list of transactions that updated a data item, in their chronological order; consistency is reported
504
K. V. S. RAMARAO
FIG. 2. A serialization graph for the special case of transactions accessing single data items.
(i.e., the corresponding component in the graph is acyclic) only if one of the lists is a prefix of the other or both lists are identical. Parker et al. have further encoded these lists by noting that a transaction can be uniquely represented by the id of the site where it was originated. Thus, a vector they call version vector is maintained with each data item in a group: at any time, for each site in the system, the version vector gives the number of transactions originated at that site that have updated that data item. Mutual consistency is reported as long as one of the vectors of each item “dominates” all other vectors of that item. See [ 71 for more details. This simple scheme cannot be carried over to more general cases. Both the assumptions it rests on are not valid in general: (i) the database need not be mutually consistent even if each of the data items is, and (ii) read-only
FIG. 3. A component of a serialization graph.
MUTUAL INCONSISTENCY
505
transactions play a crucial role in deciding the consistency. The following examples depict this: EXAMPLE 1. r, is run in S and T2 is run in S’ where read-set( T, ) = readwrite-set( T, ) = { d 1} , and write-set ( T2) = { d2 } . Neither WG)= GWd, d, nor d2 is updated by more than one transaction. Thus, each of them is mutually consistent. Together they are not, since there is a cycle in the serialization graph. EXAMPLE 2. T, , T, are run in S and T,, T4 are run in S’ (in that order) where read-set( T, ) = write-set( T, ) = { d, } , read-set ( T3) = write-set( T,) = { d2 } , read-set( T2) = read-set( T4) = { dl , d2 } , and write-set( T,) = writeset( T,) = @. The serialization graph is acyclic if the read-only transactions are discarded but has a cycle if they are included.
2.1. Detection of Mutual Inconsistency-The
General Case
Theorem 1 can be directly used to detect mutual inconsistency: let each site maintain the read- and write-sets of all transactions committed there; assume that the sites in each group elect a unique site as the coordinator of that group (see [ 61 for some election algorithms); when two groups merge, the coordinator of each group constructs a serial schedule of the transactions committed in that group; finally, the coordinators together construct the serialization graph and determine whether it is acyclic. Independently, Parker and Ramos have extended the above version-vector-based solution to the general case [ 81. But that procedure is highly complex and is known to be incorrect [ 41. Here, we present a scheme which requires less space and computation time than Davidson’s in most instances. Our scheme can be viewed as “orthogonal” to Davidson’s in the sense that we introduce a precedence relation among the data items depending on how they were manipulated by the transactions, while her scheme induces a precedence relation among the transactions themselves. Thus, we associate certain information with each data item as in the version-vector solution. First we introduce some terminology. DEFINITION 1. Let H = ( T,, , T,2, . . . , Tlk), H’ = ( T2,, Tz2, . . . , Tzm) be the sequences of transactions appearing in the serial schedules in S, S’, respectively. We call H, H’ the transaction sequences of S, S’, respectively. DEFINITION 2. For data items d, d’ we say that d immediately precedes d in S if the following conditions hold: there exist TIP, T,, for p < q such that
( a) T,, updates d, ( b ) T,, reads d’, and (c) there is a d” in read-set ( T,,) n write-set ( T,,) and there is no p < u < q such that d” in in write-set ( T,,) .
506
K. V. S. RAMARAO
DEFINITION 3. The relation precedes is defined on the data items as the transitive closure of the relation immediately precedes. Precedence in S’ is also defined similarly.
Intuitively, dj precedes 4 in Sif dj is read by a transaction in S which either updates dj or is strictly later in any equivalent serial schedule than some transaction which updates di. We now give a characterization for serializable schedules in terms of precedences among data items. This result forms the backbone of our detection scheme. THEOREM 2. The serialization graph formed from the schedules of two groups S, s’ is acyclic ifand only ifthere do not exist data items di, 4 such that di precedes 4, in S and d, precedes di in S’.
Proof Necessity-Assume that there exist di, 4 such that di precedes d, in Sand dj precedes di in S’. We claim that there are transactions TIP, TIq and T2r, TZssuch that T,, updates di, T,, reads 4, T2r updates 4, and T,, reads dj, and there are paths in the serialization graphs of S, S’ from T,p to T,, and T2,. to TZs, respectively. We shall prove this for S. The case of S’ is similar. The claim is clearly true if di immediately precedes dj in S. Otherwise assume that di immediately precedes d,, d, precedes 4, and the claim is true for d, and dj. Let T,, be the transaction that reads dl and some item in writeset ( T,,) (so that d; immediately precedes dL) and let T,, be the transaction updating d, from which there is a path to T,,. Thus there are the following paths: from T, Uto TLq, from TIP to T,,, and from T,, to T,,. Hence the claim follows. Consequently, the serialization graph obtained after the merging of S, S’ has the following paths: T,, to T,,, TZr to TZs, T,, to TZr, and TZsto T,,. Thus, there is a cycle in the serialization graph. Sufficiency-Assume that there is a cycle in the serialization graph. Then there exist transactions TIP, TIq, TZr, TZsfor p G q and r G s such that there are paths from TIq to TZr and from TZsto T,,,, and there are paths from TIP to T,, and from TZr to TZs. But this implies that there is a di read by TIP and updated by TZr and d, read by TZsand updated by TIP. This means that Q.E.D. 4 precedes di in S and di precedes 4 in s’ . We use this result to devise a scheme to detect the inconsistencies. Specifically we want to associate a set precedence-set( d) with each data item dread in each group. Inconsistency can be detected at the time of merging by checking the above condition. But it is not obvious from the definition how to compute the precedence-set(d). We first note that any updates to this set caused by a transaction T must be done either during or after the commitment of T. We choose the commitment time for the updates also since otherwise we need to maintain elsewhere the effects of Ton the precedence-sets. For simplicity and convenience, we make the following assumption: the commitment order oftransactions (after a partition is detected) is consistent
MUTUAL
INCONSISTENCY
507
with the serializationorder(in that group). This assumption is quite practical: all systems with which we are familiar have this property. The implication of this assumption is that if the effects of each transaction Tare recorded at its commitment time then the precedence-set(d) for any din read-set( T) entirely reflects the effects of all transactions preceding Tin the serialization graph of that group. What are the effects of a transaction Ton the precedence-sets of the data items in read-set( T)? In other words, what are the data items to be added to precedence-set(d) for a d in read-set( T)? Clearly any item in write-set( T) precedes d. If d’ is in read-set(T) fl write-set( T’) for some T’ “directly preceding” T (that is, no transaction writes d’ after T’ does but before T reads) then d’ precedes d. Finally, any data item d” preceding such a d’ also precedes d. Since we do not maintain the serialization graph, we must derive the same information in a different way, using the currentprecedence-sets. This is done as follows: for each d, we associate a set we call the link-set(d), which is the set of data items in the write-set of the transaction that has most recently updated d (and is committed after the partition is detected). It is now easy to compute the new precedence sets: first find the union of the write-set ( T) with all link-sets of data items in read-set(T) (call it A); then find the union of the precedence-sets of all data items in A (call this set B) ; finally, for each data item d in read-set( T), insert the items in this set B into the current precedence-set(d) . This represents the latest precedence-set(d) . Following is a more formal description of the algorithm. ProcedureUpdate-Lists begin
* Each item d has a set precedence-set(d) which * contains items preceding it in that group and a link-set(d). * for each transaction T do begin
A-0; for all din read-set( 7’) do if d in link-set(d) then A-AUlink-set(d); for all din write-set( T) do ifd not in link-set(d) then A c A U link-set(d); AtAUwrite-set(T);
-Cl)
-----(2)
B+0; for for
all din A do B c B U precedence-set(d) ;
----(3)
each d in read-set ( T ) do begin precedence-set(d) + precedence-set(d) U B;
508
K. V. if
S. RAMARAO
write-set( T) non-empty then link-set(d) + write-set( T)
end end end.
The following procedure is run at the time of group merging. One of the coordinators runs it to detect inconsistencies. Procedure Detect-Inconsistency begin for
each d, , d, do d, is in precedence-x%( d,) in S AND d, is in precedence-set (d, ) in S then report inconsistency if
end. LEMMA 1. After the completion of a transaction T, precedence-set( d;) for any di in read-set ( T ) consists of all dj preceding di .
Proof: The proof is by induction on the length of the sequence of transactions in a serial schedule. The claim is clearly true if T is the first transaction. Assume that T is the kth transaction and that the claim is true before T is started. First we note that our algorithm correctly determines link-set(d) for all d. Now, let any new dj be added to precedence-set( di) by the algorithm. Assume that d, belongs to A in the algorithm at the step marked (2). The claim follows if dj is in write-set ( T) . Assume not. If dj belongs to A at the step marked ( 1)) then there is a transaction T’ such that dj is in write-set ( T’), and write-set( T’) fl read-set( T) f a. dj thus precedes di by the definition of the precedence relation. Assume that dj is not in A at the step marked ( 1). Then there is a transaction T’ such that dj is in write-set( T’) and readset( T’) fl write-set( T) $: ~3. Again dj precedes di. Finally assume that dj is not in A but is in B. Then there is some d, in A such that dj precedes d, (by the algorithm and the induction hypothesis). Since d, precedes di by the above argument, dj also precedes di by the definition of the precedence relation. Q.E.D. As a consequence, we have the following theorem: THEOREM 3. Procedure detect-inconsistency only iyit really exists.
reports inconsistency if and
2.2. Resource Requirements The space requirements of our scheme are reasonable: the number of entries in any precedence-set is no more than the number of data items and
MUTUAL
INCONSISTENCY
509
this upper bound holds no matter how large the number of transactions is. By contrast, the space required to implement Davidson’s scheme is proportional to the number of transactions. On the other hand, certain computation time is required in our solution during the commitment of each transaction which is not the case with Davidson’s. But the time taken by our solution at the time of group merging is less than that taken by Davidson’s if the number of transactions is large. To be more precise, assume that m and n respectively are the numbers of transactions committed in S, S’ and let ril, wc, for i = 1, 2, respectively be the number of data items in the read- and write-sets of the transaction TV. Note that precedence-sets can be implemented as bitstrings of length k where k is the number of data items. Then the union operations on precedence-sets translate into OR operations on bitstrings. We observe that the transitive closure operation performed in our algorithm can be potentially expensive. Our objective is thus to postpone the computation of the transitive closure until the actual merging of two groups. We record with each item only those items that immediately precede it and compute the complete precedence relation only when it is necessary. Thus, at the time of commitment of a transaction T, the following steps take place: (a) compute the set A as in the procedure update-lists and (b) for each d in read-set ( T), append A to precedence-set(d) and mark the new items with the time-stamp of T (the time-stamp here refers to any number assigned to each transaction that is consistent with the commitment order); if an item occurs more than once, then delete the previous occurrence (that is, the one with a lower time-stamp). The computation of the precedence relation at the time of a merging is done as follows: construct a precedence graph from the existing precedence-sets by placing an arc from d to d’ if d immediately precedes d’ and label each arc by the corresponding time-stamp; d precedes d’ if there is a path from d to d’ which either has a single arc or has nonincreasing time-stamps on its arcs. With this optimization, the number of OR operations at the commitment of Tlj does not exceed 2 * Yij. Thus, the total number of OR operations is bounded above by 2 * C ( Ylj) for i = I,2 and j = 1, . . . , mandl,..., n, respectively. The detection procedure can be implemented by sending the precedencegraph of one of the coordinators to the other, which checks for the required condition. Checking if d, is in precedence-set(dj) in s’ and d, is in precedence-set (d,) in S can be done as follows: AND the precedence-set (d,) in S with a mask bitstring containing 1 in the jth position only; AND the precedence-set( dj) in s’ with a mask bitstring containing 1 in the ith position only; and finally, AND the results of these two operations. This requires a total of 3k( k - 1) AND operations in the worst case. By contrast, 0( mn + n* + m’) set intersection operations are required in Davidson’s scheme to construct the serialization graph. Then, 0( (m + n)2)
510
K.
V. S. RAMARAO
node scan operations are required in the worst case to determine the acyclicity of the graph. Assuming that the sets involved are again represented as bitstrings, each set intersection can be implemented as an AND operation. Thus, the total time spent in our scheme in the worst case when all data items are in the read-sets of all transactions is O( k2). Even in this case, our complexity compares favorably with that of the other scheme when y1= O(k) or m = O(k). Furthermore, the computations at the time of transaction commitment are performed only after a partition occurs and thus are not expensive overheads.
2.3. A Time-StorageTrade-of We have shown above that the space requirements of our scheme are less than those in Davidson’s scheme when the number of transactions is large and that our scheme requires more time at the time of committing a transaction than their scheme. Now we show that the time overhead at the time of commitment can be eliminated by increasing the space. This follows from the observation that the precedence-sets neednot be updatedimmediately after a transactionis committed.The new solution works as follows: maintain a record of read- and write-sets of all transactions committed in a group and the transaction sequence (i.e., their relative order); when two groups can communicate, their coordinators first compute the precedence-sets for each of the data items; finally, one of the coordinators checks for the validity of the required condition as in the above procedure. With this modification, there is no time-overhead while committing the transactions. All computation is done only when two groups merge. The space requirements on the other hand are identical to those in the existing scheme. The expressions derived above for the time are still unchanged and hence our scheme can outperform the existing one for a reasonably large number of transactions. As we have seen, even in the highly unlikely event of all transactions reading the whole database, our scheme is better if the number of transactions is of the order of the number of data items.
2.4. UpdatingthePrecedence-Sets after a Merging When two groups come into contact and no inconsistency is detected among their databases, certain data items may have to be updated to make their copies identical. For instance, if a withdrawal is made from a bank account in one group, that should be reflected in the copies of the account balance in the other group as well. Since there can be a large number of groups in a partitioned system, the partition may still be present after two groups are merged. Hence, the algorithm for the detection of inconsistencies may have to be run in the future involving the groups formed due to mergings. But the precedence-sets of items in the merged group need not represent
MUTUAL
INCONSISTENCY
511
the status of the items accurately after the databases are updated unless they also are updated appropriately. Since the precedence-set of an item at a certain time t consists of all items preceding it in that group at time t ,4 should be in precedence-s&( d;) in the merged group if and only if dj is in precedence-set (di) in one of the groups that have merged to form the new group. The time t here is the time at which the merging occurred. Furthermore, assume that no new transactions are run in a merged group before the databases are updated. Then, the precedence-set( di) after a merging is the union of its precedence-sets in each of the merging groups. Clearly this is implemented through k OR operations on the precedence-sets. 2.5. Detection of Inconsistency When Many Groups Merge The procedure we have presented for the detection of mutual inconsistency has assumed that only two groups merge into a single group. In general, a number of groups may simultaneously merge together. Repeatedly merging two groups at a time is one simple way of dealing with this general case. But it can be done more efficiently: for each data item di, compute the union of all precedence-sets of di among the merging groups; if there are two precedence-sets of di both containing di, then report inconsistency; otherwise determine if there are di, dj such that di is in the (new) precedence-set( 4) and dj is in precedence-set( di); report inconsistency if such data items exist. The correctness proof of this scheme is similar to the special case and is omitted. The computation of precedence-sets can be delayed in this general case also until an actual merging occurs. 2.6. Special Case Revisited Consider again the special case when each transaction accesses a single data item. Then clearly the precedence-set (di) can only be either $3 or { di >. Our condition for inconsistency then degenerates to the following: there is mutual inconsistency if and only if there exists a data item di such that two precedence-sets of di (in two groups) are both nonempty. Thus, if it is known a priori that no transaction accesses more than one data item, our scheme can be greatly simplified. A precedence-set can now be represented as a single bit. Thus, each group can represent all its precedence-sets together as a single bitstring of length k. Detection of mutual inconsistency is now equivalent to taking the AND of the bitstrings in different groups and checking whether the result is nonzero. Updating bitstrings can now be done at the time of transaction commitment itself and involves only a unit time. Thus, our solution in this case is superior in terms of space and time requirements to both of the existing schemes.
512
K. V. S. RAMARAO
2.7. Handling Reconjgurations We have implicitly assumed until now that once a partition occurs, the only change that takes place in the topology is a recovery to its original status. But this is not a realistic assumption since even while the system is recovering from the partition, not all sites may be able to communicate with each other starting at the same time (informally speaking). Thus, a recovery to the original topology may go through a sequence of partitions so that the number of groups reduces due to the merging of two or more groups together in passing from one partition to the next one in this sequence. Similarly, when a partition occurs, all groups may not be formed at once: there may be a sequence of partitions so that the number of groups increases due to the splitting of a single group into two or more groups. Also, it is possible that the number of groups remains the same while the constituents of the groups change. We treat all of these possibilities as reconfigurations. The algorithms developed previously do not work as they are, if there are reconfigurations. But it is not hard to generalize the results to handle the reconfigurations. A simple way to extend the algorithms previously constructed to cases where the system goes through several reconfigurations before achieving the initial topology is to make the following modifications: (a) the information is maintained at each site rather than just the coordinator of a group, (b) each group is assigned a unique identity (such as the id of the coordinator followed by the local time at that site when some event such as the detection of reconfiguration takes place) and the precedence-sets and link-sets are maintained separately for each reconfiguration, (c) the detection algorithm for the mutual inconsistency is run each time the constituents of a group change, and (d) a group does not process any new transaction until all copies of each data item in it have updated (identical) precedence-sets and link-sets.
2.8. Discussion In our scheme for the detection of mutual inconsistency, two sets are associated with each data item accessed by transactions committed after a partition is detected. One of them, the precedence-set, can contain in the worst case all data items updated in a group. The other, the link-set, can contain the maximum number of data items updated by any single transaction. Thus if the system recovers within a short time to its original configuration, then these overheads can be reasonable. On the other hand, if the failure lasts for a long time, then the space used by these sets can grow large. A natural question arises: can some information in these sets be “forgotten” safely even if the system does not recover to its original status? Consider a relationship “d, precedes dj.” Under what conditions can this be forgotten? In other words, when is it unnecessary for a group to retain this relationship? An obvious
MUTUAL INCONSISTENCY
513
sufficient condition is “when all sites with copies of di or dj know this relationship.” But it is not hard to realize that this is indeed a necessary condition also. Assume that certain sites containing copies of both di and dj (no single site may have copies of both, but together these sites have both items) do not know of the relationship. Then those sites can (in the worst case) together form a group and run transactions that create the relationship 4 precedes di. If the first relationship is forgotten then no inconsistency will be reported when they merge, a contradiction. Similarly, one might ask how long we can delay the recording of precedence- and link-sets after the partition is detected. By an argument similar to that above, we can show that one must record starting with the first transaction whose commitment is not known to all the sites participating in that transaction. Since the distributed system is not assumed to be tightly synchronous where a global clock is used, it may be possible that different sites detect the partition at different times (on the real clock not accessible to the sites). Thus we must “synchronize” the transaction commitment process with that of partition detection since otherwise different groups may consider different transactions to be the initial transactions to be recorded. The same is true with Davidson’s scheme also. If we assume that the communications system can notify exactly which of the messages sent at the time of the partition have actually been delivered then it is possible to achieve this easily. The coordinator of each group using the information on the delivered messages finds the set of sites knowing the commitment of each transaction. Since all coordinators use the same information they reach identical conclusions. One important issue not considered here is of resolving inconsistencies. This is a very hard question and no satisfactory approach has been proposed until now, to our knowledge. Backing out some of the transactions is considered to be a possible solution [ 31 but it is argued that it may not be realistic in many environments [ 41. We can know precisely which sets of data items are “incorrect” using our scheme but it is not clear how to correct them. Clearly information on the semantics of the data items is essential, as the example on bank balances has shown.
3. CONCLUSIONS
A simple scheme is presented in this paper to detect mutual inconsistency when partitioned databases merge. Davidson has previously given a solution for the same problem where the amount of work done is a function of the number of transactions executed after the partition occurs. Our solution on the other hand does work that depends on the number of items accessed after the partition occurs. Thus, if the granularity of data items is large (such as a
m file/relation) then our scheme can be considerably less expensive. We believe that these two solutions are complementary to each other. We must mention that our scheme detects inconsistencies and determines which data items are incorrect. But it cannot inform the users regarding the transactions that have caused the inconsistencies. Also, we did not address the problem of resolving the inconsistencies, once they are detected.
REFERENCES 1. Bernstein, P. A., and Goodman, N. Concurrency control in distributed database systems. Comput. Surveys (June 198 1). 2. Chin, F., and Ramarao, K. V. S. An information-based model for failure-handling in distributed database systems. IEEE Trans. Software Eng. SE-13, No. 4 (Apr. 1987), 420-43 1. 3. Davidson, S. Optimism and consistency in partitioned distributed database systems. ACM Trans. DatabaseSystems (Sept. 1984), 456-482. 4. Davidson, S., et al. Consistency in a partitioned network: A survey. Tech. Rep., University of Pennsylvania, 1984. 5. Eswaran, K. P., et al. The notions of consistency and predicate locks in a database system. Comm. ACM(Nov. 1976), 624-633. 6. Garcia-Molina, H. Elections in a distributed computing system. IEEE Trans. Computers (Jan. 1982), 48-59. 7. Parker, D. S., et al. Detection of mutual inconsistency in distributed systems. IEEE Trans. SojwareEng. (May 1983). 8. Parker, D. S., and Ramos, R. A. A distributed file system architecture supporting high availability. Proc. 6th Berkeley Workshop on Distributed Data Management and Computer Networks, Feb. 1982. 9. Schlichting, R. D., and Schneider, F. B. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS(Aug. 1983), 222-238. 10. Schneider, F. B. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS(May 1984), 145-154. 11. Skeen, D. Nonblockingcommit protocols. ACMSZGMOD (1981), 133-142. 12. Skeen, D., and Stonebraker, M., A formal model of crash recovery in a distributed system. IEEE Trans. Software Eng. (May 1983), 219-228. 13. Strong, H. R., and Dolev, D. Byzantine agreement. Proc. IEEE COMPCON (Spring 1983), pp. 77-82. K. V. S. RAMARAO received a M.Sc. in mathematics from Andhra University, India, in 1977, M. Tech. in computer science from the Indian Institute of Technology, Kanpur, India, in 1979, and Ph.D. in computing science from the University of Alberta, Edmonton, Canada, in 1984. He worked as a systems analyst for ITC Ltd., India, from 1979 to 198 1 and as an assistant professor at the University ofoklahoma, Norman, from 1983 to 1984 and then at the University of Pittsburgh from 1984 to 1988. He is currently a senior technologist with Southwestern Bell Technology Resources, St. Louis, Missouri. His research interests include distributed databases and distributed algorithms. He is currently working on database support for intelligent networks.