Analysis of hybrid voting algorithm for replicated file systems J B Dugan
The use o f a hybrid between the available copies method and voting with witnesses to maintain consistency in a replicated file system is" investigated. In such a system, the available copies method is augmented with witnesses, and a simple static voting algorithm is used. High levels o f availability are possible with only two copies, with the added advantage that the consistency o f the file system is maintained even if the network is partitioned. The transformation o f copies and witnesses into the other, as needed, is discussed. The system is analysed via a stochastic Petri net model. distributed systems, file systems, replicated file systems, Petri nets, algorithms
A replicated file system is often used in a distributed environment to improve file availability ~-3. In such a system, multiple copies of a file are maintained on different hosts, to reduce vulnerability to host failures. The consistency of the files is often maintained by assigning votes to each copy of the file and by automatically assembling a quorum (majority) of votes for a file access 4. Associated with each representative is a timestamp, or version number, which is increased on each update. At any time, the most recently updated copies have the same (highest) version number and represent a majority of votes. Any assemblage of a majority of votes is guaranteed to contain at least one current representative (one with the highest version number). Each time a quorum of copies is gathered, any out-of-date copy participating in the quorum is brought up-to-date before the transaction is processed. Any copies not participating in the update become obsolete, as they will no longer have the highest version number. Requiring a majority for each update ensures that at most one write set can exist at any time, and that a quorum automatically includes at least one of the most recently updated copies. Such a system tolerates host failures to the extent that a minority of votes may be available at any time, but a file update will still be permitted. Voting systems also maintain consistency of a replicated file system when the network becomes partitioned, as only one partition can form a quorum at any time. The long-term benefits and costs associated with maintaining multiple copies of a file have been studied 5. It was Center for Digital Systems Research, Research Triangle Institute, PO Box 12194, Research Triangle Park, NC 27709, USA
vol 33 no 4 may 1991
concluded that, except for the problem of maintaining consistency in a partitioned network, the method of available copies is preferable to voting. When an update is to be performed under the available copies method, all copies that happen to be available are updated 6. (A copy is available if its host is up.) Updates can proceed as long as one copy exists. Loss of the last copy requires a manual re-load. The available copies method must have some auxiliary means to maintain consistency if a partition occurs in the network, as it allows updates to occur in any partition that contains at least one copy. The author suggests maintaining consistency via a hybrid between voting and available copies, to take advantage of the relative merits of each. After describing the suggested hybrid consistency protocol, a stochastic Petri net model is presented to assess the availability of the proposed system. An availability analysis of a replicated file system using copies and witnesses to maintain consistency has been reported 7. This paper uses an extended model to investigate the availability of a replicated file system that uses the proposed available copy/ voting with witnesses hybrid to maintain consistency. The availability of the proposed system is compared with that of a standard voting system with no witnesses, and with the availability of the available copies system. (A static voting scheme is assumed, as contrasted with the adaptive voting scheme 7 or the dynamic voting schemeS.) Then several possible improvements are discussed and conclusions drawn. This paper has two purposes. First, to investigate the use of a hybrid algorithm to maintain consistency in a replicated file system. Second, to demonstrate the usefulness and flexibility of the stochastic Petri net model to analyse complex systems.
AVAILABLE COPIES AND WITNESSES One major disadvantage of using standard voting techniques to maintain consistency of replicated files is the storage overhead required to keep redundant versions of the file. A minimum of three copies of a file are required to provide protection from a single host failure. In the case of three copies, the voting algorithm updates only two of the three copies on each write to the file, so there is nearly always one copy that is out-of-date. Paris has suggested replacing some of the redundant copies with witnesses to reduce storage costs 9. Witnesses contain no
0950-5849/91/040273~)8 © 1991 Butterworth-Heinemann Ltd
273
data, but simply record the version number of the most recent update witnessed. Paris also suggested that copies and witnesses can be dynamically transformed into the other. The presence of both copies and witnesses as representatives of a replicated file provides an interesting performance/availability trade-off when deciding on the participants of a write quorum 7. Suppose that an update request is received, and that more than a majority of representatives are available. The desire for a fast update suggests that, after the mandatory selection of one current copy, the quorum be filled with witnesses, which are extremely fast to update. However, such an approach causes redundant copies to be out-of-date and thus useless if the host fails on which the current copy resides. Conversely, copies can participate in a quorum, but the update process will require more time. Further, it is not clear how to assign copy/witness status to a given number of representatives. The available copies method sheds some light on both of these questions. The available copies method for maintaining file consistency is quite flexible, and ensures that the space used to store a file is not wasted on an out-of-date copy. All copies on operational hosts are updated at each file access; out-of-date files exist only temporarily (until the next update) on hosts that have recently failed and been repaired. The major drawback to the available copies method is that it allows simultaneous updates in different partitions of a disconnected network. Consideration of the advantages and disadvantages of both methods suggests a hybrid method. Here a static majority voting algorithm is suggested in a configuration where the number of witnesses is at least as large as the number of copies. Then, if more than a majority of representatives is available when an update request is received, fill a quorum with copies first, and witnesses as needed. The number of copies will always be a minority of the total number of representatives, and so an update quorum will contain all available copies. Further, the presence of inexpensive witnesses and the use of the voting algorithm ensure that simultaneous updates will not occur in a partitioned network. Updates can then continue until the last copy is lost (as long as more representatives are up than down); however, if a request is received and a quorum cannot be formed (if more representatives are down than up) the request is denied. The proposed method would also be applicable to a reliable device j°, in which blocks of a file are replicated and voted on, rather than the entire file.
A N A L Y S I S OF P R O P O S E D M E T H O D This section presents a stochastic Petri net model of a replicated file system that uses the proposed hybrid method to maintain file consistency. After defining stochastic Petri nets, the model is presented and solved and the results compared with systems that use the available copies method and voting with all copies. As it is known that a voting algorithm prevents simultaneous updates in a partitioned network, this possibility
274
is not modelled explicitly. It would be extremely difficult to model the underlying network topology without assuming an explicit configuration; such an assumption would severely limit the generality of the results. As the network topology underlying the file system is not being modelled, link failures are not considered. It is assumed that a host that is operational can communicate with any other operational host. The goal of modelling the proposed system is to compare its availability and cost with a pure voting system (with all copies) and with the available copies method.
Stochastic Petri net A Petri net is a graphical model useful for modelling systems that exhibit concurrent, asynchronous, or nondeterministic behaviour ~.~2. The nodes of a Petri net are places (drawn as circles), representing conditions, and transitions (drawn as bars), representing events. Tokens (drawn as small filled circles) are moved from place to place when the transitions fire, and denote the conditions holding at any given time. As an event is usually enabled by a combination of conditions, a transition is enabled by a combination of tokens in places. Arcs signify which combination of conditions must hold for the event to occur and which combination of conditions holds after the event occurs. A transition is enabled if each input place contains at least one token; an enabled transition fires by removing a token from each input place and depositing a token in each output place. Stochastic Petri nets (SPNs) were defined by associating an exponentially distributed firing time with each transition m4. An SPN can be analysed by considering all possible markings (enumerations of the tokens in each place) and solving the resulting reachability graph as a Markov chain. Generalized SPNs (GSPNs) allow immediate (zero firing time) and timed (exponentially distributed firing time) transitions; immediate transitions are drawn as thin bars, timed transitions as thick bars. GSPNs are solved as Markov chains as welP 5. The model used in this paper is basically the GSPN with variations from the original (or the current) definition, so the term stochastic Petri net (SPN) here has the generic meaning of 'Petri net with stochastic timing'. Three additional features to control the enabling of transitions are included in the SPN model: inhibitor arcs, transition priorities, and enabling functions. An inhibitor arc ~2from a place to a transition disables the transition if the corresponding input place is not empty. If several transitions with different priorities are simultaneously enabled in a marking, only the ones with the highest priority are chosen to fire, while the others are disabled. An enabling function is a logical function defined on the marking; if it evaluates to false, it disables the transition. More specifically, a transition t is enabled if and only if: (I) there is at least one token in each of its input places (2) there is no token in any of its inhibiting places (3) its enabling function evaluates to true
information and software technology
den~quest air
E
quorumOK
T!
fail
.®
r -- -- - ~ , ® I
k
outdate
~
outdate
F o r m quorum
I
,
I
.
.a
repair
@J fail ~pair
finished
~ j COPY STATUS
Y updatefinished
REQUEST L: UPDATE
WITNESS STATUS -
Figure 1. Stochastic Petri net model of replicated file system (4) no other transition u with priority over t and satisfying (1), (2), and (3) exists If several enabled transitions are scheduled to fire at the same instant, a probability distribution (possibly marking-dependent) is defined across them to determine which one(s) will fire. In the SPN presented, only immediate transitions require the specification of these probabilities, as the probability of contemporary firing for continuous (exponential) distributions is null.
Model Figure 1 shows the overall model of the system. (The box labelled 'Form Quorum' represents another SPN that will be discussed later.) The Figure has three main portions: copy status (left), request and update processing (centre), and witness status (right). The places labelled DOC, UOC, DCC, and UCC represent the numbers of copies that are down and out-of-date, up and out-ofdate, down and current, or up and current, respectively, while those labelled DOW, UOW, DCW, and UCW represent down and out-of-date, up and out-of-date, down and current, or up and current witnesses. These places are connected by transitions that represent failure
vol 33 no 4 may 1991
and repair of the hosts on which they reside. (The places that are double circled are shared with the Form Quorum net described later.) It is assumed that there is only one repair crew, and if more than one host is down, the next one to be repaired is selected randomly. This is reflected in the SPN model by dividing the base repair rate by the number of repair transitions that are currently enabled, and then assigning the resulting repair rate to each enabled repair transition. More complicated priority systems could also be modelled easily. The centre portion of the net is the request/update portion. A token is in place W R E Q while waiting for a request, at which time the token moves to place REQ. In this model, only write requests are being considered, and an exponentially distributed time lapse between the completion of an update and the generation of a subsequent request is assumed. When a token is in place REQ (which denotes that a request has been received) only one of its two associated timed transitions (denied and quorum OK) is enabled, depending on the status of the copies and witnesses. The firing time distribution associated with these two transitions is identical, and it represents the amount of time needed to send status request messages and receive replies (to determine if a quorum is
275
,~
O--'h (2,~)-'
i
\ (6,q)
,(3.~)
/ Figure 2. Stochastic Petri net submodel of quorum formation possible). If more representatives are up than down, and at least one current copy is available, then the transition labelled quorum OK fires, and the update commences. If more representatives are down than up, the request is denied because a quorum cannot be formed. The box labelled Form Quorum represents the sequence of actions (detailed in the next section) by which participants are selected for updating. The selected participants are taken from the UCC, UOC, UCW, and UOW places and deposited in the QC (copies in quorum) and QW (witnesses in quorum) places (see Figure 2). After the participants are selected, the remaining current (whether down or up) representatives are outdated (as they do not participate in the update). In the net, the outdate and outdate finished transitions have priority over the update transitions from the QC and QW places. After outdating is complete; the representatives in the QC and QW places will be updated. When the QC and QW places are empty, and the update request has been satisfied, the token is replaced in the WREQ place while the system waits for a new request to be generated.
Quorum formation Now the model that details the quorum formation, shown in Figure 2, is considered. The START, DONE, QW, and QC places in Figure 2 are the same as those shown inside the Form Quorum box in Figure 1. The double-circled UCC, UCW, UOW, and UOC places are shared with the overall net. That is, a transition in Figure 2 that removes a token from the UCC place actually removes it from the UCC place in the overall net. The transitions in Figure 2 are labelled with their priority and enabling function. The priority assignment determines the order in which a set of enabled transitions fire, A transition with priority 1 will fire before a transition with priority 2. The enabling function further defines the cir-
276
cumstances under which a transition is enabled. In this case, the logical function of q is true when the sum of the number of tokens in places QC and QW is greater than the sum of the number of tokens in places UCC, UCW, UOW, and UOC. The function q is the logical complement of q and is true when the quorum is not complete. The activity in the quorum formation net begins when a token is deposited in the START place. The first transition takes the obligatory current copy and includes it in the quorum. The other transitions fire in the order of their priority labels, and include current copies, outdated copies, current witnesses, and outdated witnesses in the quorum, in that order. (A simple change of the priorities changes the order of inclusion in the quorum.) The time associated with the transitions representing the inclusion of outdated representatives denotes the time needed to bring the selected representative up-to-date. All the transitions in this net have higher priority than those in the overall net. This priority assignment disallows the failure of hosts during the time that an update is being processed. A more complex net is needed to model this occurrence.
Comparative solution This section models a simplified version of the system (so that it is amenable to simpler solution methods) and solves this model by various methods. This is done to help validate that the SPN model is correct. After the results of the simplified version of the model are shown to match results from alternative solutions (and there is thus confidence in the SPN model), the various complexities are added to the SPN model and solving done again. When the more complex system is analysed using an SPN model, note that the simple alternative solutions used in this section are no longer applicable. For the analysis, it is assumed that the mean time between failures of hosts is 100 hours and the mean time to repair failed hosts is 8 hours. A single representative (copy or witness) is available whenever its host is available; the (steady-state) availability of each host (assuming independence of hosts) is given by: MTTF A~,,~, = (MTTF + MTTR) = 0.9259 If there is only one repair crew, the actual availability of each host will be slightly lower (depending on the total number of hosts) because the hosts are not independent.
Analysis of available copies The analysis of the available copies method is complicated by the repair dependency. If independence of the hosts is assumed (that is, assume that there is a large enough repair crew to assume independent parallel repairs), then the analysis of this system would then simply be the analysis of a parallel redundant system. The availability of a system with n independent copies, using the available copies method to maintain file consistency, is given bye6:
information and software technology
state probabilities for states where more than half of the copies are up:
Figure 3. Markov chain model of system with n cop#s
Avo,i.gwi,h.,opie~= ~ k
Table 1. Availability of available copies and voting No of Available copies copies 1 2 3 4 5 6 7 8 9 l0
0.9259 0.9891 0.9976 0.9993 0.9997 0.9999 > 0.9999 > 0.9999 > 0.9999 > 0.9999
A,~ail,,ovi~ ~ =
1
-
Voting hourly update
Voting daily update
Voting weekly update
Voting Markov chain
0.9258 0.8526 0.9675 0.9360 0.9768 0.9549 0.9775 0.9570 0.9733 0.9499
0.9259 0.8527 0.9676 0.9361 0.9770 0.9551 0.9777 0.9572 0.9735 0.9501
0.9259 0.8527 0.9677 0.9362 0.9770 0.9551 0.9778 0.9572 0.9735 0.9502
0.9259 0.8527 0.9676 0.9361 0.9770 0.9551 0.9777 0.9572 0.9735 0.9502
(1 - Ah,,,,)"
If the hosts are dependent, then the availability can easily be analysed via a continuous time M a r k o v chain (CTMC) ~6. The M a r k o v chain (see Figure 3) has n + l states if there are n copies. Transitions between states i and i - 1 (with rate i~) represent the failure of a host; transitions between states i - 1 and i (with rate la) represent host repair. The availability of the system is given by the probability that the system is not in state 0, and is tabulated in the leftmost column of Table 1.
Analysis of voting with all copies For the sake of comparison, now look at a system using a static voting algorithm with various numbers of copies (no witnesses). The availability of such a voting system will invariably be lower than a system with the same number of copies operating under the available copies consistency control mechanism. The cost of maintaining consistency despite network partition is a lower steadystate availability. If the status of the copies (current or out-of-date) is ignored, and it is assumed that the hosts are all independent (independent repair), then the availability of the file system can be simply estimated from the N M R redundancy equation~6: n
2 k =Ln/2] * I
which would yield an optimistic result. If the repair dependency is taken into consideration, then the M a r k o v chain in Figure 3 could be used, where the availability of the system is given by the sum of the
vol 33 no 4 may 1991
~:,
(2)
Ln/2J+ I
where r, is the steady-state probability of occupying state i. The system with n copies is analysed using the voting algorithm to maintain file consistency via the SPN described in the previous section. For the SPN analysis, it is assumed that the time needed to send status request messages and receive replies (to determine if a quorum is possible) is exponentially distributed, with a mean of 1 second. Further, it is assumed that it takes an average of 0.5 seconds for each file update. For this analysis, it is assumed that updates are requested, on average, once per hour, once per day, and once per week. The availability of the file system is then interpreted from the solution of the SPN as the probability that there are more copies on operational hosts than there are on failed hosts and that at least one copy is current. These values appear in Table 1; the rightmost column shows the availability as calculated from equation (2). The case where an even number of copies is used is interesting. A naive voting mechanism (as is assumed here), in requiring a strict majority of representatives, actually results in a lower availability with n copies (if n even) than with n - 1 copies. This phenomenon is handled in fault-tolerant hardware systems by a method where, when a redundant component fails, an additional good component is thrown away as well, and voting continues with an odd number of representatives ~7. This problem has been solved more simply ~, by selecting one site as a primary site; the primary site is used to break ties. Adding extra copies (even in the case where the total number of copies remains odd) does not always increase the availability of the system because of the repair dependency between the hosts. The addition of two copies, when the system is already highly available, has a deleterious effect on availability because of the 'waiting line' for repair. That is, if there is only one repairer, then the addition of components increases the average time to repair each component. This behaviour can be seen by comparing a system with seven copies with one that has nine copies in Table 1.
ADDING
WITNESSES
This section investigates the addition of witnesses to a fixed number of copies for a system that maintains file consistency via the proposed method. The performance of the method is investigated under three different frequencies for update requests: once per hour, once per day, and once per week (on average). First, the addition of witnesses to a single copy is considered, and then the more general (and useful) case.
277
Table 3. Availability associated with adding witnesses to several
Table 2. Adding number of witnesses to one copy
copies No of witnesses
Voting hourly update
Voting daily update
Voting weekly update
0 1 2 3 4 5 6 7 8 9
0.9258 0.8526 0.9052 0.8797 0.8950 0.8761 0.8822 0.8640 0.8669 0.8469
0.9259 0.8527 0.9053 0.8797 0.8950 0.8761 0.8821 0.8639 0.8668 0.8467
0.9259 0.8527 0.9053 0.8797 0.8950 0.8760 0.8820 0.8638 0.8667 0.8465
Adding witnesses to single copy Adding witnesses to a single copy does not increase the availability of the file system. If there is only a single copy, it must always be included in an update quorum. If there are any witnesses, then at least some of them must also be present, although they are actually providing no benefit. There are now two paths to failure, the first (which is present whether or not the witnesses exist) is activated when the copy fails. The additional path to failure is activated when enough witnesses fail (enough to constitute a majority of the representatives). Thus additional requirements are placed on the availability of the system with no added benefit. This fact is reflected in the analysis of a system with one copy and a variable number of witnesses, as shown in Table 2. The effect of the repair dependency as the number of witnesses increases can then be seen.
Adding witnesses to several copies In this section the number of copies in the replicated file system is varied, and for each fixed number of copies, witnesses added. To implement the proposed hybrid between voting and available copies correctly, the number of witnesses must Joe at least as large as the number o f copies. This ensures that if copies are given preference when forming a quorum, all available copies will be written on each update. Basically, enough useful copies need to be added to improve availability, and then enough witnesses added to avoid simultaneous updates in a partitioned network. The results of the analysis appear in Table 3. For each fixed number of copies, the system is most available when the number of witnesses is one more than the number of copies (for the system considered here). Because of the repair dependency effect as the total number of representatives increases, the availability of the system is maximized when there are three copies and four witnesses. The availability of the system with three copies and four witnesses is comparable to the availability of a system with five copies and no witnesses (see Table 1), and is higher than the availability of a system with only three copies and no witnesses. This behaviour does not change when the frequency of updates change.
278
No of copies 2 4" 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5 5
No of witnesses
Voting hourly update
Voting daily update
Voting weekly update
2 3 4 5 6 7 8 3 4 5 6 7 4 5 6 5 6
0.9360 0.9678 0.9467 0.9624 0.9427 0.9530 0.9307 0.9549 0.9754 0.9550 0.9688 0.9458 0.9570 0.9725 0.9492 0.9499 0.9629
0.9361 0.9680 0.9468 0.9625 0.9427 0.9529 0.9306 0.9551 0.9755 0.9552 0.9669 0.9459 0.9572 0.9727 0.9494 0.9502 0.9631
0.9362 0.9679 0.9468 0.9624 0.9426 0.9527 0.9304 0.9551 0.9755 0.9552 0.9688 0.9458 0.9572 0.9726 0.9494 0.9502 0.9631
TRANSFORMING COPIES AND WITNESSES Transform on update This section considers a method for transforming copies and witnesses to increase availability. The heuristic considered is as suggested by Paris 19. When a quorum is being formed because of an update request, suppose that there is at least one unavailable copy. Then, for each unavailable copy, transform an available witness to a current copy, and change the status of the unavailable copy to that of an unavailable witness. This model does not address the problem of available space on the host containing the newly converted copy, but rather analyses the problem from the availability perspective. The inclusion of the transformation process in the SPN model merely requires the addition of two .new places and three transitions. It is assumed that the amount of time needed to convert a copy to a witness is the same as the amount of time needed to write a copy on update. It is also assumed that the time needed to convert a copy to a witness is negligible (as it is merely a bookkeeping task). The results of this model that includes the transformation process are shown in Table 4. For this model, again for a fixed number of copies, availability is maximized when the number of witnesses is one more than the number of copies. As in the no transformation case, the availability of the system is maximized when there are three copies and four witnesses. Comparing the results of Table 4 and Table 3, there is improvement in all cases. Where the transformation process improves the availability of the system, the greater improvement is shown for frequent updates. The improvement decreases as the frequency of updates decreases, because the transformation process is closely tied to the update process. The next section considers the case
information and software technology
Table 4. Transforming copies and witnesses No of copies
No of witnesses
Voting hourly update
Voting daily update
Voting weekly update
2 2 2 2 2 2 "2 3 3 3 3 3 4 4 4 5 5
2 3 4 5 6 7 8 3 4 5 6 7 4 5 6 5 6
0.9360 0.9756 0.9537 0.9758 0.9554 0.9712 0.9479 0.9549 0.9775 0.9570 0.9732 0.9498 0.9570 0.9733 0.9499 0.9499 0.9633
0.9361 0.9702 0.9489 0.9669 0.9468 0.9492 0.9366 0.9551 0.9766 0.9561 0.9712 0.9480 0.9572 0.9732 0.9499 0.9502 0.9634
0.9362 0.9684 0.9472 0.9632 0.9434 0.9540 0.9316 0.9551 0.9757 0.9554 0.9693 0.9463 0.9572 0.9728 0.9495 0.9502 0.9632
Table 5. Adding daemon process for transformation No of copies
No of witnesses
Voting hourly update
Voting daily update
Voting weekly update
2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5 5
2 3 4 5 6 7 8 3 4 5 6 7 4 5 6 5 6
0.9485 0.9839 0.9657 0.9854 0.9730 0.9844 0.9685 0.9701 0.9877 0.9730 0.9862 0.9711 0.9768 0.9882 0.9736 0.9778 0.9867
0.9493 0.9850 0.9680 0.9870 0.9732 0.9864 0.9723 0.9720 0.9891 0.9759 0.9882 0.9749 0.9792 0.9900 0.9773 0.9809 0.9890
0.9486 0.9843 0.9668 0.9861 0.9717 0.9851 0.9701 0.9707 0.9880 0.9739 0.9865 0.9721 0.9773 0.9884 0.9743 0.9782 0.9866
where the transformation can occur even if there is no write request. Daemon
process for transformation
If updates are performed infrequently, then it may be desirable to disassociate the transformation process from the update process. Suppose that there is a separate "daemon' process that runs occasionally, transforming copies and witnesses as needed. It might be supposed that such a daemon process would run every two hours (exponentially distributed time between runs, with mean of two hours), but have lower priority than the actual update procedure. This way, if updates were infrequent, the probability that a current copy was available would be increased. The result of adding such a daemon process to the SPN model is shown in Table 5. All entries in Table 5 are higher than the corresponding entries in
vol 33 no 4 may 1991
Table 4. The m a x i m u m value o f the availability of the system changes from the case with three copies and four witnesses to the case with four copies and five witnesses. This suggests that the addition of a ' d a e m o n ' process can help offset the effects of the limited repair crew.
SUMMARY
AND CONCLUSIONS
To maintain consistency in a replicated file system, using a hybrid between available copies and voting with witnesses is proposed. Such a system could be implemented within a voting framework if the number of witnesses is at least as large as the number of copies, and ira quorum is formed by using all available copies, augmented by witnesses, as needed. High levels of availability can be obtained with such a system, with the added protection from simultaneous updates in a partitioned network. Analysis of an SPN model of such a system revealed two interesting behaviours. First, availability is highest when the number of witnesses is precisely one more than the number of copies. In such an arrangement, the m a x i m u m number of useful copies is kept current, and the minimum number of necessary witnesses is added to ensure consistency in the face of a network partition. Second, the SPN analysis allowed consideration of more complex (and realistic) assumptions about the independence of hosts. An independence assumption implies that there is a separate repair crew for each host, so that the availability of each host is independent of the other hosts. In reality, however, support of so many repair crews is rarely affordable. This model has taken the other extreme view, that is, that there is only one repair crew to service all machines and that there is a random selection policy for repairing machines in the case of multiple failures. Under such a system, the addition of hosts can actually decrease the availability of the system, because the average time for a repair then increases. Because of these trade-offs, it has been shown that there could be an optimum selection of the number of copies and witnesses. The possibility has also been considered of transforming copies and witnesses into the other, when the host on which a copy resides fails. For a small number of copies, such a system improves availability. A special-purpose process has been considered that occasionally checks the status of the file system (even in the absence of any write request) and converts a witness to a copy if necessary. Again, for a small number of copies, such a daemon process could further increase the availability of the file system. The systems were analysed using a stochastic Petri net model, in which all the times were assumed to be exponentially distributed, and which can thus be solved analytically, and exactly, as a M a r k o v chain. The corresponding M a r k o v chains ranged in size from 150 states and 400 transitions to 16 000 states and 108 000 transitions and were solved exactly.
279
ACKNOWLEDGEMENTS The stochastic Petri net models were solved by using the software package D E E P developed at Duke University by Gianfranco Ciardo. The models used in this paper benefited from several conversations with David Mutchler of the University of Tennessee at Knoxville.
8
9 10 11
REFERENCES 1 Ellis, C S and Floyd, R A 'The roe file system' in Proc. 3rd Symp. Reliability in Distributed Software for Database Systems (1983) pp 175-181 2 Popek, G J, Walker, B J, Chow, J et al. 'LOCUS: a network transparent, high reliability distributed system' in Proc. 8th Syrup. Operating Systems Principles (1981) pp 169-177 3 Stonebraker, M 'Concurrency control and consistency of multiple copies of data in distributed INGRES' IEEE Trans. Soft. Eng. (May 1979) pp 188-194 4 Gifford, D K 'Weighted voting for replicated data' in Proc. 7th ACM Syrup. Operating Systems Principles (1979) pp 150-159 5 Noe, J D and Andreassian, A 'Effectiveness of replication in distributed computer networks' in Proc. IEEE 7th Int. Conf. Distributed Computer Systems (1987) 6 Bernstein, P A and Goodman, N 'An algorithm for concurrency control and recovery in replicated distributed databases' ACM Trans. Database Syst. (December 1984) pp 596-615 7 Dugan, J B and Ciardo, G 'Stochastic Petri net analysis of a
280
12 13 14 15
16 17 18
19
replicated file system' IEEE Trans. Soft. Eng. (April 1989) pp 394-401 Jajoda, S and Mutehler, D "Dynamic voting' in Proc. SIGMOD (May 1987) pp 227 238 Paris, J-F 'Voting with witnesses: a consistency scheme for replicated files' in Proc. 6th Int. Conf. Distributed Computing Systems (May 1986) pp 606-612 Carroll, J L, Long, D and Paris, J-F 'Block-level consistency of replicated files' in Proc. 7th Int. Conf. Distributed Computing Systems I EEE (1987) pp 146-153 Murata, T 'Petri nets: properties, analysis and applications' Proc. IEEEVol 77 No 4 (April 1989) pp 541-580 Peterson, J L Petri net theory and the modeling of systems Prentice Hall (1981) Molloy, M K 'Performance analysis using stochastic Petri nets' IEEE Trans. Computers Vol 31 No 9 (September 1982) pp 913-917 Natkin, S 'Reseaux de Petri stochastiques' These de Docteur Ingeneur CNAM-Paris, France (June 1980) Marsan, M A, Balho, G and Conte, G 'A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems' A CM Trans. Computer Syst. (May 1984) Trivedi, K S Probability and statistics with reliability, queueing and computer science applications Prentice Hall (1982) Siewiorek, D P and Swarz, R S The theory and practice of reliable system design Digital Press, Bedford, MA, USA (1982) Jajoda, S and Mutehler, D 'Enhancements to the voting algorithm' in Proc. 13th VLDB Conf. (September 1987) pp 399-406 Paris, J-F 'Voting with a variable number of copies' in Proe. 16th Int. Syrup. Fault-Tolerant Computing (July 1986) pp 50-55
information and software technology