Replicated servers for fault-tolerant real-time systems using transputers

Replicated servers for fault-tolerant real-time systems using transputers

INM77ON stl4&?E TECH- ELSEVIER Information and Software Technology 38 (1996) 633-645 Replicated servers for fault-tolerant real-time systems usin...

1MB Sizes 1 Downloads 41 Views

INM77ON

stl4&?E TECH-

ELSEVIER

Information and Software Technology 38 (1996) 633-645

Replicated servers for fault-tolerant

real-time systems using transputers

Anupam Sinha”, Pradip K. Dasat*, Dhruba Basub aDepartment of Computer Science and Engineering, Jadavpur University, Calcutta 700032, India bFlight Computer Division, V.S.S.C.. Trivandrum 695022, India

Received 21 March 1995; accepted 4 December 1995

Abstract The design of fault-tolerant real-time systems is an area of great research activity, particularly in the context of safety-critical systems like spacecrafts. The present work describes the design, implementation, and proof of a fault-tolerant server in a transputer network. The software was developed in Occam with in-line GUY code at certain places to improve performance. Byzantine faults are tolerated and the basic strategy is failure-masking by replication. Keywords: Fault-tolerance;

Replication; State-machine; Atomic Broadcast; Digital signatures; Transputers

1. Introduction Fault-tolerance by replication is a well-established strategy for implementing real-time systems, e.g. digital flight-control systems (DFCS), launch vehicles and

spacecrafts. Typically, a failure-transparent node is constructed from a set of processors which execute identical code in a loosely synchronous fashion. The output of a node is generated by distributed voting amongst the output from the replicas. Likewise, input to a node is accepted by Atomic Broadcast amongst the replicas which ensures that every replica processes the same set of messages in the same order. In this paper, we describe an Occam implementation of a fault-tolerant transputer network in which a node consists of three transputers and three pairs of links visible to other nodes. Such a configuration naturally leads to a linear chain or pipeline. However, the design approach outlined below can also be adapted to replicated systems of arbitrary topology. The type of architecture described in this paper is suitable for use as On-board Computers (OBC) for carrying out mission-critical applications where there cannot be any compromise with the reliability requirements. For example, the OBCs used in Indian Space Research Organization’s (ISRO’s) launch vehicle programmes * Corresponding author. 0950-5849/96/$15.00 0 1996 Elsevier Science B.V. All rights reserve d SSDI

0950-5849(95)01094-7

carry out the mission critical functions of navigation, guidance, digital autopilot and vehicle sequencing, and a single fault in the hardware can cause a multi-million dollar mission to fail. To further enhance the reliability of such a system, Byzantine fault tolerance is generally attempted. Traditional Byzantine fault resilient systems require at least four replicas to cater to a single fault. We provide for Byzantine Resilience in the system by message authentication, Interactive Consistency on input ordering and majority voting on output. This results in the consequent reduction of hardware and therefore weight by 25% as only three replicas need to be used. This paper is organized as follows. The next section begins with an introduction to failure masking. The hardware configuration is described next. This is followed by the assumed fault-model. The outline of the Atomic Broadcast scheme is described in Section 3. Failure masking by output voting is discussed in Section 4. Up to this point, the discussion is essentially that of the system specifications. The discussion on the specific implementation begins in Section 5. The RSA cryptosystem and its implementation are described briefly in Section 6. Implementation details of Atomic Broadcast are discussed in Section 7. Section 8 contains the local clock times at various points in Atomic Broadcast, Voting, and Message transmission. Comparison with related work is reported in the Conclusion.

A. Sinha et aLlInformation and Software Technology 38 (1996) 633-645

634

The last three sections included in the Appendix provide proofs of some of the algorithms used. The Stability test and its proof appear in Section A. 1. This is followed by a discussion on Byzantine fault tolerance in Section A.2. The final section provides the algorithm and proof of the Interactive Consistency Algorithm [l] used in Atomic Broadcast for ordering input messages.

2. Description of the system 2.1. A brief review of replicatedprocessing

Traditionally, there are two distinct methods of tackling failures: (i) fault-detection and recovery; and (ii) fault-masking. Method (i) is suitable for crash faults. However, for transient failures and arbitrary failure modes like Byzantine failures, method (ii) is appropriate. To mask a failure, a set of processors execute identical code. Their results are voted upon. As long as the number of healthy processors is greater than the number of faulty processors, majority voting yields correct output. An output message containing the voted result is generated by each non-faulty processor. The message is signed and bears the digital signatures [2] of the processors which agree on the result. A recipient of the message recognizes it to be valid from the presence of signatures. Each member of the set is called a replica. A replica can be viewed as a state machine [3] which accepts an input message, changes to a new state, and generates output message(s). It is obvious that all the replicas must have synchronous states so that any two healthy replicas produce the same output when given the same input. Replicas can maintain synchronous states by processing the same set of messages in the same order. This is achieved by an Atomic Broadcast Protocol [4]. 2.2. Architecture

In Fig. 1, we show the configuration of a fault-tolerant L20 r

I

Fig. 2. A failure-transparent

of a ‘node’ consisting

of three

replicas

of Fig. 1.

node using replication. It consists of three transputers PO, Pl, and P2 which are the replicas referred to above. The

links LOW, LlW, L2W, LOE, LlE, and L2E communicate with the environment. The three links LOW, LlW and L2W are essentially redundant links carrying the same messages in a fault-free condition. A similar situation holds for LOE, LlE and L2E. It is convenient to visualize the configuration as a node that can communicate with the external world via two logical links (LL). The links LOl, L12 and L20 ensure that a node consisting of PO, Pl, and P2 is a complete graph. These links, as will be explained later, are required for intra-node voting and Atomic Broadcast. In Fig. 2, we show a higher-level abstraction of a node with LLs. If at most one processor can go faulty, then the node always functions properly, that is, it generates at least two correct outputs. Such fault-tolerant nodes can be conveniently connected as a linear chain as depicted in Fig. 3. Effectively, the result of a fault does not propagate beyond the boundaries of a node. This will be explained in the next section. Traditionally, this is referred to as the fire-walling of faults. The software in each transputer consists of the following layers: (i) communication layer; (ii) replication layer; and (iii) application layer. The communication layer handles inter-transputer message communication, signing, and authentication [2]. The replication layer performs Atomic Broadcast and Voting required for fault-tolerance by replication. The application layer contains the processes implementing the real-time application, say process control, without any concern for fault-tolerance. This design strategy can be extended to construct replicated systems of arbitrary topology. The important parameters that will affect the software for a different topology are: (i) the (ii) the [l]; (iii) the

Fig. 1. Architecture and P2.

abstraction

maximum number of faults; connectivity required for Interactive Consistency and limitation of four links per transputer.

PO, Pl, Fig. 3. A fault-tolerant

pipeline.

635

A. Sinha et al./Information and Software Technology 38 (1996) 633-645

2.3. Fault models and assumptions

The processors can exhibit arbitrary failure modes (Byzantine faults). A processor can be totally silent, fail to send or receive some messages, delay communications, or send conflicting information to different processors at the same time. With only three processors per node, we have to employ message authentication to handle Byzantine faults. The signature is not a separate entity; it is a coding of the message. It is assumed that signed messages cannot be forged or altered by a faulty processor. In other words, the process of authentication will detect any change in the signed message. The practical justification of this assumption is explained in Section 6. A link fault is treated as a fault of one of the processors connected by the link. The constraint is that each node can have at most one faulty processor. Subject to this constraint alone, we can map a link fault to a processor fault.

3. Input processing Consider transputer Pl (Fig. 4) receiving a message ‘m’ along link LlW from logical node Ns. When every replica in Ns is functioning properly, PO and P2 should also receive ‘m’ along LOW and L2W respectively. However, if some replica in Ns goes faulty, PO or P2 may either fail to receive a message or receive wrong (or conflicting) messages. The application level processes in PO, Pl, and P2 must process the same set of messages in the same order. This requirement stems from the fact that PO, Pl, and P2 must maintain loosely synchronous states so that a correct output can be obtained by majority voting. We will refer to these constraints as (i) Agreement; and (ii) Order. The requirements of Agreement and Order can be satisfied by an Atomic Broadcast Protocol. Birman and

Wai ting_q

Fig. 5. Ordering

Ordered

of messages

in a replica by Atomic

Joseph [4] describe the Atomic Broadcast (ABCAST) in the following manner:

_q

Broadcast.

Primitive

Consider a set of processes that maintain copies of a replicated data structure representing a queue. If items are inserted into and removed from each copy of the queue in the same order, no inconsistencies will arise among copies. The Atomic Broadcast (ABCAST) primitive is provided for applications such as this, where the order in which data are received at a destination must be the same as the order at other destinations, even though this order is not determined in advance. In the present work, we employ Atomic Broadcast to maintain identical ordered queues at the non-faulty replicas in a node. Formally, if the queues are represented by strings of message identifiers, each identifier standing for a specific message, then, for any two queues Qi and Qj at non-faulty replicas Pi and Pj of a node, Qi is a prefix of Qj if lQi\ 5 ]Qj], where IQ] represents the length of Q. In Fig. 5, we show the logical scheme of the protocol. At each non-faulty replica, an arriving message is first put into the waiting-q. The waiting-q is logically a set. The temporal order of arrival of a message in the waiting-q is, in general, different at different replicas in the same node. After this, it is necessary to determine the order of this message relative to other received messages. The ‘order’ process in the non-faulty replicas execute an Atomic Broadcast protocol which transfers messages from the waiting-q to the ordered-q in the same order at all replicas. The ordered-q corresponds to Q defined in the previous paragraph. In contrast with the waiting-q, the ordered-q is a sequence. Fig. 6 depicts a scenario of message arrival at a server. PO receives mA and Pl receives ma. After intra-node

L 20

L OE -

LlE

Fig. 4. Inter-node

message

transfer

showing

a message

m to Pl.

Fig. 6. Two replicas

receiving

two messages

concurrently.

636

A. Sinha et aLlInformation and Software Technology 38 (1996) 633-645

Fig. 8. Schematic of Voting in TMR.

It now becomes necessary to decide on a common order in which mA and mn should be executed at all three replicas PO, Pl, and P2. On receiving an inter-node double-signed message, a replica relays it to the two other replicas in the node (Fig. 7). Thus, under no-fault conditions, a replica will receive tlrree (identical) double-signed messages:

The algorithm for determining stability is given in the Appendix. Consider a message queue Q at time t in which messages are stored after assigning sequence numbers. Let the queue be sorted in ascending order of sequence numbers. Thus, if messages mi and mi+l have assigned sequence numbers Si and Si+ 1 respectively, then Si < Sj+i. In Schneider’s [5] terminology, mi is stable at time ts if any message m, that is assigned a sequence number S, at time t, > ts satisfies S, > Si. It is necessary to ensure that a process reading mi waits until mi is stable. Unless this is done, messages may not be processed in ascending order of their assigned sequence numbers. The stable portion of waiting-q is transferred to ordered-q. We have implemented this Order-Agreement-Stability requirement with provisions for Byzantine failures by an Interactive Consistency Algorithm.

(1) one over an inter-node link; (2) two over intra-node links.

4. Generating voted output

Fig. 7. Intra-node broadcast of a double-signed message by Pl.

broadcast, PO also receives ms and Pl also receives mA. Thus the sequence of messages at the replicas become as follows: PO: {mA,mB}

Pl : {mB,mA}.

However, if the processor in the adjacent node is faulty, (1) above may not be received. But at least one message in (2) is sure to arrive because there could be at most one fault per node. Thus, a replica in a receiving node is guaranteed to receive at least one fault-free double-signed message. This satisfies the Agreement requirement. An inter-node message is double-signed at the sending node. The details are discussed in the following section. Agreement ensures that an inter-node message received at a replica is also received by other replicas. For ordering, every replica proposes a sequence number for a received message ‘m’. Henceforth, we will refer to this as the proposed sequence number. A proposed sequence number should be: (i) greater than all other sequence numbers proposed already by the replica; and (ii) greater than the assigned.sequence.number of all messages already in the ordered-q. The assigned sequence number for message ‘m’ is chosen to be the maximum of all the proposed sequence numbers for that message. The algorithm is given in Section 7. In addition to assigning a sequence number to a message, there is one more factor that determines when the message can be transferred to the ordered-q for processing. This factor is called the ‘message queue stability’ [5].

In the basic triple-modular redundancy (TMR) scheme (Fig. 8), there are three processors followed by a voter. With at most one faulty processor, the voter successfully produces correct output by exercising majority voting. In practice, the voting action is distributed to prevent the voter from being a single point of failure. Thus, in our implementation, the transputers exchange their outputs over links LOl, L12, and L20 (Fig. 1). Each transputer performs majority voting and generates a voted output. The voted output is double-signed, i.e. it bears the signatures of the two replicas which have agreed. With at most one fault, a node produces at least two correct outputs along the three redundant inter-node links.

5. Process architecture in a replica The set of processes running on a single transputer is depicted in Fig. 9. The “input_buffergrocess” (~3) receives messages directed to the transputer over hard channels from-e, from-w, from-n, and from-s. It authenticates these messages by checking their signatures and stores them in a buffer. The buffer is logically a set.

63-l

A. Sinha et al./Information and Software Technology 38 (1996) 633-645

voted-output

to-e to-w to-n to-s

XC_request-ibp

Processes po~opplicotion pa=

input-buffer-process

p SE

time-server

p 2z

p 1 E voter

p 4 Edecide-

seq-no

Fig. 9. Process Architecture in a single transputer.

The action of sequencing is performed by the process “decide_seq_no” (~4). We have found it convenient to define a unique identifier (henceforth abbreviated ‘uid’) for inter-node messages. It is a 3-tuple consisting of: (i) the node.id of the sender; (ii) the node.id of the receiver; and (iii) the message sequence number set by the sender. Every node of the system is assigned a unique integer called the no&id. Every message transferred from one node to another bears a sequence number set by the sender. Successive messages bear monotonically increasing sequence numbers. This contrivance eliminates the necessity of transferring entire messages for deciding the sequence numbers to be assigned. The “decide_seq _no process” (~4) in the three replicas can refer to the uid of a message for reaching Byzantine Agreement on its assigned sequence number (the so-called “ordering action”). The actual input message is buffered in “input_buffergrocess” (~3). After a sequence number is assigned to this message, p4 asks p3 to transfer the message to p0 and p3 complies. Associated with decide_seq_no is the “time-server” process (~5). It is required for setting deadlines for Byzantine Agreement. This is further explained in the Appendix.

The “application” process (PO) sends its output to the “voter” process (~1). The three voters in the three replicas communicate via their respective p2 and p3 processes for generating the voted output. The “output_buffergrocess” (~2) collects all outgoing messages, signs them, and sends them to their destination. If we focus our attention on a specific message, the following is the sequence of stages through which it normally passes at a node: (i) one or more replicas at node ‘i’ receive(s) a doublesigned inter-node message where destination is node ‘i’ itself; (ii) after authentication, this message is relayed over intra-node links to the two other replicas; (iii) Atomic Broadcast orders this message relative to messages received earlier; (iv) ultimately, this message is processed by the upplication process which sends its output to the voter process; (v) the voters in the non-faulty replicas cooperate to produce a voted output. As already mentioned, this voted output is a double-signed message, bearing the signatures of the two replicas that have agreed; (vi) each non-faulty replica outputs the double-signed message on an appropriate inter-node link.

A. Sinha et al./Information and Software Technology 38 (1996) 633-645

638

The following pseudo-occam code is a hypothetical sequential implementation that illustrates the above steps. The channels from.e, from.w are inter-node and the others are intra-node. For example, in Fig. 1, the input channels of LOE, LOW, L20, and LO1 for PO map onto from.e, from.w, from.n, and from.s respectively. CHAN OF ANY from.e, from.w, fromn, from.s: BOOL passed:

SEQ ALT fr0m.e ? m SKIP fr0m.w ?m SKIP fr0m.n ? m SKIP fr0m.s ?m SKIP authenticate (m, passed) IF passed

SEQ

--

Interactive.Consistency(m, aseq) -- aseq = assigned -- sequence number process(m, output) -- occurs only when all preceding -- messages have been processed. vote(output, voted.output) sign(voted.output) CASE m[DESTINATION] -- destination is specified in the -- message EAST to.e ! voted.output WEST to.w ! voted.output NOT passed SKIP end of processing

PR because PF cannot generate the “correct” signature of any message reportedly sent by PH. The “correctness” is checked by PR using message authentication; (ii) PF cannot alter a message while relaying it to PR because such an alteration will be caught during authentication. The receiver can verify the authenticity of the sender and the received message by checking E(D(w)) = w, where E = D-l. It is important to emphasize that D is secret, that is, known only to the sender, whereas E is public, that is, known to everybody. In the RSA cryptosystem, the functions D and E are defined as: D(w) = wdmod n E(x) z xemod n where d G decryption key (secret key) e z encryption key (public key) n z modulus. The scheme for designing d, e, n is given below: (1) Choose two large random primes p and q. (2) Let n = pq. The success of the cryptosystem hinges on the fact that it is virtually impossible to factorize n into p and q. (3) Let 4(n) = (p - l)(q - 1). (4) Select d such that gcd(d, 4(n)) = 1. (5) Compute e such that ed mod 4(n) = 1. For computing wd mod n, it is most convenient to apply successive squaring. Thus, in order to compute w15 mod n, we note that w15mod n

= (w8.w4.w2,w) mod n = [(w8 mod n)(w4 mod n)(w2 mod n) x (w mod n)] mod n

6. Producing and authenticating signed messages We have adopted the RSA cryptosystem [2] for producing signed messages. The scheme is described briefly here. Interested readers are referred to [2] for further details. A signed message is essentially a pair (w, D(w)) where w is the actual message and D(w) is the corresponding signature. There are two significant properties of a signed message: (i) A faulty processor PF cannot impersonate a healthy processor PH by sending a message to a third processor

Each of the factors can be obtained easily, since w4 mod n = [(w2 mod n)(w2 mod n)] mod n etc. To illustrate the method, consider the following example:

n Let p = 32687; q = 32653. Then n = pq = 1067328611, 4(n) = 1067263272. Let us choose d = 1067263267. The key ‘e’ is found by the Extended Euclid Algorithm. It yields (-426905309)d + (426905307)$(n) = gcd(d, 4(n)) = 1. Th en e = 4(n) - 426905309 = 640357963, ed = 683430531640845121, so that ed mod 4(n) = 1. If we encrypt 192105838. n

the

byte

w = 29, the

signature

is

A. Sinha et al./Information

and Software

7. Ordering input messages

As already explained, it is necessary to obtain identical message ordering at the replicas so that they maintain synchronous states. This is achieved by Atomic Broadcast. On receiving a double-signed message over an internode link, a replica sends copies of the message to the two other replicas of the node. If such a message is received over an intra-node link, on the other hand, it is relayed to the replica which was not the sender. This ensures agreement, i.e. a message received by a replica is also received by other replicas. For instance, in Fig. 4, let the healthy replica PI receive an inter-node double-signed message ‘m’ from Ns over LlW. We assume that the sender is fault-free. Another fault-free replica PO will receive ‘m’ over LOW only if the processor in Ns connected to LOW is faultfree. However, if that processor in Ns is faulty, we have to ensure that PO still receives ‘m’. This is achieved by intra-node broadcast of ‘m’ by Pl over LO1 and L12, as depicted in Fig. 7. For ease of implementation, PO relays ‘m’ received over LO1 to P2 over L20. Similarly, P2 relays ‘m’ received over L12 to PO over L20. If PO, Pl, P2, and all replicas in Ns are fault-free, then each replica receives three copies of the double-signed message, one internode and two intra-node. The copies may not be identical if compared byte-by-byte since the signatures will depend upon the signatories. However, the copies will have the same uid and the process of authentication generates identical messages stripped of signatures. In the worst case, P2 and the processor in Ns connected to LOW are both faulty; then P2 may receive only one copy of the double-signed message, viz. the one relayed by Pl over LOl. Following the broadcast, the three replicas execute an Interactive Consistency Protocol to assign a sequence number to the message. Each replica maintains a counter (or sequence number) which is incremented on receiving a new message. A replica ‘proposes’ a sequence number for an incoming message. The proposed sequence number is the ordered pair [counter, replica-id]. For example, if counter = 7, then Pl will generate a proposed sequence number = [7, l] which can also be viewed as a binary number. 0000 . . . . . . . . ..0111~01 1replica-id counter The second member of the ordered pair, the replica-id, is required to resolve ties in case distinct messages are assigned identical counter values. The counter mechanism ensures that a replica proposes a larger sequence number for a message arriving later. The reason is obvious: time always flows forward and messages are consumed in ascending order of their assigned sequence numbers. If this was not done, then we

639

Technology 38 (1996) 633-645

could have a message with a larger assigned sequence number consumed earlier. However, since the assigned sequence number is decided in a distributed fashion, it is also necessary to ensure that a proposed sequence number is greater than all assigned sequence numbers, so that a new message is assigned a sequence number greater than already assigned sequence numbers. This requirement is formally defined in the Proposed Sequence Number Axiom (A3) in the Appendix. To satisfy this requirement, we have to restrict the counter value as follows: Let c = local counter value at replica r(0 < r < 2) [b, w] = assigned sequence number of the message at the tail of the ordered-q. At any instant, we should have c 2 b. When a new message arrives at a replica, the counter is incremented to (c + 1) and the proposed sequence number is [(c + l), r] G 4(c + 1) + r in our scheme. Then, Kc + l), rl - [b, wl = {4(c + 1) + r} - (4b + w} = 4(c - b) + (4 + r -w) >4+r-w

(since c - b 2 0)

Since the replica ids r and w lie in the interval [0, 21, r-w> -2. So, 4 + r - w 2 4 - 2 > 2 > 0. Thus, [(c + l), r] - [b, w] > 0 i.e. [(c + l), r] > [b, w]. This means that the proposed sequence number for the new message is greater than the assignedsequence number of the last message and hence of all messages in the ordered-q.

To implement the above, c should be increased to x whenever a message with assigned sequence number [x, w] and x > c is added to the ordered-q. When the Interactive Consistency Protocol terminates, each replica knows the proposed sequence number for ‘m’ from the two other replicas. Each replica computes: assigned sequence no := max(proposed i E (0,1,2}.

sequence noi)

WI

Thus ‘m’ is assigned the same sequence number at all the three replicas. To illustrate the algorithm, let us consider the following example. There are three messages ml, m2, and m3. The proposed sequence numbers are embodied in Table 1. The assigned sequence numbers are ml : 3 1, m2 : 32, m3 : 30. We observe that PI, P2, and PO assign the same counter value (3) to the three messages. The replica-id resolves the tie and orders the messages. Due to failures, it is possible to have identical assigned sequence numbers for distinct messages. In the example

640

A. Sinha et a/./Information and Software Technology 38 (1996) 633-64s

Table 1 A typical scenario in assigning sequence numbers Replica

ml

m2

PO

Pl P2

10 31 22

Assigned.seq.no

31

Table 3 Arrival times of output messages from the three replicas m3

Sl. No.

20

30

11 32

21 12

32

30

3.1 Replica PO 1 2 3 4 5 6

above, if P2 is faulty and proposes a sequence number 32 for both ml and m2, then both ml and m2 will have the assigned sequence number 32. This tie can be resolved by ordering the messages according to their uid’s. At this point, it seems appropriate to sum up the different terms related to sequence numbers. The uid of a message is a brief identification of a message; instead of handling the entire message packet, we refer to it by its uid. One component of the uid is the sequence number defined by the sender. The sender assigns monotonically increasing sequence numbers to messages. This sequence number has nothing to do with the proposed or assigned sequence numbers. A receiving node receives messages from many senders. It maintains a counter which is incremented on receiving a new double-signed message. This counter is used to generate the proposed sequence number. The assigned sequence number is the (unique) maximum of the proposed sequence numbers. In the rare event of distinct messages being assigned identical assigned sequence numbers, they can be ordered by their uid’s which are, by definition, distinct. 8. Experimental results

3.3 Replica P2 1 2 3 4 5 6

Sender replica

Time

mAl

PO PO Pl P2 PI P2

1205(239) 1347(258) 1365(283) 1495(260) 1502(289) 2108(300)

Pl Pl PO P2 PO P2

1237(235) 1345(257) 1361(278) 1542(259) 1549(296) 2152(285)

P2 P2 PO Pl PO Pl

1235(244) 1441(259) 1466(263) 1546(290) 1552(297) 1580(308)

mB1 mAl

mA1 mB1 mB1

mAl mB1 mAl mAl mB1 mB1

mAl mB1 mAl mAl mB1 mB1

Values are the number of ticks of the low-priority clock. Values in parentheses are those obtained with signature generation/ authentication disabled.

the Atomic Broadcast Protocol, the local clock times (low-priority clock) at the replicas were monitored at important points during execution. The monitored values, one set for each replica, are shown in Table 2. The figures in parentheses were obtained by disabling the generation and authentication of signatures. 8.2. Voting times

8.1. Atomic Broadcast Refer to Fig. 6. Replicas PO and Pl receive mA and mn respectively. In order to study the performance of Table 2 Significant instants in Atomic Broadcast Message

3.2 Replica Pl 1 2 3 4 5 6

Message

Time of

The messages mA and ma are processed by the uppliprocess (PO) t0 produce mA1 and mBi reSp&iVely. The voter process (pl) signs these messages and sends them to the other replicas. The local-clock-time (lowpriority clock) of arrival of these single-signed messages in pl were monitored during execution in each replica. These are shown in Table 3. CUtiOn

Arrival

Proposal

Decision

Stability

8.3. Message transfer times

145(45) 332(83)

224(52) 624(98)

1144(200) 1158(214)

1158(215) 1158(215)

195(44) 384(75)

304(61) 432(90)

1193(213) 1031(190)

1193(214) 1193(214)

Table 4 shows the maximum time intervals recorded in the output_bu&rqrocess (p2) in each replica for transmitting a message. It is to be noted that since communication is synchronized, the time indicated is the sum of the waiting time (for the other partner to become ready), the message transfer time and possible delay due to scheduling.

227(64) 482(94)

448(85) 581(109)

1194(222) 1107(208)

1195(223) 1195(223)

2.1 Replica PO mA

ma 2.2 Replica Pl mB mA

2.3 Replica P2 mA

ma

Values are the number of ticks of the low-priority clock. Values in parentheses are those obtained with signature generation/ authentication disabled.

9. Conclusion

The work described in [6] employs a different hardware

A. Sinha et al.lInformation and Software Technology 38 (1996) 633-645

Table 4 Message ‘send’ times Replica

Maximum time delay

PO PI P2

260(29) 245(25) 195(20)

Values are the number of ticks of the low-priority clock. Values in parentheses are those obtained with signature generation/ authentication disabled. Length of a signed message = 91 bytes. Maximum time required to add signature to a message = 51. Maximum time required to authenticate a single-signed message = 57.

architecture, viz. the so-called two-processor fail-silent nodes. In spite of this difference, it is instructive to compare the performance figures of the two designs. It is noteworthy that the authors have NOT used our method of digital signatures. Consequently, we should compare these figures with those obtained in our design without signatures, i.e. the figures in parentheses. Consider mA in Replica PO. Arrival time (tl) = 45 (ticks) Ref. Table 2. Instant when Atomic Broadcast is completed (t2) = 215 (ticks). Time required for Atomic Broadcast (t3) = t2 tl = 170 (ticks). Instant when sequence no. is assigned (t4) = 200 (ticks). Time to achieve stability (t5) = t2 - t4 = 215200 = 15 (ticks). Instant when voted output is obtained (t6) = 260 (ticks). Voting time (t7) = t6 - t2 = 260 - 215 = 45 (ticks) (Table 3). total time spent in a node (t8) = t6 - tl = 215 (ticks). Since all processes were run in the low-priority mode of the transputer, 1 tick = 64 microseconds.

So, t3 t5 t7 t8

= 10.88 ms. [cf. ID = 0.96 ms. [cf. WD = 2.88 ms. [cf. OD = 13.76 ms. [cf. ND

= 10.25 ms. in Ref. [6]]. = 6.34 ms.] = 4.60 ms.] = 14.85 ms.]

Our implementation has been on the T414 transputers. Obviously, with faster transputers, e.g. T9000, the performance is expected to improve very significantly. To sum up, the significant improvements in our work compared to similar work cited above are as follows: (i) Clock synchronization is not required. (ii) Digital signatures have been used instead of naive checksums. (iii) Proof of deadlock-freedom is easy. (iv) The software architecture is natural for the transputer since it uses processes with disjoint address spaces. (v) Low-level implementation issues of the communication layer have been considered.

641

The system described in this paper holds considerable promise for application during the launching stage of spacecrafts which is characterized by its relatively low mission-time and by its demand for a high availability system with little or no time for fault detection and reconfiguration. The performance figures achieved even with transputers at the very low end of the spectrum are adequate to interface with systems having high inertial lags such as launch vehicles. Systems with similar characteristics but higher mission times, such as chemical plants, would also benefit from this type of servers but perhaps with more elaborate fault detection and reconfiguration and/or replacement schemes. Faster processors should be employed for more time-critical applications such as nuclear plants. Our future work shall be in three broad areas. First, we intend to design predictable real-time performance by using synchronized clocks. Second, we shall develop a proof of correctness of the entire system. Third, we shall implement software fault-tolerance at the application level.

Acknowledgements This work was inspired by the RESPOND programme of I.S.R.O. The authors are grateful to Dr Srinivasan, Director, V.S.S.C., for permitting the publication of this work.

Appendix A A.1, Message queue stability As already explained, a message cannot be consumed immediately after it is assigned a sequence number. The application process must wait for the message to be ‘stable’ in the ordered message queue. In this section, we describe a Test for Stability and a proof of its correctness. We introduce the following definitions. (i) Decided (mk, tk) 5 the predicate “Message mk has been assigned a sequence number at time t 5 tk”. (ii) S, - sequence of input double-signed messages received by replica P over inter-node links or by intra-node broadcasts. (iii) At,(m k ) -= t he local clock time of arrival of message mk at replica P. (iv) Dt,(mk) z th e 1ocal clock time at which replica P assigns (by consensus) a sequence number to message mk. (v) Aseq(mk) E the sequence number assigned to message mk . (vi) Pseq(mk, P) = sequence number for message mk proposed by replica P.

642

A. Sinha et aLlInformation and Software Technology 38 (1996) 633-645

c

Atp(mi)

D+@‘tli) Atp(mk]

time

It states that a message mi in a replica P is stable at time ts if every message in Sp that had arrived before Dtp(m,) is assigned a sequence number by time ts.

time

Proof

Otp(mk)

(a) Y Atphk)

=

I

Atp(m

i)

I

Ofp(mkl Dtp(mi

n

(b) !_I

--- time

I

Atp(mi)Atph$Dtp(mG

The proposed sequence number axiom [PSI

n

Dtp(mk)

Dtp(mi)

time

P%?q(mk,

A (mk E Sp)

: Atp(mk)

> Dtp(mi)

[A31

P) > Aseq(mi)

Refer to Fig. Al(a).

*

time Atp(mk)

vmi,mk : (mi E Sp) +

(d) Atp(mi)

of the function

Dtp(m$ (cl

Atphk) Atp(mi)

We assume the following property Pseq:

Dtp(mi)

Dtp(mk)

Atp(mk,P)

> Dtp(mJ

(Sl)

+ Pseq(mk, P) > Aseq(mi)

(e)

(by A3)

(S2)

Again,

Fig. Al. Different chronological order of arrival of messages mi and mk. AWbk)

The objective is to ensure that messages are consumed by the application process in (ascending) order of their sequence numbers, i.e. for any two messages mi and mj consumed at time ti and tj respectively, ti < tj + Aseq(mJ < Aseq(mj). For every message mi, there is a time ts > Dtp(mi) at which the above condition can be satisfied. Formally, we define the predicate Stable(mi, P, ts) as follows: Stable(mi, P, ts) c Vmk : (mk E Sp) A (Dtp(mk) > ts)

PI

: Aseq(mk) > Aseq(mi)

i.e. message mi is stable at time ts at replica P if for any message whose sequence number Aseq(mk) is assigned at time Dtp(mk) > ts, the relation Aseq(m,J > Aseq(mi) holds.

ST(mi, ts, P) E {ts 2 Dtp(mi)}A vmk <

:

@k

E SP)

A (Dtp(mk)

>

Atp(mi)) : Decided(mk, ts)

Dtp(md)

2

Pseq(mk,

Q>)

(byFO) (S3)

(S2) A (S3) =+ Aseq(mk) > Aseq(mi)

(S4)

We now consider a message mj E Sp satisfying

DtdmJ > ts

(S5)

(W A W’(mi, ts, P> + Dtp(mj) > Dtp(mi) +- {Dtp(mj) > Dtp(mi V

1)A [{Atdmj) < Dtdmi )l

{Atdmj) > J&(mdll

W)

[2nd conjunct is a tautology]

[{Dtp(mj>> Dtp(mJl A {Atdmj) < Dtdmi)}]

The Stability Test

Fig. Al depicts the possible interleavings of arrival and decision times for messages mi and mk. In Fig. Al(a), Atp(mk) > Dtp(mi) from which we will prove that Aseq(mk) > Aseq(mi). In Figs. Al(b) and Al(c), Aseq(mk) is assigned before Aseq(mi) and hence the relative order of mk and m, is known to mi at time Dtp(mi). However, in Figs. Al(d) and Al(e), Aseq(mk) is assigned after Aseq(mi). Hence mi does not know its order relative to mk until time Dtp(mk). So, for ensuring Stable(mi, P, ts), it is sufficient to choose ts such that ts > Dtp(mk) for scenarios depicted in Figs. Al(d) and Al(e). This suggests the following Stability Test:

g$PsedmkT

p)

*

A.I.l.

=

V

[{Dtp(mj)> W(4)

A {Atdmj)

> Dtp(mi))

(distribution of A over V) --+ >

Decided(mj, ts) V [{Dtp(mj) > Dtp(mJ}

A {Atp(mj)

DtP(mi))I (byW

=S(Dtp(mj)

I ts) V (Aq(q) (by

=+-{$Xp(mj) + [{Dtp(mj)

> Aseq(mi))

S4)

> ts)) V (Aseq(mj) > Aseq(mi)) > ts} =S {Aseq(q)

>

Aseq(mJ)l

657)

[=+ - introduction] We have thus derived the truth of a predicate of the form

A (Atp(mk)

[A21

AAB+(A+X)

A. Sinha et aLlInformation

and Software

643

Technology 38 (1996) 633-645

where

9 5 set of faulty replicas.

A z (Dtp(mj) > Dtp(mi))

.X 5 set of healthy replicas.

B z ST(mi, ts, P)

As required by the Interactive Consistency Algorithm, x,i is accepted in round-l and ymij in round-%.

X c (Aseq(mj) + Aseq(mi)) This is equivalent to B =+ (A =+ X) Substituting for A, B, and X ST(mi, ts, P) * [{Dtp(mj) > ts) 3 {Aseq(mj) > Aseq(mJ)l + Vmi : (mj E Sp) A (Dtp(mi) > ts) : Aseq(mj) > Aseq(mi) (by (S5), (S7), and V-introduction) + Stable(mi! P, ts) (by Al).m End Proof A.2. Implementing Byzantine Resilience There are two aspects to designing Byzantine Resilience: (1) Atomic Broadcast of input messages; and (2) Voting on output messages. Messages exchanged amongst the processors are signed. For Atomic Broadcast [4], the classical Interactive Consistency Algorithm [l] has been employed. It reaches agreement on the proposed sequence numbers in two rounds. A deadline is set for each round. When the deadline expires, the time-server process (~5) notifies the decide_seq_no process (~4). We furnish a formal proof of the algorithm in the following section, Intra-node voting of output messages is straightforward: each replica Pi awaits the arrival of an identical message from another replica Pj. The arrival is guaranteed since at most one replica can be faulty and the replicas are fully connected. A voted message is doublesigned; it bears the signatures of Pi and Pj. After voting, Pi sends the double-signed message to the destination node. Any replica in the destination node recognizes its input message from the presence of the two signatures. A.3. Constructing the set of proposed sequence numbers by Interactive Consistency

Each replica constructs for each input double-signed message M, an Interactive Consistency Vector Ii with

Axiom:

[A41

(Pi E 2”) *

X,i

= Ymij

(Pj E 319) *

Xmj = Ymji

Dejnition: NIL - a special value assigned to x,i or ymjl

when no value is received by message passing or when two distinct values are received for the same variable. This models the behaviour of faulty senders. The algorithm for constructing the Interactive Consistency Vector is given below. Ii[m] e if x,, = NIL then Ii[m] := ymji else if ymji = NIL then Ii[m] := x,i else if Ymji <> X,i then Ii[m] := NIL else Ii [m] := X,i.

WI

Ij[m] = if xmj = NIL then Ij[m] := ymij else if ymij = NIL then Ij[m] := xmj else if Ymij <> Xmj then Ij[m] := NIL else Ij [m] := Xmj.

F4

To prove the correctness of (Fl) or (F2) we need to prove (Pi E 2”) A (Pj E 2)

+

i.e. all non-faulty replicas compute identical Interactive Consistency Vectors; and (Pm E %?) A (Pi E 2)

+ Ii[m] =

Subcase 1.1 X,i = NIL

ymij = Value of x,i relayed by Pi to Pj.

PI

Proof n The proof involves a case by case analysis of different combinations of values of the variables. We prove Pl followed by P2.

To illustrate the algorithm and its proof of correctness, we introduce the following definitions. Sequence number proposed by P, to Pi.

X,,

i.e. the element of the Interactive Consistency Vector for a non-faulty sender is the actual value sent.

Proof of Proposition Pl

E

Ii[m] = Ij[m] form E {0,1,2},

PI

Ii b] = Sequence number for M proposed by Pj.

X,i

[A51

Hypothesis 1 (Pi E 2)

A (Pj E 2)

Hypothesis 1.1 + Ii[m] = ymji =

Xmj

(by Fl) (by A5 and 1) (Bl)

644

A. Sinha et alflnformation and Software Technology 38 (1996) 633-645

There are two possibilities:

There are two possibilities:

1.1.1 xmj = NIL

1.2.2.1

Ymji <>

1.1.2 Xmj<> NIL

1.2.2.2

Ymji =

Hypothesis 1.1.1. + Ij[m] = ymij =

X,i

(1)

(by F2) (by

(by Hypothesis 1.1)

=

(by 1.1.1)

= Ii[m]

(by Bl)

A (1.2) A (1.2.2) A (1.2.2.1)

Also, xmj = ymji

WI

W) (by A5)

<> NIL

(by 1.2.2)

i.e., xmj <> NIL

@lOI

Again, yhj = xmi

For 1.1.2, we observe

(by A4 and 1)

<> NIL

(by A4 and 1)

Ymij= xmi

Xh

+ Ii[m] = NIL (by Fl)

A4)

= NIL Xmj

Xmi

(by 1.2)

i.e., yij <> NIL = NIL

033)

(by 1.1)

(Hypothesis

0311)

1.2.2.1)

Using 1.1.2 and (B3) in (F2), 3ymji <>Xd Ij[m] = Xmj = Ii[m]

(by Bl)

i.e. Ij [m] = Ii [m]

034)

* Xmj<> Xmi

(by A5)

=$

(by

i.e.

Xmj <> Y,ij

Y,ij

<>

.44)

0312)

Xmj

Thus, from (1. l), (B2), and (B4), Finally, (BlO) A (Bll) A (B12) (1) A (1.1)

+ Ij[m] = NIL

+ (Pi E g) A (Pj E &‘) A (Xmi= NIL) * Ii[m] = Ij[m] Subcase 1.2

(by F2)

= Ii[m] @5)

(by B%

Thus, (I) A (1.2) A (1.22) A (1.2.2.1) + Ii[m] = Ij[m]

NIL

Xmi <>

(B13)

The proof depends on the value of ymji

On the other hand, if ymji = xmi (Subcase 1.2.2.2),

Subcase 1.2.1 ymji = NIL

(1) A (1.2) A (1.2.2) A (1.2.2.2) =$

1.2.2 ymji <> NIL (Hypothesis 1) A (Hypothesis 1.2) A (Hypothesis 1.2.1.) + Ii[m] = xmi

(by Fl)

NOW, Xmj= ymji

Zi[lll] =

Xfi

(by

Also, xmj = ymji

(by 1.2.2)

i.e., Xmj<> NIL

(by A5 and 1)

(B15)

Again ymij = xmi = NIL

(by A4 and 1)

(by 1.2.1) <> NIL

i.e. xmj = NIL

(B7)

(B7) * Ij [ml = Ymij (by F2) =

Xmi

(by

i.e. ymij <> NIL

= Ii[m] (by B6)

Thus, (1) A (1.2) A (1.2.1) + Ii[m] = Ij[m] i.e.(Pi E &‘) A (Pj E %) A (XC <> NIL) A (ymji = NIL) +- Ii [m] = Ij [m] For ymji <> NIL (Subcase 1.2.2)

(by 1.2) (Bl6)

Finally, yhj = XG

A4)

(Bg)

i.e.

Y&j

=

(by A4)

= ymji

(by 1.2.2.2)

=

(by

Xmj

A5)

Xmj

(B17)

(B15) A (B16) A (B17) * Ij [m] =

Xmj

@14)

(by A5)

<> NIL

(B6)

F1)

(by

F2)

A. Sinha et al./lnformation and Software Technology 38 (1996) 633-645

(Hypothesis 2.2)

A51

= Ymji

(by

=

(by 1.2.2.2)

+ Ymji= xmj

(by Bl4)

(Hypothesis 2)

X,i

= Ii[m]

*

SO, (1) A (1.2) A (1.2.2) A (1.2.2.2) * Ii[m] = Ij[m]

(Bl8)

X,i

Xmj

(by

Af-9

A(1.2) A (1.2.2) A (1.2.2.2)] =+ (B13) V (B18)

(B26)

(by B22, B26, and Fl)

(B27)

So, (2) =+ (2) A [(2.1) V (2.2)] * i(2) A (2.1)] V ((2) A (2.41

(Bl9)

=+(B23)

1.2)

=+ (1.2) A (1.2.1 V 1.2.2)

v (B27)

* Ii[m] =

X,i

(BW

n Proposition P2 proved

+ [(1) A (1.2) A (1.2.1)] V [(1) A (1.2) A (1.2.2)] + (B8) v (B19) + Ii [m] = Ij [m]

(BW

(B24) A (B25)

* Ii[m] = Xmi

=+ [(1) A (1.2) A (1.2.2) A (1.2.2.1)] V [(1)

* Ii [m] = Ij [m]

(B24)

(2) A (2.2)

=+ (1) A (1.2) A (1.2.2) A (1.2.2.1 v 1.2.2.2)

FINALLY,

=

(by A7)

+ Ymji= xmi

(1) A (1.2) A (1.2.2)

(Hypothesis

645

(B20)

(1) =+ (1.1) v (1.2)

TO complete the proof of correctness need the following axiom:

of Ii = Ij, we

The Private Value Axiom:

=+ (B5) v (B20)

(Pi E X) + [Vm : (Pm E R) : Xim = Ii[i] = Pseq(M,Pj)

* Ii [m] = Ij [m]

(B21)

i.e. (Pi E G’?)A (Pj E Z?) + Ii[m] = Ij [m]. Proposition Pl proved

[A81 (Pi E Z) A (Pm E 2) * Ii[m] =

Proof of Proposition P2 n The hypothesis is

X*i

Ii[m] = I,[m]

+

(by

B’W

(by A8)

(B29)

Equations (B21) and (B29) imply that the Interactive Consistency Vectors Ii and Ij are identical and, in addition, the element corresponding to a nonfaulty replica Pm is the local value of Pm, i.e. I,[m]. n END PROOF

2 (Pm E %“) A (Pi E 2) To prove P2, we introduce the following axioms: The Nonfaulty processor Axiom (Pm E Z) +

(X,i

<>

NIL) A (X,i

=

Xmj)

WI

The Signed-Message Axiom (ymji<> NIL) * ymji = xmj

References [l]

[A71

We consider, two subcases:

[2] [3]

2.1 ymji = NIL [4]

(Hypothesis 2) + Xmi< > NIL

(by A6)

(B22)

(2) A (2.1) =+ Ii [m] = xmi

(by B22 and Fl)

(B23)

2.2 ymji <> NIL

[5] [6]

M. Pease, R. Shostak and L. Lamport, Reaching agreement in the presence of faults, J. ACM, 27 (April 1980) 2288234. D.E. Knuth, The Art of Computer Programming Vol. 2 (Seminumerical Algorithms), Addison-Wesley, 1981. F.B. Schneider, Implementing fault-tolerant services using the state machine approach: a tutorial, ACM Computing Surveys. 22 (December 1990) 299-319. K.P. Birman and T.A. Joseph, Reliable communication in the presence of failures, ACM Trans. Comput. Systems, 5 (February 1987) 47776. F.B. Schneider, Synchronization in distributed programs, ACM TOPLAS, 4 (April 1982) 125-148. F.V. Brasileiro, P.D. Ezhilchelvan, SK. Shrivastava, N.A. Speirs and S. Tao, Efficient protocols for fail-silent nodes in distributed systems, University of Newcastle upon Tyne. UK. Technical Report Series 413 (February 1993).