Computer Communications 23 (2000) 13–21 www.elsevier.com/locate/comcom
D -Causality and 1 -delivery for wide-area group communications T. Tachikawa*, H. Higaki, M. Takizawa Deptartment of Computers and Systems Engineering, Tokyo Denki University, Ishizaka Hatoyama, Hiki-gun Saitama, Tokyo, 350-0394 Japan, Received 31 October 1997; received in revised form 31 March 1999; accepted 31 March 1999
Abstract In distributed applications, a group of multiple processes cooperate by exchanging messages. It is critical to support the group of application processes with enough quality of service (QoS) including the ordered delivery of messages. The delay time and the message loss ratio are significant QoS parameters. In Internet applications, the delay time and the loss ratio are significantly different in different communication channels. We define a novel causality named D p-causality among the messages to hold in the world-wide environment. We discuss how to transmit messages to the destination processes and how to resolve message loss and delay supporting the D p-causality, given the requirements of delay time and message loss ratio. q 2000 Elsevier Science B.V. All rights reserved. Keywords: Group communication protocol; Causally ordering; D -Causality; 1 -Delivery; Wide-area group
1. Introduction In distributed applications like teleconferences, a group of multiple processes cooperate by exchanging multimedia data. Group communication protocols support a group of processes with the reliable and ordered delivery of messages to multiple destinations in the group. Transis and others support the causally ordered delivery. ISIS(ABCAST), Amoeba, Trans/Total, Rampart, and others support the totally ordered delivery. Group communication protocols discussed so far assume that every communication channel has almost the same communication delay time and mostly assume that the communication network is reliable and often synchronous, i.e. no message loss and bounded delay time. The FACE project is now developing the world-wide teleconferences among the agents distributed in Japan, USA and UK. Here, let us consider a world-wide teleconference among processes K, U, S and H in Keele of UK, UCLA of the USA, Sendai and Hatoyama of Japan, respectively. By using the Internet, it takes about 60 ms to propagate a message in Japan, while between Japan and Europe it takes about 240 ms. In addition, the longer the distance, more the messages lost. For example, more than 10% of the messages are lost between Japan and Europe while * Corresponding author. Tel.: 1 81-492-962911; fax: 1 81-492966185. E-mail addresses:
[email protected] (T. Tachikawa), hig@ takilab.k.dendai.ac.jp (H. Higaki),
[email protected] (M. Takizawa)
less than 1% is lost in Japan. Thus, each communication channel between the processes supports different delay time and a different level of reliability in the wide-area group. If the traditional group communication protocols are adopted to the wide-area group, the time for delivering messages to the destinations is dominated by the channel with the longest delay and the lowest level of reliability. It is significant to overcome these difficulties in the Internet. In realtime multimedia applications, messages have to be delivered in some predetermined time units. The D -causality among messages is discussed where D denotes the maximum delay time between the processes required by the application. That is, it is meaningless to receive a message m unless m is delivered in D after m is transmitted. The D causality assumes that every pair of processes have the same delay time D . Each communication channel between a pair of processes Pi and Pj supports a different quality of service (QoS), i.e. delay time d ij and message loss ratio 1 ij. d ij and 1 ij are furthermore time-variant. For example, d ij and 1 ij are increased if the communication channel between the processes Pi and Pj is congested. In contrast, the application requires the system to support some QoS. Here, let D ij and Eij be the delay time and the message loss ratio required for a pair of processes Pi and Pj, respectively. Here, the problem is how to support Pi and Pj with D ij and Eij given d ij and 1 ij in the group. If messages are lost in the network and Eij # 1 ij, some of the lost messages have to be retransmitted. This means, the less reliable the communication channel is, the longer it takes to deliver messages to the destinations. Thus, the delay d ij is related with the message loss ratio 1 ij. For
0140-3664/00/$ - see front matter q 2000 Elsevier Science B.V. All rights reserved. PII: S0140-366 4(99)00091-2
14
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21
Fig. 1. Distributed system.
example, if Eij . 1 ij and D ij < d ij, a process Pi is required to send messages to another process Pj without retransmission of messages lost by Pj. In addition, the causality among messages has to be defined taking into account that some pair of D ij and D kl may be significantly different. In this article, we newly define D p-causality. Then, we present a D p-causally ordered (D pCO) protocol which supports the D p-causality and can reduce the delay time to deliver messages and the number of messages retransmitted by the destination replication. In Section 2, we present a system model. In Section 3, we discuss the D p-causality. In Section 4, we discuss the D pCO protocol. We evaluate the D pCO protocol by comparing with the traditional group communication protocol in Section 5.
2. System model 2.1. System layers A distributed system is composed of three hierarchical layers, i.e. application, transport, and network layers as shown in Fig. 1. A group of n( $ 2) application processes AP1,…,APn cooperate by exchanging multimedia data to achieve some objectives. Each APi communicates with the
others in the group by using the underlying group communication service supported by transport processes TP1,…,TPn. Each APi takes the group communication service supported by TPi. Here, let G denote a group of the transport processes (G {TP1,…,TPn}) supporting AP1,…,APn. Data units transmitted at the transport layer and application layer are referred to as messages and streams, respectively. The network layer is considered to support each pair of processes TPi and TPj with a logical channel supporting the IP service. Data units transmitted at the network layer are referred to as packets. The channels are less reliable and asynchronous, i.e. some packets are lost, duplicated, and out of order and the delay time is unbounded. The cooperation among the transport processes TP1,…,TPn (n $ 2) is coordinated by group communication (GC) and group communication management (GCM) protocols. After establishing a group G among TP1,…,TPn, the GC protocol reliably and causally delivers messages to the destination processes in the group G. The GCM protocol monitors and manages the membership of G. APi requests TPi to send a stream si to APj. TPi decomposes the stream si into messages, and sends the messages to TPj. Here, let usiu show the number of messages in the stream si. TPj assembles the messages into a stream sj, and delivers sj to APj. In some applications, even if some messages are lost, i.e. si ± sj APj is allowed to accept sj without retransmission of the lost messages. Here, the message loss ratio 1 ij is defined to be 1 2 usj u= usi u. Let Eij be the maximum loss ratio between TPi and TPj required by the application. APj can accept sj if Eij . 1 ij. In addition, sj is required to be delivered to APj in some time units D ij. If the delay time d ij from TPi to TPj is larger than D ij, it is meaningless for TPj to receive sj. In addition, if some messages are lost, TPi has to retransmit the messages to TPj. Hence, larger the 1 ij, is larger the D ij is.
2.2. Network parameters
Fig. 2. Packet receipt ratio vs. delay.
Every transport process TPi has to know of the delay time d ij and the loss ratio 1 ij with each TPj in the group G. In the GCM protocol, TPi requests the network to transmit two kinds of ICMP packets: “Timestamp” and “Timestamp Reply”. TPi calculates d ij from the obtained round trip time between TPi and TPj. TPi periodically sends ICMP packets to all the processes in G. Here, TPj is nearer to TPi than TPk if dij , dik . In addition, the GCM protocol obtains 1 ij by monitoring packets lost between each pair of TPi and TPj. Here, we assume that dij dji and eij eji . 1Å ij and dÅ ij show averages of 1 ij and d ij, respectively. d ij is significantly larger than dkl
dij q dkl if dij $ 2dkl . Quality of service (QoS) supported by the network is characterized by delay dp {dij u i; j 1; …; n} and loss ratio 1p {1ij u i; j 1; …; n}. The network is regular if dij 1 djk $ dik and
1 2 1 ij
1 2 1 jk # 1 2 1ik for every three processes TPi, TPj and TPk. In the irregular network, TPi
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21
15
3. D p-Causality
Table 1 Delay (ms) and lost (%) Host
Minimum
Average
Maximum
Lost
S U K
30.437 119.506 164.497
60.427 157.171 241.370
756.263 532.433 2733.565
0.9 8.3 11.7
3.1. D -Causality In a group G {TP1 ; …; TPn }, messages have to be delivered in the causal order. [Causally precedent relation] A message m1 causally precedes another message m2(m1 ! m2) if
may deliver a packet m to TPk via TPj faster than directly sending m to TPk depending on the routing strategies. Through the world-wide experiment using the Internet, we obtain the statistics of d ij and 1 ij. First, the delay times among the processes H in Hatoyama, S in Sendai, U in UCLA, and K in Keele are measured. H sends, 5000 ‘ICMP echo’ packets to S, U, and K. Each destination process monitors the delay time of each packet, i.e. how long it takes each packet to arrive at the destination. Here, Rij(t) shows the ratio of the packets which it takes t time units to arrive at the destination TPj from TPi to the total number of packets transmitted, i.e. 5000. Fig. 2 shows RHS
t, RHU
t, and RHK
t. The longer the distance between a pair of processes is, the more fluctuated the receipt ratio at the processes is. For example, 80% of packets arrive at S from R80 H in 50 ms after the fastest packet arrives, i.e. 30 R
t dt 0:8. Here, it takes about 30 ms for the fastest packet to arrive at S. In contrast, 80% of the packets arrive at U and K in 85 and 137 ms, respectively, after the fastest packet arrives there. If a process TPi does not receive the receipt confirmation from TPj in some time units after sending a packet m, TPi considers that TPj loses m. If the timeout period is given 2 × 50 100 ms, about 20% of packets are considered to be lost between H and S. In order to receive more than 80% of packets between H and U and between H and K, the timeout period has to be larger than 170 and 274 ms, respectively. In addition, Table 1 shows the packet loss ratios 1 HS, 1 HU and 1 HK for the processes S, U and K. Only 0.9% of the packets are lost for S in Japan; however, 8.3 and 11.7% are lost for U in USA and K in UK, respectively. Table 1 shows that the longer the distance is, the more packets are lost.
Fig. 3. D -causality.
1. m1 is sent before m2 by a process, 2. m2 is sent after delivering m1 by a process, or 3. m1 ! m3 ! m2 for some message m3. In realtime multimedia applications, messages sent by a transport process TPi have to be delivered to the destinations by the deadline specified for the messages. Thus, a destination process TPj has to receive a message m in D time units after TPi sends m. D denotes the maximum delay time between the processes in G, which is required by the application. Here, let tsi(m) be time when TPi sends a message m. Let trj(m) be time when TPj receives m. m is referred to as received in D by TPj iff tsi
m 1 D $ trj
m. The causality based on D is defined as follows. [D -Causality] For every pair of messages m1 and m2, m1 D -causally preD cedes m2
m1 ! m2 iff • m1 ! m2 and • ts(m1) 1 D $ ts(m2). In a group G of three processes TP1, TP2 and TP3 as shown in Fig. 3, TP1 sends a message m1 to TP2 and TP3. D TP2 sends m2 after receiving m1 in D
m1 ! m2 . Then, TP2 sends m3 to TP3. TP3 receives m2 in D after m2 is sent and receives m1 not in D . Hence, TP3 delivers m2 but not m1. 3.2. D p-Causality A wide-area group G {TP1 ; …; TPn } is a group of the transport processes TP1,…,TPn where dij q dkl and eij q ekl for some processes TPi, TPj, TPk and TPl. Here, each D ij is specified for every pair of TPi and TPj by the application. D ij can be obtained based on the statistics of the delay time d ij and the loss ratio 1 ij between TPi and TPj. The shorter D ij is, the more messages are lost between TPi and TPj. That is, if D ij is smaller, the more messages take a longer time than D ij to be delivered and have no time to be retransmitted even if the messages are lost. One idea is to take D ij to be longer than the average dij ; i.e. Dij $ dij . Dij $ Dkl may hold if the distance between TPi and TPj is larger than between TPk and TPl. In addition, D ij is related with the loss ratios 1 ij and Eij. If Eij . 1 ij, messages are allowed to be lost, that is, there is no need to retransmit the lost messages. Otherwise, some messages lost by TPj have to be retransmitted by TPi. Here, suppose TPi sends a message m to TPj.
16
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21
On receipt of m, TPj sends the receipt confirmation m 0 to TPi. If TPi does not receive m 0 to 2d ij after sending m, TPi retransmits m to TPj. TPi retransmits m to TPj again if TPi does not receive the confirmation. The expected time to surely deliver m is dij
1 1 1ij =
1 2 1ij : This means that it takes a longer time to deliver m if 1 ij is larger. Here, let D p be a set {Dij u i; j 1; …n} of the delay requirements. [D p-causality] Let m1 and m2 be messages sent by processes pTPi and TPj D respectively. m1D p-casually precedes m2
m1 ! m2 iff • m1 ! m2 and • ts(m1) 1 D ij $ ts(m2). That is, m2 is sent in D ij time units after m1 is sent. In Fig. 4, the process TP1 sends a message m1 to TP2 and TP3, and TP2 sends m2 to TP3 after receiving m1. Since TP3 receives m2 in D 32, TP3 delivers m2. Then, TP3 receives m1. Since TP3 receives m1 in D 31, TP3 can pdeliver m1. However, D since m2 is already delivered and m1 ! m2 ; TP3 cannot deliver m1. If m1 is delivered, m2 cannot be delivered because m2 is obligated to be delivered after ts
m2 1 D32 . There is inconsistency among D 12 and D 23. This example shows that TPi may not deliver m even if m is received in D ij. Thus, the D p-causality may be inconsistent if each D ij is independently decided. [Consistency] Dp A D p-causality precedent relation ! is consistent iff ts
m1 1 Dki # ts
m2 1 Dkj and m1 ! m2 for every pair of messages m1 and m2 sent by processes TPi and TPj, respectively. It is straightforward that the following theorem holds. Dp
Theorem. ! is consistent if D ij 1 D jk $ D ik for every TPi, TPj and TPk.
Dp
Corollary. ! is consistent if D ij D kj for every triplet of processes TPi, TPj, and TPk. D
That is, the D -causality ! is consistent because D ij D for every pair of TPi and TPj. In the wide-area group, the network may not be regular. Here, D p may be inconsistent if each D ij is proportional to d ij. That is, Dij 1 Djk , Dik if dij 1 djk , dik for some TPi, TPj, and TPk. There are the following ways to resolve the inconsistency on the D p-causality: 1. to neglect messages which do not satisfy the D p-causality; and 2. to change some D ij so that D p is consistent. First, let us consider the example shown in Fig. 4. TP3 receives m2 in D 23 and m1 in D 13. One way is that TP3 delivers m2 just on D 23 after m2 is sent, i.e. m1 is rejected.
Fig. 4. D p-causality.
The other way is to wait for m1. As a result, m2 is rejected since m2 is received after ts
m2 1 D23 . In the latter case, if m1 is lost, neither m1 nor m2 is received although m2 can be received. Thus, m1 or m2 cannot be accepted even if each of them can satisfy the sender–receiver delay constraint. Suppose that min
Dli ; …; Dni # Di # max
Dli ; …; Dni holds for each process TPi. Let Ti be a variable showing the current time in TPi. TPi buffers the messages received. If there is a message m from TPj in the buffer such that tsj
m 1 Di Ti and tsj
m 1 Dji , Ti , m is delivered. The smaller D i gets, the more messages from the more distant processes are rejected as presented in the preceding section. Even if all the messages received satisfy the D p-causality, the other request on E* may not be satisfied. Next, we discuss how to obtain a consistent causality D 1 from D p if D p is inconsistent. Dpij is defined to be the minimum delay time among the paths from TPi to TPj. That is, Dpij is min
Dij ; Dil 1 Dplj ; …; Din 1 Dpnj : If the theorem holds, Dpij Dij : Otherwise, Dpij Dpik 1 Dkj for some TPk. We define a following set D1 {D1 ij u i; j 1; …; n} from D p. 1 p • D1 {D1 ij uDij Dij if Dik 1 Dkj $ Dij for every process 1 TPk, otherwise Dij max
{Dpik 1 Dkj uDpik 1 Dkj $ Dij for every TPk }}. 1 1 It is clear that D1 is consistent because D1 ij 1 Djk $ Dik 1 for every processes TPi, TPj and TPk. However, Dij . Dij for some pair of processes TPi and TPj. Even if TPi receives a message m from TPj in D1 ij ; it might be too late to deliver m to the application.
4. D p-Causally ordered protocol We present a distributed protocol named D pCO protocol for transmitting messages in the group G {TP1 ; …; TPn } so as to satisfy the D p-causality. The requirements of the delay time D p and message loss ratio Ep {eij u i; j 1; …; n} are specified by the application. 4.1. Transmission strategies First, we discuss how to transmit messages to the destination processes so that the D p-causality is satisfied given the network environment d p and 1 p. Since some messages are
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21 Table 2 Receipt ratio by replication (%) r
1
2
3
S U K
99.08 89.74 88.36
99.96 99.16 98.20
99.96 99.84 99.84
lost in the communication channels, TPi detects that TPk loses m1 if TPi receives no receipt confirmation of m1 from TPk in some time units tik. As shown in Fig. 2, the R delay time d ik is variant. 1 0ik 1 2 t0ik Rik
t dt gives a probability that each message sent by TPi does not arrive at TPk in tik after m1 is sent by TPi. Here, 1 0ik # 1ik . If m1 is detected to be lost, TPi retransmits m1 to TPk. The expected delay time l ik to deliver m1 to TPk is
1 1 21 0ik 1 21 0ik 2 1 …tik tik
1 1 1 0ik =
1 2 1 0ik . 1 0ik is 1 ik if dij dik : Here, lik dik
1 1 1ik =
1 2 1ik : Here, let us consider three processes TPi, TPj and TPk in the group G. TPi sends a message m1 to TPj and TPk. TPj sends another m2 to TPk after receiving m1 from TPi. One traditional way is to send directly m1 to TPj and TPk. That is, m1 arrives at TPk after m2. Unless m1 arrives at TPk in D ik after m1 is sent, m1 is not delivered as presented in the preceding section. If m1 to TPk is lost, the constraint D ik or D jk might not hold when TPk receives m1 which TPi retransmits. In order to deliver m1 so as to satisfy the D p-causality, multiple replicas of m1 can be transmitted by the following methods: 1. sender replication; and 2. destination replication. In the sender replication, the sender TPi sends rik
$ 1 replicas of m1 to TPk. Since each replica is lost in a probability 1 ik, the expected number of replicas which arrive at TPk is
1 2 eik rik . If at least one of the replicas arrives at TPk, TPk receives m1. This is similar to the saturation protocol. Hence, the expected R minimum delay time l ik for TPk to receive m1 is given as l0 ih Rik
t dt 1={
1 2 1ik rik }: That is, if TPi sends rik replicas of m1 to TPk given 1 ik, TPk receives at least one replica of m1 in l ik. If lik # Dik , TPk can accept m1. We measure how reliably the destination can receive a message m by sending multiple replicas of m in the Internet environment presented in Fig. 2. Here, the process
17
H in Japan sends r replicas of m to the processes S in Japan, U in USA, and K in UK. Table 2 shows measured receipt ratios of messages for r 1,2,3. For example, if the process H sends three replicas of m to the process K, i.e. r 3, K can receive at least one replica. If two replicas are sent, i.e. r 2, about 2% of the messages are still lost between H and K. Next, suppose that a process TPj sends m2 with m1 to another process TPk as shown in Fig. 5. TPk can receive two replicas of m1 from TPi and TPj. Even if it takes a longer time than D ik to deliver m1 from TPi to TPk, TPk receives another replica of m1 sent by TPj. If m2 satisfies D jk, TPk can receive m1 and m2. This is the destination replication. Here, we define which messages TPj can forward with m1 in the destination replication. [Definition] A message m2 subsumes another m1 in TPj with respect to TPk(m1 # jkm2) iff m2 are sent to a common destination process TPk, 1. m1 and Dp 2. m1 ! m2 ; and 3. TPj does not send a message to TPk after receiving m1 before sending m2. Let tjk
m2 be a set of messages {m1 u m1 #jk m2 }. TPj can send a message m2 with tjk
m2 to TPk. If each process TPj forwards a message m to the other destinations, s
s 2 1 replicas of m are transmitted for a number s of the destinations of m. Hence, each TPj has to decide which message m1 in tjk
m2 to be forwarded to TPk on sending m2 to TPk in order to reduce the number of replicas transmitted. In the D pCO protocol, TPj sends m2 with a replica of m1 to TPk if the following condition holds: 1. ts
m1 1 Dik $ ts
m2 1 djk ; 2. dij 1 djk # dik ; and 3.
1 2 1ij
1 2 1jk .
1 2 1ik : Unless the first condition holds, m1 cannot satisfy D ik. m1 can arrive at TPk from TPj earlier than from TPi if the first condition holds. Unless the third condition holds, m1 can be more reliably delivered to TPk from TPi than TPj forwards m1 to TPk.
5. Data transmission procedure 5.1. Basic data transmission
Fig. 5. Destination replication.
The messages have to be delivered to the destination processes in the group G {TPI ; …; TPn } so as to satisfy the D p-causality and E p given QoS of d p and 1 p supported by the network. The messages are causally ordered by using the vector clock. Each transport process TPi manipulates the variables VC1,…,VCn showing the vector clock to causally order the messages received. TPi increments VCi by one each time TPi sends a message m. m carries the vector clock m. VC km.VC1,…,m.VCnl. Here,
18
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21
m:VCj VCj
j 1; …; n. On receipt of m from a process TPj, VCj max
VCj ; m:VCj
j 1; …; n. For every pair of messages m1 and m2, m1 causally precedes m2 iff m1 :VC , m2 :VC;. Each messages m is sent to the destination process, not necessarily all the processes in the group G. m.DP denotes a set of destination processes, i.e. m:DP # G. Since a gap between messages cannot be detected by the vector clock, the message loss is detected by the sequence numbers of the messages. The message m is given a vector m:SEQ km:SEQ1 ; …; m:SEQn lof sequence numbers. TPi manipulates variables SEQ kSEQ1 ; …; SEQn l. Each SEQj denotes a sequence number of a message for TPj. If m is destined to TPj, SEQj is incremented by one. Otherwise, SEQj is not changed. m:SEQj U SEQj
j 1; …; n. SEQj is used to detect the loss of messages sent to TPj. In addition, TPi sends a message with the header but without data to every TPj if TPi has not sent any message to TPj for d ij. That is, TPj receives at least one message from TPi for d ij time units. TPi has a variable Ti showing the current time. m carries m.ST which shows when m is sent, i.e. m:ST U Ti . The messages received are stored in the buffer RBUF. The messages in RBUF are causally ordered by using the vector clock. The messages sent are also stored in the buffer SBUF in the sending order. TPi manipulates variables REQ1 ; …; REQn to receive messages. REQj shows a sequence number where TPi has received every message m from TPj where m:SEQi # REQj but TPi does not receive m where SEQi $ REQj 1 1. On receipt of a message m from TPj, if m:SEQi REQj , TPi receives every message from TPj whose SEQi # REQj . Then, REQj U REQj 1 1 and MREQj U REQj . MREQj shows a maximum sequence number of messages received from TPj, i.e. MREQj U max
MREQj ; m:SEQi . If m:SEQi . REQj , TPj finds a gap between m and m 0 where m 0 :SEQi REQj 2 1. While MREQj . REQj , TPi finds a message from TPj in RBUF whose SEQi is REQj 1 1. If found, REQj U REQj 1 1. Until not found, this step is iterated.
5.2. Message loss In order to notify what messages TPi receives, each message m sent by TPi carries the fields m:ACk1 ; …; m:ACKn : Each time TPi sends m; m:ACKj U TPi manipulates variables REQj
j 1; …; n: ACKkj U m:ACKj
j 1; …; n on ACK11 ; …; ACKnn . receipt of a message m from TPk. This shows that TPi knows that TPj has received every message m sent by TPk where m:SEQj # ACKkj : A message m in SBUF is accepted in the group G iff m:SEQk # ACKjk for every TPj in m.DP. Here, TPi knows that m is received by every destination. The accepted messages in G can be removed from SBUF. Suppose that some message m sent by TPk is not accepted in G, i.e. m:SEQj . ACKkj for some TPj. TPi considers that
TPj loses m if the current time denoted by Ti is larger than m:ST 1 2dij . A message m from TPk in RBUF is accepted in G iff m:SEQj # ACKkj for every TPj in m.DP. That is TPi knows that every destination receives m. If m:SEQj . ACKkj for some TPj in m.DP and Ti . m:ST 1 2dij ; TPi considers that TPj loses m. Here, suppose that TPi detects that TPj loses the message m. m has to be retransmitted to TPj. In the D pCO protocol, not only the sender but also the destination can retransmit m to TPj. We discuss which process retransmits m to TPj. Here, let RPi
m
# m:DP be a subset of the destination processes of m which TPi knows to have received m. RPi(m) also includes the sender of m. For each destination TPk in RPi
m; Ck dik 1 dkj
1 1 1kj =
1 2 1kj : For the source process TPh of m, Ch dhj 1 dhj
1 1 1hj =
1 2 1hj : TPi selects a process TPk in RPi(m) whose Ck is the minimum and m:ST 1 Ck # Dhj : If TPi is selected, TPi forwards m to TPj. 5.3. Transmission and delivery Each message m is composed of the following fields: • m.SP process TPi sending m. • m.DP collection of destination processes of m. • m.ST time when TPi sends m. • m:VC km:VC1 ; …; m:VCn l vector clock: • m:SEQ km:SEQ1 ; …; m:SEQn l vector of sequence numbers: • m:ACK km:ACK1 ; …; m:ACKn l sequence numbers of messages which TPi receives: • m.DT data. Suppose that a transport process TPi sends a data D to the processes TPi1 ; …; TPili in the group G. TPi constructs a following message m. • • • • • • • •
m.SP U TPi; m:DP U TPi1 ; …; TPili ; m.ST U Ti; VCi U VCi 1 1; m:VC U VC; SEQj U SEQj 1 1 for everyTPj [ m:DP; m.SEQ U SEQ; m.ACK REQ; m.DT U D;
Before sending m, TPi has to find messages subsumed by m in the receipt buffer RBUF. For each message m 0 in RBUF, TPi forwards m 0 to TPj if the following condition holds: [Subsumption condition] 1. 2. 3. 4.
TPj [ m:DP > m 0 :DP; Ti 1 Dij , m 0 :ST 1 Dkj where TPk m 0 :SP; d ki 1 d ij # d kj ; and
1 2 1ki
1 2 1ij $
1 2 1kj :
TPi sends rij ( $ 1) replicas of m to TPj in order to surely deliver m to TPj. The number rij of the replicas is obtained
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21
[Example] Suppose that a group Gex is composed of four processes, i.e. Gex {K; U; S; H} (Fig. 6). The process K in UK sends a message m1 to the processes U in USA, S in Japan, and H in Japan. U sends another message m2 after receiving m1 from K. Suppose that H loses m1. U sends a message m2 with the receipt confirmation ACK of m1 to K, S, and H after receiving m1. S sends m3 with ACK of m1 and m2 to K, U, and H. On receipt of m2, H detects a gap, i.e. H does not receive m1. Since m1 might be delayed due to the network congestion, H still waits for m1 while receiving m2 and m3. The timers of m1 expire in K, U, and S because they do not receive any receipt confirmation from H. Here, m1 is required to be delivered to H in D HK. K, U, and S calculate what time m1 can be delivered to H. Here, suppose that only S can deliver m1 in D HK. S sends m1 to H. K and U do not send m1 to H because they know that S receives m1 and m1 satisfies D HK. Here, S sends m3 after receiving m1 and
Fig. 6. Example.
by the following equation: ZDij Rij
t dt 1={
1 2 1ij rij }:
KU
0
Here, suppose that TPi receives a message m from TPj. m is stored in RBUF and MREQj U max
MREQj ; m:SEQi : TPi manipulates the variables as follows:
US
m2, i.e. m1 ! m2 ! m3 ; to H. In H, m3 has to be delivered after m1 and m2 if neither m1 nor m2 is discarded. 6. Evaluation
if
m:SEQi REQj { REQj U REQj 1 1; for
k 1; …; n ACKjk U max
ACKjk ; m:ACKk ; VCk U max
VCk ; m:VCk ; } Then, if MREQj $ REQj ; TPi searches a message m in RBUF whose SEQi REQj. If found, the procedure presented above is executed. Each message m in RBUF is delivered if the following condition holds: Ti m:ST 1 Dki where TPk m:SP: Before delivering m, some messages causally preceding m in RBUF have to be delivered even if the condition discussed above does not hold. DVC shows a vector clock of a message most recently delivered. That is, each message m 0 in RBUF is delivered if the following condition is satisfied: Dp
1. m 0 ! m; i.e. m 0 :VC , m:VC and 2. m.VC . DVC. If m 0 is delivered, DVC is changed as follows: DVCk U max
DVCk ; m 0 :VCk for k 1; …; n; after delivering every message satisfying both of the above two conditions in RBUF, m is delivered if m.VC . DVC. Otherwise, m is discarded. Table 3 Message receipt ratio (%) for D (ms)
Minimum Average Maximum
19
D
S
U
K
90 168 320
60.5 94.5 98.9
0.0 70.0 88.9
0.0 7.6 70.0
We evaluate the D pCO protocol in a wide-area group G of processes TP1 ; …; TPn in terms of the number of messages to be rejected and the delay time to deliver the messages. In the evaluation, D pCO protocol modules are implemented in processes in four Solaris workstations, a process H in Tokyo Denki Univ., Japan, S in Tohoku Univ., Japan, U in UCLA, USA, and K in Univ. of Keele, UK. First, each process TPi gets the statistics of the delay time d ij and the loss ratio 1 ij for every other process TPj by using the GCM protocol, and reports it to the application process APi. APi decides D ij by using the statistics of d ij and 1 ij. One way to obtain D ij is by adding the average d ij with some constant a i. Another way is for D ij to be given time t within when a message can be received in a possibility of b . For example, let b be 70%. From Fig. 2, 70% of messages sent by the process U in USA can be received by the process H in Japan in168 ms. Hence, D HU is given 168 ms. 70% of messages sent by the process K in UK can be received in 320 ms. Hence, D HK 320 ms. In contrast, the average delay time between the processes H and S in Japan is about 60 ms where only 0.1% of messages are lost. Here, let D HS be 90 ms which is 50% larger than 60 ms, i.e. a i 30 ms. First, suppose that each D ij is a constant D . Here, the minimum, average, and maximum of the delay are 90, 168 and 320 ms, respectively. Table 3 shows the receipt ratios of messages sent to the process H from S, U, and K. For example, if D HS D HK 168 ms, the process H in Japan receives 94.5% of messages from S in Japan while only 7.6% can be received from K in UK. Next, we consider how many messages each process TPi can receive given D p. As presented before, TPi does not receive a message m from TPj unless m arrives in D ij. Given D p presented here, 60.5% of messages sent by S are
20
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21
Fig. 7. Inconsistent D p-causality.
received by H. Hence, 1 HS 0.395 for D HS. Similarly, 1 HU 0.3 and 1 HK 0.3 for D HU and D HK, respectively. We consider how many messages are rejected in order to how a pair of satisfy the D p-causality. Fig. 7 shows Dp messages m1 and m2 such that m1 ! m2 are received by the process H. We assume dHK . dHU 1 dUK since there is a routing path from the process H in Japan via U in USA to K in UK. In Fig. 7, we also assume that U and S send m2 on receipt of m1. In the cases 1 and 2, m1 is sent by K. In the case 3, U sends m1. In 1, U sends m2 on receiving m1 while S sends m2 in 2 and 3. The reject ratios of messages received are 6.9% in 1, 16.6% in 2, and 16.7% in 3. In the D pCO protocol, the receivers forward messages to the destinations by the destination replication. For example, S sends m1 with m2 to H. Here, H can receive m1 and m2 because D HS and D HK are satisfied. More messages from the more distant processes are rejected. That is, the shorter D ij is, the smaller 1 ij is. Thus, there is a trade-off between D ij and 1 ij. The application has to decide D ij so that the requirements on the delay time and the causality are satisfied. Table 4 Delay (ms)
A
B
Protocols
D pCO
D
Receipt (R) Delivery (DL) Rel. rec. (RR) Detect (DT) Receipt (R) Delivery (DL) Rel. rec. (RR)
376 383 724 386 393 394 735
376 383 1128 762 1135 1139 1891
Lastly, we compare the D pCO protocol with the traditional group protocol in terms of the delay time in the world-wide environment. In the decentralized protocol like ISIS, a process TPi sends a message m to the destination processes and the destination processes send the receipt confirmation of m back to TPi. The D pCO protocol is a distributed one. That is, each destination process TPj sends the receipt confirmation to not only the sender TPi but also all the other destination processes. The confirmation is carried by data messages which TPj sends. If TPj does not have data to send to a destination process TPk of m, TPj does not send a confirmation message to TPk as soon as TPj receives m. TPj sends the message to TPk after waiting for d jk ms. By this delayed confirmation method, the number of messages can be reduced to almost the same as the decentralized one in systems where every process in the group sends messages to the others. In the D pCO protocol, the destination process TPj can detect if another destination TPk loses m while only the sender TPi can detect in the decentralized protocol. We measure how long it takes to deliver messages from the process H in Japan to K in Keele (UK) while H sends messages to U in USA in the presence of message loss. If K loses messages sent by H, the process U nearest to K forwards the messages to K. Table 4 shows time for delivering messages to K from H by the D pCO protocol and the decentralized protocol. In Table 4, R, DL, and RR show times, i.e. how long it takes to receive, deliver, and reliably receive a message. DT shows how long it takes to detect message loss. A in Table 4 indicates the R, DL, and RR delays for the D pCO protocol and the decentralized (D) protocols in case that no message is lost. The difference between R and DL shows time for the protocol processing. The difference between R and RR shows time for exchanging the confirmation messages of m. B in Table 4 shows the R, DT, DL, and RR delays in the presence of message losses. The difference between DL and DT shows time for recovering from the message loss by the retransmission. Table 4 shows the D pCO protocol delivers messages about twice earlier than the decentralized protocol in the world-wide environment. In the evaluation, the group is composed of four processes interconnected in the Internet, which are distributed in the world-wide area. It is currently difficult, or maybe impossible, to realize distributed multimedia applications including more processes world-widely distributed due to the limited bandwidth and longer delay in the Internet and the limited computation power of the workstations. As the future study, we are planning to do the experiments in an environment where more processes are included.
7. Concluding remarks We have discussed the wide-area group communication which includes multiple processes interconnected by the Internet. Here, each logical channel between the processes
T. Tachikawa et al. / Computer Communications 23 (2000) 13–21
in the group has different delay time. In this article, we have proposed the D p-causality to hold in the wide-area group and have presented the D pCO protocol for supporting the D p-causality given the delay d p and message loss ratio 1 p of the network. We have evaluated the performance of our protocol in the wide-area group in the Internet.
Acknowledgements We would like to thank Prof. M. Liu, Ohio State University, Prof. M. Gerla, UCLA, Prof. S.M. Deen, University of Keele and Prof. N. Shiratori, Tohoku University, for their cooperation and for the evaluation of D * protocol in the world-wide experiment.
21
Theory and Practice in Distributed Systems (LNCS 938), Springer, Berlin, 1995, pp. 99–110. N. Shiratori, K. Sugawara, T. Kinoshita, G. Chakraborty, Flexible networks: basic concepts and architecture, IEICE Trans. Communication E77-B (11) (1994) 1287–1294. T. Tachikawa, M. Takizawa, Selective total ordering broadcast protocol, Proc. of IEEE ICNP 94 (1994) 212–219. T. Tachikawa, M. Takizawa, Multimedia intra-group communication protocol, Proc. of IEEE HPDC 4 (1995) 180–187. T. Tachikawa, M. Takizawa, Distributed protocol for selective intra-group communication, Proc. of IEEE ICNP 95 (1995) 234–241. T. Tachikawa, H. Higaki, M. Takizawa, M. Gerla, M.T. Liu, M. Deen. Flexible wide-area group communication protocols-international experiments, Proceedings of ICPP Workshop, 1998, pp. 105–112. T. Tachikawa, H. Higaki, M. Takizawa, Group communication protocol for realtime applications, Proc. of IEEE ICDCS 18 (1998) 40–47. R. Yavatkar, MCP: A protocol for coordination and temporal synchronization in multimedia collaborative applications, Proc. of IEEE ICDCS 12 (1992) 606–613.
Further reading F. Adelstein, M. Singhal, Real-time causal message ordering in multimedia systems, Proc. of IEEE ICDCS 15 (1995) 36–43. R. Aiello, E. Pagani, G.P. Rossi, Causal ordering in reliable group communications, Proc. of ACM SIGCOMM 93 (1993) 106–115. Y. Amir, D. Dolev, S. Kramer, D. Malki, Transis: a communication subsystem for high availability, Proc. of IEEE FTCS 22 (1993) 76–84. R. Baldoni, A. Mostefaoui, M. Raynal, Efficient causally ordered communications for multimedia real-time application, Proc. of IEEE HPDC 4 (1995) 140–147. K. Birman, A. Schiper, P. Stephenson, Lightweight causal and atomic group multicast, ACM Trans. on Comput. Sys. 9 (3) (1991) 272–314. J.M. Chang, N.F. Maxemchuk, Reliable broadcast protocols, ACM Trans. Comput. Sys. 2 (3) (1984) 251–273. P.D. Ezhilchelvan, R.A. Macedo, S.K. Shrivastava, Newtop: a fault-tolerant group communication protocol, Proc. of IEEE ICDCS 15 (1995) 296–307. G. Florin, C. Toinard, A new way to design causally and totally ordered multicast protocols, ACM Operat. Sys. Rev. 26 (4) (1992) 77–83. M. Hofmann, T. Braun, G. Carle, Multicast communication in large scale networks, Proc. of IEEE HPCS 3 (1995). H.W. Holbrook, S.K. Singhal, D.R. Cheriton, Log-based receiver-reliable multicast for distributed interactive simulation, Proc. of ACM SIG-COMM 95 (1995) 328–341. M. Jones, S. Sorensen, S. Wilbur, Protocol design for large group multicasting: the message distribution protocol, Comput. Commun. 14 (5) (1991) 287–297. M.F. Kaashoek, A.S. Tanenbaum, S.F. Hummel, H.E. Bal, An efficient reliable broadcast protocol, ACM Operat. Sys. Rev. 23 (4) (1989) 5–19. F. Mattern, Virtual time and global states of distributed systems, in: M. Cosnard, P. Quinton (Eds.), Parallel and Distributed Algorithms, NorthHolland, Amsterdam, 1989, pp. 215–226. P.M. Melliar-Smith, L.E. Moser, V. Agrawala, Broadcast protocols for distributed systems, IEEE Trans. on Parallel and Distributed Sys. 1 (1) (1990) 17–25. A. Nakamura, M. Takizawa, Causally ordering broadcast protocol, Proc. of IEEE ICDCS 14 (1994) 48–55. J. Postel, Internet protocol, RFC 791 (1981). J. Postel, Internet control message protocol, RFC 792 (1981). M.K. Reiter, The rampart toolkit for building high-integrity services,
Takayuki Tachikawa was born in 1971. He received his BE, ME and DE degrees in computers and systems engineering from Tokyo Denki University, Japan in 1994, 1996, and 1998, respectively. His research interests include distributed systems, computer networks and communication protocols.
Hiroaki Higaki was born in 1967. He received his BE and DE degrees from the Department of Mathematical Engineering and Information Physics, the University of Tokyo in 1990 and the Department of Computers and Systems Engineering, Tokyo Denki University in 1997, respectively. From 1990 to 1996, he was in NTT (Nippon Telegraph and Telephone Corporation) Software Laboratories. Since 1996, he is in the Department of Computers and Systems Engineering, Tokyo Denki University. He is now an assistant professor. His research interest includes distributed systems, distributed algorithms, distributed operating systems, fault-tolerant systems and computer network protocols. He is a member of ACM, IEEE CS, IEICE and IPSJ.
Makoto Takizawa was born in 1950. He received his BE and ME degrees in Applied Physics from Tohoku University, Japan, in 1973 and 1975, respectively. He received his DE in Computer Science from Tohoku University in 1983. From 1975 to 1986, he worked for Japan Information Processing Developing Center (JIPDEC) supported by the MITI. He is a Professor of the Department of Computers and Systems Engineering, Tokyo Denki University since 1986. From 1989 to 1990, he was a visiting professor of the GMD-IPSI, Germany. He is also a regular visiting professor of Keele University, England since 1990. He was a program co-chair of IEEE ICDCS-18, 1998 and serves on the program committees of many international conferences. He is a chair of IPSJ SIGDPS since 1997. His research interest includes communication protocols, group communication, distributed database systems, transaction management and groupware. He is a member of IEEE, ACM, IPSJ and IEICE.