Fault-tolerant servers for the RHODOS system

Fault-tolerant servers for the RHODOS system

Fault-Tolerant Servers for the RHODOS System Wanlei Zhou and Andrzej Goscinski School of Computing and Mathematics Deakin University Geelong, VZC ...

1MB Sizes 7 Downloads 124 Views

Fault-Tolerant

Servers

for the RHODOS

System

Wanlei Zhou and Andrzej Goscinski School of Computing and Mathematics Deakin University Geelong, VZC 3217, Australia

Providing reliable services is one of the primary goals in designing a distributed operating system. Nowadays, we have seen a trend in distributed operating system design to shift from large kernel architectures or even monolithic architectures to microkernel architectures supported by the client/server model. This means that a lot of services of an operating system originally provided by the monolithic kernel are moved out of the kernel, forming individual servers. It is then crucial to guarantee that these servers will provide reliable services. This article describes a design, based on a twin-servers model, of fault-tolerant servers for the microkernel-based RHODOS distributed operating system. A model that supports fault-tolerant services is designed. The performance of the model is simulated and analyzed. A design that implements the model in the RHODOS environment is also outlined. 0 1997 by Elsevier Science Inc.

1. INTRODUCTION

One of the main goals in designing a distributed operating system is to provide reliable services for applications running on the underlying distributed system. A distributed operating system manages many hardware/software components that are likely to fail eventually. In many cases, such failures may have disastrous results. With the ever-increasing dependency being placed on distributed operating systems, the number of applications requiring fault tolerance is likely to increase. The design and understanding of fault-tolerant distributed operating systems is a very difficult task. We have to deal not only with all the complex

Address corrwpondence to Dr. Wanlei Zhou, School of Computing and Mathematics, Deakin UnL~ersi&Geelong, VIC 3217, Australia J. SYSTEMS SOFTWARE 1997; 37:201-214 0 1997 by Elsevier Science Inc. 655 Avenue of the Americas, New York, NY 10010

problems of distributed operating systems when all the components are well (Goscinski, 19911, but also the more complex problems when some of the components fail. Today we have seen a trend in distributed operating system design shift from large kernel architectures, or even monolithic architectures, to microkernel architectures supported by the client/server model. This means that a lot of services of an operating system originally provided by the monolithic kernel are moved out of the kernel, forming individual servers. It is then crucial to guarantee that these servers will provide reliable services. This article describes a design, based on a twinservers model, of fault-tolerant servers for the microkernel-based RHODOS distributed operating system (De Paoli et al., 1994). The fault-tolerant technique used is a variation of the recovev blocks method (Hecht and Hecht, 1986) and the distributed computing model used is the client/sewer model. In the twin-servers model, two servers, called twins, work together to provide a particular service. If one of the twins dies, the surviving twin will be responsible for providing continuous services. The model design is aimed toward building microkernel-based fault-tolerant operating systems. It can also be used, however, in application areas in which requirements are less severe than in, for example, the aerospace field but in which continuous availability is required in the case of some components failures (Boari et al., 1988). The application areas could be, for example, supervisory and telecontrol systems, switching systems, process control, and data processing. Software fault tolerance refers to the set of techniques for continuing service despite the presence, and even the manifestation, of faults in a program

0164-1212/97/$17.00 PII SO164-1212(96)ooo16-7

202

J. SYSTEMS SOF-IWARE 1997; 37:201-214

W. Zhou and A. Goscinski

(Mili, 1990). There are many techniques available for software fault-tolerance, such as the N-version programming (Avizienis, 1985) and the recovery blocks (Hecht and Hecht, 1986). In the N-version programming, N (N 2 2) independent and functionally equivalent versions of a program are used to process a critical computation. The results of these independent versions are compared (usually with a majority voting, if N is odd) at the end of each computation, and a decision is made accordingly. The recovery blocks method employs temporal redundancy and software standby sparing. Software is partitioned into several self-contained modules called recovery blocks. Each recovery block consists of a primary routine, which executes critical software function; one or more alternate routines, which performs the same function as the primary routine, and is invoked upon a failure is detected; and an acceptance test, which tests the output of the primary (and alternate, if the primary fails) routine after each execution. A variation of this method is used in this article. The reminder of this article is organized as follows. In Section 2, we give an overview of the RHODOS distributed operating system. Section 3 presents the model for supporting fault-tolerant servers. Section 4 analyzes the performance of the model. Section 5 describes the architecture of a proposed implementation of the model. Section 6 reviews some related work. Finally, Section 7 concludes the article.

data. They are responsible for the management of the basic logical resources of a distributed system: processes and processor time, remote communication, memory, devices and network delivery protocols. The kernel servers include the Interprocess Communication Manager, Network Manager, Process Manager, Memory Manager, Migration Manager, Data Collection Manager, and Device Manager. In order to support users in a friendly, efficient, and secure manner, the RHODOS distributed operating system provides services through system servers. The following RHODOS system servers have been developed (De Paoli et al., 1994).

2. OVERVIEW OF THE RHODOS SYSTEM

The primary interface between processes and the Nucleus is via the library/system called boundary. Because the Nucleus does not provide a great deal of services itself, there is only a limited number of system calls provided. Further services must be requested from the kernel and system servers via message passing or remote procedure calls. Figure 1 depicts the architecture of the RHODOS system. As we can see from the overview of the RHODOS distributed operating system, the servers in the system play a very important role. If a kernel server or a system server fails, it will have a severe impact on the whole system. For example, if the Global Scheduler fails, no load balancing and static allocation will be performed and the computational performance of the system will be greatly degraded. To tolerate server failures, it is necessary to build fault-tolerant servers. Our primary goal here is to provide continuous services in the case of a server or even a host failure, without or with little impact on the whole distributed system.

The RHODOS system is a modern, high performance distributed operating system based on the concept of a microkernel and the client/server model (De Paoli et al., 1994). Its architecture is very clear: on each host, there is a microkernel, known as the Nucleus, which provides a consistent and abstract view of the underlying hardware to the higher level servers and processes. It provides support for local interprocess communication, basic memory management, process dispatching, general data and statistic collection, and interrupt and exception handling. On top of the Nucleus there are two groups of servers: the kernel servers (also called managers) and the system servers, following the need for separation of policy and mechanism. The kernel servers have a close interaction with the Nucleus. They provide the functionality that is usually placed in a monolithic kernel and thus, are trusted entities that utilise a set of privileged system calls to manipulate microkernel

1. Name Server: supports attributed naming and provides evaluation of user names onto system names; 2. Trading Server: provides user autonomy and object sharing between homogeneous and heterogeneous systems through object export, import, and withdrawal; 3. File Server: provides disk, file, transaction and replication services and storage for workstations without their own external storage; 4. Global Scheduler: improves computational performance through sharing idle or lightly loaded workstations, by employing load balancing and static allocation; 5. Authentication Server: authenticates users and servers by supporting one-way, two-way, and conference authentication.

Fault-Tolerant

Servers for the RHODOS System

J. SYSTEMS SOFTWARE 1997; 37:201-214

User

Kernel

IPC:

hardware

Port

Intelface

I

Control

: [

Process Dispatching

I Pais i Mapping

int3

context

Serial

en-Or

Ethernet

Switching

DliV0X

Exceptions

DliVt%

; p Kernel : Data Collection Clock

p

Kernel

-IYapS

RHODOS architecture.

3. THE FAULT-TOLERANT 3.1.

Calls

system Calls Interface ~__--__-__,-___-__--_r----___-----___-_-~-~~~~

.-_-_--

Figure 1.

Processes

__-l_______S??~~~SEY=~~--_---_--__L-1

Servers

Library

Local

Mode

203

The Client/Service

SERVICE

MODEL

Paradigm

service model more general, we use the client/service paradigm to reflect the client/server computing in the RHODOS system. The client/service paradigm is a higher abstraction’ of the client/server model. It provides a more user-friendly interface in the sense that it hides the server details from the user. In the client/server model, a client deals directly with a server. For example, a client locates a server by the name of the server, and then it is given the address of the server, whereas, in a client/server model, a client deals with a service (which is provided by a group of servers) instead. For example, a client locates a service by the service’s attribute name, and then it is given one of the addresses of servers which offer the same service. In the most general sense, the client/service paradigm is a way of structuring distributed programs. [t consists of the following three entities (Svobodova, 1985).

To make our fault-tolerant

Sen:ice: ,4 service is a software entity that runs on one or more machines. It provides an abstraction of a set of well-defined operations. Semer:

A server is an instance of a particular service running on a single machine. Client: A client is a software entity that exploits

services provided by servers. A client can but does not have to interface directly with a human user. Ideally, as in the RHODOS system, when requesting a service, a client specifies the nature of the service through an attributed name. The exact loca-

tion of the service and the server that provides such a service will be determined by the system for transparency reasons. The drawback of this approach is the delay incurred by the name resolution and evaluation facility. Thus, in the RHODOS system, some servers have well-known addresses and clients that (usually are also system processes or agents of system processes) can communicate with these servers directly through the well-known addresses. Two approaches of providing fault-tolerant services in the client/service paradigm can be explored. The first approach is to look at the problem through the clients’ viewpoint, that is, to make the clients fault-tolerant. The clients can continue their work by seeking an alternative service should the previous service request fail. The second approach is to look at the problem through the services’ viewpoint, that is, to make the service fault-tolerant. The service consists of multiple servers that can provide alternative services to a client should some servers fail. Here we use the second approach and assume that the only failure points are the server failures.

3.2. Server Types We can divide services provided by a distributed system into two categories: the processing-oriented services and the data-oriented services. A processingoriented service, developed as a stateless service, provides some processing power to clients. It takes a client’s request and processes it. No internal states of the service are changed because of the client’s request. Such services include, for instance, the computing service, the printing service, and the compiler

204

W. Zhou and A. Goscinski

J. SYSTEMS SOFTWARE

1997; 37:201-214

servers usually (but not necessarily) locate on different hosts for increasing reliability. When a client requests a service, the request (either a processing-oriented or a data-oriented service call) is mapped to one of the twins. Without loss of generality, suppose that twin server T’ is selected. Thus, the request is sent to T’. At the same time, a copy of the request is also sent to T2 and is kept in its reserve. If T’ has completed the service of the request, it notifies T2, and T2 will drop the copy of the request from its reserve. At any time, if one of the twins, say T’, dies for some reason, the surviving twin T2 will continue the service agreed by T’ by using the requests kept in its reserve. Then T2 will invoke another server to be its new twin (also named as T’). After that, the normal operation of the twins is resumed. Clearly, this twin-servers model can survive single server failures. Figure 2 depicts the design model. In the diagram, the c 1 i ent makes a service i request to one of the twins, say, T’ (line a). This request is treated as a normal request by T’-it is serviced once T’ is free. At the same time, a copy of the request is also sent to T2, the twin of T’ (line b).This copy is stored in T2’s reserve. Line c denotes the interaction between the twins. The interaction includes notification of the completion of a request and, if the request is a data-oriented service call, the calls for maintaining consistency of data objects. Similarly, the client can make a service i request to T2 instead of T’, and a copy of this request will be sent to T l’s reserve. In that case, we have a solid line going into T2’s normal queue and a dashed line going into T”s reserve. The interaction between the twins will point from T2 to T1’s normal queue.

service.The most important characteristic of these services is that, if two or more servers provide the same service, these servers do not need to communicate with each other to maintain a consistent state. We name client requests to these services as pro-

cessing-oriented service talk.

A data-oriented service manages one or more objects and clients access these objects by issuing some requests. The states of an object can be changed because of a client’s request. Such services include, for instance, the database service, the file service, the authentication service, and the directory service. For these services, it is critical to maintain a consistent object state among all the servers providing the same service. We name client requests to these services as data-oriented service calls. A service can be both a processing-oriented service and a data-oriented service. In that case, it supports both processing-oriented service calls and data-oriented service calls. In terms of implementation, a processing-oriented service call can be implemented by either simple send ( ) / recv ( ) pairs if the message-passing method is used, or a simple remote procedure call (RPC) if the RPC method is used. A data-oriented service call, however, must guarantee the consistency of the data objects. It has to be implemented by either nested send ( ) / recv ( ) pairs if the message-passing method is used, or as a nested RPC call, or an atomic RPC transaction (Zhou, 1991) if the RPC method is used. 3.3. Fault-Tolerant

Service Design

For each service, two servers are maintained to provide the fault-tolerant service. We refer to them as twins and denote them as {T’,T2}. These two

I

-~---------

Service i

I I

I

Reserve

I I

I

I I 1

I

I ,

I I

Normal

=I-

I I I

Figure 2. The

I ’ I ’ l ’ ’ I

b;

! I I : :-_---

I

Normal

II Reserve

I I

I

I

1?

-

I I I

*

fault-tolerant service model.

Fault-Tolerant

Servers for the RHODOS System

This demonstrates an important property of the twin model: the two twins are identical and symmetric. Obviously, a process pair model, like our twins model, can only tolerate fail-stop or crash type of failures. Tolerating other types of failures (for example, the Byzantine failures) normally requires a majority voting, and that requires at least three processes.

4. PERFORMANCE

l

EVALUATION l

l

l

p: the mean service rate of the server.

Figure 3. The queueing model.

q: the probability

of the bulk arrival. This is the same as the probability of a twin failure.

1. The probability of a twin failure (9). 2. The probability of feedback (p). 3. The traffic intensity (U = A/p).

The fault-tolerant model shown in Figure 2 can be viewed as a queueing model. Each twin has a normal queue and a reserve queue. Members of the reserve queue will be added to the end of the normal queue once the other twin dies. This can be treated as a bulk arrival to the normal queue. The probability of the bulk arrival is identical to the probability of a twin failure, and the mean size of the bulk arrival is the same as the mean number of requests (requests waiting in the normal queue and the one in processing) in a twin. The normal queue also has a feedback input from its twin, indicating the interaction between the twins. We know that the two twins are identical and are symmetric. We can then simplify the model into a single service node with a feedback and bulk arrivals. This model is shown in Figure 3. Its parameters are as follows. h: the arrival rate of the normal requests.

X: the average size of the bulk arrival. This is the same as the mean number of normal requests in a twin.

We are interested in the changes in the performance, measured in the response time R, when the following parameters change.

Model and Its Simplification

l

p: the feedback probability. It consists of the probabilities of the following two actions.

-notification calls: when a twin completes a service to a request, it has to notify the other twin; therefore, the other twin can delete the corresponding request from its reserve; -consistency calls: when a request is a data-oriented service call, the data consistency must be maintained through certain mechanisms (for instance, the nested send0 / receive0 calls or the RPC transactions).

Obviously, our model can tolerate single-point failures: if one of the twins dies, the surviving twin will continuously provide services agreed by the failed twin. This is clearly an advantage compared with the normal single server model. However, every faulttolerant method employs redundancy, and redundancy may imply increasing costs and/or decreasing performance. This section tries to evaluate the performance of our fault-tolerant model in various circumstances.

4.1. The Queueing

205

J. SYSTEMS SOFTWARE 1997; 37:201-214

Because the random processes of the model are not all independent (for instance, the average size of the bulk arrival X depends on the mean length of the queue), the model cannot be solved by using existing analytic models. Actually, the model can only be solved analytically if we assume that all the corelated random processes are independent. However, these assumptions are not realistic. For example, the bulk arrival X cannot be independent from the mean length of the queue, simply because once a twin fails; the other twin will put all the requests of the reserve queue into the normal queue. A simulation model is then built and some simulation is carried out. Our simulation tool is based on MacDougall’s smpl (MacDougall, 1987). We have the following assumptions when building our simulation. 1. Similar to most of the computer system performance evaluation processes (Kant, 19921, we assume that the arrival rate of normal requests A

h

1-P (x, 3

>

206

W. Zhou and A. Goscinski

J. SYSTEMS SOFTWARE 1997; 37:201-214

and the mean service rate p follow independent exponential distribution; The twin failure probability (4) and the feedback probability p follow a uniform distribution (within

(0, 1)); The average size of a bulk arrival is the same as the current length of the normal queue; The time used by a twin for moving the requests from its reserve into its normal queue is zero. Also, the time used for invoking a new twin is zero. The final assumption is only for easier programming and can be adjusted accordingly if needed. Notice that the feedback probability p of a processoriented service can be very small (close to 0) if grouping and piggybacking are used to reduce the number of notification calls (see Section 5.2 for implementation consideration). However, for a data-oriented service p 2 S. In order to compare our twin-servers model with a single server model, we also build a queueing model for the single server model. The single server model can be viewed as an M/M/l queue, with arrival rate of requests A and the mean service rate p. It has no fault-tolerant features. That is, if the server dies, all the client requests will be lost. Because it has no redundancy, however, intuitively it will have less overhead, and thus a better response time than our twin-servers model. We only show the results of the M/M/l model in Figure 7 (it shows the response times when the utilisation rate changes). In other cases (Figures 4-6) the results for the M/M/l model are straight lines when p and/or q change. Next we are going to find out what is the overhead (the time used for the fault-tolerant purpose only) for our model.

4.2. Simulation

Results

The first result describes

the relationship between the system response time and the twin failure probability. Figure 4 depicts the response time (the vertical axis, in time units) when the twin failure probability q (the horizontal axis) changes from 0 to 0.05. The feedback probability p stays at 0.5 and the traffic intensity is u = 0.33 (the mean service rate p is assumed to be 1 time unit). For the single server model, it is easy to know that the average response time will be R = 1.49 time units, no matter how CJand p change. We have the following observations from Figure 4. If q increases at a lower range (for instance, from 0 through to 0.03 in our case), the response time will not increase too much. If q stays in a lower range, it means that the twins are not likely to fail. In that case, the requests in a twin’s reserve will not likely be added into its normal queue. Therefore, the response time of the system will not increase too much. If q increases at a higher range (for instance, from 0.03 through to 0.051, the response time will increase drastically. If q stays in a higher range, it means that the twins are likely to fail. In that case, the surviving twin will have to frequently add the requests of its reserve into its normal queue. The size of the normal queue will increase. As the size of the reserve is the same as the size of the normal queue, it implies that the number of requests added into the normal queue will drastically increase when q increases. Therefore, the response time will increase quickly.

Figure

4. The response time when q changes (p = 0.5, u = 0.33).

Fault-Tolerant

Servers for the RHODOS System

J. SYSTEMS SOFTWARE 1997; 37:201-214

The conclusion is that our fault-tolerant model does not impose too much overhead on the system when the twin failure probability is low. The second simulation result describes the relationship between the response time and the feedback probability. Figure 5 depicts the response time (the vertical axis, in time units) when the feedback probability p (the horizontal axis) changes from 0 to 0.6. The twin failure probability 9 stays at 0.02 and the traffic intensity is u = 0.33. We have the following observations from Figure 5.

207

requests are data-oriented service calls, and the traffic intensity is relatively high. The third simulation result relates the response time to the changes in both the twin failure probability and the feedback probability. Figure 6 depicts the response time (the z axis, in time units) when the feedback probability p (the x axis) changes from 0 to 0.5, and the twin failure probability q (the y axis) changes to 0 to 0.05. The traffic intensity stays at u = 0.33. Figure 6 confirms our above observations. That is, if both q and p stay low, our model will not impose too much overload on the system. We can also observe some additional properties. For example, if only q (or p, but not both) is high but p (not q) is very low, the system overhead will also be low. The system overhead will be drastically increased only when both q and p are high. The last simulation result shows the relationship between the response time and the traffic intensity. Both q and p are fixed at this time. Figure 7 depicts the response time (the vertical axis, in time units) when the traffic intensity u (the horizontal axis) changes from 0 to 0.85. The dotted line shows the response time of the single server model. The middle solid line shows the case when the twin failure probability q stays at 0.01 and the feedback probability p is 0.1. The upper solid line shows the case when the twin failure probability q stays at 0.025 and the feedback probability p is 0.36. We have the following observations from Figure 7.

The response time will not increase too much if the feedback probability stays below 0.5. If only less than half of the requests are data-oriented service calls and the others are processingoriented service calls, the twins will use less time for inter-twin communication. Therefore, the response time will not increase too much because most of the time of the twins will be used for processing normal requests. The response time will increase drastically if the feedback probability is over 0.5. If more than half of the requests are data-oriented service calls, the twins will have to spend most of their time on inter-twin communication. Therefore the time used for processing normal requests will be less. When the requests processing rate of the twins is less than the rate of incoming requests, the response time will increase drastically and eventually the whole system will be blocked. Of course, the traffic intensity u also plays an important role in here. We will analyze its influence later.

l

The conclusion is that our fault-tolerant computing model will not work properly if most of the

If the range case), much, high.

traffic intensity increases within a lower (for instance, from 0.1 through 0.4 in our the response time will not increase too even though both q and p are relatively

R 35

I

I

q=o.o2.

response time changes (q = 0.02, u = 0.33).

Figure

5. The

when

P

30

-

25

-

20

-

0

1 0

0.1

I 0.2

I 0.3

I 0.4

I

Lmo.33

1 0.5

-

0.6

P

208

W. Zhou and A. Goscinski

J. SYSTEMS SOFTWARE 1997; 37~201-214

u=o.33 R

Figure 6. The response time when both the a and p change (u = 0.33).

l

l

If the traffic intensity increases within a higher range (for instance, from 0.45 through to 0.7 here), the response time will increase drastically if both 4 and p are relatively high. But the response time will only increase slightly if both 4 and p are relatively low.

already congested without introducing the faulttolerant features), converting the service into our twin model will cause much overhead to the system performance. 5. THE SYSTEM DESIGN

If the traffic intensity increases within a very high range (for instance, higher than 0.7 here), the response time will increase drastically even if both q and p are relatively low.

This section describes a design that implements the fault-tolerant service outlined in Section 3.3. A prototype of the design (with some features still under development) is now running on a network of SUN workstations using the microkernel-based RHODOS distributed operating system.

The conclusion is that if both q and p are relatively low, our model will not impose too much overhead on the system performance. If a service has a less incoming request rate and a higher processing rate (i.e., the traffic intensity u is low), converting the service into our twins model will cause less overhead to the system performance, even if both q and p are relatively high. However, if a service has a very high incoming request rate and its processing rate is low (that means, the service is

25

-

20

-

15

-

10

-

5.1. The Architecture

Figure 8 depicts the implementation architecture. On each host, there is a local name server. It manages, among other things, the service locations and service-server mappings on its host. Generally speak-

0 0.1

0.2

0.3

0.4

of the Fault-Tolerant

Service Design

OS

Figure 7. The response time when u changes (q = 0.01, p = 0.4).

0.6

I

I

0.7

0.6

0.9

Fault-Tolerant

Servers for the RHODOS System

J. SYSTEMS

209

SOFTWARE

1997; 37:201-214

ing, any service running on a host must be registered to the local name server. If that service is to be shared by clients other than the clients of the local host, it has to be registered to the global name server as well. The global name server is responsible for managing service locations and service-server mappings of the whole network. There should be at least one global name server on the network concerned. In the RHODOS system, however, kernel and system servers are not required to register to the name servers. The communication ports of many such servers are will-known ports and clients can make requests to these servers directly. The communication between different system entities is provided by message passing (blocked or unblocked) or Remote Procedure Call primitives supported by the RHODOS Reliable Datagram Protocol, which is an efficient and fast transport protocol using the services of IP and Ethernet for communication in the RHODOS environment. In the client/server model, a distributed program consists of a client program part and a server program part. In our system, a client program consists of a Client Driver and a Client Stub. The client driver

controls the client’s action, and the communication details are hidden in the client stub. The server program consists of a Server Driver, a Server Stub, a Procedures part and a Reserve. Similar to the client program, the server driver controls the server’s action, and the communication details are hidden in the server stub. The procedures part realizes the real service operations. The reserve keeps track of the requests sent to its twin. The advantage of dividing programs into drivers and stubs is that the stubs can be automatically generated by stub generutars (Zhou, 1994). The communication between various system entities of the model can be described as follows. When the client issues a service request, the request goes to the local name server first (line a). The local name server checks to see if the service can be honored in the local host. If it cannot be honored in the local host, the local name server will ask the global name serve to map the service request (line b). If the global name server finds a suitable server, say T’, the local name server will establish a mapping between the service request and the remote server T’ on behalf of the global name server and r________________---~~-~-~---

-----I

: ,______________________-; 1 Server Host I I I I I I Sewer I 1 I 1 TJ I 1

I

------I

t

I I I 1 I 0

r_______________________--I I I I I : I I I 1 I : , I 1 I I I I ,_______-_--__------Figure 8. The architecture

of the fault-tolerant

service design.

__--___-

____-I

210

W. Zhou and A. Goscinski

J. SYSTEMS SOFTWARE 1997; 37:201-214

the request is sent to T1 (line c and d). At the same time, a copy of the request will be sent to the twin server T* for its reserve (line d’). If the service request can be honored in the local host (i.e., one of the twins, say T’ again, resides on the same host as the client does), a mapping between the service request and T1 is established, and the request is sent to the local server T’. At the same time, the local name server will locate the other twin T*, possibly through the global name server, and a copy of the request will be sent to T2 for its reserve (line d’). The client’s service request fails if both the local and the global name servers cannot complete the mapping. After the mapping is established, the client can then send requests to the twins directly (line c, line d, and line d’). If the twins’ addresses are well known, the client can send the request and its copy directly (with the help of the IPC manager) to the selected server and its twin (line c, line d, and line d’). Because both twins’ addresses are made available to the client, the client can randomly choose any one of the twins as the server to perform the request, and the other to keep the copy of the request. Statistically speaking, requests to the twins are evenly distributed and therefore load of requests are balanced. If the selected twin server T’ dies, its twin T* will typedef struct { unitl6_t sn_signature; unit8_t sn-type; unit8_t sn-copy; unit 32-t sn_origin; unit32_t sn_object; unit32_t sn_access_rights; unit32_t sn_checksum; > SNAME;

/* /* /* /* /* /* /*

5.2. Implementation

Issues

The first issue that we have to consider is the naming of twins, In RHODOS all objects are identified by System Names @Names), which combine facilities for location transparency, identification, protection, and accounting of objects into one data structure. The C data structure for SNames is shown below

The user's id */ The object's type * / Which object copy * / The object's origin * / The object's name *! The access rights */ 32 (64)-bit checksum * /

The main elements of interest to inter-process communication in this structure include the sn_origin, sn_object, and sn_type, which when combined can uniquely identify an object over an entire distributed system. In our design, each server (actually, the communication port of a server> is assigned an SName when typedef struct { SNAME *service-name; SNAME *twin-l; SNAME *twin_2; } SENTRY;

transfer all the requests in its reserve into its normal request queue and execute them as if they were requests made to itself. At the same time, T* will invoke a new twin server to be its new T’. The twins make regular checks to each other, making sure that the other twin is alive. Before the services exported by the twins can be accessed by clients, the twins must register to the global (and local) name servers if the twins do not have well-known addresses. Lines f and g show the global registration. Because the local name server and the global name server keep the critical information of services, they must be fault-tolerant themselves. So these name servers are also designed to be twins. For a particular name server, the location of its twin is a well-known address. And for all the local name servers, the location of the global name server is a well-known address.

it is created (for example, twin-1 and twin-2 for the twins of a service). Each service is also assigned an SName (for example, servi ce_name) when both of its twin servers are available. Thus, after the registration (lines f and g of Figure 8) a service has an entry in the relevant name servers as follows.

/ * service's SName e4/ /* the first twin's SName * / /* the second twin's SName * /

Fault-Tolerant

Servers for the RHODOS System

The service’s SName service-name is made available to clients. When a client calls a service through its SName, the name servers will be responsible for providing the twin servers SNames of the service. After that, the client can call both twins through the IPC Manager. If a client knows the SName of a server, it can call the server directly through the help of the IPC Manager. This is the case when a service has a well-known address (that is, the SENTRY entry is well known). If a twin dies, the surviving twin will invoke a new server to be its twin. The new server may have a different SName from the failed one. Thus, it has to be registered to the relevant name servers, and the twin’s SName will be updated accordingly. The second issue that we need to address is interprocess communication. The RHODOS system provides three interprocess communication primitives: recv(), and callo. The send0 and send0, recv ( j primitives are the basic message passing primitives, and ca.11( ) primitive, which is logically composed of a send ( ) and a recv ( ) , together with a recv ( ) and a send ( ) , provide an RPC communication service. The send ( ) and recv ( ) primitives are used in our prototype implementation. Following is a more detailed description of these two primitives. l

ret_addr, ser.d(dest_addre, send__ar-gs, send_results)

It sends a message to another process, given the destination address (dest_addr) and where a return message should be delivered (ret _addr). Details of where the message resides and how many bytes should be sent are given in the third parameter send_args. The fourth parameter, indicates the results of the send_les-Jlts, send ( ) primitive. l

recv(dest_addr, ori recv_results) recx_irgs,

-add:-,

It receives a message that has been delivered; given the destination address (dest _addr) from where the message is to be received and where the expected message has originated from the origin address (or i _ addr). The third parameter, recv_args, provides details of the message, such as where the message is supposed to be placed and the length of the received message. The fourth parameter, recv_results, indicates the results of the r+xv ( ) primitive, All

the

addresses

(dest_ acidr and

of

the

ret_addr

above primitives of send0 and

J. SYSTEMS SOFTWARE 1997: 37:201- 214

211

and ori_addr of recv()) are SNames. They can be obtained through the help of the name server(s). Figure 9 depicts the interprocess communications between the client and the twins when the SNames of the twins have been known. For a process-oriented service call, which is shown in Figure 9a, the client first sends a normal request to one of the twins, say, T’, and at the same time, a copy of the request is also sent to another twin T2. After the selected twin T’ completes the work, it sends back the results to the client and, at the same time, sends a message to T” asking it to drop the corresponding copy request from its reserve. For a data-oriented service call, which is shown in Figure 9b, the interprocess communication is almost the same as the case of a process-oriented service call, except that after the selected twin T1 completes the work, it also asks the twin T’ to perform the work on its managed data object for achieving data consistency. Because T2 already has the copy request, it can perform the work easily by moving the copy request from the reserve to its normal queue. Figure 9c shows the interprocess communication when the selected twin T’ dies. In that case, the surviving twin T2 will perform the work by moving the copy request from its reserve to its normal queue. The work of invoking a new twin is not shown. It is obvious that the data-oriented service calls are the most time-consuming calls. Various methods can be used to reduce the execution l.ime of a data-oriented service call. Our design here uses the general approach, that is, a notification call is sent from the first twin to the second, and both twins perform the task. This is fine if the execution time for such a call is short. However, if the execution time is long, one can save some time by using only the first twin to execute the request and then sending the result to the second twin for updating only. The third implementation issue is the detection of twin failures. As we mentioned in Section 5.1, twins make regular checks to each other, making sure that the other twin is alive. Currently, the checking interval (i.e., the maximum delay of detection) is designed at 180 seconds, and it can be easily adjusted to suit different applications. The overhead of failure detection is small because each checking RPC call (with a few bits of message) only uses about 3 ms on a lightly loaded network (Zhou 199.5). The checking message is also used to carry piggybacked notification messages from process-oriented services. Normally, all notification messages (i.e., messages asking the other twin to drop the duplidest_addr

212

W. Zhou and A. Goscinski

J. SYSTEMS SOFIWARE! 1997; 37:201-214

6. RELATED WORK

cated request) of process-oriented service calls are grouped together during the checking interval and are sent to the other twin together with the checking message. Because the process-oriented operations are idempotent, this will not affect the processing results even if a twin fails during the checking interval. It is clear that the original send () / recv ( ) primitives do not meet all our needs in fault-tolerant computing. They lack two properties that are important in our implementation: the concurrency and the atomicity. The following improvements are under investigation.

There have been many successful distributed computing systems using the client/server model. But few of them consider fault tolerance in their design, or they relay on users to build up the fault-tolerant features. Many leading computer companies have agreed on a vendor-neutral distributed computing environment (DCE) architecture proposed on the Open Software Foundation (OSF, 1990). This architecture is designed under the client/server model and requires that the interactions between its components follow the RPC paradigm. Although the DCE architecture helps to reduce the heterogenity of server access protocols, one important issue is still outstanding: the support for fault-tolerant services. Notable works on incorporating fault tolerance features into client/server distributed computing are the Argus (Liskov, 19881, the ISIS (Birman and Joseph, 1987) (Birman et al., 1991), the Cygnus (Chang and Ravishankar, 19911, and an atomic RPC system on ZMOB (Lin and Ganno, 1985). The Argus allows computations (including remote procedure calls) to run as atomic transactions to solve the problems of concurrency and failures in a distributed computing environment. Atomic transactions are serializable and indivisible. A user can also define some atomic objects, such as atomic arrays and atomic record, to provide the additional support needed for atomicity. All the user fault-tolerance requirements must be specified in the Argus language.

’1. To send two messages to different destinations at the same time (or concurrently). Figure 9 shows this necessity. When the client sends out a request, or when the selected twin has completed the work, a concurrent send ( ) primitive is needed. 2. To guarantee the atomicity of actions. In Figure 9b, it is necessary that the send ( ) primitive of T’ be an atomic action. That is, sending back the results to the client and asking another twin to perform the work should be regarded as atomic. Otherwise, the data objects managed by the twins may be in inconsistent state. 3. The use of RPC. RPC is a better abstraction than that of message passing, especially in terms of programming. Incorporating concurreny and atomicity into RPC mechanism is feasible (Zhou, 1991).

T

(a) Figure 9. IPC

T

clic01

I

T

clicnl

I

@I

when implementing a process-oriented service call.

(cl

P

Fault-Tolerant

Servers for the RHODOS System

The ISIS toolkit is a distributed programming environment, including a synchronous RPC system, based on virtually synchronous process groups and group communication. A special process group, called fault-tolerantprocess group, is established when a group of processes (servers and clients) are cooperating to perform a distributed computation. Processes in this group can monitor one another and can then take actions based on failures, recoveries, or changes in the status of group members. A collection of reliable multicast protocols are used in ISIS to provide failure atomicity and message ordering. The Cygnus distributed system is an instance of the client/service model and has some common features as our model. But it tries to solve the problem through the clients’ viewpoint, by making the client resilient to server failures. There are no facilities in the Cygnus system to allow the servers to provide reliable services. The atomic RPC system implemented on ZMOB uses sequence numbers and calling paths to control the concurrency and atomicity and uses checkpointing to maintain the ability of recovering from failures. Users do not have to provide synchronization and recovery themselves; they only need to specify if atomicity is desired. This frees them from managing much complexity. But when a server (or a guardian in the Argus) fails to function well, an atomic transaction or an atomic RPC has to be aborted in these systems. This is a violation of our continuous computation requirement. The client program of the Cygnus distributed system can survive from server failures, but it does not provide fault tolerance on the server’s side. The fault-tolerant process groups of the ISIS can cope with process (both clients and servers) failures and can maintain continuous computation, but the ISIS toolkit is big and relatively complex to use.

7. REMARKS We have presented the design, performance evaluation, and a prototype implementation of a faulttolerant model for servers of the RHODOS distributed operating system. The model uses two servers, called twins, to provide a particular service. Apart from accepting normal requests from clients, a twin also uses a reserve to keep track of the other twin’s uncompleted incoming requests. If a twin dies, the other twin will move all the requests from its reserve into its normal request queue, providing continuous service to clients. At the same time, the

213

J. SYSTEMS SOFTWARE 1997: 37:201-214

surviving twin will invoke a new twin. The clients will be transparent of any twin failures. The results from simulation show that our model works very well in various situations. When the twin failure probability or the feedback probability is low, or when the traffic intensity of the original system is not too high, our model does not impose too much overhead on the system performance. Our model has a wide range of applications in which continuous availability of services are required in the case of some components failures. Apart from applications in the kernel/service poolbased distributed operating systems, other application areas could be, for example, supervisory and telecontrol systems, switching systems, process control and data processing.

ACKNOWLEDGMENTS The authors their

valuable

would

like to thank

the anonymous

referees

for

comments.

REFERENCES Avizienis, A., N-Version Approach to Fault-Tolerant Software. IEEE Transactions on Software Engineering, 11(12):1491-1501 (December 1985). Birman, K. P., and Joseph, T. A., Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 5(1):47-76 (February 1978). Birman, K. P., Schiper, A., and Stephenson, P. Lightweight Causal and Atomic Group Multicast. ACM Transactions on Computer Systems, 9(3):272-3 14 (August 1991). Boari, M., Ciccotti, M., Corradi, A., and Salati, C. An integrated environment to support construction of reliable distributed applications (CONCORDIA), in Parallel Processing and Applications, Elsevier Science Publishers, North Holland, 1988, pp. 467-473. Chang, R. N., and Ravishankar, C. V. A Service Acquisition Mechanism for the Client/Service Model in Cygnus, In Proceedings of the 11th International Conference on Distributed Computing Systems, Arlington, 1991, pp. 90-97. Goscinski, A. Distributed Operating Systems: The Logical Design, Addison-Wesley Publishing Company, Reading, Massachusetts, 1991. Hecht, H., and Hecht, M. Fault-tolerant software, in Fault-Tolerant Computing: Theory and Techniques, Vol. 2, Prentice-Hall, Englewoods Cliffs, New Jersey, 1986, pp. 658-696. Kant, K. Introduction to Computer System Pe$ormance Eualuation, McGraw-Hill, Singapore, 1992. Lin, K., and Gannon, J. D. Atomic Remote Procedure Call. IEEE Transactions on Sofhyare Engineering, 11(10):1126-1135 (October 1985). Liskov, B. Distributed Programing in ARGUS. Communications of the ACM, 31(3):300-312 (March 1988).

214

W. Zhou and A. Goscinski

J. SYSTEMS

SOFIWAFCE 1997;37:201-214

M. H. Simulating Computer Systems: Techniques and Tools, The MIT Press, Cambridge, Mas-

MacDougall,

sachusetts, 1987. Mili, A. An Introduction of Program Fault Tolerance, Prentice-Hall, Englewoods Cliffs, New Jersey, 1990. OSF. OSF Distributed Computing Environment Rationale, Open Software Foundation, 1990. De Paoli, D., Goscinski, A., Hobbs, M., and Wickham, G. RHODOS-A microkernel based distributed operating system: An overview of the 1993 version. Technical Report TR C94/04, School of Computing and Mathematics, Deakin University, Geelong, Victoria, Australia, April 1994. Svobodova, L. Client/server model of distributed comput-

ing, in Znformatik Fachberiche 95, Springer-Verlag, 1985, pp. 485-498. Zhou, W. A Rapid Prototyping System for Distributed Information System Applications. The Journal of Systems and Software, 23(1):3-29

Zhou, W. A Fault-Tolerant tem for Open Distributed

(1994).

Remote Procedure Call SysProcessing, in Proceedings of

the 1995 Zntemational Conference on Open Distributed Processing, Brisbane, Australia, February 20-24 1995,

pp. 229-240. Zhou, W. and Molinari, B. On the management of remote procedure call transactions, In Lecture Notes in Computer Science, Volume 497, Springer-Verlag, Berlin, Germany, May 1991, pp. 571-581.