Parallelism and Fault-Tolerance in the CHORUS* J. S. Banino INRIA, Le Chesnay, France
CHORUS is a distributed system based on a small set of powerful concepts: actors, messages, and ports. A complete operating system has been implemented on a local network of multiprocessors and is now available. The CHORUS architecture is well adapted to the implementation of distributed fault-tolerance mechanisms. This paper presents our results in this field, the “Coupled Actors,” mechanism and the “Activity Message” concept and outlines two other projects, recently launched in the same domain, building upon the CHORUS system.
1. WARNING paper relates a collective work. Several peopie have been taking part to the CHORUS project for four years: J. S. Banino (INRIA), A. Caristan (INRIA), J. C. Fabre (INRIA), C. Gailliard (INRIA), M. Guillemont (INRIA), J. M. Guillot (INRIA), F. Hermann (UPS), P. Ibos (TRT), C. Kaiser (CNAM-INRIA), J. Legatheaux-Martins (Universite de Lisbonne), P. Maurice (INRIA), G. Medigue (CNET), G. Morisset (INRIA), M. Rime (TRT), M. Rozier (CIMSA), A Rozz (INRIA) C. Senay (CIMSA), and H. Zimmerman (CNET).
This
2. INTRODUCTION CHORUS [ 1, 2, 3,4] is a distributed operating system intended to support a large class of applications running on a network of computers (industrial process control, telecommunication systems, burotics, etc.). The CHORUS architecture is modular, flexible, and greatly independent from the hardware and from the communication network. Moreover, the basic CHORUS concepts facilitate the implementation of faulttolerant applications.
‘CHORUS
is a registered trademark of INRIA.
Address correspondence 10 J. S. Banino, INRIA. 78153 Le Chesnay Cedex. France.
The Journal of Systems and Software 1,2: 205-21 I (1986) % Elsevier Science PublishingCo. Inc. I986
B.P. 105,
After a short presentation of the CHORUS system, we describe some experiments that were led with CHORUS in the field of fault-tolerant computing.
3. THE CHORUS SYSTEM 3.1. Basic Concepts
In the CHORUS architecture, a distributed system is a set of autonomous active entities, called actors, distributed across a network of computers. An actor is like a sequential process, and includes code, data, and a context. The entire system is modular. The operating system itself is made of actors (“system actors”) in charge of managing the logical and physical resources. An application installed on CHORUS is also made of actors. The system is flexible. Actors are highly independent: they run concurrently, they do not share data, and they interact by asynchronously exchanging messages. The use of logical ports as the only communication interface ensures that processing (in actors) is independent from communication. Ports are dynamically attached to actors. A message is exchanged by two actors in the following way: the first actor sends it from one of its ports to a port of the second actor. Ports are bidirectional and may be used both for sending and for receiving. The sending actor addresses its messages to the destination port and does not need to know the identity of the destination actor. Port names are globally unique; they are related neither to the location of the port nor to the name of the actor owning this port. An actor interacts with the rest of the world using a communication protocol between its own ports and the set of outside ports that it knows. The implementation of the services available behind these outside ports can be modified without any knowledge of the actor (as long as the protocol does not change). In the same way, these ports may move from one site to another without the knowledge of the actor; CHORUS finds their new lo205 ~164-l2l2/84/$3.50
206
J. S. Banino
cation dynamically. We will see further how this feature is used for reconfiguring a system. The CHORUS architecture was designed to make the implementation of fault-tolerant services easy. An actor operates as a sequence of execution granules, called processing steps. A processing step is triggered in an actor on reception of a message on one of its ports. Conversely, messages prepared by an actor are transmitted only upon completion of the processing step; no message is transmitted if the processing step is not completed. A processing step is sequential and the actor performs processing steps one after the other.
represent the same port forever. At a given time, a port can be owned by at most one actor; open and close operations permit several actors to own the same port in succession. Messages are delivered to actors by the local kernel. In the case of remote communications, a message passes in succession through the kernel supporting the transmitter, through the network (managed by transport actors), and through the kernel supporting the receiver. Communication is provided by the CHORUS kernel in a “connectionless” mode. More sophisticated protocols are implemented in the code of the communicating actors.
3.2. The Operating System
3.3. Programming Interface
The CHORUS operating system consists of a set of actors. On each site are at least those actors that manage the local resources, as well as a kernel, which supports execution of actors and handles local communications. An application is built as a collection of actors that utilizes services provided by the kernels and the system actors. The kernel implements the logical CHORUS machine. It schedules execution of processing steps by actors and handles their local communications. Moreover, it entirely hides the physical machine by transforming exceptions (interrupts, errors) into messages addressed to ports. The kernel interface is defined as four system calls: SEND (to send a message), SELECT (to set a selection parameter on a port), SWITCH (to associate a port of the actor to one of its processing steps), and TIME-OUT (to arm a time-out on a port). Each invocation of such a call by an actor prepares a message that is delivered to the kernel only at the end of the current processing step. The CHORUS kernel is small, all the system resources in the system (memory, devices, communication links, files, actors, ports, names, etc.) being handled by actors. Groups of system actors perform system services; the invocation of such a service is done by sending a message to a port representing it. Global-resource management requires that actors of this service cooperate. This is hidden to the actor requesting the service. The basic system services are: actor management, port management, and communication management. Actors may be created and killed dynamically from actor models that define their functionality. Actors can also be created remotely. Ports may be created and destroyed dynamically. When created, a port receives a name. This name is global; i;e., it can be interpreted with the same meaning anywhere in the system. It is also unique; i.e., it will
Actors are programmed using high-level languages (PASCAL in the current version of CHORUS). The programmer, having in mind the operation mode of CHORUS actors, must structure his programs in processing steps. In an actor model, a processing step is delimited by an entry-point and a RETURN (the one and only CHORUS kernel entry point). In the text of processing steps, CHORUS services can be invoked. Among them, one can distinguish the services that define the CHORUS logical machine (SEND, SELECT, SWITCH, TIME-OUT, RETURN), which are necessarily implemented in the kernel, from the others. These may be invoked asynchronously or synchronously. An asynchronous service invocation calls an interface procedure, which prepares a message, which will be sent at the end of the current processing step to the port representing the service. In the case of a synchronous invocation, the interface procedure prepares the message and sends it by means of the external-procedure-call protocol (EP-CALL). The EP-CALL is a standard sequence for executing in succession and undivisibly two processing steps within a single actor: the first one completes by sending a request message to a port, the second is triggered on reception of the reply message from this port. 3.4. Current Status We implemented CHORUS on a set of SM90s connected by an Ethernet local network. The SM90 [ 51 is a multimicroprocessor, designed by the French CNET (Centre National d’Etudes des Telecommunications). The SM90 architecture is modular and can support various 16/32 bits microprocessors (Motorola 68000/68020, NS 16000/32000). In an SM90 multiprocessor, the CHORUS system is implemented according to its decentralized design. Though resource sharing is permitted by the machine
207
The CHORUS Distributed System architecture, CHORUS runs in an SM90 as it does in a network: each processing unit (made of plus-local memory) support a CHORUS site. That means that it runs a CHORUS kernel and that it supports actors attached to this site. Intersite communication inside one SM90 is implemented by exchange of messages, but message passing is optimized using memory sharing. This implementation of an operating system on a multiprocessor computer in a decentralized way leads to a homogeneity of concepts in the system, at the levels of both network and machine. We expect important benefits from this arrangement, especially for implementing fault-tolerant mechanisms. Furthermore, the possibility of resource sharing allowed by the SM90 architecture improves performances. We avoided the drawbacks of resource sharing; for example, access to common resources (nonallocated memory segments, nonallocated I/O control blocks) are given by parallel allocators -one for each logical site-and contention for these resources is solved using nonblocking algorithms so that the crash of one processing unit does not block the other ones. Above the basic-system services (actor, port and communication management) we implemented a set of device-driver actors and several user-oriented services: an interactive interface permits the user to install, test, debug, and run an application. Its command language gives access to the basic-system services and allows the user to define higher-level commands. When the user logs in, an actor is created to run the command-language interpretor and to represent the user in the system. CHORUS users use UNIX’ to develop their software. CHORUS and UNIX can run simultaneously in a single SM90, UNIX operating on one processing unit and CHORUS operating on the others. In this case, the two systems communicate, first through the files, which are shared by both systems, and also by means of direct communications between CHORUS actors and UNIX processes. We are now focusing our work on improving the CHORUS system and increasing its fields of application. A distributed file management service is under development. Its basis is a set of file management actors, each one handling local files, to which is added a distributed naming service. The format adopted for the files is the UNIX one. We prepare the implementation on CHORUS of a protocol description language designed by the
‘UNIX
is a registered
trademark
of Bell Labs
RHIN pilot project (PDIL, close to the 1SO FDT language). And one of our main goals is to take advantage of the CHORUS concepts for designing and implementing a tool kit of basic fault-tolerant mechanisms. We now present some results of this research.
4. FAULT-TOLERANCE 4.1. Generalities The design of fault-tolerant systems rely on three fundamental notions: modularity, redundancy, and granularity. Modularity is essential to confine the effects of a fault and to isolate it. Redundancy is necessary, for example, when a faulty element has to be dynamically replaced by a functionally equivalent one or when a vote is applied to the results of several equivalent elements. Granularity facilitates the introduction of recovery points, which defines consistent states for a computation, from which this computation can be recovered after a failure. These notions appear in the heart of CHORUS system. The actor is the basis of modularity: a CHORUS application is analyzed, is realized, and runs as a set of independent actors, interacting by explicit communication protocols. The use of logical interfaces for communications hides the identity, the implementation, and the localization of an actor to its partners. This eases the implementation of redundancy at the actor level. Finally, as mentioned previously, the processing-step concept introduces granularity in the execution scheme of CHORUS. Taking benefit of these features, we have first implemented a simple mechanism, based on active redundancy of actors, in order to get atomicity and recovery at the processing-step level: the “coupled actors” mechanism. Then we wanted to enlarge atomicity and recovery to the case of a complex distributed computation made of several processing steps. For this purpose we defined the concept of “activity message” for the expression and control of distributed computations and experimented with its use in the context of fault-tolerance.
4.2. The Coupled-Actors
Mechanism
The principle of this mechanism [7] is to increase the availability of a service, implemented in a server actor, by coupling this actor with a backup ready to take its place in case of failure. Obviously, the two coupled actors must be installed on two different sites of the net-
J. S. Banino work. When a failure occurs, the substitution is not perceived by the clients of this service. To reach such results, several problems have to be solved: The backup actor must be in a consistent state when the service is reconfigured; This state must correspond to the last consistent state reached by the first actor before the failure; After reconfiguration, the new server must offer the same interface to the clients as the old one did before the failure; and The dialogues between the clients and the server must not be interrupted. The CHORUS architecture permitted us to easily solve these problems. The server/actor and the clients respect a simple “request-reply” protocol: a request message sent to the server by the client triggers a processing step of the server, on completion of which a reply message is sent back to the client. The server/actor (the “master”) and its backup (the “slave”), installed on two different sites, are both active. The master performs the processing steps required by the clients and sends back reply messages to them. The slave performs identical processing steps in the same order as the master, but delayed by one processing step. This coupling between the master and the slave is realized as follows: when terminating a processing step, the master (using a reliable protocol) sends to the slave a copy of the request message just processed. The slave then executes the same processing step without sending back the reply message. Obviously, this coupling protocol is hidden to the programmer. When a failure of the site supporting the master occurs, the slave switches to the master mode, opens its ports with the same name as those of the previous server, and eventually creates a new slave. This reconfiguration is not perceived by the clients, because the CHORUS routing mechanisms deliver messages to the ports, whatever their actual location, and because the possible loss of messages during the reconfiguration is covered by the client-server protocol just as an ordinary loss of messages. Details of this mechanism are given in [7]. The coupied-actors principle is implemented in the CHORUS system. It can be applied to any CHORUS actor, without modification of the code written by its author. This work was a first validation of the adequacy of CHORUS choices to the taking into account of fault problems. The next step in this way was to enlarge our approach to the case of complex distributed computations.
4.3. Activities and Activity Messages 4.3.1. Principle. In CHORUS, control of execution is decentralized in the code of all the actors: synchronization and communication rely on a single mechanism: the exchange of messages. An actor receiving no message is idle; it can be activated only on reception of a message. This approach presents a great advantage for the distributed-systems design: the programmer can think of each actor with a clear vision of its input (the message received) and its outputs (messages transmitted). But several experiments show that it can be very important to keep, at design and run time, a global view of the behavior of a complex computation (i.e., of a computation made of several processing steps), for example, when setting global recovery points [6, 9, lo] is needed. With this idea in mind we defined the concept of activity as one among other schemes for controlling the progress of a distributed computation [ 1I]. An activity is the composition of sequential activities. A sequential activity is a complex computation made of several processing steps, performed one after the other, generally in distinct actors that may be located anywhere in the system. A sequential activity can be thought of as a kind of sequential process executed step by step in different actors, having its own context and data, but, moreover, accessing local resources in the actors it crosses. At run time, a sequential activity is represented by a message: the activity message. This message holds the description of the computation (i.e., the list of processing steps to be successively performed), and its current context. The activity message is transmitted from actor to actor, triggering in each of them a processing step of the activity. An activity message presents three distinct parts: control, context, and data. The control part contains the list of processing steps to be executed sequentially. The context part holds the information representing the identity of the activity message and the current state of its progress. The data part is the part of the message that can be accessed by the various processing steps that will be executed in the activity. Receiving such a message on a port, an actor executes a processing step, as it does for each message received. But, in the case of an activity message, the actor then updates the context part of the message and determines from the control part what is the next processing step of the activity; then it forwards the message to the corresponding port. The activity message is implemented as an executable program (compiled PASCAL in the current ver-
The CHORUS
Distributed
System
sion) enclosed in a message. The control part of the message is the code itself; the context part of the message is a subset of the static data of the program; the data part is the rest of the data of the program. All the features of the PASCAL language (data types, operations, control structures) are available to express the control of an activity. In addition, the language is extended with the following control structures (implemented as interface procedures): FORWARD-M (port-name, parameters) used to cause the sending of the activity message towards the “port-name” port and then to trigger the next processing step of the activity; CALL (activity-model, parameters) used to nest activities. The semantics of this construction are closed to a procedural call of a subactivity; SPLIT (parameters) and JOIN (parameters), used to cause, respectively, the transformation of the activity into several parallel ones (whose number is a parameter of the SPLIT call) and the merging of these parallel subactivities. These two procedures are used in a construction whose syntax is the following: SPLIT (parameters) IN 1 : begin (* program of the first subactivity
*)
end ; 2 : begin (* program end ;
of the second subactivity
N : begin (* program of the Nth subactivity end ; ENDSPLIT;
*)
*)
Each of the “N” programming blocs constitutes the program of one subactivity. The “N” parallel subactivities are started as follows: the original activity message is copied into “N” versions. Each version resumes its progress from a different context (the first instruction to execute is the first one in the corresponding programming bloc). When the subactivities are started, the original activity message is suspended and stored. The merging of parallel branches is expressed by means of another control statement: JOIN
(parameters)
When such a statement is executed in the program of an activity message implementing one of the subactivities, this activity submessage joins the original activity message and a part of data (defined by the parameters of the JOIN call) is copied from the submessage into the original one. The presence of a JOIN call in the subactivity pro-
209 grams is optional: some of the subactivities may have no need to join. In the parameters of the SPLIT call, the programmer can give the minimum number of subactivities that must join before the main activity resumes; he or she can also determine precisely a delay for the success of the JOIN’s Whether or not the JOIN fails or succeeds, the original-activity message continues its progress. Such complex operations as initialization of an activity message and execution of the SPLIT, JOIN, or CALL statements are performed by a system actor, the activity-message-manager (AMM), which is present on each site of the system. The CALL, SPLIT, and JOIN services are implemented behind ports of the AMM actors. Execution of a CALL, a SPLIT, or a JOIN in an activity message causes the sending of the activity message to the corresponding port of the local AMM. As a consequence, addition of new control schemes will require the addition of new services (and of new ports) in the AMM and will require writing of new interface procedures. This mechanism is easily extensible. At run time, when an activity message triggers a processing step in an actor, this processing step is executed just as if the message were a usual one. When the RETURN primitive is invoked, the nature of the message is examined: in the case of an activity message, this primitive executes a jump in the code of the message at the point specified by its context part; this code is executed until the first FORWARD-M, CALL, SPLIT, or JOIN statement, the effect of which is to send the activity message elsewhere. An important feature of the implementation is that resources of the host actor (code, data, messages, etc.) are protected during the execution of the activity-message code so that a software fault in the activity message cannot damage the host actor.
4.3.2. Application to fault-tolerance. We consider this new concept as a powerful help for the design of distributed applications. The main benefits gained from its use are the following: introduces a global level of expression and control for distributed computations; improves flexibility: processing steps in a distributed system can be viewed as a set of operators at the disposal of activities. With the same set of such processing steps it is possible to build different types of activities; and favors the implementation of fault-tolerant mechanisms, as illustrated by some examples in the following paragraph.
210 An activity is a computation the code for which is partitioned into the activity message and the actors. As a consequence, the fault-tolerance policy can be described either in the activity message or in the actors. When designing a system in which some service must offer a high availability or reliability, one can implement the functional code of the service in the actors themselves and describe the fault-tolerance policy in the code of activity messages. This partition of the code, with on one hand the functional code of the service and on the other hand the code used for the control, the checking, and the recovery of processing, is a very interesting feature; several different fault-tolerant policies can be applied to a given service without any change in its code. For example, the reliability of a service implemented in an actor can be improved by replicating the actor and by using a vote algorithm; the vote protocol is programmed in an activity message, the activation of the replicated processing steps being done by means of a SPLIT call and the vote being consecutive to a JOIN call. If one needs to modify the number of redundant servers, one has only to change the value of a parameter in the activity message. As another example, an N-version processing step is realized with the acceptance test [6] in the activity message, executed on completion of the processing step. Conversely, one can remark that, between two processing steps, the activity message is a natural checkpoint for the activity: in fact, with the activity message holding the context of the activity, it is possible to restart the activity from a copy of this message in such cases as the crash of a processing step, the loss of the activity message itself, etc. That means that in this case, the recovery is programmed in the actors and not in the activity message. This recovery technique is the one of communication networks applied to activity messages instead of data messages. Most of recovery algorithms used in communication networks can be adopted (in particular, adaptative routing and “end-to-end” control [ 81). We are experimenting such fault-tolerant mechanisms on the CHORUS system, using the “coupled actors principle,” and the “activity message concept.” Obviously, all these mechanisms relying on message passing operate on a single computer or on a network of computers, as well. 5. RELATED WORKS
As noted earlier, several other research projects adopted the CHORUS system as a basis for their developments. Two of them are particularly concerned with fault-tolerance.
J. S. Banino The first one is the SATURNE project, launched in June 1984 at CNRS-LAAS (in Toulouse, France) by Y. Deswarte and J. C. Laprie. Its main goal is to study and to experiment with techniques for computing reliability and data security in a distributed environment. Reliability will be obtained by the use of dynamic active redundancy: each critical transaction will be run by n identical actors (on n different sites) voting on their results; the redundancy level n will be dynamically computed as a function of the transaction criticity and of the load of the global network. The underlying idea is to exploit at any time the full computing power of the network. Data security will be achieved by the use of fragmentation and encryption techniques. The second project is starting at IRISA (Rennes, France) and is headed by M. Banltre and J. P. Banatre. In their previous project (ENCHERES [ 121) they experimented with the concept of atomic distributed transaction in the context of a real-time application. One of their results is a very efficient implementation of stable memory. The aim of their new project is to implement a distributed object-oriented system, relying on an atomic distributed-operation mechanism, inspired from their previous result and implemented on the CHORUS system.
6. CONCLUSION The decisions made for the CHORUS architecture are validated by the first experiments that we led: in particular, the implementatjon of the CHORUS distributed-operating system, built in terms of actors, ports, and messages, constitutes the first large application of this architecture. On that occasion we appreciated the loose coupling of actors, the synchronization by messages, the m~ularity, and the ~exibility as nice advantages for building a distributed system. In particular, these features fit very well the needs of fault-tolerant implementation in a distributed mechanisms environment.
7. REFERENCES 1. J. S. Banino, A. Car&an,
M. Guillemont, G. Morisset, and H. Zimmerman, CHORUS: An Architecture For Distributed Systems, INRIA Research Report, Nov. 1980, pp. 68. 2. H. Zimmerman, J. S. Banino, A. Caristan, M. Guillemont, and G. Morisset, Basic concepts for the support of distributed systems: The CHORUS approach, 2nd IEEE International Conference on Distributed Computing Systems, (Versailles, France), 198 1, pp. 60, 66. 3. M. GuiiIemont, The CHORUS distributed operating
211
The CHORUS Distributed System
4.
5.
6.
7.
8.
system: Design and implementation, ACM lnternutional Symposium on Local Computer Networks (Florence, Italy), April 1982, pp. 207, 223. H. Zimmerman, M. Guillemont, G. Morisset, and J. S. Banino, CHORUS: A Communication and Processing Architecture for Distributed Systems, INRIA Research Report 328, Sept. 1984. U. Finger and G. Medigue, Architectures multi-microprocesseurs et disponibilite: la SM90, L’@cchodes recherches 105, IS, 21 (July 1981) (French and English issues). B. Randell, P. A. Lee, and P. C. Treleaven, Reliability fssues in Computing System Design, ACM Computing Survey, 10, 123, 165 (June 1978). J. S. Banino and J. C. Fabre, Distributed Coupled Actors: A CHORUS Proposal for Reliability, 3rd International Conference on Distributed Computing Systems (Ft. Lauderdale-Miami, Florida), Oct. 1982, pp. 7. J. H. Saltzer, D. P. Reed, and D. D. Clark, End-to-End Arguments in System Design, 2nd fnternationaf Confer-
9.
10.
11.
12.
ence on Distributed Computing Systems (Versailles, France), April 198 1, pp. 509, 5 12. K. H. Kim, Distributed Execution of Recovery Blocks: An Approach to Uniform Treatement of Hardware and Software Faults, 4th International Conference on Distributed Computing Systems (San Francisco), May 1984, pp. 526, 532. W. H. Kohler, A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems, ACM Computing Surveys 13, 149, 183 (June 1981). J. S. Banino, G. Morisset, and M. Rozier, Controlling Distributed Processing with CHORUS Activity Messages, 18th Hawaii International Conference on System Science, January 2-4, 1985. J. P. Banatre, M. Bandtre, and F. Ployette, Construction of a Distributed System Supporting Atomic Transactions, Proc. of the 3rd Symposium on Rentability in Distributed Software and Systems (Clearwater-Beach, Florida), Oct. 1983.