Practical Fault Tolerant Software for Asynchronous Systems

Practical Fault Tolerant Software for Asynchronous Systems

SESSION 3 - FAULT TOLERANCE , RECOVERY AND THE USE OF REDUNDANCY Copyright (c) IFAC Safecomp '83 Cambridge , UK 1983 PRACTICAL FAULT TOLERANT SOFTWA...

2MB Sizes 202 Downloads 163 Views

SESSION 3 - FAULT TOLERANCE , RECOVERY AND THE USE OF REDUNDANCY

Copyright (c) IFAC Safecomp '83 Cambridge , UK 1983

PRACTICAL FAULT TOLERANT SOFTWARE FOR ASYNCHRONOUS SYSTEMS R. H. Campbell*, T. Anderson** and B. Randell** *Department of Computer Scz'ence, University of Illinois, 1304 W , Springfield Av. , Urbana, IL 61801 , USA **Computing Laboratory, University of Newcastle upon Tyne , Claremont Road, Newcastle upon Tyne NE1 7RU, UK

Abstract. ?-'etworks of computers, distributed resources, and multiple CPUs introduce new problems of constructinp: reliable systems and involve the organization and control of error recovery in complex asynchronous systems. Pecent research has provi!1pc! evidence that faul t-to] erant asynchronolls systems are both necessary and feasible. In this paper, we review and discllss several of the pragmatic issues that need to be resolved before the results of this research can be applied in practice. ~ystem fa il ure an!1 recovery; computer software; parallel proKeywords. cessing; reliability; fault tolerance; atomic actions; hackward error recovery.

IRTRODUCTION

between the first ml'nifestation of a fault and the eventual detection of an error.

The demand for hip:hly reliahlp computer systems has led to techniques for the construction of faul t-tolerant soft.'are «('hen and "vizienis, 1~7?; Horning and colleagues, 1"74). }.'etworks of computers, distrihuted resources, and multiple ('PPs introduce new problems of constructing reI iable sys terns and invo] ve the organi zation and control of error recovpry in complex asynchronous systems (Davips, 197?; ¥im, 1~?2; Uskov, 1~?2; Pandell and col]eap:ues, 197?). This paper rpviews the peneral princi pI es and frameworks proposec! for the desip:n of fau] t-tolerant asynchronous systems and discllsses the prap:matic issues that mllst be resolved hpfore such systems become practic8ble.

fo called "forV'ard error recovery" is accomplishpd hy makinp selpctivp corrections to a system stllte containing errors. It aiJTls to remove or isolate specific errors so that noma 1 computation can be resume!1 (Pandpll and colleagues, 1~7?). Fecause recovpry is applied to a system stllte containinp: errors, for.rard error recovery techniques require accurate damap:e assessment (or estim<'tion) of the likely extent of thp errors introduced hy the fau] t. In contrast, "back.'llrn error rpcovpry" aims to restore the system to Il state which occurre!1 prior to the mllnifestation of the fault. l'sing this earlier state of the computlltion, the function of thp system is then provided by an Illternate algorithm until. norn8] computation can he resumed (Horning Ilnd colleap:ues, 1~74). (In practice, thp most recent restorablp system state which is free from the effects of the fllult may be clifficult to determine. It may be necpssary to restore a sequence of successively earlier states until recovery is successful.) Fecause backward error recovery restores a valid prior system state, recovery is possihle from errors of larpely unkno\o.'I1 orip:in and propap:ation characteristics. (PI] that is required is that the errors have not affected the state restoration mechanism.) Packwarcl error recovery m<'y involve a considerable time penalty in overhead and could require tests for ~C'rpntllhle system states.

FAULT-TOLERANT SOF'lVARE SYSTEMS A fault-tolerant system is one th8t is designed to function reliably despite the effects caused hy component or desip:n faults duri.np: normal processinp:. fuch a system dptects the errors producp!1 hy faults and applies error recovery techniques in the form of exceptional mechanisms and abnormal alp:orithJTls to continue operation an!1 resume norma] cOl'1putation. The task of error recovery is hampered hy the propap:ation of errors - the continue!1 valid operation of a system C'ontaininp: 8n error can resul t in the introcluc tion and sprea!1 of further errors. f'uccessful fau] t tolerance must enable the systelTl to continue to function clespi te error propap:8tion c1urinp the, perhaps lengthy, time interval

59

60

R.H. Campbell, T. Anderson and B. Randell

Forward and backward error recovery techniques complement one another, forward recovery allowing efficient handljng of expected concli tions and backward recovery providing a general strategy which can cope with faul ts a designer did I'ot - or chose not to - anticipate. Ps a special case, a forward error recovery mechanism can support the implementation of backward error recovery (Cristian, 1 ~P2) hy transforming unexpected errors into defaul t exception connHions. Fxception handling provides a very convenient framework for the implementation of faul t tolerance in systems having onl y a single sequential process (Pnderson and Lee, I~Pl). Poth forward and hackward error recovery can be supported within such a fraJl1ework (C'ristian, 19P2). Fault tolerance provisioI's for systems of asynchronous processes are complicated hy the possibHi ty of communication of erroneous i nformation and the need to co-ordinate processes engaged in recovery. Ceneralizing exception handling to support fault tolerance in asynchronous systems requires additional sys tern st ruc ture concerni ng the cooperatioI' and co-ordination of the individua I processe s.

STRUCTURING FAULT-TOLERANT ASYRCHRONOUS SYSTEMS The construc tioI' of s ys terns .>1 th ac ti vitie s that are formed from atomic actions provides a structure for faul t tolerance in asynchronous systems. ,Although atomic ac tions have been d efi ned many times in different ways (for eXClmple, (navies, 1~78; Liskov, 19P2; Lomet, 1977)) we . rill use tre following definition (Pnderson and Lee, I ~p I): "The activity of a group of components consti tlltes an atomic Action if there Are no interactions bet.reen that group and the rest of the system for the nuration of the activity." P more rigorous defini tion hased on occurrence graphs formed from events ann causality relations enahles fOTmCll anal ysis of nested atomic actioI's (Pest Clnd Pandell, 1"/lO).

The design of Jl1any recovery schemes has heen bClsed on a mechanism .rhich supports atomic actions. For example, conversations (Pandell, 1"75), recoverc>hle monitors (rim, 1 ~7/l), chase protocol recovery (I-'erl in and Pandell, ]"71'), transactions (~pector a"d ~chwartz, 1"/l3) and tvo phase cOtn!'1it protocolI' (Cray, ]~76) all provide atomjc actions for interacting processes. There are t.ro reasons .'hy atomjc actioI's provide the basis for so many different approaches. If a faul t, resul ting error propag1'ltion,

and subsequent successful error recovery all occur within a single atomic action they will not affect any other system activities. Furthermore, if tbe activity of a system can be decomposed into atomic actions, fault tolerance measures can be cOI'structed for each of the atomic actions independently. Thus, atomic actions provide a framework for encapsulating fault tolerance techniques within modular components. The notion of reliabili ty requires that a system has a specification against which the actual results of invoking its operations can be assessed. When an atomic action is executed, a .rell-defined state exists at the beginning and termination of its activity (although these states may not necessarily be instantaneously observable). The intended relationship between these states constitutes a specification for the atomic action which is independent of any asynchronous activity insine or outside the atomic action. The reliahility of an atomic action depends upon the re]jahil ity of each of its compOI'ents. Pn ini tial and final state can be associated with each compoI'ent joining and leaving the atomic action. Pre- and postcondi tions a t the entry and exi t points of the components can speci fy the resul ts of the activity of each component. These preand post-conditions constitute a decomposition of the specification of the atomic actioI'. The specifications and the encapsulation associated wi th an atomic action provide a context for the application of error detection and damage assessment techniques. Because atomic actions delimit any error propagation caused hy interprocess communication they al so support error confi nement • The foll o.'i ng n'o pri nci pI es have been proposed for structuring fault tolerance . ri thin asynchronous systeJl1s (Camphe]] and Pand ell, 1"p. 3 ) : ]) The services providecl hy a fault-tolerant asynchronous system shoul cl he impl emen ted hy atorr>ic actions.

n

Fach fault tolerance measure should he associated I<'ith a particular atomic action and should involve all of its components.

P fault-tolerant system is reliable as long as i t provides servjces which meet its speci fication, even though i. t may suffer from internal faul ts and contain internal errors. Pny faul t tolerance measures that the system invokes as a result of detecting such errors should be invisible I<'hen that system is used as a component of another system. Pence, system services must be atomic actions. ,Although this principle

61

Practical Fault Tolerant Software

appears to restrict the appUcations for .flich our techniques are appropriate, in fact this is not the case. Computer hard.'are and software are often merely a fe.' cOJ!lponents in much larger faul ttolerant systems involving peopl e and process control eauipment. Of course, error recovery in such systems must be coordinated between components having very different characteristics.

L)

EXCEPTION HANDLING

Fvery component of the atomic action respond s to the rai sed exception by changing to an exceptional control flow which executes the handler for that exception. (Thus, exception handling in a sequential system is a special case.) Figures 1 and 2 show the changed control flo.'s of the components of two atomic actions following an exception. Tn Fig. I, the recovery measures implemented by the exception handlers succeed and the normal control flo.' of the components is resumed.

IN ATOMIC ACTIONS If a component of an atomic action raises an exception, it indicates the detection of an ahnormal condition or error. The error may have been produced as a resul t of the activity of this component and/or one (or more) of the other components of the atomic c·:tion. Plternatively, the original fault may have occurred prior to the atomic action. The raising of an exception within a faul t-tolerant atomic action requires the appl ication of ahnormal computa tion and mechanisms to implement the fault tolerance measures. Tf the recovery measures succeed, the atomic action should produce the re&ul ts that are norma] ly expected from its activation. Ptomic actions that explicitly return an ahnormal re suI t h<1ve components that co-operativel y signal an exception.

An atomic action may contain internal atomic actions. Tf an exception is raised within an internal atomic action, then the fa ul t tolerance mea sures of tha t i n ternal atomic action should be applied. Fowever, an internal atomic action may signal an exception. This exception is raised in the containing atomic action. A distinguished atomic action failure exception signifies the fai~of one or more of the components of an atomic action. The failure exception may be signalled explicitly by the components of an atomic action. PI ternatively, the failure exception may be signalled implicitly as a default recovery measure for an exception that is raised within a component of the atomic action Which has no appropriate exception handler. The follo.'ing exception handling scheme for atomic actions has heen proposed by Campbell and Pandell (1~~3): 1) Tf one or more components of the atomic action raise an exception then the fault tolerance measures necessarily involve a]] of the components of that atomic action. (Tf some of the components do not require any faul t tolerance measures, they do not interact .>fth the other components and hence can form a separate atomic action.) Figures I and 2 sho.' example atomic actions in which an exception Y has been raised.

Atomic Action

exceptional flow I II

A __________________

I

normal

I

I

\1

suspended flow 11 resumed

----->I-------YI .................. vl-------> I I I I normal I flo",

exceptional flow

A __________________

I I

I flow I II

11

suspended flow 11 resumed

----->I-------xl ....•.•........... vl-------> flow

I

I

Fig. I. An example of successful recovery in an atomic action.



flow

error

It is convenient to restrict signalled exceptions so tha t each componen t (or exception handler) of an atomic action returns the same exception. The signalling of multiple exceptions would only serve to confuse the selec tion of the appropriate recovery measure "ithin any enclosing atomic action. Figure 2 shows the control flow of the components of an atomic action when the exception hand] ers for the componen ts canno t recover. A signalled exception ensures that the exceptional control flow is continued hy the components that invoked the atomic action.

4) In particular, if any of the components of the atomic action do not have an handler for the exception then those components raise an atomic action failure. (The fact that an exception has been detected elsewhere amongst the processes in an atomic action invalicates the assumptions that any of the processes can terminate normally and provid e the a ppropr ia te re suI ts .) A failure exception could have heen signalled expUcitly by the exception hand I ers shown in Fig. 2. Al ternatively, the exception handlers might be the default recovery measure which signa] s a failure exception in response to detecting an exception for which there are no explicit exception handlers.

62

R.H. Campbell, T . Anderson and B. Randell

Atomjc Pction signall ed exceptjonal flow 1 exception ~------------------>I---------> 1

1

1 1 suspended flov-' 1 ----->I-----yl .•................ >1 ••..•••.• > normal flow

1

1

1

1 signalled exceptional flow 1 exception ~------------------>I--------->

1

1

1 1

1

1 1 suspended flow 1 -----> I-----y I.................. >I ..•....•. > normal flow

1

to communicate. Conversations restricted in this way are known as "dialogues" (Anderson and ~oulding, 1983) and have been used in the implementation of a naval command and control system.

1

Fig. 2. An exampl e of returning an abnormal response or fajlure from an atomjc action.

RECOVERY IN ATOMIC ACTIONS

BA~ARD

Any notation for specifying backward error recovery should define an atomic action .dthin wbich the recovery, if necessary, should occur. Ptomjc actions have been represented hy programmjn? notations in many ways (Kim, ]<;'P2; Llskov, ]<;'P2; Lomet, ]"77; ~hrjvastava and Panatre, ]<;'7p) We suggest that the key property of an atomjc action is the fact that it restricts the sharing of state jnform8tion het.'een concurrent processes. The ac tivi ty and resul ts of the atomic action are isolated from the rest of the system for the duration of the action. It is this isolation which is the essence of an Cltomjc 8ction. We can extend the framev-'ork of exception handling to support backward error recovery in asynchronous systems by encapsulating it within a structure derived from atomic actions. The conversation (Randell, ]g75) js an extension of the recovery hlock (Forning and colleagues, ]<;'74) . 'hich co-ordjnates backward error recovery for concurrent processes hy only permHting jnterprocess communjcation .~thin Cln Cltomjc action. For the duration of a conversa tion, it must be possible to restore the state of any participating process to that which was current on entry to the conversation. Fxiting from a conversa tion is synchroni zed; all processes must leave simul taneousl y. ~'o state information is retajned after the conversation has successfully termjnated. Linguistic fraf:1eworks for conversations have been developed and these m8Y impose further restrictions (Pnder~on and Knight, ]9P3; Kim, ]gP2; lIussell and Tiedeman, ]979). In particular, it can be argued that in order to simpljfy the organization of recovery the structures defined by conversa tions shoul d be completel y predetermined, rather than established dynamically by the processes when they need

Pn alternative basis for a notation for atomic actions is the concept of a shared jnstance of an abstract data type. Such a shared jnstance would retain state information between atomic actions. We will refer to the realization of ahstract data types in a concurrent environment as "encapsulated data". ~lote that the operations on encapsulated data can be executed concurrentl y. The ac ti vi ty of processes executing these operations should be structured to form an atomic action. Thus, processes would he j sol a ted from the rest of the system while they operate on the encapsulated data. This approach is consjstent with several existing proposals (Kim, ]OP2; Liskov, 19P2; Shrjvastava and Panatre, 197P). Faul t tolerance, in the form of backward error recovery, can then be associated with the operations on encapsulated data. An abstract data type can he specified by a d8ta type invarl an t (Jones, 1 gPO) together wi th the pre- and post-cond i tions of the operations on the data type. The invariant is a predicate on the state of the data type between operations whjch is true for all valid internal states and false otherwise. This invariant could be evaluated at the completion of the operations on the dClta type in order to detect errors. ~hould an error be detected during the execution of the operations, any fault tolerance measures which are invoked will be invisible to the system of which the encapsulated data is a part. The faul t tolerance measures should, in prjnciple, involve all of the opera tions whose ac ti vi ties constitute the atomic action. Thus, the encapsulated data and Hs operations form a fault-tolerant asynchronous subsystem.

RECOVERABLE OBJECTS For the purposes of discussion, we shall describe a notation which defines a "recoverable object". The notation is based on the concept of shared encapsulated data and has been implemented experimentally in Distributed Pa th Pascal (Campbell, 19P3) (a programming language which supports concurrent processes, shared encapsulated data, and djstributed processing over a local area network). Al though the recoverable object (Schmidt, 1983) is an extension to a Path Pascal object, it can also be thought of as a "recoverable" abstract data type. The state of a recoverable object is represented by the internal variables of that object (which may themselves be recoverable objects). Fntry procedures and functjons constitute its operations. To

Practical Fault Tolerant Software

detect errors in the object, each recoverable object contains a boolean function which evaluates the invariant for the object. This function serves as part of an acceptance test which is a ppl ied after the execution of any operation to determine whether its results are correct. ~ince the invariant should only test the state of the object to determine whether it is valid, the function is constrained so that it cannot modify that state. Any operation may also incorporate a specific acceptance test for the values of any arguments which it receives or returns. The structure of a recoverable object in a ~imula class-like notation:

is,

ob~ect

(*Recoverable 0bject*) ensure synchronization of ops; (*path expression*) defn of local variables; list-of-ops and tbeir params; ini ti"alIza tIon_for_1 oca 1_vars;

invariant boolean function defn; by (*a routine for each op*)

else hy (*an alto routine for each op*)

else error;(*signal failure exception*) end (*Recoverable Object*);

An object is considered to he idle when none of that object's operations are being executed. (Fxecution of the operations on an object may he synchronized by an Open Fxpression (Campbell and KolPa th stad, 1979); both concurrent and sequential execution of the operations can be specified. Operations updating primitive objects must be sequential.) Prior to the execution of an entry routine of an idle object, a recovery cache (Horning and colleagues, 1974) is estab1ished. Any variables that are changed during execution of the operation have their orlginal values stored in the cache. Once the recovery cache is established, other routines of the object may be executed concurrently (subject to the constraints of the path expression) and the prior values of any variables ch?nged by these routines are also recorded in the cache. Routines in a recoverable object are not allowed to return values to the calling environment until all routines have finished executing.

63

When all the routines have completed their execution, the invariant is evaluated as well as any individual acceptance tests of the routines. If no errors are detected by the acceptance tests or tbe invariant, then the recovery cache is discarded and the routines return with their results. If, however, the invariant or an acceptance test fails, the cached values of the internal variables are restored and alternate routines for each executed routine are invoked. Any routine of the object can also invoke recovery by executing a standard procedure "error", by attempting to perform an invalid instruction such as divide by zero, or by invoking an operation on another recoverable object which signals a failure exception. Recovery commences by suspending the activities of all of the routines performing operations on the recoverable object. If the alternate operations fail to satisfy the invariant or raise further errors, recovery is invoked again. This time, the second alternates will be attempted. This process continues until either all the operations finish normally and the invariant does not detect an error, or one or more routines run out of al terna tes. In the former case, all the operations return normally; in the latter case, all operations will signal a failure exception to their respective calling routines. If a new operation is to be performed while alternate routines are being executed, then the alternate routine for that operation must be executed. To enforce atomi city, the passage of information in and out of the object is prevented during the execution of its operations. Information can only enter the recoverable object via the parameters of an operation before the routine which performs that operation starts executing. Information can only leave the recoverable object after all the operations have successfully completed. In this way, only validated results are passed out of the object. Thus, the conversation and recoverable object are based on similar forms of atomic action.

DISCUSSION Recovery blocks, dialogues, recoverable monitors, and recoverable objects are all particular examples of associating backward error recovery with a programming mechanism for defining atomic ac tions. The recovery block provides error recovery for a sequential process while the recoverable moni tor provides error recovery for a sequential operation on an encapsulated set of variabIes. 10 dialogue provides recovery for a fixed set of concurrent processes while a recoverable object provides recovery for a variable number of concurrent processes

64

R.H. Campbell, T. Anderson and B. Randell

1T'anipulatin? encapsulated data. .All of the techniques isolate the effects of an activity for the duration of the activity. The hackward error recovery mechanism in all the approaches is provided by the use of a cacheing scheme. The major differences bet.'een the techniques are the degree and form of the constraints they impose on information exchan?e bet.'een processes.

Ideally, such a notation should:

* * *

~ost

exi sting hackward and forward error recovery notations restrict concurrency (for exa1T'ple, dialogues) or even enforce sequentiality (for example, in 1'10nitors). ~echanisf11s that do not constrain concurrency unduly hut Allo.' the construction of atomic Actions are often di fficul t to integrate directly into existing program1T'ing languages. For example, programming langua?e notations have yet to he devised to take advantage of the concurrency per1T'i t ted by t"o-pha se commi t protocol s or chase protocol s. The selection ancl clesig!' of acceptance tests and invariants is of great importance to the successful construction of fAu] ttolerant systems hecause of the crucial role they play in error detection. Ideally, adeauate error detection facilities should ensure the detectio!' of every detrimental conseauence of any faul t in the system (Pest and Cristian, 1(11'1). Further research mAy enhance verification techniques to a] low a mechanical confirmC'tion of adequate cletection facili ties. Powever, there are numerous c1ifficulties that must be overcome before it hecomes feAsible to formally verify such properties of asynchronous systems. The simplest schemes for specifying 8tomic actions m8Y well be the best. Tec hn i que s for d esi gning forward error recovery in asynchronous systems shoulcl exhibi t the same fundamental c1ependence on atomici ty as does h8ckward error recovery. Incleed, the t¥'o principles described above were derived from an examination of the use of forwarc1 error recovery in concurrent systems (Campbell ancl Fandell, 1?il3). ~everal attempts at providing exception handling in a concurrent programming language suffer from the inadeauacy of the i r mechan isms to enforce i sola tion (for example, Ada and ~FSA).

CONCLUSIONS The design of fault tolerant software for an a:;ynchronous system can be a complex and difficult task. A recluction in complexity can be expected if atomic actions are used to structure the 8ctivity of the system. !-lowever, atomic actions are merely a concept for system structurin?; they can only be used to build practical systems 'Then a suitable notation is available.

* *

*

1T'ake apparent which it defines;

the

structures

clearly delineate the constraints imposed upon communication; enable forward and backward error recovery measures to be easily incorporated; be convenient for use by system implementors and facilitate inspectio!' of the system design; integrate well v.rith existing concurrent system environments; he amenable to formal tion techniques.

veri fica-

We argue for an effective notation in which to express Atomic actions. We are convinced that this would be a major contribution to the clevelopment of fault tolerant software.

Ack!'ow] ec'g1T'ents. This paper ,,'as "rri tten .'hile P. P. Campbell was a Visiting ~enior Pesearch Fellow of the Science and Fn?ineerin? Council of Creat Pri tain at the Computinp Lahoratory, l'niversity of ~Te,,'cas­ tIe upon Tyne. The authors are very grateful to Ceorge ~chmidt for his contributions to the development of this paper.

REFERENCES Anderson, T. and J. C. J(night (1?P3). A Software Fault Tolerance Framework for Feal-Time Systems. To appeAr in IEEE Transactions on Software Fn?ineer~ Vol. SF 9. Anderson, T. and P.A. Lee (19ill). Fault Tolera!'ce, Principles and Prac~ Prentice !-lall International, Fnglewood Cliffs NJ. Anderson, T. and ~. P. ~oulding (191'3). Dialogues for Fecovery Coordination in Concurrent Systems. Technical Feport: In preparation, Computing Laboratory, tniversity of Newcastle upon Tyne.

Best,

F. and F. Cristian (981). Systematic Detection of Fxception Occurrences. Science of Computer Pro?rammi!'g, Vol. 1, No.I. ~'orth Polland Pub. Co. pp-:- 115-1[,4.

Best, F. and P. Fandell (980). A Formal ~odel of Atomicity in Asynchronous Systems. Technical Feport 130, Computing Laboratory, t'niversity of Newcastle l'pon Tyne, December 1 9ilO.

Practical Fault Tolerant Software

F. (19P3). I'istributed Path In Y. Paker (Fd.) Distributed Co~putinf, Proceedings of the International Seminar on Synchronization Control and Communication in Distributed Computin!! Systems, To be published, Pcademic Press.

Campbell, F. Pascal.

Campbell, P. F. and F. P. Yol stad (1979). Path Fxpressions in Pascal. Proeeedi n!!s of the Fourth Interna ti ona I COrlference on ~re En!!ineerin!!, ~nich, September 1979, 212-21°. Campbell, P. F. and P. Fandell (1~P3). Frror Fecovery in Psynchronous Syste~s. Technical :peport, In preparation, Computin!! Laboratory, {'niversity o f ~Tewcastl e upon Tyne. Chen, L. and P. Pvizienis (1~7P). N-Version Programmin!!: A Faul t-Tolerance Ppproach to Peliability of Software Operation. Di!!est of Papers FTCS-R: Ei!!htb Annua~rnati~yn;posi"Um on Fault=TOferant Computinf, Toulouse, June lc:'7P, :-9. Cristian, F. (1()P2). Fxception Fandling and Software Fau] t Tolerance. IEFF Transactions on Computers, Vol. ~-31~ i., June 1°P2, 531-54(1. Pavies, c. T. (1"7P). Pata Processin!! Spheres of Control. IP}, Systems Jour~, Vo]. l2., No.,?", 17"-19P. Gray,

J. ~T. (l97f). liTotes on Pata Pase Opercoting Systems. In P. Payer, P. }'. Graham and C. See!!muller (Fd.), Lecture Notes in Computer Sci ence, Vol. ~ Springer-Verlco!!, Perlin. pp.:934P1.

Forning, J. J., P.C'. Lauer, P.}'. MelliarSmith and P. Pandell (1~74). P Profram Structure for Frror I'etection and Pecovery. In F. Celenbe and C. Yaiser (Ed.), Lecture Notes in Co~puter Sdence, Vo]. If, Sprin!!er-Verlag, Per] in. pp-:171-1P7. Jones, C'. P. (198(1). Software Development: A Ri forous Appro<'ch. Pren tice-J1all International, Fnglewood Cliffs NJ. ¥im,

Y. F. (I"??). Pn Ppproach to Programmer-Transpa ren t Coord ina tion of Pecovering Parallel Processes and its Ffficient Implementation Fules. Proceedings of Internationa] Conference on ParaTIel Processinf, I'etroit ~Pugust lc:'7P, pp.5P-fP.

¥im, Y. F. (l9P2). Ppproaches to }'echanization of the Conversation Scheme Pased on }'oni tors. IFFF Tr<'ns<'cti ons on Soft.Tare Fngineerinf, Vol. SF-~, ~TO:" 3. IP()-1"7.

65

Liskov, F. (1982). On Linguistic Support for Distributed Programs. IEEE Transactions on Software Engineerin!!, Vol. SE-~, No. 2, May 19R2, 203-210. Lomet, D. P. (1977). Process Synchronization, Communication and Fecovery Using Atomic Pctions. SIGPLAN Notices, Vol. l..?., No. 2, }'arch 1~77, 12P-137. Merlin, P. M. and P. Fandell (l~7P). Consistent State Pestoration in Distributed Sys terns. Di fes t of Papers FTCS-8: Eifhth An~InternatTOnaT syt;;posium on Fau]t=TOf€rant Computing, Toulouse, June 197P, 129-134. Fandell, P. (1975). System Structure for Faul t Tolerance. IEEE Transactions on Sofn'are Fngineering, Vol. SE-l, NO":" .?.' 22(1-232. Fandell, P., P.P. Lee and P.C. Treleaven (197R). Peliability Issues in Computing System Design. PC}' Computinf Surveys, Vol. ~, No • .?.' June 197P, 123-

lf5. Fussell, P. L. and 1". J. Tiedeman (1979). Mul ti process Recovery Psing Conver sations. Difest of Papers FTCS-9: Ninth Annual ~rnati~ sy;pDSium--on Fau] t-Tolercont Computi nf, Mcod ison June I c:'7°, ]('6-1(1".

wr-;-

Schmidt, C. (1~P3). The Recovercorle Ohject. M.S. Thesis, Department of Computer Science, [~iversity of Illinois, [,rbana IL, lc:'P3. Shrivastava, S. Y. and J-P. Panatre (19??). Peliable Pe source Pllocation Petween l~reliable Processes. IEEF Transact ions on Soft\<.'a re Engineeri nf, Vol. SF-~, No. 2, 1"ay 1~78, 230-241. Spector, P. Z. and P. 1". Schwarz (1~P3). Trconsactions: A Construct for Feliable I'istributed Computing. 0peratinf Systems Pevie\<.', Vol. l2., No • .?.' Ppril lc:'P3, IP-35.