Microprocessing and Microprogramming 31 (1991) 117-120 North-Holland
1 17
AN ASYNCHRONOUS CHECKPOINTING SERVICE Rumen STAINOV BULGARIAN ACADEMY OF Center of Informaties Sofia - 1113, "Acad. Phone: +395 2 708494,
SCIENCES and Computer Technology (CICT) G. Ben(hey" Str., BI. No 25A, Bulgaria Fax: +359 2 707273, Telex: 22056 KZIIT BG
A general requirement on checkpointbased fault recovery schemes in distributed systems (DS) is maintaining a consistent DS state, i.e. the effects of all interactions of the failed process with other processes after the checkpoint must be taken into consideration. One recovery method is to roll the entire interacting processes back to earlier checkpoints, ensuring consistent system state. These checkpoints must be set according to some synchronization policy in order to avoid stepby-step uncontrolled rolling back of the processes (domino effect [I]). Nevertheless the sophisticated synchronization protocols and the system rollback could considerably affect the overall performance. Another method supports asynchronous checkpointing and logs all messages received by a process after the checkpoint. During recovery a simulation is started so that the process will receive the same messages in the order of their receipt before the failure has occurred. Setting the checkpoints is autonomous for each process and the recovery procedure concerns usually the failed process only, which leads to relatively higher overall DS performance. Nevertheless, this method usually needs specialized protocols (e.g. atomic broadcast protocols [2], [3]) or an intermediate communication supervisor [4] in order to log messages and simulate message recovery. In both cases, this leads to an increased communication overhead during non-faulty process execution.
provides a checkpointing service for distributed message logging, distributed control and supervision of message recovery, avoiding specialized protocols or a run-time message routing through a supervisor. (F2) The checkpointing service could be used by different checkpoint-based recovery methods via an unified interface. The service is controlled by port attributes and could be 8 basis of creating application-level faulttolerance tools. (F3) The message routing in a transputer network is used in order to avoid communication overhead and for dynamic updating of the message logs. The proposed checkpointing service aims at providing system support to several fault-tolerance methods. Rollback recovery, Recovery block, Back-up recovery. The concept of asynchronous checkpointing allows us to separate the checkpointing functions from the fault-tolerance methods: (El) Setting of non-synchronizised checkpoints (BP) in the process. (E2) Logging of the interaction messages after the BP. (E3) Recovery (in case of failure in point FP) of the logged messages in a sequence, which guarantees the consistence of the DS. (E4) Starting of a procedure, realizing the specific recovery method.
This paper proposes an approach to autonomous logging of asynchronous messages and their recovery simulation, using the communication dependencies in a transputer based High Performance Computing System (HPS) [5] and focuses on the following features:
In the proposed checkpointing service, the HPS port system provides system support for E2 and E3 stages. The checkpoints setting (stage El) and the recovery procedure (stage E4) could be realized by the application process o r by a server. The service is based on the Jalote's proof [4]:
(FI) The autonomous HPS port system
(At) The recovery of the lost messages
1 18
R, Stainov
in exactly the same original o r d e r is necessary only for the time period before the last SEND-command execution. In HPS these messages are stored in runtime in a process log.
ginal process. The messages received by the process will be defined as consumed when a subsequent SEND-command is executed. Thus, the port system maintains two message logs - a transient port log for the received messages and a stable process log for the consumed ones. The port log is organized as a connected list, avoiding the time-consuming copying of messages. When a message is read from a reliable input port, the port system doesn't remove it from port but only marks the message as logged and connects it to a linked list reflecting the order of reading messages from different input ports. A message sent by the process under consideration makes the port system copy the messages from the linked list into a stable process log and remove the messages from the input ports. In Figure I the stable process log PROCESSLOGpr2 would contain, after the SEND-command execution, the messages CI, C4 and C2 in the order they have been consumed by the process Pr2.
(A2) A f t e r w a r d s t h e o r d e r o f t h e m e s s a g e s r e c o v e r e d must be o b s e r v e d o n l y w i t h r e s p e c t t o t h e same s e n d e r , i . e . messages sent from different processes can arrive at the process in a different sequence. In our system these messages are stored in the sender's port log.
According to E2 the message logging is provided during the non-faulty e x e c u t i o n of the process Pr2 (Fig. I ) : (BI) s t o r i n g of the m e s s a g e s r e c e i v e d and e x e c u t e d ( a c c o r d i n g to At) in the l o g of the p r o c e s s P r 2 ( P R O C E S S L O G p r 2 ) in ; a r a l l e l with the n e x t S E N D - c o m m a n d . The messages are logged in such a way that allows their later recovery in the same sequence as consumed by the ori-
Pr2
Prl
Pr3
,BP2n
I I I
,TI i
J
"
'
I I
L,.~. •,.,~ J
FP
FP'
PORT LOS Pll IT1 "--,-"(EI,E2,C3) lFP--,'--(C31
Fig.
/PROESPr) L
I. Messages
TI ~([tCCC2) FP-,---(CI,C4,C2)
storing before failure
FP'
PORT L06 P33 TI -,--(E4,C5) FP--- (E5)
An asynchronous checkpointing service (B2) D i s t r i b u t e d s t o r i n g o f the m e s s a ges sent to Pr2 ( a c c o r d i n g to A2) in the l e g s of the s e n d e r s p o r t s ( P O R T L O G p m ) . Message logging includes
119
from the stableprocess log, and resending of the messages from the senders port logs. Depending on the owner of the recovered ports, message recovery could be applied to the original process (in case of Rollback recovery), to its passive copy (in case of Backup recovery) or to a specific recovery process discarding the effects of the processes' interactions (in case of Recovery block).
marking the messages sent as logged where the sending sequence is kept by the port itself. In order to avoid loss of the port logs, an a t t e m p t to delete the output port will suspend it into an "inactive" port, as long as at least one logged message exists. At the time TI (Fig. I) the log of the output port Pll will contain the already sent messages CI, C2 and C3, and the log of the port P33 - the messages C4 and C5.
A process recovery example demonstrates the message recovery in its complexity. Should the process Pr2 in Fig. I fail at the point of time FP a recovery procedure will start. Assuming this procedure implements a back-up recovery algorithm, the following steps are taken (Fig. 2):
(B3) D y n a m i c a l l y d i s o a r d i n g f r o m P O R T L O G p m of the m e s s a g e s a l r e a d R e x e c u ted by Pr2. In order to deliver the necessary state information to the senders ports we use the SEND-message routing through the transputer network. Assuming the SEND-command has passed through the nodes running the processes Prl and Pr3, their output port logs will contain the messages C3 and C5 (Fig. I).
($I) A backup process Pr2' is created at another network node. (S2) A "dummy" input port (FI) is created with the messages from the stable log of the process Pr2 (PROCESSLOGpr2) recorded in it. This port represents all input ports of Pr2 and while it contains at least one message all other input ports are inactive.
Message recovery (according to E3) is provided by the checkpointing service in case of failure in the original process. This includes recovery of the process ports, recovery of the messages
($3) The original input ports (P12 and
Pr2'
Prl
DUMMY
--~_
INACTIVE
Pr3
_.. ,BP2n'
INACTIVE
INACTIVE
INACTIVE
DUMMY ,TI' ACTIVE
i,,,i
ACTIVE
-I
,LI
ACTIVE ,,..11.,,.-.
Fig. 2. Messages recovery to a back-up process Pr2'
120
R. Stainov
P23) of Pr2 are recovered according to the information from the PROCESSLOGpr2 and are owned by Prl'. A re-send request is sent to the corresponding output ports. The queue length of the port logs is used to indicate successful re-sending.
cesses, with which it interacts. These processes are not aware of the failure in their partner's process and continue to send messages to them. Thus, message recovery handles the messages sent by the other processes after the failure time too.
(S4) The output port (P22) of Pr2 is recovered according to the information from the PROCESSLOGpr2 defining the number of "dummy" messages, which have to be suppressed.
(F6) The port system is distributed and weakly connected to the process owner. Failure in the processes slightly affects its functionality. Thus, the proposed service can be used for tolerating multiple processes' failures also.
($5) If S2, S3 and S4 are successful, the passive copy of Pr2 starts into Prl' from the last checkpoint BPln.
REFERENCES During the execution the port attributes control the recovery: first, reexecuting the messages before the last SEND-message; second, suppressing the corresponding number of output messages; third, restoring the ports and the messages in them to a normal state, i.e. converting ports P12, P22 and P23 into active ones. This example demonstrates the following additional features of the checkpointing service: (F4) The re-start of a process from a checkpoint is fully autonomous and performed in the same way as restarting non-interacting processes. This is the case of introducing checkpoints in centralized computing systems. (P5) The recovery of the failed process is fully transparent to the other pro-
[I] B. Randell, System Structure for Software Fault Tolerance, IEEE Trans. on Soft. Eng., Vol. SB-I (1975) 2. [2] A. Borg et al., A Message System for Supporting Fault Tolerance, Proc. of 9th Symposium on Operating Systems Principles, Bretton Woods, ACM, October 1983, 90-99. [3] O. Babaoglu. Fault-Tolerant Computing Based on Mach. ACM Operating Sys. Rev., Vol. 24 (1990)I, 27-39. [4] P. J a l o t e , F a u l t T o l e r a n t P r o c e s s e s , D i s t r i b u t e d Computing ( 1 9 8 9 ) 3 , 185-195. [5] K. Bojanov and K. Yanev, A Family of High Performance Parallel Computer Systems, Proc. of Workshop on Parallel Distributed Processing, Sofia, 1989.