Performance Evaluation 65 (2008) 345–365 www.elsevier.com/locate/peva
Model-based performance evaluation of distributed checkpointing protocolsI Adnan Agbaria a,∗ , Roy Friedman b a IBM Haifa Research Lab, Mount Carmel, Haifa 31905, Israel b Computer Science Department, Technion - Israel Institute of Technology, Haifa 32000, Israel
Received 20 June 2006; received in revised form 13 March 2007; accepted 16 September 2007 Available online 26 September 2007
Abstract A large number of distributed checkpointing protocols have appeared in the literature. However, to make informed decisions about which protocol performs best for a given environment, one must use an objective measure for comparing them. Obviously, a distributed checkpointing protocol could be the best in a specific environment, but not in another environment. This paper presents an objective measure, called overhead ratio, for evaluating distributed checkpointing protocols. This measure extends previous evaluation schemes by incorporating several additional parameters that are inherent in distributed environments. In particular, we take into account the rollback propagation of the protocol, which impacts the length of the recovery process, and therefore the expected program run-time in executions that involve failures and recoveries. Using the objective measure as an evaluation technique, the paper also analyses several known protocols and compares their overhead ratios. c 2007 Elsevier B.V. All rights reserved.
Keywords: Distributed checkpoint/restart; Rollback propagation; Performance analysis; Markov models
1. Introduction Checkpoint/Restart (C/R) is one of the most prominent techniques for providing fault tolerance, and can also be used for debugging and migration in both uniprocessor and distributed systems [13,26]. Specifically, checkpointing is the act of saving a program’s state in stable storage, and restart is the act of restarting an application from its saved state. In particular, if an application takes periodic checkpoints, then in case of a failure, it may be possible to restart it from the latest checkpoint, thereby avoiding loss of all the computation that was carried out before that checkpoint. One of the main challenges in implementing C/R mechanisms is that of maintaining low overhead, since otherwise the cost of taking a checkpoint will outweigh its potential benefit. For programs executing on a single computer, the main focuses are on interleaving the task of saving the program state with the program execution and on reducing the total size of the file being saved. For parallel or distributed applications, the situation is more complicated. In such an environment, if each process saves its state in a completely independent manner, it is possible that no collection I This research is supported by the Bar-Nir Bergreen Software Technology Centre of Excellence. ∗ Corresponding author.
E-mail addresses:
[email protected] (A. Agbaria),
[email protected] (R. Friedman). c 2007 Elsevier B.V. All rights reserved. 0166-5316/$ - see front matter doi:10.1016/j.peva.2007.09.001
346
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
of checkpoints (with one from each process) will correspond to a consistent application state. Such a collection of checkpoints is known as a recovery line. Thus, the main research focus in this area is on devising techniques that guarantee the existence of a recovery line while minimizing the coordination between processes. This coordination overhead includes the amount of control information exchanged between processes and the number of times some process p is forced to take a checkpoint to ensure that a recovery line exists. Such checkpoints are called induced or forced checkpoints. Over the years, a multitude of C/R techniques have been proposed for both uniprocessor and distributed settings. This has created a need for objective quantitative evaluation measures that allow one to compare various approaches on the same scale. One such scheme has been developed by Vaidya in the context of applications running on a single computer [38], but it does not extend trivially to most distributed C/R mechanisms. There are two reasons for this. First, Vaidya’s work does not take into account the communication costs associated with distributed checkpointing, and second, it assumes that each process always restarts from its most recent checkpoint. This assumption cannot always be made in distributed C/R schemes, since these schemes often require a process to roll back after a failure to a more distant checkpoint [1]. In this paper and in our primary work [2], we present an objective quantitative evaluation measure, called overhead ratio, for distributed checkpointing protocols that is based on stochastic models. For the first time, we define here the overhead ratio that considers all the parameters, including communication overhead, rollback propagation, recovery overhead, and others. A special case of this measure has been used by several people for different purposes. For example, Vaidya [38] used this measure in a special case in which the recovery occurs from the most recent checkpoints. In particular, we show that when a checkpoint mechanism guarantees that a process will always roll back to its most recent checkpoint, as Vaidya assumed, our evaluation measure agrees with Vaidya’s results. Although our generalized measure of the overhead ratio considers as many parameters as possible, we still include some assumptions that may restrict the accuracy of the proposed measure in a realistic system. To simplify the readability of the paper, we state those assumptions in the paper in the places where they are used and explain them as needed. For example, one of the assumptions we use is the independent failure rate. While this assumption is widely used and being accepted in many theoretical and practical papers, some work seems to place the validity of this assumption under suspicion [9,29]. The remainder of this paper is organized as follows: Section 2 describes the model and basic definitions. Our method of evaluating distributed checkpointing protocols is presented in Section 3. In Section 4, we briefly describe different distributed checkpointing protocols and compare them according to our evaluation method. In Section 5, we describe previous related work, and we conclude our work in Section 6. 2. Preliminaries 2.1. System model We consider a distributed system consisting of n processes, denoted by P1 , P2 , . . . , Pn , connected by a communication network and communicating by exchanging messages asynchronously over the network. Processes are modelled as automata, which start in some predefined state, and each of which executes multiple steps. In each step, a process receives zero or more messages, performs some computation and generates zero or more messages to be sent to other processes. Additionally, in each step a process may decide to take a checkpoint, or to perform restart. Finally, a process may fail by crashing during any step. A process that crashes does not take additional steps until it performs a restart. On the other hand, a process may perform restart even if it did not crash, but another process did. We refer to sending/receiving a message, taking a checkpoint, and crashing as events that occur in the process. We define a local history of a process to be the sequence of events and state transitions it incurs. An execution is a collection of histories, one for each process, that obeys the typical well-formedness properties. In addition, we assume that the network delivers messages reliably in First-In-First-Out (FIFO) order [11]. Actually, this assumption is required by most of the checkpointing protocols we consider. For each receive event in an execution, there is one corresponding send event, and for each send event, there is at most one receive event. Moreover, if the histories in the execution are infinite, then for each send event there is exactly one corresponding receive event. hb
Events in an execution are related by the happened-before relation [15] denoted by →; this relation is defined as the transitive closure of the process order and the relation between the send and receive events of the same message.
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
347
Fig. 1. A process of a distributed system.
To simplify our further definitions, we assume that there is a global clock that may not be accessible to the processes. Each event in a history is associated with a global time at which the event occurred; this association is expressed by a time function t (event). This function is monotonically increasing for events in a local history. The time at which a message is received is later than or at the same time as it was sent; in other words, the time function is consistent with the happened-before relation. Each checkpoint taken by a process is assigned a unique sequence number. The ith checkpoint of process p is denoted by C p,i . The ith checkpoint interval of process p, denoted by I p,i , is the sequence of all events performed in the interval [t (C p,i−1 ), t (C p,i )). In addition, we assume that process failures occur randomly and independently under exponential distribution with rate λ. In practice, a process typically consists of several modules as depicted in Fig. 1. Specifically, the Application module executes the user program, the C/R module executes the checkpointing and recovery protocols, the Checkpointer component performs the checkpoint, and the System Layer represents the interface between the process and the operating system. This model allows us to make the following distinction: Messages generated by the application module are considered data messages while messages generated by the C/R module are called control messages. Moreover, the C/R module can intercept data messages and piggyback control data on them. The combination of control messages and control data is the control information used by the C/R protocol. When the checkpoint decision is made by the C/R module, we call it a forced checkpoint, reflecting the fact that it was taken due to the protocol. Otherwise, it is an independent checkpoint, reflecting the fact that it was taken due to local considerations only. 2.2. Definitions and notations Definition 2.1. The checkpoint overhead, denoted by o, is the increase in the execution time of a process p because of a single checkpoint. Definition 2.2. The checkpoint latency, denoted by L, is the duration required to take a single checkpoint. Clearly, o ≤ L in every checkpointing mechanism [34,38]. For example, in sequential checkpointing [28], a process suspends normal execution in order to save its state in stable storage, and thus o = L. In contrast, in forked checkpointing [5,26], the process forks a child process to save its state and continues normal execution concurrently, and thus o < L. This situation is illustrated in Fig. 2(b). For the purpose of evaluation, we further assume that every process takes an independent checkpoint every T units of time. However, the time between any two consecutive checkpoints might be less, because forced checkpoints can occur at arbitrary times. Please notice here that due to the forced checkpoints, the checkpoint intervals in the same process are not necessarily equal. Instead, the intervals are created in an arbitrary length. We call a checkpoint interval that has a forced checkpoint as an independent checkpoint interval. However, we may eliminate the word “independent” if the meaning is clear. Fig. 2(a) shows an example in which there is one forced checkpoint (C p,i−1 ) between two consecutive independent checkpoints (C p,i−2 and C p,i ).
348
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
Fig. 2. Possible events in an execution of process p.
Definition 2.3. The recovery overhead, denoted by r , is the length of time it takes to restart a process from a given local checkpoint. It depends on the restart mechanism but not on the checkpointing mechanism. We stress that r does not contain the time during which the failed process was down. When a failure occurs in a distributed system, we need to recover from a cut of checkpoints (i.e. a set of checkpoints consisting of one checkpoint from each process). However, not all cuts of checkpoints are consistent, i.e., correspond to a state that could have been reached in the execution. A consistent cut of checkpoints is called a recovery line. Definition 2.4. A cut of checkpoints S is a recovery line if for each message received in S, the corresponding send event is also included in S. We are now ready to explain the notion of k-rollback (first defined in [1]), which lies at the heart of our tools for evaluating distributed checkpointing protocols. It intuitively captures the maximal number of checkpoints that any process may need to roll back after a failure. We showed in [1] that most of the checkpointing protocols guarantee k-rollback executions for k ≥ 1. Definition 2.5. A checkpoint C in an execution E is exploited if there is a recovery line R ∈ E such that C ∈ R. Definition 2.6. Given two cuts of checkpoints S1 and S2 , S1 | p denotes the checkpoint of process p in S1 . We say that hb
S1 ≤ S2 if for every process p, either S1 | p → S2 | p or S1 | p = S2 | p . Definition 2.7. Given two checkpoints C p,i and C p, j of the same process p, the distance between them, denoted by dist(C p,i , C p, j ), is |i − j|. The distance between two cuts of checkpoints S1 and S2 , denoted by dist(S1 , S2 ), is max p {dist(S1 | p , S2 | p )}. Definition 2.8. The cut of checkpoints derived from C p,i , denoted by Cut(C p,i ), is the set of the latest checkpoints from each process that occurred at or before time t (C p,i ). This definition was initially introduced by Wang [39]. In this work Wang characterized the maximum and minimum recovery line that are associated with a directed rollback-dependency graph. Definition 2.9. The checkpoint interleaving level of an execution E, denoted by IL(E), is the minimal number l such that for all processes p, for all pairs of consecutive checkpoints (C p,i , C p,i+1 ), no process q 6= p takes more than l checkpoints in the time interval [t (C p,i ), t (C p,i+1 )). Definition 2.10. An execution E is k-rollback for a given integer k ≥ 0, if for every checkpoint C p,i there is a recovery line R ∈ E such that R ≤ Cut(C p,i ) and dist(R, Cut(C p,i )) ≤ k · IL(E). If there is no k such that E is k-rollback, then E is unbounded-rollback. The k-rollback class is the set of all k-rollback executions. In the following, we slightly abuse the terminology by omitting the word “class”.
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
349
Notice here that if two executions E 1 and E 2 are k-rollback, it does not mean that the rollback distance in these execution is identical. Instead, we say that since they are k-rollback, upon any recovery, the rollback propagation in E 1 is no more than l = k · IL(E 1 ) and in E 2 in no more than m = k · IL(E 2 ), where it is possible that l 6= m. Many distributed checkpointing protocols produce control overhead [13]. Control overhead is the overhead due to control information. Given a checkpointing protocol P, we use expctMsgsNum(P) to denote the expected number of control messages in a single checkpoint interval, and expctMsgsSize(P) to denote the expected total size of the control information for a single checkpoint interval. The expected control overhead of P is therefore M(P) = expctMsgsNum(P) · wm + expctMsgsSize(P) · wb , where wm is the “setup” time for sending a message, and wb is the additional per-bit delay associated with sending a message. The total checkpoint overhead, denoted by O, is the increase in the execution time of a process p because of a checkpoint C p,i and the control overhead corresponding to C p,i . In other words, O = o + M. If the control information is piggybacked on application messages, then the communication pattern of the execution determines the control overhead. Therefore, for some executions, we need to determine the message rate, as defined below, to compute the control overhead. Definition 2.11. Given an execution E, the message rate of E, denoted by MR(E), is the expected number of data messages sent in a checkpoint interval I p,i of E. Given an execution E, we assume that there is a recovery mechanism that finds the most advanced recovery line in E. A recovery line R is said to be the most advanced recovery line if there is no other recovery line R 0 such that R ≤ R 0 . Intuitively, if a process p has recovered from a checkpoint C p,i , then upon further failure, p can only recover from a checkpoint C p, j for j ≥ i. In this paper we only consider executions with recovery mechanisms that find the most advanced recovery line. 2.3. Assumptions and limitations In this section we present and discuss the assumptions and limitations in our system model and the study that we applied. In our system model, we assumed that failures occur in the system with a Poisson rate. Notice here that much previous work in this regard assumed that the failures occur in a Poisson rate fashion [30,33,36]. Although no one has observed Poisson failure rates in practice, our results will be comparable to many other results. In addition, we assume that the failure rate is identical (denoted by λ) and failures are independent. Several people have suggested different failure distribution, such as some form of exponential distribution or hyperexponential distribution [16,22, 24]. However, here we pursue the mentioned assumptions regarding failure rate to be able to compare our results with other related approaches. An important assumption that we would like to present here is regarding the checkpoint interleaving level. Here, we would like to emphasize the fact that by the definition of k-rollback, some processes in an execution that belongs to such a class might need to rollback k · IL(E) checkpoints, for an execution E. Ideally, it would have been nicer to have a definition that does not depend on the interleaving level. However, when checkpoints are taken in an uncoordinated manner, one process p may take an unbounded number of checkpoints during the same period that another process q takes only a couple of checkpoints. In particular, these extra checkpoints may be independent of the communication pattern, and for each checkpoint rolled back by p we might therefore have to rollback all the checkpoints taken by q between p’s latest checkpoints. By normalizing the definition with respect to IL(E), we capture the rollback propagation that is inherent to the communication pattern of the execution. A simple way of keeping IL(E) small is taking checkpoints at approximately the same frequency in all processes, e.g., by using loosely synchronized clocks. Therefore, for this reason and to simplify our mathematic equations, we assume that IL(E) = 1 for an execution E. In this paper we assume that the local checkpoint for any protocol is the same. Since we are concerned with distributed checkpointing protocols, we assume that a local checkpoint of any distributed checkpointing protocol has the same overhead o and latency l. However, we do care about the control overhead induced by messages that are created due to the checkpointing protocol, thus we assume that the local processing of those messages is identical in each process and for every protocol. Simply, we ignore the time of processing of a control message in the C/R module (as presented in Fig. 1). In addition, we assume that the overhead due to the interaction between
350
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
Fig. 3. A possible execution of I p,i .
the Disk and the Checkpointer, as presented in Fig. 1, is identical and for simplicity we just included it in o and l. As we mentioned above, the length of a checkpoint interval is fixed. However, since we consider forced checkpoints between any two independent checkpoints, the length of an independent checkpoint interval is a random variable. Furthermore, as we show later, for any given execution, in our analysis we consider the expected length of the independent checkpoint interval. As a result, way may obtain approximate results rather than exact results. Regarding recovery, first of all, we assume that the local recovery in each process is identical and is denoted by r . Notice here that since we concentrate on the distributed checkpointing protocols, we care about the rollback propagation during recovery more than the value of r . Some other work may consider r more carefully and try to characterize its effect [27,30]. Generally speaking, in this paper we fix the values of local overhead, latency, and recovery to focus on the parameters that affect the distributed checkpointing protocols. That is, a local checkpoint/recovery mechanism must be used by any distributed protocol and the local mechanism chosen is independent of the distributed protocol being employed. So the local performance parameters have an impact on the overall performance, but this impact stems from local operations and does not depend on the distributed protocol. In this work we concentrate on the impact of distribution and communication patterns on the overall performance of the distributed checkpointing protocol. 3. The overhead ratio Consider an execution E ∈ k-rollback, and a process p ∈ E. Suppose that p is running in the checkpoint interval I p,i+1 , that is, it has taken the checkpoint C p,i but not yet C p,i+1 . If a failure occurs during I p,i+1 , then p needs to recover from the newest exploited checkpoint. By [1], since E ∈ k-rollback, the newest exploited checkpoint is no more than k · IL(E) checkpoints behind. Recall that T is the duration between the establishments of two consecutive independent checkpoints. Consider an execution E with a checkpointing protocol P. Use F(P) to denote the expected number of forced checkpoints that occur during T when there are no failures. Then, the expected time of a checkpoint interval I p,i ∈ E is τ (P) = F(PT)+1 . Let τ 0 be the expected duration of a checkpoint interval without the checkpoint and control overheads, namely, τ 0 = τ − O. Fig. 2(a) depicts an example in which the expected number of forced checkpoints in a time interval T is 1 (F = 1,) and thus the expected time of checkpoint interval I p,i is τ = T2 . On the other hand, if one or more failures occur during I p,i , then the expected time of I p,i is more than τ . For instance, in Fig. 3, there is a failure during I p,i+1 . After a failure, process p must roll back to the latest exploited checkpoint, incurring r units of time due to recovery. Moreover, after the rollback, L − o units of computation that were performed during the checkpoint latency should be performed again. That is necessary because the computation that happened concurrently with the checkpoint is not part of the saved data [30,38]. Therefore, in the presence of one failure, τ + r + (L − o) units of time are required to complete the checkpoint interval. The overhead ratio was first defined by Ziv and Bruck [42] and used by Vaidya [38] for performance analysis in special cases. We augment their definition to our generalized model as follows:
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
351
Fig. 4. A Markov chain represents I p,i+1 execution in the 0-rollback class.
Definition 3.1. Consider an execution E ∈ k-rollback with a checkpointing protocol P. Let Γk (P) be the expected execution time of a checkpoint interval I p,i ∈ E. The overhead ratio of P, denoted by v(k, P), is v(k, P) =
Γk (P) Γk (P)(F(P) + 1) Γk (P) − τ 0 (P) = 0 −1= − 1. τ 0 (P) τ (P) T − O(P)(F(P) + 1)
k (F+1) When we do not care about the checkpointing protocol, we call the overhead ratio v(k) = TΓ−O(F+1) − 1. Notice that v(k) ≥ 0 for every k ≥ 0. Moreover, a smaller overhead ratio corresponds to a better execution. Actually, the overhead ratio is the ratio between the total overhead of C/R and the computation events in a checkpoint interval. Therefore, for v(k) ≥ 1, we find that the total overhead of C/R is bigger than the computation events in a checkpoint interval. Recall that T is a constant and F (the expected number of forced checkpoints during T ) can be computed by either theoretical analysis or experimental work [6]. Therefore, to compute the overhead ratio, we need to compute Γk , which depends on the k-rollback class to which the execution belongs. The rest of the calculations appear in Sections 3.1 and 3.2. Section 3.3 then utilizes these computations to draw some numerical results, and in particular to establish that there is an optimal value for T .
3.1. Computing Γk using Markov chains We compute Γk by constructing a finite-state Markov chain [33] for the k-rollback class of executions. As we mentioned in Section 2.3, we assume that IL(E) = 1 for an execution E. This assumption agrees with the realistic executions in which the checkpoint interleaving is normalized. Practically, this assumption is justified by the fact that many distributed checkpointing protocols ensure implicity that IL(E) = 1. Notice that this assumption does not affect our results, but simplify their presentation [1]. We use mathematic manipulation techniques to extract Γk from the Markov chain. The Markov chain has a unique start state and a unique sink state. The number of states in the Markov chain depends on the k-rollback class. To make the presentation more methodological, we start by computing Γ0 , followed by Γ1 , Γ2 , Γ3 , and eventually the general case Γk . We do it that way since only Γ3 exhibits all the components of the general case Γk . 3.1.1. Computing Γ0 Consider process p running in I p,i+1 of a 0-rollback execution. Γ0 can be computed using the 3-state Markov chain presented in Fig. 4 [34,42]. Process p starts the interval with the start state i (related to the checkpoint C p,i ). A transition from state i to the sink state i + 1 occurs if I p,i+1 is completed without failures. If a failure occurs during I p,i+1 , then p recovers from C p,i . In that case, we have a transition from state i to state Ri . After state Ri is entered, a transition is made to state i + 1 if no further failure occurs in I p,i+1 after a recovery. Otherwise, a transition is made from state Ri to itself. Let M be a Markov chain that represents the interval I p,i+1 in a k-rollback execution. Let s, t be states in M such that there is a transition from s to t in M. We use Ps,t to denote the probability of the transition from s to t and Ws,t to denote the expected execution time spent in state s before the move to state t. State s ∈ M is called recovery if a transition to this state is due to a failure. For instance, in Fig. 4, only state Ri is a recovery state. Furthermore, for each state s ∈ M, we define the variable X s to be the expected cost of reaching the sink state i + 1 from state s. Actually,
352
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
Fig. 5. A Markov chain represents I p,i+1 execution in 1-rollback class.
X s equals the expected cost of all possible paths in M from state s to state i + 1 weighted by their probabilities. Therefore, Γk = X i . In Fig. 4, since there are two possible paths from state i to state i +1, then X i = Pi,i+1 Wi,i+1 + Pi,Ri (Wi,Ri + X Ri ). Therefore, we have the following two linear equations: X i = Pi,i+1 Wi,i+1 + Pi,Ri (Wi,Ri + X Ri ) and X Ri = PRi ,Ri (W Ri ,Ri + X Ri ) + PRi ,i+1 W Ri ,i+1 . After solving these two linear equations and substituting PRi ,i+1 = 1 − PRi ,Ri , we have PRi ,Ri W Ri ,Ri + W Ri ,i+1 + Pi,i+1 Wi,i+1 Γ0 = X i = Pi,Ri Wi,Ri + 1 − PRi ,Ri 3.1.2. Computing Γ1 Given an execution E ∈ 1-rollback, if a failure occurs in process p during I p,i+1 , then p can roll back either to C p,i or C p,i−1 [1]. Therefore, in the Markov chain for Γ1 presented in Fig. 5, we add a new state Ri−1 to represent the possibility of recovering from C p,i−1 . A transition to state Ri represents a recovery made from C p,i . Since we consider a recovery mechanism that finds the most recent recovery line, if there is an additional failure after entering state Ri , then the recovery will be made from C p,i . On the other hand, a transition to state Ri−1 represents a recovery made from C p,i−1 . In that case, process p is rolled back to re-execute the checkpoint interval I p,i . After state Ri−1 is entered, a transition is made to state i after I p,i is completed. When a further failure occurs, a transition is made from state Ri−1 to itself. We compute the variable X i by solving the following three linear equations: X i = Pi,i+1 Wi,i+1 + Pi,Ri (Wi,Ri + X Ri ) + Pi,Ri−1 (Wi,Ri−1 + X Ri−1 ), X Ri = PRi ,Ri (W Ri ,Ri + X Ri ) + PRi ,i+1 W Ri ,i+1 ,
and
X Ri−1 = PRi−1 ,Ri−1 (W Ri−1 ,Ri−1 + X Ri−1 ) + PRi−1 ,i (W Ri−1 ,i + X i ). Obviously, these linear equations can be presented by a linear system Ax = b, where A is a matrix of size 3 × 3 and x, b have 3 entries. Substituting PRi ,i+1 = 1 − PRi ,Ri and PRi−1 ,Ri−1 = 1 − PRi−1 ,i , we get 1 −Pi,Ri −Pi,Ri−1 Xi , 0 PRi ,i+1 0 A= x = X Ri , −PRi−1 ,i 0 PRi−1 ,i X Ri−1 Pi,i+1 Wi,i+1 + Pi,Ri Wi,Ri + Pi,Ri−1 Wi,Ri−1 . b = PRi ,Ri W Ri ,Ri + PRi ,i+1 W Ri ,i+1 PRi−1 ,Ri−1 W Ri−1 ,Ri−1 + PRi−1 ,i W Ri−1 ,i The solution is 1 PRi ,Ri W Ri ,Ri + W Ri ,i+1 Γ1 = X i = Pi,i+1 Wi,i+1 + Pi,Ri Wi,Ri + 1 − Pi,Ri−1 1 − PRi ,Ri PRi−1 ,Ri−1 + Pi,Ri−1 Wi,Ri−1 + W Ri−1 ,Ri−1 + W Ri−1 ,i . 1 − PRi−1 ,Ri−1
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
353
Fig. 6. A Markov chain represents I p,i+1 execution in 2-rollback class.
Fig. 7. The values of the linear system Ax = b for obtaining Γ2 .
3.1.3. Computing Γ2 Given an execution E ∈ 2-rollback, if a failure occurs in process p during I p,i+1 , then the rollback could be from C p,i−2 . Therefore, the Markov chain of 2-rollback for computing Γ2 should contain transitions from state i to the recovery states Ri , Ri−1 , and Ri−2 . Such a Markov chain is presented in Fig. 6. The states i and i 1 in Fig. 6 represent the same state of the execution, where process p runs in I p,i+1 . However, we use two different states in the Markov chain to capture the situation in which process p has rolled back to C p,i−1 , so that it will not roll back to C p,i−2 . In that case, there is no transition path from state Ri−1 to state Ri−2 (see Fig. 6). The probabilities and costs from states i and i 1 to state i + 1 are identical. Thus, Pi,i+1 = Pi1 ,i+1 and Wi,i+1 = Wi1 ,i+1 . As in the previous cases, to compute Γ2 , we first build the corresponding linear system Ax = b and then solve it to extract X i . From Fig. 6, we have X i = Pi,i+1 Wi,i+1 + Pi,Ri (Wi,Ri + X Ri ) + Pi,Ri−1 (Wi,Ri−1 + X Ri−1 ) + Pi,Ri−2 (Wi,Ri−2 + X Ri−2 ), X i−1 = Pi−1,i (Wi−1,i + X i ) + Pi−1,Ri−1 (Wi−1,Ri−1 + X Ri−1 ) + Pi−1,Ri−2 (Wi−1,Ri−2 + X Ri−2 ), X i1 = Pi,i+1 Wi,i+1 + Pi1 ,Ri (Wi1 ,Ri + X Ri ) + Pi1 ,Ri−1 (Wi1 ,Ri−1 + X Ri−1 ), X Ri = PRi ,Ri (W Ri ,Ri + X Ri ) + PRi ,i+1 W Ri ,i+1 , X Ri−1 = PRi−1 ,Ri−1 (W Ri−1 ,Ri−1 + X Ri−1 ) + PRi−1 ,i (W Ri−1 ,i + X i1 ), X Ri−2 = PRi−2 ,Ri−2 (W Ri−2 ,Ri−2
and + X Ri−2 ) + PRi−2 ,i−1 (W Ri−2 ,i−1 + X i−1 ).
After doing some calculus and arranging the variables on the left side of each equation, we can represent these equations by the linear system Ax = b as presented in Fig. 7. Notice that in A the first rows and columns include the coefficients of equations corresponding to the non-recovery states and are followed by coefficients of recovery sates. 3.1.4. Computing Γk Fig. 8 presents the Markov chain of the k-rollback class. In the chain there are m = (k+2)(k+1) states in addition to 2 the sink state i + 1. Therefore, the matrix A of the linear system Ax = b is of size m × m, and the vectors x and b have m entries.
354
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
Fig. 8. A Markov chain represents I p,i+1 execution in k-rollback class.
Let M be a Markov chain that represents the execution of I p,i+1 in the k-rollback class. Consider the matrix A of the linear system Ax = b corresponding to M. As we mentioned before, the first rows and columns of A include the coefficient values corresponding to the non-recovery states, and the following rows and columns include values of the recovery states. More precisely, the coefficient values in A are arranged with respect to the following order of the vector, x = hX i , X i−1 , · · · , X i−k+1 , X i1 , X (i−1)1 , · · · , X (i−k+1)1 , X i2 , · · · , X Ri , X Ri−1 , · · · , X Ri−k i. Given two states s, t ∈ M such that s, t 6= i + 1. The entry A(i)( j) contains the coefficient value of an expected cost variable X t in the derived linear equation of an expected cost variable X s . The values of A can be determined from M by the following rules: • If there is no transition from state s to state t in M, then A(i)( j) = 0. • If s 6= t and there is a transition from state s to state t in M, then A(i)( j) = −Ps,t . • For s = t, if state s is a recovery state, then A(i)(i) = PRm−l ,m−l+1 . Otherwise, A(i)(i) = 1. 3.2. Computing the overhead ratio Please notice here that in this paper we solve the Markov models by developing the mathematical equations and solving them. Although there are many tools that are dedicated for solving such models, such as the Mobius ¨ tool [12], we would prefer to solve our models manually. The main reason is that our Markov chains do not have a large number of states, so they can be solved easily by developing their mathematical equations. Recall that to compute v(k), we first need to compute Γk . Given the discussion in Section 3.1, we must first determine Ps,t and Ws,t for any pair of connected states s and t in the corresponding Markov chain M. A transition from state i to state i + 1 occurs if I p,i+1 is completed without a failure. If this transition is made, then T τ units of time are spent where τ = F+1 , that is, Wi,i+1 = τ . We assumed that failures are governed by a Poisson process with rate λ. The probability that there is no failure during τ units of time is Pi,i+1 = e−λτ . Since for every l, 0 ≤ l ≤ k − 1, the probability and expected cost of transiting from state i − l to state i − l + 1 are equal, we have that Pi−l,i−l+1 = e−λτ
and
Wi−l,i−l+1 = τ,
where 0 ≤ l ≤ k − 1.
(1)
Notice here that τ is not a fixed value, but the expected time of a checkpoint interval. Therefore, given a checkpointing protocol where we can determine the value of τ , our analysis shows an approximation (at best results), but not exact results. On the other hand, if a failure occurs during I p,i+1 , then the rollback is made to the newest exploited checkpoint C p,i−l for 0 ≤ l ≤ k. Therefore, a transition is made from state i to one of the states {Ri−l |0 ≤ l ≤ k}. The probability Pk of such a transition is equal to 1 − Pi,i+1 . We denote Pi,Ri−l = µl (1 − Pi,i+1 ) for 0 ≤ l ≤ k and l=0 µl = 1, where the value of µl depends on the communication pattern and the checkpointing protocol. The cost of this transition, Wi,Ri−l , is the expected execution time for I p,i+1 until a failure occurs. Given that a failure occurs in the interval [0, τ ) during the execution of I p,i+1 , the time to failure (TTF) is a random variable x in the interval [0, τ ) [33]. Moreover,
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
its probability density function (PDF) is λe−λx for 0 ≤ x < τ , and its conditional density function is fl (x) = where 0 ≤ x < τ and 0 ≤ l ≤ k. This implies that Z τ 1 τ e−λτ Wi,Ri−l = x · fl (x)dx = − , where 0 ≤ l ≤ k. λ 1 − e−λτ 0
355 λe−λx Pi,Ri−l
,
Therefore, for every m, 0 ≤ m ≤ k − 1, the probability of transition from every state i − m to state Ri−m−l is Pi−m,Ri−m−l = µl [1 − e−λτ ]
k−m X
and
µl = 1
l=0
Wi−m,Ri−m−l
τ e−λτ 1 = − λ 1 − e−λτ
(2)
where 0 ≤ l ≤ k − m.
For every l, 0 ≤ l ≤ k, after state Ri−l of Fig. 8 is entered, a transition to state i + 1 − l is made if no further failure occurs before I p,i+1−l is completed. The execution time required to reach state i +1−l is W Ri−l ,i+1−l = τ +r + L −o. Therefore, the probability that no additional failure occurs is PRi−l ,i+1−l = e−λ(τ +r +L−o) . It follows that PRi−l ,i+1−l = e−λ(τ +r +L−o)
and
W Ri−l ,i+1−l = τ + r + L − o,
where 0 ≤ l ≤ k.
(3)
If another failure occurs after state Ri−l , 0 ≤ l ≤ k, a transition is made from state Ri−l to itself. For this transition we have PRi−l ,Ri−1 = 1 − PRi−1 ,i+1−l = 1 − e−λ(τ +r +L−o) . As discussed above, W Ri−l ,Ri−l can be obtained in much the same way that Wi,Ri can. The PDF of the TTF is λe−λx for x ∈ [0, τ + r + L − o). Thus we have PRi−l ,Ri−l = 1 − e−λ(τ +r +L−o) where 0 ≤ l ≤ k.
and
W Ri−l ,Ri−l =
1 (τ + r + L − o)e−λ(τ +r +L−o) − λ 1 − e−λ(τ +r +L−o) (4)
From Eqs. (1)–(4), we can extract the values of A and b in the linear system Ax = b. Then, by using any known technique for solving linear systems, we can obtain the expected cost variable X i . Lastly, by substituting Γk in the equation of v(k) we can obtain the value of overhead ratio. 3.3. Numerical results To compute the overhead ratio v(k) of an execution E ∈ k-rollback, we first need to set the following parameters: k, T, F, O, r, L , o, M, wm , wb , and λ. T is the only parameter that can be specified by the user, while the other parameters are obtained from the execution E and the environment where E is executed. As we will show later, the parameters o, L, and r depend on the local checkpoint, but not distributed checkpointing protocol. Therefore, the values can be obtained from an actual implementation. F is the only parameter that we need to calculate based on the execution of the distributed checkpointing protocol. Although F may depend on the communication pattern, we used real results from [6] to estimate the value of F. Since T is specified by the user, we would like to use the optimum value of T that will yield a minimum overhead ratio. In this section we present numerical results to determine the optimum value of T for a given set of parameter values. We report results from two computing environments: SMALL and BIG. In both environments we use seconds as the time units. SMALL represents small and short-running applications executed in the Starfish system [3]. In previous experimental work, presented in [4], we ran a matrix multiplication application in Starfish with checkpointing. From that work we obtained the following parameters: o = 1.78, L = 4.292, and r = 3.32. BIG represents a long-running distributed application. In it, applications run for several days and produce large checkpoint files. For BIG, we use the simulation and experimental work performed by Plank and Thomason [30] in which they obtained the following parameters: o = 316.7, L = 5375.5, and r = 5375.5. Both environments use a Fast Ethernet network (100 Mb/s). Hence, wm = 0.000276 and wb = 10−8 . As we mentioned before, in both environments we assume that the failure rate of a single process is λ = 1.23 · 10−6 [30,38]. In addition, for k-rollback execution E and process p ∈ E, we first assume equal probabilities of recovering from the possible checkpoints; that means that the probabilities of rolling back to any of the C p,i− j ,
356
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
Fig. 9. SMALL and BIG environments. For every v(k) there is an optimum T .
Fig. 10. The overhead ratio vs. T with different values of k.
1 ≤ j < k, are the same. We also consider decreasing probabilities of recovering from the possible checkpoints; that is, if process p fails during the checkpoint interval I p,i+1 , then the probability of recovering p from C p, j is twice as great as the probability of recovering from C p, j−1 for i − k < j ≤ i (given that i ≥ k). Fig. 9 depicts v(k) vs. T in SMALL and BIG, where the number of forced checkpoints in an interval T is F = 0, the number of nodes is n = 8, and the control overhead is M = wm + wb (we discuss the impact of this choice of values later). The figure shows that for every value of k we have an optimum T that achieves the minimum v(k). Notice that since BIG has long-running applications, the optimum T values, as presented in Fig. 9(b), are larger than in SMALL, as presented in Fig. 9(a). Moreover, in both environments we find that v(k) is a monotonic function with k, and for a larger k the minimum value for v(k) is obtained when T is small. The reason is that with a larger k, the execution loses fewer computation events with a smaller T . Although we try to ensure that T ≥ L, notice that in Fig. 9(b) we may have that T slightly smaller than L. Such values of T do not affect the results. We can see that in this case the optimal values of T are closer to L. Fig. 10 depicts the value of v(k) vs. T as in Fig. 9(a) but with decreasing probabilities of recovering. We also have here an optimum T for every value of k. However, v(k) is a slow monotonic function with k, and as can be seen, v(16) is almost the same as v(25).
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
357
Fig. 11. The overhead ratio vs. T with different values of λ.
Fig. 12. The overhead ratio behaves differently with different values of F.
Consider Fig. 9(a) and 10. For every k, v(k) obtains the minimum value in the interval [0, 500] of T . In that interval the value of v(k) drops exponentially as T increases. The reason is that o and L are small in SMALL, so we need a longer checkpoint interval to perform useful computation and thus achieve a better overhead ratio. However, since with large values of T , we can lose computation events due to rollback, in the interval (500, ∞) the overhead ratio becomes much worse as T becomes longer (particularly in Fig. 9(a) due to equal probabilities). Therefore, there is a value of T for which 1) the loss of overhead ratio due to the checkpoint overhead and latency and 2) the loss due to rollback propagation are similar. That value of T yields the minimal overhead ratio. Fig. 11 shows the trade-off between the optimum values of T and λ. With no doubt, the overhead ratio increases as the failure rate increases. Thus, to limit the loss of useful computation with large values of λ, we need to set smaller values of T . Fig. 12 considers the effect of F (the number of forced checkpoints in a time interval T ), with different values of T , on v(k). It shows that the optimum value of T is obtained in values with respect to F. The overhead ratio with F = 2 and T = 3000 is equivalent to the overhead ratio with F = 0 and T = 1000. In general, the overhead ratio t with F = f and T = t is equivalent to the overhead ratio with F = 0 and T = f +1 . Therefore, there is no need to present the overhead ratio with F > 2.
358
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
Fig. 13. The overhead ratio behaves differently with different values of n.
On the other hand, Fig. 13 shows that the overhead ratio increases proportionally to the number of processes (n). Under the assumption that process failures occur randomly and independently with a probability of ρ, the probability of a failure in a system with n processes is 1 − (1 − ρ)n . Therefore, the overhead ratio increases proportionally with n. 4. Performance analysis of checkpointing protocols Most checkpointing protocols are designed to guarantee recovery lines. Another characterization of recovery lines is based on the notions of zigzag-path and zigzag-cycle introduced by Netzer and Xu [20]. Definition 4.1. A zigzag path (also called Z-path) from C p,i to Cq, j (denoted by C p,i messages (m 1 , m 2 , . . . , m l ), l ≥ 1, such that
Cq, j ) is a sequence of
1. m 1 is sent by process p after C p,i , 2. if m k (1 ≤ k < l) is received by process r , then m k+1 is sent by r in the same or a later checkpoint interval (m k+1 may be sent before m k is received), and 3. m l is received by process q before Cq, j is completed. In particular, a Z-path can be from a checkpoint to itself, in which case, it is called a Z-cycle, and the checkpoint is said to be in a Z-cycle. Moreover, Netzer and Xu [20] proved that a cut of checkpoints S is a recovery line iff there is no Z-path between any two checkpoints in S. Manivannan and Singhal [18] classified executions by the degree to which the creation of various Z-paths is prevented, and identified the following classes: strictly Z-path free (SZPF) includes executions in which there are no non-causal Z-paths between any two checkpoints; Z-path free (ZPF) includes executions in which for every Zpath there is a corresponding causal path; Z-cycle free (ZCF) includes executions in which no checkpoint is inside a Z-cycle. They also showed that S Z P F ⊆ Z P F ⊆ Z C F. Furthermore, we showed in [1] that Z P F ⊂ 1-rollback and Z C F = 1-BC ⊂ (n − 1)-rollback. 4.1. Distributed checkpointing protocols In this section we present a brief description of a set of distributed checkpointing protocols. We try to consider here as many checkpointing protocols as possible. It is obvious that we cannot address all the checkpointing protocols, so we chose the protocols that have different characteristics to represent a large number of checkpointing protocols. For each protocol, we give a brief description and calculate the relevant parameters regarding our measure of overhead ratio.
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
359
4.1.1. Sync-and-Stop (SaS) [25] SaS is a coordinated checkpointing protocol in which a coordinator invokes barrier synchronization among all processes and then all the processes take a global checkpoint, which is also a recovery line. Specifically, the protocol works as follows: 1. The coordinator broadcasts a special message, chkpt request, to all the other processes. 2. Upon receiving chkpt request, each process stops running its target program and sends the message chkpt ready to the coordinator. 3. When the coordinator collects chkpt ready messages from all other processes, it broadcasts the message chkpt do and takes a local checkpoint. 4. Upon receiving chkpt do, each process takes a checkpoint and sends back chkpt done. 5. When the coordinator collects chkpt done messages from all other processes, it broadcasts chkpt commit to resume the application. Obviously, the collection of checkpoints is a recovery line. Thus, SaS ∈ 1-rollback [1]. In this protocol there are no forced checkpoints; therefore, the number of forced checkpoints in an interval T is F(SaS) = 0. Regarding the control overhead, in each phase of SaS, the coordinator broadcasts three messages, and the other n − 1 processes send two reply messages. Notice that the protocol needs an 8-bit control message. Therefore, the control overhead is M(SaS) = 5(n − 1)(wm + 8 · wb ). 4.1.2. Chandy–Lamport (C–L) [11] C–L is a coordinated checkpointing protocol in which there is no need to block the application execution. In C–L the coordinator initiates a checkpoint by broadcasting marker messages and then takes a checkpoint. When process p receives a marker from a channel c, it acts as follows: • If p has not taken a checkpoint, then it broadcasts the marker, takes a checkpoint, and records the state of c as being empty. • Otherwise, p records the state of the channel c as the sequence of messages received along c after it had taken a checkpoint and before it received the marker. C–L belongs to 1-rollback, and since there are no forced checkpoints, F(C–L) = 0. In a fully connected network with n nodes, C–L generates 2n(n − 1) messages per checkpoint [25]. Also, as the marker is assumed to be 8-bit, the overall control overhead is M(C–L) = 2n(n − 1)(wm + 8 · wb ). 4.1.3. Baldoni, Helary, Mostefaoai and Raynal (BHMR) [7] The BHMR protocols ensure the rollback-dependency trackability (as defined in [7]), in which each process p piggybacks its dependency vector, another Boolean vector, and a Boolean matrix in each application message. Using the piggybacked information, each process takes forced checkpoints to break every non-causal message chain, so that a SZPF execution is guaranteed. Since the BHMR protocol is SZPF, we know that BHMR ∈ 1-rollback. Since in each message, there are one Boolean matrix of n 2 bits, one vector of n bits, and one vector of 32n bits, the control overhead is M(BHMR) = MR(E)((n 2 + 33n)wb + ), where is the delay due to the interception of each data message. In addition, we assume that the number of forced checkpoints during an interval T is F(BHMR) = c · MR(E) for 0 ≤ c ≤ 1. In the measurements below, c = 0.1. 4.1.4. Fixed-Dependency-Interval (FDI) [40] In FDI, when a process p sends an application message, it piggybacks its dependency vector D p . On the other hand, when a process q receives a message M, it performs as follows: • If M.D[ p] ≤ Dq [ p] for all p, 1 ≤ p ≤ n, then q delivers M. • Otherwise, process q takes a forced checkpoint, updates Dq , and then delivers M. Wang showed that FDI is Z-path free (ZPF) [40]. By [1], we have ZPF ⊆ 1-rollback; therefore, FDI ∈ 1-rollback. Also, the dependency vector is piggybacked on each message. Thus, the control overhead is M(FDI) = n · MR(E) for an execution E. However, the number of forced checkpoints clearly depends on the communication pattern. To guarantee ZPF, the protocol takes many forced checkpoints depending on the communication pattern. We assume
360
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
that the number of forced checkpoints during an interval T is F(FDI) = c · MR(E) for 0 ≤ c ≤ 1, where in the measurements below c = 0.1. Please notice here that since we don’t consider any specific application, we determine F(FDI) as a function of the communication pattern. Remember that a Z-path is formed due to communication between processes. 4.1.5. Briatico, Ciuffoletti and Simoncini (BCS) [10] BCS is a communication-induced checkpointing protocol. In it, each process pi maintains a checkpoint timestamp lci , and acts as follows: 1. Whenever pi takes a checkpoint it increases lci by 1. 2. For each message m, pi piggybacks lci on m. We call the piggybacked value m.lc. 3. Whenever pi receives a message m, it compares lci with m.lc. If m.lc > lci , then pi assigns lci to m.lc and takes a forced checkpoint before delivering m. Manivannan and Singhal [18] proved that BCS belongs to the ZCF class, and from [1] we know that ZCF ⊂ nrollback. It can be claimed that the number of forced checkpoints depends on the communication pattern of an execution E. Running different applications under BCS, Alvisi et al. [6] showed that the expected number of forced checkpoints in BCS is 2 for T = 360 and T = 480. Lastly, since BCS piggybacks a 32-bit logical clock (assuming a 32-bit architecture) on every application message, then the control overhead is M = MR(E)(32 · wb + ), where is the delay due to the interceptions of all data messages. For instance, in our measurements (presented below), = 50 microseconds. 4.1.6. Baldoni, Quaglia and Ciciani (BQC) [8] BQC is another communication-induced checkpointing protocol that ensures ZCF by preventing the creation of potential Z-cycles. By [1], we know that BQC ∈ n-rollback. Moreover, Alvisi et al. [6] showed that BQC is worse than BCS, but that the number of forced checkpoints during an interval T is F(BQC) = 2. Lastly, the protocol propagates n 2 32-bit values on each application message to help processes detect suspected Z-cycles. Therefore, we have M(BQC) = MR(E)(32 · n 2 · wb + ), where is the delay due to the interceptions of all the data messages. 4.1.7. Helary, Mostefaoui and Raynal (HMR) [14] The HMR protocol ensures distributed snapshots (1-rollback) by piggybacking the timestamp lci with each application message and taking forced checkpoints if necessary. We assume that M(HMR) = MR(E)(32 · wb + ), where is the delay due to the interceptions of all the data messages. In addition, we assume that the number of forced checkpoints during an interval T is F(HMR) = c · MR(E) for 0 ≤ c ≤ 1. In the measurements below, c = 0.1. 4.1.8. Manivannan–Singhal (M–S) [17] In M–S, when a process p sends an application message M, it piggybacks its checkpoint sequence number sn p . The sequence number piggybacked with the message M is denoted by M.sn. On the other hand, when a process q receives a message M, it takes a forced checkpoint if M.sn > snq . Manivannan and Singhal proved in [17] that M–S is ZPF, from which we conclude that M–S ∈ 1-rollback. Also, the sequence number is piggybacked on each message. Thus, given execution E, M(M–S) = MR(E)(32 · wb + ), where is the delay due to the interceptions of all the data messages. For instance, in our measurements (presented below), = 50 microseconds. We assume that the number of forced checkpoints during an interval T is F(M–S) = c · MR(E) for 0 ≤ c ≤ 1. In the measurements below, c = 0.1. 4.1.9. d-Bounded Cycles (d-BC) [1] d-BC is a communication-induced checkpointing protocol. A cycle is a Z-path from a checkpoint to the next one, to itself, or to an earlier one on the same process. In d-BC, each process takes checkpoints independently. Cycles can be formed as long as no cycle exceeds d checkpoints in the same process. Upon taking a new checkpoint C p,i , process p broadcasts a control message containing all the dependencies between C p,i and all other checkpoints. Once a cycle of size d has been generated, the protocol switches to C–L in order to guarantee the d-BC property. By [1], d-BC belongs to (n − 1)d-rollback. Upon a new checkpoint C p,i , process p broadcasts a cut of size no larger than d · n; therefore, M(d − BC) = n · wm + d · n 2 · wb . Moreover, d-BC forces checkpoints by calling C–L
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
361
Fig. 14. Comparing the SaS and C–L protocols with various numbers of processes.
only if a cycle of size d is generated. Since a Z-cycle is a special case of a cycle, then the conditions of generating cycles and Z-cycles are almost equivalent. Also, since Z C F = 1-BC, then by [6] we know that the number of forced checkpoints in an interval T is F(1 − BC) = 2. 4.2. Comparing distributed checkpointing protocols In this section, we compare the overhead ratio of the checkpointing protocols presented in the previous section. First, we compare the coordinated checkpointing protocols, C–L and SaS. Fig. 14 depicts the overhead ratio of these protocols in both SMALL and BIG environments. We used the SMALL environment with T = 1024, which is the optimum value of T . We can see in Fig. 14(a) that C–L is better if the number of processes is less than 100, and it becomes worse for a larger number of processes. The reason is that C–L incurs more control overhead than SaS does, where M(C–L) = O(n 2 ) and M(SaS) = O(n). Notice here that the checkpoint overhead is bigger in SaS than in C–L. The reason is that SaS pauses processes for checkpointing. In the BIG environment, we have the same relation between SaS and C–L. Since in this and all following measurements; BIG and SMALL exhibit the same qualitative behaviour, we present only the results for SMALL from now on. In Fig. 15, we compare additional checkpointing protocols. We computed the overhead ratio in the SMALL environment with T = 1024. Consider the n-rollback protocols: BQC, BCS, and 1-BC. The 1-BC protocol is an optimistic protocol that takes fewer forced checkpoints than BQC and BCS do. Moreover, 1-BC does not use piggybacking as the other protocols do; therefore, its overhead ratio is not directly affected by the communication pattern. As depicted in Fig. 15(a), in executions with small MRs, 1-BC has worse overhead than BCS and BQC do. On the other hand, when MR is large, 1-BC performs much better than BQC and almost as well as BCS does. Fig. 15(b) shows that with M R = 2048, BQC and BCS are worse than in Fig. 15(a). Notice that, since BQC piggybacks O(n 2 ) control information while BCS piggybacks only an 8-bit logical clock, then v(n, BCS) ≤ v(n, BQC). Fig. 16 compares the same protocols as Fig. 15, with the same MR and T values, but with decreasing probabilities of recovering from the possible checkpoints. Here again, the graph on the right, (a), shows the behaviour with small MRs while the graph on the left, (b), shows the behaviour with large MRs (recall that “MR” stands for message rate, i.e. the expected number of messages in each checkpoint interval). Note that the overhead ratio in Fig. 16 is smaller than in Fig. 15. The reason is that when the probability of a long rollback is small, then the probability of losing useful computation becomes small as well. We can clearly see in Fig. 16 that the overhead ratios of executions with protocols that use piggybacking are worse for those with large MRs (these protocols are FDI, BCS, and BQC.) Notice that in Fig. 16(b), 1-BC has better overhead ratio than the other n-rollback protocols. Lastly, we can conclude here that the C–L protocol is the best up to 32 processes. Recall that SaS becomes better than C–L for a larger number of processes, as depicted in Fig. 14.
362
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
Fig. 15. Comparing checkpointing protocols of different classes with equal rollback probabilities.
Fig. 16. Comparing checkpointing protocols of different classes with decreasing rollback probabilities.
In Fig. 17, we compare different protocols that belong to 1-rollback. In this case, we compute the overhead ratio in the SMALL environment with T = 1024 and M R = 32. In BHMR and HMR, we chose c = 0.1 and thus F = 3 (recall the discussion in Sections 4.1.3 and 4.1.7). As depicted in the figure, we see that the C–L protocols have better results for a small number of processes, but that their overhead ratios become worse as the number of processes increases. In addition, notice that the other protocols have different overhead ratios according to the values of M, as presented in their respective descriptions in Section 4.1. 5. Related work There has been much work on checkpointing performance analysis for both uniprocessors and distributed systems [19,30,35,42]. Most of the previous work did not take into account the rollback propagation. Our work is the first to incorporate all parameters that affect the performance in distributed environments into an analytical measure. Mishra and Wang [19] evaluated several checkpointing protocols by implementing them and then running them with test applications to obtain their overhead. Ziv and Bruck [42] compared just four checkpointing protocols by using the Markov Reward Model [33]. For each checkpointing protocol, they defined a particular Markov chain that depended on the number of failures that can occur
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
363
Fig. 17. Comparing 1-rollback protocols with different numbers of processes.
simultaneously during the execution. Our approach differs from [42] in that we provide a technique for comparing any checkpointing protocols based on rollback propagation. Moreover, Ziv and Bruck presented in [43] a checkpoint scheme for duplex systems. Such a system is formed by a pair of processors connected by a LAN. The state of the two processors must be compared to detect a failure. The comparison is performed on the local checkpoints of the processors. Ziv and Bruck conducted a performance analysis for their scheme in the duplex system, but it was not a general system for distributed executions. Vaidya defined the overhead ratio for uniprocessor systems as a function of the checkpoint overhead and latency [38]. He proved that the optimum checkpoint interval depends on the value of o. In addition, Vaidya claimed that it is possible to compute the overhead ratio for distributed systems, as for uniprocessor systems, by taking the value of each parameter, such as latency L, to be either the maximum or the average over all processes. In another work [37], Vaidya computed the overhead ratio for the two-level recovery approach, which tolerates single failures with a low overhead and multiple failures with a higher overhead. Plank and Thomason [30] have presented a method for estimating the overhead ratio for distributed executions with coordinated checkpointing. By assuming coordinated checkpointing, they avoid the issue of rollback propagation. Moreover, they do not address the control overhead incurred by control information. Neves and Fuchs presented RENEW (REcoverable NEtwork of Workstations) that can dynamically inserts code to test distributed checkpointing protocols [21]. Using the RENEW tool, not only we can improve the implementation of checkpointing protocols, but also compare different protocols. Egida provides a similar framework for implementing checkpointing protocols [32]. In addition, the OTEC tool enable to test and evaluate checkpointing protocols using implementation or simulation [31]. Similarly, dPSIM is a simulation tool for evaluating different checkpointing protocol [23]. Unlike our approach which is a model-based, RENEW, Egida, OTEC, and dPSIM target a real implementation of the protocols and/or a simulation of the system. 6. Conclusions We have presented the overhead ratio metric for performance analysis of distributed checkpointing protocols. Our evaluation tool considers all the parameters that affect the execution of distributed checkpointing protocols. We have compared several checkpointing protocols using our measure in different environments. All the results we obtained show that the most significant parameters that affect the overhead ratio are the propagation rollback and communication overhead. Therefore, in order to minimize the overhead ratio, a protocol should exhibit little rollback propagation and avoid piggybacking of control information on every message. We found that in two realistic settings, coordinated checkpointing protocols, namely C–L and SaS, performed better than induced checkpointing protocols, since they satisfied those conditions. Moreover, in general, the protocols’ behavior is highly sensitive to the parameters, indicating that among the protocols we have checked, there is no “perfect” checkpointing protocol
364
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
(i.e. one that performs best for every application and environment assumption). However, based on our work, given a specific application communication pattern and corresponding environment assumptions, it is possible to identify some protocols that are likely to perform much better than others. We showed that for all the distributed checkpointing protocols presented here just as for uniprocessor systems [38], there is an optimum value of T that achieves the best overhead ratio in any environment. This is not a new result [41], but we were able to obtain it using new and different approaches. The open question is that of whether there is an optimum T for every distributed checkpointing protocol. Acknowledgments We would like to thank Ari Freund for discussions related to developing the mathematical equations and Jenny Applequist for her editorial assistance. We would like to thank the anonymous reviewers of the PEVA journal for their useful comments. References [1] A. Agbaria, H. Attiya, R. Friedman, R. Vitenberg, Quantifying rollback propagation in distributed checkpointing, Journal of Parallel and Distributed Computing 64 (3) (2004) 370–384. [2] A. Agbaria, A. Freund, R. Friedman, Evaluating distributed checkpointing protocols, in: Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS’03, Providence, Rhode Island, May 2003, pp. 266–273. [3] A. Agbaria, R. Friedman, Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations, in: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, HPDC’99, August 1999, pp. 167–176. [4] A. Agbaria, R. Friedman, Virtual machine based heterogeneous checkpointing, Software: Practice and Experience 32 (1) (2002) 1–19. [5] A. Agbaria, J.S. Plank, Design, implementation, and performance of checkpointing in netsolve, in: Proceedings of the 1st IEEE Conference on Dependable Systems and Networks, DSN’00, New York, USA, June 2000, pp. 49–54. [6] L. Alvisi, E. Elnozahy, S. Rao, S.A. Husain, A.D. Mel, An analysis of communication induced checkpointing, in: Proceedings of the 29th Fault-Tolerant Computing Symposium, Madison, Wisconsin, June 1999, pp. 242–249. [7] R. Baldoni, J.M. H´elary, A. Mostefaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in: Proceedings of the 27th International Symposium on Fault-Tolerant Computing, June 1997, pp. 68–77. [8] R. Baldoni, F. Quaglia, B. Ciciani, A VP-accordant checkpointing protocol preventing useless checkpoints, in: Proceedings of the IEEE International Symposium on Reliable Distributed Systems, October 1998, pp. 61–67. [9] J. Brevik, D. Nurmi, R. Wolski, Quantifying machine availability in networked and desktop grid systems, Technical Report 2003-37, Department of Computer Science, University of California, Santa Barbara, November 2003. [10] D. Briatico, A. Ciuffoletti, L. Simoncini, A distributed domino-effect free recovery algorithm, in: Proceedings of the IEEE International Symposium on Reliability in Distributed Software and Database Systems, October 1984, pp. 207–215. [11] K.M. Chandy, L. Lamport, Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems 3 (1) (1985) 63–75. [12] G. Clark, T. Courtney, D. Daly, D. Deavours, S. Derisavi, J.M. Doyle, W.H. Sanders, P. Webster, The M¨obius tool, in: Proceedings of the 9th International Workshop on Petri Nets and Performance Models, Aachen, Germany, September 2001 pp. 241–250. [13] E.N. Elnozahy, L. Alvisi, Y.M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys 34 (3) (2002) 375–408. [14] J.M. H´elary, A. Mostefaoui, M. Raynal, Communication-induced determination of consistent snapshots, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 208–217. [15] L. Lamport, Time, clocks and ordering of events in distributed systems, Communications of the ACM 21 (7) (1978) 558–565. [16] I. Lee, D. Tang, R.K. Iyer, M.C. Hsueh, Measurement based evaluation of operating system fault tolerance, IEEE Transactions on Reliability 42 (2) (1993) 238–249. [17] D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the 16th International Conference on Distributed Computing Systems, May 1996, pp. 100–107. [18] D. Manivannan, M. Singhal, Quasi-synchronous checkpointing: Models, characterization, and classification, IEEE Transactions on Parallel and Distributed Systems 10 (7) (1999) 703–713. [19] S. Mishra, D. Wang, Choosing an appropriate checkpointing and rollback recovery algorithm for long-running parallel and distributed applications, in: Proceedings of the 11th ISCA International Conference on Computers and their Applications, San Francisco, CA, March 1996. [20] R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots, IEEE Transactions on Parallel and Distributed Systems 6 (2) (1995) 165–169. [21] N. Neves, W.K. Fuchs, RENEW: A tool for fast and efficient implementation of checkpoint protocols, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp, 58–67. [22] D. Nurmi, J. Brevik, R. Wolski, Modeling machine availability in enterprise and wide-area distributed computing environments, Technical Report 2003–28, Department of Computer Science, University of California, Santa Barbara, 2003.
A. Agbaria, R. Friedman / Performance Evaluation 65 (2008) 345–365
365
[23] H.S. Paul, A. Gupta, R. Badrinath, Performance comparison of checkpoint and recovery protocols, Concurrency and Computation: Practice and Experience 15 (15) (2003) 1363–1386. [24] J. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, June 1998, pp. 48–57. [25] J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. Thesis, Department of Computer Science, Princeton University, January 1993. [26] J.S. Plank, An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance, Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997. [27] J.S. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, R. Wolski, The internet backplane protocol: Storage in the network, in: NetStore’99: Network Storage Symposium, Internet2, Seattle, WA, October 1999, pp. 242–249. [28] J.S. Plank, M. Beck, G. Kingsley, K. Li, Libckpt: transparent checkpointing under UNIX, in: Usenix Winter 1995 Technical Conference, New Orleans, January 1995, pp. 220–232. [29] J.S. Plank, W. Elwasif, Experimental assessment of workstation failures and their impact on checkpointing systems, in: Proceedings of the 28th International Symposium on Fault-Tolerant Computing, Munich, June 1998, pp. 48–57. [30] J.S. Plank, M.G. Thomason, Processor allocation and checkpoint interval selection in cluster computing systems, Journal of Parallel and Distributed Computing 61 (11) (2001) 1570–1590. [31] B. Ramamurthy, S.J. Upadhyaya, R.K. Iyer, An object-oriented testbed for the evaluation of checkpointing and recovery systems, in: Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing. [32] S. Rao, L. Alvisi, H.M. Vin, Egida: An extensible toolkit for low-overhead fault-tolerance, in: Proceedings of IEEE International Conference on Fault-Tolerant Computing, June 1999, pp. 48–55. [33] K.S. Trivedi, Probability and Statistics with Reliablity, Queuing, and Computer Science Applications, Prentice-Hall, USA, 1982. [34] N. Vaidya, On checkpoint latency, in: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, Newport Beach, December 1995. [35] N.H. Vaidya, Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency, Technical Report TR94-068, Department of Computer Science, Texas A&M University, 1994. [36] N.H. Vaidya, Consistent logical checkpointing, Technical Report TR94-051, Department of Computer Science, Texas A&M University, 1994. [37] N.H. Vaidya, A case for two-level distributed recovery schemes, in: Measurement and Modelling of Computer Systems, 1995, pp. 64–73. [38] N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme, IEEE Transactions on Computers 46 (8) (1997) 942–947. [39] Y-M. Wang, Maximum and minimum consistent global checkpoints and their applications, in: Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, SRDS’95, September 1995, pp. 86–95. [40] Y.M. Wang, Consistent global checkpoints that contain a given set of checkpoints, IEEE Transactions on Computers 42 (4) (1997) 456–486. [41] J.S. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM 17 (9) (1974) 530–531. [42] A. Ziv, J. Bruck, Analysis of checkpointing schemes for multiprocessor systems, in: Proceedings of the 13th Symposium on Reliable Distributed Systems, 1994. [43] A. Ziv, J. Bruck, Efficient checkpointing over local area networks, in: Proceedings of the IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, June 1994, pp. 30–35.
Adnan Agbaria received his Master degree (on “Communication-Processor Tradeoffs in Limited Resources Parallel RAM”) and his Ph.D. (on “Reliability in High Performance Distributed Computing Systems”) in computer science from The University of Haifa in 1997 and in computer science from the Technion in 2002, correspondingly. Dr Adnan Agbaria is currently a Research Staff Member at IBM Haifa Research Laboratory. Before joining IBM Research, Dr Agbaria spent a year as a computer scientist at the ISI of the University of Southern California and about three years as a Postdoctoral Fellow in the CSL of the University of Illinois at Urbana-Champaign. Dr Agbaria’s research interests include, but are not limited to, fault and intrusion tolerance, parallel and distributed systems and model-based validation. Dr Agbaria served as a programme committee member on the International Conference on Dependable Systems and networks (DSN) and the International Conference on Distributed Computing Systems (ICDCS). He is the Co-organizer of the international Workshop on Applied Software Reliability (WASR). Roy Friedman is an associate professor in the department of Computer Science at the Technion. His research interests include distributed systems, group communication, dependable computing and middleware for mobile ad-hoc networks. He has published more than 70 technical papers on distributed systems, group communication, fault-tolerance, high availability, cluster computing, client/server middleware and wireless mobile ad-hoc networks in major international journals and conferences. He also holds two patents. Prof. Friedman also regularly participates in the technical programme committees of leading international conferences. Formerly, Roy Friedman was an academic specialist at IRISA/INRIA (France) and a researcher at Cornell University (USA). In the late 80s he has also worked as a programmer for Intel Israel and Shiluv Information Systems Ltd. He is one of the two technical founders of PolyServe Inc and holds a Ph.D. and a B.Sc. from the Technion.