Performance Evaluation 66 (2009) 311–326
Contents lists available at ScienceDirect
Performance Evaluation journal homepage: www.elsevier.com/locate/peva
Numerical computation algorithms for sequential checkpoint placementI Tatsuya Ozaki a , Tadashi Dohi a,∗ , Naoto Kaio b a
Department of Information Engineering, Hiroshima University, 1-4-1 Kagamiyama, Higashi-Hiroshima 739–8527, Japan
b
Department of Economic Informatics, Hiroshima Shudo University, 1-1-1 Ozukahigashi, Asaminamiku, Hiroshima 731-3195, Japan
article
info
Article history: Received 17 August 2006 Accepted 27 November 2008 Available online 10 December 2008 Keywords: File systems Checkpointing Sequential checkpoints Availability Expected reward Numerical algorithms
a b s t r a c t This paper concerns sequential checkpoint placement problems under two dependability measures: steady-state system availability and expected reward per unit time in the steady state. We develop numerical computation algorithms to determine the optimal checkpoint sequence, based on the classical Brender’s fixed point algorithm and further give three simple approximation methods. Numerical examples with the Weibull failure time distribution are devoted to illustrate quantitatively the overestimation and underestimation of the sub-optimal checkpoint sequences based on the approximation methods. © 2008 Elsevier B.V. All rights reserved.
1. Introduction System failures in large scaled computer systems can lead to a huge economic or critical social loss. Checkpointing and rollback recovery is a commonly used solution for improving the dependability of file systems, and is regarded as a lowcost environment diversity technique from the standpoint of fault-tolerant computing. Especially, when the file system to write and/or read data is designed in terms of preventive maintenance, checkpoint generations can back up occasionally or periodically the significant data on the primary medium to the safe secondary medium, and can play a significant role to limit the amount of data processing for the recovery actions after system failures occur. If checkpoints are frequently taken, a larger overhead by checkpointing itself will be incurred. Conversely, if checkpoints are seldom placed, a larger recovery overhead after a system failure will be required. Hence, it is important to determine the optimal checkpoint sequence taking account of the trade-off between two kinds of overhead factor above. Since the system failure phenomenon under uncertainty is described by a probability distribution, called the system failure time distribution, the optimal checkpoint sequence should be determined based on any stochastic model [1–5]. Young [6] obtained the optimal checkpoint interval approximately for the computation restart after system failures. Baccelli [7], Chandy et al. [2], Dohi et al. [8], Gelenbe and Derochette [9], Gelenbe [10], Gelenbe and Hernandez [11], Goes and Sumita [12], Grassi et al. [13], Kulkarni et al. [14], Nicola and Van Spanje [15], Sumita et al. [16] proposed performance evaluation models for database recovery, and calculated the optimal checkpoint intervals which maximize the system availability or minimize the mean overhead during the normal operation. L’Ecuyer and Malenfant [17] formulated a dynamic checkpoint placement problem by a Markov decision process. Ziv and Bruck [18] reconsidered a checkpoint placement problem under a random environment, by taking account of the change of operation circumstance. Vaidya [19] examined I This work is supported by a Grant-in-Aid for Scientific Research from the Ministry of Education, Sports, Science and Culture of Japan under Grant No. 18510138 (2006-2008) and the Research Program 2008 under the Center for Academic Development and Cooperation of the Hiroshima Shudo University, Japan. The authors very much appreciate two reviewers’ comments to improve the first version of this paper. ∗ Corresponding author. Tel.: +81 82 424 7648; fax: +81 82 422 7025. E-mail addresses:
[email protected] (T. Dohi),
[email protected] (N. Kaio).
0166-5316/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.peva.2008.11.003
312
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
the impact of checkpoint latency on overhead ratio for a simple checkpoint model. Recently, Okamura et al. [20] reformulated the Vaidya’s model [19] with a semi-Markov decision process. On the other hand, some authors discussed the sequential checkpoint placement problems where the checkpoint intervals were not always constant. For instance, in almost all checkpoint models for transaction-based systems [7–12, 16], it could be proved theoretically that the constant checkpoint intervals maximizing the system availability were better than the independent and identically distributed random checkpoint intervals. For any case, however, the sequential policy with aperiodic checkpoint interval can provide the general framework on the checkpoint placement, because the sequential checkpoint involves the periodic one as a special case. Duda [21] derived a recursive formula satisfying the optimal aperiodic checkpoint sequence maximizing the mean program execution time. Toueg and Babao˜glu [22] developed a discrete dynamic programming algorithm which mminimizes the expected execution time of tasks placing checkpoints between two consecutive tasks under very general assumptions. Kaio and Osaki [23] and Ling et al. [24] proposed approximate methods to calculate the optimal checkpoint sequence minimizing the expected cumulative operation cost until the system failure. In the sequential checkpoint placement problem, it is assumed that the system failure time obeys the common probability distribution, i.e. does not always obey the negative exponential distribution. Actually, the non-exponential system failure time distribution with increasing failure rate can be assumed for some real workstation failure data [25]. Also, it is reported that some system failures are caused by software aging such as resource exhaustion and that the system failure time cannot be regarded as an exponentially distributed random variable any more (see e.g. [26]). The sequential checkpoint placement problem is formulated as a complex non-linear optimization problem with unknown number of decision variables. This leads to the computational difficulty to place optimally the aperiodic checkpoint sequence even for a simple centralized system under a criterion of optimality. Although the checkpointing models mentioned above mainly focused on centralized systems, the analytical techniques for reliability and performance evaluation can be applied to distributed systems [27]. Wong and Franklin [28] considered simple Markov models to determine the frequency of checkpointing in parallel systems. Plank and Thomaso [29] modeled the performance of coordinated checkpointing systems [30–32] where the number of processors dedicated to the application and the checkpoint intervals are selected by the user before running the program. They employed a birth and death Markov chain to determine the system availability of the parallel system over the long term. Agbaria et al. [33] took account of the rollback propagation [34] and evaluated the coordinated checkpointing protocols based on both the overhead ratio and simple Markov chain models. On the other hand, uncoordinated checkpointing techniques [35] are used to reduce the checkpointing overhead in normal processing. Soliman and Elmaghraby [36] developed the so-called hybrid state saving technique to reduce the mean time to execute a finite length task for an uncoordinated checkpointing system. However, the aperiodic checkpoint scheme has not been developed yet in the literature on parallel and distributed systems. In this paper we consider sequential checkpoint placement problems for centralized systems under two dependability measures: steady-state system availability and expected reward per unit time in the steady state [37,38]. When the checkpoint strategy is restricted to constant intervals, the past literature [7,2,8–12,16,6] provided satisfactory answers on the optimal constant checkpoint interval with the negative exponential or the general system failure time distribution under the specific cost criteria. Surprisingly, it should be noted that the general sequential checkpoint placement problems have not been studied sufficiently during the last three decades except for a few examples [21,23,24,22]. Recently, Ozaki et al. [39] dealt with the same problem under the expected cumulative operation cost over infinite/finite time horizon, and developed an effective computation algorithm to calculate the optimal checkpoint sequence. However, the algorithm proposed in their paper [39] was not all-round and could not be applied to the general problems. In this paper, we develop numerical computation algorithms for the optimal sequential checkpoint placement under the steady-state system availability and the expected reward per unit time in the steady state. The basic idea is due to the Brender’s classical fixed-point theorem [40], that is, the computation algorithms proposed here converge to the real optimal solutions eventually. The rest part of this paper is planned as follows: in Section 2, we define the notation and describe two sequential checkpoint placement models with perfect and imperfect checkpointing, referred as Model A and Model B, respectively. The system availability and the expected reward rate are formulated in Section 3 and Section 4, respectively, where the numerical computation algorithms to maximize them are derived. In Section 5 we introduce three approximation methods to calculate the sub-optimal checkpoint sequence. Numerical examples with the Weibull failure time distribution are devoted in Section 6 to illustrate the overestimation and underestimation of the sub-optimal checkpoint sequences based on the approximation methods. We compare the real optimal checkpoint sequence and its associated dependability measures with three approximate solutions. Finally, the paper is concluded with some remarks in Section 7. The computation algorithms in this paper provide evidently exact solutions for unsolved problems during the last three decades and their impact to the actual fault-tolerant file management will be very significant, because the underlying techniques may be applied to the aperiodic and distributed checkpointing protocols. 2. Sequential checkpoint placement 2.1. Model A Consider a simple file system with sequential checkpointing over an infinite time horizon. The system operation starts at time t0 = 0, and the checkpoint (CP) is sequentially placed at time {t1 , t2 , . . . , tk , . . .}. At each CP, tk (k = 1, 2, . . .), all the
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
313
Fig. 1. Configuration of Model A.
file data on the main memory is saved to a safe secondary medium such as CD-Rom, where the cost (time overhead) c0 (>0) is needed per each CP placement. It is assumed at the moment that the system operation stops during the checkpointing and the file system has not deteriorated. System failure occurs according to an absolutely continuous and non-decreasing probability distribution function F (t ) having density function f (t ) and finite mean 1/µ (>0), which depends on the cumulative system operation time excluded the checkpointing period. Upon a system failure, a rollback recovery takes place immediately where the file data saved at the last CP creation is used. Next, a checkpoint restart is performed and the file data is recovered to the state just before the system failure point. The time length required for the checkpoint restart is given by the function L(·), which depends on the system failure time and is assumed to be differentiable and increasing. More specifically, suppose that the recovery function is an affine function of the time interval between the begin of the last checkpointing before system failure and the system failure time, i.e., L(t − tk ) = a0 (t − tk−1 )+ b0 , (t > tk−1 , k = 1, 2, . . .) [7, 2,8–12,23,15,39,16,6], where a0 (>0) and b0 (>0) are constants. The first term a0 denotes the constant restart time needed per unit of time passed since the last checkpointing, and the second term is a fixed time associated with the CP restart. Throughout this paper, it is assumed that no failure occurs during the recovery period with probability one. We define the time interval from t = 0 to the completion time of recovery operation from the system failure as one cycle. The same cycle repeats again and again over an infinite time span. Fig. 1 depicts the possible realization of the above CP model referred as Model A. Define the indicator function I{·} , where for the probabilistic event A,
I{A} =
1: 0:
event A occurs otherwise.
(1)
Let Ak be the event that the system failure time X is strictly greater than the (k − 1)-st CP time, tk−1 on the cumulative operation time, and is equal to or less than tk (k = 1, 2, . . .). Then, we have P (Ak ) = F (tk−1 ) − F (tk ),
k = 1, 2, . . . ,
(2)
where P (A) = Pr{event A occurs} and F (·) = 1 − F (·). Let N, C and X be the number of CPs, the time to place a CP just before the system failure and the time to system failure, respectively. Then, for the CP sequence t = {t1 , t2 , . . .}, the mean number of CPs for one cycle is given by E [ N | t] =
∞ X
kP (Ak ) =
k =1
∞ X
k F (tk−1 ) − F (tk ) = 1 +
k=1
∞ X
F (tk ).
(3)
k=1
Similar to Eq. (3), the expected time to the last checkpointing before system failure is given by E [ C | t] =
∞ X
tk−1 P (Ak ) =
k=1
∞ X
tk−1 F (tk−1 ) − F (tk ) =
k=1
∞ X (tk − tk−1 )F (tk ).
(4)
k=1
Since the mean time to system failure (MTTF) obviously becomes E [ X | t] =
∞ Z X k=1
tk
xdF (x) = µ−1
(5)
tk−1
with t0 = 0, we obtain the expected time length between the last checkpointing before system failure and the failure occurrence during one cycle as E [ D | t] = E [ X | t] − E [ C | t] = µ − 1 −
∞ X (tk − tk−1 )F (tk ). k=1
(6)
314
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
Fig. 2. Configuration of Model B.
From the above results, the mean time length of one cycle is given by T (t) = c0 E[N | t] + E[X |t] + a0 E[D | t] + b0
( = c0 1 +
∞ X
) F (tk )
( +µ
−1
+ a0 µ
−1
k=1
) ∞ X − (tk − tk−1 )F (tk ) + b0 .
(7)
k=1
2.2. Model B Next we consider a different model from Model A, where the system may deteriorate even during the checkpointing period. In general, most of the checkpoint libraries do not suspend the application during checkpointing, because a checkpoint can be taken concurrently with the main application by a delicate process/thread in many cases. Vaidya [19] treated a similar but somewhat different model for a parallel program where the sub-process can run with lower performance when the main process is checkpointed. However, it is worth noting that the system failure during checkpointing may occur even for centralized systems in practice. The most plausible reason for such an imperfect checkpointing is due to the human error when the checkpointing is hand-tuned. Also, even if a checkpoint can be taken concurrently with applications, the system state at that timing may go to an unstable state based on the aging-related bugs [26]. The simplest approach to treat the imperfection of checkpointing is to introduce an imperfect checkpointing probability (coverage) at each checkpoint placement. This type of stochastic model can be easily considered by using a similar technique to the References [41,42]. Since imperfect checkpointing as well as the system failure are both rare events, however, it will be quite difficult to quantify the imperfect checkpointing probability in practice. In other words, the assumption that the system failure may occur during checkpointing with the same failure mode as the normal operation can be validated approximately if the checkpoint overhead is relatively small with respect to the time length of normal operation. Suppose that the system failure time distribution F (t ) is defined on the calender time since the time t0 = 0. In our model referred as Model B, the imperfect checkpointing scheme is introduced under the assumption that the checkpointing period may be in out-of-control state. As recognized intuitively, the possibility of imperfect checkpointing cannot be ignored in practice, though it seldom happens for a short overhead at a CP. The configuration of Model B is illustrated in Fig. 2. In a fashion similar to Model A, define the following events. Let Bk be the event that the system failure time X is strictly greater than the (k − 1)-st CP complete time, tk−1 + c0 , and is equal to or less than tk (k = 2, 3, . . .), where especially B1 is given by
B1 = {t0 < X ≤ t1 },
k = 1, 2, . . . .
(8)
Also, let
Ck = {tk < X ≤ tk + c0 },
k = 1, 2, . . .
(9)
denote the event that the system failure occurs during the k-th CP placement. Then we have P (B1 ) = F (t0 ) − F (t1 ),
(10)
P (Bk ) = F (tk−1 + c0 ) − F (tk ), P (Ck ) = F (tk ) − F (tk + c0 ),
k = 2, 3, . . .
k = 1, 2, . . . .
(11) (12)
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
315
From Eqs. (10)–(12), we obtain the expected number of CPs, the mean time length to place a CP just before the system failure and the mean time length to system failure as E[N | t] = 1 · P (B1 ) + 2 · {P (C1 ) + P (B2 )} + · · · + k · {P (Ck−1 ) + P (Bk )} + · · ·
= 1 · {F (t0 ) − F (t1 )} + 2 · {F (t1 ) − F (t2 )} + · · · + k · {F (tk−1 ) − F (tk )} + · · · ∞ ∞ X X = k F (tk−1 ) − F (tk ) = 1 + F (tk ), k=1
(13)
k=1
E[C | t] = t0 {P (B1 ) + P (C1 )} + (t1 + c0 ){P (B2 ) + P (C2 )} + · · · + (tk−1 + c0 ){P (Bk ) + P (Ck )} + · · ·
= t0 {F (t0 ) − F (t1 + c0 )} + (t1 + c0 ){F (t1 + c0 ) − F (t2 + c0 )} + · · · + (tk−1 + c0 ){F (tk−1 + c0 ) − F (tk + c0 )} + · · · ∞ X = (tk−1 + c0 ) F (tk−1 + c0 ) − F (tk + c0 ) k=2
= (t1 + c0 )F (t1 + c0 ) +
∞ X
(tk − tk−1 )F (tk + c0 ),
(14)
k=2
E [ X | t] = µ − 1 .
(15)
From Eqs. (13)–(15), we get the mean system down time and the mean time length of one cycle: E[D | t] = E[X | t] − E[C | t]
(
= µ
−1
) ∞ X − (t1 + c0 )F (t1 + c0 ) + (tk − tk−1 )F (tk + c0 ) ,
(16)
k=2
T (t) = E[X | t] + a0 E[D | t] + b0
( = µ
−1
+ a0 µ
"
−1
#) ∞ X − (t1 + c0 )F (t1 + c0 ) + (tk − tk−1 )F (tk + c0 ) + b0 .
(17)
k=2
3. Availability analysis 3.1. Model A From the previous discussion in Section 2, the steady-state system availability for Model A is given by AA (t) =
=
E[X | t] T ( t) ( " c0 1 +
∞ X
# F (tk ) + µ
−1
k =1
+ a0
" ∞ Z X k=1
tk
tk−1
# )−1 ∞ X xdF (x) − µ. (tk − tk−1 )F (tk ) + b0
(18)
k=1
It should be noted that obtaining the optimal CP sequence maximizing AA (t) is equivalent to minimizing T (t) because the numerator of Eq. (18) is constant with respect to t. Recently, this type of optimal CP placement problem was considered by the same authors [39], where the sequence of intervals between two successive checkpointings {t1 , t2 − t1 , t3 − t2 , . . .} is a non-increasing sequence under the assumption that the system failure time distribution F (t ) is PF2 (Polya Frequency Function of Order 2) [43]:
f (u1 − v1 ) f (u2 − v1 )
f (u1 − v2 ) ≥0 f (u2 − v2 )
(19)
for arbitrary u1 < u2 and v1 < v2 . If F (t ) is PF2 then it has to be IFR (increasing failure rate), i.e the system failure rate r (t ) = f (t )/F (t ) is increasing in operation time t. With no loss of generality, it is assumed that the system failure time distribution belongs to the class of PF2 . Then the first order condition of optimality for AA (t) is given by tk − tk−1 =
F (tk+1 ) − F (tk ) f (tk )
+
c0 a0
,
(20)
where tk − tk−1 > c0 (k = 1, 2, 3, . . .). This condition is obtained by setting derivations of Eq. (17) with regard to the ∗ tk s equal to 0. From the condition of optimality, an algorithm to derive the optimal CP sequence t = {t1∗ , t2∗ , . . .} which
316
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
Rt
maximizes AA (t) can be derived. More precisely, we set the initial value t1 satisfying c0 = a0 0 1 F (t )dt, and compute the CP sequence {t2 , t3 , . . .} using Eq. (20). Next, for k-th CP (k = 1, 2, . . .), if tk+1 − tk > tk − tk−1 then decrease t1 and compute the CP sequence {t2 , t3 , . . .} again. On the other hand, if tk+1 − tk < c0 , then increase t1 and compute the CP sequence. The bisection method is used for adjustment of t1 . Let n (>0) be the minimum integer satisfying F (tn ) = (>0) for a sufficiently Rt small constant (≈ 0). Finally, for the CP sequence t1 < t2 < · · · < tn , if t n+1 [c0 (n + 1) + L(t − tn )]dF (t ) ≈ then the n procedure is stopped. We call this algorithm Algorithm 0 in this paper.
Rt
Algorithm 0. Step 1: Set the initial value t1 satisfying c0 = a0 1 tdF (t ) + b0 F (t1 ). Step 2: Compute the CP sequence {t2 , t3 , . . .} using Eq. (20). Step 3: For k-th CP (k = 1, 2, . . .), if tk+1 − tk > tk − tk−1 then decrease t1 and Go to Step 2. Step 4: For k-th CP (k = 1, 2, . . .), if tk+1 − tk < c0R then increase t1 and Go to Step 2. t Step 5: For the CP sequence t1 < t2 < · · · < tn , if t n+1 [c0 (n + 1) + L(t − tn )]dF (t ) ≈ then Stop the procedure. n
3.2. Model B Next consider the maximization problem of the steady-state system availability for Model B. From the similar argument to Model A, we obtain the steady-state system availability: AB (t) =
E[X | t] − c0 E[N | t] T (t)
( =
"
µ
−1
− c0 1 +
∞ X
F (tk )
#) (
"
µ
−1
+ a0 µ−1 − (t1 + c0 )F (t1 + c0 )
k=1
+
!#
∞ X
(tk − tk−1 )F (tk + c0 )
) + b0 .
(21)
k=2
Unfortunately, it is impossible to apply Algorithm 0 to Model B since the numerator of Eq. (21) involves the term of t. Define the constant sequence {l1 , l2 , . . . , li , . . .} and the function: D(li , t) = E[X | t] − c0 E[N | t] − li T (t)
( =µ
−1
− c0 1 +
∞ X
) F (tk )
(
"
− li µ
−1
+ a0 µ−1 − (t1 + c0 )F (t1 + c0 )
k =1
!# ) ∞ X + (tk − tk−1 )F (tk + c0 ) + b0 .
(22)
k=2
If there exists li∗ = argmaxi {li } satisfying D(li∗ , t) = 0, then it has to be the maximum system availability AB (t∗ ) = li∗ . This simple argument can be validated by applying Brender’s theorem [40] as follows. Theorem. D(li∗ , t) = 0 = argmaxi {D(li , t)} if li∗ = argmaxi {li }. That is, the optimal CP sequence is given by t∗ = t(li∗ ) as a function of li . Proof. From c0 /(1/µ) ≤ a0 , it is evident that D(a0 , t(a0 )) = c0 − a0 /µ ≤ 0 and D(0, t(0)) > 0. Since D(li , t(li )) is absolutely continuous with respect to li (≤ a0 ), there exists li = li∗ satisfying D(li∗ , t(li∗ )) = 0. For an arbitrary t, we find that D(li∗ , t(li∗ )) ≤ D(li∗ , t(li )),
(23)
D(li∗ , t(li∗ )) = 0,
(24)
D(li∗ , t(li )) = E[X | t] − c0 E[N | t] − li∗ T (t). (25) It is immediate to see that AB (t(li )) ≥ li∗ . Since D(li∗ , t(li∗ )) = 0, we get li∗ = AB (t(li∗ )) and show that for all t, AB (t) ≥ AB (t(li∗ )). That is, t(li∗ ) is the optimal CP sequence to maximize the steady-state system availability. The proof is completed. ∗
Hence, the optimal CP sequence t = t(li∗ ) can be calculated once AB (t∗ ) is given. Differentiating D(li , t) with respect to tk and setting it equal to 0 yields tk − tk−1 =
F (tk+1 + c0 ) − F (tk + c0 ) f (tk + c0 )
+
c0 f (tk ) a0 li f (tk + c0 )
,
(26) ∗
where tk − tk−1 > c0 (k = 1, 2, 3, . . .). The following iterative scheme determines the optimal CP sequence t maximizing AB (t).
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
317
Algorithm 1. Step 1: Set l1 := 0. Step 2: Compute the CP sequence {t2 , t3 , . . .} with a fixed li using Eq. (26) and Algorithm 0. Step 3: Compute D(li , t). If argmaxi {D(li , t)} < ≈ 0, then Stop the procedure, otherwise update li by li+1 = li +
D(li , t)
(27)
T ( t)
and Go to Step 2. This is a fixed-point algorithm and has not been developed in the long history of the optimal CP placement problems. 4. Performability analysis Our next concern is the expected reward per unit time in the steady state. Define:
• • • • • •
a: reward per unit time in the normal state b: reward per unit time during the CP is placed c: reward per unit recovery time in the system down state P1 (t): steady-state probability that the system is in normal state P2 (t): steady-state probability that the system is in checkpointing state P3 (t): steady-state probability that the system is recovering from a system failure. Then, the expected reward per unit time in the steady state, if the ergodic probabilities Pj (t) (j = 1, 2, 3) exist, is given
by Ri (t) = aP1 (t) + bP2 (t) + cP3 (t)
(28)
for Model i (= A, B). It is clear that Ri (t) is reduced to Ai (t) when a = 1, b = 0 and c = 0. In the following discussion, we suppose that a = 0 to simplify the discussion. That is, for the negative reward parameters b and c, the expected reward can be regarded as the expected cost per unit time in the steady state. It should be noted that even for Model A one cannot apply Algorithm 0 under the expected reward criterion because the CP sequence t is involved in both denominator and numerator of the reward function. Hence, similar to the maximization problem of system availability for Model B we develop a fixed point type algorithm for the expected reward measure. 4.1. Model A For Model A, the expected reward per unit time in the steady state is formulated as RA (t) =
b · c0 E[N | t] + c · {a0 E[D | t] + b0 } T (t)
( =
" b · c0 1 +
n X
#
"
F (tk ) + c · a0
µ
−1
! #) ( " # n n X X − (tk − tk−1 )F (tk ) + b0 c0 1 + F (tk )
k=1
k=1
"
#
n
+ µ − 1 + a0 µ − 1 −
k=1
)
X (tk − tk−1 )F (tk ) + b0 .
(29)
k =1
For constant sequence {h1 , h2 , . . . , hi , . . .}, define the function: QA (hi , t) = b · c0 E[N | t] + c · a0 E[D | t] + b0 − hi T (t)
( =
"
b · c0 1 +
n X
#
"
F (tk ) + c · a0 µ
−1
# ) n X − (tk − tk−1 )F (tk ) + b0
k=1
( " − hi c0 1 +
n X
k=1
# F (tk ) + µ
" −1
+ a0 µ
k=1
−1
# ) n X − (tk − tk−1 )F (tk ) + b0 .
(30)
k=1
Partially differentiating Eq. (30) with respect to tk , we have tk − tk−1 =
F (tk+1 ) − F (tk ) f (tk )
+
bc0 (1 − hi ) ca0 (c − hi )
,
(31)
where tk − tk−1 > c0 (k = 1, 2, 3, . . .). Based on Algorithm 1, we develop a numerical computation method that converges to the real optimal solution maximizing R1 (t).
318
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
Algorithm 2. Step 1: Set h1 := 0. Step 2: For an arbitrary hi compute the CP sequence {t1 , t2 , t3 , . . .} by using Eq. (31) and Algorithm 0. Step 3: Calculate RA (t). If argmaxi QA (hi , t) < ≈ 0 then Stop the procedure, otherwise Go to Step 2 after updating hi with hi+1 = hi +
QA (hi , t) T (t)
.
(32)
4.2. Model B In a fashion similar to Model A, the expected reward per unit time in the steady state for Model B is given by RB (t) =
b · c0 E[N | t] + c · {a0 E[D | t] + b0 } T (t)
( =
" b · c0 1 +
n X
#
"
F (tk ) + c · a0
! ∞ X µ−1 − (t1 + c0 )F (t1 + c0 ) + (tk − tk−1 )F (tk + c0 )
k=1
#) (
µ
+ b0
k=2
" + a0 µ
−1
−1
− (t1 + c0 )F (t1 + c0 ) +
!#
∞ X
(tk − tk−1 )F (tk + c0 )
) + b0 .
(33)
k=2
Define the following function with a constant hi : QB (hi , t) = b · c0 E[N | t] + c · {a0 E[D | t] + b0 } − hi T (t)
( =
"
b · c0 1 +
n X
#
"
F (tk ) + c · a0 µ
−1
− (t1 + c0 )F (t1 + c0 ) +
"
− h i µ − 1 + a0
!
#)
(tk − tk−1 )F (tk + c0 ) + b0
k=2
k=1
(
∞ X
!# ) ∞ X (tk − tk−1 )F (tk + c0 ) + b0 . µ−1 − (t1 + c0 )F (t1 + c0 ) +
(34)
k=2
By differentiating Eq. (34) with respect to tk , we get tk − tk−1 =
F (tk+1 ) − F (tk ) f (tk )
+
b · c0 f (tk − c0 ) c · a0 (c − hi )f (tk )
,
(35)
where tk − tk−1 > c0 (k = 1, 2, 3, . . .). The computation algorithm for Model B is given by replacing QA (hi , t) by QB (hi , t) in Algorithm 2. 5. Approximate algorithms 5.1. Exponential approximation For Model A under the availability criterion, it is well known that the optimal CP interval is constant, i.e., t1 = t2 − t1 = · · · = tk+1 − tk = · · ·, if F (t ) is the exponential distribution with mean 1/µ. Under the assumptions that a0 = 1 and b0 = 0, Young [6] considered the checkpoint restart model with constant CP interval with the exponential system failure time distribution, and derived the following non-linear equation which satisfies the optimal CP interval t1 : ec0 µ − t1 µ − e−t1 µ = 0.
(36)
Based on the second order approximation exp(−µt ) ≈ 1 − µt + µ2 t 2 /2, he obtained the approximate form of the optimal CP interval: t1 ≈
p
2(ec0 µ − 1)/µ ≈
p
2c0 /µ,
(37)
which is due to exp(c0 µ) ≈ 1 + c0 µ. For the general system failure time distribution F (t ), the simplest method is to approximate F (t ) with an exponential distribution Fe (t ) = 1 − exp{−µt }, where the parameter µ is determined by the MTTF of F (t ), say, 1/µ. Since the resulting optimal CP sequence is given by the constant sequence kt1 (k = 1, 2, . . .) maximizing the system availability with Fe (t ), the optimal t1∗ is a unique solution of Eq. (36). For Model B with the system availability and the other cases with expected
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
319
reward, the corresponding dependability measures are defined by AeB (t1 ) and Rei (t1 ) (i = A, B) as univariate functions of t1 . That is, from a few algebraic manipulations, we have
/ 1/µ + a0 1/µ − (t1 + c0 )e−µ(t1 +c0 ) + (t1 e−µ(2t1 +c0 ) )/(1 − e−µt1 ) + b0 , ReA (t1 ) = b · c0 1/(1 − e−µt1 ) + c · a0 1/µ − (t1 e−µt1 )/(1 − e−µt1 ) + b0 / c0 1/(1 − e−µt1 ) + 1/µ + a0 1/µ − (t1 e−µt1 )/(1 − e−µt1 ) + b0 , ReB (t1 ) = b · c0 1/(1 − e−µt1 ) + c · a0 1/µ − (t1 + c0 )e−µ(t1 +c0 ) + (t1 e−µ(2t1 +c0 ) )/(1 − e−µt1 ) + b0 / 1/µ + a0 1/µ − (t1 + c0 )e−µ(t1 +c0 ) + (t1 e−µ(2t1 +c0 ) )/(1 − e−µt1 ) + b0 . AeB (t1 ) =
1/µ − c0 1/(1 − e−µt1 )
(38)
(39)
(40)
5.2. Constant-sequence approximation Next, we treat the general system failure time distribution F (t ) but restrict our concern to the periodic CP interval t1 . In this case, the dependability measures under consideration can be approximated by AcA
(t1 ) = µ
−1
( "
∞ X
1+
c0
!
#
"
(kt1 ) + µ
F
−1
+ a0 µ
−1
−
k=1
( AcB
(t1 ) =
"
µ
−1
− c0 1 +
∞ X
∞ X
#
)
t1 F (kt1 ) + b0
,
(41)
k=1
#) (
F (kt1 )
"
µ
−1
+ a0 µ−1 − (t1 + c0 )F (t1 + c0 )
k=1
+
∞ X
!#
)
t1 F (kt1 + c0 )
+ b0 ,
(42)
k=1
( RcA (t1 ) =
"
b · c0 1 +
∞ X
#
"
F (kt1 ) + c · a0
µ− 1 − # )
k=1
" + µ − 1 + a0 µ − 1 −
∞ X
t1 F (kt1 ) + b0
∞ X
! t1 F (kt1 )
#) ( " + b0
c0 1 +
k=1
∞ X
# F (kt1 )
k=1
,
(43)
k=1
( RcB
(t2 ) =
" b · c0 1 +
∞ X
#) (k=1 µ
+ b0
#
"
F (kt1 ) + c · a0
−1
µ
−1
− (t1 + c0 )F (t1 + c0 ) +
∞ X
!! t1 F (kt1 + c0 )
k=1
" + a0 µ
−1
− (t1 + c0 )F (t1 + c0 ) +
∞ X
!# t1 F (kt1 + c0 )
) + b0 .
(44)
k=1
When F (t ) is replaced by Fe (t ), Eqs. (42)–(44) are reduced to Eqs. (38)–(40), respectively. 5.3. Constant-hazard approximation The third approximation method was proposed by Kaio and Osaki [23] and could be applied only for Model A. In this approximate scheme, the conditional probability that the system failure occurs during (tk−1 , tk ] (k = 1, 2, . . .) is approximated by a constant p ∈ (0, 1) satisfying F (tk ) − F (tk−1 ) 1 − F (tk−1 )
= p.
(45)
This assumption can be validated when the time interval between successive CPs is relatively small. Since the system failure time distribution is given by F (tk ) = 1 − (1 − p)k ,
(46)
the CP sequence is determined by tk∗ = F −1 (1 − {1 − p∗ }k ). Using this approximation, the steady-state system availability for Model A is represented as a function of p as follows: AA (p) = µ
−1
(
" c0 /p + µ
−1
+ a0 λ −
∞ X i=1
F
−1
(1 − (1 − p)
i−1
)p(1 − p)
i
#
) + b0 ,
(47)
320
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
Fig. 3. Optimal CP sequence under the availability in the steady state (Model A): m = 2.0, η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0.
if the inverse function F −1 (·) exits. Also, we obtain the expected reward rate by
( RA (p) =
"
b · c0 /p + c · a0
µ− 1 −
∞ X
F −1 (1 − (1 − p)i−1 )p(1 − p)
! i
#) + b0
i =1
(
"
c0 /p + µ
−1
+ a0 µ
−1
−
∞ X
# F
−1
(1 − (1 − p)
i−1
)p(1 − p)
i
) + b0 .
(48)
i =1
The resulting sub-optimal checkpoint sequence can be derived from Eq. (46) with the optimal p∗ maximizing AA (p) or RA (p). 6. Numerical examples In this section we calculate numerically the exact and approximate optimal CP sequences, and compare them in terms of dependability measures. Here we represent the approximations by an exponential distribution, the constant CP interval with general distribution and the constant hazard by Approximation 1, Approximation 2 and Approximation 3, respectively. Suppose that the system failure time obeys the Weibull distribution: F (t ) = 1 − e
−( ηt )m
(49)
with shape parameter m (>0) and scale parameter η (>0). We show the results on the system availability and the expected reward per unit time in the steady state. Throughout numerical examples, we set the design parameter in algorithms as = 10−10 . 6.1. System availability First, we show the result on the system availability in the steady state. In Figs. 3 and 4, we depict the comparison results on the optimal and sub-optimal CP sequences. From Fig. 3, Approximation 3 overestimates the optimal CP interval in the initial phase and eventually underestimates the exact CP interval in the latter phase. On the other hand, from Figs. 3 and 4, both sub-optimal CP sequences based on Approximation 1 and Approximation 2 also underestimate the real optimal solution in early phase, but they tend to overestimate in the latter phase. Tables 1 and 2 present the dependence of shape parameter of the Weibull distribution on the maximum system availability in respective cases. The approximate availability is calculated by substituting the approximate CP sequence into Eq. (18) or (21), where the relative error is defined by Relative error (%) =
|approximation − maximum availability| × 100. |maximum availability|
(50)
In general, the availability requirement for real mission-critical systems is said to be over five nines, say, 99.99%, so that the relative error 0.06% for m = 1.5 can not be negligible in Table 1. As the shape parameter in the Weibull distribution increases, the failure rate r (t ) = (m/η)(t /η)m−1 monotonically increases in m (>1) and the MTTF = 1/µ = η0 (1 + 1/m) decreases, where 0 (·) is the standard gamma function. For a larger shape parameter, the difference is significant if one applies the approximation methods. When m = 1, i.e. the system failure time is exponentially distributed random variable, the resulting CP sequences and their associated system availability take same values as those based on Approximations
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
321
Fig. 4. Optimal CP sequence under the availability in the steady state (Model B): m = 2.0, η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0. Table 1 Dependence of shape parameter on the availability in the steady state (Model A): η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0. m
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Approximation 1
Approximation 2
Approximation 3
Exact
Availability
Error (%)
Availability
Error (%)
p∗
Availability
Error (%)
availability
0.96371 0.96133 0.96090 0.96092 0.96105 0.96120 0.96134 0.96147
0.000 −0.061 −0.177 −0.302 −0.421 −0.530 −0.629 −0.717
0.96371 0.96134 0.96092 0.96094 0.96107 0.96122 0.96136 0.96149
0.000 −0.060 −0.175 −0.300 −0.419 −0.528 −0.627 −0.715
0.17163 0.17149 0.16625 0.16132 0.15731 0.15408 0.15148 0.14935
0.96371 0.96051 0.95877 0.95761 0.95676 0.95610 0.95558 0.95515
0.000 −0.147 −0.398 −0.646 −0.866 -1.058 -1.224 -1.370
0.96371 0.96192 0.96261 0.96383 0.96512 0.96632 0.96743 0.96842
Table 2 Dependence of shape parameter on the availability in the steady state (Model B): η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0. m
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Approximation 1
Approximation 2
Exact
Availability
Error (%)
Availability
Error (%)
availability
0.96297 0.96047 0.96002 0.96004 0.96018 0.96033 0.96048 0.96062
0.000 −0.063 −0.181 −0.309 −0.432 −0.543 −0.644 −0.735
0.96297 0.96049 0.96004 0.96006 0.96020 0.96035 0.96050 0.96064
0.000 −0.061 −0.179 −0.307 −0.429 −0.541 −0.642 −0.733
0.96297 0.96107 0.96177 0.96302 0.96434 0.96558 0.96671 0.96773
1–3. That is, all the approximate methods are consistent in the case of exponential system failure time. However, as MTTF decreases and the system tends to be more unreliable, the difference from the real optimal solution becomes remarkable for all approximation methods. Tables 3 and 4 present the dependence of scale parameter. In Tables 3 and 4, we can see that the system availability monotonically increases as the scale parameter increases. This is because the MTTF increases as the scale parameter increases. On the other hand, the errors of Approximations 1–3 decrease as the scale parameter increases. From these results, it is seen that both exact and approximate methods have monotone properties of shape and scale parameters on the system availability in the steady state. 6.2. Expected reward Next, we show the result on the expected reward per unit time in the steady state. In Figs. 5 and 6, we illustrate the comparison results between the optimal and sub-optimal CP sequences. From Figs. 5 and 6, the trend of CP sequences based on Approximations 1–3 is same as the case of the system availability. Tables 5 and 6 present the dependence of shape parameter of the Weibull distribution on the minimum expected operation costs (maximized negative rewards) in respective cases, where each reward in the column is given by the absolute value of negative reward (cost). The approximate expected operation cost is calculated by substituting the approximate CP
322
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
Table 3 Dependence of scale parameter on the availability in the steady state (Model A): m = 2.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0.
η 55 60 65 70 75 80 85 90 95 100
Approximation 1
Approximation 2
Approximation 3
Exact
Availability
Error (%)
Availability
Error (%)
p∗
Availability
Error (%)
availability
0.95915 0.96090 0.96245 0.96382 0.96505 0.96617 0.96719 0.96812 0.96897 0.96976
−0.181 −0.177 −0.173 −0.170 −0.167 −0.164 −0.161 −0.158 −0.155 −0.153
0.95917 0.96092 0.96246 0.96383 0.96507 0.96618 0.96720 0.96813 0.96898 0.96977
−0.179 −0.175 −0.171 −0.168 −0.165 −0.162 −0.159 −0.157 −0.154 −0.152
0.17303 0.16625 0.16022 0.15483 0.14996 0.14555 0.14151 0.13780 0.13438 0.13121
0.95692 0.95877 0.96040 0.96186 0.95515 0.96435 0.96542 0.96641 0.96732 0.96815
−0.413 −0.398 −0.385 −0.373
0.96089 0.96261 0.96412 0.96546 0.96666 0.96775 0.96874 0.96965 0.97048 0.97125
-1.191 −0.352 −0.343 −0.334 −0.326 −0.318
Table 4 Dependence of scale parameter on the availability in the steady state (Model B): m = 2.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0.
η 55 60 65 70 75 80 85 90 95 100
Approximation 1
Approximation 2
Exact
Availability
Error (%)
Availability
Error (%)
availability
0.95819 0.96002 0.96164 0.96307 0.96436 0.96552 0.96658 0.96754 0.96843 0.96925
−0.185 −0.181 −0.177 −0.174 −0.170 −0.167 −0.164 −0.161 −0.159 −0.156
0.95821 0.96004 0.96166 0.96309 0.96438 0.96554 0.96659 0.96756 0.96844 0.96926
−0.183 −0.179 −0.175 −0.172 −0.169 −0.166 −0.163 −0.160 −0.158 −0.155
0.95997 0.96177 0.96335 0.96475 0.96601 0.96714 0.96817 0.96911 0.96997 0.97077
Fig. 5. Optimal CP sequence under the expected reward per unit time in the steady state (Model A): m = 2.0, η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0.
sequence into Eq. (29) or (33). Here, the relative error on the expected cost is defined by Relative error (%) =
|approximate − minimum cost| × 100. |minimum cost|
(51)
From Tables 5 and 6, the errors of Approximations 1–3 increase as the shape parameter increases, similar to Tables 1 and 2. However, the increasing trend is in particular remarkable in this case. Especially, when m = 4.5 we see that approximate errors become 30%–50%. Also, Tables 7 and 8 signify the dependence of the scale parameter. In Tables 7 and 8, as the scale parameter increases, both exact and approximate cost functions monotonically decrease, but the relative errors between them increase. On the other hand, Tables 9 and 10 present the dependence of the reward parameter per unit recovery time. From these results, the errors on Approximations 1–3 monotonically increase as the reward parameter increases. Finally, we conclude that both exact and approximate methods have monotone dependence of shape, scale and reward parameters on the expected reward in the steady state. This trend is also quite similar to the case with the system availability.
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
323
Fig. 6. Optimal CP sequence under the expected reward per unit time in the steady state (Model B): m = 2.0, η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0. Table 5 Dependence of shape parameter on the expected reward (cost) per unit time in the steady state (Model A): η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0. m
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Approximation 1
Approximation 2
Approximation 3
Exact
Reward
Error (%)
Reward
Error (%)
p∗
Reward
Error (%)
reward
0.15086 0.15855 0.15994 0.15985 0.15938 0.15883 0.15829 0.15780
0.000 4.863 7.593 11.846 16.570 21.341 26.007 30.524
0.15086 0.15856 0.15993 0.15985 0.15938 0.15883 0.15829 0.15780
0.000 4.868 7.591 11.844 16.568 21.339 26.004 30.521
0.03729 0.03806 0.03711 0.03593 0.03485 0.03394 0.03316 0.03252
0.15086 0.16042 0.16535 0.16909 0.17221 0.17487 0.17717 0.17917
0.000 6.097 11.235 18.308 25.951 33.597 41.036 48.200
0.15086 0.15120 0.14865 0.14292 0.13672 0.13089 0.12562 0.12090
Table 6 Dependence of shape parameter on the expected reward (cost) per unit time in the steady state (Model B): η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0. m
Approximation 1
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Approximation 2
Exact
Reward
Error (%)
Reward
Error (%)
reward
0.164011 0.173257 0.174914 0.174812 0.174245 0.173582 0.172937 0.172346
0.000 4.217 7.673 12.551 17.752 22.915 27.929 32.768
0.164011 0.173253 0.174910 0.174807 0.174241 0.173578 0.172933 0.172341
0.000 4.214 7.671 12.548 17.749 22.912 27.926 32.764
0.164011 0.166247 0.162449 0.155318 0.147976 0.141221 0.135182 0.129810
Table 7 Dependence of scale parameter on the expected reward (cost) in the steady state (Model A): m = 2.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0.
η 55 60 65 70 75 80 85 90 95 100
Approximation 1
Approximation 2
Approximation 3
Exact
Reward
Error (%)
Reward
Error (%)
p∗
Reward
Error (%)
reward
0.16644 0.15994 0.15416 0.14898 0.14429 0.14004 0.13614 0.13256 0.12926 0.12619
7.278 7.593 7.946 8.339 8.770 9.237 9.740 10.276 10.844 11.441
0.16644 0.15993 0.15415 0.14897 0.14429 0.14004 0.13614 0.13256 0.12926 0.12619
7.276 7.591 7.944 8.337 8.768 9.236 9.738 10.274 10.842 11.439
0.03855 0.03711 0.03580 0.03466 0.03361 0.03266 0.03179 0.03098 0.03024 0.02955
0.17213 0.16535 0.15933 0.15392 0.14905 0.14461 0.14055 0.13683 0.13338 0.13019
10.946 11.235 11.566 11.938 12.352 12.805 13.295 13.821 14.381 14.970
0.15515 0.14865 0.14281 0.13751 0.13266 0.12820 0.12406 0.12021 0.116613 0.113238
324
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
Table 8 Dependence of scale parameter on the expected reward (cost) in the steady state (Model B): m = 2.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0, c = −20.0.
η
Approximation 1
55 60 65 70 75 80 85 90 95 100
Approximation 2
Exact
Reward
Error (%)
Reward
Error (%)
reward
0.18278 0.17491 0.16799 0.16182 0.15629 0.15128 0.14673 0.14256 0.13873 0.13519
7.461 7.673 7.923 8.210 8.537 8.901 9.305 9.747 10.224 10.736
0.18277 0.17491 0.16798 0.16182 0.15628 0.15128 0.14672 0.14256 0.13873 0.13519
7.458 7.671 7.920 8.208 8.534 8.899 9.303 9.744 10.222 10.734
0.17009 0.16245 0.15565 0.14954 0.14399 0.13892 0.13424 0.12990 0.12586 0.12209
Table 9 Dependence of the reward per unit recovery time on the expected reward (cost) in the steady state (Model A): m = 2.0, η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0. c
5 10 15 20 25 30 35 40 45 50
Approximation 1
Approximation 2
Approximation 3
Exact
Reward
Error (%)
Reward
Error (%)
p∗
Reward
Error (%)
reward
0.08404 0.1163 0.14026 0.15994 0.17688 0.19189 0.20544 0.21784 0.2293 0.23997
6.091 6.500 6.860 7.593 8.835 10.555 12.656 15.040 17.622 20.335
0.08403 0.11630 0.14026 0.15993 0.17688 0.19189 0.20544 0.21784 0.22929 0.23997
6.082 6.495 6.857 7.591 8.833 10.554 12.655 15.039 17.621 20.334
0.07555 0.05320 0.04315 0.03711 0.03297 0.02990 0.02751 0.02558 0.02398 0.02263
0.08784 0.12091 0.14535 0.16535 0.18254 0.19774 0.21144 0.22396 0.23552 0.24628
10.892 10.724 10.736 11.235 12.315 13.922 15.943 18.271 20.814 23.501
0.07921 0.10920 0.13126 0.14865 0.16252 0.17357 0.18236 0.18936 0.194943 0.199415
Table 10 Dependence of the reward per unit recovery time on the expected reward (cost) in the steady state (Model B): m = 2.0, η = 60.0, a0 = 0.2, b0 = 0.0, c0 = 0.2, b = −1.0. c
5 10 15 20 25 30 35 40 45 50
Approximation 1
Approximation 2
Exact
Reward
Error (%)
Reward
Error (%)
reward
0.08802 0.12401 0.15163 0.17491 0.19543 0.21397 0.23103 0.24690 0.26181 0.27591
6.250 6.785 7.147 7.673 8.516 9.707 11.220 13.010 15.028 17.230
0.08801 0.12401 0.15163 0.17491 0.19542 0.21397 0.23102 0.24690 0.26181 0.27591
6.23919 6.77935 7.1426 7.67072 8.51403 9.70514 11.2184 13.0082 15.0263 17.228
0.08284 0.11613 0.14152 0.16245 0.18009 0.19504 0.20772 0.21848 0.22761 0.23536
7. Conclusion In this paper, we have developed numerical computation algorithms for sequential checkpoint placement, so as to maximize the steady-state system availability and the expected reward per unit time in the steady state, and compared numerically the real optimal solutions with some approximate ones. The lesson learned from the numerical study in this paper is that three approximation methods provide rather different checkpoint sequences with larger error in the earlier operational phase, as the shape parameter of the Weibull distribution increases and the corresponding MTTF is shorter. In other words, as the degree of IFR property is more remarkable, the sub-optimal checkpointing policies developed in the past literature do not function better. In fact, for Model A with m = 4.5, the system availabilities for Approximations 1–3 are given by 0.961474, 0.961492 and 0.955153, respectively. Since the exact maximum availability is 0.968421, the relative errors for respective approximation methods are 0.717 (%), 0.715 (%) and 1.370 (%). In other words, these approximation methods cannot be used for mission critical systems with higher availability requirement. Hence, the numerical computation algorithms for the optimal checkpoint sequence will be useful to back-up the information for the general file systems. Though the constant checkpoint placement which is hand-tuned by some system expert is very often employed in industry, of course, such a heuristic checkpointing is not always optimal in terms of system availability and performability. In the future, the same idea will be applied to distributed checkpointing systems. For instance, Soliman and Elmaghraby [36] considered an uncoordinated checkpointing protocol with periodic time interval to minimize the mean
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326
325
overhead to execute a finite length task. This will be extended to the aperiodic checkpointing scheme by using any algorithmic way under the general failure time distribution assumption. In a fashion similar to this paper, when the checkpoint sequence is determined under the system availability and the expected reward rate per unit time in the steady state, the Brender’s fixed-point type algorithm [40] will be useful to calculate the optimal solution iteratively. References [1] K.M. Chandy, A survey of analytic models of roll-back and recovery strategies, Computer 8 (5) (1975) 40–47. [2] K.M. Chandy, J.C. Browne, C.W. Dissly, W.R. Uhrig, Analytic models for rollback and recovery strategies in database systems, IEEE Transactions on Software Engineering SE-1 (1) (1975) 100–110. [3] G.M. Lohman, J.A. Muckstadt, Optimal policy for batch operations: Backup, checkpointing, reorganization and updating, ACM Transactions on Database Systems 2 (3) (1977) 209–222. [4] V.F. Nicola, Checkpointing and modeling of program execution time, in: M.R. Lyu (Ed.), Software Fault Tolerance, John Wiley & Sons, New York, 1995, pp. 167–188. [5] A.N. Tantawi, M. Ruschitzka, Performance analysis of checkpointing strategies, ACM Transactions on Computer Systems 2 (2) (1984) 123–144. [6] J.W. Young, A first order approximation to the optimum checkpoint interval, Communications of ACM 17 (9) (1974) 530–531. [7] F. Baccelli, Analysis of s service facility with periodic checkpointing, Acta Informatica 15 (1981) 67–81. [8] T. Dohi, N. Kaio, K.S. Trivedi, Availability models with age dependent-checkpointing, in: Proceedings of 21st Symposium on Reliable Distributed Systems, IEEE CS Press, 2002, pp. 130–139. [9] E. Gelenbe, D. Derochette, Performance of rollback recovery systems under intermittent failures, Communications of the ACM 21 (6) (1978) 493–499. [10] E. Gelenbe, On the optimum checkpoint interval, Journal of the ACM 26 (2) (1979) 259–270. [11] E. Gelenbe, M. Hernandez, Optimum checkpoints with age dependent failures, Acta Informatica 27 (1990) 519–531. [12] P.B. Goes, U. Sumita, Stochastic models for performance analysis of database recovery control, IEEE Transactions on Computers C-44 (4) (1995) 561–576. [13] V. Grassi, L. Donatiello, S. Tucci, On the optimal checkpointing of critical tasks and transaction-oriented systems, IEEE Transactions on Software Engineering SE-18 (1) (1992) 72–77. [14] V.G. Kulkarni, V.F. Nicola, K.S. Trivedi, Effects of checkpointing and queueing on program performance, Stochastic Models 6 (4) (1990) 615–648. [15] V.F. Nicola, J.M. Van Spanje, Comparative analysis of different models of checkpointing and recovery, IEEE Transactions on Software Engineering SE-16 (8) (1990) 807–821. [16] U. Sumita, N. Kaio, P.B. Goes, Analysis of effective service time with age dependent interruptions and its application to optimal rollback policy for database management, Queueing Systems 4 (1989) 193–212. [17] P. L’Ecuyer, J. Malenfant, Computing optimal checkpointing strategies for rollback and recovery systems, IEEE Transactions on Computers C-37 (4) (1988) 491–496. [18] A. Ziv, J. Bruck, An on-line algorithm for checkpoint placement, IEEE Transactions on Computers C-46 (9) (1997) 976–985. [19] N.H. Vaidya, Impact of checkpoint latency on overhead ratio of a checkpointing scheme, IEEE Transactions on Computers C-46 (8) (1997) 942–947. [20] H. Okamura, Y. Nishimura, T. Dohi, A dynamic checkpointing scheme based on reinforcement learning, in: Proceedings of 2004 Pacific Rim International Symposium on Dependable Computing, IEEE CS Press, 2004, pp. 151–158. [21] A. Duda, The effects of checkpointing on program execution time, Information Processing Letters 16 (5) (1983) 221–229. [22] S. Toueg, Ö. Babao˜glu, On the optimum checkpoint selection problem, SIAM Journal of Computing 13 (3) (1984) 630–649. [23] N. Kaio, N. Osaki, A note on optimum checkpointing policies, Microelectronics and Reliability 25 (1985) 451–453. [24] Y. Ling, J. Mi, X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Transactions on Computers 50 (7) (2001) 699–707. [25] D. Long, A. Muir, R. Golding, A longitudinal survey of internet host reliability, in: Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, IEEE CS Press, 1995, pp. 2–9. [26] V. Castelli, R.E. Harper, P. Heidelberger, S.W. Hunter, K.S. Trivedi, K. Vaidyanathan, W.P. Zeggert, Proactive management of software aging, IBM Journal of Research & Development 45 (2001) 311–332. [27] E.N. Elnozahy, L. Alvisi, Y.M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Survey 34 (3) (2002) 375–408. [28] K.F. Wong, M. Franklin, Checkpointing in distributed systems, Journal of Parallel and Distributed Systems 35 (1996) 67–75. [29] J.S. Plank, M.G. Thomaso, Processor allocation and checkpoint interval selection in cluster computing systems, Journal of Parallel and Distributed Computing 61 (11) (2001) 1570–1590. [30] K. Li, J.F. Naughton, J.S. Plank, Low-latency concurrent checkpointing for parallel programs, IEEE Transactions on Parallel and Distributed Systems 5 (8) (1994) 874–879. [31] J.S. Plank, K. Li, M.A. Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems 9 (10) (1998) 972–986. [32] N.H. Vaidya, Staggered consistent checkpointing, IEEE Transactions on Parallel and Distributed Systems 10 (7) (1999) 694–702. [33] A. Agbaria, A. Freund, R. Friedman, Evaluating distributed checkpointing protocols, in: Proceedings of the 23rd International Conference on Distributed Computing Systems, IEEE CS Press, 2003, pp. 266–273. [34] A. Agbaria, H. Attiya, R. Friedman, R. Vitenberg, Quantifying rollback propagation in distributed checkpointing, in: Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems, IEEE CS Press, 2001, pp. 36–45. [35] Y. Wang, P. Chung, I. Lin, W. Fuchs, Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems, IEEE Transactions on Parallel and Distributed Systems 6 (5) (1995) 546–554. [36] H.M. Soliman, A.S. Elmaghraby, An analytical model for hybrid checkpointing in time warp distributed simulation, IEEE Transactions on Parallel and Distributed Systems 9 (10) (1998) 947–951. [37] K. Go˘seva-Popstojanova, K.S. Trivedi, Stochastic modeling formalisms for dependability, performance and performability, in: G. Haring, C. Lindemann, M. Reiser (Eds.), Performance Evaluation – Origins and Directions, in: LNCS, vol. 1769, Springer-Verlag, Berlin, 2000, pp. 385–404. [38] J.F. Meyer, On evaluating the performability of degradable computer systems, IEEE Transactions on Computers C-29 (8) (1981) 720–731. [39] T. Ozaki, T. Dohi, H. Okamura, N. Kaio, Distribution-free checkpoint placement algorithms based on min–max principle, IEEE Transactions on Dependable and Secure Computing 3 (2) (2006) 130–140. [40] D.M. Brender, A surveillance model for recurrent events, IBM Watson Research Center Report, 1963. [41] T. Dohi, H. Okamura, N. Kaio, Optimal age-dependent checkpoint strategy with retry of rollback recovery, in: Proceedings of the 2nd IEEE Computer Society International Workshop on Autonomous Decentralized Systems, IEEE CS Press, 2002, pp. 113–118. [42] S. Fukumoto, S. Nakagawa, N. Kaio, S. Osaki, Optimum checkpoint policies attending with unsuccessful rollback recovery, International Journal of Reliability, Quality and Safety Engineering 4 (4) (1997) 427–439. [43] R.E. Barlow, F. Proschan, Mathematical Theory of Reliability, SIAM, Philadelphia, 1996.
326
T. Ozaki et al. / Performance Evaluation 66 (2009) 311–326 Tatsuya Ozaki received the B.S.E. and M.S. from Hiroshima University, Japan, in 2001 and 2005, respectively. In 2005, he joined NTT Facilities, Inc., Japan as a Technical Stuff. His research interests are dependable computing and performance evaluation. His papers appeared in IEEE Transactions on Dependable and Secure Computing and several major conferences like DSN 2004, DASC 2006, etc.
Tadashi Dohi received the B.S.E., M.S. and Dr. of Engineering degrees from Hiroshima University, Japan, in 1989, 1991 and 1995, respectively. In 1992, he joined the Department of Industrial and Systems Engineering, Hiroshima University, Japan, as an Assistant Professor. Now, he is working as a Full Professor in the Department of Information Engineering, Graduate School of Engineering, Hiroshima University, Japan, since 2002. In 1992 and 2000, he was a Visiting Research Scholar in University of British Columbia, Canada and Duke University, USA, respectively, on leave of absence from Hiroshima University. His research areas include software reliability engineering, dependable computing and performance evaluation. He is a Regular Member of ORSJ, JSIAM, IEICE, ISCIE and IEEE. He published over 200 journal papers and refereed conference papers. Dr. Dohi is serving as an Associate Editor of IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences (A) and Asia-Pacific Journal of Operational Research, and an Editorial Board Member of Journal of Risk and Reliability, Journal of Autonomic and Trusted Computing, International Journal of Reliability and Quality Performance, etc. He published over 300 refereed papers. Dr. Dohi served as a General Chair of several international conferences like AIWARM 2004–2008 and WoSAR 2008 and as a Program Committee Chair of RASOR 2005–2007 and ISAS 2009. Naoto Kaio received the B.S.E., M.S. and Dr. of Engineering degrees from Hiroshima University, Japan, in 1976, 1978 and 1982, respectively. He is a Full Professor in the Department of Economic Informatics, Hiroshima Shudo University, Japan. From 1986 to 1987, he was a Visiting Research Scholar in the William E. Simon Graduate School of Business Administration, University of Rochester, USA. His research areas include systems science, operations research and reliability theory. He is a Regular Member of ORSJ, IEICE, JIMA, IPSJ, JSQC, REAJ and IEEE. Also, Dr. Kaio is serving as Regional Editor for Asia in Journal of Quality in Maintenance Engineering.