Accepted Manuscript
Joint optimal checkpointing and rejuvenation policy for real-time computing tasks Gregory Levitin , Liudong Xing , Liang Luo PII: DOI: Reference:
S0951-8320(18)30834-2 https://doi.org/10.1016/j.ress.2018.10.006 RESS 6281
To appear in:
Reliability Engineering and System Safety
Received date: Revised date: Accepted date:
3 July 2018 30 August 2018 18 October 2018
Please cite this article as: Gregory Levitin , Liudong Xing , Liang Luo , Joint optimal checkpointing and rejuvenation policy for real-time computing tasks, Reliability Engineering and System Safety (2018), doi: https://doi.org/10.1016/j.ress.2018.10.006
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights Real-time computing system with checkpointing and rejuvenation is considered; An algorithm for evaluating probability of timely task completion is presented; A problem of joint optimal checkpointing and rejuvenation scheduling is formulated;
AC
CE
PT
ED
M
AN US
CR IP T
An illustrative example is presented.
1
ACCEPTED MANUSCRIPT
Joint optimal checkpointing and rejuvenation policy for realtime computing tasks Gregory Levitina,b, Liudong Xingc , Liang Luoa,* Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China b The Israel Electric Corporation, P. O. Box 10, Haifa 31000, Israel E-mail:
[email protected] c University of Massachusetts, Dartmouth, MA 02747, USA E-mail:
[email protected] * Corresponding author
AN US
CR IP T
a
Abstract - Performance of a software system can deteriorate from higher to lower levels due to software aging. To counteract the aging effect, software rejuvenation is widely implemented to restore the performance of a degraded system before the
M
system crash actually takes place. To facilitate an effective system function restoration after each rejuvenation action, it is desirable to apply checkpointing to
ED
occasionally save the system state on a reliable storage so that the mission task can be
PT
resumed from the last saved checkpoint (instead of being restarted from the very beginning). As both rejuvenation and checkpointing procedures incur system
CE
overhead while bringing these benefits, it is significant to determine the optimal rejuvenation and checkpointing scheduling policy optimizing the system performance
AC
measures of interest. This paper makes new contributions by modeling and optimizing the joint maintenance policy involving state-based rejuvenation and periodic checkpointing schedule for software systems performing real-time computing tasks. The system can undergo multiple performance degradation levels or states, and transition time between different states can assume arbitrary types of
2
ACCEPTED MANUSCRIPT
distributions. The proposed solution methodology encompasses an efficient numerical algorithm for evaluating the probability of task completion (PTC) by a prespecified deadline. The joint optimal rejuvenation and checkpointing policy is further determined to maximize the PTC of the considered real-time task. Examples are
system parameters on the optimization solution.
CR IP T
provided to illustrate applications of the proposed methodology as well as effects of
Keywords – Real-time computing; software rejuvenation; periodic checkpointing;
AN US
task completion probability.
Acronyms and Abbreviations
cumulative distribution function
pdf
probability density function
PS
processing speed
PTC
probability of the task completion by a pre-specified deadline
w
ED
PT
Nomenclature
M
cdf
total number of operations in computational task total number of operations needed to complete the task (including
CE
W()
checkpointing procedures)
AC
()
number of operations performed between ends of two consecutive checkpointing procedures
maximum allowed task processing time or task deadline
fraction of task saved by incremental checkpoints
total number of checkpoints during the task execution 3
ACCEPTED MANUSCRIPT
maximum possible number of rejuvenations
J
maximum state index of the unit, i.e., the unit states range from 0 to J
fj(t), Fj(t)
pdf , cdf of random unit sojourn time in state j
gj
unit PS in state j
rejuvenation time
R
PTC
m
number of time intervals and work portions considered in numerical algorithm
CR IP T
K
rejuvenation decision function
b()
number of operations required to perform each checkpoint
c()
number of operations required to retrieve data stored by each checkpoint
x
greatest integer not exceeding x
M
AN US
u(t,x)
ED
1. Introduction
It is widely known that failures of modern computing-based systems are more often
PT
caused by software faults than by hardware faults [1, 2]. As a major type of software faults, software aging results from successive accumulation of error conditions during the
CE
software running, such as round-off errors, memory leaks, unreleased memory space,
AC
storage fragmentation, and etc. [3, 4]. Due to the aging effect, the system performance gradually degrades, which can eventually lead to the system crash if no maintenance action is taken [5]. In mission-critical or safety-critical applications, software failure may incur unrecoverable loss to economics or even human life [6], and thus are not acceptable for those systems. In the 90’s, software rejuvenation was first introduced as a preventive and proactive maintenance technique to counteract the software aging before the system 4
ACCEPTED MANUSCRIPT
crash takes place [3]. Specifically, the rejuvenation technique consists of occasionally and proactively stopping the running program or application, cleaning its internal state or environment to remove the accrued undesirable conditions, and rebooting the system. After the rejuvenation procedure, the system performance (e.g., in processing speed,
CR IP T
service rate) can be fully or at least partially recovered [7]. For transactions-based software systems, the rejuvenated system can resume accepting and processing newly arriving transactions. However, for computing-based systems, the running computing task is interrupted at the point of rejuvenation and rolling back to the beginning of the
AN US
mission after the rejuvenation is costly. Thus, it is desirable to apply checkpointing to occasionally save the system state on a reliable storage so that the mission task can be resumed from the last saved checkpoint after the rejuvenation (avoid restarting from the
M
very beginning of the mission) [2, 8, 9]. In this paper we model and optimize such software systems that implement the combined checkpointing and rejuvenation technique
ED
for performing real-time computing tasks.
As both checkpointing and rejuvenation incur some system overhead, it is relevant and
PT
crucial to address the optimal rejuvenation policy as well as the optimal checkpointing
CE
schedule simultaneously to maximize benefits and effectiveness of implementing the combined fault tolerant technique [10]. Specifically, system performance can be
AC
promoted with frequent rejuvenations (e.g., greater average processing speed, larger average service rate), which however may extend the task completion time as each rejuvenation procedure itself takes time to complete. Similarly, the number of checkpointing performed during the mission also has two-fold effects: more frequent checkpointing implies less work being repeated during rollback or recovery, but possibly
5
ACCEPTED MANUSCRIPT
longer time to complete the mission. There is a tradeoff relationship between the recovery overhead and the checkpoint overhead. Therefore, it is significant to determine the optimal rejuvenation and checkpointing scheduling policy optimizing the system performance measure of interest, e.g., maximizing the task completion probability,
CR IP T
maximizing system availability, minimizing expected mission completion time, minimizing transaction loss probability, and minimizing mean response time [11].
This paper makes new contributions by solving the joint optimal maintenance policy involving both state-based rejuvenation and periodic checkpointing schedule for
AN US
computing-based software systems performing real-time tasks. The system can undergo multiple performance degradation levels or states, and transition time between different states can assume arbitrary types of distributions. The proposed solution methodology
M
encompasses an efficient event-transition based numerical algorithm for time-dependent evaluation of the probability of task completion (PTC) by a pre-specified deadline. The
ED
joint optimal rejuvenation and checkpointing policy is then determined to maximize the PTC of the real-time task.
PT
The remainder of this paper is arranged as follows. Section 2 reviews related work on
CE
software rejuvenation and checkpointing, followed by a clarification of novel contributions of this work in the context of the literature review. Section 3 describes the
AC
system model and the joint optimal software maintenance policy problem considered in this work. Section 4 suggests the PTC evaluation method for the considered real-time software system. Section 5 discusses a heuristic algorithm for obtaining the joint optimal software maintenance policy. Section 6 presents illustrative examples of the suggested
6
ACCEPTED MANUSCRIPT
evaluation method and optimization solutions. Section 7 presents conclusions and directions of our future research. 2. Related Work Since its introduction in the 90’s, software rejuvenation has been applied to mitigate
CR IP T
the software aging effects in a wide range of diverse applications, such as spacecraft [12], embedded system [13], billing software [3, 14], transaction processing [1, 11], etc. The rejuvenation policies can be classified as time-based (initiated based on periodic or aperiodic time intervals [1, 15]), condition-based (triggered based on monitoring of
AN US
system state related to, for example, workload, response time, or completed tasks [16, 17]), and prediction-based (triggered based on statistical analysis of system data [18]). Different types of software systems have been modeled for optimizing the rejuvenation
M
policy, including for example, single-node systems [19], standby/redundant systems [20], clustered systems [14, 16], and cloud systems [21].
ED
While a rich set of exact or heuristic algorithms are directly available for various optimization problems [22, 23], specific approaches have to be developed to estimate
PT
system performance measures to be optimized. Measurement-based, simulations,
CE
analytical model-based, and event transition methods have been developed for evaluating software aging and rejuvenation systems [17, 20, 24-26]. In the measurement-based
AC
approach, empirical data about the system state are first collected; statistical analysis or machine learning techniques are then applied to the runtime data collected for predicting the performance measure of interest [5]. As the measurements-based approach typically exploits particular nature of the considered system, it is difficult to generalize across different systems. Moreover, it is not appropriate to apply measurements for estimating
7
ACCEPTED MANUSCRIPT
long-term or steady-state performance measures. Sharing the common limitations to the measurements approach, the simulation-based method has a further problem of being computationally expensive, which is more severe when a high level of accuracy is required for critical systems [20]. In contrast, using probabilistic models to represent the
CR IP T
software aging process, the analytical model-based approaches can be easily portable across different systems. Different stochastic processes were adopted in the analytical models for evaluating software rejuvenation systems, including, for example, Markov processes [3, 7, 11], Semi-Markov processes [1, 25], Markov Regenerative processes [11,
AN US
24], and stochastic Petri nets [16]. Compared to the measurements and simulations based approaches, these analytical models can be less effective because simplification assumptions are often made about the real system behavior. In addition, the analytical
M
models are state-based, and thus suffer from the state-space explosion issue, which makes time-dependent performance evaluation computationally expensive and even intractable.
ED
Due to this issue, the analytical model-based approaches usually focus only on steadystate solutions [11, 27]. These stationary solutions however are often not adequate for
PT
critical applications that require time-dependent estimation of the system performance.
CE
As a recently emerged approach, the event transition method can provide time-dependent evaluation in an efficient manner [17].
AC
Apart from the rejuvenation, checkpointing is another important technique that has been intensively applied for software dependable computing. By periodically [28] or non-periodically [29] saving state information associated with the completed portion of a mission task, checkpointing can facilitate an effective recovery of the system function in the case of a system failure. Upon the failure, the system does not need to restart from the
8
ACCEPTED MANUSCRIPT
very beginning; instead it can resume the task from the last saved checkpoint through rollback and retrievals. While reducing the recovery time, checkpointing may also impact system performance negatively due to additional overheads incurred by each checkpointing procedure [9]. Various checkpoint placement policies (e.g., adaptive, age-
CR IP T
dependent, online checkpoints) [30-32] have been investigated to tradeoff the reduction in recovery time and the checkpoint overhead, and optimize performance measures of interest (e.g., minimizing the expected mission completion time [8], maximizing task completion probability by a deadline [9], or maximizing system availability [33]).
AN US
Different types of computing systems have been modeled for optimizing the checkpointing policy, including for example, single-node systems [8, 34], distributed systems [35], and standby systems [9, 28, 29].
M
Though the two software fault tolerance techniques (rejuvenation and checkpointing) have been introduced for different purposes, they can be unified in the same software
ED
system to complement each other for greater benefits [36]. Compared to the rich literature on each separate technique, a relatively minority of studies has been conducted to model
PT
and optimize software systems simultaneously implementing both techniques.
CE
Specifically, a maintenance policy with periodic checkpointing and periodic rejuvenation (being triggered at every certain number of checkpoints) was modeled to compute and
AC
minimize the expected completion time in [2]. In [36] a modeling framework based on fluid stochastic Petri nets was proposed to quantify dependability measures of aging software systems with checkpointing and rejuvenation. In [33], the optimal aperiodic time-based rejuvenation policy was evaluated and optimized for a software system subject to a fixed periodic checkpoint schedule, maximizing the steady-state system
9
ACCEPTED MANUSCRIPT
availability using dynamic programming. In [10], joint aperiodic time-based checkpointing and rejuvenation schemes were evaluated and optimized to maximize the steady-state system availability using dynamic programming. In [37] an implementation of checkpointing and rejuvenation in Unix library called libckp was presented; no
CR IP T
modeling or evaluation was actually performed. In [38], the benefits of using the combination of rejuvenation and checkpoint mechanisms over using only the checkpoint mechanism were investigated through simulations for high performance computing systems.
AN US
To the best of our knowledge, no comprehensive evaluation has been performed for the joint checkpointing and rejuvenation scheme with respect to maximizing the PTC for real-time tasks. This work advances the state-of-the-art by suggesting an iterative method
M
of evaluating the PTC by a specific deadline in software systems subject to the joint maintenance scheme. The optimization problem of determining the optimal joint
ED
rejuvenation and checkpointing policy is further solved, maximizing the PTC of the realtime task.
PT
Note that two different viewpoints have been assumed for system success measures in
CE
literature: system-oriented (concerning the expected amount of work accomplished in a given time period) [39] and task/user-oriented (concerning the completion probability of
AC
a system task involving certain amount of work in due time) [40, 41]. The task-oriented measure, particularly, the PTC is adopted in this work.
3. System Model A task containing w operations should be performed by the considered system. The system’s initial task processing speed (PS) is g0. Due to software aging effects, the 10
ACCEPTED MANUSCRIPT
system degrades from states with a greater PS to states with a lower PS during the task processing. The system’s PS in state j is gj. The total number of possible states is J+1. The system can transit from any state j only to state j+1. The transition time from state j to state j+1 follows a known distribution with cumulative distribution function (cdf) Fj(t)
CR IP T
and probability density function (pdf) fj(t). To improve the system performance, its software can be rejuvenated causing the system to return to the initial state j=0 with the peak PS g0.
To avoid starting the task execution from scratch after each rejuvenation, the system
AN US
saves intermediate results to an external storage (i.e., implementing checkpointing). After the rejuvenation the system retrieves the data saved by previous checkpointing procedures and continues performing the task from the operation that immediately
M
follows the last completed checkpoint.
The periodic checkpointing procedures are performed upon completion of each fraction
ED
of the entire mission task. Thus, the system conducts a data checkpointing procedure after successfully performing each w operations. The total number of checkpointing
PT
actions performed during a successful mission (i.e., checkpointing frequency) is fixed and
1 / if 1 / 1 ()= . 1 / - 1 if 1 / 1
(1)
AC
CE
can be determined as
The second case in (1) takes place because when 1 / 1 , the last checkpoint is scheduled to be performed at the end of the mission which is not necessary. is referred to as a checkpointing frequency parameter to be optimized in the optimization problems considered in this paper. According to the incremental checkpointing model [34] we
11
ACCEPTED MANUSCRIPT
assume that each checkpointing action requires a constant number of operations b(), which depends on the amount of task operation performed since the previous checkpoint. Thus, the total number of operations needed to complete the mission is W()=()b()+w.
(2)
CR IP T
The number of operations needed to retrieve the data saved by i previous checkpoints is c(i). The rejuvenation procedure consists of system reset, which takes fixed time and data retrieval, which takes time c(i)/g0 given that i checkpoints have been completed
AN US
before the rejuvenation (we assume that during the data retrieval the system cannot deteriorate because it does not perform the mission task, which can cause the deterioration).
The task performed is real-time; particularly, the mission fails if the task is not
M
completed within the specified deadline . Thus, the maximum number of rejuvenation procedures that can be performed during the task processing can be given as (3)
PT
ED
K=(-W()/g0)/.
The optimal rejuvenation policy and checkpointing frequency should be determined to
CE
maximize the probability of the task completion (PTC) R by the deadline . Specifically, more frequent rejuvenation procedures can lead to greater average PS as the system is
AC
prevented from deteriorating to low PS levels. On the other hand with more rejuvenation procedures, less time remains for the system to perform the mission task operations during the allowed time . The rejuvenation policy to be optimized can be defined by a function u(t, x). The rejuvenation decision rule adopted is: if the system transits to state j at time t from the beginning of the mission when part x of the entire task is accomplished 12
ACCEPTED MANUSCRIPT
and ju(t,x), then the system software is immediately rejuvenated. The optimal rejuvenation policy is also interrelated with the optimal number of checkpointing procedures. Performing more checkpointing procedures increases the minimal number of operations to be performed during the mission. On the other hand, when the number of
CR IP T
checkpointing procedures increases, the amount of operations to be re-executed after each rejuvenation decreases, which decreases the expected total number of operations in the case of rejuvenations. Thus, the rejuvenation policy and checkpointing schedule should be jointly optimized to maximize the PTC R.
AN US
Fig. 1 presents an example of the successful mission completion for the case of
()=3 and discrete function u(t,x). The t,x space is divided into six areas with the constant values of u(t, x) in each area. The system starts performing the mission in state
M
j=0, corresponding to PS g0. During the mission performance, the system transits to state j=1 when u(t,x)=2 and performs the first checkpoint in this state. Then it transits to state
ED
j=2 when u(t,x)=2, which causes the system rejuvenation. After the rejuvenation the system returns to state j=0 and continues performing the task from the operation that
PT
immediately follows the first checkpoint. During performing the second checkpoint the
CE
system transits to state j=1 when u(t,x)=2 and during performing the third checkpoint the system transits to state j=2 and then to state j=3 when u(t, x)=3. The last transition causes
AC
the system rejuvenation. As the rejuvenation happens before completion of the third checkpoint, the system continues performing the task after returning to state j=0 from the operation that immediately follows the second checkpoint. Then the system performs the third checkpoint and completes the entire mission remaining in state j=0.
13
AN US
CR IP T
ACCEPTED MANUSCRIPT
Fig. 1. Example of task performance
M
4. Event Transition Based Method for Evaluating PTC
In this section, an event transition-based iterative method is presented for analyzing
ED
the PTC of real-time tasks in the considered software rejuvenation and checkpointing system.
PT
4.1 Probabilistic modeling of events and event transitions
CE
Denote Tk,j,Xk,j as an event that the system starts executing task or checkpointing operations with PS gj at time Tk,j from the beginning of the mission when Xk,j operations
AC
are accomplished and k rejuvenations are performed. The joint pdf of random values Tk,j and Xk,j is denoted as qk , j (t , x) . When the mission first starts, k=0 and the system’s PS is g0. Thus we have
1 for t x 0, q0,0 (t , x) 0 otherwise.
(4)
14
ACCEPTED MANUSCRIPT
According to the rejuvenation decision function u(t,x), if the unit deteriorates to state j
~ after spending time t in state j-1, the event transition Tk,j-1,Xk,j-1Tk,j,Xk,j with ~ ~ Tk,j=Tk,j-1+ t and Xk,j=Xk,j-1+g j-1 t
(5)
takes place if j
CR IP T
If ju(Tk,j,Xk,j) the rejuvenation procedure starts immediately after the system state transition and the event transition Tk,j-1,Xk,j-1Tk+1,0,Xk+1,0 with
~ Tk+1,0=Tk,j-1+ t ++c(i)/g0, Xk+1,0=i(),
AN US
is performed, where
()=w+b()
(6)
(7)
is the number of operations that should be performed by the system between the ends of two consecutive checkpointing procedures, and
M
~ i=(Xk,j-1+gj-1 t )/()
(8)
is the index of the last checkpointing procedure completed before the rejuvenation.
ED
c(i)/g0 is the time needed to retrieve the data saved by the previous i checkpointing
PT
procedures.
Because the probability that the system transits from state j-1 to state j within time
CE
interval [t, t+dt) (since it started operation with PS gj-1) is fj-1(t)dt, qk , j (t , x) for k0, j>0
AC
can be recursively obtained as min( t , x / g j 1 )
q k , j (t , x)
~ ~ ~ ~ q t t , x t g f t d t for kt< , u(t,x)>j k , j 1
j 1
j 1
(9)
0
Indeed, the system can transit to state j at time t when x operations are completed if it
~ spends in state j-1 time t not exceeding x/gj-1. The minimal time when the system can
15
ACCEPTED MANUSCRIPT
transit to state j after k rejuvenations t=k corresponds to situation where all the rejuvenations are performed before completing the first checkpointing procedure. Following (6) we can obtain qk1,0 (t , i ( )) considering all the possible transitions from events Tk,j,Xk,j with k0, j=1,…,J for which
CR IP T
~ ~ x tgj , Tk,j= t t c(i ) / g 0 , Xk,j= ~
(10)
~ x where the number of operations completed after working in state j during time t is ~ and
(11)
AN US
i ( ) ~ x (i 1) ( ) ,
which corresponds to i checkpointing procedures completed before the k+1-th rejuvenation. The rejuvenation is performed at time
~ Tk,j+ t = t c(i ) / g 0
min( t , x / g j )
j 1
0
~ ~ ~ 1 j u(t c(i ) / g , ~x )q t t c(i ) / g , ~x t g d~x dt k , j 1
0
0
j
(13)
i ( )
PT
4.2 PTC evaluation
( i 1) ( )
f j ~t
ED
J
q k 1,0 (t , i ( ))
M
x ) . Thus if j u(t c(i ) / g 0 , ~
(12)
Denote Yk,j as the event that the system completes the mission task at time no later than
CE
the deadline while the system is in state j after k rejuvenations. The PTC R can thus be
AC
evaluated by summing probabilities of mutually exclusive events as K
R k 0
J
Pr{ Y j 0
k, j
}.
(14)
When the event Tk,j,Xk,j occurs, the system can accomplish the task without further state transitions if it functions with PS gj during time (W()-Xk,j)/gj required to complete
16
ACCEPTED MANUSCRIPT
the entire task and this time does not exceed the remaining allowed task processing time
-Tk,j. Hence, Pr{Yk,j} in (14) can be computed as
Pr{Yk , j }
0
W ( ) x gj
1 F W ( ) x dtdx q t , x k , j j gj k
4.3 Discrete numerical PTC evaluation algorithm
(15)
CR IP T
W ( )
The maximum allowable mission time is discretized into m intervals with an equal duration of =/m such that the interval z starts at time z, and ends at time (z+1) for
AN US
z=0,…,m. The rejuvenation thus takes =/ time intervals. Correspondingly the total number of operations is divided into m equal portions such that each portion contains
=(W())/m operations. During z time intervals, the system functioning in state j can perform gjz operations corresponding to gjz/=gjz/W() work portions. Thus, the
M
system’s PS in state j is Gj=gj/W() work portions per time interval. Each checkpointing
ED
procedure requires performing B()=b()/ work portions. The data retrieval after i checkpointing procedures requires performing C(i)=c(i)/ work portions. The number
CE
=()/.
PT
of work portions between the ends of two consecutive checkpointing procedures is
Based on the corresponding cdf, the probability that the system remains in state j for
AC
less than d time intervals can be evaluated as j(d)=Fj(d). The probability that the system departs from state j after functioning in this state j for d intervals is thus
j(d)=Fj((d+1))-Fj(d). The joint pdf function qk , j (t , x) can be approximated by matrix Qk , j ( z, n) of probabilities that after the k-th rejuvenation the system starts operating in
17
ACCEPTED MANUSCRIPT
state j in time interval z when n work portions are accomplished. The rejuvenation decision function u(t,x) can be applied in the form of U(z,v)=u(z,n). Based on (9) and matrix Qk,j-1(z,n) z=0,…,m-1, and n=0,…,m-1, Qk,j(z,n) can be obtained. Based on (13) and all matrixes Qk,j(z,n) for j=0,…,J, z=0,…,m-1, and n=0,…,m-
CR IP T
1, Qk+1,0 can be obtained. In the backward procedures based on (9) and (13), for any element of matrix Qk,j (or Qk+1,0), a summation using different elements of matrix Qk,j-1 (or Qk,j-1 for j=0,…,J) should be performed. Next a more convenient forward procedure is explained and used. At the beginning of the forward procedure, matrixes Qk,j and Qk+1,0
AN US
are zeroed. Then for any fixed element of matrix Qk,j-1 and any realization of the time to the next system state transition, we iteratively update the corresponding different elements in matrixes Qk,j and Qk+1,0. The forward procedure can avoid storing all the
M
matrixes Qk,j for j=0,…,J because any matrix Qk,j-1 can be deleted from the memory upon finishing the update of Qk,j and Qk+1,0. As the iterative procedure cannot be expressed in
ED
equations, we describe this procedure of evaluating R using the following pseudo code.
PT
Pseudo code of PTC evaluation algorithm 1. Initialize Q0,0(z,n)=0 for z=0,…,m; n=0,…,m; and assign Q0,0(0,0)=1;
CE
2. For k=0,…,K:
2.1. Initialize Qk+1,0(z,n)=0 for z=0,…,m; n=0,…,m; 2.2. For j=1,…,J:
AC
2.2.1. Initialize Qk,j(z,n)=0 for z=0,…,m; n=0,…,m; 2.2.2. For n=0,…,m-1: 2.2.2.1. For z=k,…,m-1: 2.2.2.1.1. dmax=(m-n)/Gj-1; 2.2.2.1.2. If dm-z then R=R+Qk,j-1(z,n)(1-j-1(dmax)); 2.2.2.1.3. If j
18
ACCEPTED MANUSCRIPT
2.2.2.1.3.1. For d=0,…,min{dmax-1,m-z-1}: 2.2.2.1.3.1.1. If j
CR IP T
Qk+1,0(z+d++C(i)/G0,i)=Qk+1,0(z+d++C(i)/G0, i)+Qk,j-1(z,n) j-1(d).
Step 2.2.2.1.1 calculates the number of time intervals required for completing the task when the system operates in state j-1. Step 2.2.2.1.2 updates the PTC. Step 2.2.2.1.3.1.2
AN US
obtains the number or index i of the last completed checkpoint. Step 2.2.2.1.3.1.1 corresponds to event transition without rejuvenation. Step 2.2.2.1.3.1.3 corresponds to event transition with rejuvenation. The complexity of the iterative algorithm is
M
O((K+1)(J+1)m3).
5. Optimal Joint Checkpointing and Rejuvenation Policy
ED
Having the PTC evaluation algorithm described above one can find the rejuvenation and checkpointing policy that maximizes the PTC using one of general optimization
PT
procedures. To optimize the rejuvenation policy under fixed checkpointing schedule ,
CE
we define a discrete approximation of the decision function U(z,n). Specifically, we divide the m time intervals (work portions) into H regions. For any combination of z and
AC
n, we define U(z,n)=V(z/h,n/h), with h=m/H+1 and V being an HH matrix consisting of elements ranging from 0 to J+1. Such approximation implies that when the system transits to state j in time interval z (from the beginning of the mission) when n work portions are accomplished, the rejuvenation procedure is immediately conducted if jV(z/h,n/h).
19
ACCEPTED MANUSCRIPT
It is a complicated optimization problem to find the optimal rejuvenation policy in the form of matrix V. There are (J+1)HH possible solutions. It is not realistic to perform an exhaustive examination of all the solutions even for a moderate number of regions H and system states J+1. To solve the optimal rejuvenation problem, we use a Genetic
CR IP T
Algorithm (GA) heuristic, the most widely used method in reliability optimization due to its advantages of having flexibility in solution representation, parallel computation possibility, quick convergence to near optimal solutions, etc [22, 23]. The basic structure of GA used in this work is as follows.
AN US
First, an initial population of randomly constructed solutions (represented as integer strings) is generated. Within this population, new solutions are obtained during the genetic cycle by using crossover and mutation operators. The crossover produces a new
M
solution (offspring) from a randomly selected pair of parent solutions by copying random fragment of strings from different parents. Mutation results in slight changes to the
ED
offspring’s structure, and maintains a diversity of solutions. This procedure avoids premature convergence to a local optimum, and facilitates jumps in the solution space. In
PT
our GA, the mutation procedure swaps elements initially located in two randomly chosen
CE
positions on the string. Each new solution is decoded, and its objective function (fitness) value is estimated. This value, as a measure of quality, is used to compare different
AC
solutions. The comparison is accomplished by a selection procedure that determines which solution is better: the newly obtained solution, or the worst solution in the population. The better solution joins the population, while the other is discarded. If the population contains equivalent solutions following selection, redundancies are eliminated, and the population size decreases as a result.
20
ACCEPTED MANUSCRIPT
After new solutions are produced a pre-specified number of times, new randomly constructed solutions are generated to replenish the shrunken population, and a new genetic cycle begins. The GA is terminated after a pre-determined number of genetic cycles. The final population contains the best solution achieved. It also contains different
CR IP T
near-optimal solutions which may be of interest in the decision-making process.
The GA operates with integer strings. To apply the GA to a specific optimization problem, the corresponding solution representation must be defined. For the rejuvenation optimization problem considered in this work, any integer string a=(a1,…,aHH) with
AN US
0ajJ+1 corresponds to a feasible solution such that for any 0i,k
6. Illustrative Example
M
assigns the obtained value of R to the solution fitness.
ED
6.1. Evaluating PTC of a visual data streaming onboard software system A visual data streaming onboard software system for a space probe is considered [42].
PT
Based on to a predetermined schedule, the space probe takes video images, which are
CE
placed in an input buffer for being processed by a software ETL (Extract, Transform, Load) streaming system [43, 44]. Specifically, the ETL system [43, 44] is responsible for
AC
the image data transform and compression task to prepare its translation to the Earth. The task presumes conducting W=5000 mega operations. Due to the limited capacity of the input buffer (new image batches replace the old ones), the ETL has to finish the task in the time between the images update, which is =100 time units.
21
ACCEPTED MANUSCRIPT
The ETL system utilizes four internal memory blocks. Depending on the number of blocks overwhelmed, the data processing speed can change. The system state is defined by the number of unavailable memory blocks. Thus, there are five different system states (from 0 to 4). As the number of unavailable blocks increases consecutively, the system
CR IP T
state increases one by one during the data processing. In addition, the system states with different numbers of available memory blocks can have different sojourn times. This is because the data processing time, and thus the random time of filling the next block depend on the number of available memory blocks. When all of the four memory blocks
AN US
are unavailable, the software still can function using low capacity registers, but with very low speed. During the periodic checkpointing procedures the system saves data processed so far to an external memory. The rejuvenation cleans up all of the internal memory
M
blocks. Therefore, the system can be transited to state 0 from any other state after the rejuvenation. The software mission fails if the data transform and compression task is not
ED
finished before the input buffer data renewal takes place. The sojourn time of the system in state j∈{0,1,2,3} follows the Weibull distribution
PT
[1,45] with scale parameter j and shape parameter j. In other words, the system can
CE
transit from state j to state j+1 with random time having cdf of Fj(t)=exp(-(t/j)j)) for j∈{0,1,2,3}. The system can transit from the worst state 4 only to the initial state 0 when
AC
the rejuvenation is performed. Note that the proposed methodology is applicable to any type of distributions. While no consensus has been found on the type of failure or degradation time distributions for operational software systems [11], the Weibull distribution is selected for this example system (and in other literature, e.g., [11, 46]) because of its flexible representation of different failure rate behaviors. In [1], the
22
ACCEPTED MANUSCRIPT
accelerated life testing based studies actually showed that the Weibull distribution is a good fit for modeling the time to failure distribution of software systems studied in the work. The PS of the system in each state is shown in Table 1 as well as the scale (j) and
CR IP T
shape (j) parameters.
AN US
Table 1. Parameters of the example software system State PS Parameters of Fj(t) j gj j j 0 95 25 1.2 1 72 33 1.0 2 50 38 1.1 3 29 29 1.3 4 15 -
We assume that checkpointing and retrieval procedures after performing x>0 task
M
operations require performing b1+b2x and c1+c2x operations, respectively. Thus,
ED
b()=b1+b2w and c(i)=c11(i>0)+c2iw.
The number of intervals m used during the discretization affects the accuracy of results
PT
obtained. To investigate its effects, values of R are collected for different m ranging from 50 to 2600. For =20, u(t,x)3, b1=c1=100, b2=0.1, c2=0.08, =0.2, Fig. 2 presents
CE
values of R and running time of the proposed numerical algorithm on Pentium 3.2GHz
AC
PC as functions of m. The relative differences in R for m=100, m=500 and m=1000 compared to m=2600 are 0.98%, 0.15% and 0.04%, respectively. Fig. 3 presents the PTC R as a function of checkpointing frequency parameter for
=20, b1=c1=100, b2=0.1, c2=0.08, and different u(x,t)k corresponding to the case where the rejuvenation is performed immediately when the system enters state j=k, independent of the amount of completed work and time when this event occurs. 23
AN US
CR IP T
ACCEPTED MANUSCRIPT
Fig. 2. PTC R and running time of the proposed algorithm on Pentium 3.2GHz PC
AC
CE
PT
ED
M
as functions of m.
Fig. 3. PTC R() for different u(x,t)k.
It can be seen that the function R() always takes its maximum when 1/ =1, i.e.,
when the amount of operations after the last checkpointing procedure equals to that before any checkpointing procedure (the explanation is given in [28]). This means that instead of looking for a real-valued maximizing R(), one can check different integer 24
ACCEPTED MANUSCRIPT
numbers of checkpointing procedures and determine =1/(+1). The best obtained checkpointing and rejuvenation policy for constant u(t,x) is =3 (=0.25) and u(t,x)3. In the following sub-sections, we present solutions to optimizing the rejuvenation
optimization of rejuvenation and checkpointing policy.
CR IP T
policy under a fixed checkpointing frequency, as well as solutions to the joint
6.2. Optimizing rejuvenation policy under fixed checkpointing schedule
Table 2 presents the best obtained matrices V for different H when =3, =20, b1= c1=100, b2 =0.04, c2=0.0032.
0.8736
2
0.8961
3
6
0.9178
8
0.9214
10
0.9214
ED
M
1
AN US
Table 2. Optimal rejuvenation policies for fixed =3 and different H. H R V H R V
CE
PT
22 25
AC
4
0.9145
3222 2323 2333 3355
232435 223352 223233 223333 333334 335455 23333333 23233444 22233444 22222333 22232323 22233334 33334555 33445555 2222222222 2332222222 2222332222 2222323222 2222323332 2222323323 2222323334 3333333344 3333444555 3333444555
Fig. 4 presents the obtained R as a function of H. It can be observed that an increase of H from 1 to 8 considerably improves R. However, a further increase in H does not change
25
ACCEPTED MANUSCRIPT
the PTC. The values of V increase with z and n, meaning that when the system approaches the task completion and not much time remains before the deadline, it has to
AN US
CR IP T
take a chance to complete the task without conducting the rejuvenation procedure.
Fig. 4. PTC R as a function of H for =3, =20, b1=c1=100, b2 =0.04, c2=0.0032.
M
6.3. Optimizing joint rejuvenation and checkpointing policy
ED
To optimize the joint rejuvenation and checkpointing policy, in applying the GA, for each string, the suggested numerical algorithm evaluates the PTC R for different values
PT
of ranging from 0 to max (in our example max=10) and assigns the maximum obtained value of R to the solution fitness.
CE
Fig. 5 presents values of PTC R and numbers of checkpoints for the best obtained
AC
rejuvenation and backup policies with H=8 as functions of the rejuvenation time and checkpointing/retrieval complexity factor (assuming that b2=0.1, c2=0.08). It can be seen that with an increase in the data checkpointing/retrieval complexity and rejuvenation time, the optimal number of checkpoints decreases as well as the PTC. Eventually, for large and checkpointing becomes not beneficial at all (=0). Observe that when
26
ACCEPTED MANUSCRIPT
>15 the best solutions for =2 and =4 provide the same value of R because for =0
AN US
CR IP T
the checkpointing complexity does not affect the system performance.
Fig. 5. PTC R and numbers of backups for the best obtained solutions as functions of the
M
rejuvenation time and checkpointing/retrieval complexity factor
Tables 3 - 6 present the best obtained matrices V and numbers of checkpoints for
ED
different combinations of and . It can be seen that when not much time remains until
PT
the allowed mission time expiration (great values of z for matrix V(z,n)), it is beneficial to have the system continue the task execution without rejuvenation even in low PS states
CE
because the system has no time to perform the rejuvenation (which corresponds to high values of V). With an increase in the rejuvenation time , the values of V also increase,
AC
which means that the rejuvenations should be performed less frequently when their time increases.
When the increased checkpointing/retrieval complexity makes the checkpointing not beneficial (=0), the values of V(z,n) for low z decrease, which means that the rejuvenations should be more frequent when the system software deteriorates at the 27
ACCEPTED MANUSCRIPT
beginning of the mission (when restarting the mission from scratch takes not much time). The rejuvenations in the later stages of the mission become much less beneficial because the system may not have enough time to re-execute the task from scratch. Table 3. The best V, solutions obtained by the GA and corresponding PTC R for =0.5 and different 10 5
15 4
20 4
R
0.993 22222222 22222222 22222333 22222333 22222233 22222223 22222233 22233335
0.973 22222222 22222222 22222223 22222333 22222333 22222233 22333334 23444555
0.942 23333333 22233333 22323333 22223233 22223233 22223333 22333355 44444555
0.915 23333333 22233333 22323233 22223233 22223232 22223334 23333555 34455555
V
25 3
30 3
0.890 23333333 23222222 22232233 22232333 23233323 22233334 33444555 44444555
0.858 23333333 33223333 22232333 22232333 23233334 23333555 22444555 44444555
CR IP T
5 8
AN US
Table 4. The best V, solutions obtained by the GA and corresponding PTC R for =1 and different 10 5
R
0.986 22222222 22222222 22222233 22222333 22222233 22222223 22222224 22233355
0.956 22222222 22222222 22222333 22222333 22222333 22222233 22222334 22345555
PT
V
15 3
M
5 8
0.923 22222222 23222222 22232333 23232333 22232333 22232333 22233355 24455555
ED
20 3
25 2
30 1
0.894 22222222 23222233 22232333 22232333 22232333 22233334 22333555 33444555
0.861 22222222 23323333 22223333 22223323 22323234 23333344 33455555 33445555
0.829 22222222 22333333 33343235 22242335 22343344 22333455 33335555 33355555
CE
Table 5. The best V, solutions obtained by the GA and corresponding PTC R for =2 and different 5 5
10 5
15 3
20 0
25 0
30 0
R
0.959 22222222 22232333 22222233 11222333 22222233 22222232 22222233 22233355
0.910 22222222 22233333 22222333 12222333 22222233 22222232 22222333 33555555
0.868 22333333 13222222 22233333 22232333 13232333 22232333 22333355 33344555
0.849 12222222 12222222 22233444 22333555 12233555 22334555 33334555 44455555
0.835 12233333 12233333 22333444 22234555 22333555 22334555 44455555 44444555
0.822 22222222 22233333 22333444 22224555 22333555 33333555 44444555 44444555
AC
V
28
ACCEPTED MANUSCRIPT
Table 6. The best V, solutions obtained by the GA and corresponding PTC R for =4 and different 5 0
10 0
15 0
20 0
25 0
30 0
R
0.903 11222222 11222222 11223333 12223345 12233555 23334555 23334555 33344555
0.883 11222222 12222222 12223333 12233455 12233555 22334555 33334555 44445555
0.865 12222222 12222222 12233333 22233455 22233555 22334555 33334555 44444555
0.849 12222222 12222222 22233444 22333555 12233555 22334555 33334555 44455555
0.835 12233333 12233333 22333444 22234555 22333555 22334555 44455555 44444555
0.822 22333333 12233333 22333455 22224555 22333555 33333555 44444555 44455555
V
7. Conclusion and Future Work
CR IP T
AN US
For many safety critical or mission critical applications the software system failure can cause unrecoverable losses to economics and even human lives. Therefore preventive maintenance techniques have been applied to keep the system crash from happening at
M
the first place. This paper models one of such techniques, software rejuvenation coupled with the checkpointing used to mitigate performance deterioration effects of software
ED
aging avoiding the system crash. Specifically with the software rejuvenation, the degraded system performance can be restored to higher or the peak performance level;
PT
with the checkpointing the restored system can effectively resume the mission task from
CE
the last saved checkpoint (instead of from the mission beginning). To balance the benefits and overheads of implementing this combined maintenance technique, we formulate and
AC
solve the joint optimal rejuvenation and checkpointing policy problem maximizing the probability of the task completion by a predetermined mission deadline. The solution methodology encompasses an iterative numerical algorithm proposed for evaluating the real-time task completion probability and the Genetic Algorithm applied for solving the formulated optimization problem. Example analyses demonstrate effects of different parameters on the system solution, including the number of discretized time intervals, the 29
ACCEPTED MANUSCRIPT
rejuvenation decision function, checkpointing frequency parameter, the number of regions
for
rejuvenation
decision
function,
rejuvenation
time,
and
the
checkpointing/retrieval procedures complexity parameter. The limitations of this work are based on the assumptions of the full state independent
CR IP T
rejuvenation. The full rejuvenation always brings the degraded system back to the initial state with the peak performance level. As one direction of our future works, we plan to extend the proposed methodology by considering imperfect rejuvenations that only partially restore the system performance. Another direction is to relax the fixed
AN US
rejuvenation time assumption by modeling the state-dependent rejuvenation time.
We are also interested in considering the cost function for optimizations. Having the rejuvenation cost, the checkpointing cost and the penalty cost associated with the system
M
inability to complete the mission in time, one can choose the minimal cost rejuvenation and checkpointing policy. This requires extension of the current model that enables one
ED
to obtain the expected number of rejuvenations during the mission as well as expected number of checkpoints (included ones that fail before their completion because of the
PT
system state deterioration).
CE
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (Grant No.
AC
61602094).
30
ACCEPTED MANUSCRIPT
References [1] J. Zhao, Y. Wang, G. Ning, K. S. Trivedi, R. Matias and K. Cai, “A comprehensive approach to optimal software rejuvenation”, Performance Evaluation, vol. 70, no. 11, pp. 917-933, 2013. [2] S. Garg, Y. Huang, C. Kintala, and K. S. Trivedi, "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation", Proceedings of the 1996 ACM SIGMETRICS
Philadelphia, Pennsylvania, USA, 1996.
CR IP T
international conference on Measurement and modeling of computer systems, pp. 252-261,
[3] Y. Huang, C. Kintala, N. Kolettis and N. D. Fulton, "Software rejuvenation: analysis module and applications", Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers, Pasadena, CA, USA, pp. 381-390, 1995.
AN US
[4] M. Grottke, R. Matias and K.S. Trivedi, “The fundamentals of software aging”, Proc. of IEEE First International Workshop on Software Aging and Rejuvenation Washington DC USA, pp. 1–6, 2008.
[5] V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, Vaidyanathan and W. P. Zeggert "Proactive management of software aging", IBM Journal of Research and Development, vol. 45, no. 2, pp.311-332, March 2001.
M
[6] E. Marshall, “Fatal error: how patriot overlooked a Scud”, Sci., p. 1347, Mar. 1992. [7] V. Koutras and A. Platis, “Applying partial and full rejuvenation in different degradation
ED
levels”, Proceedings of the IEEE 3rd International Workshop on Software Aging and Rejuvenation, pp. 20–25, 2011.
PT
[8] K. G. Shin, T. H. Lin and Y. H. Lee, "Optimal Checkpointing of Real-Time Tasks", IEEE Transactions on Computers, vol. C-36, no. 11, pp. 1328-1341, Nov. 1987. [9] G. Levitin, L. Xing, Y. Dai and V. M. Vokkarane, “Dynamic Checkpointing Policy in
CE
Heterogeneous Real-Time Standby Systems”, IEEE Transactions on Computers, vol. 66, no. 8, pp. 1449-1456, August 2017. H. Okamura and T. Dohi, “Analysis of a Software System with Rejuvenation Restoration
AC
[10]
and Checkpointing”, In: Nanya T. Maruyama F. Pataricza A. Malek M. (eds) Service Availability. ISAS 2008. Lecture Notes in Computer Science vol. 5017, Springer Berlin Heidelberg, 2008.
[11]
S. Garg, A. Puliafito, M. Telek and K. S. Trivedi "Analysis of preventive maintenance in
transactions based software systems", IEEE Transactions on Computers, vol. 47, no. 1, pp. 96-107, Jan 1998.
31
ACCEPTED MANUSCRIPT
[12]
J. Alonso, M. Grottke, A. Nikora and K. S. Trivedi, “The nature of the times to flight
software failure during space missions”, Proc. of IEEE Int. Conf. Softw. Rel. Eng. Workshops, pp. 331–340, 2012. [13]
C. Kintala, “Software rejuvenation in embedded systems”, J. Autom. Lang.
Combinatorics, vol. 14, pp. 63–73, 2009. [14]
T. Dohi, K. Goševa-Popstojanova, K. Vaidyanathan, K. S. Trivedi and S. Osaki,
Reliability Engineering. Springer London, 2003. [15]
CR IP T
“Software Rejuvenation: Modeling and Applications” In: H. Pham (editor) Handbook of
X. Hua, C. Guo, H. Wu, D. Lautner and S. Ren, "Schedulability Analysis for Real-Time
Task Set on Resource with Performance Degradation and Dual-Level Periodic Rejuvenations", IEEE Transactions on Computers, vol. 66, no. 3, pp. 553-559, March 1
[16]
AN US
2017.
D. Wang, W. Xie and K. S. Trivedi, “Performability analysis of clustered systems with
rejuvenation under varying workload”, Performance Evaluation, vol. 64, no. 3, pp. 247-265, 2007. [17]
G. Levitin, L. Xing, H. Ben-Haim, Y. Dai, “Optimizing software rejuvenation policy for
real time tasks”, Reliability Engineering & System Safety, vol. 176, pp. 202–208, 2018. K. Rinsaka and T. Dohi, "Toward high assurance software systems with adaptive fault
M
[18]
management", Software Quality Journal, vol. 24, no. 1, pp. 65–85, March 2016. P. K. Saravakos, G. A. Gravvanis, V. P. Koutras and A. N. Platis, "A comprehensive
ED
[19]
approach to software aging and rejuvenation on a single node software system", Proceedings of The 9th Hellenic European Research on Computer Mathematics & its Applications
[20]
PT
Conference, 2009.
S. Malefaki, V. P. Koutras and A. N. Platis, "Modeling Software Rejuvenation on a
CE
Redundant System Using Monte Carlo Simulation", Proc. of IEEE 23rd International Symposium on Software Reliability Engineering Workshops Dallas TX, pp. 277-282, 2012. A. Puliafito, "Software Rejuvenation in Cloud Systems", Proc. of IEEE International
AC
[21]
Symposium on Software Reliability Engineering Workshops, Naples, pp. 413-413, 2014.
[22]
D. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning Addison
Wesley Reading, MA, 1989.
[23]
G. Levitin, "Genetic algorithms in reliability engineering", Guest editorial. Reliability
Engineering & System Safety, vol. 91, no. 9, pp. 975-976, 2006.
32
ACCEPTED MANUSCRIPT
[24]
H. Okamura, K. Yamamoto & T. Dohi, “Transient Analysis of Software Rejuvenation
Policies in Virtualized System: Phase-Type Expansion Approach”, Quality Technology & Quantitative Management, vol. 11, no. 3, pp. 335-351, 2014 [25]
Y. Bao, X. Sun and K. S. Trivedi, "Adaptive software rejuvenation: degradation model
and rejuvenation scheme", Proc. of International Conference on Dependable Systems and Networks, pp. 241-248, 2003. D. Cotroneo, R. Natella, R. Pietrantuono and S. Russo, “A survey of software aging and
CR IP T
[26]
rejuvenation studies”, J. Emerg. Technol. Comput. Syst., vol. 10 no. 1, Article 8, 34 pages 2014. [27]
W. Dang and J. Zeng, “Software System Rejuvenation Modeling Based on Sequential
Inspection Periods and State Multi-control Limits”, In: B. Zou Q. Han G. Sun W. Jing X.
AN US
Peng Z. Lu (eds.), Data Science. ICPCSEE 2017, Communications in Computer and Information Science, vol. 728, Springer Singapore, 2017. [28]
G. Levitin, L. Xing, B. W. Johnson and Y. Dai, “Mission Reliability Cost and Time for
Cold Standby Computing Systems with Periodic Backup”, IEEE Transactions on Computers, vol. 64, no. 4, pp. 1043-1057, April 2015. [29]
G. Levitin, L. Xing, and Y. Dai, “Optimal Backup Distribution in 1-out-of-N Cold
no. 4, pp. 636-646, April 2015
Y. Zhang and K. Chakrabarty, "Adaptive Checkpointing with Dynamic Voltage Scaling
ED
[30]
M
Standby Systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45,
in Embedded Real-Time Systems", Embedded Software for SoC, pp. 449-463, 2003. [31]
N. Kaio, T. Dohi, K. S. Trivedi, "Availability Models with Age-Dependent
[32]
PT
Checkpointing", Proc. of IEEE Symp. on Reliable Distributed Systems, pp. 130, 2002. A. Ziv and J. Bruck "An on-line algorithm for checkpoint placement," Proc. Seventh
CE
International Symposium on Software Reliability Engineering White Plains NY, pp. 274-283, 1996.
H. Okamura and T. Dohi, "Availability optimization in operational software system with
AC
[33]
aperiodic time-based software rejuvenation scheme," IEEE International Conference on Software Reliability Engineering Workshops, Seattle, WA, pp. 1-6, 2008.
[34]
G. Levitin, L. Xing, Q. Zhai, and Y. Dai, "Optimization of Full vs. Incremental Periodic
Backup Policy," IEEE Transactions on Dependable and Secure Computing, vol. 13, no. 6, pp. 644-656, Nov 2016.
33
ACCEPTED MANUSCRIPT
[35]
A. Khunteta and P. Kumar, "An Analysis of Checkpointing Algorithms for Distributed
Mobile Systems," International Journal on Computer Science and Engineering, vol. 02, no. 04, pp. 1314-1326, 2010 [36]
A. Bobbio, S. Garg, M. Gribaudo, A.Horv´ath, M. Sereno, and M. Telek, “Modeling
software systems with rejuvenation restoration and checkpointing through fluid stochastic Petri nets,” Proc. of International Workshop on Petri Nets and Performance Models, pp. 82-
[37]
CR IP T
91, 1999
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala, “Checkpointing and its
applications,” Digest of Papers for 25th International Symposium Fault-Tolerant Computing, 1995. [38]
N. Naksinehaboon, N. Taerat, C. Leangsuksun, C. F. Chandler, and S. L. Scott, "Benefits
AN US
of Software Rejuvenation on HPC Systems" International Symposium on Parallel and Distributed Processing with Applications Taipei, pp. 499-506, 2010. [39]
J. F. Meyer, "On Evaluating the Performability of Degradable Computing Systems,"
IEEE Transactions on Computers, vol. C-29, no. 8, pp. 720-731, Aug. 1980. [40]
A. Bobbio and M. Telek, “Task completion time in degradable systems,” In B.R.
Haverkort, R.
Marie, G. Rubino and K. S. Trivedi, editors, Performability Modelling:
[41]
M
Techniques and Tools, Wiley, Chapter 7:139-161, 2001.
G. Levitin, L. Xing, and Y. Dai, “Heterogeneous 1-out-of-N warm standby systems with
[42]
ED
online checkpointing,” Reliability Engineering & System Safety, vol. 169, pp. 127-136, 2018. R. Vitale, A. Zhyrova, J. F. Fortuna, O. E. de Noord, A. Ferrer, H. Martens, “On-The-Fly
Processing of continuous high-dimensional data streams,” Chemometrics and Intelligent
[43]
PT
Laboratory Systems, vol. 161, pp. 118-129 2017. S. K. Bansal, "Towards a Semantic Extract-Transform-Load (ETL) Framework for Big
CE
Data Integration," Proc. of IEEE International Congress on Big Data, Anchorage, AK, pp. 522-529, 2014. [44]
V. C. Storey and I-Y. Song, “Big data technologies and Management: What conceptual
AC
modeling can do,” Data & Knowledge Engineering, vol. 108, pp. 50-67 2017.
[45]
W. Weibull, "A statistical distribution function of wide applicability," J. Appl. Mech.-
Trans. ASME, vol. 18, no. 3, pp. 293–297, 1951.
[46]
A. T. Tai, S. N. Chau, L. Alkalaj, and H. Hecht, “On-Board Preventive Maintenance:
Analysis of Effectiveness and Optimal Duty Period,” Proc. of Third Int’l Workshop Object Oriented Real-Time Dependable Systems, 1997.
34