Joint optimal checkpointing and rejuvenation policy for real-time computing tasks

Joint optimal checkpointing and rejuvenation policy for real-time computing tasks

Accepted Manuscript Joint optimal checkpointing and rejuvenation policy for real-time computing tasks Gregory Levitin , Liudong Xing , Liang Luo PII:...

1MB Sizes 1 Downloads 11 Views

Accepted Manuscript

Joint optimal checkpointing and rejuvenation policy for real-time computing tasks Gregory Levitin , Liudong Xing , Liang Luo PII: DOI: Reference:

S0951-8320(18)30834-2 https://doi.org/10.1016/j.ress.2018.10.006 RESS 6281

To appear in:

Reliability Engineering and System Safety

Received date: Revised date: Accepted date:

3 July 2018 30 August 2018 18 October 2018

Please cite this article as: Gregory Levitin , Liudong Xing , Liang Luo , Joint optimal checkpointing and rejuvenation policy for real-time computing tasks, Reliability Engineering and System Safety (2018), doi: https://doi.org/10.1016/j.ress.2018.10.006

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights  Real-time computing system with checkpointing and rejuvenation is considered;  An algorithm for evaluating probability of timely task completion is presented;  A problem of joint optimal checkpointing and rejuvenation scheduling is formulated;

AC

CE

PT

ED

M

AN US

CR IP T

 An illustrative example is presented.

1

ACCEPTED MANUSCRIPT

Joint optimal checkpointing and rejuvenation policy for realtime computing tasks Gregory Levitina,b, Liudong Xingc , Liang Luoa,* Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China b The Israel Electric Corporation, P. O. Box 10, Haifa 31000, Israel E-mail: [email protected] c University of Massachusetts, Dartmouth, MA 02747, USA E-mail: [email protected] * Corresponding author

AN US

CR IP T

a

Abstract - Performance of a software system can deteriorate from higher to lower levels due to software aging. To counteract the aging effect, software rejuvenation is widely implemented to restore the performance of a degraded system before the

M

system crash actually takes place. To facilitate an effective system function restoration after each rejuvenation action, it is desirable to apply checkpointing to

ED

occasionally save the system state on a reliable storage so that the mission task can be

PT

resumed from the last saved checkpoint (instead of being restarted from the very beginning). As both rejuvenation and checkpointing procedures incur system

CE

overhead while bringing these benefits, it is significant to determine the optimal rejuvenation and checkpointing scheduling policy optimizing the system performance

AC

measures of interest. This paper makes new contributions by modeling and optimizing the joint maintenance policy involving state-based rejuvenation and periodic checkpointing schedule for software systems performing real-time computing tasks. The system can undergo multiple performance degradation levels or states, and transition time between different states can assume arbitrary types of

2

ACCEPTED MANUSCRIPT

distributions. The proposed solution methodology encompasses an efficient numerical algorithm for evaluating the probability of task completion (PTC) by a prespecified deadline. The joint optimal rejuvenation and checkpointing policy is further determined to maximize the PTC of the considered real-time task. Examples are

system parameters on the optimization solution.

CR IP T

provided to illustrate applications of the proposed methodology as well as effects of

Keywords – Real-time computing; software rejuvenation; periodic checkpointing;

AN US

task completion probability.

Acronyms and Abbreviations

cumulative distribution function

pdf

probability density function

PS

processing speed

PTC

probability of the task completion by a pre-specified deadline

w

ED

PT

Nomenclature

M

cdf

total number of operations in computational task total number of operations needed to complete the task (including

CE

W()

checkpointing procedures)

AC

()

number of operations performed between ends of two consecutive checkpointing procedures



maximum allowed task processing time or task deadline



fraction of task saved by incremental checkpoints



total number of checkpoints during the task execution 3

ACCEPTED MANUSCRIPT

maximum possible number of rejuvenations

J

maximum state index of the unit, i.e., the unit states range from 0 to J

fj(t), Fj(t)

pdf , cdf of random unit sojourn time in state j

gj

unit PS in state j



rejuvenation time

R

PTC

m

number of time intervals and work portions considered in numerical algorithm

CR IP T

K

rejuvenation decision function

b()

number of operations required to perform each checkpoint

c()

number of operations required to retrieve data stored by each checkpoint

x

greatest integer not exceeding x

M

AN US

u(t,x)

ED

1. Introduction

It is widely known that failures of modern computing-based systems are more often

PT

caused by software faults than by hardware faults [1, 2]. As a major type of software faults, software aging results from successive accumulation of error conditions during the

CE

software running, such as round-off errors, memory leaks, unreleased memory space,

AC

storage fragmentation, and etc. [3, 4]. Due to the aging effect, the system performance gradually degrades, which can eventually lead to the system crash if no maintenance action is taken [5]. In mission-critical or safety-critical applications, software failure may incur unrecoverable loss to economics or even human life [6], and thus are not acceptable for those systems. In the 90’s, software rejuvenation was first introduced as a preventive and proactive maintenance technique to counteract the software aging before the system 4

ACCEPTED MANUSCRIPT

crash takes place [3]. Specifically, the rejuvenation technique consists of occasionally and proactively stopping the running program or application, cleaning its internal state or environment to remove the accrued undesirable conditions, and rebooting the system. After the rejuvenation procedure, the system performance (e.g., in processing speed,

CR IP T

service rate) can be fully or at least partially recovered [7]. For transactions-based software systems, the rejuvenated system can resume accepting and processing newly arriving transactions. However, for computing-based systems, the running computing task is interrupted at the point of rejuvenation and rolling back to the beginning of the

AN US

mission after the rejuvenation is costly. Thus, it is desirable to apply checkpointing to occasionally save the system state on a reliable storage so that the mission task can be resumed from the last saved checkpoint after the rejuvenation (avoid restarting from the

M

very beginning of the mission) [2, 8, 9]. In this paper we model and optimize such software systems that implement the combined checkpointing and rejuvenation technique

ED

for performing real-time computing tasks.

As both checkpointing and rejuvenation incur some system overhead, it is relevant and

PT

crucial to address the optimal rejuvenation policy as well as the optimal checkpointing

CE

schedule simultaneously to maximize benefits and effectiveness of implementing the combined fault tolerant technique [10]. Specifically, system performance can be

AC

promoted with frequent rejuvenations (e.g., greater average processing speed, larger average service rate), which however may extend the task completion time as each rejuvenation procedure itself takes time to complete. Similarly, the number of checkpointing performed during the mission also has two-fold effects: more frequent checkpointing implies less work being repeated during rollback or recovery, but possibly

5

ACCEPTED MANUSCRIPT

longer time to complete the mission. There is a tradeoff relationship between the recovery overhead and the checkpoint overhead. Therefore, it is significant to determine the optimal rejuvenation and checkpointing scheduling policy optimizing the system performance measure of interest, e.g., maximizing the task completion probability,

CR IP T

maximizing system availability, minimizing expected mission completion time, minimizing transaction loss probability, and minimizing mean response time [11].

This paper makes new contributions by solving the joint optimal maintenance policy involving both state-based rejuvenation and periodic checkpointing schedule for

AN US

computing-based software systems performing real-time tasks. The system can undergo multiple performance degradation levels or states, and transition time between different states can assume arbitrary types of distributions. The proposed solution methodology

M

encompasses an efficient event-transition based numerical algorithm for time-dependent evaluation of the probability of task completion (PTC) by a pre-specified deadline. The

ED

joint optimal rejuvenation and checkpointing policy is then determined to maximize the PTC of the real-time task.

PT

The remainder of this paper is arranged as follows. Section 2 reviews related work on

CE

software rejuvenation and checkpointing, followed by a clarification of novel contributions of this work in the context of the literature review. Section 3 describes the

AC

system model and the joint optimal software maintenance policy problem considered in this work. Section 4 suggests the PTC evaluation method for the considered real-time software system. Section 5 discusses a heuristic algorithm for obtaining the joint optimal software maintenance policy. Section 6 presents illustrative examples of the suggested

6

ACCEPTED MANUSCRIPT

evaluation method and optimization solutions. Section 7 presents conclusions and directions of our future research. 2. Related Work Since its introduction in the 90’s, software rejuvenation has been applied to mitigate

CR IP T

the software aging effects in a wide range of diverse applications, such as spacecraft [12], embedded system [13], billing software [3, 14], transaction processing [1, 11], etc. The rejuvenation policies can be classified as time-based (initiated based on periodic or aperiodic time intervals [1, 15]), condition-based (triggered based on monitoring of

AN US

system state related to, for example, workload, response time, or completed tasks [16, 17]), and prediction-based (triggered based on statistical analysis of system data [18]). Different types of software systems have been modeled for optimizing the rejuvenation

M

policy, including for example, single-node systems [19], standby/redundant systems [20], clustered systems [14, 16], and cloud systems [21].

ED

While a rich set of exact or heuristic algorithms are directly available for various optimization problems [22, 23], specific approaches have to be developed to estimate

PT

system performance measures to be optimized. Measurement-based, simulations,

CE

analytical model-based, and event transition methods have been developed for evaluating software aging and rejuvenation systems [17, 20, 24-26]. In the measurement-based

AC

approach, empirical data about the system state are first collected; statistical analysis or machine learning techniques are then applied to the runtime data collected for predicting the performance measure of interest [5]. As the measurements-based approach typically exploits particular nature of the considered system, it is difficult to generalize across different systems. Moreover, it is not appropriate to apply measurements for estimating

7

ACCEPTED MANUSCRIPT

long-term or steady-state performance measures. Sharing the common limitations to the measurements approach, the simulation-based method has a further problem of being computationally expensive, which is more severe when a high level of accuracy is required for critical systems [20]. In contrast, using probabilistic models to represent the

CR IP T

software aging process, the analytical model-based approaches can be easily portable across different systems. Different stochastic processes were adopted in the analytical models for evaluating software rejuvenation systems, including, for example, Markov processes [3, 7, 11], Semi-Markov processes [1, 25], Markov Regenerative processes [11,

AN US

24], and stochastic Petri nets [16]. Compared to the measurements and simulations based approaches, these analytical models can be less effective because simplification assumptions are often made about the real system behavior. In addition, the analytical

M

models are state-based, and thus suffer from the state-space explosion issue, which makes time-dependent performance evaluation computationally expensive and even intractable.

ED

Due to this issue, the analytical model-based approaches usually focus only on steadystate solutions [11, 27]. These stationary solutions however are often not adequate for

PT

critical applications that require time-dependent estimation of the system performance.

CE

As a recently emerged approach, the event transition method can provide time-dependent evaluation in an efficient manner [17].

AC

Apart from the rejuvenation, checkpointing is another important technique that has been intensively applied for software dependable computing. By periodically [28] or non-periodically [29] saving state information associated with the completed portion of a mission task, checkpointing can facilitate an effective recovery of the system function in the case of a system failure. Upon the failure, the system does not need to restart from the

8

ACCEPTED MANUSCRIPT

very beginning; instead it can resume the task from the last saved checkpoint through rollback and retrievals. While reducing the recovery time, checkpointing may also impact system performance negatively due to additional overheads incurred by each checkpointing procedure [9]. Various checkpoint placement policies (e.g., adaptive, age-

CR IP T

dependent, online checkpoints) [30-32] have been investigated to tradeoff the reduction in recovery time and the checkpoint overhead, and optimize performance measures of interest (e.g., minimizing the expected mission completion time [8], maximizing task completion probability by a deadline [9], or maximizing system availability [33]).

AN US

Different types of computing systems have been modeled for optimizing the checkpointing policy, including for example, single-node systems [8, 34], distributed systems [35], and standby systems [9, 28, 29].

M

Though the two software fault tolerance techniques (rejuvenation and checkpointing) have been introduced for different purposes, they can be unified in the same software

ED

system to complement each other for greater benefits [36]. Compared to the rich literature on each separate technique, a relatively minority of studies has been conducted to model

PT

and optimize software systems simultaneously implementing both techniques.

CE

Specifically, a maintenance policy with periodic checkpointing and periodic rejuvenation (being triggered at every certain number of checkpoints) was modeled to compute and

AC

minimize the expected completion time in [2]. In [36] a modeling framework based on fluid stochastic Petri nets was proposed to quantify dependability measures of aging software systems with checkpointing and rejuvenation. In [33], the optimal aperiodic time-based rejuvenation policy was evaluated and optimized for a software system subject to a fixed periodic checkpoint schedule, maximizing the steady-state system

9

ACCEPTED MANUSCRIPT

availability using dynamic programming. In [10], joint aperiodic time-based checkpointing and rejuvenation schemes were evaluated and optimized to maximize the steady-state system availability using dynamic programming. In [37] an implementation of checkpointing and rejuvenation in Unix library called libckp was presented; no

CR IP T

modeling or evaluation was actually performed. In [38], the benefits of using the combination of rejuvenation and checkpoint mechanisms over using only the checkpoint mechanism were investigated through simulations for high performance computing systems.

AN US

To the best of our knowledge, no comprehensive evaluation has been performed for the joint checkpointing and rejuvenation scheme with respect to maximizing the PTC for real-time tasks. This work advances the state-of-the-art by suggesting an iterative method

M

of evaluating the PTC by a specific deadline in software systems subject to the joint maintenance scheme. The optimization problem of determining the optimal joint

ED

rejuvenation and checkpointing policy is further solved, maximizing the PTC of the realtime task.

PT

Note that two different viewpoints have been assumed for system success measures in

CE

literature: system-oriented (concerning the expected amount of work accomplished in a given time period) [39] and task/user-oriented (concerning the completion probability of

AC

a system task involving certain amount of work in due time) [40, 41]. The task-oriented measure, particularly, the PTC is adopted in this work.

3. System Model A task containing w operations should be performed by the considered system. The system’s initial task processing speed (PS) is g0. Due to software aging effects, the 10

ACCEPTED MANUSCRIPT

system degrades from states with a greater PS to states with a lower PS during the task processing. The system’s PS in state j is gj. The total number of possible states is J+1. The system can transit from any state j only to state j+1. The transition time from state j to state j+1 follows a known distribution with cumulative distribution function (cdf) Fj(t)

CR IP T

and probability density function (pdf) fj(t). To improve the system performance, its software can be rejuvenated causing the system to return to the initial state j=0 with the peak PS g0.

To avoid starting the task execution from scratch after each rejuvenation, the system

AN US

saves intermediate results to an external storage (i.e., implementing checkpointing). After the rejuvenation the system retrieves the data saved by previous checkpointing procedures and continues performing the task from the operation that immediately

M

follows the last completed checkpoint.

The periodic checkpointing procedures are performed upon completion of each fraction

ED

 of the entire mission task. Thus, the system conducts a data checkpointing procedure after successfully performing each w operations. The total number of checkpointing

PT

actions performed during a successful mission (i.e., checkpointing frequency) is fixed and

1 /   if  1 /    1  ()=  . 1 /   - 1 if  1 /    1

(1)

AC

CE

can be determined as

The second case in (1) takes place because when  1 /    1 , the last checkpoint is scheduled to be performed at the end of the mission which is not necessary.  is referred to as a checkpointing frequency parameter to be optimized in the optimization problems considered in this paper. According to the incremental checkpointing model [34] we

11

ACCEPTED MANUSCRIPT

assume that each checkpointing action requires a constant number of operations b(), which depends on the amount of task operation performed since the previous checkpoint. Thus, the total number of operations needed to complete the mission is W()=()b()+w.

(2)

CR IP T

The number of operations needed to retrieve the data saved by i previous checkpoints is c(i). The rejuvenation procedure consists of system reset, which takes fixed time  and data retrieval, which takes time c(i)/g0 given that i checkpoints have been completed

AN US

before the rejuvenation (we assume that during the data retrieval the system cannot deteriorate because it does not perform the mission task, which can cause the deterioration).

The task performed is real-time; particularly, the mission fails if the task is not

M

completed within the specified deadline . Thus, the maximum number of rejuvenation procedures that can be performed during the task processing can be given as (3)

PT

ED

K=(-W()/g0)/.

The optimal rejuvenation policy and checkpointing frequency should be determined to

CE

maximize the probability of the task completion (PTC) R by the deadline . Specifically, more frequent rejuvenation procedures can lead to greater average PS as the system is

AC

prevented from deteriorating to low PS levels. On the other hand with more rejuvenation procedures, less time remains for the system to perform the mission task operations during the allowed time . The rejuvenation policy to be optimized can be defined by a function u(t, x). The rejuvenation decision rule adopted is: if the system transits to state j at time t from the beginning of the mission when part x of the entire task is accomplished 12

ACCEPTED MANUSCRIPT

and ju(t,x), then the system software is immediately rejuvenated. The optimal rejuvenation policy is also interrelated with the optimal number of checkpointing procedures. Performing more checkpointing procedures increases the minimal number of operations to be performed during the mission. On the other hand, when the number of

CR IP T

checkpointing procedures increases, the amount of operations to be re-executed after each rejuvenation decreases, which decreases the expected total number of operations in the case of rejuvenations. Thus, the rejuvenation policy and checkpointing schedule should be jointly optimized to maximize the PTC R.

AN US

Fig. 1 presents an example of the successful mission completion for the case of

()=3 and discrete function u(t,x). The t,x space is divided into six areas with the constant values of u(t, x) in each area. The system starts performing the mission in state

M

j=0, corresponding to PS g0. During the mission performance, the system transits to state j=1 when u(t,x)=2 and performs the first checkpoint in this state. Then it transits to state

ED

j=2 when u(t,x)=2, which causes the system rejuvenation. After the rejuvenation the system returns to state j=0 and continues performing the task from the operation that

PT

immediately follows the first checkpoint. During performing the second checkpoint the

CE

system transits to state j=1 when u(t,x)=2 and during performing the third checkpoint the system transits to state j=2 and then to state j=3 when u(t, x)=3. The last transition causes

AC

the system rejuvenation. As the rejuvenation happens before completion of the third checkpoint, the system continues performing the task after returning to state j=0 from the operation that immediately follows the second checkpoint. Then the system performs the third checkpoint and completes the entire mission remaining in state j=0.

13

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 1. Example of task performance

M

4. Event Transition Based Method for Evaluating PTC

In this section, an event transition-based iterative method is presented for analyzing

ED

the PTC of real-time tasks in the considered software rejuvenation and checkpointing system.

PT

4.1 Probabilistic modeling of events and event transitions

CE

Denote Tk,j,Xk,j as an event that the system starts executing task or checkpointing operations with PS gj at time Tk,j from the beginning of the mission when Xk,j operations

AC

are accomplished and k rejuvenations are performed. The joint pdf of random values Tk,j and Xk,j is denoted as qk , j (t , x) . When the mission first starts, k=0 and the system’s PS is g0. Thus we have

1 for t  x  0, q0,0 (t , x)   0 otherwise.

(4)

14

ACCEPTED MANUSCRIPT

According to the rejuvenation decision function u(t,x), if the unit deteriorates to state j

~ after spending time t in state j-1, the event transition Tk,j-1,Xk,j-1Tk,j,Xk,j with ~ ~ Tk,j=Tk,j-1+ t and Xk,j=Xk,j-1+g j-1 t

(5)

takes place if j
CR IP T

If ju(Tk,j,Xk,j) the rejuvenation procedure starts immediately after the system state transition and the event transition Tk,j-1,Xk,j-1Tk+1,0,Xk+1,0 with

~ Tk+1,0=Tk,j-1+ t ++c(i)/g0, Xk+1,0=i(),

AN US

is performed, where

()=w+b()

(6)

(7)

is the number of operations that should be performed by the system between the ends of two consecutive checkpointing procedures, and

M

~ i=(Xk,j-1+gj-1 t )/()

(8)

is the index of the last checkpointing procedure completed before the rejuvenation.

ED

c(i)/g0 is the time needed to retrieve the data saved by the previous i checkpointing

PT

procedures.

Because the probability that the system transits from state j-1 to state j within time

CE

interval [t, t+dt) (since it started operation with PS gj-1) is fj-1(t)dt, qk , j (t , x) for k0, j>0

AC

can be recursively obtained as min( t , x / g j 1 )

q k , j (t , x) 

~ ~ ~ ~  q t  t , x  t g  f t d t for kt< , u(t,x)>j k , j 1

j 1

j 1

(9)

0

Indeed, the system can transit to state j at time t when x operations are completed if it

~ spends in state j-1 time t not exceeding x/gj-1. The minimal time when the system can

15

ACCEPTED MANUSCRIPT

transit to state j after k rejuvenations t=k corresponds to situation where all the rejuvenations are performed before completing the first checkpointing procedure. Following (6) we can obtain qk1,0 (t , i ( )) considering all the possible transitions from events Tk,j,Xk,j with k0, j=1,…,J for which

CR IP T

~ ~ x  tgj , Tk,j= t  t   c(i ) / g 0 , Xk,j= ~

(10)

~ x where the number of operations completed after working in state j during time t is ~ and

(11)

AN US

i ( )  ~ x  (i  1) ( ) ,

which corresponds to i checkpointing procedures completed before the k+1-th rejuvenation. The rejuvenation is performed at time

~ Tk,j+ t = t   c(i ) / g 0

min( t , x / g j )

j 1

0



~ ~ ~  1 j  u(t   c(i ) / g , ~x )q t  t   c(i ) / g , ~x  t g d~x dt k , j 1

0

0

j

(13)

i ( )

PT

4.2 PTC evaluation

( i 1) ( )

f j ~t 

ED

J

q k 1,0 (t , i ( ))  

M

x ) . Thus if j  u(t   c(i ) / g 0 , ~

(12)

Denote Yk,j as the event that the system completes the mission task at time no later than

CE

the deadline  while the system is in state j after k rejuvenations. The PTC R can thus be

AC

evaluated by summing probabilities of mutually exclusive events as K

R k 0

J

 Pr{ Y j 0

k, j

}.

(14)

When the event Tk,j,Xk,j occurs, the system can accomplish the task without further state transitions if it functions with PS gj during time (W()-Xk,j)/gj required to complete

16

ACCEPTED MANUSCRIPT

the entire task and this time does not exceed the remaining allowed task processing time

-Tk,j. Hence, Pr{Yk,j} in (14) can be computed as

Pr{Yk , j } 



 0

W ( )  x gj

   1  F  W ( )  x  dtdx   q t , x k , j j     gj k   

4.3 Discrete numerical PTC evaluation algorithm

(15)

CR IP T

W ( )

The maximum allowable mission time  is discretized into m intervals with an equal duration of =/m such that the interval z starts at time z, and ends at time (z+1) for

AN US

z=0,…,m. The rejuvenation thus takes =/ time intervals. Correspondingly the total number of operations is divided into m equal portions such that each portion contains

=(W())/m operations. During z time intervals, the system functioning in state j can perform gjz operations corresponding to gjz/=gjz/W() work portions. Thus, the

M

system’s PS in state j is Gj=gj/W() work portions per time interval. Each checkpointing

ED

procedure requires performing B()=b()/ work portions. The data retrieval after i checkpointing procedures requires performing C(i)=c(i)/ work portions. The number

CE

=()/.

PT

of work portions between the ends of two consecutive checkpointing procedures is

Based on the corresponding cdf, the probability that the system remains in state j for

AC

less than d time intervals can be evaluated as j(d)=Fj(d). The probability that the system departs from state j after functioning in this state j for d intervals is thus

j(d)=Fj((d+1))-Fj(d). The joint pdf function qk , j (t , x) can be approximated by matrix Qk , j ( z, n) of probabilities that after the k-th rejuvenation the system starts operating in

17

ACCEPTED MANUSCRIPT

state j in time interval z when n work portions are accomplished. The rejuvenation decision function u(t,x) can be applied in the form of U(z,v)=u(z,n). Based on (9) and matrix Qk,j-1(z,n) z=0,…,m-1, and n=0,…,m-1, Qk,j(z,n) can be obtained. Based on (13) and all matrixes Qk,j(z,n) for j=0,…,J, z=0,…,m-1, and n=0,…,m-

CR IP T

1, Qk+1,0 can be obtained. In the backward procedures based on (9) and (13), for any element of matrix Qk,j (or Qk+1,0), a summation using different elements of matrix Qk,j-1 (or Qk,j-1 for j=0,…,J) should be performed. Next a more convenient forward procedure is explained and used. At the beginning of the forward procedure, matrixes Qk,j and Qk+1,0

AN US

are zeroed. Then for any fixed element of matrix Qk,j-1 and any realization of the time to the next system state transition, we iteratively update the corresponding different elements in matrixes Qk,j and Qk+1,0. The forward procedure can avoid storing all the

M

matrixes Qk,j for j=0,…,J because any matrix Qk,j-1 can be deleted from the memory upon finishing the update of Qk,j and Qk+1,0. As the iterative procedure cannot be expressed in

ED

equations, we describe this procedure of evaluating R using the following pseudo code.

PT

Pseudo code of PTC evaluation algorithm 1. Initialize Q0,0(z,n)=0 for z=0,…,m; n=0,…,m; and assign Q0,0(0,0)=1;

CE

2. For k=0,…,K:

2.1. Initialize Qk+1,0(z,n)=0 for z=0,…,m; n=0,…,m; 2.2. For j=1,…,J:

AC

2.2.1. Initialize Qk,j(z,n)=0 for z=0,…,m; n=0,…,m; 2.2.2. For n=0,…,m-1: 2.2.2.1. For z=k,…,m-1: 2.2.2.1.1. dmax=(m-n)/Gj-1; 2.2.2.1.2. If dm-z then R=R+Qk,j-1(z,n)(1-j-1(dmax)); 2.2.2.1.3. If j
18

ACCEPTED MANUSCRIPT

2.2.2.1.3.1. For d=0,…,min{dmax-1,m-z-1}: 2.2.2.1.3.1.1. If j
CR IP T

Qk+1,0(z+d++C(i)/G0,i)=Qk+1,0(z+d++C(i)/G0, i)+Qk,j-1(z,n) j-1(d).

Step 2.2.2.1.1 calculates the number of time intervals required for completing the task when the system operates in state j-1. Step 2.2.2.1.2 updates the PTC. Step 2.2.2.1.3.1.2

AN US

obtains the number or index i of the last completed checkpoint. Step 2.2.2.1.3.1.1 corresponds to event transition without rejuvenation. Step 2.2.2.1.3.1.3 corresponds to event transition with rejuvenation. The complexity of the iterative algorithm is

M

O((K+1)(J+1)m3).

5. Optimal Joint Checkpointing and Rejuvenation Policy

ED

Having the PTC evaluation algorithm described above one can find the rejuvenation and checkpointing policy that maximizes the PTC using one of general optimization

PT

procedures. To optimize the rejuvenation policy under fixed checkpointing schedule ,

CE

we define a discrete approximation of the decision function U(z,n). Specifically, we divide the m time intervals (work portions) into H regions. For any combination of z and

AC

n, we define U(z,n)=V(z/h,n/h), with h=m/H+1 and V being an HH matrix consisting of elements ranging from 0 to J+1. Such approximation implies that when the system transits to state j in time interval z (from the beginning of the mission) when n work portions are accomplished, the rejuvenation procedure is immediately conducted if jV(z/h,n/h).

19

ACCEPTED MANUSCRIPT

It is a complicated optimization problem to find the optimal rejuvenation policy in the form of matrix V. There are (J+1)HH possible solutions. It is not realistic to perform an exhaustive examination of all the solutions even for a moderate number of regions H and system states J+1. To solve the optimal rejuvenation problem, we use a Genetic

CR IP T

Algorithm (GA) heuristic, the most widely used method in reliability optimization due to its advantages of having flexibility in solution representation, parallel computation possibility, quick convergence to near optimal solutions, etc [22, 23]. The basic structure of GA used in this work is as follows.

AN US

First, an initial population of randomly constructed solutions (represented as integer strings) is generated. Within this population, new solutions are obtained during the genetic cycle by using crossover and mutation operators. The crossover produces a new

M

solution (offspring) from a randomly selected pair of parent solutions by copying random fragment of strings from different parents. Mutation results in slight changes to the

ED

offspring’s structure, and maintains a diversity of solutions. This procedure avoids premature convergence to a local optimum, and facilitates jumps in the solution space. In

PT

our GA, the mutation procedure swaps elements initially located in two randomly chosen

CE

positions on the string. Each new solution is decoded, and its objective function (fitness) value is estimated. This value, as a measure of quality, is used to compare different

AC

solutions. The comparison is accomplished by a selection procedure that determines which solution is better: the newly obtained solution, or the worst solution in the population. The better solution joins the population, while the other is discarded. If the population contains equivalent solutions following selection, redundancies are eliminated, and the population size decreases as a result.

20

ACCEPTED MANUSCRIPT

After new solutions are produced a pre-specified number of times, new randomly constructed solutions are generated to replenish the shrunken population, and a new genetic cycle begins. The GA is terminated after a pre-determined number of genetic cycles. The final population contains the best solution achieved. It also contains different

CR IP T

near-optimal solutions which may be of interest in the decision-making process.

The GA operates with integer strings. To apply the GA to a specific optimization problem, the corresponding solution representation must be defined. For the rejuvenation optimization problem considered in this work, any integer string a=(a1,…,aHH) with

AN US

0ajJ+1 corresponds to a feasible solution such that for any 0i,k
6. Illustrative Example

M

assigns the obtained value of R to the solution fitness.

ED

6.1. Evaluating PTC of a visual data streaming onboard software system A visual data streaming onboard software system for a space probe is considered [42].

PT

Based on to a predetermined schedule, the space probe takes video images, which are

CE

placed in an input buffer for being processed by a software ETL (Extract, Transform, Load) streaming system [43, 44]. Specifically, the ETL system [43, 44] is responsible for

AC

the image data transform and compression task to prepare its translation to the Earth. The task presumes conducting W=5000 mega operations. Due to the limited capacity of the input buffer (new image batches replace the old ones), the ETL has to finish the task in the time between the images update, which is =100 time units.

21

ACCEPTED MANUSCRIPT

The ETL system utilizes four internal memory blocks. Depending on the number of blocks overwhelmed, the data processing speed can change. The system state is defined by the number of unavailable memory blocks. Thus, there are five different system states (from 0 to 4). As the number of unavailable blocks increases consecutively, the system

CR IP T

state increases one by one during the data processing. In addition, the system states with different numbers of available memory blocks can have different sojourn times. This is because the data processing time, and thus the random time of filling the next block depend on the number of available memory blocks. When all of the four memory blocks

AN US

are unavailable, the software still can function using low capacity registers, but with very low speed. During the periodic checkpointing procedures the system saves data processed so far to an external memory. The rejuvenation cleans up all of the internal memory

M

blocks. Therefore, the system can be transited to state 0 from any other state after the rejuvenation. The software mission fails if the data transform and compression task is not

ED

finished before the input buffer data renewal takes place. The sojourn time of the system in state j∈{0,1,2,3} follows the Weibull distribution

PT

[1,45] with scale parameter j and shape parameter j. In other words, the system can

CE

transit from state j to state j+1 with random time having cdf of Fj(t)=exp(-(t/j)j)) for j∈{0,1,2,3}. The system can transit from the worst state 4 only to the initial state 0 when

AC

the rejuvenation is performed. Note that the proposed methodology is applicable to any type of distributions. While no consensus has been found on the type of failure or degradation time distributions for operational software systems [11], the Weibull distribution is selected for this example system (and in other literature, e.g., [11, 46]) because of its flexible representation of different failure rate behaviors. In [1], the

22

ACCEPTED MANUSCRIPT

accelerated life testing based studies actually showed that the Weibull distribution is a good fit for modeling the time to failure distribution of software systems studied in the work. The PS of the system in each state is shown in Table 1 as well as the scale (j) and

CR IP T

shape (j) parameters.

AN US

Table 1. Parameters of the example software system State PS Parameters of Fj(t) j gj j j 0 95 25 1.2 1 72 33 1.0 2 50 38 1.1 3 29 29 1.3 4 15 -

We assume that checkpointing and retrieval procedures after performing x>0 task

M

operations require performing b1+b2x and c1+c2x operations, respectively. Thus,

ED

b()=b1+b2w and c(i)=c11(i>0)+c2iw.

The number of intervals m used during the discretization affects the accuracy of results

PT

obtained. To investigate its effects, values of R are collected for different m ranging from 50 to 2600. For =20, u(t,x)3, b1=c1=100, b2=0.1, c2=0.08, =0.2, Fig. 2 presents

CE

values of R and running time of the proposed numerical algorithm on Pentium 3.2GHz

AC

PC as functions of m. The relative differences in R for m=100, m=500 and m=1000 compared to m=2600 are 0.98%, 0.15% and 0.04%, respectively. Fig. 3 presents the PTC R as a function of checkpointing frequency parameter  for

=20, b1=c1=100, b2=0.1, c2=0.08, and different u(x,t)k corresponding to the case where the rejuvenation is performed immediately when the system enters state j=k, independent of the amount of completed work and time when this event occurs. 23

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 2. PTC R and running time of the proposed algorithm on Pentium 3.2GHz PC

AC

CE

PT

ED

M

as functions of m.

Fig. 3. PTC R() for different u(x,t)k.

It can be seen that the function R() always takes its maximum when  1/ =1, i.e.,

when the amount of operations after the last checkpointing procedure equals to that before any checkpointing procedure (the explanation is given in [28]). This means that instead of looking for a real-valued  maximizing R(), one can check different integer 24

ACCEPTED MANUSCRIPT

numbers of checkpointing procedures  and determine =1/(+1). The best obtained checkpointing and rejuvenation policy for constant u(t,x) is =3 (=0.25) and u(t,x)3. In the following sub-sections, we present solutions to optimizing the rejuvenation

optimization of rejuvenation and checkpointing policy.

CR IP T

policy under a fixed checkpointing frequency, as well as solutions to the joint

6.2. Optimizing rejuvenation policy under fixed checkpointing schedule

Table 2 presents the best obtained matrices V for different H when =3, =20, b1= c1=100, b2 =0.04, c2=0.0032.

0.8736

2

0.8961

3

6

0.9178

8

0.9214

10

0.9214

ED

M

1

AN US

Table 2. Optimal rejuvenation policies for fixed =3 and different H. H R V H R V

CE

PT

22 25

AC

4

0.9145

3222 2323 2333 3355

232435 223352 223233 223333 333334 335455 23333333 23233444 22233444 22222333 22232323 22233334 33334555 33445555 2222222222 2332222222 2222332222 2222323222 2222323332 2222323323 2222323334 3333333344 3333444555 3333444555

Fig. 4 presents the obtained R as a function of H. It can be observed that an increase of H from 1 to 8 considerably improves R. However, a further increase in H does not change

25

ACCEPTED MANUSCRIPT

the PTC. The values of V increase with z and n, meaning that when the system approaches the task completion and not much time remains before the deadline, it has to

AN US

CR IP T

take a chance to complete the task without conducting the rejuvenation procedure.

Fig. 4. PTC R as a function of H for =3, =20, b1=c1=100, b2 =0.04, c2=0.0032.

M

6.3. Optimizing joint rejuvenation and checkpointing policy

ED

To optimize the joint rejuvenation and checkpointing policy, in applying the GA, for each string, the suggested numerical algorithm evaluates the PTC R for different values

PT

of  ranging from 0 to max (in our example max=10) and assigns the maximum obtained value of R to the solution fitness.

CE

Fig. 5 presents values of PTC R and numbers of checkpoints  for the best obtained

AC

rejuvenation and backup policies with H=8 as functions of the rejuvenation time  and checkpointing/retrieval complexity factor  (assuming that b2=0.1, c2=0.08). It can be seen that with an increase in the data checkpointing/retrieval complexity and rejuvenation time, the optimal number of checkpoints decreases as well as the PTC. Eventually, for large  and  checkpointing becomes not beneficial at all (=0). Observe that when

26

ACCEPTED MANUSCRIPT

>15 the best solutions for =2 and =4 provide the same value of R because for =0

AN US

CR IP T

the checkpointing complexity does not affect the system performance.

Fig. 5. PTC R and numbers of backups  for the best obtained solutions as functions of the

M

rejuvenation time  and checkpointing/retrieval complexity factor 

Tables 3 - 6 present the best obtained matrices V and numbers of checkpoints  for

ED

different combinations of  and . It can be seen that when not much time remains until

PT

the allowed mission time expiration (great values of z for matrix V(z,n)), it is beneficial to have the system continue the task execution without rejuvenation even in low PS states

CE

because the system has no time to perform the rejuvenation (which corresponds to high values of V). With an increase in the rejuvenation time , the values of V also increase,

AC

which means that the rejuvenations should be performed less frequently when their time increases.

When the increased checkpointing/retrieval complexity makes the checkpointing not beneficial (=0), the values of V(z,n) for low z decrease, which means that the rejuvenations should be more frequent when the system software deteriorates at the 27

ACCEPTED MANUSCRIPT

beginning of the mission (when restarting the mission from scratch takes not much time). The rejuvenations in the later stages of the mission become much less beneficial because the system may not have enough time to re-execute the task from scratch. Table 3. The best V,  solutions obtained by the GA and corresponding PTC R for =0.5 and different  10 5

15 4

20 4

R

0.993 22222222 22222222 22222333 22222333 22222233 22222223 22222233 22233335

0.973 22222222 22222222 22222223 22222333 22222333 22222233 22333334 23444555

0.942 23333333 22233333 22323333 22223233 22223233 22223333 22333355 44444555

0.915 23333333 22233333 22323233 22223233 22223232 22223334 23333555 34455555

V

25 3

30 3

0.890 23333333 23222222 22232233 22232333 23233323 22233334 33444555 44444555

0.858 23333333 33223333 22232333 22232333 23233334 23333555 22444555 44444555

CR IP T

5 8

AN US

 

Table 4. The best V,  solutions obtained by the GA and corresponding PTC R for =1 and different  10 5

R

0.986 22222222 22222222 22222233 22222333 22222233 22222223 22222224 22233355

0.956 22222222 22222222 22222333 22222333 22222333 22222233 22222334 22345555

PT

V

15 3

M

5 8

0.923 22222222 23222222 22232333 23232333 22232333 22232333 22233355 24455555

ED

 

20 3

25 2

30 1

0.894 22222222 23222233 22232333 22232333 22232333 22233334 22333555 33444555

0.861 22222222 23323333 22223333 22223323 22323234 23333344 33455555 33445555

0.829 22222222 22333333 33343235 22242335 22343344 22333455 33335555 33355555

CE

Table 5. The best V,  solutions obtained by the GA and corresponding PTC R for =2 and different  5 5

10 5

15 3

20 0

25 0

30 0

R

0.959 22222222 22232333 22222233 11222333 22222233 22222232 22222233 22233355

0.910 22222222 22233333 22222333 12222333 22222233 22222232 22222333 33555555

0.868 22333333 13222222 22233333 22232333 13232333 22232333 22333355 33344555

0.849 12222222 12222222 22233444 22333555 12233555 22334555 33334555 44455555

0.835 12233333 12233333 22333444 22234555 22333555 22334555 44455555 44444555

0.822 22222222 22233333 22333444 22224555 22333555 33333555 44444555 44444555

AC

 

V

28

ACCEPTED MANUSCRIPT

Table 6. The best V,  solutions obtained by the GA and corresponding PTC R for =4 and different  5 0

10 0

15 0

20 0

25 0

30 0

R

0.903 11222222 11222222 11223333 12223345 12233555 23334555 23334555 33344555

0.883 11222222 12222222 12223333 12233455 12233555 22334555 33334555 44445555

0.865 12222222 12222222 12233333 22233455 22233555 22334555 33334555 44444555

0.849 12222222 12222222 22233444 22333555 12233555 22334555 33334555 44455555

0.835 12233333 12233333 22333444 22234555 22333555 22334555 44455555 44444555

0.822 22333333 12233333 22333455 22224555 22333555 33333555 44444555 44455555

V

7. Conclusion and Future Work

CR IP T

 

AN US

For many safety critical or mission critical applications the software system failure can cause unrecoverable losses to economics and even human lives. Therefore preventive maintenance techniques have been applied to keep the system crash from happening at

M

the first place. This paper models one of such techniques, software rejuvenation coupled with the checkpointing used to mitigate performance deterioration effects of software

ED

aging avoiding the system crash. Specifically with the software rejuvenation, the degraded system performance can be restored to higher or the peak performance level;

PT

with the checkpointing the restored system can effectively resume the mission task from

CE

the last saved checkpoint (instead of from the mission beginning). To balance the benefits and overheads of implementing this combined maintenance technique, we formulate and

AC

solve the joint optimal rejuvenation and checkpointing policy problem maximizing the probability of the task completion by a predetermined mission deadline. The solution methodology encompasses an iterative numerical algorithm proposed for evaluating the real-time task completion probability and the Genetic Algorithm applied for solving the formulated optimization problem. Example analyses demonstrate effects of different parameters on the system solution, including the number of discretized time intervals, the 29

ACCEPTED MANUSCRIPT

rejuvenation decision function, checkpointing frequency parameter, the number of regions

for

rejuvenation

decision

function,

rejuvenation

time,

and

the

checkpointing/retrieval procedures complexity parameter. The limitations of this work are based on the assumptions of the full state independent

CR IP T

rejuvenation. The full rejuvenation always brings the degraded system back to the initial state with the peak performance level. As one direction of our future works, we plan to extend the proposed methodology by considering imperfect rejuvenations that only partially restore the system performance. Another direction is to relax the fixed

AN US

rejuvenation time assumption by modeling the state-dependent rejuvenation time.

We are also interested in considering the cost function for optimizations. Having the rejuvenation cost, the checkpointing cost and the penalty cost associated with the system

M

inability to complete the mission in time, one can choose the minimal cost rejuvenation and checkpointing policy. This requires extension of the current model that enables one

ED

to obtain the expected number of rejuvenations during the mission as well as expected number of checkpoints (included ones that fail before their completion because of the

PT

system state deterioration).

CE

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant No.

AC

61602094).

30

ACCEPTED MANUSCRIPT

References [1] J. Zhao, Y. Wang, G. Ning, K. S. Trivedi, R. Matias and K. Cai, “A comprehensive approach to optimal software rejuvenation”, Performance Evaluation, vol. 70, no. 11, pp. 917-933, 2013. [2] S. Garg, Y. Huang, C. Kintala, and K. S. Trivedi, "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation", Proceedings of the 1996 ACM SIGMETRICS

Philadelphia, Pennsylvania, USA, 1996.

CR IP T

international conference on Measurement and modeling of computer systems, pp. 252-261,

[3] Y. Huang, C. Kintala, N. Kolettis and N. D. Fulton, "Software rejuvenation: analysis module and applications", Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers, Pasadena, CA, USA, pp. 381-390, 1995.

AN US

[4] M. Grottke, R. Matias and K.S. Trivedi, “The fundamentals of software aging”, Proc. of IEEE First International Workshop on Software Aging and Rejuvenation Washington DC USA, pp. 1–6, 2008.

[5] V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, Vaidyanathan and W. P. Zeggert "Proactive management of software aging", IBM Journal of Research and Development, vol. 45, no. 2, pp.311-332, March 2001.

M

[6] E. Marshall, “Fatal error: how patriot overlooked a Scud”, Sci., p. 1347, Mar. 1992. [7] V. Koutras and A. Platis, “Applying partial and full rejuvenation in different degradation

ED

levels”, Proceedings of the IEEE 3rd International Workshop on Software Aging and Rejuvenation, pp. 20–25, 2011.

PT

[8] K. G. Shin, T. H. Lin and Y. H. Lee, "Optimal Checkpointing of Real-Time Tasks", IEEE Transactions on Computers, vol. C-36, no. 11, pp. 1328-1341, Nov. 1987. [9] G. Levitin, L. Xing, Y. Dai and V. M. Vokkarane, “Dynamic Checkpointing Policy in

CE

Heterogeneous Real-Time Standby Systems”, IEEE Transactions on Computers, vol. 66, no. 8, pp. 1449-1456, August 2017. H. Okamura and T. Dohi, “Analysis of a Software System with Rejuvenation Restoration

AC

[10]

and Checkpointing”, In: Nanya T. Maruyama F. Pataricza A. Malek M. (eds) Service Availability. ISAS 2008. Lecture Notes in Computer Science vol. 5017, Springer Berlin Heidelberg, 2008.

[11]

S. Garg, A. Puliafito, M. Telek and K. S. Trivedi "Analysis of preventive maintenance in

transactions based software systems", IEEE Transactions on Computers, vol. 47, no. 1, pp. 96-107, Jan 1998.

31

ACCEPTED MANUSCRIPT

[12]

J. Alonso, M. Grottke, A. Nikora and K. S. Trivedi, “The nature of the times to flight

software failure during space missions”, Proc. of IEEE Int. Conf. Softw. Rel. Eng. Workshops, pp. 331–340, 2012. [13]

C. Kintala, “Software rejuvenation in embedded systems”, J. Autom. Lang.

Combinatorics, vol. 14, pp. 63–73, 2009. [14]

T. Dohi, K. Goševa-Popstojanova, K. Vaidyanathan, K. S. Trivedi and S. Osaki,

Reliability Engineering. Springer London, 2003. [15]

CR IP T

“Software Rejuvenation: Modeling and Applications” In: H. Pham (editor) Handbook of

X. Hua, C. Guo, H. Wu, D. Lautner and S. Ren, "Schedulability Analysis for Real-Time

Task Set on Resource with Performance Degradation and Dual-Level Periodic Rejuvenations", IEEE Transactions on Computers, vol. 66, no. 3, pp. 553-559, March 1

[16]

AN US

2017.

D. Wang, W. Xie and K. S. Trivedi, “Performability analysis of clustered systems with

rejuvenation under varying workload”, Performance Evaluation, vol. 64, no. 3, pp. 247-265, 2007. [17]

G. Levitin, L. Xing, H. Ben-Haim, Y. Dai, “Optimizing software rejuvenation policy for

real time tasks”, Reliability Engineering & System Safety, vol. 176, pp. 202–208, 2018. K. Rinsaka and T. Dohi, "Toward high assurance software systems with adaptive fault

M

[18]

management", Software Quality Journal, vol. 24, no. 1, pp. 65–85, March 2016. P. K. Saravakos, G. A. Gravvanis, V. P. Koutras and A. N. Platis, "A comprehensive

ED

[19]

approach to software aging and rejuvenation on a single node software system", Proceedings of The 9th Hellenic European Research on Computer Mathematics & its Applications

[20]

PT

Conference, 2009.

S. Malefaki, V. P. Koutras and A. N. Platis, "Modeling Software Rejuvenation on a

CE

Redundant System Using Monte Carlo Simulation", Proc. of IEEE 23rd International Symposium on Software Reliability Engineering Workshops Dallas TX, pp. 277-282, 2012. A. Puliafito, "Software Rejuvenation in Cloud Systems", Proc. of IEEE International

AC

[21]

Symposium on Software Reliability Engineering Workshops, Naples, pp. 413-413, 2014.

[22]

D. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning Addison

Wesley Reading, MA, 1989.

[23]

G. Levitin, "Genetic algorithms in reliability engineering", Guest editorial. Reliability

Engineering & System Safety, vol. 91, no. 9, pp. 975-976, 2006.

32

ACCEPTED MANUSCRIPT

[24]

H. Okamura, K. Yamamoto & T. Dohi, “Transient Analysis of Software Rejuvenation

Policies in Virtualized System: Phase-Type Expansion Approach”, Quality Technology & Quantitative Management, vol. 11, no. 3, pp. 335-351, 2014 [25]

Y. Bao, X. Sun and K. S. Trivedi, "Adaptive software rejuvenation: degradation model

and rejuvenation scheme", Proc. of International Conference on Dependable Systems and Networks, pp. 241-248, 2003. D. Cotroneo, R. Natella, R. Pietrantuono and S. Russo, “A survey of software aging and

CR IP T

[26]

rejuvenation studies”, J. Emerg. Technol. Comput. Syst., vol. 10 no. 1, Article 8, 34 pages 2014. [27]

W. Dang and J. Zeng, “Software System Rejuvenation Modeling Based on Sequential

Inspection Periods and State Multi-control Limits”, In: B. Zou Q. Han G. Sun W. Jing X.

AN US

Peng Z. Lu (eds.), Data Science. ICPCSEE 2017, Communications in Computer and Information Science, vol. 728, Springer Singapore, 2017. [28]

G. Levitin, L. Xing, B. W. Johnson and Y. Dai, “Mission Reliability Cost and Time for

Cold Standby Computing Systems with Periodic Backup”, IEEE Transactions on Computers, vol. 64, no. 4, pp. 1043-1057, April 2015. [29]

G. Levitin, L. Xing, and Y. Dai, “Optimal Backup Distribution in 1-out-of-N Cold

no. 4, pp. 636-646, April 2015

Y. Zhang and K. Chakrabarty, "Adaptive Checkpointing with Dynamic Voltage Scaling

ED

[30]

M

Standby Systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45,

in Embedded Real-Time Systems", Embedded Software for SoC, pp. 449-463, 2003. [31]

N. Kaio, T. Dohi, K. S. Trivedi, "Availability Models with Age-Dependent

[32]

PT

Checkpointing", Proc. of IEEE Symp. on Reliable Distributed Systems, pp. 130, 2002. A. Ziv and J. Bruck "An on-line algorithm for checkpoint placement," Proc. Seventh

CE

International Symposium on Software Reliability Engineering White Plains NY, pp. 274-283, 1996.

H. Okamura and T. Dohi, "Availability optimization in operational software system with

AC

[33]

aperiodic time-based software rejuvenation scheme," IEEE International Conference on Software Reliability Engineering Workshops, Seattle, WA, pp. 1-6, 2008.

[34]

G. Levitin, L. Xing, Q. Zhai, and Y. Dai, "Optimization of Full vs. Incremental Periodic

Backup Policy," IEEE Transactions on Dependable and Secure Computing, vol. 13, no. 6, pp. 644-656, Nov 2016.

33

ACCEPTED MANUSCRIPT

[35]

A. Khunteta and P. Kumar, "An Analysis of Checkpointing Algorithms for Distributed

Mobile Systems," International Journal on Computer Science and Engineering, vol. 02, no. 04, pp. 1314-1326, 2010 [36]

A. Bobbio, S. Garg, M. Gribaudo, A.Horv´ath, M. Sereno, and M. Telek, “Modeling

software systems with rejuvenation restoration and checkpointing through fluid stochastic Petri nets,” Proc. of International Workshop on Petri Nets and Performance Models, pp. 82-

[37]

CR IP T

91, 1999

Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala, “Checkpointing and its

applications,” Digest of Papers for 25th International Symposium Fault-Tolerant Computing, 1995. [38]

N. Naksinehaboon, N. Taerat, C. Leangsuksun, C. F. Chandler, and S. L. Scott, "Benefits

AN US

of Software Rejuvenation on HPC Systems" International Symposium on Parallel and Distributed Processing with Applications Taipei, pp. 499-506, 2010. [39]

J. F. Meyer, "On Evaluating the Performability of Degradable Computing Systems,"

IEEE Transactions on Computers, vol. C-29, no. 8, pp. 720-731, Aug. 1980. [40]

A. Bobbio and M. Telek, “Task completion time in degradable systems,” In B.R.

Haverkort, R.

Marie, G. Rubino and K. S. Trivedi, editors, Performability Modelling:

[41]

M

Techniques and Tools, Wiley, Chapter 7:139-161, 2001.

G. Levitin, L. Xing, and Y. Dai, “Heterogeneous 1-out-of-N warm standby systems with

[42]

ED

online checkpointing,” Reliability Engineering & System Safety, vol. 169, pp. 127-136, 2018. R. Vitale, A. Zhyrova, J. F. Fortuna, O. E. de Noord, A. Ferrer, H. Martens, “On-The-Fly

Processing of continuous high-dimensional data streams,” Chemometrics and Intelligent

[43]

PT

Laboratory Systems, vol. 161, pp. 118-129 2017. S. K. Bansal, "Towards a Semantic Extract-Transform-Load (ETL) Framework for Big

CE

Data Integration," Proc. of IEEE International Congress on Big Data, Anchorage, AK, pp. 522-529, 2014. [44]

V. C. Storey and I-Y. Song, “Big data technologies and Management: What conceptual

AC

modeling can do,” Data & Knowledge Engineering, vol. 108, pp. 50-67 2017.

[45]

W. Weibull, "A statistical distribution function of wide applicability," J. Appl. Mech.-

Trans. ASME, vol. 18, no. 3, pp. 293–297, 1951.

[46]

A. T. Tai, S. N. Chau, L. Alkalaj, and H. Hecht, “On-Board Preventive Maintenance:

Analysis of Effectiveness and Optimal Duty Period,” Proc. of Third Int’l Workshop Object Oriented Real-Time Dependable Systems, 1997.

34