Optimizing software rejuvenation policy for tasks with periodic inspections and time limitation

Optimizing software rejuvenation policy for tasks with periodic inspections and time limitation

Journal Pre-proof Optimizing software rejuvenation policy for tasks with periodic inspections and time limitation Gregory Levitin , Liudong Xing , Ya...

1007KB Sizes 0 Downloads 25 Views

Journal Pre-proof

Optimizing software rejuvenation policy for tasks with periodic inspections and time limitation Gregory Levitin , Liudong Xing , Yanping Xiang PII: DOI: Reference:

S0951-8320(19)30945-7 https://doi.org/10.1016/j.ress.2019.106776 RESS 106776

To appear in:

Reliability Engineering and System Safety

Received date: Revised date: Accepted date:

21 July 2019 29 October 2019 22 December 2019

Please cite this article as: Gregory Levitin , Liudong Xing , Yanping Xiang , Optimizing software rejuvenation policy for tasks with periodic inspections and time limitation, Reliability Engineering and System Safety (2019), doi: https://doi.org/10.1016/j.ress.2019.106776

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Highlights 

Software systems with periodic inspections are considered.



State-based software rejuvenation procedure can be activated in each inspection.



An algorithm for determining the optimal rejuvenation policy is suggested.



An event transition-based method is used for quantifying the system task success probability.



Genetic algorithm is applied for deriving the optimal rejuvenation policy.

1

Optimizing software rejuvenation policy for tasks with periodic inspections and time limitation Gregory Levitina,b, Liudong Xingc, Yanping Xianga a

Center for System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, P. R. China b The Israel Electric Corporation, P. O. Box 10, Haifa 31000, Israel E-mail: [email protected] c University of Massachusetts, Dartmouth, MA 02747, USA E-mail: [email protected]

Abstract – Software aging has been observed in diverse types of software systems, causing gradual performance degradation with time and/or load and eventually system failures. To mitigate the aging effects and prevent serious losses caused by the system failure, software rejuvenations can be proactively performed to restore the system performance. This paper models and optimizes a state-based rejuvenation policy for software systems performing real-time computing tasks and undergoing periodic inspections. During each scheduled inspection, the system state is evaluated and the decision about the rejuvenation is made based on the evaluated system state and a rejuvenation decision function. The time of each rejuvenation procedure (corresponding to the system downtime) depends on the system state as well as on the amount of task operations accomplished before deciding to perform the rejuvenation. As the rejuvenation policy determines the time and number of rejuvenations performed during the task processing, it can affect the probability that the system can accomplish the real-time task by a certain deadline significantly. In this work, we optimize the state-based rejuvenation policy to maximize the probability of task completion (PTC) of periodically inspected

2

software systems. The methodology encompasses an event transition-based iterative method proposed for quantifying the PTC and application of the Genetic Algorithm for deriving the optimal rejuvenation policy. Examples are presented to demonstrate the proposed methodology and influences of several parameters (e.g., inspection interval, rejuvenation time) on the optimization results.

Keywords – Reliability; software full rejuvenation; periodic inspection; task completion probability; real-time.

Acronyms and Abbreviations cdf

cumulative distribution function

pdf

probability density function

GA

Genetic algorithm

PS

processing speed

PTC

probability of task completion

Nomenclature W

task complexity, i.e., total number of operations of task



task deadline or maximum allowed mission time

R

PTC in time not exceeding 

s

time between two consecutive inspections

H

maximum number of inspections

J

the maximum index of the system state

fj(t), Fj(t)

pdf , cdf of random sojourn time of the system in state j

3

gj

PS of the system in state j

j(x)

time of rejuvenation from state j when x operations are completed

u(h,x)

rejuvenation decision function

m

number of time intervals (or work portions) used in suggested numerical algorithm

1(A)

Boolean logic function: 1(TRUE)=1; 1(FALSE)=0

x

floor operation returning the greatest integer less than x

1. Introduction Due to an accumulation of error conditions such as memory leaks, round-off numerical errors, storage fragmentation and etc., a software system can age during its continuous running, causing the system to gradually degrade its performance and eventually fail if no maintenance action is taken timely [1, 2]. The aging phenomenon has been observed in diverse types of software systems, spanning from mission-critical to business-oriented systems [3]. Specific examples of such systems include spacecraft [4], web servers [5], billing software [6], embedded systems [7], telecommunication switching [8], and transaction processing systems [9, 10]. To counteract the software aging effect and further prevent or at least reduce the occurrence of system failures, software rejuvenations have commonly been applied to proactively restore the system performance [11, 12]. Depending on the performance level a system can roll back to, rejuvenations are classified as full and partial rejuvenations [13]. A full rejuvenation (also referred to as a cold or perfect or system-level rejuvenation) can recover performance of the software

4

system to the initial peak level typically through stopping all running applications and restarting the system. A partial rejuvenation (also referred to as a warm or minimal or application-level rejuvenation) recovers the system performance to some higher but nonpeak level through stopping and restarting certain applications [14, 15]. Under both types of rejuvenations, the average performance (e.g., processing speed) of a software system can be enhanced and such improvements become more significant with more frequent rejuvenations performed during the software running. On the other hand, the system often becomes unavailable for executing the actual mission task during the rejuvenation process, incurring system downtime and cost due to loss of business. When the number of rejuvenations performed is large, the extra downtime incurred can prolong the mission completion time greatly, even lead to a failed mission if the task is real-time with a strict deadline. To balance the positive and negative effects of performing rejuvenations, it is crucial to carefully design and optimize the software rejuvenation policy that determines how many rejuvenations and when each of them should be triggered during the system running. The rejuvenation policy considered in this work presumes state-based full rejuvenations that can be performed immediately after periodic inspections. Specifically, the software system undergoes periodic inspections during the execution of a real-time task. At each scheduled inspection, the system status is detected and evaluated. Based on the evaluated system state, the decision about the rejuvenation is made through a rejuvenation decision rule or function. If the rejuvenation is decided to be triggered, the time it takes the system to be rejuvenated to the peak performance level depends on the system state as well as the amount of task operations already completed right before the

5

rejuvenation action. We model and evaluate the successful completion probability of a real-time task performed by the software system subject to the considered rejuvenation policy. We further optimize the state-based rejuvenation policy (in particular, the rejuvenation decision function) for any given inspection interval to maximize the task completion probability. Note that there exist two different viewpoints on system success measures in literature of computing systems with random performance: system-oriented and task/user-oriented [16]. The former concerns the expected amount of work accomplished in a certain time period [17], while the latter typically concerns the probability that the amount of work required to complete the system task can be accomplished in due time [18, 19]. The taskoriented viewpoint, particularly, the probability of task completion (PTC) is adopted in this work. The remainder of this paper is arranged as follows. Section 2 reviews some relevant works on existing software rejuvenation policies and models, followed by a clarification on novelty of this work as compared to the existing works. Section 3 depicts the software system modeled. Section 4 suggests a method based on event transitions for analyzing the task completion probability. Section 5 presents an iterative numerical algorithm implementing the analysis method. Section 6 illustrates the suggested evaluation algorithm using an example. Section 7 presents the optimization solution and example results of the considered optimal rejuvenation policy problem. Section 8 concludes the work and states a few directions of our future work.

6

2. Related Work Software rejuvenations have been modeled and optimized for single node software systems [20], cluster computing systems [21], and standby redundant systems [22]. Recently, software aging and rejuvenations have also been investigated for cloud computing systems considering virtual machine restart [23]. Depending on the mechanism of triggering a rejuvenation action, the rejuvenation policy can be categorized into time-based, state-based, or prediction-based rejuvenations [24, 25]. Under the time-based policy, rejuvenations are planned ahead of time, and triggered according to a pre-decided periodic or aperiodic schedule [10, 26, 27]. Under the state-based (also referred to as condition-based) policy, rejuvenations are triggered when a certain system condition monitored during runtime is true (e.g., the processing speed is below a specific threshold) [16, 21, 28-30]. Under the prediction-based policy, rejuvenations are triggered based on statistical analysis of system data collected during the runtime [25, 31]. The objective of optimizing the rejuvenation policy involves maximizing or minimizing certain performance metric of the considered software system. Examples of performance metrics include expected task completion time, task completion probability, steady-state system availability, mean response time, transaction loss probability [9, 12, 32], expected downtime cost [33], mean steady-state operation cost per unit time [6, 34], and average rejuvenation cost per unit time [35]. To perform optimizations, these performance metrics have to be evaluated in a quantitative manner. Evaluation methods available from the software rejuvenation literature include measurements, simulations, analytical models, and event transitions based numerical methods [11, 22, 32].

7

Both measurements and simulations can exploit unique features or nature of a system being analyzed, but are not portable across different systems. In measurements, runtime data of system condition monitored is collected and then analyzed using statistical analysis to evaluate performance metrics of interest [36-38]. Measurements are often not applied to estimate steady-state performance metrics (also known as asymptotic or longrun metrics defined as the limit of the metrics when time approaches infinity or the system being investigated settles) due to requiring long term data measured from the system during the runtime. Simulations can offer generality and flexibility in representing system behaviors but are typically expensive in computational requirements especially when high accuracy of results is required [22, 34]. In the analytical modeling approach, the software aging behavior or process is represented via a probabilistic model, which is then evaluated to determine the desired performance metric. Markov stochastic processes [9, 13, 39], Semi-Markov processes [10, 12, 28], Markov Regenerative processes [9, 21], and Petri nets [21] are among the examples of such probabilistic model. Analytical models can be easily generalized or portable across different systems. However, they are less effective than measurements and simulations because analytical methods often need to make simplification assumptions about behavior of the system being modeled (e.g., the continuous-time Markov chains assume the state transition time follows the exponential distribution). Moreover, the analytical models usually suffer from the state space explosion issue, which makes time-dependent evaluation intractable. Due to this problem, only asymptotic or steady-state solutions are often provided, which becomes insufficient for real-time systems [9, 12, 28, 40].

8

Recently a new class of the analytical modeling approach, an event transition-based numerical algorithm has been developed for analyzing a software aging system undergoing rejuvenations [32]. The algorithm is efficient and flexible without any restrictions to types of distributions modeling the software failure time or state transition time. It also provides time-dependent performance metrics. In [41], this methodology was expanded for software systems subject to periodic checkpointing and state-based rejuvenations during the task execution; the optimal joint checkpointing and rejuvenation policy maximizing the task completion probability was studied. In [42], another extension of the event transition-based method was performed for optimizing the total expected mission cost of a real-time task. A full state-based rejuvenation policy has been assumed for models of [32, 41, 42]. In [43] the event transition-based method was extended for modeling and optimizing a partial state-based rejuvenation policy. All those works, however, do not consider inspections, which are typically implemented in the state-based rejuvenation policy. During each inspection, the system condition is evaluated and the decision on whether to trigger a rejuvenation is made [40, 44]. In this work, we make significant extensions of the event transition-based method by considering periodic inspections as well as practical dynamic rejuvenation time (dependent on system state and accomplished task operations) in the evaluation of realtime task completion probability. To consider the periodic inspections, the methodology has to be extended by introducing additional variables denoting the time between two consecutive inspections and the maximum number of inspections that can be performed during the execution of a real-time task. The two variables are used in determining whether certain event transitions can occur, occurrence probabilities of state transitions,

9

and eventually the probability of task completion (detailed in Section 4). All equations constituting the major part of derivation of the PTC are different from those in the previous works [32, 41-43] as the rejuvenations can be accomplished only at discrete times of inspections whereas the state transitions can happen in any time. Based on the extended evaluation method, we make further contributions by deriving the optimal periodically inspected state-based rejuvenation policy that maximizes the task completion probability.

3. System Model A computation task consists of W operations that have to be completed by a specific deadline . The mission is considered being failed if the entire task cannot be completed within . Due to software aging, the system performing the task can exhibit J+1distinguishable states or performance levels, each characterized by a different processing speed (PS). The initial state has the peak performance or PS of g0. The PS in state j is gj, in general. During the execution of the real-time task, the system performance degrades from a higher level (with greater PS) to a lower level (with lower PS). In particular, the system can make only one-step transition from any state j to state j+1, with transition time obeying a certain known distribution defined by the probability density function (pdf) fj(t) and cumulative distribution function (cdf) Fj(t). Such a transition time is also the sojourn time of the system remaining in state j. To mitigate the performance deterioration effect of the software aging, the system software can be rejuvenated. Full or perfect rejuvenations are adopted [13, 14], where the system performance can be restored to the initial state j=0 with peak PS g0. The time of each rejuvenation procedure depends on the system state as well as on the number of task 10

operations accomplished up to performing the rejuvenation. Specifically, the time of rejuvenation performed from state j when x task operations are completed is  j(x). The introduction of such function reflects the situations where each rejuvenation includes backup and retrieval procedures of data produced by the system during the execution of the task and the time of performing these procedures depends on the amount of produced data. The system undergoes periodic inspections during which its state is evaluated and the decision about the rejuvenation is made based on the system state and a rejuvenation decision function u(h,x). Specifically, if the system state detected at the h-th inspection is j when part x of the entire task is completed, the system software is rejuvenated if ju(h,x). The time between two consecutive inspections is s. The maximum number of inspections during the mission is thus H= /s (fewer than H inspections can be performed in the case when the task is completed earlier than some scheduled inspections). As rejuvenations can only be triggered during the scheduled inspection time, the maximum possible number of rejuvenations is also H. The rejuvenation policy determines the time and number of rejuvenations performed during the task processing, and can affect the task completion probability. Specifically, executing more rejuvenations can prevent the system from deteriorating to low performance levels, leading to greater average PS for performing the task. On the other hand, more frequent rejuvenations (involving more rejuvenation time) reduce time available for executing the actual mission task, thus decreasing the probability that the task can be accomplished by the deadline . To balance these two conflicting effects and maximize the benefits of implementing rejuvenations, the state-based rejuvenation policy

11

should be optimized. The objective of the optimal rejuvenation policy problem is to maximize the probability R that the system accomplishes the real-time task within time .

4. Analysis of Real-Time Task Completion Probability To assess the probability of task completion (PTC) R, we consider event Tk,j,Xk,j representing the situation in which the system starts function with PS gj at time Tk,j (measured from the beginning of the task processing) when Xk,j task operations and k rejuvenations are performed. Based on event transitions that can take place during the task processing, we determine the joint pdf of the two random values Tk,j and Xk,j, denoted by qk , j (t , x) . Since the system begins the task processing with the initial PS g0 and no rejuvenations have been performed at the beginning (i.e., k=0), we have

1 for t  x  0, q0,0 (t , x)   0 otherwise.

(1)

The number of the closest inspection after the event Tk,j,Xk,j is Tk,j/s+1 and the time of this inspection from the mission beginning is (Tk,j/s+1)s. We assume that the inspection time is negligible. If the system remains in state j for time  and then transits to state j+1, the event transition Tk,j,Xk,jTk,j+1,Xk,j+1 happens in which Tk,j+1= Tk,j+, Xk,j+1= Xk,j+ gj.

(2)

Such a transition can occur in two cases. 1. The transition occurs before the next inspection i.e. when Tk,j+<(Tk,j/s+1)s.

(3)

12

2. The transition occurs after a number of inspections i.e. Tk,j+(Tk,j/s+1)s. For all inspection

h

accomplished

while

the

system

remains

in

state

j

(Tk,j/s+1)h<(Tk,j+)/s+1 either the rejuvenation rule does not hold i.e. j
(4)

or no time remains to complete the rejuvenation before the mission time expiration i.e. hs+j(Xk,j+(hs-Tk,j)gj)>

(5)

In (4) and (5) Xk,j+(hs-Tk,j)gj is the total number of operations completed by the time of the h-th inspection when the system enters state j at time Tk,j after completing Xk,j operations. If the system remains in state j until inspection h for which ju(h,Xk,j+(hs-Tk,j)gj) and hs+j(Xk,j+(hs-Tk,j)gj), the rejuvenation procedure starts at time hs and Tk,j+1= hs+j(Xk,j+(hs-Tk,j)gj), Xk,j+1= Xk,j+(hs-Tk,j)gj.

(6)

As the probability that the system transits from state j to state j+1 in time interval [t,t+dt) since starting operation with PS gj is fj(t)dt, we can recursively obtain qk , j 1 (t , x) for k0, j>0 based on the transition rules (2), (3) as qk , j 1 (t , x) 

t / s 1

t / s 

i 0

h  t / s i

t  t / s  s  ( i 1) s

 1 j  u(h, x  (t  hs) g ) or hs  x  (t  hs) g      q t  , x g  f  d   j



j

j

t  t / s  s

k, j t  t / s  s is

j

j

(7)

 q t  , x g  f  d , k, j

j

j

0

The first term in (7) corresponds to situations where the system undergoes inspections during which the rejuvenation conditions do not hold; the second term corresponds to the situation where no inspections are performed during the system operation in state j. Fig. 1 illustrates an example of Tk,j,Xk,jTk,j+1,Xk,j+1 transition.

13

Fig. 1. Event transition Tk,j,Xk,jTk,j+1,Xk,j+1

Based on the transition rule (4) we can recursively obtain qk1,0 (t , x) for k0 as qk 1,0 (hs  j ( x), x)  1 j  u (h, x) and hs  j ( x)    J

j 1

  qk , j hs   , x  g j 1  F j   hs

0

h 1

1 j  u(l, x  (h  l )sg ) or ls  x  (h  l )sg    d

l  h  / s 1

j

j

(8)

j

for 1hH. Fig. 2 illustrates an example of Tk,j,Xk,jTk+1,0,Tk+1,0 transition.

Fig. 2. Event transition Tk,j,Xk,jTk+1,0,Xk+1,0

To determine PTC, we consider event Yk,j that the system accomplishes the entire task at time no later than  while being in state j after performing k rejuvenations. Its occurrence probability Pr{ Yk,j} can be derived as follows.

14

If the event Tk,j,Xk,j takes place, the system can accomplish the entire task without any state transitions if it functions with PS gj during time (W-Xk,j)/gj (required for accomplishing the remaining part of the task) and this time is not greater than the remaining allowed time -Tk,j. In addition the system should undergo all the remaining inspections without making the rejuvenation decision i.e. j for h=Tk,j/s+1,…,H, where X*(h)=Xk,j+(hs-Tk,j)gj is the number of operations completed by the moment of the h-th inspection given event Tk,j,Xk,j after which the system operated with PS gj during time hs-Tk,j. Thus, we have  W

Pr{Yk , j }   0

W x gj

   H 1  F  W  x     q t , x 1 j  u h, x  hs  t g j  or hs   j x  hs  t g j dtdx .(9) 0 k , j  j  g j  h    t / s 1 

An example of the task termination after event Tk,j,Xk,j is presented in Fig. 3.

Fig. 3. Task completion after event Tk,j,Xk,j

The PTC can thus be evaluated as a sum of occurrence probabilities of mutually exclusive events Yk,j, that is, K

R k 0

J

 Pr{Y j 0

k, j

}.

(10)

15

5. Numerical Iterative Evaluation Algorithm for PTC Based on the derivation of PTC in Section 3, we present the numerical algorithm for evaluating PTC R in this section. First the maximum allowable task processing time  is divided into m equal intervals. The duration of each interval is =/m. Each interval z starts at time z and ends at (z+1) for z=0,…,m. The number of intervals between two consecutive inspections is S=s/. Correspondingly, the number of operations W for completing the task is divided into m equal portions, each containing W/m operations. The system functioning in state j with PS gj can perform gjz operations during z time intervals, corresponding to gjzm/W=gjz/W portions. Hence, the system’s PS in state j is Gj=gj/W portions per interval. The time of rejuvenation from state j performed when w portions of the task are accomplished is j(w)= j(wW/m)/ intervals. The probability of the sojourn time of the system remaining in state j for less than d intervals is j(d)=Fj(d). The probability of the system leaving state j after functioning for d intervals in state j is thus j(d)=Fj((d+1))-Fj(d). We can approximate qk , j (t , x) using matrix Qk , j ( z, w) , which contains probabilities that the system begins function in state j after completing k rejuvenations in interval z and after finishing w portions of the task. The rejuvenation decision function u(h,x) is also used in the discrete form of U(h,w)=u(h,wW/m). Expressions (7) and (8) represent backward procedures to obtain Qk,j(z,w) (from Qk,j-1(z,w) z=0,…,m-1, and w=0,…,m-1) and Qk+1,0 (from Qk,j(z,w) for j=0,…,J, z=0,…,m1, and w=0,…,m-1), respectively. In these backward procedures, summations using different elements in matrix Qk,j-1 (or Qk,j-1 for j=0,…,J) need to be done for any element 16

of Qk,j (or Qk+1,0), which is inconvenient. Instead we adopt an iterative forward procedure in which matrixes Qk,j and Qk+1,0 are zeroed initially and then their corresponding elements are iteratively updated for any fixed element of Qk,j-1 and any realization of the state transition time. This forward procedure does not need to save all the matrixes Qk,j for j=0,…,J in the memory because we can delete any matrix Qk,j-1 immediately after updating Qk,j and Qk+1,0. The following pseudo code summarizes the iterative algorithm for obtaining the PTC R for a real-time task performed by a software rejuvenation system subject to periodic inspections. This pseudo code is based on the logic of matrixes Qk,j update, which is presented in Fig. 4. 1. Initialization: R=0; Qk,0(z,w)=0 for z=0,…,m; w=0,…,m; k=0,…,K; Q0,0(0,0)=1; 2. For k=0,…,H: //k - number of completed rejuvenations 3. For j=0,…,J: //j - system state 4. Set Qk,j(z,w)=0 for z=0,…,m; w=0,…,m; 5. For w=0,…,m-1: //w - realization of Xk,j 6. z=0; //z - realization of Tk,j 7. h=z/S+1; //index of the next inspection 8. dmax=(m-w)/Gj; If(dmax>m-z) dmax=m-z; //number of remaining time intervals 9. For d=0,…, dmax: //number of intervals when system operates in state j ~z =z+d; w ~ =w+dGj; ~ - realizations of Tk,j+1, Xk,j+1 respectively 10. // ~ z ,w ~ )=Qk,j+1( ~z , w ~ )+Qk,j(z,w)j(d); 11. If( ~z
15. 16. 17. 18.

Go to Step 17; h=h+1; If (m-w)/Gj
//task completion

17

Fig. 4. The update logic of matrixes Qk,j

In Steps 11 and 13 event transitions without rejuvenation (before the closest inspection and after inspections when the rejuvenation condition is not met respectively) take place. In Step 14, event transitions with rejuvenation take place. In Step 16, the PTC is updated. The complexity of the numerical algorithm is O(H(J+1)m3). As demonstrated in Section 6, the actual running time for a five-state system is on the magnitude of few seconds even when a large value of m is used. To the best of our knowledge, this is the first algorithm that can evaluate a real-time software system undergoing periodic inspections as well as practical dynamic rejuvenation time (dependent on system state and accomplished task operations). The algorithm can accommodate systems exhibiting any finite number of states and arbitrary types of state transition time distributions. In contrast, as discussed in

18

Section 2, the conventional Markovian methods become computationally hard especially when a large number of states are involved. Moreover, the continuous-time Markov chains-based methods can handle only the exponential distribution of the system sojourn time in different states. The flow of the data needed for obtaining the PTC is presented in Fig. 5.

Fig. 5. The information flow of the algorithm for evaluating the PTC

6. Analysis of an Illustrative Example Consider a visual data streaming onboard software system in a space probe, which takes video images based on a predetermined schedule [45]. An input buffer is used to accept the raw images; due to the limited capacity new image batches replace the old ones. The software Extract, Transform, Load (ETL) streaming system [46] conducts image data transform and compression for preparing its translation to the Earth. The data transform and compression task, involving W=5000 mega operations, has to be completed in the time between the images update, which is =200 time units. 19

In the ETL streaming system, four internal memory blocks are used and can be overwhelmed during the data processing. The number of overwhelmed blocks can affect the data processing speed. Five system states are defined (from 0 to 4), each corresponding to a different number of unavailable memory blocks. The system state can increase one by one during the data processing since the system transits to states with more unavailable memory blocks consecutively. Because data processing time is dependent on the number of available blocks, the random time taken by the next block filling is also dependent on the number of available blocks. Hence, the system states characterized with different numbers of available memory blocks have different sojourn times. When all of the four memory blocks become unavailable, the software can still operate using low capacity registers, but with very low processing speed. The periodic self-inspections measure the number of operations performed per time unit and determine the state of internal memory blocks based on these measurements. Each rejuvenation action cleans up all the internal memory blocks. Before performing the rejuvenation procedure, the system saves data processed so far to external memory and retrieves it after the rejuvenation. Each rejuvenation transits the system to state 0 from any other state. The software mission failure takes place if the data transform and compression task is not finished before the input buffer data renewal happens. The sojourn time of the system in state j∈{0,1,2,3} follows the Weibull distribution [10] with scale parameter j and shape parameter j. In other words, the system can transit from state j to state j+1 with time having cdf of Fj(t)=exp(-(t/j)j)) for j∈{0,1,2,3}. The system can transit from the worst state 4 only to the initial state 0 when the rejuvenation is performed. Note that the proposed methodology is applicable to

20

arbitrary types of distributions. However, following the practice in some software rejuvenation literature [4, 9], the Weibull distribution is adopted for the example system due to its flexibility in modeling diverse failure rate behaviors. While there exists no consensus on the type of distributions modeling degradation or failure time of an operational software [9], studies using the accelerated life testing in [10] show that the Weibull distribution actually appears to be a good fit for modeling the time to failure distribution of software systems. We assume the rejuvenation time j(x) from state j is an increasing linear function of the number of operations x completed immediately before the rejuvenation: j(x)= j+jx because the amount of compressed data that should be stored before the rejuvenation and retrieved before resuming the task increases with progress of the task execution. Table 1 summarizes values of PS gj, scale parameter (j), shape parameter (j), and rejuvenation time parameters j and j for each state j of the example system.

Table 1. Values of parameters for the example software system Software state j 0 1 2 3 4

PS gj 50 45 40 29 15

Parameters of Fj(t)

Parameters of j(x)

j

j

j

j

25 33 38 29 -

1.2 1.0 1.1 1.3 -

18 21 25 27

0.0025 0.0030 0.0040 0.0040

Fig. 6 presents values of PTC R for u(h,x)2, s=60 (which corresponds to up to 3 inspections during the cycle ), and different m ranging from 50 to 2000. As the number of time intervals m used in discretization increases, accuracy of the evaluation result increases quickly. The relative differences for m=100, m=200 and m=500 compared to 21

m=2000 are 0.56%, 0.24% and 0.077%, respectively. Fig. 4 also presents the running time (in seconds) of the proposed iterative algorithm as a function of m when evaluating the example system on Pentium 3.2GHz PC.

Fig. 6. PTC R and running time of the proposed algorithm as functions of m.

To check the dependence of the algorithm running time on the number of system states (J+1), the PTC is evaluated for systems having up to 10 states (with state parameters identical to parameters of state 1 from Table 1 and u(h,x)2). The running time for m=2000 and m=1500 as functions of the number of system states is presented in Fig. 7.

Fig. 7. Algorithm running time as function of number of system states 22

7. Optimal State-Based Rejuvenation Policy Considering Periodic Inspections To determine the state-based rejuvenation policy, we divide the total number of task portions m into L regions for obtaining a discrete approximation of decision function U(h,w) as element (ranging from 0 to J+1) of an HL matrix Y (for any combination of h and w): U(h,w)=Y(h,w/), where =m/L+1. With matrix Y, if the system resides in state j when the h-th inspection is performed and w portions of the task are accomplished by the inspection time), the system is rejuvenated if jY(h,w/). There are (J+1)HL possible solutions for determining the optimal rejuvenation policy in the form of approximation matrix Y. An exhaustive search becomes non-realistic even when J, H and L assume moderate values. Therefore, we use Genetic Algorithm (GA), a widely applied heuristic search algorithm for solving the reliability optimization problems [47]. In particular, the GA procedure adopted in this work can be found in [48]. While GA is flexible in solution representation using integer strings, a specific or unique representation has to be defined for a specific optimization problem. For the problem considered in this work, any feasible solution is denoted by an integer string a=(a1,…,aHL) (1ajJ+1); for any 1h
23

7.1 Impact of number of regions Table 2 presents the results of optimization for different L for the PTC maximization problem with s=20 and =180 (H=8).

Table 2. Optimal rejuvenation policies for different L L R

Y

1 0.9781

2 0.9785

4 0.9786

6 0.9789

2 2 4 2 1 5 5 5

25 35 44 23 51 55 55 55

2255 2355 2445 5234 5512 5555 5555 5555

225555 223555 234455 512445 555125 555555 555555 555555

When no rejuvenation is allowed (i.e. u(h,x)5), the PTC takes the value of R=0.6632. When the system is rejuvenated to the initial state 0 if any inspection reveals that it resides in any degraded state (i.e. u(h,x)1), the PTC takes the value of R=0.6874. It can be seen that in last inspections the rejuvenation is not beneficial because not much time remains to perform it. It is preferable to give the system a chance to complete the task even if it resides in states with low PS. An increase in L does not affect the value of PTC considerably.

7.2 Impact of inspection interval and rejuvenation time Table 3 presents the best rejuvenation policies obtained by the GA for different values of inter-inspection time s when =180 and L=4. It can be seen that when s decreases and total number of inspection increases, the rejuvenation allows achieving greater PTC. However, when the total number of inspections remains constant, the PTC can be a nonmonotonic function of the inter-inspection time s. Fig. 8 illustrates the dependence of 24

PTC on s for best obtained rejuvenation policies when 60s<90 (which corresponds to H=2). The PTC corresponding to the best rejuvenation policies are also obtained for different rejuvenation time functions  j(x)=(j+jx), where j and j are taken from Table 1.

Table 3. Optimal rejuvenation policies for different s s H R

Y

20 8

40 4

60 2

80 2

100 1

0.9785 2255 2355 2445 5234 5512 5555 5555 5555

0.9679

0.9550

0.9526

0.8429

3355 5125 5555 5555

2225 5555

3325 5555

3335

Table 4 presents the best obtained rejuvenation policies for different combinations of s and .

Fig. 8. PTC R for best obtained rejuvenation policies as function of inter-inspection time s and rejuvenation time factor . 25

Table 4. Optimal rejuvenation policies for different  and s when H=2 s

60

63

66

69

72

75

78

81

84

87

R 0.9776 0.9717 0.9708 0.9722 0.9730 0.9731 0.9726 0.9712 0.9688 0.9653 4215 4325 4325 4325 4325 4325 5325 5325 5335 =0.8 Y 4215 5532

5552

5555

5555

5555

5555

5555

5555

5555

5555

R 0.9648 0.9639 0.9664 0.9677 0.9681 0.9675 0.9660 0.9630 0.9583 0.9522 4225 4325 4325 4325 4325 5325 5325 5325 5335 =0.9 Y 4215 5552

5555

5555

5555

5555

5555

5555

5555

5555

5555

R 0.9550 0.9588 0.9610 0.9618 0.9613 0.9594 0.9560 0.9504 0.9425 0.9326 =1.0 Y 4225 4225 4325 4325 4325 5325 5325 5325 5335 5335 R =1.1 Y R =1.2 Y

5555 0.9493 4225 5555 0.9422 4225 5555

5555 0.9526 4225 5555 0.9445 4225 5555

5555 0.9541 4325 5555 0.9444 4325 5555

5555 0.9537 4325 5555 0.9420 5325 5555

5555 0.9515 5325 5555 0.9370 5325 5555

5555 0.9472 5325 5555 0.9291 5325 5555

5555 0.9409 5325 5555 0.9188 5325 5555

5555 0.9315 5325 5555 0.9046 5335 5555

5555 0.9193 5335 5555 0.8870 5335 5555

5555 0.9046 5335 5555 0.8668 5335 5555

It can be seen that R(s) has distinguishable maxima that depend on the rejuvenation time factor . When <1 and s is low, the system has enough time to complete the task after rejuvenation performed during time of the second inspection, which causes an increase in PTC. When the inter-inspection time or rejuvenation time increases the rejuvenation becomes non-beneficial in the second inspection and the R(s) function, in fact, depends on the time of the first inspection and becomes a convex function of s with distinguishable maximum.

7.3 Impact of worst PS Fig. 9 illustrates the dependence of PTC on the PS of the worst possible system state g4 for best obtained rejuvenation policies when s=60, L=4. Table 5 presents the best obtained rejuvenation policies for different values of PS g4 and rejuvenation time factor.

26

Fig. 9. PTC R for best obtained rejuvenation policies as function of the lowest system PS g4 and rejuvenation time factor .

Table 5. Optimal rejuvenation policies for different  and g4 when H=2. g4 =0.8

R Y

=1.0

R Y

=1.2

R Y

5

7

9

11

13

15

17

19

21

23

0.9661 4215 5521 0.9251 4215 5551 0.8917 4215 5555

0.9682 4215 5521 0.9283 4215 5551 0.9014 4215 5555

0.9704 4215 5521 0.9316 4215 5551 0.9113 4215 5555

0.9726 4215 5521 0.9358 4215 5554 0.9213 4215 5555

0.9749 4215 5532 0.9450 4215 5555 0.9310 4215 5555

0.9776 4215 5532 0.9550 4225 5555 0.9422 4225 5555

0.9803 4215 5532 0.9653 4225 5555 0.9522 4225 5555

0.9830 4215 5544 0.9727 4235 5555 0.9620 4235 5555

0.9883 5355 5555 0.9818 5355 5555 0.9727 5355 5555

0.9919 5355 5555 0.9854 5355 5555 0.9759 5355 5555

It can be seen that when g4 is low, it can be beneficial to rejuvenate the system in the second inspection if not much work remains until the task completion. With an increase in g4 it becomes preferable to let the system complete the task without performing any rejuvenation during the second inspection. During the first inspection the function u(1,x) also increases with g4, which means that with an increase in g4, the system gets more chances to complete the task without rejuvenation during the first inspection.

27

8. Conclusion and Future Work This paper models a real-time software system undergoing periodic inspections and full rejuvenations triggered based on the system state evaluated during inspections. The model accommodates systems exhibiting any finite number of states (performance degradation levels) and arbitrary types of distributions representing the state sojourn or transition time. The completion probability of a real-time task within certain deadline performed by the considered software system is assessed using an iterative numerical algorithm. Further, the optimal periodically inspected state-based rejuvenation policy is derived using the Genetic algorithm, which maximizes the successful completion probability of the real-time task. As demonstrated through examples, the number of regions for which the rejuvenation function is defined has a minor impact on the real-time task completion probability. It is also demonstrated that the adopted inspection interval, rejuvenation time function parameters, and actual degraded performance (processing speed) of the software system can affect the determination of the optimal rejuvenation policy. The proposed work can facilitate the optimal choice of the state-based rejuvenation decision function under any given inspection interval. Since the inspection interval is also relevant to effectiveness of the state-based policy [44], in the future, we are interested in solving the joint optimization of inspection and rejuvenation policy where both the optimal inspection interval and rejuvenation decision function will be co-determined. The rejuvenation policy affects not only the PTC, but also the expected total mission cost of a real-time task [42]. Hence, another direction of our future work is to extend the suggested methodology to model the mission cost for periodically inspected software rejuvenation

28

systems. In addition, we will consider the partial rejuvenation [13], besides the full rejuvenation considered in this work. We are also interested in extending the model by addressing failed rejuvenations that may take place with certain probability during the rejuvenation process [14].

Conflict of interest: None References [1] Grottke, M., Matias, R., & Trivedi, K.S. The fundamentals of software aging. Proc. of IEEE First International Workshop on Software Aging and Rejuvenation, Washington, DC, USA, 1–6, 2008. [2] Marshall, E. Fatal error: how patriot overlooked a Scud. Science, 255(5050), 1347, 1992. [3] Yurcik W. & Doss, D. Achieving fault-tolerant software with rejuvenation and reconfiguration. IEEE Software, 18(4), 48-52, 2001. [4] Tai, A. T., Chau, S. N., Alkalaj, L., & Hecht, H. On-Board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period. Proc. of Third Int’l Workshop Object Oriented Real-Time Dependable Systems, Newport Beach, CA, 40-47, 1997. [5] Lei, L., Vaidyanathan, K., & Trivedi, K.S. An approach for estimation of software aging in a web server. Proc. of 2002 International Symp. on Empirical Software Engineering, 91-100, 2002. [6] 9. Iwamoto, K., Dohi, T., Okamura, H., & Kaio, N. Discrete-time cost analysis for a telecommunication billing application with rejuvenation. Computers & Mathematics with Applications, 51(2), 335-344, 2006. [7] Kintala, C. Software rejuvenation in embedded systems. J. Autom., Lang. Combinatorics, 14, 63–73, 2009. [8] Avritzer, A. & Weyuker, E. J. Monitoring Smoothly Degrading Systems for Increased Dependability. Empirical Software Eng. J., 2(1), 59-77, 1997. [9] Garg, S., Puliafito, A., Telek, M., & Trivedi, K.S. Analysis of preventive maintenance in transactions based software systems. IEEE Transactions on Computers, 47(1), 96-107, 1998. [10]

Zhao, J., Wang, Y., Ning, G., Trivedi, K.S., Matias, R., & Cai, K. A comprehensive

approach to optimal software rejuvenation. Performance Evaluation, 70(11), 917-933, 2013.

29

[11]

Cotroneo, D., Natella, R., Pietrantuono, R. & Russo, S. A survey of software aging and

rejuvenation studies. J. Emerg. Technol. Comput. Syst., 10(1), Article 8, 34, 2014. [12]

Eto, H. & Dohi, T. Determining the Optimal Software Rejuvenation Schedule via Semi-

Markov Decision Process. Journal of Computer Science, 2(6), 528-535, 2006. [13]

Koutras, V. P. & Platis, A. N. Applying partial and full rejuvenation in different

degradation levels. Proc. of the IEEE 3rd International Workshop on Software Aging and Rejuvenation (WoSAR), 20–25, 2011. [14]

Koutras, V. P., Platis, A. N., & Limnios, N. Availability and reliability estimation for a

system undergoing minimal, perfect and failed rejuvenation. Proc. of IEEE International Conference on Software Reliability Engineering Workshops, Seattle, WA, 1-6, 2008. [15]

Machida, F., Kim, D. S. & Trivedi, K. S. Modeling and analysis of software rejuvenation

in a server virtualized system with live VM migration. Performance Evaluation, 70(3), 212230, 2013. [16]

Bobbio, A., Sereno, M. & Anglano, C. Fine grained software degradation models for

optimal rejuvenation policies. Performance Evaluation, 46(1), 45-62, 2001. [17]

Meyer, J. F. On Evaluating the Performability of Degradable Computing Systems. IEEE

Transactions on Computers, C-29(8), 720-731, 1980. [18]

Bobbio, A. & Telek, M. Task completion time in degradable systems. In Haverkort, B.R.,

Marie, R., Rubino, G., & Trivedi, K.S. (editors), Performability Modelling: Techniques and Tools, Wiley, Chapter 7:139-161, 2001. [19]

Levitin, G., Xing, L., & Dai, Y. Heterogeneous 1-out-of-N warm standby systems with

online checkpointing. Reliability Engineering & System Safety, 169, 127-136, 2018. [20]

Saravakos, P. K., Gravvanis, G. A., Koutras, V. P., & Platis, A. N. A comprehensive

approach to software aging and rejuvenation on a single node software system. Proc. of The 9th Hellenic European Research on Computer Mathematics & its Applications Conference, 2009. [21]

Wang, D., Xie, W., & Trivedi, K. S. Performability analysis of clustered systems with

rejuvenation under varying workload. Performance Evaluation, 64(3), 247-265, 2007. [22]

Malefaki, S., Koutras, V. P., & Platis, A. N. Modeling Software Rejuvenation on a

Redundant System Using Monte Carlo Simulation. Proc. of IEEE 23rd International Symposium on Software Reliability Engineering Workshops, Dallas, TX, 277-282, 2012. [23]

Pietrantuono, R. and Russo, S. Software Aging and Rejuvenation in the Cloud: A

Literature Review. Proc. of IEEE International Symposium on Software Reliability Engineering Workshops, Memphis, TN, pp. 257-263, 2018. 30

[24]

Dohi, T. & Okamura, H. Dynamic Software Availability Model with Rejuvenation.

Journal of the Operations Research Society of Japan, 59(4), 270-290, 2016. [25]

Vaidyanathan, K. & Trivedi, K. S. A comprehensive model for software rejuvenation.

IEEE Transactions on Dependable and Secure Computing, 2(2), 124-137, 2005. [26]

Hua, X., Guo, C., Wu, H., Lautner, D. & Ren, S. Schedulability Analysis for Real-Time

Task Set on Resource with Performance Degradation and Dual-Level Periodic Rejuvenations. IEEE Transactions on Computers, 66(3), 553-559, 2017. [27]

Dohi, T., Zheng, J., Okamura, H., and Trivedi, K. S. Optimal periodic software

rejuvenation policies based on interval reliability criteria. Reliability Engineering & System Safety, 180, 463-475, 2018 [28]

Bao, Y., Sun, X., & Trivedi, K. S. A workload-based analysis of software aging, and

rejuvenation. IEEE Transactions on Reliability, 54(3), 541-548, 2005. [29]

Machida, F. & Miyoshi, N. Analysis of an optimal stopping problem for software

rejuvenation in a deteriorating job processing system. Reliability Engineering & System Safety, 168, 128-135, 2017. [30]

Xie, W., Hong, Y., & Trivedi, K. S. Analysis of a two-level software rejuvenation policy.

Reliability Engineering & System Safety, 87(1), 13-22, 2005. [31]

Rinsaka, K. & Dohi, T. Toward high assurance software systems with adaptive fault

management. Software Quality Journal, 24(1), 65–85, 2016. [32]

Levitin, G., Xing, L., & Ben-Haim, H. Optimizing software rejuvenation policy for real

time tasks. Reliability Engineering & System Safety, 176, 202–208, 2018. [33]

Jiang, L., Peng, X., & Xu, G. Time and Prediction based Software Rejuvenation Policy.

Proc. of the Second International Conference on Information Technology and Computer Science, Kiev, 114-117, 2010. [34]

Eto, H., Dohi, T., & Ma, J. Simulation-Based Optimization Approach for Software Cost

Model with Rejuvenation. In: C. Rong, M. G. Jaatun, F. E. Sandnes, L. T. Yang, J. Ma (eds) Autonomic and Trusted Computing. Lecture Notes in Computer Science, 5060. Springer, Berlin, Heidelberg, 2008. [35]

Guo, J., Li, W., Song, X., Zhang, B., & Wang, Y. Software Rejuvenation Strategy Based

on Components. Proc. of the Second World Congress on Software Engineering, Wuhan, 8083, 2010. [36]

Castelli, V., Harper, R. E., Heidelberger, P., Hunter, S. W., Trivedi, K. S., Vaidyanathan,

K., & Zeggert, W. P. Proactive management of software aging. IBM Journal of Research and Development, 45(2), 311-332, 2001. 31

[37]

Trivedi, K. S. & Vaidyanathan, K. Software Rejuvenation - Modeling and Analysis. In:

R. Reis (eds) Information Technology. IFIP International Federation for Information Processing, 157. Springer, Boston, MA, 2004. [38]

Torquato, M., Araujo, J., Umesh, I. M., and Macie, P. SWARE: An approach to support

software aging and rejuvenation experiments. Journal of Information Systems Engineering & Management, 3(2), 1-3, 2018. [39]

Okamura, H., Zheng, J., & Dohi, T. A Statistical Framework on Software Aging

Modeling with Continuous-Time Hidden Markov Model. Proc. of IEEE 36th Symposium on Reliable Distributed Systems, Hong Kong, 114-123, 2017. [40]

Dang, W. & Zeng, J. Software System Rejuvenation Modeling Based on Sequential

Inspection Periods and State Multi-control Limits. In: B. Zou, Q. Han, G. Sun, W. Jing, X. Peng, Z. Lu (eds.) Data Science. ICPCSEE 2017. Communications in Computer and Information Science, 728. Springer, Singapore, 2017. [41]

Levitin, G., Xing, L., & Luo, L. Joint optimal checkpointing and rejuvenation policy for

real-time computing tasks. Reliability Engineering & System Safety, 182, 63-72, February 2019 [42]

Levitin, G., Xing, L., & Xiang, Y. Cost minimization of real-time mission for software

systems with rejuvenation. Reliability Engineering & System Safety, 193, 106593, January 2020. [43]

Levitin, G., Xing, L., & Huang, H. Optimization of partial software rejuvenation policy.

Reliability Engineering & System Safety, 188, 289-296, August 2019 [44]

Meng, H., Hei, X., & Liu, J. Analytical Modeling of Periodically Inspected Software

Rejuvenation Policy. Information Technology Journal, 12(6), 1227-1232, 2013. [45]

Vitale, R., Zhyrova, A., Fortuna, J. F., de Noord O. E., Ferrer A., Martens H. On-The-Fly

Processing of continuous high-dimensional data streams. Chemometrics and Intelligent Laboratory Systems, 161, 118-129, 2017. [46]

Bansal, S. K.. Towards a Semantic Extract-Transform-Load (ETL) Framework for Big

Data Integration. Proc. of IEEE International Congress on Big Data, Anchorage, AK, 522529, 2014. [47]

Levitin, G. Genetic algorithms in reliability engineering. Guest editorial. Reliability

Engineering & System Safety, 91(9), 975-976, 2006. [48]

Levitin, G., Xing, L., & Dai, Y. Cold vs. hot standby mission operation cost minimization

for 1-out-of-N systems. European Journal of Operational Research, 234(1), 155-162, 2014.

32