Reliability Engineering and System Safety 133 (2015) 184–191
Contents lists available at ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Modeling and optimizing periodically inspected software rejuvenation policy based on geometric sequences Haining Meng a,b,n, Jianjun Liu c, Xinhong Hei a a
School of Computer Science and Engineering, Xi'an Technology University, Xi'an 710048, China Shaanxi Key Laboratory for Network Computing and Security Technology, Xi'an 710048, China c Aeronautics Computing Technique Research Institute, Xi'an 710068, China b
art ic l e i nf o
a b s t r a c t
Article history: Received 18 May 2013 Received in revised form 4 August 2014 Accepted 1 September 2014 Available online 16 September 2014
Software aging is characterized by an increasing failure rate, progressive performance degradation and even a sudden crash in a long-running software system. Software rejuvenation is an effective method to counteract software aging. A periodically inspected rejuvenation policy for software systems is studied. The consecutive inspection intervals are assumed to be a decreasing geometric sequence, and upon the inspection times of software system and its failure features, software rejuvenation or system recovery is performed. The system availability function and cost rate function are obtained, and the optimal inspection time and rejuvenation interval are both derived to maximize system availability and minimize cost rate. Then, boundary conditions of the optimal rejuvenation policy are deduced. Finally, the numeric experiment result shows the effectiveness of the proposed policy. Further compared with the existing software rejuvenation policy, the new policy has higher system availability. & 2014 Elsevier Ltd. All rights reserved.
Keywords: Software aging Software rejuvenation Inspection Geometric sequence Cost rate System availability
1. Introduction With the explosive growth in internet technology and the emergence of a number of new and advanced applications, assured stable and reliable operation and availability of software system has become a critical issue. The challenge is to provide the desired availability and performance at a low cost. The reliability studies have reported software aging phenomenon in the long-running software system [1], which leads to the increasing failure rate or performance degradation of a system during execution, and eventually to the system hanging or crashing. Software aging has been observed in many kinds of long-running systems, ranging from business-oriented to highly critical systems, such as telecommunication switching and billing software [2], networked UNIX workstations [3], OLTP DBMS servers [4], Apache web server [5], Sun Hotspot JVM [6], spacecraft flight systems [7], and the cloud computing infrastructure [8]. Software aging could cause great losses in the safety-critical systems, including the loss of human lives [9]. Software aging can be attributed to the activation and propagation of the software faults called aging-related bugs [8], which are n Corresponding author at: 5 Gold Flower South Road, Xi'an, Shaanxi Province 710048, China. E-mail addresses:
[email protected] (H. Meng),
[email protected] (J. Liu),
[email protected] (X. Hei).
http://dx.doi.org/10.1016/j.ress.2014.09.007 0951-8320/& 2014 Elsevier Ltd. All rights reserved.
transient, non-deterministic and difficult to characterize. These faults do not immediately cause a software failure when triggered, but manifest themselves as memory consumption, unreleased file locks, data corruption or numerical error accumulation after a long period of execution, making the system gradually degrade its performance and eventually fail. Such faults are too subtle or too costly to be removed during software development and testing. Thus, even if software may have been thoroughly tested, it still may have some design faults that are yet to be revealed in practice [10]. Since software aging leads to performance degradation and sudden failures, a lack of proper software maintenance technique will inevitably cause serious economic losses and system downtime. Apart from reactive methods such as system reboot, a proactive and preventive software maintenance technique to counteract software aging is software rejuvenation [1], which involves occasional stopping of software system, removal of error conditions and system restarting in a clean environment. This process removes the accumulated errors and frees up operating system resources, thus improving system availability and reliability, and postponing or preventing the unplanned and expensive future system failures. Nevertheless, software rejuvenation does not solve the root cause of software aging, consequently software aging will continue since system start-up, so that software rejuvenation has to be executed cyclically at predetermined or scheduled time to
H. Meng et al. / Reliability Engineering and System Safety 133 (2015) 184–191
maintain the robustness of software system. Moreover, software system may be unavailable during rejuvenation which will increase system downtime and incur some costs (e.g., costs due to the loss of business). However, unlike the downtime caused by sudden failures, the downtime related to software rejuvenation can be scheduled at the discretion of the user or administrator, typically during the middle of the night or over weekends. It is likely that the costs of downtime will be high if the downtime is unscheduled, so the cost incurred due to software rejuvenation is less than that incurred due to system failure. Therefore, software rejuvenation can avoid or at least postpone software aging, and reduce the overall downtime and related costs. The most important problem is to plan a cost-effective rejuvenation policy to ensure system reliability, reduce maintenance cost and downtime cost, and improve system availability. One of the significant issues in the software rejuvenation policy is when and how to trigger it, due to its overhead in processing tasks. Two main approaches to determine the timing for software rejuvenation are time-based and inspection-based methods [11]. The time-based approaches determine the optimal rejuvenation timing by analysis of the state relationship of software system and assumption of system failure distribution. The inspection-based policy is usually conducted by continuous monitoring runtime states and failure behaviors and using data statistical analysis method to estimate the software rejuvenation interval. However, the relatively complex computation process generally leads to the increasing resource exhaustion in software system. Thus, the frequent inspections of runtime software system incur some overhead in terms of downtime and cost. This must be traded off with the downtime and cost due to failures to obtain maximum benefits. The periodical inspection mechanism has been studied to monitor runtime system in which failures are immediately detected and subsequently repaired. For example, Grall et al. [12] studied the inspection-maintenance strategy for a deteriorating system based on the average long-run cost rate, and the degradation is expressed by the Gamma process. The inspection-based maintenance policy in hardware systems was studied [13], and the inspection time is assumed to be exponentially distributed. Vaidyanathan et al. [14] investigated inspection-based preventive maintenance in operational software system which is inspected at regular intervals, and they obtained the optimal inspection interval to minimize downtime and cost. Ning et al. [15] studied a multi-granularity software rejuvenation policy, supposed the inspection interval follows exponential distribution, and obtained the optimal inspection rate and rejuvenation period by maximizing availability and minimizing cost. Moreover, the periodical system inspection is difficult to discover system failures immediately. Consider that in the software system subjected to software aging, the probability of occurring failures is lower when system begins to execution, accordingly the time interval between two inspections should be determined longer. With the execution of software system, system performance declines gradually and the possibility of failure occurrence increases over time, so the successive inspection intervals need to be shorter and shorter, i.e. the inter-inspection intervals constitute a decreasing sequence. Nagel and Skrivan [16] stated that geometric sequences are possible in the well-known software reliability model, i.e. JelinskiMoranda model. The increasing geometric sequence between failure rates of faults was observed in projects of the communication networks department of the Siemens AG [17]. Inspired by Nagel and Skrivan's idea, we resort to a more complex but more realistic paradigm of the decreasing geometric sequence to describe the decreasing inter-inspection intervals, so that the probability that failure occurred at each inter-inspection interval
185
tends to be a constant p. Then through simulation experiments, the sensitivity analysis of the influence of different parameter p on system availability and maintenance cost rate is evaluated. To the best of our knowledge, this is the first work that geometric sequence formulations applied in software rejuvenation policy. The rest of this paper is organized as follows. In Section 2, the related works are introduced. Then, in Section 3, the definitions of geometric sequences and geometric series are given. In Section 4, a periodically inspected rejuvenation policy is presented and rejuvenation optimization solution and boundary conditions are obtained by maximizing system availability and minimizing cost rate. In Section 5, numerical results are shown and several examples are provided to illustrate optimal rejuvenation scenarios for different system parameters. Section 6 presents the concluding remarks.
2. Related works 2.1. Software aging Outages in computer systems consist of both hardware and software failures. It has been determined that software failures lead to more outages than hardware failures [18,19]. Software aging is a phenomenon occurring in long-running software systems, which exhibit an increasing failure rate and lead to progressive resource consumption, performance degradation, and eventually to the system failure, typically because of increasing and unbounded resource consumption, data corruption, and numerical error accumulation. Aging in a software system, as in human beings, is an accumulative process. It is important to highlight that a system fails due to the consequences of aging effects accumulated over time. For example, a given aged application fails due to insufficiency of available physical memory caused by the accumulation of memory leaks. Resource leaking and other aging effects can be due to agingrelated bugs in application software [11]. These bugs are hard to reproduce, even when activated, their manifestation takes long time to become evident, and this makes the testing time insufficient to reveal the problem in most of cases. In addition, most of these aging problems are caused by bad software design or faulty code [20]. However, because software is extremely complex and never wholly free of errors, it is almost impossible to fully test and verify that a piece of software is bug-free. This situation is further exacerbated by the fact that software development tends to be extremely time-to-market-driven, which results in applications which meet the short-term market needs, yet do not account very well for long-term ramifications such as reliability. Hence, residual faults have to be tolerated in the operational phase. So far, software aging phenomenon has been observed and detected in many computing systems through different algorithms or methods. For example, Garg et al. [3] first proposed a measurement-based method to estimate software aging in networked UNIX workstations by designing and implementing a SNMP-based distributed monitoring tool to collect the operating system resource usage and system activity data. Cassidy et al. [4] adopted the statistical pattern recognition method to detect software aging in large OLTP DBMS servers. Grottke et al. [5] applied the non-parametric statistical method to estimate aging trends in an Apache web server, through collecting the data of used swap space, response time, and free physical memory. Cotroneo et al. [6] revealed the presence of software aging in the Sun Hotspot JVM by adopting both parametric and non-parametric statistical techniques to analyze the trend of throughput loss and memory depletion. Alonso et al. [21] presented a monitoring framework based on aspect-oriented programming to monitor the resource usage of
186
H. Meng et al. / Reliability Engineering and System Safety 133 (2015) 184–191
every component in a web application, thus determining which component is related to software aging. To investigate the aging affects in the cloud computing infrastructure, the virtual and resident memory usage were monitored [8]; RAM memory exhaustion, swap memory depletion, high CPU utilization, and the subsequent increasing in the response time of applications were observed [22]. 2.2. Software rejuvenation Software rejuvenation is a proactive approach for software maintenance instead of traditional reactive maintenance approach to prevent performance degradation and sudden failures due to software aging. Its aim is to restore the software system to the initial healthy state by releasing OS resources and removing error accumulation. This proactive and preventive maintenance has to be added as a complementary approach in addition to the traditional reactive recovery mechanisms with subject to aging or deteriorating system. In many professional-grade and commercial application software systems such as the Apache server [23] and IBM xSeries server [20], simple rejuvenation policies have been implemented in various ways to alleviate software aging problems. The billing data collector system, originally built by AT&T and used in most of the U.S. regional telephone companies, was the first system that used software rejuvenation for the entire system [24]. A simple way to perform software rejuvenation is to restart the application. Restarting can involve queuing the incoming messages temporarily, cleaning up the data structures, respawning the processes to the initial healthy state. Moreover, rejuvenation actions can be partial or total restarting of the system: application reset, Virtual Machine, or Virtual Machine Monitor (VM/VMM) restart, node reboot (in a cluster-based system) or activation of a standby system. For example, Alonso et al. [25] presented a comparative experimental study of different rejuvenation actions, which covers five levels of rejuvenation: application granularity, operating system granularity, virtual machine granularity, and physical machine granularity. These actions are generic and they do not make use of application-specific features. By contrast, application-specific rejuvenation actions are represented by garbage collection, kernel table flushing, and defragmentation. Numerous and valuable studies try to the figure out the optimal time for scheduling the rejuvenation actions [26–34]. This is typically done by either time-based or inspection-based polices. The time-based rejuvenation polices are done with a Markov models, semi-Markov models, stochastic Petri Nets and so on. With these models it is possible to compute the system availability and cost. For example, Huang et a1. [1] built a two-phase software rejuvenation model to optimize the rejuvenation rate by taking into consideration the downtime cost. Based on this model, different analytical modeling approach has been used to address the software rejuvenation policy in several research studies. In Ref. [26], a software rejuvenation model for cluster system is constructed and the optimal rejuvenation interval is determined based on system availability and maintenance cost. Xie et a1. [27] took the semi-Markov process into account for building a model to estimate software rejuvenation schedule by maximizing system availability. Iwamoto et at. [28] proposed a modified stochastic model and developed nonparametric statistical algorithms to determine the software rejuvenation schedule which minimize the expected cost per unit time. Alberto et al. [29] introduced a software rejuvenation model in mission critical systems to determine the optimal configuration algorithm by maximizing mission success. Then, Machida et al. [30] researched on software rejuvenation polices of the server virtualized system with virtual machine migration, and derived the optimal
rejuvenation interval by maximizing the steady-state availability. Using the Markov decision process, Okamura and Dohi [31] determined the optimal rejuvenation policy in a transactionbased system by taking the long-run average reward and the power efficiency as performance criteria. The inspection-based rejuvenation approaches determine the optimal rejuvenation schedule by statistical analysis on system data about resource usage (e.g., free physical memory and used swap space) and performance (e.g., response time and throughput) [32–34]. Compared with the time-based rejuvenation policy, the inspection-based policy has high flexibility in real-time decision-making and is suitable for operation scenario with larger changes. For example, Jia et al. [32] studied the relationship between the effect of software aging and some varying performance indicators such as response time, physical memory and swap space, and then predicted the effect of software aging on a single parameter using principal component analysis. Matias et al. [33] explored OS kernel instrumentation techniques to measure software aging effects that are related to memory problems, specifically memory leaks and memory fragmentation. The work is a very important step towards better understanding of the software aging phenomenon. Avritzer et al. [34] proposed several software aging detection methods and software rejuvenation algorithms based on the measured variation of response time.
3. Geometric sequences Before we give the method and elaborate on our solution, we first introduce the related concepts as preliminaries. Definition 1. Given a sequence of numbers {Xn, n ¼1,2,…}, if each term after the first is found by multiplying the previous one by a fixed non-zero number, then {Xn, n ¼1,2,…} is called a geometric sequence. Geometric sequences are sometimes called geometric progressions. Definition 2. For a geometric sequence {Xn, n ¼1,2,…}, a common formulation is given by {λan 1, n ¼1,2,…}. a and λ are both positive constant and a is called the common ratio. If a 41, then {Xn, n¼ 1,2, …} is an increasing geometric sequence, i.e. Xn o Xn þ 1;
n ¼ 1; 2; …
ð1Þ
if 0o ao 1, then {Xn, n ¼1,2,…} is called a decreasing geometric sequence, i.e. Xn 4 Xn þ 1;
n ¼ 1; 2; …
ð2Þ
if a¼ 1, then {Xn, n¼ 1,2,…} is an identical sequence. Definition 3. A geometric series is the sum of a geometric sequence {Xn, n¼ 1,2,…}. Based on the common geometric sequence {λan 1, n¼ 1,2,…}, the geometric series can be calculated as n1 1 an ∑ X n ¼ ∑ λan 1 ¼ λ 1a k¼0 k¼0 n1
ð3Þ
4. Software rejuvenation policy 4.1. General rejuvenation policy Huang et al. [1] presented a general rejuvenation policy, in which software is rejuvenated at predetermined time interval δ. As shown in Fig. 1, software system begins operation with the best performance value of Pmax. After a period of execution, system performance reduces to the worst value of Pmin at the time δ at which system failure will occur soon, and thereby software
H. Meng et al. / Reliability Engineering and System Safety 133 (2015) 184–191
System fails and system recovery point Rejuvenation point
Performance value
187
a renewal cycle inspection inspection inspection inspection
Pmax
t
0 n
Pmin n-1
time
δ
'
δ
n
R
software rejuvention
Fig. 3. No failure is detected after the inspection for n times. Fig. 1. Periodical rejuvenation model.
Assumption 2. At each system inspection interval, p denotes the probability that no failure is detected; 1–p stands for the probability that system failure is detected.
a renewal cycle
Assumption 3. After system recovery or software rejuvenation, software system will return to the new initial state.
inspection inspection inspection inspection failure
Assumption 4. The inspections discover system failures and have no effect on the operation of software system.
t
0 n
n-1
n
r
System recovery
Fig. 2. The failure is detected after inspection for k(k r n) times.
rejuvenation is performed to bring the system back to the initial healthy state. On the other hand, because system failures occur randomly, when system crashes the performance might have declined to the minimum value of zero at the time δ0 (δ0 oδ). In this case the reactive maintenance method such as system recovery is carried out to restore software system to its initial healthy state. 4.2. Basic idea
From the above analysis, the choice of the inter-inspection intervals {δn, n¼ 1,2,…} and software rejuvenation period is determined by n. A smaller n will result in a frequent rejuvenation action, which might reduce the chance of occurring high deterioration and failures but it also costly and reduce the availability of the software system. On the other hand, a greater n will keep the software system operating in a high risk state. Thus, it is very important to decide a proper value of n. 4.3. Formulation Let Yk denote the kth system inspection, k¼ 1,2,…. Yk ¼1 indicates no failure is detected, while Yk ¼0 implies the failure is detected. According to Assumption 2, P(Yk ¼1) ¼ p, P(Yk ¼0) ¼1–p. Then the probability that no failure occurs until the kth (k rn) system inspection is deduced as PðY 1 ¼ 1; Y 2 ¼ 1; …; Y k 1 ¼ 1; Y k ¼ 0Þ ¼ pk 1 ð1 pÞ
It is more convenient to schedule software system being periodically monitored with equal inspection interval to find system faults or failures [14]. However, the inspection program occupies the CPU time when it is executed with software system, it might not always worthwhile to inspect the software frequently. Moreover, due to software aging effects, system performance declines gradually and the possibility of occurring failures increases with the execution of software over time. Therefore, the successive inspection interval needs to be shorter. In this study, the consecutive system inspection intervals {δn, n ¼1,2,…} constitute a decreasing geometric sequence, given by λan 1, for n¼ 1,2,… and 0 oa o1. The error log is used to record whether system failures occur or not. Consequently, the system will complete a renewal cycle in two cases. In the first case as shown in Fig. 2, if system failures are discovered at the kth (k rn) inspection interval, then system recovery is carried out. In the latter case as shown in Fig. 3, if failures are not detected after the n th inspection, then software rejuvenation is performed. Both software rejuvenation and system recovery will bring the system to its initial healthy state. The basic assumptions on the software rejuvenation policy are Assumption 1. The value of τR (the execution time for implementing software rejuvenation) and τr (the time for implementing system recovery after system failures) have been selected on whatever practical basis, and τR o τr .
ð4Þ
The probability that no failure is detected after system inspections for n times is PðY 1 ¼ 1; Y 2 ¼ 1; …; Y n 1 ¼ 1; Y n ¼ 1Þ ¼ pn
ð5Þ
From the descriptions in Section 4.2, the average renewal cycle begins with a system start-up and ends either with software rejuvenation after system inspections for n times or with system recovery at the kth (k rn) inspection interval. Then the average length of a renewal cycle of software system is expressed as ! ! LðnÞ ¼
n
n
k¼1
k¼1
∑ δ k þ τ R pn þ ∑
k
∑ δj þ τr pk 1 ð1 pÞ
j¼1
1 ðapÞn 1 pn þ 1 p þ p2 þ τR pn þ τr ¼λ 1 ap p
ð6Þ
The expected uptime for software system during a renewal cycle is ! n n k 1 ðapÞn n ð7Þ RðnÞ ¼ ∑ δk p þ ∑ ∑ δj pk 1 ð1 pÞ ¼ λ 1 ap k¼1 k¼1 j¼1 Software system is down when software recovery or system rejuvenation is started up. Thus, the expected downtime during a renewal cycle is given by n
DðnÞ ¼ LðnÞ RðnÞ ¼ τR pn þ ∑ τr pk 1 ð1 pÞ k¼1
188
H. Meng et al. / Reliability Engineering and System Safety 133 (2015) 184–191
¼ τ R pn þ τ r
1 pn þ 1 p þ p2 p
ð8Þ
In addition, the expected number of inspections in a renewal cycle is 1 pn þ 1 IðnÞ ¼ npn þ ∑ kpk 1 ð1 pÞ ¼ 1p k¼1 n
rðnÞ ¼
¼
1 pn þ 1 1 pn þ 1 p þ p2 þ cf τR pn þ τr 1p p
cðnÞ ¼ LðnÞ
!
n
cI IðnÞ þcf DðnÞ n
∑ δk þ τR pn þ ∑
k¼1
ð9Þ
We assume that cI is the expected cost of system inspections, and cf is the expected cost of system downtime which consists of the time of system recovery, software rejuvenation and system breakdown. The expected total cost per cycle is cðnÞ ¼ cI IðnÞ þ cf Dðδ; nÞ ¼ cI
rate function is
k¼1
k
!
∑ δj þ τr pk 1 ð1 pÞ
j¼1
pn þ 1 1 pn þ 1 p þ p2 n cI 1 1 p þ cf τ R p þ τ r p n
ð12Þ
nþ1
Þ 1p n λ11ðap ap þ τ R p þ τr
p þ p2 p
Since software is cyclically rejuvenated after the nth system inspection, the software rejuvenation interval is n
n
k¼1
k¼1
ηðnÞ ¼ ∑ δk ¼ ∑ ak 1 λ ¼ λ
1 an 1a
ð13Þ
ð10Þ The system availability is defined as the ratio of the expected uptime to the average length of a renewal cycle. Then the system availability function is n
AðnÞ ¼
Þ λ11ðap RðnÞ ap ¼ 1 ðapÞn nþ1 2 LðnÞ λ þ τ R pn þ τ r 1 p p þ p 1 ap
ð11Þ
p
The cost rate is defined as the ratio of the expected total cost per cycle to the average length of a renewal cycle. Then the cost
Table 1 System parameters. Parameter
cf
cI
λ
a
τR
τr
Value
50
2
5
0.909
0.2 h
0.3 h Fig. 5. Cost rate versus system inspection times (p¼ 0.95).
Table 3 Comparison between inspection times n and cost rate r(n), where n¼9 is the optimal rejuvenation policy with minimal cost rate 0.8091. n
r (n)
n
r (n)
n
r (n)
1 2 3 4 5 6 7
2.032 1.293 1.050 0.9367 0.8758 0.8415 0.8223
8 9 10 11 12 13 14
0.8126 0.8091 0.8100 0.8139 0.8200 0.8277 0.8365
15 16 17 18 19 20 21
0.8462 0.8565 0.8672 0.8781 0.8892 0.9003 0.9113
Fig. 4. Cost rate versus system inspection times (p ¼0.95).
Table 2 Comparison between inspection times n and system availability A(n), where n¼11 is the optimal rejuvenation policy with maximal system availability 0.9853. n
A(n)
n
A(n)
n
A(n)
1 2 3 4 5 6 7
0.9561 0.9733 0.9790 0.9817 0.9832 0.9841 0.9846
8 9 10 11 12 13 14
0.9849 0.9851 0.9852 0.9853 0.9852 0.9852 0.9851
15 16 17 18 19 20 21
0.9850 0.9849 0.9847 0.9846 0.9844 0.9843 0.9841 Fig. 6. System availability versus system inspection times for different values of p.
H. Meng et al. / Reliability Engineering and System Safety 133 (2015) 184–191
4.4. Boundary condition of optimal rejuvenation policy The following practical task for the optimal rejuvenation policy is to maximize the system availability or minimize the cost rate by selecting proper value of n. The reasonable approach is to calculate the derivative of system availability function A(n) or the cost rate function r(n), and if it satisfies A0 (n) ¼0 or r0 (n)¼0, then the optimal nn and software rejuvenation interval η(nn) is obtained. Besides, by analyzing the variations of A(n) and r(n) with the increasing n, two boundary conditions for the optimal rejuvenation policy are concluded as follows. Theorem 1. By the system availability function A(n), the optimal inspection times nn is given by ( ) 1 p þ p2 ð1 apÞτr nn ¼ min nϕðnÞ 4 : ð14Þ pðτR τr Þ where ϕðnÞ ¼ ðp apn þ 1 1 þ ðapÞn þ 1 Þ=an : Proof. From Formula (11), we have RðnÞ RðnÞ 1 ¼ ¼ : AðnÞ ¼ LðnÞ RðnÞ þDðnÞ 1 þ tðnÞ
189
It can be seen that t(n) is increasing (decreasing) when the system availability A(n) is decreasing (increasing). To maximize A (n) is equivalent to minimize t(n). Thus, our objective is to find the minimum value of t(n) to obtain the optimal nn and the associated optimal rejuvenation interval ηðnn Þ. Consider the optimal nn which maximize the system availability A(n) should be the minimum value of n that satisfies the condition expression t(n þ1)–t(n)40. The difference of t(n þ1) and t(n) is computed as follows: tðn þ 1Þ tðnÞ ¼
Dðn þ 1Þ DðnÞ Dðn þ 1ÞRðnÞ DðnÞRðn þ 1Þ ¼ Rðn þ 1Þ RðnÞ Rðn þ 1ÞRðnÞ
Furthermore, note that t(n þ1)–t(n)4 0 is equivalent to Dðn þ 1ÞRðnÞ DðnÞRðn þ 1Þ 40. It follows that Dðn þ 1ÞRðnÞ DðnÞRðn þ 1Þ 1 pn þ 2 p þ p2 1 ðapÞn λ ¼ τR pn þ 1 þ τr p 1 ap nþ1 2 1 p p þ p 1 ðapÞn þ 1 λ τ R pn þ τ r p 1 ap 1 p þ p2 ð1 apÞτr p apn þ 1 1 þ ðapÞn þ 1 40 ¼ an pðτR τr Þ Therefore, it can be concluded that the optimal nn can be derived from formula (14). The assertion is proven.
where
Theorem 2. By the cost rate function r(n), the optimal system inspection times nn is given by
tðnÞ ¼ DðnÞ=RðnÞ:
nn ¼ minfnjψðnÞ 4 0g:
ð15Þ
where
ψðnÞ ¼ ω1 þ pn þ 1 ω3 Þ ω2 þ pn ðτR τr an ω4 ÞÞ ω1 þ pn ω3 Þ ω2 þ pn þ 1 ðτR τr an þ 1 ω4 ÞÞ and cI 1 p þ p2 þ cf τ r ; 1p p λ 1 p þ p2 ω2 ¼ þ τr ; 1 ap p p λ ; ω ¼ : ω3 ¼ cf ðτR τr Þ cI 1 p 4 1 ap
ω1 ¼
Fig. 7. Cost rate versus system inspection times for different values of p.
Table 4 Comparison between optimal inspection times nn and cost rate r(nn)/system availability A(nn) for different values of p. p
nn
r(nn)
p
nn
A (nn)
0.99 0.95 0.91
10 9 8
0.7870 0.8091 0.8365
0.99 0.95 0.91
13 11 9
0.9905 0.9853 0.9785
Proof. Similar to Theorem 1, in order to obtain the optimal nn which minimize the cost rate r(n), the difference of r(nþ 1) and r (n) is firstly calculated according to formula (12) as follows: cðn þ1Þ cðnÞ cðn þ 1Þ ULðnÞ cðnÞ ULðn þ 1Þ ¼ rðn þ 1Þ rðnÞ ¼ Lðn þ 1Þ LðnÞ Lðn þ 1Þ U LðnÞ Moreover, r(n þ1)–r(n)4 0 is equivalent to cðn þ 1ÞLðnÞ cðnÞLðn þ 1Þ 40. We have cðn þ 1ÞLðnÞ cðnÞLðn þ 1Þ 1 pn þ 2 1 pn þ 2 p þ p2 þcf ðτR pn þ 1 þ τr Þ ¼ cI 1p p nþ1 1 ðapÞn 1 p p þ p2 þ τ R pn þ τ r λ 1 ap p
Table 5 Comparison of system performance. Models
System availability
Periodically inspected software rejuvenation policy based on geometric sequences General software rejuvenation policy (in Ref. [1])
0.9853 0.9740
190
H. Meng et al. / Reliability Engineering and System Safety 133 (2015) 184–191
1 pn þ 1 1 pn þ 1 p þ p2 þ cf ðτR pn þ τr Þ cI 1p p ! nþ1 1 ðapÞ 1 pn þ 2 p þ p2 nþ1 λ þ τR p þτr 1 ap p ¼ ω1 þ pn þ 1 ω3 Þ ω2 þ pn ðτR τr an ω4 ÞÞ ω1 þ pn ω3 Þ ω2 þ pn þ 1 ðτR τr an þ 1 ω4 ÞÞ It can be concluded that the optimal nn can be derived from Formula (15). Thus, the assertion is proven.
5. Numerical results and analysis Based on the above rejuvenation policy, numerical experiments are performed by taking system availability and cost rate as evaluation indicators of system reliability. By maximizing system availability and minimizing cost rate, the optimal system inspection times nn and the optimal software rejuvenation interval ηðnn Þ are obtained. The values of system parameters are given in Table 1, and all these values are selected by experimental experience for demonstration purposes. The relationship between system inspection times and system availability is shown in Fig. 4. It can be seen that, when the system inspection times n increases, system availability increases rapidly as well and goes to the maximum value where the optimal nn is reached. Then with the continuous increasing of the value of n, system availability steadily decreases because the possibility of occurring failures gradually increases. We studied the variation of system availability A(n) when the value of n varies from 1 to 21. The optimal nn is 11 while A(n) reaches the maximum value of 0.9853 as shown in Table 2. In this case, taking nn into formula (13), we obtained the optimal software rejuvenation interval ηðnn Þ ¼35.7. Similarly, Fig. 5 illustrates the cost rate changes with the increasing of system inspection times n. When the value of n increases, the cost rate increases rapidly and goes to the minimum value where the optimal value nn is met. Then, when the value of n continues to expand, the possibility of occurring failures steadily goes up and the cost rate increases as well. The results of cost rate r(n) for n varying from 1 to 21 are listed in Table 3, which shows the optimal nn is 9 when cost rate r(n) reaches the minimum value of 0.8091. Similarly, taking nn into formula (13), we have the optimal software rejuvenation interval ηðnn Þ ¼31.7. Alternatively, we investigated the effects of the probability p on the optimal rejuvenation policy. The relationship between system inspection times and system availability/cost rate for various values of p is shown in Figs. 6 and 7. It can be seen from both figure, on the whole, the greater the value of p is, the larger the optimal inspection times nn becomes. The reason is that the greater p implies that the possibility of occurring failures declines, accordingly software rejuvenation should be triggered later as well as the optimal nn should be larger. The same conclusion can be drawn from Table 4 which gives the cost rate and system availability versus optimal nn with varying values of p. Finally, the periodically inspected software rejuvenation policy based on geometric sequences is compared with the general rejuvenation policy in the aspect of system performance. The comparison results are given in Table 5, which indicates that our new policy has superiority in the aspect of improving system availability compared with the general periodical rejuvenation policy.
6. Conclusions and future work Software aging is an important potential factor that affects software reliability. Software rejuvenation is a main effective method to counteract software aging. Due to system failures occurred randomly and system degrades gradually over time, in this study, we presented a periodical inspected software rejuvenation policy for software system based on geometric sequences denoting inspection intervals. The inter-inspection schedule is considered as an important decision factor for maximizing system availability and minimizing cost rate. Then the optimal inspection times and rejuvenation interval are given and the boundary condition of the optimal rejuvenation policy is deduced. Finally, quantitative analysis and numeric experiment result show the proposed rejuvenation policy can greatly reduce average downtime cost and improve system availability and compared with previous work on general software rejuvenation policy, the new policy has superiority in aspect of improving system availability. Future work includes the researches on the dynamically selecting the software rejuvenation policy and the fine-granularity software rejuvenation policy considering of system rejuvenation action on part of software system.
Acknowledgments The author would like to thank the sponsors of the National Natural Science Foundation of China under Grant no. U1334211 and 61472318, Scientific Research Plan Project of Shaanxi Province of China under Grant no. 2014JQ8302, Scientific Research Plan Project of Shaanxi Education Department of China under Grant no. 14JK1520 and Doctoral Fund of Xi'an Technology University under Grant no. 116-210912.
References [1] Huang Y, Kintala C, Kolettis N, Fulton N. Software rejuvenation: analysis, module and application. In: Proceedings of IEEE international symposium on fault tolerant computing; 1995. p. 381–90. [2] Avritzer A, Weyuke J. Monitoring smoothly degrading systems for increased dependability. Empir Softw Eng J 1991;2(1):59–77. [3] Garg S, Puliafito A, Telek M, Trived KS. Analysis of preventive maintenance in transactions based software systems. IEEE Trans Comput 1998;47(1):96–107. [4] Cassidy KJ, Gross KC, Malekpour A. Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers. In: Proceedings of international conference on dependable systems and networks; 2002. p.478–82. [5] Grottke M, Li L, Vaidyanathan K, Trivedi KS. Analysis of software aging in a web server. IEEE Trans Reliab 2006;55(3):411–20. [6] Cotroneo D, Orlando S, Pietrantuono R, Russo S. A measurement-based ageing analysis of the JVM. Softw Test, Verif Reliab 2013;23(3):199–239. [7] Tai A, Chau S, Alkalaj L, Hect H. On-board preventive maintenance: a designoriented analytic study for long-life applications. Perform Eval 1999;35 (5):215–32. [8] Araujo J, Matos R, Alves V, Maciel P, Souza F, Trivedi KS. Software aging in the eucalyptus cloud computing infrastructure: characterization and rejuvenation. ACM J Emerg Technol Comput Syst 2014;10(1):11–6. [9] Marshall E. Fatal error: how patriot overlooked a scud. Science 1992;13 (3):59–67. [10] Gray J. Why do computers stop and what can be done about it? In Proceedings of IEEE international symposium on reliability in distributed software and database systems; 1986. p.3–12. [11] Cotroneo D, Natella R, Pietrantuono R, Russo S. A survey of software aging and rejuvenation studies. ACM J Emerg Technol Comput Syst 2014;10(1):8–42. [12] Grall A, Dieulle L, Bérenguer C, Roussignol M. Continuous-time predictivemaintenance scheduling for a deteriorating system. IEEE Trans Reliab 2002;51 (2):141–50. [13] Hosseini MM, Kerr RM, Randall RB. An inspection model with minimal and major maintenance for a system with deterioration and Poisson failures. IEEE Trans Reliab 2000;49(1):88–98. [14] Vaidyanathan K, Selvamuthu D, Trivedi KS. Analysis of inspection-based preventive maintenance in operational software systems. In: Proceedings of IEEE symposium on reliable distributed systems; 2002. P. 286–95.
H. Meng et al. / Reliability Engineering and System Safety 133 (2015) 184–191
[15] Ning G, Trivedi KS, Hu H, Cai KY. Multi-granularity software rejuvenation policy based on continuous time Markov chain. In: Proceedings of IEEE international workshop on software aging and rejuvenation; 2011. p. 32–7. [16] Nagel PM, Skrivan JA. Software reliability: repetitive run experimentation and modeling[M]. Natl Tech Inform Serv 1982. [17] Wagner S, Fischer H. A software reliability model based on a geometric sequence of failure rates. Reliable Software Technologies–Ada-Europe. Berlin Heidelberg: Springer; 2006; 143–54. [18] Gray J, Siewiorek DP. High-availability computer systems. IEEE Comput 1991;24(9):39–48. [19] Sullivan M, Chillarege R. Software defects and their impact on system availability: a study of field failures in operating systems. In: Proceedings of the 21st IEEE international symposium on fault-tolerant computing; 1991. p. 2–9. [20] Castelli V, Harper RE, Heidelberger P, Hunter SW, Trivedi KS. Proactive management of software aging. IBM J Res Dev 2001;45(2):311–32. [21] Alonso J, Torres J, Berral JL, Gavalda R. J2EE instrumentation for software aging root cause application component determination with AspectJ. In: Proceedings of international symposium on parallel & distributed processing, Workshops and Ph.d. Forum; 2010. p. 1–8. [22] Matos R, Araujo J, Alves V, Maciel P. Experimental evaluation of software aging effects in the eucalyptus elastic block storage. In: Proceedings of IEEE international conference on system, man, and cybernetics; 2012. p. 1103–8. [23] Apache Software Foundation. Apache http server project, 〈http://www.apache. org〉. [24] Bernstein L, Kintala CMR. Software rejuvenation. CrossTalk 2004;17(8):23–6.
191
[25] Alonso J, Matias R, Vicente E, Maria A, Trivedi KS. A comparative experimental study of software rejuvenation overhead. Perform Eval 2011;70(3):231–50. [26] Vaidyanathan K, Harper RE, Hunter SW, Trived KS. Analysis and implementation of software rejuvenation in cluster systems. ACM SIGMETRICS Perform 2001;29(1):62–71. [27] Xie W, Hong YG, Trivedi KS. Analysis of a two-level software rejuvenation policy. Reliab Eng Syst Saf 2005;87(1):13–22. [28] Iwamoto K, Dohi T, Okamura H, Kaio N. Discrete-time cost analysis for a telecommunication billing application with rejuvenation. Comput Math Appl 2006;51(2):335–44. [29] Alberto A, Robert GC, Elaine JW. Methods and opportunities for rejuvenation in aging distributed software systems. J Syst Softw 2010;83(9):1568–78. [30] Machida F, Kim DS, Trivedi KS. Modeling and analysis of software rejuvenation in a server virtualized system with live VM migration. Perform Eval 2012;70 (3):212–30. [31] Okamura H, Dohi T. Dynamic software rejuvenation policies in a transactionbased system under Markovian arrival processes. Perform Eval 2012;70 (3):197–211. [32] Jia YF, Chen XE, Zhao L, Cai KY. On the relationship between software aging and related parameters. In: Proceedings of 8th international conference on quality software; 2008. p. 241–6. [33] Matias R, Barbeta PA, Trivedi KS, Filho PJF. Accelerated degradation tests applied to software aging experiments. IEEE Trans Reliab 2010;59(1):102–14. [34] Avritzer A, Bondi A, Grottke M, Trivedi KS, Weyuker EJ. Performance assurance via software rejuvenation: monitoring, statistics and algorithms. In: International conference on dependable systems & networks, Philadelphia, PA, United states; June 2006. p. 435–44.