Reliability Engineering and System Safety 144 (2015) 12–22
Contents lists available at ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Optimal backup frequency in system with random repair time Gregory Levitin a,b,n, Liudong Xing c, Yuanshun Dai a a b c
Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China, China The Israel Electric Corporation, P.O. Box 10, Haifa 31000, Israel University of Massachusetts, Dartmouth, MA 02747, USA
art ic l e i nf o
a b s t r a c t
Article history: Received 12 August 2014 Received in revised form 14 June 2015 Accepted 18 June 2015 Available online 26 June 2015
This paper considers single-component repairable systems performing backup procedures to avoid repeating the entire work from scratch and thus facilitate fast system recovery in the case of failures. The mission succeeds if a specified amount of work can be accomplished within the maximum allowed mission time or deadline. The repair time is randomly distributed within a specified interval. Both failure and repair times are represented by known distributions. We first suggest a numerical algorithm to evaluate mission reliability, conditional expected cost and completion time of a successful mission. The backup frequency optimization problem is then formulated and solved for finding the inter-backup interval that maximizes mission reliability or minimizes expected mission cost while satisfying a desired level of mission reliability. Impacts of parameters including the maximum allowed mission time, data backup and retrieval times, repair and failure time distributions, and repair efficiency on mission reliability, cost and time as well as on the optimal solution are investigated through examples. & 2015 Elsevier Ltd. All rights reserved.
Keywords: Backup Repair Mission reliability Mission cost Mission time
1. Introduction A system is repairable if after its failure, the system can be recovered to some fully satisfactory performance level via maintenance activities such as replacement and part adjustment [1,2]. According to the degree to which the working condition of a system can be restored through maintenance, three types of repair models can be differentiated [3,4]: under the perfect repair model, the system is restored to an “as good as new” condition; under the minimal repair model, the system is restored to an “as bad as old” condition, that is to the same state as it was immediately before its failure; under the general or imperfect repair model, the system is restored to any condition between the former two cases. Numerous research efforts have been expended in modeling and optimizing single-component or multi-component repairable systems under the different repair models. Renewal processes (in particular, homogeneous and non-homogeneous Poisson processes) [5,6], Markov chains [7–10], geometric processes [11–14], and Bayesian methods [15–17] are among the commonly-applied techniques for modeling the failure behavior of a repairable system. Diverse optimization problems have been formulated and addressed for repairable systems subject to different maintenance policies or
n
Corresponding author. E-mail addresses:
[email protected] (G. Levitin),
[email protected] (L. Xing).
http://dx.doi.org/10.1016/j.ress.2015.06.014 0951-8320/& 2015 Elsevier Ltd. All rights reserved.
behaviors. For example, replacement policy optimization problems have been solved for repairable systems with a repairman who can take multiple vacations [11,18], for systems under free-repair warranty [19], and for systems subject to waiting and repair times [20]. The optimal inspection scheme problems (periodic and non-periodic) have been addressed for repairable systems subject to hidden failures (failures that are not evident to operators and can only be rectified during inspections) or repairable systems with interactions between self-announcing hard failures and non-self-announcing soft failures [21–27]. A joint optimization problem has been solved in [28] for repairable systems with the objective to find the optimal number and schedule of preventive maintenance actions as well as the corresponding maintenance degrees. The joint optimal maintenance and warranty policy problem has been solved in [29] for repairable systems considering all phases of the system life cycle, where the optimal burn-in period, optimal preventive maintenance intervals and optimal replacement times are determined. In spite of the rich literature on the modeling and optimization of repairable systems, to the best of our knowledge, none of those existing works has considered backup behavior as well as the related optimal backup frequency problem. For systems especially those used in computing-related applications, the backup mechanism is necessary to facilitate an effective system recovery or reconfiguration in the case of failures occurring. Specifically backup procedures are performed periodically to save data associated with the completed fraction of the mission task so that the
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
Nomenclature B b 1( ) g
backup time data retrieval time unity function 1(FALSE) ¼0; 1(TRUE) ¼1 time needed to retrieve data after the a-th backup: g ¼b 1(a 40) N maximal possible number of repairs during the mission τ maximum allowed mission time R mission reliability E expected mission completion time C expected cost of a successful mission θj,Cj conditional expected mission repair, operation time given the mission succeeds after j repairs 〈Tj,Xj,Aj〉 event that the j-th failure happens at time Tj given the system spends time Xj in the operation mode and completes Aj backups before the failure Qj(t,x,a) function representing a joint distribution of random values Tj, Xj and Aj qj(v,w,a) probability that the j-th failure happens in time interval v and the system spends w intervals in the operation mode and completes a backups before the failure D random repair time dmin, dmax minimal, maximal possible realizations of D H maximum number of backup actions during the mission
failed system, upon being repaired, can resume the mission task from the latest backed up point instead of re-performing the entire mission task from the very beginning. In recent work [30], effects of periodic backups were first considered for modeling and evaluating mission reliability, expected mission time and cost of 1-out-of-N: G, cold standby systems. However, the model of [30] assumes that the system components are not repairable during the mission. In this paper we make novel contributions by modeling singlecomponent repairable systems subject to periodic backup actions, random failure and repair times, as well as real-time constraint on mission task completion. Three mission performance indices are evaluated, including mission reliability, conditional expected mission cost and completion time given a successful mission. The proposed evaluation algorithm is applicable to different repair models (perfect, imperfect, minimal) as well as arbitrary types of time-to-failure and time-to-repair distributions. The optimal backup frequency problem is formulated and solved for the considered repairable system where the optimal number of backups is identified for maximizing mission reliability or minimizing expected mission cost. Effects of different parameters on mission performance indices as well as on the optimal backup frequency solution are also investigated through examples. The remainder of the paper is organized as follows. Section 2 presents the system model and formulation of the optimization problem addressed in this work. Section 3 discusses the evaluation of mission reliability, conditional expected cost and completion time of successful mission for real-time repairable systems subject to periodic backups and random repair times. Section 4 presents the discrete numerical evaluation algorithm based on the derivation in Section 3. Section 5 presents illustrative examples. Effects of various factors on mission indices and on the optimal backup frequency solution are investigated. Section 6 presents conclusions as well as directions of future work.
13
f(t), F(t) pdf and cdf of system time-to-failure distribution ψ(t), Ψ(t) pdf and cdf of repair time W pure mission time (time needed to complete the mission without backups and failures) π fraction of the entire mission task that should be backed up periodically (referred to as backup frequency parameter) πW time during which the system produces information that should be backed up σ time between the ends of two consecutive backup actions in the case of no failures: σ¼ πWþB m number of discrete intervals considered in the numerical algorithm Δ duration of a discrete time interval: Δ ¼τ/m h number of time intervals needed to complete the mission if no failures occur bxc floor operation that returns the maximal integer not exceeding x Yj event that the mission is completed after j repairs rj Pr(Yj) φj, θj expected mission operation, repair time given that j failures happen during the mission η, β scale, shape parameters of Weibull time-to-failure distribution cr, co per time unit repair, operation cost z repair efficiency coefficient (z ¼0 corresponds to as good as new, z¼ 1 corresponds to as bad as old)
2. System model and problem description 2.1. System description The system should accomplish a desired mission task within a time not exceeding τ, i.e., the system is a real-time system. The time required to accomplish the entire mission without backups or failures is W. Each fraction π of the entire mission task should be periodically backed up. Thus, the system conducts a data backup procedure after successfully performing the mission task during time πW. The total number of backup actions performed during a successful mission (i.e., backup frequency) is fixed and can be determined as: ( 1=π if π 1=π o 1 H¼ : ð1Þ 1=π 1 if π 1=π ¼ 1 The second case in (1) is because when π 1=π ¼ 1, the last backup is scheduled to be performed at the end of the mission which is not necessary. Notice that π is referred to as a backup frequency parameter to be optimized in the optimization problems considered in this paper. Each backup action takes constant time B. Thus, the minimal time needed to complete the mission (given that no failures happen) is HBþW. The system time-to-failure distribution is known and determined by the cumulative distribution function (cdf) F(t). When the system fails, the repair procedure starts immediately. The repair time depends on external factors such as availability and efficiency of the repair manpower and equipment, and is randomly distributed within the interval [dmin, dmax]. Having the minimum possible repair time, the maximum possible number of failures that can occur in a successful mission can be obtained as: N ¼ ðτ HB WÞ=dmin :
ð2Þ
14
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
The cdf Ψ(t) of the repair time distribution is known. We assume that if the system has operated for a time period of t0 before the repair, its time-to-failure distribution after the repair has cdf F(zt0 þt), where z is the repair efficiency coefficient. The value of z can vary from 0 (the system after repair is as good as new, which corresponds to the replacement by a brand new system or the perfect repair model) to 1 (the system is as bad as old, which corresponds to the minimal repair model). When the repair is completed, the system starts its operation from retrieving backup data that takes time b, and then performs the mission task from a work step that directly follows the last completed backup procedure. The following assumptions are made: 1) Time-to-failure and time-to-repair distributions of the system are independent. 2) The fault detection and backup mechanisms of the system are perfectly reliable. 3) The mission task is performed evenly in time, i.e., equal portions of work are performed during equal time intervals. 4) The data backup and retrieval times are fixed and do not depend on the backup frequency.
3.2. Failure event Let 〈Tj,Xj,Aj〉 be an event that the j-th failure happens at time Tj from the mission beginning when the system spends time Xj r Tj in the operation mode and time Tj Xj in the repair mode and successfully completes Aj backup procedures. For example, Fig. 1A illustrates a mission involving three failure events indicated by the three lightning symbols with H ¼2. The first failure happens at time T1 being the length of the arrowed line labeled by j¼ 1 with time X1 ¼ T1 in the operation mode, time T1 X1 ¼0 in the repair mode, and A1 ¼0. The second failure happens at time T2 which is the sum of lengths of two arrowed lines labeled by j¼1 and j¼2 with time d2 ¼T2 X2 spent in the repair mode and A2 ¼1. The third failure happens at time T3 which is the sum of lengths of three arrowed lines labeled by j¼ 1, j¼2 and j¼3 with time d2 þd3 ¼T3 X3 spent in the repair mode and A3 ¼1. The values of X1, X2 and X3 correspond to the sum of lengths of the bold parts of one, two and three arrowed lines, respectively. The same event sequence is represented in Fig. 1B as a mission trajectory in the “elapsed mission time – completed fraction of the mission task” space (hatched bars indicate the system operation periods and the total length of these bars before the j-th failure event corresponds to Xj).
2.2. Problem formulation
3.3. Joint distribution of the failure event parameters
Having the per time unit repair cost cr and operation cost co of the system, as well as its time-to-failure distribution F(t) and repair time distribution Ψ(t), the problem is to find mission reliability R(π), conditional expected mission completion time E(π) and conditional expected mission cost C(π) given the mission succeeds as functions of the backup frequency parameter π. The backup optimization problem presumes finding the backup frequency parameter π that maximizes mission reliability, or minimizes the expected mission cost subject to providing a desired level of mission reliability (Rn) and a desired level of expected completion time (En), which are respectively formulated as:
Let Qj(t,x,a) be a function representing the joint distribution of the random values Tj, Xj and Aj. The event 〈Tj,Xj,Aj〉 corresponds to the initiation of a repair that takes random time D.
π ¼ arg max RðπÞ; π ¼ arg min CðπÞ s:t:RðπÞ 4 Rn ; EðπÞ o En :
ð3Þ
3. Determining mission reliability, expected cost and completion time 3.1. Outline of the method In this work we use a state space event transition method for evaluating the mission reliability, expected cost and completion time. The method presumes performing the following steps:
Defining the combination of parameters (state space) that
determine the system behavior as a sequence of events such as failures and repairs and corresponding backups (Section 3.2). Introducing a function that describes a probabilistic distribution of the state-space parameters. Determining this function for the first failure event (Section 3.3). Determining probabilities of event transitions for any realization of the state space parameters, which provide the way of a recursive derivation of distribution function for any event (Section 3.4). Determining the mission reliability, cost and completion time after any event and calculating the corresponding parameters for the entire mission as a sum of parameters of mutually exclusive events (Sections 3.5 and 3.6).
Fig. 1. Example of successful mission completion after three failures (H¼ 2, A1 ¼0, A2 ¼A3 ¼1). A: Sequence of mission events; B. Mission trajectory in the “elapsed time – completed fraction of mission task” space.
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
For the first failure, ( f ðtÞ if x ¼ t; a ¼ t=σ and 0 r t r W þ HB Q 1 ðt; x; aÞ ¼ 0 otherwise;
ð4Þ
where σ¼ πWþB. because no repair happens before the first failure, the number (index) of the last backup completed after operation for time t is ⌊t/σc and the system is switched off when the mission is completed after time Wþ HB of continuous operation. For any j41 and fixed Aj ¼a (0 ra rH), Xj can vary from ðaÞ ¼ aσ; X min j
ð5Þ
when the system operates for a time period of Ajσ before the first failure and then fails immediately after the repairs to X max ðaÞ ¼ ða þ 1Þσ þ ðj 1Þðσ þ bÞ ¼ ða þ jÞσ þ ðj 1Þb j
ð6Þ
when the system fails j times immediately before completion of the a þ1-th backup for a oH or to X max ðHÞ ¼ W þ HB þ ðj 1ÞðW þ HB Hσ þ bÞ j
ð7Þ
when the system fails j times immediately before the mission completion (i.e., a ¼H). Refer to Fig. 2 for an illustration of these two maximum operation time scenarios. As the mission time cannot exceed the deadline τ, the total operation time before the j-th failure cannot exceed τ (j 1)dmin. Thus (
X max ðaÞ ¼ j
minðða þjÞσ þ ðj 1Þb; τ ðj 1Þdmin Þ if a o H minðW þHB þðj 1ÞðW þ HB Hσ þbÞ; τ ðj 1Þdmin Þ if a ¼ H:
ð8Þ
For any realization x of Xj, the time elapsed since the mission beginning Tj can vary from T min ðxÞ ¼ x þ ðj 1Þdmin ; j
ð9Þ
when the system spends minimal time dmin in each of j 1 repairs to T max ðxÞ ¼ minðτ; x þ ðj 1Þdmax Þ; j
ð10Þ
when the system spends maximal time dmax in each of j 1 repairs.
15
transit from event 〈Tj 1,Xj 1,Aj 1〉 to event 〈Tj,Xj,Aj〉 with Tj ZTj 1 þ dmin, Xj Z Xj 1 and Aj ¼ Aj 1 þ ⌊max(0,(Xj Xj 1 g))/σc, where Xj ¼ Xj 1 and Aj ¼Aj 1 when the system fails immediately after the j 1-th repair. The event transition 〈Tj 1,Xj 1,Aj 1〉-〈Tj,Xj,Aj〉 occurs when the system operates for at least a time period of gþ (Aj Aj 1)σ, but less than time g þ(Aj Aj 1 þ 1)σ after the j 1-th repair (see Fig. 3). Thus, for the given Aj, Aj 1 and Xj, we have X j g Aj Aj 1 þ 1 σ r X j 1 o X j g þ Aj Aj 1 σ: ð11Þ When the transition 〈Tj 1,Xj 1,Aj 1〉-〈Tj,Xj,Aj〉 occurs, the j 1th repair takes time D¼ (Tj Tj 1) (Xj Xj 1). As dmin rD rdmax, for any given Xj 1, Xj and Tj the condition of (12) should hold to make the transition possible. T j þ X j 1 X j dmax r T j 1 r T j þ X j 1 X j dmin
ð12Þ
Having the function Qj 1(t,x,a), ψ(t) and f(t), one can recursively obtain the function Qj(t,x,a) for j¼2,…,N as: a Z U1 Z U2 X ~ ~ aÞψ Q j ðt; x; aÞ ¼ Q j 1 ðt~ ; x; t t~ x þ x~ f ðzx~ þ x x~ Þdt~ dx~ a~ ¼ 0
L1
L2
where
~ ~ U 1 ¼ min X max j 1 ða Þ; x g ða aÞσ ; ~ ~ L1 ¼ max X min j 1 ða Þ; x g ða a þ 1Þσ ; ~ ~ U 2 ¼ min T max j 1 ðx Þ; t þ x x dmin ; ~ ~ L2 ¼ max T min j 1 ðx Þ; t þ x x dmax
ð13Þ
ðaÞ rx rX max ðaÞ; T min ðxÞ r t r T max ðxÞ; and Qj (t,x,a)¼0 when X min j j j j otherwise. ~ ~ ~ Indeed, any event transition T j 1 ¼ t ; X j 1 ¼ x; Aj 1 ¼ a to T j ¼ t; X j ¼ x; Aj ¼ a occurs when the j 1-th repair takes time ~ operation before j 1-th repair and between j 1-th t t~ x þ x, ~ respectively (correspondand j-th repairs takes times x~ and x x, ~ ing to the cumulative operation time zx~ þ x x).
3.4. Recursive determination of function Qj(t,x,a) Let g ¼b 1(a 40) be the time needed to retrieve the backup data after event 〈Tj 1,Xj 1,Aj 1〉, when Aj 1 ¼a. Note that 1(a 40) is a unity function which gives 1 when the condition a 40 is TRUE and 0 when the condition is FALSE (i.e., a ¼0). The system can
Fig. 2. Maximum operation time scenarios for failure event with Aj ¼ a.
3.5. Mission reliability Let Yj be the event that the system completes the mission task at time not exceeding τ given exactly j failures happen during the mission. The entire mission reliability can be obtained as the sum of the probabilities of mutually exclusive events Y0, Y1,…, YN.
Fig. 3. Minimum and maximum operation time Xj corresponding to event transition 〈Tj 1,Xj 1,Aj 1〉-〈Tj,Xj,Aj〉.
16
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
Without any failure (thus repair), the system needs time WþHB to complete the entire task. Thus r 0 ¼ PrfY 0 g ¼ 1 FðW þ HBÞ:
ð14Þ
Given the event 〈Tj,Xj,Aj〉, the system can complete the remaining amount of work by operating for a time period of WþHB Ajσ þg. The remaining mission time after performing the repair is τ Tj D. Thus, the mission can be completed if W þ HB Aj σ þ g r t T j D
rj ¼
X max ðaÞ j X min ðaÞ j
a¼0
Z
U L
Z
U
L
Q j ðt; x; aÞð1 F ðzx þ W þ HB þ g aσ ÞÞ
ð16Þ
where U ¼ τ W HB g þ aσ dmin and L ¼ x þ ðj 1Þdmin . The total mission reliability is as follows: R¼
N X
L
Q j ðt; x; aÞΨ ðτ t W HB g þaσ Þdtdx:
ð18Þ
The expected mission repair time given that j40 failures happen during the mission is H Z X max ðaÞ X j ð1 F ðzx þ W þHB þ g aσ ÞÞ θj ¼ a¼0
Z
U L
X min ðaÞ j
Q j ðt; x; aÞ
Z
minðdmax ;τ t W HB g þ aσ Þ
ðt x þ yÞψ ðyÞdydtdx:
dmin
ð19Þ
C¼
X min ðaÞ j
Q j ðt; x; aÞΨ ðτ t W HB g þ aσ Þdtdx:
U
X min ðaÞ j
Thus, the conditional expected mission cost given the mission is completed can be obtained as:
Ψ ðτ t W HB g þ aσ Þdtdx H Z X max ðaÞ X j ð1 F ðzx þ W þ HB þ g aσ ÞÞ ¼ a¼0
a¼0
Z
ð15Þ
and the system does not fail during the time Wþ HB Ajσ þg. It follows from inequality (15) that in the case of the mission success, the repair time should not exceed τ Tj W HBþ Ajσ g. As D Zdmin, the greatest possible value of Tj when the task can be completed after the event 〈Tj,Xj,Aj〉 is τ W HBþ Ajσ g dmin (see Fig. 4). Thus, having the functions Qj(t,x,a) and cdf Ψ(t) one can obtain rj as: H Z X
is spent in the operation and the remaining time Tj Xj þ D is spent in repair mode. The total mission time is Tj þD þWþHB Ajσ þg. The expected mission operation time given that j40 failures happen during the mission is H Z X max ðaÞ X j ϕj ¼ ðx þ W þ HB þ g aσ Þð1 F ðzx þ W þ HB þ g aσ ÞÞ
N 1X c r θ j þ c o ϕj r j : Rj¼0
ð20Þ
The expected mission completion time given that j40 failures happen during the mission is: H Z X max ðaÞ X j ð1 F ðzx þ W þ HB þ g aσ ÞÞ μj ¼ a¼0
Z
U L
X min ðaÞ j
Q j ðt; x; aÞ
Z
minðdmax ;τ t W HB g þ aσ Þ
ðt þ y þW þ HBþ g aσÞψ ðyÞdydtdx:
dmin
ð21Þ rj :
ð17Þ
j¼0
4. Discrete numerical algorithm 3.6. Conditional expected mission time and expected cost If the system completes the mission without failures, the repair time is θ0 ¼ 0 and the operation time is equal to the total mission time: μ0 ¼φ0 ¼W þHB. After the event 〈Tj,Xj,Aj〉 and the j-th repair, when the system completes the remaining part of the mission in time WþHB Ajσ þg, the total mission time is Tj þ Dþ WþHB Ajσ þg. In this case the time Xj þWþ HB Ajσþ g
To obtain the mission indices (R, C, and E) numerically, we divide the maximum allowable mission time τ into m equal intervals with duration Δ¼ τ/m such that for i¼0,…, m interval i begins at time iΔ and ends at time (iþ 1)Δ. The total number of time intervals needed to complete the mission if no failures occur is h ¼(W þHB)/Δ. Having the cdf F(t) for the cumulative time-to-failure of the system, one can obtain the probability that it fails before operating for i intervals after repair given that it has operated for k intervals before the repair as F(Δ(kz þi)), and the probability that it fails in the i-th interval after the repair given it has operated for k intervals before the repair as fn(k,i) ¼F(Δ(kz þiþ 1)) F(Δ(kz þi)). The probability that the repair takes less than d intervals is Ψ(Δd) and the probability that the repair takes d intervals can be obtained as ψn(d) ¼Ψ(Δ(d þ1)) Ψ(Δd). The minimal and maximal number of intervals that the repair can take are dmin/Δ and dmax/Δ, respectively. The function Qj(t,x,a) can be approximated by a discrete threedimensional array qj(v,w,a) for v ¼0,…,m 1, w¼ 0,…,m 1 and a¼ 0,…,H representing the probability that the j-th failure occurs in time interval v and the system operated during w intervals and completed a backups before this failure. According to (4) the nonzero elements of the array q1 can be obtained using the following procedure: For w ¼ 0; …; h 1 : q1 ðw; w; wΔ=σ Þ ¼ f ð0; wÞ: Having qj 1(v,w,a) for j¼2,…,N, one can obtain qj(v,w,a) using the following procedure based on (13).
Fig. 4. Maximum value of Tj when the mission can be completed after the event 〈Tj,Xj,Aj〉.
1. Set qj(v,w,a)¼0 for v ¼0,…,m 1; w¼0,…,m 1; a ¼0,…,H
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
2. For a ¼0,…,H: 2.1. u ¼(WþHB þb 1(a 40) aσ)/Δ;
17
2.2.2.4.1. θj ¼θj þΔ(v w þd)ψ n(d)y; 2.2.2.4.2. μj ¼μj þΔ(v þu þd)ψ n(d)y;
max 2.2. For w¼X min j 1 ðaÞ/Δ,…,X j 1 ðaÞ/Δ: max 2.2.1. For v ¼T min j 1 ðΔwÞ/Δ,…,T j 1 ðΔwÞ/Δ: 2.1.1.1. For d¼ dmin/Δ,…, min(dmax/Δ, m v u): 2.1.1.1.1. For i¼0,…, min(u 1,m v d 1): 2.1.1.1.1.1. λ ¼⌊(Δi b 1(a4 0))/σc; If(λ o0) λ ¼0; 2.1.1.1.1.2. qj(v þd þi,w þi,aþ λ) ¼qj(vþ d þi,w þi,aþ λ) þqj 1(v,w,a)ψ n(d)fn(w,i);
In the above pseudo code, u is the number of operation time intervals needed to complete the mission after the (j 1)-th failure, d and i are respectively the numbers of time intervals during which the system undergoes the (j 1)-th repair and operates after the repair, v is the index of time interval when the (j 1)-th failure happens, w is the number of intervals during which the system operated before the (j 1)-th failure, a is the last backup performed before j 1-th failure, and λ is the number of backups performed between the (j 1)-th and j-th failure. Notice that the number of operation time intervals before the j-th failure i should be less than the number of intervals needed to complete the mission u and less then the number of intervals remaining after the (j 1)-th repair before the mission time expiration m v d. When the number of intervals spent in repair d exceeds m v u, the system has no chance to complete the mission in time. Therefore there is no need to compute the corresponding probabilities qj(v þd þi,wþi,a þλ). Having qj(v,w,a) for j¼1,…,N, v ¼0,…,m 1 and w ¼0,…,h 1 one can obtain rj, φj and θj using the following procedure based on (16), (18) and (19). 1. Set rj ¼ φj ¼ μj ¼θj ¼0; 2. For a ¼0,…,H: 2.1. u ¼(WþHB þb 1(a 40) aσ)/Δ; 2.2. For w¼X min ðaÞ/Δ,…,X max ðaÞ/Δ: j j 2.2.1. e¼1 F(Δ(zw þu)); 2.2.2. For v ¼T min ðΔwÞ/Δ,…,T max ðΔwÞ/Δ: j j 2.2.2.1. y¼qj 1(v,w,a)e; s ¼yΨ(Δ(m u v)); 2.2.2.2. rj ¼ rj þs; 2.2.2.3. φj ¼ φj þΔ(w þu)s; 2.2.2.4. For d ¼ dmin/Δ,…, min(dmax/Δ, m v u):
In the above pseudo code, e is the conditional probability that after the successful j-th repair the system does not fail until the mission completion. Summarizing, the algorithm for evaluating the mission reliability and conditional expected cost and completion time of the single-component real-time system subject to periodic backups and random repair times takes the following form. 1. Determine H and N using (1) and (2); 2. Set R ¼1 FðW þ HBÞ; E ¼(W þHB)R; C ¼coE; 3. Set q1(v,w,a)¼0 for v ¼ 0,…,m 1; w¼ 0,…,m 1; a ¼0,…,H 4. For w¼ 0,…,h 1: q1(w,w, wΔ=σ )¼ f(0,w). 5. For j¼2,…,N þ1: 5.1. Set rj 1 ¼φj 1 ¼θ j 1 ¼μ j 1 ¼ 0; Set qj(v,w,a) ¼0 for v¼ 0,…, m 1; w ¼0,…,m 1; a ¼0,…,H; 5.2. For a ¼0,…,H: 5.2.1. u ¼(WþHB þb1.(a 40) aσ)/Δ; max 5.2.2. For w¼X min j 1 ðaÞ/Δ,…,X j 1 ðaÞ/Δ: 5.2.2.1. e¼1 F(Δ(zw þu)); max 5.2.2.2. For v ¼T min j 1 ðΔwÞ/Δ,…,T j 1 ðΔwÞ/Δ: 5.2.2.2.1. y¼qj 1(v,w,a)e; s ¼yΨ(Δ(m u v)); 5.2.2.2.2. rj 1 ¼rj 1 þ s; 5.2.2.2.3. φj 1 ¼φj 1 þ (wþu)s; 5.2.2.2.4. For d ¼dmin/Δ,…, min(dmax/Δ, m u v): 5.2.2.2.4.1. θj 1 ¼ θj 1 þ(v wþ d)yψ n(d); 5.2.2.2.4.2. μj 1 ¼ μj 1 þ(v þu þd)yψ n(d); 5.2.2.2.4.3. If(j4N) go to step 5.3; 5.2.2.2.4.4. For i ¼0,…, min(u 1,m v d 1): 5.2.2.2.4.4.1. λ ¼⌊(Δi b 1(a 40))/σc; If(λ o0)λ ¼0; 5.2.2.2.4.4.2. qj(v þd þi,wþi,a þλ)¼qj(v þd þi,wþi,a þλ)þ qj 1(v,w,a)ψ n(d)fn(w,i); 5.3. R¼ Rþr j 1; 5.4. E ¼E þΔμj 1; 5.4. C ¼C þΔ(crθ j 1 þcoφ j 1); 6. C¼ C/R; E ¼E/R.
It can be seen from the above pseudo-code that the algorithm complexity is less than O(NH/Δ4). Notice that for obtaining array
Fig. 5. Reliability R, expected mission completion time E, expected mission cost C and algorithm running time t obtained for the example system as functions of 1/Δ.
18
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
qj(v,w,a), only array qj 1(v,w,a) is needed. Thus the algorithm only needs memory required for keeping two arrays of size m m H.
5. Illustrative examples In this section, examples are presented to illustrate the proposed evaluation algorithm. The algorithm is also applied to solve the optimal backup frequency problems formulated in (3). Consider a mission with W¼ 50, B ¼2, b¼1.5 and maximum allowed time τ¼ 100. The operation and repair costs are co ¼2 and cr ¼8. The mission is performed by a system characterized by a Weibull time-to-failure distribution [31] with scale parameter η¼ 200 and shape parameter β¼1.2. The random repair time obeys a truncated normal distribution [32] with dmin ¼ 15, dmax ¼30, μ¼26 (mean) and std¼4 (standard deviation). The repair efficiency coefficient is z ¼0.7. Given that the backups are performed after completing the fraction π¼0.4 of the mission task, the mission reliability, expected completion time and cost are R¼ 0.95, E ¼59.1, C ¼139.6, respectively. To investigate impacts of the duration of a discrete time interval Δ on the accuracy of obtained results, values of R, E and C are obtained for different Δ ranging from 0.125 to 10. Fig. 5 presents values of R, E and C as functions of 1/Δ as well as running time of the suggested algorithm on Pentium 2 GHz PC. It can be seen that as the value of 1/Δ increases, estimates of R, E and C converge. Table 1 presents related differences (in percentage) between estimates of R, E and C obtained for 1/Δ¼8 and for lower values of 1/Δ.
Table 1 Difference (in percentage) between estimates of R, E and C obtained for 1/Δ ¼8 and for lower values of 1/Δ. 1/Δ
R
E
C
1 2 3 4 5
0.38 0.03 0.05 0.05 0.02
0.54 0.24 0.15 0.07 0.04
0.55 0.06 0.06 0.03 0.03
Fig. 6 presents the number of backups H, mission reliability R, expected cost C and expected completion time E as functions of the backup frequency parameter π. The rest of the mission parameters are the same as those presented above. The impact of π on the system reliability and expected mission cost and time is two-fold. On the one hand, increasing the value of π results in less frequent backup actions, leading to an increase in work to be reperformed when the system failure occurs. On the other hand the increase of π causes the reduction in the number of backups during the mission, which reduces the total time required to accomplish the entire mission. Hence the system reliability and expected mission time and cost are non-monotonic functions of π. The abrupt jumps in functions R(π), C(π), E(π) occur when the integer value of H changes. Consider π¼ 1/x with x being an integer number, which corresponds to the case where H ¼x 1 backup actions are performed throughout a successful mission. If the value of π is decreased by a negligibly small amount, the value of H immediately changes from x 1 to x, causing a sharp increase in the minimal possible mission time. Because the variation in the value of π is negligible, there is no considerable change in the work portion that should be re-conducted when a failure happens. Hence the system reliability decreases and the expected mission completion time and cost increase abruptly due to the increase in the value of H. Thus, in the case of even backups, the maximal system reliability and the minimal expected mission time and cost are always obtained when π is valued as 1/(H þ1). Thus the backup policy optimization problem formulated in (3) is reduced to finding the number of backups H that maximizes system reliability (or minimizes expected mission cost) and then determining π¼ 1/(H þ1). Fig. 7 presents the number of backups that maximizes system reliability as a function of the allowed mission time τ. The corresponding maximal possible number of repairs during the mission N, and mission performance indices R, E and C are also presented. It can be seen that the optimal number of backups changes with τ non-monotonically. When the allowed mission time is low, its increase allows more backups to be performed. When the allowed mission time reaches a certain large value, its further increase allows system to complete the mission after failures even with a lower backup frequency. Thus, the number of backups decreases with τ. Values of R, E and C increase monotonically with τ when H does not change; however their
Fig. 6. The number of backups H, mission reliability R, expected cost C and completion time E as functions of the fraction of the mission task that should be performed between backups.
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
19
Fig. 7. The optimal number of backups H, maximal number of repairs N, mission reliability R, expected cost C and completion time E as functions of the allowed mission time τ.
Fig. 8. Number of backups H that maximizes system reliability and corresponding mission performance indices R, E and C as functions of ε.
increase rate is affected by the variation of the allowed number of repairs N. Fig. 8 presents the number of backups that maximizes system reliability and corresponding mission performance indices R, E and C as functions of the data backup and retrieval time variation parameter ε. It is assumed that B ¼2ε and b¼1.5ε. The optimal number of backups decreases with the increase of ε because with the increase of data backup and retrieval time the effect of backups on increasing the total mission time becomes greater than the effect on reducing re-performed work in the case of failures. For ε 43.1 the backups become inefficient and should not be used. Fig. 9 presents the number of backups H that maximizes the system reliability and corresponding mission performance indices R, E and C as functions of the mean repair time μ when dmin ¼ μ 2, dmax ¼μþ 2 and std ¼2. The increase in the repair time leaves less time for backups to be performed within the allowed mission
time. The expected mission time and cost increase with μ until the repair time takes the values that prohibit the mission completion after repair for most of failure times. From this point C and E decrease with μ. Fig. 10 presents the number of backups H that maximizes system reliability and corresponding mission performance indices R, E and C as functions of the system time-to-failure distribution scale parameter η for different values of repair efficiency coefficient z. The increase in the system reliability and repair efficiency makes the backups more beneficial because the probability that the mission is completed after backup data retrieval increases. Thus, the optimal number of backups increases with η and 1 z. The expected mission time and cost demonstrate non-monotonic behavior. E and C first increase with the increase in η because of the increase in the number of backups, and then decrease because the chance that the system completes the task without failures
20
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
Fig. 9. Number of backups H that maximizes system reliability and corresponding mission performance indices R, E and C as functions of the mean repair time μ (given dmin ¼μ 2, dmax ¼μþ 2, std ¼2).
Fig. 10. Number of backups H that maximizes system reliability and corresponding mission performance indices R, E and C as functions of the system time-to-failure distribution scale parameter η and repair efficiency coefficient z.
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
21
Fig. 11. Min C s.t. RZ 0.9 and max R solutions as functions of τ for z¼ 0.9.
Table 2 Difference (in percentage) of R and C for max R and min C s.t. RZ 0.9 solutions. τ R C
95 2.6 11.2
100 1.9 9.7
105 5.9 15.8
110 4.6 10.5
increases. With the increase in the repair efficiency (decrease of z), the expected mission time and cost increase because the chance that the successful mission includes repairs increases. Having the dependencies between R(H) and C(H) one can also minimize the expected mission cost subject to reliability constraint, i.e., min C s.t. RZRn. Fig. 11 presents such optimal solutions for Rn ¼0.9 as functions of the maximum allowed mission times τ compared with max R solutions when z ¼0.9. Table 2 presents the related improvement in the mission reliability and increase in the expected mission cost (in percentage) obtained for max R solutions compared to solutions min C s.t. R Z0.9 for different values of the allowed mission time. The cost-reliability relationship analysis can be important for decision making.
6. Conclusion and future work This paper models mission indices of reliability, expected cost and completion time of single-component real-time systems subject to periodic backups as well as random failure and repair times. Arbitrary types of time-to-failure and time-to-repair distributions are allowed. The optimal backup frequency problems are formulated and solved with the objective to maximize the system reliability or to minimize the expected mission cost while satisfying a certain constraint on system reliability. Through examples, the effects of factors including the maximum allowed mission time, data backup and retrieval times, parameters of repair or failure time distribution, and repair efficiency coefficient on the optimal backup frequency solution as well as on the mission reliability, cost and time are investigated. The revealed complicated relationships demonstrate the significance of applying the proposed algorithm to determine the optimal backup policy for the considered type of repairable systems.
In the future we will extend the proposed methodology to study single-component repairable systems subject to phasedmission requirements, where component failure and repair behavior, operation and repair costs can change from phase to phase. Based on recent work on non-repairable standby systems [30], we will also investigate the optimal backup policy for multicomponent repairable systems designed with different standby sparing techniques.
References [1] Ascher H, Feingold H. Repairable Systems Reliability. Modeling, Inference, Misconceptions and Their Causes. New York: Dekker; 1984. [2] Yang Q, Zhang N, Hong Y. Reliability analysis of repairable systems with dependent component failures under partially perfect repair. IEEE Trans Reliab 2013;62(2):490–8. [3] Lindqvist BH. On the statistical modeling and analysis of repairable systems. Stat Sci 2006;21(4):532–51. [4] Yañez M, Joglar F, Modarres M. Generalized renewal process for analysis of repairable systems with limited failure experience. Reliab Eng Syst Saf 2002;77(2):167–80. [5] Saldanha PLC, de Simone EA, Frutuoso e Melo PF. An application of nonhomogeneous Poisson point processes to the reliability analysis of service water pumps. Nucl Eng Des 2001;210(1–3):125–33. [6] Weckman GR, Shell RL, Marvel JH. Modelling the reliability of repairable systems in the aviation industry. Comput Ind Eng 2001;40(1):51–63. [7] Bloch-Mercier S. Optimal restarting distribution after repair for a Markov deteriorating system. Reliab Eng Syst Saf 2001;74(2):181–91. [8] Bloch-Mercier S. A preventive maintenance policy with sequential checking procedure for a Markov deteriorating system. Eur J Oper Res 2002;147 (4):548–76. [9] Marquez AC, Heguedas AS. Models for maintenance optimization: a study for repairable systems and finite time periods. Reliab Eng Syst Saf 2002;75 (3):367–77. [10] Soro IW, Nourelfath M, Aït-Kadi D. Performance evaluation of multi-state degraded systems with minimal repairs and imperfect preventive maintenance. Reliab Eng Syst Saf 2010;95(2):65–9. [11] Jia J, Wu S. A replacement policy for a repairable system with its repairman having multiple vacations. Comput Ind Eng 2009;57(1):156–60. [12] Lam Y. A note on the optimal replacement problem. Adv Appl Probab 1988;20:479–82. [13] Castro IT, Pérez-Ocón R. Reward optimization of a repairable system. Reliab Eng Syst Saf 2006;91(3):311–9. [14] Zhang YL, Wang GJ. A deteriorating cold standby repairable system with priority in use. Eur J Oper Res 2007;183(1):278–95. [15] Percy DF. Bayesian enhanced strategic decision making for reliability. Eur J Oper Res 2002;139(1):133–45. [16] Rosqvist T. Bayesian aggregation of experts' judgements on failure intensity. Reliab Eng Syst Saf 2000;70(2):283–9.
22
G. Levitin et al. / Reliability Engineering and System Safety 144 (2015) 12–22
[17] Sheu S-H, Yeh RH, Lin Y-B, Juang M-G. A Bayesian approach to an adaptive preventive maintenance model. Reliab Eng Syst Saf 2001;71(1):33–44. [18] Yuan L, Xu J. An optimal replacement policy for a repairable system based on its repairman having vacations,. Reliab Eng Syst Saf 2011;96(7):868–75. [19] Yeh RH, Chen M-Y, Lin C-Y. Optimal periodic replacement policy for repairable products under free-repair warranty. Eur J Oper Res 2007;176(3):1678–86. [20] Jia J, Wu S. Optimizing replacement policy for a cold-standby system with waiting repair times. Appl Math Comput 2009;214(1):133–41. [21] Taghipour S, Banjevic D. Periodic inspection optimization models for a repairable system subject to hidden failures. IEEE Trans Reliab 2011;60 (1):275–85. [22] Golmakani HR, Moakedi H. Periodic inspection optimization model for a twocomponent repairable system with failure interaction. Comput Ind Eng 2012;63(3):540–5. [23] Taghipour S, Banjevic D. Optimal inspection of a complex system subject to periodic and opportunistic inspections and preventive replacements. Eur J Oper Res 2012;220(3):649–60. [24] Taghipour S, Banjevic D, Jardine AKS. Periodic inspection optimization model for a complex repairable system,. Reliab Eng Syst Saf 2010;95(9):944–52.
[25] Taghipour S, Banjevic D. Optimum inspection interval for a system under periodic and opportunistic inspections. IIE Trans 2012;44(11):932–48. [26] Golmakani HR, Moakedi H. Optimal nonperiodic inspection scheme for a multicomponent repairable system with failure interaction using An search algorithm. Int J Adv Manuf Technol 2013;67(5–8):1325–36. [27] Golmakani HR, Moakedi H. Optimal non-periodic inspection scheme for a multi-component repairable system using A* search algorithm. Comput Ind Eng 2012;63(4):1038–47. [28] Yeh RH, Lo H-C. Optimal preventive-maintenance warranty policy for repairable products. Eur J Oper Res 2001;134(1):59–69. [29] Monga A, Zuo M. Optimal system design considering maintenance and warranty. Comput Oper Res 1998;25(9):691–705. [30] Levitin G, Xing L, Johnson BW, Dai Y. Mission reliability, cost and time for cold standby computing systems with periodic backup. IEEE Trans Comput 2015;64 (4):1043–57. [31] Weibull W. A statistical distribution function of wide applicability. J Appl Mech – Trans ASME 1951;18:293–7. [32] Barr DR, Sherrill ET. Mean and variance of truncated normal distributions,. Am Stat 1999;53(4):357–61.