Accepted Manuscript
Optimal Mission Abort Policy for Partially Repairable Heterogeneous Systems Ji Hwan Cha , Maxim Finkelstein , Gregory Levitin PII: DOI: Reference:
S0377-2217(18)30567-8 10.1016/j.ejor.2018.06.032 EOR 15217
To appear in:
European Journal of Operational Research
Received date: Revised date: Accepted date:
27 October 2017 11 June 2018 13 June 2018
Please cite this article as: Ji Hwan Cha , Maxim Finkelstein , Gregory Levitin , Optimal Mission Abort Policy for Partially Repairable Heterogeneous Systems, European Journal of Operational Research (2018), doi: 10.1016/j.ejor.2018.06.032
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights Mission abort for heterogeneous systems is considered based on updated information A random shock model is used to represent the random environment. Mission success probability and system survivability are suggested. Tradeoff between mission success probability and system survivability is demonstrated. Optimal mission abort policy is analyzed based on the model.
AC
CE
PT
ED
M
AN US
CR IP T
1
ACCEPTED MANUSCRIPT
Optimal Mission Abort Policy for Partially Repairable Heterogeneous Systems Ji Hwan Cha1a, Maxim Finkelsteinb,c and Gregory Levitind, a
Department of Statistics, Ewha Womans University, Seoul, 120-750, Rep. of Korea. e-mail:
[email protected] Department of Mathematical Statistics, University of the Free State, Bloemfontein, South Africa E-mail:
[email protected] c
CR IP T
b
ITMO University, 49 Kronverkskiy pr., St. Petersburg, 197101, Russia d
The Israel Electric Corporation, Haifa, Israel
AN US
E-mail:
[email protected]
Abstract
PT
ED
M
To enhance survivability of many real-world critical systems (e.g., aircrafts, human space flight systems and dangerous technological processes), mission abort procedures are often utilized in practice. In such cases, in order to improve survivability, a rescue or recovery procedure is initiated. In this paper, we first develop a methodology for obtaining the mission success probability (MSP) and the system survival probability (SSP) for heterogeneous systems with major and minor failures. A major failure is nonrepairable and results in a failure of a mission, whereas a minor failure is minimally repaired. We consider a policy where a mission is aborted and a rescue procedure is activated if the m-th minimal repair occurs before time since the start of a mission. We demonstrate the tradeoff between the SSP and the MSP that should be balanced by the proper choice of the decision variables m and . A detailed illustrative example is presented.
CE
Keywords: Maintenance; mission success probability; system survival probability; mission abort; minimal repair
AC
Acronyms cdf
cumulative distribution function
pdf
probability density function
MSP mission success probability SSP 1
system survival probability
Corresponding Author 2
ACCEPTED MANUSCRIPT
PH
proportional hazards
NHPP nonhomogeneous Poisson process Notation
(t) S(,,m) R(,,m)
time after which a mission is not aborted duration of the rescue procedure activated at time t SSP as a function of ,,m MSP as a function of ,,m cumulative life time deceleration/acceleration factor
q(t ) p(t ) (t ) (t , ) (t , )
AN US
( )
M
CR IP T
q (t )
system lifetime with respect to a major failure random time of the m-th minimal repair occurrence realization of Tm mission time random parameter characterizing system frailty realization of a frailty pdf of probability of a minor failure minimally repaired at time t probability of a major failure (termination) at time t baseline failure rate of a system at time t failure rate of a system at time t indexed by frailty rate of the NHPP at time t indexed by frailty cumulative rate of minimal repairs by time t
1. Introduction
ED
L Tm tm
AC
CE
PT
Stochastic analysis of mission abort strategies has recently attracted a considerable attention from reliability specialists as it is aimed at rescue of costly systems or equipment that can be lost due to failures if a mission continues. It is well known that the mission success probability (MSP), i.e., the probability of successfully completing a specific mission (Rausand and Høyland (2003), Levitin et al. (2016), Levitin and Finkelstein (2018)), is an important reliability characteristic for various engineering systems that perform specific tasks. However, at certain instances, a mission should be aborted if survival of a system, due to safety or cost related reasons, has a higher priority than its completion (e.g., for aircrafts, submarines or complex costly technological processes). Myers (2009) was the first to consider a mission abort strategy for standby systems with a rescue procedure to be initiated upon failure of a fixed number of components. A corresponding approach was developed for the specific case of the hot standby systems with components lifetimes described by identical, exponential distributions. In Levitin et al. (2017), the model was extended to systems with non-identical components and an adaptive abort policy. However, these papers did not take into consideration the impact of a stochastic environment on operational characteristics of systems and on the corresponding abort policy. The latter was first addressed in Levitin and Finkelstein 3
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
(2018), where a random environment was modeled by the Poisson process of shocks. In this paper, in accordance with the generalized extreme shock model (Cha and Finkelstein (2011a)), each shock results in a failure of a system with probability that increases with its number (see also Cazorla and Pérez-Ocón (2011, 2014), Frostig and Kenzin (2009), Huynh et al. (2012)). The consecutive number of arriving shock was then considered as a decision variable for the corresponding mission abort policy. Thus, the number of external shocks experienced by a system in this paper, already „acts‟ as an observable degradation variable (similar to the observed internal failures of components in the standby “1-out-of-n” system in Myers (2009), Levitin et al. (2017)). It is intuitively clear that when this degradation is „too large‟ during the initial operational phase, a mission should be aborted and a system should be rescued. In the current paper, we suggest a different model and a new approach. We consider systems with major (critical) and minor (non-critical) failures, which is the case for many repairable engineering systems. A major failure of a system is non-repairable during a mission and, therefore, results in a failure of a mission and often in a loss of a system. On the other hand, at many instances, minor failures can be minimally repaired (instantaneously) and a system can carry on operating. It is well known that in this case the process of minimal repairs forms the non-homogeneous Poisson process (NHPP) (see e.g., Nakagawa (2007), Finkelstein (2008), Finkelstein and Cha (2013) to name a few). At the first sight, it seems that the number of minimal repairs can also act as a decision variable for the abort policy. However, it is not the case for the described model because the number of minimal repairs experienced up to a current instant of time does not contain information on the future (remaining) lifetime with respect to either major or minor failures. The latter is, obviously, due to the memoryless property of the NHPP (i.e., the future events does not depend on the timings and numbers of the previous events). The above reasoning refers to the homogeneous situation when a system with a lifetime T is characterized by the absolutely continuous cdf F (t ) with the corresponding failure rate (t ) . However, in practice, the manufactured systems are often heterogeneous and the corresponding reliability characteristics in a population of systems, therefore, are variable. This heterogeneity occurs due to many different reasons such as quality of resources and components used in the production process, operation and maintenance history, human errors, etc. (Cha and Finkelstein, 2011b). In the simplest and the most effective way, this impact can be modeled by introducing an additional random variable (frailty). Denote this random variable by and its pdf with support in [0, ) by ( ) . Thus, given , let F (t , ) be the cdf of a system with the corresponding failure rate (t , ) . It is well known from the theory of mixed distributions that the population cdf and the failure rate in this case are defined as (Finkelstein and Cha (2013))
Fm (t ) F (t , ) ( )d , 0
m (t ) (t , ) ( | t )d ,
(1)
0
respectively, where “m” stands for the “mixture”, S 1 F and the conditional pdf is defined as
4
ACCEPTED MANUSCRIPT
( | t ) ( | T t )
S (t , ) ( )
.
(2)
S (t, ) ( )d 0
M
AN US
CR IP T
We will show in our paper that distinct from the homogeneous case, in the heterogeneous setting, the number of minimal repairs can be already considered as a decision variable for the mission abort policy. We believe that this is the main theoretical contribution of our paper, and we will illustrate it with a detailed numerical example. The main rationale behind this statement is the fact that the mixed process of minimal repairs to be defined is no longer an NHPP and, therefore, contains information on the history of minimal repairs. Loosely speaking, for some realizations of , there will be large numbers of minimal repairs which result in smaller values of system survival probability (SSP). Obviously, this can justify a decision to abort a mission („unreliable‟ system). On the other hand, for other realizations of , there will be small numbers of minimal repairs and, consequently, larger values of the SSP that can justify the decision to continue a mission („reliable‟ system). As far as we know, this approach was not considered in the literature so far. In order to make it practically significant, the corresponding theoretical results should be obtained and this is done in our paper. The paper is organized as follows. In Section 2, we describe the model for mission abort/continuation policy, justify this policy for heterogeneous systems and formulate the corresponding optimization problem. Section 3 is devoted to derivation of probabilities of interest and Section 4 presents the detailed numerical example. Finally, concluding remarks are given in Section 5. 2. Problem formulation and supplementary results
AC
CE
PT
ED
Consider a system that should perform a mission of duration . It can experience two types of failures. A major (critical) failure results in the failure of a mission and often in the loss of the system. On the other hand, minor (non-critical) failures, are minimally (instantaneously) repaired during a mission. Following Brown and Proschan (1983) and Block et al. (1985), we assume that each failure that occurs at time t is a minor failure with probability q(t ) and a major failure with probability p(t ) 1 q(t ). This setting is typical for many repairable engineering systems. The system survival has a higher priority than accomplishing the mission. Therefore, when the successful mission completion becomes unlikely, the mission is aborted and a rescue procedure is implemented. The mission abort decision should be based on the information on the likelihood of the remaining mission completion which is implicitly expressed by the failure occurrence history (the numbers and the occurrence times of the minor failures). Specifically, we consider the abort policy when the mission is aborted and the rescue operation is activated immediately upon the occurrence of the m-th minor failure (minimally repaired) if its time t m does not exceed the pre-specified value of . Indeed, when t m increases, the mission time remaining after the m-th minor failure decreases. Thus, it may become unreasonable to start the rescue procedure if the mission is close to its end ( t m close to ) and the system has good chances to complete it. Therefore, we assume that the system continues executing a mission if tm≥, where is a 5
ACCEPTED MANUSCRIPT
AN US
CR IP T
time after which a mission should never be aborted. Thus , along with m, can be considered as another decision variable (in the bivariate model) that can be chosen to achieve a proper balance between the MSP and the SSP. Notice that the special case when = corresponds to the situation when the mission abort upon the m-th failure is allowed at any time during the mission. Following Levitin and Finkelstein (2018), we assume that the duration of the rescue procedure is a function of the time when it starts, i.e., the occurrence time of the m-th minimal failure/repair, and it can be described as =(tm). Therefore, the full time from the start of the operation of a system until the successful rescue is completed (abort at tm) is tm +(tm). Note that minimal repairs can be carried on during the rescue phase. The decision variables and m for any given combination of system and mission parameters unambiguously determine both the MSP R(,,m) and the SSP, S(,,m). The optimal mission abort policy , m should achieve a balance (tradeoff) between the MSP, and the SSP. For instance, the practically meaningful problem of obtaining the optimal m and that achieve the maximum MSP subject to providing the desired level of the SSP S*, can be formulated as max R(,,m) s.t. S(,,m)>S*.
(3)
In what follows, for modeling systems heterogeneity, we employ a specific case of (t , ) in relationships (1), which is a proportional hazards model widely used in reliability literature (Leemis, 1995), Ebeling (1997), Kalbfleisch and Prentice (2002)):
M
(t , ) (t ) ,
(4)
ED
where (t ) has the meaning of a baseline failure rate. We can consider more general models as well that comply with a natural ordering, i.e.,
(t ,1 ) (t , 2 ); 1 2 ,
(5)
AC
CE
PT
however, in this paper, we will assume (4) due to its practical importance and clear modelling motivation. Denote the lifetime to the major failure in realization for the described model by L . Thus, in accordance with Finkelstein (2008), Brown and Proschan (1983), Block et al. (1995), the corresponding survival function is t P( L t ) S (t , ) exp p(u ) (u )du . 0
(6)
with the failure rate
p (t , ) p(t ) (t ) .
(7)
On the other hand, let L denote the „overall‟ (marginal) lifetime to the major failure. In accordance with (1) and (6):
6
ACCEPTED MANUSCRIPT
0
0
P( L t ) P( L t ) ( )d S (t , ) ( )d .
(8)
It is well known (see, e.g., Rausand a Høyland, 2003 and Finkelstein and Cha, 2013) that the process of minimal repairs (prior to a major failure) in this case follows the NHPP with rate (9)
CR IP T
(t, ) q(t ) (t ) .
In fact, this result makes the further mathematical reasoning tractable. For convenience of t
further notation, denote the corresponding cumulative rate by q (t ) q(u ) (u )du . 0
Denote also the NHPP of minimal repairs by {N m (t ), t 0 }, where N m (t ) is the number
AN US
of minimal repairs in [0,t) and let T1 T2 ... be the arrival times of minimal repairs
CE
PT
ED
M
with realizations t1 t2 ... The rationale behind the proposed mission abort policy stems from the fact that the observable variable m enables updating the pdf of the frailty parameter in our model (6)(8) (see (13) later), similar to updating of in (2), where the corresponding information is just the survival of a system up to t. Due to this Bayesian updating, we can conclude that event {Tm } implies smaller values of and thus possible continuation of the mission, whereas event {Tm } implies larger values of and thus possible aborting of a mission. Through Propositions 1-2, we will justify mathematically the mission abort policy proposed above (see also Remark 1). Proposition 1 means that, for larger realizations of , the probability of observing the event {Tm } is smaller, whereas for smaller realizations of , the probability of observing the event {Tm } is larger. Conversely, as mentioned above, it implies that if we observe {Tm } , the corresponding realization of will be smaller and the continuation of mission will be reasonable. On the other hand, if we observe {Tm } , the corresponding realization of will be larger and aborting the mission will be reasonable.
AC
Proposition 1. Let the PH model (4) hold and the minor failures be minimally repaired thus forming the NHPP with rate q(t ) (t ) . Then the conditional probability P(Tm | ) is decreasing in , or equivalently, P(Tm | ) is increasing in for each m=1,2,… Proof. Denote by X the Poisson random variable with mean . Note that, in accordance with the properties of the NHPP,
7
ACCEPTED MANUSCRIPT
m1
P(Tm | ) P( N m ( ) m 1 | ) n 0
( )
n
q
n!
exp q ( ), (10)
which also defines the cdf of X (m =1,2,…) for q ( ) . Let 1 2 . Then, using the corresponding density (pmf) for each n 0,1,2... and considering the likelihood ratio 1
( )
n
q
n! 2 q ( )n n!
exp 1 q ( ) exp 2 q ( )
n
1 exp (1 2 ) q ( ), 2
CR IP T
AN US
we see that it is decreasing in n . Therefore, in accordance with the definition of the likelihood ratio ordering, which requires that the corresponding ratio should be decreasing (Ross, 1966; Shaked & Shanthikumar, 2007), we have now
X 1 q ( ) lr X 2 q ( ) ,
where lr stands for likelihood ratio. It is well known (Shaked and Shanthikumar, 2007) that the likelihood ratio ordering of two random variables implies the usual stochastic ordering (in distributions) of these random variables. Therefore,
M
X 1 q ( ) st X 2 q ( ) .
(11)
ED
and from (5) and (11) we finally get m 1
P(Tm | 1 )
( ) 1
n!
PT
n 0
m 1
CE
n 0
n
q
2
( )
n
q
n!
exp 1 q ( ) exp 2 q ( ) P(Tm | 2 ) .
■
AC
Initial distribution of the frailty parameter (Cha and Finkelstein (2014)) was characterized by its pdf ( ) . We will show now that this distribution is changing in a specific, meaningful way. Now we can use additional information on minimal repairs and compare the composition of with the condition that Tm and without this condition. As the conditional probability of (Tm | ) is given by (9), the conditional pdf of ( | Tm ) , denoted by ( | Tm ) , is
8
ACCEPTED MANUSCRIPT
( | Tm
( )
n! ) ( ) n! q
n
n 0
m 1
n
q
0 n 0
Then, m 1
( | Tm ) ( )
m 1
0 n0
exp q ( ) ( )d
( )
n 0
exp q ( ) ( )
n
q
n! n q ( ) n!
.
(12)
exp q ( )
CR IP T
m 1
exp q ( ) ( )d
is decreasing in as shown in the proof of Proposition 1. Thus, we have ( | Tm ) lr .
(13)
AN US
As likelihood ratio ordering is stronger than the usual stochastic ordering, inequality (13) implies that, in the case of mission continuation after observing Tm , the updated frailty is stochastically smaller than the initial . Lemma 1. (Joe, 1997; Cuadras, 2002) Let X be a random variable with the cdf F (x) and g (x) , h(x) are real valued functions.
M
(i) If both g (x) and h(x) are increasing; or if both g (x) and h(x) are decreasing, then
E[ g ( X )h( X )] E[ g ( X )]E[h( X )] .
ED
(ii) If g (x) is increasing and h(x) is decreasing, then
E[ g ( X )h( X )] E[ g ( X )]E[h( X )] .
PT
Definition 1. Positive Quadrant Dependency (PQD) (Lehmann (1966))
CE
Two random variables Y1 and Y2 are positively quadrant dependent (PQD) if the inequality P(Y1 y1 , Y2 y2 ) P(Y1 y1 ) P(Y2 y2 ) , for all y1 and y2 , holds.
AC
It can be shown that if Y1 and Y2 are PQD then Cov(Y1 , Y2 ) 0 .
Now we can formulate our next result. Proposition 2. Let the PH mode (4) hold and the minor failures be minimally repaired thus forming the NHPP with rate q(t ) (t ) . Then a system lifetime (to a major failure), L and the time to the m-th minimal repair, Tm are PQD, i.e., P( L l ,Tm tm ) P( L l ) P(Tm tm ) , for all l 0 and t m 0 . 9
ACCEPTED MANUSCRIPT
Proof. Observe that P( L l , Tm tm ) E[ P( L l , Tm tm | )]
l m 1 q (tm ) n E exp p(u ) (u )du exp q (tm ) . n! n 0 0 m 1 (t ) l q m exp q (tm ). g ( ) exp p(u ) (u )du and h( ) n! n 0 0
CR IP T
Denote
n
Then, from the proof of Proposition 1, h( ) is decreasing in and it is obvious that g ( ) is also decreasing in . Thus, by applying Lemma 1,
AN US
l m 1 q (tm ) n E exp p(u ) (u )du exp q (tm ) n! n 0 0
l m 1 q (tm ) n E exp p(u ) (u )du E exp q (tm ) n! n 0 0
ED
which completes the proof.
M
E[ P( L l | )] E[ P(Tm tm | )] P( L l ) P(Tm tm ) ,
■
PT
The inequality given in Proposition 2 implies the following statement P( L | Tm ) P( L ) ,
(14)
AC
CE
which means that the occurrence of smaller than m minimal failures during the time from the start of a mission can be interpreted as an evidence that a system performing a mission is characterized by the smaller value of the unobserved frailty parameter and, therefore, has better chances to complete the mission successfully (without a major failure). Based on the above considerations, we can present now (in the following remark) the detailed justification of our mission abort policy. Remark 1. It is obvious that by simply aborting a mission with some probability, we decrease the MSP of a system. This refers to the case when we have two outcomes, i.e., success or failure of a mission. However, our optimization problem is about maximization the MSP with a given lower bound on the SSP of a system. As a population of systems is heterogeneous (and stochastically ordered)), some of them will be „suitable‟ for completing a mission (systems with smaller frailty) whereas the other ones are
10
ACCEPTED MANUSCRIPT
characterized by the larger values of frailty and a mission should be aborted. Thus we have four possible events: (I){No Abort: Mission Success; System Survival}, (II){No Abort: Mission Failure; System Loss}, (III){Abort: Mission Failure; System Survival}, (IV){Abort: Mission Failure; System Loss}
CR IP T
and we have to increase the probabilities of events (I) and (III) as there are two objectives now, which is reflected in the optimization criterion in (3). Inequality (14) means that under the abort policy, the probability that the non-aborted systems (part of the whole population of systems) will perform a mission is larger than the unconditional probability of completing a mission P( L ) , which is the MSP for the case without the abort policy (for the whole population of systems). On the other hand, from Proposition 2, using the properties of bivariate survival functions, we can easily obtain the inequality
AN US
1 P( L l ) P(Tm tm ) P( L l , Tm tm ) P( L l , Tm tm ) P( L l ) P(Tm tm )
1 P( L l ) P(Tm tm ) P( L l ) P(Tm tm ) ,
which implies that
P( L l , Tm tm ) P( L l ) P(Tm tm ) for all l 0 and t m 0 .
Therefore,
M
P( L l | Tm tm ) P( L l )
and, finally,
ED
P( L | Tm ) P( L ) .
CE
PT
This inequality (compare with (14)) implies that the probability that the aborted by our policy systems (part of the whole population of systems) will not complete a mission (in case it continue to operate after Tm ) is larger than the unconditional probability of failing a mission P( L ) , which is the 1-MSP for the case without the abort policy (for the whole population of systems). Thus, our abort policy based on the number of minimal repairs can be also considered as a reasonable classification rule for the „good‟ and the „bad‟ systems.
AC
3. Mission success probability and system survival probability A mission succeeds if a system does not fail (major failure) in [0, ) and less than m minimal repairs occur in [0, ) (no mission abort). In accordance with this description and the proof of Proposition 2, the MSP can be defined as
R( , , m) P( L ,Tm ) E[ P( L ,Tm | )]
11
ACCEPTED MANUSCRIPT
m 1 q ( ) n exp p(u ) (u )du exp q ( ) ( )d . n! 0 0 n 0
(15)
CR IP T
A system survives if it completes either a mission or a rescue procedure. The rescue procedure is activated only on the m-th (and if tm<) minimal repair which along with , starts to become a decision variable for further optimization. To complete the rescue procedure activated at a random time Tm, a system lifetime must be not less than Tm+(Tm). Thus, the SSP is S ( , , m) R( , , m) P( L Tm (Tm ), Tm ) ,
where, in accordance with results of the previous section,
(16)
P( L Tm (Tm ),Tm ) E[ P( L Tm (Tm ),Tm | )] ,
AN US
and
P( L Tm (Tm ),Tm | ) E(Tm |) [ P( L Tm (Tm ),Tm | ,Tm )] ,
where E(Tm |) [] stands for the expectation with respect to the distribution of (Tm | ) . Observe that P( L Tm (Tm ),Tm | ,Tm tm ) P( L tm (tm ) | ) I (tm ) ,
ED
M
and the conditional pdf of (Tm | ) (which is the instantaneous probability that the m-th minimal repair from the NHPP with rate (t; ) q(t ) (t ) occurs in [t , t dt ) ) is given by
f m (t; ) q(t ) (t )
(m 1)!
exp q (t ).
PT
Hence,
( q (t )) m1
CE
P( L Tm (Tm ),Tm | )
AC
0
t m ( t m ) ( q (tm ))m 1 exp p(u ) (u )du q(tm ) (tm ) exp q (tm )dtm , (m 1)! 0
and, thus,
P( L Tm (Tm ),Tm )
0
0
t m ( t m ) ( q (tm ))m 1 exp p(u ) (u )du q(tm ) (tm ) exp q (tm )dtm ( )d . (17) ( m 1 )! 0
12
ACCEPTED MANUSCRIPT
CR IP T
Therefore, R( , , m) is given by (15), whereas S ( , , m) can be obtained using (16) and (17). Therefore, optimization problem (14) can be addressed. Note that (17) can be easily modified to the practically important case when the rescue operation is performed in a different than a mission regime (loading). To account for this effect, we apply the cumulative exposure model (Nelson (1990), Sedjakin (1966)). In accordance with this model, the time till the successful completion of a rescue operation that was imitated at time t, i.e., (t ) , should be adjusted as (t ) , where <1 is an acceleration/deceleration factor used to reflect the change of the system „loading‟ from a regular mission phase to a rescue one. Thus (17) is modified to: P( L Tm (Tm ), Tm )
0
0
AN US
t ( tm ) ( q (t m )) m1 m exp p(u ) (u )du q(t m ) (t m ) exp q (t m )dt m ( )d .(18) (m 1)! 0
PT
ED
M
Remark 2. In the next section, we will provide a detailed numerical example of the optimization model (3). Above all, it will also highlight the crucial distinction of the current abort mission model for heterogeneous systems from the abort mission model considered in Levitin & Finkelstein (2018). The abort mission policy in Levitin & Finkelstein (2018) was justified for the deteriorating systems characterized by the IFR distributions of the time to failure. Additionally, each shock was decreasing the remaining lifetime, which also contributed to the „objective‟ deterioration in a system. In the example to follow, we consider a heterogeneous (via the PH model (4)) population of systems with exponentially distributed baseline distribution (no aging!). It should be noted that we have no additional a-priory information on of the specific system performing a mission and only the pdf of the frailty ( ) is given. Following our discussion in Section 2, we can loosely state that, for instance, the increasing number of minimal repairs in a fixed interval of time means (after Bayesian update of ( ) ) that the estimate of a specific system lifetime (with respect to a major failure) is stochastically decreasing.
CE
4. Numerical illustration
AC
Consider a chemical reactor with a cooling system aimed at providing a desired temperature level. Each failure of the cooling system can be repaired (minimal repair), but it causes the increase of the reactor temperature, which eventually can result in a major failure and destruction of the reactor. Too frequent failures of a cooling system indicate that the corresponding frailty parameter is large for this system and that the level of danger for the reactor is unacceptable. The decision on termination of the technological process and unloading the reactor (the mission abort) to avoid its destruction and possible consequent losses can be made based on the suggested (m,) mission abort policy. During the reactor unloading, the cooling system must also perform, though with a lower load level. When the technological process is completed, the heat release is terminated and the major failure of the reactor caused by overheating cannot occur anymore. 13
ACCEPTED MANUSCRIPT
For the considered setting, assume that p(t ) p , (t ) and ( ) exp{ } . In this case (15) and (18) take the following forms: m 1
0
n 0
q n exp q exp{ }d n!
q 1 ( p ) q p
P( L Tm (Tm ), Tm )
0
0
exp (tm (tm )) p q
m
,
(qtm )m 1 exp qtm dtm exp{ }d (m 1)!
AN US
CR IP T
R( , , m) exp p
AC
CE
PT
ED
M
To solve the optimization problem (3). one has to determine optimal values of the discrete variable m and of the continuous variable . As the value of m in practical applications cannot be very large, the optimization algorithm can enumerate m obtaining for each m the optimal values of by applying any algorithm for one-dimensional optimization. In this paper, similar to Levitin & Finkelstein (2018), we enumerated m up to the value of 20 and apply the golden section search algorithm for obtaining with precision 10-5. The running time for a single optimization problem on 3.2 GHz PC does not exceed 1 min. For convenience of notation in this example, we will mostly drop the arguments in the MSP and the SSP, i.e., R and S, respectively. Figs. 1 and 2 present mission abort policies (m, ) that maximize the MSP subject to providing the SSP not less than 0.95 as functions of parameters p and (=6, 0.1 , =0.9) for constant and increasing functions (t), respectively. It can be seen that in the optimal mission abort policies, the number of minor failures/repairs m allowed before the mission abort decreases with the increase of the major failure probability p. The time interval during which the mission abort is allowed increases in p when m is constant and drops when m decreases. When p exceeds certain values, no mission abort policy can provide the SSP larger than 0.95 (the corresponding curves „stop‟ at these points). With increase of , the smaller values of become more probable. This results in larger values of R and S and a milder mission abort policy (larger m and smaller for the same m).
14
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Fig. 1. Optimal solutions of max R s.t. S>0.95 problem for =6, 0.1 , =0.9, and (t)=1 as functions of p and .
Fig. 2. Optimal solutions of max R s.t. S>0.95 problem for =6, 0.1 , =0.9, and (t)=0.5+3t as functions of p and .
Additionally, it is also interesting to illustrate numerically the impact of monotonicity properties of the function (t) on R and S. This is done in Fig.3 for three different functions (t). The constant (t)=1 corresponds to the situation when the reactor unloading time does not depend on the phase of the reaction upon termination. The increasing ((t)=0.5+3t) corresponds to the case when the amount of the product to be 15
ACCEPTED MANUSCRIPT
PT
ED
M
AN US
CR IP T
unloaded increases during the reaction causing the increase of the unloading time. The decreasing ((t)=18.5-3t) corresponds to the case when the raw reagents are consumed during the reaction producing product that can be unloaded faster. It can be seen that for constant (t), R always decreases and S always increases in . However, in the case of increasing (t) the increase in the allowed mission abort time can cause decrease of the SSP, because the reactor unloading procedure performed close to the end of the mission becomes very long and, therefore, failure prone. From some value of t , we have (t)>>-t and the probability of failure during the unloading exceeds the probability of failure during the remaining part of the primary mission and, thus, the SSP for large can decrease below S(m,0)=R(m,0). This means that the mission abort at the advanced stages of the mission is not beneficial. When (t) is decreasing and large enough in the beginning of the mission, the unloading procedure is so failure prone that the mission abort is never justified. Indeed, when (0)>, the system has larger chances to fail during a very long rescue procedure than during performing the entire mission from its beginning.
AC
CE
Fig. 3. R and S as functions of decision variables m, for =1, =6, 0.1 , p=1, =0.9 and different functions (t).
5. Concluding remarks This paper presents a new stochastic model for obtaining relevant operational characteristics of a system with two types of failures. A minor failure is minimally repaired, whereas the major failure is non-repairable during a mission and, therefore, terminates it. An important feature of our model is that we are considering heterogeneous population of systems with lifetimes indexed by random frailties. For the fixed realization of frailty, the process of minimal repairs is, obviously, NHPP. However, for the general case, it is not NHPP and, e.g., the remaining lifetime till the next minimal repair depends 16
ACCEPTED MANUSCRIPT
AN US
CR IP T
already on the process history. It turns out that this heterogeneity justifies the implementation of the optimal mission abort/continuation policy. If a mission completion becomes problematic, it can be aborted and a rescue procedure can be activated. Assuming that the decision about the mission abort is made upon occurrence of the m-th minimal repair (the decision variable for the corresponding optimization problem), we present an original method of evaluating the corresponding mission success probability and the SSP. An additional feature of the model is that the abort is never performed when the occurrence time of the m-th shock exceeds , which is the second decision variable for the bivariate optimization problem considered in this paper. Based on the obtained results, one can solve the corresponding optimization problem and find the values of optimal m and that balance the tradeoff between the MSP and the SSP. We illustrate our findings and approach by considering the detailed numerical example. Further research in this direction can consider the case when the number of minimal repairs is finite (as e.g. the number of spares in an autonomous system). The other possibility is to deal with the corresponding costs optimization problem with respect to decision variables m and . Acknowledgements
AC
CE
PT
ED
M
The authors would like to thank the reviewers for helpful comments and advices. The work of the first author was supported by Priority Research Centers Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2009-0093827). The work of the second author was supported by National Research Foundation (SA) (Grant no: 103613).
17
ACCEPTED MANUSCRIPT
References
AC
CE
PT
ED
M
AN US
CR IP T
Block, H. W., Borges, W., & Savits, T. H. (1985). Age-dependent minimal repair. Journal of Applied Probability, 22, 370-386. Brown, M., & Proschan, F. (1983). Imperfect repair. Journal of Applied Probability, 20, 851-862. Cha, J. H., & Finkelstein M. (2011a). On new classes of extreme shock models and some generalizations. Journal of Applied Probability, 48, 258-270. Cha, J. H., & Finkelstein M. (2011b). Burn-in and performance quality measures in heterogeneous populations. European Journal of Operational Research, 210, 273-280. Cha, J. H., & Finkelstein, M. (2014). Some notes on the unobserved parameters (frailties) in reliability modelling. Reliability Engineering and System Safety, 123, 99-104. Cuadras, C. M. (2002). On the covariance between functions. Journal of Multivariate Analysis, 81, 19–27. Ebeling, C. (1997). An introduction to reliability and maintainability engineering. New York: McGraw-Hill. Finkelstein, M. (2008). Failure rate modelling for reliability and risk. London: Springer. Finkelstein, M., & Cha, J. H. (2013). Stochastic modelling for reliability: shocks, burn-in, and heterogeneous populations. London: Springer. Frostig, E., & Kenzin, M. (2009). Availability of inspected systems subject to shocks-A matrix algorithmic approach. European Journal of Operational Research, 193, 168183. Huynh, K. T., Castro, I. T., Barros, A., & Berenguer, C. (2012). Modeling age-based maintenance strategies with minimal repairs for systems subject to competing failure modes due to degradation and shocks. European Journal of Operational Research, 218, 140-151. Joe, H. (1997). Multivariate models and dependence concepts. Boca Raton: Chapman & Hall/CRC. Kalbfleisch, J. D., & Prentice, R. L. (2002). The statistical analysis of failure time data. 2nd ed. Hoboken (NJ): Wiley. Leemis, L. M. (1995). Reliability: probabilistic models and statistical methods. Englewood Cliffs (NJ): Prentice-Hall. Lehmann, E. L. (1966). Some concepts of dependence. Annals of Mathematical Statistics, 37, 1137-1153. Montoro-Cazorla, D., & Pérez-Ocón, R. (2011). Two shock and wear systems under repair standing a finite number of shocks. European Journal of Operational Research, 214, 298-307. Montoro-Cazorla, D., & Pérez-Ocón, R. (2014). A reliability system under different types of shock governed by a Markovian arrival process and maintenance policy K. European Journal of Operational Research, 235, 636-642. Levitin, G., Xing, L., Zhai, Q., & Dai, Y. (2016). Optimization of full vs. incremental periodic backup policy. IEEE Transactions on Dependable and Secure Computing, 13, 644-656.
18
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Levitin, G., Xing, L., & Dai, Y. (2017). Mission abort policy in heterogeneous nonrepairable 1-out-of-N warm standby systems. To appear in IEEE Transactions on Reliability, DOI 10.1109/TR.2017.2740330. Levitin, G., & Finkelstein, M. (2018). Optimal mission abort policy for systems in a random environment with variable shock rate. Reliability Engineering and System Safety, 169, 11-17. Myers, A. (2009). Probability of loss assessment of critical k-Out-of-n: G systems having a mission abort policy. IEEE Transactions on Reliability, 58, 694-701. Nakagawa, T. (2007). Shocks and damage models in reliability theory. London: Springer. Nelson, W. (1990). Accelerated testing, statistical models, test plans and data analysis. New York: Wiley. Ross, S. M. (1996). Stochastic processes, 2nd ed. New York: John Wiley & Sons, Inc.. Rausand, M., & Høyland, A. (2003). System reliability theory: models, statistical methods, and applications, 2nd Edition. New Jersey: Wiley. Sedjakin, N. (1966). On one physical principle of reliability theory (in Russian). Techn. Kibernetika, 3, 80-82. Shaked, M., & Shantikumar, J. (2007). Stochastic Orders. London: Springer.
19