Optimal mission abort policies for multistate systems

Optimal mission abort policies for multistate systems

Journal Pre-proof Optimal mission abort policies for multistate systems Gregory Levitin , Maxim Finkelstein , Hong-Zong Huang PII: DOI: Reference: S...

996KB Sizes 0 Downloads 64 Views

Journal Pre-proof

Optimal mission abort policies for multistate systems Gregory Levitin , Maxim Finkelstein , Hong-Zong Huang PII: DOI: Reference:

S0951-8320(19)30635-0 https://doi.org/10.1016/j.ress.2019.106671 RESS 106671

To appear in:

Reliability Engineering and System Safety

Received date: Revised date: Accepted date:

16 May 2019 15 September 2019 24 September 2019

Please cite this article as: Gregory Levitin , Maxim Finkelstein , Hong-Zong Huang , Optimal mission abort policies for multistate systems, Reliability Engineering and System Safety (2019), doi: https://doi.org/10.1016/j.ress.2019.106671

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Highlights 

A multi-state system with variable performance is considered;



The system performs a mission in a random environment modeled by the renewal process of shocks;



The mission is aborted and rescue operation is activated upon occurrence of m-th shock;



An algorithm for obtaining mission success and system survival probabilities is presented;



Optimal mission abort strategy balancing these probabilities is obtained.

1

Optimal mission abort policies for multistate systems Gregory Levitina,b, Maxim Finkelsteinc,d, Hong-Zong Huanga a

Center for System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, PR China b The Israel Electric Corporation, P. O. Box 10, Haifa 31000, Israel E-mail: [email protected] c University of the Free State, Bloemfontein, South Africa d ITMO University, St. Petersburg, Russia E-mail: [email protected]

Abstract. All previous research on optimal mission abort policies was devoted to binary systems that can be only in two states, i.e., operable or failed. This paper considers mission abort and rescue strategies for multistate systems that, apart from a completely operable state and the state of a total failure, can operate in intermediate states with different levels of performance. A system operates in a random environment modeled by a renewal process of shocks. With each shock, the state of a system can deteriorate with certain probabilities that can eventually result in the total failure. Therefore, in order to increase system’s survival probability, a mission can be aborted and a rescue procedure can be activated. The trade-off between the mission success probability and the system’s survival probability is studied and an optimal number of shocks for initiating the abort procedure is defined by solving the corresponding optimization problem. The detailed numerical example illustrates our findings.

Keywords: multi-state system; mission success probability; system survival probability; mission abort; renewal process of shocks.

2

Acronyms MSP

mission success probability

SSP

system survival probability

PM

primary mission

RP

rescue procedure

PL

performance level

pdf

probability density function

pmf

probability mass function

Notations pi(k)

probability that the system is in state i after k-th shock

FM(t), fM(t)

cdf, pdf of inter-arrival time for shocks during the PM

FR(t), fR(t)

cdf, pdf of inter-arrival time for shocks during the RP

Xk (Yk)

random amount of PM (RP) work completed by the system before occurrence of k-th shock

qk (x) , q~h ( y ) pdf of Xk,Yk

(XM)

amount of work in the RP when amount of work Xk has been performed during the PM before the mission abort

gi

system’s performance level in state i

vi,j(k)

probability of transition from state i to state j caused by k-th shock

W

total amount of work to be performed by PM

m

maximum number of shocks allowed during the PM

G(k)

random system’s performance level after k-th shock

Z

MSP 3

R

SSP

S

RP success probability



attack resistance deterioration factor

1. Introduction In recent years, a number of studies have been published discussing optimal abort policies for various missions [1-7]. A mission abort option is usually considered for systems with failures resulting in substantial economic losses or for safety-critical systems such as submarines, aircrafts, manned space flights, sensitive data storages, etc. Then survival of a system can become more important than accomplishing a mission. The abort policy can be initiated when, e.g., the risk of failure during the rest of the mission time becomes sufficiently large. The classic example of this setting is the one reported in the pioneering paper by Meyers [1], where a parallel system of n i.i.d. components is considered. When the number of failed components reaches the critical level, the mission is aborted and the system rescue procedure is activated. A practical situation of a more general type was considered in [7], where a multi-engine aircraft had to abort the mission and initiate an emergency landing when a certain subset of engines fails. Another example of a condition that causes aborting a mission is when an air pressure control subsystem during the manned space flight deteriorates to a certain risky level [2]. Note that, the example of an unmanned aerial vehicle (UAV) deployed in critical applications was used in a number of publications (see, e.g., [2-4] and [7-8]). These devices operate often in a random environment that can be modelled by stochastic point processes of shocks (e.g., electromagnetic impulses, enemy attacks, etc). Each shock decreases the remaining lifetime of the UAV. Thus, when the corresponding deterioration reaches a 4

certain level, the primary mission (PM) should be aborted and the rescue procedure (RP) should be initiated. Note also that various general shock models were intensively studied in the literature for different reliability applications (see, e.g., the following monographs devoted to this subject: Nakagawa [9], Finkelstein and Cha [10]). When defining a mission abort policy, one should consider two metrics: the mission success probability (MSP) and the system’s survival probability (SSP). The abort policy that allows for mission continuation when its failure and system’s loss probabilities are large results in the smaller values of the SSP, although giving a system more chances to complete its PM. More rigid abort policies result in the smaller values of the SSP at the price of the decreased values of the MSP. Therefore, the optimal abort policy should achieve a desirable balance/trade-off between these two contradicting metrics, which is the goal of our optimization procedure. In reality, many engineering systems are multistate, i.e., having more than one operating state. Moreover, the system’s performance level (PL) varies from state to state. Reliability theory and applications for multistate systems were intensively developed in the last two decades (see, e.g., monographs [11-13] entirely devoted to this topic). There are also numerous publications that consider effects of external shocks on performance of multistate systems (see, for example, [14-18], to name a few). To the best of our knowledge, all previous works have studied mission abort rules and the MSP-SSP trade-off only for binary systems and missions of the fixed duration. In this paper, we are the first to consider mission aborting rules for the multi-state systems and suggest a methodology for evaluating and balancing the MSP and the SSP for such systems. Dealing with multistate systems with states described by different levels of

5

performance presents a certain challenge, as in this case, a mission duration as such becomes random, whereas the amount of work that should be performed for the success of a mission, stays fixed. Moreover, the corresponding probabilistic analysis becomes more complex. Therefore, in order to consider optimal trade-off problems, the new methods of deriving the MSP and the SSP for the considered class of systems have to be developed first. As a motivating example, consider a multi-engine unmanned aerial vehicle (UAV) performing a surveillance mission (PM) in a hostile environment. The UAV should cover a distance W (which is ‘work’, in our terms) to complete its mission. Malicious attacks (shocks) can destroy a part of the UAV’s engines reducing its speed (PL). Depending on the number of operating engines, the UAV state (and the corresponding speed) can change discretely. If the chances to complete the mission are low, the UAV can abort it and start an emergency landing operation (RP). The distance to the closest emergency landing field depends on the UAV location when the mission was aborted. During the emergency landing, the UAV remains exposed to further attacks. The time needed to complete the PM or the RP depends on the UAV’s speed. The decrease of the UAV’s speed results in the larger time it needs to complete the PM or RP. Therefore, deterioration of the UAV’s speed causes the increase in the probability that it will experience additional shocks before completing the PM or the RP. The UAV completes the mission if it covers the distance W without a total failure. The UAV is lost if it fails (enters the state with zero PL and crashes) before completing either the PM or the RP. Another example describing an online data processing system exposed to hackers’ attacks is considered in detail in Section 5.

6

The paper is organized as follows. In Section 2, we describe the optimization problem and provide necessary explanations. Section 3 presents a model describing multi-state system’s deterioration under shocks, whereas Section 4 presents the derivation of the MSP and the SSP. In Section 5, the detailed illustrative example is provided. Some concluding remarks are given in Section 6.

2. Problem formulation Assume that during the PM a system should perform a fixed amount of work (e.g., operations), W. We consider missions of a relatively short duration, where all failures are induced by external shocks and probabilities of ‘internal failures’ are negligible. Generalization to the case when these probabilities are not negligible can present a topic for further research, especially for the case of stochastic dependence between a shock process and the mechanisms of development of internal failures and transitions between various states. A system is operating in a random environment modeled by the point process of external shocks, {N (t ), t  0 }, where N (t ) is the number of shocks in [0,t) and

T1  T2  ... are the random arrival times of shocks. We assume that inter-arrival times of this process are i.i.d. with the cdfs for the PM and RP FM (t ), FR (t ), respectively, thus forming the corresponding renewal processes. It should be noted that in most reliability applications, the shock process is assumed to be NHPP (nonhomogeneous Poisson process) as, due to the independent increments property, it often allows for tractable closed-form relatively simple solutions for the probabilities of interest (see, e.g., Cha and Finkelstein [19] and references therein). On the contrary, the renewal assumption does not lead to this type of results, although it is 7

also often met in practice. For instance, renewal processes can usually model recurrent external shocks that are gradually ‘building up’. In this case, the time to a shock’s occurrence on each renewal cycle usually depends on the time it was already building up (thus violating the independent increments assumption of the NHPP) [20]. Example of this type is considered in Section 4. As our approach in this paper is based on the corresponding recursive procedures, the renewal assumption fits perfectly in our framework, whereas the NHPP process of shocks should be considered in the future research. Indeed, shocks inter-arrival times of the NHPP are not i.i.d and depend on the corresponding arrival times. Obviously, when inter-arrival times are exponentially distributed, the two processes coincide. The multistate system can have N+1 states characterized by different performance levels (PLs) gi i{0,…,N}, where the state 0 corresponds to the total failure with g0=0, whereas gi>gj for i>j. It is assumed that the system starts the mission in the N-th state. With each shock, a system can deteriorate (or remain in the same state) gradually moving to the state of the total failure which should be avoided (more precisely: its probability should be decreased) by the proper aborting the PM and performing the corresponding RP. Thus, the number of shocks experienced during the PM, can be the decision parameter for aborting the PM and activating the RP. If the PM is aborted upon occurrence of the m-th shock when the system has completed an amount of work x
(x) (see an example of this function in Section 4). If the system enters the total failure state before completing either the PM or the RP, it is lost. Fig. 1 presents an example of a mission performed by a system that can have eight states and starts the PM in state 7

8

characterized by the maximum PL g7. During the mission, the system experiences shocks and deteriorates to the lower states. The state transition diagram shows that the first shock results in transition from state 7 to state 6. The second shock results in transition from state 6 to state 4, the shock #3 causes no state transition, the shock #4 results in transition from the state 4 to state 3, the shock # 5 results in transition from state 3 to state 1 and, finally, the shock #6 results in transition from state 1 to state 0. The latter is the state of a total failure. Each transition corresponds to reduction of the system PL. If the PM is aborted after the m=4-th shock when the system enters state 3, the RP is successfully completed (although during the RP execution, additional shock ‘moves’ the system to state 1). If m>5, the PM is not aborted and the system continues operation until mission completion or total failure (entering state 0), whichever comes first. In the latter case (as depicted in Fig. 1), the system is lost.

Fig. 1. Example of an aborted mission. 9

The MSP Z(m,W), is the probability that the system completes the PM before it is aborted or lost. The SSP R(m,W), is the probability that the system successfully completes either the PM or the RP. The optimal value of m can be obtained as a solution of the following constrained optimization problem

max Z (m,W ) s.t. R(m,W )  R * ,

(1)

where R* is a predetermined desired level of the SSP. When a mission failure and a system’s loss are associated with the corresponding costs CF and CL, the cost minimization problem with respect to the decision parameter m can be also considered. Thus, if a system is successfully rescued, the cost is just CF, whereas when a total failure occurs during the PM or the RP, the cost is CF+CL. Thus, the expected cost that should be minimized is C(m,W)=(1-R(m,W))(CF+CL)+(R(m,W)-Z(m,W))CF=CF(1-Z(m,W))+ CL(1-R(m,W)) , (2) where 1-R(m) is the probability of a failure of a system during the PM or the RP and R(m)-Z(m) is the probability that a system was successfully rescued after aborting its PM. Our main focus in Section 4 will be on the optimization problem (1). However, a brief numerical analysis of (2) will be also presented. 3. Modelling system’s deterioration under shocks We consider systems that operate in a random environment modeled by the process of external shocks. In the simplest extreme shock model [19-20] for a binary system, each shock results in a failure of a system with probability v and is survived with a complementary probability 1-v. Thus, a system after any number of shocks k, can be only in two states with probabilities p1 (k )  v k and p0 (k )  1  p1 (k ) . As the above setting is

10

described in the discrete ‘time’ scale k, these probabilities obviously do not depend on the type of the point process of shocks. It can be, e.g., Poisson o renewal processes. However, when the description is in the real operating time, the type of the process matters (see the next section). As shocks usually have random severity, they can affect system’s performance in each state in a different way resulting in its gradual deterioration (see examples in the Introduction). To describe this process of system’s deterioration, we introduce the state transition probabilities, which depend on the number of shocks experienced by the system. Let the system start operation at time t=0 in an initial state N with the maximal PL gN. Given this initial condition, we are interested now in obtaining the probabilities of the system being in a particular state i after experiencing k shocks, i.e., pi (k )  Pr(G(k )  gi ), i  0,1,..., N ; pN (0)  1, i  0,1,2,..., N ,where G(k ) is a random PL of the

system after k shocks (the state of failure, i  0 is absorbing). Note that, for convenience, we are denoting the states via their PLs. Similar to the one-step transition probabilities in discrete Markov chains, define by vi,j(k)=Pr(G(k)=j|G(k-1)=i)

(3)

the conditional probability that after the k-th shock the system will be in the state j given that after the (k-1)-th shock it was in the state i (one-step probabilities). This, obviously, includes the case of non-effective shocks, when after a shock the system remains in the same state. Thus, by definition of vi,j(k), our process possesses the Markov property (i.e., the discrete Markov chain). As all shocks result either in deterioration or in keeping the current state, we have i

vi,j(k)=0 for j>i and  vi , j (k )  1. j 0

11

(4)

Therefore, as a specific case of the Chapman-Kolmogorov’s equations, the following recursive formula holds: p j (k ) 

N

 p (k  1)v

i  j 1

i

i, j

(k ) .

(5)

Defining the vector of state probabilities after the k-th shock as Pk=( p0(k),…, pN(k))T and the matrix Vk of state transition probabilities after the k-th shock with elements determined in (3) and (4), one can obtain Pk recursively as Pk =VkPk-1.

4. Derivation of MSP and SSP In what follows, we are dealing with the system’s behavior in a real (mission) time. Therefore, the specific point process of shocks should be already considered, which, in accordance with our assumption, is a renewal process. As described in the previous section, the system gracefully deteriorates (stochastically) with the increase of the number of survived shocks k. Therefore, the level of this deterioration is defined only by k. In accordance with the predetermined aborting policy, a mission is aborted when this k reaches some fixed level m. Thus, let m be the number of shocks survived by a system after which its PM should be aborted. This m will become a decision parameter in the optimization problem to be considered further. The work performed in each state depends on this state via the corresponding PL. Thus, the cumulative amount of work depends on the random times the system spent in different states prior to a current moment. Let Xk be a random variable representing the amount of work completed by the system during the PM before occurrence of the k-th shock and qk (x) be the pdf of Xk. The system starts operation in state N (with the maximal performance level gN) at time 0 when 12

X0=0 operations (work) are completed. Thus, by definition, the corresponding pmf of X0 can be formally defined as

1 for x  0, q0 ( x)   0 otherwise.

(6)

(Note that we need this formal definition as the initial condition for the recursive procedure to follow). In general, the work performed in each state i during time t can be represented by the specific functions i(t). In most practical cases, the performed work is proportional to the time the system had spent in this state, i.e., if the system operates in state i between the k1-th and the k-th shocks during time t, the amount of work performed between these shocks is i(t)=git. Thus, in this case, if X k 1  x , then X k  x  git or, equivalently, which will be used further: if X k 1  x  git , then X k  x . Having the probabilities pi(k1)=Pr(G(k-1)=gi) and the pdf of shock’s inter-arrival times during the primary mission fM(t), one can obtain the pdf qk (x) recursively as N

x / gi

qk ( x)   pi (k  1)  qk 1 x  tg i  f M t dt for 0xW. i 1

(7)

0

The system completes the PM (amount of work W) after k shocks when it operates in state i>0 , i=1,2,…, N after the k-th shock and no additional shocks occur during the time (W-Xk)/gi needed to complete the mission. Thus, in accordance with the law of total probability, the probability that the system completes the PM after k shocks is W N   W  x   dx . zk (W )   pi (k )  qk x 1  FM   g i 1  i  0 

13

(8)

As the PM is aborted upon occurrence of the m-th shock, the maximal k in (8) is m-1. The corresponding events (the system completes the PM after different number of shocks) are mutually exclusive. Therefore, the overall MSP takes the form m 1

Z (W , m)   zk (W ) .

(9)

k 0

When the m-th shock occurs and the system survives it (remains in the state i>0), the RP is activated. In this case, the amount of work (XM) should be accomplished to save the system under the conditions of the RP that are modeled by the renewal process of shocks with the inter-arrival time pdf fR(t). For the RP phase, we will describe the procedure similar to that above for the PM phase. Let Yh be the random variable representing the amount of the RP work completed by the system before occurrence of the h-th shock (hm) and q~h ( y ) be the pdf of Yh. We can obtain

1 for y  0, q~m ( y )   0 otherwise

(10)

and N

y / gi

i 1

0

q~h ( y)   pi (h  1)  q~h 1  y  tg i  f R t dt for 0y(XM) , h>m.

(11)

Given Xm=x, the probability that the system completes the RP after the h-th shock is N

 ( x)

 pi (h) i 1



  ( x)  y    dy . g i  

 q~  y 1  F  h

R



0

(12)

The overall probability that the system completes the RP after the h-th shock is W

N

 ( x)

sh (W )   qm ( x) pi (h) 0

i 1



  ( x)  y     dydx . gi  

 q  y  1  F  h

0

14



R

(13)

Thus, the total probability of the RP success is 

S (W )   sh (W ) .

(14)

hm

In fact, for deteriorating system’s shock resistance, we have negligible values of pi(k) for i>0 and p0(k) approaching 1 when k>k*, where k* is sufficiently large (see an example of obtaining k* in the next section). Therefore, for computational reasons, in practice, we can obtain S as a finite sum k*

S (W )   sh (W ) .

(15)

hm

As the completion of the PM and the RP are mutually exclusive events (and in both of them the system survives), the SSP is obtained as R(W , m)  Z (W , m)  S (W ).

(16)

Having the MSP and the SSP derived above, one can solve the optimization problems (1) and (2) for specific setting as in examples to follow.

5. Illustrative examples Consider an online system that performs a data processing task (PM) involving W=100 mega-operations. The system is exposed to hackers’ attacks aimed at mission success prevention and data destruction. Each successful attack can cause a corruption of the memory blocks, which leads to the decrease of the data processing speed (PL of the system). When all memory blocks are corrupted, the system totally fails and the data is lost. The data processing speed can vary discretely depending on the number of uncorrupted memory blocks. The total number of memory blocks is four, thus the system can be in five states with 0, 1, 2, 3, 4 operable blocks, respectively and with the

15

corresponding PLs (data processing speeds measured in mega-operations per hour) g0=0, g1=1, g2=1.2, g3=1.7, g4=2.1. The hacker’s attack can corrupt the memory blocks if it penetrates the system’s protection with dynamically changing protection codes. The attacker’s software uses time consuming algorithm to penetrate the protection. When the attacker uses pure random enumeration (brute force), the penetration probability is the same at any time. This can be modelled by the renewal process of shocks (successful attacker’s penetrations into the system) with inter-arrival time obeying an exponential distribution. When the attacker’s algorithm uses some search truncation rules, its initial efficiency can be lower because it should store and process information about the codes already checked. However the search efficiency can increase with the time when truncation rules become stricter. In this case, the probability that an attack penetrates the protection in time interval [t,t+dt) from the start of an attack increases in t. This is modelled by the renewal process of shocks (successful attacker’s penetrations into the system) with inter-arrival time obeying a Weibull distribution. Both cases are formally modelled by Weibull distribution with the scale parameter

M=60 and the shape parameters M=1 (for exponential distribution) and M=1.7 (for Weibull distribution). Assume that deterioration from any state j to the state j-u>0 (destruction of u memory blocks in an attack) as a result of the k-th shock, has the same probability for any j, which is reflected in the following specific form of the transition probabilities matrix: Vk: vi-u,j-u(k)=vi,j(k) for j>u

16

(17)

In fact, this means that transition probabilities between states i and j do not depend on i and j but on their difference j-i, which makes sense as a simplifying assumption. After each shock which does not destroy the system totally, i.e., transfers it from the state i>0 to the state j (ij>0), an attacker can gain information about the system that increases the probability of the total system destruction in the next attacks (i.e. decreases the system resistance to attacks). Thus, the state transition probability vi,0(k) for any i i

increases in k, whereas the sum of probabilities

v j 1

i, j

(k ) decreases in k. This is

modelled by introducing the attack resistance deterioration factor  and using the following recursive rule for obtaining the matrix Vk: vi,j(k)=vi,j(k-1)  for j>0 and vi ,0 (k )  1   vi , j (k )  1   1  vi ,0 (k  1), i

(18)

j 1

which is derived from the following obvious equations i

i

i

j 1

j 1

j 1

vi ,0 (k  1)   vi , j (k  1)  1; vi ,0 (k )   vi , j (k )  vi ,0 (k )    vi , j (k  1)  1 .

(19)

The initial values of non-zero elements of the state transition probability matrix V1 are v4,1(1)=0.02; v4,2(1)=v3,1(1)=0.15; v4,3(1)=v3,2(1)=v2,1(1)=0.08; v4,4(1)=v3,3(1)=v2,2(1)= v1,1(1)=0.7. Fig. 2 presents the system’s state probabilities as functions pi(k) for the given V1 when the attack resistance decrease factor  takes the values of 1 and 0.8. It can be seen that with the increase of the number of shocks, the probability of the system’s total failure p0(k) approaches 1. For =1, p0(k)>0.9999 when k14, for smaller values of , probability p0(k) increases even faster. Therefore, one can neglect the probability that the system can experience more than k*=14 shocks while executing its task. 17

Fig. 2. System’s state probabilities as functions of number of shocks

The task succeeds if the system completes W operations before its total failure (corruption of all four memory blocks). If the number of experienced shocks reaches a certain value m, the chance to complete the task becomes too small and the system aborts the PM (data-processing task) and starts the data backup procedure (RP) transferring the data to an external storage. The number of operations required to transfer the data depends on the number of PM operations completed before the mission abort (amount of data produced so far). This is reflected by using the function (x), which is assumed to be linear in our example, i.e., (x)=+x with =5, =0.12. The data backup speed depends on the state of the memory blocks. During the data backup, the system remains exposed to attacks, which can cause further reduction of the number of available uncorrupted memory blocks and/or total failure. However, during the data backup the system uses protected communication channel that reduces the attack rate. Thus, the parameters of Weibull distribution of attacks inter-arrival time during the 18

RP take the values R=100 and R=1.7 (and R=1 for exponential distribution), which corresponds to a larger inter-arrival time in the corresponding renewal process (e.g., in the sense of the usual stochastic order [21] of the corresponding survival functions). The cdfs of these inter-arrival times for both phases are presented in Fig. 3 for exponential and Weibull distribution cases. Note that, in our specific example, the impact of shocks during the RP is less severe than during the PM., which complies with the usual stochastic order for two random variables [22] i.e., FM (t )  FR (t ), t  0 , and is illustrated by Fig.3. It can be seen that using the more sophisticated protection penetration algorithm (Weibull distribution) during the PM results in lower protection penetration probability for t<60 than using the pure enumeration (exponential distribution). However, for t>60 the sophisticated algorithm outperforms the pure enumeration one. During the RP, the pure enumeration remains more effective than the sophisticated algorithm for t<100. Thus, for the specific case when the system in its lower operable state (with g1=1) can complete the task (with W=100) in time t=100, using the sophisticated algorithm is not beneficial for the attacker. However, if the task completion requires larger time, using the sophisticated algorithm can be justified. We also assume that the attacker has no information about the task complexity and can apply the sophisticated algorithm in the considered case. If the system fails (remains without available uncorrupted memory blocks) before completion of the data processing task or backup procedure, the data is lost. The corruption of a part of the memory blocks makes the data processing and backup processes slower, which increases the time needed to complete the PM or the RP and increases the probability that the further shocks will cause the total system failure. The

19

system survives, if it succeeds to complete either the PM or the RP before entering the state 0.

Fig. 3. Cdf of inter-shock times during PM and RP (for exponential and Weibull distributions) Fig. 4 presents the MSP, Z and the SSP, R as functions of the allowed number of shocks m for two cases of shocks inter-arrival time distributions (exponential and Weibull) and the attack’s resistance decrease factor =0.9. The presented results show that, in the specific case considered, when the attacker uses sophisticated protection penetration algorithm (Weibull distribution), which is initially slower than the pure enumeration one (exponential distribution), larger MSP and SSP are achieved. In what follows, we will consider the case of Weibull distribution of shocks interarrival times). Fig. 5 presents the MSP, Z and the SSP, R as functions of allowed number of shocks m for different values of the attack’s resistance decrease factor . For the sake of notation brevity, in this section, the corresponding arguments are omitted. It can be seen that the MSP increases and the SSP decreases in m because the system has larger 20

chances to complete the mission, but starts the RP in the worse state after the larger number of shocks, which reduces the RP success probability. For m>5, the RP success probability S becomes negligible and the MSP practically coincides with the SSP. Indeed, when the mission is not aborted (as the probability that the system survives more than five attacks is negligible) the system survives only if it completes its PM.

Fig. 4. MSP Z and SSP R as functions of the allowed number of shocks m for exponential and Weibull shocks inter-arrival time distributions and =0.9 The case =1 corresponds to the situation when the system’s state transition probabilities do not depend on the number of an attack (i.e., the resistance to an attack does not decrease in the number of executed attacks). This results in the largest possible MSP and SSP. With decrease of , the probability of the total system failure increases in each shock, causing reduction of both the MSP and the SSP. Observe that Fig. 5 gives the practical information on the range of the possible values of the MSP and the SSP and the MSP-SSP tradeoff. Having such plot one can see if the 21

desired level of the SSP can be obtained in principle and what is the price of the SSP increase in terms of the MSP reduction.

Fig. 5. MSP Z and SSP R as functions of allowed number of shocks m for different values of 

Fig. 6 presents the MSP, Z and the SSP, R in the cases when the data processing speed deteriorates

(g0=0,

g1=1,

g2=1.2,

g3=1.7,

g4=2.1)

and

not

deteriorates

(g0=g1=g2=g3=g4=2.1) upon experienced shocks for =0.9, respecrively. It can be seen that the performance deterioration causes considerable decrease of both the MSP and the SSP. For m=1, the MSP is the same for both cases because the system performs the PM only with its maximal PL g4=2.1 until the first shock occurs and the PM is aborted.

22

Fig. 6. Comparison of MSP, Z and SSP, R in the cases when the data processing speed deteriorates and not deteriorates for =0.9

Fig. 7 presents the values of the abort decision parameter m obtained as solutions of the optimization problem (1) and the corresponding values of the MSP, Z and the SSP, R as functions of the minimal allowed SSP, R* for different values of . In practice, for any desired level of the SSP, one can determine the optimal abort policy parameter m that maximizes the MSP using the leftmost plot of the Fig. 7. For the small values of the desired SSP, R* the best policy is to forbid aborts (m=). With the increase of R*, the optimal value of m reduces to 2 and then to 1. With the increase of , the SSP increases and the value of m should be reduced when R* reaches larger values. Observe that, when m=1, the values of the MSP, Z for different  coincide. Indeed, when m=1, the PM is performed before the occurrence of the first shock and the system resistance to shocks does not affect the MSP.

23

Fig. 7. The optimal values of the abort decision parameter m and the corresponding values of MSP Z and SSP R as functions of the minimum allowed SSP R* for different values of .

Fig. 8 presents the r( ) function that determines the maximal values of R* for which the mission abort should not be implemented (i.e. for any R*r( ), the value m= remains the optimal solution of (1)). As the system shock’s resistance increases in , the larger values of the SSP, R=Z can be achieved without aborting the mission. The knowledge of the range of the SSP values, for which the mission abort is not beneficial is important for decision making. Indeed, using Fig. 8, one can decide whether to implement the mission abort or not, based on the combination of the value of the attack’s resistance decrease factor  and the desired level of the SSP.

24

Fig. 8. The maximum value of SSP for which the mission should not be aborted as function of .

Fig. 9 presents the MSP, the SSP and the expected loss, obtained according to (2), as functions of m for =0.9, CF =100 and two different values of CL. Though both Z(m) and R(m) are monotonic functions, CF(1-Z(m))+CL(1-R(m)) appears to be non-monotonic with a maximum at m=2. Indeed, the SSP R(2) is much smaller than R(1), which causes a sharp increase of the expected loss. The change from m=2 to m=3 causes much smaller variation of the SSP, whereas the MSP still increases leading to reduction of the expected loss. Further increase of m causes no considerable variation of both the MSP and the SSP. When CL=300, C(m) takes its minimum when no aborts are allowed (m=). When CL=400, C(m) takes its minimum at m=1 i.e., the best policy is to abort the mission when the first shock occurs.

Fig. 10 presents the minimum expected loss C and the

corresponding optimal number of shocks m as functions of CL for CF =100 and different values of .

25

Fig. 9. MSP, SSP and the expected losses C as functions of m for =0.9, CF =100 and two different values of CL.

Fig. 10. Minimum expected loss C and the corresponding optimal number of shocks m as functions of CL for =0.9, CF =100. It can be seen that the optimal mission abort policy is either to avoid mission aborting (m= when the system loss CL is relatively small) or to abort the mission after the first shock (m=1) when the system loss CL is relatively large). The smaller the system resistance to shocks (which corresponds to the smaller values of ), the larger should be 26

the value of CL when the mission aborting becomes beneficial. Fig. 10 demonstrates the practical methodology of determining the optimal abort policy parameter m that minimizes the expected mission cost for the given system parameters and the corresponding costs CF and CL. 6. Conclusions Distinct from the previous works [1-7], where only binary systems performing missions were considered, this paper deals with mission abort and rescue policies for multistate systems that, apart from the completely operable state and the state of a total failure, can operate in intermediate states with different levels of performance. Deterioration of a system is due to external shocks that induce system’s transitions to lower states and eventually to the state of a total failure. When the risk of failure (and, therefore, of a system loss) is large, a mission can be aborted and a system can be rescued subsequently. The number of shocks experienced by a system is used as the decision parameter for the considered optimization problems defining the trade-off between the mission success probability and the system’s survival probability. Specifically, the constrained optimization is discussed when the MSP is maximized for the SLP not smaller than the predetermined value. The unconstrained total expected loss minimization problem is also considered. In order to approach the formulated optimization problems, we first develop a new methodology for modeling and evaluating the corresponding MSP and the SLP for multistate systems that execute missions with the fixed amount of work. This is done via the suggested recursive procedures implementing the corresponding Markov chain that describes the multistate system’s deterioration in time under the process of shocks.

27

A detailed example illustrating application of the suggested model is presented. It was shown that aborting the mission is not always beneficial in both considered optimization problems. The decision about the mission abort effectiveness should be made based on the system and mission parameters. The suggested algorithm and optimization methodology can be used for evaluating the MSP and the SSP relationships and for determining the best mission abort policy. The further research in this direction can be focused on shock processes that follow the nonhomogeneous Poisson process. However, in this case, the recurrent procedure will be more complex, as times of shocks occurrences should be taken into account. Moreover, Poisson-driven processes that do not possess the independent increments property (Generalized Polya processes) can be also considered [23]. In case when the system state is observable, the abort policy may be based not just on the number of experienced shocks, but on the system’s state. The future work should develop and study the state-based abort policy as well.

conflicts of interests There are no any conflicts of interests associated with the submitted paper.

References 1. Myers A. Probability of loss assessment of critical k-Out-of-n: G systems having a mission abort policy, IEEE Transactions on Reliability, vol. 58, no. 4, pp. 694-701, 2009. 2. Levitin G., Xing L., Dai Y. Mission abort policy in heterogeneous non-repairable 1-out-of-N warm standby systems, IEEE Transactions on Reliability, vol. 67(1), 342-354, 2018.

28

3. Levitin G., Finkelstein, M. Optimal Mission Abort Policy for Systems Operating in a Random Environment, Risk Analysis, vol. 38, 795-803, 2018. 4. Levitin G., Finkelstein M. Optimal mission abort policy for systems in a random environment with variable shock rate. Reliability Engineering and System Safety, 169, 11-17, 2018. 5. Qiu Q., Cui. L. Optimal mission abort policy for systems subject to random shocks based on virtual age process, Reliability Engineering & System Safety, vol. 189, 11-20, 2019. 6. Qiu Q., Cui. L., Gamma process based optimal mission abort policy, to appear in Reliability Engineering & System Safety 7. Peng R. Joint routing and aborting optimization of cooperative unmanned aerial vehicles. Reliability Engineering & System Safety, Vol. 177, 131-137, 2018. 8. Yang L., Sun Q., Ye Z., Designing Mission Abort Strategies Based on Early-Warning Information: Application to UAV, to appear in IEEE Transactions on industrial automatics. 9. Nakagawa, T. Shocks and damage models in reliability theory. London: Springer, 2007. 10. Finkelstein M. and Cha J.H. Stochastic Modelling for Reliability: Shocks, Burn-in, and Heterogeneous Populations, Springer, London, 2013. 11. Lisnianski A, Levitin G. Multi-state system reliability. Assessment, optimization and applications, World Scientific, 2003 12. Lisnianski A., Frenkel I., Ding Y. Multi-state system reliability analysis and optimization for engineers and industrial managers, Springer, 2010. 13. Natvig B. Multistate Systems Reliability Theory with Applications, Wiley Series in Probability and Statistics, 2010. 14. Eryilmaz S. Assessment of a multi-state system under a shock model, Applied Mathematics and Computation, Vol. 269, pp. 1-8, 2015. 15. Li, W., Pham H. Reliability modeling of multi-state degraded systems with multi-competing failures and random shocks, IEEE Transactions on Reliability, Vol. 54 (2), pp. 297 – 303, 2005. 16. Xian Z., Siqi W., Xiaoyue W., Kui C. A multi-state shock model with mutative failure patterns, Reliability Engineering & System Safety, Vol. 178, pp. 1-11, 2018. 17. Segovia, M.C.,Labeau P.E. Reliability of a multi-state system subject to shocks using phasetype distributions, Applied Mathematical Modelling, Vol. 37(7), pp. 4883-4941, 2013. 29

18. Ruiz-Castro J. Preventive Maintenance of a Multi-State Device Subject to Internal Failure and Damage Due to External Shocks, IEEE Transactions on Reliability, Vol. 63(2), pp. 646 – 660, 2014. 19. Cha J. H. and Finkelstein M. On new classes of extreme shock models and some generalizations. Journal of Applied Probability, vol.48, 258-270, 2011. 20. Ross S. Stochastic Processes. John Wiley, New York, 1996. 21, Gut A, Hysler J. Realistic variation of shocks models. Statistics and Probability Letters, Vol 74, pp. 187-204, 2005. 22. Shaked M, Shanthikumar J. Stochastic Orders. Springer, London, 2007

23. Cha, J.H. and Finkelstein, M (2018). Point Processes for Reliability Analysis. Shocks and Repairable Systems, Springer, London, 2018

30