Optimal work distribution and backup frequency for two non-identical work sharing elements

Optimal work distribution and backup frequency for two non-identical work sharing elements

Accepted Manuscript Optimal work distribution and backup frequency for two non-identical work sharing elements Gregory Levitin , Liudong Xing , Yuans...

1MB Sizes 0 Downloads 0 Views

Accepted Manuscript

Optimal work distribution and backup frequency for two non-identical work sharing elements Gregory Levitin , Liudong Xing , Yuanshun Dai PII: DOI: Reference:

S0951-8320(17)30689-0 10.1016/j.ress.2017.10.016 RESS 5984

To appear in:

Reliability Engineering and System Safety

Received date: Revised date: Accepted date:

11 June 2017 28 September 2017 21 October 2017

Please cite this article as: Gregory Levitin , Liudong Xing , Yuanshun Dai , Optimal work distribution and backup frequency for two non-identical work sharing elements, Reliability Engineering and System Safety (2017), doi: 10.1016/j.ress.2017.10.016

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Highlights 

Heterogeneous two-element work-sharing systems are considered.



Incremental backups in such systems are studied.



Mission success probability and expected mission completion time are derived.

CR IP T

Optimal work distribution and backup frequencies are obtained.

AC

CE

PT

ED

M

AN US



1

ACCEPTED MANUSCRIPT

Optimal work distribution and backup frequency for two non-identical work sharing elements Gregory Levitina,b, Liudong Xingc, Yuanshun Daia Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China b The Israel Electric Corporation, P. O. Box 10, Haifa 31000, Israel E-mail: [email protected] c University of Massachusetts, Dartmouth, MA 02747, USA E-mail: [email protected]

CR IP T

a

Abstract – Motivated by abundant real-world applications where system elements

AN US

process different work portions in parallel to accomplish a specified mission task, this paper models and optimizes two-element work-sharing systems. When one of the two elements fails, the remaining element takes over the work of the failed element upon completing its own part. Incremental backups are used to reduce the amount of work

M

that should be redone in the case of failures. System elements can be characterized by different processing speeds, different time-to-failure distributions, and different

ED

numbers of backup actions. Mission success probability and expected mission

PT

completion time are first derived. Based on the proposed evaluation procedure, unconstrained and constrained optimization problems are formulated and solved,

CE

which find optimal work distribution and backup frequencies of the two elements maximizing mission success probability. Influence of element reliability, processing

AC

speed, deceleration factor as well as data backup and retrieval complexity on the optimal solutions is further investigated through examples. Results of this work can facilitate the optimal decision on work distribution and backup policies of heterogeneous work-sharing systems. Keywords: mission success probability; completion time; work sharing, backups; optimization 2

ACCEPTED MANUSCRIPT Nomenclature number of operations in the mission task

R

mission success probability

E

expected mission completion time

x

fraction of mission task that should be performed by element 1

wj

number of mission task operations assigned to element j

Fj(t), fj(t)

cumulative distribution function (cdf), probability density function (pdf) of element j time-to-failure

CR IP T

W

deceleration factor of element j during the idle mode

gj

computational speed (no. of operations per unit time) of element j

hj

number of backups that should be performed by element j during the

AN US

j

mission

number of mission task operations that element j performs between

M

vj

consecutive backups

total number of operations (task and backup) that should be performed

ED

dj

PT

by element j between completions of two consecutive backups Tj

time needed by element j to complete its part of the mission given the

CE

other element does not fail

AC

3-j(t,x,hj,h3-j) time needed by element j to complete its part of the mission given the



other element fails at time t data backup and retrieval complexity factor

1. Introduction Many practical systems have work-sharing attributes, where system elements work in parallel on different portions of the same task to accomplish a specified mission. In the case of one element failing, remaining operating elements take over the work of 3

ACCEPTED MANUSCRIPT the failed element upon completing their own parts. Examples abound in diverse application areas, such as parallel computing systems (multi-processor systems, computer grids), multi-channel data communication systems, and multi-path flow transmission systems [1]. Particularly, in this paper we consider a work-sharing computing system of two

CR IP T

elements with fixed performance (processing speeds) and fixed failure time distributions during the mission. The two elements share a computation task by performing different parts of this task. Depending on the task complexity and task distribution between the elements, the probability and time of task completion can

AN US

vary.

The system considered is different from load sharing systems in which the performance of an element usually must be adjusted to meet a demand and its failure

M

rate can change dynamically based on the magnitude of the shared load in the event of an element failure [2, 3]. The load sharing systems are aimed at providing a desired

ED

cumulative instantaneous performance, whereas the work sharing systems are aimed

PT

at accomplishing a certain amount of work. Consider, for example, a multiprocessor system. If the system operates in real life

CE

applications performing navigation of control tasks, it should provide a desired computation speed to process any task portion in a certain time interval (or operation

AC

cycle). The processors should provide required cumulative computation speed. If some processors fail, the load of the remaining operating processors increases, which reduces their per cycle idle time and can cause faster overheating. The multiprocessor system in this case corresponds to the load sharing type. If the processors solve a task of certain complexity (like a service request or optimization problem) the instantaneous cumulative processing speed does not matter and the probability that

4

ACCEPTED MANUSCRIPT the entire task is solved without a total system failure or within a certain time becomes the system success criterion. If some processors fail, the rest of processors continue operation with the same speed and the total system performance decreases. In this case the system is of the work sharing type. As performances of elements in work sharing systems do not change, their time-to-failure distributions are considered

CR IP T

to be fixed.

The work sharing system is also different from a hot standby or active redundant system where system elements work on the same task simultaneously without work sharing to provide fast system recovery in the case of an element failure occurring [4-

AN US

7].

Extensive research efforts have been dedicated to modeling and optimizing reliability of load sharing systems (e.g., [8-12]) and different types of standby systems

M

(e.g., [13-19]). However little work has been done for reliability analysis and optimization of work-sharing systems [1, 20-25]. These existing works have focused

ED

on work distribution problem. The possibility of uncompleted subtask reassignment as

PT

well as effects of backups have not been addressed. This paper advances the state-of-the-art by proposing a solution procedure to

CE

evaluate reliability (or mission success probability) and expected mission completion time of a two-element work-sharing system subject to periodic backups and

AC

uncompleted subtask reassignment. Optimization problems with the objective to maximize mission success probability are also solved by determining optimal work distribution and backup frequencies of the two elements. Performing backups is a common practice to facilitate an effective system recovery when a failure occurs for various computing related systems [26-29]. It enables an operational element to take over the task of a failed element from a backup point (or checkpoint), instead of re-

5

ACCEPTED MANUSCRIPT performing the entire task from the very beginning. In a full backup strategy, all data produced from the beginning of the mission until the current checkpoint are saved. In comparison, an incremental backup only saves data produced since the last checkpoint, which is faster to perform and consumes smaller capacity on backup storage [30]. Therefore, the incremental backup is considered for the work-sharing

CR IP T

system modeled in this paper. To provide generality, the two elements may have different incremental backup frequencies, optimization of which can offer a balance between overhead for performing backup actions and benefit in reducing reworking time in the case of failures. The two elements can also be different in their

AN US

performance and failure time distributions considering the fact that they may be from different vendors and/or have different exploitation history in practice. The rest of the paper is organized as follows. Section 2 presents system model and

M

formulation of optimization problems considered in this work. Section 3 presents evaluation of mission success probability and expected mission completion time of

ED

the work-sharing system considered. Section 4 presents illustrative examples of the

PT

proposed evaluation procedure as well as optimization results. Effects of several element parameters and data backup and retrieval complexity on the optimal solutions

CE

are also studied through examples in this section. Lastly, Section 5 concludes the

AC

work and gives directions of future research. 2. System Model and Problem Formulation The work-sharing system considered consists of two non-repairable elements. Each

element j has a fixed processing speed gj and its time-to-failure is characterized by a cumulative distribution function (cdf) of Fj(t). The system performs a task that presumes accomplishing W equal work portions further referred to as operations. The single work portion can correspond to a fixed number of basic arithmetical operators 6

ACCEPTED MANUSCRIPT performed by a processor or to a fixed number of data collection procedures or to iteration of an algorithm. The work is divided such that elements 1 and 2 should perform w1(x)=xW and w2(x)=(1-x)W operations, respectively i.e. wj(x)=W(3x-2xj+j-1) for j=1 or 2. Here x is the work distribution factor, which can take on any value between 0 and 1 implying

CR IP T

that the mission task can be distributed between the two elements in any proportion. Once chosen, x cannot change during the mission. Notice that x=0 and x=1 correspond to a warm standby system in which one of the two elements initially performs no

case of the operating element failure.

AN US

work and waits as a standby reserve ready to take over the uncompleted task in the

The elements perform even incremental backups. The number of backups for element j is hj. Thus, the number of mission task operations that should be performed

M

between backups is wj(x)/(hj+1). The number of operations needed to perform each backup is defined as a function of the number of operations performed between the

ED

backups vj(wj(x)/(hj+1)). Thus, the number of operations (task and backup) that should

PT

be performed between completions of two consecutive backups is dj=wj(x)/(hj+1)+vj(wj(x)/(hj+1)).

(1)

CE

The backup mechanism is assumed to be perfect. The mission succeeds either when both elements do not fail before completing their parts of the mission task and

AC

scheduled backups or when one of the elements fails and the other one completes the remaining part of the entire mission task. Given Fj(t), vj(.) and gj for j=1,2 the problem addressed in this work is to find the

optimal values of x, h1, h2 that maximize mission success probability R(x,h1,h2) subject to a constraint on expected completion time E(x,h1,h2). The optimization problem is formulated as:

7

ACCEPTED MANUSCRIPT x,h1,h2=argmax(R(x,h1,h2)) s.t. E(x,h1,h2)≤E*.

(2)

3. Evaluation of Mission Success Probability and Time Solution to the optimization problem (2) requires the evaluation of mission success probability R(x,h1,h2) and expected completion time E(x,h1,h2) of the two-element

CR IP T

work-sharing system considered. These two mission performance metrics can be derived as follows.

The computational speed of element j is gj. Thus the time between completion of two consecutive backups is j=dj/gj. The time needed by element j to complete its part

AN US

of the mission (including backups) is

Tj(x,hj)=[wj(x)+hjvj(wj(x)/(hj+1))]/gj.

(3)

If element j fails at time t, the index of the last completed backup can be obtained as t/j, the number of task operations completed before the last successful backup is

M

t/jwj(x)/(hj+1),

(4)

ED

the time elapsed after the last completed backup is t-jt/j, and the number of task operations performed after the last backup completion is (t-j t/j)gj if at time t the

PT

element performs mission task and is wj(x)/(hj+1) if at time t the element performs

CE

backup procedure (see Fig. 1). Thus, the total number of task operations performed by element j before time t is (5)

AC

t/jwj(x)/(hj+1)+min{(t-j t/j)gj ,wj(x)/(hj+1)}

If after the failure of element j at time t, element 3-j remains in a working

condition, it is responsible for completing the entire remaining task. If element 3-j does not complete its part of task before time t (i.e. T3-j(x,h3-j)≥t), it completes its part of work first, then immediately retrieves data saved by element j and performs the part of mission task not completed by element j (see Fig. 2. A). There is no sense to

8

ACCEPTED MANUSCRIPT perform backup procedures when only one operating element remains. Thus, the remaining part of mission task assigned to element 3-j presumes performing U3-j(t,x,h3-j)=w3-j(x)-t/3-jw3-j(x)/(h3-j+1)-min{(t-3-jt/3-j)g3-j,w3-j(x)/(h3-j+1)}

(6)

operations (we assume that when element j fails, the remaining element interrupts

PT

ED

M

AN US

CR IP T

backup procedure immediately).

CE

Fig. 1. Two cases of element failure: during task execution (A); during backup (B).

AC

If element 3-j completes its part of task before time t (i.e. T3-j(x,h3-j)
an idle mode during time t-T3-j(x,h3-j) and starts retrieving the data saved by element j at time t and then performs the part of mission task not completed by element j (see Fig. 2. B). We assume that in the idle mode the element remains failure prone for two reasons: first, even when the element is switched off upon completion of its part of the work, it still can fail due to ambient factors not related to its operation (e.g., common

9

ACCEPTED MANUSCRIPT cause failures originating from other devices or extreme environment conditions, external stresses etc.); second, in many cases the idle elements remain switched on (or in a warm standby mode [13, 15, 18], though unloaded, because the activation and

M

AN US

CR IP T

warming up time can be considerable compared to the entire allowable mission time.

ED

Fig. 2. Two cases of element j failure during task execution: T3-j(x,h3-j)≥t (A); T3-j(x,h3-j)
PT

Based on the cumulative exposure model [31, 32], the cumulative time spent by element 3-j in the idle mode is 3-j*(t-T3-j(x,h3-j)), where 0 ≤3-j≤ 1 is deceleration

CE

factor used to reflect lower stresses experienced in this mode. The special case with

AC

3-j=0 corresponds to the case when the idle element waits in the cold standby mode (i.e., is completely switched off). The acceleration life testing methods of evaluating the deceleration/acceleration factors are discussed in [31-33]. Summarizing both cases described above, element 3-j operates during time min(T3j(x,h3-j),t)

before the failure of element j, spends time max(0, t-T3-j(x,h3-j)) in the idle

mode, and needs time max(U3-j(t,x,h3-j),0)/g3-j to complete its part of the mission after this failure. 10

ACCEPTED MANUSCRIPT The task uncompleted by element j that should be completed by element 3-j consists of Y3-j(t,x,hj)=wj(x)-t/jwj(x)/(hj+1)

(7)

operations. Before starting performing operations of element j, element 3-j should retrieve the

CR IP T

data backed up by element j. The number of operations that element 3-j performs to retrieve the data is a function of the number of operations completed by element j before its last successful backup:

Z3-j(t,x,hj)=Z3-j(t/jwj(x)/(hj+1)).

(8)

AN US

The time needed for element 3-j to complete the mission after the failure of element j at time t is thus

3-j(t,x,hj,h3-j)=[max(U3-j(t,x,h3-j),0)+Y3-j(t,x,hj)+Z3-j(t,x,hj)]/g3-j.

(9)

element j fails at time t is

M

The cumulative exposure time of element 3-j completing the mission given

ED

3-j(t,x,hj,h3-j)=min(T3-j(x,h3-j),t)+ 3-jmax(0, t-T3-j(x,h3-j))+3-j(t,x,hj,h3-j), (10) where the first term corresponds to the operation time before the failure, the second

PT

term corresponds to the idle time and the third term corresponds to the operation time

CE

after the failure.

The mission success probability can thus be obtained as 2 T j ( x ,h j )

AC

R( x, h1 , h2 )  1  F1 T1 ( x, h1 ) 1  F2 T2 ( x, h2 )    j 1

 f t 1  F  t, x, h , h dt, (11) j

3 j

3 j

j

3 j

0

where the first term corresponds to the probability that the mission succeeds when both elements do not fail before completing their parts of the mission task (including scheduled backups), the second term corresponds to the mission success probability when one of the elements j (j=1,2) fails at [t,t+dt) for any 0tTj(x,hj) and the other one completes the remaining part of the entire mission task (performing the data 11

ACCEPTED MANUSCRIPT retrieval, but not performing any additional backup), which takes time

 3 j t , x, h j , h3 j . The probability that the element j fails at [t,t+dt) is fj(t)dt and the probability that the other element does not fail during time  3 j t , x, h j , h3 j  is 1  F3 j  3 j t , x, h j , h3 j  .

E ( x, h1 , h2 ) 

1 {1  F1 T1 ( x, h1 )1  F2 T2 ( x, h2 ) max T1 ( x, h1 ), T2 ( x, h2 ) R

2 T j ( x ,h j )

 j 1

CR IP T

The expected mission completion time is

 f t 1  F  t , x, h , h t   t , x, h , h dt}. 3 j

j

3 j

j

3 j

3 j

0

3 j

AN US

(12)

j

Having (11) and (12) one can now solve the optimization problem (2). As h1 and h2 are integer values that cannot be very large, this problem can be solved by enumerating h1 and h2 in a pre-specified range and finding x that maximizes the

M

function R(x,h1,h2))-min(0,E(x,h1,h2)-E*) for any combination of h1 and h2. Any

ED

simple one-objective optimization algorithm can be used for obtaining optimal values of x(h1,h2). Due to advantages of guaranteed convergence for unimodal functions on

PT

specific intervals and no derivative needed, the Golden Section search algorithm [39]

CE

is applied in this work for solving the proposed optimization problem. 4. Illustrative Example

AC

Consider a dual-processor computer system aimed at performing a task with 5000

operations. Each processor has a Weibull time-to-failure distributions with cdf

   .

Fj (t )  1  exp  t /  j

j

(13)

Values of j (scale parameter) and j (shape parameter) as well as processing speeds gj and deceleration factors j for j=1, 2 are presented in Table 1. Note that 12

ACCEPTED MANUSCRIPT while the proposed evaluation method is applicable to arbitrary types of time-tofailure distributions, the Weibull distribution is chosen here due to its flexibility in modeling diverse failure rate behaviors as well as its extensive applications in reliability analysis [34, 35]. As discussed in [36-38], the Weibull distribution is indeed a realistic failure time model for processors.

CR IP T

For both elements the number of operations needed to perform a backup is vj(w)=80+0.05w,

(14)

and the number of operations need to perform a data retrieval is given by

(15)

AN US

Zj(h,w)=40∙1(h>0)+0.02hw

where w is the number of mission task operations performed between completions of two consecutive backups and h is the number of successfully completed backups.

j

j

600

ED

1

M

Table 1. Element parameters for the numerical example

2

400

j

gj

j

1.2

12

0.3

2.0

26

0.5

PT

4.1. Evaluation Results and Discussions

CE

Fig. 3. presents times Tj needed for each element j to complete the mission in the

AC

case when no failures happen and h1=h2=2 as a function of work distribution factor x.

13

ACCEPTED MANUSCRIPT Fig 3. Times Tj as a function of work distribution factor

Fig. 4 presents probabilities for three different mutually exclusive scenarios of successful mission where element 1 fails during the mission, element 2 fails during the mission, and both elements complete the mission, respectively. The overall probability of mission success R is also presented in Fig. 4.

CR IP T

With an increase in x the amount of work assigned to element 1 increases and the amount of work assigned to element 2 decreases. Thus, T1 increases and T2 decreases (Fig. 3). Also, the probability that elements 1 and 2 solely complete the mission decreases and increases respectively (Fig. 4). It can be seen that R(x) has distinct

PT

ED

M

AN US

maximum.

CE

Fig 4. Mission success probabilities (A: element 1 fails during the mission, B: element 2 fails during the mission, C: both elements complete the mission, R: overall probability of mission success).

AC

Fig. 5 presents the mission success probability and expected mission completion time as functions of x for different combinations of h1 and h2. It can be seen that the values of x maximizing R(x) and minimizing E(x) for any fixed h1 and h2 do not coincide, which justifies the definition (2) of the constrained optimization problem. The v-shaped form of functions E(x) can be explained by Fig. 3 taking into account that the mission time in the case of no failures is max(T1,T2).

14

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 5. Mission success probability R and expected mission completion time E as functions of x for different combinations of h1 and h2.

15

ACCEPTED MANUSCRIPT 4.2. Optimization Results and Discussions The optimization problems for this example have been solved using the Golden Section search algorithm [39] for determining x and enumeration in the range (0, 12) to determine integer values of h1 and h2. For the sake of the optimization procedure rapidity, the precision of the Golden Section algorithm was limited by the value of

CR IP T

0.03, which explains the small fluctuations in figures below.

Fig. 6 presents the optimal solutions for unconstrained problem max R and constrained problem max R s.t. E<160 as functions of time-to-failure distribution

AC

CE

PT

ED

M

AN US

scale parameter 1.

Fig. 6. Optimal solutions for unconstrained problem max R and constrained problem max R s.t. E<160 as functions of 1.

16

ACCEPTED MANUSCRIPT With an increase in 1, i.e., reliability of the first element, the amount of work assigned to it increases and the optimal number of backups it should perform decreases. On the contrary the optimal number of backups that should be performed by the second element increases because the occurrence probability of scenario in which the second element fails and the first one takes over the mission increases.

CR IP T

When the expected mission time is limited, the optimal number of backups h1 does not change because its increase corresponds to unacceptable expected mission time. When 1<170 the expected mission time cannot be less than 160 for any x and h1 and

AN US

h2. The overall mission success probability increases with increasing 1.

Fig. 7 presents the optimal solutions for unconstrained problem max R and

AC

CE

PT

ED

M

constrained problem max R s.t. E<160 as functions of processing speed g1.

Fig. 7. Optimal solutions for unconstrained problem max R and constrained problem max R s.t. E<160 as functions of processing speed g1. 17

ACCEPTED MANUSCRIPT

The influence of g1 is similar to influence of 1 because with increased processing speed the first element can complete its task with greater probability. When g1<11.5 the expected mission time cannot be less than 160 for any x, h1 and h2. Fig. 8 presents parameters of the optimal solutions for unconstrained problem max

CR IP T

R and constrained problem max R s.t. E<160 as functions of data backup and retrieval

AC

CE

PT

ED

M

AN US

complexity.

Fig. 8. Optimal solutions for unconstrained problem max R and constrained problem max R s.t. E<160 as functions of data backup and retrieval complexity factor .

We assume that the number of operations needed for performing the data backup and retrieval procedures are determined as vj(w) and Zj(h,w) respectively, where vj(w) and Zj(h,w) are obtained using (14) and (15), respectively. With an increase in  18

ACCEPTED MANUSCRIPT the data backup and retrieval procedures become more time consuming, and therefore the optimal number of backups decreases. When >2.6 for the unconstrained problem and >1.65 for the constrained problem, using backups becomes non-beneficial because the increase in the total mission time in the case of no failures cannot be compensated by the reduction of reworking time in the case of failure. When no

CR IP T

backups are performed, the mission success probability and expected completion time do not depend on  any more. When <0.5 the parameters minimizing R provide E<160, and thus the solutions for constrained and unconstrained problems coincide.

AN US

Fig. 9 presents parameters of the optimal solutions for unconstrained problem max R and constrained problem max R s.t. E<160 as functions of cumulative time deceleration factor of the fastest element 2.

When 2 is low, the second element almost does not age in the idle mode.

M

Therefore it is beneficial from the mission success probability point of view to assign

ED

a greater amount of work to the first element and to let it perform frequent backups. The situation is close to a cold standby system in which the second element waits in

PT

idle mode until the failure of the first one and then takes over the mission. With an increase in 2 it becomes less beneficial to keep the second element in the idle mode.

CE

Therefore x and h1 decrease and h2 increase with increasing 2.

AC

While being effective from the mission success probability point of view, the idle time policy results in inacceptable expected mission completion time. Therefore for constrained optimization problem, x, h1 and h2 remain insensitive to 2.

19

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

PT

ED

Fig. 9. Optimal solutions for unconstrained problem max R and constrained problem max R s.t. E<160 as functions of deceleration factor 2.

5. Conclusion and Future Work

CE

This paper models a heterogeneous work-sharing system with two non-repairable

AC

elements being subject to periodic incremental backups. In the case of failure of one element, the remaining element takes over the subtask of the failed element upon completing its own subtask. For a successful mission, a specified amount of mission task must be accomplished before failures of both elements. Following an evaluation of mission success probability and expected mission completion time of the system considered, both unconstrained and constrained optimization problems are solved with the objective to find optimal work distribution and periodic backup frequencies 20

ACCEPTED MANUSCRIPT of the two system elements maximizing mission success probability. Examples are provided to demonstrate effects of several parameters on the optimal solutions, including element reliability, processing speed, deceleration factor as well as data backup and retrieval complexity. Based on this pioneer work on two elements, we are interested in extending the

CR IP T

proposed methodology for work-sharing systems with an arbitrary number of elements in the future. In the extended model the problem of scheduling and distribution of the uncompleted subtasks among the remaining elements arises. We will also investigate work-sharing systems performing multi-phased missions where

AN US

performance rate and failure behavior of a system element may vary in different phases due to changing conditions and tasks. Based on works for standby systems in [40-42], effects of imperfect backup mechanism, uneven backups and combinations of

M

full and incremental backups will also be considered for reliability modeling and

ED

optimization of work-sharing systems in our future work. Acknowledgement: This work was supported in part by the National Natural

PT

Science Foundation of China (No. 61170042) and Jiangsu Province development and

CE

reform commission (No. 2013-883). References

AC

[1] Levitin G. The Universal Generating Function in Reliability Analysis and Optimization, Springer, 2005. [2] Kvam PH, Pena EA. Estimating load-sharing properties in a dynamic reliability system. J. Amer. Statist. Assoc. 2005;100(469):262–272. [3] Singh B, Gupta P. K. Load-sharing system model and its application to the real data set. Mathematics and Computers in Simulation 2012;82(9):1615-1629. [4] Johnson BW. Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.

21

ACCEPTED MANUSCRIPT [5] Levitin G, Xing L, Dai, Y. Optimal Sequencing of Warm Standby Elements. Computers & Industrial Engineering 2013;65(4):570-576. [6] Levitin G, Xing L, Dai Y. Optimization of Predetermined Standby Mode Transfers in 1out-of-N: G Systems. Computers & Industrial Engineering 2014;72:106-113. [7] Valdés JE, Zequeira RI. On the optimal allocation of two active redundancies in a twocomponent series system, Operations Research Letters 2006;34(1):49-52. [8] Huang L, Xu Q. Lifetime reliability for load-sharing redundant systems with arbitrary

CR IP T

failure distributions. IEEE Transactions on Reliability 2010;59(2):319–330. [9] Liu H. Reliability of a load-sharing k-out-of-n: G system: non-iid components with arbitrary distributions. IEEE Transactions on Reliability 1998;47(3):279-284.

[10] Park C. Parameter estimation for the reliability of load-sharing systems. IIE Trans. 2010; 42(10):753–765.

[11] Singh B, Sharma KK, Kumar AA. Classical and Bayesian estimation of a k-components

AN US

load-sharing parallel system. Comput. Statist. Data Anal. 2008; 52(12):5175–5185.

[12] Ye Z, Revie M, Walls L. A Load Sharing System Reliability Model With Managed Component Degradation. IEEE Transactions on in Reliability 2014;63(3):721-730. [13] Eryilmaz S. Reliability of a K-Out-of-n System Equipped With a Single Warm Standby Component. IEEE Transactions on Reliability 2013;62(2):499-503.

Hoboken, NJ, USA: Wiley, 2003.

M

[14] Kuo W. & Zuo, M. J. Optimal Reliability Modeling: Principles and Applications.

ED

[15] Levitin G, Xing L, Dai Y. Non-homogeneous 1-out-of-N Warm Standby Systems with Random Replacement Times. IEEE Transactions on Reliability 2015;64(2):819-828. [16] Xing L, Tannous O, Dugan J. B. Reliability analysis of non-repairable cold-standby

PT

systems using sequential binary decision diagrams. IEEE Trans. Syst, Man, Cybern., Part A: Syst. Humans 2012;42(3):715–726.

CE

[17] Zhai Q, Xing L, Peng R, Yang J. Multi-Valued Decision Diagram-Based Reliability Analysis of k-out-of-n Cold Standby Systems Subject to Scheduled Backups. IEEE

AC

Transactions on Reliability 2015;64(4):1310-1324. [18] Zhang T, Xie M, Horigome M. Availability and reliability of k-out-of-(M+N): G warm standby systems. Rel. Eng. Syst. Safety 2006;91(4):381–387. [19] Zhao R, Liu B. Standby redundancy optimization problems with fuzzy lifetimes. Computers and Industrial Engineering 2005;49(2):318–338. [20] Elmakias D, Editor. New Computational Methods in Power System Reliability, Springer, 2008. [21] Levitin G, Dai Y. Optimal service task partition and distribution in grid system with star topology. Reliability Engineering and System Safety 2008; 93:152-159.

22

ACCEPTED MANUSCRIPT [22] Levitin G, Ng SH, Peng R, Xie M. Reliability of Systems Subjected to Imperfect Fault Coverage. in Stochastic Reliability and Maintenance Modeling: Essays in Honor of Professor Shunji Psaki on his 70th Birthday, T. Dohi and T. Nakagawa (Editors), Springer, 2013. [23] Lisnianski A, Frenkel I, Ding Y. Multi-state System Reliability Analysis and Optimization for Engineers and Industrial Managers, Springer, 2010. [24] Lisnianski A, Levitin G. Multi-state System Reliability: Assessment, Optimization and Applications, World Scientific, 2003.

CR IP T

[25] Yang B, Hu H, Guo S. Cost oriented task allocation and hardware redundancy policies in heterogeneous distributed computing systems considering software reliability. Computers & Industrial Engineering 2009;56:1687-1696.

[26] Fu Y, Jiang H, Xiao N, Tian L, Liu F, Xu L. Application-Aware Local-Global Source Deduplication for Cloud Backup Services of Personal Storage. IEEE Transactions on Parallel and Distributed Systems 2014;25(5):1155-1165.

AN US

[27] Gaonkar S, Keeton K, Merchant A, Sanders WH. Designing Dependable Storage Solutions for Shared Application Environments. IEEE Transactions on Dependable and Secure Computing 2010;7(4):366-380.

[28] Koppol P, Namjoshi KS, Stathopoulos T, Wilfong GT. The inherent difficulty of timely primary-backup replication. Bell Labs Technical Journal 2012;17(2):15-24.

M

[29] Levitin G, Xing L, Dai Y. Optimal Backup Distribution in 1-out-of-N Cold Standby Systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems 2015;45(4):636-

ED

646.

[30] Xia R, Yin X, Alonso Lopez J, Machida F, Trivedi KS. Performance and Availability Modeling of ITSystems with Data Backup and Restore. IEEE Transactions on Dependable

PT

and Secure Computing 2014;11(4):375-389. [31] Amari SV, Misra KB, Pham H. Tampered Failure Rate Load-Sharing Systems: Status

CE

and Perspectives. Chapter 20 in Handbook of Performability Engineering (Editor: K. B. Misra), Springer 2008;291-308.

AC

[32] Nelson W. Accelerated Testing: Statistical Models, Test Plans, and Data Analysts. New York: Wiley, 1990. [33] Levitin G, Xing L, Amari SV, Dai Y. Reliability of Non-repairable Phased-Mission Systems with Common-Cause Failures. IEEE Transactions on Systems, Man, and Cybernetics: Systems 2013;43(4):967-978. [34] Leemis LM. Reliability: Probabilistic Models and Statistical Methods. Lawrence Leemis, 2009. [35] Weibull W. A statistical distribution function of wide applicability. Journal of Applied Mechanics - Transactions of the ASME 1951;18:293–297.

23

ACCEPTED MANUSCRIPT [36] El-Berry A, Al-Bossly A. Effect of Heat on Computer’s Processor Failures. International Journal of Innovative Technology and Exploring Engineering 2015;4(10):57-61. [37] Herault T, Robert Y, Editors Fault-Tolerance Techniques for High-Performance Computing, Springer International Publishing, 2015. [38] JEDEC Failure mechanisms and models for semiconductor devices (jep122c). JEDEC Publication, 2003. [39] Press W, Teukolsky S, Vetterling W, Flannery B. Numerical recipes in C. The art of

CR IP T

scientific computing. Cambridge University Press, 1992. [40] Levitin G, Xing L, Dai Y. Heterogeneous 1-out-of-N Warm Standby Systems with Dynamic Uneven Backups. IEEE Transactions on Reliability 2015;64(4):1325-1339.

[41] Levitin G, Xing L, Dai Y. Cold-Standby Systems with Imperfect Backup. IEEE Transactions on Reliability 2016;65(4):1798-1809.

[42] Levitin G, Xing L, Zhai Q, Dai, Y. Optimization of full vs. incremental periodic backup

AC

CE

PT

ED

M

AN US

policy. IEEE Transactions on Dependable and Secure Computing 2016;13(6):644–656.

24