Reliability Engineering and System Safety 156 (2016) 97–108
Contents lists available at ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Optimal task partition and state-dependent loading in heterogeneous two-element work sharing system Gregory Levitin a,b,n, Liudong Xing c, Hanoch Ben-Haim b, Yuanshun Dai a a
Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China, China The Israel Electric Corporation, P.O. Box 10, Haifa 31000, Israel c University of Massachusetts, Dartmouth, MA 02747, USA b
art ic l e i nf o
a b s t r a c t
Article history: Received 24 January 2016 Received in revised form 18 May 2016 Accepted 9 July 2016 Available online 12 July 2016
Many real-world systems such as multi-channel data communication, multi-path flow transmission and multi-processor computing systems have work sharing attributes where system elements perform different portions of the same task simultaneously. Motivated by these applications, this paper models a heterogeneous work-sharing system with two non-repairable elements. When one element fails, the other element takes over the uncompleted task of the failed element upon finishing its own part; the load level of the remaining operating element can change at the time of the failure, which further affects its performance, failure behavior and operation cost. Considering these dynamics, mission success probability (MSP), expected mission completion time (EMCT) and expected cost of successful mission (ECSM) are first derived. Further, optimization problems are formulated and solved, which find optimal task partition and element load levels maximizing MSP, minimizing EMCT or minimizing ECSM. Effects of element reliability, performance, operation cost on the optimal solutions are also investigated through examples. Results of this work can facilitate a tradeoff analysis of different mission performance indices for heterogeneous work-sharing systems. & 2016 Elsevier Ltd. All rights reserved.
Keywords: Work sharing Loading Task partition Mission success probability
1. Introduction This paper considers work-sharing systems with two processing elements jointly working on the same mission task. The two elements can be different in their parameters including performance, operation cost and failure behavior as they may be supplied from different vendors and/or have different exploitation history. Parameters of the same element may also change due to changing load levels during the mission. Particularly, when one of the two elements fails, the remaining element takes over the uncompleted task of the failed element upon finishing its own part. The load level of this remaining operating element can change at the time of the failure, which can further affect its performance, operation cost and failure behavior [1–3]. Depending on the task complexity, task partition and load levels of the two elements, Acronyms and abbreviations: AFTM, accelerated failure-time model; cdf, cumulative distribution function; pdf, probability density function; MSP, mission success probability; ECSM, expected cost of successful mission; EMCT, expected mission completion time n Corresponding author at: Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China, China. E-mail addresses:
[email protected] (G. Levitin),
[email protected] (L. Xing). http://dx.doi.org/10.1016/j.ress.2016.07.009 0951-8320/& 2016 Elsevier Ltd. All rights reserved.
mission performance indices including the probability, time and cost of mission task completion can vary. Thus, problems of finding optimal task partition and element load levels optimizing these mission performance indices for a specified task complexity are relevant and should be solved for reliable and cost-effective design of work-sharing systems. The work sharing system considered in this paper is motivated by real-world systems such as multi-path flow transmission systems, multi-channel data communication systems, and parallel computing systems (multi-processor systems, computer grids or clusters), where elements work simultaneously on different portions of the same task to accomplish a specified mission [4]. Consider, for a specific example, a flow transmission system aimed at transferring a predetermined amount of material through two parallel channels (pipelines with pump stations). Each channel can work autonomously and contains specific equipment with given productivity, cost and reliability characteristics. The total instant throughput of the system does not matter, but the time needed to complete the entire transfer task as well as the task cost and success probability are important. The transfer task can be arbitrarily distributed between the channels and the load of pumps can be chosen from a set of available levels. In many real applications the change of load during the task performance is impossible or undesirable (possibility of human errors, overhead
98
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
Nomenclature W R E C x
amount of work in the mission task mission success probability expected mission completion time expected cost of successful mission fraction of mission task that should be performed by element 1 wj amount of work assigned to element j Fj(t), fj(t) cdf, pdf of time-to-failure of element j Lj load level of element j before failure of element 3 j Ljþ load level of element j after failure of element 3 j
and failures associated with transient processes, unavailability of personnel for constant process monitoring). If one of channels fails, the system inevitably requires personnel intervention. After the failure, the single available channel is assigned to transfer the remaining amount of the material and the load of this channel can be changed. Note that the system considered is different from the traditional load sharing systems where the elements are loaded in a way that provides a desired level of cumulative system performance and the event of an element failure results in a higher load thus a higher failure rate to the remaining elements [5–7]. It is also different from the performance sharing systems in which the surplus instant performances of elements can be redistributed in a way that allows elements to meet individual demands [8–11]. The load levels of the elements in a work sharing system can be chosen without respect to cumulative system performance, but in a way that provides a desired balance among the mission success probability, expected cost and duration. The system considered also differs from an active redundant or hot standby system, in which multiple elements work on the same task in parallel but without any work sharing for the purpose of providing fast system recovery in the event of failures [12–14]. The considered work sharing system actually generalizes 1-out-of-2 warm standby system, where one element is online and working with the other element serving as a standby unit ready to take over the task in the event of the online element failure [15,16]. In other words, the 1-out-of-2 warm standby system is a special case of work sharing system considered when no task is initially assigned to one of the two elements. Considerable research efforts have been expended in reliability modeling and optimization of traditional load sharing systems (e.g. [17–21]) and different types of standby systems (e.g. [22–27]). Refer to [7] and [28] for a review of these efforts. However little work is dedicated to modeling and optimizing reliability of worksharing systems [4,29,30] and the existing work only focused on the task distribution problem. The possibility of uncompleted task reassignment as well as effects of state-dependent loading have not been addressed. This paper makes original contributions by proposing a solution methodology to assess mission success probability (MSP), expected mission completion time (EMCT) and expected cost of successful mission (ECSM) of two-element heterogeneous worksharing systems subject to uncompleted task reassignment and state-dependent element loading. Another contribution is to formulate and solve a set of optimization problems, which determine optimal task partition and element load levels with the objective to maximize MSP, minimize EMCT, or minimize ECSM subject to providing desired levels of MSP and EMCT. Examples are provided to demonstrate applications of the proposed evaluation and optimization methodology.
gj(l)
ϕj(l) Tj cj(l) cj(0)
ηj, βj
performance (productivity) of element j working with load level l cumulative time acceleration factor of element j working with load level l time needed by element j to complete its part of the mission given the other element does not fail per unit time operation cost of element j working with load level l per unit time idle mode cost of element j scale, shape parameters of baseline Weibull time-tofailure distribution for element j
The remaining of the paper is organized as follows. Section 2 describes the two-element work sharing system model considered in this work. Section 3 presents evaluation of mission performance indices including MSP, EMCT and ECSM of the considered system. Section 4 illustrates the evaluation using examples. Section 5 presents formulation of related optimization problems. Effects of several element parameters on optimal solutions are also investigated through examples. Lastly, Section 6 gives conclusions and directions of future research.
2. Two-element system model The system consists of two elements that have to perform jointly a specified amount of work W. The work is distributed between the elements based on a pre-specified partition when they are both functioning. The distribution or partition does not change if no element failures happen. In the case of failure of an element, the work uncompleted by this element is re-assigned to the remaining one. When both elements are working, element j is subject to load level Lj (j ¼1, 2). Performance (or productivity) of element j working with load level l is gj(l). When the work is distributed such that the first element should perform w1 ¼xW and the second element should perform w2 ¼(1 x)W amount of work (0 r xr1), the time needed by element j to complete its part of the mission is
Tj =
wj gj (L−j )
(1)
and the mission time in the case of no failures is
T * = max { T1, T2 }.
(2)
The mission succeeds either when any element j does not fail during time Tj or when element j fails before time Tj and element 3 j completes the mission. It may be decided to change the load level of element 3 j after the failure of element j from L3 j to L þ 3 j. Notice that when the work distribution parameter x ¼0 or x ¼1 the system considered reduces to 1-out-of-2 warm standby system, where one of the two elements initially performs no work and serves as a standby unit ready to take over the uncompleted task when the operating element fails.
3. Evaluating mission performance indices In this section we derive mission success probability (MSP), expected mission completion time (EMCT), and expected cost of successful mission (ECSM) for the two-element system considered. It has been shown through empirical studies that the load level
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
99
Thus, the mission success probability can be obtained as
⎞⎞ ⎛ ( 1 − x) W ⎛ ⎛ xW ⎞⎞⎛ R = ⎜⎜ 1 − F1 ⎜ ϕ (L −) ⎟ ⎟⎟ ⎜⎜ 1 − F2 ⎜⎜ ϕ2 (L 2−) ⎟⎟ ⎟⎟ − 1 1 − g L g L ( ) ( ) ⎝ 1 1 ⎠⎠⎝ ⎠⎠ ⎝ ⎝ 2 2 2
∑∫
+
j=1
min ( T1, T2 )
0
(
f j tϕj (L−j )
)
⎡ ⎛ ⎞⎤ − − ⎢ 1 − F3 − j ⎜ tϕ (L − ) + W − ( g1 (L1 ) + g2 (L 2 ) ) t ϕ (L + ) ⎟ ⎥ dt j − 3− j 3− j 3 ⎜ ⎟⎥ j − 3 ⎢⎣ g3 − j (L3+− j ) ⎝ ⎠⎦
∫T
+
Tk
3 −k
⎞⎤ wk − gk (Lk−) t + ) ⎟ ⎥ dt , L ϕ ( ⎟ k − 3 k − 3 g3 − k (L3+− k ) ⎠ ⎥⎦
+
Fig. 1. Examples of mission task completion by second element in the case of failure of the first one (for equal element performances g1 ¼ g2 ¼ g). A: to T2; B: t Z T2.
can significantly impact failure behavior of a system element [1,2,31,32]. The accelerated failure-time model (AFTM) is widelyused for describing the relationship between element load and failure behavior, where the effect of load is multiplicative in time [7,33]. Particularly, let ϕj(l) denote a cumulative time acceleration factor of element j working with load level l, which is used to reflect different stresses undergone by the element under that load level. The failure function for element j working with load level l is Fj(tϕj(l)), where Fj() is baseline cumulative distribution function (cdf) of time-to-failure of element j. Based on the AFTM, the probability that element j, which initially worked with load level Lj during time t1, then works with load level Ljþ during at
(
)
least time t2 is 1 − Fj t1ϕj (L−j ) + t2 ϕj (L+j ) . If element j fails at time to T3 j, element 3 j operates at the moment of this failure and must complete the remaining amount of work W ( g1 (L1−) + g2 (L2−) ) t for a successful mission (see Fig. 1A for example). The probability that element 3 j, which has worked for time t with load L 3 j before the failure of element j, completes the mission task working with load L þ 3 j after the failure of element j can be obtained as
⎛ ⎞ W − ( g1 (L1−) + g2 (L 2−) ) t ϕ3 − j (L3+− j ) ⎟⎟. 1 − F3 − j ⎜⎜ tϕ3 − j (L3−− j ) + + g3 − j (L3 − j ) ⎝ ⎠
(3)
(4)
(5)
where k ¼1 if T1 ZT2 and k¼ 2 otherwise. The first term corresponds to the case where both elements complete their tasks without failures, the second term corresponds to the case where one of the two elements fails when the other one is in operation, and the third term corresponds to the case where one of the elements fails after the other one has completed its task. The expected time of the mission completion is ⎛ ( 1 − x) W ⎞⎞ ⎛ xW ⎞⎞⎛ 1⎧⎛ ⎨ ⎜ 1 − F1 ⎜ ϕ1 (L1−) ⎟ ⎟ ⎜⎜ 1 − F2 ⎜⎜ ϕ2 (L2−) ⎟⎟ ⎟⎟ max ( T1, T2 ) − − R⎪ g ( L ) g ( L ) ⎝ ⎠ ⎠⎝ 2 2 1 1 ⎝ ⎠⎠ ⎩⎝ ⎪
E=
⎡ ⎛ f j tϕj (L−j ) ⎢ 1 − F3 − j ⎜⎜ tϕ3 − j (L3−− j ) ⎢ ⎝ ⎣ j=1 ⎞⎤ ⎛ W − ( g1 (L1−) + g2 (L2−) ) t W − ( g1 (L1−) + g2 (L2−) ) t ⎞ ⎟⎟ dt + ϕ3 − j (L3+− j ) ⎟⎟ ⎥ × ⎜⎜ t + ⎥ g3 − j (L3+− j ) g3 − j (L3+− j ) ⎠ ⎠⎦ ⎝ 2
+
∑∫
min ( T1, T2 )
0
(
)
⎡ ⎛ fk ( tϕk (Lk−) ) ⎢ 1 − F3 − k ⎜⎜ T3 − k ϕ3 − k (L3−− k ) + ( t − T3 − k ) ϕ3 − k (0) ⎢ ⎝ 3 −k ⎣ ⎤ ⎫ − ⎛ ⎞ w − gk (Lk ) t w − gk (Lk−) t ⎞ ⎪ ⎟⎟ dt ⎬ , + k ϕ3 − k (L3+− k ) ⎟⎟ ⎥ × ⎜⎜ t + k g3 − k (L3+− k ) g3 − k (L3+− k ) ⎠ ⎪ ⎠ ⎥⎦ ⎝ ⎭ +
∫T
Tk
(6)
Let cj(l) represent per unit time operation cost of element j working with load level l, cj(0) represent per unit time idle mode cost of element j. The expected cost of a successful mission is
C=
⎧ ⎞⎞ ⎛ ( 1 − x) W ⎛ xW ⎞⎞⎛ 1⎪⎛ ⎨ ⎜⎜ 1 − F1 ⎜ ϕ (L −) ⎟ ⎟⎟ ⎜⎜ 1 − F2 ⎜⎜ ϕ2 (L 2−) ⎟⎟ ⎟⎟ − 1 1 − R⎪ g L g L ( ) ( ) ⎝ 1 1 ⎠⎠⎝ ⎠⎠ ⎝ ⎩⎝ 2 2 2
× ( c1 (L1−) T1 + c2 (L 2−) T2 ) +
∑∫ j=1
If element j fails at time t ZT3 j, element 3 j completes its part of work at time T3 j and waits in an idle mode during time interval t–T3 j until failure of element j (see Fig. 1B for example). Immediately after this failure, element 3 j is put in operation with load level L þ 3 j and must complete the remaining amount of work wj- gj (L−j ) t . The conditional probability that element 3 j, which has worked for time T3 j with load L 3 j and then waited in an idle mode during time t–T3 j, completes the mission task working with load L þ 3 j after the failure of element j can be obtained as ⎛ ⎞ wj − gj (L−j ) t 1 − F3 − j ⎜⎜ T3 − j ϕ3 − j (L3−− j ) + ( t − T3 − j ) ϕ3 − j (0) + ϕ3 − j (L3+− j ) ⎟⎟. + g3 − j (L3 − j ) ⎝ ⎠
⎡ ⎛ fk ( tϕk (Lk−) ) ⎢ 1 − F3 − k ⎜⎜ T3 − k ϕ3 − k (L3−− k ) + ( t − T3 − k ) ϕ3 − k (0) ⎢⎣ ⎝
0
min ( T1, T2 )
(
f j tϕj (L−j )
)
⎡ ⎞⎤ ⎛ W − ( g1 (L1−) + g2 (L 2−) ) t + )⎟⎥ L ϕ × ⎢ 1 − F3 − j ⎜⎜ tϕ3 − j (L3−− j ) + ( j − 3 ⎟⎥ j − 3 ⎢⎣ g3 − j (L3+− j ) ⎝ ⎠⎦ ⎡ W − ( g1 (L1−) + g2 (L 2−) ) t ⎤ ⎥ dt × ⎢ cj (L−j ) t + c3 − j (L3−− j ) t + c3 − j (L3+− j ) ⎥⎦ ⎢⎣ g3 − j (L3+− j ) +
∫T
Tk
3 −k
+
⎡ ⎛ fk ( tϕk (Lk−) ) ⎢ 1 − F3 − k ⎜⎜ T3 − k ϕ3 − k (L3−− k ) + ( t − T3 − k ) ϕ3 − k (0) ⎢⎣ ⎝
⎞⎤ ⎡ wk − gk (Lk−) t ϕ3 − k (L3+− k ) ⎟⎟ ⎥ × ⎢ ck (Lk−) t + c3 − k (L3−− k ) T3 − k + g3 − k (L3 − k ) ⎠ ⎥⎦ ⎣
+ c3 − k (0) ( t − T3 − k ) + c3 − k (L3+− k )
⎫ wk − gk (Lk−) t ⎤ ⎪ ⎥ dt ⎪ ⎬. + g3 − k (L3 − k ) ⎦ ⎭
(7)
The evaluation of MSP R, EMCT E, and ECSM C is illustrated by
100
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
examples in Section 4. Based on the evaluation of these mission performance indices, optimization problems maximizing R, minimizing E, or minimizing C subject to R and E constraints are formulated and solved in Section 5.
4. Illustrative example Consider a flow transmission system aimed at performing a transmission task with amount of work W¼5000 through two channels or paths (pipelines with pump stations). The channels
Table 1 Element parameters. Element 1
Element 2
Load level l
g1 (l)
ϕ1 (l)
c1 (l)
g2 (l)
ϕ2 (l)
c2 (l)
0 1 2 3 4
0 10 19 42 47
0.3 1.0 1.2 2.0 2.8
5.0 9.0 11.5 27.0 36.5
0 14 29 53 67
0.5 1.0 1.4 1.9 2.3
7.0 16.0 30.0 54.5 59.0
have Weibull time-to-failure distributions with cdf β Fj (t ) = 1 − exp −⎡⎣ t /ηj ⎤⎦ j ,
{
}
(8)
scale parameters η1 ¼600, η2 ¼400 and shape parameters β1 ¼1.6, β2 ¼2.0. Four operation load levels 1–4 are available. Table 1 presents elements’ performances, cumulative time acceleration factors and per unit time operation costs for each level of operation load and for the idle mode (l¼ 0). As the load level increases, performance (i.e., transmission rate) increases, the acceleration factor that reflects stresses and per unit time operation cost also increase. Consider a situation where only operation load level 1 is available (i.e., l¼ {0,1} from Table 1). Fig. 2 presents the mission performance indices as functions of the work distribution parameter x. It can be seen that for small x, T1 o T2 though the performance of the second element g2(1) is higher than the performance of the first element g1(1). When no failures happen, the mission time equals to T2. If the second element fails, the mission completion time increases because the uncompleted task of element 2 is performed by slower element 1. In this case E4 T2. For large x, T1 4 T2 and when no failures happen, the mission completion time E equals to T1. If the first element fails, the mission completion time decreases because the uncompleted task of
Fig. 2. Mission indices T1,T2, E, C and R as functions of x when the element operation load level is fixed, η1 ¼ 600, η2 ¼ 400, β1 ¼ 1.6, β2 ¼ 2.0.
Fig. 3. Mission indices T1,T2, E, C and R as functions of x when the element operation load level is fixed, η1 ¼600, η2 ¼100, β1 ¼1.6, β2 ¼2.0.
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
101
Fig. 4. Mission indices E, C and R for combinations of loads providing max R, min E, min C and min C s.t. R 40.94 and E o270 as functions of x.
Fig. 5. Combinations of load levels providing max R, min C and min C s.t. R 40.94 and E o 270 as functions of x.
the first element is performed by the faster second element. In this case E oT1. Thus, in heterogeneous systems element failures can reduce the expected mission completion time. The minimum expected mission completion time is achieved when the work is distributed in proportion to the element performance. However, work distribution parameter x providing the minimum mission completion time can differ from the one providing the minimum expected mission cost. The same functions are presented in Fig. 3 for the case when the fastest element 2 is much less reliable (η2 ¼ 100). It can be seen that the unreliable element 2 cannot reduce the EMCT for large x because the probability that it can complete the mission after failure of element 1 is very low. The ECSM decreases with x, which means that the least expensive policy is to assign no work to more
Table 2 Optimal solutions. Problem
x
L1
L1þ
L2
L2þ
R
E
C
Max R Min E Min C Min C s.t. R 40.94 and E o 270
0.42 0.41 1.0 0.76
1 4 3 2
3 4 2 3
1 4 4 4
4 4 4 4
0.9617 0.9148 0.8924 0.9400
195.7 44.8 116.1 190.5
5020 4189 3326 3539
expensive, but unreliable element 2. Comparison of Figs. 2 and 3 shows that the optimal work distribution depends on elements’ reliability.
102
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
5. Optimization problems and example solutions The optimization problem is formulated as follows: for given task complexity W and functions Fj(t), ϕj(l), gj(l), cj(l), find task partition parameter value x and element load levels Lj and Ljþ that minimize the expected cost of successful mission (ECSM) providing desired levels of mission success probability (MSP) and expected mission completion time (EMCT):
(
)
(
)
(
)
min C x, L−j , L+j s . t . R x, L−j , L+j ≥ R*, E x, L−j , L+j ≤ E*.
(9)
Unconstrained optimization problems of maximizing MSP (i.e.,
max R(x, Lj , Ljþ )), minimizing EMCT (i.e., min E(x, Lj , Ljþ )), and minimizing ECSM (i.e., min C(x, Lj , Ljþ )) are also relevant and solved in this section. The optimization problems for the example of Section 4 have been solved using the Golden Section search algorithm [34] for determining x and enumeration in the range (1, 4) for determining integer values of L1 , L1þ , L2 and L2þ , Fig. 4 presents the mission indices obtained by finding, for any fixed value of x, combinations of Lj and Ljþ that maximize MSP, minimize EMCT, minimize ECSM, or minimize ECSM subject to constraints R40.94 and Eo270. There are no solutions satisfying the
Fig. 6. Mission indices E, C and R and value of x providing optimal solutions as functions of η1 for different R* and E*.
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
constraints when xo0.15 or x40.78. Fig. 5 presents the values of L-j and Ljþ obtained for different optimization problems as functions of x. For min E problem, the maximum loads corresponding to greater performances are chosen: Lj ¼Ljþ ¼ 4 for j¼1,2. Table 2 presents the best work distribution and element loading solutions for different optimization problems. The minimal ECSM is achieved when the expensive second element is not put in operation initially and serves as a standby reserve. In this case the MSP is lowest. The minimal EMCT is achieved when both elements are maximally loaded and perform with maximal performance whereas the work is distributed such that the elements complete the assigned tasks simultaneously if no failures occur:
103
x¼ g1(4)/(g1(4) þg2(4))¼ 47/(47 þ67)¼0.41. The maximal MSP solution finds a compromised elements loading, which strikes a balance between the decrease in operation time and increase in the failure rate with the increasing load. Similar to the case of minimal EMCT, the work is distributed such that the elements complete the assigned tasks simultaneously if no failures occur: x¼ g1(1)/(g1(1)þg2(1))¼ 10/(10 þ14) ¼0.42. The max R solution is the most expensive one. In the constrained minimal ECSM solution the less expensive first element performs a much greater part of the work than the second element, which serves mostly as a standby reserve. However, assigning 0.24 of the entire task to the second element
Fig. 7. Optimal combinations of loads as functions of η1 for different R* and E*.
104
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
reduces the required operation time of the first element and thus provides the desired level of MSP. Figs. 6 and 7 demonstrate optimal solutions obtained for constrained optimization problems (9) with different R* and E* for different values of η1 (scale parameter of Weibull time-to-failure distribution of the first element). It can be seen that with an increase in the first element reliability, the part of work initially assigned to this element is increased. When time and reliability limitations allow (for example, when R*¼ 0.92, E* ¼200), the entire work is assigned to the cheaper first element (x¼ 1) if it is reliable enough (η1 4745) and the system becomes a standby system. In this case no work is initially assigned to the expensive second element and it is put in operation only when the first
element fails. ECSM for the optimal solutions monotonically decreases with increasing η1 because when the cheaper element is more reliable, the probability of using more expensive one decreases. The MSP and EMCT can behave non-monotonically because the optimal work distribution and element loading can change abruptly with η1 to avoid the constraints violation. Figs. 8 and 9 demonstrate optimal solutions obtained for constrained optimization problems (9) with different R* and E* for different values of the first element performance. We assume that the element performance gj(l) for any load level l changes proportionally to factor z (i.e. gj(l) is multiplied by z). The increase in the first element performance has an effect similar to that due to
Fig. 8. Mission indices E, C and R and value of x providing optimal solutions as functions of z for different R* and E*.
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
an increase in its reliability. The part of work initially assigned to the first element increases and for z¼1.2 it becomes beneficial to keep the expensive second element in a standby mode. ECSM for the optimal solutions monotonically decreases with increasing z because when the cheaper element is faster, the probability that it completes the mission increases. The MSP and EMCT are non-monotonic functions of z. When zo0.73, z o0.66 and z o0.53 the constraint R4R* cannot be met for R* ¼0.94, R* ¼0.93 and R* ¼0.92 respectively. Figs. 10 and 11 demonstrate optimal solutions obtained for constrained optimization problems (9) with different R* and E* for
105
different values of the first element operation cost. We assume that the element operation cost cj(l) for any load level l changes proportionally to factor y (i.e. cj(l) is multiplied by y). With the increase in the first element operation cost, the fraction of work assigned to this element decreases. However this decrease is limited by values providing a desired level of MSP. The EMCT decreases with increasing y because a greater portion of work is assigned to the fastest second element. It can be seen that when elements’ loads do not change and x decreases, the MSP changes non-monotonically. Indeed, as shown in Fig. 4 R(x) is non-monotonic function.
Fig. 9. Optimal combinations of loads as functions of z for different R* and E*.
106
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
6. Conclusion and future work This paper evaluates and optimizes a heterogeneous worksharing system with two non-repairable elements subject to statedependent loading. Initially the work is distributed between the two elements according to a certain proportion. When one element fails before completing its own part of the work, the other element, if still operating, takes over the remaining uncompleted task of the failed element with a possibly different load level upon completing its own part. The mission succeeds if a specified amount of work can be completed before failures of both elements. Following an evaluation of mission performance indices (MSP, EMCT and ECSM) of the considered system, optimization
problems are formulated and solved, which find optimal work distribution and element load levels maximizing MSP, minimizing EMCT or ECSM. Effects of several element parameters (reliability, performance, operation cost) on the optimal solutions are demonstrated through examples. One direction of our future work is extending the proposed model for work-sharing systems with an arbitrary number of elements, where a new problem of scheduling and distributing uncompleted tasks of failed elements among the remaining operating elements is relevant. We are also interested in investigating work-sharing systems performing phased-missions, where element performance, operation cost and failure behavior may change from phase to phase due to varying tasks and conditions [35,36].
Fig. 10. Mission indices E, C and R and value of x providing optimal solutions as functions of y for different R* and E*.
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
107
Fig. 11. Optimal combinations of loads as functions of y for different R* and E*.
Acknowledgment This work was supported in part by the National Natural Science Foundation of China (No. 61170042) and Jiangsu Province Development and Reform Commission (No. 2013-883).
References [1] Iyer RK, Rosetti DP. A measurement-based model for workload dependency of CPU errors. IEEE Trans Comput 1986;C-35:511–9. [2] Nelson W. Accelerated testing: statistical models, test plans, and data analysts. New York: John Wiley & Sons; 1990. [3] Kapur KC, Lamberson LR. Reliability in engineering design.New York: John Wiley & Sons; 1977. [4] Levitin G. The universal generating function in reliability analysis and optimization.London: Springer; 2005. [5] Singh B, Gupta PK. Load-sharing system model and its application to the real data set. Math Comput Simul 2012;82(9):1615–29.
[6] Kvam PH, Pena EA. Estimating load-sharing properties in a dynamic reliability system. J Am Stat Assoc 2005;100(469):262–72. [7] Amari SV, Misra KB, Pham H. Tampered failure rate load-sharing systems: status and perspectives. In: Misra KB, editor. Chapter 20 in handbook of performability engineering. London: Springer; 2008. p. 291–308. [8] Lisnianski A, Ding Y. Redundancy analysis for repairable multi-state system by using combined stochastic processes methods and universal generating function technique. Reliab Eng Syst Saf 2009;94(11):1788–95. [9] Levitin G. Reliability of multi-state systems with common bus performance sharing. IIE Trans 2011;43(7):518–24. [10] Yu H, Yang J, Mo H. Reliability analysis of repairable multi-state system with common bus performance sharing. Reliab Eng Syst Saf 2014;132(11):90–6. [11] Xiao H, Peng R. Optimal allocation and maintenance of multi-state elements in series–parallel systems with common bus performance sharing. Comput Ind Eng 2014;72:143–51. [12] Levitin G, Xing L, Dai Y. Effect of failure propagation on cold vs. hot standby tradeoff in heterogeneous 1-out-of-N: G systems. IEEE Trans Reliab 2015;64 (1):410–9. [13] Johnson BW. Design and analysis of fault tolerant digital systems.Reading, MA: Addison-Wesley; 1989. [14] Levitin G, Xing L, Dai Y. Reliability and mission cost of 1-out-of-N: G systems with state-dependent standby mode transfers. IEEE Trans Reliab 2015;64
108
G. Levitin et al. / Reliability Engineering and System Safety 156 (2016) 97–108
(1):454–62. [15] Amari SV, Pham H, Misra RB. Reliability characteristics of k-out-of-n warm standby systems. IEEE Trans Reliab 2012;61:1007–18. [16] Eryilmaz S. Reliability of a K-out-of-n system equipped with a single warm standby component. IEEE Trans Reliab 2013;62(2):499–503. [17] Liu Huamin. Reliability of a load-sharing k-out-of-n: G system: non-iid components with arbitrary distributions. IEEE Trans Reliab 1998;47(3):279–84. [18] Ye Z, Revie M, Walls L. A load sharing system reliability model with managed component degradation. IEEE Trans Reliab 2014;63(3):721–30. [19] Huang L, Xu Q. Lifetime reliability for load-sharing redundant systems with arbitrary failure distributions. IEEE Trans Reliab 2010;59(2):319–30. [20] Park C. Parameter estimation for the reliability of load-sharing systems. IIE Trans 2010;42(10):753–65. [21] Singh B, Sharma KK, Kumar A. A classical and Bayesian estimation of a k-components load-sharing parallel system. Comput Stat Data Anal 2008;52 (12):5175–85. [22] Kuo W, Zuo MJ. Optimal reliability modeling: principles and applications. Hoboken, NJ, USA: Wiley; 2003. [23] Zhang T, Xie M, Horigome M. Availability and reliability of k-out-of-(M þ N): G warm standby systems. Reliab Eng Syst Saf 2006;91(4):381–7. [24] Zhai Q, Xing L, Peng R, Yang J. Multi-valued decision diagram-based reliability analysis of k-out-of-n cold standby systems subject to scheduled backups. IEEE Trans Reliab 2015;64(4):1310–24. [25] Xing L, Tannous O, Dugan JB. Reliability analysis of non-repairable coldstandby systems using sequential binary decision diagrams. IEEE Trans Syst
Man Cybern Part A: Syst Hum 2012;42(3):715–26. [26] Levitin G, Xing L, Dai Y. Optimal backup distribution in 1-out-of-N cold standby systems. IEEE Trans Syst Man Cybern: Syst 2015;45(4):636–46. [27] Levitin G, Xing L, Dai Y. Heterogeneous 1-out-of-N warm standby systems with dynamic uneven backups. IEEE Trans Reliab 2015;64(4):1325–39. [28] Levitin G, Xing L, Dai Y. Non-homogeneous 1-out-of-N warm standby systems with random replacement times. IEEE Trans Reliab 2015;64(2):819–28. [29] Levitin G, Dai Y. Optimal service task partition and distribution in grid system with star topology. Reliab Eng Syst Saf 2008;93:152–9. [30] Yang B, Hu H, Guo S. Cost oriented task allocation and hardware redundancy policies in heterogeneous distributed computing systems considering software reliability. Comput Ind Eng 2009;56:1687–96. [31] Levitin G, Amari SV. Optimal load distribution in series-parallel systems. Reliab Eng Syst Saf 2009;94:254–60. [32] Iyer RK, Rossetti DJ. Effect of system workload on operating system reliability: a study on IBM 3081. IEEE Trans Softw Eng 1985;SE-11(12):1438–48. [33] Levitin G, Xing L, Dai Y. Optimal component loading in 1-out-of-N cold standby systems. Reliab Eng Syst Saf 2014;127:58–64. [34] Press W, Teukolsky S, Vetterling W, Flannery B. Numerical recipes in C. The art of scientific computing.New York: Cambridge University Press; 1992. [35] Levitin G, Xing L, Amari SV, Dai Y. Reliability of non-repairable phased-mission systems with common-cause failures. IEEE Trans Syst Man Cybern: Syst 2013;43(4):967–78. [36] Peng R, Zhai Q, Xing L, Yang J. Reliability of demand-based phased-mission systems subject to fault level coverage. Reliab Eng Syst Saf 2014;121:18–25.