Reliability Engineering and System Safety 109 (2013) 32–40
Contents lists available at SciVerse ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Quantifying the impact of correlated failures on system reliability by a simulation approach Yi-Kuei Lin a,n, Lance Fiondella b, Ping-Chen Chang a a b
Department of Industrial Management, National Taiwan University of Science & Technology, Taipei 106, Taiwan, ROC Department of Computer Science & Engineering, University of Connecticut, Storrs, CT, USA
a r t i c l e i n f o
a b s t r a c t
Article history: Received 9 April 2012 Received in revised form 1 August 2012 Accepted 4 August 2012 Available online 28 August 2012
Correlation poses a serious threat to many engineered systems because the simultaneous failure of multiple components can dangerously degrade performance. Given the high cost of system failures in business and mission-critical applications, methods to explicitly consider the impact of correlation on system reliability are essential. This paper constructs a stochastic-flow network model to analyze the performance of a computer network, where there exists correlation between the failures of all the physical lines and routers comprising the edges and nodes of the network. That is, we address globalscale events that can cause widespread damage to the performance of the network. We propose a simulation approach to estimate the probability that a given amount of data can be sent from a source to sink through this network. This probability that the network satisfies a specified level of demand is referred to as the system reliability. Experimental results demonstrate that correlation can produce a substantial impact on system reliability. The proposed approach, thus, captures the influence of correlation on system reliability and offers a method to quantify the utility of reducing correlation. & 2012 Elsevier Ltd. All rights reserved.
Keywords: Correlated failure Simulation Stochastic-flow network (SFN) System reliability
1. Introduction Network analysis is a conventional yet important procedure to evaluate the performance of network-structured systems such as internet, telephone, and transportation systems. In a network, edges and nodes may exhibit multiple levels of capacity due to the possibility of failure, partial failure, and maintenance. Hence, a network characterized by such edges and nodes also possesses stochastic (i.e., multi-state) capacities, which is typical of stochasticflow networks (SFN) [1–13]. In the context of computer networks, edges model the physical transmission media such as fiber optics or coaxial cables, while nodes represent transmission facilities including routers and switches. Thus, a computer network characterized by such multi-state edges and nodes also demonstrates stochastic (multi-state) capacities and is suitably modeled by an SFN. In an SFN, both nodes and edges can fail unexpectedly when malfunctions occur. Aggarwal et al. [1] suggested that failure of a node induces failure of the edges incident to it, but only considered the case where the edges and nodes exhibit two capacity levels, reliable operation and total failure. Lin’s works [5–7] reformulated this original concept as a stochastic-flow network problem with multi-state edges, computing system reliability in terms of minimal paths (MPs), where an MP is a path whose proper subsets are
n
Corresponding author. Tel.: þ886 2 27303277; fax: þ886 2 27376344. E-mail address:
[email protected] (Y.-K. Lin).
0951-8320/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ress.2012.08.008
no longer paths. Subsequent studies have devoted a considerable amount of research [8–10,12,13] to evaluate the performance and system reliability of SFN. These later studies define the system reliability as the probability that the SFN can send a requested demand d from the source to sink, which requires that the transmission capacity of the SFN be measured in terms of its ability to satisfy this demand. Despite the significant amount of research on the topic of SFN, virtually all of the previous research assumes that the capacity of each edge and node is stochastic with a given probability distribution [5–7,11–13]. A smaller number of studies [8–10] derive probability distributions for the capacities, assuming each physical line of an edge is binary state, reliable or failed. This latter approach characterizes the capacity probability distribution of each edge as an independent binomial distribution. In addition to numerical models, Monte Carlo simulation is another system modeling technique to explore a phenomenon and identify an approximate solution to a complicated problem. Monte Carlo simulation relies on repeated random sampling to compute the approximation. Several studies [14,15] have proposed Monte Carlo simulation techniques to estimate the reliability of network structured systems using the state-space decomposition methodology. A great deal of research [16–20] has also been devoted to evaluating the system reliability of binary-state networks. Refs. [21,22] developed Monte Carlo simulation algorithms to estimate the SFN reliability. However, all of the above works, both numerical and simulation, fail to consider correlated failures, where two or more physical lines or routers comprising an edge or node of the network fail
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
Nomenclature
r2
s t n m ei E N wi
S F( )
W G Nc
source node sink node number of edges number of nodes ith edge/node with i ¼1, 2, y, nþm {ei9i¼1, 2, y, n}: the set of edges {ei9i¼n þ1, nþ2, y, nþm}: the set of nodes maximal capacity of ei (number of physical lines/ routers in the ith edge/node) (w1, w2, y, wn þ m): the maximal capacity vector (E, N, W): an SFN number of physical lines and routers, where nP þm wi Nc ¼ i¼1
ri R
mi U
ri1 ,i2 r0 r1
state of the ith physical line or router (r1, r2, y, r Nc ): state vector average reliability of ith physical line or router (m1, m2, y, mNc ): average reliability vector correlation among failures of elements i1 and i2 correlation among the physical lines/routers of an edge/node correlation among the physical lines/routers of edges and nodes that are a distance of one or two from each other in the graph
simultaneously. Thus, existing models lack the flexibility to consider the impact correlated failures could have on the system reliability and capacity of the edges and nodes of a network. A technique to explicitly integrate correlation into system reliability evaluation for stochastic-flow network models is widely needed because several real-world scenarios can lead to correlated failures. Potentially disastrous, examples include earthquakes and hurricanes that simultaneously damage many edges and nodes of a network, suggesting that correlated failures among edges and nodes may occur. Given that these serious failures can lead to severely degraded network performance, enhanced stochastic-flow network models incorporating correlated failures will quantify the negative impact these types of events could be expected to have on system reliability. Such a modeling technique may then be used to measure the utility of reducing the possibility of correlated failures through strategies such as physically shielding vulnerable portions of the network. Following the system reliability model proposed by Lin [5], this paper develops a simulation method to evaluate the system reliability of a computer network as an SFN, where the failure of the physical lines or routers comprising the edges and nodes of the network may experience correlated failures. We verify the simulation technique by comparing with results from Lin’s numerical model [5]. Specifically, the failure of a physical line or router can affect the failure of physical lines and routers in all other edges and nodes of the network. Thus, the approach removes the restrictive assumption that the capacities of different edges/nodes are statistically independent [1–12]. We employ a simulation technique for multivariate Bernoulli distributions [23] to characterize the correlated failure of physical lines and routers in the edges and nodes, allowing distinct reliabilities and correlations between each pair of elements (physical lines and routers) in the network. This general case is therefore capable of considering global-scale events that can cause widespread damage to the performance of the network. The remainder of this paper is organized as follows. Section 2 presents a method to evaluate the system reliability of stochastic-
z(p)
di1 ,i2 D A Z
gi G xi X k Pj fj d yi Yv Rd
33
correlation among lines/routers that are separated by two or more other edges/nodes Nc Nc correlated failures matrix cumulative bivariate normal distribution pth quantile of standard normal distribution multivariate normal covariance encoding of correlated failures between elements i1 and i2 Nc Nc multivariate normal covariance matrix Cholesky decomposition of D vector of Nc standard normal deviates correlated normal deviate encoding ri (g1, g2, y, gNc ): vector of correlated normal deviates encoding R current capacity of ei (x1, x2, y, xn þ m): current capacity vector number of minimal paths jth minimal path with j ¼1, 2, y, k flow of jth minimal path with j ¼1, 2, y, k demand minimal capacity of ei for satisfying d vth minimal capacity vector satisfying d with v ¼1, 2, y, h system reliability at demand level d
flow networks in terms of its minimal capacity vectors and an approach to simulate the correlated failures of the physical lines and routers of the edges and nodes in the network. To facilitate the explanation of the illustrations, Section 3 constructs a fundamental analysis in terms of k-out-of-n system. Section 4 provides two examples, including a large-scale case study of the Taiwan academic network (TANET). Section 5 offers conclusions and future research.
2. Stochastic-flow network model with correlated failures This section presents a method to assess the system reliability of a stochastic-flow network, considering correlated failures. The development is conducted in a top-down manner. We construct a network model for an SFN to compute the system reliability as follows. Let G¼(E, N, W) denote an SFN with source node s and sink node t, where E¼{ei9i¼ 1, 2, y, n} represents the set of edges, N¼{ei9i¼nþ1, nþ2, y, nþ m} the set of nodes, and W¼{wi9i¼1, 2, y, nþm} the vector of the maximal capacities wi of ei. The capacity vector X¼(x1, x2, y, xn þ m) is defined as the system state of G, where xi represents the current capacity of edge/node ei. Such a G under study is assumed to satisfy the following assumptions. 1. Flow in G satisfies the flow-conservation law [24]. 2. The failures among edges and nodes are correlated according to a given correlation matrix. We first formalize concepts including system reliability in terms of the minimal paths as a function of the flows in the edges and nodes of the network. Next, an efficient algorithm to generate all minimal capacity vectors of an SFN is given. Finally, a technique for simulating correlated multivariate Bernoulli distributions [23] to characterize the reliability of the physical lines and routers is discussed. This latter approach quantifies the influence correlation will have on the capacity probability distribution of each edge and node in the network. The system reliability estimate is obtained as
34
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
the fraction of the simulation experiments that satisfy at least one of the minimal capacity vectors. This approach provides an assessment that explicitly incorporates the impact of correlated failures on the overall system reliability.
Step 2. Transform each flow vector F into capacity vector X¼(x1, x2, y, xn þ m) via xi ¼
k n X
f j 9ei A P j
o
for i ¼ 1,2,. . ., n þ m:
ð5Þ
j¼1
2.1. System reliability definition Let P1, P2, y, Pk be the k minimal paths of the SFN. This SFN can be described with capacity vector X ¼(x1, x2, y, xn þ m) and a flow vector F ¼(f1, f2, y, fk), where xi represents the capacity of ei and fj the flow on the jth MP. Therefore, the capacity of ei may be determined in terms of flow vectors [5] by xi ¼
k n X
f j 9ei A P j
o
ð1Þ
j¼1
subject to k n X
o f j 9ei A P j r wi :
ð2Þ
j¼1
Given demand d, the system reliability Rd is the probability that the network capacity is sufficient to send the data from the source to sink. Thus, the system reliability is Pr{X9V(X) Zd}, where V(X) is defined to be the maximum flow that can be sent given network state X. It would be computationally inefficient to find all X such that V(X)Zd and then sum their probabilities to obtain Rd. The minimal capacity vectors Y1, Y2, y, Yh in the set {X9V(X)Zd} constitute a more effective approach to compute the system reliability of the network under demand d. A capacity vector Y is said to be minimal for d if and only if (i) V(Y) Zd and (ii) V(Y0 )od for any capacity vectors Y0 such that Y0 oY. Given, Y1, Y2, y, Yh, the set of minimal capacity vectors capable of satisfying demand d, the system reliability Rd is h ð3Þ Rd ¼ Pr [ Dv , v¼1
where Dv ¼{X9XZYv}, v¼1, 2, y, h. Several methods such as the RSDP algorithm [8–10,12], inclusion–exclusion method [4–6], disjoint-event method [4,25], and decomposition [2,3] state-space may be applied to compute Pr [hv ¼ 1 Dv . The inclusion–exclusion method [9,10,12] quickly becomes computationally intractable, especially when the size of the network is large. In practice, the RSDP algorithm demonstrates better computational efficiency than the state-space decomposition approach for large networks [12]. In this study, we apply the RSDP algorithm to derive the exact system reliability. However, the RSDP algorithm assumes that the failures of edges and nodes of the network are statistically independent. Hence, an alternative approach is needed to characterize correlated failures between all physical lines and routers in the edges and nodes, justifying the simulation approach developed in this paper. 2.2. Algorithm to generate all minimal capacity vectors Given k minimal paths (MPs) P1, P2, y, Pk, with flow on the jth MP fj, the minimal capacity vectors for d can be generated with the following steps [5]. Step 1. Find all feasible solutions of flow vector F ¼(f1, f2, y, fk) k P f j ¼ d subject to satisfying j¼1 k n X j¼1
o f j 9ei A Pj r wi :
ð4Þ
Step 3. Find minimal capacity vectors among the X obtained in Step 2. Suppose X1, X2, y, Xq are all results from Step 2. The minimal capacity vectors can be obtained as follows: 3.1 Let Omin be the set which stores the minimal capacity vectors among X1, X2, y, Xq Initially, Omin ¼|. 3.2 U ¼1. Go to step 3.4. 3.3 For v ¼1 to u 1. If Xu ZXv, then go to step 3.5; if Xu oXv, then delete Xv from Omin. 3.4 Omin ¼ Omin[{Xu}. 3.5 U ¼u þ1. 3.6 If u 4q, stop; otherwise, go to step 3.3. Step 4. Let Y1, Y2, y, Yh be the minimal capacity vectors for d obtained in Step 3. Using these minimal capacity vectors obtained from the algorithm, the exact system reliability can be computed according to the capacity probability distribution for edges [12] only when failures are uncorrelated. In the simulation procedure, these minimal capacity vectors serve as thresholds to determine if an experiment is successful (reliable) or not. 2.3. Simulating correlated failures Consider a set consisting of Nc ¼
nP þm
wi physical lines and
i¼1
routers, where reliability vector U denotes the average reliabilities and S is an Nc Nc matrix characterizing the correlation between the failures of physical lines and routers comprising the network. These parameters form an Nc-variate Bernoulli distribution and the reliability simulation technique [23] encodes these parameters as a multivariate normal distribution. This approach utilizes the cumulative bivariate normal distribution Z x1 Z x2 1 2 2 2 pffiffiffiffiffiffiffiffiffiffiffiffi eðz1 2rz1 z2 þ z2 =2ð1r ÞÞ dz1 dz2 ð6Þ F x1 ,x2 , r ¼ 1 1 2p 1r2 Solving
Fðzðmi1 Þ, zðmi2 Þ, di1 i2 Þ ¼ ri1 ,i2
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi mi1 ð1mi1 Þmi2 ð1mi2 Þ þ mi1 mi2
ð7Þ
numerically for each di1 ,i2 (1ri1 oi2 oNc). Here, z(p) denotes the pth quantile of the standard normal distribution. The di1 ,i2 constitute the entries of the covariance matrix D for the MVN(0, D) distribution used to simulate the correlated failures. The following steps simulate the system reliability of the SFN: Step 1.Compute A, the Cholesky decomposition [26] of D. Step 2.Generate Nc standard normal deviates, Z¼(Z1, Z2,y, Z Nc )T. Step 3.Let G ¼AZ. Step 4.If gi rz(mi) then ri ¼1, otherwise ri ¼0, where 1ri rNc. Step 5.Set xi, 1rirn þm, according to the number of reliable physical lines or routers in that edge or node. Step 6.If X exceeds one or more minimal capacity vectors, then the SFN is reliable and fails otherwise. Averaging the outcomes of many runs of the simulation will provide an estimate for the system reliability of the SFN. Furthermore, by recording the vector of success and failure in each physical line and router and then computing the Pearson productmoment correlation coefficient for each pair of these components,
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
it can be verified that the observed correlation matrix converges to S. The simulation approach allows each physical line and router to assume a distinct reliability and correlations. However, there are also several special cases of interest, which are noted here. For example, when the failures of the physical lines or routers in individual edges/nodes are correlated, but the capacity of the different edges and nodes are statistically independent corresponds to the following block diagonal correlation matrix: 2 3 r0 0 0 0 6 7 0 0 7 6 0 r0 7: S¼6 ð8Þ 6 0 0 & 0 7 4 5 0 0 0 r0 Here, r0 is a submatrix of dimension wi wi. The diagonal elements of r0 are 1.0, while the off-diagonal elements characterize the correlations between the physical lines of an edge or the routers of a node. The next most general case is to allow correlations between routers of a node and the physical lines of neighboring edges. Such cases will arise in situations where failure of a node prohibits flow to adjacent edges of the graph. Introducing correlations between edges and nodes at progressively greater distances within the network structure approaches the completely general scenario discussed above. Given that many events inducing correlated failure are concentrated in a geographical region, it may be the case that the correlations at greater distances are smaller or even negative. In such cases, additional sub matrices to denote correlations between routers of a node and the physical lines of neighboring edges would be necessary. In this paper, we consider correlation among the physical lines/routers of edges and nodes that are a distance of one or two from each other in the graph in terms of the submatrix r1. Correlation among lines/ routers that are separated by two or more other edges/nodes is denoted by r2.
3. Relationship between performance and k-out-of-n systems To facilitate the explanation of the illustrations, we discuss the relationship between performance and k-out-of-n systems as well as the role correlation plays in these systems. The following numerical example illustrates the impact of correlation in the context of an edge of a stochastic-flow network consisting of three physical lines (wi ¼3), where each physical line delivers one unit of capacity with average reliability mi ¼0.9. Table 1 shows how correlation alters the capacity probability distribution of the edge. Note that increasing the correlation between the failures of the physical lines increases the probability that the edge exhibits the lowest (0) and highest (3) capacity levels. This occurs because higher levels of correlation result in simultaneous failure and reliability more often. Thus, correlation can increase the probability of the highest capacity level. This will improve system
Table 1 Impact of correlation on edge capacity. Edge capacity
0 1 2 3
35
Table 2 Impact of correlation on different system. Network type
Parallel 2-out-of-3 Series
System reliability
r ¼0
r ¼ 0.1
r ¼0.5
r ¼0.9
r ¼1.0
0.999 0.972 0.729
0.98829 0.96642 0.74529
0.94725 0.94050 0.81225
0.90909 0.90882 0.88209
0.9 0.9 0.9
reliability when the edge performance must equal its full capacity. For example, when r ¼0 the probability that the edge capacity is 3 is just 0.729, but increases to 0.9 for r ¼1.0. When the required performance is less than the full capacity, however, correlation lowers the probability that the minimum performance level is met. For example, when the edge capacity must be two or greater the probability is 0.972 for r ¼0, but decreases to 0.9 when r ¼1.0. This second situation is undesirable because it will lower the system reliability of a stochastic-flow network as lower edge performances will render the network incapable of delivering the specified demand. In the general case, correlation improves the reliability of a series system and lowers the reliability of k-out-of-n systems when k is close to n [27]. When all physical lines must be reliable to deliver the required performance this is equivalent to a series system. Edges that do not need all of the physical lines to form a minimal capacity vector can be viewed as k-out-of-n systems, explaining why correlation can improve the performance of some edges, but lower the performance of others. Table 2 illustrates this by computing the reliability of the series, parallel, and 2-out-of-3 system reliability for each correlation. Note that in practice, correlations between the failures of the physical lines or switches and routers may be estimated from failure logs. For example, many hardware manufacturers bundle their products with performance monitoring and optimization software packages to measure quality of service and identify strategies to improve reliability and availability. The accuracy of these estimates will be specific to different hardware products and the amount of data and details recorded by such software. Consider the situation where a network maintenance software package records statistics such as the packet loss rates for a set of physical lines. These logs can then be processed to estimate the correlation in the rates of packet loss for the set of edges comprising the edges of the network.
4. Numerical examples and discussions In this section, we illustrate the utility of the approach through a pair of examples. The first case study shows the approach’s ability to quantify the influence of correlated failures on system reliability. The second example demonstrates the scalability of the approach, analyzing the impact of correlation on a large realworld example, the Taiwan academic network (TANET). 4.1. Benchmark network
Correlation (r) 0
0.1
0.5
0.9
1.0
0.001 0.027 0.243 0.729
0.01171 0.02187 0.22113 0.74529
0.05275 0.00675 0.12825 0.81225
0.09091 0.00027 0.02673 0.88209
0.1 0.0 0.0 0.9
Consider the following benchmark network consisting of six edges and four nodes with e7 as the source node and e10 the sink, shown in Fig. 1. This example was originally given by Lin [5] to demonstrate his solution procedure to obtain a set of minimal capacity vectors, which satisfies demand d. However, this previous study assumed arbitrary probability distributions for the capacity of the edges and nodes to illustrate the system reliability
36
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
e8
e1 e3
e7
e5
e2 e4
e9
e10
e6
Fig. 1. Benchmark network.
Table 3 Edge/node data for example 1. Edge (ei)
Physical line/ router
Reliability (ri)
Edge (ei)
Physical line/ router
Reliability (ri)
e1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.988188 0.965605 0.950398 0.952177 0.971000 0.979833 0.953521 0.975950 0.983766 0.964635 0.987040 0.961864 0.960590 0.969169 0.960599
e6
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0.962051 0.955219 0.980918 0.981323 0.974892 0.974420 0.968109 0.971686 0.964841 0.976232 0.952504 0.971288 0.962664 0.955232 0.962671
e2
e3
e4
e5
e7
e8
e9
e10
evaluation process. Similar to this earlier work, we utilize the network topology, but quantify the reliability of each physical line and router as well as the correlation between them with the method discussed in Section 2.3. Table 3 provides the reliability of each physical line and router in each (edge/node) ei, where the number of physical lines/routers in each ei (edge/node) is wi ¼3. The correlations among the ten edges and nodes are given by the following block structure. 2 3
r0 r1 r0 r1 r1 r2 r1 r2 r1 r2 r2 r1
6r 6 1 6 6 r1 6 6r 6 1 6 6 r1 S¼6 6 6 r2 6 6r 6 1 6 6 r1 6 6r 4 2 where 2
1:00
r1 r1 r0 r1 r1 r1 r2 r1 r1 r2
0:05
r1 r1 r1 r0 r1 r1 r2 r1 r1 r2
r1 r2 r1 r1 r0 r1 r1 r2 r1 r2
0:05
3
r2 r1 r1 r1 r1 r0 r2 r2 r1 r1
r1 r2 r2 r2 r1 r2 r0 r1 r1 r2
7 r0 ¼ 6 4 0:05 1:00 0:05 5, 2
0:05
0:05
1:00
0:02
0:02
0:02
3
7 r1 ¼ 6 4 0:02 0:02 0:02 5, and 0:02 0:02 0:02 2 3 0:001 0:001 0:001 7 r2 ¼ 6 4 0:001 0:001 0:001 5: 0:001 0:001 0:001
r1 r1 r1 r1 r2 r2 r1 r0 r1 r1
r2 r2 r1 r1 r1 r1 r1 r1 r0 r1
r2 r1 7 7 7 r2 7 7 r2 7 7 7 r2 7 7 r1 7 7 7 r2 7 7 r1 7 7 7 r1 7 5 r0
The structure r0 indicates that physical lines within the same edge or routers within the same node experience positively correlated failures. The values in r1 characterize correlations between the physical lines and routers of edges and nodes that are a distance of one or two from each other in the graph. For example, e1 is at a distance of one from e2, e3, and e4 because they are all adjacent to node e8, while e1 and e5 are both connected to node e7. Edges e1 and e6, however, are at a distance greater than two because every path connecting them contains at least one edge and node. Thus, r2 represents correlations between elements that are separated by two or more other elements. Note that this correlated block structure imposed by r0, r1, and r2 is less general than the full method presented in Section 2.3 and it is possible to specify a 30 30 matrix with distinct correlations. The block structure given here has been provided to simplify the presentation, yet retains sufficient detail to illustrate the flexibility of the approach. When the demand is one unit (d ¼1), and the correlation parameter ri for each set of physical lines is set to zero, the system reliability, calculated by the RSDP algorithm in terms of the four minimal capacity vectors [5] Y1 ¼ (0, 1, 0, 1, 1, 0, 1, 1, 1, 1), Y2 ¼(0, 0, 0, 0, 1, 1, 1, 0, 1, 1), Y3 ¼ (1, 0, 1, 0, 0, 1, 1, 1, 1, 1), and Y4 ¼(1, 1, 0, 0, 0, 0, 1, 1, 0, 1), is 0.999800. The reliability estimated from the simulation is 0.999920, where the small error is due to the large number of components being simulated. Such a result shows that the simulation technique is verified. When the physical lines and routers experience correlated failures according to the correlation matrix S the simulation approach indicates that the system reliability falls to 0.999304. This 0.000496 decrease in system reliability suggests that for every million requests one might expect 496 more occasions where the demand cannot be satisfied. Thus, the proposed approach quantifies the negative impact of correlated failures on system reliability. Demand levels two (d¼ 2) and three (d ¼3) possess nine and sixteen minimal capacity vectors respectively. Table 4 summarizes the system reliability of SFN with and without correlation for demand levels one, two, and three as well as the change in system reliability that results when correlations are considered. Table 4 indicates that system reliability decreases even more significantly when the demand level is two, but suggests that system reliability actually increases when the demand level is three. This increase occurs because all sixteen capacity vectors require that nodes e7 and e10 operate at their highest capacity level. Thus, these two nodes are effectively a series system, whose system reliability benefits from positively correlated failures. Hence, increasing the frequency of correlated failures between these two bottleneck nodes increases the system reliability of the system. Fig. 2 shows the system reliability obtained by decreasing correlation when the demand level is three. The x-axis indicates the correlation in the off-diagonal elements of the block r0. The blocks r1 and r2 are scaled similarly. For example, when is r0 ¼0.4, the entries in r1 and r2 are set to 0.016 and 0.0008, respectively. Thus, Fig. 2 shows the improvement in system reliability that would result from lowering the frequency of correlated failures. Table 5 lists the 16 minimal capacity vectors that will satisfy a demand of three (d ¼3).
Table 4 Impact of correlation on system reliability of benchmark network. Demand (d)
No of minimal capacity vectors
No correlation
Correlation Difference
1 2 3
4 9 16
0.999800 0.993803 0.825027
0.999304 0.986034 0.831718
0.000496 0.007769 0.006691
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
37
Fig. 2. System reliability at d¼ 3.
Table 5 Minimal capacity vectors satisfying d ¼ 3. Minimal capacity vector
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y16
0 0 0 0 1 2 3 1 1 1 2 3 2 2 3 3
3 2 1 0 0 0 0 3 2 1 1 1 3 2 2 3
0 0 0 0 1 2 3 0 0 0 1 2 0 0 1 0
3 2 1 0 0 0 0 2 1 0 0 0 1 0 0 0
3 3 3 3 2 1 0 2 2 2 1 0 1 1 0 0
0 1 2 3 3 3 3 0 1 2 2 2 0 1 1 0
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 2 1 0 1 2 3 3 2 1 2 3 3 2 3 3
3 3 3 3 3 3 3 2 2 2 2 2 1 1 1 0
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Table 6 Minimal capacity vectors satisfying d ¼ 2. Minimal capacity vector
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9
0 0 0 1 2 1 1 2 2
2 1 0 0 0 2 1 1 2
0 0 0 1 2 0 0 1 0
2 1 0 0 0 1 0 0 0
2 2 2 1 0 1 1 0 0
0 1 2 2 2 0 1 1 0
2 2 2 2 2 2 2 2 2
2 1 0 1 2 2 1 2 2
2 2 2 2 2 1 1 1 0
2 2 2 2 2 2 2 2 2
At this demand level, both the source and sink nodes e7 and e10 must operate at their highest capacity levels. Thus, correlated failures are preferred because they will also increase the probability of occasions where all three physical lines in these nodes are reliable. However, the other edges and nodes will not always need to operate at their highest capacity levels to satisfy a minimal capacity vector. In the cases where nodes and edges other than e7 and e10 require fewer than three physical lines or routers, their capacity will suffer from positive correlations because positive correlations lower the reliability of k-out-of-n systems when k is less than n [27]. Thus, the nodes e7 and e10 would achieve higher capacity for r0 ¼0.05, but all other edges and nodes will demonstrate better performance for r0 ¼0. As the source and sink nodes constitute structural bottlenecks for system performance, r0 ¼0
attains the highest reliability for the entire network. Table 6 lists the nine minimal capacity vectors that can satisfy a demand of two (d¼2). Fig. 3 shows the change in system reliability for demand level two when correlation is lowered. In this scenario, system reliability is maximized when r0 ¼0. This trend occurs because all edges and nodes, including e7 and e10 require no more than two reliable physical lines to satisfy the demand, as indicated by Table 6. As a result, lowering correlation will also improve the performance of nodes e7 and e10 because their capacity probability distributions are equivalent to 2-out-of-3 systems. Fig. 4 shows the influence of correlated failures for demand level one. Note that decreasing r0 to 0.01 achieves a slightly higher system reliability of 0.999848 than the scenario where there is no
38
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
Fig. 3. System reliability at d¼ 2.
Fig. 4. System reliability at d ¼1.
correlation. This phenomenon may be explained by the following reasoning. The four minimal capacity vectors indicate that none of the nodes or edges must operate at their highest capacity level to achieve a reliable system at demand level one. This suggests that decreasing correlation would improve the performance of edges and nodes because lowering correlation in a k-out-of-n system improves reliability when kon ¼3. However, decreasing r0 from 0.01 to 0.0 corresponds to an increase in r2 to 0.002 from 0.0. This removes the negative correlation between the source node e7 and edges e2, e3, e4, and e6, as well as the sink node e10. Furthermore, edges e4 and e6 form parallel paths from source to destination and negative correlation improves the reliability of a parallel system [27]. Hence setting r0 ¼0.01 and r2 ¼ 0.002 balances the positive and negative impacts of correlation on system reliability considering the network structure. 4.2. Large-scale case study We employ the Taiwan academic network (TANET) with 30 edges, shown in Fig. 5 [9,10], to demonstrate the utility of the approach for assessing larger SFN. The TANET is the backbone network, connecting all educational and academic organizations in Taiwan. Consider the case where NTU and NSYSU are the source and the sink, respectively. The six paths are P1 ¼{e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e12, e13}, P2 ¼{e1, e2, e3, e4, e5, e6, e7, e8, e18, e19, e20}, P3 ¼{e14, e15, e16, e17, e18, e19, e20}, P4 ¼{e14, e15, e16, e17, e9, e10, e11, e12, e13}, P5 ¼{e23, e24, e25, e26, e27, e28, e29}, and P6 ¼{e23, e24, e25, e26, e27, e28, e30, e22, e21}. In this example, we let each edge consist of three Optical Carrier 18 (OC-18) physical
lines wi ¼3 and each physical line provides two possible capacities, 1 gigabits per second (Gbps) and 0 bits per second (bps). Since the previous example demonstrated the approach’s ability to handle the case where each physical line/router possesses a different reliability, in this example the reliability of all 90 physical lines are set to mi ¼0.95. Furthermore, the correlation in the failure of the physical lines of an edge is 0.05, while the correlation between physical lines in adjacent edges is 0.02 and the correlation in physical lines at a distance of two or more is 0.001. For example e1 is at a distance of one from e2, e14, and e23, but at a distance of two or more to all other edges in the network. Table 7 summarizes the results of the system reliability assessment for different levels of demand (in terms of Gbps). These values are compared with system reliability evaluations assuming statistically independent failures, as determined by the exact RSDP algorithm [12], and computes the difference between these two sets of estimates. The impact of correlation is increasingly negative for demand levels two through seven (d ¼{2, 3, 4, 5, 6, 7}), but is positive at demand levels eight and nine (d¼{8, 9}). This latter behavior occurs because the maximum capacity of the network is 9 and therefore each of the three main paths from the source must operate at or very near to their highest capacity level. Thus, for many minimal capacity vectors, each edge forms a series system because three out of three physical lines must be reliable for the SFN to satisfy high demands. However, the system reliability at these high demand levels is rather low. The impact of correlation becomes negative when demand is seven (d ¼7) or lower because many of the minimal capacity vectors require only two of three
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
physical lines to be reliable. Hence, correlation lowers the performance of the edges and the overall system reliability of SFN. The only reason correlation appears to improve system reliability at demand level one (d ¼1) is the extremely high system reliability. In the several million simulation experiments, no combination of failures in the physical lines lead to a situation where at least one of the six minimal capacity vectors could not be satisfied. Since the difference is decreasing for each demand level from seven down to two, it is reasonable to assume that correlation negatively impacts system reliability of SFN, but by no more than 0.00001230689. Hence a lower bound on the system reliability of d ¼1 is 0.9999999994496560.00001230689¼
39
0.9999877, which is only slightly higher than the system reliability when d¼2. Thus, a future research challenge will be to quantify the impact of correlation on highly reliable SFN when the demand level is low. Such an estimate will quantify the network’s ability to maintain connectivity in the face of correlated failures. To examine the error produced by the exact approach [5,12] that assumes failures in the physical lines are statistically independent, the simulation procedure was repeated for mi ¼{0.9, 0.99}. Fig. 6 shows the difference between the uncorrelated and correlated estimates system reliability of SFN. The exact method overestimates the system reliability even more severely at demand levels three through six (d¼{3, 4, 5, 6}) when mi ¼0.9. This error occurs because correlated failures lower the reliability of 2-out-of-3 systems even more dramatically when the individual physical lines possess lower reliability. In the case, where mi ¼0.99 the overestimation occurs for demand levels seven and eight (d ¼{7, 8}). This shift takes place because the more reliable physical lines make the system reliability quite high for demands less than or equal to six. Therefore, the difference between the uncorrelated and correlated estimates is never greater than 0.00167. However, the uncorrelated system reliability estimates for demand levels seven and eight (d ¼{7, 8}) are R7 ¼0.9917 and R8 ¼0.8897, but the corresponding correlated reliabilities determined from simulation are only 0.9831 and 0.8808. Thus, the uncorrelated approach produces an overestimate in system reliability of nearly 1% in both cases. This analysis clearly shows that the approach that assumes physical lines fail in a statistically independent manner can produce significant error when correlated failures can occur, which is often the case in realworld systems. The proposed approach, thus, provides a more detailed method to assess SFN so that their system reliability can be assured.
5. Conclusion
Fig. 5. The Taiwan academic network (TANET).
This paper develops a method to measure the impact of correlated failures on the system reliability of a stochastic-flow network. We propose a simulation technique to estimate the system reliability of a computer network, where the failure of the physical lines or routers comprising the edges and nodes of the network may experience correlated failures. Moreover, we allow the failure of physical lines or routers in one edge/node to affect the failure of physical lines or routers in all of the other edges and nodes of the network, providing a method to model global-scale correlations. We first find the minimal capacity vectors, which form the combinations of edge and node performances that can satisfy a given level of demand. Correlation in the failures of the physical lines and routers was modeled with a technique to simulate multivariate Bernoulli distributions. Two experiments were performed, including a large-scale case study of the Taiwan academic network (TANET). Both results reveal that correlated
Table 7 Impact of correlation of TANET system reliability. Demand (d)
No of minimal capacity vectors
No correlation
Correlation
Difference
1 2 3 4 5 6 7 8 9
6 20 50 84 106 100 50 19 4
0.999999999449656 0.999999906885408 0.999993437105799 0.999752418028649 0.994703865202139 0.937852655573230 0.645528761457905 0.219925672898952 0.018294993793876
1.0000000 0.9999876 0.9998394 0.9979523 0.9822638 0.8958865 0.6233829 0.2358484 0.0248410
0.00000000055 0.00001230689 0.00015403711 0.00180011803 0.01244006520 0.04196615557 0.02214586146 0.01592272710 0.00654600621
40
Y.-K. Lin et al. / Reliability Engineering and System Safety 109 (2013) 32–40
Fig. 6. Difference between uncorrelated and correlated system reliability estimates.
failures can have a substantial impact on the system reliability of SFN. This greater detail will allow network designers to explicitly consider the problems posed by global correlations when designing networks to ensure that user demands can be satisfied with high probability. Several problems for future research exist. The first is to identify efficient numerical techniques to assess the system reliability of SFN with globally correlated failures. Further design-oriented studies should conduct detailed sensitivity analysis to identify the most critical edges and nodes so that the system reliability of large SFN can be improved cost-effectively. It will also be important to demonstrate the applicability of the approach to other types of network models such as power transmission and transportation.
Acknowledgements This work was supported in part by the National Science Council, Taiwan, Republic of China, under Grant No. NSC 982221-E-011-051-MY3. Lance Fiondella’s research was made possible by the National Science Foundation and the National Science Council of Taiwan through an East Asia and Pacific Summer Institutes (EAPSI) for U.S. Graduate Students (Award #1107724). References [1] Aggarwal KK, Gupta JS, Misra KB. A simple method for reliability evaluation of a communication system. IEEE Transactions on Communications 1975;23:563–6. [2] Alexopoulos C. A note on state-space decomposition methods for analyzing stochastic flow networks. IEEE Transactions on Reliability 1995;44:354–7. [3] Aven T. Reliability evaluation of multistate systems with multistate components. IEEE Transactions on Reliability 1985;34:473–9. [4] Hudson JC, Kapur KC. Reliability bounds for multistate systems with multistate components. Operations Research 1985;33:153–60. [5] Lin YK. A simple algorithm for reliability evaluation of a stochastic-flow network with node failure. Computers and Operations Research 2001;28: 1277–85. [6] Lin YK. Reliability of a stochastic-flow network with unreliable branches & nodes under budget constraints. IEEE Transactions on Reliability 2004;53: 381–7. [7] Lin YK. Reliability evaluation for an information network with node failure under cost constraint. IEEE Transactions on Systems, Man and Cybernetics— Part A: Systems and Humans 2007;37:180–8.
[8] Lin YK, Chang PC. Evaluation of system reliability for a cloud computing system with imperfect nodes. Systems Engineering 2012;15:83–94. [9] Lin YK, Yeh CT. Using minimal cuts to optimize network reliability for a stochastic computer network subject to assignment budget. Computers and Operations Research 2011;38:1175–87. [10] Lin YK, Yeh CT. Maximal network reliability for a stochastic power transmission network. Reliability Engineering and System Safety 2011;96:1332–9. [11] Xue J. On multistate system analysis. IEEE Transactions on Reliability 1985;34:329–37. [12] Zuo MJ, Tian Z, Huang HZ. An efficient method for reliability evaluation of multistate networks given all minimal path vectors. IIE Transactions 2007;39: 811–7. [13] Yeh WC. A simple minimal path method for estimating the weighted multicommodity multistate unreliable networks reliability. Reliability Engineering and System Safety 2008;93:125–36. [14] Bulteau S, Khadiri ME. A new importance sampling Monte Carlo method for a flow network reliability problem. Naval Research Logistics 2002;49:204–28. [15] Fishman GS, Shaw TD. Evaluating reliability of stochastic flow networks. Probability in the Engineering and Informational Sciences 1989;3:493–509. [16] Fishman GS. A comparison of four Monte Carlo methods for estimating the probability of s–t connectedness. IEEE Transactions on Reliability 1986;35: 145–55. [17] Yeh WC. A simple MC-based algorithm for evaluating reliability of stochasticflow network with unreliable nodes. Reliability Engineering and System Safety 2004;83:47–55. [18] Yeh WC, Lin YC, Chung YY. Performance analysis of cellular automata Monte Carlo simulation for estimating network reliability. Expert Systems with Applications 2010;37:3537–44. [19] Yeh WC. An improved Monte-Carlo method for estimating the continuous state network one-to-one reliability. WSEAS Transactions on Systems 2007;6:959–66. [20] Zenklusen R, Laumanns M. High-confidence estimation of small s–t reliabilities in directed acyclic networks. Networks 2011;57:376–88. [21] Ramirez-Marquez JE, Coit DW. Monte-Carlo simulation approach for approximating multi-state two-terminal reliability. Reliability Engineering and System Safety 2005;87:253–64. [22] Ramirez-Marquez JE, Coit DW. Multistate component criticality analysis for reliability improvement in multistate systems. Reliability Engineering and System Safety 2007;92:1608–19. [23] Fiondella L. Reliability and sensitivity analysis of coherent systems with negatively correlated component failure. International Journal of Reliability, Quality and Safety Engineering 2010;17:505–29. [24] Ford LR, Fulkerson DR. Flows in network. New Jersey: Princeton University Press; 1962. [25] Yarlagadda R, Hershey J. Fast Algorithm for computing the reliability of communication network. International Journal of Electronics 1991;70: 549–64. [26] Burden R, Faires J. Numerical analysis. 8th ed. Belmont, CA: Brooks/Cole; 2004. [27] Fiondella L, Zeephongsekul P. Reliability of systems with identically distributed correlated components. Proceeding of ISSAT international conference on reliability and quality in design, Aug 2011.