Closed-form solution for reliability of SCI-based multiprocessor systems using Weibull distribution and self-healing rings

Closed-form solution for reliability of SCI-based multiprocessor systems using Weibull distribution and self-healing rings

Computers and Electrical Engineering 30 (2004) 309–329 www.elsevier.com/locate/compeleceng Closed-form solution for reliability of SCI-based multipro...

602KB Sizes 0 Downloads 17 Views

Computers and Electrical Engineering 30 (2004) 309–329 www.elsevier.com/locate/compeleceng

Closed-form solution for reliability of SCI-based multiprocessor systems using Weibull distribution and self-healing rings Mohammad Al-Rousan

a,*

, Adnan Shaout

b

a

b

Computer Engineering The American University of Sharjah, Sharjah, UAE Department of Electrical and Computer Engineering, The University of Michigan-Dearborn, Dearborn, MI 48128, USA Received 26 July 2001; received in revised form 20 June 2003; accepted 6 April 2004

Abstract This paper introduces a new closed-form solution for the reliability of large-scale multiprocessor systems. The systems are based on SCI rings interconnected in hierarchical structures. Reliability expressions using enumeration technique are derived assuming Weibull failure process. The reliability function derived in this paper is general and valid for any hierarchical ring-based system with arbitrary number of levels. The hierarchical interconnections are constructed from self-healing rings and basic rings. The analysis shows the improvement achieved in reliability when self-healing rings are used. Although we used hierarchical systems based on SCI rings, the technique followed in this work is applied for any type of rings such as slotted or token rings.  2004 Elsevier Ltd. All rights reserved. Keywords: Weibull; Reliability; SCI; Hierarchical multiprocessors; Self-healing rings

*

Corresponding author: Tel.: +971 6 5152932; fax: +971 6 5152979. E-mail addresses: [email protected] (M. Al-Rousan), [email protected] (A. Shaout).

0045-7906/$ - see front matter  2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compeleceng.2004.04.001

310

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

1. Introduction Performance and reliability are two major issues in multiprocessor systems. These two factors are greatly influenced by the type of interconnection used in the system. Multistage, hypercubes. meshes, and ring-based are well-known interconnections for large-scale multiprocessors. Ringbased interconnections for multiprocessor systems have not gained much attention, so a few systems have been introduced. KSR-1 from Kendall Square Research [1] and Hecter [2] are examples of ring-based multiprocessors built in hierarchical fashion. Recently, and due to the development of the Scalable Coherent Interface (SCI), hierarchical ring-based systems have attracted considerable attention. The SCI interconnection provides low-latency and high bandwidth communication for multiprocessors and computer network systems. More details on the SCI standard and functionality can be found in [3]. The impact of SCI on the performance of multiprocessor systems is discussed in [4–8]. Since the reliability of a multiprocessor system is dominated by the reliability of the interconnection used in the system, we in this paper are interested in the reliability of hierarchical ringbased systems. The systems we consider in this work are of the type shown in Fig. 1. In the figure the interconnection is constructed from multiple rings distributed at different levels in the hierarchy. The processor elements (PE) are connected to rings at level 0 via SCI nodes with one interface each. Rings at higher levels are interconnected via SCI nodes having two interfaces (called bridges). The structure of the SCI bridge is shown in Fig. 2. Its structure is similar to a single SCI node except that its input and output components are duplicated so it can connect two rings. In addition, it has a switching logic to interchange traffic between the two rings. SCI bridges are discussed thoroughly in [9]. In general, the hierarchical systems considered here are referred to as H (R) = (F0, F1, . . ., FR). This means that the system has R + 1 levels, and each ring at level i consists of Fi SCI nodes. Therefore, each ring at level i may connect Fi rings at level i  1. Note that there is one extra

Fig. 1. Three-level hierarchy H (R) = (4,3,2).

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

311

Fig. 2. SCI bridge.

SCI node connecting each ring at level i to a ring at level i + 1 except the ring at the root (level R). The rings at level 0 are connected to the processors directly. In this paper we provide closed-form expressions for K-processor reliability of large-scale hierarchical ring-based systems. That is, we calculate the probability that a group of (K) processors within the system are able to communicate. Our interest lies in the reliability behavior of hierarchies constructed from basic and self-healing rings. In recent work in [10], reliability for systems similar to the hierarchical systems we consider here was computed. However, the analysis was made for hierarchies based on the basic and double counter-rotating rings. Self-healing rings were not considered, and the analysis has been done assuming exponential distribution for the reliability function. Exponential reliability function is a special case of the analysis we introduce here using Weibull function. Other studies in [11,7] showed the reliability for hierarchies based on basic, braided, and counter-rotating rings. The derived formulas are only valid for two-processor (k = 2) reliability using the exponential function. Sarwar et al. in [12] presented a reliability study for SCI ring-based interconnections. They presented reliability results for two-dimensional k-array n-cube ring-based interconnections, whereas we use hierarchical (multilevel) ring-based interconnections. Yin and Silio in [13] considered ring networks and derived closed-form K-processor reliability expressions for a counter-rotating ring, a tree of wiring concentrators and terminals, and a counter-rotating ring of concentrator trees. Our analysis differs from theirs in that our reliability expressions can directly be applied to ring-based hierarchies with an arbitrary number of levels. Furthermore, unlike their analysis, we compute the overall system reliability by averaging all cases that contribute to K-processor reliability. In [14–19], closed-form two-processor (K = 2) reliability expressions are derived without considering hierarchical networks. In [20–23], two-processor reliability for hierarchical structures with fixed number of levels are computed. The importance of the analysis made in this paper includes the evaluation of K-processor reliability for self-healing, ring-based interconnections in the context of hierarchical, large-scale multiprocessor systems. There are no special cases or restrictions on the reliability expressions we derived. The analysis is valid for hierarchies with arbitrary number of levels. We compare the reliability of systems based on self-healing rings with the reliability of systems based on the basic

312

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

rings. The analysis is valid for any large-scale systems with N processors. In addition the analysis is applied to any ring-based systems; not only to SCI-based systems. We use systems of 512 and 1024 processors to discuss the derived reliability expression. The results obtained for the systems under consideration showed that the self-healing rings significantly improve the reliability behavior of large-scale systems as compared to basic rings. Moreover, the reliability function derived for self-healing structures suggests that systems with low number of levels provide higher reliability and axe preferred over systems with higher number of levels (more than four). The remaining sections of the paper are organized as follows, in Section 2 we derive the reliability of a single ring, and in Section 3 we present the reliability of the hierarchical interconnections. Numerical results are discussed in Section 4 followed by conclusions in Section 5.

2. Single ring reliability Before we derive the overall reliability of hierarchical systems, we first show the reliability of a single ring from which a given interconnection is constructed. We find the reliability of the single ring as if it is isolated from the hierarchy. We will show the reliability for the basic ring and the self-healing ring. In our analysis we assume that a node on a ring may have at least one interface depending on the type of the ring. That is for the self-healing rings, SCI nodes must have two interfaces; one for each ring. We also assume that the failure of any node or component in the systems is a random event and statistically independent of the failures of other components or nodes. The reliability of (point-to-point) links is assumed very high and will never fail compared to other components in the hierarchy. As we stated before. Weibull distribution is used for the reliability function. In general the reliability expression is given by a

RðtÞ ¼ eðktÞ

ð1Þ

where k is the scale parameter and a is the shape parameter. The value of a indicates weather the failure rate is decreasing (a < 1), increasing (a > 1), or constant (a = 1) with time. Generally speaking most of the reliability studies [7,10,13–23] assumed exponential reliability function. This assumption usually simplifies reliability analysis since it models systems with constant failure rates. Although exponential function is accurate enough for some systems but in reality failure rates for systems do not remain constant as the systems get old. Failure rates for most of hardware systems increase with time, while failure rates of software-based systems decrease with time. Weibull distribution is more general than exponential distribution and can model systems with decreasing, increasing and constant failure rates. However, the choice of Weibull distribution will complicate the derivation of the reliability function for complex large-scale systems. Moreover, Weibull distribution can model constant failure rate by putting a = 1. 2.1. Basic ring The reliability of the basic ring has been derived in previous works such as [11]. However, because we assume Weibull distribution for reliability model we rederive the reliability equation under this assumption. For the basic ring, the system connects F SCI nodes, each of which has one

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

313

interface attached to the ring. Nodes in the system are connected in a point-to-point configuration to form a loop topology (ring). The failure of any node in the basic ring will bring the whole system down. Now if K processors are to communicate, the communication will occur if and only if all the K processors are working and all the F SCI nodes are up. Let Rk (basic,t) be the K-processor reliability of the basic ring, then it can be calculated as follows: a

Rk ðbasic; tÞ ¼ eKðkPE tÞ eF ðkf tÞ

a

ð2Þ

Note that the failures of the individual components (processors and SCI nodes) in the rings are assumed independent. The above equation represents the probability that all K processors are operational (ek (kPEt)a) and the probability that all F SCI nodes are also operational (eF (kft)a). 2.2. Self-healing ring In this scheme, the system consists of two rings (primary and secondary) and F SCI nodes attached to both rings via two separate interfaces, as shown in Fig. 3. The nodes here have the ability to self-heal, which wraps the network from the primary to secondary ring (and vice versa) to provide a failure recovery mechanism for eliminating down nodes and maintain some level of communication in the network. Each node contains a self-heal (ring-wrap) configuration unit in addition to its interface. When a node failure occurs, the nodes on either side of the failed node loop-back signals from the primary ring to the secondary ring (upstream node) or from the secondary ring to the primary ring (downstream node) so as to isolate the failed node and to maintain a complete ring.

Fig. 3. Node isolation in self-healing ring.

314

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

For a given ring, if our interest lies in K processors then the nodes in the ring are partitioned into K + 1 sets. The first set contains the K processors of interest. Each of the remaining K sets is denoted by Si and contains di nodes that are located between node pair (i ¯ K, i ¯ 1). For example, the ring in Fig. 4 contains 12 nodes (F = 12), each is attached to a local processor. Let the processors of interest be P0, P1 and P2 which are connected to nodes 0, 1 and 2, respectively. Since K = 3 here, the resulting K + 1 sets are    

3-element set that contains nodes 0, 1 and 2. Set S0: contains three nodes (d0 = 3) located between a node pair (0, 1). Set S1: contains four nodes (d1 = 4) located between a node pair (1, 2). Set S2: contains two nodes (d2 = 2) located between a node pair (2, 0). Using union of a combination of Si (0 6 i 6 K  1), we can form K new sets of the form: Aj

K1 [

Si

ð0 6 i 6 K  1Þ

i¼0;i6¼j

Aj (0 6 j 6 K  1) is simply the union of all Si (0 6 i 6 K  1) excluding Sj. In Fig. 4, for example, we can form three new sets as follows: 1. A0 = S1 [ S2 containing the nodes that belong to S1 and S2. 2. A1 = S0 [ S2 containing the nodes that belong to S0 and S2. 3. A2 = S0 [ S1 containing the nodes that belong to S0 and S1. Observe that as long as none of the nodes that belong to A0 = S1 [ S2 set has failed. We can lose any number of nodes that belong to S0 set. Similarly, as long as none of the nodes that belong to

Fig. 4. Partioning in self-healing ring for K = 3 processors and 12 SCI nodes.

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

315

A1 = S0 [ S2 (or A2 = S0 [ S1) has failed, we can lose any number of nodes that belong to S1 (or S2). In order for K processors of interest to communicate, all K processors and their interfaces must be operational and at least one Aj (0 6 j 6 K  1) must be operational. Aj is operational if and only if all Si sets (0 6 i 6 K  1, i 6¼ j) are operational. Si set is operational if and only if all nodes in Si are up. Thus, the probability that Si (0 6 j 6 K  1) is operational is PrfCS i g ¼ e2d i ðkf tÞ

a

ð3Þ

Since each SCI node uses two interfaces, one for each ring, di in the above equation is multiplied by 2. The probability that Aj (0 6 j 6 K  1) is operational is ! K1 K1 Y \ PrfCAj g ¼ Pr CS i ¼ PrfCS i g ð4Þ i¼0;i6¼j

i¼0;i6¼j

Thus, the probability that at least, one Aj (0 6 j 6 K  1) is operational can be calculated as ! K1 K1 K1 K1 [ X X X Pr Aj ¼ PrfCAj g  PrfCAj \ CAm g j¼0

j¼0

þ

j¼0 m¼0;m>j K1 X

K1 X

K1 X

PrfCAj \ CAm \ CAn g þ   

j¼0 m¼0;m>j n¼0;n>m

þ ð1Þ

K1

PrfCA0 \ CA1 \    \ CAK1 g

ð5Þ

Since Si (0 6 j 6 K  1) is contained in more than one set of Aj (0 6 j 6 K  1), the events of CAj (0 6 j 6 K  1) are not mutually exclusive events and hence PrðCAj \ CAk Þ 6¼ PrðCAj Þ  PrðCAk Þ The following rule is therefore used in simplifying Eq. (5): if PrðCAj Þ ¼ PrðCS x Þ PrðCS y Þ and

PrðCAk Þ ¼ PrðCS y Þ PrðCS z Þ

then PrðCAj \ CAk Þ ¼ PrðCS x Þ PrðCS y Þ PrðCS z Þ

ð6Þ

By substituting Eq. (4) into Eq. (5) in conjunction with Eq. (6) the resulting equation becomes: Pr

K1 [ j¼0

! Aj

¼



K1 K1 X Y

PrfCS i g 

j¼0 i¼0;i6¼j



þ

K

 K1 Y

3

i¼0

K

 K1 Y

2

i¼0

PrfCS i g K1

PrfCS i g þ    þ ð1Þ

K1 Y i¼0

PrfCS i g

ð7Þ

316

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

As shown in Eq. (7) there are many repeated terms, some of which have a positive binomial coefficient   zo 1 K ð1Þ zo where zo is an odd number, and some of which have a negative binomial coefficient   K ð1Þze 1 ze where ze is an even number. Combination of positive and negative coefficients on repeated terms causes some of the terms to be canceled in the final expression. Thus Eq. (7) reduces to Pr

K 1 [ j¼0

! Aj

¼

K 1 K 1 X Y

PrfCS i g  ðK  1Þ

K 1 Y

j¼0 i¼0;i6¼j

PrfCS i g

Substituting Eq. (3) into Eq. (8) results in ! K 1 K 1 K 1 K 1 X X [ Y a 2d i ðkf tÞa Aj ¼ ðe Þ  ðK  1Þ ðe2d i ðkf tÞ Þ Pr j¼0

ð8Þ

i¼0

j¼0 i¼0;i6¼j

ð9Þ

j¼0

Eq. (9) can be simplified as follows: ! K1 [ a a a Aj ¼ e2ðd 1 þd 2 þþd K1 Þðkf tÞ þ e2ðd 0 þd 2 þþd K1 Þðkf tÞ þ    þ e2ðd 0 þþd K2 Þðkf tÞ Pr j¼0

 ðK  1Þe2ðd 0 þþd K2 Þðkf tÞ

a

ð10Þ

di (0 6 j 6 K  1) can take in values as follows: 0 6 d0 6 F  K 0 6 d1 6 F  K  d0 .. . d K1 ¼ F  K  d 0  d 1      d K2 For d0, since the choice of the K processors of interest is independent, all (F  K + 1) unique values of d0 occur with equal probability. For di (1 6 i 6 K  2), all (F  K  d0      di1 + 1) unique values of di occur with equal probability. Therefore, di (0 6 i 6 K  2) takes in any par1 , where ticular values in its range with probability UðKÞ 8
nK2 ¼1

K¼2 K>2

ð11Þ

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

317

Substituting for the values of di, Eq. (10) becomes

Pr

K1 [

! Aj

a

a

a

¼ e2ðF Kd 0 Þðkf tÞ þ e2ðF Kd 1 Þðkf tÞ þ    þ e2ðF Kd K1 Þðkf tÞ

j¼0

 ðK  1Þe2ðF KÞðkf tÞ ¼

a

K 1 X a a ðe2ðF Kd i Þðkf tÞ Þ  ðK  1Þe2ðF KÞðkf tÞ j¼0

" 2ðF KÞðkf tÞa

¼e

# K1 X 2d i ðkf tÞa ðe ÞK þ1

ð12Þ

j¼0

Using the above equations, the K-processor reliability of the self-healing counter-rotating SCI ring can be calculated as 

K 1 [

K2

j¼0

a a F K d K2 X eKðkPE tÞ e2Kðkf tÞ X  Pr Rself -heal ðtÞ ¼ UðKÞ d ¼0 d ¼0 0

! CAj

" # a a F K d K2 K 1 X X eKðkPE tÞ e2F ðkf tÞ X 2d i ðkf tÞa ¼  ðe ÞK þ1 UðKÞ d ¼0 d ¼0 j¼0 0

ð13Þ

K2

3. Interconnection reliability Having calculated the reliability of a single ring, we now move to look at the overall hierarchical systems. As we pointed out before, the number of levels in the hierarchies considered in this work may vary. The main question we want to answer is that if K (out-of N) processors within the hierarchy are to communicate, what is the probability that this communication will take place? There is no restrictions on the distribution of the K processors under consideration; they could lie (if possible) in one single ring. or in g (g 6 K) rings in the hierarchy. Therefore, computing this probability depends on the relative positions of the K processors in the interconnection system. To derive the overall reliability of a hierarchy connecting N processors, one has to consider all the possible cases corresponding to the positions of the K processors of interests within the hierarchy. The number of the considered cases depends on the value of K. This means that the case in which the K processors reside in one ring at level 0 will not be counted if K is greater than the size of that ring i.e. if (K P F0). Similarly, if K P F0 F1, the case in which the K processors are in a subtree containing F0 F1 interfaces is not considered. Generally, for a hierarchy H (R) with R + 1 levels, the number of cases (C) one should consider is given by

318

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

8 R þ 1 26K 6F0 > > > > > R F 0 < K 6 F 0F 1 > > > > < C¼ >  > > > > > > R  m F 0F 1    F m < K 6 F 0F 1    F m > > : where 1 6 m 6 R  1

ð14Þ

In order to obtain a more general formula we should cover all cases and, hence, we assume that 2 6 K 6 F0 and hence the number of cases to be considered is R + 1 cases. These cases can be classified as follows:  case 0. All of the K processors belong to the same ring at level 0 of the hierarchy. The probability of such case is as follows:    N N F0 ð15Þ PrfT 0K g ¼ F0 K K Each of the remaining R cases can be represented as follows:  case i (1 6 i 6 R). The K processors belong to the same subtree whose root is at level i (T i) given that they are in distinct subtrees whose roots are at level i  1 (T i  1). In order to calculate the probability of case i, we should consider all possible distributions of the K processors of interest within the subtree whose root is at level i. Let j0 be the number of rings of interest at level 0 to which the K processors of interest are attached. Let Cj0 be the number of all possible distributions of the K processors in j0 rings at level 0 such that r01 processors are in the first ring, r20 processors are in the second ring, and r0i processors are in the ith ring where r01 þ r01 þ    þ r0i þ    þ r0j0 ¼ K. Similar to j0, let jm be the number of rings of interest at level m to which the jm1 rings of interest at level m  1 are connected. Let Cjm (jm1) be the number of all possible ways of connecting these jm1 rings to jm rings at level m such that rm1 rings of jm1 rings are connected to the first ring of jm rings, rm2 rings are connected to the second ring, and rmi are connected to the ith ring where rm1 þ rm2 þ    þ rmi þ    þ rmjm ¼ jm1 . Then, the probability of case i can be calculates as PrfT ii1K g ¼ Qi

N

n¼0 F n



Gi  N K

where  Gi ¼

Fi ji1

 Cj0

i1 Y m¼1

Cjm ðjm1 Þ

ð16Þ

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

319

and Cjm ðjm1 Þ ¼

 jm  Y Fm rmn

n¼1

Having calculated the probability of occurrence for each of the above cases, we now derive the overall reliability for any given hierarchy. The K-processor reliability for systems based on selfhealing rings, RK (self-heal, t), can be calculated as RK ðself -heal; tÞ ¼

R X

Reliability of case i  Probability of occurence of case i

ð17Þ

i¼0

The probability for each case is calculated in Eqs. (15) and (16). When calculating the reliability for each case, one has to consider all rings and components within the subtree containing the K processors of interest. Since we assume the failures of components are independent of each others, the reliability of each case can be computed as the product of reliabilities of the K processors, rings, and SCI switches that are involved in that case. The reliability of any ring involved in the case under consideration can be found using Eq. (13) derived above. Let RiSH ðni ; k i ; tÞ be the k-node reliability (k P 2) of a single ring having an n SCI nodes at level i. Note that the ring reliability is a function of the number of nodes n and the number of nodes of interest, k, within the ring. The values of n and k are determined according to the distribution of the K processors of interest within the hierarchy. Using Eq. (13) the reliability of a self-healing ring at level i is given by " # d k k i K i 1  i2 2ni ðkf tÞa F X X X  e i 2d i ðkf tÞa  e ð18Þ  Ki þ 1 RSH ðni ; k i ; tÞ ¼ Uðk i Þ d ¼0 d k 2¼0 j¼0 0

i

The overall reliability expressed in Eq. (17) can written as: RK ðself -heal; tÞ a

¼ eKðkPE tÞ q1 ðKÞR0SH ðK; F 0 þ q2 ðRÞ; tÞN

þe

KðkPE tÞa

R X

j i1

X



j0 ¼j0

i1

6 X 4 

ji1 X



ri1 ¼ri1 1 1

2

i1 1

r01 ¼r0 1

> :



¼ri1 j

i1 1

ji1 Y

F i1

n¼1

ri1 n



0

r1 6X 4 

ri1 j

r0

j0 1 X

r0j

0

j0 Y

n¼1 ¼r0 1 j 1 0

K

N K

RiSH ðji1 ; F i



ri1 1



ri1 1

!,

! F0

j 0

i¼i ji1¼j

2

8 " < X>

F0

F0 r0n

!

!

þ q3 ðiÞ; tÞN

Fi

!,

ji1

N

!

K

i Y

# Fn

n¼1

3 a 7 i1 eðksw tÞ Ri1 SH ðr n þ 1; F i1 þ 1; tÞ5   

39 > = 7 ðksw tÞa 0 0 e RSH ðrn þ 1; F 0 þ 1; tÞ5 > ;

ð19Þ

320

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

where, all the parameters are explained in Appendix A. The first term in the above equation represents the reliability of case 0 where the K processors are in the same ring at level 0. Recall that case 0 is considered if K 6 F0. Therefore, we use q1 (K) function in the equation. The rest of the equation represents the reliability of all possible cases for case i (i* 6 i 6 R). Note that the ring of level i has Fi + 1 interfaces, while the root of the hierarchy at level R has only FR interfaces. Hence, we use q2 (R) and q3 (i) to differentiate between these cases. To connect level i with level i + 1 an extra interface is needed, and Fi + 1 interfaces are included in the reliability calculation as shown in the equation. It should be noted that, each bridge has a 2 · 2 switch which is used to route communication packets between rings. The reliability of the switch represented by (e(kswt)a) is included because all considered switches in the subtree that contains the K processors of interest must be operational. Also it is required that all K processors of interest be working in order for them to communicate. Therefore, the reliability of the K processors eK(kPEt)a is multiplied by all terms in the above equation. 3.1. Hierarchy of basic rings The process of deriving the overall reliability of a hierarchical system based on the basic rings is very much similar to what we have done for the self- healing hierarchies. In order for K processors to communicate, the number of possible cases to be considered within the subtree (s) containing the K processors is controlled by Eq. (14). The probabilities of associated cases can be calculated using Eqs. (15) and (16). Therefore, the overall reliability equation (Eq. (17)) holds for the hierarchies based on the basic rings. However, we need first to calculate the reliability of any ring at level i; Ribasic ðk i ; ni; tÞ, which will be needed in the overall reliability equation. Using Eq. (2), the reliability of a basic ring at level i is given by a

Ribasic ðk i ; ni; tÞ ¼ eni ðkf tÞ

ð20Þ

Note that a ring at level i (i > 0) has no processors attached to its SCI interfaces; it only has ni SCI nodes. If a ring is involved in any considered case, the ring as a whole must work in order for the K processors to be able to communicate, hence, the reliability of the basic ring in Eq. (20) does not depend on the number of nodes of interest k. Now. using Eqs. (15), (16) and (20) in Eq. (17), the K-processor reliability of the hierarchical interconnections based on the basic ring RK (Basic, t) is given by

Table 1 Hierarchies considered for 512-processor system Level

Topology

Topology

Topology

3

H1 (8, 8, 8) H4 (4, 16, 8) H6 (8, 4, 4, 4) H9 (16, 4, 4, 2) H11 (4, 4, 4, 4, 2)

H2 (16,8,4) H5 (16, 16, 2) H7 (4, 4, 4, 8) H10 (4, 4, 16, 2) H12 (8, 4, 4, 2, 2)

H3 (8,16,4)

4 5

H8 (8, 8, 4, 2) H13 (4, 4, 8, 2, 2)

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329 a

321

a

RK ðbasic; tÞ ¼ eKðkPE tÞ  PrfT 0K g  q1 ðKÞ  eðF 0 þq2 ðRÞÞðkf tÞ 8 j j j R < i1 i2 0 X> X X X a KðkPE tÞa þe  PrfT ii1K g  eðF i þq3 ðiÞÞðkf tÞ > j0 ¼j0 : i¼i ji1 ¼ji1 ji2 ¼ji2 2

ri1 1



6 X 4 2

ri2 1

2









i2 1 X

 ri2 j

r0j

X r0j

0

 0 1

¼r0j 1

ji2 Y 

i2 1

 0 1





a7 eksw t eðF i1 þ1Þðkf tÞ 5

3





r02 ¼r02

i1 1

ri2 j

ri2 ¼ri2 2 2

6X X  4

n¼1

¼ri1 j



r02

r01 ¼r01

i1 1

X



ji1 Y 

ri1 j

ri2 2

ri2 ¼ri2 1 1

r01

X

ri1 ¼ri1 2 2



6 X 4

ri1 ji11

X

ri1 ¼ri1 1 1

3





ri1 2

¼ri2 j

i2 1

n¼1

a a7 eðksw tÞ eðF i2 þ1Þðkf tÞ 5   

39 > = ðksw tÞa ðF 0 þ1Þðkf tÞa 7 e e 5 > ; n¼1

j0 Y

ð21Þ

As in the hierarchies of the self-healing rings, the above equation represents all the R + 1 cases. It also takes into account the reliabilities of the switches and the K processors of interests. Eqs. (19) and (21) will be used in the next section to discuss the reliabilities of different hierarchical large-scale systems. 4. Numerical results and discussion We now move to discuss the benefit of the above reliability expressions through an analysis of large-scale hierarchical multiprocessors. For the purpose of illustration we focus our analysis on multiprocessor hierarchies consisting of 512 processors. However, the analysis is valid for any hierarchical system with arbitrary number of processors. A system with 512 processors can be constructed in a wide range of hierarchies with different number of levels. For example, one can connect the 512 processors in one ring (i.e. one-level hierarchy). In the other extreme, one can distribute every two processors in one ring forming a hierarchical structure of nine levels (H(R) = (2, 2, 2, 2, 2, 2, 2, 2, 2). In this work we only use hierarchies that are acceptable from the point view of the system performance. Recent studies in [6,4,7] have demonstrated that systems using relatively small SCI rings provide better performance that systems using larger size of rings. The studies showed an SCI ring of 4–16 nodes provides better performance than larger sizes. Therefore, we choose the hierarchies that we believe they are acceptable in that sense. Table 1 lists the considered hierarchical systems for 512 processors. Note that the hierarchies are of 3, 4, and 5 levels. Unless otherwise stated, the failure rates (k 0 s) of the SCI interface, switching logic, and the processor are assumed to be 106 per hour. This is consistent with the parameters assumed in previous works [14,21,10]. The reliability function derived for the hierarchies is very beneficial in choosing the most reliable system among many choices. We have calculated the reliabilities of all hierarchical systems shown in Table 1. The result of comparison is depicted in Fig. 5. For clarity of the figure we only

322

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

Fig. 5. Realibility comparision for heirachies based on self-healing rings.

plot the best and worst reliable hierarchies and some other curves in between. For self-healing rings, designers should recommend choosing H5(16, 16, 2) hierarchy since it tends to be the most reliable structure, and try to avoid the use of H11(4, 4, 4, 4, 2) due to its low reliability. However, if basic rings are used then H1(8, 8, 8) would be better choice than others structures as Fig. 6 shows. One of the drawbacks of the basic ring is its low reliability since it has no mechanism to overcome a single failure. Using self-healing rings in hierarchical systems is expected to improve the reliability of the systems. In Fig. 7 we compare the reliability of H1(8, 8, 8) when the self-healing rings are used with that of the basic rings for the same hierarchy. The curves shows the significant improvement in reliability of the self-healing hierarchies over the basic-ring hierarchies. The reliability of the basic-ring system goes down sharply in a short mission time, while self-healing system continues to survive for longer mission time. This argument holds for all the structures selected in Table 1. One of the key contributions of this work is the generality of the formula derived for system reliability. One can investigate the reliability of a system for any given K (processors of interest). In Fig. 8 we arbitrary choose H7(4, 4, 4, 8) and show its reliability behaviors for different values of K (4, 8, 16, and 20 processors). The results demonstrate that as the number of communicating (K) processors increases the reliability of the system decreases. This is because the more processors to communicate the more rings in the hierarchical interconnections to be encountered. For multiprocessor systems, usually a small group of processors gets involved in certain transactions such caching and message passing transactions. In some systems the failure rate may not be always constant; it may vary with time. In this case, representing the failure process according to the exponential distribution is not a valid assumption. The reliability function reported here for hierarchical systems is valid for constant and time-varying failure rates. In Fig. 9 reliability curves for HI hierarchy is plotted for different values

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

323

Fig. 6. Realibility comparision for heirachies based on basic rings.

Fig. 7. Realibility improvement of self-healing heirarchy H1 (8, 8 8).

of a (0.2, 0.5, 0.8, 1, 1.2, 1.5, 1.8, 2). The scale parameter for the SCI interface. kf, is kept fixed at 106. The results emphasize the strong influence of a on the reliability behavior of the system.

324

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

Fig. 8. Varying number of processors of interest (K).

Fig. 9. Time-varying failure rates.

To validate our results and to appreciate the benefit of the derived equation for reliability, let us examine briefly the reliability behavior of larger system consisting of 1024 processors. Table 2 shows hierarchies constructed from three, four (set 1) and five levels (set 2). Recent studies on

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

325

Table 2 Hierarchical structures for 1024 processors Set

Index

Topology

Index

Topology

1

H1 H3 H5 H7 H9 H11

(16, 8, 8) (8, 8, 16) (16, 4, 16) (16, 4, 4, 8) (4, 4, 8, 8) (4, 4, 4, 16)

H2 H4 H6 H8 H10 H12

(8, 16, 8) (4, 16, 16) (16, 16, 4) (8, 8, 2, 8) (8, 8, 8, 2) (8, 4, 4, 8)

2

H13 H15 H17 H19 H21 H23

(8, 4, 4, 4, 2) (4, 4, 8, 4, 2) (4, 4, 4, 2, 8) (8, 4, 8, 2, 2) (8, 2, 2, 8, 8) (16, 4, 4, 2, 2)

H14 H16 H18 H20 H22 H24

(4, 8, 4, 4, 2) (4, 4, 4, 8, 2) (4, 4, 4, 4, 4) (8, 8, 4, 2, 2) (8, 2, 4, 2, 8) (4, 4, 2, 2, 16)

Fig. 10. Realibility comparision for heirachies based on self-healing rings, k = 2, a = 1.

similar systems [7,11] have developed reliability functions for special cases using exponential distribution and two-reliability (k = 2) problem. To validate our results against their results we put K = 2 and a = 1 in Eqs. (19) and (21). The obtained results are summarized and depicted in Figs. 10 and 11. In Fig. 10 we compare the two-processor (terminal) reliability for all the hierarchies of 1024 processors that are based on the self-healing rings. For clarity we do not show all the curves in the figure. One can see that the structure H3(8,8,16) is the most reliable topology amongst all,

326

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

Fig. 11. Realibility for H3 (8, 8 16) heirachy, k = 2, a = 1.

while H21(8,2,2,8,8) is the least reliable structure. It has been concluded from the results that hierarchies with low number of levels (3 levels) arid moderate ring size provide better reliability than hierarchies with four and five levels. Therefore, this result leads designers interested in such systems to keep the number of levels in large-scale systems relatively low to achieve high reliable machines. The obtained results order the hierarchies based on the self-healing rings from best reliable to worst reliable in Table 2 as: H3, {H2, H1}, {H6, H5, H4}, H10, {H12, H9, H8}, H11, H20, H19, {H7, H15, H14, H13}, {H16, H23}, H18, H17, H22, {H24, H21}. The results obtained here when relaxing the shape parameter (a = 1) and the number of processors under consideration (K = 2) is consistent with the results reported in previous works [7,11]. In Fig. 11 we show a comparison between the two types of hierarchies to measure the improvement in reliability for the self-healing rings over the basic rings. Fig. 11 compares the reliabilities for H3 hierarchy which was found to be the most reliable structure in Table 2. The reliability behavior in the figure shows the system constructed from self-healing rings provides much higher reliability than the basic ring. The reliability curve for the basic hierarchy decreases drastically in a short mission time, while the reliability of the self-healing hierarchy continues for longer mission time. All the comparisons we made (not shown) indicate that the self-healing has a major impact on the hierarchical systems from the reliability point of view; the improvement is very significant.

5. Conclusions Performance and reliability are two key issues in designing multiprocessor systems. Predicting reliability functions for large systems is not an easy task. In this paper we introduced a closed-

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

327

form reliability function for large-scale hierarchical systems constructed from self-healing rings. We have shown that the reliability function can be used for any hierarchy with no restrictions on the number of levels or number of processors in the systems. Furthermore, the use of Weibull process for the failure rate makes the derived formula valid for systems with constant failure rates as well as for systems with time-varying failure rates. The derived reliability expression considered all the major components in the systems including the processors, node interfaces, and switching elements within each interface. This flexible tool allows the designer to investigate the influence of each individual components on the overall systems reliability.

Appendix A Notations               

F: number of SCI nodes in the ring network. N: number of processors in the hierarchy. K: number of processors of interest. ¯: addition modulo K. di: number of nodes that are located between the numbered nodes i ¯ K and i ¯ 1 along the clockwise direction in a single ring. Si: di-element set that contains the nodes located between the numbered nodes i ¯ K and i ¯ 1 along the clockwise direction in a single ring. R: the root level of the hierarchy (R P 1). Fi: number of SCI nodes (branching factor) at each ring of level i. H (R) = (F0, F1, . . ., FR): a hierarchical topology consisting of R + 1 levels. Ti: a subtree whose root is at level i (1 6 i 6 R). T 0K : event that K processors of interest reside in the same ring at level 0. Ti  1Ki: event that K processors of interest belong to Ti given that they reside in different Ti  1Õs (1 6 i 6 R) within the hierarchy. kf, ksw, kPE: failure rates of an interface, switching logic, and a processor, respectively. Ri (ki, ni, t): k-node reliability of a ring having n SCI nodes at level i. CAi: event that Ai is operational in the self-healing counter-rotating ring.       K K K i ¼ min ;...; þ i  1; . . . ; þR1 F 0F 1 F0Fi F0FR    K  ji1 ¼ max ;2 F i1 F i2    f0 



j i1 ¼ minðF i ; KÞ    K  jm ¼ max ;j F m F m1    F 0 mþ1 j m ¼ minðjmþ1 F mþ1 ; KÞ

328

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329 

rm1 ¼ maxð1; jm1  ðjm  1ÞF m Þ m

rl ¼ max 1; jm1  ðjm  1ÞF m 

l1 X

! rmq

q¼1  rm1  rml

¼ minðF m ; jm1  jm þ 1Þ ¼ min F m ; jm1  jm þ l  

0 1  0 q2 ðRÞ ¼ 1  0 q3 ðiÞ ¼ 1

q1 ðkÞ ¼

l1 X

! rmq

ð2 6 l 6 j1  1Þ

ð0 6 m 6 i  2Þ

q¼1

K > F0 K 6F0 R¼0 otherwise i¼R otherwise

References [1] Burkhardt H, Frank S, Knobe B, Rothine J. Overview of the KSR 1 Computer System, Technical Report KSRTR-9202001, Kendall Square Research; February 1992. [2] Holliday M, Stumm M. Performance evaluation of hierarchical ring-based shared memory multiprocessors. IEEE Trans Comput 1994;43(1):53–67. [3] IEEE. ANSI/IEEE Std 1596–1992––Standard for Scalable Coherent Interface. IEEE; 1992. [4] Sarwar M, George A. Simulative performance analysis of distributed switching fabrics for SCI-based systems. Microprocess Microsyst 2000;24(1):1–11. [5] Felty A, Stomp F. Cache coherency in SCI: specification and sketch of correctness. J Formal Aspects Comput 1999;11(5):475–97. [6] Scott S, Goodman J. The impact of pipelined channels on k-array n-cube networks. IEEE Trans Parallel Distribut Syst 1994;5(1):2–16. [7] AL-Rousan M. Reliability and performance of hierarchical large-scale ring-based SCI shared-memory multiprocessors. Phd dissertation, Brigham Young University, August 1996. [8] Ronneberg G, Horn G, Lysne O. Evaluation and suggested improvements for the SCI flow control. In: Proceedings of SCI Europe Õ99 Toulouse (France); 1999. p. 131–7. [9] Wu B. SCI switches. In: International Data Acquisition Conference on Event Building and Event Data. Readout in Medium and High Energy Physical Experiments; October 1994. [10] AL-Rousan M, Mowafi M. Improving K-processor reliability of large-scale hierarchical systems using double counter-rotating rings. Int J Comput Arid Their Applic (IJCATA) 2000;7(3):111–2. [11] AL-Rousan M, Bearnson L, Archibald, J. The two-processor reliability of hierarchical large-scale ring-based networks. In: Proceedings of the 29th Annual Hawaii International Conference on System Sciences; 1996. p. 104–14. [12] Sarwar M, George A, Collins D. Simulative reliability analysis of SCI ring-based topologies. In: Proceedings of IEEE Conference on Local Computer Networks (LCN) Tampa, FL; November 2000. p. 8–10. [13] Yin J, Silio C. K-terminal reliability in ring networks. IEEE Trans Reliab 1994;43(3). [14] Ibe O. Reliability comparison of token-ring network schemes. IEEE Trans Reliab 1992;41(2):288–93. [15] Spragins J. Token ring reliability models. In Proceedings of INFOCOMÕ92 Florence, Italy; 1992. [16] Peha J, Tobagi F. Analyzing the fault tolerance of double-loop networks. IEEE/ACM Trans Network 1994;2(4):363–73.

M. Al-Rousan, A. Shaout / Computers and Electrical Engineering 30 (2004) 309–329

329

[17] Yin J. Reliability of fail-soft and fault-tolerant ring networks. Phd dissertation. University of Maryland, College Park; 1993. [18] Yin J, Silio C. A reliability analysis of fail-soft FDDI networks. In: Proceedings of IEEE 17th Conference on Local Computer Networks; September 1989. p. 158–167. [19] Trivedi K, Yu P, Smith W. Reliability and performance analysis of a ringlet. In: Proceedings of IFIP International Symposium on Local Communication Systems; November 1986. p. 111–23. [20] Raghavendra C, Sivester J. A survey of multi-connected loop topologies for local computer networks. Comput Networks, ISDN Syst 1986;11(1):29–42. [21] Logothetis D, Trivedi K. Reliability analysis of double counter-rotating ring with concentrator attachments. IEEE/ACM Trans Network 1994;2(5). [22] Vujosevic M, Sucur M, Seriborn A. Reliability analysis for a tree-structured hierarchic control system. IEEE Trans Reliab 1992;41(2):190–2. [23] Yang O. Terminal-pair reliability of tree-type computer communication networks. IEEE Trans Reliab 1992;41(1):49–56. Mohammad Al-Rousan received his MS in Electrical Engineering from University of MissouriColumbia, MI, USA in 1992, and his Ph.D. in Electrical Engineering from Brigham Young University, UT, USA in 1996. He is an associate professor of Computer Engineering, Jordan University of Science and Technology, Jordan. Currently, he is on sabbatical leave at the American University of Sharjah, Sharjah, UAE, January 2001–present. His search interest includes wireless networking, SCI systems, intelligent systems, and Internet computing.

Dr. Adnan Shaout is a full professor in the Electrical and Computer Engineering Department at the University of Michigan-Dearborn. His current research is in applications of fuzzy set theory, computer design, computer arithmetic s parallel processing and artificial intelligence and expert systems. Dr. Shaout obtained his B.Sc, M.S. and Ph.D. in Computer Engineering from Syracuse University, Syracuse, NY, in 1982, 1983, 1987, respectively.