Microelectron. Rel~b., Vol. 32, No. 4, pp. 525--538, 1992. Printed in Great Britain.
0026-2714/9255.00 + .00 © 1992 Pergamon Press pie
RELIABILITY AND COST OF FAULT TOLERANT MULTIPROCESSING NETWORKS WITH HETEROGENEOUS NODES YIU-WING LEUNG
Department of Information Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong (Received for publication 3 December 1990)
Abstract We present a reliability and cost model for a class of multiprocessing networks which consist of a large number of processing elements (PEs) and fail when there are less than K functioning PEs. The processing elements are of two types: a type 2 PE is more reliable and costly than a type 1 PE. The PEs in a network can be of the same type or of different type. A decision problem arises: how many PEs of each type should be used in order to (i) maximize the mean-time-to-failure given a cost budget and (ii) minimize the cost given a mean-time-tofailure requirement. We formulate these optimization problems as zero-one integer programruing problems and propose efficient algorithms to solve them.
I Introduction The advance in VLSI t~hnology and the ever increasing need for higher computing power have led to the development of multiprocessing network consisting of a large number of pro-
cessing elements [1-4]. If the failure of one processing element results in the failure of the whole system, then multiprocessing systems are less reliable than uniprocvssing systems for the same components' reliabilities. Hence muhiprocessing systems should support graceful degradation by isolating the failed processing elements. To provid~ this capability, redundant processing elements are included in the network. The resulting networkis known asfault tolerant
multiprocessing network. Reliability modeling is an important tool to evaluate the reliability of fault tolerant computer systems. Several reliability models for multiprocessing networks have been proposed in the literature [5-9]. In these models, nil the processing elements are assumed to have the same reliability statistics. We call this kind of multiprocessing networks as the homogeneous networks. To trade off between reliability and cost, we need to determine how many redundant processing elements should be included. Semiconductor components fabricated by different manufacturers have different reliability statistics and prices. Hence, processing elements using components fabricated by different 525
526
YIu-WING LEUNG
manufacturers will have different reliabilky statisticsand cost. To design fault tolerant multiprocessing networks, we can have three options: (I)
Use a larger number of lessreliableand less costly processing elements;
(2)
Use a smaller number of more reliableand costlyprocessing elements;
(3)
Use both types of processing elements.
For the firsttwo options, the resulting multiprocessing network will have homogeneous pro~ssing elements. The decision problem is which type of processing elements should be chosen
inorder to optimize a objective.The objectivecan be eitherthe maximization of systemr¢liability when a cost budget is given, or the minimization of the cost of the system while satisfyinga given reliabilityrequirement. For the thirdoption, the resultingmuldprocessing network will have heterogeneous processing elements. W e callthiskind of networks as the heterogeneous
network. The decision problem is how many processing elements of each type should be used. In thispaper, we presenta reliabilityand cost model for a classof multiprocessingnetworks when there are two types of processing elements. Based on this model, we propose efficient algorithms to solve the following decision problems for both homogeneous and heterogeneous networks: how many PEs of each type should be used in order to (i) minimize the cost while satisfying a given mean-time-to-failure(MTTF) requirement and (ii)maximize the M T F F without exceeding a given cost budget.
II System Model Figure 1 shows the general model of multiprocessing networks. The processing elements (which will be referred as nodes) arc connected by an intcrconncctionnetwork, through which each node can directly communicate with any other nodes. The structure of the intcrconncction network can be a single bus, multiple buses or crossbar switches (see Figure 2). The internal structure of a node is shown in Figure 3. The failure of either the processor, RAM, ROM or the interface to the intcrconncction network wil.1result in node failure. When the I/O interface fails, a node can continue processing and collaborating with other nodes. Hence the reliability of the I/O interface does not affect the node's reliability. The multiproccssing networks work only when there arc at least K functioningnodes [8]. We assume that there arc two types of components and two types of nodes. We identify them as type 1 and type 2 nodes. The functions and capabilities of these two types of nodes arc identical. They differ only in reliability and cost: components in type 2 nodes arc more reliable and costly than that used in type 1 nodes. We also assume that the cost of the intercormection network is much smaller than the cost of the nodes and can be ignored. In addition, we assume
thatnode failuresare statisticallyindependent and the intemonncction network will not fail[8].
Fault tolerant multiprocessing networks
527
PE 1
Interconnection Network
1
PE: Processing Element
Figure 1" General model of multiprocessing networks
We define the following notations: g°(t)
=
probability density function (pdf) of the lifetime of type i processor
f~Ct)
=
pdf of the lifetime of type i RAM
~%(t)
=
pd,f of the lifetime of type i ROM
~')(t)
=
pdf of the lifetime of type i interface
r~)Ct)
=
reliability function of component X (X=p, R A M , R O M or i:)
R~(t)
=
reliability function of type i node
Rg~o(n,,t)
=
reliability function of homogeneous network with N~ type i nodes
Rne(NI,N2, t)
=
reliability function of heterogeneous network with Nt type 1 nodes and N2 type 2 nodes number of type i nodes in the network
K
=
minimum number of functioning nodes for the network to work correctly
Cit O
=
cost of homogeneous network
tHE
=
cost of heterogeneous network
T~)o(N,)
=
MTFF of homogeneous networks with N~type i nodes
T~(N,,N~)
=
M'FI'F of heterogeneous networks with NI type 1 nodes and N2 type 2 nodes a given MT1T requirement
co
=
a given cost budget
528
YIu-W1NG LEUNG
?T
' Bus
(a) Single-bus interconnection network
OO0
2
w o o)
B
(b) Multiple-buses interconnection network
0
v
TT .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(c) Crossbar interconnection network
Figure 2: Examples of multiprocessing network
III Reliability And Cost Modeling Reliabilityof a component atany time tisdefined [10] as the probability thatthe component lives for more than time t. The reliabilityfunctions of the components in ~
i nodes a m given
by:
r~i)(t) = P[the type i processor lives for more than time t]
= ~,"~')(x)d'~ dx /')p~(t)= r't<~(x) ,Jl
(la) (ib)
Fault tolerant multiprocessing networks
529
I/0 interfacel
RAM I ROM [
interface
I Figure 3" Internal structure of a processing element
(i) ) = ft"~u(x) dx rhou(t r~i)(t)
(lc)
ff)('O d'¢
=
The reliability function of type i nodes
(ld)
R~(t) is given by:
Ri(t ) = P [all components =
live for more than time t]
r~°(t) r~)~(t) rR<~u(t)r:O(t)
(2)
Since the network works only when there are at least K functioning nodes, the network can be modeled as a K-out-of-N system. The reliabilityfunction of homogeneous network using type i nodes is given by:
(0 i,t) = P [at leastK nodes live for more than time t] Rno(N N, (N,'~ q =q~x(q J[Ri(t)] [1-Ri(t)] u'-'
(3)
The pdf of this system is: o
d
(o
~no(Ni, t) = --~ [Rno(N i,t)]
(4)
The mean time to failure Tg)o(Ni) can be found as:
r~oW,) (i) =
f0"t~>o(N,,t)e,
= fo'Rg~o(N,, 0 et
(5)
530
YIu-WINGLEUNG The cost of homogeneous network using Ni type i nodes is:
Cno =NiCi
(6)
A heterogeneous network works only when there axe at least K functioning nodes, irrespective of their types. The reliability function of heterogeneous network with Nt type 1 nodes andN2 type 2 nodes is given by:
Rne(N,,Nvt)= P+qZKk.PJY" (NII[RI(t)]P [1-Rl(t)]
NI -p
(N2~
q
~q l[R2(t)] [1-R2(t)] ~'a-*
(7)
O
The mean time to failure
Tne(N~,Nz) is: Tne(N1,Nz) = r3o-Rne(Nl,Nv t) dt
(8)
The cost of the heterogeneous network using NI type 1 nodes and N2 type 2 nodes is:
Cn~ = N1C1+ N2C2
(9)
IV H o m o g e n e o u s Networks In homogeneous networks, all the nodes arc identical. We nccd to determine which type of nodes should be chosen. In the following subsections, two optimizationproblems arc def'mcd and solved.
A. Maximization of Mean Time To Failure We axe given a cost budget
Co and we want to determine which type of nodes to be used
in order to maximize the MTTF. This problem can be solved as follow. Since the M T r F of homogeneous network is an increasing function of the number of nodes in the network, we determine the maximum number of type i (i= 1,2) nodes that can be used without exceeding the cost budget and then evaluate
T~)o(NO and T(~)o(N2). By comparing the mean time to failure
T(n~(Na) and T(n~(N2), we can determine the best type of nodes. These steps axe summaxizcd below: Onfimizafion Algorithm
L .j
(1)
N 1 <-- ~, ;
p' t~/=a~,,tlnt~s~pmax */
(2)
Evaluate T(M~(NI)and T~(N2);
(3)
IF TCn~(Nt)> T(n~(Nz) THEN use N~ type 1 nodes ELSE IF T#)o(U,) < T~3(N2) THEN use N2 type 2 nodes ELSE use either NI type 1 nodes or N2 type 2 nodes.
Fault tolerant multiprocessing networks
531
B. Minimization of Cost In this optimization problem, a MTTF requirement is given and the cost of the network is to be minimized. Note that both the cost and MTTF increase with the number of nodes in the network. To minimize the cost, we need to determine the minimum number of type i nodes N, (i=I,2) such that the MUFF requirement can just be satisfied. By comparing the costs required by each type of node, we can determine the best type of nodes to be used. These steps are summarized below: Optimization Aleorithm (1)
FindN~ (>K) for/=1,2 such that
Tf~)o(N,) > T" T(~) t . _ I) < T" O~.A,i (2)
IF NIC,
use N, type I nodes
IF NICI>N2C 2 THEN use N2 type 2 nodes ELSE use either N~ type I nodes orNa type 2 nodes
V Heterogeneous Networks Nodes in a heterogeneous network can be of different types. We need to determine how many nodes of each type should be used. In the foLlowing subsections, two optimization problems are defined and solved.
A. Maximization of Mean Time To Failure We are given a cost budget Co and we want to determine how many nodes of each type should be usedin order to maximize the MTTF. This problem can be formulated as the following integer programming problem: Maximize
T.~(N,,N9 = fo'R,,E(N.N2,t) at
(lOa)
Subject to (I)
Nz, N2>O
(10b)
(2)
NIC, +N2C2 < Co
(10c)
The objective function is the MTI'F of heterogeneous network. N1 and N: are integer variables to be determined. Constraint (1) as sures that both Nt andN2 are not negative. Constraint (2) assures that the cost of the heterogeneous network will not be larger than the cost budget. These two constraints clef'me a feasible solution space for this problem. It is well known that there is no general efficient algorithm of polynomial time complexity for all integer programming problems [11]. To solve this problem, we reduce the feasible
532
YIu-WING LEUNG
solutionspace (as explainedbelow). Fortunately,the number of feasiblesolutionpointsin the | C.|
reduced solutionspace is only LqJ + i. By enumerating thesefeasiblesolutionpoints,we can get the optimum solutionNI' and N~. Note that the mean time to failure Tn~(N1,Nz) is an increasing function of N1 and N2. If (N1,Nz) is a feasible solution point, Tne(NI,N2) will then be larger than Tmz(N1-i,N2) and Tne(Nt,N2 - j ) for 0 < i
To fred the next feasible
point, substract Nz by 1 and find how many type I nodes can be accommodated without violating the cost constraint, i.e. Co - (N, -
N, <-- L
By repeating the above step, the ~ J
1)C2/
J
+ 13feasible solution points can be found. Bycnumerafing
these solutionpoints,we can find out the optimum solutionN~ and N~. These ideas are given in the following algorithm: Optimization Algorithm
(1)
N~ ~
;
max ~ TH~(NI,N2);
N( ~ N~; (2)
N; (-" N2; Fori=l toSdo begin N2 ~ N2-1;
1 c.-~:c, [
N, (-- L ~ J ;
X e.- Tng(N1,Ng; if x>max then begin max e-x; N1' <---N1; N; (-- N2; end; end;
Fault tolerant multiprocessing networks
533
B. Minimization of Cost In this optimization problem, a M'ITF requi~ment is given and the cost of the network is to be minimized. This problem can be formulated as the following integer progr~mrnlng problem: Minimize
CnE =NIC , +N2C2
(lla)
(1) N,,N2~O
(11b)
(2) Tns(Nx,N~T"
(11c)
Subject to
The objective function is the cost of the heterogeneous network. Nt and N2 are integer variables to be determined. Cons tralnt (1) assures that both N, and N2 are not negative. Constralm (2) assures that the MTTF of the resulting system is at least as large as the given requirement. These two constraints define a feasible solution space for this problem. To solve this integer programming problem, we reduce the feasible solution space and enumerate the solution points in the reduced solution space. To minimize the cost, we need to fred the smallest possible values of N t and N 2 that satisfy the two constraints. These can be done as follow. Firstly, we let Nt---0 and find the value of N2 such that the reliability constraint
is just satisfied, i.e. r.~(0,N2) > T" THE(0,N2 - I) < T" and find the cost C'=N2C2. To generate the next feasible solution point, we increment N I by one and find the value of N 2 such that the reliability constraint is just satisfied. By repeating this step, we can generate all the feasible solution points until NI = ~ . ~c~J+ 1) feasible solution points.
Ot~timization A12orithm (1)
Find N2 (~.h') such that
T,~8(0,N9 > T' Tx~(O,N 2 -
¢2) N; ~ 0; N; ~--N2; U B ~- N2; mi. ~- N2C2; C' ~- rain;
1) < T"
There are a total of
YIU-WINGLEUNG
534
(3)
for i=I
toL~] do
begin
(i)
M <---i;
(ii)
findN2 (max{O,K -i}
Tns(i,Nz)> T" TxE(i,N2-1) < T"
(iii) (iv)
UB e-- N~; if iCl + N2C2< rain then
begin rain ~ iCl +N2C2; N~ o--i; N; ~-- g2; end; end;
VI Numerical Examples In this section, we assume that the lifetime distributions of the processor, RAM, ROM and the interface to the interconnection network are exponentially distributed. Let g~), I.t~, ~ M and ~t~') be the failure rates of the processor, RAM, ROM and the interface to int~onnection network in a type i node. The reliability function of type i nodes can be obtained from equation (2):
R~(t) =
~,0) a ~ , 0 )
e ~" e ~'~
t
~.0)
#
O)t
e ~'x°u e " ~
= e ~i'
(12)
where p(~)ffi,,(O.~.,JO .~_,,(0 ~.p T ~'J~.uT.~_,,0) ~'~o~ Im
(13)
The reliabilityfunction of homogeneous network with N~ type i nodes can be obtained from equations (3) and (12):
: _ e-:"]
(14)
The m e a n time to failure T~o(N~) 0) is:
(0
fo" R
(15)
Fault tolerant multiprocessing networks
535
Table 1: Homogeneous network; g~)=l.0, Ct=l.0. C2
Maximize MT'I'F, C,=I0
N;
N;
Trio
Minimize cos~ 7"=0.5
N;
N;
Cno
1.0
0
10
0.4287
0
13
13
1.2
0
8
0.3654
0
13
15.6
1.4
10
0
0.3572
0
13
18.2
1.6
10
0
0.3572
19
0
19
1.8
10
0
0.3572
19
0
19
2.0
10
0
0.3572
19
0
19
2.2
10
0
0.3572
19
0
19
2.4
10
0
0.3572
19
0
19
2.6
10
0
0.3572
19
0
19
2.8
10
0
0.3572
19
0
19
3.0
10
0
0.3572
19
0
19
(a) p.~) = C~
5
Minimize cost, T'----0.5
MaximizeMTrF, C,=10
T~o
N;
10
0.5359
0
10
10
8
0.4567
0
10
12
0
7
0.4098
0
10
14
10
0
0.3572
0
10
16
0.3572
0
10
18
0.3572
19
0
19
N;
N;
1.0
0
1.2
0
1.4 1.6 1.8
10
0
2.0
10
0
N;
C;w
2.2
10
0
0.3572
19
0
19
2.4
10
0
0.3572
19
0
19
2.6
10
0
0.3572
19
0
19
2.8
10
0
0.3572
19
0
19
3.0
10
0
0.3572
19
0
19
(2) _ (b) gt~ -
C2
Maximize MTIT, C,=IO
Minimize cost, T'=0.5
N;
T;,o
N;
N;
C:to
10
0.7145
0
7
7.0
1.0
0
1.2
0
8
0.6089
0
7
8.4
1.4
0
7
0.5464
0
7
9.8
1.6
0
6
0.4750
0
7
11.2
1.8
0
5
0.3917
0
7
12.6
2.0
0
5
0.3917
0
7
14.0
2.2
10
0
0.3572
0
7
15.4
2.4
10
0
0.3572
0
7
16.8
2.6
10
0
0.3572
0
7
18.2
2.8
10
0
0.3572
19
0
19.0
3.0
10
0
0.3572
19
0
19.0
(2) _
(c) ~t~ -
1
536
Y I u - W I N G LEUNG
Table 2: Heterogeneous networks; l.t~)=l.0, Cx=l.0. C2
Maximize MTTF, Co=IO
Minimize cost, :/"----0.25
N;
N;
T~o
N;
N;
CHo
1.0
0
I0
0.4407
0
6
6.0
1.1
0
9
0.4093
2
4
6.4
1.2
0
8
0.3720
2
4
6.8
1.3
10
0
0.3696
7
0
7.0
1.4
10
0
0.3696
7
0
7.0
1.5
10
0
0.3696
7
0
7.0
1.6
10
0
0.3696
7
0
7.0
1.7
10
0
0.3696
7
0
7.0
1.8
10
0
0.3696
7
0
7.0
1.9
10
0
0.3696
7
0
7.0
2.0
10
0
0.3696
7
0
7.0
(a) ~)=Z 5 C2
Minimize cost, 7"*=0.25
Maximize MT'fF, C,=10
N;
N;
T;~o
N;
N;
C;~o
1.0
0
I0
0.5412
1.1
0
9
0.5054
0
5
5.0
1
4
5.4
1.2
0
8
0.4640
1
4
5.8
4
6.2
1.3
2
6
0.4233
1
1.4
0
7
0.4147
1
4
6.6
1.5
I
6
0.3924
1
4
7.0
1.6
2
5
0.37
7
0
7.0
1.7
10
0
0.3696
7
0
7.0
1.8
10
0
0.3696
7
0
7.0
1.9
10
0
0.3696
7
0
7.0
2.0
10
0
0.3696
7
0
7.0
, ,(2) _ _2
(b) ,E - 3
C2
Maximize MTrF, C0=10
.Minimize cost. 7"*---0.25
N;
N;
T;~o
N;'
N;
C;,o
1.0
0
10
0.7143
0
4
4.0
1.1
0
9
0.6657
0
4
4.4
1.2
0
8
0.6118
0
4
4.8
1.3
0
7
0.5502
0
4
5.2
1.4
0
7
0.5502
0
4
5.6
4
6.0
1.5
1
6
0.5062
0
1.6
0
6
0.4775
0
4
6.4
1.7
1
5
0.4291
5
1
6.7
1.8
1
5
0.4291
5
1
6.8
1.9
0
5
0.3895
5
1
6.9
2.0
0
5
0.3895
5
1
7.0
1
Fault tolerant multiprocessing networks The reliability function of heterogeneous network Ru~(N1,Nz, t) can be found by substituting equation (12) into equation (7). The mean time to failure TnE(NI,Nz) can be obtained by integrating Rtt~Nt,Nz, t ) (equation (8)). Unfortunately, this integral cannot be expressed in closed form. Hence, we can only rely on numerical methods to evaluate this integral. For homogeneous networks, we let the cost budget C, be 10.0 and the MTI'F rcquiren~nt T* be 0.5. In addition, we let both Ct and g~) be 1.0. Tables l(a)-(c) show the optimum decision N~ and N~ for homogeneous networks when C2 and g~) assume different values. It is interesting to note that, for fixed p.~), there is a threshold on C2, below (beyond) which it is cost effective to use type 2 (type 1) nodes. For heterogeneous networks, we let the cost budget Co be 10.0 and the MTTF requirement T" be 0.25. In addition, we let both Ci and p.~ be 1.0. Tables 2(a)-(c) show the optimum decision N~ and N~ for heterogeneous networks when Cz and g~) assume different values. There is a threshold on C2 below which the optimum decision is either to choose only type 2 nodes, or to choose a large number of type 2 nodes and a few type 1 nodes. This can be explained as follow. Let C~=I.0, C2=1.5 and C0=10; and let us consider the maximization of MTFF. Suppose it is more cost effective to use type 2 nodes when C2=1.5. We can use at most 6 type 2 nodes at a cost of 9.0. Since the MTTF is an increasing function ofN t andN2, we should also use one type 1 node in order to maximize the M T r F while the cost budget is not exceeded. Hence the optimum decision in this case is N~=I and N~=6.
VII Conclusions We have presented a reliability and cost model for a class of multiprocessing networks when there are two types of processing elements. A network can consist of either type 1 PEs or type 2 PEs or both. To determine how many PEs of each type should be used, we formulate two optimization problems: maximization of M'FI'F given a cost budget and minimization of cost given a MTTF requirement. By making use of the property that both the MTTF and cost increase with N~ (i--1,2), we propose efficient algorithms to solve them. The optimum decision is highly dependent on the cost ratio and the reliability improvement ratio of the two types of PEs.
References
MR
32/4---G
1
D.K.Pradham, IEEE Trans. Computers, 34, 33 (1985).
2
I.Koren, D.K.Pradhan, Prec. IEEE, 74, 699 (1986).
3
C.S.Raghaven~a, M.Gcrla, A.Avizienis, IEEE TraM. Computers, 34, 46 (1985).
4
S.P.Dandamudi, D.L.Eager, IEEE Trans. Computers, 39, 786 (1990).
5
Y.W.Ng, A.Aviziens, IEEE Trans. Computers, 29, 1002 (1980).
537
538
YIu-WINGLEtr~o 6
A.Perder, V.V.Sarma, IEEE Trans. Computers, 32, 911 (1983).
7
C.R.Das,L.N.Bhuyan, IEEE Trans. Computers, 34, 918 (1985).
8
LD.Bruguera, E.L.Zapata, O.G.Plata, Microprocessing and Microprogramming, 29, 15 (1990).
9
S.Latifi, IEEE Trans. Reli., 39, 361 (1990).
10
D.L.Grosh, A Primer of Reliability Theory, John Wiley (1989).
11
M.M.Syslo, N.Deo, .l.S.Kowalik, Discrete Optimization Algorithms, p.82, Prentice Hall (1983).