Compuf. & 0,~s Ret Vol. 6. pp. 87-97 Pergamon Press Ltd.. 1979 Printed inGreat Brilain
AVAILABILITY EVALUATION OF REDUNDANT COMPUTER SYSTEMS SHUNJI OSAKI* and TOSHIHIKONISHIO? Department of Industrial Engineering, Hiroshima University, Hiroshima 730, Japan Scope and purpose-Computer systems play an important role in our society. A system break-down is costly, dangerous, and even causes confusion in our society. It is, therefore, of great importance to build and operate such systems with high degree of reliability. This paper investigates two typical redundant computer systems and obtains the limiting availability for each system analytically by using Markov renewal processes. Tbe paper also presents numerical examples of the availability for comparison purposes. Abstract-This paper investigates availability of two redundant computer systems, a duplex system and a load share system, under the assumptions that the failure time distribution is bivariate exponential and the repair time distribution is arbitrary. We derive the limiting availability for each system by applying a unique modificationof regenerationpoint techniquesin Markov renewal processes. In particular, the procedure for obtaining the limiting probability is presented thoroughly. This paper also gives numerical examples for comparison of the two systems from the viewpoint of the availability.
1. INTRODUCTION
The remarkable progress of modem computer technology enables us to make large-scale computer systems which play an important role in our society. Examples of such systems are a vehicle traffic control system, a seat reservation system, an on-line real-time banking system, a communication system, and so on. A break-down of such a system is costly, dangerous, and can cause confusion in our society. It is, therefore, of great importance to operate such a computer system with high reliability. A computer system with high reliability can be achieved both by hardware and software [I51. In this paper, we are interested in the reliability of computer systems from the viewpoint of hardware, more precisely, the system configuration by redundancy. That is, we consider two models of computer systems with redundancy, i.e. a duplex system and a load share system. The duplex system is used for attaining high availability and long “mean time to system failure” (MTITF). Let us consider a simple model of the duplex system composed of 4 units, say Al, AS, BI and Bz. The operation of the system can be performed by two sub-systems, units A1 and B,, and AZ and BZ. For instance, units A1 and B, may form an on-line sub-system and units AZ and BZ form an off-line sub-system which is a hot standby for the on-line, and vice versa, where the on-line sub-system can execute on-line tasks, and the off-line sub-system can execute off -line tasks. The load share system is also used for attaining high availability and long MTTF. Consider a simple model of the load share system composed of 4 units, say A,, AZ, B, and B2. The operation of the system can be performed by any combination of unit Ai(i = 1,2) and unit Bi(i = 1,2). In this paper we are interested in the availability analysis of the above two models by introducing some assumptions. It is generally true that two measures of reliability can be considered, namely, availability and M’ITF. It is more useful to consider the availability as a measure of reliability in the commercial computer systems. Thus, we obtain the availability of each model by applying the unique modifications of Markov renewal processes. Finally, numerical results of the availability and their comparisons are presented for illustration. *Shunji Osaki is currently Associate Professor of systems engineering at Hiroshima University, Hiroshima, Japan. He received his Ph.D. in Operations Research from Kyoto University. His research interests include reliability and maintenance theory, and applied probability. tToshihiko Nishio is a graduate student of systems engineering at Hiroshima University. He received his bachelor of Science in Engineering from Hiroshima University. His research interests include reliability theory and computer system performance analysis. 87
S. OSAKIand T. NISHIO
88
2. MODEL
Duplex system Consider a simple model of the duplex system shown in Fig. 1. Units A, and Bi form one sub-system and units AZ and BZ another sub-system. The system can function with either sub-system, where one is used as the on-line sub-system and another as the off-line sub-system. If the on-line fails, the off-line, if it is functioning, takes over its operation. If both sub-systems fail simultaneously, a system break-down takes place. Let us describe a typical example of this behavior: Units A, and Bi are used for on-line and units AZ and Bz for off-line. If the on-line fails (i.e. unit A, or B, fails), the off-line sub-system (units AZ and Bz), if functioning, takes over its operation, and the failed unit A, or B, is repaired. If the repair of unit A, or B, is completed, units A, and B, come back as the off-line sub-system. On the other hand, if the on-line fails (i.e. unit AZ or BZ fails) during the repair of unit A, or B,, a system breakdown takes place and the system returns to functioning status by completing the repair of A, or B, and then comes back with units A, and B, as the on-line. We assume that each switchover is perfect and instantaneous. We further assume that the repair discipline is “first come first served” and the repair facility is a single. Of course, since the repair facility is a single, the second-arrived failed unit must wait for repair when two units fail simultaneously. 1.
Fig. 1. A simple model of duplex system.
We assume a bivariate exponential distribution for the failure time of two nonindependent units Ai (i = 1,2). We follow the elegant definition by Barlow and Proschan[9]. Suppose three independent sources of shocks are present in the environment. A shock from source 1 destroys unit A,; it occurs at a random time Ui, where Pr[Ui > t] = e-‘I’. A shock from source 2 destroys unit A,; it occurs at a random time UZ,where Pr[ UZ> t] = em*“.Finally, a shock from source 3 destroys both units; it occurs at a random time Uu, where Pr[Ui2 > t] = eerit. Thus the random life length Xi of unit A, satisfies
XI = min WI, Ud, while the random life length X2 of unit AZ satisfies Xz = mm (U2, Ui2). Hence the joint survival probability jT,(t,,
t2) =
pr[x,
> t,, X2> t2] = e-A1(r1+‘3-Aimax(r”tz),
(1)
for t, 2 0, t2 2 0. The joint distribution Fl(t,, t2) with survival probability given by (1) is called the bivariate exponential distribution. Barlow and Proschan191 introduced the different parameters for random variables U, and U2. However, we consider a simple case since two units are identical. Similarly, we assume a bivariate exponential distribution for the failure time of two
Availability evaluation of redundant computer systems
89
nonindependent units Bi (i = 1,2). In this case, we introduce the corresponding three random times V,, V,, and Vi*, where Pr[V, > t] = esA2’, Pr[V, > t] = emA*‘,and Pr[Vr2 > t] = e-*5, respectively. Then the random life length of Yi of unit Bi satisfies yi = min (Vi, VIZ) (i = 1,2),
respectively. Hence the joint survival probability &(t,,
t2)
=
pr[y,> f,, y2
> t2] =
e-h~(‘~+‘~)-Ajmax(‘1~‘3,
(2)
for tr 2 0, tz 2 0. We further assume that the repair time of each unit Ai (i = 1,2) is a random variable Zr with arbitrary distribution Gr(t) having a finite mean l/y,. Similarly, we assume that the repair time of each unit Bi (i = 1,2) is a random variable Z2 with arbitrary distribution G*(f) having a finite mean l/y*. Introduce the following states (time instants) i (i = 1 , . . . ,8): State 0: two sub-systems are operating. State 1: unit Ai (i = 1 or 2) fails. State 2: units Ai (i = 1,2) fail simultaneously (system breakdown). State 3: unit Bi (i = 1 or 2) fails. State 4: units Bi (i = 1,2) fail simultaneously (system breakdown). State 5: unit Ai (i = 1 or 2) fails through state 1 (system break-down). State 6: unit Bi (i = 1 or 2) fails through state 1 (system breakdown). State 7: unit Ai (i = 1 or 2) fails through state 3 (system break-down). State 8: unit Bi (i = 1 or 2) fails through state 3 (system breakdown). Where “state” means the status of the system and “time instant” means the time instant at which the process just enters the “state”. Note that states i (i = 0,1,2,3,4) are regeneration points and states i (i = 5,6,7,8) are non-regeneration points. Then we can show the state transition diagram among the states above in Fig. 2.
Fig. 2. The state transition diagram for duplex system, where 0 represents a state with regeneration point and 0 a state with non-regeneration point.
2. Load share system Consider a load share system in Fig. 3. We assume a simple system composed of 4 units, say, Al, AZ,BI and Bz. The system can perform its functioning in any combination of units Ai (i = 1,2) and units Bi (i = 1,2) (i.e. A1 and B,, A1 and Bz, AZand BI, or AZand Bz). If units Al and A*, or units B1 and B2 fail simultaneously, a system break-down takes place. The other assumptions are the same as described in the preceding model for the duplex system. Introduce the following states (time instants) i (i = 1, . . . ,14): State 0: all units are operating. CAOR Vol. 6. No. 2-C
90
S. OSAKI and T. NISHIO
Fig. 3. A simplemodelof load share system.
State 1: unit Ai (i = 1 or 2) fails. State 2: Unit Bi (i = 1 or 2) fails. State 3: Unit Bi (i = 1 or 2) fails thr0Ugh state 1. State 4: unit Ai (i = 1 or 2) fails through state 2. State 5: unit Ai (i = 1 or 2) fails through state 1 (system breakdown). State 6: unit Bi (i = 1 or 2) fails through state 2 (system breakdown). State 7: unit Ai (i = 1 or 2) fails through state 3 or state 12 (system breakdown). State 8: unit Bi (i = 1 or 2) fails through state 4 or state 11 (system break-down). State 9: unit Bi (i = 1 or 2) fails through state 3 or state 12, or units Bi (i = 1,2) fail simultaneously through state 1 (system break-down). State 10: unit Ai (i = 1 or 2) fails through state 4 or state 11, or units Ai (i = 1,2) fail simultaneously through state 2 (system breakdown). State 11: the repair of unit Ai (i = 1 or 2) is completed through state 7. State 12: the repair of unit Bi (i = 1 or 2) is completed through state 8. State 13: the repair of unit Ai (i = 1 or 2) is completed through state 9. State 14: the repair of unit Bi (i = 1 or 2) is completed through state 10. Where states i (i = 0,1,2,11,12,13,14) are regeneration points and states i (i = 3,4,5,6,7,8,9,10) are non-regeneration points. Then we can show the state transition diagram among the states above in Fig. 4.
Fig. 4. The transition diagramfor load share system.
Let us describe the procedure of analysis by using Markov renewal processes for each model mentioned above. If all the states defined above are regeneration points, it is quite easy to apply the conventional Markov renewal processes (see, e.g. Pyke[6]). However, we have to consider some states which are not regeneration points since we assume the arbitrary repair time distributions for each model. To overcome this difficulty, we can apply the unique modification of Markov renewal processes developed by Nakagawa and Osaki[7]. That is, if we
AvaiIabiiity
evaluation of redundant computer systems
91
consider the transition from state i to state j, where the starting state i is a regeneration point, we can obtain the one-step transition probability Qii(t) (see Pyke[6]). However, if the starting state k is not a regeneration point, we introduce the two-step transition probability Q!?(t) or the three-step transition probability Q{?‘)(t), where state i must be a regeneration point. That is, we can summarize the definitions of the transition probabilities as follows: Qii(t) = Pr{after making a transition into state i, the process next makes a transition into state j, in an amount of time less than or equal to t)_ ~~~)(t~= Pr{after making a transition into state i, the process next makes a transi~on into state j via state k, in an amount of time less than or equal to t}. @f’)(t) = Pr{after making a transition into state i, the process next makes a transition into state j via states k and I, in amount of time less than or equal to t}. We can apply the unique modification of Markov renewal processes using the transition probabilities Qi(t), @f’(t) and Q{?“(t) based on regeneration point techniques. A detailed discussion of this technique will be found in a paper by Nakagawa and Osaki[7]. We review the procedure of analysis. Now we assume that the number of all states is n. (i) Derive the transition probabilities Qij(t), Q{?(t) and Q$f’)(t) for all states considered using the states defined above. (ii) Taking the Laplace-Stieltjes (LS) transforms for Qij(t), Qff’(t) and Q{y’(t) yield the LS transforms qii(s), qua’ and &%), respectively. (iii) Assume the limiting transition probab~ities:
and solve for the limiting probabilities ri (i = 0,. . . , m) for the embedded Markov chain: ?t== ?rQ
and
2m=1, k=O
(6)
r,,,) is a row vector and Q is a transition probability matrix composed of where a=(?ro,..., the possible regeneration points i = 0,. . . , m, i.e. neglecting the non-regeneration points and relabeling i= 0, . . . , m only for the possible regeneration points:
where @iican be obtained by neglecting the uon-regeneration points. (iv) Derive the unconditional means pi (i = 0,. . . , m) for the regeneration points. (v) Calculate the mean recurrence times:
I ii --
(vi)
k=O
rkpk
(i=O,...,m). ri
Derive the unconditional means & (i = 0,. . . , n) not neglecting the non-regeneration
points. (vii) Calculate the limit~g probab~ities pi (i = 0,. . . , n) using [ii and & (viii) The limiting av~ab~ity can be obtained by summing up the limiting probab~ities for the operating states.
92
S.Osmand T.Nmuo 3.ANALYSIS
We introduce the following notation for analysis of each system: &=A,+&+hj+Ai, 8,=A,+2A2+A;+A;, &~2A,+Az+Ai+A$, t&=21,+212+Ai+Ai, Y,sA,+Ai, v2=A2+A;, G’(t) e 1 - Gi(t) (i = 1,2), gi(S); LS transform of the repair time distribution Gi(t) (i = 1,2), [
d[ IIds.
]‘lpo=lii
1. Duplex system Consider a simple model of the duplex system in Sec. 2. The transition probabilities are the followirrg: QOZi-l(t)
=[
2Aieme3’dt
Qo,zi(t)= [ Ai eve3’dl
(i = 1,2, i.e.
Qo,(t)
and Qoj(t)),
(i = 1,2, i.e. Q&t) and Q&f)),
QZ’-,,O(t)= l e-‘@’dGi(t)
(9) (10)
(i = 1,2, i.e. Q,o(t) and @o(t)),
(11)
Qz,_,,zi+3(t)= I,’ v,Ci(t) e-‘d dt
(i = 1,2, i.e. Q,&) and Q,(f)),
(12)
&-,3+4(t) = l
(i = 1,2, i.e. (&j(t) and @s(t)),
(13)
&‘(f) e-cd dt
&.2’-,(t) = Gi(t)
(i = 1,2, i.e.
Qz,(t)
and CM)),
(14)
(i = 1,2, i.e. Q{?(t) and @y(t)), (i = 1,2, i.e. Q$T(t) and Q$?(t)).
(15) (16)
The derivations of the transition probabilities above are the following: For instance, let us consider the transition probability Qo,(t). We can consider six random variables U,, UZ, UU, V,, V, and V,2. If unit A, or A2 fails first while the remaining units are still operating, then a transition from state 0 to state 1 takes place, Qo#) = Pr[U, 5 t, uz > t,
=
+Pr[u,It, I I0
u,>t,
u,2 > t, VI > t, v2 > t, v,2 >
t1
u,z>t, v,>t, V2>f, V,2>f]
2A, eee3’dt.
(17)
other transition probabilities Q&t), Q&t) and Q&t) can be obtained in a similar fashion. Let us next consider the transition probability Q!?(i). We can consider three random variables X,, Y, and 2,. Note that the marginal survival probabilities of X, and Y, are
The
pr[X, > t] = e-(nl+r9,
(18)
Pr[Y, > t] = e-cnz+AQr,
(19)
93
Availability evaluation of redundant computer systems
respectively. A transition from state i to state 5 takes place; I
Pr[X, 7 f, Y1 > f] =
I
v1
0
e-‘*
dx.
(20)
Thus, if repair of unit Ai (i = 1,2) is completed after system break-down (i.e. a transition from state 1 to 5 takes place), the two step transition from state 1 to state 1 via state 5 takes place,
~~5~(f~ = [ Pr[X,
I f,
ib[ld vi emBox dx] dG&).
Y1 > f] dG&) =
The other transition probabilities (Ii?(t), Q’&f) and Q!?(t) can be obtained in a similar fashion. The remaining transition probabilities can be similarly obtained. The LS transforms of the transition probabilities can be easily obtained, We don’t give the results here. Noting that states 0, 1, 2, 3 and 4 are regeneration points, we have the following transition probability matrix for the embedded Markov chain: 0
401
qoz
qo3
410
4I?
0
419
Q= [ 0 q21 0 a0 4%) 0 0 0 0
qo4 0
0 47
0 0
q43
0
1 .
We can solve the limiting probabilities:
(23) The unconditional means pi (i = 0,1,2,3,4) are given by (24) (25) (26) (27) (28) We can calculate the mean recurrence times:
iii
=
k=O -
nkpk (i
=
0, 1,2,3,4).
(29)
ni
We derive the unconditional means .$ (i = 0,1,3) not ne~ecti~
the non-regeneration points:
50 = PO = I/@,, 51 = tl-%o(s)5 = [I - @l(s)
~15~s) - q16W~,=o - q37(s) - q3*W),=o
fW = [I/eo][l -g&J)], = P/eo][l - g2(&l)].
(31) (327
Thus we calculate the limiting probabilities: PO=
50160,
Pi = 51/&r, P3 = 531/a.
(33). (34)
S. Osm and T. NISHIO
94
2. Load share system
Consider a simple model of the load share system in Sec. 2. Transition probabilities are the following: (i = 1,2),
eme3’dt
@i(t) = [2Ai
(37)
I Q0,12+i(t)= QiO(t)
J
I0
l
(i = 1,2),
h j-i eme3’dt
(i = 1,2),
dGi(t)
emy
(38)
QiJ+i(t) = & 2~~-i~i(t)e-e~ dt
(39)
(i = 1,2),
(40)
Qi,d+i(t)= I,’ 2&i(t)
eweif dt
(i = 1,2),
(41)
Qi,g+i(t)= l Ai&(t)
e-“” dt
(i = 1,2),
(42)
Qlmi,i(t) = 6 eVeddG,-i(t) Qlo+i,+i(t) = I,’ v&3-i(t)
(i = 1,2),
eeBd dt
Q,o+j,,~-j(t) = 1 &-i(t)
(i
eveof dt
=
(43)
1,2),
(44)
(i = 1,2),
(45)
(i = A%
Qlz+i+i(t) = G+,(t)
(46)
Q{?‘(t) = of[~i~ej][l-e-‘$1 dGi(t) I
(i = 1,2),
(47)
Q$?j(t) = [2[e-‘d
(i = 1,2),
WJ
Q@/(t)
- eleif] dGi(t)
= 16 2z&(t)[e-ed
(i = 1,2),
-eeei’] dt
Q&?(t) = o’2+.&(t)fe-‘” I
(49)
- e-@#] dt
(i = 1,2),
(501
Q@%(t) = : [hLi/&][l -e-ei] dGi(t) I
(i = I, 2),
(51)
Q$$d;!&t) Q{&&(t)
Q{$P(t) Q@$$“(t)
= l[~3-~80][1
-eFBd] dG3-i(t)
= o’ [vi/@o][l-eeed] dG3-i(t)
I
(i = 1,2), (i = 1,2),
e -ed][2vi/6i][l -e-‘(‘1 dGi(t)
= ~[2vi/&][l-
= I,’ [2vs-i/&][1 -e-ed][2&@iJ[l
-em@‘]dGi(t)
(52) (53)
(i = l-2), (i = 1,2).
(54) (55)
Noting that the states 0, 1, 2, 11, 12, 13 and 14 are regeneration points, we have the following transition matrix for the embedded Markov chain: 401
qoz
410
d51
q1Y
q20
4%)
qB
0
qll,l
0 Q=
0 0 0
0
0
qo,13
q@fl
0
41.13
0
0
4YiY
0
q2,14
0
91%2
0
qYl:14
0
0 0 q14.1
q12.2
0
d?Jl
41%3
qo.14
0
413.2
0
0
0
0
0
0
0
0
0
95
Availabilityevaluationof redundantcomputersystems
and have the limiting probability vector P = (7r0, PI, 7T2, Tll,
by
solving the equations II = nQ 0,1,2,11,12,13,14) are given by
(57)
p12, 7713, m4),
and Z Vi = 1. The unconditional
/Jo = l/83,
means pi
(i =
(58)
p1 = l/Y19
(59)
CL2 = l/Y29
WI
1*ll = UY2,
(61)
1112 = l/Y,,
(62)
1113 = l/72,
(63)
/Jl4 = l/Y,.
W)
We can calculate the mean recurrence times Iii (i = 0, 1, 2, 11, 12, 13, 14). We next derive the unconditional means 6 (i = 0, 1, 2, 3,4, 11, 12) not neglecting the regeneration points: 50 = PO = 1/e,,
51= w3ll[l - g1@1)1, t2 = w2i[l - g2(e2)i, t3 = [2/eoi[1 - gl(eo)i - Ww - gdedi,
t4 = weoi[l - g2(eo)i - [2/e2i[i - g2(e2)i, tll = weoi[l - g2(eo)i, h2 = [Moi[l - gl(eo)i.
(65) (66) (67)
(68) WI (70) (71)
Thus, we can calculate the limiting probabilities: PO=
5o/hl,
PI1 = 51l/h,llr
(72) (73) (74) (75) (76) (77)
PI2 = 512/1*2,12.
(78)
PI = 511111. P2 = 521122, P3 = ~3/h, P4 = 641122,
We finally obtain the limiting availability for the load share system: (79) 4. NUMERICAL
EXAMPLES
Consider numerical examples of the availabilities for the duplex system and the load share system. We introduce a simplex system composed of 2 units Al and Bl. The simplex system is an essential computer system with no redundancy and can be compared to the other redundant computer systems discussed in this paper. Of course, the limiting availability for the simplex system is given by A3 = 1/[(Al/n)
+ 02h2)
+ 11.
(80)
Let us consider the repair time distributions common to the three systems above (i.e. the duplex system, the load share system, and the simplex system). The repair time distributions are gamma with shape parameter 2: Gi(t)=1-(1t2yi)e-2ri’
(i= 1,2).
(81)
The limiting availabilities can be calculated from (36), (79), and (80), respectively, by assuming
%
S. OSAKI and T. NISHIO
Load
RATIO Wy,M
share system
I/y@
Fig. 5. The dependence of the ratio [(l/yl)/(l/y2)] in the limiting availability for each system, where At = 10-3,AZ= 5 x lo-‘, AI= lo-‘, A5= lo-‘, and l/-y*= 1.
Load share system
Duplex system
RATIO C(I/y$/(
I/y, 11
Fig, 5. The dependence of the ratio [(l/yz)/(l/r~)] in the limitingavailability for each system, where A,= lo-‘, A*= 5 x IO-‘, A; = 10e5, A; = lo-‘, and l/y, = 1.
all the parameters. Figure 5 shows the dependence of the ratio [(l/y)/(l/y~)] in the limiting availability for each system, where A, = 10e3, A2= 5 x 10e4, A; = 10e5, Ai = lo-‘, and l/r* = 1. Figure 6 also shows the same dependence, where AI = 10T3,AZ= 5 x 10m4,A; = lo-‘, A1= 10m5, and l/y1 = 1. From both figures above, we can conclude the following facts: (i) The load share system is better than the duplex system for the viewpoint of availability. (ii) The d8erence between the availabilities for the load share system and the duplex system increases gradually as the ratio [(1/72)/(l/~J] or [(l/yl)/(l/y2)] increases. (iii) The simplex system is fairly worse than the other redundant systems from the viewpoint of availability. 5. CONCLUDING
REMARKS
We have discussed the limiting availabilities for the duplex and the load share systems. We have successfully obtained such availabilities by using modification of regeneration point techniques in Markov renewal processes. The assumptions we gave in this paper are quite simple compared to real computer systems. The models we proposed are just the initial mathematical models for redundant computer systems. We should construct more realistic models and obtain the necessary quantities. Of course, we should always check the validity of the models by using field data and/or simulation.
Availability evaluation of redundant computer systems
97
REFERENCES 1. A. Avizienis, Fault-tolerant systems. IEEE Trans. Comput. US, 13044312 (B76). 2. R. A. Short, The attainment of reliable digital systems through the use of redundancy-a survey. IEEE Comw Group News 2,2-17 (1968). 3. W. C. Carter and W. G. Bouricius, A survey of fault-tolerant computer architecture and its evaluation. Compul. 4,9-16 (1971). 4. A. L. Hopkins, Jr. and T. B. Smith, III, The architectural elements of a symmetric fault-tolerant multiprocessor. IEEE Trans. Comput. C-24,49&505 (1975). 5. A. E. Cooper and W. T. Chow, Development of on-board space computer systems. IBM.!. Res. Develop. 24l,5-19 (1976). 6. R. Pyke, Markov renewal processes with finitely many states. Ann. Math. Statist. 32, 1243(1%1). 7. T. Nakagama and S. Osaki, Stochastic behavior of a two-unit standby redundant system. INFOR 12,66 (1974). 8. R. E. Barlow and F. Proschan, Mahemahcal Theory of Reliability. Wiley, New York (1965). 9. R. E. Barlow and F. Proschan, Statistical Theory of Reliabilityand Life Testing.Holt, Rinehart and Winston, New York (1975).