Reliability Engineering and System Safety 27 (1990) 213-218
Optimal Testing Policy for a Computer System with Intermittent Faults T. N a k a g a w a , M. M o t o o r i Department of Industrial Engineering, Aichi Institute of Technology, Toyota 470-03, Japan
& K. Y a s u i Division of Information Systems, Chubu Electric Power Inc., Nagoya 461-91, Japan
(Received 2 October 1988; revised version received 12 March 1989; accepted 17 March 1989)
ABSTRACT A computer system with intermittent.faults fails with probability p when it is used in hidden.faults. Periodic tests are scheduled at times kT(k = 1, 2 .... ) to detect a hidden.fault. The mean time, the expected number of tests and the expected cost until detection of a fault or system failure are derived, using the theory of Markov renewal proeesses. An optimal testing time T* to minimize the expected cost is discussed. A finite T* is given by a unique solution of an equation.
1 INTRODUCTION Faults of a computer system occur intermittently, and are automatically detected by the error correcting code and corrected by the error control, 1'2 or the restart. 3'4 However, some faults are hidden 5 or latent, 6 and will be a failure in the future. Inspection tests such as instruction tests and repetitive tests, ~ which check the system and detect a fault at periodic intervals, are actually used in many digital systems. This paper considers the periodic test of detecting a hidden fault. Faults of 213 Reliability Engineering and System Safety 0951-8320/90/$03'50 © 1990 Elsevier Science Publishers Ltd, England. Printed in Great Britain
T. Nakagawa, M. MotoorL K. Yasui
214
the system occur intermittently and are hidden. The system can be operative in hidden faults, and fails with probability p, when it is used before being recovered from hidden faults. A testing policy for such intermittent faults is scheduled at periodic times kT(k = 1, 2 .... ). The mean time, the expected n u m b e r of tests and the expected cost until detection of a fault or system failure are derived, using the technique of M a r k o v renewal processes. 8 The optimal time T* to minimize the expected cost is discussed, and exists uniquely under suitable conditions. A numerical example is finally given. 2 RELIABILITY ANALYSIS Suppose that faults of a computer system occur in a Poisson process with rate 2 and are hidden. The time duration of hidden faults is exponential, i.e. these faults are recovered from the hidden state according to an exponential distribution ( 1 - e x p ( - # y ) ) for y > 0 . The system is used when an instantaneous d e m a n d occurs with an exponential density 0 exp (-Or) for t > 0, i.e. the time duration of use required for d e m a n d is negligible. Then, t h e system fails with probability p(0 < p < 1) if hidden faults have occurred, where system failure is detected instantaneously upon its occurrence. Otherwise, the system can always operate in no fault and can continue to operate with probability 1 - p in hidden faults, when it is used. We define states for the above intermittent fault model:
State 0: The system is operating in no fault. State 1: The system is operating in hidden faults. State 2: The system fails. We have the following mass functions Qo~tXi = 0, 1 ;j = 0,1, 2) from state i to s t a t e j for t > 0: a
Qol(t) = 1 - e x p ( - 2 t ) Qlo(t)
(1)
I i exp ( - Oy)kt exp ( - #y) dy = [#/(0 + #)][1 - exp ( - ( 0 + #)t)]
Q~(t)=(1 -p)
;o
(2)
exp(-#y)Oexp(-Oy)dy
= [(1 - p)O/(O + #)] [ 1 - exp ( - (0 + #)t)] Q~ 2(0 = p f l exp ( - #y)O exp (
i
(3)
Oy)dy
= [pO/(O + ~)][1 - e x p ( - ( O + ~)t)]
(4)
Optimal testing of computer systems with intermittent faults
215
Next, let Po(t) denote the probability that the system is in s t a t e j at time t, given that it was in state i at time 0. Then, using the theory of M a r k o v renewal processes, a we have Poo(t) = 1 - Qox(t) + Qol(t)*Pxo(t)
(5)
Plo(t) = Qlo(t)*Poo(t) + Qx l(t)*P11(0 Pol(t) = Qox(t)* P l l(t)
P1 l(t) = 1 - Q x o ( t ) - Q1 l ( t ) - Qx2(t) + Qlo(t)*Pol(t) + Q1 l(t)*P1 l(t)
(6)
Po2(t) = Qox(t)*Pl2(t)
(7)
P12(t) = Q12(t) + Qxo(t)*Po2(t) + Qxl(t)*P12(t) where * denotes the Stieltjes convolution, i.e. a(t)*b(t)=-Srob(t- u)da(u). F o r m i n g the Laplace-Stieltjes (LS) transforms of (5), (6) and (7), and rearranging them, [1 - - Q ~ , ( s ) ] [ 1
- Q*x(s)]
(8)
P'do(S) = 1 - Q*l(s)Q*o(S) - Q*l(s) P*l(s) =
P*2(s) =
O~l(s)[l - Q'~o(S)- a * l ( s ) - Q~,2(s)] 1 - Q*l(s)Q*o(S) - Q~l(s)
(9)
Q'd,(s)Q*2(s) 1 - Q ~ t ( s ) Q * o ( S ) - Q*l(s)
(10)
where, in general, **(s) represents the LS transform of any function ~(t), i.e. O*(s) - ~'~ exp ( - st) d*(t). Thus, substituting (1)-(4) into (8), (9), (10) and taking the inverse LS transforms of P*j(s), we have Poo(t) = [(2 - b ) e x p ( - a t )
+ (a - 2 ) e x p ( - b t ) ] / ( a
- b)
(11)
Pot(t) = [-- 2 exp ( - at) + 2 exp ( - bt)]/(a - b)
(12)
Po2(t) = 1 + [b exp ( - at) - a exp ( - bt)]/(a - b)
(13)
where a - [2 +/a + Op + ~/(2 + # + Op) z - 4 2 0 p ] / 2 b - [2 +/~ + Op - ~/(2 +/z + Op) 2 - 4 2 0 p ] / 2 It is evident that Poo(t) + Pol(t) + Poz(t) = 1. Next, the test is scheduled at periodic times k T ( k = 1, 2,...) to detect a hidden fault. Assume that the test is perfect, and hence, any hidden fault is
T. Nakagawa, M. Motoori, K. Yasui
216
always detected at each test. The test stops u p o n detection o f a hidden fault or system failure. The m e a n time to detection o f a fault or system failure is given by a renewal equation: 9 y(T) = f ro tdPo2(t) + T P o l ( T ) + I T + y(T)]Poo(T ) Solving this equation, we have y(T) = f ro [1 - Po2(/)] dt/[1 - Poo(T)]
(14)
Similarly, the expected n u m b e r o f tests is M ( T ) = Poo(T)/[ 1 - Poo(T)]
(15)
and the expected cost to detection o f a fault or system failure is C(T) = [ciPoo(T) + c2Pox(T) + c3Po3(T)]/[1 - Poo(T)]
(16)
where c~ = cost o f one test, c2 = cost o f detection o f a fault, a n d c3 = cost o f system failure.
3 OPTIMAL POLICY The expected cost is, f r o m (16), cl[(a - 2) exp ( - bT) - (b - 2) exp ( - aT)] + c22(ex p ( - bT) - exp ( - aT)) + c3[a(1 - exp ( - bT)) - b(1 - exp ( - aT))] C(T) = (a - 2)(1 - exp ( - bT)) - (b - 2)(1 - exp ( - aT))
(17)
We seek an o p t i m a l time T* which minimizes C(T) in (17) for c3 > c2. It is evident that C ( 0 ) - lim C ( T ) = oo (18) T~O
C ( ~ ) = lim C(T) = c3
(19)
Differentiating C(T) with respect to T a n d setting it equal to zero imply L ( T ) = cl/[2(c 3 - c 2 + c 0 ]
(20)
where
L(T) -= [b(exp (aT)
- 1) -- a(exp (bT) - 1)]/ [ab(exp (aT) - exp (bT)) + 2(a - b)]
Optimal testing of computer systems with intermittent faults
217
We easily have lim L(T) = 0
L(0) -
T--,O
lim L ( T ) = 1/a T~oo
L'(T) = O{(a - b) exp [(a + b)T-J + (b - 2) exp (bT) - (a - 2) exp (aT)} > O{al-exp [(a + b)T] - exp (aT)] - b[exp [(a + b)T-J - exp (bT)]} =DIabexp[(a+b)T]fro(eXp(-bt)-exp(-at))dtl>O since a > b, where O - (a - b)/[ab(exp (aT) - exp (bT)) + ~.(a - -
b)] 2
Thus, we have the following o p t i m u m policy: (i) (ii)
I f (a/2) < (c3 - c2 + cl)/cl then there exists a finite a n d unique T* which satisfies (20). If(a/2) > (ca - c 2 Jr- Cl)/C 1 then T* = ~ , i.e. no test should be made, a n d the expected cost C ( ~ ) is given in (19).
4 NUMERICAL
EXAMPLE
Suppose t h a t 1/# = 1 a n d c 1 = 1, i.e. all times are relative to the m e a n fault time a n d all costs are to the cost o f one test. Table 1 gives the o p t i m a l times T* which minimize the expected cost C(T) in (17), the resulting m e a n time ~,(T*), a n d expected n u m b e r o f tests M(T*) until detection of a fault or system failure for c 2 = 1, 2, 5, a n d c 3 = 10, 50, 100 w h e n p = 0.8, 1/0 = 1.6 a n d TABLE 1
Optimal Time T* to Minimize the Expected Cost C(T*), the Mean Time ),(T*) and the Expected Number of Tests M(T*) C2
C3
lO T* 1 2 5
7(T*)
50 M(T*)
T*
~(T*)
lOO M(T*)
T*
~(T*)
M(T*)
0"964 3"4584 2"6777 0-397 2"6081 5"6036 0-280 2"4275 7"6928 1'034 3'5576 2'5378 0"401 2"6143 5"5538 0'281 2"4291 7'6676 1"420 4"0718 2"0028 0"414 2-6344 5"3988 0"285 2"4352 7"5682
218
T. Nakagawa, M. Motoori, K. Yasui
1/2 = 2. In this case, a finite T* exists uniquely if(c 3 - c2)/c 1 > 1 + w/-3. It is easily seen that T * M ( T * ) < ~ ( T * ) < T*[1 + M(T*)], however, ~(T*) are almost the same as T*[1 + M(T*)]. F o r example, when c 2 = 5 and c a = 10, the optimal time T* is 1.42, ~(T*) is 4, and M ( T * ) = 2. If faults occur 6 times per each day, i.e. 1//~ = 2 h and 1/2 = 4 h, then the test should be done at a b o u t every 3 h ( - 2 × 1.42) in cost. The mean time and the expected number o f tests of such case are 8 h and 2 times, respectively.
REFERENCES 1. Cox, G. W. & Carroll, B. D., Reliability modeling and analysis of fault-tolerant memories. IEEE Trans. Reliab., R-27 (1978) 49-54. 2. Rao, T. R. N., Use of error correcting codes on memory words for improved reliability. IEEE Trans. Reliab., R-17 (1968) 91-6. 3. Castillo, X. & Siewiorek, D. P., A performance-reliability model for computing systems. In lOth Int. Syrup. Fault-Tolerant Computing, Computer Society Press, Washington DC, 1980, pp. 187-92. 4. Nakagawa, T., Nishi, K. & Yasui, K., Optimum preventive maintenance policies for a computer system with restart. IEEE Trans. Reliab., R-33 (1984) 272-6. 5. Gertsbakh, I. B., Models of Preventive Maintenance. North-Holland, New York, 1977. 6. Shin, K. G. & Lee, Y. H., Measurement and application of fault latency. IEEE Trans. Computers, C-35 (1986) 370-5. 7. Malaiya, Y. K. & Su, S. Y. H., Reliability measures for hardware redundancy fault-tolerant digital systems with intermittent faults. IEEE Trans. Computers, C-30 (1981) 600-4. 8 Pyke, R., Markov renewal processes: Definitions and preliminary properties. Ann. Math. Statist., 32 (1961) 1231-42. 9. Cox, D. R., Renewal Theory. Methuen, London, 1962.