Microelectron. Reliab., Vol. 35, No. 2, pp. 309-318, 1995
Pergamon
Copyright ~ 1994 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0026-2714/95 $9.50+.00 TECHNICAL NOTE
REPAIRABLE SINGLE SERVER SYSTEMS WITH MULTIPLE BREAKDOWN MODES YI-CHIH HSIEH l AND MARK S. ANDERSLAND 2 ~Department of Industrial Engineering and 2Department of Electrical and Computer Engineering, The University of Iowa, Iowa City, IA 52242
(Received for publication 30 June 1994)
Abstract Typically the availability, steady-state queue length distribution, and mean queue length of Markov queueing systems subject to random breakdowns are computed by generating function or matrix geometric numerical methods. In this paper we point out that, for single server systems, a simple partition balance approach is easier. We illustrate this observation by deriving expressions for the availability, steady-state queue length distribution, mean queue length, and server utilization of a single server system subject to multi-mode, bi-level, Poisson distributed breakdowns of exponentially distributed duration. A numerical example illustrating some of the relations between these measures is also given. Our setup provides a simple, computationally tractable, Markov model for systems in which breakdowns of different types occur and are repaired at rates dependent on the type and severity of the breakdown.
1. INTRODUCTION During the past three decades the behavior of Markov service systems subject to Poisson distributed breakdowns of exponentially distributed duration has been studied extensively (see Mitrany and Avi-Itzhak [3] and Neuts and Lucantoni [4] for references). The importance of these studies derives from the need for simple, intuition-rich, quantitative models of the effects of server breakdown on the service dynamics of practical systems. Typically, the system availability, steady-state queue length distribution, and mean queue length are computed by generating function (Mitrany and Avi-Itzhak [3]) or matrix geometric numerical methods (Neuts and Lucantoni [4]). In this paper we point out that for single server Markov models it is easier to use a partition balance approach that exploits the fact the steady-state rate of transitions into and out of any state set in a stationary Markov chain must balance (see Kelly [2], or Pag6s and Gondran [5]). To illustrate the ease of the partition balance approach we introduce and analyze a new model, a repairable single server system in which the server is subject to a variety of HR 35-2-K
309
310
Technical Note Poisson distributed breakdowns of both minor and major severity. This setup provides a simple Markov model for situations in which breakdowns of different types occur and are repaired at rates dependent on the type and severity of the breakdown. The setup could model, for instance, computer systems in which the correction of distinct network, disk, or CPU problems may require a minor system reboot or major hardware replacement (Ibe and Wein [1]); or industrial processes in which each breakdown mode has both minor and major causes, e.g., a fuse or a transformer failure may cause power loss, overheating or winding shorts may cause motor loss, dullness or breakage may cause tool loss, etc.. The model is simple, but it is non-trivial to analyze by generating function or matrix geometric numerical methods. Our system model and notation are described in detail in Section 2. In Section 3 we use the partition balance approach to develop expressions for key performance measures, including the system availability, steady-state queue length distribution, mean queue length, and server utilization. A numerical example illustrating the relations between these measures is provided in Section 4. Section 5 contains our conclusion. 2. MODEL AND NOTATION The system of interest consists of a single server with an infinitely large buffer that services a Poisson stream of jobs of exponentially distributed duration on a ftrst-come-f'trstserve basis. The server is subject to a finite number of Poisson distributed breakdowns of different rates and severities. The severity of a given breakdown may be either minor or major and is the outcome of a Bernoulli trial. Breakdowns halt all service and the occurrence of additional breakdowns, but jobs may continue to arrive. Jobs in process are interrupted and placed at the head of the queue. The breakdown diagnosis is instantaneous and perfect. Server repair starts immediately thereafter, and is of exponentially distributed duration. The repair rate depends on the severity of the breakdown. The outcome of minor repair is a Bernoulli trial that either returns the server to operation or identifies additional problems requiring major repair. Major repair always returns the server to operation. Assuming that all relevant random variables are independent, this system can be modeled by an irreducible Markov chain with the transition-rate diagram shown in Figure 1. The notation used in the diagram and throughout the paper is listed below. N
Number of failure modes.
X
:= {x:= (i,j) : i = 0,1,2 ..... j ~ C:= {0,m(1),M(1) ..... m(N),M(N)}}, is the set of the system states. Here i indexes the number of jobs in the system, and j the current server condition. The possible conditions are : operational "0", minor mode k breakdown "re(k)", and major mode k breakdown "M(k)".
zr(i.j)
:= !ira
P{x(t) = (i,j)}, the steady-state probability that there are
system and that the current server condition is j .
i jobs in the
Technical Note
311
g(Z)
where a = A
b =/2
e(k) = skflk f(k) = 7,
c(k) = pka~
d(k) = rk/3~
g(k) = q~ak
Figure I : The transition rate diagram for the system,
tr(i,.)
:= ~ tr(i,j), the steady-state probability that there are i jobs in the system. jet
~(.,j)
:= ~ zr(i,j), the steady-state probability that the server condition is j . i=0
A(sys) E[i]
The steady-state system availability, i.e., lr(.,0). := ~ ~ilr(i,j), the mean queue length of the system, i=O j~C
SU
Server utilization, i.e., the steady-state probability that the server is busy given that the system is up.
Pk
The probability that a mode k breakdown is minor.
qk
:= 1 - Pk, the probability that a mode k breakdown is major.
312
Technical Note r,
The probability that a minor mode k breakdown does not become a major mode k breakdown.
sk
: = 1 - r k , the probability that a minor mode k breakdown becomes a major mode k breakdown.
tx~
The rate at which mode k breakdowns occur.
flk
The rate at which minor mode k breakdowns are repaired.
)'~
The rate at which major mode k breakdowns are repaired.
;t,
The rate at which jobs arrive.
/.t
The rate at which jobs are served. 3. DERIVATION
Our analysis is based on the well-known partition balance lemma (see, Kelly [2] or Pag6s and Gondran [5]).
Lemma : For stationary Markov chains, the steady-state rate of transitions into and out of any set of states must balance, i.e., for any S c X, and S c = X - S,
X XpuQ,.v XXP v a., u~S v~S ~
v~S e u~S
where P(u) is the steady-state probability for state u, and Q,.v is the transition rate from state u to state v.
Condition for steady-state Our first task is to identify conditions ensuring the existence of a steady-state distribution. Suppose, for the moment, that this distribution exists, and consider the sequence of partitions generated by the sets
A,,:={(i,j): i S m } , m = 0,1,2 .... i.e., the partitions generated by all possible horizontal cuts of the rate transition diagram in Figure 1. Applying the partition balance lemma we obtain A,zr(i,.) = ~rr(i + 1,0), i = 0,1 ....
(1)
Summing (1) over all i we find that ~, ~ a'(i,.) =/.~ ~ ;,r(/,0), i=0
i=1
or equivalently, that
= I.t[A(sys) - rr(0,0)].
(2)
Noting that the irreducibility of our Markov chain ensures that the limiting distribution lim P{x(t) = (i,j)} is either positive or 0 for all states, we can deduce from (2) that a steady-state exists if and only if
3,/gA(sys)
<
1.
(3)
313
Technical Note (3) says that average service capacity laA(sys) must exceed the average arrival rate ~. if a steady-state is to exist. Henceforth we will assume that (3) holds.
System availability Next we solve for the system availability A(sys) =/r(.,0). Consider the sequence of partitions generated by the sets
Bk:= {(i,j) : i > 0 a n d j = m(k)}, k = 1,2 ..... N, and Ck:= {(i,j) : i > 0 a n d j = M(k)}, k = 1,2 ..... N, i.e., the partitions generated by vertical cylinders centered about respectively, the N minor mode, and N major mode breakdown states in Figure 1. Applying the partition balance lemma we obtain, from the B k partitions,
lr(.,m(k)) = (~k/flk)PkTr("O)' k = 1,2 ..... N,
(4)
and from the Ck partitions,
~tkx(.,M(k)) = flkSklr(.,m(k)) + akqJr(.,0 ) , k = 1,2 ..... N.
(5)
Substitution of (4) in (5) yields
rc(.,M(k)) = (t~k/)" k)(pksk + qk)tr(',O), k = 1,2 ..... N.
(6)
From (4), (6), and the fact that N
x(.,0) + ~ Jr(.,m(k)) +tr(.,M(k)) = 1, k=l
it follows that
A(sys)
=
~r(.,0) =
l+Z_~(pks ~+ k=l 7k
q~) +
Ogkpk ~k
1 .
(7)
Steady-state distribution To compute the system's steady-state probabilities, we write the balance equations for states (O,m(k)) and i > 1, k =
tr(i,j) =
(O,M(k)),
k =
1,2 ..... N , and the states (i,m(k)) and
(i,M(k)),
1,2 ..... N . Simple rearrangement of these equations and (1) produces
!(t~kpk/(fl k "t"/L))X(0,0)
if i = O,j = m(k),k = 1,2 ..... U
(8a)
!(flkSk/(yk + A ))tr(O,m(k))+(akqk/(yk + A))tr(O,O) (Al#)tr(i - 1,.)
if i = O , j = M ( k ) , k = l,2 ..... N if i > l , j = 0
(Sb)
~k-~--~{Ax(i - 1,m(k)) + akpk n:(i,0) }
if i > 1,j = m(k),k = 1,2 ..... N
(8d)
~ k - ~ {~,x(i -- l,M(k)) + flks~r(i,m(k)) + a~qkrc(i,0)}
if i > 1,j = M(k),k = 1,2 ..... N
(8e)
Note that for all i > 1, and k = 1,2 ..... N , (a) Ir(i,m(k)) can be expressed in terms of ~r(i,0) and x ( i - 1,re(k)), and (b) rc(i,M(k)) can be expressed in terms of re(i,0) and t r ( i - l,M(k)).
(8c)
314
Technical Note Moreover, for i = 0 and k = 1,2 ..... N (c)
rc(i,m(k))
and
lr(i,M(k))
can be expressed in terms of x(0,0).
Thus the entire distribution can be expressed in terms of tr(0,0). But from (2) and (7)
a'(O,O)=
l+~-~(pks~
+'4k,
e~ ]j #"
(9)
Hence we can recursively compute the steady-state distribution using the procedure listed below:
PROCEDURE : (steady - state distribution)
Tol:= Tolerance; i:-- O; ~(0..):= Compute x(0,0), by (9).
O; term:= 0
term
While ~,
nr(n,.) < 1 - Tol
n=O
do begin Compute rc(i,m(k)),k = 1,2 ..... N, by (8a) when i = 0, by (8d) when i > 1. Compute Ir(i,M(k)),k = 1,2 ..... N, by (8b) when i = O, by (Be) when i > I. Compute
~r(i,.) = ~ tr(i,j). jEC
Compute r¢(i+ 1,0), by (8c). term:=i; i : = i + 1 end(while) END Mean queue length Eli], the mean queue length of the system, recall that E[i] = Z E[ilj = c]rc(.,c)
To derive an expression for
c~C
where
E[ilj = c]:=
i ,°o
z(.,c)
denotes the conditional mean queue length. Multiplying (1) by i and summing over i, we obtain i=0
i=0
or, equivalently
~.E[i]= #I ~= ix(i,O)- ~ zr(i,O)]= #zr(',O)E[il j =O]- &, which reduces to
lr(,,O)E[iIj = 0] = A-(E[/] + 1).
(10)
#
Similarly, multiplying (8d) and (8e) by i(/3k + ~.) and
i(]/k + ~),
and summing all i > 1,
we obtain, respectively
~kzr(.,m(k))E[ilj
--- m(k)] =
,~.x(.,m(k))+ Ot~pkx(,,O)E[ilj = 0]
Technical Note
315
and
y,x(.,M(k))E[il j = M(k)] = Otk(PkSk + qk )rC(.,O)E[ilj = 0] + ~,SkX(.,m(k ) ) + Azr(.,M(k )) or, equivalently
tr(.,m(k))E[ilj = re(k)] = (:t/fl k)x(.,m(k)) + (akp ~/[3k)x(.,O)E[ilj = O]
(11)
and
x(.,M(k))E[il j = M(k)] = ( Otk( P ~ + qk ).)zr(.,O)E[il j = O] (12) Adding (10), (11), and (12), and summing over all k yields
E[i]=(~)(E[i]+l)+~(A---l(E[i]+l)ct~[~--+ pksk+qk ] k=l k lz ) L P~ 7~ J N 1 sk
(13)
Solving for E[i], we obtain
U+V
E[i]= I - U '
(14)
where
1 + ~kk=~ t x Pk.~fl, P k ~a+n qd
U=
V= k~=lX lr(.,m(k))
+skl+yk ) tr(.,M(k))]~k.J'
By (4), (6), and (7), U and V can be further reduced to U=
and V = :ta(sys)
:t
~(
1 + s, IOt, p~ + Ot,(p~s~ +qk)
(15)
Note that I. Overload and failure-free single server behaviors are observed when A(sys) approaches it limits, i.e.,
E[i] ~ ~
-
as A(sys) --> 1 and E[i] --~ ~0 as A(sys) ~ A__. I~
2. By (7) and (15), when the effective repair rates for all minor and major breakdowns are equal, i.e., when flkrk = 7, = ~" for all k = 1,2 ..... N,
E[i] =
A + ~,lg(1/¢)A(sys)(l - a(sys)) lgA(sys) - ~,
(16)
Server utilization Server utilization, SU, measures the steady-state probability that the server is busy given that the system is up. From (2) we have
SU = 1 - P(O jobs [ system is up) = 1 - rc(O,O)/A(sys) A
#A( sys )
(17)
316
Technical Note 4. NUMERICAL EXAMPLE Mean queue length E[i], and server utilization SU, are two important measures of a system's steady-state performance. To illustrate how these measures are affected by changes in system availability A(sys), suppose, for simplicity, that arrival rate ~ , and effective major and minor failure repair rates y, and flkr,, k = 1,2 ..... N , are set to one. Plotting, using (16), E[i] versus A(sys) for service rates St ranging from 1.2 to 2.0 we obtain Figure 2. Note that increasing the system availability decreases the mean queue length. This effect is more pronounced when the service rate is small. Plotting, using (17), SU versus A(sys) for the same range of service rates we obtain Figure 3. We see that increasing the system availability reduces the server utilization. The effect is similar for all service rates in the range. Curves such as those plotted in Figures 2 and 3 can be used to evaluate performance/availability tradeoffs. Suppose, for instance, that for fixed arrival and repair
600 "-~ 50O 400
X=I.0
t
e~ t~
oo1 oo 0.5
06
0.7
o:8
o:s
, .o
availability Figure 2. Mean queue length vs. availability.
1.0 0.9
8 0.8 0.7.
0.6' 0.5 0.5
' 0.6
' 0.7
0i 8
0i 9
availability Figure 3. Server utilization vs. availability.
1.0
Technical Note
317
rates, we wish to compare the performance/availability tradeoffs for machines with the following characteristics : ..................................................................................................
Machine I
Machine II
Machine HI
[0.70,0.76] 1.6
[0.84,0.85] 1.2
..................................................................................................
range for A(sys) /.t
[0.65,0.80] 2.0
..................................................................................................
From Figures 2 and 3 we find that the availability intervals of those machines induce the following mean queue length and server utilization intervals : ..................................................................................................
Machine I
Machine II
Machine III
[5.98,11.13] [0.82,0.89]
[57.65,145.16] [0.98,0.99]
..................................................................................................
interval for E [ i ] interval for SU
[2.20,4.85] [0.59,0.63]
..................................................................................................
Hence if we wish to minimize Eli] subject to SU>0.8, Machine II is best. Whereas if we wish to maximize SU subject to E[i]
5. CONCLUSIONS In this paper we pointed out that single server Markov systems subject to Poisson distributed breakdowns of exponentially distributed duration can be more easily analyzed by a simple partition balance approach than by generating function or matrix geometric numerical techniques. We illustrated this observation by deriving expressions for the availability, steady-state queue length distribution, mean queue length, and server utilization of a single server system subject to multi-mode, bi-level, Poisson distributed breakdowns of exponentially distributed duration. The relations between these measures were explored in a numerical example. Our setup provides a simple Markov model for systems in which breakdowns of different types occur and are repaired at rates dependent on the type and severity of the breakdown. Since the partition Lemma holds for arbitrary Markov systems, the results can be extended to systems with arbitra~ phase distributions.
REFERENCES 1. O.C. Ibe and A.S. Wein, Availability of systems with partially observable failure, IEEE
Transactions in Reliability, 41, 92-96 (1992). 2. F.P. Kelly, Reversibility and Stochastic Networks, John Wiley & Sons, NY (1979).
318
Technical Note 3. I.L. Mitrany and B. Avi-Itzhak, A many-server queue with service interruptions,
Operations Research, 16,628-638 (1968). 4. M.F. Neuts and D.M. Lucantoni, A Markovian queue with N servers subject to breakdowns, Management Science, 25,849-861 (1979). 5. A. Pag6s and M. Gondran, Systems Reliability : Evaluation and Prediction in
Engineering, North Oxford Academic (1986).