Stiffness-tolerant methods for transient analysis of stiff Markov chains

Stiffness-tolerant methods for transient analysis of stiff Markov chains

Microelectron. Reliab., Vol. 34, No. 11, pp. 1825 1841. 1994 Copyright !~ 1994ElsevierScience Ltd Printed in Great Britain.All rights reserved 0026 27...

1MB Sizes 9 Downloads 69 Views

Microelectron. Reliab., Vol. 34, No. 11, pp. 1825 1841. 1994 Copyright !~ 1994ElsevierScience Ltd Printed in Great Britain.All rights reserved 0026 2714/94 $7.00+.00

Pergamon

STIFFNESS-TOLERANT METHODS FOR TRANSIENT ANALYSIS OF STIFF MARKOV CHAINS MANISH MALHOTRA Room 2K-312, AT&T Bell Labs, Holmdel, NJ 07733, U.S.A. JOGESH K. MUPPALA Department of Computer Science, The Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong and KISHOR S. TRIVED! Department of Electrical Engineering, Duke University, P.O. Box 90291, Durham, NC 27708-0291, U.S.A. (Received.for publication 18 March 1994) Abstract--Three methods for numerical transient analysis of Markov chains, the modified Jensen's method (Jensen's method with steady-state detection of the underlying DTMC and computation of Poisson probabilities using the method of Fox and Glynn [1]), a third-order L-stable implicit Runge-Kutta method, and a second-order L-stable method, TR-BDF2, are compared. These methods are evaluated on the basis of their performance (accuracy of the solution and computational cost) on stiff Markov chains. Steady-state detection in Jensen's method results in large savings of computation time for Markov chains when mission time extends beyond the steady-state point. For stiff models, computation of Poisson probabilities using traditional methods runs into underflow problems. Fox and Glynn's method for computing Poisson probabilities avoids underflow problems for all practical problems and yields highly accurate solutions. We conclude that for mildly stiff Markov chains, the modified Jensen's method is the method of choice. For stiff Markov chains, we recommend the use of the L-stable ODE methods. If low accuracy (upto eight decimal places) is acceptable, then TR-BDF2 method should be used. If higher accuracy is desired, then we recommend third-order implicit Runge-Kutta method.

1. INTRODUCTION Analytic models of computer and communication systems are evaluated on the basis of performance 1-2] or dependability 1-3] measures. In practice, models of complex systems give rise to large Markov chains. System modelers are often interested in transient measures which provide more useful information than steady-state measures. As models grow in size, closed-form solutions of transient measures become infeasible. Thus, practically the modeler has to always rely on numerical methods. The numerical transient analysis of Markov chains is plagued with several computational problems: largeness and stiffness of the Markov chain, and possible inaccuracy of the solution. This paper addresses these three key issues of numerical transient analysis of finite-state Markov chains: largeness, stiffness and accuracy. Numerical methods based on Jensen's method and ordinary differential equations (ODE) solution have been implemented and analyzed to obtain transient solution of continuous-time Markov chains (CTMCs). Reibman et al. 1-4] provide an overview of numerical approaches for Markov and Markov reward models. Jensen 1-5] originally proposed a method (also

known as randomization or uniformization) which involves computation of Poisson probabilities. Grassman [2], Keilson [6] and Gross and Miller [7] have analyzed Jensen's method for transient analysis of Markov chains. Marie [8] proposed a modification to Jensen's method which involves matrix-products. However, this method is not suitable for large Markov chains. Clarotti I-9] proposed an implicit A-stable method (similar to trapezoidal rule) for stiff Markov chains based on O D E solution. A comprehensive effort to comparatively evaluate different methods was carried out in [10]. They empirically compared Jensen's method, R u n g e - K u t t a Fehlberg method (an explicit method for O D E solution), and T R - B D F 2 method (an implicit method for O D E solution). They conclude that T R - B D F 2 is the method of choice for stiff Markov chains. For non-stiff models, they recommend the use of Jensen's method. One of the drawbacks of T R - B D F 2 is that it is only second-order accurate. Therefore it is computationally inefficient if high accuracy is desired. In [11], we have empirically compared a third-order A-stable generalized R u n g e - K u t t a method and a third-order L-stable implicit Runge Kutta method with the TR BDF2 and the Jensen's method. For stiff models, implicit Runge K u t t a method is

1825

1826

M. MALHOTRA,J. K. MUPPALAand K. S. TRIVEDI

recommended if high accuracy is desired. For non-stiff models, Jensen's method is recommended. From these research efforts, it is clear that the Jensen's method is clearly the best method to solve non-stiff Markov chains. It provides highly accurate solutions and it is computationally much more efficient than the ODE methods. For stiff Markov chains, the computational complexity of this method rises fast and the Poisson probability computation almost invariably runs into underflow problems resulting in lower accuracy than desired. For stiff Markov chains, L-stable ODE methods have been recommended. However, the results presented in [10, 11] reveal that in some cases, the computational cost of the original Jensen's method is more than the ODE methods only when the mission time extends beyond the steady-state point. This suggests that if somehow steady-state detection is incorporated in Jensen's method, then large savings in computational cost can result. Contrast this with the fact that explicit steady-state detection is not required by L-stable ODE methods since if the state probability vector does not change within the tolerance limit over a number of time steps, then these methods increase the step size geometrically until mission time is reached. This results in a negligible computational effort to advance the solution from steady-state point to the mission time. In this paper, we describe how steady-state detection can be incorporated in Jensen's method. Through numerical experiments on realistic models, we show that for stiff Markov chains, the method of Fox and Glynn [1] for Poisson probability computation of Jensen's method avoids underflow and yields highly accurate solutions. We also present error and complexity analysis of the modified Jensen's method. To the best of our knowledge, no other implementation of Jensen's method considers steadystate detection. This modification further implies that the definition of stiffness in [10] could grossly overestimate stiffness if mission time is larger than the time at which steady-state is reached. This provides the motivation to comparatively evaluate the modified Jensen's method with the ODE methods known to perform well for stiff Markov chains. We choose the two ODE methods that have been recommended for solving stiff Markov chains. These are the T R - B D F 2 method and the third-order implicit Runge-Kutta method. Sparse implementations of all the three methods are used. We find that for mildly stiff models, the modified Jensen's method performs the best of all three methods. For stiff Markov chains (according to the new definition), L-stable ODE methods are recommended. For high accuracy, implicit RungeKutta method is computationally the most efficient method. If low accuracy (upto eight decimal digits) is acceptable, then T R - B D F 2 method should be used. In Section 2, we formulate the problem in

mathematical terms and show how various system measures can be obtained through transient analysis of a Markov model of the system. In Section 3, we discuss the computational difficulties encountered in transient analysis of stiff Markov chains. In Section 4, we describe the Jensen's method and how steady-state detection can be incorporated in it. We also analyze the error and computational complexity of this method. In Section 5, we present the TR BDF2 method and the third-order implicit Runge-Kutta method. We also present error and computational complexity analysis of these methods. In Section 6, we describe the example models and the experiments carried out on these models. Different methods are compared on the basis of accuracy achieved and computational cost. Our results are summarized in Section 7.

2. MATHEMATICAL FORMULATION Consider a finite-state continuous-time Markov chain (CTMC) {Z(t), t/> 0} with state space f~. Without loss of generality, let f~ = {1, 2. . . . . n}. Let the matrix Q = [ q J , where qij, i :# j is the transition rate from state i to statej and qii = - ~ = 1. j~i qij. Q is the infinitesimal generator matrix of this CTMC. We let q = maxi ]qii] and let r/ be the number of non-zero entries in Q. Let P~(t) = Pr{Z(t) = i}. The state probability vector at time t is P ( t ) = [Pl(t), PE(t). . . . . P,(t)]. The row vector P(t) is computed by solving a system of first order linear differential equations, known as Kolmogorov equations dP(t) -

dt

P(t)Q.

(1)

The initial condition is spedified by the initial state probability vector P(0). Assume that the matrix Q has m ~< n distinct eigenvalues: ,~, 22 . . . . . )-,,-1, ).,, in non-decreasing order of absolute value. The matrix Q is singular, so )q = 0. Let d~ be the multiplicity of )~. All the eigenvalues have non-positive real part (some eigenvalues could be complex). The general solution for P/(t) (i = 1, 2. . . . . n) can then be written as dj

Pi(t) = ~ j=l

~, A~i'ktk-1 e a~t,

(2)

k=l

where ~a~i~ j , k are constants. If the transient terms in this equation decay fast, then steady-state is attained quickly. The rate of decay depends upon the magnitude of eigenvalues. The steady-state is reached at a particular t when each of the terms eaJ' becomes so small that they are considered equal to zero. In particular, it is the smallest (in magnitude) non-zero eigenvalue (22) that has the slowest rate of decay. This is the eigenvalue that determines the time at which steady-state is attained.

Markov chains

2.1. Measures of interest Various measures of the system behavior can be evaluated as weighted sums of state probabilities. To analyze a dependability model, the state space ~'l is divided into the set ta, of up states and the set tad of down states. Instantaneous availability, the probability of the system being in an operational state at time t, is then defined as A(t) = ~i,ou Pi(t). Reliability, R(t) is defined as the probability of the system not having failed until time c All the down states are made absorbing states by deleting all the arcs going from down states to up states and R(t)= ~i,n, P~(t). In performability models, a reward rate r~ is associated with state i as an indicator of the performance index. The computation availability Ac(t ) at time t can then be computed as the expected reward rate at time t:

Ac(t ) = ~iriPi(t). If n~ is the number of jobs in a queuing system when it is in state i, then the average number of jobs in the system at time t is ~,~n~P~(t). If the system throughput in state i is r~, then the average system throughput at time t is ~r~P~(t). Similarly, if q~ is the probability that the queuing system is full in state i, then ~iqiPi(t) is the probability that the system is full at time c Thus we see that transient analysis of a dependability, performance or performability model at time t basically involves computing state probabilities at time t of the underlying CTMC. 3. COMPUTATIONAL DIFFICULTIES Various sources of intractability present difficulties in transient analysis of Markov chains. Several difficulties that arise in transient analysis and methods to overcome them are discussed below. 3.1. Largeness Markov models of realistic systems are commonly plagued by largeness of state space. Usually a system model is specified using a high level description such as stochastic Petri nets [12]. These models are then automatically converted into equivalent Markov chains which are then solved to obtain useful measures of system behavior. Practical modeling problems, in general, give rise to hundreds of thousands of states [13]. Two basic approaches to overcome largeness are: (i) Largeness-avoidance: State-truncation based on avoiding generation of low probability states [ 14-17] and model-level decomposition [-18, 19] are two examples of this approach. (ii) Largeness-tolerance: In this approach, a concise method of description and automated generation of the CTMC is used. Sparsity of Markov chains is exploited to reduce the space requirements but no model reduction is employed. Appropriate data structures for sparse storage are used. Sparsity preserving solution methods are used which result in

1827

considerable reduction in computational complexity. CTMCs with several hundred thousand states have been solved using this approach. In this paper, we focus our attention on largeness-tolerance methods, although in practice both methods can be used in conjunction with each other. 3.2. Stiffness Besides largeness, stiffness is another undesirable characteristic of many practical Markov models that adversely affects the computational efficiency of numerical solution techniques. Stiffness arises if the solution has components of widely varying rates of change. The linear system of differential Equations (1) is said to be stiff if for i = 2 . . . . . m, Re(2i) < 0 and max IRe(2i)l >> min IRe(At)l, i

(3)

i

where 2~ are the eigenvalues of Q. One of the factors that must be taken into account is that rate of change of a solution component is defined relative to the solution interval. In [20], stiffness is defined as: A system of differential equations is said to be stiff on the interval [0, t) if there exists a solution component of the system that has variation on that interval that is large compared to 1/t. Thus, the length of solution interval also becomes a measure of stiffness. Stiffness could cause numerical instability in solution methods if the methods are not designed to handle stiffness. Like largeness, two basic approaches to overcome stiffness are: (i) Stiffness-avoidance: This approach is based on the idea of eliminating stiffness from a model and solving a set of non-stiff models. One such technique based on aggregation is described in [21]. Reibman [22] has proposed a splitting technique that can be used to avoid stiffness and largeness. (ii) Stiffness-tolerance: Special solution methods that remain stable for stiff models are used. This paper focuses on stiffness-tolerant solution methods. Consider the source of stiffness in Markov chains. In a dependability model, fault (error) handling rates are several orders of magnitude (sometimes 106 times) larger than fault occurrence rates. Fault handling rates could also be much larger than the reciprocal of mission time. Therefore, such Markov chains have events occurring at widely different time scales. Clarotti [9] observes that this causes the largest eigenvalue of Q to be much larger than the inverse of mission time and consequently the system of differential equations, Equation (1), is stiff. Using the Gerschegorin circle theorem [23], the magnitude of the largest eigenvalue is bounded above by twice the largest entry (absolute value) in the generator matrix. In a Markov chain, this entry corresponds to the largest total exit rate from any of the states. Hence, the stiffness index of a Markov chain was defined in [10] as the product of the

1828

M MALHOTRA,J. K. MUPPALAand K. S. TRIVI:DJ

largest total exit rate from a state and the length of the solution interval ( = qt). This implies that stiffness can be arbitrarily increased by increasing q or t. The largest rate q can t,e increased by increasing model parameters. However, this changes the eigen-structure of matrix Q. In some models, it results in an increase in the magnitude of the smallest non-zero eigenvalue of the matrix. This implies that those models reach steady-state faster. We will later define stiffness index in terms of q, t and 2,, where ";~2 is the smallest (in magnitude) non-zero eigenvalue of matrix Q. O D E solution methods which are not designed to handle stiffness become computationally expensive for stiff problems. Stiffness forces very small integration time steps which increases the total number of time steps required and total computation time by several orders of magnitude. Original Jensen's method does not handle stiffness well either [10]. 3.3. Accuracy The solution of the models of life-critical systems must be highly accurate. Very low probabilities are of interest many times and numerical techniques that provide highly accurate solutions are desirable. The total error in the final solution is a cumulative effect of errors arising from various sources. Some of these errors are unavoidable and some are hard to analyze and control. Typically, there is a trade-off between computation time and the magnitude of errors introduced. Different kinds of errors arising in the numerical methods are listed.

We now discuss the numerical techniques that we employ for the transient analysis of stiff Markov chains and how these methods try to overcome the computational difficulties mentioned above. 4.

JENSEN'S M E T H O D

This method has been used in [2, 5, 7, 10, l l], to compute transient state probabilities of Markov chains. It involves conversion o f a C T M C to a D T M C (Discrete-time Markov Chain) subordinated to a Poisson process. Assume that the C T M C IZ(t), t >1 0} is uniformizable [7], i.e. the diagonal elements of Q are uniformly bounded. Let q/> max~lqjjl. It can be shown that there exists a D T M C {Z~, i = 0, 1. . . . with transition probability matrix Q * = Q/q + I and a Poisson process ~N(t), t >1 0} with rate q, such that Z(t) and ZN, ~ are probabilistically identical. This conversion leads to a computationally useful technique for calculating the state probabilities of the C T M C Z(t). Conditioning on N(t) (the number of occurrences of the Poisson process in [0, t]) and using the law of total probability, we get:

Pr{Z(t) = j ) =

~ P r { Z ( t ) = j [ X ( t ) = i ] P r { N ( t ) = i I. i

0

(4) If II(i) is the state-probability vector of the D T M C Z i, then

P(t) = ~ H(i) e q' (qt)'. i=o

(5)

i!

Fl(i) is computed iteratively as:

Round-offerrors arise in every arithmetic operation carried out on a computer due to finite precision. These errors can be reduced by minimizing the number of operations and by avoiding cancellation. Truncation errors arise from truncating an infinite series after a finite number of terms. This error can be reduced by increasing the number of terms considered. However. the law of diminishing returns applies here (i.e. after a certain number of terms, the reduction in truncation error is not worth the extra computational effort). Oiscretization errors occur because a continuous interval is discretized in a set of intervals. Smaller the step size, smaller is the error. However, the increase in the number of steps increases the computation time and the round-off error. Propaqation errors arise due to iterative nature of numerical methods. Our concern is not only with the errors occurring in one step of a method, but also how these errors are propagated through subsequent steps. The solution of one step is used as input to the next step. As more and more steps are executed, the errors get amplified. The degree of ill-conditioning [24] of the problem determines the extent to which the error gets amplified and propagated through a sequence of steps.

I](i) = H(i - 1)Q*, H(0) = P(0).

(6)

In practice, the summation in Equation (5) is calculated by truncating the series to the right after a finite number of terms r. Thus r is termed as the right truncation point and is chosen to meet the pre-specified error tolerance E:

l-

i o

171

i!

As qt increases, the Poisson distribution thins from the left, i.e. the terms in the summation for small i become less significant. Thus we start the summation at a value I > 0, called the left truncation point (cf. [10, 26]) to avoid the less significant terms. In this case, Equation (5) is reduced to:

(qt)'.

e ( t ) ~ L r l 0 ) e ~' i=t i!

(8)

The values of l and r can be computed from the specified truncation error tolerance E using

t- 1 E e-a'

i=0

(qt) i i'~- ~<

E 3"

~-, 1 -- ~

i=0

(qt) i e-q'

i!

E ~< " 2

(9)

However, we use the method of Fox and Glynn [1]

Markov chains to compute I and r, the left and the right truncation points. This method also computes the Poisson probabilities e-qt((qt)i)/(i!) for all i = I, I + l . . . . . r - 1, r. For stiff Markov chains, qt is typically very large and the term e -qt almost always runs into underflow problems. The method of Fox and Glynn is especially useful in such cases since it avoids such underflow problems. Their method results in I = q t - O ( x / q t ) and r = qt + O(x/qt). The number of terms needed between the left and the right truncation point is O(x/qt ) which is also the computational complexity of Fox and Glynn's method. However, we must obtain the DTMC state probability vector at the left truncation point l, which is O(qt). Thus we need to compute O(qt) matrix-vector multiplications and each matrix-vector multiplication takes O(r/) time where r/ is the number of non-zero entries in the generator matrix Q. Therefore, the time complexity of Jensen's method becomes O(qqt). Space required is O(r/) for sparse implementation. We note that the computational complexity of Jensen's method rises linearly with q and t. Large value of qt also implies large number of matrix vector multiplications which results in large round-offerrors. A possible approach to control these errors is to normalize the DTMC state probability vector so that the sum of state probabilities is equal to one. For our applications, we noted in Section 3.2. that large q in some cases implies faster attainment of steady-state. If steady-state is reached at time t~s and tss is significantly smaller than t, then detection of steady-state may result in large savings of computation time (provided the steady-state detection can be performed at a much lesser cost), and reduce round-off errors. 4.1. densen's method with steady-state detection We begin by observing that Equation (6) which is used to compute the probability vectors for the underlying DTMC, also represents the iteration equations of the power method used for computing the steady-state vector for the CTMC [26]. If convergence of the probability vector to steady-state is guaranteed, then we can terminate the iteration in Equation (6) upon attaining the steady-state. In order to ensure convergence of the power iteration, we require that q > max I - q , I

(10)

i

since this assures that the DTMC is aperiodic. Note that we do not require that the CTMC (or the DTMC) be irreducible. A more general structure with one or more recurrent classes and a (possibly empty) transient class of states is allowed. Let II* denote the steady-state probability vector of the DTMC. Assume that the probability vector for the underlying DTMC attains steady-state at the Sth

1829

iteration, so that 11n(S) - II* II is bounded above by a given error tolerance. Three different cases arise in the computation of the transient state probability vector of the CTMC: (1) S > r, (2) I < S ~< r and (3) S ~< I. We examine each of these cases individually. In the following equations we will denote the transient state probability of the CTMC computed by the modified Jensen's method as P(t). Case 1 (S > r): In this case the steady-state detection has no effect and the probability vector is calculated using Equation (8). Case 2 (l < S ~< r): Consider Equation (8). By using II(i) = H(S), i > S, the equation can be rewritten setting the right truncation point r to ~ as: P(t) = ~ [l(i) e qt i=t

(qt)' i!

s (qt)i i = ~ n(i)e-q' . . . . + n ( s ) ~ e q'(qt) i i i! i=s+l i! "(i)e-qt (qt)i l -+ iH ( ,S ) (

i:o ~ e - " (qt)i)i' J

i=1

Case 3 (S ~< l): The DTMC reaches steady-state before the left truncation point. In this case, no additional computation is necessary and P(t) is set equal to H(S). 4.2. Implementation details To implement Jefisen's method with steady-state detection, several practical issues (e.g. how often should the steady-state detection procedure be invoked?) need to be resolved. The aim is to keep the overhead due to steady-state detection procedure minimal so that the method is almost as efficient as Jensen's method without steady-state detection for the cases where ts~ > t. We have implemented steady-state detection based on the suggestions given in [26]. Convergence of power iteration is checked by testing a norm of the difference between successive iterates. However, if the method is converging slowly, this may lead to inaccurate results. To avoid this problem we compare iterates that are spaced m iterations apart, i.e. we check the difference between H(i) and H ( i - m). Ideally, m should be varied according to the convergence rate. However, it is difficult to implement in practice since convergence rate is not known and changes with iterations. Instead we choose m based on the iteration number so that steady-state detection is performed neither too often nor too infrequently. If left trunctation is used (Equation (8)), first DTMC state probability vector needed is at the left truncation point I. Instead of using successive matrix-vector multiplies to compute this vector, matrix squaring could be used and the complexity of computing II(I) can be reduced from O(qt) to O(log(qt)) [10]. Thus the complexity of Jensen's method can be reduced to O(x/(qt ) + log (qt)) = O(x/(qt)). However, the problem in using this method is that successive

1830

M. MALHOTRA,J. K. MUPPALAand K. S, TRIVEDI

squaring results in fill-in (reducing sparsity) and hence this approach is not feasible for C T M C s with large state spaces. 4.3. Error and complexity analysis One of the main advantages of Jensen's method (compared to O D E solution methods) is that the number of terms, in the series in Equation (8), required to meet the user-specified error tolerance can be precomputed. However, this property is lost if steady-state detection is incorporated since steadystate point of D T M C can not be predicted a priori. In this section, we derive an error bound (subject to the assumption that the steady-state point can be detected exactly within a given error tolerance) for the most general form of a finite-state Markov chain containing one or more ergodic subsets and a (possibly empty) transient subset. Recall that H* is the steady-state probability vector and If(i) is the D T M C probability vector at the ith iteration. To derive the error bound, we will make use of the following lemma (cf. Cinlar [27, p. 378], Bhat [28, p. 90]). Lemma 1 For an irreducible aperiodic discrete-time Markov chain with transition probability matrix Q* and limiting probability distribution I-l*, there exists an ~ > O and a O <<,fl < l such that, IIIq(i) - H* H ~< ~fl',

i = 1, 2 . . . .

Typically fl = I).21 where 21 is the sub-dominant eigenvalue of Q* matrix. Let us substitute 6(i) = ~ff. Since fl < 1, we notice that 6(j) = ~/3j < ~flk = ~(k), j>k. Similarly if we consider a Markov chain with a transient subset T, the following lemma can be applied to bound the n-step transition probabilities for the transient states (cf. Bhat 1-28, p. 51]). Lemma 2 For a state j ~ T, the state probability at the ith iteration, Hj(i) approaches zero 9eometrically, i.e. there exists an ct and a 0 <~ fl < 1, such that Ilj(i) ~< ~tfli. We can use Lemma 2 in conjunction with Lemma 1 to prove the existence of the geometric convergence bound for aperiodic discrete-time Markov chains with a (possibly empty) transient subset and one or more ergodic subsets (cf. R. Karandikar 1-29]). Thus the above two lemmas can be restated for a general Markov chain as shown in Lemma 3. We must realize that the steady-state probabilities for the transient states is zero. Lemma 3 For an aperiodic discrete-time Markov chain with one or more ergodic subsets of states and a (possibly empty) transient subset, transition probability matrix Q* and limiting probability distribution If*, there exists an ~ > 0 and a 0 <~ fl < 1 such that, I I I I ( i ) - H * [ I ~
i = 1,2 . . . .

Now return to the error equations for each of the cases considered in the previous section. In the following equations, P(t) represents the exact solution for the probabilities of the Markov chain states and P(t) represents the solution obtained by the modified Jensen's method. We now bound the difference P(t) - P(t). We assume that E is the tolerance limit set by the user. Case 1 (S > r): In this case, since the equations used for the original Jensen's method and the modified Jensen's method are the same, the error bound is the same as computed for the original Jensen's method, i.e. t- 1 (qt)i (qt)i liP(t) - P(t) ll ~< ~ e q' + ~ e q' i:o i! i=k+l i! Case 2 (l ~< S ~< r): (qt)' P(t) - P(t) = ~ ll(i) e -q' i=o i! (qt) i -

~'H(i)

e q' i!

i=l

+n s) t-

i=s+l

i! /

(qt)i

1

= ~ l l ( i ) e -q' i=o i! +

qt) i ~ ( H ( / ) - H ( S ) ) e -q' i=s+l i!

Taking the norm on both sides of the above equation, we realize that the first summation is upper bounded by ~/2 since I is computed with this as the bound. By taking the norm of the second term, we realize that IIn(i) - !1(8)11 = IIIl(i) - 1-1" + !-1" - !1(8)11

~< IIII(i) - H*ql + II!1" <~ ot[3i + ~fl s

-

ll(S)ll

(fl <~ l, i >~ S)

<~ 2ctfl s where we assume that the D T M C has reached steady-state at the Sth iteration. Then, the error equation can be rewritten as,

i:s+l

i!

<~2~ + 2otfl s \{l . O -- ~ e -q' (qt)i) i=o i! / E

~< ~ + 26(S). By selecting 6(S) = ~/4 we can get the same bound as in the original Jensen's method. However, we must realize that it is computationally infeasiable to ensure that this bound is satisfied since 6(S) = IIIf(S) - II* II and we do not know the exact steady-state solution II* a priori. The best we can do is to ensure that the

Markov chains 100000

1831

I

!

!

!

^v

^~

O

^v

, i

!

!

1J

0

10000

MVMs

SS-Jen Jensen

1000 10-s

I

I

10-6

10-r

I

I

10-S 10-9 Error T o l e r a n c e

0

: t

I

I

10-10

10-11

lO-X2

Fig. 1. MVM count vs error tolerance--M/M/l/K model. norm of the difference between successive iterates is within the bound ~/4. Case 3 (S < / ) : P(t) - P(t) = Z 1-I(i) e -q' (qt)i - II(S) i=o i!

s (qt)i = ~ (If(i) - ll(S)) e -q' i:o i! +

(qt) i ~ (1-I(i)- H(S))e -q' i=s+l i!

The norm of the first summation is upper bounded by ¢/2 and the norm of the second summation can also be shown to be upper bounded by 26(S). Thus the overall bound on the error is,

¢ if the mission time extends beyond the steady-state point. We empirically confirm this result by solving the Markov model of an M/M/1/K queue with arrival rate2 = 9.0, service rate/t = 10.0, K = 50, and mission time t = 1000.0. This model reaches steady-state before the mission time t = 1000.0. The number of matrix vector multiplications (MVMs) required by original Jensen's method (labeled "Jensen") and Jensen's method with steady-state detection (labeled "SS-Jen') are shown in Fig. 1. We observe that whereas the MVMs required by original Jensen's method remain almost constant (there is a slight increase) as Ilog(¢)[ increases, the MVMs required by the modified Jensen's method increase with increase in Ilog(e)I.

liP(t) - P(t)H ~< ~ + 26(S) ~< E ,

5. NUMERICAL DIFFERENTIAL EQUATIONS SOLUTION

since c5(S) <~ E/4. Let us now derive the complexity of modified Jensen's algorithm. Each iteration of Equation (6) requires a matrix-vector multiplication which takes O(q) time. The number of iterations is given by rain{S, r}. If S is less than the right truncation point, the complexity of the algorithm is governed by the power iterations, i.e. O(qlogE/logl2snl) where E is the desired precision and 2~ is the sub-dominant eigenvalue of the matrix Q*. Note that the smallest (in magnitude) non-zero eigenvalue of matrix Q corresponds to the sub-dominant eigenvalue of matrix Q* (2~d = 22/q + I). Hence for the DTMC, this is the eigenvalue that determines the steady-state point. On the other hand, if steady-state is not reached before the right truncation point, the complexity is O(qqt). Thus the overall complexity of the modified algorithm is O(q rnin{Iog c/log 12sal, qt}). Therefore, we redefine stiffness index of a Markov chain with respect to Jensen's method as min{log E/log(122/q + l I), qt}. Note that the complexity of original Jensen's method is O(rlqt) and does not depend on E whereas the cost of modified Jensen's method depends upon

The most obvious way to obtain the transient solution of a Markov model is to use the techniques available for solving ordinary differential equations. A big advantage of these methods is that they can be used for transient analysis of non-homogeneous CTMCs (Jensen's method in its present form can not be used to solve non-homogeneous Markov chains). Secondly, the ODE solution is obtained on a grid of points (instants of time) which makes it possible to examine the solution as a function of time. Furthermore, the variety of available methods makes it possible to select special methods for Markov chains with special structures. There are methods with different orders of accuracies and methods that can handle stiffness. The basic strategy of ODE solution methods is to discretize the solution interval into a finite number of time intervals separated by mesh points {t~, t 2 , . . . , t~. . . . . t,}. Given a solution at t i (ith mesh point), the solution at ti + h( = t~+ l) is computed and then we advance in time with step size h until the time at which the solution is desired (known as mission time in our terminology) is reached. In practice,

1832

M. MALHOTRA,J. K. MUPPALAand K. S. TRIVEDI

the step-size is not constant, but varies from step to step. For stiff problems, although the numerical approximation to the solution using an explicit method converges to the actual solution as step size h - , 0, the step size may need to be unacceptably small to achieve the desired accuracy [30]. When the step size becomes unacceptably small, the round-off effects become significant and computational cost increases greatly. Thus, we require methods that are not forced to decrease the step size for maintaining stability, i.e. the methods that are inherently stable in some sense. Implicit O D E methods are inherently stable. The stability of implicit methods is characterized by following definitions. A method is said to be A-stable if all numerical approximations to the actual solution tend to zero as n .--, z when it is applied to the differential equation (dy/dt) = )4' with a fixed positive h and a (complex) constant ). with a negative real part [30]. Here n is the number of mesh points which divide the solution interval. For extremely stiff problems, even A-stability does not guarantee that rapidly decaying solution components would decay rapidly in the numerical approximation as well without large decrease in step size. This may lead to the phenomenon of ringing, i.e. the successively computed values tend to be of the same magnitude but opposite sign ()'i+~ = - Y i ) , unless the step size is too small [31]. Once again, we have the same problem if the step size needs to be unacceptably small. Axelsson [32] defined methods to be s t ! ~ y A-stable if for the equation (dy/dt) = 2y, (Yi, ~)/Yi ~ 0 as Re().h) -, - ~ . This property is also known as L-stability [33]. We use two L-stable O D E methods for our purpose. 5.1. TR BDF2 method This is an L-stable composite method that uses one step of TR (trapezoidal rule) and one step of BDF2 (second order backward difference formula) [31]. It is a re-starting cyclic multi-step method. The L-stability of the method comes from the L-stability of the backward difference formula and TR step provides the desirable property of re-starting. A single step of TR BDF2 is composed of a TR step from ti to t~ + 3,h and a BDF2 step from ti + ?,h to ti+l where 0 < ;' < 1. For the system of Equations (1), the above scheme yields

(

P(t + 7h)\I

5.2. Extensions Of implicit R u n g e - K u t t a methods The methods we discuss are extensions of implicit R u n g e - K u t t a methods. Butcher [34] has shown that it is possible to achieve a method of order 2r for r-stage implicit R u n g e - K u t t a methods. Applying these methods to the test problem (dy/dt) = ).y, we get for an r-stage method: Pr(;~oh) (13) Y i + 1 = Q,(;,.h) Yi where P,(x) and Q d x ) are polynomials of degree r in x. Hence, we have approximated:

P,(~JO

= Rr(ih ) = e "~'n+ O(h2'+ l).

Qd2h)

Pr l()~h) = e;,h + O(h2r)

l

P(t+7h)

7 (1 - 7) 2

(15)

Qr(;~h) yields an L-stable method of order 2r - 1. For the state probability computation, Pade approximants take the form of matrix polynomials (polynomials in hQ). When computing state probabilities using Equation ( l ), these methods yield a linear algebraic system at each time step: r

1

a(t + h) ~ ~i(hQ)i = P(t) ~ fli(hQ) i, 0

(16)

i=0

where ~i and fii are constants whose values are determined based upon the order of the method desired. Here, the 0th power of hQ is defined to be the Identity matrix I. It can be seen that this system involves higher powers of the generator matrix Q. In particular, we get a third order L-stable method by using r = 2 in Equation 06):

2

P(t+h)((2-/)l-(1-7)hQ)=

(14)

Here R~(x) is the nth diagonal Parle approximation. It has been shown in [35] that these methods are A-stable. Furthermore, if the degree of numerator polynomial Pp(x) is strictly less than the degree of denominator polynomial Qq(x) (i.e. p < q), then we get a class of methods that are L-stable. Axelsson [32] has shown that the approximation:

i

for the TR step and

P(t)

7 (12) for the BDF2 step.

This method provides reasonable accuracy for error tolerances up to 10 8 and excellent stability for stiff Markov chains [10]. However, for tighter tolerances, the computational complexity of this method rises sharply. We have seen in Section 3.3 that computation of state probabilities with precision as high as 1 0 ~ is often required in practice. In such cases, this method requires a step size so small that it is unacceptable in terms of total time of computation. Thus the need arises for higher order methods to perform highly accurate transient analysis of stiff Markov chains.

P(t+h)(l-2hQ+~h2Q

2)=P(t)(l+-~hQ).

(17)

Similarly, using r = 3, a fifth order method may

Markov chains be derived. Thus, in principle, we could derive methods of even higher order. In this study, we restrict ourselves to the third order method described by Equation (17) since taking high powers of a sparse matrix may lead to a lot of fill-in which is undesirable. Various possibilities exist for solving the system in Equation (17). (i) One way is to compute the matrix polynomial directly. This method involves only squaring the generator matrix and it is reasonable to expect that fill-in will not be very much. We tried several models and found that fill-in is generally not more than ten percent. (ii) Factorization of the matrix polynomial on the left hand side is another possibility. We then need to solve two successive linear algebraic systems. For example, the left hand side polynomial in Equation (17) can be factorized as: P(t + h)(l - rthQ)(l - r2hQ) = P(t)(l + ~1h Q),

(18)

where r 1 and r2 are the roots of the polynomial 1 - 2x + ~x 2. This system can be solved by solving two systems: X(I - r2hQ) = P(t)(l + ~hQ)

(19)

P(t + h)(I - rthQ) = X.

(20)

Unfortunately, the roots r 1 and r2 are complex conjugate and hence this approach will require the use of complex arithmetic. 5.3. Implementation aspects The basic strategy employed by all the differential equation solvers can be divided into a few broad steps. The first step is initialization of various parameters: initial step-size ho, m i n i m u m step-size hmln, maximum step-size hm~x, error tolerance for the linear system solvers if any, etc. Calculation of hmin and hmax may be based upon the given problem and the accuracy desired. However, we may not want to use a very small hm~, to avoid excessive computation time and concomitant round-offerrors. The next step is to form the LHS matrix for the given system of equations (e.g. matrix (I - 2hQ + ~h2Q 2) in Equation (17)). Then the calculations for each time step are performed. For implicit methods, a linear system (s) needs to be solved at each time step. The choice of linear system solver depends upon the specific model. Sparse direct methods [36] like Gaussian elimination yield accurate results. Rows and columns can be reordered to minimize the fill-in that results from using direct methods. For large Markov chains, direct methods are too expensive to use. In such cases, sparse iterative techniques like Gauss-Seidel or SOR are preferred. In our experiments, we found Gauss-Seidel method to be sufficiently fast and accurate. Typically the matrices ((I -- (Th/2)Q) and ((2 - 7)1 - (1 - 7)hQ) for T R - B D F 2 and ( I - ~hQ + lh2Q2) for implicit NR 34:11-I

1833

R u n g e - K u t t a method) are diagonally dominant because of the special structure of matrix Q which helps faster convergence of the iterative solvers. However, if Gauss-Seidel does not converge within a specified number of iterations (i.e. the matrix has poor eigen-structure), we switch to SOR. If convergence is still not achieved, then tolerance is relaxed by a constant factor. The other possibility would be to switch to a sparse direct method. In the next step, LTE vector at the end of time step is calculated. A scalar estimate of LTE is obtained as a suitable norm of LTE vector. We must note that this is an estimate of the LTE. Hence, we require that local error tolerance be specified instead of global error tolerance. If the scalar LTE estimate is within the local tolerance, then this step is accepted. If we have reached the end of solution interval, then stop, otherwise a new step-size is computed based on the step-size control technique such that it is less than h . . . . and the above steps are repeated starting from the step in which the LHS matrix is computed. If the error estimate is not within the tolerance, then step-size is reduced and the above time-step is repeated. If the step-size is reduced below hmin, two actions may be taken: Either increase the tolerance or switch to a higher order O D E solver. This summarizes the steps taken by a typical differential equation solver. 5.4. Error and complexity analysis For O D E methods, LTE is estimated by the principal truncation error term after appropriate Taylor series expansion. Estimation of LTE helps in properly regulating the step-size. Thus the performance of the O D E solver crucially depends on good estimation of LTE at each time step. For a system of differential equations, we get an LTE vector, each element of which corresponds to the LTE in each state probability value. For T R - B D F 2 method, the principal truncation error term per step is obtained by Taylor series expansion of terms in Equations (11) and (12). The LTE vector E(h) at time t + h is given by: E(h) -

- 3 7 2 + 47 - 2

h3P(t)Q 3,

12(2 - 7) where h is the step size at time t. Direct calculation of LTE vector is perhaps the most accurate but it requires three matrix vector multiplications. We use a divided difference estimator suggested in [31]: ~(h) -

- 3 7 2 + 47 - 2 6(2 - 7) x

P(t) ---+

P(t + 7h)Q

~(l - 7)

P(t + h)Q~ h /

iZ~

j'

This provides a good estimate of LTE and it is less expensive to compute.

1834

M. MALHOTRA,J. K. MUPPALAand K. S. TRIVEDI

For the third order implicit Runge-Kutta method, the LTE vector at t + h is given by: h4 e(h) = 72 P(t)Q4"

(21)

Direct calculation of LTE is not really as expensive as it seems in this case. A careful look at Equation (17) reveals that we have already computed P(t)Q to compute the right hand side. We also compute Q2 as part of left hand side. Hence, P(t)Q 4 can be computed by two matrix-vector multiplications (one with matrix Q and one with Q2). Another attractive alternative is based on Richardson extrapolation scheme. For the system of Equations (1), Richardson extrapolation yields: Eth) = 8[1~1(t + h) - P2(t + h)],

(22)

where Pl(t + h) is obtained from P(t) using a step-size h and P2(t + h) is obtained from P(t) using two steps of sizes h/2 each. If all the other LTE estimation methods become too expensive to compute, then we opt for the Richardson extrapolation scheme. Once the LTE vector at a step is derived, scalar error is computed using one of the standard error norms: L 1, L z or Lr,. Given the scalar error, there are different techniques to estimate the optimal step size. A commonly used technique is: h°P'

: h (local tolerance ~l/,ordor + 1, \ LTE / ,

(23)

where order is the order of accuracy of the single-step method. If the LTE is greater than the specified tolerance, then the step size is decreased and the step is repeated. The step size may be increased if the error is within the error tolerance. In our experience, we found this step control mechanism to work well. It is hard to estimate global error from the local errors occurring at each time step. However, it is reasonable to assume that controlling local errors would help bound the global error. Thus, we specify tolerance to bound the local error. The computational complexity of ODE solution methods is usually evaluated in terms of the number of function evaluations. Each function evaluation in our case is a matrix-vector multiplication. For implicit methods, computational complexity also depends heavily on the linear system solver. For Markov chains with small state space, a sparse direct method [36] like Gaussian elimination may be used. For Markov chains with large state spaces, iterative methods such as Gauss-Seidel or SOR are preferable. Each iteration of an iterative solver takes O(r/) time. However, the number of iterations until convergence can not be bounded a priori. Let s be the number of time-steps required by the ODE-solver to compute the state probability vector at the mission time. For TR-BDF2 method with an iterative linear system solver, the complexity is O(lsq) where I is the average number of iterations per linear system solution.

For the implicit R u n g e - K u t t a method, the complexity depends upon which implementation is used, i.e. whether the left hand side matrix polynomial is factorized, or whether the matrix polynomial is computed. We analyze the two cases separately. (i) Calculating the matrix polynomial involves squaring of the matrix and three matrix additions. Squaring the matrix takes O(nq) time, where q is the number of non-zero entries in the matrix Q and n is the dimension of matrix (i.e. the number of states in the Markov chain). The squaring of Q results in some fill-in. Suppose rt' denotes the number of non-zeroes in the squared matrix and f be the fill-in ratio (q'/q). We tried several different structures of Markov chains and found that f increases with n. For most of the Markov chains tried, f was not more than 10 percent. Once we have computed the left hand side matrix, the remaining complexity arises from solving the linear system of n equations. If an iterative method is used, then total time-complexity is O(nq + lsq') where I is the average number of iterations per linear system solution. We observed that usually not more than 2 -3 iterations are required for iterative methods to converge. (ii) The next approach involves factorizing the matrix polynomial and then solving the two linear systems. Hence the computational complexity of this method arises mainly from the complexity of the linear system involved. If an iterative method is used for linear system solution, then total time-complexity is 0((11 + 12)s~) where I~ and 12 is the average number of iterations for the two successive linear system solutions in Equations (19) and (20). Note that in this case our domain of operation is extended to complex numbers, hence the total time taken increases by a constant factor.

6. EXPERIMENTAL RESULTS In this section, we describe the experiments and the numerical results obtained from them. We empirically compare the three different methods: (1) Modified Jensen's method, (2) T R - B D F 2 method and (3) Third order implicit Runge-Kutta method, A variety of models that give rise to stiff Markov chains were used for experimental purposes. Among the different models used are: (1) Reliability model of duplex disk system with repair and (2) Reliability model for extra-stage shuffle-exchange network [37]. These models have closed form transient solutions. This allows us to compare the actual solution against the computed solutions, thereby giving us a measure of accuracy of the methods. The experiments yield useful general trends which are used to draw inferences about different methods. Before proceeding to illustrate the results obtained, we briefly describe the various models used.

Markov chains

1835 (2c+ 1)2+/~+~1

6.1. M o d e l s used

AI'= ~I -- ~2

6.1.1. D u p l e x - d i s k model. This is the reliability model of a duplex-disk I/O system. The system consists of two disks which store identical data. Data is available for an I/O operation if at least one of the disks is operational. If a disk fails, it is replaced by a spare disk and data is reconstructed onto the spare disk from the other disk. We assume unlimited supply of cold disk spares. However, if a disk fails while the other is being repaired, then the I/O system is considered failed. Each disk fails at rate 2 and data reconstruction takes place at rate #. We allow for imperfect coverage, where the coverage c is the probability that a disk failure is covered. If an uncovered disk failure occurs, then the I / O system is considered failed. Some causes of uncovered disk failures are undetected disk fault(s), failure of support hardware (cooling fans, power supply, etc.), failure of disk controller, or some catastrophic failure. The Markov reliability model of a duplex-disk I/O system is shown in Fig. 2. The reliability R(t) of the I/O subsystem at any time t is the sum of probabilities of being in state 2 and 1 at time t. Closed-form expression for reliability, given that the system begins operation in state 2 at time t = 0, is given by:

(2c+1)2+~+~2 A2= ~2 - - ~1

Typically, disk failure rate is very small compared to data reconstruction rate. The "base" model we use has 2 = 0.09, /a = 1800.0 and c = 0.9. For experimental purposes, we use mission time (time at which the solution of the model is desired) t = 100.0 and error tolerance a = 10-12 unless otherwise specified. With these parameters, this model becomes extremely stiff. 6.1.2. S E N + n e t w o r k model. The next model we use is an upper-bound reliability model for an N x N extra-stage shuffle-exchange network [37], known as SEN + . The block diagram for this model is shown in Fig. 3. A block of two SEs in parallel is modeled by the same C T M C shown in Fig. 2 while a set of SEs in series is modeled by a two-state Markov chain shown in Fig. 4. A cross-product Markov model of the overall block diagram is then constructed assuming independence across blocks. The upper-bound reliability of SEN + network is given by: RseN+ub(t ) = e - N ; a ( A 1 e alt + A 2 e ~2tlx

R ( t ) = A 1 e ~'' + A 2 e ~2',

(24)

where A I, A2, cq, ~2 have the same form as in Equation (25) and N(log2 N - 1) K4

where - ( / a + 32) + \ / [ 2 2 + #2 + 22/1(4c - 1)] ~1 ~-

2

For the base model, we choose N = 8 (number of C T M C states = 324 and number of transitions = 2052), 2 = 1.0, /~ = 100.0 and c = 0.9. The mission

- ( / z + 32) - x/J22 +/12 + 22/~(4c - 1)] ~2 "~-

2

2a(1

-

,)

Fig. 2. Reliability model of a duplex-disk I/O subsystem.

N/2SEs~

.......

_

N/2SEs

~

/ N(log~N

(25)

- 1)/4

pairs

J

Fig. 3. Upper bound reliability model of SEN + network.

1836

M. MALHOTRA,J. K. MUPPALAand K. S, TRIVEDI

Fig. 4. CTMC for m SEs in series. time t = 1.0 and error tolerance E = 10 to in the experiments unless otherwise stated. This model is moderately stiff. 6.2. Results In this section, we discuss the results obtained after a series of experiments were conducted on the models described in the previous section. The experiments were conducted on Sun SPARC station ELC. The criteria used for comparison of different methods include the accuracy achieved and the computational cost as function of stiffness. The experiments fall into the following categories: (i) (ii) (iii) (iv) (v)

Accuracy versus error tolerance Computational cost vs error tolerance Computational cost vs model parameters Computational cost vs mission time Computational cost vs model size.

In all the plots, the following legend is used. The plots for implicit Runge-Kutta method are labeled "Imp-RK'. The plots for T R - B D F 2 method are labeled "TRBDF2". The plots for original Jensen's method without steady-state detection and without Fox and Glynn's method for Poisson probability computation are labeled "Jensen". The plots for Jensen's method with steady-state detection but without Fox and Glynn's method for Poisson probability computation are labeled "SS-Jen". The plots for Jensen's method with steady-state detection and Fox and Glynn's method for Poisson probability computation are labeled "FGSS-Jen". 6.2.1. Accuracy. The accuracy of the final results is

determined by comparing the computed solutions against the actual solutions. The actual solutions are obtained by substituting numerical values into the closed form solutions. In the experiments, error tolerance was varied from 10- ~ to 10 it in each of the two models. Accuracy achieved vs the error tolerance specified for different methods is plotted in Figs 5 and 6 for duplex-disk and SEN + models, respectively. The legend in these figures includes a straight line labeled "x" which allows visual checking of whether specified error tolerance is being met or not. Modified Jensen's method using the method of Fox and Glynn [1] for computation of Poisson probabilities results in much higher accuracy compared to the implementation of Jensen's method that uses sum of logarithms scheme as mentioned in [10]. For moderate tolerances, ODE methods also provide reasonable accuracy. However, as tolerance becomes tighter, the accuracy achieved by the ODE-solvers falls short of the tolerance specified. TR BDF2 provides an accuracy of l0 to at best. Implicit Runge Kutta method provides accuracy as high as 10-II. For the ODE-solvers, the minimum step size of 10 7 was used. We should rarely expect to get higher accuracy by using a step size smaller than that. Besides, smaller step sizes are computationally infeasible to use due to excessively large number of steps required. We now consider the effect of increasing error tolerance on the computational cost. The computational cost is measured here in terms of the CPU seconds required. The results obtained are plotted in Figs 7 and 8 for the "base" duplex-disk and SEN + models, respectively. The most common trend

10-4

TRBDF; 0

10-5 ~ - - ~ ' ~ . ~

1o- 1-

,o-,i-

I

1

1

'°-'° I,o-,, I-

\

1

10-12 ~ 10-13

10-14 10-5

10-6

10-7

10-S 10-9 Tolerance

10-10 10-11 10-12

Fig. 5. Accuracy vs error tolerance--duplex-disk model.

Markov chains

1837

10-s 10-6 10-r

10-s

10-9

Accuracy

10-1o

10-11 10-12 10-1a

10-5

10-6

10-r

10-s

1 0 - 9 10-1o 10-11 10-12

Tolerance Fig. 6. Accuracy vs error tolerance--SEN + model.

10 9 8

!

I

I

I

0

0

0

[]

X

X

X

X

i

'

'

/I

7 C P U Seconds

6 5 4 3 2

TRBDF20 Imp-RK -l-SS-Jen FGSS-Jen X

1

~

0 10-s

~

j

X

l

I

! m

I I 10-6

I 10-r

I

I 10-a

10-9 10-1o 10-11 Tolerance Fig. 7. CPU time vs error tolerance--duplex-disk model.

observed is that the computational cost increases with the decrease in tolerance. It is the rate of increase that forms the basis for distinction among the methods. For error tolerance up to 10 -8 , all the methods perform quite well. Beyond that, the computational cost of T R - B D F 2 method starts to rise fast. The computational cost of the implicit Runge-Kutta method has slower rate of increase compared to the T R - B D F 2 method. The computational cost of the modified Jensen's method remains virtually constant over the entire range of tolerance specificied. For SEN + model, the computation time of this method is so small that it is invisible in Fig. 8. Although the plots labeled "FGSS-Jen" and "SS-Jen" in Fig. 7 show that computation time is less if Fox and Glynn's method is used for this model, we do not consider it a general phenomenon. For many other models we looked at, the change in the computational cost due to the use of Fox and Glynn's method for Poisson probability computation is negligible. 6.2.2. Stiffness. We now consider how stiffness affects the performance of different solvers. The computational cost is the measure of the performance

10-12 10-1a

chosen. Two ways of varying stiffness are considered. One is to vary the mission time t and the other is to vary the model parameters. We consider both the base separately. (1) We first consider the effect of increasing mission time on the computational complexity of solvers. Time to compute the results (CPU seconds) is plotted as a function of the mission time t in Figs 9 and 10 for duplex-disk and SEN + "base" models, respectively. For the sake of comparison, we also plot the results obtained from using the original Jensen's method (without steady-state detection). For small values of t, the original and modified Jensen's methods perform equally well and significantly better than the ODE-solvers. However, as t increases, computational complexity of original Jensen's method rises very fast. The computational complexity of the modified Jensen's method also rises fast but only until steady-state is reached. Once the steady-state is reached, the computational cost of modified Jensen's method levels off. For the extremely stiff duplex-disk model, the modified Jensen's method continues to perform better than the original Jensen's method but

1838

M. MALHOTRA, J. K. MUPPALAand K. S. TRIVEDI

1200

I

i000

I

I

I

I

TRBDF2

0 Imp-RK ~,, SS-Jen O

800

F

G

S

I

/

N

-

J

e

n

1

A

/

Y

-~

/

~

CPU Seconds 600 400 200

t 10-e

0

10-s

r 10-r

I 10-8

i 10-9

i

i

I

10-10

10-11

10-12

Tolerance Fig. 8. CPU time vs error tolerance--SEN + model. 400



*

"

" ' ' ' ' I

'



"

' ' ' ' ' I

I

I

"

'

'

'

' ' ' ' ' I





i

350 TRBDF20 ImI>-RK I FGSS-Jen -B--

300 250

Jensen x

CPU Seconds 200 150 100 50 l

I

I

l

l "''1

I

I

""

l

==.,.

,

,

I0

I00 I000 Missiontime Fig.9. CPU timevs missiontime--duplex-diskmodel.

2000

........

1600

ImI>-RK -+-FGSS-Jen 3ensen --x--

1400 1200 CPU Secondd000 800 600 400 200 0 1

i

10

........

i

100 Mission time Fig.10. CPU timcvs missiontimciSEN + modcL

it performs worse than both the ODE methods. All the methods tend to level-off after the steady-state is reached and require no extra computation for mission times beyond that. (2) We now consider how the model parameters affect the solver performance. The parameters varied for different models are:

,

,|

....

I0000

........

1000

1. Data reconstruction rate l~ for the duplex-disk model 2. Repair rate Ft for the SEN + model Increase in the values of these parameters is expected to make the models increasingly stiffer. For the "base" duplex-disk and SEN + models, CPU seconds

Markov chains 160

........

140

I

........

1839 ,

........

,

.........

TRBDF20 Imp-RK ~ , FGSS-Jeo O

120 100 C P U Seconds 80 60

40 20 .

. v

. . . .

10

v

.

.

.

.

100

.

_

.

1000 Repair rate

.

. . i , t .

'



10000

*

" I * ' *

100000

Fig. II. C P U time vs repairrate--duplex-diskmodel.

1600



1400

,!

TRBDF2 Imp-RK t FGSS-Jen

1200 IO00

CPU Seconds 800 600 400 200

i

0

i

1



I -|

i ,

*

10

i •

I

-.

I

_

100

1000

Repair rate

,

10000



100000

Fig. 12. CPU time vs repair rate--SEN + model.

required are plotted against the corresponding model parameters in Figs 11 and 12, respectively. For moderate levels of stiffness, the modified Jensen's method outperforms both the ODE methods. For both the duplex-disk and SEN + models, the computational cost of Jensen's method increases rapidly as ~ increases. For the duplex-disk model, it is easy to compute the sub-dominant eigenvalue )-~a of matrix Q*. The value of 2~a for different values of are shown in Table 1. As ~ is increased from 10.0 to 10000.0, ,~.~aincreases and becomes closer to 1.0. The closer 2,a is to 1.0, the longer it takes the model to reach steady-state. Thus, in this case, increasing/~ truly increases stiffness of the model. The computational cost of the T R - B D F 2 method slightly decreases with increase in ~ for the SEN + model. This happens because it falsely detects steady-state before it is reached for this model. This is confirmed further by the fact that the computational cost of the implicit Runge-Kutta and the modified Jensen's method increase. The computational cost of the implicit Runge-Kutta method only slightly increases with increase in stiffness. This reflects the

Table 1. Sub-dominant eigenvalue vs ,u in duplex-disk model

10.0 100.0 1000.0 10000.0

0.97133 0.997881 0.999799 0.99998

inherent stability of the L-stable ODE methods for stiff Markov chains. 6.2.3. S i z e . In the end, we consider the effect of the size of the model on the solver performance. We use the Markov model of an M/M/I/K queue with arrival rate 2 = 1.0h -1 and service rate 100.0 h - 1. This model is chosen because its structure remains the same for different values of K. Error tolerance is set to 10- 12 and mission time is 1.0. Given a value of K, the number of states in the Markov chain is K + 1. In Fig. 13, the computation time of different solvers is plotted as a function of buffer size K. The rate of increase in the computational cost is

1840

M. MALHOTRA,J. K. MUPPALAand K. S. TRIVEDI

250 200

I

I

I

I

I

l

I

I

TRBDF2 ~ I m p - R K -4---

150

C P U Seconds 100 50 0

"""--'- I 5O 100

I

150

200 250 Number of Buffers

I

1

300

350

400

Fig. 13. CPU time vs size--M/M./1/K model.

nearly constant for all the methods. The modified Jensen's method has the smallest rate of increase while T R - B D F 2 has the largest. Among the ODE-solvers, implicit R u n g e - K u t t a method has the smallest rate of increase.

Acknowledoements--The authors would like to thank Dr. Rajeeva Karandikar, Dr. Vidhyadhar Kulkarni, and Dr. Andrew Reibman for informative discussions and helpful comments. The authors would also like to thank Dr. Ricardo Pantazis and Dr. A. V, Ramesh for many useful suggestions and insights into the numerical methods. The help of Dr. Christoph Lindemann is gratefully acknowledged.

7. CONCLUSION We have empirically compared three different methods for numerical transient analysis of stiff Markov models. Although we have shown the results of our experiments for only two models, we have performed experiments on a variety of models and our conclusions are based on the general trends we have observed. Jensen's method with steady-state detection provides solutions for non-stiff and moderately stiff problems very efficiently. For stiff models, qt is typically large which almost always causes underflow problems in Poisson probability computation. The use of Fox and Glynn's method [1] avoids underflow in Poisson probabilities c o m p u t a t i o n resulting in solutions that are more accurate than those obtained from the implementations of Jensen's method that use traditional methods of computing Poisson probabilities. Incorporating steady-state detection with the original Jensen's method saves a lot of computation time if mission time extends beyond the steady-state point. Thus, the most efficient implementation of Jensen's method is the one that uses Fox and Glynn's method for computing Poisson probabilities and also incorporates steady-state detection. However, stiffness arising from a spread of eigenvalues still causes performance degradation of this method. Implicit ODE-solvers have better stability properties and perform well for extremely stiff models. T R - B D F 2 , a second-order method, performs efficiently if accuracy required is low (up to eight decimal places). If higher accuracy is desired, then the third-order implicit R u n g e - K u t t a method should be used.

REFERENCES

1. B. L. Fox and P. W. Glynn, Computing Poisson probabilities, Commun. ACM 31(4), 440-445 (1988). 2. W. K. Grassmann, Transient solution in Markovian queuing systems. Comput. Operations Res. 4, 47-56 (1977). 3. J. C. Laprie, Dependable computing and fault-tolerance: concepts and terminology. Proc, 15th Intl. Syrup. on Fault-Tolerant Computing, pp. 2-7, Ann Arbor, MI (1985). 4. A. Reibman, K. S. Trivedi and R. Smith, Markov and Markov reward model transient analysis: an overview of numerical approaches, Europ. J, Operations Res. 40(2), 257 267 (1989). 5. A. Jensen, Markoff chains as an aid in the study of Markoff processes, Skand. Aktuarietidskrifi. 36, 87-91 (1953). 6. J. Keilson, Markov Chain Models: Rarity and Exponentiality. Springer, Berlin (1979). 7. D, Gross and D. Miller, The randomization technique as a modeling tool and solution procedure for transient Markov processes, Operations Res. 32(2), 334-361 (1984). 8. R. Marie, Transient numerical solutions of stiff Markov chains. Proc. of the 20th 1SATA Symposium, pp. 255-270 (1989). 9. C. A. Clarotti, The Markov approach to calculating system reliability: Computational problems. In A. Serra and R. E. Barlow (Eds), Proc. of the International School qf Physics, Course XCIV, pp. 55-66. North-Holland (1986). 10. A. Reibman and K. S. Trivedi, Numerical transient analysis of Markov models, Comput. Operations Res. 15(1) 19-36 (1988), ll. M. Malhotra, Higher-order methods for transient analysis of stiff Markov chains. Master's thesis, Duke University, Department of Computer Science (1991). 12. M. Ajmone-Marsan, G. Balbo and G. Conte, Performance Models of MultiprocessorSystems. MIT Press, Cambridge, MA (1986).

Markov chains 13. O. C. lbe and K. S. Trivedi, Stochastic Petri net models of polling systems, IEEE J. Selected Areas Commun.8(10) (1990). 14. M. Boyd, M. Veeraraghavan, J. B. Dugan and K. S. Trivedi, An approach to solving large reliability models, Proe. of IEEE/AIAA DASC Symp., San Diego (1988). 15. H. Kantz and K. S. Trivedi, Reliability modeling of MARS system: A case study in the use of different tools and techniques, International Workshop on Petri Nets and Performance Models, Melbourne, Australia (1991). 16. V. O. K. Li and A. Silvester, Performance analysis of networks with unreliable components, IEEE Trans. Commun. Com-32(10), 1105-1110 (1984). 17. N. M. Van Dijk, Truncation of Markov chains with application to queuing, Operations Res. 39(6) 1018-1026 (1991). 18. G. Ciardo and K. S. Trivedi, A decomposition approach for stochastic petri net models, International Workshop on Petri Nets and Performance Models, pp. 74-83, Melbourne, Australia (1991). 19. L. A. Tomek and K. S. Trivedi, Fixed point iteration in availability modeling. In M. Dal Cin and W. Hohl (Eds),

Proc. of the 5th International GI/ITG/GMA Conference on Fault-Tolerant Computing Systems, pp. 229-240, Springer, Berlin (1991). 20. W. L. Miranker, Numerical Methods for Stiff Equations and Singular PerturbationProblems.D. Reidel, Dordrecht, Holland (1981). 21. A. Bobbio and K. S. Trivedi, An aggregation technique for the transient analysis of stiff Markov chains, IEEE Trans. Comput. C-35(9), 803-814 (1986). 22. A. Reibman, A splitting technique for Markov chain transient solution, In W. J. Stewart (Ed.) Numerical Solution of Markov Chains, pp. 373-384. MarceI-Dekker (1991). 23. G. H. Golub and C. F. Van Loan, Matrix Computations. Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, 2nd edition (1989).

1841

24. C. Moler and C. F. Van Loan, Nineteen dubious ways to compute the exponential of a matrix, SlAM Rev. 20(4) 801-835 (1978). 25. E. de Souza e Silva and H. R. Gail, Calculating availability and performability measures of repairable computer systems using randomization, J. ACM. 36(1), 171-193 (1989). 26. W. Stewart, A comparison of numerical techniques in markov modeling. Commun. ACM21(2), 144-152 (1978). 27. E. Cinlar, Introduction to Stochastic Processes. PrenticeHall, Englewood Cliffs, NJ (1975). 28. U. N. Bhat, Elements of Applied Stochastic Processes. John Wiley, New York (1984). 29. R. Karandikar, A proof of geometric convergence bound on the state probabilities of a DTMC with multiple recurrent classes. Private Communication (1990). 30. C.W. Gear, Numerical Initial ValueProblemsin Ordinary Differential Equations. Prentice-Hall, Englewood Cliffs, NJ (1971). 31. R. E. Bank, W. M. Coughran, W. Fichtner, E. H. Grosse, D. J. Rose and R. K. Smith, Transient simulation of silicon devices and circuits. 1EEE Trans. Comput. Aided Design 4, 436-451 (1985). 32. O. Axelsson, A class of A-stable methods, BIT9, 185 199 (1969). 33. J. D. Lambert, Numberical Methods for Ordinary Differential Systems. John Wiley (1991). 34. J. C. Butcher, Implicit Runge-Kutta processes, Math. Comp. 18, 50-64 (1964). 35. G. Birkhoff and R. S. Varga, Discretization errors for well set Cauchy problems, J. Math. Phys., 44, 1-23 (1965). 36. I. S. Duff, A. M. Erisman and J. K. Reid, Direct Methods for Sparse Matrices. Oxford Science Publications, Oxford University Press (1986). 37. J. T. Blake and K. S. Trivedi, Reliability analysis of interconnection networks using hierarchical composition, IEEE Trans. Reliab. R-38(1), 111-120 (1989).