Numerical transient analysis of markov models

Numerical transient analysis of markov models

Comput. Opns Res. Vol. 15, No. 1, Pp. 19-36, 1988 0305-0548/88 $3.00+0.00 Copyright0 1988Pergamon Journals Ltd Printed in Great Britain. All rights ...

2MB Sizes 0 Downloads 94 Views

Comput. Opns Res. Vol. 15, No. 1, Pp. 19-36, 1988

0305-0548/88 $3.00+0.00 Copyright0 1988Pergamon Journals Ltd

Printed in Great Britain. All rights reserved

NUMERICAL

TRANSIENT

ANALYSIS OF MARKOV

MODELS

ANDREW REIBMAN* and KISHOR TRIVEDI~ Department of Computer Science, Duke University, Durham, NC 27706, USA (Received July 1986; revised May 1987) Abstract-We consider the numerical evaluation of Markov model transient behavior. Our research is motivated primarily by computer system dependability modeling. Other application areas include tinitecapacity queueing models, closed queueing networks, and inventory models. We focus our attention on the general problem of finding the state probability vector of a large, continuous-time, discrete-state Markov chain. Two computational approaches are examined in detail: uniformization and numerical linear multistep methods for ordinary differential equation solution. In general, uniformization provides greater accuracy but deals poorly with stiffness. A special stable ordinary differential equation solver deals well with stiffness, but it provides increased accuracy only at much greater cost. Examples are presented to illustrate the behavior of the techniques discussed as a function of model size, model stiffness, increased accuracy requirements, and mission time.

1. INTRODUCTION

This paper deals with the “transient” analysis of finite-state, continuous-time Markov chains (CTMC). Our research is motivated by the dependability analysis of computer systems. The analysis of nonrepairable or incompletely repairable systems requires a “transient” analysis of a finite-state Markov model [ 1,2]. Other application areas include closed queueing networks [3], finite-capacity queueing models [4,5], and inventory or logistical models [6]. We are particularly interested in evaluating the dependability of fault-tolerant computer systems. Spacecraft computers are required to provide interruption-free service for long missions. For these systems, we want to find the probability that a system operates without failure during a specific timeinterval. Commercial fault-tolerant computer systems are often designed to provide high availability. For such highly available systems, we may be interested in the probability that the system is up at a given time t, or the expected fraction of time that the system is up in a given time-interval. For simplicity, holding times in system states are assumed to be exponentially distributed. Even in situations where this assumption is invalid, nonexponential holding times can often be satisfactorily approximated using a finite set of exponential phases. We will examine three numerical techniques for finding the transient solution of large sparse Markov models: uniformization, an explicit differential equation solution method (Runge-Kutta), and a special stable implicit differential equation solution method (TR-BDF2). All three methods have sparse implementations. We will see that uniformization and Runge-Kutta achieve greater accuracy at less cost than TR-BDF2. Conversely, only TR-BDF2 deals with increasing stiffness efficiently; TR-BDF2 requires little extra computation as parameter size or mission time is increased. The paper is organized as follows: in Section 2, we first mathematically formalize the idea of a Markov chain model. We then discuss the computational difficulties inherent in the transient analysis of Markov models. In Section 3, we present and analyze computational approaches to the transient analysis of Markov models. Finally, in Section 4, we use several examples to compare the competing methods empirically. *Andrew L. Reibman received the B.S. degree in mathematics and computer science from Lenoir-Rhyne College, Hickory, NC., in 1982. He is currently a Ph.D. candidate in computer science at Duke University. His research interests include computer system reliability and performance modelling, fault-tolerant computing, and numerical analysis. TKishor S. Trivedi received the B.Tech. degree from the Indian Institute of Technology (Bombay), and M.S. and Ph.D. degrees in computer science from the University of Illinois, Urbana-Champaign. He is the author of a widely used text on Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by Prentice-Hall. Both the text, and his related research activities, have been focused on establishing a unified mathematical modeling foundation for computing system reliability and performance evaluation. Presently, he is a Professor of Computer Science and Electrical Engineering at Duke University, Durham, N.C. He has served as a Principal Investigator on various AFOSR, ARO, Burroughs, IBM, NASA, NIH, and NSF hmded projects and as a consultant to industry and research laboratories. He is an Editor of the IEEE Transactions on Computers. 19

20

ANDREW REIBMANand KISHOR TRIVEDI 2.

PROBLEM

FORMULATION

In this section, we define a continuous-time Markov chain (CTMC) model of system behavior. We then define various measures of system dependability that can be derived from the Markov model. Finally, we discuss the difficulties faced when numerically computing a CTMC transient solution. A system’s state at time t can be described by a CTMC {X(t), t 2 0}, with discrete state space R. We will assume that the state space is finite and of size N. We let qij, i # j, be the rate of transition from state i to state j. Define qii = -1 j+ i qij. The matrix Q = {qij} is the “infinitesimal generator” of the CTMC. We let q denote the number of nonzero entries in Q. If Pi(t) is the probability the CTMC is in state i at time t, the row vector P(t) is called the “state probability vector” of the CTMC. The system of Kohnogorov differential equations describes the behavior of the state probability vector as a function of t:

i)(t)= WQ,

P(0) = P,.

(1)

The general solution for the entire system is given by P(t) = P, es

where ec’ the “matrix exponential”

(2)

[ 181 defined by the Taylor series

e@= OD(Qt)’

1,. 0

(3)

i=

For state i, the general solution can also be written in the form

j=l

(4)

k=O

where m is the number of distinct eigenvalues of Q, lj is the jth distinct eigenvalue of Q, mj is the multiplicity of A, and aijk is a constant. Measures of system behavior can be derived using weighted sums of state probabilities. For example, if the CTMC models a single server queueing system, state i corresponds to having i jobs in the system. The average number ofjobs in the system at time t is xi Pi(t). For an availability model, we specify two mutually-exclusive, collectively-exhaustive subsets of R: R, the set of operational system states and R, the set of failure or down states. We define the indicator random variable Z(t) =

1 if X(t) c R, 0 otherwise

(5)

The probability that the system is operational at time t is the “point (instantaneous) availability”, defined PAV(t) = Pr{ 1(t) = l} = E[Z(t)] = C

P,(t).

(6)

iEflo

“System reliability” can be defined as R(t) = Pr{Z(u) = 1, Vu E [0, t]}.

(7k

If we treat R, as a class of absorbing states, R(t) is 1 - Pr{absorbtion over [0, t)}. This quantity can be computed by deleting all arcs leading from states in $2, to states in R, and then computing R(t) = Ci.&, Pi(t)* Three factors hinder our evaluation of Markov models: size, ill-conditioning, and stiffness. The

Numerical transient analysis of Markov models

21

models of interest tend to be large, often having more than 10,000 states. The solutions of illconditioned problems are extremely sensitive to small changes in the input. Stiff problems have solution com~nents that vary rapidly relative to the length ofthe solution interval. In the remainder of this section, we examine these three problems in greater detail. In Markov modeling, the most common computational difficulty is size. In a dependability model, the number of states in the model grows exponentially with the number of components in the system under study. For a closed queueing network with j jobs and s servers, there are (“5-r) states in the corresponding CTMC. Modeling component interaction (noninde~ndence) and approximating nonMarkovian behavior both enlarge the state space significantly. Thus, sparse matrix data structures are required [2,7]. Even though a problem has a straightforward solution in theory, in practice, as the size of the matrix grows, computational difficulties increase. For example, a transient solution method that does not preserve sparsity is unacceptable for most large problems. Techniques to reduce the size ofthe state space, either at the model [8] or matrix level [9], simplify the analysis of larger and more complex models. In addition to dificulties due to size, numerical problems may be ill-conditioned. A matrix is ill-conditioned with respect to an operator if it is possible for small changes in the matrix to produce large changes in the solution. Recall that we can write the general solution of the matrix differential equation (1) as P, e Qz. The “matrix exponential condition number” provides an estimate of the numerical difficulty (ill-conditioning) encountered when computing a particular matrix exponential [lo, 111. Small ~rturbations in matrices with large exponential condition numbers can produce large changes in the solution. Several authors have examined ways to estimate the matrix exponential condition number. In particular, the matrix exponential condition number is bounded below by IJQIJt [12], and above by ilQllp(t), where p(t) is a polynomial in t with degree equal to the maximum multiplicity of any eigenvalue of Q [ 133. The relationship between the condition of a linear initial-v~ue problem like (1) and the condition ofa corresponding matrix exponential is not exact. Some work on bounding the errors in the solution of (1) as a function of errors in the initial conditions has been done in the context of reliability [7]. If each parameter value has an associated upper and lower bound, optimistic and conservative estimates of system reliability can be computed by using the bound values in place of the parameters. Further work on conditioning and numerical error bounding is needed. Although size and ill~onditioning present problems, stiffness is the di~culty that most often characterizes dependability models. A classical definition of stiffness follows: A linear system jr = yQ + Q(t) is said to be stiff if for i = 1,2, ..., m,Re(li) < 0 and mjlx/Re(di)l it>mjn/Re(&)l where Li are the eigenvalues of Q [14]. A close analog to this stiffness definition is the common assumption in Markov modeling that stiffness in the system ofequations (1) somehow corresponds to having Markov transition rates of greatly different orders of magnitude. In particular, a Markov chain is often thought to be stiff if it has one or more states with competing exit transitions with greatly different rates. However, what really affects solver performance is the presence of rapidly changing solution components. Ifthe solution ofa linear system is expressible as a sum ofexponential terms (4), then the presence of large rate constants for the exponential terms leads to rapid decay in the corresponding solution components and makes the system stiff. Because rapid is defined relative to the length of the solution interval, stiffness is a function of time-scale. The classical definition omits this overt dependence on time-scale and overemphasizes the dependence on the eigenvalue ratio. Because failures and repairs, the interesting events in a fault-tolerant computer system model, often occur at widely differing rates, choosing a satisfactory time-scale for a dependability model can be difficult. In general, the time-scale is chosen as a function of the slow failure rates. Repairs and reconfigurations thus occur rapidly relative to the size of the solution interval. This choice of time-scale, based on the slow rates, is implicit in the classical definition of stiffness. In general we will use a computationally oriented de~nition of stiffness that depends explicitly on the solution interval. A system ofdifferential equations is said to be stiff on the interval [O, t), if there

22

ANDREW REIBMANand KISHOR TRIVEDI

exists a solution component ofthe system that has variation on that interval that is large compared to l/r [IS]. A component with large variation changes rapidly relative to the length ofthe interval. Thus, an informal metric for stillness is the product ofthe largest eigenvalue ofthe CTMC generator matrix with the length of the solution interval. The largest eigenvalue can be approximated by the largest total exit rate from any one state in the CTMC. Using the Gerschegorin circle theorem [lo], the magnitude of the largest eigenvalue is bounded above by twice the largest total exit rate from any one state in the CTMC. For uniformization in particular, we will see that computational complexity corresponds directly to the product of the length of the solution interval with the magnitude of the largest total exit rate. 3. NUMERICAL

SOLUTION

METHODS

To find the state probabilities as a function of time, we need to evaluate a system of first order linear differential equations with constant coefficients (1). A variety of solution methods are available. For hand computation on small or special case models, the general solution is often found by deriving and symbolically inverting Laplace transforms [3]. Analytic Laplace transform inversion yields state probabilities as functions of t. However, it has worst-case computational complexity 0(/V’) and requires that the eigenvalues of the transition matrix be accurately determined [ 1,161. Numerical Laplace transform inversion using Fourier series [16] holds more promise. Unfortunately, both analytic and numerical Laplace transform inversion are numerically unstable. One area where this type of approach is successful is for acyclic CTMC; [17] provides an efficient solution procedure based on the closed-form solution of integral equations. For acyclic sparse matrices with rl nonzero elements, the closed-form transient solution algorithm requires only O(q) operations. A second approach is to evaluate the matrix exponential series (3) directly. Unfortunately, if t >>0, the exponential series has poor numerical properties even for small problems. In particular, the series is often subject to severe round-off error [4]. Thus direct evaluation of (3) is not a viable general solution technique for Markov models. Alternate methods for evaluating the matrix exponential exist, but they generally are not efficient for solving the linear initial-value problem [ 111. For small r, the matrix exponential series behaves acceptably; it provides the basis for the Runge-Kutta differential equation solution technique that will be discussed later in this section. The classical techniques for numerical solution of (1) first find the eigenvalues and associated eigenvectors of Q. Many techniques can be used for this analysis; the most widely used is the QR algorithm [lo]. Note that for large matrices, it is difficult to accurately generate the entire eigensystem. Once the eigensystem is computed, the solution can be obtained using the well-known Lagrange-Sylvester formula [18]. This method has complexity 0(N4) when all the eigenvalues are distinct [and O(N’) otherwise] making this approach impractical for solving large dependability models [ 11. Diagonalization, block diagonalization, and Jordan decomposition are common alternative methods [lo]. However, all three methods are often numerically unstable. When dealing with large sparse models, computational difficulties arise as well. Diagonalization is not sparsity preserving; the amount of storage needed in the computation will be O(iV’) rather than O(q). In this section, we present and analyze two general approaches that are practical for the numerical transient analysis of large Markov models: uniformization and linear multi-step methods for numerical ordinary differential equation solution. We will concentrate our analysis on three main topics: computational complexity, accuracy, and stiff stability. We will discuss computational complexity and stiff stability as they arise in our analysis. Before we begin, we define terminology needed to discuss accuracy and error control. The difference between the actual and computed solutions is called the “error”. A knob on a code that attempts to control the error is called the “tolerance”. Several issues need to be considered : How is error measured? How is error estimated? Given a tolerance, how do we control the error? How well does a particular error-control method perform? What does additional accuracy cost? In general, we will restrict our attention to absolute error, the unscaled difference between the actual and computed solution vectors. The “error vector” is the vector difference between the actual and computed solution vectors. If we apply a vector norm to the error vector, we get a scalar error measure. For dependability models, the most widely used norms are the L, and L, norms. The L, vector norm is the sum of the magnitudes of each component of a vector. The L, vector norm is the

Numerical transient analysis of Markov models

23

absolute value of largest magnitude component of a vector. In some applications other error measures are appropriate, e.g. relative error bounds. These alternate measures add complexity to our analysis and tend to be problem or algorithm specific. More commonly considered will be error bounds on derived dependability measures. For example, in one of our empirical studies, we will consider the accuracy of availability estimates given by various transient solvers. We will obtain the actual error for this example by comparing the evaluation ofa closed-form expression for the solution with the numerically computed result. Error comes from two sources: round-off and truncation. Because computer arithmetic has finite precision, round-off error is introduced in every arithmetic operation. In our experience using double precision computation for the transient solution algorithms, total round-off error is less than 10-i’. We can insure that significant round-off error is not occurring by repeating the computation with higher precision arithmetic and checking the resulting solutions for discrepancies. The major source of significant error in the algorithms we consider is “truncation error”, the difference between the actual and computed solutions (assuming no round-off error is present). For instance, uniformization computes a finite truncation of an infinite series. Controlling error is a major problem in many numerical applications. Solutions often must be accurate to a prescribed tolerance. Given such a tolerance, we want to avoid failing to meet this tolerance or spending extra computation to greatly exceed it. To control the error, it must be bounded or estimated. For our application, we will concentrate on controlling truncation error. Ifwe used higher tolerances, round-off error would be more significant and would also require consideration. The function computed by a numerical procedure is only an approximation of the actual function. We can expand the computed function with a Taylor series, and compare it to a similar Taylor series expansion of the actual function. The “principal error” is the difference between the largest term of the actual series not matched by the computed function series and the corresponding term of the computed function series [ 143. Principal error is often easily computed, and thus a candidate error estimator. It is widely used in numerical differential equation solution and usually estimates at least the correct order of magnitude of the error. We call an error estimator “conservative” if the actual error is less than the estimated error. Unfortunately, principal error is not conservative. In contrast, uniformization will yield a bound on the total mass of the truncated series, a conservative estimate of truncation error.

3 .I. Uniformization Though the direct evaluation of the matrix exponential is not a practical method for solving dependability models, other approaches based on the evaluation ofintinite series are more promising. We describe the uniformization (or randomization) technique based on the reduction ofa CTMC to a discrete-time Markov chain (DTMC) subordinated to a Poisson Process [4,19,20,21]. From a CTMC we obtain a DTMC 2 = {Z,, n = 0, 1, . . .} and a Poisson Process {N(t), t 2 0} . N(t) is independent of Z and has rate 4. We form X(t)= Z,,, in the following way: let 42 maxi, i4N I- qiil. Consider an equivalent process for which the transition rate from state i is q and a 1 - lqii)/q fraction of these transitions return immediately to state i. The transition probability matrix of the DTMC Z is Q* = Q/q + I. This process is the “uniformization” (or randomization) of the original CTMC. Once a Markov process is uniformized, the transient solution is computed by conditioning on N(t), the number of transitions in [0, t]. Pi(t) = Pr{X(t) = j} = f Pr(X(t) = j I N(t) = i} Pr{N(t) = i}. i= 0

(8)

Thus

P(t)=jJII(i)eeq’~ i=O

where D(i) = {U,(i), II,(i), . . ., II,(i)} is the state probability vector of the subordinated

DTMC Z

24

ANDREW REIBMAN and KISHORTRIVEDI

with II(O) = P(0). Note that the DTMC transient state probability successive matrix-vector multiplications l-I(i) =l-I(i

- l)Q*.

vector can be computed by

(10)

In the rest of this subsection, we will first discuss computation and truncation of the Poisson distribution. We then consider DTMC state probability vector computation. Finally, we discuss uniformization implementation and computational complexity. 3 .I .l . Poisson Process computation and truncation. The Poisson probability mass function needs to be computed to scale the terms of the uniformization series. Poisson probability estimates are also needed to show where the uniformization series should be truncated to meet desired error tolerances. Tight truncation accelerates the computation [ 191. The original error bound for uniformization was based on a simple right-side truncation of the Poisson distribution. If the “right truncation point” is k, the series is truncated after the kth term. Here, the truncation error in each element is bounded by E, where E is easily computed using E<

1 -e-@ ii0 (4rYli!.

(11)

This error equation can be used before the actual computation to bound the number of terms needed to obtain a desired absolute error tolerance. The computed bound might not satisfy a relative tolerance as it does not consider the computed solution vector. We emphasize that this approach gives an error bound; truncation error in each solution component is guaranteed not to exceed the amount ofmass truncated from the Poisson distribution. This is a major advantage of uniformization. Although computing the Poisson probability mass function is not difhcult, the Poisson distribution still causes difficulties. As qt grows, the Poisson distribution thins. Using only right-side truncation, the number of significant terms in the Poisson distribution grows linearly with qt. The terms on the left side grow smaller and less significant. At some point it becomes appropriate to begin the sum (9) not at 0, but at some integer 1> 0, called the “left truncation point”. To bound the left tail of the Poisson distribution we can use a normal approximation [ 191. This is inaccurate for smaller values of qt; but, if qt is small we will not use left truncation. For E = lo- lo, left truncation is not used until qt > 25 [22]. Once the truncation point is identified, a sum of logarithms is used in place of a round-off prone series of multiplications to compute the first significant Poisson probability. To achieve a fixed absolute accuracy as qt grows, the number of terms required for uniformization is O(&) [22]. The size of the left tail grows linearly in qt. Also, as qt grows, the computation becomes more sensitive to round-off error. In [223, this round-off error is minimized by first determining the endpoints and middle of the truncated Poisson distribution. The small terms near the endpoints of the truncated distribution are then added lirst, reducing round-off error. For uniformization, this type of approach might be useful, but would incur an overhead penalty since the DTMC state probability vectors must be computed out of order. 3.1.2. DTMC analysis. Uniformization can be accelerated by exploiting the behavior of the subordinated DTMC. If the Poisson probability mass function is initially truncated on the left (for large qt), the first DTMC state probability vector needed is at the truncation point. If successive matrix-vector multiplications are used to compute this first significant DTMC state probability vector, the computation cost increases linearly in qt. Computing (Q*)’ by squaring replaces O(qt) matrix-vector multiplications computations by O(log(qt)) matrix multiplications. For large enough qt, such an approach may save a significant amount of computation time. Unfortunately, matrix multiplications usually result in till-in. Thus for most large sparse problems, this approach is infeasible because of space limitations. When the DTMC reaches steady-state, the DTMC state probability vector is invariant. Because no further matrix-vector multiplications need be done, the remainder of the series can be computed cheaply. Since significant errors might be introduced if the DTMC analysis is halted prematurely, care needs to be taken to insure that the steady-state vector actually exists and that the DTMC solution is sufficiently close to its steady-state value. For example, choosing q strictly greater than the largest Iqiil insures that the DTMC will be aperiodic.

Numerical transient analysis of Markov models

25

If we exploit the exchangeability property of events in a Poisson Process [ 191, we need examine only the number of visits to each state and can neglect the order of the visits. This yields a method of computing availability (but not the entire state probability vector) that avoids much of the computational effort expended in evaluating DTMC behavior. If the eigenvalues of Q* are distinct, diagonalizing Q* may accelerate the computation by replacing matrix-vector multiplies with vector cross-products. More efficient computation of DTMC state probability vectors would improve unifo~i~tion’s performance. 3 .I 3. Computational complexity. We discuss uniformization’s computational complexity for three separate cases: the original uniformization algorithm that employs only right truncation of the Poisson distribution, uniformization with left and right truncation, but conventional computation of the first significant DTMC vector, and uniformization with lel? and right truncation and logarithmic computation of the first significant DTMC vector. For a given error tolerance, unifo~ization without left-side truncation requires O(qt) terms. Chains with TVnonzero elements in their sparse generator matrices should then require O(qqt) operations for computing the state probability vector. If left-side truncation is used without improved initial DTMC vector computation, O(yqt) operations would still be required. Given an error tolerance, uniformization requires O(&) terms if the distribution is truncated from both sides. The number of terms in the left tail, I, is O(qt). Successively squaring Q* to get the first significant DTMC state probability vector requires time 0(N3 log I). This yields a total computation complexity for uniformization of 0(N3 log qt + r.rfi) operations. Therefore, if problems with storage space are avoidable, this approach will be useful when N is moderate and qt is large. When qt is large, a possible approach is to split the solution interval and use several uniformizations, one for each sub-interval. However, error estimation for a multi-step approach is more difficult. Furthermore, if left truncation is employed with an improved DTMC evaluator, the computational complexity of each step is sub-linear in t. In this case, no asymptotic cost savings will be realized. In Appendix A, we give a simple sketch of the uniformization algorithm. Multiple solution points can be handled simultaneously without much extra computation [21]. 3.2. Numerical d@eerentialequation solution 3.2.1. Theory. We now briefly outline the theory behind linear multi-step methods for numerical differential equation solution [ 141. We touch briefly on the concepts of A-stability and L-stability. We are particularly interested in solving the linear initial-value problem j(t) = y(t)Q = f(t, y(t)). We choose a solution approach that was proposed for use in a similar problem (circuit simulation) [23], The essential element of a solution method is a discretization; we seek a solution not on the continuous interval t E [a, b], but on a discrete point set (ti 1i = 0, 1, . . ., n>where h = (b - a)/n is the “step length”. For purposes ofanalysis, assume h is fixed; in actual implementations, step length will vary. Let yi be an approximation of the theoretical solution at ti and fi -_(ti, yi). A linear k-step method is a computational method for dete~ining {yi> that takes the form of a linear relationship between yi+jandfi+j,forj=O, 1,. . ., k. The general linear multi-step method with step length h can be written

j$o ujYi+j=

h i Pj.fi+j

(12)

j=O

where aj and flj are constants, not both (x0and PO are zero, and c+ = 1. A method is explicit if & = 0 and implicit if & # 0. An implicit method requires the solution of a set of algebraic equations at every solution step. In return for this computational overhead, implicit methods usually are more accurate and have better stability properties. If we consider a single application of a solution operator R, the local truncation error (LTE) for a given step is /R(h, ti, yi) -3(h, t, yi)l where 3 denotes the actual value of the solution given that the approximation obtained from the previous step was correct. R gives the approximate solution obtained using the same initial value. A method’s order is the largest integer p such that the LTE is O(hp+l).

26

ANDREW REIBMANand KISHOR TRIVEDI

Most conventional difference approximations to systems of linear ordinary differential equations converge to the correct solution as the step length h --t 0. Conventional explicit Runge-Kutta methods perform acceptably in the transient solution of nonstiff Markov models [4]. However, in stiff systems, h may need to be intolerably small before acceptable accuracy is obtained, resulting in unacceptable round-off errors and computation time requirements. Special methods are needed to either deal with stiff equations without greatly restricting step size, or to eliminate stiffness altogether. A method is said to be A-stable ifthe numerical approximations y, of Y(t,) approach zero as n -+ x when it is applied to the differential equation jl = JY with a fixed positive h and a (possibly complex) constant R with Re(l) c 0 [14]. In contrast, if A is real, 4th order explicit Runge-Kutta, a commonly used explicit ODE solution technique [24], requires that hl be between 0 and -2.5. Thus as il grows in magnitude, h must shrink proportionately. A well-known theorem of Dalquist states that (a) all Astable linear multi-step methods are implicit, (b) no A-stable linear multi-step method can have order greater than two, and (c) the trapezoid rule (TR) is the A-stable linear multi-step method with the smallest LTE. To integrate from t = ti to ti+ 1 z ti + hi with TR, we use the formula 2Yi+ 1

-

hLfi+

1 =

2Yi +

(13)

h*fi*

When computing the state probabilities with equation (1) this is computationally the lirst order linear algebraic system

the same as solving

(14) Unfortunately, for extremely stiff problems, even A-stability may be inadequate. IfTR is applied to the test problem with Re(,?) <<0 and h not small, successively computed values will have the same magnitude but opposite signs. This phenomenon is called “ringing” and may introduce large numerical errors. Unless h is small, trapezoid rule may fail to reach the correct steady-state value as t + cc . Because transient solvers are often used to model a system from start-up until steady state, it is important that the transient solution technique yield the correct steady-state solution as t -+ co. Therefore, simple TR is unacceptable. A stronger condition than A-stability is L-stability. A one-step numerical method is said to be L-stable ifit is A-stable and, in addition, when applied to the scalar test equation j = dy, where n is a complex constant with Re(/Z) < 0, it yields /Yi+r/yi/-+ 0 as Re(hl)-+ -00 [14]. If TR, which is self-starting and has good local truncation error is combined with 2nd order Backward’s Difference (BDFZ), an L-stable method, the result is an L-stable, self-starting two-step method that is well suited for our problem [23]. To compute a step of the difference approximation, we first take a TR step from ti to ti + yh, where y is a constant between zero and one. Then to step from ti + yh to ti+ i, we use a BDF2 step given by (2-Y)Y,+i -(l

-Y)hfi+l

=Y-lYi+y-Y-l(l

-Y12Yi

(15)

-y)2Pe

(16)

that for (1) becomes the linear algebraic system Pi+i((2-y)I-

(1 -y)h,Q)=y-‘Pi+v-Y-‘(l

Here, Yi+, and Pi+? denote the intermediate solutions obtained by the TR step from ti to ti + yh. If nonlinear problems are considered, ‘)I= 2 - .@z 0.59 has advantages [23]. For our linear problem, we have varied y over a range of values and have seen little change in performance. So, in our empirical studies, we will continue to use the value y = 0.59. We will see that the two-step TRBDF2 method provides reasonable accuracy and excellent stability on stiff problems. 3.2 2. ~~~~e~ntution and co~~~tution~l complexity . An ODE solver code first applies the solution method at the initial time point t,, with step length h,. The method is then applied at the point t, + ho with step length h,, using the result of the initial step as the new initial vector. This process is continued until the end of the solution interval is reached. To better control error and provide a more efficient solution, hi will be varied from step to step. If too much error is intr~uced, the step is repeated with smaller h. After a successful step, h may be increased.

21

Numerical transient analysis of Markov models

To regulate step size and provide the basis for global error bounding, we need to estimate the LTE. In [23], a Taylor series is used to derive the principal truncation error term for a step of TRBDF2: Ch?y’3)(s), where C = (- 3y2 + 4y - 2)/(12(2 - 7)). Given the equation for the form of the error, several techniques of LTE estimation are viable. They include Milne’s device, predictor corrector, and direct estimation of the third derivative. We reject the use of Milne’s device as it nearly doubles the amount of computation required for solution. At present, we use a divided difference estimator suggested in [23]. (17) This estimator is inexpensive to compute and seems sufficiently accurate. Ifat some point, it is found inadequate, a predictor-corrector scheme based on TR-BDF2 could be derived using the techniques of [14] for a more accurate (and more expensive) estimation of LTE. Once the LTE vector is derived, a scalar error estimate is computed using an appropriate vector norm. Relative error tests could be used, but a pure relative error test is difficult if components of the solution vector are near zero. With the scalar error estimate, we can estimate the optimal step size h* = hi

local tolerance

(18)

LTEi

Here order is the order of accuracy of the single-step method. If the local error is greater than the desired limit, the step size is decreased and the step repeated. If the error is acceptable, the step size may be increased. To prevent unnecessary step failures, no increase of the step size should be allowed after a step failure (and possibly this restriction should be retained for several steps). Bounds on maximum increase and minimum decrease prevent “chattering”, an excessive number ofstep failures. Note that we are controlling and measuring the error in a given solution step, the local error. What we actually desire is an estimate of the global error, the total (truncation) error in the final solution. Open research areas in numerical analysis include the derivation of global errors from local errors and the accurate choice of local tolerance given some desired global error bound. The usual procedure in numerical packages is to specify a local tolerance and report the resulting global error estimate after the solution is finished. The sum of local errors is often used as a crude global error estimate. A pure application of this heuristic would neglect the fact that further approximation might magnify the errors made early in the computation. We assume that bounding local errors will in some rough sense bound the global solution errors. Furthermore, once the transient phase of the solution is completed, the error in the remaining solution will tend to be small. At time ti, for step i of length hi, a simple approach to error control is to bound the local error by LTEi ~

global error tolerance - Cj< i LTEj mission time - ti

h i.

(19)

This approach controls the local error per unit step (LEPUS) and produces excellent results for many problems. However, it is usually costly. If the solution interval is long, i.e. close to the time needed for the CTMC to reach steady-state, the tolerances on the initial steps are excessively tight. An alternative method is to control the local error per step (LEPS). In LEPS control, the local error in each step is required to be less than the prescribed tolerance. LEPS control is the most widely used error-control method, and it has produced acceptable results for our problems. Estimated errors are usually within one order of magnitude of the actual error. Alternative techniques for error estimation and control exist but are more expensive. For example, using Richardson extrapolation to compute the global error increases computation by a factor of three. In Appendix B, we give a sketch of a solver algorithm. The algorithm is independent of the actual method used to compute the individual steps and error estimates. The computational complexity of an explicit ODE solver is normally estimated by the number of function evaluations required. In the CTMC case, a function evaluation is a matrix-vector multiply. For an explicit ODE solver, the total number of function evaluations is the number of function evaluations per step times the number of steps. For Runge-Kutta, step length decreases linearly with

28

ANDREW REIBMAN and KISHORTRIVEDI

4 so the number of steps required is O(qt). The total cost will be O(N2qt) FLOPS for full matrices and O(qqt) FLOPS for sparse matrices. The computational complexity of an implicit ODE solver depends heavily on the method used to solve the resulting linear systems. If a full direct linear system solver is used, then each grid point costs 0(iV3) FLOPS. Ifa sparse iterative method is used, individual iterations cost O(q) floating point operations. In practice, the number of iterations used per linear system solve (I) is usually < 10, and often is only 2 or 3, even when using a simple Gauss-Seidel linear system solver. The total cost ofthe implicit ODE solution will then be O(r]nl) FLOPS for sparse CTMC, where n is the number of steps required. Numerical ordinary differential equation solution has several advantages. The solution is obtained on a grid of points (rather than a single time point) at no additional cost, making it easy to examine the solution as a function of time. Nonhomogeneous CTMC can also be analyzed with the same approach. Finally, approximation techniques for the solution ofdifferential equations may speed the solution and reduce difficulties with stiffness [ 15). Disadvantages stem mainly from the difficulty of accurate error estimation. 4. EXAMPLES

In this section, we empirically compare uniformization and ODE solution methods. In addition to the TR-BDF2 method, we also include fourth-fifth order Runge-Kutta Fehlberg (RKF45), a commonly used explicit ODE solution technique [24]. Our first example is a model of the C .mmp multiprocessor system [25]. Our second example is an M/M/l/k queue [5,26]. Both examples have a closed-form transient solution, allowing us to compute their actual solutions for purposes of comparison. We conduct several trials ofthe various solution methods. In particular, we consider (a) CPU time needed as a function of problem size, (b) accuracy achieved as a function of error tolerance, (c) computation required as a function of accuracy achieved, (d) computation required as a function of model parameters (stiffness), and (e) computation required as a function of mission time. The examples reflect the general trends that we have observed over many examples. For our experiments, we will use codes based on the algorithms outlined in the appendices. Our uniformization code has two simplifications: it has no DTMC starter and does not check for DTMC steady-state. No uniformization code we know of currently uses an improved scheme for determining the first significant DTMC state probability vector when the Poisson distribution is left-truncated. All codes (including ours) use successive matrix-vector multiplications to compute this initial vector. Thus the computational savings possible with left truncation are not realized, although there are still numerical benefits. In addition to basic execution time measures for our code, we also estimate the execution time that would be needed by a code using an improved DTMC starter. To compute this estimate, we discount the computation used to determine the DTMC state probability vectors that are associated with the truncated left tail. This adjusted estimate is optimistic; it underestimates the execution time of uniformization if an improved DTMC starter were actually used. Our second uniformization simplification is the omission ofa DTMC steady-state check. No uniformization code that we know of currently checks for DTMC steady-state. We have been unhappy with the performance of our code in this area and thus have omitted any attempt to recognize DTMC steadystate. This omission may increase uniformization execution time. The RKF45 and TR-BDF2 codes use the same error and step-size control methods. The actual stepping algorithms and LTE estimators are called as subroutines. The only major difference in the code mainline is the use of different values for the order in equation (18). In all ODE solver runs, we control the error using the L, norm and a LEPS error control strategy. The uniformization code also uses an L, norm to measure and control error. Our first example is a model of the C.mmp multiprocessor system [25]. The C.mmp system consists of 16 processors and 16 memories connected by a switch. A diagram ofthe system is found in Fig. 1. System status can be described by a 578-state CTMC with 2754 transition arcs (Fig, 2). Each processor and its associated circuitry fails at rate 68.9 failures/lo6 hr. Each memory and its associated circuitry fails at rate 224 failures/lo6 hr. The switch fails at rate 202 failures/lo6 hr. We assume that each component has an independent repairman. Time to recovery is exponentially distributed with

Numerical transient analysis of Markov models

29

SWITCH

Q

@

E

“’

Fig. 1. C.mmp system.

Fig. 2. C.mmp system CTMC.

mean 1 hr. To function, the system requires that at least four processors, four memories, and the switch be operational. We will use a “baseline” model with the original failure and repair rates, t = 1.0 hr, and tolerance E = 10e9. This error tolerance yields a solution with at least eight correct decimal places for all the solvers. One reason we use this model is the existence of a closed-form transient solution. The instantaneous availability of an individual component with failure rate I and repair rate p is given by A@)= -!!II+p+Ge

1

- (2.+PM .

(20)

ANDREW RE~BMANand KISHOR TRIVEDI

30

(y--Q$&--J+&-)+‘..~ Fig. 3. M/M/l/k system CTMC.

100

+

Leqqnd: + 4 0

80

-TR-BDF2 -RKF46 - ~nl~ormlza~~qn

/

t ;

60

-

iz Q

40

-

/+

+

Number of buffers lkl

Fig. 4. M/M/Q% model--CPU

time as a function of size.

Given individual component availability, system availability can be determined combinatorially. We can compute exact solutions using an arbitrary precision desk calculator program. Using the exact solution, we can measure the actual error that occurs when solving this problem with a numerical solution procedure. Our second example is the M/M/l/k queue. With k buffers in the queue, the resulting CTMC has k + 1 states and 2k nonzero transitions. The “baseline” model we consider will have arrival rate A= 90.0, service rate 1_1=100.0,49 buffers, and mission time t = 1.O.Unless otherwise stated, we will use this model and a tolerance of E = 10e6. This error tolerance yieIds a solution with at least five correct decimal places for all the solvers. Like the Cmmp example, the M/M/l/k queue has a closed-form transient solution. The resulting state probabilities can be determined using the expression from [26]:

Here n denotes the number ofjobs in the system at time t, m denotes the initial number ofjobs in the system, p is A/p, and IT is the steady-state solution. The eigenvalues, yi, are given by = I&,

i=l,2

,...,

k.

4.1 . Size In our first experiment, we consider the effect of varying model size on the execution time of the solution algorithms. We also consider the amount ofcomputation spent on matrix-vector multiplies. We use the M/M/l/k model because its size is easily varied without significantly altering the underlying structure. In Fig. 4, we plot the amount ofCPU time used by various solvers as a function of k, the number of buffers in the M/M/l/k queue. We see that the CPU time increase appears linear in the number of states in the chain. We note that the number of nonzero, off-diagonal entries in the M/M/lb queue CTMC generator matrix is 2k, so algorithms with O(v) time complexity also run in time O(N). This linear increase in execution time shows the utility of a sparse matrix approach. If a sparse approach were not used, the curves would be cubic in the number of states, rather than linear.

31

Numerical transient analysis of Markov models

lo-'2

c

10-J

t

lo-"

a

0”

o-

10-Z

10-e

16'

10-a

,o-‘o

,0-w

16"

Tolerance (-log,,) Fig. 5. C.mmp model-accuracy

lo-"

-

_.

lo-'2 _

4

lo-‘0

_

lo-’

-

lo-’

-

L a

as a function of tolerance.

::

$ 2

10-a 10-Z I

10-Z

I

10-d

I

I

lo-‘

lo-'

I

10-J

I

10-Q

I

lo-'*

Tolerance (-log,,) Fig. 6. M/M/l/k model-accuracy

as a function of tolerance.

Using monitoring routines, we have observed that most CPU time is spent performing sparse matrix-vector multiplies (MVM). The fraction ofCPU time spent on MVM increases with problem size (both number ofstates and fill). In the remainder ofour studies, instead ofusing actual CPU time, we report a count of the number of matrix-vector multiplies required by each solution algorithm. This alternative measure of required computation is machine independent. It also gives more insight into the general cost trends of the different algorithms independent of many of their implementation details. In particular, it allows us to predict the performance of left-truncated uniformization. 4.2. Accuracy Next, we examine transient-solver accuracy and error-control. We first consider the accuracy achieved by setting the tolerance knobs ofthe various transient solvers. For each ofthe three transient solver codes, we ran both examples for a range of tolerances. We first vary the error tolerance in the baseline C.mmp model. We then repeat the experiment with the baseline M/M/l/k model. We plot the results for the C.mmp model in Fig. 5 and for the M/M/l/k model in Fig. 6. In both figures, -log,, of the accuracy achieved by the three solver codes is plotted of the initial tolerance specification. In the figures, specified tolerance as a function of -log,, increases to the right, while accuracy achieved increases as we move upwards. The 45-degree line represents actual error equal to prescribed tolerance. For modest tolerances, all three methods meet the tolerance specification. Uniformization closely matches the specified tolerance. Particularly for the M/M/l/k examples, TR-BDF2 error exceeds the prescribed tolerance even for reasonable tolerances, although not by a particularly wide margin. As CAOR

15:1-c

ANDREW REIBMANand KISHOR TRIVEDI

32 1200 s $

1000

5

800

Legend: + --RKF45 TR-BDF2 0 0 - Uniformlratlon

0

10-z

10-q

10-e

10-e

lo-‘0

Accuracy t-log,,)

Fig. 7. C.mmp model-computation

SOW

a

cost as a function of accuracy.

1 Legend :

5

4ooo

c

‘; 8

+ - TR-BDFP 0 - RKF45 0 - Unitwmholbn

kJ 3000

I +

‘; e 2 2000 E 8

1000 I

Accuracy (-log,,)

Fig. 8. M/M/l/k modelAomputation cost as a functionof accuracy. tolerances increase the accuracy achieved by all three methods levels off. This leveling off represents the limiting effect of round-off error and minimum step-size requirements. For RKF45 and uniformization, the maximum accuracy achievable is between lo-” and lo- “. Next, using the same data, we plot the amount of computation (number of matrix-vector multiplies) needed by each of the three algorithms as a function of accuracy achieved. The results are plotted for the C.mmp model in Fig. 7, and for the M/M/l/k model in Fig. 8. The abscissas show -log,, of the actual accuracy achieved. The ordinates show the number of matrix-vector multiplies required. Because TR-BDF2 is only second-order accurate, as accuracy increases TR-BDF2 performance degrades. Higher order methods require fewer resources to obtain greater accuracy. At the far right, both RKF45 and uniformization lines become vertical. The corresponding point on the abscissa is the limiting accuracy from round-off and minimum step-size constraints. 4.3. Stiffness Next, we consider the effect of stiffness on solver performance. We first consider decreasing component failure rates in the C.mmp model. This decreases the smallest eigenvalues of the CTMC transition rate matrix and thus increases the classical stiffness ratio of the model. We then consider increasing the rate of switch repair in the C.mmp model and the service rate in the M/M/l/k queue. This increases model stiffness in terms of both the classical and computational stiffness measures. Finally, we consider increasing the length of the solution interval. This increases model stiffness in terms of the computational stiffness measure. 43.1. Decreasing failure rates. We consider a series ofmodels with decreasing failure rates derived from the baseline C.mmp model Markov chain. The models have all three component failure rates reduced by 1 to 12 orders of magnitude, compared with the failure rates in the baseline model. We solve each chain with our RKF45, TR-BDF2, and uniformization codes. No performance degradation is observable. The RKF45 and TR-BDF2 solvers actually run slightly faster. As

33

Numerical transient analysis of Markov models

10'

10'

102

104

ROW rata (tog,, /L,) Fig. 9. Cmmp model-computation

cost as a function of repair rate.

Lepsnd : -

+ - TR-BDFZ 0 - RKF45 0 - Truncoled

-

unlf wtlmak

X - Vntruneahd uniformlzatlon

+ x \

~~ /

A %I&/ A’ lo2

+10’

lo4

Service rote (log,, Jo) Fig. 10. M~/l~

m~el~omputation

cost as a function of service rate,

predicted, TR-BDF2 is not severely impacted by stiffness. RKF45 and uniformization behavior does not depend strongly on the classical stiffness ratio. Although it is also not shown in any figure, if we lengthen the time-scale or increase accuracy in correspondence with the change of parameter, solver performance will degrade. 43.2. Increasing repair rate. Using the baseline C.mmp model and M/M/l/k model presented above, we consider varying the value of ,uSs,the switch repair rate, and p, the queue service rate. Failure rates (and other repair rates in the C.mmp model) are fixed at their original values. We include both the actual uniformization code performance (without the DTMC starter) and our hypothetical estimate of unifo~ization performance with the DTMC starter. In Fig. 9 we consider the Cmmp model. We plot the number ofmatrix-vector multiplies needed by the different methods as a function of log,, ,uS.For small ,LL~, we see that uniformization and RKF45 both perform more efliciently than TR-BDFZ. As pS increases, RKF45 and untruncated uniformization’s performance begins to decay. Finally, for large p,, TR-BDF2 begins to perform better than uniformization. As pS increases, the differences will continue to increase. Next, we analyze the effect ofincreasing the M/M/l/k service rate. In Fig. 10, we plot the number of matrix-vector multiplies needed by the different methods as a function of log,, /.L The behavior observed is similar to that seen with the Cmmp model, RKF45 and uniform~ation perform well when the service rate p is not large. As p increases, RKF45 and untruncated uniformization performance decays. Finally, truncated uniformization performance decays for large 1( leaving TRBDF2 as the method of choice for such problems. We hypothesize that TR-BDF2 execution time decreases because the smallest nonzero eigenvalue of the CTMC decreases. (This is only a secondorder effect with the other methods.)

ANDREWREIBMANand KISHORTRIVEDI

34

L*QMd :

+ - TR-WFP

lo3

IO2

10'

10'

Solution interval (log,, ,I Fig. 11. C.mmp model-computation

cost as a function of mission time.

~,p$eL.l+

1:::

“tImare

10'

10'

lo2

lo4

Solution interval (log,, ,I Fig. 12. M/M/l/k model-computation

cost as a function of mission time.

In summary, RKF45 and untruncated uniformization computation times increase linearly with the largest state exit rate. Truncated uniformization computation time increases with the square root of the largest state exit rate. TR-BDF2 computation time is not changed significantly by such an increase. 4.3.3. Zncreasing mission time. In our final experiment, we vary the mission time considered. We use the baseline C.mmp and M/M/1/1< models and vary the length of the solution interval. The amount of computation needed by each solver for the C.mmp model is plotted in Fig. 11 as a function oflog,, t. Similar data for the M/M/l/k model is plotted in Fig. 12. We see behavior similar to that observed when we increased the repair or service rates. For small t, uniformization and RKF45 perform more efficiently than TR-BDF2. As t grows, RKF45 and untruncated uniformization performance decays. For larger t, even truncated uniformization suffers from increased execution time. For large t, TR-BDF2 is superior to all three methods. Once again, RKF45 and uniformization are markedly better than TR-BDF2 for nonstiff problems. RKF45 and untruncated uniformization execution times increase linearly with t. For uniformization with left-truncation, execution time increases as the square root oft . The stable TR-BDF2 method is the method of choice if the solution interval is large. 5. CONCLUSIONS

We have considered three numerical approaches to the transient solution of finite-state, continuous-time Markov models. Conventional ODE solution techniques (like RKF45) are often available “off-the-shelf”. For nonstiff problems with normal accuracy requirements, they are satisfactory. Uniformization is usually more accurate and efficient than RKF45. It exploits the natural probabilistic structure of the problem, and, as a series method, allows more accurate error control. Uniformization is the method of choice for typical problems. For stiff problems, special

Numerical transient analysis of Markov models

35

algorithms like TR-BDF2 are recommended. TR-BDF2’s performance is nearly independent of problem stiffness. Its main drawback is that it is only a second-order accurate method. Because of its low order, it requires a small step-size to achieve high absolute accuracies. To solve even larger models more efficiently, we can turn to approximation techniques at the model [8], chain [9], or numerical level [ 151. By reducing size, eliminating cycles, and eliminating stiffness, approximation techniques should allow us to use conventional techniques eff%iently on a wider range of problems.

stable

Acknowledgements-This work was supported in part by the Army Research Oflice under contract No. DAAG29-84-0045, by the Air Force Office ofscientific Research under grant No. AFOSR-84-0132, and by the National Science Foundation under grant No. MCS-830200.

REFERENCES 1. R. Geist and K. S. Trivedi, Ultra-high reliability prediction for fault-tolerant computer systems. IEEE Transact. Comput.

32, 11111127 (1983). 2. A. Goyal, S. Lavenberg and K. Trivedi, Probabilistic modeling ofcomputer system availability. Ann. Opns Res. 8,285-306 (1987). 3. K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Prentice-Hall, Englewood Cliffs, N.J. 4. W. K. Grassmann, Transient solution in Markovian queueing systems. Comput. Opns Res. 4, 47-56 (1977). 5. S. K. Tripathi and A. Duda, Time dependent analysis of queueing systems. INFOR (1986). 6. D. Gross, D. Miller and C. Plastiras, Simulation methodologies for transient Markov processes: a comparative study based on multi-echelon repairable item inventory systems. In Proceedings of the Summer Computer Simulation Conference (1984).

7. K. S. Trivedi, R. Geist, M. Smothenuan and J. B. Dugan, Hybrid modeling of fault-tolerant systems. Comput. Elect. Engng 11, 87-109 (1984). 8. R. A. Sahner and K. S. Trivedi, A hierarchical, combinatorial-Markov method ofsolving complex reliability models. In Proceedings ACM/IEEE

Fall Joint Computer Conference (1986).

9. A. Bobbio and K. S. Trivedi, An aggregation technique for the transient analysis ofstiff Markov chains. IEEE Transact. Comput. C-35, 803-814 (1986).

10. G. H. Golub and C. F. Van Loan, Matrix Computations. John Hopkins University Press, Baltimore, Md (1983). 11. C. Moler and C. F. Van Loan. Nineteen dubious wavs to commute the exnonential of a matrix. SIAM Rev. 20. 801-835 (1978). 12. C. F. Van Loan, The sensitivity of the matrix exponential. SIAM J. Numer. Anal. 14, 971-981 (1977). 13. I. R. Roche, On the sensitivity of the matrix exponential problem. RAIRO Analyse Numerique 15,249-255 (1981). 14. J. D. Lambert, Computational Methods in Ordinary Differential Equations. Wiley, London (1973). 15. W. L. Miranker, Numerical Methods for St~fl Equations and Singular Perturbation Problems. Reidel, Dordrecht (1981). 16. V. Kulkarni, V. F. Nicola, R. M. Smith and K. S. Trivedi, Numerical evaluation of performability measures and job completion time in repairable fault-tolerant systems. In Proceedings of the 16th International Symposium on FaultTolerant Computing, IEEE, Vienna, Austria, July (1986). 17. R. A. Marie, A. L. Reibman and K. S. Trivedi, Transient solution of acyclic Markov chains. Perjbrm. Eva/n 7, 175-194 (1987).

18. R. Bellman, Introduction to Matrix Analysis. McGraw-Hill, New York (1969). 19. E. de Souza e Silva and R. Gail, Calculating cumulative operational time distributions of repairable computer systems. IEEE

Transact. Comput. 35, 322-332 (1986).

20. B. Melamed and M. Yadin, Randomization procedures in the computation ofcumulative-time distributions over discrete state Markov processes. Opns Res. 32, 926944 (1984). 21. D. Gross and D. Miller, The randomization technique as a modeling tool and solution procedure for transient Markov processes. Opns Res. 3& 334-361 (1984). 22. B. Fox and P. Glynn, Computing Poisson Probabilities. Technical Report, University of Montreal (1986). 23. R. Bank, W. Courghran, W. Fichtner, E. Grosse, D. Rose and R. Smith, Transient simulation of silicon devices and circuits. IEEE Transact. CAD 4, 43W51 (1985). 24. C. W. Gear, Numerical Initial Value Problems in Ordinary Differential Equations. Prentice-Hall, Englewood Cliffs, N.J. (19711. 25. ‘D. Sikwiorek, Multiprocessors: reliability modelling and graceful degradation. In Infotech State of the Art Conference on System Reliability, pp. 48-73. Info&h International (1978). 26. P. M. Morse, Queues, Inventories, and Maintenance: The Analysis of Operation Systems with Variable Supply and Demand. Wiley, New York (1958). APPENDIX A.

Given: Q-transition rate matrix P,-initial probability vector .+-error tolerance taesired solution time.

Uniformization

Algorithm

ANDREWREIBMANand KISHORTRIVEDI

36

Variables: Q*-modified transition rate matrix f-identity matrix q-scaling factor S-series sum vector R-term of series vector P,--first significant DTMC state probability vector iseries term counter I-number of leftmost significant term of Poisson distribution k-number of rightmost significant term of Poisson distribution. Procedure: Choose q 2 maxlq,,/ Q*= Q/q+I. Determine largest I with exp( -qt)(xf:h (qt)‘/i!) < s/2 If (I > 0) { Determine smallest k so [ 1 - exp( - qt)(x:= O(qt)‘/i!)] 6 s/2 1 Else ( Determine smallest k so [ 1 - exp( - qt)(C$ O(qt)‘/i!)] < E 1 s=o P, = P,(Q*)’ R = P,(qt)‘//!

For(i=Itoi=k)( S=S+R R = R(Q*)(qM 1

Scaling S by exp( -qt) yields solution

Given : Q-transition rate matrix P,,-initial probability vector tl,,,-mission completion time s-local error tolerance. Variables: Success~ul_Step-flag = have we completed a step with acceptable error? H_Failed-flag = have we had a step failure this step? Error_ Fector-local truncation error vector Error_Estimate-estimate of magnitude of local error t4urrent time h4urrent step size.

Estimate initial value for h r=O While (r < tlinnl) { Successful__Step = FALSE H-Failed = FALSE While (Successful_Step = = FALSE) {

Compute Solution Step from t to r + h Compute LTE estimate Error-Estimate

= IIError_Vector{l

If (Error_Estimate < E) { SuccessfuLStep = TRUE If (H_Faifed = = FALSE) consider increasing h I

Else ( Decrease h H-Failed

= TRUE