PER ANCE FORTION EVALUA gl/gemaliond Performance Evaluation 24 (1995) 311-331
ELSEVIER
A computationally efficient technique for transient analysis of repairable Markovian systems Manish Malhotra * AT&T Bet1 Laboratories,
Holmdel,
NJ 07733,
Received 26 October 1992; revised 24 September
USA
1993
Abstract A technique to design efficient methods using a combination of explicit (non-stiff) and implicit (stiff) ODE methods for numerical transient analysis of repairable Markovian systems is proposed. Repairable systems give rise to stiff Markov chains due to extreme disparity between failure rates and repair rates. Our approach is based on the observation that stiff Markov chains are non-stiff for an initial phase of the solution interval. A non-stiff ODE method is used to solve the model for this phase and a stiff ODE method is used to solve the model for the rest of the duration until the end of solution interval. A formal criterion to determine the length of the nonstiff phase is described. A significant outcome of this approach is that the accuracy requirement automatically becomes a part of model stiffness. Two specific methods based on this approach have been implemented. Both the methods use the Runge-Kutta-Fehlberg method as the non-stiff method. One uses the TR-BDF2 method as the stiff method while the other uses an implicit Runge-Kutta method as the stiff method. Numerical results obtained from solving dependability models of a multiprocessor system and an interconnection network are presented. These results show that the methods obtained using this approach are much more efficient than the
corresponding stiff methods which have been proposed to solve stiff Markov models. Keywords:
Repairable
Availability; Dependability; Markov chains; Ordinary differential systems; Stability; Stiffness; Transient analysis
equations; Performability;
Reliability;
1. Introduction Transient analysis of repairable computer and communication systems is of considerable interest and value to the system modelers [ 12,141. If the system is Markovian (i.e., the time to occurrence of any event in the system such as failure and repair of a component is exponentially distributed), then a Markov chain can be used to analyze the dependability and performability of the system. The importance of transient analysis is further illustrated by the number of software tools that have * This work was done while the author was at Duke university. 0166-5316/95/$09.50 @ 1995 Elsevier Science B.V. All rights reserved KSLNOl66-5316(94)00033-G
312
A4 Mdhotra /Performance
Evaluation 24 (1995) 311-331
been developed for dependability analysis. Many of these tools convert high-level system descriptions into finite-state continuous-time Markov chains for transient analysis [ 211. For dependability analysis, transient measures provide more useful information than steady-state measures. Transient pet-formability measures are also of significance [ 71. Markov chains of real systems are very large and closed-form solutions of transient measures are infeasible. Thus, the modeler has to practically always use numerical solution methods. An overview of numerical solution methods can be found in ]331. Several numerical techniques for transient analysis of Markov chains have been implemented and analyzed in the past. Jensen [ 201 originally proposed a method (also known as randomization or uniformization) . Grassmann [ 15,161, Keilson [ 221, and Gross and Miller [ 171 have analyzed this method. This method has been shown to be computationally very efficient for solving non-stiff Markov chains, however, for stiff Markov chains, it is inefficient [ 32,251. Marie [ 261 has proposed an alternative procedure to speed up Jensen’s method. This approach requires matrix products which is affordable for small Markov chains but becomes expensive for large Markov chains. Non-repairable systems give rise to acyclic Markov chains for which specialized solution methods exist [ 271. Solving a system of ordinary differential equations (ODES) is another approach. Grassmann [ 151 has shown that explicit Runge-Kutta methods (ODE solution methods) can be used for non-stiff Markov models. Clarotti [6] proposed an implicit A-stable method (similar to trapezoidal rule) for stiff Markov chains. A comprehensive effort to comparatively evaluate different methods is the work by Reibman and Trivedi [ 321. They numerically compared Jensen’s method, the Runge-KuttaFehlberg method (an explicit ODE method), and the TR-BDF2 method (a second order L-stable implicit ODE method). They concluded that TR-BDF2 is the method of choice for stiff Markov models. For non-stiff models, they recommend the use of Jensen’s method. We [25] had earlier implemented higher order implicit ODE methods. A third order A-stable generalized Runge-Kutta method and a third order L-stable implicit Runge-Kutta method were numerically compared with the TR-BDF2 and the Jensen’s method. For stiff models, we recommended the use of the implicit Runge-Kutta method to achieve high accuracy and TR-BDF2 method for low accuracy requirements. For non-stiff models, Jensen’s method was recommended. For a description of several other ODE solvers, refer to [ 18,191. From these research efforts, it is clear that Jensen’s method is the most efficient method to solve non-stiff Markov chains. It provides highly accurate solutions and it is computationally more efficient than the ODE methods. For stiff Markov chains, ODE solution methods emerge the winners. For high accuracy requirements, high order L-stable ODE methods (e.g., implicit Runge-Kutta) are recommended. For low accuracy requirements, low order L-stable ODE (e.g., TR-BDF2) methods suffice. Moreover, ODE solution methods have two significant advantages over the Jensen’s method. The solution is obtained on a grid of points (instants of time spread over the solution interval) at no additional cost. This makes it possible to examine the solution as a function of time. Secondly, Jensen’s method in its present form cannot be used to solve time-inhomogeneous Markov chains. ODE codes can be slightly modified (basically the Jacobian becomes time-dependent) to solve time-inhomogeneous Markov chains. In practice, fault-tolerant repairable systems almost always give rise to stiff Markov chains due to extreme disparity between fault-occurrence (failure) rates and fault-handling (repair) rates. Typically repair rates could be lo6 times in magnitude than failure rates. In this paper, we concern ourselves with transient analysis of such systems which requires solution of a stiff Markov chain. We present
M. Malhotra /Performance
Evaluniion 24 (1995) 311-331
313
a general technique to design fast numerical methods for transient analysis of such systems. Our technique uses a combination of stiff and non-stiff ODE methods. By stiff ODE methods, we mean the methods that are designed to handle stiffness. The results presented in [ 321 show that for smaller solution intervals, the Runge-Kutta-Fehlberg method (a non-stiff ODE method) requires much less computational effort than TR-BDF2 (a stiff ODE method). This implies that for longer solution intervals, there exists an initial phase (corresponding to those smaller solution intervals) for which the Markov chain is not stiff with respect to a non-stiff explicit ODE method. This observation forms the basis of our approach. For the initial non-stiff phase of the solution interval, we use a non-stiff ODE method to advance the solution in time and for the rest of the duration until the end of solution interval, we use a stiff ODE method. This results in large computational savings since during the non-stiff phase, a stiff ODE method is computationally very expensive (because of one or more linear system solutions at every time step) compared to a non-stiff ODE method. The usefulness of this approach hinges on successful determination of duration of the non-stiff phase. To do this, we use an approach similar to the one described by Petzold [ 3 I]. However, unlike [ 3 11, we develop a more formal criterion to actually determine when to switch to a stiff method. Based upon this principle, we constructed two methods. Both the methods use explicit fourth order Runge-Kutta-Fehlberg method as the non-stiff ODE method. For stiff ODE methods, one uses third order implicit Runge-Kutta method and the other uses second order TR-BDF2 method. Numerical results obtained by solving dependability models of a multiprocessor system and an interconnection network show that large computational savings can be obtained using these methods. The overhead cost to determine the time to switch from non-stiff method to stiff method compared to the total computational cost is negligible. In Section 2, we mathematically formulate the problem of transient analysis of Markov chains and show how various system measures can be obtained by transient analysis. In Section 3, we discuss non-stiff ODE methods and explain why the computation cost of these methods greatly increases for stiff Markov models. We also discuss what causes stiffness in Markov models and stiff ODE methods. Our basic strategy is described in Section 4. We briefly describe two specific methods based on this principle. TR-BDF2 and the third order implicit Runge-Kutta method are also described in this section. In Section 5, the implementation aspects of these methods are described. Numerical results are presented in Section 6. Different methods are compared on the basis of accuracy achieved and computation cost. Our conclusions are summarized in Section 7.
2. Mathematical
formulation
Let {Z(t) , t > 0) be a finite-state continuous-time Markov chain (CTMC) with state space R = {1,2, . . . . n}. Let Q be the infinitesimal generator matrix of Z(t) . Then Q = [ qij], where qL,i,i # j is the transition rate from state i to state j and qii = - Cjzi qij. Let q = maXi (qiil and r] denote the number of non-zero entries in Q. The state probability vector P(t) at time t is [P, (t), P2( t) , . . . , P,,(t) ] where P,(t) = Pr{Z(t) = i}. Th e vector P(t) can be computed by solving a system of first order linear differential equations, known as Kolmogorov equations
F
=P(t)Q .
(1)
314
M. Malhotra /Performance
Eualualion 24 (1995) 311-331
The initial state probability vector P(0) specifies the initial condition. Suppose the matrix Q has m 5 it distinct eigenvalues: hi, AT, . . . . A,,_ ,, A,, in non-decreasing order of magnitude. Let di be the multiplicity of Ai. The eigenvalues could be complex and all have non-positive real part. The general solution for each Pi(t) (i = 1,2, . . . . n) can be written as
(2)
where As.2 are constants varying with i. Since the matrix Q is singular, we have Ai = 0. Several time-dependent measures of the system behavior are then evaluated as weighted sums of state probabilities. For dependability models, the state space fi is divided into the set of operational states (Sz,,) and the set of failure states (a,). The probability of the system being in an operational state at time t is instantaneous availability A(t) = CiEn, Pi(t). The probability that a system does not fail during time interval (0, t] is the reliability of the system. A dependability model is converted to a reliability model by deleting all the arcs going from states in Rf to states in CI,, i.e., making all the states in fi, absorbing. Reliability is then calculated as R(t) = CiEn, Pi(t). To calculate repairability [ 91, the probability that a failed system can be made operational in time interval (0, t] , the states in R0 are made absorbing by deleting all arcs going from states in R, to states in a,. System repairability is then computed as CiEn, Pi(t) . Performability measures such as computation availability at time t are computed by calculating expected reward rate at time t: A,(t) = Cj riPi( t), where ri is the reward rate associated to state i denoting some kind of performance index of the system in state i. Many more useful measures of system behavior can be likewise computed by transient analysis of the Markov model of the system. Thus, transient analysis of Markov chains essentially consists of computing the state probabilities Pi(t) (i = 1,2, . . . . n). Unfortunately, transient analysis of Markov models is plagued by computational problems because of largeness and stiffness of Markov models. Large Markov models of complex systems are mostly sparse and sparse implementations of numerical methods reduce the computational cost. Truncation, aggregation, lumping, and decomposition are some of the techniques that overcome largeness but yield approximate results. Grassmann [ 161 has proposed using Jensen’s method in a dynamic fashion so that generation of the entire generator matrix is avoided before starting the method. Stiffness of a Markov model can be considered as the measure of computational effort required by a numerical method to solve the model within the accuracy specified by the user. We shall later consider what causes stiffness in Markov models. Two basic approaches to overcome stiffness are l
l
StifSness-avoidance: This approach is based on the idea of eliminating stiffness from a model and solving a set of non-stiff models. Some of these approaches also solve the problem of largeness. Bobbio and Trivedi [4] describe one such technique based on aggregation. Marie [ 261 has proposed a second approximation which works well for large values of solution interval but requires matrix products. Melamed and Yadin [28] and Miller [ 291 propose selective randomization which is applicable in special cases. Stiffness-tolerance: No approximations are used in this approach. Instead, special solution methods that remain stable for stiff models are used. In this paper, we focus on a technique to design more efficient stiffness-tolerant methods.
M. Malhotra /Performance
Evaluation 24 (1995) 31 I-331
315
3. Stiff Markov chains and ODE solution methods The basic strategy of ODE solution methods is to discretize the solution interval into a finite number of time intervals separated by mesh points (tr, f2, . . . . t,, . . . . t,,}. Given a solution at ti (ith mesh point), the solution at ti + h (= t,+l ) is computed and then we advance in time with step size h until the time at which the solution is desired is reached. In practice, step size varies from step to step. ODE methods can be broadly divided into two categories: stiff methods and non-stiff methods (methods that are not designed to handle stiffness). 3. I. Non-stifS methods An explicit ODE method when applied to the system of differential equations (Eq. ( 1) ) computes the state probability vector at successive time steps until the end of solution interval is reached: P(r + h) = P(t)U(Qh),
(3)
where h is the step size used to advance in time from t to t + h. U( .) is a polynomial depending upon the method used. The error in the probability vector at the end of each time step is estimated by a local truncation error (LTE) vector T( t + h) : T(t+
h) = P(t)V(Qh),
(4)
where V( .) is another polynomial. For Runge-Kutta-Fehlberg method (henceforth referred to as RKF45) [ lo], two estimates of P( t + h) are obtained, one using a fourth order formula and the other using a fifth order formula. The difference of the two gives the LTE vector: T(t + h) = P(t)Us(Qh)
- f’(t)Uci(Qh),
which has the same form as Eq. (4). Equation ~(t+
h) = P(t)V(Qh)
(5) (4) can be written as
= P(t - h)U(Qh)V(Qh)
= T(t)U(Qh),
(6)
where 7(t) is the LTE vector at time t. This implies that error in state probability vector decays if p( U(Qh) ) < 1 and grows if p( U( Qh) ) > 1 where p( U( Qh) ) is the spectral radius of U( Qh) . This defines the stability region of the method. If the step size is so chosen that one is within the stability region of a method, then the error dies out, otherwise the error grows. The stability region R of a non-stiff method is typically a closed curve in the complex plane. For the method to be stable for a given matrix Q, the following must be true: h,h E R
vi= I,2 ,..., m,
where Ais are m distinct eigenvalues the form:
(7) of matrix Q. Typically,
the above restriction
gets translated
to
(8) where C is usually a small constant depending
on the method. For RKF45 method, C = 2.78.
M. Malhotm / Performance
316
Evaluation
24 (199.7) 311-331
Besides stability, desired accuracy also restricts the step size. If local error tolerance met at every step, then some user-specified norm of 7( t + h) must satisfy
of E must be
(9) From Eq. (4)) we know that 7( t + h) depends upon the step size h. Therefore, the step size h at any step must be so chosen that the conditions in Eq. (8) (for stability) and Eq. (9) (for accuracy) are satisfied. Let us now consider what makes a Markov model stiff with respect to an explicit ODE method. From Eq. (2), we find that the term with the largest eigenvalue vanishes fastest and contributes nothing to the solution after an initial transient phase. It is the term with A2 that contributes most to the solution. However, the step-size is restricted by A,,. If A,, is very large compared to AZ, then according to Eq. (S), the step-size h must be very small to maintain stability. This increases the total number of time steps required and total computation time by several orders of magnitude. Moreover, roundoff error effects become significant if step size is very small. Thus the linear system of differential equations (Eq. ( 1) ) is said to be stiff if IReV,,)
l >> IRebMI.
(10)
One of the factors that must also be taken into account is that rate of change of a solution component is defined relative to the solution interval. In [ 301, stiffness is defined as: A system of differential equations is said to be stiff on the interval [ 0,t) if there exists a solution component of the system that has variation on that interval that is large compared to l/t. Thus, the length of solution interval also becomes a measure of stiffness. At this point, it is worth considering what causes stiffness in Markov chains of repairable faulttolerant systems. In a dependability model of a repairable system, fault occurrence rates are several orders of magnitude smaller than fault (error) handling rates. Fault handling rates could be as large as IO6 times the fault occurrence rates and much larger than the reciprocal of the length of solution interval. Therefore, such Markov chains have events occurring at widely different time scales. Clarotti [ 61 observes that, in general, this causes the largest absolute eigenvalue of Q to be much larger than the inverse of the length of solution interval and consequently the system of differential equations (Eq. ( 1) ) is stiff. Reibman and Trivedi [ 321 formalized this notion and defined stiffness quantitatively. Using the Gerschegorin circle theorem [ 131, the magnitude of the largest eigenvalue is bounded above by twice the largest absolute diagonal entry in the generator matrix. In a Markov chain, this entry corresponds to the largest total exit rate from any of the states. They defined the stiffness index of a Markov chain as the product of the largest total exit rate from a state and the length of the solution interval (= qt) . However, this index ignores the accuracy requirement, which we have seen plays an important role in determining the step-size and therefore the computational complexity of ODE solution methods. 3.2. Stiff methods From the discussion of previous section, we realize that to solve stiff Markov chains, we need methods that are inherently stable and do not restrict step size to maintain stability. Implicit ODE methods fall into this category. The stability of implicit methods is characterized by following
h4. Malhotra /Performance
Eualunkx~ 24 (1995) 31 l-331
317
definitions. A method is said to be A-stable if all numerical approximations to the actual solution tend to zero as rz + CCJwhen it is applied to the scalar differential equation dy/ dt = Ay with a fixed positive h and a (complex) constant A with a negative real part [ 111. Here n is the number of mesh points which divide the solution interval. For these methods, the stability region is the entire left half of the complex plane, i.e., Re(A,,)h
< 0 .
(11)
Therefore, the only restriction on step size is due to accuracy (Eq. (9) ) . For extremely stiff problems, even A-stability does not guarantee that error in solution will damp out as well without large decrease in step size. Slowly damped oscillatory errors [23] may occur which could lead to the phenomenon of ringing, i.e., the successively computed values tend to be of the same magnitude but opposite sign (yi+i = myi), unless the step size is too small [ 21. Once again, we have the same problem if the step size needs to be unacceptably small. To overcome this, the concept of L-stability was defined by Axelsson [ 11. A method is L-stable if it is A-stable and when applied to the scalar equation dy/ dt = Ay, it yields yj+,/yi + 0 as Re(hh) -+ --00. It guarantees that errors are quickly damped. Since dependability models are usually very stiff, we use L-stable methods as stiff solvers. We now describe our basic strategy to use both stiff and non-stiff methods to solve a stiff Markov chain,
4. Basic approach Our approach is based on the observation that for long solution intervals, there exists an initial phase for which the Markov chain is not stiff with respect to a non-stiff explicit ODE method, i.e., a non-stiff ODE method requires much less computational effort than a stiff ODE method to solve the Markov chain for this phase. A simple explanation of this observation follows from Eq. (2). For the initial non-stiff phase (i.e., small t), the term e *,J is not insignificant and contributes to the solution. Therefore the step size of a non-stiff method is determined not by the stability condition but by the accuracy requirements. Suppose h,, is the largest step size that a non-stiff method can take which meets both the stability and accuracy requirements (Eqs. (8) and (9) ). Suppose at a given time step t, h,$ is the largest step size that a stiff method can take to satisfy accuracy requirements. If h,, 2 izy, then the non-stiff method must be used for this time step. Thus, our basic approach is to use a non-stiff ODE method for the initial non-stiff phase of the solution interval. At the end of this phase, we switch to a stiff ODE method and use the state probability vector computed by the non-stiff method as the initial state probability vector. The results presented by Reibman and Trivedi [ 321 lend experimental credence to our approach. They show that for smaller solution intervals, non-stiff ODE methods require much less computational effort than stiff ODE methods. A significant outcome of this approach as we shall see below is that accuracy requirement automatically becomes a part of model stiffness. The most crucial aspect of our approach is the determination of the length of non-stiff phase, i.e., when to switch from non-stiff method to a stiff method. A practical ODE code uses variable step size, After every time step of the non-stiff method, new step size is calculated such that it meets the stability conditions and it is expected to meet the accuracy requirements. A commonly used technique to obtain step size for next time step that is expected to meet the accuracy requirements is:
M. MnNzotrn/ Per-fo,nmm
31s
Euahrcrlion24 (199.5) 31 I-331
(12) where ord~,,~ is the order of accuracy of non-stiff method and T,,~ is the local truncation error. Let h,,,ybe the largest step size that satisfies Eqs. ( 12) and (8). At the end of step size, we also compute the step size of the stiff method that is expected to meet the accuracy requirements using the LTE and current step size of the non-stiff method. Thus, (13) where order, is the order of accuracy of the stiff method and r,? is the LTE for the stiff method. However, T,~is not known and must be estimated from T,,. This is accomplished as follows. For the linear system with constant coefficients (Eq. ( 1) ), we have r, = C,P( t) ( hCUTTY,,,Q) (“rd’rL+‘)+ 0( ( hcurrent)(ordPrr+2)),
(14)
where x = s for the stiff method and x = IZSfor the non-stiff method. The constants C, (x = s, ns) are known a priori. Assuming that order,, > order,y (which is true for the methods we use), we have T,,,~
2
2
( hCUrrCn,Q) (ordrrn’-order’) 7, + 0 ( ( hCUrrCn,) (order’w+2) ) .
s
Neglecting
(15)
the terms of higher order, we get
(16) Substituting
this in Eq. (13), we get the following estimate for h,: I/(order,+l) -!‘+nsll )
h,s 5 hcurwd
’
(17)
where K = C,/ ( C,, ( hCUTrYn,Q) (ord’rpis-ordPrs) ). One possibility is to switch to the stiff method as soon as h,,y < h,. However, this is a very conservative criterion since it ignores the large difference in the computation time of the stiff and the non-stiff method to perform calculations for one time step. More savings can be realized if this factor is also taken into account. Let r, and T,,y be the computation times taken to perform calculations for one time step by the stiff and non-stiff method respectively. Ideally, we would like to switch to the stiff method when the computation time taken by the nonstiff method to advance the solution per unit time exceeds the computation time taken per unit time step by the stiff method, i.e.,
(18) This is the criterion that we use to switch from the non-stiff method to the stiff method, i.e., compute h,Vand h,,s after every time step of the non-stiff method and switch to the stiff method whenever the condition (19) becomes true.
M. Moll~otrn /hfomunce
ELabrcriion 24 (1995) 311-331
31’)
In actual implementation, there are two useful ways to estimate the values of T, and T,,,. Dummy function calls can be made to the stiff solver and the non-stiff solver. The actual cpu time it takes the two solvers to perform calculations for one time step can be used as measures of r7 and T,,,y respectively. Another way is to actually count the number of flops in the code for one time step of the stiff and non-stiff solver. A stiff solver requires a linear system(s) to be solved. If a direct method is used, then it is easy to count the number of flops. If an iterative method is used, then an estimate of the average number of iterations multiplied with the number of flops used per iteration could be used to estimate the cost of solving the linear system. Note that T, and T,, will depend on the number of states and the number of transitions in the Markov chain being solved. The overhead of estimating T, and T,,,7is insignificant compared to the computational savings that are realized using this approach. At this point, one may naturally ask if switching back and forth between stiff and non-stiff methods could yield more savings, i.e., are there any more intervals of time after the initial phase when using a non-stiff method would be more efficient than using a stiff method. The answer to this question is no and follows from Eq. (2). At the end of the initial non-stiff phase, the terms corresponding to high modulus eigenvalues are extremely small and contribute almost nothing to actual solution. Therefore, the step-size of a non-stiff method after this phase is determined solely by the stability condition. This constrains the step-size so much so that computational complexity of non-stiff method is much larger than a stiff method. Based on this principle, we have experimented with two combinations. Both use RKF45 method as the non-stiff method but differ in the choice of stiff methods. One uses the second order L-stable TR-BDF2 method and the other uses an L-stable implicit Runge-Kutta method of order 3 (henceforth referred to as IRK3). The former combination is henceforth referred to as RKF-IRK method and the latter as RKF-TBF. RKF45 method is commonly known [24] and we do not describe it here. We briefly describe the two stiff methods. 4.1. TR-BDF2
method
This is an implicit L-stable method that was proposed by Bank et al. [ 21. It is a re-starting cyclic multi-step method that uses one step of TR (Trapezoidal rule) and one step of BDF2 (Second order backward difference formula). L-stability of the backward difference formula provides the L-stability of this method and TR step provides the desirable property of re-starting. A single step of TR-BDF2 is composed of a TR step from t to t + yh (0 < Y < 1) and a BDF2 step from t + yh to t + h. For the system of equations ( 1), the above scheme yields P(t+yh)(Z-
$Q,
=P(t)(Z+
yh TQ,
(20)
for the TR step and W+h)((2-y)Z-(l-y)hQ)=+‘(t+yh)for the BDF2 step. LTE vector T(h) difference formula suggested in [ 21:
(1 -
y
Y)Zp(t)
at the end of each time step is estimated
using a divided
(22)
320
M. Malhotra /Performance
Evaluation 24 (1995) 311-331
It has been shown in [ 25,321 that this method provides reasonable accuracy for error tolerances up to 10e8 and excellent stability for stiff Markov chains. However for tighter tolerances, the computation cost of this method increases greatly. It also fails to provide sufficient accuracy under the time constraints. Computation of state probabilities with precision as high as lo-” and more is common in practice. In such cases, this method requires a step size so small that the total time of computation increases tremendously. Thus the need arises for higher order methods. 4.2. Third order implicit Runge-Kutta
method
This is a specific third order L-stable method which was used in [25] for transient analysis of Markov chains. L-stable implicit Runge-Kutta methods of higher order can also be derived. Axelsson [ 1] has shown that these methods are L-stable, To compute state probabilities using Eq. ( l), these methods yield a linear algebraic system at each time step:
P(t+ h)(Z-
$2 +;h2Q2)
= P(t)(z+
;hQ)
.
(23)
This method involves second power of the generator matrix Q. Various possibilities exist for solving the system in Eq. (23) [ 251. One is to factorize the matrix polynomial on the left-hand side into linear factors and solving two linear systems that use complex arithmetic. The other is to compute the matrix polynomial directly. We use this approach. This involves squaring the generator matrix. For sparse Markov chains, the fill-in is usually not more than twenty percent. We find this approach to work well. The LTE vector at t + h is given by [ 251 7-(h) = $‘(t)e’
.
This approach requires two extra matrix-vector per time step.
5. Implementation
multiplications
(one with matrix Q and one with Q2)
aspects
In this section, we broadly describe the basic strategy. The first step is to initialize various parameters such as the initial step-size h o, the minimum and maximum step-sizes ( hnrinrh,,O,), error tolerance for the linear system solvers. Calculation of hnrin and h,,, may be based upon the given problem and the accuracy desired. However, we may not want to use a very small hmin due to time constraints. We use ho = 10m7, hnrin= 10P7, and h,,, = 10. If iterative linear system solvers are used, then an error tolerance for the linear system solver must be specified. This is usually taken to be one-tenth of the local error tolerance for the ODE method. The next step is to compute the Jacobian and the right-hand side vector (Eqs. (20), (21), and (23)). Once the above steps are completed, the calculations for one time step are performed. At each time step of an implicit method, one or more linear system(s) needs to be solved. The choice of linear system solver depends upon the specific model. Sparse direct methods [8] like Gaussian elimination yield accurate results. Rows and columns can be reordered to minimize the fill-in that results from
M. Mnlhotrn /Performance
Eualualion 24 (1995) 311-331
321
using direct methods. However, for large Markov chains which commonly occur in practice, direct methods are too expensive to use. In such cases, sparse iterative techniques like Gauss-Seidel or SOR are used. In our experiments, we found Gauss-Seidel method to be sufficiently fast and accurate. However, if Gauss-Seidel does not converge within a specified number of iterations (i.e., the matrix has poor eigen-structure), we switch to SOR. The matrices ((I- $?Q) and ( (2-y)Z( 1 -y) hQ) for TR-BDF2 and (I - !hQ + ih2Q2) for IRK3 method) are diagonally dominant because of the special structure of matrix Q which helps faster convergence of the iterative solvers. If convergence is still not achieved, then tolerance is relaxed by a constant factor. The other possibility would be to switch to a sparse direct method if tolerance cannot be sacrificed. After the solution (state probability vector) at the end of time step is computed, the LTE vector is also computed. Using a user-specified norm, a scalar estimate of the LTE is computed from the LTE vector. If this estimate is within the user-specified local tolerance, then this step is accepted. Note that the user must specify a local error tolerance instead of global error tolerance. This is because there is no good way to estimate global error from the local error and the errors at each step must be bounded somehow. If the end of solution interval is reached, then stop, otherwise compute a new step-size based on the step-size control technique such that it is less than h,,,, and repeat the above steps starting from the step in which the Jacobian is computed. If the local error estimate is not within the tolerance, then reduce the step-size (usually halve it) and repeat the above time step. If the step-size is reduced below hnlin, two actions may be taken: either increase the tolerance or switch to some other ODE solver. This summarizes the steps taken by a typical differential equation solver. Our approach requires slight modifications to this general strategy. We begin with a non-stiff method. At the end of each time step, the step-size of this method that satisfies the stability condition and accuracy requirement (Eq. ( 12) ) is computed. The magnitude of eigenvalue A, is bounded above by twice the largest diagonal entry in the matrix Q using Gerschegorin circle theorem [ 131. Hence, the stability condition (Eq. (8)) for the RKF45 method becomes h<-
I
.39 4
.
(25)
The step-size for the stiff method (TR-BDF2 or IRK3) is computed using Eq. (17). If these stepsizes satisfy the condition specified by Eq. (19), then switch to the stiff method, otherwise continue with the RKF45 method. If a switch is made to the stiff method, then its starting step-size is chosen as the value of h,y computed by Eq. ( 17).
6. Numerical
results
In this section, we describe the experiments and the numerical results obtained from them. We numerically compare four different methods: (1) TR-BDF2, (2) IRK3, (3) RKF-TBF, and (4) RKF-IRK. We used two models for experimental purposes: ( 1) reliability model for repairable extra-stage shuffle-exchange network [ 31 (used as a stiff model in [ 25]), and (2) dependability model of repairable C.mmp multiprocessor system [ 341 (used as a stiff model in [ 321) . A useful property of these models is that they have closed form transient solutions. This allows us to compare the actual solution against the computed solutions, thereby giving us a measure of accuracy. The
M. Malhotra /Performance
322
Evaluation 24 (1995) 311-331
N/2 SEs in series
N(log2N - 1)/4
pairs
Fig. 1. Upper bound reliability model of SEN+ network.
2X(1 - c)
Fig. 2. Reliability model of a subsystem of two SEs in parallel.
@A-@ Fig. 3. Reliability model of m SEs in series.
experiments yield useful general trends which are used to draw inferences about different methods. Before proceeding to describe the results obtained, we briefly describe the various models used. 6.1. Models used 6.1.1. SEN+ network model This is an upper-bound reliability model for an N x N extra-stage shuffle-exchange network [3], known as SEN+. The block diagram for this model is shown in Fig. 1. A block of two SEs in parallel is modeled by the CTMC shown in Fig. 2. If a SE fails, then it is replaced (repair action). However, the block of SEs continues to function since the other SE is working. If the other SE also fails before repair is complete, then the block is considered failed. A set of m SEs in series is modeled by a two-state Markov chain shown in Fig. 3. A Markov model of the overall block diagram is then constructed assuming independence across blocks. We used the Stochastic Petri Net Package (SPNP) [ 51 to generate the Markov chain. The reliability R(t) of a subsystem of two SEs in parallel at any time t is the sum of probabilities of being in state 2 and 1 at time t. The reliability of this subsystem, given that the subsystem begins operation in state 2 at time t = 0, is given by R(t) = Alea” + A2enZ’ , where
(26)
M. Malhotra /Performance
Evaluation 24 (1995) 311-331
323
p2
4
SWITCH
I
iI cl Ml
Fig. 4. Cmmp
CYI= ffy2=
A
system.
3A) + JA’ + r”* + 2h/L( 1 + 4c) , 2
-(P
+
-(P
+ 3A) - JA2 + f%2+ 2A/u( I + 4c) 9 2
A = (2c+ I
L!l Ml6
M2
l)A+/J++, aI - a2
9
=(2c+I)A+p+a2
(27)
2 a2 -
The upper-bound
ffl
reliability
of SEN+ network is given by
RSEN+uh(f) = ewNA’(Ale”” + A2ea2’)K
(28)
where A,, A*, cyI, a2 have the same form as in Eq. (27) and K=
N(log,N4
1) .
For the base model, we choose N = 8, A = 1.O, p = 100.0, and c = 0.9. The error tolerance E = 1O-” in the experiments unless otherwise stated. The number of states in the Markov chain is 324 and the number of transitions is 2052. C.mmp model This is the dependability model of C.mmp system which consists of sixteen processors and sixteen memory modules interconnected by a switch (Fig. 4). Time to failure and time to repair each component is independent and exponentially distributed. It is further assumed that each component has an independent repair person. The system is operational if at least four memory modules, four processors, and the switch are operational. Instantaneous availability of a component with failure rate A and repair rate ,u is given by
61.2.
P A(t) = A+P
A +A+/Le
__(A-tPL)f .
(29)
M. Malhotra /Performance
324
The instantaneous
availability
Evaluation 24 (1995) 31 I-331
of the C.mmp system is
LMW’[l
- &Wl’6-’
(30)
where A,s( t), A,(t), and A,(t) are respectively the instantaneous availabilities of the switch, a processor, and a memory module. For the base model, the failure rate of any processor A, = 68.9 x 10m6 per hour, failure rate of any memory module A,, = 224 x 10e6 per hour, and failure rate of the switch h, = 202 x 10m6 per hour. The repair rate ,u, for each component is taken to be 1.0 per hour. The Markov chain for this system is generated using SPNP [5]. it contains 578 states and 2754 transitions. In experiments with this model, an error tolerance of E = lo-i2 is chosen unless otherwise stated. 6.2. Experimental
results
In this section, we discuss the results obtained after a series of experiments were conducted on the models described in the previous section. The experiments were conducted on Sun SPARC station ELC The criteria used for comparing different methods are the accuracy achieved and the computation cost. The experiments fall into the following categories: Accuracy versus error tolerance. Computation cost versus error tolerance. Computation cost versus model parameters. Computation cost versus length of solution interval. Computation cost versus model size. 6.2. I, Accuracy In the first series of experiments, we varied the error tolerance and measured the computation cost and accuracy achieved. The accuracy of the final results is measured as the error in computed solution by comparing the computed solutions against the actual solutions. The actual solutions are obtained by substituting numerical values into the closed form solutions. Calculations are carried out by using the virtually infinite-precision calculator program available on the computer. The error tolerance is varied from 10e5 to 10-i’ for each of the two models: SEN+ and C.mmp. The length of solution interval for SEN+ and C.mmp model are 10 and 40 respectively. Accuracy achieved versus the error tolerance specified for different methods is plotted in Figs. 5 and 6. For the SEN+ model, RKF-IRK method achieves higher accuracy than the other methods for E < 10e6. For E < 10e8, all the methods achieve slightly less accuracy than desired. This is ascribed to the poor error control mechanism with which all the ODE methods suffer. For Cmmp model, all the methods achieve higher accuracy than desired. In general, accuracy achieved increases as error tolerance gets tighter. However, sometimes a counter-intuitive phenomenon occurs when higher accuracy is achieved for higher error tolerance. This behavior is ascribed to the poor control of global
M. Malhotra
/Performance
Evaluation
24 (I 995) 31 I-331
325
10-4 10-5 10-s RKF-IRK RKF-TBF
lo-’ Accuracy
+ Jt
lo-8 10-g 10-10 10-l’ 10-12 10-d
10-5
10-s
10-7 Toi::ce
lo- 9
10-10
10-n
10-12
Fig. 5. Accuracy vs error tolerance - SEN+ model.
10-s 1o-6
TRBDF; IRK3 RKF-IRK RKF-TBF
lo-’ 10-s
G t + ++
Accuracy lo-9 10-10 10-l’ 10-12 10-13 1. ’ 1o-5
’
” ’ 1o-6
’
” lo-’
’
’
” ’
’
” ’
10-6 1o-g Tolerance
’
” ’
lo-lo
’
”
lo-11
’
’ ’
le - 12
Fig. 6. Accuracy vs error tolerance - C.mmp model.
error from the local error. Typically, the accuracy achieved by RKF-IRK (RIG-TBF) is higher than IRK3 (TR-BDF2). This is expected because order of accuracy of RKF45 is higher than order of accuracy of either TR-BDF2 or IRK3. Therefore, the state probability vector at time t, the time at which the switch is made from RKF45 to TR-BDF2 or IRK3, should contain less error than if it had been computed using TR-BDF2 or IRK3 method. The accuracy of the final solution also depends upon the smallest step-size allowed in actual implementation. In our implementation, the minimum allowable step-size is 10M7. One should rarely expect to get higher accuracies by using a step size smaller than that. Besides, smaller step sizes are computationally infeasible to use due to excessively large number of steps required. We now consider the computation cost as a function of error tolerance. The computation cost is measured in terms of the CPU seconds required. From the same experiments, the results obtained are plotted in Figs. 7 and 8 for the “base” SENt and C.mmp models respectively. The common trend
M. Malhotra /Performance
326
800 700 600 CPU Seconds
TRBDFZ
+
RKF-IRK RKF-TBF
+ Jt
Evaluation 24 (199s) 31 l-331
500 400 300 200 100 n
i0-5
1o-s
lo-’
1o-8
10-10
10-11
10-12
T~lerar!:-~ Fig. 7. CPU time vs error tolerance - SEN+ model.
RKF-IRK
G-
CPU Seconds
10-5
10-S
1o-7
10-8
10-9
10-10
10-11
10-12
10-13
Tolerance Fig. 8. CPU
time vs error tolerance - Cmmp model.
observed is that computation cost increases with the decrease in tolerance. It is the rate of increase that forms the basis for distinction among the methods. For error tolerance up to 10p7, IRK3 (TR-BDF2) performs nearly as well as RKF-IRK (RKFTBF) . However, as the error tolerance is tightened beyond that, we see that RKF-IRK (RKF-TBF) is much more efficient that IRK3 (TR-BDF2). Savings in computation cost of as much as 80% and more can be realized for some models using the methods (RKF-TBF and RKF-IRK) based on our new approach. 6.2.2. Stiffness We now consider how stiffness affects the performance of different solvers. Computation cost is the measure of performance chosen. Two ways of varying stiffness are considered. One is to vary the length of solution interval and the other is to vary the model parameters. We first consider the effect of increasing the length of solution interval on the computational complexity of solvers. Computation cost (CPU seconds) is plotted as a function of the length of solution interval (which in our case is
M. Malhotra / Perfornmnce Eualun/ion 24 (I 995) 31 l-331
TRBDFS
327
+
800 CPU Seconds 600
1
10
100 Mission Time
1000
Fig. 9. CPU time vs length of solution interval - SEN+ model.
400 ,
I
I
I
10
100
1000
350 300 250 CPU Seconds
200 150 100 50 0
1
Mission Time Fig. IO. CPU time vs length of solution interval - C.mmp model.
also the time at which the solution is desired since initial time is zero) in Figs. 9 and 10 for SEN+ and C.mmp ‘base’ models respectively. We see that RKF-IRK (RKF-TBF) is much more efficient than IRK3 (TR-BDF2). For some models, savings of as much as 70% and more in computation time are realized with the new methods. The computation cost of the ODE methods levels off after steadystate is reached. We also observe that TR-BDF2 detects steady-state before it is actually reached. This happens because the state probabilities change very slowly when close to the steady-state. TR-BDF2 being a second order method does not detect those changes to within the high accuracy desired and begins to increase the step size. We now vary stiffness by varying model parameters. The parameters varied for different models are
M. Malhotra /Performance
328
Evaluation 24 (1995) 311-331
600 500 TRBDFS 400 CPU
Seconds
+
RKF-IRK + RKF-TBF ++
300 200
0
1
10
100 Repair Rate
1000
10000
Fig. 11. CPU time vs repair rate - SEN+ model.
80 70 60 CPU Seconds
TRBDF2 IRK3 RKF-IRK RKF-TBF
+ t -H++
50 40 30 20 10 n “0.1
1
Repair Rate
10
Fig. 12. CPU time vs repair rate - C.mmp model.
( 1) Repair rate p for the SEN+ model. (2) Repair rate p for the C.mmp model. For both the models, the largest entry q in the generator matrix Q can be increased by increasing ,u. It is expected that increasing ,u will also increase the largest eigenvalue of the generator matrix Q and the model will be increasingly stiffer. For the “base” SEN+ and C.mmp models, CPU seconds required are plotted against the corresponding ,u in Figs. 11 and 12 respectively. The length of solution interval for SEN+ and C.mmp models are 1.O and 0.1 respectively. Smaller solution intervals are chosen for C.mmp model since the steady-state is attained faster for high values of repair rate. For low values of ,x, RKF-IRK (RKF-TBF) is much more efficient than IRK3 (TR-BDF2). As p increases, RKF-IRK (RKF-TBF) begin switching to IRK3 (TR-BDF2) earlier and the savings due to these methods over IRK3 and TR-BDF2 decrease. This is expected since the step-size h of RKF45 method is restricted as h < 11.39/ql. Thus, the condition in Eq. (19) is satisfied at smaller times as ,u (therefore q) increases and the savings due to RKF-IRK (RKF-TBF) over IRK3 (TR-BDF2) decrease. For the C.mmp model we also observe that for smaller solution intervals, TR-BDF2 is
M. Malhotra /Performance
250 1 200
I I TRBDFP +
Evaluation 24 (I 995) 31 l-331
1
1
329
I
1
I
200 250 300 Number of Buffers
350
400
RKF-IRK + RKF-TBF ++
150 CPU Seconds
0 50
100
150
Fig. 13. CPU time vs size - M/M/l/K
model.
more efficient than IRK3 method. This is because the overhead of computing the Jacobian for IRK3 method forms a large part of the total computation time for smaller solution intervals. 6.2.3. Size Finally, we consider the effect of the size of the model on the solver performance. We use the Markov model of M/M/l/K queue with arrival rate A = 1.0 per hour and service rate 100.0 per hour. This model is chosen because its structure remains the same for different buffer sizes. Error tolerance is set to lo-l2 and the length of solution interval is 1.0. For buffer of size K, the number of states in the Markov chain are K + 1. In Fig. 13, computation cost of different methods is plotted as a function of buffer size K. The rate of increase in computation cost of RKF-IRK and RKF-TBF methods is very small compared to the rate of increase of TR-BDF2 and IRK3 methods.
7. Conclusions Dependability and per-formability models of repairable computer and communication systems commonly give rise to stiff Markov chains. Transient analysis of these Markov chains is often carried out to obtain useful measures of system reliability, availability, and performability. Stiff ODE methods are traditionally recommended for transient analysis of stiff Markov chains. We have described an approach that uses a combination of non-stiff and stiff ODE methods to design methods that are significantly more efficient than stiff ODE methods. Our approach is based on the observation that many stiff Markov chains are actually non-stiff during an initial phase of the solution interval. Using a non-stiff ODE method to solve the Markov chain during this phase yields large computational savings. The extent of the non-stiff phase is determined by a formal criterion. Furthermore, unlike earlier definitions of stiffness which ignored accuracy requirement, our approach automatically takes into account the accuracy requirement as part of the model stiffness. We have implemented two methods based on our approach. RKF-TBF method uses the fourth order Runge-Kutta-Fehlberg method as the non-stiff method and TR-BDF2 method as the stiff method. RKF-IRK method uses the same non-stiff method but the third order implicit Runge-Kutta
330
M. Malhotra /Performance Evaluation 24 (199.5) 311-331
(IRK3) method as the stiff method. The numerical comparison among these methods is carried out by solving stiff dependability models of an interconnection network and a multiprocessor system. Our experiments show that the new methods RKF-TBF and RKF-IRK significantly reduce the computation time compared to the respective TR-BDF2 and IRK3 methods. For many stiff models, the computation time is reduced by as much as 80% and more. The accuracy achieved by RKF-IRK (RKF-TBF) is at least as much as that achieved by IRK3 (TR-BDF2) and sometimes more. We recommend RKF-TBF method for moderate accuracy requirements (up to 7 decimal places) and RKF-IRK method for high accuracy requirements.
Acknowledgements The author would like to thank Dr. Andrew Reibman, Dr. Ricardo Pantazis, Prof. Linda Petzold, Dr. A.V. Ramesh, and Prof. Kishor Trivedi for helpful comments. The referees made useful comments and provided helpful suggestions. One of the referees suggested the discussion on computation of K in Section 4.
References [I] 0. Axelsson. A class of A-stable methods. BIT, 9 (1969) 185-199. [2] R.E. Bank, W.M. Coughran, W.Fichtner, E.H. Grosse, D.J. Rose, and R.K. Smith. Transient simulation of Silicon devices and circuits. IEEE Transactions on Computer-Aided Design, 4 ( 1985) 436-45 1. [3] J.T. Blake and K.S. Trivedi. Reliability analysis of interconnection networks using hierarchical composition. IEEE Transactions on Reliability, 38 ( 1) ( 1989) 11 l-120. [4] A. Bobbio and K.S. Trivedi. An aggregation technique for the transient analysis of stiff Markov chains. IEEE Transactions on Computers, 35 (9) ( 1986) 803-8 14. [5] G. Ciardo, J.K. Muppala, and K.S. Trivedi. SPNP: Stochastic Petri Net Package. Proc. Int. Workshop on Petri Nets and Performance Models. Kyoto, Japan, December 1989, pp. 803-8 14. [ 61 C.A. Clarotti. The Markov approach to calculating system reliability: Computational problems. in: A. Serra and R.E. Barlow, editors, Proc. Int. School of Physics, Course XCW, North-Holland, Amsterdam ( 1986) 55-66. [ 71 E. de Souza e Silva and H.R. Gail. Performability analysis of computer systems: from model specification to solution. Performance Evaluation, 14 (1992) 157-196. [8] I.S. Duff, A.M. Erisman, and J.K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, Oxford (1986). [9] D. Dyer. Unification of reliability/availability/repairability models for Markov systems. IEEE Transactions on Reliability, 38 (2) (1989) 246-252. [IO] E. Fehlberg. Klassische Runge-Kutta formeln vierter und niedrigerer Ordnung mit Schrittweiten-kontrolle und ihre Anwendung auf Warmeleitungsprobleme. Computing, 6 (1970) 61-71. [ 1 l] C.W. Gear. Numerical Initial Value Problems in Ordinary Dtferential Equations. Prentice-Hall, Englewood Clifss, N.J. (1971). [ 121 R. Geist and K.S. Trivedi. Ultra-high reliability prediction for fault-tolerant computer systems. IEEE Transactions on Computers, 32 (12) (1983) 1118-l 127. [ 131 G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, Md., 2nd ed. (1989). [ 141 A. Goyal, S. Lavenberg, and KS. Trivedi. Probabilistic modeling of computer system availability. Annals of Operations Research, 8 (1987) 285-306. [ 151 W.K. Grassmann. Transient solution in Markovian queuing systems. Computers and Operations Research, 4 ( 1977) 47-56.
M. Malhotra /Performance Evaluation 24 (1995) 311-331
331
[ 161 W. K. Grassmann.
Finding transient solutions in markovian event systems through randomization. in: W.J. Stewart, editor, Numerical Solution of Markov Chains. Marcel Dekker, New York ( 1991) 357-37 1. [I71 D. Gross and D. Miller. The randomization technique as a modeling tool and solution procedure for transient Markov processes. Operations Research, 32 (2) ( 1984) 334-361. [ISI E. Hairer, S.P. Norsett, and G. Wanner. Solving Ordinary Di$erential Equations I - Nonsttf Problems. Springer, Berlin ( 1987). [I91 E. Hairer and G. Wanner. Solving Ordinary Differential Equations II - Sti$ and Differential Algebraic Problems. Springer, Berlin ( 199 1). [201 A. Jensen. Markoff chains as an aid in the study of Markoff processes. Skand. Aktuarietidskrift., 36 (1953) 87-91. 1211 A.M. Johnson and M. Malek. Survey of software tools for evaluating reliability, availability and serviceability. ACM Computing Surveys, 20 (4) ( 1988) 227-269. [221 J. Keilson. Markov Chain Models: Rarity and Exponentiality. Springer, Berlin ( 1979). Ordinary Differential 1231 J.D. Lambert. Stiffness. in: 1. Gladwell and D.K. Sayers, editors, Computational Techniquesfir Equations. Academic Press, New York (1980) 19-46. u41 J.D. Lambert. Numerical Methods for Ordinary Dt’jerential Systems. Wiley, New York ( 199 1) CT51 M. Malhotra and K.S. Trivedi. Higher-order methods for transient analysis of stiff Markov chains. Proc. 3rd Int. Conf on Performance of Distributed Systems and Integrated Communication Networks. 1991. [261 R. Marie. Transient numerical solutions of stiff Markov chains. Proc. 20th ISATA Symp., 1989, pp. 255-270. [271 R.A. Marie, A.L. Reibman, and K.S. Trivedi. Transient solution of acyclic Markov chains. Performance Evaluation, 7 (3) (1987) 175-194. distributions over [281 B. Melamed and M. Yadin. Randotnization procedures in the computation of cumulative-time discrete state Markov processes. Operations Research, 32 (4) ( 1984) 926-944. I291 D.R. Miller. Reliability calculation using randomization for Markovian fault-tolerant computing systems. 13th Annual Int. Symp. on Fault-Tolerant Computing, Milano, Italy, 1983, pp. 284-289. 1301 W.L. Miranker. Numerical Methods for St@ Equations and Singular Perturbation Problems. D. Reidel, Dordrecht (1981). L. Petzold. Automatic selection of methods for solving stiff and nonstiff systems of differential equations. SIAM J. [311 Sci. Stat. Comput., 4 (1) (1983) 136-148. (321 A. Reibman and K.S. Trivedi. Numerical transient analysis of Markov models. Computers and Operations Research, 15 (I) (1988) 19-36. A. Reibman, KS. Trivedi, and R. Smith. Markov and Markov reward model transient analysis: An overview of [331 numerical approaches. European Journal of Operations Research, 40 (2) (1989) 257-267. Reliability modeling and graceful degradation. Infotech State of the Art Conf on [341 D.P. Siewiorek. Multiprocessors: System Reliability, Infotech International, London, 1977, pp. 48-73.
M. Malhotra received B.Tech. degree in Computer Science from Indian Institute of Technology, Delhi, India, in 1989 and the MS. and Ph.D. degrees from Duke University in Computer Science in 1990 and 1993 respectively. Since June 1993, he has been a member of technical staff at AT&T Bell Labs, Holmdel. His current interests include reliability and performability modeling of telecommunication networks and services.