Availability and Reliability Modeling for Computer Systems

Availability and Reliability Modeling for Computer Systems

Availability and Reliability Modeling for Computer Systems DAVID I. HEIMANN AND NlTlN MITTAL Digital Equipment Corporation. Andover. Massachusetts KI...

3MB Sizes 11 Downloads 188 Views

Availability and Reliability Modeling for Computer Systems DAVID I. HEIMANN AND NlTlN MITTAL Digital Equipment Corporation. Andover. Massachusetts

KISHOR S. TRlVEDl Computer Science Dept . Duke University Durham. North Carolina

1. Introduction . . . . . . . . . . 1.1 What is Dependability? . . . . 1.2 Why Use Dependability?. . . . 1.3 Where is Dependability Used? . . 2. Measures of Dependability . . . . . 2.1 Classes of Dependability Measures 2.2 Guidelines for a Choice of Measure 2.3 The Exponential Distribution . . 2.4 An Introductory Example . . . 2.5 System Availability Measures . . 2.6 System Reliability Measures . . . 2.7 Task Completion Measures . . . 2.8 Summary of Measures . . . . 3. Types of Dependability Analyses . . . 4. The Modeling of Dependability . . . 4.1 Model Solution Techniques . . . 4.2 Parameter Determination . . . 4.3 Model Validation and Verification. 5. A Full-System Example . . . . . . 5.1 System Description . . . . . . 5.2 Dependability Analysis . . . . 5.3 Evaluations Using Other Measures 6. Conclusions . . . . . . . . . . Acknowledgments . . . . . . . . References. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

175 ADVANCES IN COMPUTERS. VOL . 31

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

176 176 177

179 180 180 180 182 184 186 190 196 198 200 201 205 209 216 218 218 219

225 229 230 231

.

Copyright 0 1990 by Academic Press Inc. All rights of reproduction in any form reserved. ISBN 0-12-012131 -X

176

DAVID I. HEIMANN e t a / .

1. lntroductlon

This paper addresses computer system dependability analysis, which ties together concepts such as reliability, maintainability and availability. It serves, along with cost and performance, as a major system selection criterion. Three classes of dependability measures are described: system availability, system reliability, and task completion. Using an introductory example, measures within each class are defined, evaluated, and compared. Four types of dependability analyses are discussed: evaluation, sensitivity analysis, specification determination, and tradeoff analysis. Markov and Markov reward models, commonly used for dependability analysis, are reviewed, and their solution methods are discussed. The determination of the parameters, such as failure rates, coverage probabilities, repair rates, and reward rates, is discussed, as well as model verification and validation. To demonstrate the use of these methods, a detailed dependability analysis is carried out on a full-system example representative of existing computer systems. 1.1 What is Dependability?

All kinds of people involved with computers, whether as designers, manufacturers, software developers, or users, are very much interested in determining how well their computer system is doing its job (or would do the job, if they are considering acquiring such a system). As with most other products and services, the people involved want to know whether their money has been (or would be) well spent and whether what they need is in fact being provided. At first, this assessment naturally takes the form of determining the faultfree performance, or level of service, of the system. People have become aware, often by bitter experience, that not only must they know how much service a computer system can deliver, but also how often it in fact delivers that intended level of service. Similar to other products, a computer system becomes far less attractive if it frequently deviates from its nominal performance. In fact, in many cases people would prefer a system that faithfully delivers its level of service to an alternative system that does not, even if the latter system delivers more service over the long run. There has therefore been a definite need to assess this “faithfulness to the intended level of service.” Generally, this assessment first takes the form of determining how frequently the system fails to function, or, similarly, the length of time the system operates until such a failure. This assessment has developed into the field of reliability. However, a complete assessment also requires consideration of the time

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

177

needed for the system to recover its level of service once a failure takes place, or, more broadly, what impact the failure has on service. The characterization of repair and recovery is embraced by the topic of maintainability. The concepts of reliability and maintainability have been combined to produce the concept of availability. Since terms such as reliability and availability are used in a precise mathematical sense as well as in a generic sense, the term dependability is used to refer to the generic concept. The International Electrotechnical Vocabulary (IEV 191, 1987) and Laprie (1985) define dependability as the ability of a system or product to deliver its intended level of service to its users, especially in the light of failures or other incidents that impinge on its level of service. Dependability manifests itself in various ways. In an office word-processing system, it can be the proportion of time that the system is able to deliver service. In a manufacturing process-monitoring system, it can be the frequency of times per year that its control-system failure causes the manufacturing line be shut down. In a transaction-processing system it can be the likelihood that a transaction is successfully completed within a specified tolerance time. Dependability, measuring as it does the ability of a product to deliver its promised level of service, is a key descriptor of product quality. In turn, quality is a key aspect, along with product cost and performance, that customers use not only in making purchases of specific products and services, but in forming reputations of hardware and software producers. 1.2 Why Use Dependability?

Dependability allows comparisons with cost and performance. Dependability covers one of three critical criteria on which decisions are made on what to purchase or use (see Fig. 1). When customers or users make such decisions, they ask three fundamental questions: What level of service can this system deliver to me, How much is the system, and How likely is the system to actually deliver its nominal level of service? The first question addresses performance, while the second question addresses cost. The concept of dependability addresses the third question. By doing so, it plays an important role in providing a platform that solidly addresses all three issues, thus allowing the user to make an effective multi-criteria decision. Dependability provides a proper focus for product-improvement eforts. Taking in a dependability point of view in the design process (and in manufacturing and operations planning as well) causes one to consider a broad range of possible influences on product quality. Without such a view, one may focus strongly on a specific area or method, which may result in actions that in fact hurt the overall situation. For example, a focus on improving component

178

DAVID I. HEIMANN eta!.

DEPENDABILITY

PERFORMANCE

COST

FIG. 1. Product selection criteria.

reliability or maintainability alone may miss the following possibilities: 0

0

0

The processor reliability is improved too much. Further improvement in the reliability of the given component beyond a certain point will not help the overall dependability.It is very important to recognize this point so as not to waste time and resources trying to improve subsystem reliability past this limit. Measures other than subsystem reliability improvement may provide better results. For example, consider a system that requires a system reboot after every failure. One may obtain improvement by increasing the processor reliability so that failures do not happen as often. However, it may turn out to be much more effective to change the design so that failures can be isolated from the rest of the system and do not require a total system reboot. Failures may not be the main problem. For example, service interruption may occur most often when the system is heavily loaded (Iyer et al., 1986). In this event, rather than trying to improve the processor reliability, it may be far more effective to perform load balancing.

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

179

Dependability can take into account safety and risk issues. Safety is an extremely important issue in many situations. This includes not only safety of people and equipment, but also safety of data and processes as well. Unsafe situations generally arise only through a combination of underlying events, often in a complex fashion. To deal with them adequately requires the kind of system-level approach that dependability analysis provides. In addition to safety, dependability can also assess other risks a user faces because of failure-induced non-performance. Risk is a very important part of product selection before purchase and product operation after purchase. Risk can be reduced by identifying the sources of significant and unsafe outages and taking steps to decrease their occurrence and their impact. Dependability analyses can identify the likelihood of potentially large impacts as well as the overall average risk, and also pinpoint the sources of significant risk.

1.3 Where is Dependability Used?

Dependability is used in all stages of the computer life cycle. In the requirements and planning stage, it provides a customer/user orientation in developing the overall requirements on the hardware, software, and information systems. In the specijication stage, these generalized assessments of user failure sensitivities are formulated into dependability specifications, from which reliability and maintainability specifications are developed. In this manner, the resulting specifications are focused properly on the users’ failure sensitivities. In the design stage, prospective system architectures and operating policies are evaluated with respect to the specifications, as well as with respect to an overall fault tolerance approach developed during the planning stage and refined during this stage. In the manufacturing stage, dependability provides an overall framework for quality control effort, so that these quality control efforts can be focused on those potential defects to which the users would be most sensitive. In the sales and deployment stage (which includes sales and sales support, product marketing/positioning, and systems planning and analysis), dependability makes for precise expectations on how well and in what form the product can deliver its intended usage. In the operations stage, dependability can be used to plan operator response to failures and other incidents, including the effective use of measures such as operator-requested shutdowns, load balancing, scheduling of preventive and non-urgent corrective maintenance. In the maintenance stage, the frequency of preventive maintenance can be compared against the improvement in user dependability, and its scheduling can be adjusted to minimize the impact on users.

180

DAVID I . HEIMANN et el.

2.

Measures of Dependability

2.1 Classes of Dependability Measures

Dependability measures fall into three basic classes: system availability, system reliability, and task completion. Each of these measure classes has its differences; which one is appropriate depends on the specific situation under investigation, the availability of relevant data, and the usage or customer profile. System Availability. System availabilitymeasures show the likelihood that the system is delivering adequate service or, equivalently, the proportion of potential service actually delivered. Commercial computer systems are designed to provide high system availability. Brief interruptions in system operation can be tolerated, but not significant aggregate annual outage. For such highly available systems, a measure of interest might be the probability that the system is up at a given time t, or the expected proportion of time that the system is up in a given time interval. System Reliability. System reliability measures show the length of time before a system failure occurs or, equivalently, the frequency of such failures. These measures apply to systems that are highly sensitiveto interruptions. For example, flight control systems are required to provide interruption-free service during the length of the flight. For these systems, the measure of interest is the probability that the system operates without failure during a specific time interval. Task Completion. Task completion measures show the likelihood that a particular user will receive adequate service or, equivalently,the proportion of users who receive adequate service. These measures fit best with situations in which specifictasks of importance can be identified,for example, for an on-line transaction processing system, in which a definite unit of service exists: the transaction. The measure “percent of transactions successfully completed” accurately describes the dependability situation. 2.2

Guidelines for a Choice of Measure

The choice of a proper measure is very important for an effective dependabilityanalysis. The key to a proper choice is the user tolerance pattern, i.e., how the system and its users react to failures and interruptions. The tolerance pattern can be depicted by a graph of the impact of an interruption or outage against its length as shown in Fig. 2. The graph illustrates the situations in which particular measures should be used.

t

dr

l-l

M

\

\

\, \

182

DAVID I. HEIMANN e t a / .

Note the following considerations in the user tolerance patterns: 0

0

0

There is a tolerance z, at or below which the users are unaffected by the interruption. The tolerance may be equal to zero, in which case all interruptions affect the users. The graph may be discontinuous at the tolerance z. A discontinuity implies that a system reliability measure (Curves 2 and 4)be used, while a lack of such discontinuity implies that a system availability measure (Curves 1 and 3) be used. The graph may have a positive slope after the tolerance z (vs. a slope of zero). Such a slope implies that a system availability measure be used. A slope of 1 implies that a basic availability be used, whereas other slopes imply that a weighted (or capacity-oriented) availability be used.

The tolerance graph can address task-completion measures as well as system availability and system reliability measures. In these cases the tolerance graph addresses the impact of the interruption on an individual task rather than on the computer system as a whole. 2.3 The Exponential Distribution

The exponential distribution plays an important role in dependability analysis. The exponential distribution function is given by F(t) =

1 -e-nr

io,

,

ifOst
Suppose we have an event type whose occurrence is governed by the following two rules:

1. The likelihood that the event occurs once during a given small interval of time is proportional to the length of the interval, i.e., Pr(event occurs during (t, t

+ h)) = Ah + o(h),

where 1is the constant of proportionality. 2. The likelihood that the event occurs twice or more during a given small interval of time is negligibly small, i.e., Pr(two or more events) = o(h), where o(h) denotes any quantity having an order of magnitude smaller than h, that is,

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

183

Then the time between successive occurrences of the event will be exponentially distributed with rate of occurrence 1. The mean time to the occurrence of the event is l/L. If successive inter-event times of a stochastic process are independent, identically distributed with an exponential distribution of rate 1,the number of occurrences of the event in an interval of length t has a Poisson distribution with mean At, i.e.,

Notice that the likelihood of the event occurring in the interval (t, t + h) does not depend on the value of t. This is called the memoryless or Markov property (Trivedi, 1982). The memoryless property states that the time we must wait for a new occurrence of the event is statistically independent of how long we have already spent waiting for it. In a reliability context, it implies that a component does not “age,” but instead only fails due to some randomlyappearing failure. In a performance context, it implies that the arrival of a new customer does not depend on how long it has been since the arrival of the previous customer. The memoryless property leads to considerable simplification in the analysis of the stochastic processes underlying dependability and performance problems. In many such problems the memoryless property in fact holds, and in many others it represents an excellent approximation. Methods of dealing with non-exponential distributions are beyond the scope of this paper but the interested reader is referred to Cinlar (1975), Cox (1955), and Trivedi (1982). If a random variable with distribution function F ( t ) is used to model the time to failure of a component (or system) then R(t)= 1 - F ( t ) is called the reliability function. Note that R(0) = 1 and R(m) = 0. Note also that if the time to failure is an exponentially-distributed random variable with rate parameter 1, then R(t) = e-“. Related to the reliability function is the hazard function h(t). This function represents the instantaneous rate at which failures occur. In other words, the probability is h(t)d t that a failure will take place in the time interval (t, t + dt), given that no failure has taken place before time t. Furthermore,

and

For the exponential reliability function h(t) = 1, i.e., the instantaneous failure rate is a constant in time (i.e., no infant mortality or aging takes place).

184

DAVID I. HEIMANN e t a / .

2.4

An Introductory Example

Consider a multiprocessor system with two processors. Each processor is subject to failure so that its MTTF (mean time to failure) is l/A. When a processor failure occurs, the system will automatically recover from the failure with probability c (this is called a covered failure), or with probability 1-c the system needs to be rebooted (this is called an uncovered failure). A covered failure is followed by a brief reconfiguration period, the average reconfiguration time being 1/6. An uncovered failure requires a reboot, which requires a longer time to take place, the average reboot time being l / f l ( l / f l > l / d ) . In either case, the affected processor needs to be repaired, with the Mean Time To Repair (MTTR) being 1/11. During the repair, the other processor continues to run and provides service normally. Should the other processor fail before the first one is repaired, however, the system becomes out of service until the repair is completed. If we assume that times to processor failures, processor repair, system reconfiguration and system reboot are independent exponentiallydistributed random variables and that there is only a single repair person, then the multiprocessor system can be modeled by the continuous time Markov chain shown in Fig. 3. Let & ( t ) be the probability that the system is in state i at the time instant t . Then, the followingdifferentialequations completely define the Markov chain of Fig. 3 (Trivedi, 1982):

-dP2(t) - -2AP,(t) + pP1(t) dt

-dP1c(t)- -6P,,(t) + 2AcP2(t) dt

-dP1u(t)- - j P J t ) dt

-dP1(t)- -(A dt

+ 2 4 1 - c)Pz(t)

+ p)P1(t)+ dPIC(t)+ jPIU(t)+ ,UP&)

h P U FIG.3. Markov model for a two-processor system.

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

185

with the given initial state probabilities, e(0).We assume that at time t = 0, the system is in State 2, that is, P2(0) = 1. Solving this system of coupled linear differential equations will provide the transient solution (P2(t),Pl,(t), Pl,(t), Pl(t),Po(t))of the Markov chain. Often, however, we are merely interested in the long-run or the steady-state probabilities n, = limt-m e(t).The steadystate balance equations are obtained by taking the limit of the system of differential equations above Trivedi (1982): P n2 = -n,

21

2Ac

n1c

=7

nlu =

2

2 4 1 - c)

B

n2

where Xini = 1. Data for Introductory Example. In the introductory example, we shall use the foliowing numerical parameter values:

Processor mean time to failure Mean time to repair Coverage Mean reconfiguration time Mean reboot time System size

(1 /A) (1/d

5,000 hours 4 hours

(1/6) (1/p)

30 seconds 10 minutes 2 processors

(4

0.9 (90%)

Note that the data describing the sample system and the results of the availability analysis, while based on engineering designs and observations of machine performance, are hypothetical. They should be used only for general observations about dependability and its modeling and analysis, not to draw specific conclusions about specific products. Note also that the models on which the results are based will continue to evolve as development work and validation proceeds.

186

DAVID I. HEIMANN e t a / .

Solving the steady-state equations for the Markov model of Fig. 3 with these parameters, we obtain n2 = 0.99839164, nlC= 0.00000300, nlu = 0.00000665, n1 = 0.00159743, and no = 0.00000128. 2.5

System Availability Measures

System availabilitymeasures are what traditionally have been referred to as “availability.” To temporarily oversimplify, system availability is the proportion of total time in which the system is in an operational condition. The measure can be expressed either as a percentage or probability or as the amount of system uptime (or downtime) per year. As mentioned before, availability measures are used for systems, such as telephone switching systems and database systems, that are usually operated continuously and for which short down times can be tolerated. 2.5.1 Basic Availability

The most straightforward form of system availability is basic availability. Basic availability follows the dotted curve shown in Fig. 2. The tolerance is zero, so that all outages count. The system is assumed to be either “up” or “down,” with no partial or intermediate states. Using the state description shown in Fig. 3, we consider the states labeled 2 and 1 as system “up” states and all other states as system “down” states. Three Forms Of System Availability. System availability measures can be expressed in one of three forms, as follows: The probability that the system is up at a time t, called instantaneous basic availability, is A(t) = &(t)

+ Pl(t).

If we assume that the system has reached steady state (i.e., time t + OD),we then have steady-state basic availability, which is A = 712

+ 711.

We also have the interval basic availability of the system, the proportion within a given interval of time that the system is up, by carrying out a time average value of instantaneous availability over the time interval, i.e.,

t j:

A(t) = -

A ( x )dx

:j:

=-

+

[ P ~ ( x ) P,(x)]dx.

Figure 4 displays the three availabilitiesA, A(t),and A(t)as functions of time t for our example system. Note that A(t) > A(t)(sinceA(t)is A(t)averaged over

/

I

I

I

I

3

188

DAVID I. HEIMANN e t a / .

the time interval and the latter is a decreasing function), and that both of these converge to the steady-state availability A. In the introductory example, the steady-state basic availability A is 0.99998907.This means that the system is up 99.998907%of the time (basic availability) and thus down 0.001093%of the time (basic unavailability). In the course of a year, or 525,600 minutes, the system can be expected to be out of operation an average of 5.74 minutes (basic downtime). Further analysis shows the basic downtime is composed of 0.67 minutes due to lack of required processors (loss of both processors), 1.57 minutes due to reconfigurations, and 3.5 minutes due to uncovered failures. 2.5.2 Tolerance (Nonreconfiguration) Availability Tolerance availability introduces a tolerance, along the lines of the solid curve of Fig. 2. In this case, all reconfigurations are assumed to result in brief outages that are below the tolerance values (and hence tolerable), while all reboots and all repairs when the system as a whole is down, are assumed to result in outages above the tolerance. According to the state description shown in Fig. 3, State l c is now considered an “up” state, in addition to States 2 and 1 while States lu and 0 are “down” states. Then, at steady state, Tolerance (Nonreconfiguration) Availability = n2

+ n,, + nl.

In the example, the system is either up or undergoing a tolerably brief outage, 99.999207% of the time (tolerance availability), and thus during 0.000793%of the time, the system is undergoing an intolerably long outage (tolerance unavailability). In the course of a year, or 525,600 minutes, the system can be expected to be out of operation an average of 4.17 minutes due to intolerably long outages (tolerance downtime). Further analysis shows that this downtime is composed of 0.67 minutes due to lack of required processors and 3.50 minutes due to uncovered failures. 2.5.3

Capacity-Oriented Availability

Capacity-oriented availability takes into account that in many situations the users are interested not as much in whether the entire system is up or down but rather in how much service the system is delivering. Capacity-oriented availability measures have curves similar to the first curve in Fig. 2 except that the slope, instead of being equal to one, is equal to the relative amount of lost service capacity. In the example, we assume that if both processors are up, the system is delivering full service, whereas if only one processor is up, the system is delivering only half service. If no processors are up, or if reconfigurations or

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

189

reboots are taking place, the system is assumed to be down and thus delivering zero service. According to the state description shown in Fig. 3, State 2 delivers full service, State 1 delivers half service, and States lc, lu, and 0 deliver zero service. Accordingly, the steady-state capacity-oriented availability is given by COA = n2

+ 0 . 5 =~ 0.99919036. ~

Thus, in the example 99.919036% of the 2* 525,600 processor-minutes potentially available over the course of a year are actually delivered (capacity-oriented aoailability). Equivalently, 0.080964% (capacity-oriented unavailability) of the 2* 525,600 processor-minutes, or 851 processor-minutes (capacity-oriented downtime), are not delivered. This downtime consists of 2*5.74 or approximately 11 processor-minutes of downtime due to system downtime per year plus 840 processor-minutes of downtime due to degraded capacity. 2.5.4

Tolerance (Nonreconfiguration) CapacityOriented Availability

Tolerance capacity-oriented availability measures takes both tolerance and capacity considerations into account. Except for the tolerance value below which outages are not counted, these measures are similar to capacity-oriented measures. They therefore have curves similar to the solid curve in Fig. 2 except that the slope, instead of being equal to one, is equal to the relative amount of lost service capacity. From the state description in Fig. 3, the tolerance (nonreconfiguration) capacity-oriented availability is given by TCOA = n2

+ n,, + 0 . 5 =~ 0.99919335. ~

In the example, the tolerance capacity-oriented availability of the system is 99.919335%. Equivalently, the tolerance capacity-oriented unavailability is 0.080665%, and the tolerance capacity-oriented downtime is 848 processorminutes per year. Note that the difference between this downtime and the 851 processorminutes for capacity-oriented downtime represents 1.57 minutes of reconfiguration downtime, or approximately 3 processor-minutes. The tolerance capacity-oriented downtime figure is only slightly lower than that for capacity-oriented downtime with reconfiguration losses taken into account. This is because most of the impact on capacity-oriented downtime comes from the degraded-capacity state (and the reboot state, to a lesser extent) rather than from reconfiguration losses.

190

DAVID I . HEIMANN e t a / .

2.5.5 Degraded-Capacity Time The degraded-capacity time is the annual amount of time that the system is functioning but operating at less than full capacity. In the example, out of 525,600 minutes in a year, the system spends approximately 6 minutes actually out of service, but 840 minutes (525,600 * nl)in a degraded-capacity mode due to the loss of one processor. For the remaining time (524,754 minutes), the system is expected to operate at full capacity. The degraded-capacity time of 840 minutes per year compares with 6 minutes per year spent actually out of service, so that the contribution to loss of service from degraded capacity far outweighs the contribution from actual system outages. 2.6 System Reliability Measures System reliability measures emphasize the occurrence of undesirable events in the system. These measures are useful for systems where no downtime can be tolerated, for example, flight control systems. System reliability can be expressed in a number of forms: Reliability Function. This represents the probability that an incident (of sufficient severity, if a tolerance threshold is in effect) has not yet occurred since the beginning of the current uptime epoch. It is denoted by the function R(t) = P(X > t), where X is the (random) time to the next failure and t is the length of the time period of interest. The system unreliability is simply 1 - R(t). In computing the system reliability R(t)for our example system, we consider three different criteria: Case I . Any processor failure is considered a system.failure. In this case we turn States lc and l u into absorbing states so that once the system enters those states, it is destined to stay there (see Fig. 5a). Then

R&) = Pz(t), where S(t)denotes the transient probability that the system is in State j at time t given that it started in State 2 at time 0. Case 2. Any uncovered processor failure or any failure that leads to exhaustion of all processors is considered to be a system failure. In this case, States l u and 0 are absorbing states (see Fig. 5b). R,(t) = P2(t) + PIC(t)+ PI(t).

Case 3. Only the failure of all processors is considered a system failure. In this case only State 0 is an absorbing state (see Fig. 5c). R3(d =

PAt)

+ Pl&) + Pl&) + PI@).

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

191

FIG.5. Failure criteria for system reliability analysis (* indicates absorbing states). (a) Any processor failure. (b) Any uncovered processor failure or loss of processors. (c) Loss of all processors.

For any of the three cases, the reliability R(t)is the probability that having started in State 2 at time 0 the system has not reached an absorbing state by time t. Likewise, the system unreliability 1 - R(t)is the probability that system has reached an absorbing failure state on or before time t. In Fig. 5a, 5b and 5c, respectively, we show the Markov models for each of the three criteria. Note the difference between the graphs of Fig. 5 and that of Fig. 3. All the graphs in Fig. 5 have absorbing states, while the one in Fig. 3 has no absorbing states. Naturally, the corresponding differential equations will also be different.

192

DAVID I. HEIMANN e t a / .

Mean Time To Failure (Incident). This is the average length of time that elapses until the occurrence of an incident, and is denoted by M T T F . It is given by MTTF

:j

=

R(t)dt.

Frequency of Incidents. This is the average number of occurrences of incidents per unit of time. In order to compute the frequency of a certain incident, we return to the Markov model shown in Fig. 3 and count the average number of visits to the state of interest during the interval of observation. For our sample system under the three criteria given above, the frequencies of incidents per year are therefore Fl = 8,760 * C(@n,,+ ( F2 = 8,760 *

m l u

+ (Conol,

[(mi"+ (P)noI9

F3 = 8,760 * (p)no.

We shall present a collection of system reliability measures, again based on the example system described previously. The first three cases show differing criteria as to what constitutes an incident. For each of these cases three kinds of values discussed above (reliability function, MTTF, and frequency of occurrence) are provided. The fourth case generalizes the first three in that a (frequency-of-incident) value is given for a whole range of tolerance values rather than just for a specific given value. The remaining frequency measures depict related aspects of system behavior. 2.6.7

Any Outage (Case 1 )

The likelihood that an outage occurs between time 0 and time t is 1 - R,(t), plotted in Fig. 6. The mean time to the first outage is 2,500 hours. The average frequency at which service is interrupted on the system is 3.5 times per year. Note that the tolerance pattern of this measure corresponds to the dashed curve of Fig. 2: a system reliability curve (i.e., a curve with a discontinuity at the tolerance level) with a zero tolerance value. 2.6.2 Over-Tolerance (Nonreconfiguration) Outages (Case 2)

The likelihood that an outage, other than a reconfiguration, occurs between time 0 and time t is 1 - R2(t),plotted in Fig. 6. The mean time to the first

Il

d r i

194

DAVID I. HEIMANN e t a /

outage of more than a reconfiguration is 24,857 hours. The average frequency at which service is interrupted on the system for more than a reconfiguration time interval is 0.35 times per year (once every 2.8 years). The tolerance pattern of this measure corresponds to the double solid curve of Fig. 2: a system reliability curve (a curve with a discontinuity at the tolerance level) with a nonzero tolerance value. 2.6.3 Outages Due To Lack Of Processors (Case 3)

The likelihood that all processors fail at some point between time 0 and time t is 1 - R3(t), plotted in Fig. 6. The mean time to the first occurrence of “both processors failed” condition is 3,132,531 hours. The average frequency at which service is interrupted on the system due to all processors having failed is 0.0028 times per year (once every 357 years). 2.6.4 Frequency and Duration Of System Outages

Next we consider the frequency of system outages exceeding a given outagelength tolerance t. It is given by F4(t)= 8,760 * [(6)n1ce-dr (P)n,,e-Br ( p ) ~ ~ e - ~ ~ ] ,

+

+

since the probability that a given reconfiguration interval is longer than T is Kdrand likewise for the reboot interval and the repair interval. This frequency for the sample system is shown in Fig. 7 as a function of t. Note that the relationship is nonlinear; the outage frequency changes significantly as the outage tolerance moves through values associated with reconfigurations (less than 0.01 hour) or system reboots (more than 0.1 hour), whereas the outage frequency does not change much for tolerance values intermediate between reconfigurations and system reboots. 2.6.5 Frequency of Degraded-Capacity Incidents

The frequency of degraded-capacity incidents is the average annual number of times that the system loses capacity but continues to operate at a reduced level. In the example, this frequency is 3.15. Note that these incidents represent those incidents of Section 2.6.1 not included in Section 2.6.2. In other words, the formula used for this frequency is F,

=

8,760 * 6 * nlC.

2.6.6 Frequency of Processor Repairs

Unlike the above measures, this measure does not reflect system reliability per se, since it does not necessarily show instances where the user is deprived of

g 0 3

i

i

/

,

0 0

.. 0

0

0 0 3

9 0

/

/

,/,’ c

196

DAVID I. HEIMANN e t a / .

system service. Rather, it shows the workload on the maintenance facility generated by incidents. The average rate of processor repairs per year is given by

F6 = 8,760 * p

* [nl + no].

The value for the example system is 3.5 times per year.

2.7 Task Completion Measures Task completion measures indicate the likelihood that a task (or job or customer) will be completed satisfactorily. Since the task is the fundamental unit by which work is carried out on a system, the likelihood of successful completion of a task gives a precise assessment of customer satisfaction. Task completion measures are thus very effective in situations where system usage can indeed be broken down into individual tasks, such as a transaction processing system, for example. Unlike system availability or system reliability measures, which only take into account the system itself, task completion measures also include the nature of the tasks themselves’ and their interaction with the system. The analysis therefore has two layers: the occurrence of and recovery from incidents, and the effects of these incidents on the tasks. The effects are functions of such aspects as the incident profiles, the length of time of the task, and the sensitivity of the task to interruptions. We shall present a collection of task completion measures (specifically, probability-of-end-user-interruption measures), again based on the example system described previously. The numerical values shown are for a task that needs 60 minutes (one hour) of “uninterrupted” execution time. Curves showing the values of these task completion measureS for other task execution times are provided in Fig. 8. 2.7.1

Task Interruption Probability Due To Any Interruption

The likelihood that a user requiring x units of uninterrupted system time finds the system initially available and suffers a service interruption during usage, whether due to the failure of the user’s own processor, system reconfiguration, uncovered failure or loss of required processors, is given by Task Interruption Probability = (1 - e-2Ax)n2+ (1 - e-”)nl, since the probability that the interruption occurs due to the first failure in the system is 1 - e-2Ax,provided that the task was executing with both processors up. Similarly, the probability that the interruption is due to a loss of required

10001

\

\

0 d

d

-\.

1001

-\,

'--

S

.........

101

11

10

Task Time (hours)

FIG.8. Odds against task interruption.

Id0

Uncovered failure or Loss of all processors Uncovered failure or Loss of own processor Any interruption

198

DAVID I. HEIMANN etal.

processors is 1 - e-”, provided that the task was executing with the system in State 1 (of Fig. 3). For x = 60 minutes, the task interruption probability is calculated to be 0.03997%, and the odds against interruption 2,501:1. 2.7.2

Task Interruption Probability Due To An Over-Tolerance (Nonreconfiguration) Interruption of the System or User’s Processor

The likelihood that a user requiring x units of “uninterrupted” system time finds the system initially available and suffers an interruption due to a failure of the user’s processor or due to a system uncovered failure or loss of required processors is given by Task Interruption Probability = (1 - e-(A(”-c)+’)x)7r2

+ (1 - e-Ax)nl.

The probability that the interruption occurs due to an uncovered processor X, that the task failure or the user’s processor failure is 1 - e - ( A ( l - c ) + A ) provided was executing with both processors up. Similarly, the probability that the interruption is due to a loss of required processors is 1 - eCdx in case the task was running with only one processor up. For x = 60 minutes, the task interruption probability is computed as 0.021997%, and the odds against interruption 4,545 :1. Note that reconfiguration interruptions do not count as an interruption. 2.7.3 Task Interruption Probability Due To An Over-Tolerance (Nonreconfiguration) Interruption of the System

The likelihood that a user requiring x units of “uninterrupted” system time finds the system initially available and suffers an interruption due to uncovered system failures or loss of required processors is given by Task Interruption Probability = (1 - e-21(1-c)x)nz + (1 - e-”)n,.

The probability that the interruption occurs due to an uncovered processor failure is given by 1 - e-2’((’-c)x ,provided the system is in State 2 (of Fig. 3). For x = 60 minutes, the task interruption probability is computed to be 0.004026%, and the odds against interruption 24,841: 1. Note that in this situation, the user can switch to another processor in case of a covered failure, so that a covered failure of the user’s own processor does not count as an interruption. Note also that a reconfiguration does not count as an interruption. 2.8

Summary of Measures

The various dependability measures are summarized in Table I.

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

199

TABLEI DEPENDABILITY MEASURES Table Ia System Availability Measures Measures

Values in sample system 99.998907% 0.001093% 5.74 minutes/year 99.999207% 0.000793% 4.17 minutes/year 99.9 19036% 0.080964% 85 1 processor-minutes/year 99.919335% 0.080665% 848 processor-minutes/year 840 minutes/year

Basic availability Unavailability Downtime Tolerance (nonreconfiguration) availability Unavailability Downtime Capacity-oriented availability Unavailability Downtime Tolerance (nonreconfiguration) capacity-oriented availability Unavailability Downtime Degraded-capacity time Table Ib System Reliability Measures Measures Any outage Over-tolerance (nonreconfiguration) System outages Lack of required processors Frequency and duration of system outages Frequency of degraded-capacity incidents Frequency of processor repairs

Values in sample system (MTTF = 2,500 hrs.) 3.5/year (MTTF = 24,857 hrs.) 0.35/year (MTTF = 3,132,531 hrs.) 0.0028/year See Fig. 7 3.15/year 3.5/year

Table Ic Task Completion Measures Measures (user requiring 60 minutes) Against any interruption: Task interruption probability Odds against interruption Against an over-tolerance (nonreconfiguration) interruption of system or of user’s processor Task interruption probability Odds against interruption Against an over-tolerance (nonreconfiguration) interruption of system: Task interruption probability Odds against interruption

Values in sample system

0.04% 2,500: 1 0.022% 4,500: 1 0.004% 25,000: 1

200

DAVID I. HEIMANN e t a / .

3.

Types of Dependability Analyses

To fully analyze a candidate computer system, there are four types of dependability analyses: evaluation, sensitivity analysis, specification determination, and tradeoff analysis. Evaluation (i.e., “What is?”) is the basic dependability analysis. It investigates a specific computer system, either as designed or as it actually exists. Input data is collected as to nominal performance, component reliability, maintainability, failure recovery, etc. The analyst then evaluates the dependability of the system as described by the design specifications or the existing conditions. Sensitioity analysis (i.e., “What if?”) takes place after a system has been evaluated. One may naturally wish to determine how the analysis results would change if one or more of the input parameters change (for example, what if the component reliability improves?). One can then conduct several analysis runs with differing values for a given input parameter, and examine the changes in the dependability measure of interest. This applies particularly well to situations where some doubt exists as to the proper values of a certain input parameter, or where results are required for a range of values for a parameter. This type of procedure is called sensitivity analysis, because it measures the sensitivity of dependability to changes in the input parameters. It is also possible to compute the partial derivative of the measure of interest with respect to a chosen parameter in the quest for sensitivity analysis (Blake et al., 1988). Specijication determination (i.e., “How to?”) determines the values of given input parameters required to achieve a given level of dependability. These values then become specifications for the indicated parameters. Specification determination is therefore the reverse of sensitivity analysis, in that while sensitivity analysis takes given values of input parameters and determines the impact of these values on dependability, specification determination takes a given value of dependability and determines its impact on the specification for an input parameter. Tradeof analysis (i.e., “How best?”) investigates trading off of a change in one input parameter for a change in a second parameter, leaving overall dependability unaffected. For example, if in order to save costs the designer reduces the redundancy of a subsystem by one unit, by how much would the component reliability in that subsystem have to improve in order to preserve the overall dependability? The main distinction between tradeoff analyses and sensitivity analysis is that the former investigates the interaction between two input parameters (holding dependability constant) while the latter investigates the interaction between an input parameter and a dependability

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

201

measure. Tradeoff analyses allow a designer to have a design depend less on weaker or more expensive areas and more on stronger and/or more costeffective ones. The relationship among these four types of analyses is shown in the conceptual graph given in Fig. 9. Component reliability is shown on the horizontal axis and maintainability (or redundancy) on the vertical axis. Within the graph are curves of equal dependability, i.e., all points on a given curve have the same dependability. The dependability represented by each curve increases as one moves upward in the direction of the dashed arrow. Point A represents an evaluation for a given level of component reliability and maintainability (or redundancy). Points B , and B, represent sensitivity analyses from point A , with B, showing the effect of increasing maintainability (or redundancy) and B , showing the effect of increasing component reliability. Point C represents a specification determination for component reliability, with the dependability requirement shown by the second curve line from the top and the component reliability being increased from point A to point C until dependability meets the requirement. Points D, and D, represent tradeoff analyses (with both points remaining on the same dependability curve as A ) , with D,showing an exchange of lower component reliability for greater maintainability or redundancy and D, showing the reverse. Sample Types Of Analyses. To illustrate the four types of analyses, we have carried them out on the sample system defined previously. The analysis consists of an evaluation on the original data, sensitivity analyses and specification determinations on two of the system parameters, namely processor reliability and repair time, and a tradeoff analysis of processor reliability vs. failure coverage. The measure used for dependability is the mean downtime per year, a basic availability measure. The results are summarized in Table 11. 4.

The Modeling of Dependability

Generally when one carries out a dependability analysis of a computer system, the system is represented by a mathematical model. It is certainly possible to evaluate the dependability of a system by observing and measuring actual system behavior under either normal or controlled conditions, then estimating various measures of dependability using statistical techniques (Bard and Schatzoff, 1978; Trivedi, 1982). However, a measurement-based evaluation is sometimes impossible or prohibitively expensive. For instance, the system under consideration may not yet be available for obtaining measurements, either not at all or not for the intended application. Additionally, the required measurement data, especially frequency-of-failure data in

Maintainability

Redundancy

Component reliability

ANALYSES REPRESENTED IN GRAPH

A

B1,Bz

EVALUATION SENSITIVIW ANALYSES

C SPECIFICATION DETERMINATION Di.Dz TRADEOFF ANALYSES

FIG.9. Types of system dependability analyses.

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

203

TABLEI1 TYPS

OF

DEPENDABILITY ANALYSES

Evaluation: System as originally specified (Processor MTTF = 5,000, MTTR

= 4, c =

0.9)

Downtime = 5.7 min/yr

Sensitivity Analysis: Processor reliability Repair time

MTTF = 10,000 hours MTTF = 2,500 hours MTTR = 2 hours MTTR = 8 hours

Downtime = 2.7 min/yr Downtime = 12.8 min/yr Downtime = 5.2 min/yr Downtime = 7.7 min/yr

Specification Determination: Specification is 5 min/yr of downtime MTTF = 5,670 hours Cannot meet specification (5.08 min/yr when MTTR = 0)

Processor reliability Repair time Tradeoff Analysis:

Processor Reliability vs. coverage Downtime remains at 5.7 min/yr Processor reliability increases so that MTTF = 10,OOO hours Processor reliability decreases so that MTTF = 3,500 hours

Coverage may decrease to c = 0.76 Coverage must increase to c = 0.975

high reliability situations or data on the effects of infrequently-occurring failure modes, may require unfeasible levels of time and effort to obtain in sufficient amounts to yield statistically significant estimates (Geist and Trivedi, 1983). Therefore, a model-based evaluation, or in some cases a hybrid approach based on a judicious combination of models and measurements, is used for cost-effective dependability analysis. Two broad categories of mathematical models exist: simulation and analytic. In Monte Carlo simulation models, an input stream of simulated events, such as failures, recoveries, and repairs, is produced using random variates from the appropriate distributions, and the impact of these events on the system is evaluated. In analytic models, equations describing the underlying structure of the system are derived and solved. Simulation models are frequently more straightforward than analytic ones, and usually do not have as many of the simplifying assumptions that analytic models require for tractability. However, we must carry out repeated

204

DAVID I. HEIMANN e t a / .

replications, each with a different randomly-generated input stream, until enough replications have been made to obtain statistically significant results. In the case of large models that result from reasonably complex systems, this can become prohibitively expensive or evcn computationally unfeasible. In addition, in dependability analysis the numerical values for failure rates and repair/recovery rates are usually vastly different, with failure rates being much lower than repair/recovery rates. This makes dependability models stiff and hence even more difficult to simulate. Methods of speeding up the simulation of stiff systems are being studied (Conway and Goyal, 1987). Nevertheless, whenever a reasonable analytical model exists or can be developed, it should be used over a simulative approach. Analytic models include combinatorial and Markov models. Combinatorial models, in turn, include reliability block diagrams, fault trees, and reliability graphs. These models are parsimonious in describing system behavior. Hence, they are relatively easy to specify and solve. Combinatorial models, however, generally require that system components behave in a stochastically independent manner. Dependencies of many different kinds exist in real systems (Goyal et al., 1987; Veeraraghavan and Trivedi, 1987). For this reason combinatorial models turn out not to be entirely satisfactory in and of themselves. A Markov model is represented by a graph (or, equivalently,by a matrix of transition rates) in which the nodes are the possible states the system can assume and the arcs depict the transitions the system can make from one state to another (Figs. 3,5a, 5b and 5c are examples of such graphs). The model is then solved to obtain the probabilities that the system will assume various states. Unlike combinatorial models, Markov models can include different kinds of dependencies. However, for most practical systems, a satisfactory Markov model could easily have tens of thousands of states. The construction and solution of such large Markov models pose a challenge. Two principal approaches exist to deal with this potential largeness of the Markov state space. In the approach we call largeness avoidance, we find a way to avoid generating and solving a large Markov model. Largeness avoidance commonly uses hierarchies of models and often (but not always) implies an approximate rather than an exact solution to the original modeling problem (Ibe, Howe, and Trivedi, 1989; Sahner and Trivedi, 1987; Veeraraghavan and Trivedi, 1987). State truncation (Boyd et al., 1988; Goyal et al., 1986; Ciardo et al., 1989), fixed-point iterative (Ciardo and Trivedi, 1990) and other approximation techniques (Blake and Trivedi, 1989) that avoid the generation and solution of large state spaces also belong here. The alternative approach to largeness avoidance is to use largeness tolerance. In this approach, we accept the fact that a large Markov model needs to be generated and solved. However, we automate the generation and

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

205

solution of the large Markov model. This can be done in several ways: 1. A special-purpose program can be written to generate the states and transition rates of the Markov model (Heimann, 1989b). 2. A more concise stochastic Petri net (SPN) model of the problem can be specified, and subsequently an SPN package can be used to automatically generate the Markov model (Ciardo et al., 1989; Ibe, Trivedi, et al., 1989). 3. Modeling languages specially tailored to availability modeling (e.g., SAVE (Goyal et al., 1986) or reliability modeling (e.g., HARP (Bavuso et al., 1987; Dugan et al., 1986)) can be used to automatically generate the underlying Markov chain state space.

Whether the Markov model is directly specified by the modeler or has been automatically generated by a program, the need to use sparse-matrix storage techniques and sparsity-preserving efficient numerical solution methods is evident. In the rest of this section, the discussion shall be based on a Markov model with a largeness tolerance approach. For further information on measurement techniques see Bard and Schatzoff (1978), Iyer et al. (1986), and Siewiorek and Swarz (1982). For references on combinatorial methods see Sahner and Trivedi (1987) and Shooman (1968), for simulation see Conway and Goyal (1987), and for hierarchical combinatorial and Markov methods see Blake and Trivedi (1989), Ibe, Howe and Trivedi (1989), Sahner and Trivedi (1987), and Veeraraghavan and Trivedi (1987). This section addresses three areas: model solution, parameter determination, and model validation. 4.1

Model Solution Techniques

As mentioned above, a Markov model is described by a graph, called a state transition rate diagram, such as the one shown in Fig. 3. The graph is represented by a state space S and a matrix of transition rates Q = [ q i j ] ,where qij is the rate of transition from State i to State j ( j # i ) where i , j E S, and where the value of the diagonal element qii is equal to - C j qij(so that the rows of Q sum to zero) (Cinlar, 1975; Trivedi, 1982). For our example problem (Fig. 3), for instance, we have ~

Q=

-2A 0 0 P

0

2Ac 2 4 1 - c) -6 0 0 -P O 0 0 0

0 6

P

0 0 0

P

-P

-(A + P)

A

state 2 state l c state lu state 1 state 0

206

DAVID I. HEIMANN eta/.

The rows are identified to show that the first row covers transitions from State 2, the second row from State lc, etc. Similarly, the first column covers transitions to State 2, the second column to State lc, etc. The solution of the Markov model to obtain steady-state availability, instantaneous availability, interval availability, system reliability or task completion measures is discussed below. Steady-State Availability. Let n, be the steady-state probability that the Markov chain is in State i. Let n be the row vector of these probabilities. Then the linear system of equations zQ=O,

Eni= 1 i

will provide the required probabilities. If we assume that every state in the Markov model can be reached from every other state (that is, the Markov chain is irreducible) and the number of states is finite, then the above system has a unique solution n =(xi) independent of the initial state (Trivedi, 1982).To obtain basic availability, we partition the state space S into the set of system U P states and the set of system DOWN states. Then the basic availability is given by A = E i c u p n i . Thus, the steady-state analysis of a Markov model involves the solution of a linear system of equations with as many equations as the number of states in the Markov chain. The number of states can thus be rather large. However, the connection graph of the Markov chain, and therefore the transition rate matrix, is sparse, and this can be exploited in solving and storing large Markov models. In carrying out this solution, iterative methods such as Gauss-Seidel or Successive Overrelaxation (SOR) are preferable to direct methods such as Gaussian elimination (Goyal et al., 1987; Stewart and Goyal, 1985). The iteration for SOR is n k + l = w[nk+'U

+ nkLID-1 + (1 - w)d,

(2)

where n k f l is the solution vector at the kth iteration, L is a lower triangular matrix, U is an upper triangular matrix, and D is a diagonal matrix such that Q = D - L - U. For w = 1, the solution given by Equation (2) reduces to the Gauss-Seidel method. The choice of w is discussed in Stewart and Goyal (1 985). To obtain the more general dependability measures, we make use of Markov reward models (Blake et al., 1988; Howard, 1971; Smith et al., 1988). In such a model, we assign a reward rate ri to State i of the Markov chain. For basic availability (Section 2.5.1),the reward rate 1 is assigned to all operational states (i.e., states in UP) and a reward rate 0 is assigned to all system failure states (i.e., states in DOWN). Note that by reversing the reward

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

207

assignments so that states in UP get reward rate 0 and states in DOWN get reward rate 1, we obtain bas;,: unavailability. For nonreconfiguration availability (section 2.5.2), we set the reward assignment to 1 not only for all states in UP, but also for all states in DOWN.that represent the computer system undergoing a reconfiguration. For capacity-oriented measures, the reward assignment of a state in UP is the system capacity level in the state (possibly normalized so that nominal capacity is l), while the reward rate of a state in DOWN is 0. The measures of interest above are thus a weighted sum x i r i n i of state probabilities, with the reward rates ri as weights. Algorithms for the steadystate solution of Markov and Markov reward (as well as semi-Markov reward) models have been built into SHARPE (Sahner and Trivedi, 1987), SAVE (Goyal et al., 1986) and SPNP (Ciardo et al., 1989) packages. Instantaneous Availability. The above discussion addresses a steady-state solution, i.e., the probabilities ni are independent of the time elapsed since the start of system operation. However, this is not always sufficient. For example, high-dependability systems with preventive maintenance will not often be in steady-state. In this case, we need to carry out a transient analysis. Let P(t) be the row vector consisting of pi(t), the probability that the Markov chain is in State i at time t given that the initial probability vector is P(0).Then P(t) can be obtained by solving the following coupled system of linear first-order, ordinary differential equations (Trivedi, 1982): dP dt

-=

P(t)Q.

(3)

The solution method commonly used for such a system of differential equations is uniforrnization (or randomization) (Reibman and Trivedi, 1988). Uniformization first applies the transformation Q* = Q/q + I, where q = maxiIqiil.The solution is then

where O(0) = P(0)and O(k)= O(k - l)Q*. For computational purposes, the series needs to be truncated. The number of terms to be used in the series can be determined based on a given truncation error bound (Reibman and Trivedi, 1988). Other solution methods for transient analysis of Markov models are discussed in detail elsewhere (Reibman and Trivedi, 1988). Many transient measures can be obtained as weighted sums of transientstate probabilities, pi(t), with the weights being the reward rates. In other words, the desired measure of interest will be the expected reward rate at time t, rie(t). This expression can be clearly specialized to the instantaneous availability A ( t )by assigning a reward rate 1 to all up states and a reward rate 0

xi

208

DAVID I . HEIMANN e t a / .

to all down states. Algorithms for the transient solution of Markov and Markov reward (as well as semi-Markov reward) models have been built into SHARPE (Sahner and Trivedi, 1987), SAVE (Goyal et al., 1986) and SPNP (Ciardo et al., 1989) packages. Interval Availability. Many measures of interest are cumulative in nature (e.g., interval availability, downtime in a given interval of observation or the downtime between two preventive maintenance events). For computing the expected values of cumulative measures, integrals of state probabilities over the interval 0 to t are required. Thus if we let L,(t)= dx, be the average time spent by the Markov chain in state i during the interval (0,t) then riLi(t). expected accumulated reward in the interval is obtained as Measures like the expected downtime or the expected total work done in a finite interval of operation can be computed using this approach. A special case of this measure that we have already discussed in Section 2.5 is the interval availability A(t)(= l / t x i e U PLi(t)),where the accumulated uptime is divided by the elapsed time t. The vector L(t)= (L,(t))satisfies the equation

&e(x)

dL dt

- = L(t)Q

+ P(O),

L(0)= 0.

ci

(4)

For a discussion of the methods of solving this equation and hence computing expected cumulative measures, see Reibman and Trivedi (1989).Such transient analysis of cumulative measures can be done using SHARPE, SAVE or SPNP. The next level of measure complexity is related to the distribution of availability and other cumulative measures. Algorithms for such computations are known (Smith et al., 1988) but will not be discussed here. System Reliability. If all system down states are made absorbing states, then the sum of state probabilities of all the U P states will yield the system reliability, R(t), at time t. For instance, in our example problem, the matrix corresponding the reliability in Case 3 in Section 2.6 (Fig. 5c) is given by

Q=(

-21 0 0 P

21c 2 4 1 - c) -6 0 0 -B O 0

0

B

state 2 state lc state l u '

Note the difference between the two matrices Q and Q, in that the latter omits the last row and column of the former. This represents the fact that for the system reliability measure represented by Q State 0 is considered to be a system failure and thus an absorbing state. Solving the differential equation

dP dt

- = P(t)O

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

209

cisup

and summing over the state probabilities, P,(t), will yield system reliability at time t. A basic form of system reliability measure is the mean time to system failure (MTTF).Such measures can be obtained by solving a linear system of equations much like the case of steady-state probabilities: tQ =

-F(O),

where t = (ti)is a vector of times before absorption, ti,i E UP, is the average time spent in State i before absorption (note that ti can be assumed to be zero for i E DOWN), and P(0) is the partition of P(0) corresponding to the U P states only. After solving for the row vector t, the system MTTF is obtained by (Goyal et al., 1987) MTTF =

czi. i

(7)

The methods of solving the above linear system of equations are similar to those used for solving for steady-state probabilities (recall Equation (1)) (Goyal et al., 1987. Stewart and Goyal, 1985). SAVE, SHARPE and SPNP facilitate the computation of MTTF. Task Completion Measures. So far, we have discussed system-oriented dependability models. Suppose we consider a task that requires x amount of uninterrupted CPU time. Further suppose that when the task arrives the system is in State i and the rate of interruption as seen by the task is yi. Then the task interruption probability is given by ci(l- e - Y i x ) x i . More generally, assume that a task requires x amount of time to execute in the absence of failures and let T ( x ) be the task execution time with failure interruptions accounted for. The execution requirement, x, of the task can be either deterministic or random. It is of interest to compute the expected value, E [ T ( x ) ] or , the distribution, P ( T ( x )< t),of the task completion time. Models for task completion time can be built as either Markov (or semi-Markov) models. Such Markov models can be generated by hand, using Kronecker algebra techniques (Bobbio and Trivedi, 1990) or by using generalized stochastic Petri nets (Ciardo et al., 1989). If more accurate modeling of task performance including the effect of work loss and checkpointing is desired, then transform-based techniques need to be used (Chimento, 1988). For references on this topic, the reader may consult Chimento (1988), Kulkarni et al. (1987), Nicola et al. (1987), and Kulkarni et al. (to appear). 4.2

Parameter Determination

In order to solve and use dependability models, one must consider the underlying input parameters. These parameters group into four categories: failure rates (A), failure coverage probabilities (c), repair rates (p),and system performance levels (or reward rates) (r).

210

DAVID I. HEIMANN e t a / .

4.2.1 Failure Rates (What is A?) A number of issues arise in describing and determining the occurrence of component failures in a computer system. Foremost among these are the fmlt/error/failure distinction, the source of failures, the type of failures, the age dependency of failures, the distribution of inter-failure times, and the process by which failure rates are estimated. We address each of these in turn. Faults us. errors us. failures. To properly evaluate failure rates, one must distinguish among faults, errors, and failures. A fault is an improper condition in a hardware or software module which may lead to a failure of the module (Nelson and Carroll, 1987).An error is a manifestation of a fault leading to an incorrect response from a hardware or software module. A failure is a malfunction of a system or module such that it can no longer operate correctly. While failures result from errors, which in turn result from faults, it is not necessarily the case that a fault will lead to a failure. In fact, to have faults not lead to failures is the objective of fault-tolerant computing design. High system dependability may thereby be obtained not only by reducing the rate at which faults occur, but also be preventing faults that do occur from propagating into failures. Source offailures. Failures can arise from a number of different sources in the computer system. They can be hardware, software, or operator induced, and can arise from the processors, the storage units or storage controllers, power supply, or system or application software. There is much interaction among these sources, so that it is often difficult to pinpoint the actual source. For example, a software failure may look like an operator-induced one if it causes the operator to have to shut down the system in order to reload the code, or a hardware failure may look like a software one if it changes a parameter value to be outside the range the software is designed to handle. Note that permanent hardware failures generally form only a small minority of the total failures. Types of failures. Failures can be one of three types: permanent, intermittent, or transient. Permanent (also called hard or solid) failures are those that occur due to a fault in the system and require a repair action to restore system operation. Intermittent (also called soft) failures are those that occur due to a fault in the system but do not require a repair action to restore system operation, but rather a reboot or other system restart. Intermittent failures frequently, though not always, can be precursors to an eventual permanent failure, in that an underlying fault may initially have only a mild impact on the system, but then have an increasing impact as it gets worse, eventually resulting in a permanent failure. Note that intermittent failures generally occur far more often that permanent failures, sometimes by an order of magnitude. Transient failures are those that occur not due to a fault in the

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

21 1

system, but due to outside causes such as cosmic rays or alpha particles. Transient failures, like intermittent failures, generally do not require a repair action to restore system operation. These types were developed with hardware failures in mind. Software failures are nominally permanent, in that some code fault is causing the failure. However, as software becomes very complex, the faults can become extremely subtle, and the resulting failures look more and more like intermittent ones. Intermittent software failures have been termed “Heisenbugs” (Gray, 1986). Age dependency of failures. The rate at which failures occur in a component generally depends on how far along it is in its life span. Hardware components will show a decreasing failure rate in early life due to the realization of “infant mortality failures.” During midlife the failure rate will be approximately constant, and in later life (particularly for mechanical components), the failure rate will increase due to wearout characteristics. Software components will generally show a decreasing failure rate as bugs are found and removed (similar to hardware infant mortality). Distribution of inter-failure times. For simplicity and often with justification (especially in the midlife section of the failure process), times to failure are often assumed to be exponentially distributed. This is a powerful assumption which allows many analytical techniques to be applied to the dependability evaluation. Often, however, the modeler is interested in more general distributions of times to failure. In some cases, extensions of the exponential assumption can be used. For example, the use of nonhomogeneous Markov models (Bavuso et al., 1987) (a special case of such a process is the nonhomogeneous Poisson process, or NHPP) allows failure times to have a Weibull distribution. Semi-Markov models (Cinlar, 1975; Ciardo et al., to appear) and phase-type expansions (Cox, 1955, Cinlar, 1975; Hsueh et al., 1988. Sahner and Trivedi, 1987) can also be utilized to capture non-exponential distributions. For each fault type in the fault model of each component, the nature of these distributions must be specified. Estimation of component failure rates. A crucial question while computing system dependability is how to obtain accurate estimates of component failure rates. A relatively straightforward way to obtain these may be to use vendor data and the parts count method, facilitated by reliability databases and analysis tools. This method may have drawbacks, though, because an exhaustive listing of all parts within, for example, a processor or a storage device may be unwieldy and, furthermore, the database may be incomplete and/or untrustworthy. However, in the early stages of the design cycle (design stage), this may nonetheless be the only applicable approach. A second method of estimating component failure rates is from field measurement data. Such operational field failure data are likely to be a much

212

DAVID I. HEIMANN e t a / .

more reliable source than a database with vendor-supplied part failure rates. Trading off against this is that the expense and time of collecting enough data is quite high. An important way to use these data more efficiently is to think of the individual component failure rates At(that is, the failure rate for component i, where a component is a basic unit of the computer system such as a processor, storage unit, storage controller, power supply, or communications link) as functions of at least three kinds of parameters; a, e, u, i.e.,

li= h(a,e, u; 81,

(8)

where 0

0 0 0

a is a vector of architectural (or system configuration) variables (e.g., the number and types of processor nodes, the number of disks and disk controllers, etc.), e is a vector of environment variables (e.g., temperature), u is a vector of usage variables (e.g., banking, education, transaction processing, military, etc.), and 8 is a vector of coefficients for the above parameters.

After hypothesizing a functional form of fi based on the parameter set a, e, and u, we use statistical techniques such as regression analysis, Bayesian techniques, or maximum likelihood estimation to determine the coefficients 8 and use the resulting equation to determine the component failure rates for the dependability model. In some sense this is analogous to the approach used in MIL-HDBK-217C (US. Department of Defense, 1980) but tailored to the problem at hand. Failure rates can also vary with the load on various system resources. Since load is a function of time, so will be the failure rates. For the sake of simplicity, we have assumed for the analyses described in this paper that failure rates are not dependent on load. For information on the load dependence of failure rates, see Iyer et al. (1986). 4.2.2

Coverage Probabilities (What is c?)

In a system-level analysis of dependability, it becomes very important to know how well the system as a whole can operate when one of its subsystems fails. If the system can continue operations, either without ill effect or with an acceptable degradation of operations, the failure is said to be covered. If, however, the failure causes the whole system to become unoperational, the failure is said to be uncovered.Clearly, dependability-enhancing efforts such as redundancy or checkpointing will only function if subsystem failures are

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

213

covered. The coverage probability c is the conditional probability that a system successfully recovers, given that a fault has occurred. It has been known for some time that a small change in the value of a coverage probability can make a rather large change in model results (Dugan and Trivedi, 1989). It is, therefore, extremely important to estimate various coverage parameters accurately. Three different ways of estimating coverage can be identified: 1. Structural

Modeling. This approach involves decomposing the fault/error-handling behavior into its constituent phases (e.g., detection, retry, isolation, reconfiguration, etc.) and using various Markov, semiMarkov and stochastic Petri net models for computing the overall coverage (Dugan and Trivedi, 1989).This approach is useful during the design phase. 2. FaultlError-Injection Experiments. If the system is ready for experimentation, fault/errors can be injected and the response can be recorded. From this measurement data coverage can be estimated (Arlat et al., 1989). This approach is appropriate at design verification time. 3. Field-Measurement Data. Based on the data collected from a system in operation, coverage can be estimated. Clearly, this is the most expensive approach among the three. Nevertheless, collection and analysis of measurement data is to be highly encouraged to enhance our understanding of dependability. 4.2.3

Repair Rates (What is p?)

Two types of broad repair categories need to be specified: Corrective or unscheduled maintenance and preventive or scheduled maintenance. For each type of detected error, a different type of corrective repair action needs to be specified. Various parameters of interest here are the time to reboot a processor, time to reboot a system, reconfiguration time, and so on. These data could come from design documents or error logs. One also needs to determine whether the system is considered (fully or partially) up or down during each of these intervals. Other data, such as the field service travel time and actual repair time, may come from field service organizations.

4.2.4

Reward Rates (What is r?)

In a basic availability model, we classify states as either up or down. However, this binary classification of states often needs to be expanded for many applications, such as multiple and distributed processing systems with

214

DAVID I . HEIMANN et a/.

many different performance levels. A simple extension of Markov models allows a weight, worth or reward rate assignment to each state. The reward rate may be based on a resource capacity in the given state (such as the number of up processors) or, in more sophisticated analyses, on the performance of the system in that state. After making reward assignments, sometimes it is desirable to scale the reward rates so that the value assigned to the fully operational states is 1 and the values assigned to degraded configurations are less than 1. Other times scaling may not be appropriate, such as when two systems with a different number of processor nodes are compared. Note that in some cases, such as for unavailability or for the probability of end-user interruption, for examples, the “reward is actually a penalty (i.e., a value of 1 represents a failure), though for consistency it is nonetheless called a reward rate. In Table 111, we summarize the reward assignments that yield some of the measures in Sections 2.5-4.1. The states of the system as depicted in the Markov chain are partitioned into U P and DOWN states (i.e., U P is the set of all operational states and DOWN is the set of all system failure states). DOWN states are similarly further partitioned into RECON, UDO W N and PDO W N states, where RECON indicates the system is undergoing a reconfiguration, UDOWN indicates the system is down due to an uncovered failure, and PDO W N indicates the system is down due to loss of processors. The number of processors in the system is denoted by N . Let Ci denote the system capacity in State i, and let C, denote the capacity when all processors are up (note: one freqiiently used example of system capacity is the number of up processors). Clearly, Ci for a DOWN state will be zero. Table IIYa describes the reward structure for the system availability measures. Note that most of the measures are steady-state-based and thus use the steady-state probabilities x i . However, the instantaneous and interval availability measures instead use the instantaneous (time-dependent) and interval-based quantities pi(t) and Li(t)/t,respectively. Note also that the down-time measures are expressed in minutes/year (or processorminutes/year for the capacity-oriented downtime), and are based on a total of 60*24*365 = 525,600 minutes/year. Table IIIb describes the reward structure for the system reliability measures. The set ABS represents the absorbing states in the underlying Markov chain, i.e., the set of system failure states from which no recovery is permitted. The value pi@) for i 4 ABS represents the probability that the system has not yet q(t)represents failed and is currently in State i, so that the summation the probability that the system has not yet failed, i.e., the reliability function R(t). Note also that for i # ABS the value zi represents the mean time before failure that the system spends in State i, so that the summation x i C A B S z i represents the system mean time to failure, i.e., M T T F .

xiCABs

TABLE 111 REWARD-BASED FORMULAS FORDEPENDABILITY MEASURE Reward rate ( r i )

Measure

Formula

Table Ma System Availability Measures Basic availability Basic unavailability Basic downtime Basic instantaneous availability Basic interval availability Tolerance availability Tolerance unavailability Tolerance downtime Capacity-oriented availability Capacity-oriented unavailability Capacity-oriented downtime

1 if 0 if 0 if 1 if 1 if 1 if 0 if 0 if

i E UP, else 0 i E UP, else 1 i E UP, else 525,600 i E UP, else 0 i E UP, else 0 i E U P u RECON, else 0 i E U P u RECON, else 1 i E U P u RECON, else 525,600

CJC, if i E UP, else 0 1 - [C,/C,] if i E UP, else 1

Tolerance capacity-oriented availability Tolerance capacity-oriented unavailability Tolerance capacity-oriented downtime Degraded-capacity time

525,600 * C, * ( I - [ C i / C , ] )if i E UP, else 525,600 * C, Ci/CN,if i E U P u RECON, else 0 1 - CJC,, if i E U P u RECON, else 1 525,600 * CN * (1 - [CijCN]) if i E U P u RECON, else 525,600 * C, 525,600 if i E U P and Ci # C,, else 0

Table IlIb System Reliability Measures Due to lack of processors (ABS = PDOWN) Reliability ( R ( I ) ) System MTTF

1, if i # ABS, else 0 I, if i # ABS. else 0

Ci r,?(t) = Xi,

Frequency

525,600'p. if i E PDOWN

CFi%

ABS

?(t)

ziriri =~ i , A B S T i

Due to over-tolerance outages (ABS = UDOWN u P D O W N ) Reliability ( R ( t ) ) System MTTF

1, if i E ABS. else 0 1, if i # ABS, else 0

Frequency

525,600*8, if i E UDOWN 525,600.p. if i E PDOWN

L r i ? ( t ) = CirABs ?(I) liriTi

ciri%

Due to any outage (ABS = DOWN = RECON u UDOWN u PDOWN)

I, if i # ABS. else 0 1, if i 9 ABS, else 0 525.600.6, if i E RECON 525,600'8, if i E UDOWN 525,600*p, if i E PDOWN

Reliability (R(r)) System MTTF Frequency Frequency of degraded-capacity incidents

x,jc if (i

ABs~,c,=c,,qij

E

UP, Ci # C,), else 0

Table IlIc Task Completion Measures Probability of end-user interruption 1 - e-y'x,i

E

UP, else 0

xirixi

=Ci(ABSTi

216

DAVID I. HEIMANN ef a / .

In computing the frequency of lack-of-processors events, for example, the rate of occurrence is the repair rate p (in repairs/minute) provided the system has experienced this condition, and the mean time spent in this condition is 525,600~i,,D0,, niminutes per year. The other frequencies in the table are similarly derived (note that the occurrence rate from an uncovered failure event is p and the occurrence rate from a reconfiguration event is 6). Note also that the occurrence rate from a degraded-capacity state i to either an absorbing state or a full-capacity state is ‘&jeABSorC,=CN) qij. Table IIIc describes the reward structure for the task-completion measures. Note that x denotes the uninterrupted processing time required by the task under consideration. The reward rate ri is (1 - e O i X )assuming that system operating in State i and where yi is the cumulative rate at which all the interrupting events occur. More generally, reward rates can be based on actual system performance. States of the Markov model represent the configuration of up resources of the system (for example, see Fig. 3). For that complement of resources and the given workload, we calculate system performance using an analytical model, a simulation model or actual measurement results (Lavenberg, 1983). Transitions of the Markov model represent failure/repair of components and system reconfiguration/reboot. The Markov reward model is then solved for various combined measures of performance and availability using the techniques described in Meyer (1980, 1982) and Reibman et al. (1989) or using the tools such as SHARPE (Sahner and Trivedi, 1986; Veeraraghavan and Trivedi, 1987) or SPNP (Ciardo et al., 1989).

4.3.

Model Validation and Verification

Model verification and validation are the processes by which one determines how well a model fits with the underlying situation it aims to represent. Model verification is concerned with the correctness of the implementation of the conceptual model, while model validation ascertains that a model is an acceptable representation of the real world system under study. A model can be verified, at least in principle, by using program-proving techniques. More commonly, however, structured programming techniques and extensive testing are used in order to minimize implementation errors. The testing is often aided by simple cases for which there might be closed-form answers or by the existence of an alternative model that applies in some cases and which has been previously verified and validated. Reasonableness checks on the results can also help in testing.

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

217

A three-step validation process has been formulated by Naylor and Finger [1967]: 1 . Face Vulidution This involves a dialogue between the modeler and the people who are acknowledgeable about the system in producing a model that is mutually agreeable. This is an iterative process of stepwise refinement and requires a constant contact with people who are well versed with the innards of the system being modeled. 2. Input-Output Validation The data obtained from the real system are used as input to the model and the output data from the model are compared with the observed results of the real system. Clearly, many different data sets should be used in order to gain confidence in the model. The process is quite expensive and time consuming, yet extremely important. 3. Validation of Model Assumptions The third step in the validation process is validating model assumptions. Here, all the assumptions going into the model are explicitly identified and then tested for accuracy. Validation of the assumptions can be carried out either by face validity (checking the assumptions with experts), logical inference (proving the assumption correct), or statistical testing. In addition to checking the validity of the assumptions, one should also check their robustness (or sensitivity), i.e., for each assumption, how likely are the model results to change significantly if the assumption is not quite correct? Such an analysis, while often difficult, has the potential of identifying the assumptions that need a careful examination.

Model verification and validation check out the following types of assumptions on which models are often based: Logical. Are the states and state transitions of the model close to the behavior of the system being modeled? If there are missing or extra states or missing or extra transitions, the error in the results of the model can be rather drastic. Although formal proof techniques have been proposed, the most effective way of ascertaining that the model behaves correctly from the logical point of view appears to be a very good understanding of the system on the part of the modeler and face validation. Distributional. We need to verify whether all the distributional assumptions made in the model hold. Sometimes, we can show that a farm of a certain distribution does not have an effect on the results of the model. Such insensitivity results, although desirable, do not generally hold. In the common case, we need to statistically test a hypothesis regarding each distributional assumption and prepare to modify the model in case the hypothesis is rejected based on measurement data.

218

DAVID I. HEIMANN e t a / .

Independence. Most stochastic models (Markov models included)assume that some events are independent of some other events. We need to statistically test the hypotheses of such assumptions. In case the hypothesis is rejected, we should be prepared to modify the model. Approximation. Several types of approximations, e.g., state truncation (Boyd et al., 1988) and decomposition (Bobbio and Trivedi, 1986), are commonly used. We need to provide good estimates of the approximation error (Muntz et al., 1989) or to provide tight bounds on the error (Li and Silvester, 1984).

Numerical. Since a model is eventually solved numerically, truncation and round-off errors are encountered. An attempt should be made to minimize and/or estimate these errors. 5.

A Full-System Example

5.1 System Description

To demonstrate the preceding techniques on an example based on actual systems, we increase the complexity of the example system. The new example (which is representative of actual systems in use at Digital and elsewhere) contains four processors, three of which are required for system operation. Processors are subject to two types of failures; “permanent” failures, which require a physical repair (taking a matter of hours) to the processor in order for a recovery to take place, and “intermittent” failures, which require only a reboot (taking a matter of minutes) of the failed processor. Failures can be either “covered” or “uncovered.” In covered failures, the system reconfigures itself to exclude the failed processor (in a matter of seconds), then continues to function as long as at least three processors remain (when a failed processor recovers, another configuration takes place to include it once again in the system). In uncovered failures, the system cannot reconfigure successfullyand thus fails as a whole. In this case a complete system reboot (taking a matter of minutes) is necessary for the system to recover. The system failure and recovery data are as follows: Processor MTTF for permanent failures Processor MTTF for intermittent failures Processor MTTR for permanent failures Mean processor reboot time Mean system reboot time Mean reconfiguration time Coverage (permanent failures)

5,000 hours 1,OOO hours 4 hours 6 minutes 10 minutes 30seconds 90%

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

Coverage (intermittent failures) System size Minimum required size (3)

219

90% 4 processors 3 processors

5.2 Dependability Analysis We shall carry out a dependability analysis of the system described above. The analysis consists of an evaluation on the original data, sensitivity analyses and specification determination based on each of five system parameters, and a tradeoff analysis of processor reliability vs. processor reboot time. The measure to be used for dependability is basic availability, displayed in terms of mean downtime per year. The analysis is carried out using the model described in Heimann (1989b) and using the SHARPE package. 5.2.1 Evaluation

The system as specified has a mean downtime of 87 minutes per year. The mean downtime consists of 5 minutes per year due to too few processors up to carry out the customer’s function (i.e., lack of required processors), 40 minutes per year due to reconfiguration, and 42 minutes per year due to uncovered failures. We shall use the notation “87 min/yr (5 + 40 + 42)” to summarize this information. This implies that efforts to improve dependability should concentrate on reconfigurations and uncovered failures, as against meeting required processors (for example, this would suggest not adding redundancy by using extra processors, whose positive effect in meeting required processors would be more than offset by the negative effect of inducing more reconfigurations and uncovered failures). 5.2.2 Sensitivity Analysis

We investigate the sensitivity of dependability to five parameters: intermittent failure rate (while keeping the permanent-failure processor MTTF constant), permanent-and-intermittent failure rates (while keeping their ratio constant), mean repair time, mean reconfiguration time, and mean processorand-system reboot times (while keeping their ratio constant). Processor Intermittent MTTF. Changing the rate of intermittent failure does cause a significant change in dependability. Increasing the processor , OO to 2,500 hours reduces the downtime from intermittent MTTF from 1O 87 min/yr to 45 min/yr (4 + 20 + 21), while decreasing it from 1,OOO to 500 hours increases the downtime to 155 min/yr (5 + 73 + 77). The reconfiguration and uncovered failure components are affected by the changes

220

DAVID I. HEIMANN e t a / .

about equally, while the lack-of-required-processors component is virtually unaffected. Processor MTTF. Changing the processor MTTF's for permanent failures and intermittent failures by the same factor also causes a significant change in dependability. Increasing the permanent-failure MTTF to 10,000 hours (with a corresponding change of the intermittent-failure MTTF to 2,000 hours) improves the downtime from 87 min/yr to 42 min/yr (1 + 20+21), while decreasing the former MTTF to 2,500 hours and the latter MTTF to 500 hours degrades the downtime to 182 min/yr (18 80 84). Because both the permanent and the intermittent failure rates change, rather than just the intermittent rate alone, the dependability impact is greater. All three components of dependability are affected by the changes, with the lack-of-required-processors component showing a very strong sensitivity. Note that if only the permanent failure rates are changed, leaving the intermittent rates constant, the sensitivity is far less. Improving the permanent MTTF to 10,000 hours improves the downtime to 76 min/yr (1 + 37 + 38), while degrading the permanent MTTF to 2,500 hours degrades the downtime to 113 min/yr (17 + 47 + 49). In both cases, the lack-of-required-processors time changes significantly, but the other two components do not change very much. Permanent failures thus mainly influence the lack-of-requiredprocessors downtime, while intermittent failures mainly influence the reconfiguration and uncovered-failure downtimes. Mean Repair Time. Changing the mean repair time does not affect dependability very much. Decreasing the repair time from 4 hours to 2 hours improves the downtime from 87 min/yr to 83 min/yr (1 + 40 + 42), while increasing it degrades the downtime to 92 min/yr (10 + 40 + 42). The impact shows up in the lack-of-required-processors component, which actually is very sensitive to repair time. The overall impact is low because the lack-of-requiredprocessors component comprises only a small portion of the overall measure, and repair time does not affect the other two components at all. Note the value of disaggregating the output measure into its components; the overall lack of sensitivity of dependability to repair time masks a very high sensitivity on the specific component of downtime directly affected. Mean Reconjguration Time. Changing the reconfiguration time does affect dependability, but not to the same extent as changing the processor reliability values. Decreasing the reconfiguration time from 30 seconds to 15 seconds improves the dependability from 87 min/yr to 67 min/yr (5 + 20 42), while increasing it to 60 seconds degrades the dependability to 127 min/yr (5 + 80 + 42). The change affects only the reconfiguration component of downtime, which may explain the lower overall impact. Mean Reboot Time. Changing the processor and system reboot times (keeping their ratio constant) also affects dependability, but to a lesser extent

+ +

+

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

221

than changing processor reliability values. Decreasing the processor-reboot time from 6 minutes to 3 minutes improves the dependability from 87 min/yr to 65 min/yr (4 + 40 + 21), while increasing the time to 12 minutes degrades the dependability to 129 min/yr (5 -t- 40 + 84). The change affects only the uncovered-failure component of dependability, which explains the lower overall impact. Coverage. Changing the coverage also affects dependability to a moderate extent. Increasing the coverage from 90% to 95%improves the dependability from 87 min/yr to 67 min/yr (5 + 41 + 21), while decreasing the coverage to 80% degrades the dependability to 127 min/yr (5 + 38 + 84). This impact is just about the same as that for a similar change in the reboot time. 5.2.3 Specification Determination

Suppose we need a dependability of 99.99%,or 53 min/yr downtime. Since the system as evaluated has an downtime of 87 min/yr, some parameter specifications need to be improved to meet this requirement. We determine, for each of the five system parameter$ in turn (and assuming the other four remain constant), the necessary specification on that parameter for the system to satisfy the overall dependability requirement. Processor Intermittent M TTF. To meet requirements, the processor MTTF for intermittent failures must be 2,000 hours instead of the current 1000. Processor MTTF. To meet requirements, the processor MTTF for permanent failures needs to be improved to 8,000 hours instead of 5,000, while the processor MTTF for intermittent failures needs tb be improved to 1,600 hours instead of 1,OOO. Note that because both permanent and intermittent failure rates change, the magnitude of change necessary for each is less than for the intermittent failure rate alone as shown above (1.6:1 instead of 2:l). However, if only the permanent failure rate changes, with the intermittent rate remaining fixed, then the requirements cannot be met by an improved (permanent-failure) MTTF. Even with a very high permanent-failure MTTF (such as 1,000,000 hours), the dependability is 68 min/yr (0 + 33 + 35). Mean Repair Time. The requirements cannot be met by improving repair time. Even if the MTTR were reduced to zero, downtime would still be 82 min/yr, well above the requirement. This is so because, as seen above, repair time affects only the lack-of-required-processors dependability component, which represents only a small portion of overall downtime. Mean Reconfiguration Time. To meet requirements, the mean reconfiguration time must be 5 seconds instead of the current 30 seconds. This means a significant change is necessary in order to meet the dependability

222

DAVID I. HEIMANN e t a / .

requirements by means of reconfiguration time (largely because changing reconfiguration times only affects one component of downtime: reconfiguration downtime). Mean Reboot Time. To meet requirements, the mean processor reboot time must be 1.2 minutes instead of the current 6 minutes (and the system reboot time must be 2 minutes instead of the current 10 minutes). As with reconfigurations, a significant change is necessary in order to meet the dependability requirements by means of reboot times (again, largely because changing reboot times only affects one component of downtime: uncoveredfailure downtime). Coverage. To meet requirements, the coverage must be 98.4% instead of the current 90%. In a similar manner as reboot time, a significant change in the coverage (the lack-of-coverage must decrease by a factor of six) is necessary in order to meet the dependability requirements because only one component of downtime is affected by the change. 5.2.4

Tradeoff Analysis

Even if the system as currently specified does not meet the dependability requirements, the parameter values given may not be the best way to satisfy the requirements. For instance, we may be able to easily improve the processor reboot times from 6 minutes to 3 (and similarly for system reboots), whereas the specified processor MTTF values may be difficult to achieve. Conversely, 6-minutes processor reboot times may be difficult to achieve (and similarly for system reboots), while 12-minute reboot times may be achieved quite easily and compensatory improved processor reliabilities may be easy to come by. In either of these cases, it would be beneficial to know the extent to which reboot times can be “traded off against individual processor reliability, while keeping overall system dependability constant. Note that in the following analyses the processor MTTFs are changed in such a way as to keep constant the ratio between the permanent and intermittent failure rates, so that a change in the permanent failure rate also means a proportional change in the intermittent failure rate. Decreased Reboot Times us. Decreased Processor Reliability. Suppose the mean processor reboot time improves from 6 minutes to 3 minutes, with the mean system reboot time similarly improving from 10 minutes to 5 minutes. The system dependability can then be maintained with a permanent-failure processor MTTF of 3,800 hours (instead of 5,000hours) and an intermittent, OO hours). Since the failure processor MTTF of 760 hours (instead of 1O downtime disaggregation changes from (5 + 40 + 42) to (7 + 53 + 27), this

223

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

has been achieved by improving uncovered-failure down-time at the expense of lack-of-required-processors and reconfiguration down-time. If low reboot times are easier to achieve than high processor reliabilities, or if the customer is more sensitive to uncovered-failure outage than reconfiguration (or lack-ofrequired-processors) outages, this tradeoff would be worthwhile. Increased Reboot Times us. Increased Processor Reliability. Suppose the mean processor reboot time degraded from 6 minutes to 12 minutes, with the mean system reboot time similarly degrading from 10 minutes to 20 minutes. The system dependability can nonetheless be maintained with a permanentfailure processor MTFF of 7,300 hours (instead of 5,000 hours) and an intermittent-failure processor MTTF of 1,460 hours (instead of 1,000 hours). Since the downtime disaggregation changes from (5 + 40 + 42) to (2 + 28 57), this has been achieved by improving lack-of-required-processors and reconfiguration downtime at the expense of uncovered-failure downtime. If high processor reliabilities are easier to achieve than low reboot times, or if the customer is more sensitive to reconfiguration (or lack-of-required-processors) outages than uncovered-failure outages, this tradeoff would be very much worthwhile. The dependability analysis is summarized in Table IV.

+

5.2.5 Remark We have seen in the sensitivity analyses that overall dependability is highly sensitive to some parameters, moderately sensitive to others, and not very sensitive to still others. This sensitivity is influenced by the specific measure used for dependability. Table V compares qualitatively the sensitivity of Basic Availability/Mean Downtime (a system availability measure given in Section 2.5.1) with Frequency and Duration of System Outage (F4(z)) (Heimann, 1989a). Compared to the Mean Downtime measure, the Frequency-of-TotalSystem-Outage measure is more sensitive to a change in the coverage or a degradation in the reconfiguration time, but less sensitive to an improvement in the reconfiguration time. This is because a changed likelihood of an uncovered failure, or an increased likelihood of a lengthy reconfiguration, strongly influences the likelihood that an outage will exceed a tolerance value (of on the order of a few minutes), whereas above-tolerance reconfigurations are already unlikely in the base case, so that an improved reconfiguration time will not help matters significantly. This comparison thus highlights the importance of choosing the proper dependability measure for the particular application under consideration.

TABLE IV

ANALYSIS RESULTS SUMMARY OF DEPENDABILITY Table IVa Evaluation Dependability (min/yr of downtime) 87 (5 + 40+42) (i.e., uptime 99.9835%)

System as originally specified Table IVb Sensitivity Analysis

Dependability (rnin/yr of downtime) Intermittent processor MTTF Processor MTTF

Mean repair time Mean reconfiguration time Mean reboot time

Coverage

MTTF, = 2,500 hr MTTF, = 500 hr MTTF, = 10,OOO hr (MTTF, = 2,000 hr) (MTTF, unchanged) MTTF, = 2,500 hr (MTTF, = 500 hr) (MTTF, unchanged) 2 hours 6 hours 2 seconds 15 seconds 60 seconds Processor = 3 rnin System = 5 rnin Processor = 12 min System = 20 rnin 95% 80%

45 ( 4 + 2 0 + 21) 155 (5 + 73 + 77) 42 (1 + 20+ 21) 76 (1 + 37+ 38) 182 113 83 92 49 67 127

(18 + 80 + 84) (17 + 47 + 49) (1 +40+42) (10 + 40 + 42) (4 + 3 + 42) ( 5 + 20+ 42) ( 5 + 80 + 42)

65 ( 4 + 4 0 + 21) 129 (5 +40+ 84) 67 ( 5 + 4 1 +21) 127 (5 + 38 + 84)

Table IVc Specification Determination Dependability (rnin/yr of downtime) Dependability requirement is 53 min/yr (i.e., 99.99% uptime) Intermittent processor MTTF Processor MTTF

Specification: Ratio = 2.5 (MTTF, = 5,000 hr, MTTF, = 2,000 hr) Specification: MTTF = 8000 hr (MTTF, = 8,000hr, MTTF, = 1,600 hr)

Cannot meet specification when MTTF, unchanged

Mean repair time

Cannot meet specification

Mean reconfiguration time Mean reboot time

Specifcation: Mean time = 5 sec Specification: Mean reboot time = 1.2 min (processor), 2 min (system) Specifcation: Coverage = 98.4%

Coverage

224

53 (4 + 24+ 25) 53 ( 2 + 25 +26)

68 ( 0 + 33 + 35) when MTTF,, = co 82 (0+40+42) when MTTR = 0 53 ( 4 + 7 +42) 53 ( 4 + 4 0 + 9 ) 53 (5 +41 + 17)

225

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

Table IVd Tradeoff Analysis Dependability (min/yr of downtime) Reboot time vs. processor MTTF Dependability remains at 87 min/yr (i.e., 99.9835%) uptime Permanent MTTF may decrease to 3,800 hours (intermittent to 760 hr) Permanent MTTF must increase to 7,300 hours (intermittent to 1,460 hr)

teboot time decreases to 3 min teboot time increases to 12 min

+ 27) ( 2 + 28 + 57)

87 (7 + 53 87

TABLEV QUALITATIVE COMPARISON OF SENSlTIVlTlEs Sensitivity of:

Change of Parameter@)

Mean downtime

Frequency of system outages

Processor intermittent MTTF = 2,500 hr. Processor intermittent MTTF = 500 hr. Processor permanent MTTF = 10,000 hr, Processor intermittent MTTF = 2,000 hr Processor permanent MTTF = 10,000 hr, Processor intermittent MTTF unchanged Processor permanent MTTF = 2,500 hr, Processor intermittent MTTF = 500 hr Processor permanent MTTF = 2,500 hr, Processor intermittent MTTF unchanged Mean repair time = 2 hr Mean repair time = 6 hr Mean reconfiguration time = 15 sec Mean reconfiguration time = 60 sec Mean reboot time = 3 rnin Mean reboot time = 12 rnin Coverage = 95% Coverage = 80%

High High

High High

High

High

Low

Low

High

High

Low Low Low Moderate Moderate Moderate Moderate Moderate Moderate

Low Low Low Low High Moderate Moderate High High

5.3

Evaluations Using Other Measures

The full-system example can also be evaluated using the entire collection of measures described in Section 2. Results using the various measures are shown in Table VI.

226

DAVID I. HEIMANN e t a / .

TABLE VI DEPENDABILITY MEASURES (FULL-SYSTEM EXAMPLE) Table VIa System Availability Measures Measures

Values

Basic availability Unavailability Downtime Tolerance (nonreconfiguration) availability Unavailability Downtime Capacity-oriented availability Unavailability Downtime Tolerance (nonreconfiguration) capacity-oriented availability Unavailability Downtime Degraded-capacity time

0.999835 (99.9835%) 0.000165 ( 0.0165%) 87 minutes/year 0.999912 (99.9912%) O.ooOo88 ( 0.0088%) 46 minutes/year 0.998952 (99.8952%) 0.001048 ( 0.1048%) 2,204 processor-minutes/year 0.999078 (99.9078%) 0.000972 ( 0.0972%) 2,044 processor-minutes/year 1,858 minutes/year

Table VIb System Reliability Measures Measures

Values

Frequency of outages Frequency of over-tolerance (nonreconfiguration) system outages Frequency of lack of required processors Frequency and duration of system outages Frequency of degraded-capacity incidents Frequency of processor repairs ~~

42/year 4.3/year 0.1l/year See Fig. 10 38.0/year 7lyear

~~~

Table VIc Task Completion Measures Measures (user requiring 60 minutes) Against an over-tolerance (nonreconfiguration) interruption of system: Task interruption probability Odds against interruption Against an over-tolerance (nonreconfiguration) interruption of system or of user’s processor Task interruption probability Odds against interruption Against any interruption: Task interruption probability Odds against interruption

Values

0.00049 (0.049%) 2037: 1 0.00157 (0.157%) 638:l 0.00584 (0.584%) 170:1

0

9 0 3 0

9

4 0

0

,/"'

/"

/

/'

/'

?

228

DAVID I. HEIMANN et a/.

Note the following:

1. The difference between basic availability and capacity-oriented availability is quite large. The total lost processing capacity of 2204 processorminutes is more than six times the lost processing capacity due only to system outages (87*4, or 348 processor-minutes). The difference is accounted for (with minor discrepancies due to rounding) by the degraded-capacity time of 1858 minutes per year. From a system-capacity point of view, the loss due to total system outage accounts for only a small part of the total failure-related loss of service. The loss due to partial system outage represents a by far greater contribution to the total. 2. The effect of not counting the brief reconfiguration outages varies greatly with the measure used, and is thus very application-dependent. For capacity-oriented availability (a measure fitting many office applications), the relative difference is small (2204 vs. 2044 processor-minutes/year), since most of the lost capacity is accounted for by the degraded operation state, which is not influenced at all by whether or not reconfigurations are counted. For basic availability, the relative difference is large (87 vs. 46 min/yr), since, as shown in Table IV, reconfiguration time accounts for a significant percentage of the total downtime. For system reliability (a measure fitting applications such as flight control), the difference is overwhelming (42 vs. 4.3 incidents per year), since most outage events are indeed due to reconfigurations. 3. Since three processors are required out of the four available, there is a redundancy of one processor. One might expect that greater redundancies would yield greater dependability. However, this is not the case, as the following results show (using basic downtime as the dependability measure): TABLE VII

REDUNDANCYANALYSIS 3 Required Processors Downtime (min/yr) System size 3-processors 4-processors 5-processors 6-processors

Required processors 3 of 3 of 3 of 3 of

3 4 5 6

Loss of processors

Reconfiguration

Uncovered failure

System total

1,397 5 0 0

30 40 50 60

32 42 53 63

1,459 81 103 123

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

229

Even though adding extra processors improves the loss-of-processors outage, it does exact a countervailing penalty. The more processors in the system, the more failures can occur, and the more failures can occur, the more downtime results from reconfigurations and uncovered failures. After the first redundant processor, this penalty outweighs the improvement in loss-of-processor outage. A detailed discussion of this effect is given in Trivedi et al. (1990).

6.

Conclusions

The concept of system dependability is being considered with increasing interest as a component of computer system effectiveness, and as a criterion that customers use for product selection decisions. Dependability measures the ability of a product to deliver its intended level of service to the user, especially in light of failures or other incidents that impinge on its performance. It combines various underlying ideas, such as reliability, maintainability, availability, and user demand patterns, into a basic overall measure of quality which customers use along with cost and performance to evaluate products. We have defined three classes of dependability measures: System availability measures show the proportion of potential service actually delivered; system reliability measures show the length of time before a service-interrupting failure occurs; task completion measures show the likelihood that a particular user will receive adequate service. Which of these measure classes is appropriate depends on the specific application under investigation, the availability of relevant data, and the usage or customer profile. For example, an office word-processing system is best evaluated by a system availability measure, while a flight-control system is best evaluated by a system reliability measure, and an on-line transaction processing system is best evaluated by a task completion measure. We have identified four types of dependability analyses: evaluation, sensitivity analysis, specification determination and trade08 analysis. Markov models, commonly used to analyze dependability, are described and their solution methods are briefly discussed. Problems in model parameterization and model validation are described. We have carried out a detailed dependability analysis on an example system, in order to illustrate the techniques. System dependability modeling has evolved significantly in recent years. Progress has been made along the lines of clear definitions (IEV191, 1987), model construction techniques (Dugan et al., 1986; Geist and Trivedi, 1983; Goyal et al., 1986) and solution techniques (Reibman et al., 1989).

230

DAVID I. HEIMANN eta/.

The topics where further progress is needed include: 0

0

0

0

0

0

Further develop the various alternative measures of dependability, as well as how to choose the proper measure for a given application. The aim is a “front-end” technique to routinely analyze a configuration and available data, and from this to select the proper measure and analysis technique. Integrate system dependability and system performance to yield an overall measure to assess the service delivered by the system. Quite often these two issues are closely intertwined, as when subsystem failures cause degradation in throughput or when increased subsystem response time causes a “timeout” failure condition at the system level. Include software reliability and availability within system dependability (Laprie, 1986). Software is becoming increasingly important, both in terms of its percentage of total system cost and development time and in terms of its percentage of potential system incidents. Software reliability models are not discussed here because of space limitations; interested readers may see Littlewood (1985) and Musa et al. (1987). Develop techniques to address model largeness. As the systems being analyzed become larger and more complex, and as performance and software considerations are included, the underlying state space can quickly become very large. Further work in largeness avoidance and largeness tolerance is needed. Develop techniques to address model stiffness.The equations to be solved in dependability modeling are stiff, particularly when dependability is combined with performance. This is because of the considerable difference in magnitude between failure rates, recovery/repair rates, and arrival/service rates. Techniques for solving stiff equations need to be further developed and applied to dependability analysis. Incorporate model calibration and validation into the modeling process. Techniques to identify and collect the necessary data, develop the appropriate experimentaldesigns and statistical analyses, and to evaluate the results need to be developed. In addition, the interaction between measurement techniques (including model calibration and validation) and the model formulation and solution process needs to be encouraged.

ACKNOWLEDGMENTS

This paper is based on projects sponsored by Digital’s VAXclusterTechnicalOfficeunder the direction of Ed Balkovich, who has also provided significant direct input into those projects. In addition,this paper has benefited by the comments and suggestions of Michael Elbert, Rick Howe, Oliver Ibe, John Kitchin, Archana Sathaye, and Anne Wein.

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

231

REFERENCES Arlat, J., Crouzet, Y., and Laprie, J. C. (1989). Fault Injection for Dependability Validation of Fault-Tolerant Computing Systems. Nineteenth Int. Symp. Fault-Tolerant Computing, Chicago, pp. 348-355. Bard, Y., and Schatzoff, M. (1978). Statistical Methods in Computer Performance Analysis. In “Current Trends in Programming Methodology, Vol. 111: Software Modeling” (K. M. Chandy and R. T. Yeh, eds.), pp. 1-51. Prentice-Hall, Englewood Cliffs, New Jersey. Bavuso, Salvatore J., Dugan, Joanne Bechta, Trivedi, Kishor S., Rothmann, Elizabeth M., and Smith, W. Earl (1987). Analysis of Typical Fault-Tolerant Architectures Using HARP. IEEE Trans. Reliability R-36(2), 176-185. Blake, J., and Trivedi, K. (1989). Reliability of Interconnection Networks Using Hierarchical Composition. IEEE Trans. Reliability 38 (I), 1 1 1-120. Blake, J., Reibman, A., and Trivedi, K. (1988). Sensitivity Analysis of Reliability and Performability for Multiprocessor Systems. Proc. 1988 A C M S I G M E T R I C S Conf., Santa Fe, New Mexico. pp. 177- 186. Bobbio, A., and Trivedi, K. (1986).An Aggregation Technique for the Transient Analysis of Stiff Markov Chains. IEEE Trans. Computers C-35(9), 803-814. Bobbio, A., and Trivedi, K. (1990). Computation of the Distribution of the Completion Time When the Work Requirement is a PH Random Variable. Stochastic Models 6 (1). Boyd, M. A,, Veeraraghavan, M., Dugan, J. Bechta, and Trivedi, K. S. (1988). An Approach to Solving Large Reliability Models. 1988 IEEEiAIAA DASC Symp., San Diego. Chimento, P. F. (1988). System Performance in a Failure Prone Environment, Ph.D. thesis, Department of Computer Science, Duke University, Durham, North Carolina. Ciardo, G., and Trivedi, K. S. (1990). Solution of Large GSPN Models, Proc. First Int. Workshop on Numerical Solution of Markou Chains. Raleigh, NC. Ciardo, G., Muppala, J., and Trivedi, K. (1989). SPNP Stochastic Petri Net Package, Proc. Third Int. Workshop Petri Nets and Performance Mo-dels PNPM89, 142- 151, Kyoto, Japan. Ciardo, G., Marie, R., Sericola, B., and Trivedi, K. S. (1990).Performability Analysis Using SemiMarkov Reward Processes. IEEE Trans. Computers. Cinlar, E. (1975). “Introduction to Stochastic Processes.’’ Prentice-Hall, Englewood Cliffs, New Jersey. Conway, A. W., and Goyal, A. (1987). Monte Carlo Simulation of Computer System Availability/Reliability Models. Proc. Seventeenth Int. Symp. Fault-Tolerant Computing, pp. 230-235. Cox, D. R. (1955). A Use of Complex Probab es in the Theory of Stochastic Processes. Proc. Camb. Phil. SOC.51, 313-319. Dugan, J. B., and Trivedi, K. (1989). Coverage Modeling for Dependability Analysis of FaultTolerant Systems. IEEE Trans. Computers C-38(6), 775-787. Dugan, J. B., Trivedi, K., Smotherman, M., and Geist, R. (1986). The Hybrid Automated Reliability Predictor. A I A A J. Guid., Control, and Dynamics 9 (3), 319-331. Geist, R., and Trivedi, K. S. (1983). Ultra-High Reliability Prediction for Fault-Tolerant Computer Systems. IEEE Trans. Computers 32 (12), 1 1 18-1127. Goyal, A,, Lavenberg, S. S., and Trivedi, K. S. (1987).Probabilistic Modeling of Computer System Availability. Annals of Operations Research 8, 285-306. Goyal, A,, Carter, W. C., de Souza e Silva, E., Lavenberg, S. S., and Trivedi, K. S. (1986). The System Availability Estimator. Proc. Sixteenth Int. Symp. Fault-Tolerant Computing, pp. 84-89. Gray, J. (1986). Why Do Computers Stop and What Can Be Done About It? Proc. F i f h Symp. Reliability in Distributed SoBware and Database Systems, pp. 3- 12. Heimann, D. (1989a). VAXcluster-System Availability-Measurements and Analysis. Technical Report, DEC, March, 1989.

232

DAVID 1. HEIMANN eta/.

Heimann, D. (1989b). A Markov Model for VAXcluster System Availability. IEEE Trans. Reliability, submitted. Howard, R. A. (1971). “Dynamic Probabilistic Systems, Vol. 11: Semi-Markov and Decision Processes.” John Wiley and Sons, New York. Hsueh, M. C., Iyer, R., and Trivedi, K. (1988). Performability Modeling Based on Real Data: A Case Study. IEEE Trans. Computers C37 (4), 478-484. Ibe, O., Howe, R., and Trivedi, K. (1989). Approximate Availability Analysis of VAXcluster Systems. IEEE Trans. Reliability 38 (l), 146-152. Ibe, O., Trivedi, K., Sathaye, A,, and Howe, R. (1989). Stochastic Petri Net Modeling of VAXcluster System Availability. Proc. Third Int. Workshop Petri Nets and Performance Models PNPM89, pp. 112-121 Kyoto Japan. IEV191 (1987). International Electrotechnical Vocabulary, Chapter 191: Reliability, Maintainability, and Quality of Service. CCIR/CCITT Joint Study Group on Vocabulary. International Electrotechnical Commission, Geneva, Switzerland. Iyer, R. K., Rosetti, D. J., and Hsueh, M. C. (1986). Measurement and Modeling of Computer Reliability as Affected by Systems Activity. ACM Trans. Computer Systems 4,214-237. Kulkarni, V., Nicola, V., and Trivedi, K. (1991). Effects of Checkpointing and Queuing on Program Performance. Stochastic Models. Kulkarni, V. G., Nicola, V. F., and Trivedi, K. S. (1987). The Completion Time of a Job on Multimode systems. Adv. in Applied Prob. 19 (4), 932-954. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. Fqteenth Int. Symp. Fault-Tolerant Computing, pp. 1-1 1. Laprie, J. C. (1986). Towards an X-ware Reliability Theory. LAAS Technical Report, Toulouse, France, December, 1986. Lavenberg, S. S. (1983). “Computer Performance Modeling Handbook.” Academic Press, New York. Li, V. O., and Silvester, J. A. (1984). Performance Analysis of Networks with Unreliable Components. IEEE Trans. Communications COM-32 (lo), 1105-11 10. Littlewood, B. (1985). Software Reliability Prediction. In “Resilient Computing Systems” (T. Anderson, ed.). Collins, London. Meyer, J. F. (1980). On Evaluating the Performability of Degradable Computing Systems. IEEE Trans. Computers C-29 (S), 720-731. Meyer, J. F. (1982). Closed-form Solutions of Performability.” IEEE Trans. Computers C-31 (7), 648- 657. Muntz, R. R., de Souze e Silva, E., and Goyal, A. (1989). Bounding availability of repairable computer systems. Proc. 1989 ACM SIGMETRICS and PERFORMANCE89 Int. Conf. Measurement and Modeling of Computer Systems, pp. 29-38 Berkeley, California. Musa, J., Iannino, A., and Okumoto, K. (1987). “Software Reliability: Measurement, Prediction, Application,” McGraw-Hill, New York. Naylor, T. H., and Finger, J. M. (1967). Verification of Computer Simulation Models. Management Science 14,92-101. Nelson, Victor P., and Carroll, Bill D. (1987). “Tutorial: Fault Tolerant Computing.” IEEE Computer Society Press, Silver Springs, Maryland. Nicola, V. F., Kulkarni, V. G., and Trivedi, K. S. (1987). Queuing Analysis of Fault-Tolerant Computer Systems. IEEE Trans. Software Eng. SE13 (3), 363-375. Reibman, A., and Trivedi, K. S. (1988). Numerical transient analysis of Markov models. Computers and Operations Research 15 (I), 19-36. Reibman, A. L., and Trivedi, K. S. (1989). Transient Analysis of Cumulative Measures of Markov Model Behavior. Stochastic Models 5 (4),683-710. Reibman, A., Smith, R., and Trivedi, K. (1989).Markov and Markov Reward Models: A Survey of Numerical Approaches. European J . Operations Research 40 (2), 257-267.

AVAILABILITY AND RELIABILITY MODELING FOR COMPUTER SYSTEMS

233

Sahner, R., and Trivedi, K. (1986). SHARPE: An Introduction and Guide to Users. Duke University, Computer Science, Technical Report. Sahner, R., and Trivedi, K. S. (1987). Reliability Modeling Using SHARPE, IEEE Trans. Reliability R-36(2). 186-193. Shooman, M. L. (1968).“Probabilistic Reliability: An Engineering Approach.” McGraw-Hill, New York. Siewiorek, D. P., and Swarz, R. S. (1982).“The Theory and Practice of Reliable System Design.” Digital Press. Bedford, Massachusetts. Smith, R. M., Trivedi, K.S., and Ramesh, A. V. (1988). Performability Analysis: Measures, an Algorithm and a Case Study. IEEE Trans. Computers C-37 (4), 406-417. Stewart, W. J., and Goyal, A. (1985). Matrix Methods in Large Dependability Models. Research Report RC-11485, IBM, November, 1985. Trivedi, K. S . (1982).“Probability and Statistics with Reliability, Queuing and Computer Science Applications.” Prentice-Hall, Englewood Cliffs, New Jersey. Trivedi, K., Sathaye, A,, Ibe, O., and Howe, R. (1990). Should I Add a Processor? Proc. Hawaii Conf. System Sciences, pp. 214-221. U. S.Department of Defense (1980).“Military Standardization Handbook: Reliability Prediction of Electronic Equipment. MIL-HDBK-217C, Washington, D.C. Veeraraghavan, M., and Trivedi, K. (1987). Hierarchical Modeling for Reliability and Performance Measures. Proc. 1987 Princeton Workshop on Algorithms, Architecture and Technology Issues in Models of Parallel Computation. Published as “Concurrent Computation.” S . Tewksbury, B. Dickinson and S . Schwarz (eds.),Plenum Press, New York, 1988, pp. 449-474.