PII:
Microelectron. Reliab., Vol. 38, No. 4, pp. 685±688, 1998 # 1998 Elsevier Science Ltd All rights reserved. Printed in Great Britain 0026-2714/98 $19.00 + 0.00 S0026-2714(97)00205-9
RESEARCH NOTE A HYBRID MODEL FOR WEEKDAY REPLACEMENTS WITH INFANT MORTALITY FAILURES MARK SMOTHERMAN,{ WILLIAM TODD STINSON and MADHU CHETUPARAMBIL Department of Computer Science, Clemson University, Clemson, SC 29634-1906, U.S.A. (Received 22 October 1996; in revised form 19 September 1997) AbstractÐA hybrid availability model of a repairable system with infant mortality failures is proposed. The hybrid model can eciently represent dierent types of state transitions by the use of a hierarchy of models: (1) a time-discretized component submodel, which provides a piecewise-linear failure rate during an operating interval; (2) a phased-mission model, which transforms the state probability vector according to repair activity; and (3) a combinatorial model, which is used to predict the number of working units among a collection of identical components. The modeling approach is illustrated by predicting the expected number of working components for a multiple-component system where replacement components are ordered at the end of business hours each weekday end. Replacement occurs on the next weekday morning. # 1998 Elsevier Science Ltd. All rights reserved.
The solution technique is typically based on some form of numerical integration, or time-discretization where one-step transition probabilities can be varied according to a piecewise-linear representation of the transition rates [1]. However, the failure rates in a non-homogeneous Markov model can only vary according to the global time of the whole model, and component replacement (or repair to a goodas-new state) cannot be represented. This is because at each instant of time, there is some probability that a given component has just been repaired and should start failing at its initial rate, yet there is also some probability that the component is still operational and should continue failing at its present rate. Phased-mission models provide a solution to this dilemma when repairs occur at speci®c global times [2, 3]. In this type of model, a system changes its structure or behavior at speci®c global times and the model represents this by making corresponding changes to the state space, the transitions and transition rates, and/or the state probability vector. Each period of time between changes is called a phase, and each phase can use a separate submodel. Repairs and replacements can be represented by initializing the operational states of the submodel for the next phase to include the probabilities found in non-catastrophic failure states at the end of the previous phase. Indeed, one way to represent periodic repairs or replacements is to use a single submodel and reinitialize operational probabilities in it at each scheduled repair. The repairs, thus, act according to a global clock, while the failure rates
INTRODUCTION
Markov models have been widely used to predict the performance, reliability, and availability of computer systems. Some systems, however, exhibit behavior that does not ®t the Markov assumptions, e.g. a repairable system in which the components fail according to rates that vary with time since last repair. In these cases, the modeler often turns to simulation. However, an alternative technique is to use a combination of dierent analytic methods, each of which can represent a subset of system behavior. In this paper, we propose a hybrid modeling approach for a repairable system of the type just described when the repair intervals are ®xed, but not identical.
A HYBRID MODEL
An underlying assumption of most Markov models is the use of constant failure rates. For many types of components, a constant failure rate is an appropriate representation of the failure behavior during normal lifetime. However, components often experience a bathtub-shaped failure rate over their complete lifetimes. The increased number of initial failures is called infant mortality, and the increased number of later failures is termed wear out. Non-homogeneous Markov models can be used to model components with time-varying behavior. {To whom all correspondence should be addressed. 685
686
M. Smotherman et al.
of a single component being operational at global time t, and this value can be used in a combinatorial expression for the expected number of operational components.
AN EXAMPLE SYSTEM
Fig. 1. Hierarchical model structure.
within the phase model can act according to either of two clocks: the global clock or the clock of the current phase. When a system contains identical components that are used in a non-interacting manner, a reduction of model solution eort is possible. Because of statistically identical behavior, there is no need to explicitly model each dierent component; instead, the reliability of one component can be used in a combinatorial formula, e.g. the expected number of working components at time t in an identical collection of n components can be found by summing the terms of the form k Pk(t) from k = 0 to k = n, where Pk(t) is the probability of k working components at time t. Figure 1 shows the overall model structure that has been discussed. At the basic level, a time-discretized phase submodel can be used to represent component behavior, and the failure rate li can vary according to the time since (re-)initialization. Controlling the submodel is a phased-mission model that acts according to a global clock t. The phased-mission model can give the probability r(t)
Fig. 2. Weekday phased-mission model.
We illustrate the hybrid modelling approach described above using a system of 10 identical components governed by a weekday replacement policy. Replacement parts can be ordered by 5 p.m. on a weekday and installed at 10 a.m. the following weekday; the new parts are delivered overnight or over the weekend. A key feature of the model is to allow the new part to have a dierent failure rate when it is ®rst installed, e.g. we can model infant mortality failures. Figure 2 depicts the logical structure of a phasedmission model that represents weekday replacements. In this ®gure, the transitions between submodels represent the replacements. The overall model will be evaluated for a single component, and the combinatorial expression for an expected number of working components among ten identical units will be calculated. Figure 3 shows component states for two instances of the submodel. The submodel is actually a combination of two sets of states: one for normal working hours, 10 a.m. to 5 p.m. and one for overnight, 5 p.m. to 10 a.m. or over the weekend, 5 p.m. Friday to 10 a.m. Monday. Table 1 describes the seven states in a day-submodel, and Table 2 describes the phase-change transitions among the states. Since the submodel for each weekday is the Table 1. State interpretations in a day submodel State
Interpretation
0 1 2 3
Active period: 10 a.m.±5 p.m. Monday±Friday Previously working component is operational Previously working component has failed Newly replaced component is operational Newly replaced component has failed
4 5 6
Active period: 5 p.m.±10 a.m. Monday±Thursday and over weekend Component is operational Component has failed after call Component is awaiting replacement
Table 2. State probability transformations in a day submodel Event
Probability transformations
At 10 a.m. Monday±Friday Shift change Replacement
P0 = P4; P4 = 0 P1 = P5; P5 = 0 P2 = P6; P6 = 0
At 5 p.m. Monday±Friday Shift change Replacement part call
P4 = P0; P0 = 0 P4 = P4 + P2; P2 = 0 P6 = P1; P1 = 0 P6 = P6 + P3; P3 = 0
Hybrid model for infant mortality failures
687
Fig. 3. Single-component submodel replicated for the ®rst two days of a week. (Solid transitions are failure transitions; dashed transitions are phased-mission probability vector transformations.)
same with the only exception that Friday's states 4, 5 and 6 are active for a longer period, a single submodel can be used with the durations of working hours and overnight/over-the-weekend hours being submodel parameters. In Fig. 3 the failure states are annotated with the state number and ``(F)''. A day's submodel starts in states 0, 1, and 2. A failure of a previously working component is represented by a transition from state 0 to state 1 at constant failure rate l. For this example, we use a rate of 0.005 failures/h. If the component has been newly replaced, a failure is represented by a transition from state 2 to state 3 at a time-varying failure rate l(t). For this example, we use an initial failure rate that is ®ve times higher than the normal rate for the ®rst 15 min of operation and then settles down to the normal rate l. Although this type of component behavior is undesirable in practice, it is used here for illustrative purposes. Furthermore, we note that this simple two-piece approximation can be re®ned into additional pieces to more accurately re¯ect an arbitrary initial failure rate curve. The phased-mission model is initialized to Monday and the day-submodel is initialized in state 2 with probability l; all other states are set to probability 0. Figure 4 shows the predicted number of
operational components out of 10 during the fourth week. For the model parameters chosen, transient behavior that is apparent in the predictions for the ®rst and second week is damped out by the third week; so, the fourth week and beyond are identical in predicted behavior. The predicted result is a distorted sawtooth curve. It is lower on Monday, since only those components that had failed by 5 p.m. on Friday will have replacements available. Tuesday at 10 a.m. starts out a little higher than the remaining weekdays since the backlog of failures from the weekend will be handled at 5 p.m. on Monday. The eect of the infant mortality failures is slight, but it can be noticed with the ®rst displayed point for each weekday being slightly higher than the remaining points for the day. A lower failure rate than the one chosen to generate Fig. 4 would ¯atten out and raise the curve. Simulations were written to validate the analytic predictions, and 10,000 trials were required for the simulation results for the fourth week to get within 1% of the predictions of the hybrid model.
CONCLUSIONS
We have presented a hybrid model that includes multiple components, regular, but nonuniform
688
M. Smotherman et al.
Fig. 4. Expected number of working components in a 10-component system during the fourth week. (0±1 represents 10 a.m. Monday to 10 a.m. Tuesday, etc.).
scheduled replacements for failed components, and a varying failure rate for newly replaced components. The expected number of working components can be obtained from a numerical solution of the model without resort to simulation. The model can be easily extended by adding coverage factors for repair success (e.g. some shipped replacements may be faulty), dierent submodels for more complex underlying components, and non-instantaneous failure-recognition and/or repairs (as in Ref. [3]).
REFERENCES 1. van Dijk, N. M., in Controlled Markov Processes: Time-Discretization, CWI Tract 11 edn. Centrum voor Wiskunde en Informatica, Amsterdam, 1984. 2. Clarotti, C. A., Contini, S. and Somma, R., in Synthesis and Analysis Methods for Safety and Reliablity Studies, eds. G. Apostolakis, S. Garribba and G. Volta. Plenum Press, New York, 1980, pp. 45± 58. 3. Smotherman, M. and Geist, R., Reliability Engineering and System Safety, 1990, 27, 241±255.