Periodic inspection optimization model for a complex repairable system

Periodic inspection optimization model for a complex repairable system

ARTICLE IN PRESS Reliability Engineering and System Safety 95 (2010) 944–952 Contents lists available at ScienceDirect Reliability Engineering and S...

669KB Sizes 47 Downloads 113 Views

ARTICLE IN PRESS Reliability Engineering and System Safety 95 (2010) 944–952

Contents lists available at ScienceDirect

Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress

Periodic inspection optimization model for a complex repairable system Sharareh Taghipour , Dragan Banjevic, Andrew K.S. Jardine Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Ont., Canada M5S 3G8

a r t i c l e in f o

a b s t r a c t

Article history: Received 10 October 2009 Received in revised form 31 March 2010 Accepted 4 April 2010 Available online 10 April 2010

This paper proposes a model to find the optimal periodic inspection interval on a finite time horizon for a complex repairable system. In general, it may be assumed that components of the system are subject to soft or hard failures, with minimal repairs. Hard failures are either self-announcing or the system stops when they take place and they are fixed instantaneously. Soft failures are unrevealed and can be detected only at scheduled inspections but they do not stop the system from functioning. In this paper we consider a simple policy where soft failures are detected and fixed only at planned inspections, but not at moments of hard failures. One version of the model takes into account the elapsed times from soft failures to their detection. The other version of the model considers a threshold for the total number of soft failures. A combined model is also proposed to incorporate both threshold and elapsed times. A recursive procedure is developed to calculate probabilities of failures in every interval, and expected downtimes. Numerical examples of calculation of optimal inspection frequencies are given. The data used in the examples are adapted from a hospital’s maintenance data for a general infusion pump. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Periodic inspection Maintenance Optimization Repairable systems Soft failures Hard failures

1. Introduction Complex repairable systems such as medical devices, telecommunication systems, and electronic instruments consist of a large number of interacting components that perform the system’s required functions. A repairable system, on failure, can be restored to satisfactory performance by any method except replacement of the entire system [1]. A repairable system is usually subject to periodically or non-periodically planned inspections during its life cycle. The scheduled inspections are performed to verify device safety and performance by detecting potential and hidden failures and taking appropriate actions. The actions can be to fix the potential failure if the device is found defective or perform preventive maintenance if no failure is detected to avoid or reduce future failures. We will consider a system with several failure modes that may be classified into two broad categories: type I and type II, by their consequence or possibility of detection. Failure modes of type I are ones that have more significant influence on system operation. Failure modes of type II are ones that are less critical for the system. Meeker and Escobar [2, p. 327] also consider two types of failures: hard and soft. They suggest that soft failures may be defined as when degradation (gradual loss of performance) exceeds certain specified level, while hard failures cause the system to stop working. In this paper, type I failures (which we

 Corresponding author. Tel.: + 1 416 978 6937; fax: + 1 416 946 5462.

E-mail address: [email protected] (S. Taghipour). 0951-8320/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.ress.2010.04.003

can also call ‘‘hard’’ failures) signify those failures that stop operation of the system or they are self-announcing and are fixed as soon as they occur. On the other hand, type II failures (which we can also call ‘‘soft’’ failures) are failures that do not make the system stop, but can reduce the system’s performance and require to be fixed eventually. Soft failures are usually not self-announcing and are detected and fixed only at the next scheduled inspection. Therefore, there is a time delay between real occurrence of a soft failure and its detection. There are different categories of soft failure of components in complex devices. One category includes a wide range of integrated protective components used to protect a system from unwanted transient incidents such as current and voltage surges. For example, infusion pumps and medical ultrasounds are equipped with circuit breakers to protect them against overload and short circuits. If these protective components fail, the system can continue its main function, although the risk of damage increases if an overload or short circuit takes place. The failures of protective components are not self-announcing and can be rectified only at inspection. This group of components does not age further when a failure occurs and until the component is fixed and put back in service. Standby redundant components are another category of components with soft failures. They are used in many critical systems to enhance system safety and reliability. The system, depending on its design, can continue functioning if a required number of these redundant components work. Uninterruptible power supplies (UPSs), dual redundant processors, and redundant fan trays are examples of redundant components used to increase system safety and availability. A CAT scan (CT) may

ARTICLE IN PRESS S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952

have several CPUs running different algorithms for image processing. If one CPU fails, processing is routed to other CPUs without interrupting or degrading the system’s performance. Again, components in this category do not age after failure, and usually their failures are detected only at inspection. In addition to protective and redundant components, there are other components whose failure does not halt the system, even though failure can have serious consequences, even catastrophe, if left unattended. With infusion pumps, for example, audible or visual signals are used to communicate with operators and inform them of the status of the patient to whom the device is attached. When the level of liquid delivered to a patient reduces to a certain level, the component responsible for producing signals starts to send a warning alarm. If the component fails, the pump can still function, but the patient’s health risk increases if the operator does not take action. Devices in overcrowded hospitals are always in use, and even though some of their informative features such as audible signals may fail, they remain in operation until the next scheduled inspection. The results obtained from the study in [3] show that some components of a general infusion pump such as indicators, switches, and occlusions have hard failures. The system stops operating when its power switch fails, and failure is selfannouncing—as soon as there is an occlusion alarm, the user is notified that something is wrong. The pump can continue to operate if a protective component such as a circuit breaker or an informative component such as an audible signal generator fails, but such failures may cause damage if left unattended. It should be noted that some components may have several failure modes, including soft and hard, but in the analysis they may be considered separately, which will be discussed later. For simplicity we will identify components with their dominant failure modes and assume that each component has only one failure mode—either soft or hard. Once the component experiences a soft failure it stays in the same condition until it is fixed as discussed above. Even if the component may deteriorate in some way after the failure the repair at inspection will return it to the state just before failure. Both components with soft and hard failures should be in general incorporated in developing an inspection optimization model for a complex repairable system. Numerous models have been developed for inspection and maintenance optimization of a system subject to failure. Nakagawa and Mizutani [4] in their recent paper give an overview of replacement policies with minimal repair, block replacement and simple replacement for an operating unit in a finite time horizon. In all three policies they assume that the unit is replaced at periodic replacement times and is as good as new. In the policy with minimal repair the unit is minimally repaired if it fails between periodic replacements. In block policy the unit is always replaced at failures between replacements, and in the simple policy there is time elapse between a failure and its detection at the next inspection time after failure. Moreover, the paper presents optimal periodic and sequential policies for an imperfect preventive maintenance and an inspection model for finite time span. In the inspection model it is assumed that the unit is replaced if a failure is detected at inspection; hence this model describes mainly a failure-finding optimization problem. Further inspection and maintenance models for single-unit and multi-unit systems are discussed in detail in Refs. [5–10] including age-dependent, periodic PM, failure limit, sequential PM, and repair limit policies in infinite time span. The proposed models are classified according to the degree to which the operating condition of an item is restored by maintenance [9,11–14]. Perfect repair restores the system to as good as new and minimal repair restores it to the same failure rate (intensity of failures) as

945

before failure. Imperfect repair restores the system to somewhere between as good as new and as bad as old. Worse and worst repairs make the failure rate increase; however, the system does not break down in the worse repair, while the worst repair unintentionally makes the system fail. Total expected cost and expected downtime per unit time are usually considered for inspection/replacement decision problems. The majority of maintenance/inspection models assume that the system is not operative after failure, i.e., they consider failures as hard failures. However, the delay-time model [15,16] regards the failure process as a two-stage process. First a defect is initiated and if unattended, it will lead to a failure. The delay time between initiation of the defect and time of failure represents a window of opportunity for preventing the failure. Wang [17] in his recent paper presents an inspection optimization model for minor and major inspections that are carried out for a production process subject to two types of deterioration. Minor and major defects are identified and repaired, respectively, at routine and major inspections. In many real-world situations where safety and reliability of devices is vital, devices must be inspected periodically. For example, according to regulations, all major components and functions of clinical medical devices in a hospital must be checked at inspection to assure that the devices are safe for use on patients. If a device fails between two consecutive scheduled inspections and the failure is discovered by operators, it is checked and repaired immediately regardless of scheduled plans. Often hospitals take non-scheduled maintenance as an opportunity to check the device’s failed component and to inspect all other major components/functions. In fact, clinical engineers working in a hospital are supposed to go through a predesigned checklist when inspecting major components/functions of a device at both scheduled and non-scheduled inspections. In this paper, we consider a model for a multi-component repairable system on a finite horizon subject to both soft and hard failures. For practical purposes and simpler implementation most organizations specially dealing with a large number of devices use a periodic inspection policy. A non-periodic inspection policy can be also considered using the same mathematical model but with more time consuming calculation and more demand on practical implementation. We assume that the components with soft failures are repaired if found failed at inspections to the same condition as just before failure (minimal repair), even if they age in a certain way while in the failed state. The components with hard failures are immediately minimally repaired on failure, without delay. Therefore, the components restart with the same failure rate as before failure, and the number of failures for each component follows a non-homogeneous Poisson process. It should be noted that at inspection only states of components with soft failures are found, but not their age at failure (i.e., the times of soft failures are censored), which makes the problem of estimating the failure rate complicated [3]. At this stage we consider a simple policy in which at hard failures only the failed component is inspected and fixed and at periodic scheduled inspections all components with soft failures are checked and fixed if found failed. Because of instantaneous minimal repair of hard components, they do not have impact on the inspection policy, and will be ignored in calculation in this paper. Therefore, the model reduces to a model of a system consisting of several units/components with different failure rates. Depending on consequences of failures of individual components, dependence between components should be taken into account, or can be ignored. For example, if only the downtime of each component contributes to the cost, then only marginal failure rates may be considered and any relationship between components can be ignored. This case will be discussed in Section 3. Hence at this

ARTICLE IN PRESS 946

S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952

stage the main contribution of this paper is on the component level in introducing delayed minimal repairs of the components on finite time intervals and in using recursive calculation. This inspection optimization model is developed particularly for finding the optimal periodic inspection interval for medical devices being used in hospitals. The assumptions used to construct the optimization model in this paper are obtained from the result of study performed in Ref. [3]. A detailed problem description is given in Section 2. Section 3 describes the inspection optimization model considering downtime of components with type II failures and includes a numerical example using adapted data from the case study [3]. Section 4 discusses the model considering the number of type II failures exceeding a pre-defined threshold, and a numerical example. The combined model and its numerical example are given in Section 5. The final Section 6 presents conclusions.

application for which the optimization model is developed. The importance and consequences of type II failures on system’s performance can be used as criteria by managers to decide about the form of including this type of failure in the model.

3. Model considering downtime and repair of components When the downtime of a component with type II failure is important, a method considering downtime should be employed. At times, the longer a component is in a failed state, the more significant is its influence on system performance or safety. For example, a defective head in a thickness gauge can produce inaccurate measurements starting from the moment of its deformation and lasting until the defect is rectified at inspection. Therefore, a penalty should be imposed for the period of less accurate measurements. Another method of incorporating type II failures in the cost model is discussed in Section 4; it considers the number of type II failures that exceeds a threshold. The method of considering downtime of type II failures can be combined with the total number of failed components (Section 5) to emphasize both total downtime and the number of failures exceeding the threshold.

2. Problem definition Consider a complex repairable system consisting of two groups of components. Failure of a component in the first group (type I failure, or ‘‘hard’’ failure) is revealed or self-announcing and when it occurs the system stops working. However, the system can still continue operating when a component in the second group fails (type II failure, or ‘‘soft’’ failure). Therefore, type I failure is reported immediately as soon as it occurs and the failed component is inspected (non-scheduled inspection) and fixed consequently. However, type II failures are detected and rectified only at periodic inspections, which are taken at times kt (k¼1, 2, y, n), where T¼nt is given (Fig. 1). We assume that at a non-scheduled inspection only the failed component is inspected and fixed. Moreover, components with type I failure will not fail simultaneously, and there will be no type I failure detected at scheduled inspections, with probability one. When a type I or type II failure takes place, the component is minimally repaired and returned to the state it was in just before the failure. Inspections are perfect and inspection and repair times are negligible. A convenient and realistic model for this kind of repairable systems with minimal repair is a non-homogeneous Poisson process (NHPP) [18], and we will use it in this paper. Our objective is to find the optimal scheduled inspection interval that minimizes the expected cost incurred over the cycle of length T. Following our discussion at the end of the Introduction, we will ignore type I failures at this stage. We also assume that type II failures or their combination cannot convert to type I failure. However, they still have influence on the system’s performance. For example, any changes in the shape of a measuring head in a thickness gauge used in steel industry can produce less accurate measurement of a steel strip, although the whole gauge is still working. Hence the components with type II failures affect performance and maintenance cost, but not basic operation of the system. Several methods can be employed to incorporate type II failures into the model. In fact, the method depends on the actual

NS

NS

Notation m

number of components with type II failure; components need not be independent lj(x) intensity function of the non-homogenous Poisson process associated with the number of failures of component j over time period [0,x] T system cycle length, which is known and fixed n number of inspection intervals t length of periodic scheduled inspection interval; nt ¼T, or t ¼T/n cost of minimal repair of component j cj expected cost incurred from scheduled inspections in a E[CST] cycle cost of one inspection cI E[CS((k  1)t,kt]] expected cost incurred from a scheduled inspection at kt over time period ((k  1)t, kt] For each inspection interval ((k 1)t, kt], k ¼1, 2, y, n, and for all components with type II failure a penalty cost is incurred proportional to the elapsed time of the failure. The objective is to find the optimal inspection interval that minimizes the total penalty cost. We assume that inspection and possible repairs are also done at the end of the cycle, T, that is, for k¼n. This action may be considered as a sort of minimal preparation for the next cycle, but also makes the model slightly simpler. This assumption can be easily removed, if appropriate. In general, the expected cost incurred from inspections over a cycle T¼nt is the sum of the expected cost incurred from scheduled and non-scheduled inspections over periods ((k 1)t, kt], k¼1, y, n. However, since we do not include type I failures in the

NS

τ



( k − 1)τ



S

S

S

S

NS: Non-scheduled inspection

S: Scheduled inspection

Fig. 1. Scheme of scheduled and non-scheduled inspections in a cycle.

nτ = T

ARTICLE IN PRESS S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952

þ P j ðX ¼ tjX0 ¼ t,X1 ¼ tÞP j ðX1 ¼ tjX0 ¼ tÞ Z t k j j Pk1 ðt þ xÞdF1j ðxjtÞ þPk1 ðt þ tÞP1j ðtÞ ¼ 0 Z t j ¼ Pk1 ðt þ xÞdF1 j ðxjtÞ for k ¼ 2, . . . ,n:

model now, only costs incurred from scheduled inspections will be incorporated into the model. Further on, we will refer only to components with type II failures. The expected cost incurred from scheduled inspections in a cycle is E½CST  ¼

n X

E½CSððk1Þt,kt 

ð1Þ

k¼1

The costs incurred at a scheduled inspection are

0

Hence the recursive equation for Pkj(t), t Z0 is 8 R tþt < P j ðtÞ ¼ e t lj ðsÞds , 1 : P j ðtÞ ¼ R t Pj ðt þxÞdF j ðxjtÞ, k ¼ 2, . . . ,n

When a component fails, it remains in the failed state till the next inspection time, when it is minimally repaired. Therefore, the expected cost incurred from the kth scheduled inspection in a cycle is: E½CSððk1Þt,kt  ¼ cI þ

m X

cj Pðcomponent j fails in ððk1Þt,ktÞ

Note that if we want to calculate Pkj(t) for 0 rt rA, we have to j ðtÞ for 0 rt rA+ t. For example, when we want to calculate Pk1 calculate Pkj(0), k¼ 1, 2, y, n, we have to calculate Pkj(t) for 0rt(n  k)t, k¼1, 2, y, n 1. Actual numerical calculation of Pkj(t) will be mentioned later in the numerical example. Direct calculation of Pkj(t) would be tedious, except for small n. For example, from Eq. (3) Z t P2j ðtÞ ¼ P1j ðt þ xÞdF1 j ðxjtÞ þ P1j ðt þ tÞP1j ðtÞ 0

Z t

j¼1

þ

m X

¼

c0j Eðdowntime of component j in ððk1Þt,ktÞ

Z t ¼

0

ð2Þ P3j ðtÞ ¼

and pdf of X1 at its continuous part R tþx  l ðsÞds @ j f1j ðxjtÞ ¼ @x F1 ðxjtÞ ¼ lj ðt þ xÞe t j ,

0rxot

P1j ðtÞ ¼ P j ðX1 ¼ tjX0 ¼ tÞ ¼ F1j ðtjtÞF1j ðt0jtÞ ¼ e

R tþt

t

lj ðsÞds



dx þ e

R t þ 2t tþt

R tþt

lj ðsÞds 

e

R tþxþt

lj ðt þ xÞe

t

lj ðsÞds



dx þ e

R t þ 2t t

t

lj ðsÞds

lj ðsÞds

Z t

0

ejk ðtÞ ¼ Ej ½Xk jX0 ¼ t,

t

j

¼ tjX 0 ¼ tÞ ¼ E½P ðXk ¼ tjX0 ¼ t,X 1 Þ

Then for kZ2, similar to derivation of (3) ejk ðtÞ ¼ Ej ½Ej ½Xk jX0

X1

¼ t,X1  ¼

j 0 E ½Xk jX0

ej1 ðtÞ ¼ Ej ½X1 jX0 ¼ t ¼ Z t 0

2nd failure

τ



2τ X2

0

k ¼ 2, . . . ,n

For k¼1

¼

1st failure

=t

Rt

¼ t,X1 ¼ xdF1 j ðxjtÞ Z t Z t ¼ Ej ½Xk1 jX0 ¼ t þ xdF1 j ðxjtÞ ¼ ejk1 ðt þxÞdF1 j ðxjtÞ

Pj ðXk ¼ tjX0 ¼ t,X1 ¼ xÞf1j ðxjtÞdx

X0

k ¼ 1,2, . . . ,n

0

lj ðsÞds

From the assumptions of NHPP and minimal repair, we have

0

R tþx

lj ðt þ xÞe

0



Z t

lj ðsÞds

Hence the recursive equation for ekj(t) is Z t j ejk ðtÞ ¼ ejk1 ðt þxÞf1 ðxjtÞdxþ ejk1 ðt þ tÞP1j ðtÞ,

Notice that

¼

tþx

Let

We will derive a recursive equation for Pkj(t) using conditioning over the first failure time X1. Let us first introduce the CDF of X1: 8 R tþx <  l ðsÞds , 0rxot 1e t j j j F1 ðxjtÞ ¼ P ðX1 rxjX0 ¼ tÞ ¼ : 1, xZt

Pkj ðtÞ ¼ P j ðXk

R tþxþt



P2j ðt þ xÞdF1 j ðxjtÞ þ P2j ðt þ tÞP1j ðtÞ Z t0 Z t R tþxþyþt lj ðsÞds ¼ ½ lj ðt þ x þ yÞe t þ x dy 0 0 R t þ x þ 2t R tþx  lj ðsÞds  l ðsÞds þe tþx lj ðt þ xÞe t j dx Z t R t þ x þ 2t R t þ 3t R tþt  lj ðsÞds  l ðsÞds  l ðsÞds þ½ lj ðt þx þ tÞe t þ t dx þ e t þ t j e t j 0 Z tZ t R tþxþyþt lj ðsÞds ¼ lj ðt þxÞlj ðt þx þyÞe t dx dy Z0 t 0 R t þ x þ 2t lj ðsÞds þ lj ðt þ xÞe t dx 0 Z t R t þ x þ 2t R t þ 3t lj ðsÞds  lj ðsÞds þ lj ðt þ x þ tÞe t dxþ e t

Pkj ðtÞ ¼ Pðcomponent j does not fail in ððk1Þt,ktjX0 ¼ tÞ ¼ Pj ðXk ¼ tjX0 ¼ tÞ

e

0

j¼1

From additivity of the cost function it is clear that any structure of the system can be ignored because only marginal distributions of failure times are needed. Still the components are economically dependent, by sharing the same inspection interval. Let Xk be the survival time of the component j in the kth interval, calculated from (k 1)t, k¼1, 2, y, n (index j is suppressed for simplicity; Fig. 2). Let X0 be the initial age of the component at the beginning of the cycle. Using initial age is convenient for recursive calculation, but in real application it is also possible that some components are not as good as new. In the numerical example t ¼0 will be used. Let

ð3Þ

1

0 k1

k

 cost of inspection (cI),  cost of minimal repair of component j if found failed (cj),  penalty cost for the elapsed time of failure for component j (c0j ).

947

X3

Rj1 ðxjtÞdx ¼

Z t 0

Z t

xdF1j ðxjtÞ ¼ 

e

R tþx t

lj ðsÞds

Z t 0

xf1j ðxjtÞdx þ tP1j ðtÞ

dx:

0

rth failure

( k − 1)τ



Xk

Fig. 2. Initial age t and failure times of component j.

T

ð4Þ

ARTICLE IN PRESS 948

S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952

If we consider a penalty cost c0 j for the elapsed time of failure for component j, we will have E½CST  ¼ ncI þ

m X

cj ðn

j¼1

¼T

m X

n X

Pkj ðtÞÞ þ

c0 j þ nðcI þ

c0 j ðtejk ðtÞÞ

k¼1j¼1

k¼1

j¼1

n X m X

m X

cj Þ

j¼1

m X n X

ðcj Pkj ðtÞ þ c0 j ejk ðtÞÞ

ð5Þ

j¼1k¼1

Let g1j ðtÞ ¼ cj P1j ðtÞ þ c0 j ej1 ðtÞ. Then for kZ2 Z t j gkj ðtÞ ¼ cj Pkj ðtÞ þ c0 j ejk ðtÞ ¼ ðcj Pk1 ðt þxÞ þ c0 j ejk1 ðt þ xÞÞdF1 j ðxjtÞ; gkj ðtÞ ¼

Z t 0

0

j gk1 ðt þ xÞdF1 j ðxjtÞ:

Hence gkj(t) can be calculated recursively using the single recursive equation 8 < g j ðtÞ ¼ cj Pj ðtÞ þ c0 j ej ðtÞ, 1 1 1 R : g j ðtÞ ¼ t g j ðtÞðt þ xÞdF1 j ðxjtÞ, k ¼ 2,3, . . . ,n 0 k1 k instead of separately calculating Pkj(t) and ekj(t). n P gkj ðtÞ, the expected cost per cycle is given Finally, if g jn ðtÞ ¼ k¼1

as follows: E½CST  ¼ T

m X

c0 j þ nðcI þ

j¼1

m X j¼1

cj Þ

m X

g jn ðtÞ:

j¼1

Our objective is to find the optimal inspection frequency n* that minimizes the expected cost per cycle E½CST  incurred from scheduled inspections. For that purpose, we calculate E½CST  for different values of n and find the value of n*. This procedure is numerically intensive. It should be noted that n* is limited by an upper bound nU. This upper bound may result from a lower bound t Z tL 40 on t, where tL is the minimum feasible inspection interval, e.g., 1 week, or 1 month. Then ntL rT, or n r T=tL ¼ nU . Note that E[CST]ZncI for all n. Let E[CS(0,T]] be the expected cost incurred from a plan with only one inspection (n¼1). Then if ncI ZE[CS(0,T]], n cannot be the optimal frequency, that is nU rE[CS(0,T]]/cI.

3.1. Numerical example Consider a complex repairable system with 5 components. We assume that the number of failures for component j follows an NHPP with power law intensity function lj ðxÞ ¼ ðbj =Zj Þðx=Zj Þbj 1 . The parameters of the power law processes and the minimal repair costs of the components are given in Table 1. The system is currently inspected periodically every 4 months. The number of components and parameters are adapted from a case study on medical equipment [3]. We are interested in finding the optimal inspection interval that minimizes the total cost incurred from scheduled inspections over a 12-month cycle length. For simplicity, we will assume that the system starts as good as new, which is not a limitation in our model, as we may assume any initial age, t, of the system (or even different initial ages, t¼tj, of different components j). We may assume that after 12 months a major overhaul of the system will be performed, e.g., some components are replaced with new or used spare components, and the system will start again. Then, a new optimal inspection frequency should be found. We also assume that the inspection interval cannot be shorter than 1 month, that is, nr12. Hence we have m¼5, and T¼1year¼12 months in our model. In the hospital’s current policy t ¼ 4 months, or n ¼12/4¼3. The inspection cost is cI ¼$200. We use Eq. (5) for the optimization purpose. We obtain Pkj(t), ekj(t), k¼1, y, n, recursively using Eqs. (3) and (4) and numerical integration (composite Simpson’s rule with step length h depending on n and required accuracy; after some experimentation, h¼10  4 was used). The value t ¼0 is used in the calculation, that is, we assume the equipment is as good as new at the beginning of the cycle. The total cost E[CST] for different values of n is shown in Fig. 3. As can be seen in the figure, the minimal cost is obtained for n ¼4 with the total cost E[CST]¼ $3,932.88. Hence if the real costs were as given, the current inspection policy should be changed from inspecting the system every 4 months to 3 months. This model is sensitive to the penalty costs incurred for components’ downtimes. If the downtime penalty costs c0j are not small, the optimal frequency will likely not be n¼1, as it may happen in the second model (model with threshold). In other words, the longer the interval between inspections the longer the elapsed time of failures, and consequently the larger the downtime penalty cost.

Table 1 Parameters of the power law intensity functions and costs. Component

bi

Zi (months)

Repair cost ci

Downtime cost/month c0i

4. Model considering the number of failures

1 2 3 4 5

1.3 1.1 2.1 1.8 1.7

3.5 4.6 6.0 10 3.6

$70 $45 $100 $75 $150

$100 $25 $200 $50 $150

In this variation of the model we introduce a threshold Ntrh for the acceptable number of failures observed at a scheduled inspection. According to the total number of failures that exceeds the threshold a penalty cost is incurred. This method can be used when only the total number of failures is important regardless of

Fig. 3. Total costs for different inspection frequencies (first model).

ARTICLE IN PRESS S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952

which combination of failures has occurred. An example of an application of this method can be a computer system with several external hard drive backups; even if all hard drive backups fail, the system can function, but a penalty should be incurred since the chance of losing data is increased by increasing the number of failed backups. Thresholds can be set by managers responsible for the safety and reliability of the system. A threshold may also be related to the number of redundant components in the system when a minimum required number of components must be operational, or to the additional costs of labour and downtime required to fix accumulated problems. We also assume that a penalty cost is incurred for exceeding the threshold.

Ntrh c0P 00

cP

components are independent. Otherwise, the calculation would require knowledge of system structure and a joint distribution of failure times, which is not considered here. Computation of pk(l) based on this equation is complicated, so a more efficient algorithm suggested by Barlow and Heidtmann [20] based on generating function can be used to reduce the computational complexity. The probability mass function of each component can be represented by the generating function: ujk ðzÞ ¼ Ej ðzIðXk ¼ tÞ jX0 ¼ tÞ ¼ Pkj ðtÞz0 þð1Pkj ðtÞÞz1 ¼ Pkj ðtÞ þ ð1Pkj ðtÞÞz Qm

acceptable threshold for the number of failures detected at a scheduled inspection penalty cost incurred for exceeding the acceptable number of failures at a scheduled inspection penalty cost incurred per number of failures exceeding the threshold Ntrh

m P

ujk ðzÞ ¼

pk ðlÞzl generates the distribution of Sk. P P We are interested in finding pk ðlÞ and lpk ðlÞ. Define Uk ðzÞ ¼

Additional notation

949

j¼1

l¼0

l 4 Ntrh m X

U k ðzÞ ¼

l 4 Ntrh

pk ðlÞzl

ð8Þ

l ¼ Ntrh þ 1

P

Then

P

pk ðlÞ ¼ U k ð1Þ and

l 4 Ntrh

0

lpk ðlÞ ¼ U k ð1Þ. Hence

l 4 Ntrh 00

00

00

0

E½c0P IfSk 4 Ntrh g  þ E½cP maxf0,Sk Ntrh g ¼ ðc0P Ntrh c P ÞU k ð1Þ þc P U k ð1Þ ð9Þ

The cost incurred at a scheduled inspection includes

 Cost of inspection (cI),  Cost of minimal repair of a component if found failed (cj),  Penalty cost incurred for exceeding the acceptable number of 

failures (c0P ), Penalty cost which takes into account the number of failures 00 exceeding Ntrh (cP ).

E½CSððk1Þt,kt  ¼ cI þ

m X

cj ð1Pkj ðtÞÞ þ E½c0P IfSk 4 Ntrh g 

00

þE½cP maxf0,Sk Ntrh g

ð6Þ

where Sk is the total number of failures in the kth interval and I{Sk 4 Ntrh} the indicator function. The expected penalty cost of exceeding the acceptable number of failures, Ntrh, can be calculated as follows. In our proposed model the more the number of failures exceeds Ntrh, the larger the penalty cost incurred. For Sk, the total number of failures in the kth interval, pk(l)¼ P(Sk ¼l|initial age of all components¼t), l ¼0, 1, 2, y, m (n and t are suppressed for simplicity)

¼ c0P

X

00

pk ðlÞ þ cP

l 4 Ntrh 00

¼ ðc0P Ntrh cP Þ

 X

X

lpk ðlÞ

l 4 Ntrh

pk ðlÞ þ cP

X

Ntrh pk ðlÞ

lpk ðlÞ

l 4 Ntrh

00

00

¼ ðc0P Ntrh cP ÞPðSk 4 Ntrh Þ þ cP

X

lpk ðlÞ:

l 4 Ntrh

We need to find the probability of having Ntrh +1 or more of failures which may be considered as a (Ntrh + 1)-out-of-m:F system problem. However, since the probabilities of failure of components need not be equal, the number of failed components in the system does not follow the binomial distribution. Hence in order to find pk(l), one can use the following equation [19]: 3 2 32 m ml þ1 þ2 m X Y X X qj1 ml qj2 qjl 5 4 4 5 pk ðlÞ ¼ ð1qj Þ ... 1qj1 j ¼ j þ 1 1qj2 1qjl j¼1 j ¼1 j ¼ j þ1 1

2

1

l

l1

ð7Þ where qj ¼1  Pkj(t) is the probability of failure for component j, j ¼1, y, m, in the kth interval. In this calculation we assume that

0

cj ð1Pkj ðtÞÞ þðc0P Ntrh c P ÞU k ð1Þ þ cP U k ð1Þ 00

00

ð10Þ Therefore, the expected cost incurred from all scheduled inspections over the cycle T¼nt is n X

E½CSððk1Þt,kt 

k¼1 n X m X

¼ ncI þ ¼ ncI þ

cj ð1Pkj ðtÞÞ þ

k¼1j¼1 m X

n X

j¼1

k¼1

m X

cj Þ

j¼1

n X

00

00

0

½ðc0P Ntrh cP ÞU k ð1Þ þ cP U k ð1Þ

k¼1

cj ðn

¼ nðcI þ

m X

00 Pkj ðtÞÞ þ ðc0P Ntrh cP Þ

00

U k ð1Þ þ cP

k¼1 j

00

cj P n ðtÞ þ ðc0P Ntrh cP Þ

j¼1

n X

n X k¼1

n X

0

U k ð1Þ

k¼1 00

U k ð1Þ þ cP

n X

0

U k ð1Þ

k¼1

ð11Þ j



m X j¼1

where P n ðtÞ ¼

l 4 Ntrh

00

l 4 Ntrh

X

E½CSððk1Þt,kt  ¼ cI þ

E½CST  ¼

j¼1

00 E½c0P IfSk 4 Ntrh g  þ E½cP maxf0,Sk Ntrh g 00 ¼ c0P E½IfSk 4 Ntrh g  þc P E½maxf0,Sk Ntrh g X X 00 ¼ c0P pk ðlÞ þ cP ðlNtrh Þpk ðlÞ l 4 Ntrh lNtrh 4 0

Using Eq. (9) in Eq. (6) gives

n P k¼1

Pkj ðtÞ.

This method is particularly useful when the components are similar or make almost the same contribution in reducing the whole system’s performance. In other cases, a weight can be assigned to each component to describe its contribution to the system performance, system production, etc. In this case, different combinations of failures can cause different penalty costs since both the occurrence of a failure and its individual influence on system performance or production are incorporated into the model through the weight. The combined influence of several failures can be considered as a weighted sum of the influence of all individual components, or it can be a more complicated function, or a value decided by managers. In this paper we consider additive downtime costs incurred for individual but simultaneous soft failures as their so-called joint failure consequence, yet the model can be revised to intensify the failure consequence or risk of simultaneous failures. In this case, if for example two soft failures take place simultaneously, their joint downtime consequence should be greater than the sum of their individual downtime consequences. For this variation of the model, possessing a good knowledge of the system, its structure,

ARTICLE IN PRESS 950

S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952

component dependency, and their joint failure consequences is required. Moreover, we assume that soft failures cannot convert into hard failures, and hard and soft failures have independent consequences on the system. In other words, we do not consider joint risk or failure consequence of soft and hard failures even though they may be related somehow, as suggested by one of the reviewers of the paper. For example, if both protective component (component with soft failure) and protected component (component with hard failure) fail simultaneously, our proposed model considers the consequence of only soft failure, since it assumes that a hard failure is fixed instantaneously; however, the model can be extended to consider the joint failure consequence of soft and hard failures when they have some sort of dependency. Such failure dependency and joint risk and cost consequences have been extensively discussed in the risk assessment literature [21–24]. 4.1. Numerical example We use the same data given in the numerical example in Section 3 to demonstrate the model suggested in this section. The other parameters of the model are threshold Ntrh ¼ 3, cI ¼$200, cP0 ¼$1500, and c00P ¼$1000. We will use Eq. (11) in the following form for convenience of comparing the values associated with different inspection frequencies n: E½CST  ¼ ncI þ

m n   X X j cj nP n ðtÞ þc0 P U n,k ð1Þ j¼1

þc

00

P

n X

k¼1

U

0

n,k ð1ÞNtrh

k¼1

To calculate j Pn,k ðtÞ,

n X

 U n,k ð1Þ

k¼1 j P n ðtÞ ¼

n P k¼1

j Pn,k ðtÞ for n¼ 1, 2, 3, y, 12 we use

k ¼1, y, n. Dependence of

emphasized in the notation.

j ðtÞ Pn,k

j Pn,k ðtÞ

and U n,k ðzÞ on n is

values are also used to find

0 U n,k ð1Þ and U n,k ð1Þ. We first extract the terms with z4 and z5 from Q5 j j Un,k ðzÞ ¼ j ¼ 1 ½ð1Pn,k ðtÞÞz þ Pn,k ðtÞ and use it to create U n,k ðzÞ. 0 Then U n,k ð1Þ and U n,k ð1Þ can be easily obtained. The value t ¼0 is

used in the calculation, that is, we assume the equipment is as good as new at the beginning of the cycle. The cost curve is shown in Fig. 4; nn ¼1 is the optimal inspection frequency, or to inspect once a year, with the minimum cost of $3,759.82. Obviously, the optimal n depends on input parameters of the model. With the cost parameters given in this example, nn ¼1 is

the optimal policy. However, we would like to find conditions for costs under which some other value of n, such as n ¼9, is the m n P P j cj ðnP n ðtÞÞ, Q2 ðnÞ ¼ U n,k ð1Þ, optimal solution. Let Q1 ðnÞ ¼ j¼1

Q3 ðnÞ ¼

n P

0 U n,k ð1ÞNtrh

k¼1

Then

the

n P

k¼1

U n,k ð1Þ

k¼1

expected

cost

for

n

is

E½CST  ¼ ncI þQ1 ðnÞ þ

00 c0P1 Q2 ðnÞ þc P1 Q3 ðnÞ,

if we consider the costs cj fixed. Table 2 shows ncI, Q1(n), Q2(n), and Q3(n) calculated for n ¼1, 2, 3, y, 12. For a particular n to be better than n0 the following relationship 00 of c0P and cP should hold: 00

00

ncI þ Q1 ðnÞ þc0 P Q2 ðnÞ þ c P Q3 ðnÞ on0 cI þ Q1 ðn0 Þ þ c0 P Q2 ðn0 Þ þ c P Q3 ðn0 Þ 00

For example if cP 4 2171:31410:456c0P , then n¼9 is better solution than n0 ¼1. Similar inequalities can be obtained for other values of n0 . For c0P ¼ $1500, the previous inequality gives the 00 00 lower limit c P 4 $1490. It appears that when e.g., cP ¼ $1700, n¼9 is also the optimal inspection frequency. The cost function for that case is shown in Fig. 5. Obviously, when the cost curve is flat near the optimal point, i.e., when there is no significant difference between the total cost of the optimal frequency and the other frequencies close to the optimal, we can select the frequency n close to the optimal that is easier to implement practically. For instance, the costs for n ¼12 may be considered close enough to the minimal cost, so n ¼12 can be selected to be implemented in practice, or to inspect every month. If n ¼12 is inconvenient for implementation, then n ¼1, or only one inspection at the end of the cycle may be the next choice. A compromise would be n ¼6, or to inspect every 2 months. Comparing the values of ncI, Q1(n), Q2(n), and Q3(n) given in Table 2 for different n indicates that for some n the values are always greater than for some other n0 and consequently no value Table 2 Calculated components of the cost function. n

ncI

Q1(n)

Q2(n)

Q3(n)

1 2 3 4 5 6 7 8 9 10 11 12

200 400 600 800 10,000 12,000 14,000 16,000 18,000 20,000 22,000 24,000

416.8154 699.7925 888.5510 1029.8770 1141.4125 1232.2306 1307.6402 1371.9082 1426.6128 1474.8199 1516.8528 1553.7449

0.9800 1.3340 1.2423 1.0732 0.8986 0.7453 0.6180 0.5154 0.4320 0.3660 0.3110 0.2660

1.6730 1.8328 1.5699 1.2886 1.0440 0.8458 0.6895 0.5676 0.4711 0.3940 0.3340 0.2850

Fig. 4. Total costs of scheduled inspections for different inspection frequencies (second model).

ARTICLE IN PRESS S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952 00

for c0P and cP can make them optimal. For instance, since all values for n ¼2 are greater than the values for n¼1, n ¼2 can never be optimal. Therefore, some kind of ‘‘peculiar’’ behavior of the cost function for n ¼1 in both Figs. 4 and 5 should not be a surprise in this example. In practical terms this means that to decrease expected penalty costs it is better to leave type II failures unattended (n ¼1), than to inspect them after 6 months (n ¼2) and minimally fix them, because it will then only create another not-small chance for them to fail again and pay a penalty due to exceeding of the threshold. But using n 42 may significantly reduce the chance of exceeding the threshold, and may then reduce the expected penalty cost. In the second model (Eq. (11)) we need to calculate U k ðzÞjz ¼ 1 to obtain the expected penalty cost. However, this calculation can be very complicated when the number of components increases. Also, using a threshold may produce the situation discussed above that it is sometimes better not to fix failures. The model discussed in Section 3 suggests another version of the cost model whose optimization requires simpler calculation while it takes into account downtime of components.

5. The combined model Eqs. (5) and (11) can also be combined to take into account both the exceeding number and elapsed times of failures, thus making the cost model more flexible. It should be noted that the ‘‘threshold’’ part of the model requires calculation of the distribution of the total number of failures, whereas

951

the ‘‘downtime’’ part of the model requires only marginal distributions of times to failure. We also assume that components are independent. In the combined model we have E½CST  ¼ ncI þ

m X

j

cj ðnP n ðtÞÞ þc0P

j¼1 00

þ cP ½

n X

U

n X

U n,k ð1Þ

k¼1 0

n,k ð1ÞNtrh

k¼1

n X k¼1

U n,k ð1Þ þ

n X m X

c0j ðtejk ðtÞÞ

k¼1j¼1

Results of applying the combined model on the data given in the examples in Sections 3 and 4 and the models considering threshold and elapsed times are shown in Fig. 6. As can be seen, n¼ 10 and the total cost E[CST] ¼$5,571.84 is the optimal solution for the combined model. Obviously, there is a significant difference between costs for small n and larg n. The cost for n¼ 1 is still smaller than the cost for n ¼2, but not significantly, due to significant downtime cost for n ¼1. For practical purposes, it may be simpler to use n ¼12, i.e., to inspect every month, which makes only a minor difference from the optimal solution. In general, the methodology for considering the influence of different failures can be set up by managers responsible for maintaining devices in a safe and reliable state. When information about the influence of individual failures and different combinations of failures is available, a threshold model with a weighted sum of the performance reduction or loss production can be employed. However, in many real situations, such as hospitals dealing with many different devices, the availability of this type of information is problematic. In this case, simpler and more practical methods, such

00

Fig. 5. Total costs of scheduled inspections for different inspection frequencies (second model). For penalty cost c P ¼ $1700, n* ¼ 9 is the optimal number of inspections when c0P ¼ $1500.

Fig. 6. Total costs for different inspection frequencies (combined model).

ARTICLE IN PRESS 952

S. Taghipour et al. / Reliability Engineering and System Safety 95 (2010) 944–952

as considering the total number of failures or the sum of the individual downtimes as the penalty, can be employed.

6. Concluding remarks A complex repairable system consists of various components with different types of failure, hard and soft. The so-called hard failures are revealed, i.e., either they are self-announcing or the system stops when they take place. However, the system can still continue functioning when soft failures occur. Both hard and soft failures should be incorporated in the general formulation of an inspection optimization model for a repairable system. Soft and hard failures are important in terms of the consequences that may result. Power plants, hospitals, or any other organization for which safety of instruments is a vital issue have to consider carefully all outcomes of their systems’ failures. Unrevealed failures may even end up in a catastrophe if unattended. In this paper, we consider a policy in which for hard failures only failed components are inspected and fixed at scheduled inspections, and the components with soft failures are inspected and fixed if found failed. In our model, the cost incurred from hard failures does not contribute to deciding on the optimal inspection interval. However, the model can be extended to include opportunistic maintenance in which hard failure instances provide an opportunity to also inspect and fix components with soft failures. Then, both scheduled and unscheduled opportunistic inspection times should be incorporated in the model. This paper proposes two basic variants of the model to find the optimal inspection frequency. The first variant takes into account the elapsed times from soft failures by considering downtime penalty costs for each component. The second variant sets up a threshold for the total number of soft failures and considers penalty costs for exceeding the threshold and for the number of failures that exceeds the threshold. A combined model is also proposed to incorporate both threshold on number of soft failures and elapsed times of failures. In all variants of the models it is assumed that components are subject to minimal repairs if found failed. Due to the assumption of minimal repair, a nonhomogenous Poisson process is used as the model for failure processes. Since times of soft failures are not known (they are interval censored by inspections), it was necessary to develop a recursive procedure to calculate probabilities of failures and expected downtimes, which makes calculation of the optimum inspection frequency computationally intensive. The procedure allows for considering initial age of each component not necessarily equal to zero at the beginning of the cycle, and therefore, possibility to perform incomplete repair at the beginning of the next cycle. In other words, some components can be renewed, while ages of others can be reduced. Then, the optimal inspection frequency can be calculated for the next cycle. The recursive procedure that deals with downtimes and minimal repairs in an NHPP is the main contribution of the paper.

Acknowledgements We acknowledge the Natural Sciences and Engineering Research Council (NSERC) of Canada, the Ontario Centre of Excellence (OCE), and the C-MORE Consortium members for their financial support. We are thankful to the referees, whose constructive criticism and comments have helped us to improve the presentation of the paper and to augment our literature review with new references. Specifically, remarks concerning applications of the proposed model were very helpful.

References [1] Ascher HE, Feingold H. Repairable systems reliability: modeling, inference, misconceptions and their causes. New York: Marcel Dekker; 1984. [2] Meeker W, Escobar L. Statistical methodology for reliability data. New York: Wiley; 1998. [3] Taghipour S, Banjevic D, Jardine AKS. Reliability analysis of maintenance data for complex medical devices. Qual Reliab Eng Int 2010 [in print, published online at http://dx.doi.org/10.1002/qre.1084]. [4] Nakagawa T, Mizutani S. A Summary of maintenance policies for a finite interval. Reliab Eng Syst Saf 2009;94:89–96. [5] Nakagawa T. Advanced reliability models and maintenance policies. London: Springer; 2008. [6] Nakagawa T. Maintenance theory of reliability. London: Springer; 2005. [7] Dohi T, Kaio N, Osaki S. Preventive maintenance models: replacement, repair, ordering, and inspection. In: Pham H, editor. Handbook of reliability engineering. London: Springer; 2003. p. 367–95. [8] Nakagawa T. Maintenance and optimum policy. In: Pham H, editor. Handbook of reliability engineering. London: Springer; 2003. p. 349–63. [9] Wang H, Pham H. Reliability and optimal maintenance. London: Springer; 2006. [10] Wang H. A survey of maintenance policies of deteriorating systems. Eur J Oper Res 2002;139:469–89. [11] Pham H. Optimal imperfect maintenance models. In: Pham H, editor. Handbook of reliability engineering. London: Springer; 2003. p. 397–412. [12] Pham H, Wang H. Imperfect maintenance. Eur J Oper Res 1996;94:425–38. [13] Kijima M. Some results for repairable systems with general repair. J Appl Probab 1989;26(1):89–102. [14] Pham H. Recent advances in reliability and quality in design. London: Springer; 2008. [15] Christer AH. A review of delay time analysis for modelling plant maintenance. In: Osaki S, editor. Stochastic models in reliability and maintenance. London: Springer; 2002. p. 90–117. [16] Wang W, Christer AH. Solution algorithms for a nonhomogeneous multicomponent inspection model. Comput Oper Res 2003;30:19–34. [17] Wang W. An inspection model for a process with two types of inspections and repairs. Reliab Eng Syst Saf 2009;94:526–33. [18] Ross SM. Introduction to probability models, 9th ed. New York: Academic Press; 2007. [19] Levitin G. Universal generating function in reliability analysis and optimization. New York: Springer-Verlag; 2005. [20] Barlow RE, Heidtmann KD. Computing k-out-of-n system reliability. IEEE Trans Reliab 1984;R-33(4):322–3. [21] Vaurio JK. Optimization of test and maintenance intervals based on risk and cost. Reliab Eng Syst Saf 1995;49:23–36. [22] Vaurio JK. Common cause failure probabilities in standby safety system fault tree analysis with testing-scheme and timing dependencies. Reliab Eng Syst Saf 2003;79:43–57. [23] Vaurio JK. Uncertainties and quantifications of common cause failure rates and probabilities for system analyses. Reliab Eng Syst Saf 2005;90:186–95. ¨ Yalaoui F. Reliability optimization of a redundant [24] Yu H, Chu C, Chˆatelet E, system with failure dependencies. Reliab Eng Syst Saf 2007;92:1627–34.