Operations Research Letters 9 (1990) 203-209 North-Holland
OPTIMAL
m-FAILURE POLICIES
M a y 1990
WITH RANDOM
REPAIR TIME
John G. W I L S O N Department of Operations Research, Weatherhead School of Management, Case Western Reserve University, Cleveland, OH, USA
Ali B E N M E R Z O U G A Department of Operations Research, Weatherhead School of Management, Case Western Reserve University, Cleveland, OH, USA Received July 1989 Revised October 1989
The failure times of n machines are i.i.d, exponential random variables with parameter )~. This paper extends Assaf and Shanthikumar's (1987) repair a n d replacement model to the case where repair time is a nonnegative r a n d o m variable. The behaviour of the optimal policy as a function of the cost parameters is investigated. The cost function is shown to be unimodal and an easily implemented algorithm for finding optimal policies is developed. group maintenance * reliability
1. Introduction Consider n machines that are subject to random failures. The failure times are independent identically distributed exponential random variable with parameter 7~. Each failed machine m a y be repaired at anytime and is considered as good as new once it has been repaired. The problem is to determine when to start repairing machines. Unlike Assaf and Shanthikumar (1987) and Okumoto and Elsayed (1983), repair is not assumed to be instantaneous. Instead, each failed machine takes R units of time to be repaired, where R is a nonnegative r a n d o m variable with finite mean and density f ( - ) . In practice this generalization is of economic importance. An optimal policy for the instantaneous repair case can be decidedly nonoptimal when repair takes time. Of course, assuming that repair time is r a n d o m introduces extra difficulties to the analysis. Specifically one has to account for the possibility that other machines m a y fail while a machine is being repaired. The challenge is to model the problem in such a way that the assumption of nonzero repair time does not overly complicate the solution procedure. Once repair has commenced, the repair crew will keep working until all n machines are working. A fixed cost of c o is assumed. This cost could reflect, for instance, the paperwork and travel costs incurred in bringing a repair crew to the factory. The cost of repairing an individual machine consists of two components, one fixed and the other time related. Specifically, cf is the fixed cost of repairing a machine and includes such costs as the price of replacement parts, etc., while c r denotes the cost per unit time of repairing a machine. This latter cost could be interpreted as a labor cost. Each machine that fails accumulates down time costs at a rate of c o per unit time.
2. Derivation of the objective function The probability that an individual machine will break down during a fixed period of length r is given by 1 - e -rx. Suppose that i machines are broken. Then the number not working at the end of the period of 0167-6377/90/$3.50 © 1990 - Elsevier Science Publishers B.V. (North-Holland)
203
Volume 9, N u m b e r 3
O P E R A T I O N S R E S E A R C H LETTERS
May 1990
length R taken to fix the first machine equals (i - 1) + (number of remaining machines that fail during the period of length R). Hence p,j(r), the probability that j machines are not working at the end of the period taken to fix the first machine given that the first machine will take r time units to fix, i >/1 are currently broken and repair is now starting, is given by
pij(r) =
(n-i)e_(n_j_l>X(l_e_rX)J+l-I j +1- i
f o r i - l <~j<~n-l,
0
otherwise.
Let T, denote the time required for all machines to be functioning given that i >~ 1 are currently broken and repair is about to commence. Then, conditioning on the time taken to fix the first broken machine, the expected value of T,, i >/1, is given by
E
Pij(r)[r+E(Tj)
f(r) dr=E(R)+
~_, bijE(Tj),
j=i--1
(1)
j=i--1
where
b,j = f p,j(r)f(r) dr
and
E(To) = O.
Let D, denote the total down time cost incurred until all machines are working if repair is about to commence and there are currently i >/1 broken machines. Suppose that the first machine takes r time units to repair. Then the down time cost during the period of length r consists of two components: the downtime cost incurred from machines that break down within the next r time units and icdr for the i machines broken at the beginning of the period. Given that j - i + 1 of the n - i working machines break down, the expected downtime cost over the next r time units from these machines is given by
r
(J-i+l)CdfoP[T
[
r
1]
1 _ ; -hr
)k'
where T is an exponential random variable with parameter X. Hence the expected value of Dg, i >i 1, is given by
E(Di)= f j=i--1 ~_~ pi,(r){icdr+(j--i +l)cd
1 -- re - x '
X1 + E(Dj) f(r) dr.
(Note that the down time costs incurred after the present period depend only on the number of failed machines and not the length of the first repair period.) On noting that }2j(j - i + 1 ) p i j ( r ) = (n - i)(1 e - r x ) and simplifying, the above can be written as n--1
E(D,) =cdnE(R)- cd(n-i)X -1 +ca(n-i)X-lE(e
-xR) +
•
gjE(Dj).
(2)
j=i--1
Suppose a policy of starting repair when i units are broken is followed. Then the length of the renewal period is the time taken for all machines to be repaired once repair has been started added to the time for i machines to fail. The expected time to i failures is given by
] 8(,) - E (._ k)x
(3)
k=0
Hence the expected length of the renewal period is f l ( i ) + E(T~). The expected down time incurred while waiting for i machines to fail is given by ),(i)=
204
i-, k E (n-k))~"
k=0
(4)
Volume 9, Number 3
OPERATIONS RESEARCH LETTERS
May 1990
Using the renewal theorem, the long run cost per unit time of following t h e / - p o l i c y is given b y
K(i) =
co + E(D,) + CdY(i ) + [C r + (cf/E(R))] E(Ti) fl(i) + E(Ti)
(5)
Hence, the objective is to find min~ K(i) and the i at which this occurs. The quantities in (5) can be found f r o m expressions (1)-(4). A n i m p o r t a n t special case occurs when the repair time is deterministic. In this case, all of the previous integrals are degenerate and the coefficients of the linear equations are easily determined. As an illustration, the next example contains explicit formulas for the three machine case where the repair time is deterministic. Example. Suppose that n = 3 and each repair takes a fixed r time units for completion. Let p = e -rx and q = 1 - e -rx. Then on solving the equations given in (1):
E(T1)=(q2+p)p-3r, E(T3 ) = ( p 3 + p 2 + q2
E(T2)=(p2+qZ+p)p-3r, +p)p-3r.
E(Di) are as follows: E(Da) = ( 2 p 2 + 3q 2 + 2pq+p)p-3cdr-- (q3 + 2p2q+ 2q2p)p-3X-lCd '
F r o m expression (2), the values of
E ( D 2 ) = (5p 2 + 3q 2 +
2pq+p)p-3cdr- (q3 + 3p2q+ 2pq2)p-3X-lcd '
E ( D 3 ) = (3p 3 + 5p 2 + 3q 2 +
2pq +p)p-3cor- (q3 + 3p2q + 2pq2)p-3~-lCd.
Using the above and some algebraic manipulation, the values of K ( 1 ) = { Xc0 + c d + t~(e 3rx
-
-
e 2rx + era)} ( 2 + ~kr(e3rX
K(i) are given by _
K ( 2 ) = ( Xc0 + -~cd + q~(e 3rx - e 2rx + 2e ~x) } ( -~ + Xr(e 3~x
e2rX + e r a ) } - 1 , -
e 2rX 71- 2e 4x) } -1,
K ( 3 ) = (Xc 0 + ~ c a + q~(e3rXe2~x + 2e ~x + 1)} { ~ + Xr(e 3 r x - e 2rx + 2e ~x + 1 ) } - 1 , where q~ = rX(3c d + Cr) + XCf -- Cd. Assume that min{ K(1), K(2), K(3)} < 3c d since otherwise it would not be profitable to operate the system. By appropriately m a n i p u l a t i n g the K(i), the following decision rule can be obtained: wait for all machines to fail if
2 + 2X[ E(T3) - E(T2)] c d ~< X(cf +
rcr) + 3rAc o
11E(T2 ) _ 5E(T3)
;
start repairing as soon as any machine fails if
I + 2X[ E(T2) - E(T1) ] Cd>X(cf+rcr)+ 3rXCo 5E(T1)_2E(T2) ; start repairing as soon as exactly two machines fail if c d does not satisfy any of the previous two inequalities. F o r this example the decision rule indicates that the optimal policy is a nondecreasing function of c 0, cf, c r and a nonincreasing function of c d. It will be shown in Section 4 that these statements remain true for the general problem.
3. Unimodality of the objective function First some new notation will be introduced. Suppose there are exactly j + 1 machines in the system and one of them has failed. Define Ej to be the expected time for all j + 1 machines to be working if repair 205
Volume 9, Number 3
OPERATIONS RESEARCH LETTERS
May 1990
c o m m e n c e s immediately. Then, the following relationship between the variable Tk for the n-machine p r o b l e m and the Ej for the ( j + 1)-machine p r o b l e m s must hold: k
E(Tk) = ~ E,_j.
(6)
j=l Define C(i) to be the n u m e r a t o r in (5), i.e. C(i) is the expected cost during a cycle if the /-policy is followed. N o t e that policy i is better than policy k if and only if C(k)[fl(k)
+ E(rk) ]-' > C(i)[fl(i)
+ e(T,)]-',
which can be rewritten as
C ( k ) - C(i) > K ( i ) [ f l ( k ) + E(Tk) - fl(i) - E ( T ~ ) ] .
(7)
First it will be shown that if policy i is better than policy i + 1, then it must be better than all policies k with k >~ i + 1. The p r o o f will be by induction. By a s s u m p t i o n the statement is true for k = i + 1. A s s u m e that (7) is true for some k >~ i + 1. The goal is to show that (7) is then true for k + 1. Using the induction hypothesis:
C ( k + 1) - C(i) = [ C ( k + 1) - C ( k ) ] + [ C ( k ) - C ( i ) ] > C ( k + 1) - C ( k ) + K ( i ) [ f l ( k ) + E(Tk) - fl(i) - E ( T , ) ] . C o m p a r i n g the right hand side of the above expression with the right hand side of (7) with k set equal to k + 1, it is clear that the p r o o f will be complete if it can be shown that
C( k + 1) - C( k ) > K( i )[ fl( k + 1) + E( Tk +I) - fl( k ) - E ( rk ) ].
(8)
If one waits for k + 1 machines to fail rather than k before initiating repair, the extra costs consists of: the d o w n t i m e costs incurred from k failed machines while waiting for another machine to fail; the d o w n t i m e cost incurred from having one extra failed machine during the period Tk; the total cost, excluding fixed cost, of bringing the system with one failed machine to a system with all n machines working. Therefore the left hand side of (8) can be written as
C ( k + 1) - C ( k ) = kCd[fl(k + 1) - f l ( k ) ] + cdE(Tk) + [ C ( 1 ) - Co].
(9)
N o w use (3), (6) and (9) to see that (8) is equivalent to
(n
-- k)-12k-l[kcd
-
K ( i ) ] + CdE(T,) + C ( 1 ) - c o > K ( i ) E , _ , _ , .
(10)
Since policy i is assumed to be better than policy i + 1, expression (7) is true when k is set equal to i + 1. C o m p a r i n g this with (8) it is clear that (8) and hence (10) are true for k = i. T h e left h a n d side of (10) increases in k while the right hand side decreases. Therefore (10) is true for all k >~ i + 1 and the p r o o f is complete. N o w suppose that action i is better than action i - 1. The goal is to show that this shows that action i is better than action k for all k ~ i - 1. Again the p r o o f will be by induction. By a s s u m p t i o n the statement is true for k = i - 1. Assume it is true for some k ~< i - 1. T h e goal is to show that (7) holds for k - 1. Proceeding as before, the p r o o f will be complete if it can be shown that
C(k-
1) - C ( k ) > K ( i ) [ f l ( k -
1) + E ( T k _ , ) - fl(k ) - E ( T , ) ]
which can be written as (n -- k + 1 ) - 1 • - '
[ ( k - 1)c d - K ( i ) ] + CdE(Tk_,) + C ( 1 ) - c o < K ( i ) e , _ k.
(11)
By assumption, the above expression is satisfied at k = i. The p r o o f is completed on noting that, as k decreases, the left hand side of (11) decreases while the right hand side increases. F r o m the above, if policy i is better than policies i - 1 and i + 1, it must be an optimal policy. Thus the function K(i) can turn no more than once. 206
Volume 9, Number 3
OPERATIONS RESEARCH LETTERS
May 1990
4. Behavior of optimal actions as a function of the cost parameters In this section it will be shown that the optimal policy is a nondecreasing function of c 0, cf, c r and a nonincreasing function of c d. Let K(i) denote the objective function when the cost parameters equal c 0, c d, c r and cf. Let Ks(i ) denote the cost function if the fixed cost co increases to co + 8 while all other costs are held fixed. Then, using (5),
K~(i) = K ( i ) + 6 [ f l ( i ) + E ( T / ) ] - I
(12)
Suppose i* is optimal for the problem with fixed cost c 0. Then, using (12), the fact that the last term on the right hand side of (12) is a decreasing function of i and the assumption that K(i) >~K(i* ), Vi, shows that Ks(i ) >~Ks(i* ) for all i ~< i*. Thus the optimal policy when the fixed cost is c o + 6 can be no smaller than the optimal policy for the problem with fixed cost equal to c0, i.e. the optimal policy is a nondecreasing function of c 0. N o w the relationship between optimal policies and the downtime cost c d will be investigated. The goal is to show that if policy i - 1 is better than policy i when the fixed cost is cd, then it is also better when the fixed cost is c d + 6, where 6 >1 0. This combined with the unimodality of the cost function will show that the optimal policy is a nonincreasing function of cd. Policy i - 1 is better than policy i if and only if
Co+[Cr+(Cf/E(R))]E(T~-1)
Co+[Cr+(cf/E(R))]E(T,)
fl(i - 1) + E ( T , _ , )
f l ( i ) + E(T,.)
(g(Xi)
g( Xi-l)
< Cd fl(i) + E(T~) - fl(i----~) 7E--(T,_I)
} '
(13)
where Xj denotes the sum of the downtimes for all machines during a cycle where the j-policy is followed. Assume that (13) is true. The goal is to show that this implies that (13) must be true when c d is replaced by c d + 6, where 8 >t 0. If the right hand side of (13) is negative then (13) must also hold true for all downtime cost rates between 0 and c d. This leads to a contradiction since policy i - 1 can never be better than policy i if the downtime cost rate is 0. Therefore the right hand side of (13) must be positive, which implies that (13) also holds whenever c d is replaced by any larger value. N o w the effect of varying c r and cf will be considered. Policy i is at least as good as i - 1 if and only if
Co+C,,e(x,_,)
¢o+ dE(X,)
f l ( i - 1) + E ( T / _ , )
f l ( i ) + E(T,.)
{ E(T/)
>1(Cr+Cf/E(R)) fl(i)
E(T~_,) } -- f l ( i -
+E(T/)
1) + E ( T i _ I )
"
(14)
Assume that (14) is true. The goal is to show that (14) remains true whenever c~ and cf are replaced with larger values. If the right hand side of (14) is negative then the result clearly follows. So suppose the right hand side of (14) is positive. In that case, if (14) is not true when c r and cf are replaced by c~ + 81, and Cf -Jr-82, for some nonnegative 6, and 62, then
+ qE(x,_,)
Co+qE(x,)
f l ( i - 1) + E(T,._,) - f l ( i ) + E ( T / )
< (x+y/E(R))
{fl(i)E(7 ) +E(Ti)
- fl(i---~)-+E(~-l)
} (15)
for all x > c~ + 81 and for all y > cf + 6 2. This however is a contradiction since for sufficiently large repair costs it must always be the case that policy i will be better than policy i - 1. Thus the optimal action is a nondecreasing function of the parameters c r and cf. 207
Volume 9, N u m b e r 3
OPERATIONS RESEARCH LETTERS
May 1990
5. An algorithm for computing optimal policies
Step 1. Compute E[R], E[e -Rx] and bij. Step 2. Set fl = 1 and compute f2 . . . . . f, from the following recursion: fk+l
,{
=
bk+l,k
k, (bk.k -- 1)A -
~
}
b,.kf* •
(
i=1
Then n
E(:r,,) =f.-'e(R) E (-1)"-7,
(16)
i=1
and
E(D.)=f-'ca L (-1)"-'f,{nE(R)+h-~(n-i)[E(e-XR)-l]}.
(17)
i=1
Step 3. Using (1) through (5) and the above expressions for E ( T , ) and E ( D , ) , sequentially compute K ( n ) , K ( n - 1) .... until K ( i - 1) > K ( i ) for some i. This value for i is optimal. (If K ( i - 1) ~< K ( i ) , Vi, then the 1-policy is optimal.) Expressions (16) and (17) above are simply the results of solving (1) and (2) for E ( T , ) and E(Dn) by successively eliminating the equations for i = 1, 2 . . . . . From the unimodality result of Section 3 the above
8.6 8.5 8.4 8.3 8.2
E
8.1 8
"S 7.9 7.8 r'--1
t,c',
7.7 7.6 7.5
0 tO "D
7.4 7.3 7.2
~O O. X
LO
7.1 7 6.9 6.8 6.7 6.6 6.5 6.4
I
I
I
I
I
2 The
expected repoir time.
Fig. 1. Optimal cost vs. cost of 7-policy. 208
I
I
3
I
Volume 9, Number 3
OPERATIONS RESEARCH LETTERS
May 1990
algorithm will always work. Each step of the algorithm requires little extra computation since values of E(T~) will be retained from previous steps. An algorithm that does not search through policies according to their order will be inefficient since calculation of the cost function at a given policy requires knowledge of all expected downtimes and time to completion of repair for all policies greater than the one being considered.
E(D~) and
Example. Suppose that R has a G a m m a distribution with parameters a and fl, i.e. the density is given by
f(r)=
_F_( raa) - I
e -/~',
r>O.
Then, after appropriate simplification, the following expressions can be obtained: j+l--i
(n-i)! ! fla Y'. ( - 1 ) k [ ( n + k - j - 1 ) X + f l ] -"[ k ! ( j + l - i - k ) ! ] -,, b,j- (n_----~7~) k=O
Suppose the parameters take the following values: c o = $20, c a = $0.87, cf = $1, c r = $2, X -- 0.1, n = 10. Assume that the time taken to repair a single machine is a gamma random variable with parameters fl = 0.1 and a. In this case E(R) = 10a. The lower curve of Figure 1 shows the cost associated with the optimal policy as a function of E(R). If repair time were instantaneous, the optimal policy using Assaf and Shanthikumar's algorithm would be to repair the system when six machines break down. The top curve in Figure 1 shows the cost per unit time associated with this policy as a function of E(R). It is clear that the optimal policy computed using the algorithm in this section can result in significantly lower costs compared to the 7-policy.
Acknowledgement Thanks are due to an anonymous referee for many helpful suggestions. This research has been partially supported by National Science Foundation grant no. DMC-8910378.
References D. Assaf and J.G. Shanthikumar, "Optimal group maintenance policies with continuous and periodic inspection", Management ScL 33, 1440-1452 (1987). K. Okumoto and E.A. Elsayed, "An optimum group maintenance policy", Naval Res. Logist. Quart. 30, 667-674 (1983).
209