Estimating the value of a discounted reward process

Estimating the value of a discounted reward process

Operations Research Letters l l (1992)267-272 North-Holland June 1992 Estimating the value of a discounted reward process Moshe Haviv Department gf ...

425KB Sizes 0 Downloads 20 Views

Operations Research Letters l l (1992)267-272 North-Holland

June 1992

Estimating the value of a discounted reward process Moshe Haviv Department gf Econometrics, The Unil,ersity of Sydney, Sydney, NSW 2006, Australia

Martin L. Puterman FaculO' of Commerce and Business Administration, Unit~ersity of British Columbia, Vancout,er, Canada Received August 1990 Revised September 199l

This paper provides a differential equation which relates the expected total discounted reward of a reward process to the expected total undiscounted reward of a process which terminates at a negative binomial stopping time. The solution of this equation provides the basis for unbiased estimators of the expected total discounted reward and its derivative with respect to the discount rate. We compare this estimator to other estimators and discuss when it might be more efficient. When rewards are positive we show that the estimator is monotone in the sampled variate. reward processes: unbiased estimators: simulation

I. Introduction

R a n d o m reward processes occur naturally in many areas of applied probability modelling, especially in performance analysis and stochastic dynamic programming. However, for all but the simplest models, evaluating the expected total discounted return of such a process requires simulation or experimentation. In this paper we provide an unbiased estimator for the expected total discounted return of a reward process derived from sampling cumulative sums of the rewards up to independent negative binomial stopping times. This approach is based on a differential equation which relates the expected total discounted return of a reward proCorrespondence to: Prof. Martin L. Puterman, Faculty of Commerce and Business Administration, University of British Columbia, Vancouver, BC, V6T 1Z2 Canada.

cess to the expected total undiscounted return of the process terminated at a negative binomial stopping time. Our derivation of the differential equation was motivated by a result of D e r m a n [2] which shows the equivalence of the expected total discounted return of a Markov reward process to the expected total undiscounted return up to an independent geometric stopping time. D e r m a n ' s result is generalized and applied by Fox and Glynn [3] in a computer simulation context. They derive and analyze an estimator of the expected total discounted reward by sampling rewards up to an independent geometric stopping time. They provide a central limit theorem for this estimator, suggest variance reduction techniques for semiMarkov processes and investigate the use of regenerative structures. Since a negative binomial random variable is the sum of independent geometric random variables, estimating the expected total discounted

0167-6377/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

267

Volume l 1, N u m b e r 5

OPERATIONS RESEARCH LETTERS

return by our proposed method requires longer reward series (and considerably more computation) than an approach based on geometric stopping times. For example, if the negative binomial stopping time T* has parameters 2 and 1 - A and the geometric stopping time T has p a r a m e t e r 1 - A , then E ( T * ) = 2 E ( T ) . In this case the reward series corresponding to T* would be, on average, twice as long as that corresponding to T. This suggests that the proposed estimator uses more information per replicate than the one based on T does, so given the same number of replicates under each method, the proposed estimator would have smaller variance than that based on T. ]t is natural to ask "lsn't it more efficient to use twice as many replicates of the estimator based on geometric stopping times than that based on negative binomial stopping times?" In a computer simulation context, in which the cost of generating a stopping time and the cost of generating a realization of the reward are comparable, the method based on the geometric variate might be preferable since it requires less computation. However, when designing experiments in which the cost per experimental unit is high and the cost per time period of observation is low, using fewer experimental units and running them for a longer period of time would be more attractive. As an example of this, suppose we wish to compare the expected discounted cost of two heavy equipment maintenance regimes. The proposed method would use fewer pieces of equipment than that based on geometric times but run each about twice as long. In such a setting, the estimator based on the negative binomial might be preferable; the additional computational costs would be negligible compared to the greatly reduced experimental costs. This argument applies equally well to clinical trials in which patient recruitment may be difficult or costly. From a technical perspective, another advantage of the proposed procedure is that it provides an easily computed estimator of the derivative of the expected total discounted reward with respect to the discount rate. This might be of use when investigating the sensitivity of the discounted rew a r d to the discount rate. The p a p e r is organized as follows. Section 2 derives and discusses the differential equation relating the total expected returns of the two 268

June 1992

reward processes. In Section 3, we use the solution of the differential equation to develop an estimator for the expected total discounted reward. We explore properties of the sampling distribution of the estimator through computer simulation and heuristic arguments in Section 4. These results suggest that the estimator reduces variance by a factor of 2 compared to that based on geometric stopping times. Finally, in Section 5 we show that the estimator is monotone in the sampled variate.

2. An identity for the expected discounted reward By an infinite horizon discrete time reward process we mean a sequence of random variables {r t} defined for t ~ {1, 2 . . . . }. When t ~ {1, 2 . . . . . T} for some stopping time T, we refer to the reward process as a terminating reward process. Usually a reward process is generated by a discrete time stochastic process x t and a function r ( ' ) which associates a reward to each state. In this case

r, = r(x,). We assume for all t, t > 1, ] E(r,)l < M < oo for some M > 0. Let A, 0 < A < 1, denote the discount factor. That is, one is indifferent between receiving a reward of R units at time t and receiving a reward of At- ~R at time 1. Denote the expected total discounted reward or value of the discounted reward process by f(A) and define it by f(A) -E

E At-'rt • t=l

Let h(A) denote the expected total reward of a terminating reward process with independent geometrically distributed stopping time with parameter 1 - A . That is, T

h(A)=E Er,. t=l

Then it is well known (cf., Fox and Glynn [3]) that h(A) = f ( A ) and that the geometric distribution is the unique distribution with this property. Let T* denote a random variable with a negative binomial distribution with parameters 2 and 1 - A , that is,

P(T*=k)=(k-1)Ak-Z(1-A)

2, k > 2 ,

(1)

Volume 11. Number 5

O P E R A T I O N S R E S E A R C H LETTERS

and let g(A) denote the expected total reward of a terminating reward process with this stopping time, i.e.

Jn and 1 - A reward.

June 1992

to the expected total discounted

T*

g(A)=EEr

,•

3. Unbiased estimation counted reward

t=l

The following theorem relates

f(a)

and

0f(A)

1

1

+ --f(A)

dA

=

I-A

1 -a

g(A)

(2)

and a f
= (1 - A ) f ( 0 )

+

g(r) m d r2

-

(3)

where f(0) = E(r, ).

Proof. To establish (2) note that g(A)=

EP(T*=k)E

=k

t=l

t= k

= E (k-

1)Ak-2(1-- A) 2 EL'(Ft)

k=l

t=l ve

= EE(r,) t=l

E (k-

1)Ak-2(l

-A)

2

k=t

~c

= EE(r,)[(t-

1)A'-2(1-A)+A '-l]

t=l

=f(A) + (l-A)

df(A) d~

Because we assumed that IE(r,)l <_M,g(A)
expected

dis-

g(a).

Theorem 2.1. Suppose f(A) and g(A) are defined as above. Then, for 0 < A < 1, - -

of the

E(T*) <~

we may interchange the order of summation between the second and third expressions above. The dominated convergence theorem justifies passing the limit inside the summation when defining d f ( A ) / d A . Equation (3) follows by solving the differential equation (2). [] Using similar arguments to those in the proof of T h e o r e m 2.1, we can derive differential equations which relate the expected total reward of a terminating reward process with independent negative binomial stopping time with parameters

We may estimate the expected total discounted reward of the infinite horizon reward process by truncating the process at some finite K and computing the mean of E)~ 1A~- ~r, over several (finite) realizations of the process. Unfortunately, this results in a biased estimator of f(A), with bias at most MA~/(1 - A). Alternatively, one might estimate f(A) by sampling several realizations of a truncated reward process in which the stopping time T has a geometric distribution with parameter 1 - A and estimating f(A) by the sample mean of ~ ir,. Denote this estimator by /~(A). It is an unbiased estimator of f(A). In this section we propose another unbiased estimator for f(A) based on the stopping time T*. The following provides a standard method for sampling a realization of T* using its inverse cumulative distribution function. Recall that T* assumes values 2, 3 . . . . . (1) Specify A. (2) Partition the unit interval into consecutive sub-intervals the lengths of which coincide with the probabilities determined by the distribution of T * , i.e., the i-th interval is of length P ( T * = i + 1) for i >_ 1. Choose all the intervals except the first, to be open on the left and closed on the right. Take the first interval to be closed from both ends. Label the i-th interval i + 1. (3) Sample x from a [0,1J-uniform distribution, and if x falls in the i-th interval set T* = i +1. The above procedure can be implemented simultaneously for all values of the discount factor, that is, a single value of x can be used to sample T * for several values of A. In the next paragraph we show that for each random variate x there exists a set of break points {A0, AI, A2. . . . } with A 0 = 0 , such that if A is in (Ai, Ai+ ~) then T* equals i + 2, and also investigate the relationship between this set and x. To emphasize that T * is a function of A we denote it by T*. For k > 2, P ( T * <_k) = 1 - k A ~ ' + ( k - t)A k. 269

Volume 11, Number 5

OPERATIONS RESEARCHLETTERS

Let Ak_ ~ denote the unique solution of 1 - k A k-1 + ( k - 1)A* = x

June 1992

Corollary 3.2. For j > O, set Sy = E~+Zrt. Then, for A ~(Ai, A i + l ] , i > 0 , T** = i + 2 , and f(A) =r,(1 -A)

in [0, 1). That a unique Ak_ 1 exists and satisfies 0 __ 2,

P(Tt~
and

[i-~) +(i-A)

lim P(TA* < k) = 0.

j

( 1 Sj l - A j +

i

1 1-Aj

A--+l

The uniqueness is established from the fact that P(TA* < k) is monotone decreasing with respect to A. Finally, for each k, Ak is a monotone decreasing function of x. This can be seen by (implicitly) differentiating both sides of the identity P ( T ~ * < k ) = x with respect to x and noting then that d A k _ l / d X is negative. We now propose an unbiased estimator of f(A) based on (2).

Theorem 3.1. Let Ta* be a realization of the stopping time determined by a [O,1]-uniform variate in the manner described above and let tr t'tlt~v* = 1 be a realization of the terminating reward process. Then, for each A, O < A < l,

E r,

+Si

1-A

1]

1-A s

Equivalently,

?(A) = S i - ( l - a )

i rj+2 E 1 =~j j=O T*-2

E 1-a~" j=0

(4)

Clearly, expression (4) for f(A) is the easiest to implement and apply. In some applications we might seek an estimate of d f ( A ) / d A , for instance when we wish to investigate the sensitivity of the expected total discounted reward to changes in A. Combining Theorem 3.1 with (2) and (4) yields the following corollary which provides an unbiased estimator for this derivative.

t--1

is an unbiased estimator of g( A ) and

Corollary 3.3. Let

-f(A) f(A)-(1-a)f(O)

+ (1-a)f~o(l:-r)2dr~=

f'(A) =

1 -A

(5)

Then, E ( f ' ( A ) ) = d f ( A / d A ) and is an unbiased estimator o f f ( A ) .

T~-2 rj+2

f'(A)= E fT . j=0

Proof. The first part of the theorem is immediate. Note that ~ ( ' ) is a step function with jumps at A0, A~, A2. . . . . Consequently, for each A the number of summands defining the integral is finite so that we may interchange expectation and summation. Hence, to establish the second part of the theorem, set f(0) to equal r l and apply (3). []

It is easy to see that f(A) is continuous in A with only a countable number of points, {Ai}7_0, where it is not differentiable. The next corollary gives an explicit representation for f(A) as a function of the break points. The proof follows by straightforward but tedious manipulation•

270

Observe that f(A) = ~a(A) - (1 - A)f'(A).

4. Variance properties of the estimators The variance provides a measure of the precision of an unbiased estimator. Unfortunately, we were unable to obtain closed form expressions for var(f(A)) even in the relatively simple case of independent rewards. We hypothesize that the variance of f(A) is less than that of /~(A). Below we instead relate var(/~(A)) to var(~(A)).

Proposition 4.1. Suppose that {rt; t > 1} are independent and identically distributed (i.i.d.) random

Volume 11. N u m b e r 5

OPERATIONS RESEARCH LETTERS

variables with mean tx and variance cr2. Then

June 1992

Table 1

v a r ( i f ( A ) ) -- 2 var(/~(A)) = 2[

/~'2A -

°'2

-

-{-

L(1 - A )

-

2

(6)

-

1

-a

Mean Variance Range Kurtosis

0.06 13.43 ( - 16.44, 18.222) 1.79

0.04 21.06 ( - 21.62, 30.20) 3.78,

Proof. Under the assumptions on G,

= v a r ( T ~ ) + E(T~r 2)

To use this result in practice, suppose we have a lower bound o n / z , say #. Then ~(A, a) is more efficient than ~(A) for 0 < a _< ~. In a similar fashion, using

= ~2 v a r ( T ) + cr2E(T)

/~(A, a) - h ( A ) - a [ T -

var(/~(A) ) --- var( E(/~( A) l T ) ) + E(var(/~(A) ] T ) )

/z2A --

_

_

(1

o -2 -{-

-

-A) 2

-

1 -A '

Equation (6) shows that the variances ~(A) and h(A) are increasing in ~2. Next we show how one can take this into account by using a control variable to obtain a reduction in variance. Specifically, suppose we replace ~(A) by ~(A, a) where ~(A, a) - ~ ( A ) - a[T* - 2 / ( 1 - A)]. Since E(T*) = 2/(1 - A), E(ff(A, a)) = E(~(A)) so that the expectation of these two estimators is the same. When rewards are i.i.d., cov[~(A), T*] = E(~(A)T*) -E(~,(A))E(T*)

E V* E r t T * t

+~[E(T*)]

2

t=o I

= / . t E ( ( T * ) 2) - ~ [ E ( T * ) ]

2

=g var(T*). Hence var(~(A, a)) =var(~(a))

can lead to more efficient estimation than /~(a). Unfortunately, we cannot apply the same approach to and preserve its unbiased property. Perhaps through use of a control variable, we might be able to achieve smaller variance and an overall reduction in mean squared error.

f(a)

Since E(T*) = 2 E ( T ) and v a r ( T * ) = 2 var(T) repeating the above argument for var(o&(A)) shows that var(~(A)) = 2 var(/~(A)). []

:E

1/(1 -A)]

+a 2 var(T*)

- 2a/z v a r ( T * ) so that for positive values of /x, this is smaller than var(~(A)) for all a E (0, 2g). i In particular, the variance is minimized when a = ~.

t For negative values of /z, a ~(2tz, 0) leads to a reduced variance.

5. Simulation results

We report results of a simulation which investigates the variability and bias of these estimators. We chose the random variables r, to be independent with a uniform distribution with zero expectation and variance of one. For A = 0.95 we generated 1000 replicates of f(A) and of h(A) based on the same variates for the two estimators. Table 1 reports the results of the simulation. Observe that there was no evidence that the mean of either of these estimators differed from 0 (p-values of 0.62 and 0.81, respectively). As we speculated above, f(A) has smaller variance than h(A) with a ratio of sample variances equal to 1.57. Note that the sample variance of /~(A) was close to its theoretical value (see (6)) of 20 and that the correlation between ~(A) and f ' ( A ) was 0.87. Normal probability plots (available from the authors) of the simulation output and the kurtosis 2 values suggest that both estimators have heavier tails that would be expected from normally distributed data with the deviation from normality greater for h(A). 2 The kurtosis equals the fourth sample central moment minus 3. The normal distribution has kurtosis 0; positive values suggest distributions with heavier tails than the normal. 271

Volume 11, Number 5

OPERATIONS R E S E A R C H LETTERS

Although f(A) has smaller variance than j~(A), it requires greater effort per replicate since E(T*)=2E(T). When there is a large set-up cost per replicate and small cost for replication, the proposed estimator might be preferable.

6. Monotonicity of the estimator Suppose all rewards are positive. If not and they are bounded below, they can be scaled by a constant. Then an immediate consequence of the assumption that each of the break points is a decreasing function of the sampled variate x is that f(A) and ~(A) are monotone functions of x. Also, f(A) is continuous in A. This implies that given the sampled rewards, the proposed sampiing procedure for TA* results in an estimator f(A) for f(A) which is equivalent to applying the inverse of the cumulative distibution function (conditional on the rewards) of f(A) on a [0,1]uniform variate. 3 This is a useful property when applying variance reduction techniques. For example, suppose one wishes to estimate fl(.)_f2(.) where fl(A) and f2(A) represent the total expected discounted return corresponding to different reward processes taken as functions of the discount factor. Estimating both fl(Al) and f2(A2) independently as suggested above yields an unbiased estimate but not neces-

3 It is easy to see that if U is a [0,1]-uniform variate and h(U) is monotone increasing and continuous having cumulative distribution function F then h = F t.

272

June 1992

sarily one with minimum variance. Further suppose one generates the two estimates by sampling two [0,1]-uniform variates and then computing the corresponding f l ( a l ) and fz(a2), then it is well known (e.g., Bratley, Fox and Schrage [1], pp. 46-47) that out of all joint distributions of these uniform variates, the one which minimizes the variance of fl(A1)-f2(A2) is when the two variates coincide. Thus we choose the same variate to generate both fl(A1) and f2(A2).

Acknowledgement We wish to thank Yair Wand, Faculty of Commerce and Business Administration, University of British Columbia, for some helpful comments and especially the associate editor whose careful reading of and comments on the first draft of this manuscript led to substantial improvements, both in content and in presentation. This research was supported by grant A5527 of The Natural Science and Engineering Research Council (Canada).

References [1] P. Bratley, B.L. Fox and L.E. Schrage, A Guide to Simulation, 2nd Edition, Springer-Verlag, New York, 1987. [2] C. Derman, Finite State Markoc Decision Processes, Academic Press, New York, 1970. [3] B. Fox and P. Glynn, "Simulating discounted cost", Management Sci. 35, 1297-1315 (1989).