Computational comparison of policy iteration algorithms for discounted markov decision processes

0305-0548186 $3.00+ 0.00 (ornprct.& Op.\.Kes. Vol. 13, No. 4, pp. 41I-420, 1986 Copyright 0 Prlnted ,n Great Britain. AH ri_ehts reserved COMPUTAT...

Download PDF

894KB Sizes 0 Downloads 76 Views

Report

PDF Reader
Full Text

0305-0548186 $3.00+ 0.00

(ornprct.& Op.\.Kes. Vol. 13, No. 4, pp. 41I-420, 1986

Copyright 0

Prlnted ,n Great Britain. AH ri_ehts reserved

COMPUTATIONAL

ALGORITHMS

COMPARISON

FOR DISCOUNTED

1986 Pergamon Journals Ltd

OF POLICY ITERATION

MARKOV

DECISION

PROCESSES

R. HARTLEY* Department of Decision Theory, University of Manchester, Manchester Mi3 9PL. England

A. C. LAVERCOMBE? Department of Operations Research, Paisley College of Technology, Paisley. Scotland

and L. C.

THOMAS:

Department of Business Studies, University of Edinburgh, 50 George Square, Edinburgh EH8 9JU, Scotland

Scope and Purpose- Markov decision processes are dynamic programming models of problems involving a series of decisions to be made in an uncertain environment. This paper looks at one technique for solving Markov decision processes, called policy iteration, and surveys its variants and embeilis~ents as well as comparing the times these variants take to sohe sets of randomly generated problems. Abstract-This paper describes a computational comparison of the policy iteration algorithms for solving discounted Markov decision processes. It examines the different forms of iterations, reordering, extra~iation and action elimination.

1. INTRODUCTION

In this paper we report on computational experiments designed to investigate the usefulness of various schemes which have been suggested for improving the efficiency of the basic policy iteration scheme introduced by Howard [l]. We will assume a finite set of states S = {l, 2,. . ., M} and a finite action set 1yi for each i E S. Any k E Kj leads to a reward 6 and probability p$ of making a transition to j E S. A policy 6 specifies an action S(i) E Ki for each if S. The expected discounted return using 6 and starting in state i for any discount factor /?E (0,l) is written u”(i) and satisfies

where Hci is defined by (HQW)(I’) = r’j(Q+ fl C p’zi)w(j), for iE S. j= 1 Usually u’ will be a vector which is an estimate of 0”.Our objective is to find a policy 6* which satisfies

where v(i) = max d(i), 6

for ie S.

*Roger Hartley is Senior Lecturer in the Department of DecisionTheory. University of Manchester, He holds a M.A. and Ph.D. from the University of Cambridge. He has published papers in Markov prog~mming and multiple objective optimization and is the author of two text books and (joint) editor of two sets of conference proceedings. *Adrian Lavercombe is a Lecturer in the Department of Mathematics and Computing at Paisley College of Technology. He holds a B.A. and Ph.D. in Decision Theory from the University of Manchester and a MSc. in Operations Research from the University of Strathclyde. His interests are in the area of Markov decision processes and inventory problems. :Lyn C. Thomas is Professor of Management Science at the Department of Business Studies, University of Edinburgh. He hoids a D.Phil., M.A. in mathematics from the University of Oxford. From 1982 to 1983 he was a National Research Council Associate at the Naval Postgraduate School, Monterey. His interests include Markov decision processes. replacement and maintenance inventory, queueing problems and game theory. 411

R. HARTLEYrt

412

al.

It is well known that v and 6* satisfy v = Av = H,,v = max H,+, where (Aw)(i) = max 6 + : kcK,

&w(j)

for YES.

j=l

At its simplest the policy iteration method works as follows: Step I: Choose a policy 6. Step 2: Policy evaluation. Solve (1) for t+. Step 3: Policy improvement. Try to find a policy 6’ satisfying Hcyu, 2 us and H,.v, # qj (i.e. an improvement on 6).

(i) If no such policy exists, 6 is optimal; STOP. (ii) If a 6’ can be found, put 6 = 6’ and go to step 2. Although the algorithm is known to be finite it may take many iterations. Usually we can relax our objective a little and instead of looking for an optimal policy we look for an s-optimal policy, i.e. we look for a policy 6 and an estimate w of v so that for any E> 0 we want

Throughout this paper we will take E= 0.001 and p = 0.9. This relaxation affects steps 2 and 3 of our algorithm and they are modified as follows: Step 2’ Find w such that lw - vdl< E’ (E’> 0). Step 3’ Find 6’ which improves on 6 (in some sense).

(i) If 6’ is s-optimal, STOP. (ii) Otherwise, put 6 = 6’ and go to step 2’. Note that the method involves a second error E’,but we shall take E’= Ein the sequel. We also need an appropriate technique for answering the question implicit in step 3(i), and we shall discuss this further in Section 4, where we will give a refinement of step 3’. We will now outline the plan of the paper. In Section 2 we discuss various iterative schemes for solving step 2’. Computational comparisons are reported in Section 3. Using the best method identified from Section 2 for the solution of step 2’, we consider in Section 4 how to test for s-optimal policies and the consequences for the full policy iteration algorithm. The computational results are in Section 5. Methods for eliminating nonoptimal actions are described in Section 6 and compared computationally in Section 7.

2. POLICY

EVALUATION--METHODS

Several papers, notably those involving Porteus [S-8], have been written on the problem of policy evaluation. In this section we detail only those which appear to have proved themselves as worthy of consideration, and a large number of obviously inferior variants are not mentioned. For comparison we have included Gaussian elimination, which if performed with complete accuracy would give vd accurately. However, it should be noted that, very roughly, Gaussian elimination is 0(M3) while all the other methods are O(M2). Further, we have not made direct use of sparsity in our computation. However, iterative methods are capable of exploiting sparsity to a much greater extent than Gaussian elimination. For details of the use of sparsity see Morin [3]. We will identify six basic iterative schemes, all of which are of the form v, = Hv,_ 1 (we drop the subscripts and superscipts 6 and k throughout this section). Adopting the convention that 1: = 0 if a > b, we have, for i E S, the following:

413

Computational comparison of policy iteration algorithms

PreJacobi (HpJW)(i)

=

1

ri + p

Pijw(,i)*

j=l

Pre-Gauss-Seidel

i- 1 A.4 (HPGSW)(i)= ri + p -2 Pjj(HPGSW)(j) + B 2 Pijw(.!)* j=i j- 1

Gauss-Seidel r

~~GsW)(j) =

1

ri +B

i- 1 M 1 ! r] Pij~~Gsw)(~) f P 1 P~jw(~) /I’ j=i+l j= I _I

-Pii]’

Reversible Gauss-Seidel

where

i-

1

(~GsW)(~)= Pi+/3 1 j=

pijW(j)

+P

F

Pij(~~w)(~~

j=i+l

1

I’

fr -Piil*

Successive over-relaxation

i- 1 (HSoRW)(i)

=

0

ri

+fi

C

pij(HSoR

W)(j)

j=l

+

p

F

j=i+l

PijW(j)

Variable successive ovf~r-reZaxa~i0~ i-l W

vsoRw)(i) = Wj ri+ fi 1 pij(H VSoRWNj)+ P t j=i+l j= 1 ii

Ptjwlj))ill

- BPii]] + t1 - Oilw(i).

i

A few notes on some of these schemes are in order here. (i) The easiest way to implement reversible Gauss-Seidel is to define v,+ 1j2= HoSon in which case u,+1 =fiGsv”+i12. It follows from the definitions that for i czS v,+l(i)=~,+1~2(~)+P

i

Pij[t;,+l(j)-t’n(j)]l[l-pPiil,

(2)

V,+,,z(i) = t’,(i) + P 1 PijCv ,+1!2(j)-v,.-li2(j)ll[l-.BPiil, j= 1

(3)

j=i+l

i-

1

and so the calculations can be performed with half the number of iterations which might be expected initially. Note that (2) must be performed in reverse order, i.e. for i = M, M - 1,. . ., 1. (ii) It is usual to choose w E (1,2) in successive over-relaxation [8]. Extensive computational tests suggested the value w = 1.28 as giving a good compromise between speed and convergence, and we used this value in our computational comparisons. Since convergence cannot be guaranteed, the program includes a mechanism for detecting divergence; but this was not needed in any of our test problems. (iii) If ctii are defined recursively, for i = 1, . . ., M, by i-

/

oi=

l-82 i

1

\

1 PjjOjPji j=

1

it can be shown that variable successive over-relaxation

, !

converges at a rate of at least fl.

R. HARTLEY et al.

414

All the methods except preJacobi depend upon the ordering of the states, and therefore applying a permutation to S affords the possibility of improving the performance. Of the various reordering schemes suggested by Porteus, we experimented with his two most successful methods and found that the most efficient was the minimum remaining row sum (method 2 of [6]). This agrees with Porteus’ conclusion. Whenever “reordering” is referred to in the sequel, we shall mean this particular scheme. Another possible means of improving performance is to use extrapolation. We will consider two methods of extrapolation here. For an operator H defining any of the six schemes above and o E (I, 2) Put H,=wH+(l

-w)l

where I is a unit matrix. We call this method over-relaxation (OR). es is very similar to HsoR; in the latter the extrapolation is done state-by-state as each state value is being calculated, but the former extrapolation is performed at the end of each iteration. Although we might expect Hz’ to be less effective than ti OR,there is an advantage in using Hz’. es involves the use of ps and so we can use the bounds on HGSto give a precise stopping criterion, whereas for HsoR only a heuristic method of stopping can be used (see below). However, we still cannot guarantee convergence with Hz’. We will take o = 1.28 again. The second extrapolation method called row sum extrapolation (RSE) by Porteus can be written WRSEW)(~)

=

Ww)(i)+ 4(wWe)(i)

where e(j) = 1 for all jS and

&J(w) = m4 + +w[2 - G- El i(w) = max

(Hw - w)(i),

g(w) = min (Hw - w)(i),

ii = max (He)(i),

g = min (He)(i). In the case of preJacobi

HPJe = /?e, so

(f&w)(i) = WpJw)G) + P[$w) + $+~)]/(l -

P),

(4)

and it is not hard to show from this expression that using (4) on every iteration is exactly the same as using HLiE only on the last iteration and HP’ on all others. Row sum extrapolation presents a difficulty when used with HRGSsince the reduction (2) is no longer valid. Porteus proposes a test to determine if extrapolation is worthwhile (essentially, that i(w) and N(W)should have the same sign). In the Appendix we show that it is possible to perform extrapolations m such a manner that a simple modification of (2) and (3) can be used; provided that some initial calculation is performed. This is equivalent to an additional iteration (or two iterations if extrapolation is to be performed every half iteration) added on to the overall time. We have used this method of performing extrapolations in our computational work.

Computational comparison of policy iteration algorithms

415

All the iterative methods require a stopping criterion and, since they all involve the use of one of the operators H described above, we can base our criterion on these. All the schemes except successive over-relaxation are (aftine) monotonic, contraction mappings and the arguments of Porteus [3] can be applied to show that if we put

p=ii;

F=cc

if E(W)c 0 < a(w),

(k)

then

and so we can stop if the RHS is less than E. Note that in the preJacobi case, B = 6 if k = a, which results in a simplified criterion. Convergence of successive over-relaxation can only be guaranteed under conditions which are often satisfied in scientific and engineering applications but are rarely valid in Markov decision processes (see [lo]). We adopt the heuristic procedure of assuming that convergence is geometric in I, norm and estimate the rate of convergence by considering successive iterations. This ultimately leads to the criterion [9], stop if 1HSGR W-WI,<&

qw - UI,

-

lHSORW - w1p2

(7)

where w = HSoRu. If our hypothesis on convergence is correct and the implicit estimate of convergence rate is an upper bound, then

lHSORW - VI <&.

(8)

In practice the stopping criterion (7) does lead to w satisfying (8).

3. POLICY EVALUATION-COMPUTATIONAL

RESULTS

We can now present the computational results. These are given in Table 1. In all the iterative methods v1 = r was used. Fifteen test problems of 40 states each were generated using the procedure described in [9] but with one action per state. We have verified that for all schemes except reversible Gauss-Seidel, reordering improves performance. As an example we illustrate Gauss-Seidel with and without reordering. We have included several examples both with and without row sum extrapolation to demonstrate the remarkable improvement this technique affords. Indeed this improvement is perhaps the most important conclusion we can draw from the table. The performance of reversible Gauss-Seidel is disappointing, which is largely due to the fact that extrapolation does not produce such a substantial improvement in performance as for other methods. Indeed, if we remember that iterations of HRGSconsist of two “half iterations”, each being, a priori, comparable in effect with a single iteration of other methods, we see that HRGSis taking nearly twice as long as we might initially expect. There is clearly scope for the development of extrapolation which wilt mesh with HRGSto provide the same proportional improvement as RSE does for, say, ps. To compare our results with those of Porteus is not completely straightforward, as we have different values of p and E,Markov transition matrices, differences in coding and present the results in a rather different manner. Unlike [7] we do not find reversible Gauss-Seidel to be the fastest method, but this is probably accounted for by our obtaining an even greater decrease in computation time from the use of RSE. In the remaining sections we use PJ and GS4 for policy evaluation. The former is

416

R. HARTLEY et al, Table 1

Method

Basic scheme

Average

Standard

time

Reordering?

Extrapolation method

(9

deviation time (s)

Average No. of iterations

GE

Gaussian elimination

NA

NA

0.72

0

NA

PJ

HP’

NA

0.02

I6

PGS I PGS2 GSI GS2 GS3 GS4 RGSl RGS2 RGS3 SOR VSOR

Yes Yes No Yes Ye.5 Yes Yes No

RSE (implicitly) None RSE None None OR RSE None

0.36

HPGS HPGS

I.21 0.35 1.13 0.96 0.76 0.29 0.87 0.38 0.44 0.45 0.98

OX9 0.02 0.10 0.07 0.05 0.01 0.07 0.07 0.05 0.04 0.07

53 I.5

HGS HGS HGS HGS

HRGS HRGS HRGS HSOR +‘SOR

RSE RSE NA None

Yes Yes Yes

63 Cl 39 14 35 14 1.5 24 Sl

included partly because it is very simple to implement and partly because it has certain advantages with regard to policy improvement as we shall see in the next section. We conclude this section with two further observations on the results. Firstly, we note that Gaussian elimination cannot exploit the existence of a w which is “reasonably close” to v in some sense, whereas iterative methods, by putting r, = w, may require fewer iterations. This result proves valuable in the full policy iteration method, further tipping the scales in favor of iterative methods. Secondly, we note that semi-Markov problems and absorbing state problems with total cost criterion have transition matrices with row sums which are not 1. Now preJacobi with the stopping method indicated is essentially relative value iteration [4], for which Morton and Wecker have shown that the convergence rate is fi times the absolute value of the subdominant eigenvalue of the transition matrix provided the row sums are 1. Hence, pre-Jacobi may be expected to be less efhcient for semi-Markov and absorbing problems, whereas there seems no reason to expect the performance of GS4 to deteriorate substantially. 4.

POLICY

IMPROVEMENT-METHODS

We now turn our attention to step 3 of the policy iteration algorithm outlined in Section 1. Once we admit the possibility of not performing the evaluation step 2 exactly, it becomes necessary to change our outlook on the algorithm. It can be proved that, for some policy, 6 (i) if 6 is optim a 1and H&v” 2 Y’, then Hs’u”= u”, so no policy 6’ can actually improve the value v’; (ii) if 6 is not optimal, there is a policy 6’ satisfying H’u” Z IJ”and IY”u”# v”. However, iffor some E> 0, w satisfies (w - ~$1, < E,neither (i) nor (ii) need be true for H’w and so the policy improvement step cannot be guaranteed to work. We are rescued from this problem by an easy modification of a result due to Porteus [5] which says that if I? corresponds to any of the operators of Section 2 (except HsoR and WRGS)for a particular policy and we define

then there always exists a 6 such that @w = Aw and, if we put fJ,t =

Aw + 3[P69/U-

S)+ /gwu- @I

where

ii(w) = max (Aw - w)(i) i

g(w) = min (Aw - w)(i)

7

417

Computational comparison of policy iteration algorithms

p =

max @e)(i)

min (A)(i)

[ =

if G(w) 3 0, otherwise,

max U&e)(i)

if _ x(M’);’ 0,

min (H,,e)(i)

otherwise,

I

then max&,,

- udJ, (uR- 01, 1~- u’,,J}< +[@(w)/(l -

p)-

p+)/(l

- p)].

(9)

It is easy to verify, by specializing to a single policy, that this is a proper generalization of the stopping criterion of Section 2; thus justifying the use of the same notation. The natural procedure is now to replace step 3’ of the policy iteration algorithm by Step 3”. Let 6’ satisfy H,w = Aw. (i) If the RHS of (9) is less than E, then 6’ is c-optimal; STOP. (ii) Otherwise, put 6 = 6’ and go to step 2’. If this version terminates, it does so with an E-optimal policy. However, we cannot guarantee termination. Porteus [6] shows that if the ESused to stop value determination iterations tend to 0, then the method must terminate. It is more convenient and natural to fix E. In practice we have never encountered a problem which failed to terminate. The use of iterative methods for policy evaluation allows us to use any good starting value in such methods. Consequently, in the second and subsequent passes through step 2’, we can initiate value determination with cl = Aw + c#4w)(H,je)(i),

(10)

where w and 6 are as defined at the end of step 3”, and 4 is defined in terms of the policy 6 as in Section 2. Experimentation has shown that this technique uniformly improves all methods on all problems, so we will use it in all our computation. In a degenerate problem with one action per state, if we complete step 2’ with w, then Iw - 01< Eand so we would hope step 3” would stop the algorithm since c” = u for the one policy. However, experiments have shown that if, say, we use Gauss-Seidel in step 2’ and A as in Section 1, then w may fail the stopping test. This is because the stopping regions for HP”and ps, say, are different for a given E> 0, in general, even though both are contained in the c-optimal region. When the trick mentioned in the last paragraph is used, this corresponds to performing further iterations of HGS; if the trick is not used, the algorithm can cycle indefinitely. If this sit‘uation can occur for degenerate problems, it could increase computational time in nondegenerate ones. For this reason we consider the possibility of using an A in step 3” to match the H used in step 2’. We denote by APJ the conventional A described in Section 1, and AGSwill then refer to a Gauss-Seidel improvement step. Thus when GS4 is used in step 2’ we try both A” and AGS in step 3”. Two obvious methods exist for initiating policy iteration (step 1). (i) Choose 6(i) = 1 for all i E S (arbitrary). (ii) Choose 6 to satisfy r’!@ I - maxk 6 for all iE S (myopic). With(i) we initiate step 2’ with u1 = 0. With (ii) we initiate step 2’ by putting w = Oin (10). A priori, we might expect (ii) to require fewer iterations but (i) to require less effort in step 1. When using GS4 with AGS,if 6’ is the policy chosen in step 3”, we have to calculate H:Se in step 3” for stopping tests and extrapolation, but we then have to calculate these rows sums again in step 2’ since reordering of S takes place before HGSis used again. This imposes an additional penalty on reordering.

418

R. HARTLEY

et al.

Table 2

I.

Method used in step 2

Method used in step 3”

Method used in step I

Reordering?

HP’

APJ

ARB

NA

2.

HP’

APJ

3.

GS4

APJ

ARB

No

4.

GS4

APJ

MY

NO

5.

GS4

APJ

ARB

Yes

6.

GS4

APJ

MY

Yes

7.

GS4

AGS

ARB

No

8.

GS4

A-

MY

No

9.

GS4

A

GS

ARB

Yes

I 0.

GS4

AGS

MY

Yes

T= average time (s). D = standard deviation

MY

T

T

T

(g)

(0)

(fl) --___

Class

I

1.71 (0.16) 1.38 (0.23) I .45 (0. IO) 1.17 (0.19) 1.52 (0.IO) 1.22 (0.17) I .48 (0.12) 1.30 (0. I 2) I .62 (0.13) I.43 (0.14)

NA

Class2

Class 3

0.39 (0.05) 0.33 (0.06) 0.44 (0.06) 0.37 (0.07) 0.45

0.16 (0.04) 0.20 (0.06) 0.16 (0.04) 0.20

ww

(0.W

0.38 (0.07) 0.70 (0.08) 0.71 (0. IO) 0.72 (0.08) 0.73 (0. IO)

0. I 9

(0.W

0. I 6 (0.04) 0.33 (0.06) 0.33 (0.07) 0.33 (0.06) 0.22 (0.07)

of time (s).

The use of (10) to initiate step 2’ often reduces the number of iterations needed for this step, thus increasing the proportion of time devoted to reordering. For these reasons whenever GS4 is used in step 2’ we compare its performance with GS4 but without reordering. 5.

POLICY

IMPROVEMENT-COMPUTATIONAL

RESULTS

Our computational results are set out in Table 2. We examined 15 problems with 100 states and an average of 4.5 sections per state (class l), 15 problems with 40 states and an average of 35 actions per state (class 2) and 15 problems with 10 states and an average of 250 actions per state. Details of how these problems were generated are given in [9]. In the table we have labelled the method used in step 2 GS4 whether or not reordering is used. For the problems with fewer states, especially those in class 3, we should really have included JYoE,but our main object in classes 2 and 3 was to examine problems with many actions per state, and the reduction in state size was forced by the decision not to use backing store during computation. We have not given iteration counts, but for nearly 97% of problems this is 3 or 4 and for all methods except those using AGSthe number of iterations depends only on the starting policy. An immediate conclusion from the results is that the myopic starting policy is generally valuable, although in a few individual problems it can increase the computation time. It also appears that using A” for step 3” is the best choice, even if HGSis used in step 2’ although the relative disadvantage of AGS reduces as the average number of actions per state reduces. This is not surprising in view of the fact that the proportion of time spent in step 3” is least in class 1, and this in turn suggests that using AGS may be worthwhile in, say, problems with, at most, 2 actions per state. Recall also, that unless our trick for initialing step 2’ is used, the use of APJwith ps may cycle indefinitely. We can also observe that when AGSis used, reordering does not seem worthwhile, but it is a little more surprising that reordering is uncompetitive (albeit by a small margin) even if AP’is used. However, we would expect this to change if (1 - fi) decreases. Finally, we can see that HP’ in step 2’ is useful (just) in classes 2 and 3 but not for class 1. We may expect the relative advantage of ps to increase as state size increases and number of actions decreases. We will consider methods 2 and 4 in the next two sections in which we discuss action elimination. 6.

ACTION

ELIMINATION-

METHODS

The bounds underlying the stopping rules of Section 4 also provide a means of identifying actions which cannot be used in any optimal policy. If these actions are removed from the appropriate action

419

Computational comparison of policy iteration algorithms

sets, we may expect to reduce the computation time required to perform step 3” in subsequent iterations. By regarding the calculation of Aw in step 3” as a single value iteration, we may use the results of [5] to give elimination procedures which are computationally efficient. The most efficient scheme found in [9] is due to MacQueen [2] and eliminates k from K i, where in S, if r-i + : j=

&w(j)

< w(i) +

P[iitw)- ~fw)Ji(l - 8)

WQ)

1

where G(w) and t(w) are as defined in Section 3, and we have assumed APJis used in step 3”. (We shall maintain this assumption throughout this and the next section and thereforedrop the superscript.)To implement efficiency we must store the LHS of(MQ each i E S and k E Kiwhen calculating Aw. We then test for stopping. If the test is failed, since we have the information required to evaluate the RHS of (MQ), we can eliminate nonoptimal actions. To avoid storing the LHS of (MQ) we could recalculate these quantities. But this requires much more time [9], and the best method not involving storage is due to Porteus and eliminates k if i\l

max j=

d=l,...,k-1

1

rf+

2

+

P'[$w)

-$w)]/(l

-

PI-

fPf

j=l

For any k E Ki the RHS of(P) is known without extra calculation in the normal course ofevaluating A(Aw). However, we do have to modify step 3” to calculate first Aw and then A(Aw). It is during the second stage that we can perform the elimination. If 6 satisfies WJw = A*w, then 6 is fed to step 2 which is initialized with u1 =

7. ACTION

A2w + @i(Aw) + $4w)]/(l

- p).

ELIMINATION-COMPUTATIONAL

RESULTS

We display the computational results in Table 3. It should be noted that when Porteus elimination is used the policy iteration method is modified to compute A2w in step 3”. The obvious conclusion is that action elimination is worthwhile provided that MacQueen elimination is used. However, the relative advantage of action elimination is much smaller for policy iteration than value iteration although, as might be expected, when the proportion of time required for step 3” is taken into account, elimination becomes more profitable as the number of actions per state increases. 8. CONCLUSION

We have evaluated by computational experiment several ideas which have been suggested for the implementation of policy iteration. Our objective has been to suggest which of these ideas is really Table 3

Policy iteration method

Action elimination

2

None

2

MQ

2

P

4

None

.I

MQ

1.11 (0.19)

4

P

I.17 (0.191

procedure

T = average time (s). P = standard deviation (s).

1.38 (0.23) 1.33 (0.22) 1.38 (0.22) 1.17 (0.19)

0.33 (0.06) 0.26 (0.05) 0 33 (0061 0.37 (0.07) 0.30 10.05) 0.36 (0.06)

0. I 6 (0.W 0. I 3 (0.02) 0. I 7 (0.03) 0. I 6 (OW 0.14 (0.02) 0.17 (0.03)

R.

420

HARTLEYet al.

valuable rather than to provide a best method, and conclusions on these lines are included in Sections 3, 5 and 7. To reiterate the results therein quoted, it does seem that row sum extrapolation does have a very marked effect with whatever scheme is employed. There does not seem a great deal to be gained by a sophisticated policy improvement technique over the simple preJacobi technique. However, using a myopic starting rule does seem beneficial. Action elimination does seem also to be worthwhile, though the advantage gained is considerably less than when it is applied in value iteration. ~Ck~o~~edge~e~ts-The authors are grateful to the Social Science Research Council for their financial support. We are also indebted to Professor D. J. White and Dr S. French for stimulating discussions. REFERENCES 1. R. Howard, Dynamic Programming and Markov Processing. MIT Press, Cambridge (1960). 2. J. MacQueen, A test for suboptimal actions in Markovian decision problems. Opns Res. 15, 559-561 (1969). 3. T, L. Morin, Computational Advances and Reduction of Dimensionality in Dynamic Programming: A Survey in Dynamic Programming and its Applications (Edited by M. L. Puterman). Academic Press, New York (1979). 4. T. E. Morton and W. R. Wecker, Discounting ergodicity and convergence for Markov decision processes. Mgmt Sri. 23, 559-561 (1977).

5. E. Porteus, Rounds and transformations for finite Markov decision chains. Opns Res. 23, 761-784 (1975). 6. E. Porteus, Improved iteration computation ofthe expected discounted return in Markov and semi-Markov chains. Z. Op. Res. 24, 155-I 70 (1980). 7. E. Porteus, Computing the discounted return in Markov and semi-Markov chains. Nav. Rex Logist. Q. 28,567-578 (1981). 8. E. Porteus and J. Totten, Accelerated computation of the expected discounted return in a Markov chain, Opns Res. 26, 35&358 (1978). 9. L.C. Thomas, R. Hartley and A. C. Lavercombe, Computational comparison of value iteration algorithms for discounted Markov decision processes. Opns Res. L&t. 2, 72-76 (1983). 10. D. Young, Iterative Solutions of Large Linear Systems. Academic Press, New York (1971). APPENDIX The reversible Gauss-Seidel scheme with extrapolation falls under the general class of methods described by u,+ ,,z = H%,

+ i,d),

V n+,=~S(~+,,Zf~n+I,,a)

where i,, depends only on 4 for t 6 m. We must initially calculate, for iE S, JW f(i)=8

C piId j-i+1 i-,

Ai)= P 1 PijJUh j= I

It then follows from the definitions that v,+ I,2(i) = u,(i) + M(i)i ~+,(i)=u,+,,20)+

i

i- 1 A- ,,J(i) + B 2 Pij[vn+ ~,oz(A - on-,,Aj)l j=l

&+li~ 7(i)-

&f(il+B

1:

t pi+,

PO[c~.+i(i)-“.iillJi(l

/(I-PPiih

(Al)

-@Pii)

where (Al) holds for n > i and (AZ) for n Zz0. 1n our implem~tation of HE@ we had d = XRGSe and /;, = r$f~~-,) and i,, ,12= 0 for ail n. For this case we have to calculate initially Hose, HRGSe by (HRGSe)(i)= (JFSe)(i) + p

E

pij{ (HRGSe)(i) - 1j/(1 - bpii)

j=i+l

and f(i) = P

g

p#RGSe)(.i),

j=i+*

which is approximately two extra full iterations (compared with one extra iteration to perform RSE for other methods). However, our result is sufficient in genera1 to be applicable to other extrapolation methods (see [S]) or extrapolation every half iteration (which requires approximately three extra iterations to initiate).

Computational comparison of policy iteration algorithms for discounted markov decision processes

Computational comparison of policy iteration algorithms for discounted markov decision processes

Recommend Documents