Automatica, Vol. 17, No. 2, pp. 281 296, 1981
0005-1098/81/020281- 16$02.00/0 Pergamon Press Ltd .i, 1981 InternationalFederation of AutomaticControl
Printed in Great Britain
Optimal Control of Markov Chains Admitting Strong and Weak Interactions* FRANCOIS DELEBECQUEt
and J E A N - P I E R R E Q U A D R A T t
A Markov chain with an almost block diagonal structure with N blocks can be controlled by a policy improvement algorithm involving only decentralized computations within the N blocks and computations relative to an aggregate N states Markov chain. Key Words--Discrete systems; dynamic programming; modelling; optimal control; perturbation techniques.
large-scale systems; Markov
processes;
many examples of applications of 'almost decomposable' M a r k o v chains in the field of queuing systems. We extend some of these results to the case when transient states appear in the fast chain. In the non-linear case, we restrict here the study to the case where the recurrent transient structure of the fast chain cannot be changed by the control. Veinott has considered the same kind of control problems in several papers (e.g. Miller and Veinott, 1969; Veinott, 1969), in a case where no reduction of dimensionality appears. (No weak interactions.) Our main contribution is that there is a reduction of dimensionality to the number of ergodic-sets of the 'fast' chain. For instance in the purely recurrent case we show that the stochastic control problem can be reduced to the control problem of an aggregate chain whose states are the ergodic sets of the fast chain and the costs associated to each aggregate state are themselves solutions of ergodic stochastic control problems (with constraints). So, if the original problem admits n states, we deal only with problems having either N states (number of ergodic sets) or N i states (number of states in the ith ergodic set). In the first part of the paper a simple but concrete motivating example is given. The second part is devoted to the study of perturbed M a r k o v chains: first the linear case (without control) is presented. Then we study the control problem and we give an algorithm to compute an approximate solution. In the last part, we apply the results to the introductory example and we give a simple numerical example to show the effectiveness of the method.
Abstract This paper is devoted to the study of an optimal control problem for a Markov chain with generator B + d , where e is a small parameter. It is shown that an approximate solution can be calculated by a policy improvement algorithm involving computations relative to an 'aggregated' problem (the dimension of which is given by N, the number of ergodic sets for the B matrix) together with a family of 'decentralized' problems (the dimensions of which are given by the number of elements in each ergodic set for the B matrix).
INTRODUCTION
THE PURPOSE o f this paper is to show how some large scale stochastic control problems involving a small parameter e can be approximated in such a way that they become numerically tractable. We study the optimal control problem of M a r k o v chains admitting two scales of transition probabilities that we call 'interactions' and the small parameter e will characterize the ratio of these two scales. As shown in the first part of the paper with a specific example, this kind of problem arises for instance when (continuous time) Markov processes admitting two scales of time are discretized: the 'slow' part of the system leads to the 'weak' interactions and the 'fast' part to the 'strong' ones. We study the behavior of the chain over a period of order (l/e), so that the weak interactions cannot be neglected. The linear case (without control) has been studied with a different approach by several authors (see Simon and Ando, 1961; Gaitsgori and Pervozvanskii, 1975; Pervosvanskii and Smirnov, 19741. The book (Courtois, 1977) gives *Received December 29 1978; revised February 25 1980; revised August 15 1980. The original version of this paper was presented at the IFAC/INRIA Workshop on Singular Perturbations which was held in Rocquencourt, France during June 1978. Information about this IFAC Meeting may be obtained from: INRIA, Institut National de Recherche en Informatique et en Automatique, D o m a i n e de VoluceauRocquencourt, B.P. 105-78150 Le Chesnay, France. This paper was recommended for publication in revised form by associate editor P. Kokotovic. tINRIA-Laboria, Domaine de Voluceau, 78150 Le Chesnay, France.
1. AN I N T R O D U C T O R Y E X A M P L E
In this part we show by an example how a small parameter e naturally arises in stochastic control problems admitting two time scales and 281
282
F. DE-LtgBL:CQUEand J.-P. QUAt)RAT
we give the discrete version of this kind of problem. We consider the problem of meeting a demand of electricity power, with the power e, produced by a dam. The notations will be the following: a , = a is the flow of water input (a known deterministic constant: this is not a restriction for our purposel v, is the stock of water (state variable) u, = water output (control variable) 0 < u,
stochastic control problem : f
rain E j" e- ",t; dt u
where # is a given actualisation rate. The best control policy u(y,,z,) is obtained by solving the Hamilton Jacobi-Bellman equation (1) satisfied by the function: V,:(y, z ) = min Ey._. j' e -'"c[e(y, u(.v:, z,)) - zr]dt. u
e(y,u): power produced by the dam when the stock of water is y and the control u.
I)
0
[Here Ey.~ means the expectation for the (y,,z,) process starting from (y, z).] This equation can (at least formally) be written"
-t~v,:O:,z)+m2n{A(u)V,.(y,z) Y,
+ BV,:O',z)+fO,,z,u) FIG. 1. M a n a g e m e n t of a dam.
=0
{1)
where
~V A ( u ) V ( y , z ) = d p , , ( y - u ) ¢~z (y,z) The stock of water evolves according to :
BV(y,z)=b(z)~
?V
72 V (y,~l+)aZlz) ? ~ (y,z)
('Z
dyt = ~ (y,, u,)dt
.f(y, z, u ) = c[e(y, u ) - z].
where
~
+(a-u) ~
qS,(.V,U)=-~ t
(a
u)
{ -(a-u)
if y = 0 if 0 < y < f if y = f
and f is the capacity of storage. If the demand z, for electrical power is modelled in the short run (unit of t i m e = o n e week) by a diffusion-type process
It turns out that it is always possible to discretize equations of type (1) in such a way that they become the dynamic programming equations for a controlled Markov chain (see Kushner, 1976 for the general case). Indeed, let us introduce a grid Gn of step h for coordinate (y, z) and a time step At.
dzv = b(z,, Idt'+ aiz,, )dWI, then, in the long run (unit of time = one year) we get :
I I
I I
T7 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 I
dz,=
g,,
b(z,)dt+
a(z, }d W,.
I I
I I !
y
) (Here e -- ±52. We assume that the functions b and a are such that the process (z,) admits a unique invariant probability measure p(z)dz with support [;, zT]. Now if at time t the power produced by the dam is e, and the demand for electrical power is z, the cost function is .f ( t ) = c ( e , - z,) and the optimal management of the dam is given by the
h
FiG. 2. Discretization grid.
The discrete state variables ()~,,Z,,) are now taking their values (ih, z +.jh)i= 1..... I: j = l , . . . , J , in the grid Gn at the discrete instants O, At,..., mat . . . . .
Optimal control of Markov chains We can approximate the operators A(u) and B by their finite difference counterparts as follows: At
+
Ah(u)V(i,j)=~dpa (ih,u)V(i+ 1,j) At + ~ c~; (ih, u)V (i-
¢/o
1,j)
u
(2)
- A t ( hlb-!+h2~}(i,j)V(i,j)
(i,j)V(i,j-1)
~=
(3)
[f+ (respectively f ) denote the positive (respectively negative) part of f and fh(i,j) =f(ih, jh)]. With these notations the equation (1) becomes
-#AtV~(i,j)+ m~n{Ah(u)~(i,j)
l
+ BhV,:(i,j)+fh(u,i,j)At r
} =0.
(4)
We remark that the ( I x J; Ix J) matrices Ah(u ) and B h have the following properties: the off-diagonal elements are non-negative, the diagonal elements are non-positive, and V(i,j)
Ah(U)i,j;i',j'=O
~ i',j'
i',j'
Moreover, for At small enough the off-diagonal elements of Ah(u) and B h are less than one, so that Ah(u) and Bh can be interpreted as 'generators" of two discrete time Markov chains with transition probability matrices [Ah(u)+l ] and [Bh+I]. (If P is the transition probability matrix of a Markov chain we define its 'generator' as P - I where I is the identity matrix.) Now, equation (4) is the dynamic programming equation for the Markov chain (Ym,Zm) with generator Ah(u)+(1/~,)Bh and cost function :
V,:(i,j)=minEi4 \1 +l~At/
m-O
.f(Ym,Zm, U(Y.~,Zm))At.
u(Y. Zt)) dt.
0
Of course both interpretations are valid but in the sequel we shall restrict the study to discrete time Markov chains. [] We can check two important properties of the problem: Bh is a block-diagonal matrix and Bh appears in (4) with the coefficient 1/c. In the last part of the paper, we shall see on this particular example that the stochastic control problem (5) can be reduced to the solution of a 'long run' stochastic control problem for the 'slow' process (Ym) together with a family of static optimization problems (indexed by the 'slow' variable y which becomes a parameter in these problems). In so doing we can find an 'almost optimal' control in the sense that the Bellman's function V,:(y,z) is approximated up to terms of order e. Moreover we shall give an algorithm to compute this approximate control which requires only inversions of I x I or J x J matrices, so we are able to solve approximately the stochastic control problem (1) without dealing with the 'large scale' system (Yt,z,). 2. STOCHASTIC CONTROL PROBLEMS FOR MARKOV CHAINS WITH STRONG AND WEAK INTERACTIONS The purpose of this part is to study the asymptotic behavior, when e is small, of the following dynamic programming equation:
Bh(i,j; i',j') =0.
u
Remark. For A t = l , (4) is the dynamic programming equation for the control problem of the continuous time Gh-valued Markov chain (Yt, Zt) with generator Ah(u)+(1/~)Bh and cost function : minEi4 ~ e-"tf(Y~,Zt,
At ]4)~(ih,u)] V (i,j) h hb ++c~ BhV(i,j)=At ( h2 )(i,j)V(i,j+l)
+At.
283
(5)
Equation (4) of the preceding section is of this kind. In Section 2.1 we define the perturbed stochastic control problem which leads to this equation. In Section 2.2 in the case of a fixed given control we give Vo = lim V~ ~:~O
as a functional of an aggregate chain. In Section 2.3, we study the controlled case and give a decentralized-aggregated policy improvement algorithm to compute Vo. In Section 2.4, the probabilistic interpretation of V0 is given.
284
F. I)Et, EBECQUE and J.-P. QUADRA'I
2.1 7'he perturbed stochastic control problem Let us first i n t r o d u c e s o m e notations. [X,, } will d e n o t e a discrete time M a r k o v chain with wflues in E, a finite state space with n elements. The transition p r o b a b i l i t y matrix P of the chain is assumed to d e p e n d c o n t i n u o u s l y on a c o n t r o l variable u with values in ;'~ a fixed c o m p a c t set of R k. W e d e n o t e its elements by P(x,y,u). G i v e n a strategy S, that is a m a p p i n g S:E--+I " the chain evolves with the t r a n s i t i o n p r o b a b i l i t y matrix ps defined by PS(x, y ) - P(x, ),,, Six)}. G i v e n a cost function ,f(x, ut d e p e n d i n g c o n t i n u o u s l y on u, we define the 'function' f s (column vector) by . f S ( x l - f ( x , S ( x ) ) and the function [pS],~fs by
[P S ] m.t ' 5 ~"(x)=~,[PS]"(x,y)fS(y) v
= Ds,'fs (Xm)
We recall (see e.g. D e r m a n , 1970: Ross, 1!)69 that 1/* is the unique solution of
[G(u)l*(x)+2l(x,u}l -0. 19
- 2V*ix)+min u
N o w we assume that the g e n e r a t o r (;,(u) --P(u) I of the chain admits for all fixed u the representation
G,:(u)=B(u)+cA(u) where B(u) and A(u) can b o t h be interpreted as generators of discrete time M a r k o v chains called respectively 'fast" a n d 'slow" in the sequel and a c c o r d i n g to the i n t r o d u c t i v e example. It is always possible to label the states in such a way that the B matrix is given by the following figure.
( m = 1,2 .... )
i
BT, ---
where E s d e n o t e s the e x p e c t a t i o n for the M a r k o v chain (X,,) with t r a n s i t i o n p r o b a b i l i t y m a t r i x ps and s t a r t i n g point x ( X o = x ) . Similarly if p is a ' m e a s u r e ' on E (row vector) we define a new measure p[pS],, by
I I -4-----
1
I% I L---
o
k ,\ \ \ k
O
\\
f BiI
p[pS]'(_v )= ~ p(y)[PS]'(y,x )= E~l.dX,,).
---~---~ B-r'i
v
G i v e n an a c t u a l i s a t i o n rate 1,,."1+ 2 {z:>0) we i n t r o d u c e R)S the resolvent m a t r i x of ps defined by:
(X,,)
=,,,E>
I I
~1
I
(6)
j.s < O. We define also the function I~ by
l~)(.,c) jRS.lS(x). So, if G S = p s - I is the g e n e r a t o r of the chain (Xm) V~. is the unique solution to
-Al~){x)+(;s}),C\)+2fS(x)--O.
i N
I
%----4-----
B.
IBT~
I I
'1~1
"
l
FRi. 3. Blu) matrix.
for x + y
B (.x, x, u ) < 0
N o t e that R'~ satisfies the m a x i m u m principle: s ,s _<0 with strict inequality on •f s <= o implies R~.I'
I7)
The stochastic c o n t r o l p r o b l e m is then to find V * ( x ) = m i n VS(x)
IBT-i~ I
B (.v, y, u ) > 0
1
RS,f s(x)=E • ,.Z=o \ i + 2 J
-
I8)
~B(x.y,u)=O
for a l l x .
v
This c o r r e s p o n d s to a p a r t i t i o n of the states into N ergodic sets (.?~..... x~,) and a set of a transient states. We shall d e n o t e by R the set of recurrent states, and by T the set of transient ones. T h e matrix A ( u l can be considered as the g e n e r a t o r of a M a r k o v chain if we set A (x, x, u)
--. V .4 (x, y, u ).
A(u) satisfies: ,4 Ix, v , u ) ~ O ,4 i x , x , u ) <
for x ~ y
(}
~A(x,y,u)=O
for all x.
v
where 5f' is the class of all the M a r k o v strategies {that is the c o m p a c t set I "l.
F r o m now' on we shall assume that ,;~= cH.
Optimal control of Markov chains This corresponds to study the Markov chain (X,,) with a time scale of order l/e: indeed, VS(x) S S can be interpreted as Exf (X~) where v is a geometric random variable, independent of the chain (Xm), with P(v=m)=2/(l+2) m+~. Note that P(v=nlv>n)=2/1 + ) . = c o n s t a n t of order 2. We have Ev=l/2, so v is a (random) time of order 1//~e. When c converges to zero, the weak interactions also converge to zero but they cannot be neglected because the period of study is of order 1/~. Also note that the behavior of the chain can be quite different after a period of time of order 1/e 2. We shall now study the asymptotic expansion of V~-vS~ (linear case) and V* = V,* (non-linear case) for a chain with generator B + eA. In the linear case (S fixed) we drop the dependence in S and denote A (x, y) =A(x,y,S(x)), B(x,y)=B(x,y,S(x)) and f(x)
285
(row vector) of the subgenerator Be:
m~.B~=O (m~,l)=~m~(x)=l. x
q~(x) (column vector) is the probability to end in the ergodic set y starting from x. (Note that ~xqx(x) = 1 and qx= 1 on 2).
i
l \ \ x, N
. . . .
\
[-- . . . .
l. . . . .
]
I
I
q; ®m;l
=f(x, s (x)).
I
I l
I
I FIG.
4.
PB
1 matrix.
2.2 The linear case From (7), V~ is the unique solution to
-elan+ (B+~A)V~+~I~f =O that is, if R ~ = ( Z - B - e A )
(10)
I is the resolvent of
3. The reduced resolvent H B of B associated with the eigenvalue 0 gives the unique solution V = HBf of the equation
(B+eA)
B V+ f = 0
PBV =O We denote A,=A-I~I, so (10) can be rewritten
( A u + ! B ) V~+ ~ f = 0.
1. 0 is a semi-simple (diagonalizable) eigenvalue of B, its multiplicity being equal to N the number of ergodic sets ff in B (i.e. the number of matrices along the diagonal of B in Fig. 3). 2. If / ¢ ( B ) denotes the kernel of B (i.e. the eigenspace associated with the eigenvalue 0) and .N(B) denotes the range of B then R" =,/V(B)O~(B) (direct sum) and the eigenprojection on .A/'(B) denoted by PB is given by
13)
O.
It is given by HB=(PB--B)-I--pB, umque solution of the system:
PBH8 = HBPB= 0 HB'B=B'HB=PB--I.
(12)
We need a few well known results about the n x n matrix B, generator of a Markov chain (see Pallu de la Barri6re, 1966; Kato, 1966; Kemeny and Snell, 1960), that we briefly recall:
where PBf
So that, the (Riesz) decomposition f = f + f with fe~+'(B) and f e ~ ( B ) can be written f
=PBf +HB(--Bf). 4. 2Ra is an holomorphic function in the neighbourhood of 0 with the expansion (Kato, 1966, p. 39)
2R;=Ps+
)m(_ 1)m+l H m B.
With these notations we can now state:
Proposition 1. V,: given by (ll) admits the expansion (15)
E: = Vo+ ev~ +0(~: 2 )
PBf (X ) = ~ m~(f )qx (x ), where m~ is the invariant probability measure
(14)
m=l
(0(e 2) means a term of order where
Vo
and
V1 = Pl + r/1
are
~:2) uniquely
286
F. DELEBECQUE and J.-P. QUADRAI
determined by :
By the probabilistic interpretation
{
Vo~+(B)
(i.e. P~V0--F}))
PnA~,Vo + l*P~.f =O.
(16)
B[7, + A,,Vo + H f =O [/1 e~#(B) (17)
(i.e. P,~] -- i>"1 ). Proof l.
~:sh(X,,) = 0(c'2 )
(U: denotes the expectation for the chain with generator B + cA). This completes the proof. {-]
(i.e. PBP~ =0)
P,A,,I/, + P,A,,[', = 0 V, ~. 1 ( B )
W':(x) = E ~ , ~=:0 l + i u i ",
V0 and I/1 are well defined'
F r o m (16) and the definition of H~, V~ is uniquely defined by I/~ = H~ (A, Vo+ H.t ). Now, we give the explicit solution of the equation
{
PsA. F + P~.I'=O fie. ~(B).
Let R , ( A ) = ( ~ t l - A) ~ denote the resolvent of A, C = R u ( A ) B and P,. denote the spectral projection associated with the eigenvalue 0 of C: this solution is ~'=P~R~(A).f Indeed, let us show that PnA~,P~R~,(A ) + PB 0. (18) =
Remark. The fact that 0 is a semi-simple eigenvalue of C = A,7 ~B as shown above does not seem obvious. Indeed pR,(A) can be interpreted as the transition probability matrix of a M a r k o v chain and t~R,(A) tends to identity when i~--, ~ . It follows that I~Ru(A)B can be seen as the generator of a M a r k o v chain for H large enough and as a consequence in this case 0 is a semisimple eigenvalue of C, but this argument is not valid for all H and in general C is not a generator for all H. We remark also that F=R~,tA)f~DI where D is the o p e r a t o r BR,,IA ) which admits the decomposition R" =, I(D)@,N(D) =. l ( D ) @ ~ ( B ) .
Corollary. The pair (~'i~, I/~I is also the unique solution of
(i} BF}~=O
t
:iii, A~F~'+ BVI + Id=O
First, by the probabilistic interpretation of 14/,,: solution of O:I+A~~B)W,+~;g=O as V~I: =(1/tOUA,g(X,), we see that cR,:(A,1B) is bounded independently of c. It follows, from the expansion of the resolvent [Kato, 1966, p. 39) in a Laurent series, that the eigennilpotent associated with the eigenvalue 0 of A,,~B vanishes. So 0 is a semi-simple eigenvalue of A~IB and R"=.'I(C)@~(C)(direct sum). Because A, is invertible, . ~ ( B ) = , ' I ( C ) and we also have the direct sum R"=Au~(B)@ .~(B). N o w if f=Bge~,*(B), we have:
A,,P,.R~,{A).I A,P,R,{A)Bg=A,,O=O. If.f=A,h
with h e , I ' ( B ) we have
A,,P~R, ( A ) . t = A,,P,.R,, (A)A,,h = A,P~t~ A,,h = 1:
(18) is then easily verified by decomposing .f --.fl + ,t2 with .I~ ~ A,,, ~ '(B } and .t2 e .#(B ).
Because A,F] ~;#(B) there exists V2 such that A,V~ +BV2 = 0 . Let W~:= Vo +~:V~ -~ ~:2 ~ -- V,. F r o m (10) we obtain, with h=A,V2"
(19)
PuA,, F'I = O.
Proq[ We need only to check the uniqueness. But if there exists two distinct solutions (l~i), 1"~ and (V~, I/'~1 then W0 = ~i)-- i/oe, ~'(B) and .4, I~i~ =B(V~ - V] )e ~#(B) which contradicts • I ' ( B ) m Au,#(B)= 101. This shows that (i) and (ii) uniquely determine ki~ together with the part Vt =HB(A,Vo+H.I) of I~"~ which belongs to ,#{B). The part of I~ in , I ( B ) , that is 1,"1, is given then by (iii) which is equivalent to Pr~A,,~I +PBAHRAVo+HPBA,f =O. [] Remark. By a straightforward induction one can give the complete expansion of 1(:. To this end it suffices to d e c o m p o s e I/;,=~',+ I,~ with P,e~#(B) and ['i,e, t ( B ) , [~, being chosen in such a way that F;,+~ is well defined. Each equation (of type 19(ii)) uniquely determines V,, and [i,~ 1. By this way in the case A - 0 , we obtain exactly the expansion (14)(with ,;. =Clt). So tile expansion (15) can be seen as an extension of (14) to the case of a perturbation of the B matrix by ~:A, some of the zero elements off the blocks B x in Fig. l becoming of order u. In (14) the first term of the expansion = t~1 can be interpreted as
(B + ~:.4,, ) l&:: = B~, + ~-(.4. E, + BI/~ + H.t!
+ c2 (A,,l/1 + B V 2 )+c3h =~:3h.
L_.J'
] n
Pu.l(xl=m.~(f)= lim .
, •
|
3~ f(X,,, II m
= 0
Optimal control of Markov chains when x is recurrent (xe)~, X o = x ) and Pnf(x) =~rmx(f)qse(x)=E~f(X~) where r=inf(n=>0; XmER } when x is transient, Xo=x. We can also interpret the first term of the expansion (15). For this purpose let us introduce an aggregate chain (Xm) as follows. Its state variables are the ergodic sets of the fast chain (2~,:~2..... :~N)- Its generator is defined by
= mrAq~,.
(20)
287
AVo(x)= ~ A~rVo(y) Y
= ~ A~rVo(y)+ ~ AxrVo(y) yER
y~T
=~, Z (~'o(g)A~r) y~Yc
It follows from the definition of PB that
PBAVo(x) is constant for x e f f and this constant Note that for x fixed in ff 4 2 ' the term between brackets represents a probability to end in the set £' starting from x. The following proposition shows that V0 is the same functional of this aggregate chain as V~ was for the (B + eA) chain.
Proposition 2. Vo admits probabilistic interpretation
the
following
(i) V0(x)=go(~)=pEg k m=O
/
takes the value ~x, ~]~,~, Vo(if) = ,4 Vo(if). The probabilistic interpretation (21)(i) then follows from (22).
Remark. Note that the terms of the expansion depend on the parameter # > 0 . The case p = 0 will not be considered in this paper. In the particular case where no transient states appear in the fast chain and if the perturbed chain (with generator B+eA) admits an unique invariant probability measure p~(.), then it can be shown that p~(x) converges to p(.~)mx(x ), where /5(.) is the invariant probability measure of the aggregate chain (see Courtois, 1977; Pervozvanskii and Smirnov, 1976).
1 \m+l (21)
{ii) Vo(x)= ~ qx(x)go('Y)
--E~BVo(X~) V x e T where r =inf{m; X,, ~ R}.
Proof From (16) we know that Vo belongs to N dimensional space ./if(B), so Vo(x) =PBVo(x) is equal to a same constant ~'o(~) on each ergodic set .~ of B and Vo(x) is the linear the
combination (ii) of these constants if x is a transient state. So we need only to determine the constants [7o (~). From (16) Vo=PsVo is the solution of
Aggregate level
- # V + P B A V + f =O where fe.M(B) V e JV (B ).
This equation implies:
Indeed for a fixed x we have:
If one is interested in computing V~ for small 2, the expansion (14) (case A - 0 ) is not useful from a computational point of view: for instance if B is a block-diagonal matrix then R~ as well as PB and H B are also block-diagonal. For A:~0 (expansion (15)) it is simpler to compute V~ by calculating the terms of the expansion. The expansion of V~can be obtained by solving a sequence of problems in ,3~'(B) (that we call 'aggregate') or in :~(B) (that we call 'fast') from which we respectively get V and V. Let us explain the calculations in each level.
We have to solve
- #Vo + PsA Vo+ pPsf =O.
- / t Vo(Y) + / ] Vo(:~) + p f (X) = 0 V£.
Computational aspects
(22)
(23)
The computations can be achieved as follows: 1. Given (from the 'fast' level below), the
288
F. DELEBECQUE and J.-P. QUADRAI
invariant probability measure m, on each ergodic set mx "B ~ = 0 m,, 1 = 1 .
(24)
2. Given (from the 'fast' level below) the probability to end in an ergodic set .(" starting from x
{
Bqe=O
onE-2
qx = 1
on 2
the
aggregate
Remark. O n each class 2, the c o m p u t a t i o n of mx and ~ requires only one matrix inversion of size nx x nx. One can check that if Z denotes the matrix
(25)
which reduce to q x ( x ) = l if x~.?, G ( x ) = 0 if ) ¢ ~ X ' ~ d and Br,~'l~+Brq.e=O on T which gives q~ Ix) = - B r 1Br" x 1., (x ). 3. C o m p u t e formula (20).
which gives
generator
by
then mX is equal to the last row Z ' and ~i, = Z t h x + c l . , where c is a constant which can be determined from the condition m,- ~-'~,=0. The last row of Z ~ is obtained as an intermediary step if the Gauss elimination method is used (Schweitzer, 1968).
4. Solve: 2.3. The non-linear case -p[/+A[/+
]'=0
(26)
which is a linear system of N equations. 5. Define V(x) by
I
V { x ) = 1/(_~) xe.'~ (27)
V (x)= ~ qx{x)V (,£) x ~ T.
N o w we study the stochastic control problem (8) for a perturbed M a r k o v chain with generator ( B + c A ) and 'killing rate' cp. We shall give a policy improvement algorithm which reduces to the classical one in the case A - 0 and whicil will give V~; = lim,:.~ ~ V~* where V,* = min s V)'. V,* is the (unique) solution to:
c p V * ( x ) + m i n ',(B(ut+ cA(u))V*lx)
Fast level (decentralized) + cCGlx, u )'~ =0.
We have to solve BV+h=0
(h6~(B))
(28)
This leads first to a system of N decoupled systems of (n,,x = x~ ..... XN) linear equations where G is the n u m b e r of elements in the set 2. Each system is defined on an ergodic set ;- and determines the value ~ of the restriction of ~' to
{
B~ V,. + h, = 0 mx{~:,)=0.
It is easy to check that restriction of HB to 5:: ~=[(P,.-B,)
(29)
~,. is given by the
where P . , = 1~® m.~. O n the set of transient states the solution Vr is obtained by solving:
X BT.x ~ + Br Er + hr=O
We assume that the recurrent-transient structure of the fast chain cannot be modified by the control: this means that the partition o ( t h e
states into N ergodic subsets and N.r transient states remains the same fi)r all the strategies. This hypothesis (which will be referred to as ' ~ ' ) can be removed when the control is assumed to take only a finite number of values but is convenient to obtain the following lemma. (As in the linear case, we denote by R the set of recurrent states and by T the set of transient ones.)
Lemma 1. Under hypothesis ~ estimate holds : ~upb] u~ll < + z .
~-P.~]hx=(P.~-B.,) ~h.~
(30)
131)
the following
i32t
Pro@ Let (),~,.. ., z;,) "s be the repeated eigenvalues of B s. They depend continuously on the strategy S and, a m o n g them, exactly N are equal to zero. Because S belongs to a compact set, the lowest eigenvalue in modulus different from zero, say )_s.
Optimal control of Markov chains
The next proposition gives the first term of the expansion of V* (defined by (31)) in the case of a controlled chain.
satisfies [ 2s [ > 6 > 0. It follows that 1
PBs = ~ - . ~R sd2 zirr r (where F is a positive contour enclosing 0 but small enough to exclude the other eigenvalues) is bounded by a constant independent of S, and depends continuously on S. The operator (BS--pB ~) is invertible and its lowest eigenvalue #s satisfies I p s i > 6 ' > 0 by the same argument. This implies that s u p l [ H S l l < + oc,.
[]
S ~ .ct,
Following Veinott (1969) let us introduce the lexicographical ordering denoted < on the pair of vectors (fo, fl ). By definition (fo, f l ) < 0 if for each fixed x, either f o ( x ) < 0 or f o ( x ) = 0 and f l ( x ) < 0 . The strict ordering is: (fo, f l ) < 0 if either f o ( x ) < 0 or f o ( x ) = 0 and f l ( x ) < 0 . It is easy to check that this definition is equivalent to (fo, f l ) < 0 (respectively <0) if f o + e f l < 0 (respectively < 0) for all e small enough. The next lemma (a vector maximum principle) will provide an aggregated~:tecentralized policy improvement algorithm to calculate V~,
Lemma 2. Let (fo, f~) be two given functions with PBfo = 0 and let Vo be defined by f(i) BVo + fo = 0 )(ii) A.Vo+BVI + fi =0.
289
(33)
Then the inequality (fo, f l ) < O (lexicographic inequality) implies Vo<0. Moreover Vo(x)<0 if Jo(X)<0 (which can occur only if x e T ) or if ] ] ( x ) < 0 for an x ~ R .
Proof If V~ is defined by the equation -epV~+(B+eA)V~+fo+efl =0, then by Proposition 1, we have V~-Vo=0(~). By the maximum principle for the perturbed chain, the inequality Jo + ~f~ < 0, ~:< 6 implies V~< 0, e < c~ and passing to the limit we see that Vo< 0. Moreover we note that because P ~ f o = 0 the inequality fo < 0 implies fo = 0 on R and if fo (x) <0, x e T then ~'o(X)<0. (HB(x,x)>O (all x) and HB(x,y)>O if x e T and y e T). If f l ( x ) < 0 , x ~ R then ]'~<0 on the set )? and by the maximum principle for the aggregate chain (# > 0) we obtain Vo<0. This concludes the proof. []
Proposition 3. Under the continuity and compactness assumptions of paragraph 2.1 a n d hypothesis ~ the pair of equations (i) minB(u)V~(x)=O
xeT
U~2U
(ii) m i n A u ( u ) V * ( x ) + B ( u ) V * ( x )
l
+#f(u,x)=O
x~R
admits a solution, and (V~d(x), x e E ) is unique and (V*(x), x e R ) is unique up to an additive constant on each ergodic set 2 o R . If V* is the solution of (31) then the following estimate holds: V* = V* +0(~:).
[B(u)Vo, A.(u)Vo +B(u)V1 +#f(u)].
(35)
Proof It is based on a two-level policy iteration improvement algorithm. First we obtain a decreasing sequence which converges to V*. To define completely V* we have to use the same algorithm with V* given. By a straightforward generalisation of the results of the preceding paragraph one can set the system of equation analog to (19) for the functions (VSo' - VSo, Vs ' - Vs) where S and S' are two given strategies: (i)
BS'(v s ' - VS)+Bs'Vs=O
(ii)
A.s, ( Vos, - VSo) + BS' ( v s' - V,s ) + A.s" V S+BS'VS o + #fs' = 0
(36)
(iii) AS' (v s ' - V1)+B s s' (V2s' -- V2) s
+ A.s' V S1 + BS'VS=O" We note that (36)(i) is well defined because Bs'VSoe~(B s') and (36)(i)(ii) uniquely determine (V s' - Vs), and Vs ' - V s = O i f A.s' Vos + B s' vS + pfs'<- O. Using this result we can now define a policy iteration algorithm. Starting with an arbitrary strategy S (°1 we define inductively S~"÷ ~) from S ("~ as follows: S ~'÷l)(x) is a solution of the minimization problem:
minB(u)V(o"(x);
f
xE T
min A. (u ) V(o"(x ) + B (u ) V(~')(x )
We shall denote by T(u) the operator which associates to W = (Vo, V1) the pair of functions
(34)
u~'J"
+#f(u,x), S{ n)
x~R vS(n)
where V~)= Vo , V~"1= -1 -
(37)
F. DELEBECQUEand J.-P. QUADRAT
290
We note that if (37) is replaced by the lexicographic minimization problem minuT(u)iV~o"~,V(O ~) then we obtain a strategy ~(,+t) which is such that Vg'+~)= V~d +~' Indeed let us consider the R.H.S. of (36). We see that BS'VSo(x)=O, x e R because VSo(x) is a constant for x~,f (hypothesis ~V~) and, from the ,definition of pS, V s, =pS, V s, depends only on the values of ~,aS'vS, o+BS'VS+l~fs ' , on the set R. So the minimization of this last expression with respect to S' on the set T is not necessary to decrease Vs . We obtain a sequence S~"~ with associated (V{~', V~~) such that
Moreover Vae. I (B*) and ~*~.~(B*) are uniquely defined. It is also clear that V*Ivl =< VS(x), VSe 5/'. Let us nOW denote by'
iolx)= {ue l /B(u )V~ (x )= min B(u )V~ (x )} u
= l u ~ ~(/A,,lu)V~(x)
+ B(u)V*(x)+ ~tf ( u , x ) = m i n [ A , ( u ) V ~ ( x ) u
+B(u)V*(x)+pJ(u,x)]l and
~ ~o(x).
,%,=[St~q~,S(x)e~(;(x)l=
('~ V~o")= 0
f~
x~R
x ~ E
{~ v 0
~
v
1
~/a!
("lV~o" l~(x)
If S and S' are two strategies in ~f0, then Vs' = v S = V* and from (36) we see that V~"-V~' is defined by
xc T
~)V~o. ')(x)+B(")G"-~'(x )
+ ltf(")(x)<-Au(u)V~']- 1)(x) +B(u)V(l"-m)(x)+lJf(u,x)
(39),
BS'(v s' - V2)+ (A,s' Vo.
Iti)
xeR.
+ B s' V s + Id 's' ) = 0 43)
The sequence V(o") is bounded and Vto"~
(ii) As' (v'~ ' - VS)+BS'(V s ' - V s)
Thus V~o") is convergent. From Lemma 1, the sequence V]") is bounded. We can extract from (S ("), V~0 "), V]"~) a convergent subsequence (still indexed by n) with limit point (S*, V*, V*) such that, passing to the limit in (38), and (39),,
We note that As V * + B s ' V s + / ~ j ' s ' ( x ) = O , x e R, so we can repeat the above policy iteration with cf replaced by ,~o and V0 replaced by V1 to get a strategy $3 e~C/'o with associated V* such that V* satisfies:
+ A .s' 1/.SI + B s" V s =O.
I min A , ( u ) V 3 ( x ) + B ( u ) V * ( x ) + t ( f ( u , x ) = O , (40)
A,* Vo*+ B* V* + ~ f * =O. "B*V3(x)<__B(u)V~)(x)
t
x~T
A*VR+B*V(*+IAf*
(41)
+t~J(u,x)=O
x~R.
u e "t o i x )
x 6 R.
xeR.
(44)
The pair (V~, V*) solution of (42), (44) is also solution of minT(u)(V*,V*)(x)=0
.teE,
u
where the minimization lexicographic order, and
is
relative
to
the
(V~Ix), V ' I x ) ) < (VSIx), VSIx)l v s ~ : / .
xeT
min A,(u)V'J(x)+ B(u)V*(x) uE
x~- T min A . ( u ) V * ( x ) + B ( u ) V * ( x ) = O
In (41) V** denotes the limit point of V~" l) In particular we have B * ( V * * - W ( ) ( x ) < O , x ~ R and therefore V* and V** differ by a same constant on each ergodic set ~?. (Recall that any non-positive function f which satisfies PBf=O is vanishing on R, because the entries of PB are strictly positive on the recurrent states). So B*V**(x)=B*V*(x), x ~ R and the pair (V*, V*) so constructed is a solution of
-minB(u)V*(x)=O,
"t o (x)
(42)
The existence of this V* (uniquely determined) is necessary to prove the estimate (35), but V* is determined by the first stage of the algorithm. To show (35) we apply the non-linear operator
Optimal control of Markov chains in (31) which defines V* to V*+~V*. We obtain: - ~/~(V* + 8,V*)(x)+ min {r,A (u)(V* +eV*)(x) u
+ B(u)(V~ + e,V'~) (x) + 8,1~f(u,x )} -- rain {B(u)V~ (x) + e[A (u)V* (x) u
+ B(u)V*(x)+ laf (u,x)] -8,2A,,(u)V*(x)}
291
chosen small enough. As a consequence it is easy to see that a control S* which lexicographically minimizes the (n + 1) first terms of the expansion of ,V~: is such that its associated sequence (vS",. i., Vs"*) gives the expansion of V*. This fact is no longer verified in the case considered here, even when A--0 (see Chitashvili, 1975). However, under additional hypotheses V ~ + r V * = V* + a (~2) see the corollary below.
=0(8,2).
The last equality holds because the mapping e ~ m i n {B(u)V* + e[A, (u)V* + B(u)V* (x) u
+ laf (u,x)} admits a right derivative at zero given by rain .(Au(u)V* + B(u)V~( + ~ f (u, x)). ue'~ o(X)
(See e.g. Pschenichniy, 1971, p. 73). By the probabilistic interpretation we get:
V*(x)+8,V*(x)=minE~ S
( 1
× \1+~8,/
m=O
[~Ms(x~)+°(~)]
2.4. Probabilistic interpretation in the recurrent case
In this section we assume that the unperturbed chain admits no transient states. Because of hypothesis W this implies that ~*J~(BS)=o,¥" is independent of the strategy S. This additional assumption is often verified when controlled Markov chains which are discrete versions of Markov processes are considered (see for ' example the introductive example). In this case we can give a probabilistic interpretation of V* as the Bellman's function of a 'long run' stochastic control problem. This interpretation may be practically useful if the 'short run' stochastic control problems described below can be solved easily. First we give a corollary to Proposition 3 adapted to this particular case.
and so
v*-(v~+Ev*)=o(e).
Corollary. There exists an unique pair (V*, V*)
[]
such that V~c~.+,
Remark. The proof gives an algorithm to compute V* :
A u* V *x ~:~(B*)
(45)
min (Au(u)VJ(x)+ B(u)V*(x)+ l~f (u,x))=O. u~'t
(!) S given
One has:
(2) compute Vs, Vs using the 'computational aspects' of Section 2.
section
BS Vo = 0 ASuVo+ BSVl + # f s = 0 (3) compute S'
S'(x)~ Argmin B(u)Vo(x), u
x ~ T (transient states);
S' (x)E Arg min (Au(u)Vo + B(U)Vl + l~f (u))(x),
V* = V* +0(e),
and moreover if (45) defines an unique strategy S* then: V* = V* + 8,V* + 0(8,z ). (47)
Proof The existence of (V*, VI*) may be proved along the lines of Proposition 3 and (46) is the estimate (35) because Uo(X) = ~Jl The estimate (47) is also obtained as in Proposition 3: since A~V I ~ ( B * ) there exists V* such that A*V*+B*V~=O. We have:
u
x ~ R (recurrent states) (4) go to 2 with S replaced by S', until convergence of Vo occurs.
Remark. If the control is assumed to take only a finite number of values, the optimal control for the perturbed chain does not depend on ~ if ~ is
(46)
-l~e(V~ +8,V* +8,V*)(x)+min [(B (u) u
+8,A (u))(V~ +eV* +~V'~)(x)+elaf(u,x)] =~min [Au(u)V*(x)+ B(u)V*(x) u
+laf(u,x)+8,Au(u)V*(x)+B(u)V*(x))]
=0(~3).
292
F. DEIE!BECQUE and J.-P. QUADRAT
The last equality being a consequence that the term between b r a c k e t s {seen as a function of ~i admits a right derivative at zero given by A~*V* + B ' V * . We can c o n c l u d e as in P r o p o s i t i o n 3.
Let us consider the following ergodic stochastic control p r o b l e m : - ~ IQ. ~), ) (x i + m in [ B i . ~I/'1 (X)
,
min E.~ 2 Q
(
i
) ,,,+,
It,,\, + I t j
v(X,,,qx,, ").
148}
{m H 114 ~ Ii~ {X
u
+ H,f lu, x ) ) ] = O
N o w we can give a p r o b a b i l i s t i c i n t e r p r e t a t i o n of V~ as the Bellman's function of a c o n t r o l l e d aggregate chain {X,,):
vt t.,~ ) =
@"
constrained
mS AS lx ' = g.~..~,
~511
x' = 1 . . . . . . N~.
In (51) A~(u)Vo{x)+it.llu, x) is seen as the cost function for the c o n t r o l l e d M a r k o v chain with g e n e r a t o r B(uj. Because of the structure of B, ~(Q, V o ) ( x ) i s c o n s t a n t for . \ ~ x . Also note that this c o n s t a n t d e p e n d s only on Qx..~,, -¢:'= 1..... N.
m = 0
(50) can be written min e ~(Q, k~ )ix ) = 0. N o w if we set
The c o n t r o l l e d a g g r e g a t e chain is described in
Proposition 4. Va is s o l u t i o n of the tbllowing "long run' stochastic c o n t r o l p r o b l e m :
v (2, q.,, ' ) = rain mS (1 s )
1. The state variables are the N e r g o d i c sets x of the fast chain.
under the constraints mJASlx, =qx.x 2' = 1...... N. (This is the xth short run problem). W e get:
2. The c o n t r o l variables associated to x are ,,q.~..,,.~' = 1 . . . . . N}.
~(Q,I/~, (.~,l=min[m.~,(.fst+ mxs (A,,s l,> / )]
3. The g e n e r a t o r Q is the m a t r i x = 1 ..... N; 2 " = 1..... N].
lq.,..~.: 2
3;
with m;S, A s 1.,. -- q.,.. ~.
4. The cost function associated to x is v (x, q.,," ) which is itself the ergodic cost of a 'short run" stochastic c o n t r o l p r o b l e m (indexed by (-~, qx, ' )) defined below: 5. The subset x.
( 52)
,S"
=mln s
S "S [mx(.t )+QVo(Y;}-l~vo(.~)]
with
m s A s I ,., = q.,..,,
state variables are the states of the
= - i t V o ( x ) + Q V o l t ~ ) + v ( ~ , q x , .}.
6. F o r the c o n t r o l u, the g e n e r a t o r is B(u} a n d the cost function is ,t (ll, X) (both restricted to 21. v (2, qx, ") is the o p t i m a l ergodic cost:
Using (50) we o b t a i n
m i n [ - i t l / ~ ( x ) + Q V o { 2 1 + v l : ¢ , q . ~," t]=0. v(x,q.,,
= m i n mS(.[sl
where m xs is the i n v a r l a n t p r o b a b i l i t y m e a s u r e of the set ,'~ for the fast chain"
and the constraints:
strategies
satisfy
the
q~..~. = mS.,.As 1 x, 2
S(x)ASx. ITIx
2'
The p r o b a b i l i s t i c i n t e r p r e t a t i o n from (54). []
(481 follows
3. EXAMPLE OF APPLICATION AND NUMERICAL TEST
m sx ( l x ) = l, Slx )~ *
(54t
e
S
mS B s = 0
t53)
1..... N.
[] (49)
Proo[ V~' ~, 1 is solution of m i n [ B ( u ) V l (x)+ A,{u)Vo(x) + i t f (u,x}]=O. u
150)
This last part is d e v o t e d to illustration of the m a t h e m a t i c a l study of Section 2, we treat the i n t r o d u c t i v e e x a m p l e and give a numerical test. We show the usefulness of the p r o b a b i l i s t i c i n t e r p r e t a t i o n ( i n t r o d u c t i o n of slow control variables: P r o p o s i t i o n 4) on the i n t r o d u c t i v e example. O n the n u m e r i c a l e x a m p l e we use the policy improvement algorithm.
3.1. Example Let us first a p p l y the results of the preceding section to the i n t r o d u c t i v e example.
Optimal control of Markov chains We remark that the fast chain (with generator B) can be identified with (Z,,), i.e. the second component of the state process, because b and ~depend only on j• So the kernel of B consists of functions f ( i , j ) which depend only on i. The fast chain admits 1 invariant probability measures, which are all identical because the matrices along the diagonal of B are identical: this (row) vector is solution of:
J
p'B,.L=O, ~
j=l
p(j)=l.
The aggregate chain ()(,,) is then defined on the set 3 ( = {0..... i..... I} (discretization points of (Ym)) and its generator is defined by:
293
- A 2 ) . One is faced with a demand Z, (fast variable) which can take the values {0..... jh .... } with probabilities p (j). The function ./'(i, A ~, A 2) then represents the minimal short run cost (to meet the demand) compatible with the transitions of the slow variable when this slow variable is at level ih. The long run stochastic control is then to manage in an optimal way the slow variable knowing the optimal short run management. It is worthwhile to interpret also the results for the continuous time modeling. Formally, E(x,y) solution to (1) "converges' to Vo(x) which is the optimal cost of the long run problem: min ~ e "'g(y,,ti,)dt fi"
flh(u)f(i)=
E P(J)Ai.j:i',j'(u)f(i')" •
h
(55)
.j,j',i'
It is a tridiagonal matrix and (the sum of the elements of each row being zero) there are only two controls associated with each aggregate state i: 7t~(i)=7t(i,i+1) and 7t2(i)=A(i,i-1 ) respectivelY represent the probability for the aggregate chain to move from i to i + l (respectively from i to i - 1 ). The cost function f(i, A1,A2) is defined below by a family of fast control problems indexed by i and in which A~ and A2 are parameters• The long run stochastic control problem is then min AI('),A2(')
where 3', evolves according to: dy,=(a~(yt, gt,)dt and the function g(y, fi) is given by the 'fast' short run problem, in which the 'slow' variables y and ~i are fixed: g(y,u)=
.fc(e(y,u)-z)p(z)dz
min
(60)
0 --
under the constraint ~u(z)p(z)dz=a. (Note that the aggregate generator is fixed as soon as ti is fixed.) This last problem can been seen as the best allocation of ti to meet the demand z, which has a given load curve: indeed, if p(z)dz is the invariant measure of z, then
E
F(z)=ip(r)dr
m=O
2
t56)
1 (1 + p A t ) "+
(59)
0
1 ,f (Xrn, A 1 (Xm),/]2
is the load curve of the demand.
(~t[m)) •
The ith fast control problem is (i fixed)
Remark. If we do not take into account the constraint O
min ~ p(j)c(e(ih, u) - j h ) A t = f (i, A1, A 2 u
j
(57)
e(y, ut=h(yju
under the constraints
we have, if c is convex:
7t1= ~ P(J)Ai..i:i + l , j ' ( / I J,i'
t
)
(58)
A 2 = ~ P(j)Ai.j:i 1,j'(/A ) j.j'
This ith short run problem can be interpreted as follows: the stock of water (slow variable) is fixed at the level ih and can move during t and t + A t to (i+ 1)h or ( i - 1 ) h with probabilities A 1 and A~ (or stays in ih with probability l - A 1 AUTO
g
j' c (e (y, u ) - z)p(z)dz>c(eO,, a ) - z) with
5=j'zp(z)dz and so u solution of (60) is given by _ U=ll@
2--2
h(y)
2t;4
F. l)l:l I BI:C()UI. trod J.-P. Qt:ADR,".,I
So we have an explicit formula for g
g y,z~)--e(e(y,o)-
l-OF
l/ ~= 2 "
~). -(1.3
Finally we have only to solve the Bellman equation for the slow part to obtain the optimal "slow" feedback u(v)'
-pl~o+min
l
dl'i) ) G,(v, ti) , + c ( e ( y , u
fi
e)I
O.3
0 - ~.;_ -0.~
=0.
(V
()
k/_
0
'.-0.5
().5
0.5
-(1.5
0. I
1.0
-0.7
1.0 :
_ 5£__-_1_°__:
. . . . . . . . . . . . . . o. . . . .
0
: ---0.25 0.75
0
] . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.1
0.2'
0.2
0.1
0.3
0.1 '
0. t
(1.2 '
. . . . 0.9 0.1
0
0.15
__ _0__ _ -2().5
-0.6
0 0.2
' (1.2
O.7
0.25. 0.30
()
.....
__ _0.1(! _ _ _0_I0 :._0 .0_
0.0 10.
0.20
0.20
0.40 I
1.0 i 0 . 3 0
0.I()
? --
0.0
! .....
d ....
!. . . .
(1.2
- 1.0
:
0.25
0.15 : (1.3
0.3
0.1
:
0
0.55:0.(/5
0.2
A (1)
-0.38410 ~,
i 0._ 8 4 8 l
0.36952
0.33135 I
0.33629
At2)
(/.556
0.556
(/.576
- 0.576 J
For c - O . 1, and / 1.-,1- 1 if .,~-- 1,0 elsewhere, we have computed 11 (exact value) from (101 (with i ( = 1 ) and the 4 first terms of thc expansion. The numerical ~alues obtained are:
0.38431-
- 0.38442-
0.33550
0.33563
0.13506
0.12527 ~
i 0.12745
0.12696
(l. 12705
0.13506
0.12093 i
0.12442
0.12362
0.12376 !
0.25739
0.20344 -~
0.21445
0.21215
(1.21255 !
1/i) ~= E I'" I
]
-1 0. l |
0.1
_0.22334
i; . . . .
0.3
'
Lo.26758_
0@)__
0.10 " -0.60
-0.8
0
.36952-
-<)12o....
0. I0
B(I)
-
- -
0.4
The linear case has been tested for the fixed control u = 1. The aggregate generator (201 is the ( 2 x 2 ) matrix:
I) . . . . . .
0.25 : 0.75 :
(1. I
B(21
3.2. ,4 monerieal test
--
()
-- ~ii ..... (il~- 75L;
0. I
In this particular case it is very efficient to solve in advance the fast problems (because it is possible analyticallyl, in the general situation the introduction of the slow control introduce natural but additional variables to be optimized. This procedure may waste time relative to policy improvement algorithm, if we are no( careful.
We consider a problem where the control is assumed to take only two values: u~[l,2)). The unperturbed chain admits a (6 × 6) generator wilh 2 ergodic sets x~ = {1,2',: .xz= [1,21, the states 5 and 6 being transient. For the control u - l , the matrices B(1) (unperturbed generator) and A{lt (generator of weak interactions) are the following:
t)
..........................................
I
L0.23226_ ~i~ + ~:I/L 4 c'- I~2
0.23032 O.23067 1,~)+ cl~ + c z l~ 4-c s l i~
Optimal control of Markov chains The experimental errors obtained are respectively (l~-norm): (0.036; 0.008; 0.0018; 0.0004) in accordance with Proposition 1. For the cost functions f ( 1 ) = (1, 1, 0, 0, 1 -, 4 ") and ./'(2)= (1., 1 ", 0 ", 0 ", 2 -, 1 .),~:=0.02, ~t= 1 the exact optimal control for the control problem (26), computed by the classical Howard's algorithm is: S * = (2, 1, 2, 1, 1, 2). Now the algorithm described in Proposition 3 (policy improvement by the lexicographic minimization of the first two terms of the expansion) we get the following sequence of policies : S ° = ( l , 1, 1, 1, 1,1) S1 = (2, 1,2, 1,2,2) S 2 =(1, 1,2, 1,2,2) This control approximations
(convergence occurs).
leads -
-0.6895
-
to
the
0.7057 I
0.6895
0.7074
0.7045 I
0.1880
0.2039
0.2026 I
0.1880
0.2042
0.2028 I
0.3977
0.4602
0.4563 1
_0.3795
_0.4401
_0.4354A
v~+~vT
vZ
Vo*
In this example the approximate control S 2 does not coincide with the optimal control S*. The same example with c,=0.4 x 10 -z leads to the same control (see remark of Proposition 3) and we have obtained in this case: -0.6895 I -0.6933~
0.6933
0.6895 I
0.6931 /
0.6930
0.1880 I
0.19121
0.1912
0.1880 I
0.1913
0.1912
0.3977 I
0.4102
0.4101
_0.3795~l
L0.3916
0.3915
v~
v~ + ~vT
vy
We have shown that such systems can be approximated ( N + I ) subsystems (N fast subsystems plus an N states-aggregated subsystem). We have given a policy improvement algorithm which takes advantage of such a simplified structure to compute a near optimal strategy. This kind of technique gives the complete expansion of the optimal cost when the control is finite-valued (Delebecque and Quadrat, 1980). In the infinite dimensional case, when 0 is an isolated eigenvalue with finite multiplicity the extension is straightforward. In the case of continuous time, finite states Markov chain the results are the same. In practical problems (Delebecque and Quadrat, 1978) even the fast subsystems may be too large to be solved numerically, in these cases the results given here are qualitative but may lead to good heuristics.
following
q
0.7088
295
4. C O N C L U S I O N
The 'curse of dimensionality' for stochastic control problems is well known. In this paper we have used perturbation techniques to simplify the computation of the control of large scale Markov chains, when the transition probability matrix is almost block diagonal (N blocks, e. order offblock diagonal terms), and the discount rate is of order r.
REFERENCES Bensoussan, A., J. L. Lions and G. Papanicolaou [1978). Asymptotic" Analysis jot Periodic Structures. N o r t h - H o l l a n d Blankenship, G. and G. Papanicolaou (1978). Stability and control of stochastic systems with wide band noise disturbances. SIAM J. appl. Math., 34, 437 476. Chitashvili, R. Y. (1975). A controlled finite Markov chain with an arbitrary set of decisions. SIAM Th. Proba, 20, 4. Chow, J. H. and P. Kokotovic (1976). Decomposition of nearoptimal state regulator for systems with slow and fast modes. IEEE, AC-21,701 705. Courtois, P. J. (1977). Decomposability. ACM Monograph Series, Academic Press. Delebecque, F. and J. P. Quadrat [1978). Contribution of stochastic control, singular perturbation and team theories to an example of large scale system. IEEE, AC-23, 209 221. Delebecque, F. and J. P. Quadrat (1980). The optimal cost expansion of finite controls, finite states Markov chains with weak and strong interactions. Analysis and Optimization of Systems. Lecture Notes in Control and lnJormation Sciences 28, Springer-Verlag. Derman, C. (1970). Finite State Markovian Decision Processes. Academic Press. Gaitsgori, V. G. and A. A. Pervozvanskii (1975). Aggregation of states in a Markov chain with weak interactions. Kybernetika, 3, 91 98. Howard, R. A. (1960). Dynamic Programming and Markor Processes. Wiley. Kato T. (1966). Perturbation Theory ])n" Linear Operators. Springer Verlag. Kemeny, J. G. and J. L. Snell (1960). Finite Markov Chain. Van Nostrand. Kushner, H. J. (1976). Probability Methods for Approximations in Stochastie Control and jor Elliptic Equations. Academic Press. Lions, J. L. (1973). Perturbations singulieres dans les probl6mes aux limites et en contr61e optimal. Lecture Notes in Mathematics 323. Springer Verlag. Miller, B. L. and A. F. Veinon (1969). Discrete dynamic programming with small interest rate. Ann. Math. Stat., 40, 366 370. Pallu De La Barriere, R. (1966). Cours d'Automatique Theorique. Dunod. Pervozvanskii, A. A. and I. N. Smirnov (1974). Stationary state evaluation of a complex system with slow wwying coupling. Kyhernetika, 4, 45 51. Pschenichiniji, A. N. (1971). Necessary Conditions for an Extremum. Dekker.
F. I)f,I,F f~l!((,)tPE ~ilad .I.-F'. Qt ADI4 ~, I
296 R(>ss,
S.
M.
Optimiz~ltion
11969). Applied Prob~U~ilitr I/~plicali
~lodc/.~ ~itll
~inaon. H. a n d A. Andt~ (1961). A g g r e g a t i o n ol ~m-iablc,, in d y n a m i c s y s t e m s . EcommletHca. 29, 11 I 139.
S c h w e i t z c r , P. J. (i96~). P c r l t t r b a l J t ; n I h c o r \ ~!/tt tinih: M a r k o v chtlinn. J. ,q~pl. lh'oh., 5 , 4 0 1 413. Vcim~tt. A. I . (19691. D i s c r e t e dyn~nnic pro,~r~.Lnmling ,,\ith ",cnsitivc d i s c o u n t (>plimalil_x c [ i l c l ]~] I t ~~i 1. AI(II/~ 51[z¢.[ 4{), 1fl35 1660.