Operations Research Letters 37 (2009) 317–321
Contents lists available at ScienceDirect
Operations Research Letters journal homepage: www.elsevier.com/locate/orl
Bias optimality for multichain continuous-time Markov decision processes Xianping Guo a , XinYuan Song b , Junyu Zhang a,∗ a
The School of Mathematics and Computational Science, Zhongshan University, Guangzhou, PR China
b
Department of Statistics, The Chinese University of Hong Kong, Hong Kong
article
info
Article history: Received 24 December 2008 Accepted 28 April 2009 Available online 22 May 2009 Keywords: Bias optimality Continuous-time Markov decision process Difference formula Multichain model Policy iteration
abstract This paper deals with the bias optimality of multichain models for finite continuous-time Markov decision processes. Based on new performance difference formulas developed here, we prove the convergence of a so-called bias-optimal policy iteration algorithm, which can be used to obtain bias-optimal policies in a finite number of iterations. © 2009 Elsevier B.V. All rights reserved.
1. Introduction Continuous-time Markov decision processes (MDPs) have received considerable attention because many optimization models (such as those in communications, networks, queueing systems, computer science, and population processes) are based on processes involving continuous time. One of the most widely used optimality criteria in continuous-time MDPs is the expected average (EA) criterion; see, for instance, [1–8] and their references. However, since the EA criterion only focuses on the long-run asymptotic average reward of a system, and thus it does not distinguish such two policies that have the same long-run EA reward but different finite-horizon rewards, it ignores transient rewards and is extremely underselective. When a system such as many inventory or queueing models has more than one EA-optimal policy, it is natural to consider such a problem as how we choose a ‘‘best’’ one among the EA-optimal policies? To deal with such a problem, the bias optimality criterion has been proposed and studied; see, for instance, [7,9–13] and their references for discrete-time MDPs. For continuous-time MDPs, as mentioned in [2,6], only Prieto-Rumeau and Hernandez-Lerma [6] address this issue and show the existence of bias-optimal policies. However, the treatment [6] is restricted to the case of geometrically ergodic Markov chains. Our goal here is to consider the general multichain case. Since we are concerned with not only the existence of bias-optimal policies but also an algorithm for computing such optimal policies, we consider the
∗ Corresponding address: The School of Mathematics and Computational Science, Zhongshan University, Guangzhou (Post code: 510275), PR China. E-mail address:
[email protected] (J. Zhang). 0167-6377/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.orl.2009.04.005
model of finite continuous-time MDPs. By developing difference formulas of EA rewards and biases of two policies, and using a simple and interesting observation from the canonical form of a transition rate matrix, we establish the existence of bias-optimal policies and also the convergence of a so-called bias-optimal policy iteration algorithm. This algorithm is proved to obtain a bias-optimal policy in a finite number of iterations. Our approach makes the presentation and proofs simple and self-contained since our arguments need neither any result for discrete-time MDPs nor one for discounting continuous-time MDPs, and thus it is rather different from those in [7,10–13] for discrete-time MDPs which depend on the discounting MDPs. We describe the model of finite continuous-time MDPs and EA and bias optimality criteria in Section 2, and derive difference formulas of EA rewards and biases in Section 3. After presenting a simple fact about an observation from the canonical form of a transition rate matrix and preliminaries about the EA optimality in Section 4, we show the existence and computation of bias-optimal policies in Section 5. 2. The model and optimality criteria We consider the model of finite continuous-time MDPs
{S , (A(i), i ∈ S ), q(j|i, a), r (i, a)},
(1)
where S is a state space and A(i) is a set of admissible actions at state i ∈ S; S and all A(i) are assumed to be finite. Let K := {(i, a)|i ∈ S , a ∈ A(i)} be the set of pairs of states and actions. The real-valued function q(j|i, a) in (1) is transition rates that satisfy: P 0 ≤ q(j|i, a) < ∞ (when j 6= i) and j∈S q(j|i, a) = 0 for all (i, a) ∈ K , and r (i, a) in (1) is a real-valued reward function on K .
318
X. Guo et al. / Operations Research Letters 37 (2009) 317–321
Definition 1. A policy f is a function on S such that f (i) ∈ A(i) for all i ∈ S . The set of all policies is denoted by F . For each f ∈ F , we define (i, j)-th element q(j|i, f ) of a matrix Q (f ) and i-th component r (i, f ) of a vector r (f ), by q(j|i, f ) := q(j|i, f (i))
and r (i, f ) := r (i, f (i)),
(c) Q (f )P ∗ (f ) = P ∗ (f )Q (f ) = 0. (d)
R∞
kP (t , f ) − P ∗ (f )kdt < ∞, where kBk denotes the absolute 0 value P norm of any matrix B = [bij ]i,j∈S defined by kBk := i,j∈S |bij |.
Proof. This is well known; see [14, p. 183] and Theorem 5 in [5], for instance.
for i, j ∈ S ,
Lemma 2. η(f ) and g (f ) are a unique solution to following equations:
respectively. Then, it is well known (e.g., [4,5,14]) that for each f ∈ F there exists a unique and homogeneous transition probability matrix P (t , f )(t ≥ 0) depending on f and having Q (f ) as its transition rate matrix, which satisfies the Kolmogorov equations
Q (f )η(f ) = 0,
d dt
P (t , f ) = P (t , f )Q (f )
= Q (f )P (t , f ),
P (0, f ) = I ,
and
(2)
where I denotes the identity matrix. Here, and from now on, we suppose that all operators, such as limits and integrals on matrices or vectors, are component-wise. Without any confusion, we denote by ‘‘0’’ the matrix and the vector with zeroes as all of their components. Thus, as in [5,7,11], for each f ∈ F we define the EA reward η(f ) by
RT η(f ) := lim sup T →∞
0
P (t , f )r (f )dt T
.
It is well known that the existence of an EA-optimal policy and its computation have been established for the model of finite continuous-time MDPs; see [5,7,11] for instance. Thus, the set F ∗ := {f ∈ F : η(f ) = η∗ } of all EA-optimal policies is not empty. As in [9,11,12,15], we now introduce the important concept of a (performance) bias but for the continuous-time case.
(4)
Proof. By Lemma 1(b) and (c), we have Q (f )η(f ) = Q (f )P ∗ (f )r (f ) = 0. Since P (0, f ) = I, by Lemma 1(a), (2) and (3) we have ∞
Z
Q (f )g (f ) =
Q (f )P (t , f )r (f )dt 0
= P ∗ (f )r (f ) − r (f ) = η(f ) − r (f ).
∞
[P (t , f )r (f ) − η(f )]dt ,
(5)
Moreover, it follows from (3) and Lemma 1(b) that P (f )g (f ) = 0, which together with (5) means that η(f ) and g (f ) are a solution to the equations in (4). To prove the uniqueness, we suppose that x and y satisfy ∗
x = r (f ) + Q (f )y,
P ∗ (f )y = 0.
Thus, P (t , f )Q (f )x = 0 for all t ≥ 0, and so (by (2)) we have [P (T , f ) − I ]x = 0 for all T ≥ 0. Letting T → ∞ and from Lemma 1(a), we get P ∗ (f )x = x. Moreover, by Lemma 1(b) and the hypothesis we have x = P ∗ (f )x = P ∗ (f )r (f ) + P ∗ (f )Q (f )y = η(f ). Thus, the rest shows y = g (f ). In fact, by Lemma 1(b) and x = η(f ) = P ∗ (f )r (f ), from x = r (f ) + Q (f )y we have P (t , f )Q (f )y = P (t , f )x − P (t , f )r (f ) = −[P (t , f )r (f )−η(f )], which together with (2) and a straightforward calculation gives P (T , f )y − y = −
T
Z
[P (t , f )r (f ) − η(f )]dt ∀T > 0.
(6)
0
Letting T → ∞ in (6), by P ∗ (f )y = 0 and (3) we have y = g (f ).
To state the difference formulas, for each f ∈ F we introduce the notation:
Definition 3. For each f ∈ F , the bias g (f ) is defined as g (f ) :=
P (f )g (f ) = 0.
Q (f )x = 0,
Definition 2. A policy f ∗ in F is said to be EA-optimal if η(f ∗ ) = η∗ , where η∗ := supf ∈F η(f ) is called the EA-optimal value vector.
Z
η(f ) = r (f ) + Q (f )g (f ),
∗
(3)
z (f ) := −
Definition 4. A policy f ∗ in F ∗ is said to be bias-optimal if g (f ∗ )(i) ≥ g (f )(i) for all f ∈ F ∗ and i ∈ S, where g (f )(i) denotes the i-th component of g (f ). Our goal is to show the existence of a bias-optimal policy and develop a policy iteration algorithm for computing it.
[P (t , f ) − P ∗ (f )]g (f )dt , 0
0
which is well defined (by Lemma 1(d) below) and allows interpretation of the bias as the expected total difference between the immediate reward P (t , f )r (f ) and the long-run average reward η(f ).
∞
Z
w(f ) := −
∞
Z
[P (t , f ) − P ∗ (f )]z (f )dt .
(7)
0
Lemma 1(d) guarantees that finiteness of z (f ) and w(f ). To solve z (f ), we can use the following lemma. Lemma 3. For each f ∈ F , z (f ) is a unique solution to the following equations: P ∗ (f )z (f ) = 0,
Q (f )z (f ) = g (f ).
(8)
Proof. By Lemma 1(b) and (7) we have 3. Difference formulas In this section, we derive difference formulas for EA rewards and biases, which play a critical role in the following arguments. To begin with, we present some facts that are required for solving η(f ) and g (f ). Lemma 1. Let f be in F , and t ≥ 0. Then, (a) The limits P ∗ (f ) := limt →∞ P (t , f ) exist. (b) P (t , f )P ∗ (f ) = P ∗ (f )P (t , f ) = P ∗ (f )P ∗ (f ) = P ∗ (f ), η(f ) = P ∗ (f )r (f ).
P ∗ (f )z (f ) = −
∞
Z
[P ∗ (f )P (t , f ) − P ∗ (f )P ∗ (f )]g (f )dt = 0.
(9)
0
Also, by (2) and Lemma 1(b) we have Q (f )z (f ) = −
∞
Z
[Q (f )P (t , f )]g (f )dt 0
= −[P ∗ (f ) − I ]g (f ) = g (f ), which together with (9) implies that z (f ) is a solution to the equations in (8). To prove the uniqueness, suppose that P ∗ (f )x = 0 and
X. Guo et al. / Operations Research Letters 37 (2009) 317–321
Q (f )x = g (f ). Then, by Lemmas 1 and 2 we have P (t , f )Q (f )x = P (t , f )g (f ) = [P (t , f ) − P ∗ (f )]g (f ), which, together with (2), and a straightforward calculation gives
(c) By (7) and (2), g (f ) = g (f˜ ), and Lemma 3 we have z (f˜ ) − z (f ) =
[P (t , f ) − P ∗ (f )]g (f )dt ∀T > 0.
∞
Z
[P (t , f ) − P (t , f˜ )]g (f )dt
Z0 ∞
T
Z
[P ( T , f ) − I ]x =
=
(10)
[P (t , f ) − P (t , f˜ )]Q (f )z (f )dt
0
0
∞
Z
Then, by P ∗ (f )x = 0, (7) and (10) we get x = z (f ), so the uniqueness follows.
P (t , f˜ )[Q (f˜ ) − Q (f )]z (f )dt
= 0
∞
Z
Z0 ∞ −
Theorem 1. Let f and f˜ be in F . Then
0
Q (f )w(f ) = −
η(f˜ ) − η(f ) = P ∗ (f˜ )r (f˜ ) − η(f ) = P ∗ (f˜ )[r (f˜ ) + Q (f˜ )g (f ) − η(f )] + [P ∗ (f˜ ) − I ]η(f ). This is Theorem 1(a). (b) Since P ∗ (f )r (f ) = η(f ) = η(f˜ ) = P ∗ (f˜ )r (f˜ ), by Lemma 1 and (3) we have g (f˜ ) − g (f ) =
∞
Z
[P (t , f˜ )r (f˜ ) − P (t , f )r (f )]dt 0
∞
Z
P (t , f˜ )[r (f˜ ) + Q (f˜ )g (f ) − η(f )]dt + ∆, (11)
=: 0
where,
∆ := −
Z
∞
P (t , f˜ )Q (f˜ )g (f )dt
Z0 ∞
[P (t , f˜ )η(f ) − P (t , f )r (f )]dt Z ∞ = g (f ) − P ∗ (f˜ )g (f ) + [P ∗ (f˜ )r (f˜ ) − P (t , f )r (f )]dt 0 Z ∞ = g (f ) − P ∗ (f˜ )g (f ) + [P ∗ (f ) − P (t , f )]r (f )dt 0
0
= −P ∗ (f˜ )g (f ),
which together with (12) and Lemma 1(c) gives z (f˜ ) − z (f ) =
g (f˜ ) − g (f ) =
Z
∞
Z
P (t , f˜ )[r (f˜ ) + Q (f˜ )g (f ) − η(f )]dt 0
+ P ∗ (f˜ )[Q (f˜ ) − Q (f )]z (f ). This is Theorem 1(b).
∞
P (t , f˜ )[Q (f˜ ) − Q (f )]z (f )dt
0
+ P ∗ (f˜ )[Q (f˜ ) − Q (f )]w(f ). This is (c).
Theorem 1 is new. We call Theorem 1(a) the difference formula of the EA rewards, Theorem 1(b) the difference formula of the biases, and Theorem 1(c) the difference formula of the performance z (f ) (i.e., the bias of the bias g (f )). These performance difference formulas lead to the fundamental results in average and bias optimality for continuous-time MDPs, which will be discussed in the following sections. 4. Preliminaries In this section, we state some results related to the canonical form of P ∗ (f ) and then present some preliminaries about the EA optimality. f For any f ∈ F , let Sk ⊂ S, k = 1, . . . , m, be the disjoint irreducible closed sets of the recurrent states of a Markov chain with the transition probability matrix P (t , f ), where m is the f number of such sets; and Sm+1 is the set of transient states. Then, it is well known (see, e.g., [14]) that by reordering the states in S, Q (f ) and P ∗ (f ) take the ‘‘canonical form’’ Q1 (f )
.. .
Q (f ) =
0
..
0 T1 (f )
. ··· ···
P1∗ (f )
0
.. .
P ∗ (f ) =
0
T1∗ (f )
which together with (11), Lemma 1(c) and (8) gives
Q (f )P (t , f )z (f )dt
= −[P ∗ (f ) − I ]z (f ) = z (f ),
+
∞
Z 0
0
Proof. (a) Since P ∗ (f˜ )Q (f˜ ) = 0 (by Lemma 1), we have
P (t , f˜ )[Q (f˜ ) − Q (f )]z (f )dt − P ∗ (f˜ )z (f ). (12)
Moreover, by (7), Lemma 3 and (2) we have
∗
+ P ∗ (f˜ )[Q (f˜ ) − Q (f )]w(f ).
∞
Z =
0
(c) If g (f ) = g (f˜ ), then Z ∞ ˜ z (f ) − z (f ) = P (t , f˜ )[Q (f˜ ) − Q (f )]z (f )dt
P (t , f˜ )Q (f˜ )z (f )dt
0
(a) η(f˜ ) − η(f ) = P ∗ (f˜ )[r (f˜ ) + Q (f˜ )g (f ) − η(f )] + [P ∗ (f˜ ) − I ]η(f ).
+ P (f˜ )[Q (f˜ ) − Q (f )]z (f ).
P (t , f )Q (f )z (f )dt
+
We now give performance difference formulas.
(b) If η(f ) = η(f˜ ), then Z ∞ ˜ g (f ) − g (f ) = P (t , f˜ )[r (f˜ ) + Q (f˜ )g (f ) − η(f )]dt
319
..
. ··· ···
··· .. .
Qm (f ) Tm (f )
··· .. .
Pm (f ) Tm∗ (f ) ∗
0
.. .
,
0 Tm+1 (f ) 0
.. . ,
(13)
0 0
in which Pk∗ (f ) = ek πk (f ), where πk (f ) is the steady-state probability (row) vector obtained by πk (f )Qk (f ) = 0, subject to πk (f )ek = 1, ek is a column vector with 1 as all components, and Tk (f )Pk∗ (f ) + Tm+1 (f )Tk∗ (f ) = 0 for k = 1, . . . , m. Hence, the canonical form (13) can be used to solve η(f ). By (13), we have the following simple and interesting observations.
320
X. Guo et al. / Operations Research Letters 37 (2009) 317–321
Lemma 4. Let P ∗ (f ) be as in Lemma 1 (a) and u a vector on S. Then, (a) if P (f )u = 0 and u ≤ 0 (or u ≥ 0), then u(i) = 0 for all recurrent states i under P (t , f ); (b) if u(i) ≥ 0 (or u(i) ≤ 0) for all recurrent states i under P (t , f ), then P ∗ (f )u ≥ 0 (or P ∗ (f )u ≤ 0). ∗
Proof. Since it follows from (13) that the columns in P (f ) corf responding to transient states in Sm+1 are all zeros, all u(i)’s with ∗
We begin with characterizing a bias-optimal policy by using the bias optimality equations (BOEs); see Theorem 3 below. Theorem 3. If a policy f ∗ ∈ F satisfies the following three BOEs max
( X
a∈A(i)
This lemma plays a key role in our arguments below, such as for the proof of the anti-cycling rule for the following policy iteration procedures. To state an algorithm to compute an EA-optimal policy, for a given f ∈ F , i ∈ S and a ∈ A(i), let H (i, a) := r (i, a) + f
X
q(j|i, a)g (f )(j),
Af (i) :=
X
q(j|i, a)η(f )(j) > 0;
or
a ∈ A(i) : H (i, a)X . > H f (i, f (i)) q(j|i, a)η(f )(j) = 0 when
(15)
(16)
The EA-optimal policy iteration algorithm: 1. Let n = 0 and select an arbitrary policy fn ∈ F . 2. Obtain (by Lemma 2 and the canonical form) η(fn ) and g (fn ). 3. Obtain an improvement policy fn+1 from (15) and (16) with f := fn . 4. If fn+1 = fn , then stop and fn+1 is EA-optimal (by Theorem 2 below). Otherwise, increase n by 1 and return to step 2.
) X
max
∗ a∈C f (i)
q(j|i, a)z (f )(j) ∗
j∈S
X
q(j|i, f ∗ (i))z (f ∗ )(j)
∀i ∈ S ,
(22)
j∈S
C (i) :=
)
( q(j|i, a)η(f ∗ )(j)
= 0 ∀ i ∈ S,
(17)
j∈S
( q(j|i, a)g (f )(j) ∗
= η(f )(i) ∗
j∈S
∀ i ∈ S,
(18)
) X ∗ f ∗ where B (i) := a ∈ A(i) q(j|i, a)η(f )(j) = 0 j∈S (
6= ∅.
a ∈ B (i) : r (i, a) +
X
q(j|i, a)g (f )(j) = η(f )(i) , ∗
∗
Proof. For each f ∈ F ∗ , as f ∗ satisfies (20) and (21), by Theorem 2 we see that f ∗ is in F ∗ , and so η(f ∗ ) = η(f ) = η∗ . Thus, Q (f )η(f ∗ ) ∗ = Q (f )η(f ) = 0, and so f (i) is in Bf (i) for all i ∈ S. Then, it follows from (21) that η(f ∗ ) − r (f ) − Q (f )g (f ∗ ) ≥ 0, which together with Lemma 1 gives P ∗ (f )[η(f ∗ ) − r (f ) − QP (f )g (f ∗ )] = 0. Therefore, ∗ by Lemma 4(a) we have r (i, f (i)) + j∈S q(j|i, f (i))g (f )(j) = ∗ η(f )(P i) for all the recurrent states P (t , f ), and so (by P i under ∗ ∗ (22)) j∈S q(j|i, f (i))z (f ∗ )(j) ≤ j∈S q(j|i, f (i))z (f )(j) for all the recurrent states i under P (t , f ). Thus, by Lemma 4(b) we obtain P ∗ (f )[Q (f ) − Q (f ∗ )]z (f ∗ ) ≤ 0. Since we have shown that r (f ) + Q (f )g (f ∗ ) − η(f ∗ ) ≤ 0, by P (t , f ) ≥ 0 and P ∗ (f )[Q (f ) − Q (f ∗ )]z (f ∗ ) ≤ 0, we have ∞
P (t , f )[r (f ) + Q (f )g (f ∗ ) − η(f ∗ )]dt 0
+ P ∗ (f )[Q (f ) − Q (f ∗ )]z (f ∗ ) ≤ 0, which together with Theorem 1(b) gives g (f ) ≤ g (f ∗ ), for all f ∈ F ∗ . Thus, f ∗ is bias-optimal. Theorem 3 provides a characterization of a bias-optimal policy f ∗ . The existence and calculation of f ∗ satisfying the BOEs (20)–(22) will be shown below. To do so, we need the following lemmas.
) X
) f∗
j∈S
Z Theorem 2. (a) The EA-optimal policy iteration algorithm stops at a policy f ∗ satisfying the average optimality equations (17) and (18) in a finite number of iterations:
r (i, a) +
(21)
then, f ∗ is bias-optimal.
j∈S
h(i) = f (i) if Af (i) = ∅, for i ∈ S .
max
= η(f ∗ )(i) ∀i ∈ S ,
(
h(i) ∈ Af (i) when Af (i) 6= ∅ and
∗ a∈Bf (i)
q(j|i, a)g (f ∗ )(j)
j∈S
(
f∗
We then define a so-called improvement policy h ∈ F (depending on f ) as below
a∈A(i)
X
∗
j∈S f
X
)
r (i, a) +
max
∗ a∈Bf (i)
=
(20)
with Bf (i) as in (19), and
max
= 0 ∀i ∈ S ,
(
(14)
j∈S
q(j|i, a)η(f )(j)
j∈S
f
i ∈ Sm+1 contribute nothing to P ∗ (f )u. Moreover, since all the entries in Pk∗ (f ), k = 1, . . . , m, are positive, this lemma follows.
) ∗
(19)
(b) Each policy f satisfying (17) and (18) is EA-optimal, and thus the policy f ∗ in (a) is EA-optimal. Theorem 2 is well known (see [4,5,7], for instance), and it can also be proved again by using Theorem 1 and Lemma 4. However, the details are omitted here. 5. Bias-optimal policies In this section, we show the existence and computation of a bias-optimal policy. To do so, we introduce a notation: For two vectors u and v , we define ‘‘u v 00 if u ≥ v and u(i) > v(i) for at least one i ∈ S .
Lemma 5. Suppose that f ∗ ∈ F ∗ , and f ∈ F . (a) If Q (f )η(f ∗ ) = 0, and r (f ) + Q (f )g (f ∗ ) ≥ η(f ∗ ), then η(f ) = η(f ∗ ). (b) Under the conditions in (a), if in addition Q (f )z (f ∗ )(i) ≥ Q (f ∗ ) z (f ∗ )(i) for all states i such that [r (f )+ Q (f )g (f ∗ )](i) = η(f ∗ )(i), then η(f ) = η(f ∗ ) and g (f ) ≥ g (f ∗ ). Proof. (a) Let uˆ := [Q (f ) − Q (f ∗ )]z (f ∗ ), vˆ := r (f ) + Q (f )g (f ∗ ) − η(f ∗ ). Then, by the conditions in (a) we have vˆ ≥ 0 and P (t , f )Q (f )η(f ∗ ) = 0 for all t ≥ 0, and so it follows from (2) and Lemma 1(a) as well as a straightforward calculation that [P ∗ (f ) − I ]η(f ∗ ) = 0. Thus, by Theorem 1(a) we have
η(f ) − η(f ∗ ) = P ∗ (f )ˆv + [P ∗ (f ) − I ]η(f ∗ ) = P ∗ (f )ˆv ≥ 0, which together with the EA optimality of f and so (a) follows.
(23) ∗
gives η(f ) = η(f ∗ ),
X. Guo et al. / Operations Research Letters 37 (2009) 317–321
(b) From (a) and (23) we see that P ∗ (f )ˆv = 0. Thus, by vˆ ≥ 0 and Lemma 4(a) we further have that vˆ (i) = 0 for all the recurrent states i under P (t , f ). Hence, by the condition in (b) we have uˆ (i) ≥ 0 for all the recurrent states i under P (t , f ). Then, it follows from Lemma 4(b) thatRP ∗ (f )ˆu ≥ 0, and so by Theorem 1(b) we have ∞ g (f ) − g (f ∗ ) = 0 P (t , f )ˆv dt + P ∗ (f )ˆu ≥ 0. This together with (a) proves (b). In what follows, we provide a bias-optimal policy iteration algorithm to compute a policy f ∗ solving the BOEs (20)–(22). To state such an algorithm and its proof, we use the following notation. Since η∗ = η(f ) for all f ∈ F ∗ , by (19) we see that Bf (i) (for any fixed i ∈ S) is independent of f ∈ F ∗ and is denoted as B∗ (i) for simplicity. Thus, we can define Df (i)
H f (i, a) > H f (i, f (i));X or X q(j|i, a)z (f )(j) > q(j|i, f (i))z (f )(j) := a ∈ B∗ (i) : , j∈S j∈S when H f (i, a) = H f (i, f (i))
321
With Lemma 6, we can easily obtain an algorithm as follows. A bias-optimal policy iteration algorithm: 1. Let n = 0 and select an EA-optimal policy fn ∈ F ∗ (by Theorem 2). 2. (Policy evaluation) Obtain (by Lemma 2 and Lemma 3) g (fn ) and z (fn ). 3. (Policy improvement) Obtain an improvement policy fn+1 from (25), with f and h being replaced by fn and fn+1 respectively. 4. If fn+1 = fn , then stop and fn+1 is bias-optimal (by Theorem 4 below). Otherwise, increase n by 1 and return to step 2. Lemma 6 can be used to compare the biases of two policies, as well as to prove the anti-cycling property in the bias-optimal policy iteration procedure. The existence of a policy f ∗ satisfying the BOEs (20)–(22) is proved by construction, as shown in the following theorem.
(24)
Theorem 4. The bias-optimal policy iteration algorithm stops at a bias-optimal policy satisfying the BOEs (20)–(22) in a finite number of iterations.
for each f ∈ F ∗ , where H f (i, a) is as in (14). We then define an improvement policy h ∈ F (depending on f ) as follows:
Proof. Let {fn , n ≥ 0} be the sequence of policies in the biasoptimal policy iteration algorithm aforementioned. Then, by Lemma 6, every policy in the iteration sequence is different. Since the number of policies is finite, the iteration must stop after a finite number of iterations. Suppose it stops at a policy denoted as f ∗ . Then, by Lemma 6 and Theorem 2 we see that f ∗ satisfies (20), and moreover, it must satisfy the BOEs (21) and (22) because otherwise, ∗ for some i, the set Df (i) is non-empty and we can find the next improvement policy in the policy iteration. Thus, by Theorem 3, f ∗ is bias-optimal.
h(i) ∈ Df (i) h(i) = f (i)
when Df (i) 6= ∅, if D (i) = ∅, f
and
∀i ∈ S .
(25)
By Lemma 1(c) we see that f (i) is in B (i) for all i ∈ S. However, from (24) we see that f (i) is not in Df (i) for any i ∈ S. ∗
Lemma 6. For any given f ∈ F ∗ , let h be defined as in (25). Then, (a) η(h) = η(f ) = η∗ and g (h) ≥ g (f ). (b) If g (h) = g (f ) and h 6= f , then z (h) z (f ).
Acknowledgment
Proof. (a) Since we have mentioned that f (i) is in B∗ (i) for all i ∈ S, it follows from (24) and (25) that h(i) is in B∗ (i) for all i ∈ S, and so Q (h)η∗ = Q (h)η(f ) = 0. Also, since H f (i, f (i)) = η(f )(i) for all i ∈ S, by (24) and (25) we have H f (i, h(i)) ≥ H f (i, f (i)) = η(f )(i) for all i ∈ S, this is r (h) + Q (h)g (f ) ≥ η(f ). Thus, again using (24) and Lemma 5 (with f ∗ and f being replaced by f and h here, respectively), we see that (a) is true. (b) Since g (h) = g (f ), by (a) (just proved) and Lemma 2 we have r (h) + Q (h)g (f ) = r (h) + Q (h)g (h) = η(h) = η(f ),
(26)
which together with Theorem 1(b) and g (h) = g (f ) gives P ∗ (h)[Q (h) − Q (f )]z (f ) = 0.
(27)
By (26) and (14) we see that H (i, h(i)) = H (i, f (i)) for all i ∈ S. Hence, by (24) and (25) we have f
f
[Q (h) − Q (f )]z (f ) ≥ 0,
(28)
which together with Lemma 4(a), (27) as well as (24) implies that Df (i) = ∅ for all the recurrent states i under P (t , h). Thus, we have h(i) = f (i) for all the recurrent states i under P (t , h). Then, from the canonical form (13) for P ∗ (h) we see P ∗ (h)[Q (h) − Q (f )] = 0, which together with (28) and Theorem 1(c) gives z ( h) − z ( f ) =
Z
∞
P (t , h)[Q (h) − Q (f )]z (f )dt ≥ 0.
0
Thus, the rest shows that z (h) 6= z (f ). Suppose that z (h) = z (f ). Since η(h) = η(f ) and g (h) = g (f ), it follows from Lemma 3 that Q (f )z (f ) = g (f ) = g (h) = Q (h)z (h) = Q (h)z (f ).
(29)
On the other hand, since h 6= f , by (24)–(26) we have Q (h)z (f ) Q (f )z (f ), which contradicts with (29).
The research of the authors was supported by NSFC, the grant (CUHK 450607) from the Research Grant Council of Hong Kong Special Administration Region, and Sun Yat-Sen University Science Foundation, respectively. References [1] X.P. Guo, Constrained optimality for average cost continuous-time Markov decision processes, IEEE Trans. Automat. Control 52 (2007) 1139–1143. [2] X.P. Guo, O. Hernández-Lerma, T. Prieto-Rumeau, A survey of recent results on continuous-time Markov decision processes, Top 14 (2006) 177–246. [3] D. Honhon, S. Seshadri, Admission control with finite buffer and partial queue information, Probab. Engrg. Inform. Sci. 21 (2007) 19–46. [4] P. Kakumanu, Nondiscounted continuous-time Markov decision processes with countable state and action spaces, SIAM J. Control 10 (1972) 210–220. [5] B.L. Miller, Finite state continuous time Markov decision processes with an infinite planning horizon, J. Math. Anal. Appl. 22 (1968) 552–569. [6] T. Prieto-Rumeau, O. Hernández-Lerma, Bias optimality for continuous-time controlled Markov chains, SIAM J. Control Optim. 45 (2006) 51–73. [7] M.L. Puterman, Markov Decision Processes, Wiley, New York, 1994. [8] L.I. Sennott, Stochastic Dynamic Programming and the Control of Queueing System, Wiley, New York, 1999. [9] X.-R. Cao, J.Y. Zhang, The nth-order bias optimality for multi-chain Markov decision processes, IEEE Trans. Automat. Control 53 (2008) 496–508. [10] E.A. Feinberg, A. Shwartz, Handbook of Markov Decision Processes, Kluwer Academic Publishers, Boston, Dordrecht, London, 2002. [11] M. Haviv, M.L. Puterman, Bias optimality in controlled queuing systems, J. Appl. Probab. 35 (1998) 136–150. [12] M.E. Lewis, M.L. Puterman, A probabilistic analysis of bias optimality in unichain Markov decision processes, IEEE Trans. Automat. Control 46 (2001) 96–100. [13] A.F. Veinott, On finding optimal policies in discrete dynamic programming with no discounting, Ann. Math. Statist. 37 (1966) 1284–1294. [14] K.L. Chung, Markov Chains with Stationary Transition Probabilities, 2d ed., Springer, Berlin, 1967. [15] X.-R. Cao, X.P. Guo, A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: Multichain cases, Automatica 40 (2004) 1749–1759.