Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints

Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints

European Journal of Operational Research xxx (2014) xxx–xxx Contents lists available at ScienceDirect European Journal of Operational Research journ...

676KB Sizes 0 Downloads 70 Views

European Journal of Operational Research xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor

Stochastics and Statistics

Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints Xianping Guo a,⇑, Wenzhao Zhang a,b a b

School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou 510275, PR China College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, PR China

a r t i c l e

i n f o

Article history: Received 31 December 2012 Accepted 24 March 2014 Available online xxxx Keywords: Constrained continuous-time Markov decision processes Unbounded transition rate Convergence Finite approximation

a b s t r a c t In this paper we consider the convergence of a sequence fMn g of the models of discounted continuoustime constrained Markov decision processes (MDP) to the ‘‘limit’’ one, denoted by M1 . For the models with denumerable states and unbounded transition rates, under reasonably mild conditions we prove that the (constrained) optimal policies and the optimal values of fMn g converge to those of M1 , respectively, using a technique of occupation measures. As an application of the convergence result developed here, we show that an optimal policy and the optimal value for countable-state continuous-time MDP can be approximated by those of finite-state continuous-time MDP. Finally, we further illustrate such finite-state approximation by solving numerically a controlled birth-and-death system and also give the corresponding error bound of the approximation. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Constrained Markov decision processes (MDP) form an important class of stochastic control problems with applications in many areas such as telecommunication networks and queueing systems; see, for instance, Guo and Hernández-Lerma (2009), Hordijk and Spieksma (1989), and Sennott (1991). As is well known, the main purpose of studies on constrained MDP is on the existence and computation of optimal policies, see, for instance, the literature on the discrete-time MDP by Feinberg and Shwartz (1999), Feinberg (2000), Hordijk and Spieksma (1989), Hernández-Lerma and González-Hernández (2000), Hernández-Lerma, GonzálezHernández, and López-Martínez (2003), and Sennott (1991), and the works on continuous-time MDP by Guo (2007), Guo and Hernández-Lerma (2003), Guo and Hernández-Lerma (2009), Guo and Piunovskiy (2011). On the other hand, from a theoretical and practical point of view, it is of interest to analyze the convergence of optimal values and optimal policies for constrained MDP, and such convergence problems have been considered, see, for instance, Altman (1999), Zadorojniy and Shwartz (2006), AlvarezMena and Hernández-Lerma (2002) and so on. Alvarez-Mena and Hernández-Lerma (2006) also consider the convergence problem as in Alvarez-Mena and Hernández-Lerma (2002) for the case of ⇑ Corresponding author. Tel.: +86 020 84113190; fax: +86 020 84037978. E-mail addresses: [email protected] (X. Guo), zhangwenzhao1987@163. com (W. Zhang).

more than one controller. To the best of our knowledge, however, these existing works for the convergence problems are on the constrained discrete-time MDP. Most recently, the convergence problem of controlled models for unconstrained continuous-time MDP has also been considered by Prieto-Rumeau and Lorenzo (2010) and Prieto-Rumeau and Hernández-Lerma (2012) using an approximation of the optimality equations. However, the similar convergence problem for constrained continuous-time MDP has not been considered. This paper studies the convergence problem for constrained continuous-time MDP. More precisely, in this paper we consider a sequence fMn g of the models of the constrained continuous-time MDP with the following features: (1) the state space is denumerable, but action space is general; (2) the transition rates and all reward/cost functions are allowed to be unbounded; and (3) the optimality criterion is the expected discounted reward/cost, while some constraints are imposed on similar discounted rewards/costs. We aim to give suitable conditions imposed on the models fMn g, under which the optimal policies and the optimal values of fMn g converge to those of the limit model M1 of the sequence fMn g, respectively. In general, the approaches to study continuous-time MDP can be roughly classified into two groups: the indirect method and the direct method. For the indirect method, the idea is to convert the continuous-time MDP into equivalent discrete-time MDP. This approach has been justified by Feinberg (2004), Feinberg (2012), and Piunovskiy and Zhang (2012). On the other hand, the most

http://dx.doi.org/10.1016/j.ejor.2014.03.037 0377-2217/Ó 2014 Elsevier B.V. All rights reserved.

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

2

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

common direct method to investigate constrained continuoustime MDP is to establish an equivalent linear program formulation of the original constrained problem, see Guo and Piunovskiy (2011). In this paper, we follow this direct approach without involving discrete-time MDP. First, as in Guo and Piunovskiy (2011), we transform the optimality problem in constrained continuous-time MDP into an equivalent optimality problem over a class of some probability measures by introducing an occupation measure of a policy. Then, we analyze the asymptotic characterization of the occupation measure and the expected discounted rewards/costs, which are used to prove that the optimal values and optimal policies of the sequence fMn g converge to those of M1 . Finally, we apply our results to the approximations of the optimal policies and the optimal value of finite-state continuoustime MDP to those of countable-state continuous-time MDP. More precisely, for a model M01 of constrained countable-state continuous-time MDP satisfying the usual conditions as in Guo and Hernández-Lerma (2009) and Guo and Piunovskiy (2011), we can construct a sequence of models fM0n g of constrained continuoustime MDP with finite states such that accumulation point  every  of a sequence of optimal policies of M0n is optimal for M01 and  0 that the sequence of the optimal values of Mn converge to the optimal value of M01 . Furthermore, we further illustrate such finite-state approximation by solving numerically a controlled birth-and-death system, and also give the corresponding error bound of the approximation. The motivation of providing such approximation is from the following facts: (i) there exist many methods to solve the optimal value and optimal policies for unconstrained continuous-time MDP with finite sates, for example, the value iteration algorithm and the policy iteration algorithm by Guo and Hernández-Lerma (2009) and Puterman (1994), the approximation dynamic programming technique by Cervellera and Macciò (2011), and so on. However, these methods, which are all based on the optimality equation, are not applied to constrained continuous-time MDP since the optimality equation no longer exists for the constrained MDP; (ii) the optimal value and optimal policies for finite-state constrained continuous-time MDP with finite actions can be computed by the well known linear programming in Guo and Piunovskiy (2011) and Puterman (1994), whereas in general the optimal value and optimal policies cannot be computed for countable-state continuous-time MDP because the number of states in such MDP is infinite. The rest of this paper is organized as follows. In Section 2, we introduce the models of constrained continuous-time MDP and the convergence problems. In Section 3, we state our main results, which are proved in Section 6, after technical preliminaries given in Section 5. An application of the main results to finite state approximation and a numerable example are given in Section 4. Finally, we finish this article with a conclusion in Section 7.

2. The models In this section we introduce the models and convergence problems we are concerned with. Notation. If X is a Polish space, we denote by BðXÞ its Borel r-algebra, by Dc the complement of a set D # X (with respect to X), by PðXÞ the set of all probability measures on BðXÞ, endowed with the topology of weak convergence. For a finite set D, we denote by jDj the number of its elements. Let N :¼ f1; 2; . . .g and S N :¼ N f1g. Consider the sequence of models fMn g for constrained continuous-time MDP: n   o l Mn :¼ Sn ;ðAn ðiÞ;i 2 Sn Þ;qn ðji; aÞ; c0n ði; aÞ; cln ði; aÞ; dn ; 1 6 l 6 p ; cn ; n 2 N; ð2:1Þ

where Sn are the state spaces, which are assumed to be denumerable. The set An ðiÞ represents the set of available actions or decisions at state i 2 Sn for model Mn . Let

K n :¼ fði; aÞji 2 Sn ; a 2 An ðiÞg; represent the set of all feasible state-action pairs for Mn . In what follows, we assume that Sn " S1 , and S1 ¼ f0; 1;. .. ;n;. ..g without loss of generalization. As a consequence, for each i 2 S1 , we can define nðiÞ :¼ minfn P 1;i 2 Sn g. Furthermore, we assume that An ðiÞ # A1 ðiÞðn P nðiÞ;i 2 S1 Þ, and moreover, for each n 2 N; An ðiÞ is in BðAn Þ, where An is a Polish space, the action space for Mn . Thus, BðAn ðiÞÞ ¼ BðA1 ðiÞÞ \ An ðiÞ and PðAn ðiÞÞ #PðA1 ðiÞÞ, for each i 2 S1 and n P nðiÞ. For fixed n 2 N, the function qn ðji; aÞ in (2.1) refers to the conservative transition rates, that is, qn ðjji; aÞ P 0 and P j2Sn qn ðjji; aÞ ¼ 0 for all ði; aÞ 2 K n and i – j. Moreover, qn ðjji; aÞ is a measurable function on An ðiÞ for each fixed i;j 2 Sn . Furthermore, qn ðjji; aÞ is assumed to be stable, that is, qn ðiÞ :¼ supa2An ðiÞ jqn ðiji;aÞj < 1 for each i 2 Sn . Finally, c0n corresponds to the objective cost function, and cln ð1 6 l 6 pÞ correspond to the cost functions on which some conl straints are imposed. The real numbers dn ð1 6 l 6 pÞ denote the constraints, and cn denotes initial distribution on Sn for Mn . To complete the specification of fMn g (n 2 N), we introduce the classes of policies. A randomized Markov policy p for Mn is a family ðpt ; t P 0Þ of stochastic kernels satisfying: (i) for each t P 0 and i 2 Sn ; pt ðjiÞ is a probability measure (p.m.) on An ðiÞ; and (ii) for each D 2 BðAn ðiÞÞ, and i 2 Sn ; pt ðDjiÞ is a Borel measurable function in t P 0. Moreover, a policy p ¼ ðpt ; t P 0Þ is called (randomized) stationary for Mn if, for each i 2 Sn , there is a p.m. pðjiÞ 2 PðAn ðiÞÞ such that pt ðjiÞ  pðjiÞ for all t P 0. We denote this policy by ðpðjiÞ; i 2 Sn Þ. We denote by Pn the family of all randomized Markov policies and by Psn the set of all stationary policies for each n 2 N. For each n 2 N and policy p ¼ ðpt ; t P 0Þ 2 Pn , let

qn ðjji; pt Þ :¼

Z

An ðiÞ

qn ðjji; aÞpt ðdajiÞ; cln ði; pt Þ :¼

Z

An ðiÞ

cln ði; aÞpt ðdajiÞ ð2:2Þ

for each i; j 2 Sn ; t P 0, and 0 6 l 6 p. When p is stationary, we will write qn ðjji; pt Þ and cln ði; pt Þ as qn ðjji; pÞ and cln ði; pÞ, respectively. Let Q n ðt; pÞ :¼ ½qn ðjji; pt Þ be the associated matrix of transition rates with the ði; jÞth element qn ðjji; pt Þ. As the matrix ½qn ðjji; aÞ is conservative and stable, so is Q n ðt; pÞ. Thus, Proposition C.4 in Guo and Hernández-Lerma (2009) ensures the existence of a so-called minimal transition function (see, Definition C.3 in Guo and Hernández-Lerma (2009)) pn ðs; i; t; j; pÞ for Mn with i; j 2 Sn and t P s P 0. P To guarantee the regularity condition (i.e. j2Sn pn ðs; i; t; j; pÞ ¼ 1 for all i 2 Sn and t P s P 0), we impose the following so-called drift conditions. Assumption 2.1. There exist a function 1 6 x on S1 and xðiÞ " þ1 as i ! 1, and constants q; b; L > 0, such that P (a) for all ði; aÞ 2 K n ; n 2 N; j2Sn qn ðjji; aÞxðjÞ 6 qxðiÞ þ b (b) qn ðiÞ 6 LxðiÞ for all i 2 Sn ; n 2 N. For each p 2 Pn , n 2 N and cn 2 PðSn Þ, under Assumption 2.1, by Proposition C.9 and Theorem 2.3 in Guo and Hernández-Lerma (2009), the corresponding pn ðs; i; t; j; pÞ is unique and regular, and moreover, there exists a unique probability space ðX; F ; P pcn Þ and a state-action process fðxt ; at Þ; t P 0g defined on this space. The

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

3

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

expectation operator with respect to the probability measure (p.m.) Ppcn is denoted by Epcn . If cn is concentrated at some state i, then we will write P pcn and Epc as P pi;n and Epi;n , respectively. Fixed a discount factor a > 0. The expected discounted criteria V ln ðpÞ are defined by

V ln ðpÞ :¼

Z

0

1

eat Epcn cln ðxt ;at Þdt; for 0 6 l 6 p; n 2 N; and p 2 Pn : ð2:3Þ

The criteria (2.3) are well defined under Assumptions 3.1(c, d) below. V ln ðpÞ will be denoted as V ln ði; pÞ if the initial distribution cn is concentrated at state i 2 Sn . For every n 2 N, let p2Un

be the set of feasible policies and the optimal value of Mn , respectively. For each n 2 N, we then consider the following constrained optimality problem:

p 2 Un :

Remark 3.1. Assumption 3.1(a) implies that the space PðAn ðiÞÞ with the topology of weak convergence is also compact for all n 2 N and i 2 Sn . Then, it follows from Tychonoff’s theorem that Psn ¼ Pi2Sn PðAn ðiÞÞ is compact too. Moreover, it follows from Theorem 3.1(a) in Guo and Piunovskiy (2011) and Assumptions 3.1(b,c,d) that

"   ½axðiÞ þ b  l  ; V n ði; pÞ 6 M aða  qÞ

l

0 U n :¼ fp 2 Pn jV ln ðpÞ 6 dn ; 1 6 l 6 pg; and V 0 n ¼ inf V n ðpÞ

Minimize V 0n ðpÞ over

(d) jcln ði; aÞj 6 M xðiÞ for all ði; aÞ 2 K n ; 0 6 l 6 p and n 2 N, with some constant M > 0. (e) The functions qn ðjji; Þ and cln ði; Þ are all continuous in a 2 An ðiÞ, for each fixed n 2 N; i; j 2 Sn , and 0 6 l 6 p.

   l  V n ðpÞ 6

X

M a

#

xðiÞcn ðiÞ þ b

i2Sn

aða  qÞ ð3:1Þ

for each n 2 N; i 2 Sn ; p 2 Pn and 0 6 l 6 p. To show our main results, we consider the following assumptions:

ð2:4Þ

(As cln ði; aÞ is allowed to take positive and negative values for each 0 6 l 6 p and n 2 N, it can be interpreted as rewards rather than ‘‘costs’’ only, and thus the corresponding constrained optimality problem is to maximize the V 0n ðpÞ over p 2 U n .) In order to make these problems reasonable, we introduce the following assumption: Assumption 2.2. For each n 2 N, the set U n is not empty. Definition 2.1. (i) For any n 2 N, a policy p 2 U n is called (constrained) optimal for Mn if V 0n ðp Þ ¼ V 0 n . (ii) A sequence fpn g of policies pn 2 Psn is said to converge weakly to p 2 Ps1 , if, for each i 2 S1 , pn ðjiÞ ! pðjiÞ weakly in PðA1 ðiÞÞ as nðiÞ 6 n, and n ! 1.

Assumption 3.2. (a) limn!1 supa2An ðiÞ jqn ðjji; aÞ  q1 ðjji; aÞj ¼ 0, for each i; j 2 S1 , where n P maxfnðiÞ; nðjÞg; (b) limn!1 supa2An ðiÞ jcln ði; aÞ  cl1 ði; aÞj ¼ 0, for all i 2 S1 , 0 6 l 6 p, where n P nðiÞ; (c) limn!1 cn ðiÞ ¼ c1 ðiÞ, for each i 2 S1 , where n P nðiÞ; l l (d) limn!1 dn ¼ d1 , for each 1 6 l 6 p.

Assumption 3.3. (a) (Slater condition) There exists a policy

V l1 ð

pÞ <

l d1

p 2 P1 such that

for all 1 6 l 6 p:

ð3:2Þ

(b) For each i 2 S1 ; An ðiÞ " A1 ðiÞ as n ! 1, where n P nðiÞ. Remark 3.2.

As is known, the existence of an constrained optimal policy for each fixed n 2 N is ensured in Guo and Piunovskiy (2011) under suitable conditions. Here, we are interested in the following convergence problems: (1) (Convergence of the optimal values): does V 0 n converges to V 0 as n ! 1? 1 (2) (Convergence of the optimal policies): if pn is the constrained optimal policy of Mn (n 2 N), is every accumulation point of fpn g an optimal policy of M1 ? 3. The main results In this section, we state our main results. Their proofs are postponed to Section 6 below. First, for the existence of an optimal policy pn of Mn , we need the following conditions: Assumption 3.1. (a) For each n 2 N and i 2 Sn ; An ðiÞ is compact. (b) The discount factor a satisfies that a > q, with q as in Assumption P 2.1.  P 2 2 (c) supn2N and i2Sn x ðiÞcn ðiÞ < 1, j2Sn qn ðjji; aÞx ðjÞ 6 2 j1 x ðiÞ þ j2 for all n 2 N and ði; aÞ 2 K n , with some constants 0 < j1 < a and 0 6 j2 .

(a) Assumption 3.2(a) is used to obtain the convergence feature of occupation measure, see Theorem 5.1 below. Assumptions 3.2(b–d) have been used in Alvarez-Mena and HernándezLerma (2002) for discrete-time MDP. (b) Assumption 3.3(a) is used to establish the duality between the convex analytic approach and dynamic programming by Guo and Piunovskiy (2011), and the existence of a constrained optimal policy with the structural property by Guo and Hernández-Lerma (2003, 2009) and Sennott (1991). Here, the role of Assumption 3.3 can be concluded in two parts. First, Assumption 3.3 is used to replace Assumption 2.2 for the fact that the set U n for the ‘‘approximating’’ models Mn is nonempty for larger enough n, see Proposition 5.1 below. On the other hand, it is used for the existence of feasible policies of fMn g corresponding any given feasible policy of M1 . In fact, the role of Assumption 3.3 is the same as Assumption 3.8 in Alvarez-Mena and Hernández-Lerma (2002) and same as Definition 3.1(b) in Prieto-Rumeau and Hernández-Lerma (2012). We next state our main results about the convergence problems: Theorem 3.1. Under Assumptions 2.1 and 3.1, the following assertions hold.

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

4

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

(a) If, in addition, U n – ; for some n 2 N, then an optimal policy for Mn exists. (b) If, in addition, Assumptions 3.2 and 3.3 hold, then there exists an integer N such that (b1 ) for each n P N; Mn has an optimal policy pn 2 Psn ; (b2 ) every accumulation point of fpn g (with pn as in ðb1 Þ) is an optimal policy for M1 ; 0 (b3 ) limn!1 V 0 n ¼ V1.

Definition 4.1. For each fixed n 2 N, a stationary policy p 2 Psn is P p p called m-randomized if i2Sn ðjAn ðiÞj  1Þ 6 m, where An ðiÞ :¼ fa 2 An ðiÞjpðajiÞ > 0g. For linear programming LP0n and M0n , we have the following fact: Lemma 4.1. For each n 2 N, let gn be an optimal solution to the LP0n above. Then, (1)

Proof. See Section 6 below. h

p0n is an optimal policy of M0n , where p0n is given by p

4. Applications In this section, we apply Theorem 3.1 to a finite-state approximation in Section 4.1, that is, we will show that an optimal policy and the optimal value for a countable-state constrained continuous-time MDP can be approximated by those of solvable finite-state continuous-time MDP. Furthermore, we illustrate such approximation by a controlled birth-and-death system in Section 4.2.

0 n ðajiÞ ¼

8 g ði;aÞ < g^n ðiÞ

^ n ðiÞ :¼ when g

n

:

X

gn ði;aÞ > 0 and a 2 An ðiÞ;

a2An ðiÞ

^ n ðiÞ ¼ 0 and a 2 An ðiÞ; Ifan ðiÞg ðaÞ when g ð4:6Þ

for all i 2 Sn , where an ðiÞ 2 An ðiÞ is chosen arbitrarily; 0  e 0 ¼ 1 P P (2) V i2Sn a2An ðiÞ c ði; aÞgn ði; aÞ; n a (3) there exists an optimal p-randomized policy p 2 Psn for each M0n .

4.1. On the finite-state approximation

Proof. This conclusion follows from Theorem 3.8 in Altman (1999) and Lemma 5.1. h

Consider the model M01 of countable-state constrained continuous-time MDP: n o l M01 :¼ S;ðAðiÞ;i 2 SÞ; qðji;aÞ;c0 ði;aÞ;ðcl ði;aÞ;d ;1 6 l 6 pÞ; c ; ð4:1Þ

~ on S and Proposition 4.1. Suppose that there exist a function 1 6 x ~ ðiÞ " þ1 as i ! 1, and constants q0 ; b0 ; L0 ; j01 ; j02 > 0, satisfying the x

which is a copy of Mn in (2.1). Since S is denumerable, without loss of generalization, we write S as S ¼: f0; 1; . . . ; n; . . .g. For each n P 1, let Sn :¼ f0; 1; . . .; ng;An ðiÞ :¼ AðiÞ for i 2 Sn , and K n :¼ fði;aÞji 2 Sn ; a 2 An ðiÞg. Moreover, for each i;j 2 Sn , a 2 An ðiÞ, let

( q0n ðjji; aÞ



qðjji; aÞ qðnji; aÞ þ q



 Scn ji; a

if j 2 Sn n fng if j ¼ n;

following conditions: (1)



cðiÞ if i 2 Sn n fng cðnÞ þ cðScn Þ if i ¼ n;

~cln ði; aÞ :¼ cl ði; aÞ;

~l :¼ dl ; for each 0 6 l 6 p: and d n

(2) (3) (4) (5)

ð4:3Þ

Then,

  Then, we obtain a sequence of finite-state models M0n :

n   o ~l ; 1 6 l 6 p ; c0 : M0n :¼ Sn ;ðAn ðiÞ;i 2 Sn Þ; q0n ðji; aÞ; ~c0n ði;aÞ; ~cln ði;aÞ; d n n

(a) there exists an optimal policy p0n 2 Psn of M0n for each n P N, with some N P 1; (b) every accumulation point of fp0n g (with p0n as in ðaÞ) is an optimal policy of M01 ; e 0 ¼ V e 0 . (c) limn!1 V n

1

ð4:5Þ For each fixed n 2 N, as in (2.3) and Definition 2.1, we define the e l ðpÞ, an optimal policy p0 of corresponding discounted criteria V n n e 0 of M0 . M0n , and the optimal value V n n Suppose that AðiÞ is finite for each i 2 S in this section, and so is An ðiÞ for every i 2 Sn and n P 1. Then, to solve an optimal policy and the optimal value of M0n with n 2 N, we can consider the following LP0n :

LP0n : inf g

1X X

a i2Sn a2An ðiÞ

c0 ði; aÞgði;aÞ

8X X l > cl ði; aÞgði;aÞ 6 ad ; 1 6 l 6 p > > > i2S a2A ðiÞ > n n > > XX < X a gði; aÞ ¼ ac0n ðiÞ þ q0n ðijj; aÞgðj; aÞ; 8i 2 Sn ; subject to > a2An ðiÞ j2Sn a2An ðjÞ > > X > > > gði;aÞ ¼ 1; gði;aÞ P 0; i 2 Sn ; a 2 An ðiÞ: > : i2Sn ;a2An ðiÞ

Since the LP0n is a linear program with finite number of variables

gði; aÞði 2 Sn ; a 2 An ðiÞÞ, an optimal solution to the LP0n can be solved by many methods such as the well known simplex method.

P

0 ~2 ~2 j2S qðjji; aÞx ðjÞ 6 j1 x all ði; aÞ 2 K. ~ ðiÞ, jcl ði; aÞj 6 L0 x ~ ðiÞ for all q ðiÞ :¼ supa2AðiÞ jqðiji; aÞj 6 L0 x ði; aÞ 2 K and 0 6 l 6 p. The discount factor a verifies that a > q0 and a > j01 . P ~2 i2S x ðiÞcðiÞ < 1. e l ðpÞ < dl for all There exists a policy p 2 P1 such that V 1 1 6 l 6 p.

ð4:2Þ

ð4:4Þ

~ ðjÞ 6 q0 x ~ ðiÞ þ b0 , and x

j2S qðjji; aÞ ðiÞ þ 02 for 

j

and

c0n ðiÞ :¼

P

Proof. Fix any n P 1. By (4.2), it is obvious that function q0n ðjji; aÞ denotes indeed transition rates, which are conservative and stable. Also, Assumption 2.1(b) holds for q0n ðjji; aÞ. Moreover, for each ~ ðiÞ is nondecreasing, and qðkji; aÞ P 0 for each ði; aÞ 2 K n , since x k 2 Scn , we have

X X ~ ðjÞ ¼ ~ ðjÞ þ q0n ðnji; aÞx ~ ðnÞ q0n ðjji; aÞx q0n ðjji; aÞx j2Sn

j2Sn nfng

6

X X ~ ðjÞ þ ~ ðkÞ 6 q0 x ~ ðiÞ þ b0 ; qðjji; aÞx qðkji; aÞx j2Sn

k2Scn

which implies Assumption 2.1(a). Similarly, we can deduce Assumption 3.1(c) from the conditions (1,3,4) and (4.3). Obviously, Assumptions 3.1(a,b,d,e) and 3.3 follow from the conditions (2,3,5), (4.2), (4.3) and (4.4) and the definition of AðiÞ. The rest verifies Assumption 3.2. Let i; j 2 S be fixed, there exists an integer N such that for each n > N; i 2 Sn and j – n. So q0n ðjji; aÞ ¼ qðjji; aÞ and ~cln ði; aÞ ¼ cl ði; aÞ for each n > N; 0 6 l 6 p and a 2 An ðiÞ, which, together with (4.3) and (4.4), imply Assumption 3.2.

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

5

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

Therefore, Assumptions 2.1 and 3.1–3.3 are verified for the models M0n and M01 , and thus Proposition 4.1 follows from Theorem 3.1. h 4.2. An example

n

ao 0 b :¼ maxfl  k þ h2 ; 0g; q0 :¼ max k  l; : 2

ð4:10Þ

Hence, by Condition 4.1(a), for each i > 0 and a 2 AðiÞ, we have X ~ 2 ðjÞ ¼ 2ðk  lÞði þ 1Þ2 þ ð3l  2a1 þ 2a2  kÞði þ 1Þ qðjji; aÞx j2S

þ a1 þ a2  l  k

2 a   a 3l þ 2h2 2 6 þ k  l ði þ 1Þ þ k  l  iþ1þ 2 2 2k  2l  a

In this subsection, we illustrate the finite-sate approximation with an example. Example 4.1. Consider a controlled birth-and-death system in which the state variable denotes the population size at any time t P 0. There are natural birth and death rates represented by nonnegative constants k and l, respectively. When the state of the system is i 2 S :¼ f0; 1; . . .g, the controller takes an action a :¼ ða1 ; a2 Þ from a given finite set AðiÞ :¼ fak1 j1 6 k 6 s; ak1 > 0g fal2 j1 6 l 6 m; al2 > 0g, in which a1 > 0 denotes the emigration rate and a2 > 0 denotes the immigration rate, where s and m are given integers. These actions incur two costs c0 ði; aÞ and c1 ði; aÞ. Moreover, the controller wishes to minimize the expected discounted cost corresponding to c0 ði; aÞ, while the expected discounted cost corresponding to c1 ði; aÞ is maintained bounded above by a given constant d. Let c be an initial distribution on S. We now formulate this system as a model for constrained continuous-time MDP. With costs c0 ði; aÞ; c1 ði; aÞ, and the constraint constant d as given in the system, the corresponding transition rates qðjji; aÞ are given as follows. For each i P 1 and a ¼ ða1 ; a2 Þ 2 AðiÞ, 8 li þ a 1 > > > < ðl þ kÞi  a  a 1 2 qðjji;aÞ :¼ > > ki þ a2 > : 0

8 > < a2 if j ¼ 0; and qðjj0;aÞ :¼ a2 if j ¼ 1; > if j ¼ i þ 1; : 0 otherwise: otherwise: if j ¼ i  1; if j ¼ i;

ð4:7Þ

ð3l þ 2h2 Þ2 2ða  2k þ 2lÞ a  ð3l þ 2h2 Þ2 ~ 2 ðiÞ þ h1 þ h2 þ 6 þkl x ; 2 2ða  2k þ 2lÞ þ h1 þ h2 þ

X ~ 2 ðjÞ ¼ 3a2 6 3h2 qðjj0; aÞx j2S

6

a 2

 ~ 2 ð0Þ þ 3h2 ; for each a 2 Að0Þ; þkl x

which imply, for each ði; aÞ 2 K,

a  X ~ 2 ðjÞ 6 ~ 2 ðiÞ þ j02 ; ~ 2 ðiÞ þ j02 < ax qðjji; aÞx þkl x 2 j2S

ð4:11Þ

2

þ2h2 Þ j02 :¼ h1 þ 3h2 þ 2ðð3al2kþ2 lÞ. Hence, the second inequality in condition (1) of Proposition 4.1 holds with j01 :¼ a2 þ k  l and j02

where

above. Moreover, by straightforward calculations and (4.7), we have, for each j 2 S and a 2 A,

~ ðjÞ; where L0 :¼ q ðjÞ 6 ðl þ kÞj þ a1 þ a2 6 L0 x

l þ k þ h1 þ h2 : ð4:12Þ

First of all, to show that above model satisfies the hypothesis of Proposition 4.1, we introduce the following conditions.

Thus, from (4.12) and Condition 4.1(b), we see that the condition (2) in Proposition 4.1 holds. The conditions (3,4) in Proposition 4.1 follow from the Conditions 4.1(a,c). Finally, define a policy f by

Condition 4.1.

~ for each i 2 S; with a ~ as in Condition 4:1ðdÞ: f ðiÞ :¼ a

(a) a > 2ðk  lÞ, where a > 0 is the given discount factor. (b) There exists a constant M 0 > 0 such that jcl ði; aÞj 6 M 0 ði þ 1Þ for all ði; aÞ 2 K and l ¼ 0; 1. P 2 (c) i2S i cðiÞ < 1. ~ 2 A such that c1 ði; a ~Þ < ad, for each i 2 S. (d) There exists some a

Proposition 4.2. Under the Condition 4.1, an optimal policy and the optimal value for the above controlled birth-and-death system can be approximated by those of finite-state birth-and-death systems constructed as in (4.5) (by Proposition 4.1). Proof. From

here

and

below,

we

consider 

the

function 

ð4:13Þ

e 1 ðf Þ < d, and so the condition (5) in Proposition 4.1 is Hence, V 1 satisfied. h Theorem 4.1. Suppose that Condition 4.1 holds. Then, for the Example 4.1, there exists an integer N, such that for each n P N,

i 2Pn1 h 3 2 1 0   1 ði þ 1Þ þ a j cðiÞ X 2 i¼0  e 0 e 0  4 þ ði þ 1ÞcðiÞ5 V n  V 1  6 D ðn þ 1Þ i¼n where D :¼

2M0 ðaþb0 Þ aðaq0 Þ

P a i2S ðiþ1ÞcðiÞþb0 e 1 ðf Þ, and 4M0 þ 1 , h3 :¼ d  V 1 h3 aðaq0 Þ

0

the constants M 0 ; b ; q0 are defined in Condition 4.1(b) and (4.10).

~ ðiÞ :¼ i þ 1 for each i 2 S. Let h1 :¼ max ak1 : 1 6 k 6 s ; h2 :¼ x  

~ 2 Ps1 Proof. For each p 2 Psn , we can define an associated policy p by

X ~ ðjÞ ¼ ðk  lÞi þ ða2  a1 Þ 6 ðk  lÞx ~ ðiÞ þ l  k þ h2 ; qðjji; aÞx

p~ ðjiÞ :¼

max al2 : 1 6 l 6 m . By (4.7), it follows that j2S

8i > 0; a 2 AðiÞ; X ~ ðjÞ ¼ a2 6 h2 6 ðk  lÞx ~ ð0Þ þ l  k þ h2 ; qðjj0; aÞx

ð4:8Þ

8a 2 Að0Þ:

j2S

ð4:9Þ Hence, by Condition 4.1(a), the first inequality in condition (1) of Proposition 4.1 holds with



pðjiÞ if i 2 Sn ; m if i 2 Scn ; where m 2 PðAðiÞÞ is chosen arbitrarily: ð4:14Þ

  To avoid confusion, we use xpt to denote the state process fxt g corresponding to a given policy p. Then, for each n 2 N; p 2 Psn     and k 2 Sn , let sk ðpÞ :¼ inf t > 0 : xpt P k ¼ inf t > 0 : xpt ¼ k (since the population augments at most by one individual at each transition). For each n 2 N; p 2 Psn and j 2 Sn , it follows from the argument at page 263 in Anderson (1991) that Epj;n ½sjþ1 ðpÞ < 1,

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

6

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

which implies that sjþ1 ðpÞ < 1 a.s. Ppj;n . Then, for each n 2 N; p 2 Psn and j < k 2 Sn , by the strong Markov property, we have

Epj;n ½sk ðpÞ ¼ Epj;n ½sjþ1 ðpÞ þ Epj;n ½sk ðpÞ  sjþ1 ðpÞ h i ¼ Epj;n ½sjþ1 ðpÞ þ Epj;n Epj;n ½sk ðpÞ  sjþ1 ðpÞjF sjþ1 ðpÞ  ¼ Epj;n ½sjþ1 ðpÞ þ Epjþ1;n ½sk ðpÞ;

6B ð4:15Þ

where F sjþ1 ðpÞ denotes the r-algebra associated with sjþ1 ðpÞ. Let i < k be fixed, by summing both sides of the equality (4.15) from i to k  1, we have

Epi;n ½sk ðpÞ ¼

k1 X Epm;n ½smþ1 ðpÞ < þ1; and so

sk ðpÞ < þ1 a:s: Ppi;n :

m¼i

ð4:16Þ ~ ðiÞ ¼ i þ 1, by the similar calculations as in (4.11), Recalling that x we have

X ~ 3 ðjÞ ¼ ð3k  3lÞx ~ 3 ðiÞ þ ð6l  3a1 þ 3a2 Þx ~ 2 ðiÞ qðjji; aÞx j2S

~ ðiÞ þ ða2 þ l  k  a1 Þ þ ð3a2 þ 3a1  4l  2kÞx ~ 3 ðiÞ; for each i > 0 and a 2 AðiÞ; 6ð3k þ 4l þ 7h2 Þx ð4:17Þ X 3 3 ~ ðjÞ ¼ 7a2 6 ð3k þ 4l þ 7h2 Þx ~ ð0Þ; for each a 2 Að0Þ; qðjj0; aÞx j2S

ð4:18Þ ~ 2 ðiÞ 6 ðk þ l þ h2 þ h1 Þx ~ 3 ðiÞ; for each i 2 S: q ðiÞx

0

ð4:20Þ

Moreover, for each i < n 2 N, by the strong Markov property, (4.4) and (4.16), Z sn ðpÞ Z 1 e l ði; pÞ ¼ Ep eat ~cln ðxpt ; at Þdt þ Epi;n eat ~cln ðxpt ; at Þdt V i;n n ¼ Ei;n

 

l e ðn; pÞ for each l ¼ 0;1: c xpt ; at dt þ Epi;n easn ðpÞ V n

at l

e

0

ð4:21Þ Z sn ðp~ Þ 0

~

Z sn ðp~ Þ

l ~ e ðn; p ~ Þ: eat cl ðxpt~ ; at Þdt þ Epi;1 easn ðp~ Þ V 1

0

  eat cl xpt~ ; at dt ¼ Epi;n

Epi;n easn ðpÞ ¼ Epi;1 easn ðp~ Þ : ~

Let B :¼

0 Þ 2M0 aððaaþb . q0 Þ

Z sn ðpÞ 0

By (4.25), there exists an integer N such that for each n P N; ðnÞ 6 h23 . So

e 1 ðf Þ ¼ d  V e 1 ðf Þ þ V e 1 ðf Þ  V e 1 ðf Þ P d  V e 1 ðf Þ  ðnÞ P d V n 1 1 n 1

h3 > 0; 2 ð4:26Þ

which implies M0n satisfies slater condition for each n P N. Hence, under Condition 4.1, by Proposition 7.1 in Guo and Piunovskiy (2011) and Theorem 3.1ðb1 Þ, there exist an optimal stationary policy p0n 2 Psn and kn P 0 for each N 6 n 2 N such that

















e0 0 p0n ; kn ¼ mins Ln p; kn ¼ maxLn p0n ; k ¼ Ve 0 ð4:27Þ n ¼ V n pn ; p2Pn

kP0

e 0 ðpÞ þ kð V e 1 ðpÞ  dÞ for each n 2 N, p 2 Ps and where Ln ðp; kÞ :¼ V n n n P k 2 ð1; 1Þ. 0 ~ a x ðiÞ c ðiÞþb i2S Let E :¼ M 0 , it then follows from (3.1) and (4.26) aðaq0 Þ that

  e 0 ¼ minf V e 0 ðpÞ þ k V e 1 ð pÞ  d g E 6 V n n n n s p2Pn

   e 1 ðf Þ 6 E  kn h3 ; 6E  kn d  V n 2

which implies that

kn 6

ð4:22Þ k1 6

4E for each n P N: h3

ð4:28Þ n

e0 ð V 1

Þ þ k1



e1 ð V 1

o

2E 4E 6 : h3 h3

ð4:29Þ s 1

Let p denote an extension of p to P by replacing p in (4.14) with p0n . Thus, by (4.25), (4.29), and (4.27), we have ~ 0n

0 n

n  o e 0  V e 0 ¼ min V e 0 ðpÞ þ k V e 1 ðpÞ  d V 1 1 n 1 1 s p2P1

n      o e 0 p0 þ k V e 1 p0  d  max V n n n n kP0

       0   e0 p e 1 ðp e 0 p0  k V e 1 p0  d ~ n þ k1 V ~ 0n Þ  d  V 6V 1 1 1 n n n n

eat cl ðxpt ; at Þdt; and ð4:23Þ

  el 0  Then, it follows from (3.1) that  V n ði; p Þ

~ ðiÞ for each n 2 N; i 2 S and l ¼ 0; 1, which together with 6 B2 x (4.20)–(4.23) implies that,

ð4:25Þ

Similarly, E 6 ¼ minp2Ps1 p pÞ  d 6 Ve 01 ðf Þþ    e 1 ðf Þ  d 6 E  k d  V e 1 ðf Þ ¼ E  k h3 , which implies that k1 V 1 1 1 1

~ ðiÞ 6 ðk þ l þ h1 þ h2 Þx ~ 2 ðiÞ for each i 2 S, Moreover, we have q ðiÞx which, together with (4.8)–(4.12) and (4.14), implies the hypothesis of Lemma 3.2 in Guo, Song, and Zhang (2014) holds for the case of a constant discount factor. Hence, it follows from Lemma 3.2(a) in Guo et al. (2014) and Lemma 6.1.5 in Anderson (1991) that, for each i < n,

Epi;1

! 0 as n ! 1:

e 0 V 1

Similarly, for each i 2 S and l ¼ 0; 1, ~ e l ði; p ~ Þ ¼ Epi;1 V 1

n1

1 B BX ~ ðnÞcðScn1 Þþ ~ ðiÞcðiÞ x þ x 2 2 i¼n !

Pn1 2 1 1 0 X ~ i¼0 x ðiÞþ a j2 cðiÞ ~ ðiÞcðiÞ :¼ ðnÞ x 6B þ ~ ðnÞ x i¼n

sn ðpÞ

0

Z sn ðpÞ

ð4:24Þ

Hence, using Condition 4.1(c), by (4.3) and (4.24), for each l ¼ 0; 1, we have that   X  X   e l ði; pÞc0 ðiÞ e l ði; p ~ Þ c ðiÞ V  V  n n 1  i2S  i2S n  Xh i    e l ði; pÞ V e l ði; p e l ðn; pÞc Sc ~ Þ cðiÞþ V V ¼ n1 n 1 n i2S n1 

Pn1 2  X ~ ðiÞþ a1 j02 cðiÞ x  l e ~  V 1 ði; pÞcðiÞ 6 B i¼0 ~ ðnÞ  x i2SnS

Ln

which implies that

p

~ 2 ðiÞ þ a1 j02 x ; for each i < n: ~ ðnÞ x

ð4:19Þ

By (4.16)–(4.19), it is legal to use Dynkin’s formula. Hence, by (4.11) and Dynkin’s formula, for each i < n 2 S, "Z # n  o sn ðp~ Þ ~ ~Þ 2 p~ asn ðp p~ 2 0 p at ~ ~ ~ 2 ðiÞ þ a1 j02 ; Ei;1 e x xsn ðp~ Þ 6 x ðiÞ þ j2 Ei;1 e dt 6 x

x ~ 2 ðiÞ þ a1 j02 ~ Epi;1 easn ðp~ Þ 6 : ~ 2 ðnÞ x

  i

h el ~ asn ðp ~Þ  e l e l ði; p e l ðn; p ~ Þ ¼ Epi;1 ~ Þ  e  V n ðn; pÞ  V  V n ði; pÞ  V 1 1   i

h l ~ e l ðn; p e ðn; pÞ þ  V ~ Þ 6Epi;1 easn ðp~ Þ  V n 1

    0     e0 p e 0 p0 þ k V e1 p e 1 p0 ~n  V ~ 0n  V ¼V 1 1 n n 1 n n 6

  4E þ 1 ðnÞ: h3

ð4:30Þ

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

7

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx 18

gpn ðfig  CÞ :¼ a

16

Z 0

1

eat Epcn ½IfigC ðxt ; at Þdt

fig  C 2 BðSn  An Þ; ð5:1Þ

14 12

which is concentrated on K n .

10

Under Assumptions 2.1 and 3.1(b–d), by (2.3), (3.1), and (5.1), we have

8 6

V ln ðpÞ ¼

4 2

0

100

200

300

400

500

600

700

800

900

cln ðj; aÞgpn ðj; daÞ;

for 0 6 l 6 p; p 2 Pn ; n 2 N: ð5:2Þ

i 0–3

4

5–63

64–119

120–1000

1.0000 0.0000 0.0000 0.0000

0.5576 0.0000 0.4424 0.0000

0.0000 0.0000 1.0000 0.0000

0.0000 0.0000 0.0000 1.0000

0.0000 1.0000 0.0000 0.0000

n



o e 1 ðpÞ  d V n

e 0  V e 0 ¼ min V e 0 ðpÞ þ k V n n 1 n p2Psn n   o   e 0 p0 þ kð V e 1 p0  dÞ  max V 1 1 1 1 kP0         e 0 ðp e 1 ðp e 0 p0  k V e 1 p0  d ^ n Þ þ kn V ^ nÞ  d  V 6V n n n 1 1 1 1      e 0 ðp e 0 p0 þ k ð V e1 p e 1 p0 ^ nÞ  V ^ nÞ  V ¼V n n 1 1 n 1 1   4E 6 þ 1 ðnÞ; ð4:31Þ h3 which, together with (4.30) and (4.25), implies the desired result. h Example 4.1(cont.) For a numerical experimentation of Example 4.1, we fix the values of the parameters as follows:

l ¼ 1; A :¼ f1;2g  f1; 2g; a ¼ 1:2; d ¼ 0:1: 2

2

1

c ði; aÞ ¼ ½10 þ ða2  1:6Þ i; c ði; aÞ ¼ ða1  1:1Þ ; 1 cðiÞ :¼ iþ1 ; for each ði; aÞ 2 K: 2 For every 1 6 n 6 1000, we calculate the optimal value and an optimal policy of M0n by solving the LP0n . The optimal values are shown in Fig. 1, and optimal policy for M01 is given in Tables 1 below. By    e 0 e 0  Theorem 4.1, we know that  V  V  ¼ Oðn1 Þ. n

(i) For each n 2 N and policy equation and inequality

gði;An ðiÞÞ ¼ cn ðiÞ þ X

1X

Z

a j2Sn P

xðiÞgði; An ðiÞÞ 6

a

p 2 Pn , gpn satisfies the following

An ðjÞ

i2Sn

i2Sn

qn ðijj; aÞgðj; daÞ for i 2 Sn ;

xðiÞcn ðiÞ þ b < 1: aq

ð5:3Þ

ð5:4Þ

(ii) Conversely, for each fixed n 2 N, if a p.m. g on K n satisfies (5.3) and (5.4), then there exists a policy p 2 Psn (depending on n) such that g ¼ gpn , and p can be obtained from the following decomposition of g:

gði;daÞ ¼ g^ ðiÞpðdajiÞ; where g^ ðiÞ :¼ gði;An ðiÞÞ for each i 2 Sn : ð5:5Þ (iii) For each fixed n 2 N and solution of (5.6)

lðiÞ ¼ cn ðiÞ þ

1X

a j2Sn

p2P

s n,

^ p ðiÞ; i ðg n

qn ðijj; pÞlðjÞ 8 i 2 Sn ;

in the class of probability P i2Sn xðiÞlðiÞ < 1.

measures

2 Sn Þ is a unique

ð5:6Þ

l on Sn such that

ð4:32Þ

Finally, the cost functions and initial distribution are 0

An ðjÞ

Lemma 5.1. Under Assumptions 2.1 and 3.1(b–c), the following assertions hold.

^ n 2 Psn by p ^ n ðjiÞ :¼ p01 ðjiÞ for each n 2 N and Also, define policy p i 2 Sn . By (4.27) and (4.28),

k ¼ 1:4;

a j2Sn

Then, the following theorem collects some properties of occupation measures.

Table 1 The optimal policy for n ¼ 1000.

(1, 1) (1, 2) (2, 1) (2, 2)

Z

1000

Fig. 1. Optimal values.

a

1X

1

The notation ‘‘k  l’’ in the above table denotes the states k; k þ 1; . . . ; l, and every number denotes the corresponding probability of choosing action a at state i. From the table above, we find an optimal 1-randomized policy for M01000 .

Proof. See Theorem 5.1 in Guo and Piunovskiy (2011), or Theorem 2.2.7 in Anderson (1991) for the proof of (iii). h Lemma 5.1 together with (5.2) shows that the problem (2.4) is equivalent to the following linear program problem ðLPn Þ:

LPn : inf g

1X

a i2Sn

Z An ðiÞ

c0n ði;aÞgði;daÞ

ð5:7Þ

8P R l > gði;daÞ 6 adln ;1 6 l 6 p; > < i2Sn An ðiÞ cn ði;aÞP R ^ ðiÞ ¼ acn ðiÞþ j2Sn A ðjÞ qn ðijj;aÞgðj;daÞ;for each i 2 Sn ; subject to ag n > > P : ^ x ðjÞ g ðjÞ < 1; g 2 PðK n Þ: j2Sn ð5:8Þ

In order to prove Theorem 3.1 above, we introduce the concept of an occupation measure.

A p.m. g on K n is said to be a feasible solution to the LPn if it satisfies (5.8). We denote by inf LPn the optimal value of the LPn. If there exists a feasible solution g to the LPn such that R P 1 0 i2Sn An ðiÞ cn ði; aÞgði; daÞ ¼ inf LPn , then g is called an optimal a

Definition 5.1. For each n 2 N and policy p 2 Pn , the occupation measure of p associated to Mn is a p.m. on Sn  An , denoted by gpn , is defined by

solution to the LPn. The family of all feasible solutions and optimal solutions for the LPn are denoted by Fn and On , respectively. For n 2 N, Lemma 5.1 ensures that Fn – ;. To prove Theorem 3.1, we also need the following results.

5. Occupation measures and preliminary results

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

8

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

Lemma 5.2. For any sequence of pn 2 Psn converging weakly to p 2 Ps1 , the following assertions hold. (i) For any mn 2 PðSn Þ and m 2 PðS1 Þ, define p.m. ln on K n and l on K 1 by ln ði; daÞ :¼ mn ðiÞpn ðdajiÞ and lði; daÞ :¼ mðiÞpðdajiÞ. If mn ! m weakly in PðS1 Þ, then ln ! l weakly in PðK 1 Þ. (ii) Suppose that Assumptions 3.1(e) and 3.2(a–b) hold, then

i2S1

pn Þ ¼ cl1 ði; pÞ; n P nðiÞ _ nðjÞ

X

xðiÞg~ nks ðiÞ 6 sup

P

a

i2Sn

n2N

i2S1



xðiÞcn ðiÞ þ b < 1: aq

By Assumptions 2.1, 3.1(b–c) and (5.11), for each e > 0, there exists an integer N such that 1 X





6 qn ðjji; aÞ  q1 ðjji; aÞ 6 8 a 2 An ðiÞ; n P L1 ; 2 Z   and so  6 ðqn ðjji; aÞ  q1 ðjji;aÞÞpn ðdajiÞ 6 8 n P L1 : 2 2 An ðiÞ

ð5:9Þ

Since pn !p, Assumption 3.1(e) implies the existence of an integer L2 P L1 such that

Z

6 A1 ðiÞ

q1 ðjji; aÞpðdajiÞ 

Z

An ðiÞ

q1 ðjji; aÞpn ðdajiÞ 6

 2

8 n P L2 ;

which, together with (5.9), implies

Z

qn ðjji; aÞpn ðdajiÞ 

An ðiÞ

Z A1 ðiÞ

q1 ðjji; aÞpðdajiÞ 6  8 n P L1 _ L2 :

This verifies the first part of (ii). Similarly, we can prove the second part of (ii). h Theorem 5.1. Under Assumptions 2.1, 3.1 and 3.2(a–c), the following assertions hold. (i) If gn is a p.m. on PðK n Þ (for each n P 1) satisfying (5.3) and P (5.4), and gn ! g weakly in PðK 1 Þ, then limn!1 i2Sn R R P l l c ði; aÞgn ði; daÞ ¼ i2S1 A1 ðiÞ c1 ði; aÞgði; daÞ for each An ðiÞ n 0 6 l 6 p. (ii) If pn 2 Psn for all n P 1, and pn ! p 2 Ps1 weakly, then gpn n ! gp1 weakly in PðK 1 Þ. Proof. (i) Under Assumptions 2.1, 3.1(b–d), (5.2) and Lemma 5.1(i) nP o R l is bounded in imply that the sequence i2Sn An ðiÞ c n ði; aÞgn ði; daÞ 0 6 l 6 p and n 2 N. For any fixed 0 6 l 6 p, choose any subsenP o nP R o R l quence cl ði;aÞgnk ði;daÞ of i2Sn i2Sn An ðiÞ cn ði;aÞgn ði;daÞ An ðiÞ nk k

e xðiÞg^ ðiÞ 6 ; 3

i¼N 1 X

xðiÞg~ nks ðiÞ 6

i¼N

( P ) a i2Sn x2 ðiÞcn ðiÞ þ j2 1 e 6 8 s P 1: sup xðNÞ n2N a  j1 3 ð5:12Þ

~ nk , for each i < N, there Hence, by (5.10) and the definition of g s exists an integer MðiÞ (depending on i) such that for each s P MðiÞ,

2

 6

xðiÞg^ ðiÞ 6 s!1 lim

ð5:11Þ

(i) By the similar arguments for the proof of Lemma 4.6(a) in Alvarez-Mena and Hernández-Lerma (2002), we see that part (i) is true. (ii) Fix any i; j 2 S1 , and take arbitrarily  > 0. Assumption 3.2(a) implies the existence of an integer L1 P nðiÞ _ nðjÞ such that

2

Hence, by Assumption 3.1, it follows from (5.10) and the definition ~ nk that of g s

lim cl ði; n!1 n

Proof.



g~ nks ðiÞ :¼ g^ nks ðiÞ if i 2 Snks ; and g~ nks ðiÞ :¼ 0 if i 2 Scnks :

X

for each i; j 2 S1 and 0 6 l 6 p.



^ nk (s 2 N), define an associated p.m. g ~ nk 2 PðS1 Þ as For each g s s follows:

lim qn ðjji; pn Þ ¼ q1 ðjji; pÞ; and

n!1



Thus, it follows from Lemma 5.2(i) and (5.10) that gði; daÞ ¼

g^ ðiÞpðdajiÞ.

~ nk ðiÞ  g ^ ðiÞj 6 jg s

k

ð5:13Þ

:

Therefore, it follows from (5.12) and (5.13) that for each s P maxfMðiÞ : i 6 N  1g,

   X X     ^ ^ x ðiÞ g ðiÞ  x ðiÞ g ðiÞ n ks    i2Sn i2S 1 ks    X X  ~ nk ðiÞ  ¼  xðiÞg xðiÞg^ ðiÞ s  i2S i2S1 1    X N1 1 X  ~ ^ ~ ^ ¼  xðiÞðgnks ðiÞ  gðiÞÞ þ xðiÞðgnks ðiÞ  gðiÞÞ   i¼0 i¼N    2 X N1 N1 X  ~ nk ðiÞ  6  xðiÞg xðiÞg^ ðiÞ þ e 6 e; s  3  i¼0 i¼0

ð5:14Þ

~ nk and g ^ nk , implies which, together with the relationship between g s s that

X

xðiÞg^ ðiÞ ¼ s!1 lim

i2S1

P

X

xðiÞg^ nks ðiÞ 6 supf

a

n2N

i2Snk

i2Sn

xðiÞcn ðiÞ þ b g < 1: aq

s

ð5:15Þ Therefore, under Assumptions 3.1(e) and 3.2(b), by Lemma 5.2(ii), (5.10), (5.15) and Theorem A.2.6 in Sennott (1999), we have

X

^ nk ðiÞ ! clnk ði; pnks Þg s s

i2Snk

s

that is g ¼ lim

s!1

i2Snk

X

Ank ðiÞ

As the subsequence

^ ðiÞ;as s ! 1; cl1 ði; pÞg

i2S1

XZ s

k

converging to some constant g as k ! 1. Under Assumptions 2.1, 3.1(b–c), by Lemma 5.1(ii), there exists pnk 2 Psnk such that gn ði;daÞ ¼ g^ nk ðiÞpnk ðdajiÞ for each k. Under Assumption 3.1(a), by

e 3NxðiÞ

clnk ði;aÞgnk ði; daÞ ¼ s

s

s

nP

i2Snk

R

Ank ðiÞ

ð5:16Þ

XZ i2S1

A1 ðiÞ

clnk ði; aÞgnk ði; daÞ

o

cl1 ði;aÞgði;daÞ: ð5:17Þ was arbitrarily

Remark 3.1 and the diagonal process, there exists a subsequence fpnks g of fpnk g such that pnks ! p 2 Ps1 weakly. Moreover, since gnk ! g weakly, we have

chosen and (by (5.17)) all such subsequences have the same limit R P l i2S1 A1 ðiÞ c 1 ði; aÞgði; daÞ, we have

g^ nks ðiÞ ! g^ ðiÞas s ! 1; for each i 2 S1 ; where nks P nðiÞ;

n!1

s

ð5:10Þ

lim

XZ i2Sn

An ðiÞ

cln ði; aÞgn ði; daÞ ¼

XZ

i2S1

A1 ðiÞ

cl1 ði; aÞgði; daÞ:

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

9

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

^ pn n ! g ^ p1 (ii) By Lemmas 5.2 and 5.1, it suffices to show that g p ^ nknk g of fg ^ pn n g converging weakly weakly. Choose any subsequence fg to some measure t on S1 . Then pn

^ nk k ðiÞ ¼ tðiÞ lim g

8 i 2 S1 :

k!1

ð5:18Þ

By Assumptions 3.1(e) and 3.2(b), Lemma 5.2(ii), we have

06 X

X i2S1

tðiÞ 6 lim

X pn g^ nk k ðiÞ ¼ 1 and

k!1 i2Snk

q1 ðijj; pÞtðjÞ 6 lim

X

k!1 j2Snk

j2S1

ð5:19Þ pn

^ nk k ðjÞ; for each i 2 S1 : qnk ðijj; pnk Þg

ð5:20Þ

Then, by Assumptions 2.1 and 3.1(b,c), using Proposition A.3 in Guo and Hernández-Lerma (2009), Lemma 5.1(i) and (5.18), we have

X

a xðiÞtðiÞ 6 lim

i2S1

X

xðiÞcnk ðiÞ þ b

i2Snk

6 sup

aq

k!1

8 X 9 > a xðiÞcn ðiÞ þ b> > > < = i2Sn

aq

> n2N > :

> > ;

On the other hand, for each i 2 S1 and nk P nðiÞ, by Lemma 5.1(iii) we have

g^ nk ðiÞ ¼ cnk ðiÞ þ

1X

a j2Sn

pnk

^ nk ðjÞ: qnk ðijj; pnk Þg

ð5:22Þ

k

1

1X

Z

a i2Sn

¼ lim

k!1

1X

a i2Sn

k

ð5:23Þ

i2S1 j2S1

tðjÞ

j2S1

6 2L and so we have

XX

X

X

jq1 ðijj; pÞj 6 2

i2S1

X

tðjÞq1 ðjÞ

q1 ðijj; pÞtðjÞ ¼

i2S1 j2S1

pnm 2 Psnm . Under Assumption 3.1(a), by Remark 3.1 and the diagonal process, there exists a subsequence fpnmk g of fpnm g such that pnmk ! p 2 Ps1 weakly, and thus it follows from

X

tðjÞ

j2S1

X

q1 ðijj; pÞ ¼ 0:

i2S1

0

a j2S1

0

q1 ði jj; pÞtðjÞ:

Assumptions 2.1, 3.1 and 3.2, by Theorem 5.1(i) we have

1X

a i2S1

Z A1 ðiÞ

cl1 ði; aÞgp1 ði; daÞ ¼ lim

k!1

1X

a j2S1

Z 1 X

a i2Sn

mk

6

l lim dn k!1 mk

¼

pnm

Anm ðiÞ

clnm ði; aÞgnmk k ði; daÞ k

k

l d1 ;

Proposition 5.1. Suppose that Assumptions 2.1, 3.1, 3.2 and 3.3 hold. ~ n 2 Psn Then, there exist an integer N, a constant d > 0, and policies p for all n P N, such that l

ð5:25Þ

Hence, summation of both sides of (5.23) with respect to i 2 S1 , P together with (5.24) and (5.25), yields that i2S1 tðiÞ > P c ðiÞ ¼ 1, which contradicts (5.19). Thus, we have i2S1 1

tðiÞ ¼ c1 ðiÞ þ

! gp1 as k ! 1. Moreover, under

Note that Assumptions 2.2 and 3.3 are both on the existence of a policy satisfying the given constraints. The following proposition ensures some relationship between them.

Suppose that there exists some i 2 S1 such that

1X

k

for all 1 6 l 6 p. Hence, gp1 2 F1 , and so (ii) is true. h

ð5:24Þ

tði0 Þ > c1 ði0 Þ þ

l

cln ði; aÞgn k ði; daÞ 6 dn ; l ¼ 1; . . . ; p

j2S1

xðjÞtðjÞ < 1;

j2S1

pm

An ðiÞ

and so (i) follows. (ii) Let fgnm g be an arbitrary subsequence of fgn g. Then by p Lemma 5.1(ii) we have gnm ¼ gnmnm for each all m P 1, with some

Under Assumption 2.1(b), it follows from Fubini’s Theorem and (5.21) that

X

Z

k

1X

p

jq1 ðijj; pÞjtðjÞ ¼

cln ði; aÞgpn ði; daÞ

Theorem 5.1(ii) that gnm ¼ gnm

tðiÞ ¼ c1 ðiÞ þ lim qnk ðijj; pnk Þg^ nknk ðjÞ P c1 ðiÞ þ q ðijj; pÞtðjÞ: a k!1j2Sn a j2S1 1

XX

An ðiÞ

pnmk

Then, by (5.20), letting k ! 1 in (5.22), we get

X

V ln ðpÞ ¼

< 1:

ð5:21Þ

pnk

Proof. (i) For each n 2 N, Assumption 2.1 together with 3.1(b–d) implies that Fn – ; (by Lemma 5.1(i)). Let fgm g be an arbitrary subsequence of Fn . Then, under Assumptions 2.1 and 3.1(b–c), Lemma 5.1(ii) gives the existence of pm 2 Psn such that gm ¼ gpn m for all m P 1. Moreover, under Assumption 3.1(a), by Remark 3.1, there exist a subsequence fpmk g of fpm g and a policy p 2 Psn , such that pmk ! p weakly as k ! 1. Hence, under Assumptions 2.1 and 3.1, it follows from Theorem 5.1(ii) (for the special case that each one of the models in pm fM1 ; M2 ; . . .g is taken as the Mn ) that gn k ! gpn as k ! 1. Hence, p the rest verifies that gn satisfies the constraints. Indeed, under Assumptions 2.1 and 3.1, by Theorem 5.1(i) (for the special case above) and (5.2), we have

q1 ðijj; pÞtðjÞ 8 i 2 S1 ;

^ p1 ðiÞ for all which, together with Lemma 5.1(iii), implies that tðiÞ ¼ g p ^ nknk g was arbitrarily chosen and i 2 S1 . As the above subsequence fg ^ p1 , we have all such subsequences have the same limit g g^ pn n ðiÞ ! g^ p1 ðiÞ for each i 2 S1 , and so g^ pn n ! g^ p1 weakly. h Theorem 5.2. Under Assumptions 2.1 and 3.1, the following assertions hold. (i) If U n – ; for some n 2 N, then the corresponding Fn is nonempty and compact. (ii) If, in addition, Assumptions 2.2 and 3.2 hold, and let gn 2 Fn for each n 2 N, then fgn g is relatively compact in F1 , and any accumulation point of fgn g is in F1 .

~ n Þ 6 dn  d ðhence U n – ;Þ; V ln ðp

for each 1 6 l 6 p; and n P N:

Proof. Let p 2 P1 be the policy as in Assumption 3.3(a). Then, under Assumptions 2.1, 3.1(b–d) and 3.3, it follows from Lemma ~ 2 Ps1 and a constant 5.1(ii) and (5.2) that there exists a policy p l l l ~ Þ ¼ V 1 ðpÞ 6 d1  C for each 1 6 l 6 p. C > 0, such that V 1 ðp ~ n 2 Psn as follows: For each For each n P 1, define a policy p i 2 Sn , take arbitrarily an ðiÞ 2 An ðiÞ,

(

p~ n ðdajiÞ :¼

p~ ðda \ An ðiÞjiÞ=p~ ðAn ðiÞjiÞ if p~ ðAn ðiÞjiÞ > 0 ~ ðAn ðiÞjiÞ ¼ 0: Ifan ðiÞg ðdaÞ if p

ð5:26Þ

~ ðAn ðiÞjiÞ " p ~ ðA1 ðiÞjiÞ ¼ 1 as Then, Assumption 3.3(b) implies that p ~n ! p ~ weakly. Thus, under Assumptions 2.1, 3.1 n ! 1, and so p and 3.2, by Theorem 5.1 and (5.2), we have

~ n Þ ¼ V l1 ðp ~ Þ 6 dl1  C for each 1 6 l 6 p: gpn~ n ! gp1~ and n!1 lim V ln ðp ð5:27Þ Then, for any fixed 0 < e < C2 (and so d :¼ C  2e > 0), by Assumption 3.2(d) and (5.27), there exists an integer N such that

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

10

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx l

l

~ n Þ  V l1 ðp ~ Þj 6 e; jdn  d1 j 6 e; jV ln ðp l d1

~ n Þ 6 V l1 ðp ~Þ þ e 6 Hence; V ln ðp ¼

l dn

for n P N; and 1 6 l 6 p:

Cþe6

l dn

 C þ 2e

argument, there exists an integer N k (depending on k) which is assumed to be increasing in k P 1, such that 1

 d for each n P N and 1 6 l 6 p;

which completes the proof. h

6. Proof of Theorem 3.1 The proof of Theorem 3.1 is based on the following proposition, which is similar to Lemma 5.1 in Alvarez-Mena and HernándezLerma (2002) for discrete-time MDP. We will prove it here for the sake of completeness. Proposition 6.1. Under Assumptions 2.1, 3.1, 3.2 and 3.3, the following assertions hold. (i) For each g 2 F1 , there exist an integer N and gn 2 Fn for each n P N, such that gn ! g weakly. (ii) There exists an integer N such that Fn for each n P N. If gn 2 On for all n P 1, and g1 is an accumulation point of fgn ; n P Ng, then g1 2 O1 . Proof. (i) Let g 2 F1 . Under Assumptions 2.1, 3.1(b–d) and 3.3(a), Lemma 5.1(ii) and (5.2) give the existence of pk 2 Ps1 ðk ¼ 1; 2Þ, such that g ¼ gp11 and V l1 ðp2 Þ 6 dl1  C with some constant C > 0, for all 1 6 l 6 p. For each policy pk , replac~ in (5.26) with pk , we define a sequence fpkn ; n P 1g, which ing p converges weakly to pk (by Assumption 3.3(b)). Thus, by Theorem 5.1(ii), we have p

k

p

gn ! g1 ; as n ! 1; for k ¼ 1; 2

ð6:1Þ

and so (by Theorem 5.1(i)), for all 1 6 l 6 p,

lim

XZ

n!1

lim

p1

An ðiÞ

i2Sn

XZ

n!1

i2Sn

cln ði; aÞgn n ði; daÞ ¼

p

cln ði;aÞgn ði;daÞ ¼

1

A1 ðiÞ

i2S1

2 n

An ðiÞ

XZ

ð6:2Þ

XZ

2

l

cl1 ði; aÞgp1 ði; daÞ 6 aðd1  CÞ:

A1 ðiÞ

i2S1

l

cl1 ði; aÞgp1 ði; daÞ 6 ad1 ;

ð6:3Þ aC . By (6.2) and (6.3) and Assumption Let e be such that 0 < 2e < 1þ a 3.2(d), there exists an integer Ne (depending on e) such that, for l

l

each n P Ne and 1 6 l 6 p; jdn  d1 j 6 e, and

XZ i2Sn

An ðiÞ

XZ i2Sn

An ðiÞ

p1

l

cln ði; aÞgn n ði; daÞ 6 ad1 þ e; p2

l

cln ði; aÞgn n ði; daÞ 6 aðd1  CÞ þ e: p1

p2

Let men :¼ ð1  ke Þgn n þ ke gn n , where ke :¼ 1 6 l 6 p and n P N e ,

XZ i2Sn

An ðiÞ

cln ði; aÞmen ði; daÞ ¼ ð1  ke Þ

XZ i2Sn

þ ke

XZ i2Sn

< 12. Then, for each p1

An ðiÞ

An ðiÞ

ðaþ1Þe aC

cln ði; aÞgn n ði; daÞ p2

cln ði; aÞgn n ði; daÞ

  h   i l l 6ð1  ke Þ ad1 þ e þ ke a d1  C þ e l

6adn ; which implies that men 2 Fn for all n P Ne . C Then, let fek g # R be such that ek # 0 and 0 < 2ek < aaþ1 . For each fixed k P 1 (corresponding to a given ek ), as in the previous

ð6:4Þ

Let gn :¼ mkn , and kn :¼ kk for each N k 6 n < N kþ1 . Then, since ek # 0, by (6.1) and (6.4), we have gn ! g and gn 2 Fn for each n P N 1 , which completes the proof of (i). (ii) The first statement follows from Proposition 5.1. Then, under Assumptions 2.1, 3.1 and 3.2, Theorem 5.2(ii) gives the existence of a subsequence fgm g of fgn g and g1 2 F1 , such that gm ! g1 . Without loss of generality, we assume that gn ! g1 . On ~ 1 2 F1 , under Assumptions 2.1, 3.1, 3.2 the other hand, for any g and 3.3, it follows from (i) that there exist an integer N 0 > N and g~ n 2 Fn for all n P N0 , such that g~ n ! g~ 1 weakly, which together with gn 2 On , gives

XZ i2Sn

An ðiÞ

c0n ði;aÞgn ði; daÞ 6

XZ

An ðiÞ

i2Sn

~ n ði;daÞ 8 n P N0 : c0n ði;aÞg

ð6:5Þ

~n ! g ~ 1 , under Assumptions 2.1, 2.2, 3.1, by Since g1 2 F1 and g R P 0 Theorem 5.1(i) and (6.5), we have i2S1 A1 ðiÞ c 1 ði; aÞg1 ði; daÞ 6 R P 0 ~ h i2S1 A1 ðiÞ c 1 ði; aÞg1 ði; daÞ, and so (ii) follows. Proof of Theorem 3.1. (a) For the n 2 N, it follows from Theorem 5.2(i) that Fn – ;. By the definition of the optimal value of LPn, under Assumptions 2.1 and 3.1, by Lemma 5.1, there exists a sequence fgm g of Fn , such that

lim

m!1 k n

2

mkn :¼ ð1  kk Þgpn n þ kk gpn n 2 Fn 8 n P Nk ; k P 1; ða þ 1Þek where kk :¼ : aC

1X

a i2Sn

Z

An ðiÞ

c0n ði; aÞgm ði; daÞ ¼ inf LPn ;

ð6:6Þ

and moreover, by Theorem 5.2(i), there exists a subsequence fgmk g of fgm g such that gmk ! g 2 Fn . Then, under Assumptions 2.1 and 3.1, by Theorem 5.1(i) (for the special case that each one of the models in fM1 ; M2 ; . . .g is taken as the Mn ) and (6.6) we P R PR 1 0 1 c0 ði;aÞgmk ði; daÞ ¼ have An ðiÞ n a i2Sn An ðiÞ cn ði;aÞgði;daÞ ¼ limk!1 a i2Sn

inf LPn . This is g 2 On . Define a policy p by  gði; daÞ ¼: gðiÞp ðdajiÞ. Then, for any

V 0n ðp Þ ¼

1X

Z

a i2Sn

An ðiÞ

c0n ði;aÞgði;daÞ 6

the

decomposition

of

p 2 U n , by (5.2) we have

1X

Z

a i2Sn

An ðiÞ

c0n ði;aÞgpn ði;daÞ ¼ V 0n ðpÞ;

and so p is optimal for Mn . (b) We first prove (b1 ) and (b2 ) together. Under Assumptions 2.1, 3.1, 3.2 and 3.3, by Proposition 5.1, it follows that there exists an integer N such that U n – ; for each n P N, then by (a) we see that (b1 ) is true. Moreover, for each n P N (including n ¼ 1), it follows from Lemma 5.1 and (5.2) that p is optimal for Mn if and only if gpn 2 On , which together with Proposition 6.1(ii) and Theorem 5.1(ii), implies (b2 ). We next prove (b3 ). First, under Assumptions 2.1, 3.1, 3.2 and 3.3, there exist gn 2 On for all n P N, with some N P 1 (by (b1 )). Let 0 fV 0 nk g be any subsequence of fV n g, which converges to some constant. Then, under Assumptions 2.1, 3.1–3.3, by Theorem 5.2(ii) and Proposition 6.1(ii), there exists fgnk g # fgnk g, such that s

gnks ! g1 2 O1 weakly as s ! 1, and so it follows from V 0 n ¼ min LPn and Theorem 5.1(i) ðn 2 NÞ that 0 lim V 0 nk ¼ limV nk ¼ lim s!1

k!1

¼

1

s

s!1

XZ

a i2S1

A1 ðiÞ

Z 1X

a i2Sn

ks

Ank ðiÞ

c0nk ði; aÞgnk ði; daÞ s

s

s

c01 ði; aÞg1 ði; daÞ ¼ V 0 1:

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037

X. Guo, W. Zhang / European Journal of Operational Research xxx (2014) xxx–xxx

which together with the arbitrariness of the subsequence fV 0 nk g, implies (b3 ). h 7. Conclusion The convergence problems for the two sequences of optimal policies and the optimal values of the models fMn g of discounted continuous-time MDP with unbounded transition rates and multipleobjective constraints, are studied in this paper. Suitable conditions for the convergence of the optimal values and optimal policies of fMn g to those of the so-called ‘‘limit’’ model of fMn g are given. The main technique in this paper are based on the asymptotic properties of occupation measures of policies, which is different from the optimality equation method used in the previous literature on the unconstrained models. Furthermore, some applications to the finite approximation of constrained discounted continuous-time MDP with denumerable states are illustrated with an example. Let us finish this paper with possible further topics. For instance, the convergence rate of the approximation of the optimal value is given only for the example in Section 4. However, it is desirable but unsolved to give the convergence rates for more general models with constraints; see also Remark 3.3 in PrietoRumeau and Hernández-Lerma (2012) for unconstrained discounted continuous-time MDP. Acknowledgements This research was partially supported by the NSFC and GDUPS. We also thank the anonymous referees for constructive comments. References Altman, E. (1999). Constrained Markov decision processes. Florida: Chapman Hall & CRC. Alvarez-Mena, J., & Hernández-Lerma, O. (2002). Convergence of the optimal values of constrained Markov control processes. Mathematical Methods of Operations Research, 55, 461–484. Alvarez-Mena, J., & Hernández-Lerma, O. (2006). Existence of Nash equilibria for constrained stochastic games. Mathematical Methods of Operations Research, 63, 261–285. Anderson, W. J. (1991). Continuous-time Markov chains. New York: Springer. Cervellera, C., & Macciò, D. (2011). A comparison of global and semi-local approximation in T-stage stochastic optimization. European Journal of Operational Research, 208, 109–118.

11

Feinberg, E. A. (2000). Constrained discounted Markov decision processes and Hamiltonian cycles. Mathematics of Operations Research, 25, 130–140. Feinberg, E. A. (2004). Continuous time discounted jump Markov decision processes: A discrete-event approach. Mathematics of Operations Research, 29, 492–524. Feinberg, E. A. (2012). Reduction of discounted continuous-time MDPs with unbounded jump and reward rates to discrete-time total-reward MDPs. Control. Optimization, Control, and Applications of Stochastic Systems. Birkhäuser, pp. 77–97. Feinberg, E. A., & Shwartz, A. (1999). Constrained dynamic programming with two discount factors: Applications and an algorithm. Institute of Electrical and Electronics Engineers. Transactions on Automatic Control, 44, 628–631. Guo, X. P. (2007). Constrained optimization for average cost continuous-time Markov decision processes. Institute of Electrical and Electronics Engineers. Transactions on Automatic Control, 52, 1139–1143. Guo, X. P., & Hernández-Lerma, O. (2003). Constrained continuous-time Markov control processes with discounted criteria. Stochastic Analysis and Applications, 21, 379–399. Guo, X. P., & Hernández-Lerma, O. (2009). Continuous-time Markov decision processes. Springer-Verlag. Guo, X. P., & Piunovskiy, A. B. (2011). Discounted continuous-time Markov decision processes with constraints: Unbounded transition and loss rates. Mathematics of Operations Research, 36, 105–132. Guo, X. P., Song, X. Y., & Zhang, Y. (2014). First passage criteria for continuous-time Markov decision processes with varying discount factors and historydependent policies. Institute of Electrical and Electronics Engineers. Transactions on Automatic Control, 59, 163–174. Hernández-Lerma, O., & González-Hernández, J. (2000). Constrained Markov control processes in Borel spaces: The discounted case. Mathematical Methods of Operations Research, 52, 271–285. Hernández-Lerma, O., González-Hernández, J., & López-Martínez, R. R. (2003). Constrained average cost Markov control processes in Borel spaces. SIAM Journal on Control and Optimization, 42, 442–468. Hordijk, A., & Spieksma, F. (1989). Constrained admission control to a queueing system. Advances in Applied Probability, 21, 409–431. Piunovskiy, A., & Zhang, Y. (2012). The transformation method for continuous-time Markov decision processes. Journal of Optimization Theory and Applications, 154, 691–712. Prieto-Rumeau, T., & Hernández-Lerma, O. (2012). Discounted continuous-time controlled Markov chains: Convergence of control models. Journal of Applied Probability, 49, 1072–1090. Prieto-Rumeau, T., & Lorenzo, J. M. (2010). Approximating ergodic average reward continuous-time controlled Markov chains. Institute of Electrical and Electronics Engineers. Transactions on Automatic Control, 55, 201–207. Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley. Sennott, L. I. (1991). Constrained discounted Markov decision chains. Probability in the Engineering and Informational Sciences, 5, 463–475. Sennott, L. I. (1999). Stochastic dynamic programming and the control of queueing systems. New York: John Wiley & Sons Inc.. Zadorojniy, A., & Shwartz, A. (2006). Robustness of policies in constrained Markov decision processes. Institute of Electrical and Electronics Engineers. Transactions on Automatic Control, 51, 635–638.

Please cite this article in press as: Guo, X., & Zhang, W. Convergence of controlled models and finite-state approximation for discounted continuous-time Markov decision processes with constraints. European Journal of Operational Research (2014), http://dx.doi.org/10.1016/j.ejor.2014.03.037