An exact iterative search algorithm for constrained Markov decision processes

Automatica 50 (2014) 1531–1534 Contents lists available at ScienceDirect Automatica journal homepage: www.elsevier.com/locate/automatica Technical ...

Download PDF

361KB Sizes 1 Downloads 45 Views

Report

PDF Reader
Full Text

Automatica 50 (2014) 1531–1534

Contents lists available at ScienceDirect

Automatica journal homepage: www.elsevier.com/locate/automatica

Technical communique

An exact iterative search algorithm for constrained Markov decision processes✩ Hyeong Soo Chang 1 Department of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea

article

info

Article history: Received 23 May 2013 Received in revised form 18 February 2014 Accepted 16 March 2014 Available online 13 April 2014 Keywords: Markov decision processes Policy iteration Dynamic programming Constrained optimization

abstract This communique provides an exact iterative search algorithm for the NP-hard problem of obtaining an optimal feasible stationary Markovian pure policy that achieves the maximum value averaged over an initial state distribution in finite constrained Markov decision processes. It is based on a novel characterization of the entire feasible policy space and takes the spirit of policy iteration (PI) in that a sequence of monotonically improving feasible policies is generated and converges to an optimal policy in iterations of the size of the policy space at the worst case. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs. © 2014 Elsevier Ltd. All rights reserved.

1. Introduction Consider a finite constrained Markov decision process (CMDP) (see, e.g., Altman (1998) for example problems) M = (X , A, P , C ), where X is a finite state set, A(x) is a finite action set at x ∈ X , C is a cost function such that C (x, a) ∈ R, x ∈ X , a ∈ A(x), and P is a transition function that maps {(x, a)|x ∈ X , a ∈ A(x)} to the set of probability distributions over X . We denote the probability of making a transition to state y ∈ X when taking an action a ∈ A(x) a at state x ∈ X by Pxy . We define a (stationary Markovian pure) policy π as a mapping from X to A with π (x) ∈ A(x), ∀x ∈ X , and let Π be the set of all such policies. Define the objective value of π ∈ Π with an initial state x ∈ X :

   ∞  π t V (x) := E γ C (Xt , π (Xt ))X0 = x , t =0

where Xt is a random variable denoting state at time t by following π and γ ∈ (0, 1) is a discounting factor. The CMDP M is associated with a function κ defined over X and a constraint-cost function D such that D(x, a) ∈ R, x ∈ X , a ∈ A(x).

✩ The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Nuno C. Martins under the direction of Editor André L. Tits. E-mail address: [email protected]. 1 Tel.: +82 2 705 8925; fax: +82 2 704 8273.

http://dx.doi.org/10.1016/j.automatica.2014.03.020 0005-1098/© 2014 Elsevier Ltd. All rights reserved.

Each feasible policy π ∈ Π  needs to satisfy the constraint inequality of x∈X δ(x)J π (x) ≤ x∈X δ(x)κ(x), where δ is an initial state distribution over X and the constraint value of π with an initial state x ∈ X is defined such that J π (x) := E

 ∞ t =0

   β t D(Xt , π (Xt ))X0 = x

for a discounting factor β ∈ (0, 1). Throughout the paper, we assume that δ is fixed  arbitrarily.  π π π π We let J = δ x∈X δ(x)J (x), Vδ = x∈X δ(x)V (x), and κδ =  δ( x )κ( x ) . The goal of this problem is then to obtain an optix∈X mal feasible policy π ∗ in the feasible policy set,

Πf = {π : π ∈ Π , Jδπ ≤ κδ }, which achieves minπ∈Πf Vδπ , if the problem is solvable, that is, Πf ̸= ∅. Feinberg provides a non-linear and non-convex mathematical program (MP) formulation for this problem (named as P1 in Feinberg (2000, Theorem 3.1)) such that the MP is feasible if and only if Πf ̸= ∅ and showed that the problem is NP-hard if its size is characterized by the maximum of |X | and maxx∈X |A(x)| and the number of constraints (for the simplicity of the exposition, we consider a single constraint case here). That is, this problem is more difficult than the problem of finding the best stationary Markovian randomized policy, in which case an associated linear program is solvable via linear programming (LP) Kallenberg (1980) in polynomial time in terms of |X |, |A|, and the maximum number of bits required to represent any component of P and C (Blondel & Tsitsiklis, 2000).

1532

H.S. Chang / Automatica 50 (2014) 1531–1534

Unfortunately, LP cannot be applied to the non-linear MP. One may also consider a Lagrangian approach to this problem by establishing a ‘‘saddle point property’’ so that some minmax theorem can be applied as in the other classes of policy space (Altman, 1998; Kadota, Kurano, & Yasuda, 2006) but this would be difficult with our restriction to the pure policy space. See also a Lagrangian approach by Beutler and Ross (1985) to an average-cost CMDP problem and their optimal policy existence theorem in ‘‘mixed policy space’’ where the optimal feasible policy of the average-cost problem may be pure. Because the CMDP problem under consideration is just an NPhard discrete combinatorial optimization problem, an exhaustive search is possible or any general-purpose algorithm (Floudas, 2009), e.g., genetic algorithm (Hirayama & Kawai, 2000), simulated annealing, etc., can be also applied but convergence to an optimal solution takes an infinite number of iterations in general. More importantly, this approach does not utilize any general structural property of CMDPs, e.g., dynamic programming (DP) structures used in the well-known iterative search algorithms of policy iteration (PI) and value iteration (VI) (Puterman, 1994) for MDPs. Chen and Blankenship (2004) established certain DP equations for finite CMDPs, where the value function in the equations is a function of the system state and a constraint threshold. Later Chen and Feinberg (2007) provided a value-iteration type algorithm based on the DP equations in Chen and Blankenship (2004) and showed that a sequence of finite-horizon value functions converges to the infinite-horizon value function. Piunovskiy and Mao (2000) also considered a DP approach with randomized policies where the value function is given as a function of the state distribution and the expected total cost. Unfortunately, these value-iteration type algorithms involve functions that are difficult to be computed in general and those iterative methods converge in an infinite number of iterations in general. In this communique, we provide an exact iterative search algorithm, called ‘‘Exact Policy Search (EPS)’’, for obtaining an optimal feasible policy based on a novel characterization of the entire feasible policy space. Because the problem under study is NP-hard, developing an exact iterative search algorithm is important, e.g., as in Held and Karp’s exponential-time exact algorithm (Held & Karp, 1962) based on dynamic programming for the travelling salesman problem. To the author’s best knowledge, this is the first convergent exact algorithm which searches over the policy space directly by exploiting the newly established structure of the policy space of CMDPs. It builds upon the ideas of Chang’s policy-iteration type algorithms (Chang, 2006, 2012) that converge to locally-optimal feasible policies for a related model. In Chang (2006, 2012), the feasibility of a policy is concerned with all states rather than average over an initial distribution. EPS generates a sequence of monotonically improving policies as in Chang’s approaches but it converges to an optimal feasible policy in |Π | iterations at the worst case as in PI’s worst-case time-complexity. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs. Note that this work can be also considered as a dual to the work by Chen and Feinberg (2007). While they showed that a sequence of finite-horizon value functions (obtained from solving associated MDPs) converges to the infinite-horizon value function in an infinite number of iterations, we here show that a sequence of policies generated by solving MDPs converges to an optimal policy in Π in a finite number of iterations. We stress again that we focus on solving the model of finite CMDPs with non-randomized stationary policy space (P1 in Feinberg (2000)). This is because pure policies are easier to implement than randomized ones and proper to consider in many applications (Chen & Feinberg, 2007; Feinberg, 2000) and in particular, for

certain applications randomized policies cannot be used (Chen & Feinberg, 2007) (see also Chen (2004, 2005) for example problems). Aside from some works which study the problem under consideration here, there exists a body of literature that studies numerical methods for solving CMDPs with different mixtures of finiteness of state and action sets, constraint and cost criteria, and policy space. Briefly discussing some related works (see also the references therein), Dufour and Prieto-Rumeau (2013) provide a finite LP-approximation method for CMDPs with general state and action sets and randomized policy space, and Mendoza-Peréz and Hernández-Lerma (2012) considered a pathwise long-run average reward subject to a constraint on a similar pathwise cost with general state and action sets, and Kadota et al. (2006) considered a Lagrange method to general utility-constrained MDPs with general state and actions sets and randomized policy space. Wakuta (1998) studied a vector-valued finite MDP with randomized ‘‘semistationary’’ policy space. Feinberg provides an algorithm (cf., Theorem 8.1 in Feinberg (2012)) for solving finite CMDPs with randomized policy space based on an LP formulation via splitting (randomized) stationary policies. 2. Feasible policy space characterization Let B(X ) be the set of all real-valued functions on X . Given a value function w ∈ B(X ), we define w -inducing feasible action set Aw (x) =



a ∈ A(x) : D(x, a) + β



a Pxy w(y)

y∈X

 ≤ w(x) + (1 − β)(κδ − wδ ) ,

x ∈ X.

Roughly speaking, in order for a ∈ A(x) to be feasible at x when the total constraint-cost at each state is measured by w , the excess of the total constraint-cost obtained by taking a at x at the first timestep needs to be within the slackness induced by w . Lemma 1. Given any w ∈ B(x), any policy π ∈ Π with π (x) ∈ Aw (x), ∀x ∈ X , is feasible. Proof. Consider an operator Tφ : B(X ) → B(X ) for φ ∈ Π given as Tφ (u)(x) = D(x, φ(x)) + β



φ(x) Pxy u(y)

y∈X

with x ∈ X and u ∈ B(X ). Applying Tφ to both sides of a given inequality Tφ (u)(x) ≤ u(x) + τ , ∀x ∈ X where τ ∈ R, we have that

(Tφ )2 (u)(x) ≤ Tφ (u)(x) + βτ ≤ u(x) + τ (1 + β),

∀x ∈ X

from the monotonicity property of the operator. Successive applications and the monotonicity property of the operator yields that for all x ∈ X , limn→∞ (Tφ )n (u)(x) ≤ u(x) + limn→∞ τ (1 + β + β 2 + · · · + β n−1 ). Because iterative application of Tφ on any initial value function converges monotonically to the fixed point J φ (Puterman, 1994), limn→∞ (Tφ (u))n (x) = J φ (x), ∀x ∈ X , yielding J φ (x) ≤ u(x) + τ (1 − β)−1 , ∀x ∈ X . Now observe that Tπ (w)(x) ≤ w(x)+(1 −β)(κδ −wδ ), ∀x ∈ X . Therefore for all x ∈ X , J π (x) ≤ w(x) + (κδ − wδ ). It follows that Jδπ =

 x∈X

δ(x)J π (x) ≤



δ(x)w(x) + κδ − wδ = κδ .

x∈X

The following theorem characterizes the feasibility of a policy. Theorem 2. A policy π ∈ Π is feasible if and only if for all x ∈ X , π (x) ∈ AJ π (x). Proof. Because for all x ∈ X , π (x) ∈ AJ π (x), π is feasible by Lemma 1 with w = J π .

H.S. Chang / Automatica 50 (2014) 1531–1534

For the other direction, because π is feasible, κδ − Jδπ ≥ 0 so that for any x ∈ X , D(x, π (x)) + β



π(x) π Pxy J (y)

From the result of the above theorem, immediately we can infer that π ∈ Π is infeasible if and only if for some x ∈ X , π (x) ̸∈ AJ π (x) and that the feasible policy space is characterized as follows: Corollary 3. Given w ∈ B(X ), let Π (w) = {π : π (x) ∈ Aw (x), ∀x ∈ X , π ∈ Π }. Then



k 

Π (J πi )

Vδπ , ′

Π (J π ).

1. Initialization: Set ∆ = Π and k = 0. 2. Loop while ∆ = ̸ ∅: 2.1 Select an arbitrary policy πk ∈ ∆. 2.2 If Π (J πk ) ̸= ∅ then ∗ 2.2.1 MDP evaluation: Obtain an optimal policy πM (J πk ) ∈ πk πk Π (J ) for M (J ). ∗ 2.2.2 Policy improvement: π∗ = πM (J π0 ) if k = 0 and π ∗ πk )

π∗ = πM∗ (J πk ) if k ̸= 0 and Vδ M (J 2.2.4 k ← k + 1. 2.3 ∆ = ∆ \ (Π (J πk ) ∪ {πk }).

π∈Π

w∈B(X )

min

obtaining a locally-optimal feasible policy π∗ with respect to the k set i=0 Π (J πi ). Exact Policy Search (EPS)

That is, π (x) ∈ AJ π (x), ∀x ∈ X .

Π (w) =

π

Vδ ∗ ≤

i=0

= J π (x) ≤ J π (x) + (1 − β)(κδ − Jδπ ).



the current best policy π∗ , π ′∈

y∈X

Πf =

1533

≤ Vδπ∗ .

The above corollary describes the feasible policy space in terms of the value function space and in terms of the policy space, respectively. The first equality is useful in drawing sufficient conditions for the feasibility of the problem. For example, by letting w(x) = 0, ∀x, we have

Theorem 5. EPS terminates in |Π | iterations at the worst case and converges to an optimal feasible policy for M. Furthermore, if Πf ̸= ∅, then for any k ≥ 0, we have that

Aw (x) = a ∈ A(x) : D(x, a) ≤ (1 − β)κδ ,

Vδ ∗ ≤





x ∈ X,

and in order to make the problem feasible, it is sufficient to have that Aw (x) ̸= ∅ for all x ∈ X . Therefore, if maxx∈X mina∈A(x) D(x, a) ≤ (1 − β)κδ , then the problem is feasible. The second equality is useful in designing an exact iterative algorithm that works over the policy space directly and we describe one in the next section.

π

min π ′∈

k 

Π (J πi )

We begin with a lemma which establishes that by solving an unconstrained MDP induced from a given w ∈ B(X ), we can obtain a best feasible policy among the feasible policies in the ‘‘local’’ policy-set induced from w . Lemma 4. For a given w ∈ B(X ), consider an unconstrained MDP a M (w) = (X , Aw , Pw , Cw ) where Pw and Cw are given such that Pw, xy a := Pxy and Cw (x, a) := C (x, a) for ∀x, y ∈ X , ∀a ∈ Aw (x). Then an ∗ optimal policy πM (w) ∈ Π (w) for M (w) is feasible in M and ∗ πM (w)

Vδ

≤ ′min Vδπ . ′

π ∈Π (w)

Proof. The proof is immediate from Lemma 1. Any π ∈ Π such that π (x) ∈ Aw (x) for all x ∈ X is feasible in M. Therefore, an opti-

∗ mal policy πM (w) in Π (w) for M (w) is feasible for M and V

∗ πM (w)

(x)

π′

≤ minπ ′ ∈Π (w) V (x), ∀x ∈ X .

We now formally describe EPS below. EPS is involved with a counter k and the value of k is increased only when a feasible policy is selected at k from the search space ∆. EPS consists of mainly two steps and applies these alternately: MDP evaluation and policy improvement. The MDP evaluation step solves M (J π ) for a selected π ∈ ∆, if feasible, and the policy improvement step takes an ∗ π optimal policy πM (J π ) for M (J ) and updates the current best policy ∗ (improves the current best policy) by comparing it with πM (J π ) . The sequence of monotonically improving current best policies converges to an optimal feasible policy for M. EPS starts with the entire policy space as ∆ and iteratively narrows down the search space ∆ by eliminating all of the feasible policies induced from feasible π or just π if it is infeasible. Even though EPS runs until ∆ becomes the empty set in order to ensure that it converges to an optimal feasible policy for M, it can be stopped at any k, in which case if Πf ̸= ∅, then we have that for

′

i=0

Proof. From the step 2.2, only when Π (J πk ) ̸= ∅, EPS solves M (J πk ). From Theorem 2, this happens when πk is a feasible policy. ∗ πk Because πM (J πk ) is an optimal policy for M (J ), we have that from π ∗ πk M (J )

3. An exact search algorithm

Vδπ .

Lemma 4, Vδ ≤ minπ ′ ∈Π (J πk ) Vδπ . Now, since the current best ∗ πk policy π∗ is updated at the step 2.2.2 and πM (J πk ) ∈ Π (J ) and k is incremented only when πk is feasible, from an inductive argument ′ π on k, it follows that Vδ ∗ ≤ minπ ′ ∈k Π (J πi ) Vδπ . i=0 At the step 2.3, we subtract Π (J πk ) ∪ {πk } from ∆. Therefore, all infeasible policies selected up to the step k and all of the feasible k policies in i=0 Π (J πi ) gets eliminated from ∆ at k. It follows that ∆ becomes the empty set in Π iterations at the worst case so that EPS terminates eventually. k′ When EPS terminates at k′ , i=0 Π (J πi ) = Πf from Corollary 3. It follows that π∗ is an optimal feasible policy for M. ′

The running time-complexity of EPS mainly consists of the time-complexity T1 of checking the feasibility of selected policy πk (checking if Π (J πk ) ̸= ∅) and the time-complexity T2 of solving the MDP M (J πk ) and the number of iterations N until EPS terminates (when ∆ = ∅). The first part is O(|X |3 ) because it is the same as the policy evaluation step of PI, and the second part is also polynomial in |X |, |AJ πk |, and the maximum number of bits required to represent any component of PJ πk and CJ πk if LP is applied. The third part N consists of the number of infeasible policies and the number of iterations NMDP involved with MDP evaluations. The overall time-complexity of EPS is then O(|Π \ Πf |T1 + NMDP (T1 + T2 )). If we consider the extreme case where all policies in Π are infeasible, i.e., NMDP = 0, then EPS needs to enumerate all policies in Π and has an exponential complexity of O(|A||X | T1 ). However, considering the cases where NMDP plays a dominant role in the complexity, we observe that NMDP depends on the sizes of the MDPs induced by {πk }. By updating ∆ by ∆\(Π (J πk )∪{πk }) at the step 2.3, all policies in the policy space of the unconstrained MDP gets eliminated from the search space ∆. That is, the size of ∆ is decreased exponentially by |Π (J πk )| = O(|AJ πk ||X | ). If the size of the underlying MDP is ‘‘big’’, e.g., if |A| − |AJ πk | is small, the search space is greatly narrowed down. Therefore, this process of narrowing down makes EPS non-enumerative for all policies in Π . In other words, EPS is not an exhaustive brute-force search. If for a given CMDP, NMDP and |Π \ Πf | are polynomial in |X | and |A|, the overall complexity of EPS will

1534

H.S. Chang / Automatica 50 (2014) 1531–1534

be polynomial. As in PI, it would be very challenging to characterize the running time-complexity of EPS but this is a good future topic. Because of updating ∆, EPS requires storing all policies in Π in memory. This is also related with representation issue of the policy space. For example, in some cases, policy is parametrized or represented by some (binary) variables so that with certain compact representation, policies can be stored compactly. This issue needs to be investigated further for actual implementation of EPS. One possible way of overcoming this problem is to remove the ∆-updating parts from EPS. Instead, we generate a random or pre-determined sequence of {πk } and solve the sequence of MDPs induced by feasible policies in {πk }. This yields a heuristic algorithm losing the convergence property of EPS but still keeps the convergence to a locally-optimal solution. Developing variants of EPS to approximate the exactness of EPS is also a good future topic. We end this section with a toy example to illustrate how EPS works. The state set X = {1, 2}, and the action set A(1) = {a11 , a12 } a a and A(2) = {a21 , a22 }, and P is given such that P1211 = P1111 = 0.5, a12 a21 a22 a22 P12 = 1, and P22 = 1, P21 = 0.5, P22 = 0.5, and the cost function C and the constraint cost D are given as follows: C (1, a11 ) = D(1, a11 ) = 5, C (1, a12 ) = D(1, a12 ) = 10, C (2, a21 ) = D(2, a21 ) = −1, and C (2, a22 ) = 4, D(2, a22 ) = 1. We set γ = 0.99, β = 0.95, and the initial state distribution δ = (1/3, 2/3). In what follows, we use the ordered 2-tuple notation (x, y) where x is associated with state 1 and y is with state 2 whenever applicable. The function κ = (κ(x), κ(y)) = (0, 0) so that κδ = 0. We start with an arbitrarily chosen π0 = (π0 (1), π0 (2)) = (a11 , a22 ) ∈ ∆, which is initialized as

Π = {(a11 , a21 ), (a11 , a22 ), (a12 , a21 ), (a12 , a22 )}. To check if Π (J π0 ) ̸= ∅, we evaluate π0 by solving J π0 (1) = 5 + 0.475J π0 (1) + 0.475J π0 (2) and J π0 (2) = 1 + 0.475J π0 (1) + π 0.475J π0 (2), which yields J π0 = (62, 58). Because Jδ 0 > 0, π0 is infeasible and by updating ∆, ∆ becomes {(a11 , a21 ), (a12 , a21 ), (a12 , a22 )}. At the next iteration, selecting π1 = (a12 , a21 ) ∈ ∆, we obπ tain J π1 = (−9, −20) so that Π (J π1 ) ̸= ∅ because Jδ 1 = −49/3 ≤ 0. We now obtain J π1 -inducing feasible action set: AJ π1 (1) = {a|D  a π π (1, a) + 0.95 y Pxy J 1 (y) ≤ J π1 (1) + (1 − 0.95)(−Jδ 1 )}. Checking with a11 , because 5 + 0.95((0.5 × −9) + (0.5 × −20)) ≤ −9 + 0.05 ×(−(−49/3)), a11 ∈ AJ π1 (1). For a12 , we have 10 + 0.95(−20) ≤ −9 + 0.05 × 49/3, so that a12 ∈ AJ π1 (1). Similarly, we find that AJ π1 (2) = {a21 }. Therefore, we solve the unconstrained MDP M (J π1 ) with Π (J π1 ) = {(a11 , a21 ), (a12 , a21 )}. A simple application ∗ of policy iteration yields an optimal policy πM (J π1 ) = (a12 , a21 ). We set π∗ = (a12 , a21 ). Updating ∆, ∆ becomes {(a12 , a22 )} from ∆ \ {(a11 , a21 ), (a12 , a21 )} ∪ {(a12 , a21 )}. At the next iteration, we then choose π2 = (a12 , a22 ) and obtain J π2 = (84.07, 77.97) finding that π2 is infeasible. At this point, ∆ becomes an empty set and EPS terminates having the policy π∗ = (a12 , a21 ) as the solution. 4. Concluding remark Even though convergence is not guaranteed, we can consider a value-iteration type algorithm based on the result of Lemma 4.

Given an arbitrary sequence of value function {vk }, we alternately solve M (vk ), if defined, and updates the current best policy from an optimal feasible policy for M (vk ). Note that this algorithm also generates a sequence of monotonically improving feasible policies and converges to a locally optimal feasible policy. Unfortunately, setting {vk } properly would be difficult in general. The result of Lemma 1 is also useful in designing a heuristic policy for solving M. By selecting any v ∈ B(X ), v induces a set of feasible policies, i.e., Π (v). By choosing some heuristic policies in Π (v), we can obtain a feasible policy that improves all of the selected policies (cf., Chang, Hu, Fu, and Marcus (2013, Chp. 5)) for M. References Altman, E. (1998). Constrained Markov decision processes. Chapman & Hall/CRC. Altman, E. (1998). Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical Methods of Operations Research, 48, 387–417. Beutler, F. J., & Ross, K. W. (1985). Optimal policies for controlled Markov chains with a constraint. Journal of Mathematical Analysis and Applications, 112, 236–252. Blondel, V. D., & Tsitsiklis, J. N. (2000). A survey of computational complexity results in systems and control. Automatica, 36, 1249–1274. Chang, H. S. (2006). A policy improvement method in constrained stochastic dynamic programming. Institute of Electrical and Electronics Engineers Transactions on Automatic Control, 51, 1523–1526. Chang, H. S. (2012). A policy iteration heuristic for constrained discounted controlled Markov chains. Optimization Letters, 6, 1573–1577. Chang, H. S., Hu, J., Fu, M. C., & Marcus, S. I. (2013). Simulation-based algorithms for markov decision processes (2nd ed.). London: Springer. Chen, R. C. (2004). Constrained stochastic control and optimal search. In Proc. of the 43rd IEEE conf. on decision and control: Vol. 3 (pp. 3013–3020). Chen, R.C. (2005). Constrained Markov processes and optimal truncated sequential detection. In Defense applications of signal processing proceedings (pp. 28–31) Utah. Chen, R. C., & Blankenship, G. L. (2004). Dynamic programming equations for discounted constrained stochastic control. IEEE Transactions on Automatic Control, 49, 699–709. Chen, R. C., & Feinberg, E. A. (2007). Non-randomized policies for constrained Markov decision processes. Mathematical Methods of Operations Research, 66, 165–179. Dufour, F., & Prieto-Rumeau, T. (2013). Finite linear programming approximations of constrained discounted Markov decision processes. SIAM Journal on Control and Optimization, 51, 1298–1324. Feinberg, E. A. (2000). Constrained discounted Markov decision processes and Hamiltonian cycles. Mathematics of Operations Research, 25, 130–140. Feinberg, E. A. (2012). Splitting randomized stationary policies in totalreward Markov decision processes. Mathematics of Operations Research, 37, 129–153. Floudas, C. A., & Pardalos, P. M. (Eds.) (2009). Encyclopedia of optimization. Springer. Held, M., & Karp, R. M. (1962). A dynamic programming approach to sequencing problems. Journal of the Society for Industrial and Applied Mathematics, 10, 196–210. Hirayama, K., & Kawai, H. (2000). A solving method of an MDP with a constraint by genetic algorithms. Mathematical and Computer Modelling, 31, 165–173. Kadota, Y., Kurano, M., & Yasuda, M. (2006). Discounted Markov decision processes with utility constraints. Computers and Mathematics with Applications, 51, 279–284. Kallenberg, L. C. M. (1980). Math centre tracts: Vol. 148. Linear programming and finite Markovian control problems. Amsterdam: Mathematisch Centrum. Mendoza-Peréz, A. F., & Hernández-Lerma, O. (2012). Deterministic optimal policies for Markov control processes with pathwise constraints. Applicationes Mathematicae, 39, 185–209. Piunovskiy, A. B., & Mao, X. (2000). Constrained Markovian decision processes: the dynamic programming approach. Operations Research Letters, 27, 119–126. Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. New York: Wiley. Wakuta, K. (1998). Discounted cost Markov decision processes with a constraint. Probability in the Engineering and Informational Sciences, 12, 177–187.