Stochastic approximations of constrained discounted Markov decision processes

J. Math. Anal. Appl. 413 (2014) 856–879 Contents lists available at ScienceDirect Journal of Mathematical Analysis and Applications www.elsevier.com...

Download PDF

477KB Sizes 3 Downloads 119 Views

Report

PDF Reader
Full Text

J. Math. Anal. Appl. 413 (2014) 856–879

Contents lists available at ScienceDirect

Journal of Mathematical Analysis and Applications www.elsevier.com/locate/jmaa

Stochastic approximations of constrained discounted Markov decision processes ✩ François Dufour a , Tomás Prieto-Rumeau b,∗ a b

Université Bordeaux I, INRIA Bordeaux Sud Ouest, France Statistics Department, UNED, Madrid, Spain

a r t i c l e

i n f o

Article history: Received 7 June 2013 Available online 16 December 2013 Submitted by A. Sulem Keywords: Constrained Markov decision processes Linear programming approach to control problems Approximation of Markov decision processes

a b s t r a c t We consider a discrete-time constrained Markov decision process under the discounted cost optimality criterion. The state and action spaces are assumed to be Borel spaces, while the cost and constraint functions might be unbounded. We are interested in approximating numerically the optimal discounted constrained cost. To this end, we suppose that the transition kernel of the Markov decision process is absolutely continuous with respect to some probability measure μ. Then, by solving the linear programming formulation of a constrained control problem related to the empirical probability measure μn of μ, we obtain the corresponding approximation of the optimal constrained cost. We derive a concentration inequality which gives bounds on the probability that the estimation error is larger than some given constant. This bound is shown to decrease exponentially in n. Our theoretical results are illustrated with a numerical application based on a stochastic version of the Beverton–Holt population model. © 2013 Elsevier Inc. All rights reserved.

1. Introduction The objective of this paper is to study approximation methods for solving numerically discrete-time Markov decision processes (MDPs for short) with constraints, with general state and action spaces, under the discounted cost optimality criterion. Our approach combines the linear programming (LP) formulation of an MDP with concentration inequalities characterizing the non-asymptotic deviation (in the 1-Wasserstein metric) between a probability measure underlying the transition kernel of the MDP and its associated empirical probability measure. MDPs are a general family of controlled stochastic processes suitable for the modeling of sequential decision-making problems under uncertainty. These models appear in many applications, such as engineering, computer science, telecommunications, and ﬁnance, among others. A signiﬁcant list of references on discrete-time MDPs may be found in the survey [3] and the books [2,4,13,16,17,23,25]. The analysis of MDPs leads to various mathematical and computational problems. On the theoretical level, this ﬁeld of research has reached a rather high degree of maturity. However, it is important to emphasize that the classical approaches, e.g., the dynamic programming and linear programming techniques, and their various extensions, such as the value iteration (VIA) and the policy iteration (PIA) algorithms, are generally hardly applicable in practice. Hence, solving numerically MDPs is still an awkward and important challenge. This has given rise to the development of several approaches intended to approximate the optimal value function and -optimal policies of MDPs. Roughly, these approaches can be classiﬁed into two different families.

✩

*

This research was supported by grant MTM2012-31393 from the Spanish Ministerio de Economía y Competitividad. Corresponding author. E-mail addresses: [email protected] (F. Dufour), [email protected] (T. Prieto-Rumeau).

0022-247X/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jmaa.2013.12.016

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

857

(i) Numerical approximation techniques. These approaches provide a deterministic estimate Vˆ of the optimal value function V ∗ with some given approximation error, that is, | Vˆ − V ∗ | < ε for some given precision ε > 0. The main idea in this approach relies on a discretization (or state-aggregation) procedure transforming an MDP with general state and action spaces, and possibly unbounded cost function, into a simpler model (with ﬁnite state and action spaces, and, consequently, bounded cost function). Such techniques can be found in, e.g., [10,15,18–21,28]. (ii) Probabilistic approximation techniques. These methods involve approximating the optimal value function V ∗ by a stochastic estimate V (i.e., a random variable). The accuracy is then measured by means of a concentration inequality, that is, for a given precision ε > 0 we have

P V − V ∗ > ε < δ for some small δ > 0. This approach mainly deals with MDPs with discrete state and action spaces, and it is based on simulating sample paths of the controlled process. There exists a huge literature related to these techniques, e.g., reinforcement learning, neuro-dynamic programming, approximate dynamic programs, and simulation-based techniques, to name just a few. Without attempting to present an exhaustive panorama, the interested reader can consult the survey [27], the books [5,8,22,26], and references therein to get a rather complete view of this research ﬁeld. Clearly, the techniques in (i) and (ii) are of a different nature and, so, hard to compare, mainly because the approximation is deterministic in (i) and stochastic in (ii). In spite of this, the probabilistic techniques in (ii) are usually more eﬃcient – in computational terms – than the numerical methods in (i), and they even succeed to break the curse of dimensionality [24]. One important and interesting feature of the probabilistic approximation techniques is that the convergence results are insensitive to the dimension of the state and action spaces. However, such approaches are usually restricted to MDPs with discrete (ﬁnite or countable) state and/or action space, and/or bounded cost function, thus showing the limited range of their applicability. Moreover, a common feature of these methods is that they are based on the dynamic programming algorithms, such as the VIA and the PIA, and, in particular, they deal with unconstrained MDPs, whereas it is important to stress that constrained MDPs arise in a natural way in many important applications; see, e.g., [11,29]. The reason why the above approaches do not deal with constrained MDPs seems to be that constrained MDPs cannot be solved by using the dynamic programming algorithms. Instead, constrained MDPs are especially well suited to the LP approach, while the above referenced works deal exclusively with the VIA and the PIA, and their variations. To the best of our knowledge, Ref. [10] seems to be the ﬁrst attempt to deal with the numerical approximation of a general constrained MDP. Our approach in this paper is concerned with the stochastic approximation methods described in (ii) above. Regarding our main hypotheses, it is assumed that the control model is Lipschitz continuous (i.e., its elements, such as the transition kernel, the action sets multifunction, the cost function, etc., are Lipschitz continuous; see [19]), and we suppose that the transition kernel P (dy |x, a) is absolutely continuous with respect to some probability measure μ(dy ), with density function p ( y |x, a). Our objective is to derive a stochastic estimate of the discount optimal constrained value function. To do this, our method replaces the various elements of the MDP with suitable approximations. For instance, the state–action pairs K of the MDP are replaced with a smaller set K , where the “distance” from K to K is measured in terms of the Hausdorff distance; the density p ( y |x, a) of the Markov kernel is replaced with a function p ( y |x, a) satisfying suitable bounds; ﬁnally, the reference probability measure μ is approximated by the empirical probability measure μn , obtained from a sample of size n. Using this approach, the inﬁnite-dimensional LP formulation of the constrained MDP is approximated by a ﬁnite-state, possibly inﬁnite-action (hence, still inﬁnite-dimensional) LP problem which is related to μn , the empirical probability measure associated to μ, and labeled LP(μn ) (for a precise deﬁnition of LP(μn ) see (3.2) in Section 3). Then, using convexity arguments, we show that LP(μn ) reduces to a ﬁnite-dimensional nonlinear optimization problem (that can be easily solved numerically by using the usual nonlinear optimization tools) or even, under suitable conditions, LP(μn ) is a standard ﬁnite-dimensional LP problem that can be explicitly solved numerically. By the way, let us also mention that our approach can be used as well for unconstrained MDPs. Among the main features of our approach, we note that the accuracy of the approximation is evaluated in terms of a concentration inequality measuring the non-asymptotic deviation between the optimal constrained value V ∗ of the MDP and its approximation In . More speciﬁcally, we show that for any γ > 0 there exist constants S , T > 0 such that for any n∈N

P In − V ∗ > γ S e −T n (a precise statement of this concentration inequality is given in Section 4). Concerning our main contributions, let us mention that our framework (a constrained MDP with general state and action spaces with possibly unbounded cost function) is clearly more general than most of the settings studied in the literature, with the exception of [10]. Indeed, Ref. [10] seems to be the only attempt to deal with constrained MDPs but the technique used therein is different from the present approach. More precisely, [10] provides a different type of convergence because the approximation is deterministic while, in the present work, our approximation results are characterized by a concentration inequality and are, obviously, of stochastic nature. Moreover, the results in [10] are restricted to Euclidean state spaces (e.g., Rd ), and they cannot be extended to more

858

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

general state spaces, such as Polish spaces, as in the present work. This is because the approach developed in [10] uses a quantized approximation to the reference measure μ. The rest of the paper is organized as follows. In Section 2 we introduce our constrained control model, state our main assumptions, and recall some results on the dynamic and linear programming approaches to constrained and unconstrained control problems. Section 3 establishes several useful preliminary facts, used in Section 4 to prove our main results. In Section 5 we discuss the explicit computation of the value In of our approximation (by solving a ﬁnite-dimensional either linear or nonlinear optimization problem). Finally, we illustrate our results in Section 6 by solving numerically a ﬁsheries problem, based on a stochastic version of the Beverton–Holt population model. 2. Model, assumptions, and deﬁnitions We introduce the discrete-time Markov control model we are concerned with in Section 2.1, and state our main assumptions in Section 2.2. The following basic notation will be used in the paper. Given x and y in the Euclidean space Rn , let x, y be the usual inner product of x and y. By |x| = x, x1/2 we will denote the norm of x ∈ Rn . Let 0 and 1 be the elements of Rn with all components equal to zero and one, respectively. If θ1 and θ2 are in Rn , we shall write θ1 θ2 (respectively, θ1 > θ2 ) when all the components of θ1 are greater than or equal to (respectively, strictly greater than) the corresponding components of θ2 . We recall that E is said to be a Borel space if it is a Borel subset of a complete and separable metric space. Its Borel σ -algebra will be denoted by B( E ). If γ is a measure on ( E , B( E )) and v : E → Rq is a measurable mapping, then the (componentwise) integral of v with respect to γ will be denoted by γ ( v ) := E v dγ ∈ Rq , provided that it is well deﬁned and ﬁnite. The Dirac probability measure concentrated at x ∈ E will be denoted by δx ; that is, for B ∈ B ( E ) we have δx ( B ) = 1 B (x), where 1 B denotes the indicator function. If E is a Borel space and w : E → [1, +∞) is a measurable function, the class of measurable functions f : E → R such that f w := supx∈ E {| f (x)|/ w (x)} < ∞ is a Banach space denoted by B w ( E ), and · w is the associated norm, called the w-norm. The family of real-valued measurable functions on E with ﬁnite w-norm which, in addition, are continuous (respectively, Lipschitz continuous) is denoted by C w ( E ) (respectively, L w ( E )). We denote by M w ( E ) the family of measures ν on E such that ν ( w ) is ﬁnite. For continuous w, the w-weak topology on M w ( E ) is the coarsest topology for which all the real-valued mappings deﬁned on M w ( E ) by η → η( f ) for f ∈ C w ( E ) are continuous. Let M1 ( E ) be the family of probability measures λ on E with ﬁnite ﬁrst moment, that is, E d(x, x0 )λ(dx) < ∞ for some x0 ∈ E, with d the metric in E. The 1-Wasserstein metric on M1 ( E ) [6, p. 234] is deﬁned as

W 1 λ, λ = sup f ∈L

f dλ −

X

f dλ

for λ, λ ∈ M1 ( E ),

(2.1)

X

where L is the family of 1-Lipschitz continuous functions f : E → R. Finally, we let R+ = [0, ∞) and N = {0, 1, 2, . . .}. 2.1. The control model We follow closely notation of Chapter 2 in [16]. Let us consider the ﬁve-tuple for a Markov control model M :=

( X , A , { A (x): x ∈ X }, P , (c , r )) consisting of:

(a) A Borel space X (with metric d X ), which is the state space. (b) A Borel space A (with metric d A ), representing the control or action set. For the family of closed subsets of A we will consider the Hausdorff metric

d H (C 1 , C 2 ) := sup inf d A (a, b) ∨ sup inf d A (a, b) a∈C 1 b ∈C 2

b ∈C 2 a∈C 1

for closed C 1 , C 2 ⊆ A (here, ∨ stands for “maximum”). (c) A family { A (x): x ∈ X } of nonempty measurable subsets of A, where A (x) is the set of feasible controls or actions when the system is in state x ∈ X . We suppose that K := {(x, a) ∈ X × A: a ∈ A (x)} is a measurable subset of X × A. (d) A stochastic kernel P on X given K, which stands for the transition probability function. Given a measurable function v : X → R, we deﬁne P v : K → R as P v (x, a) := X v ( y ) P (dy |x, a), provided that the corresponding integrals are well deﬁned and ﬁnite. (e) Finally, c : K → R is a measurable function representing the cost-per-stage, and r : K → Rq is a measurable function deﬁning the q constraints. At this point, let us mention that sometimes it will be useful to consider a generic measurable cost function u : K → R, instead of c. Next, we deﬁne different classes of control policies.

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

859

Deﬁnition 2.1. The set of stochastic kernels ϕ on A given X such that ϕ ( A (x)|x) = 1 for all x ∈ X is denoted by Φ . Also, F stands for the family of measurable functions f : X → A satisfying that f (x) ∈ A (x) for all x ∈ X . It is assumed that the set F is nonempty. Let H 0 := X and H n := K × H n−1 for n 1 be the history of the MDP up to time n. Deﬁnition 2.2. A control policy is a sequence π = {πn }n∈N of stochastic kernels πn on A given H n satisfying πn ( A (xn )|hn ) = 1 for all hn ∈ H n and n ∈ N, where hn := (x0 , a0 , . . . , xn−1 , an−1 , xn ). Let Π be the class of all policies. A policy π is said to be randomized stationary if there exists ϕ ∈ Φ such that πn (·|hn ) = ϕ (·|xn ) for all n ∈ N and h ∈ H n . The class of randomized stationary policies is identiﬁed with Φ . Finally, we identify each f ∈ F with ϕ ∈ Φ such ϕ (·|x) is the Dirac measure concentrated at f (x) for all x ∈ X . We say that f ∈ F is a deterministic stationary policy. Let (( X × A )∞ , B (( X × A )∞ )) be the canonical space consisting of the set of sample paths ( X × A )∞ = {(xt , at )}t ∈N and the associated product σ -algebra. Here, {xt }t ∈N is the state process and {at }t ∈N stands for the action process. For every policy π ∈ Π and any initial distribution ν on X , there exists a unique probability measure P νπ on (( X × A )∞ , B (( X × A )∞ )) such that, for any B ∈ B ( X ), C ∈ B ( A ), and ht ∈ H t with t ∈ N,

P νπ (x0 ∈ B ) = ν ( B ),

P νπ (at ∈ C |ht ) = πt (C |ht ),

and

P νπ (xt +1 ∈ B |ht , at ) = P ( B |xt , at );

see, e.g., [16, Chapter 2] for such a construction. The expectation with respect to P νπ is denoted by E νπ . If ν = δx for some π π x ∈ X , then we will respectively write P xπ and E π x in lieu of P ν and E ν . Let 0 < α < 1 be a given discount factor. Given an integer n 1, a measurable function u : K → Rn , an initial state x ∈ X , and a policy π ∈ Π , we will write

V (x, π ) := E π x u

∞

t

α u (xt , at )

t =0

(here, the expectation is taken componentwise). If the cost function is given by u : K → R, then the optimal value function of the unconstrained control problem is

V ∗u (x) := inf V u (x, π ) π ∈Π

for x ∈ X .

A policy π ∗ ∈ Π is said to be discount optimal if V u (x, π ∗ ) = V ∗u (x) for every x ∈ X . Next, we deﬁne our constrained Markov control problem. Deﬁnition 2.3. Given an initial state z0 ∈ X and a constraint constant θ0 ∈ Rq , the constrained control problem CP is to minimize V c ( z0 , π ) within the class of policies π ∈ Π such that V r ( z0 , π ) θ0 (a policy π ∈ Π such that V r ( z0 , π ) θ0 is called a constrained policy). We let

V ( z0 ) := inf V c ( z0 , π ): π ∈ Π and V r ( z0 , π ) θ0 , and we say that

π ∗ ∈ Π is a constrained optimal policy if V r (z0 , π ∗ ) θ0 and V c (z0 , π ∗ ) = V (z0 ).

Observe that, in the notation CP, we do not make explicit the dependence on the initial state z0 ∈ X and the constraint constant θ0 ∈ Rq , which, in what follows, will remain ﬁxed. For notational convenience, we deﬁne the function rα on K with values in Rq as rα = r − (1 − α )θ0 . 2.2. Assumptions Next we state our assumptions on the parameters of the MDP. Moreover, in the context of these hypotheses, at the end of this section we recall two known results: one is concerned with bounds in the 1-Wasserstein distance for empirical measures (see Theorem 2.5), and the other one recalls some basic results on constrained and unconstrained MDPs (see Theorem 2.6). Assumption A. (A1) The state space X is a Polish space. The action set A is a locally compact Polish space. For each x ∈ X , the action set A (x) is compact. Moreover, the multifunction Ψ from X to A deﬁned by Ψ (x) := A (x) is L Ψ -Lipschitz continuous with respect to the Hausdorff metric. That is, for some constant L Ψ > 0 and every x, y ∈ X ,

d H A (x), A ( y ) L Ψ d X (x, y ). By [9, Lemma 2.6], it follows that the multifunction Ψ : X → A is continuous.

860

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

(A2) There exists an L w -Lipschitz continuous function w : X → [1, ∞) such that the function P w is continuous on K and, in addition, there exists d > 0 with

P w (x, a) dw (x),

(2.2)

for all (x, a) ∈ K. Let w := supx∈ X {1/ w (x)}. (A3) The stochastic kernel P is weakly continuous, meaning that P v is continuous on K for every bounded and continuous function v on X . There exists a constant L P > 0 such that, for every (x, a) and ( y , b) in K, and v ∈ L w ( X ) with Lipschitz constant L v ,

P v (x, a) − P v ( y , b) L P L v d X (x, y ) + d A (a, b) .

The kernel P is said to be L P -Lipschitz continuous. (A4) The functions c and r are L c - and L r -Lipschitz continuous on K, respectively. Moreover, there exists a positive constant c such that, for all x ∈ X ,

sup c (x, a) + sup r (x, a) c w (x).

a∈ A (x)

a∈ A (x)

That is, we have c ∈ L w (K) (when w is seen as a function w (x, a) := w (x) on K). Similarly, r is in L w (K) componentwise. (A5) The discount factor 0 < α < 1 veriﬁes αd < 1 and α L P (1 + L Ψ ) < 1. Given d > 0, we assume that there exists a family A d (x), for x ∈ X , of nonempty measurable subsets of A satisfying the following hypotheses. Assumption B. (B1) For every x ∈ X , A d (x) is a nonempty closed subset of A (x). The multifunction Ψd from X to A deﬁned by Ψd (x) := A d (x) is upper semi-continuous. Moreover, Kd = {(x, a) ∈ X × A: a ∈ A d (x)} is a measurable subset of X × A, and it contains the graph of a measurable function from X to A. (B2) For every x ∈ X ,

d H A (x), A d (x) d w (x).

(2.3)

The introduction of this family of sets can be seen as an additional degree of ﬂexibility of our approach and it will be justiﬁed in Section 5 when dealing with the numerical aspects of our method. Note that one may choose A d (x) = A (x) for x ∈ X and d > 0, although for our computational purposes it will be more interesting to construct ﬁnite sets A d (x) satisfying the above hypotheses. Assumption C. (C1) There exist a probability measure

P ( B |x, a) =

μ on ( X , B( X )), and a measurable function p : X × K → R+ such that

p ( y |x, a)μ(dy )

for all B ∈ B ( X ) and (x, a) ∈ K.

B

Moreover, μ( w ) < +∞. (C2) There exists a continuous mapping M : X → [1, ∞) and a constant D > 0 such that

P M w (x, a) D M (x) w (x) with

for all (x, a) ∈ K,

(2.4)

α D < 1, where the function w is as in Assumption (A2). Let q : X × K → R+ be such that p ( y |x, a) = q( y |x, a) M (x)

for (x, a) ∈ K and y ∈ X .

We assume that q( y |·, ·) is continuous on K for any y ∈ X and that there exist positive constants d, L q , q, and D satisfying, for any (x, a) ∈ K and y , z ∈ X ,

q( y |x, a) qw (x),

q( y |x, a) − q( z|x, a) L q w (x)d( y , z), w ( y )q( y |x, a) − w ( z)q( z|x, a) dw (x)d( y , z), M ( y ) w ( y )q( y |x, a) − M ( z) w ( z)q( z|x, a) D M (x) w (x)d( y , z).

(2.5) (2.6) (2.7) (2.8)

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

861

(C3) There exists some x0 ∈ X and a > 0 such that

exp ad X (x, x0 )

μ(dx) < ∞

X

(in particular,

μ ∈ M1 ( X )).

Assume that there is some probability space (Ω, F , P) and a family {Y n }n1 of i.i.d. random variables taking values in X with distribution μ. For each n 1, the M1 ( X )-valued mapping μn deﬁned on Ω by

1

n

μn (dy ) =

n

δY k (dy )

(2.9)

k =1

is called the empirical probability measure. It is a random variable since the M1 ( X )-valued mapping deﬁned on X n by (x1 , . . . , xn ) → n1 ni=1 δxi is continuous. As a consequence, the 1-Wasserstein distance W 1 (μn , μ) is a real-valued random variable. Assumption D. (D1) There exists a measurable function w : X → R+ such that w / w is a moment on X . This means that there exists an increasing sequence { H k } of compact subsets of X such that H k ↑ X and limn infx∈/ H k w (x)/ w (x) = ∞. (D2) For some constant d > 0 such that αd < 1 we have P w (x, a) d w (x) for all (x, a) ∈ K. (D3) There exists a constant d > 0 such that for all (x, a) ∈ K and y , z ∈ X

w ( y )q( y |x, a) − w ( z)q( z|x, a) d w (x)d( y , z). We introduce a family of kernels { P ,n } for > 0 and n 1, depending on a parameter associated to μ. The idea is to approximate the kernel P by P ,n . Deﬁnition 2.4. Consider for any

P ,n ( B |x, a) =

and the empirical measure μn

> 0 and n 1 the kernel P ,n on X given K

p ( y |x, a)μn (dy ) for B ∈ B ( X ) B

with

p ( y |x, a) = M (x)q( y |x, a)

and

M (x) =

M (x) 1 + M (x)

,

for (x, a) ∈ K and y ∈ X . Observe that P ,n is not necessarily a stochastic kernel and that P ,n is a random operator, meaning that it depends on

ω ∈ Ω since μn is a random measure.

We will use the following notation throughout the paper. Given a measure ν on a measurable subset K of X × A, its projection on X will be denoted by νˆ . That is, νˆ is the measure on X deﬁned as νˆ ( B ) = ν (K ∩ ( B × A )) for B ∈ B ( X ). We conclude this section by recalling two known results of the literature. Theorem 2.5. Suppose that the probability measure μ satisﬁes Assumption (C3). There exists some γ0 such that, given any 0 < γ < γ0 , there are constants C , H > 0 with

P W 1 (μn , μ) γ C exp{−Hn} for all n 1. For this result, the reader is referred to Corollary 2.5 and Theorem A.5 in [7]. Now, we present some basic results on constrained and unconstrained MDPs. Consider an arbitrary cost function u ∈ L w (K). For v ∈ B w ( X ) let us deﬁne the operator T u as

T u v (x) := inf

a∈ A (x)

u (x, a) + α

v ( y ) P (dy |x, a)

for x ∈ X .

(2.10)

X

We have that T

u

is the usual dynamic programming operator for a discounted cost MDP with cost function given by u.

862

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Theorem 2.6. Suppose that Assumption A is satisﬁed and consider an arbitrary cost function u ∈ L w (K). (i) For every v , v ∈ B w ( X ) we have

u T v (x) − T u v (x) αd v − v · w (x) for all x ∈ X ; w that is, T u is a contraction operator on B w ( X ) with modulus αd < 1. (ii) The optimal (unconstrained) discounted cost function V ∗u is in L w ( X ). Moreover, V ∗u is the unique solution in B w ( X ) of the ﬁxed point equation T u v = v. We have

u V u w ∗ w 1 − αd

and

L V ∗u

L u (1 + L Ψ ) 1 − α L P (1 + L Ψ )

.

There exists a policy f ∈ F that is discount optimal for every initial state x ∈ X . (iii) If, in addition, Assumption (C2) holds, then T u is a contraction operator on B M w ( X ) with modulus α D < 1, and V ∗u is also the unique solution of the ﬁxed point equation T u v = v in B M w ( X ). (iv) (Linear programming formulation.) Given an initial state z0 ∈ X and a policy π ∈ Π , the state–action occupation measure of the policy π is the measure ηπ ∈ M w (K) deﬁned as

ηπ ( C ) =

∞

αt P zπ0 (xt , at ) ∈ C

for C ∈ B (K).

t =0

We have that

P := {ηπ }π ∈Π = {ηϕ }ϕ ∈Φ = η ∈ M w (K): ηˆ ( B ) = δz0 ( B ) + α P ( B |x, a)η(dx, da), ∀ B ∈ B( X ) . K

In addition, for the constrained control problem CP,

V ( z0 ) = inf η(c ): η ∈ P and η(rα ) 0 . (v) Suppose, moreover, that Assumptions (D1)–(D2) hold. Assume also the existence of a policy π ∈ Π such that V r ( z0 , π ) < θ0 . q Then there exists a stationary constrained optimal policy for CP and there exists λ∗ ∈ R+ such that c +λ∗ ,rα

V ( z0 ) = V ∗

( z0 ).

If Assumption C is satisﬁed, then we can write

P = η ∈ M w (K): ηˆ ( B ) = δz0 ( B ) + α

p ( y |x, a)μ(dy )η(dx, da), ∀ B ∈ B ( X ) .

(2.11)

K B

It follows that solving the constrained control problem CP is equivalent to solving the linear programming problem LP deﬁned as

LP: minimize η(c ) subject to η ∈ P and η(rα ) 0.

(2.12)

Theorem 2.6(v) above ensures that LP is solvable and V ( z0 ) = min LP. 3. Preliminary results In this section we derive important preliminary properties that will be used in Section 4 to obtain the main results of our paper. As seen at the end of the previous section, solving the constrained control problem CP reduces to solving the linear programming problem LP on the space of occupation measures. First of all, we will introduce a family of LP problems, labeled LP(ν ), that are approximations of LP. In the sequel, we will suppose that Assumptions A–D are satisﬁed. The constant > 0 is ﬁxed throughout. For every probability measure ν ∈ M1 ( X ) we deﬁne

O (ν ) = η ∈ M w Kd : ηˆ ( B ) = δz0 ( B ) + α p ( y |x, a)ν (dy )η(dx, da), ∀ B ∈ B ( X ) Kd B

(3.1)

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

863

and

C (ν ) = O (ν ) ∩ η ∈ M w Kd : η(rα ) 0 . The linear programming problem LP(ν ) is deﬁned as

LP(ν ): minimize η(c ) subject to η ∈ C (ν ).

(3.2)

The deﬁnition of the set O (ν ) is inspired from P in (2.11), where we have introduced the following “approximations”: the set K is replaced with Kd , the density function p ( y |x, a) is replaced with p ( y |x, a), while the probability measure μ is replaced with a generic ν ∈ M1 ( X ). Then, LP(ν ) is formulated as an analogue of LP in (2.12). At this point, we do not know whether LP(ν ) is feasible or not, solvable or not, etc. Precisely, the ﬁrst part of this section is devoted to study the set O (ν ) for ν in M1 ( X ). To do this, we will introduce the sets M1 ( X ) ⊆ M1 ( X ) and M w (Kd ) ⊆ M w (Kd ), deﬁned by some bounding conditions, and we will study their topological properties (see Lemma 3.1). When ν ∈ M1 ( X ), the set O (ν ) has “nice” properties; namely, O (ν ) ⊆ M w (Kd ) is nonempty and can be seen as the set of occupation measures associated to a transition kernel generated by ν (see Lemma 3.2) and, from the topological point of view, it is a compact set (see Lemma 3.4). Finally, the solvability of a linear program related to LP(ν ), for ν ∈ M1 ( X ), is analyzed in Lemma 3.5. In the second part of this section, it is shown how the difference of the integral of a Lipschitz continuous function with respect to the kernels P and P ,n can be controlled in terms of the parameter and W 1 (μ, μn ), the distance between μ and its empirical distribution μn (see Proposition 3.6). One important consequence of this result is to establish similar bounds for the difference between, on the one hand, the ﬁxed point V ∗u of the Bellman operator T u associated to the kernel P and the multifunction Ψ , and, on the other hand, the ﬁxed point V u,n of the Bellman operator T u,n associated to the kernel P ,n and the multifunction Ψd , where u is a general cost function (Theorem 3.9). The bound on the difference between V ∗u and V u,n is, roughly speaking, expressed in terms of , W 1 (μ, μn ), and the parameter d measuring the distance between the multifunctions Ψ and Ψd (see (2.3)). Finally, the feasibility and solvability of the linear problem LP(μn ) associated to the empirical distribution μn is studied in Proposition 3.11. 3.1. The space of occupation measures O (ν ) We introduce the following constants (their meaning will become clear later).

L=

1 − α (d ∨ d ∨ D ) 2α (d ∨ d ∨ D )

>0

(3.3)

and also

dˆ = d + dL,

= d + d L, d

and

ˆ = D + D L, D

(3.4)

so that

αdˆ

1 + αd 2

αd

< 1,

1 + α d 2

< 1,

and

α Dˆ

1 + αD 2

< 1.

(3.5)

The space M1 ( X ), when endowed with the 1-Wasserstein metric, is a Polish space [6, Theorem 8.10.43]. We deﬁne M1 ( X ) as the family of measures ν ∈ M1 ( X ) that satisfy

ˆ (x) w ( y ) p ( y |x, a)ν (dy ) dw

and

X

w (x) w ( y ) p ( y |x, a)ν (dy ) d

X

Kd . We suppose that

for every (x, a) ∈ M w (Kd ) is endowed with the w-weak topology, under which it is a separable metric space [14, Corollary A.28]. We deﬁne M w (Kd ) as the closure of the set

η ∈ M w Kd : ηˆ w

2w ( z0 ) 1 − α d

.

Lemma 3.1. (i) M1 ( X ) is a closed subset of M1 ( X ). (ii) M w (Kd ) is compact for the w-weak topology. Proof. (i) By using Assumption (C2) (see (2.7)), we obtain that for any (x, a) ∈ Kd and

ν1 , ν2 ∈ M1 ( X )

1 1 dW 1 (ν1 , ν2 )/ . w ( y ) p ( y | x , a ) ν ( dy ) − w ( y ) p ( y | x , a ) ν ( dy ) 1 2 w (x) w (x) X

X

864

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Therefore, the real-valued mapping deﬁned for each (x, a) ∈ Kd on M1 ( X ) by

ν →

1

w ( y ) p ( y |x, a)ν (dy )

w (x) X

is continuous. By using the same arguments and Assumption (D3), we obtain that the real-valued mapping deﬁned on M1 ( X ) by

ν →

1

w ( y ) p ( y |x, a)ν (dy )

w (x) X

is continuous for every (x, a) ∈ Kd . Therefore, M1 ( X ) is a closed subset of M1 ( X ) because it is the intersection of the inverse image of closed sets by continuous functions. (ii) Let Hk = {(x, a) ∈ Kd : x ∈ H k and a ∈ A d (x)}, where { H k }k∈N is the sequence of compact sets given in Assumption (D1). From [9, Lemma 2.8] – in particular, we use that A is locally compact and that x → A d (x) is upper semicontinuous – it follows that Hk is a compact set. Therefore, from Assumption (D1), the mapping w / w is a moment on Kd , that is,

lim

inf

k→∞ (x,a)∈Kd −Hk

w (x)/ w (x) = ∞.

Consequently, from Corollary A.30(b) in [14], it follows that the set

2w ( z0 )

η ∈ M w Kd : ηˆ w

1 − α d

is relatively compact for the w-weak topology.

2

Let Φ ⊆ Φ be the family of stochastic kernels ϕ on A given X such that ϕ ( A d (x)|x) = 1 for all x ∈ X . By hypothesis, Φ is nonempty. Given ϕ ∈ Φ , > 0, and ν ∈ M1 ( X ), for B ∈ B ( X ) and x ∈ X , deﬁne

P ϕ , ,ν ( B |x) =

p ( y |x, a)ϕ (da|x)ν (dy ). B A d (x)

Deﬁne also recursively for k 1,

P ϕ0 , ,ν ( B |x) = I B (x),

k

P ϕ , ,ν ( B |x) =

k −1

P ϕ , ,ν ( B | y ) P ϕ , ,ν (dy |x) = X

k −1 P ϕ , ,ν ( B | y ) P ϕ , ,ν (dy |x)

X

(in particular, P ϕ , ,ν ( B |x) = P ϕ , ,ν ( B |x)). Let 1

ηϕ , ,ν (C 1 × C 2 ) =

∞

k =0

ηϕ , ,ν be the measure on X × A given by

αk

ϕ (C 2 | y ) P ϕk , ,ν (dy |z0 )

(3.6)

C1

for C 1 ∈ B ( X ) and C 2 ∈ B ( A ). Its marginal ηˆ ϕ , ,ν veriﬁes

ηˆ ϕ , ,ν (C 1 ) =

∞

ϕ (C 2 | y )ηˆ ϕ , ,ν (dy |z0 ).

αk P ϕk , ,ν (C 1 |z0 ), and so ηϕ , ,ν (C 1 × C 2 ) =

k =0

(3.7)

C1

In fact, observe that we can interpret ηϕ , ,ν as a measure on Kd since ηϕ , ,ν (( X × A ) − Kd ) = 0. The measure ηϕ , ,ν is an analogue of the state–action occupation measure ηϕ of a policy deﬁned in Theorem 2.6(iv). In our context, since P ϕ , ,ν is not necessarily a stochastic kernel, we propose the direct deﬁnition (3.6) for the state–action occupation measure. Lemma 3.2. Let ν ∈ M1 ( X ). We have O (ν ) = {ηϕ , ,ν :

η( w )

2w ( z0 ) 1 − αd

and

η w

2w ( z0 ) 1 − α d

for every η ∈ O (ν ). In particular, O (ν ) ⊆ M w (Kd ).

ϕ ∈ Φ } and so, O (ν ) is nonempty. In addition, (3.8)

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Proof. Suppose that ϕ ∈ Φ is given. For the moment, let us prove that we will show by induction on k that, for every x ∈ X ,

865

ηϕ , ,ν satisﬁes the inequalities in (3.8). First of all,

k ˆk w ( y) P ϕ , ,ν (dy |x) d w (x).

(3.9)

X

The statement is true for k = 0. Suppose that it holds for k − 1. We have

k w ( y) P ϕ , ,ν (dy |x) = X

X

k −1 w ( y) P ϕ , ,ν (dy | z) P ϕ , ,ν (dz|x).

X

By the induction hypothesis

k −1 ˆ k−1 w ( z). w ( y) P ϕ , ,ν (dy | z) d

X

However,

ˆ (x) w ( z) p ( z|x, a)ϕ (da|x)ν (dz) dw

w ( z) P ϕ , ,ν (dz|x) = X A d (x)

X

ν ∈ M1 ( X ). The inequality (3.9) follows. Now, from the deﬁnition of ηϕ , ,ν , we obtain ∞

ηϕ , ,n ( w ) = αk w ( y ) P ϕk , ,ν (dy |z0 ).

because

k =0

X

Recalling (3.5) and (3.9), we derive that ηϕ , ,n ( w ) 2w ( z0 )/(1 − αd). In a similar way we can show that ηϕ , ,n ( w ) 2w ( z0 )/(1 − αd ). In particular, ηϕ , ,n is in M w (Kd ) and M w (Kd ). We proceed with the proof. Let O (ν ) = {ηϕ , ,ν : ϕ ∈ Φ }. Our goal is, obviously, to show that O (ν ) = O (ν ). Given ϕ ∈ Φ , by deﬁnition ηϕ , ,ν satisﬁes

ηˆ ϕ , ,ν ( B ) = δz0 ( B ) +

∞

αk P ϕk , ,ν ( B |z0 )

k =1

= δ z0 ( B ) + α

∞

α k −1

k =1

= δ z0 ( B ) + α

∞

k −1 P ϕ , ,ν ( B |x) P ϕ , ,ν (dx| z0 )

X

α k −1

k =1

= δ z0 ( B ) + α

X

k −1 p ε ( y |x, a)ϕ (da|x)ν (dy ) P ϕ , ,ν (dx| z0 )

B A d (x)

p ε ( y |x, a)ν (dy )ηϕ , ,ν (dx, da)

Kd B

for all B ∈ B ( X ). Therefore, O (ν ) ⊆ O (ν ). is supported on Kd , can be disintegrated as η( B × C ) = Conversely, let η ∈ O (ν ). Observe that the measure η , which , where η ˆ is the projection of η on X ; see, e.g., Proposition D.8 ˆ ϕ ( C | x ) η ( dx ) , with B ∈ B ( X ) and C ∈ B ( A ) , for some ϕ ∈ Φ B in [16]. Consequently, ηˆ veriﬁes

P ϕ , ,ν ( B |x)ηˆ (dx)

ηˆ ( B ) = δz0 ( B ) + α

for all B ∈ B ( X ).

X

Iterating this equality yields, for every k 1 and B ∈ B ( X ),

ηˆ ( B ) = δz0 ( B ) +

k

α j P ϕj , ,ν ( B |z0 ) + αk+1

j =1

k +1 ˆ (dx). Pϕ , ,ν ( B |x)η

X

Now we use the bound (recall (3.9)) k +1 Pϕ , ,ν ( B |x)

X

k +1 ˆ k+1 w (x). w ( y) P ϕ , ,ν (dy |x) d

866

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Together with ηˆ ( w ) < ∞ and

lim

k→∞

α k +1

αdˆ < 1 (recall (3.5)), this ensures that

k +1 ˆ (dx) = 0. Pϕ , ,ν ( B |x)η

X

Finally, we have established that for every B ∈ B ( X )

ηˆ ( B ) =

∞

α j P ϕj , ,ν ( B |z0 ) = ηˆ ϕ , ,ν ( B ),

j =0

and η = ηϕ , ,ν readily follows because both measures have the same projection on X and the same “conditional distributions” (which correspond to ϕ ); recall (3.7). Hence, we have shown that O (ν ) ⊆ O (ν ). This completes the proof. 2 Our next lemma deals with a Lipschitz continuity property. Lemma 3.3. Given v ∈ L w ( X ) and (x, a) ∈ K, the function y → v ( y )q( y |x, a) is Lipschitz continuous on X with Lipschitz constant given by K v w (x), with

K v := v w (d + qL w ) + L v q. Proof. For every y , z ∈ X we have, from (2.5) in Assumption (C2),

v ( y )q( y |x, a) − v ( z)q( z|x, a) v ( y )q( y |x, a) − q( z|x, a) + q( z|x, a) v ( y ) − v ( z) v w w ( y )q( y |x, a) − q( z|x, a) + L v qw (x)d X ( y , z). Now, from Assumptions (A2) and (C2) (see (2.5) and (2.7))

w ( y )q( y |x, a) − q( z|x, a) w ( y )q( y |x, a) − w ( z)q( z|x, a) + w ( y ) − w ( z)q( z|x, a)

(d + qL w ) w (x)d X ( y , z). 2

The stated result follows.

Lemma 3.4. Suppose that {(νk , ηk )} is a sequence in M1 ( X ) × M w (Kd ) such that ηk ∈ O (νk ) and, in addition,

lim

k→∞

νk = ν ∈ M1 ( X ) and

lim

k→∞

ηk = η ∈ M w Kd

in the 1-Wasserstein metric and w-weak topology, respectively. Then (ν , η) ∈ M1 ( X ) × M w (Kd ) and η ∈ O (ν ). As a consequence, for every ν ∈ M1 ( X ), the set O (ν ) is compact. Proof. From Lemma 3.1, ν ∈ M1 ( X ) and η ∈ M w (Kd ). Let us now prove that continuous function f deﬁned on X and every k we have

η ∈ O (ν ). For any bounded 1-Lipschitz

ηˆ k ( f ) = f (z0 ) + α

f ( y ) p ( y |x, a)νk (dy )ηk (dx, da),

Kd X

and so

ηˆ k ( f ) = f (z0 ) + α +α

Kd

f ( y ) p ( y |x, a)νk (dy ) −

X

f ( y ) p ( y |x, a)ν (dy )

ηk (dx, da)

X

f ( y ) p ( y |x, a)ν (dy )ηk (dx, da).

(3.10)

Kd X

From Lemma 3.3, for every (x, a) ∈ Kd , the function y → f ( y ) p ( y |x, a) is Lipschitz continuous on X , with Lipschitz constant bounded by R w (x) for some R > 0. Therefore, for every k ∈ N and (x, a) ∈ Kd we have

f ( y ) p ( y |x, a)νk (dy ) − R w (x) W 1 (νk , ν ). f ( y ) p ( y | x , a ) ν ( dy ) X

X

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

867

ηk ( w ) is bounded in k because the bound in (3.8) is uniform in ν , we deduce that f ( y ) p ( y |x, a)νk (dy ) − f ( y ) p ( y |x, a)ν (dy ) ηk (dx, da)

Observing that

Kd

X

X

converges to 0 as k → ∞. Now, from Assumption (C2), (x, a) → p ( y |x, a) is continuous on Kd for every y ∈ X , with

p ( y |x, a)

q

· w (x) for any y ∈ X and (x, a) ∈ K .

Recalling that w is continuous, by the bounded convergence theorem, the mapping

(x, a) →

f ( y ) p ( y |x, a)ν (dy ) X

is in C w (Kd ). Consequently, by taking the limit as k tends to inﬁnity in (3.10)

ηˆ ( f ) = f (z0 ) + α

f ( y ) p ( y |x, a)ν (dy )η(dx, da).

Kd X

Since the set of bounded 1-Lipschitz continuous functions is separating for ﬁnite measures on separable metric spaces, it follows that for all B ∈ B ( X )

ηˆ ( B ) = δz0 ( B ) + α

p ( y |x, a)ν (dy )η(dx, da).

Kd B

Hence, η ∈ O (ν ) and the proof is complete. To see that O (ν ) is compact for ν ∈ M1 ( X ), choose the constant sequence subset of the compact set M w (Kd ). 2

νk ≡ ν . It follows that O (ν ) is a closed

The following lemma studies the solvability of a linear program corresponding to a slightly more general version of

LP(ν ). An easy consequence of this result is that LP(ν ) is solvable if it is feasible. Lemma 3.5. Consider ν ∈ M1 ( X ) and σ ∈ Rq . Then O (ν ) ∩ {η ∈ M w (Kd ):

inf

η(c ): η ∈ M w Kd , η ∈ O (ν ) and η(rα ) σ

is solvable provided that O (ν ) ∩ {η ∈ M w (Kd ):

η(rα ) σ } is compact and the linear program

η(rα ) σ } is nonempty.

Proof. Since rα ∈ L w (K), it follows that the mapping deﬁned on M w (Kd ) by η → η(rα ) is continuous, and so {η ∈ M w (Kd ): η(rα ) σ } is a closed subset of M w (Kd ). But O (ν ) is compact from Lemma 3.4 and so, O (ν ) ∩ {η ∈ M w (Kd ): η(rα ) σ } is closed and hence, compact. Moreover, the real-valued mapping deﬁned on M w (Kd ) by η → η(c ) is continuous since c ∈ L w (K), showing the result. 2 3.2. The random kernel P ,n Our results so far are “deterministic” in the sense that they do not involve the probability space (Ω, F , P). In this section, we are interested in “random” results. Our next proposition compares the integral of a Lipschitz continuous function with respect to the kernels P and P ,n . Proposition 3.6. For any v ∈ L w ( X ) and (x, a) ∈ K,

P v (x, a) − P ,n v (x, a) d v w + K v W 1 (μ, μn ) M (x) w (x),

(3.11)

with the constant K v deﬁned in Lemma 3.3. Moreover, for any (x, a) ∈ K

P ,n w (x, a) d + dW 1 (μ, μn )/ w (x),

P ,n w (x, a)

d + d W 1 (μ, μn )/ w (x),

P ,n M w (x, a) D + D W 1 (μ, μn )/ M (x) w (x).

(3.12) (3.13) (3.14)

868

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Proof. For notational convenience, let us introduce the kernel R ( B |x, a) = B p ( y |x, a)μ(dy ) for any (x, a) ∈ K and B ∈ B( X ). For any v ∈ B w ( X ) and (x, a) ∈ K, we have by deﬁnition of R and M (see Deﬁnition 2.4)

P v (x, a) − R v (x, a) v w

M (x) 1 + M (x)

M (x) w ( y )q( y |x, a)μ(dy ) X

M (x) P w (x, a) 1 + M (x) v w M (x) dw (x),

= v w

(3.15)

since P w (x, a) dw (x) (see Assumption (A2)). Now, we have by deﬁnition of R and P ,n (see Deﬁnition 2.4)

R v (x, a) − P ,n v (x, a) = M (x) v ( y )q( y |x, a)μ(dy ) − v ( y )q( y |x, a)μn (dy ), X

X

for any v ∈ L w ( X ) and (x, a) ∈ K and so, from Lemma 3.3,

v ( y )q( y |x, a) − v ( z)q( z|x, a) K v w (x)d X ( y , z).

(3.16)

Taking into account the deﬁnition of W 1 , it yields that

R v (x, a) − P ,n v (x, a) K v W 1 (μ, μn ) M (x) w (x).

(3.17)

Combining Eqs. (3.15) and (3.17), we obtain Eq. (3.11). We obtain that | P ,n w (x, a) − R w (x, a)| dW 1 (μ, μn ) M (x) w (x) by using (2.7). Moreover, we have P ,n w (x, a) R w (x, a) + | P ,n w (x, a) − R w (x, a)|. However, observe that R w (x, a) P w (x, a) since M (x) M (x), and so

P ,n w (x, a) P w (x, a) + dW 1 (μ, μn ) M (x) w (x). Now by using Assumption (A2) and the fact that M (x) 1/ , (3.12) follows. The proof of the inequalities (3.13) and (3.14) is similar and is, therefore, omitted. 2 At this point, recall the deﬁnition of the constants given in (3.3)–(3.5). We deﬁne the measurable subset of Ω

C ,n = W 1 (μ, μn ) L . Lemma 3.7. On the set C ,n , the following inequalities hold for every (x, a) ∈ K:

ˆ (x), P ,n w (x, a) dw

w (x), P ,n w (x, a) d

ˆ M (x) w (x). P ,n M w (x, a) D

(3.18)

Therefore, μn ∈ M1 ( X ) on the set C ,n . Proof. The proof of this result is a direct consequence of (3.12)–(3.14) in Proposition 3.6. The fact that from the deﬁnition of M1 ( X ). 2

μn ∈ M1 ( X ) follows

Consider now an arbitrary cost function u ∈ L w (K). Next, we deﬁne on the set C ,n the operator T u,n associated to the kernel P ,n and the multifunction Ψd by

T u,n v (x) :=

inf

a∈ A d (x)

u (x, a) + α

v ( y ) P ,n (dy |x, a) , X

for any x ∈ X and measurable function v on X . Lemma 3.8. Suppose that we are given ω ∈ C ,n . (i) The operator T u,n respectively maps B w ( X ) and B M w ( X ) into themselves. Moreover, T u,n is a contraction operator on B w ( X ) ˆ < 1). (respectively, B M w ( X )) with modulus αdˆ < 1 (respectively, α D 2 u (ii) Let V u,n ∈ B w ( X ) be the unique ﬁxed point in B w ( X ) (or B M w ( X )) of the operator of T u,n . We have V u,n w 1−αwd .

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

869

Proof. To prove (i), note that the measure μn has ﬁnite support, and use [17, Lemma 8.3.8(a)] to establish measurability of T u,n v. The contraction property follows from Lemma 3.7. The rest of the proof is now straightforward. 2 The next theorem shows that the difference between V ∗u and V u,n can be controlled in terms of parameter d related to the distance between the multifunctions Ψ and Ψd .

, W 1 (μ, μn ) and the

Theorem 3.9. Consider u ∈ L w (K). On the set C ,n ,

u V − V u ∗

,n M w

2[αd V ∗u w + α K V ∗u W 1 (μ, μn ) + ( L u + α L P L V ∗u )d] 1 − αD

(3.19)

.

Proof. Fix an arbitrary v ∈ L w ( X ). For every x ∈ X there exists a∗ ∈ A (x) such that

T u v (x) = u x, a∗ + α

v ( y ) P dy |x, a∗ u (x, a) + α

X

v ( y ) P (dy |x, a) X

Also, there exists a ∈ A d (x) with

T εu,n v (x) = u x, a + α

v ( y ) P ,n dy |x, a u (x, a) + α

X

v ( y ) P ,n (dy |x, a)

for all a ∈ A d (x).

X

Therefore, on the one hand we have

T u v (x) − T u,n v (x) α

for all a ∈ A (x).

v ( y ) P dy |x, a − α

X

v ( y ) P ,n dy |x, a .

X

From (3.11) we obtain that

T u v (x) − T u,n v (x) α d v w + K v W 1 (μ, μn ) M (x) w (x). On the other hand, since A d (x) ⊆ A (x), we have

d H A (x), A d (x) = sup

inf

b∈ A (x) a∈ A d (x)

d A a, a

inf

a∈ A d (x)

d A a∗ , a ,

and let aˆ ∈ A d (x) attain the above inﬁmum, so that d A (a∗ , aˆ ) d H ( A (x), A d (x)). We have

T u,n v (x) − T u v (x) u (x, aˆ ) + α

v ( y ) P ,n (dy |x, aˆ ) − u x, a∗ − α

X

v ( y ) P dy |x, a∗ .

X

The function u being Lipschitz continuous,

u (x, aˆ ) − u x, a∗ L u d A aˆ , a∗ L u d H A (x), A d (x) L u d w (x). Now, write

v ( y ) P ,n (dy |x, aˆ ) − X

v ( y ) P dy |x, a∗ =

X

v ( y ) P ,n (dy |x, aˆ ) −

X

X

X

v ( y ) P ,n (dy |x, aˆ ) −

X

v ( y ) P dy |x, a∗

v ( y ) P (dy |x, aˆ ) −

+ where

v ( y ) P (dy |x, aˆ )

X

v ( y ) P (dy |x, aˆ ) d v w + K v W 1 (μ, μn ) M (x) w (x) X

and, by Lipschitz continuity of v and the kernel P ,

v ( y ) P (dy |x, aˆ ) − X

v ( y ) P dy |x, a∗ L P L v d A aˆ , a∗ L P L v d w (x).

X

Summarizing, we have obtained

u

T v (x) − T u v (x) αd v w + α K v W 1 (μ, μn ) + ( L u + α L P L v )d M (x) w (x). ,n

(3.20)

870

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Moreover, for ﬁxed

ω ∈ C ,n we have

u V − V u

u u u u u u u u ,n M w T V ∗ − T ,n V ∗ M w + T ,n V ∗ − T ,n V ,n M w .

∗

Recall now that V ∗u ∈ B w ( X ) ⊆ B M w ( X ) and that, on C ,n , we have V u,n ∈ B w ( X ) ⊆ B M w ( X ). Since, from Lemma 3.8(i), we ˆ < 1, it follows that have that T u,n is a contraction operator on B M w ( X ) with modulus α D

u V − V u

,n M w

∗

2 T u V ∗u − T u,n V ∗u M w T u V ∗u − T u,n V ∗u M w , ˆ 1 − αD 1 − αD

where the last inequality comes from (3.5). Using (3.20) we obtain the result.

2

The next technical lemma relates the ﬁxed point V u,n of the operator T u,n to the solution of a linear program on the set O (μn ). Observe that this relation is established on C ,n and that the set O (μn ) depends on ω through the random measure μn . Lemma 3.10. Suppose that ω ∈ C ,n and consider a cost function u ∈ L w (K). We have

min

η∈O (μn )

Proof. Given

η(u ) = V u,n (z0 ).

ϕ ∈ Φ , deﬁne u (x, ϕ ) :=

T ,n , for each x ∈ X we have u

A d (x)

u (x, a)ϕ (da|x) for every x ∈ X . Since V u,n is the ﬁxed point of the operator

V u,n (x) u (x, ϕ ) + α

V u,n ( y ) P ϕ , ,μn (dy |x).

(3.21)

X

Iterating this inequality, for every k 1,

V u,n ( z0 )

k

αj

j =0

u ( y , ϕ ) P ϕ , ,μn (dy | z0 ) + αk+1 j

X

k +1 V u,n ( y ) P ϕ , ,μn (dy | z0 ).

(3.22)

X

Observe now that

u k +1 u k +1 u k +1 k +1 ˆ k+1 w ( z0 ), α V α V ( y ) P ( dy | z ) w ( y) P ϕ 0 , n ϕ , , μ , n , ,μn (dy | z0 ) V ,n w (α d) n w X

X

by (3.9), where

αdˆ < 1. Now let k → ∞ in (3.22) to obtain

V u,n ( z0 )

∞

j =0

αj

j

u ( y , ϕ ) P ϕ , ,μn (dy | z0 ) =

∞

j =0

X

j

αj

u ( y , a)ϕ (da| y ) P ϕ , ,μn (dy | z0 ),

(3.23)

X A d (x)

which equals ηϕ , ,μn (u ). Since ϕ ∈ Φ is arbitrary, this yields that V u,n ( z0 ) infη∈O (μn ) η(u ). On the other hand, by using Lemma 8.3.8(a) in [17], there exists some deterministic ϕ ∈ Φ such that (3.21)–(3.23) hold with equality. The proof of the lemma is now complete. 2 Introduce the constants

L=

L r (1 + L Ψ ) 1 − α L P (1 + L Ψ )

,

c + (1 − α )|θ0 | w V = , 1 − αd

and

= K V (d + qL w ) + Lq.

(3.24)

We impose another assumption which is related to a Slater condition on the constrained control problem presented in Deﬁnition 2.3. Assumption E. (E1) There exist δ > 0 and

π ∈ Π such that V r (z0 , π ) < θ0 − δ 1.

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

871

As shown by the following proposition, this condition will ensure that the linear program LP(μn ) is feasible by showing that C (μn ) is nonempty on some speciﬁc set. Proposition 3.11. Suppose that > 0 and d > 0 are such that the constant

H ,d =

δ(1 − α D ) V − ( Lr + α L P L )d − αd 4M ( z0 ) w ( z0 ) αK 1

(3.25)

is positive. Deﬁne the set

Cˆ ,n = W 1 (μ, μn ) L ∧ H ,d ⊆ C ,n .

(3.26)

On the set Cˆ ,n , there exists η ∈ O (μn ) such that η(rα ) − 2δ 1. In particular, on Cˆ ,n , C (μn ) is nonempty and so LP(μn ) is solvable. Proof. For a ﬁxed ω ∈ Cˆ ,n , the set C = {η(rα ): η ∈ O (μn )} is a convex subset of Rq . Consider the convex set Z = { z ∈ Rq : z − 2δ 1}. To prove the result we proceed by contradiction, and so we assume that C ∩ Z = ∅. Applying Theorem 5.61 in [1], there exists λ = 0 such that η(λ, rα ) λ, z for any η ∈ O (μn ) and z ∈ Z . First remark that λ 0. Indeed, from Lemma 3.2, we have η(λ, rα ) < ∞. If λ has some negative component then let the corresponding component of z tend a contradiction with the inequality λ, z η(λ, rα ) < ∞. Consequently, there is no loss of generality to to +∞ to get consider that i λi = 1. Let ν ∈ P ⊆ M w (K) be the state–action occupation measure of the policy π . Therefore,

ν λ, rα < −δ and so V ∗λ,rα (z0 ) ν λ, rα < −δ. Moreover, for any

η ∈ O (μn ), we have

η λ, rα −

δ

and so

2

λ,rα

V ,n

δ ( z0 ) − , 2

by Lemma 3.10. However, a straightforward calculation shows that

L V λ,rα L, ∗

λ,rα V ∗ V, w

. K V λ,rα K

and

∗

Now (3.19) yields that on Cˆ ,n

λ,rα δ λ,r V ∗ ( z0 ) − V ,n α ( z0 ) ,

sup

λ0,

2

λ i =1

leading to contradiction. Therefore, C ∩ Z = ∅. The fact that LP(μn ) is solvable easily follows from Lemma 3.5 for Remark 3.12. In the sequel we will also suppose that

σ = 0 and Lemma 3.7. 2

> 0 and d > 0 are such that H ,d > 0 (recall (3.25)).

4. Main results The objective of this section is to introduce a stochastic estimator of the optimal value V ( z0 ) of the constrained control problem CP introduced in Deﬁnition 2.3. The estimator In of V ( z0 ) is deﬁned in (4.5) as the optimal solution of the (ﬁnite state) linear programming problem LP(μn ) based on the empirical probability measure μn . In Theorem 4.3 we will see that, although In is not necessarily a random variable, the approximation error |In − V ( z0 )| can be controlled through a concentration inequality. We conclude this section by discussing the computability of the estimator In , depending on the nature of the sets A d (x). Consider an isolated point . Then, it can be easily seen from Lemma 3.1 that there exists a metric for which M w (Kd ) ∪ {} is compact and the relative topology of M w (Kd ) as a subset of M w (Kd ) ∪ {} is the w-weak topology. Introduce the multifunction A : M1 ( X ) → M w (Kd ) ∪ {} deﬁned by

A(ν ) :=

C (ν ) ∪ {} if ν ∈ M1 ( X ), {} if ν ∈ / M1 ( X ).

On the graph of the multifunction A, deﬁne the mapping

ρ (ν , η) :=

η(c )

if η = ,

4c w ( z0 ) 1 −α d

if η = .

ρ by

872

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Lemma 4.1. The real-valued mapping I deﬁned on M1 ( X ) by I(ν ) = infη∈A(ν ) ρ (ν , η) is measurable. Moreover,

I(ν ) =

min LP(ν )

if ν ∈ M1 ( X ) and C (ν ) = ∅,

4c w ( z0 ) 1 −α d

otherwise.

(4.1)

Proof. Recall ﬁrst that both M1 ( X ) and M w (Kd ) are Borel spaces (Lemma 3.1). Let us now show that the multifunction A is compact-valued and Borel measurable. Clearly, A(ν ) is compact when ν ∈ / M1 ( X ). In case that ν ∈ M1 ( X ), Lemma 3.5 with σ = 0 gives that C (ν ) is compact. Hence, the multifunction A is compact-valued. Besides, we deduce also from Lemma 3.4 that the graph of the multifunction A is closed (in particular, we use the fact that M1 ( X ) is closed in M1 ( X )). Summarizing, the multifunction A is compact-valued and closed. By Proposition D.4 in [16], the multifunction A is Borel-measurable. The function ρ (ν , ·) is continuous on A(ν ) when ν ∈ / M1 ( X ) or when ν ∈ M1 ( X ) and C (ν ) = ∅. If ν ∈ M1 ( X ) and C (ν ) = ∅, since c ∈ L w (K), we also have that ρ (ν , ·) is continuous on A(ν ). Now, Proposition D.5 in [16] gives that I is measurable. Clearly, the “inf” is actually a “min” (recall Lemma 3.5) and I satisﬁes (4.1). Indeed, if ν ∈ M1 ( X ) and C (ν ) = ∅, given

η ∈ C (ν )

η(c ) c η( w ) 2c w ( z0 ) 1 − αd as a consequence of (3.8). Consequently, minη∈A(ν ) ρ (ν , η) is not attained at

η = . 2

As a direct consequence of Lemma 4.1 we obtain that I(μn ) is a random variable because the mapping from Ω to M1 ( X ) given by ω → μn (ω) is measurable, when M1 ( X ) is endowed with the Borel σ -ﬁeld generated by the 1-Wasserstein metric. Besides, on the set Cˆ ,n we have

I(μn ) = min LP(μn )

(4.2)

from Proposition 3.11. Our next result analyzes the difference between the estimator I(μn ) and the value V ( z0 ) of the constrained problem CP. Theorem 4.2. Suppose that Assumptions A–E hold. There exist constants K1 , K2 , K3 such that, given and n, we have

I(μn ) − V ( z0 ) K1 + K2 W 1 (μ, μn ) + K3 d on Cˆ ,n . The constants K1 , K2 , K3 depend only on the parameters of the control model, and they do not depend on ω , n, , d, nor the sets A d (x). Proof. Consider

C=

ω ∈ Cˆ ,n given. We deﬁne σ ∈ Rq : η(rα ) σ ,

η∈O (μn )

which is a convex subset of Rq . By Proposition 3.11 we know that there exists

η ∈ O (μn ) with η(rα ) − 2δ 1. So, we

conclude that 0 is an interior point of C , and also that − 2 1 ∈ C . For each σ ∈ C , observe that O (μn ) ∩ {η ∈ M w (Kd ): η(rα ) σ } is nonempty. Consequently, the real-valued mapping F deﬁned on C by δ

σ → F (σ ) := min η(c ): η ∈ O (μn ) and η(rα ) σ

is well deﬁned from Lemma 3.5. In particular, by (4.2), we have F (0) = I(μn ). It can be shown that F is convex on C ; hence, there is some λ(ω) ∈ Rq with

F (σ ) − F (0) − λ(ω), σ

∀σ ∈ C .

(4.3)

The function F being bounded on C (recall (3.8)), we deduce that λ(ω) 0 (to see this, if λ(ω) has some negative component, then let the corresponding component of σ tend to ∞ to reach a contradiction). Fix now an arbitrary η ∈ O (μn ). Clearly, η(rα ) ∈ C and F (η(rα )) η(c ). We deduce that

F (0) η(c ) + λ(ω), η(rα ) η c + λ(ω), rα . But

η ∈ O (μn ) being arbitrary, we obtain from Lemma 3.10 that c +λ(ω),rα F (0) min η c + λ(ω), rα = V ,n ( z0 ). η∈O (μn )

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

On the other hand, since there exists a measure

F (0) = η∗ (c ) η∗ c + λ(ω), rα

c +λ(ω),rα

The equality F (0) = I(μn ) = V ,n

873

η∗ which attains the minimum in the deﬁnition of F (0),

c +λ(ω),rα

V ,n

( z0 ).

(z0 ) follows. Since − 2δ 1 ∈ C , we obtain from (4.3) that

δ · λ(ω) F (−δ 1/2) − F (0)

2

and, by the bound in Lemma 3.2, we conclude that

λ(ω) 8c w ( z0 ) . δ(1 − αd)

(4.4) c +λ(ω),rα

By Theorem 3.9, we have that | V ∗

c +λ(ω),rα

(z0 ) − V ,n

(z0 )| is less than

c+λ(ω),rα + α K c+λ(ω),rα W 1 (μ, μn ) + L c + λ(ω) L r + α L P L c+λ(ω),rα d . · α d V ∗ w V V

2M ( z0 ) w ( z0 ) 1 − αD

∗

∗

However, we have bounds on

c+λ(ω),rα V ∗ , w

K V c+λ(ω),rα , ∗

and

L V c+λ(ω),rα ∗

that depend on λ(ω) through its norm. Taking into account the bound in (4.4), it can be seen that there are constants Ki that neither depend on , n, nor ω such that c +λ(ω),rα

I(μn ) V ∗

( z0 ) + K1 + K2 W 1 (μ, μn ) + K3 d.

Now, recalling Theorem 2.6, let c +λ(ω),rα

V∗

ν ∗ ∈ P be such that ν ∗ (c ) = V (z0 ) and ν ∗ (rα ) 0. We have

( z0 ) ν ∗ c + λ(ω), rα ν ∗ (c ) = V ( z0 ),

which yields

I(μn ) V ( z0 ) + K1 + K2 W 1 (μ, μn ) + K3 d. q To conclude, as a consequence Theorem 2.6, there exists λ∗ ∈ R+ such that

c +λ∗ ,rα

V ( z0 ) = V ∗

( z0 ).

Consequently, by Theorem 3.9 again,

c+λ∗ ,rα c +λ∗ ,rα V ∗ ( z0 ) − V ,n ( z0 ) is less than

c+λ∗ ,rα + α K c+λ∗ ,rα W 1 (μ, μn ) + L c + λ∗ L r + α L P L c+λ∗ ,rα d . · α d V ∗ w V V

2M ( z0 ) w ( z0 ) 1 − αD Taking

∗

∗

η∗ as above, we obtain c +λ∗ ,rα

V ,n

( z0 ) η∗ (c ) + λ∗ , η∗ (rα ) η∗ (c ) = I(μn ),

and the stated result follows by proceeding as previously.

2

Note that we are not addressing the issue of the measurability of the Lagrange multiplier λ(ω). We do not need such a result because we have bounds on its norm, which suﬃces for our purposes. Now, let us introduce the stochastic estimate of the optimal value V ( z0 ) of the constrained control problem CP presented in Deﬁnition 2.3. We note that, for computational reasons, we cannot derive the value of I(μn ) in practice because it requires to check whether μn ∈ M1 ( X ), which may require an “inﬁnite” number of computations. Our stochastic estimator is given the more natural deﬁnition below.

874

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

For every n ∈ N, we deﬁne In on Ω as follows.

In =

min LP(μn ),

if LP(μn ) is solvable,

4c w ( z0 ) , 1 −α d

if LP(μn ) is not solvable.

(4.5)

Observe that In might not be measurable. However, by (4.2) we have

In = I(μn ) on Cˆ ,n . Theorem 4.3. Suppose that Assumptions A–E hold. There exists An, ,d ∈ F , and constants S and T satisfying

γ 0 > 0 such that for any 0 < γ < γ 0 there are > 0, d > 0, a set

In − V ( z0 ) > γ ⊆ An, ,d with P(An, ,d ) S exp{−T n}

(4.6)

for any n 1. Proof. Since In = I(μn ) on Cˆ ,n ,

In − V ( z0 ) > γ ⊆ I(μn ) − V ( z0 ) > γ ∩ Cˆ ,n ∪ Cˆ c . ,n Let

γ 0 = 3K2 γ0 , with γ0 as in Theorem 2.5 and K2 as in Theorem 4.2, and choose arbitrary 0 < γ < γ 0 . Also, choose 0<<

γ

and

3K1

0
γ 3K3

such that H ,d > 0. Observe that the choice of d implicitly selects adequate A d (x) ⊆ A (x). From Theorem 4.2, we have that

In − V ( z0 ) > γ ∩ Cˆ ,n ⊆ K1 + K2 W 1 (μ, μn ) + K3 d > γ . Since Cˆ c ,n = { W 1 (μ, μn ) > L ∧ H ,d } and

K1 + K2 W 1 (μ, μn ) + K3 d > γ ⊆ W 1 (μ, μn ) >

γ 3K2

,

it follows that {|In − V ( z0 )| > γ } ⊆ An, ,d with

γ An, ,d = W 1 (μ, μn ) > ∧ L ∧ H ,d . 3K2

Since

γ 3K2

∧ L ∧ H ,d < γ0 , 2

by using Theorem 2.5 the result follows.

5. Numerical and computable approximations Starting from the empirical measure

μn associated to μ, we have introduced in the previous section the linear program

LP(μn ) consisting in minimizing η(c ) subject to η ∈ C (μn ). Its solution, when it is solvable, is then the stochastic estimate In that approximates the optimal value V (z0 ) of the constrained control problem CP. Let us now discuss in more detail the structure of this linear program. n Recall that μn (dy ) = n1 k=1 δY k (dy ) and, for a given

ω ∈ Ω , deﬁne

Γn = { z0, z1 , . . . , z pn } with pn n as the union of { z0 } – the initial state – and the ﬁnite support of μn . To simplify the notation and without loss of generality, we will assume that the random variables Y k do not take the value z0 . For any (x, a) ∈ Kd , we have P ,n (Γnc |x, a) = 0 and for i ∈ {0, . . . , pn }

P ,n { zi }|x, a = M (x)q( zi |x, a)

ni n

,

where ni 1 is the cardinal of the set {k: Y k (ω) = zi }.

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

The set O (μn ) (recall (3.1)) is given by the set of

η {z0 } × A d (z0 ) = 1,

η {zi } × A d (zi ) =

875

η ∈ M w (Kd ) such that ηˆ (Γnc ) = 0 and (5.1)

pn αni

n

M (z j )

j =0

q( zi | z j , a)η { z j } × da ,

(5.2)

A d ( z j )

for i ∈ {1, . . . , pn }, while the set C (μn ) incorporates the q constraints

pn

r ( z j , a) − (1 − α )θ0

η {z j } × da 0.

(5.3)

j =0 A d (z j )

Consequently, the linear program LP(μn ) consists in ﬁnding measures satisfying the constraints (5.1)–(5.3) and such

pn

c ( z j , a)η { z j } × da

η({z j } × da) supported on A d (x), for 0 j pn ,

(5.4)

j =0 A d (z j )

is minimized. We have that LP(μn ) is a linear program with ﬁnite state space but, regarding the action space, the “dimension” of LP(μn ) depends heavily on the nature of the sets A d (x). The case where the A d (x) are ﬁnite. If A d (x) is ﬁnite for every x ∈ Γn , then LP(μn ) is a standard ﬁnite-dimensional linear program. Therefore, it leads to a computable approximation In of the optimal value V ( z0 ) of the constrained control problem CP. Concerning the choice of the ﬁnite sets A d (x), observe that for any d > 0 there exists a ﬁnite A d (x) ⊆ A (x) such that Assumption (B2) is satisﬁed, that is, d H ( A (x), A d (x)) d w (x) (this is because A (x) is compact; see Assumption (A1)). However, in order to apply our approach one needs to impose that the multifunction Ψd from X to A deﬁned by Ψd (x) := A d (x) is upper semi-continuous. For the control models arising in practice, our experience is that this condition is usually satisﬁed (see, in particular, the example in Section 6). The case where the A d (x) are inﬁnite. In this case, LP(μn ) is an inﬁnite-dimensional linear problem which is not, in principle, numerically tractable. This issue can be overcome by showing that LP(μn ) reduces, in fact, to a ﬁnite-dimensional nonlinear programming problem. Actually, it can be shown that the optimal solution of LP(μn ) (if it exists) randomizes between, at most, q + 1 actions in each state zi ∈ Γn . More precisely, we have: Theorem 5.1. Suppose that μn ∈ M1 ( X ) and C (μn ) = ∅. Then there exists an optimal solution ηn∗ of LP(μn ) such that, for each z ∈ Γn , the measure ηn∗ ({ z} × da) is supported on, at most, q + 1 actions in A d ( z). This result can be easily obtained by using arguments similar to those of Theorem 5.3 in [10]. Consequently, there exist

(q + 1) × ( pn + 1) actions {a∗i ,0 , . . . , a∗i ,q } ⊆ A d (zi ), for 0 i pn , such that there exists an optimal solution ηn∗ ∈ C (μn ) of LP(μn ) supported on the ﬁnite set

zi , a∗i ,k

.

0i pn 0kq

In view of these results, we propose the following procedure to obtain an optimal solution of LP(μn ). We deﬁne the function

H : A d ( z0 )q+1 × A d ( z1 )q+1 × · · · × A d ( z pn )q+1 → R where H(a0,0 , . . . , a pn ,q ) equals the minimum of the linear programming problem

minimize

pn q

c ( zi , ai ,k )xik

i =0 k =0

subject to

q

x0k = 1,

k =0 q

k =0

xik −

pn q

αni

n

j =0 k =0

M ( z j )q( zi | z j , a j ,k )x jk = 0

for 1 i pn ,

876

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

pn q

r ( zi , ai ,k ) − (1 − α )θ0 xik 0,

i =0 k =0

xik 0 for 0 i pn and 0 k q. Clearly, this is a standard ﬁnite-dimensional linear programming problem with (q + 1) × ( pn + 1) nonnegative variables, pn + 1 equality constraints, and q + 1 inequality constraints. This linear program corresponds to the situation when the action sets A d ( zi ) are replaced with {ai ,0 , . . . , ai ,q }, and the xik correspond to η{( zi , ai ,k )} in (5.1)–(5.4). Since we know that there exist actions {a∗i ,k } such that H(a∗0,0 , . . . , a∗pn ,q ) equals the minimum of the linear programming problem LP(μn ), solving the inﬁnite-dimensional linear program LP(μn ) reduces to solving a ﬁnite-dimensional nonlinear programming problem consisting in minimizing H. In practice, the action space A is generally a subset of Rs for some s 1 and, in this case, the minimization of the function H deﬁned on the Euclidean space Rs can be performed by using any of the usual nonlinear programming tools (although this could be computationally demanding). Therefore, even in the case of inﬁnite A d (x) we can numerically obtain the value of min LP(μn ). 6. Simulation results In order to illustrate our results, we present an example related to a real application: the management problem of the Paciﬁc halibut ﬁshery described in [12, Section 5]. 6.1. Description of the control model The dynamical system for the halibut population is given by

xt +1 = F (xt , at , w t ) where xt denotes the stock level of ﬁshes at time t; at is the control variable representing the harvest level (occurring between two consecutive breeding seasons); F is the reproduction function that maps the stock level at the end of one season to the new stock level at the beginning of the next season; and the sequence { w t }t ∈N represents the seasonal shocks that inﬂuence the reproduction part in a stochastic manner. More precisely, we will consider the Beverton–Holt model, as described in [12, Section 5], which uses the following reproduction function:

xt +1 = F (xt , at , w t ) = (1 − m)(xt − at ) + w t

r0 (xt − at ) 1 + (xt − at )/ R

,

where the parameter m represents a natural mortality coeﬃcient, r0 can be interpreted as a reproduction rate, and { w t }t ∈N is a sequence of independent and identically distributed random variables. We assume that the state space is X = [x, x] ⊂ R+ . To avoid extinction, complete harvest of the ﬁsh population is not allowed, and we will assume that A (x) = [0, x/ K ] for x ∈ X and some constant K > 1. The random variables { w t }t ∈N are supported on the interval [ w , w ] ⊂ R+ and they have density function g with respect to the Lebesgue measure on R, assumed to be Lipschitz continuous. For the sake of consistency, the above parameters must satisfy

K >1+

1 wr0 − m

,

0 < w < w,

Then the state space can be taken to be

x=

RK K −1

wr0 ( K − 1) K − (1 − m)( K − 1)

and m < wr0 .

−1

and

x = R ( wr0 − m)/m,

which are the minimal and maximal values, respectively, taken by the state variable. Let μ be the uniform probability measure on X . Consequently, the density function p of the Markov transition kernel with respect to μ is given by

p ( y |x, a) =

(x − x) · (1 + (x − a)/ R ) (1 + (x − a)/ R )( y − (1 − m)(x − a)) . g r 0 (x − a) r 0 (x − a)

Noting that x − a ( K − 1)x/ K for any (x, a) ∈ K, it follows that p is Lipschitz continuous in y ∈ X uniformly in (x, a) ∈ K. The cost function is

c (x, a) =

10 1+a

+

10 1+x−a

while the one-dimensional constraint function is given by

r (x, a) =

1+a 1+x−a

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

877

for (x, a) ∈ K, with constraint constant θ0 = 4/5. Elementary calculations show that, loosely speaking, the cost function c to be minimized “encourages” harvest, while the reward function somehow “limits” a too large harvest. So, there is indeed a compromise between the objective function and the constraint function. According to [12, Section 5], the numerical values of the involved parameters are given by m = 0.15, R = 196.3923 · 106 pounds, r0 = 0.543365, K = 4, w = 0.89, and w = 5. In order to avoid intervals of a large magnitude for {xt }t ∈N and {at }t ∈N , we have normalized these variables by dividing them by R. It follows that x = 6.1355 · 10−4 and x = 17.1122. The discount factor is α = 0.8 and the initial state is z0 = (x + x)/2 8.5558. The density g has been chosen of the form

g ( z) =

6( z − w )( w − z)

( w − w )3

if w z w, and g ( z) = 0 elsewhere. 6.2. Assumptions Let us check that this population model satisﬁes our assumptions in this paper. Assumption (A1) is satisﬁed with L Ψ = 1/ K . We take w (x) = 1 for all x ∈ X , so that any d 1 fulﬁlls Assumption (A2). We note that q( y |x, a) = p ( y |x, a) is in fact a function of y and x − a, and it is easily seen that p ( y |x, a) is L-Lipschitz continuous in (x, a) ∈ K uniformly in y ∈ X , for some L > 0. By dominated convergence, we have that the Markov kernel P is weakly continuous, while given a bounded measurable function v on X , we have that

P v (x, a) − P v ( y , b) L v · |x − y | + |a − b| for (x, a) and ( y , b) in K,

(6.1)

where v denotes the sup-norm of v (which coincides with the w-norm). We note that, although the above expression does not take the same form as the statement in Assumption (A3), it suﬃces for our purposes. In fact, the Markov kernel in this example is “more regular” than required because it maps a measurable function v into a Lipschitz-continuous function P v. Concerning Assumption (A5), we choose 1 d < 1/α . The second condition in Assumption (A5) is not needed. Indeed, it is used in Theorem 2.6(ii) to ensure that V ∗u is Lipschitz-continuous. In our context, a direct argument shows, using (6.1), that

L V ∗u (1 + L Ψ ) L u +

α u L . 1 − αd

Given d > 0, choose an integer q such that q

A d (x) =

kx

qK

x , 2K d

and let

k=0,1,...,q

for x ∈ X , which indeed satisﬁes Assumption B. By letting M (x) = 1 for x ∈ X and choosing 1 D < 1/α , it is clear that Assumption C holds. Note also that q is such that | p ( y |x, a)| q for all y ∈ X and (x, a) ∈ K, and that L q = d = D is the Lipschitz constant of y → p ( y |x, a), uniformly in (x, a) ∈ K. Assumption D holds if we let w ≡ 1, 1 d < 1/α and d = L q . In fact, Assumption D is not strictly needed because the state space X is compact. Finally, a direct computation shows that the expected discounted reward (for the constraint function r) of the deterministic stationary policy f 0 that selects the action a = 0 for every x ∈ X veriﬁes V r ( z0 , f 0 ) 0.7224 < θ0 , so that Assumption E holds. In what follows, to simplify the notation we will suppose that d = d = D, with 1 d < 1/α , and d = d = D = L q . In particular

L=

1 − αd 2α L q

= Dˆ = and dˆ = d

1 + αd 2α

.

In this control model, since the state space is compact, we can take simpliﬁes some of our results. For instance, Proposition 3.6 reads

= 0, so that p ( y |x, a) = p ( y |x, a) = q( y |x, a). This

P v (x, a) − P ,n v (x, a) L v q + L q v W 1 (μ, μn ) and P ,n 1(x, a) d + L q W 1 (μ, μn ). The set C ,n can now be deﬁned as C ,n = { W 1 (μ, μn ) L}, so that on the set C ,n the inequalities in Lemma 3.7 indeed hold.

878

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

Table 1 Unconstrained problem. Summary of results.

Mean Mode Std. dev.

n = 500

n = 1000

n = 1500

n = 2000

23.69 22.08 5.98

25.88 22.98 5.18

26.83 25.70 4.40

27.64 25.99 4.13

Fig. 1. Unconstrained problem. Density estimators. Table 2 Constrained problem. Summary of results.

Mean Mode Std. dev.

n = 1000

n = 1500

n = 2000

n = 2500

n = 3000

n = 3500

33.68 26.59 11.41

35.46 28.84 10.79

37.16 30.87 10.31

37.62 33.86 9.39

38.60 36.08 9.63

38.72 36.42 8.98

6.3. Computational results To illustrate our approach we have estimated the optimal value of both the unconstrained and constrained control problems. We want to approximate

V ∗c ( z0 ) = inf V c ( z0 , π ):

π ∈Π

and

V ( z0 ) = inf V c ( z0 , π ): π ∈ Π, V r ( z0 , π ) θ0 .

For the action sets A d (x) we have chosen the parameter q = 24, so that each action set A d (x) consists of 25 equally spaced points in [0, x/ K ]. The unconstrained case. We have selected four different values of n (the sample size for the empirical probability measure): n = 500, 1000, 1500, 2000. For each ﬁxed value of n, we have generated 1000 empirical probability measures, say μn (ωk ) for k = 1, . . . , 1000, of the uniform distribution on X , and we have solved the LP problem LP(μn (ωk )) for k = 1, . . . , 1000. In this way, we have generated a sample of size 1000 of the stochastic estimator: In (ωk ), for k = 1, . . . , 1000. To display this sample of In we have used a density estimator with ﬁxed bandwidth using Gaussian kernels. In Table 1, we give the mean and standard deviation of the sample {In (ωk )}, together with the mode, computed by using the density estimator. The density estimations are shown in Fig. 1. The results in Table 1 show convergence of the mean and the mode as n grows, while the standard deviation of In decreases. The accuracy of the method becomes evident in Fig. 1, in which we see that the density functions become sharper and more concentrated around the mode as n grows. The constrained case. We have selected n = 1000, 1500, 2000, 2500, 3000, 3500. As before, for each n we have generated 1000 empirical probability measures of the uniform distribution on X , and we have solved the LP problem LP(μn (ωk )) for k = 1, . . . , 1000, which now incorporates the constraint function r with constraint constant θ0 . In Table 2 we give the mean, standard deviation, and mode of the sample {In (ωk )} of the solution of LP(μn (ωk )). As a side remark, we note that the number of rejected samples (i.e., those for which LP(μn (ωk )) is not solvable) decreases as n grows. Namely, the number of rejected samples is 66, 48, 25, 14, 13, and 4, respectively, for our six values for n. The density estimations are shown in Fig. 2. As for the unconstrained case, we obtain convergence in the sense that the

F. Dufour, T. Prieto-Rumeau / J. Math. Anal. Appl. 413 (2014) 856–879

879

Fig. 2. Constrained problem. Density estimators.

mean and mode become stable and the density functions are sharper as n grows, and also in the sense that the variance decreases. References [1] C.D. Aliprantis, K.C. Border, Inﬁnite Dimensional Analysis, third ed., Springer, Berlin, 2006. [2] E. Altman, Constrained Markov Decision Processes, Stoch. Model., Chapman & Hall/CRC, Boca Raton, FL, 1999. [3] A. Arapostathis, V.S. Borkar, E. Fernández-Gaucherand, M.K. Ghosh, S.I. Marcus, Discrete-time controlled Markov processes with average cost criterion: a survey, SIAM J. Control Optim. 31 (2) (1993) 282–344. [4] D.P. Bertsekas, S.E. Shreve, Stochastic Optimal Control: The Discrete Time Case, Math. Sci. Eng., vol. 139, Academic Press Inc., New York, 1978. [5] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientiﬁc, Belmont, MA, 1996. [6] V.I. Bogachev, Measure Theory, vol. II, Springer-Verlag, Berlin, 2007. [7] E. Boissard, Simple bounds for convergence of empirical and occupation measures in 1-Wasserstein distance, Electron. J. Probab. 16 (2011) 2296–2333. [8] H.S. Chang, M.C. Fu, J.Q. Hu, S.I. Marcus, Simulation-Based Algorithms for Markov Decision Processes, Comm. Control Engrg. Ser., Springer-Verlag London Ltd., London, 2007. [9] F. Dufour, T. Prieto-Rumeau, Approximation of Markov decision processes with general state space, J. Math. Anal. Appl. 388 (2) (2012) 1254–1267. [10] F. Dufour, T. Prieto-Rumeau, Finite linear programming approximations of constrained discounted Markov decision processes, SIAM J. Control Optim. 51 (2) (2013) 1298–1324. [11] R. El Azouzi, E. Altman, Constrained traﬃc equilibrium in routing, IEEE Trans. Automat. Control 48 (9) (2003) 1656–1660. [12] S. Ermon, J. Conrad, C. Gomes, B. Selman, Playing games against nature: optimal policies for renewable resource allocation, in: Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-10), AUAI Press, 2010, pp. 168–176. [13] J. Filar, K. Vrieze, Competitive Markov Decision Processes, Springer-Verlag, New York, 1997. [14] H. Föllmer, A. Schied, Stochastic Finance, de Gruyter Stud. Math., vol. 27, Walter de Gruyter & Co., Berlin, 2002. [15] O. Hernández-Lerma, Adaptive Markov Control Processes, Appl. Math. Sci., vol. 79, Springer-Verlag, New York, 1989. [16] O. Hernández-Lerma, J.B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria, Appl. Math., vol. 30, Springer-Verlag, New York, 1996. [17] O. Hernández-Lerma, J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes, Appl. Math., vol. 42, Springer-Verlag, New York, 1999. [18] K. Hinderer, On approximate solutions of ﬁnite-stage dynamic programs, in: Dynamic Programming and Its Applications, Proc. Conf., Univ. British Columbia, Vancouver, BC, 1977, Academic Press, New York, 1978, pp. 289–317. [19] K. Hinderer, Lipschitz continuity of value functions in Markovian decision processes, Math. Methods Oper. Res. 62 (1) (2005) 3–22. [20] H.J. Langen, Convergence of dynamic programming models, Math. Oper. Res. 6 (4) (1981) 493–512. [21] T.L. Morin, Computational advances in dynamic programming, in: Dynamic Programming and Its Applications, Proc. Conf., Univ. British Columbia, Vancouver, BC, 1977, Academic Press, New York, 1978, pp. 53–90. [22] W.B. Powell, Approximate Dynamic Programming, Wiley Ser. Probab. Stat., Wiley–Interscience [John Wiley & Sons], Hoboken, NJ, 2007. [23] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley Ser. Probab. Stat., John Wiley & Sons Inc., New York, 1994. [24] J. Rust, Using randomization to break the curse of dimensionality, Econometrica 65 (3) (1997) 487–516. [25] L.I. Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Ser. Probab. Stat., John Wiley & Sons Inc., New York, 1999. [26] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998. [27] B. Van Roy, Neuro-dynamic programming: overview and recent trends, in: Handbook of Markov Decision Processes, in: Internat. Ser. Oper. Res. Management Sci., vol. 40, Kluwer Acad. Publ., Boston, MA, 2002, pp. 431–459. [28] W. Whitt, Approximations of dynamic programs. I, Math. Oper. Res. 3 (3) (1978) 231–243. [29] W. Wu, A. Arapostathis, S. Shakkottai, Optimal power allocation for a time-varying wireless channel under heavy-traﬃc approximation, IEEE Trans. Automat. Control 51 (4) (2006) 580–594.

Stochastic approximations of constrained discounted Markov decision processes

Stochastic approximations of constrained discounted Markov decision processes

Recommend Documents