Artificial Intelligence 221 (2015) 46–72
Contents lists available at ScienceDirect
Artificial Intelligence www.elsevier.com/locate/artint
POMDPs under probabilistic semantics ✩,✩✩ Krishnendu Chatterjee, Martin Chmelík ∗ Institute of Science and Technology Austria (IST Austria), Austria
a r t i c l e
i n f o
Article history: Received 20 February 2014 Received in revised form 11 December 2014 Accepted 22 December 2014 Available online 31 December 2014 Keywords: POMDPs Limit-average objectives Almost-sure winning
a b s t r a c t We consider partially observable Markov decision processes (POMDPs) with limit-average payoff, where a reward value in the interval [0, 1] is associated with every transition, and the payoff of an infinite path is the long-run average of the rewards. We consider two types of path constraints: (i) a quantitative constraint defines the set of paths where the payoff is at least a given threshold λ1 ∈ (0, 1]; and (ii) a qualitative constraint which is a special case of the quantitative constraint with λ1 = 1. We consider the computation of the almost-sure winning set, where the controller needs to ensure that the path constraint is satisfied with probability 1. Our main results for qualitative path constraints are as follows: (i) the problem of deciding the existence of a finite-memory controller is EXPTIME-complete; and (ii) the problem of deciding the existence of an infinite-memory controller is undecidable. For quantitative path constraints we show that the problem of deciding the existence of a finite-memory controller is undecidable. We also present a prototype implementation of our EXPTIME algorithm and experimental results on several examples. © 2015 Elsevier B.V. All rights reserved.
1. Introduction In this work, we consider partially observable Markov decision processes (POMDPs), with limit-average payoff, and study the problem of whether there exists a strategy to ensure that the payoff is above a given threshold with probability 1. Our main results characterize the decidability frontier of the problem, and for the decidable cases we establish tight computational complexity results. We first describe the details of the problem, the previous results, and then our contribution. Partially observable Markov decision processes (POMDPs). Markov decision processes (MDPs) are standard models for probabilistic systems that exhibit both probabilistic and nondeterministic behavior [20]. MDPs have been used to model and solve control problems for stochastic systems [16,38]: nondeterminism represents the freedom of the controller to choose a control action, while the probabilistic component of the behavior describes the systems response to control actions. In perfect-observation (or perfect-information) MDPs the controller can observe the current state of the system to choose the next control actions, whereas in partially observable MDPs (POMDPs) the state space is partitioned according to observations that the controller can observe, i.e., given the current state, the controller can only view the observation of the state (the partition the state belongs to), but not the precise state [35]. POMDPs provide the appropriate model to study a wide variety of applications such as speech processing [33], image processing [13], robot planning [24,21], computational biology [14], reinforcement learning [22], to name a few. POMDPs also subsume many other powerful computational models such as
✩ The research was supported by Austrian Science Fund (FWF) Grant No. P 23499-N23, FWF NFN Grant No. S11407-N23 (RiSE), ERC Start grant (279307: Graph Games), and Microsoft faculty fellows award. ✩✩ A preliminary version of the paper was accepted (for oral presentation) in UAI, 2013 [6]. Corresponding author. E-mail address:
[email protected] (M. Chmelík).
*
http://dx.doi.org/10.1016/j.artint.2014.12.009 0004-3702/© 2015 Elsevier B.V. All rights reserved.
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
47
probabilistic finite automata (PFA) [39,36] (since probabilistic finite automata (a.k.a. blind POMDPs) are a special case of POMDPs with a single observation). Limit-average payoff. A payoff function maps every infinite path (infinite sequence of state action pairs) of an MDP to a real value. A very well-studied payoff in the setting of MDPs is the limit-average payoff where every state action pair is assigned a real-valued reward in the interval [0, 1] and the payoff of an infinite path is the long-run average of the rewards on the path [16, Chapter 5], [38, Chapter 8]. POMDPs with limit-average payoff provide the theoretical framework to study many important problems of practical relevance, including probabilistic planning and several stochastic optimization problems [30, 16,38]. Expectation vs probabilistic semantics. Traditionally, MDPs with limit-average payoff have been studied with the expectation semantics, where the goal of the controller is to maximize the expected limit-average payoff. The expected payoff value can be 12 when with probability 12 the payoff is 1, and with remaining probability the payoff is 0. In many applications of system analysis (such as robot planning and control) the relevant question is the probability measure of the paths that satisfy certain criteria, e.g., whether the probability measure of the paths such that the limit-average payoff is 1 (or the payoff is at least 12 ) is at least a given threshold (e.g., see [2,24]). We classify the path constraints for limit-average payoff as follows: (1) the quantitative constraint that defines the set of paths with limit-average payoff at least λ1 , for a threshold λ1 ∈ (0, 1]; and (2) the qualitative constraint that is the special case of the quantitative constraint defining the set of paths with limit-average payoff 1 (i.e., the special case with λ1 = 1). We refer to the problem where the controller must satisfy a path constraint with a probability threshold λ2 ∈ (0, 1] as the probabilistic semantics. An important special case of probabilistic semantics is the almost-sure semantics, where the probability threshold is 1. The almost-sure semantics is of great importance because there are many applications where the requirement is to know whether the correct behavior arises with probability 1. For instance, when analyzing a randomized embedded scheduler, the relevant question is whether every thread progresses with probability 1. Even in settings where it suffices to satisfy certain specifications with probability λ2 < 1, the correct choice of λ2 is a challenging problem, due to the simplifications introduced during modeling. For example, in the analysis of randomized distributed algorithms it is quite common to require correctness with probability 1 (e.g., [37,42]). Besides its importance in practical applications, almost-sure convergence, like convergence in expectation, is a fundamental concept in probability theory, and provides stronger convergence guarantee than convergence in expectation [15]. Previous results. There are several deep undecidability results established for probabilistic finite automata (PFA), which are a special case of POMDPs (that immediately imply undecidability for the case of POMDPs).1 The basic undecidability results are for PFA over finite words: The emptiness problem for PFA under probabilistic semantics is undecidable over finite words [39,36,12]; and it was shown in [30] that even the following approximation version is undecidable: for any fixed 0 < < 12 , given a probabilistic automaton and the guarantee that either (a) there is a word accepted with probability at least 1 − ; or (ii) all words are accepted with probability at most ; decide whether it is case (i) or case (ii). The almost-sure problem for probabilistic automata over finite words reduces to the non-emptiness question of universal automata over finite words and is PSPACE-complete. However, another related decision question whether for every > 0 there is a word that is accepted with probability at least 1 − (called the value 1 problem) is undecidable for probabilistic automata over finite words [17]. Also observe that all undecidability results for probabilistic automata over finite words carry over to POMDPs where the controller is restricted to finite-memory strategies. The importance of finite-memory strategies in applications has been established in [18,28,31]. Our contributions. Since under the general probabilistic semantics, the decision problems are undecidable even for PFA, we consider POMDPs with limit-average payoff under the almost-sure semantics. We present a complete picture of decidability as well as tight complexity bounds. 1. (Almost-sure winning for qualitative constraints). We first consider limit-average payoff with a qualitative constraint under almost-sure semantics. We show that belief-support-based strategies are not sufficient (where a belief-support-based strategy is based on the subset construction that remembers the possible set of current states): we show that there exist POMDPs with limit-average payoff with a qualitative constraint where finite-memory almost-sure winning strategy exists but there exists no belief-support-based almost-sure winning strategy. Our counter-example shows that standard techniques based on subset construction (to construct an exponential size perfect-observation MDP) are not adequate to solve the problem. We then show one of our main results that given a POMDP with | S | states and |A| actions, if there is a finite-memory almost-sure winning strategy to satisfy the limit-average payoff with a qualitative constraint, then there is an almost-sure winning strategy that uses at most 23·| S |+|A| memory. Our exponential memory upper bound is asymptotically optimal, as even for PFA over finite words, exponential memory is required for almost-sure winning. This follows from the fact that the shortest witness word for non-emptiness of universal finite automata is at least exponential. We then show that the problem of deciding the existence of finite-memory almost-sure winning strategy for limit-average payoff with a qualitative constraint is EXPTIME-complete for POMDPs. In contrast to our result for finite-memory strategies, we establish that the problem of deciding the existence of infinite-memory almost-sure winning strategy for limit-average payoff with a qualitative constraint is undecidable for POMDPs.
1 Note that for PFA every path is either accepting (payoff assigned 1) or rejecting (payoff assigned 0). Thus for PFA the expected semantics and probabilistic semantics coincide. Thus undecidability results for PFA also hold for expected semantics.
48
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Table 1 Computational complexity. New results are in bold fonts. Almost-sure semantics
PFA POMDP Qual. Constr. POMDP Quan. Constr.
Probabilistic semantics
Fin. mem.
Inf. mem.
Fin./Inf. mem.
PSPACE-c EXPTIME-c Undec.
PSPACE-c Undec. Undec.
Undec. Undec. Undec.
2. (Almost-sure winning with quantitative constraints). In contrast to our decidability result under finite-memory strategies for qualitative constraints, we show that the almost-sure winning problem for limit-average payoff with a quantitative constraint is undecidable even for finite-memory strategies for POMDPs. In summary we establish the precise decidability frontier for POMDPs with limit-average payoff under probabilistic semantics (see Table 1). For practical purposes, the most prominent question is the problem of finite-memory strategies, and for finite-memory strategies we establish decidability with optimal EXPTIME-complete complexity for the important special case of qualitative constraints under almost-sure semantics. We have also implemented our algorithm (along with a heuristic to avoid the worst-case exponential complexity) for almost-sure winning under finite-memory strategies for qualitative constraints. We experimented with our prototype on variants of several POMDP examples from the literature, and the implementation worked reasonably efficiently. Technical contributions. The key technical contribution for the decidability result is as follows. Since belief-support-based strategies are not sufficient, standard subset construction techniques do not work. For an arbitrary finite-memory strategy we construct a collapsed strategy that collapses memory states based on a graph construction given the strategy. The collapsed strategy at a collapsed memory state plays uniformly over actions that were played at all the corresponding memory states of the original strategy. The main challenge is to show that the exponential size collapsed strategy, even though it has less memory elements, does not destroy the structure of the recurrent classes of the original strategy. For the computational complexity result, we show how to construct an exponential size special class of POMDPs (which we call belief-observation POMDPs where the belief support is always the current observation) and present polynomial time algorithms for the solution of the special belief-observation POMDPs of our construction. For undecidability of almost-sure winning for qualitative constraints under infinite-memory strategies we present a reduction from the value 1 problem for PFA; and for undecidability of almost-sure winning for quantitative constraints under finite-memory strategies we present a reduction from the strict emptiness problem for PFA under probabilistic semantics. 2. Definitions In this section we present the definitions of POMDPs, strategies, objectives, and other basic definitions required throughout this work. We follow the standard definitions of MDPs and POMDPs [38,26]. Notations. Given a finite set X , we denote by P ( X ) the set of subsets of X , i.e., P ( X ) is the power set of X . A probability distribution f on X is a function f : X → [0, 1] such that x∈ X f (x) = 1, and we denote by D ( X ) the set of all probability distributions on X . For f ∈ D ( X ) we denote by Supp( f ) = {x ∈ X | f (x) > 0} the support of f . Definition 1 (POMDP). A Partially Observable Markov Decision Process (POMDP) is a tuple G = ( S , A, δ, O , γ , s0 ) where:
• S is a finite set of states; • A is a finite alphabet of actions; • δ : S × A → D( S ) is a probabilistic transition function that given a state s and an action a ∈ A gives the probability distribution over the successor states, i.e., δ(s, a)(s ) denotes the transition probability from state s to state s given action a;
• O is a finite set of observations; • γ : S → O is an observation function that maps every state to an observation; and • s0 is the initial state. Given s, s ∈ S and a ∈ A, we also write δ(s | s, a) for δ(s, a)(s ). A state s is absorbing if for all actions a we have δ(s, a)(s) = 1 (i.e., s is never left from s). For an observation o, we denote by γ −1 (o) = {s ∈ S | γ (s) = o} the set of states with observation o. For a set U ⊆ S of states and O ⊆ O of observations we denote γ (U ) = {o ∈ O | ∃s ∈ U . γ (s) = o} and γ −1 ( O ) = o∈ O γ −1 (o). Remark 1 (Unique initial state). For technical convenience we assume that there is a unique initial state s0 and we will also assume that the initial state has a unique observation, i.e., |γ −1 (γ (s0 ))| = 1. In general there is an initial distribution α over initial states that all have the same observation, i.e., Supp(α ) ⊆ γ −1 (o), for some o ∈ O . However, this can be modeled easily
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
49
by adding a new initial state snew with a unique observation such that in the first step gives the desired initial probability distribution α , i.e., δ(snew , a) = α for all actions a ∈ A. Hence for simplicity we assume there is a unique initial state s0 with a unique observation. Remark 2 (Observations). We remark about two other general cases of observations. 1. Multiple observations: We consider observation function that assigns an observation to every state. In general the observation function γ : S → 2O \ ∅ may assign multiple observations to a single state. In that case we consider the set of observations as O = 2O \ ∅ and consider the mapping that assigns to every state an observation from O and reduce to our model. 2. Probabilistic observations: Given a POMDP G = ( S , A, δ, O , γ , s0 ), another type of the observation function γ considered in the literature is of type S × A → D (O ), i.e., the state and the action gives a probability distribution over the set of observations O . We show how to transform the POMDP G into an equivalent POMDP G where the observation function is deterministic and defined on states, i.e., of type S → O as in our definitions. We construct the equivalent POMDP G = ( S , A, δ , O , γ , s0 ) as follows: (i) the new state space is S = S × O ; (ii) the transition function δ given a state (s, o) ∈ S and an action a is as follows δ ((s, o), a)(s , o ) = δ(s, a)(s ) · γ (s , a)(o ); and (iii) the deterministic observation function for a state (s, o) ∈ S is defined as γ ((s, o)) = o. Informally, the probabilistic aspect of the observation function is captured in the transition function, and by enlarging the state space by constructing a product with the observations, we obtain a deterministic observation function only on states. Thus both the above general cases of observation function can be reduced to observation mapping that deterministically assigns an observation to a state, and we consider such observation mapping which greatly simplifies the notation. Plays and cones. A play (or a path) in a POMDP is an infinite sequence (s0 , a0 , s1 , a1 , s2 , a2 , . . .) of states and actions such that for all i ≥ 0 we have δ(si , ai )(si +1 ) > 0. We write Ω for the set of all plays. For a finite prefix w ∈ ( S · A )∗ · S of a play, we denote by Cone( w ) the set of plays with w as the prefix (i.e., the cone or cylinder of the prefix w), and denote by Last( w ) the last state of w. Belief supports and updates. For a finite prefix w = (s0 , a0 , s1 , a1 , . . . , sn ) we denote by γ ( w ) = (γ (s0 ), a0 , γ (s1 ), a1 , . . . , γ (sn )) the observation and action sequence associated with w. For a finite sequence ρ = (o0 , a0 , o1 , a1 , . . . , on ) of observations and actions, the belief support B (ρ ) after the prefix ρ is the set of states in which a finite prefix of a play can be after the sequence ρ of observations and actions, i.e., B (ρ ) = {sn = Last( w ) | w = (s0 , a0 , s1 , a1 , . . . , sn ), w is a prefix of a play, and for all 0 ≤ i ≤ n. γ (si ) = o i }. The belief-support updates associated with finite-prefixes are as follows: for prefixes w and w = w · a · s the belief-support update is defined inductively as B (γ ( w )) = ( s1 ∈B(γ ( w )) Supp(δ(s1 , a))) ∩ γ −1 (γ (s)), i.e., the set ( s1 ∈B(γ ( w )) Supp(δ(s1 , a))) denotes the possible successors given the belief support B (γ ( w )) and action a, and then the intersection with the set of states with the current observation γ (s) gives the new belief support. Informally, the belief (or a belief state [5]) defines the probability distributions over the possible current states, and the belief support is the support of the belief. Strategies. A strategy (or a policy) is a recipe to extend prefixes of plays and is a function σ : ( S · A )∗ · S → D ( A ) that given a finite history (i.e., a finite prefix of a play) selects a probability distribution over the actions. Since we consider POMDPs, strategies are observation-based, i.e., for all histories w = (s0 , a0 , s1 , a1 , . . . , an−1 , sn ) and w = (s0 , a0 , s1 , a1 , . . . , an−1 , sn ) such that for all 0 ≤ i ≤ n we have γ (si ) = γ (si ) (i.e., γ ( w ) = γ ( w )), we must have σ ( w ) = σ ( w ). In other words, if the observation sequence is the same, then the strategy cannot distinguish between the prefixes and must play the same. We now present an equivalent definition of observation-based strategies such that the memory of the strategy is explicitly specified, and will be required to present finite-memory strategies. Definition 2 (Strategies with memory and finite-memory strategies). A strategy with memory is a tuple where:
σ = (σu , σn , M , m0 )
• (Memory set). M is a denumerable set (finite or infinite) of memory elements (or memory states). • (Action selection function). The function σn : M → D(A) is the next action selection function that given the current memory state gives the probability distribution over actions.
• (Memory update function). The function σu : M × O × A → D( M ) is the memory update function that given the current memory state, the current observation and action, updates the memory state probabilistically.
• (Initial memory). The memory state m0 ∈ M is the initial memory state. A strategy is a finite-memory strategy if the set M of memory elements is finite. For a finite-memory strategy the size of the memory is | M |, i.e., the number of memory states. A strategy is pure (or deterministic) if the memory update function and the next action selection function are deterministic, i.e., σu : M × O × A → M and σn : M → A. A strategy is memoryless (or stationary) if it is independent of the history but depends only on the current observation, and can be represented as a function σ : O → D (A).
50
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Probability measure. Given a strategy σ , the unique probability measure obtained given σ is denoted as Pσ (·). We first define the measure μσ (·) on cones. For w = s0 we have μσ (Cone( w )) = 1, and for w = s where s = s0 we have μσ (Cone( w )) = 0; and for w = w · a · s we have μσ (Cone( w )) = μσ (Cone( w )) · σ ( w )(a) · δ(Last( w ), a)(s). By Caratheódary’s extension theorem, the function μσ (·) can be uniquely extended to a probability measure Pσ (·) over Borel sets of infinite plays [3]. Objectives. An objective in a POMDP G is a measurable set ϕ ⊆ Ω of plays. We first define the limit-average payoff (a.k.a. mean-payoff) function. Given a POMDP we consider a reward function r : S × A → [0, 1] that maps every state action pair to a real-valued reward in the interval [0, 1]. The LimAvg payoff function maps every play to a real-valued reward that is the long-run average of the rewards of the play. Formally, given a play ρ = (s0 , a0 , s1 , a1 , s2 , a2 , . . .) we have
LimAvg(r, ρ ) = lim inf n→∞
n 1
n
·
r(si , ai ).
i =0
When the reward function r is clear from the context, we drop it for simplicity. For a reward function r, we consider two types of limit-average payoff constraints. 1. Qualitative constraints. The qualitative constraint limit-average objective LimAvg=1 defines the set of paths such that the limit-average payoff is 1; i.e., LimAvg=1 = {ρ | LimAvg(ρ ) = 1}. 2. Quantitative constraints. Given a threshold λ1 ∈ (0, 1), the quantitative constraint limit-average objective LimAvg>λ1 defines the set of paths such that the limit-average payoff is strictly greater than λ1 ; i.e., LimAvg>λ1 = {ρ | LimAvg(ρ ) > λ1 }. Probabilistic and almost-sure winning. Given a POMDP, an objective
ϕ , and a class C of strategies, we say that:
• a strategy σ ∈ C is almost-sure winning if Pσ (ϕ ) = 1; • a strategy σ ∈ C is probabilistic winning, for a threshold λ2 ∈ (0, 1), if Pσ (ϕ ) ≥ λ2 . Theorem 1 (Results for PFA (probabilistic automata over finite words)). (See [36].) The following assertions hold for the class C of all infinite-memory as well as finite-memory strategies: (1) the probabilistic winning problem is undecidable for PFA; and (2) the almost-sure winning problem is PSPACE-complete for PFA. Since PFA are a special case of POMDPs, the undecidability of the probabilistic winning problem for PFA implies the undecidability of the probabilistic winning problem for POMDPs with both qualitative and quantitative constraints limitaverage objectives. The almost-sure winning problem is PSPACE-complete for PFAs, and we study the complexity of the almost-sure winning problem for POMDPs with both qualitative and quantitative constraints limit-average objectives, under infinite-memory and finite-memory strategies. Basic properties of Markov chains. Since our proofs will use results related to Markov chains, we start with some basic definitions and properties related to Markov chains [23]. Markov chains and recurrent classes. A Markov chain G = ( S , δ) consists of a finite set S of states and a probabilistic transition function δ : S → D ( S ). Given the Markov chain, we consider the directed graph ( S , E ) where E = {(s, s ) | δ(s | s) > 0}. A recurrent class C ⊆ S of the Markov chain is a bottom strongly connected component (scc) in the graph ( S , E ) (a bottom scc is an scc with no edges out of the scc). We denote by Rec(G ) the set of recurrent classes of the Markov chain, i.e., Rec(G ) = {C | C is a recurrent class}. Given a state s and a set U of states, we say that U is reachable from s if there is a path from s to some state in U in the graph ( S , E ). Given a state s of the Markov chain we denote by Rec(G )(s) ⊆ Rec(G ) the subset of the recurrent classes reachable from s in G. A state is recurrent if it belongs to a recurrent class. The following standard properties of reachability and the recurrent classes will be used in our proofs:
• Property 1. (a) For a set T ⊆ S, if for all states s ∈ S there is a path to T (i.e., for all states there is a positive probability to reach T ), then from all states the set T is reached with probability 1. (b) For all states s, if the Markov chain starts at s, then the set C = C ∈Rec(G )(s) C is reached with probability 1, i.e., the set of recurrent classes is reached with probability 1. • Property 2. For a recurrent class C , for all states s ∈ C , if the Markov chain starts at s, then for all states t ∈ C the state t is visited infinitely often with probability 1, and is visited with positive average frequency (i.e., positive limit-average frequency) with probability 1. The following lemma is an easy consequence of the above properties. Lemma 1. Let G = ( S , δ) be a Markov chain with a reward function r : S → [0, 1]. Consider a state s ∈ S and the Markov chain with s as the initial (or starting) state. The objective LimAvg=1 is satisfied with probability 1 (almost-surely) iff for all recurrent classes C ∈ Rec(G )(s) and for all states s1 ∈ C we have r(s1 ) = 1.
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
51
Fig. 1. Belief support is not sufficient.
Markov chains under finite memory strategies. We now define Markov chains obtained by fixing a finite-memory strategy in a POMDP G. A finite-memory strategy σ = (σu , σn , M , m0 ) induces a Markov chain ( S × M , δσ ), denoted G σ , with the probabilistic transition function δσ : S × M → D ( S × M ): given s, s ∈ S and m, m ∈ M, the transition δσ ((s , m ) | (s, m)) is the probability to go from state (s, m) to state (s , m ) in one step under the strategy σ . The probability of transition can be decomposed as follows:
• first an action a ∈ A is sampled according to the distribution σn (m); • then the next state s is sampled according to the distribution δ(s, a); and • finally the new memory m is sampled according to the distribution σu (m, γ (s ), a) (i.e., the new memory is sampled according
σu given the old memory, new observation and the action).
More formally, we have:
δσ s , m ( s , m ) = σn (m)(a) · δ(s, a) s · σu m, γ s , a m . a∈A
Given s ∈ S and m ∈ M, we write (G σ )(s,m) for the finite state Markov chain induced on S × M by the transition function δσ , given the initial state is (s, m). 3. Finite-memory strategies with qualitative constraints In this section we will establish the following three results for finite-memory strategies: (i) we show that in POMDPs with LimAvg=1 objectives belief-support-based strategies are not sufficient for almost-sure winning; (ii) we establish an exponential upper bound on the memory required by an almost-sure winning strategy for LimAvg=1 objectives; and (iii) we show that the decision problem is EXPTIME-complete. 3.1. Belief support is not sufficient We now show with an example that there exist POMDPs with LimAvg=1 objectives, where finite-memory randomized almost-sure winning strategies exist, but there exists no belief-support-based randomized almost-sure winning strategy (a belief-support-based strategy only uses memory that relies on the subset construction where the subset denotes the possible current states called the belief support). In our example we will present the counter-example even for POMDPs with a restricted reward function r assigning only Boolean rewards 0 and 1 to the states (the reward does not depend on the action played but only on the states of the POMDP). Example 1. We consider a POMDP with a state space {s0 , X , Y , Z } and an action set {a, b}, and let U = { X , Y , Z }. From the initial state s0 all the other states are reached with uniform probability in one-step, i.e., for all s ∈ U = { X , Y , Z } we have δ(s0 , a)(s ) = δ(s0 , b)(s ) = 13 . The transitions from the other states are as follows (shown in Fig. 1): (i) δ( X , a)( Z ) = 1 and δ( X , b)(Y ) = 1; (ii) δ( Z , a)(Y ) = 1 and δ( Z , b)( X ) = 1; and (iii) δ(Y , a)( X ) = δ(Y , a)(s0 ) = δ(Y , a)( Z ) = δ(Y , b)( X ) = δ(Y , b)(s0 ) = δ(Y , b)( Z ) = 13 . All states in U have the same observation. The initial state s0 has its own observation. The reward function r assigns reward 1 to states in U and reward 0 to the initial state s0 . The belief support initially after one-step is the set U = { X , Y , Z } since from s0 all the states in U are reached with positive probability. The belief support can only be the set {s0 } or U since every state in U has an input edge for every action, i.e., if the current belief support is U (i.e., the set of states that the POMDP is currently in with positive probability
52
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Fig. 2. The Markov chain G σ4 .
is U ), then irrespective of whether a or b is chosen either the observation of the initial state s0 is observed or all states of U are reached with positive probability and hence the belief support is again U . As playing action a in the initial state has the same effect as playing action b, there are effectively three belief-support-based strategies: (i) σ1 that plays always a in belief support U ; (ii) σ2 that plays always b in belief support U ; or (iii) σ3 that plays both a and b in belief support U with positive probability. The Markov chains G σ1 (resp. σ2 and σ3 ) are obtained by retaining the edges labeled by action a (resp. action b, and both actions a and b). For all three strategies, the Markov chains obtained contain the whole state space {s0 } ∪ U as the reachable recurrent class. It follows that in all the Markov chains there exists a reachable recurrent class containing a state with reward 0, and by Lemma 1 none of the belief-support-based strategies σ1 , σ2 or σ3 is almost-sure winning for the LimAvg=1 objective. The Markov chains G σ1 and G σ2 are also shown in Fig. 1, and the recurrent class is {s0 , X , Y , Z }. The graph of G σ3 is the same as the POMDP G (with edge labels removed), and the recurrent class is again {s0 , X , Y , Z }. The strategy σ4 that plays action a and b alternately gives rise to the Markov chain G σ4 (shown in Fig. 2) where the recurrent class does not intersect with s0 , and is a finite-memory almost-sure winning strategy for the LimAvg=1 objective. 2 Significance of Example 1. We now discuss the significance of Example 1. If we consider the problem of almost-sure winning for PFA, then it coincides with non-emptiness of universal automata, because if there is a finite word that is accepted with probability 1 then it must be accepted along all probabilistic branches (i.e., the probabilistic choices can be interpreted as universal choices). Since the non-emptiness problem of universal automata is similar to universality of non-deterministic automata, the almost-sure winning problem for PFA can be solved by the standard subset construction for automata. The implication of our counter-example (Example 1) shows that since belief-support-based strategies are not sufficient even for qualitative limit-average objectives, the standard subset construction techniques from automata theory will not suffice to solve the problem. 3.2. Strategy complexity In this section we will present memory bounds for finite-memory almost-sure winning strategies. Hence we assume that there exists a finite-memory almost-sure winning strategy, and for the rest of the subsection we fix a finite-memory almost-sure winning strategy σ = (σu , σn , M , m0 ) on the POMDP G = ( S , A, δ, O , γ , s0 ) with a reward function r for the objective LimAvg=1 . Our goal is to construct an almost-sure winning strategy for the LimAvg=1 objective with memory size (i.e., the number of memory state) at most Mem∗ = 23·| S | · 2|A| . We start with a few definitions associated with strategy σ . For m ∈ M:
• The function RecFunσ (m) : S → {0, 1} is such that RecFunσ (m)(s) is 1 iff the state (s, m) is recurrent in the Markov chain G σ and 0 otherwise. (The RecFun stands for recurrence function.) • The function AWFunσ (m) : S → {0, 1} is such that AWFunσ (m)(s) is 1 iff the state (s, m) is almost-sure winning for the LimAvg=1 objective in the Markov chain G σ and 0 otherwise. (The AWFun stands for almost-sure win function.) • We also consider Actσ (m) = Supp(σn (m)) that for every memory element gives the support of the probability distribution over actions played at m. Remark 3. Let (s , m ) be a state reachable from (s, m) in the Markov chain G σ . If the state (s, m) is almost-sure winning for the LimAvg=1 objective, then the state (s , m ) is also almost-sure winning for the LimAvg=1 objective. Collapsed graph of σ . Given the strategy σ we define the notion of a collapsedgraph CoGr(σ ) = ( V , E ), where the vertices of the graph are elements from the set V = {(Y , AWFunσ (m), RecFunσ (m), Actσ (m)) | Y ⊆ S and m ∈ M } and the initial vertex is ({s0 }, AWFunσ (m0 ), RecFunσ (m0 ), Actσ (m0 )). The edges in E are labeled by actions in A. Intuitively, the action labeled
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
53
edges of the graph depict the updates of the belief support and the functions upon a particular action. Formally, there is an edge
a
Y , AWFunσ (m), RecFunσ (m), Actσ (m) → Y , AWFunσ m , RecFunσ m , Actσ m
in the collapsedgraph CoGr(σ ) iff there exists an observation o ∈ O such that 1. the action a ∈ Actσ (m), 2. the set Y is non-empty and it is the belief-support update from Y , under action a and the observation o, i.e., Y = −1 (o ), and s∈Y Supp(δ(s, a)) ∩ γ 3. m ∈ Supp(σu (m, o, a)). Note that the number of vertices in the graph is bounded by | V | ≤ Mem∗ . In the following lemma we establish the connection of the functions RecFunσ (m) and AWFunσ (m) with the edges of the collapsedgraph. Intuitively the lemma shows that when the function RecFun σ (m) (resp. AWFunσ (m)) is set to 1 for a state s of a vertex of the collapsedgraph, then for all successor vertices along the edges in the collapsedgraph, the function RecFun (resp. AWFun) is also set to 1 for the successors of state s. a
Lemma 2. Let (Y , W , R , A ) → (Y , W , R , A ) be an edge in the collapsedgraph CoGr(σ ) = ( V , E ). Then for all s ∈ Y the following assertions hold: 1. If W (s) = 1, then for all s ∈ Supp(δ(s, a)) ∩ Y we have that W (s ) = 1. 2. If R (s) = 1, then for all s ∈ Supp(δ(s, a)) ∩ Y we have that R (s ) = 1. Proof. We present proof of both the items below. a
1. Let (Y , W , R , A ) → (Y , W , R , A ) be an edge in the collapsedgraph CoGr(σ ) = ( V , E ) and a state s ∈ Y such that W (s) = 1. It follows that there exist memories m, m ∈ M and an observation o ∈ O such that (i) W = AWFunσ (m); (ii) W = AWFunσ (m ); (iii) a ∈ Supp(σn (m)); (iv) m ∈ Supp(σu (m, o, a)); and finally (v) O (s ) = o. From all the points above it follows that there exists an edge (s, m) → (s , m ) in the Markov chain G σ . As the state (s, m) is almost-sure winning (since W (s) = 1) it follows by Remark 3 that the state (s , m ) must also be almost-sure winning and therefore W (s ) = 1. 2. As in the proof of the first part we have that (s , m ) is reachable from (s, m) in the Markov chain G σ . As every state reachable from a recurrent state in a Markov chain is also recurrent we have R (s ) = 1. The desired result follows.
2
We now define the collapsedstrategy for σ . Intuitively we collapse memory elements of the original strategy σ whenever they agree on all the RecFun, AWFun, and Act functions. The collapsedstrategy plays uniformly all the actions from the set given by Act in the collapsed state. Collapsed strategy. We now construct the collapsedstrategy σ = (σu , σn , M , m0 ) of CoGr(σ ) = ( V , E ). We will refer to this construction by σ = CoSt(σ ).
σ based on the collapsedgraph
• The memory set M is the vertices of the collapsedgraph CoGr(σ ) = ( V , E ), i.e., M = V = {(Y , AWFunσ (m), RecFunσ (m), Actσ (m)) | Y ⊆ S and m ∈ M }. • The initial memory is m0 = ({s0 }, AWFunσ (m0 ), RecFunσ (m0 ), Actσ (m0 )). • The next action function given a memory (Y , W , R , A ) ∈ M is the uniform distribution over the set of actions {a | a ∃(Y , W , R , A ) ∈ M and (Y , W , R , A ) → (Y , W , R , A ) ∈ E }, where E are the edges of the collapsedgraph. • The memory update function σu ((Y , W , R , A ), o, a) given a memory element (Y , W , R , A ) ∈ M , a ∈ A, and o ∈ O is the a uniform distribution over the set of vertices {(Y , W , R , A ) | (Y , W , R , A ) → (Y , W , R , A ) ∈ E and Y ⊆ γ −1 (o)}. The following lemma intuitively shows that the collapsedstrategy can reach all the states the original strategy could reach. Lemma 3. Let σ = CoSt(σ ) be the collapsedstrategy, s, s ∈ S, and m, m ∈ M. If (s , m ) is reachable from (s, m) in G σ , then for all belief supports Y ⊆ S with s ∈ Y there exists a belief support Y ⊆ S with s ∈ Y such that the state (s , Y , AWFunσ (m ), RecFunσ (m ), Actσ (m )) is reachable from (s, Y , AWFunσ (m), RecFunσ (m), Actσ (m)) in G σ Proof. We will start the proof with one step reachability first. Assume there is an edge (s, m) → (s , m ) in the Markov chain G σ and the action labeling the transition in the POMDP is a, i.e., (i) s ∈ Supp(δ(s, a)); (ii) a ∈ Supp(σn (m)); and
54
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
(iii) m ∈ Supp(σu (m, γ (s ), a)). Let (s, Y , AWFunσ (m), RecFunσ (m), Actσ (m)) be a state in G σ . It follows that a ∈ Actσ (m) a
and there exists Y ⊆ S and an edge (Y , AWFunσ (m), RecFunσ (m), Actσ (m)) → (Y , AWFunσ (m ), RecFunσ (m ), Actσ (m )) in the collapsedgraph CoGr(σ ). Therefore, action a is played with positive probability by the strategy σ and with positive probability the memory is updated to (Y , AWFunσ (m ), RecFunσ (m ), Actσ (m )). Therefore there is an edge (s, Y , AWFunσ (m), RecFunσ (m), Actσ (m)) → (s , Y , AWFunσ (m ), RecFunσ (m ), Actσ (m )) in the Markov chain G σ . We finish the proof by extending the one-step reachability to general reachability. If (s , m ) is reachable from (s, m) in G σ then there exists a finite path. Applying the one step reachability to every transition in the path gives us the desired result. 2 Random variable notation. For all n ≥ 0 we use the following notations for random variables: let Λn denote the random variable for the nth state of the Markov chain G σ . We have: X n for the projection of Λn on the S component, Y n for the belief support P ( S ) component of Λn , W n for the AWFunσ component of Λn , R n for the RecFunσ component of Λn , A n for the Actσ component of Λn , and L n for the nth action in the Markov chain, respectively. Run of the Markov chain G σ . A run of the Markov chain G σ is an infinite sequence L0
L1
( X0, Y 0, W 0, R 0, A0) → ( X1, Y 1, W 1, R 1, A1) → · · · such that each finite prefix of the run is generated with positive probability on the Markov chain, i.e., for all i ≥ 0, we have (i) L i ∈ Supp(σn (Y i , W i , R i , A i )); (ii) X i +1 ∈ Supp(δ( X i , L i )); and (iii) (Y i +1 , W i +1 , R i +1 , A i +1 ) ∈ Supp(σu ((Y i , W i , R i , A i ), γ ( X i+1 ), L i )). In the following lemma we establish important properties of the Markov chain G σ that are essential for our proof. L0
L1
Lemma 4. Let ( X 0 , Y 0 , W 0 , R 0 , A 0 ) → ( X 1 , Y 1 , W 1 , R 1 , A 1 ) → · · · be a run of the Markov chain G σ , then the following assertions hold for all i ≥ 0: 1. X i +1 ∈ Supp(δ( X i , L i )) ∩ Y i +1 ; Li
(Y i , W i , R i , A i ) → (Y i +1 , W i +1 , R i +1 , A i +1 ) is an edge in the collapsedgraph CoGr(σ ); if W i ( X i ) = 1, then W i +1 ( X i +1 ) = 1; if R i ( X i ) = 1, then R i +1 ( X i +1 ) = 1; and if W i ( X i ) = 1 and R i ( X i ) = 1, then r( X i , L i ) = 1.
2. 3. 4. 5.
Proof. We prove all the points below: 1. 2. 3. 4. 5.
The first point follows directly from the definition of the Markov chain and the collapsedstrategy σ . The second point follows from the definition of the collapsedstrategy σ . The third point follows from the first two points of this lemma and the first point of Lemma 2. The fourth point follows from the first two points of this lemma and the second point of Lemma 2. For the fifth point consider that W i ( X i ) = 1 and R i ( X i ) = 1. Then there exists a memory m ∈ M such that (i) AWFunσ (m) = W i , and (ii) RecFunσ (m) = R i . Moreover, the state ( X i , m) is recurrent (since R i ( X i ) = 1) and almostsure winning state (since W i ( X i ) = 1) in the Markov chain G σ . As L i ∈ Actσ (m) it follows that L i ∈ Supp(σn (m)), i.e., the action L i is played with positive probability in state X i given memory m, and ( X i , m) is in an almost-sure winning recurrent class. By Lemma 1 it follows that the reward r( X i , L i ) must be 1.
The desired result follows.
2
We now introduce the final notion of a collapsed-recurrent state that is required to complete the proof. A state
( X , Y , W , R , A ) of the Markov chain G σ is collapsed-recurrent, if for all memory elements m ∈ M that were merged to the memory element (Y , W , R , A ), the state ( X , m) of the Markov chain G σ is recurrent. It will turn out that every recurrent state of the Markov chain G σ is also collapsed-recurrent. Definition 3. A state ( X , Y , W , R , A ) of the Markov chain G σ is called collapsed-recurrent iff R ( X ) = 1. Note that due to point 4 of Lemma 4 all the states reachable from a collapsed-recurrent state are also collapsed-recurrent. In the following lemma we show that the set of collapsed-recurrent states is reached with probability 1. Lemma 5. With probability 1 a run of the Markov chain G σ reaches a collapsed-recurrent state. Proof. We show that from every state ( X , Y , W , R , A ) in the Markov chain G σ there exists a reachable collapsed-recurrent state. Consider a memory element m ∈ M such that (i) AWFunσ (m) = W , (ii) RecFunσ (m) = R, and (iii) Actσ (m) = A.
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
55
Consider a state (s, m) in the Markov chain G σ . By Property 1(b) of Markov chains from every state in a Markov chain the set of recurrent states is reached with probability 1, in particular there exists a reachable recurrent state ( X , m ), such that RecFunσ (m )( X ) = 1. By Lemma 3, there exists Y ⊆ S such that the state ( X , Y , AWFunσ (m), RecFunσ (m ), Actσ (m )) is reachable from ( X , Y , W , R , A ) in the Markov chain G σ , and moreover the state is collapsed-recurrent. As this is true for every state, we have that there is a positive probability of reaching a collapsed-recurrent from every state in G σ . This ensures by Property 1(a) of Markov chains that collapsed-recurrent states are reached with probability 1. 2 Lemma 6. The collapsedstrategy σ is a finite-memory almost-sure winning strategy for the LimAvg=1 objective on the POMDP G with the reward function r. Proof. The initial state of the Markov chain G σ is ({s0 }, AWFunσ (m0 ), RecFunσ (m0 ), Actσ (m0 )) and as the strategy σ is an almost-sure winning strategy we have that AWFunσ (m0 )(s0 ) = 1. It follows from the third point of Lemma 4 that every reachable state ( X , Y , W , R , A ) in the Markov chain G σ satisfies that W ( X ) = 1. From every state a collapsed-recurrent state is reached with probability 1. It follows that all the recurrent states in the Markov chain G σ are also collapsed-recurrent states. As in all reachable states ( X , Y , W , R , A ) we have W ( X ) = 1, by the fifth point of Lemma 4 it follows that every action L played in a collapsed-recurrent state ( X , Y , W , R , A ) satisfies that the reward r( X , L ) = 1. As this is true for every reachable recurrent class, the fact that the collapsedstrategy is an almost-sure winning strategy for LimAvg=1 objective follows from Lemma 1. 2 Theorem 2 (Strategy complexity). The following assertions hold: (1) If there exists a finite-memory almost-sure winning strategy in the POMDP G = ( S , A, δ, O , γ , s0 ) with reward function r for the LimAvg=1 objective, then there exists a finite-memory almost-sure winning strategy with memory size at most 23·| S |+|A| . (2) Finite-memory almost-sure winning strategies for LimAvg=1 objectives in POMDPs in general require exponential memory and belief-support-based strategies are not sufficient. Proof. The first item follows from Lemma 6 and the fact that the size of the memory set of the collapsedstrategy σ of any finite-memory strategy σ (which is the size of the vertex set of the collapsedgraph of σ ) is bounded by 23·| S |+|A| . The second item is obtained as follows: (i) the exponential memory requirement follows the almost-sure winning problem for PFA as the shortest witness to the non-emptiness problem for universal automata is exponential [25]; and (ii) the fact that belief-support-based strategies are not sufficient follows from Example 1. 2 3.3. Computational complexity We will present an exponential time algorithm for the almost-sure winning problem in POMDPs with LimAvg=1 objectives under finite-memory strategies. A naive double-exponential algorithm would be to enumerate all finite-memory strategies with memory bounded by 23·| S |+|A| (by Theorem 2). Our improved algorithm consists of two steps: (i) first we construct a special type of a belief-observation POMDP; and we show that there exists a finite-memory almost-sure winning strategy for the objective LimAvg=1 iff there exists a randomized memoryless almost-sure winning strategy in the belief-observation POMDP for the LimAvg=1 objective; and (ii) then we show how to determine whether there exists a randomized memoryless almost-sure winning strategy in the belief-observation POMDP for the LimAvg=1 objective in polynomial time with respect to the size of the belief-observation POMDP. Intuitively, a belief-observation POMDP satisfies that the current belief support is always the set of states with the current observation. Definition 4. A POMDP G = ( S , A, δ, O , γ , s0 ) is a belief-observation POMDP iff for every finite prefix w = (s0 , a0 , s1 , a1 , . . . , an−1 , sn ) the belief support associated with the observation sequence ρ = γ ( w ) is exactly the set of states which has the observation γ (sn ), which is the last observation of the sequence ρ . Formally, a POMDP is a belief-observation POMDP iff for every finite prefix w and the observation sequence ρ = γ ( w ) we have B (ρ ) = γ −1 (γ (sn )). Overview of the approach. We will split our algorithmic analysis in three parts. 1. First in Section 3.3.1 we present the construction that given a POMDP G constructs an exponential size beliefobservation POMDP G such that there exists a finite-memory almost-sure winning strategy for the objective LimAvg=1 in G iff there exists a randomized memoryless almost-sure winning strategy in G for the LimAvg=1 objective. 2. Second in Section 3.3.2 we show that deciding the existence of a randomized memoryless almost-sure winning strategy in G can be reduced to the computation of almost-sure winning strategies for safety and reachability objectives. 3. Finally, in Section 3.3.3 we show that for belief-observation POMDPs the almost-sure winning computation for safety and reachability objectives can be achieved in polynomial time. Combining the above three steps, we obtain an exponential-time algorithm to determine the existence of finite-memory almost-sure winning strategies for POMDPs with LimAvg=1 objectives. Fig. 3 gives a pictorial overview of the three steps.
56
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Fig. 3. Overview of the steps of the algorithm.
3.3.1. Construction of the belief-observation POMDP Given a POMDP G with the LimAvg=1 objective specified by a reward function r we construct a belief-observation POMDP G with the LimAvg=1 objective specified by a reward function r, such that there exists a finite-memory almost-sure winning strategy ensuring the objective in the POMDP G iff there exists a randomized memoryless almost-sure winning strategy in the POMDP G. We will refer to this construction as G = Red(G ). Intuition of the construction. Intuitively, the construction will proceed as follows: if there exists an almost-sure winning finitememory strategy, then there exists an almost-sure winning strategy with memory bounded by 23·| S |+|A| (by Theorem 2). This allows us to consider memory elements M = 2 S × {0, 1}| S | × {0, 1}| S | × 2A ; and intuitively construct the product of the memory M with the POMDP G. The key intuition that makes the construction work is as follows: by considering memory elements M we essentially consider all possible collapsed strategies of finite-memory strategies. Hence on the product of the POMDP with the memory elements, we essentially search for a witness collapsed strategy of a finite-memory almost-sure winning strategy. Formal construction of G = Red(G ). We proceed with the formal construction: let G = ( S , A, δ, O , γ , s0 ) be a POMDP with a reward function r. We construct a POMDP G = ( S , A, δ, O , γ , s0 ) with a reward function r as follows:
• The set of states S = S a ∪ S m ∪ {s0 , sl }, consists of the action selection states S a = S × M; the memory selection states S m = S × 2 S × M × A; an initial state s0 and an absorbing losing state sl . • The actions A are A ∪ M, i.e, the actions A from the POMDP G to simulate action playing and the memory elements M to simulate memory updates.
• The observation set is O = ( M ) ∪ (2 S × M × A) ∪ {s0 } ∪ {sl }, intuitively the first component of the state cannot be observed, i.e., the memory part of the state remains visible; and the two newly added states do have their own observations. • The observation mapping is then defined γ ((s, m)) = m, γ ((s, Y , m, a)) = (Y , m, a), γ (s0 ) = s0 and γ (sl ) = sl . • As the precise probabilities do not matter for computing almost-sure winning states under finite-memory strategies for the LimAvg=1 objective we specify the transition function only as edges of the POMDP graph and the probabilities are uniform over the support set. First we define the set of winning memories M Win ⊆ M defined as follows:
M Win = (Y , W , R , A ) for all s ∈ Y we have W (s) = 1 . Note that every reachable memory m of the collapsedstrategy of any finite-memory almost-sure winning strategy the POMDP G must belong to M Win . We define the transition function δ in the following steps:
σ in
m
1. For every memory element m ∈ M Win ∩ {({s0 }, W , R , A ) | ({s0 }, W , R , A ) ∈ M } we add an edge s0 → (s0 , m). Intuitively the set from which m is chosen contains all the possible initial memories of an almost-sure winning collapsedstrategy. 2. We will say an action a ∈ A is enabled in the observation (Y , W , R , A ) iff (i) a ∈ A, and (ii) for all sˆ ∈ Y we have if W (ˆs) = 1 and R (ˆs) = 1, then r(ˆs, a) = 1. Intuitively, a collapsedstrategy would only play enabled actions, therefore every action a ∈ A possibly played by a collapsedstrategy remains available in the state of the POMDP. This fact follows from the fifth point of Lemma 4. For an action a ∈ A that is enabled in observation (Y , W , R , A ) we have an a
edge (s, (Y , W , R , A )) → (s , Y , a, (Y , W , R , A )) iff both of the following conditions are satisfied: (i) s ∈ Supp(δ(s, a)); and (ii) Y = sˆ∈Y Supp(δ(ˆs, a)) ∩ γ −1 (γ (s )), i.e., the belief support update from belief support Y , under action a and observation γ (s ). For an action a ∈ A that is not enabled, we have an edge to the losing absorbing state sl , i.e., a
(s, (Y , W , R , A )) → sl .
3. We will say an action (Y , W , R , A ) ∈ M is enabled in the observation (Y , a, (Y , W , R , A )) iff (i) for all states sˆ ∈ Y , if W (ˆs) = 1, then for all sˆ ∈ Supp(δ(ˆs, a)) ∩ Y we have W (ˆs ) = 1, and (ii) for all states sˆ ∈ Y , if R (ˆs) = 1, then for all sˆ ∈ Supp(δ(ˆs, a)) ∩ Y we have R (ˆs ) = 1. Observe that by Lemma 2, the memory updates in a collapsedstrategy correspond to enabled actions. For an action m that is enabled in the observation (Y , a, (Y , W , R , A )) we have
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
57
m
an edge (s , Y , a, (Y , W , R , A )) → (s , m); otherwise we have an edge to the losing absorbing state sl , i.e., edge m
(s , Y , a, (Y , W , R , A )) → sl if m is not enabled.
a
4. The state sl is an absorbing state, i.e., for all a ∈ A we have that sl → sl . The reward function r is defined using the reward function r, i.e., r((s, (Y , W , R , A )), a) = r(s, a), and r((s , Y , a, (Y , W , R , A )), a) = 1 for all a ∈ A. The reward for the initial state may be set to an arbitrary value, as the initial state is visited only once. The rewards in the absorbing state sl are 0 for all actions. Intuition of the next lemma. In the POMDP G the belief support is already included in the state space of the POMDP itself, and the belief support represents exactly the set of states in which the POMDP is with positive probability. It follows that G is a belief-observation POMDP, which we formally establish in the following lemma. Lemma 7. The POMDP G is a belief-observation POMDP. Proof. Note that the observations are defined in a way that the first component cannot be observed. Given a sequence w of states and actions in G with the observation sequence ρ = γ ( w ) we will show that the possible first components of the states in the belief support B (ρ ) are equal to the updated belief-support components Y in the observation. Intuitively the proof holds as the Y component is the belief support and the belief support already represents exactly the set of states in which the POMDP can be with positive probability. We now present the formal argument. Let us denote by Proj1 (B (ρ )) ⊆ S the projection on the first component of the states in the belief support. One inclusion is trivial since for every reachable state (s, (Y , W , R , A )) we have s ∈ Y (resp. for states (s , Y , (Y , W , R , A ), a) we have that s ∈ Y ). Therefore we have Proj1 (B (ρ )) ⊆ Y (resp. Y ). We prove the second inclusion by induction with respect to the length of the play prefix:
• Base case: We show the base case for prefixes of length 1 and 2. The first observation is always {s0 } which contains only a single state, so there is nothing to prove. Similarly the second observation in the game is of the form ({s0 }, W , R , A ), and the argument is the same.
• Induction step: Let us a consider a prefix w = w · a · (s , Y , (Y , W , R , A ), a) where a ∈ A and the last transition is a (s, (Y , W , R , A )) → (s , Y , (Y , W , R , A ), a) in the POMDP G. By induction hypothesis we have that B (γ ( w )) = {(s, (Y , W , R , A )) | s ∈ Y }. The new belief support is computed (by definition) as
B γ w = Supp δ s, (Y , W , R , A ) , a ∩ γ −1 Y , (Y , W , R , A ), a . s∈ Y
Let s Y be a state in Y , we want to show that (s Y , Y , (Y , W , R , A ), a) is in B (γ ( w )). Due to the definition of a
the belief-support update there exists a state s Y in Y such that s Y → s Y and (s Y , (Y , W , R , A )) ∈ B (γ ( w )). As γ (s Y ) = γ (s ), it follows that (s Y , Y , (Y , W , R , A ), a) ∈ s∈B(γ ( w )) Supp(δ(s, a)) and as (s Y , Y , (Y , W , R , A ), a) ∈
γ −1 ((Y , (Y , W , R , A ), a)), the result follows.
The case when the prefix is extended with a memory action m ∈ M is simpler as the first two components do not change during the transition.
The desired result follows.
2
We now show that the existence of a finite-memory almost-sure winning strategy for the objective LimAvg=1 in POMDP G implies the existence of a randomized memoryless almost-sure winning strategy in the POMDP G and vice versa. Intuition of the following two lemmas. We have established in Theorem 2 that if there exists a finite-memory almost-sure winning strategy for the objective LimAvg=1 in a POMDP G, then there also exists a collapsedalmost-sure winning strategy with memory elements that are by construction already present in the state space of the belief-observation POMDP G. In every state of the POMDP G the part of the state space which corresponds to the memory elements of the collapsedstrategy is perfectly observable. The transition function of the POMDP G respects the possible memory updates of the collapsedstrategy. Intuitively, it follows that the collapsedstrategy in the POMDP G can be adapted to a memoryless strategy in the belief-observation POMDP G by considering the observable parts of the state space instead of its memory, and vice versa. Lemma 8. If there exists a finite-memory almost-sure winning strategy for the LimAvg=1 objective with the reward function r in the POMDP G, then there exists a memoryless almost-sure winning strategy for the LimAvg=1 objective with the reward function r in the POMDP G. Proof. Assume there is a finite-memory almost-sure winning strategy σ in the POMDP G, then the collapsedstrategy σ = (σu , σn , M , ({s0 }, W , R , A )) = CoSt(σ ) is also an almost-sure winning strategy in the POMDP G. We define the memoryless strategy σ : O → D (A) as follows:
58
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
• In the initial observation {s0 } play the action ({s0 }, W , R , A ). • In observation (Y , W , R , A ) play σ ((Y , W , R , A )) = σn ((Y , W , R , A )). • In observation (Y , a, (Y , W , R , A )) we denote by o the unique observation such that Y ⊆ γ −1 (o), then σ (Y , a, (Y , W , R , A )) = σu ((Y , W , R , A ), o, a). • In the observation {sl } no matter what is played the losing absorbing state is not left. Let G 1 = G σ and G 2 = G σ . We will for simplicity collapse the edges in the Markov chain G 2 , i.e., we will consider an edge a
(Y , W , R , A )
a
(s, (Y , W , R , A )) → (s , (Y , W , R , A )) whenever (s, (Y , W , R , A )) → (s , Y , a, (Y , W , R , A )) → (s , (Y , W , R , A )). Note that in the first step the Markov chain G 2 reaches a state (s0 , ({s0 }, W , R , A )), and one can observe that the Markov chains reachable from the initial state in G 1 and the state (s0 , ({s0 }, W , R , A )) in G 2 are isomorphic (when collapsed edges are considered). By Lemma 1, all the reachable recurrent classes in the Markov chain G 1 have all the rewards equal to 1. As all the intermediate states in the Markov chain G 2 that were collapsed have reward according to r equal to 1 for all the actions, it follows that all the reachable recurrent classes in G 2 have all the rewards assigned by the reward function r equal to 1. By Lemma 1 it follows that σ is a memoryless almost-sure winning strategy in the POMDP G for the objective LimAvg=1 . 2 Lemma 9. If there exists a memoryless almost-sure winning strategy for the LimAvg=1 objective with the reward function r in the POMDP G, then there exists a finite-memory almost-sure winning strategy for the LimAvg=1 objective withe the reward function r in the POMDP G. Proof. Let σ be a memoryless almost-sure winning strategy. Intuitively we use everything from the state except the first component as the memory in the constructed finite-memory strategy σ = (σu , σn , M , m0 ), i.e.,
• σn ((Y , W , R , A )) = σ ((Y , W , R , A )). • σu ((Y , W , R , A ), o, a) we update uniformly to the elements from the set Supp(σ ((Y , a, (Y , W , R , A )))), where Y is the belief-support update from Y under observation o and action a.
• m0 = σ ({s0 }), this can be in general a probability distribution, in our setting the initial memory is deterministic. How-
ever, this can modeled by adding an additional initial state from which the required memory update is going to be modeled. a
For simplicity we collapse edges in the Markov chain G 2 = G σ and write (s, (Y , W , R , A )) → (s , (Y , W , R , A )) whenever a
(Y , W , R , A )
(s, (Y , W , R , A )) → (s , Y , a, (Y , W , R , A )) → (s , (Y , W , R , A )). One can observe that the graphs reachable from the state (s0 , ({s0 }, W , R , A )) in the Markov chain G 1 = G σ and the state (s0 , ({s0 }, W , R , A )) in the Markov chain G 2 are isomorphic when the collapsed edges are considered. As the strategy σ is almost-sure winning, it follows that the losing absorbing state is not reachable in G 2 , and by Lemma 1, in every reachable recurrent class all the rewards are 1. It follows that in every reachable recurrent class in G 1 all the rewards are 1 and the desired result follows. 2 3.3.2. Analysis of belief-observation POMDPs We will present a polynomial-time algorithm to determine the set of states from which there exists a memoryless almost-sure winning strategy for the objective LimAvg=1 in the belief-observation POMDP G = ( S , A, δ, O , γ , s0 ). In this section we show how the computation is reduced to the analysis of safety and reachability objectives. For simplicity in presentation we enhance the belief-observation POMDP G with an additional function Γ : O → P (A) \ ∅ denoting the set of actions that are available for an observation, i.e., in this part we will consider POMDPs to be a tuple Red(G ) = G = ( S , A, δ, O , γ , Γ , s0 ). Note that this does not increase the expressive power of the model, as an action not available in an observation may be simulated by an edge leading to a losing absorbing state. Almost-sure winning observations. Given a POMDP G = ( S , A, δ, O , γ , Γ , s0 ) and an objective ψ , let AlmostM (ψ) denote the set of observations o ∈ O , such that there exists a memoryless almost-sure winning strategy ensuring the objective from every state s ∈ γ −1 (o), i.e., AlmostM (ψ) = {o ∈ O | there exists a memoryless strategy σ , such that for all s ∈ γ −1 (o) we have Pσs (ψ) = 1}. Our goal is to compute the set AlmostM (LimAvg=1 ) given the belief-observation POMDP G = Red(G ). Our algorithm will reduce the computation to safety and reachability objectives which are defined as follows:
• Reachability and safety objectives. Given a set T ⊆ S of target states, the reachability objective Reach( T ) = {(s0 , a0 , s1 , a1 , s2 . . .) ∈ Ω | ∃k ≥ 0 : sk ∈ T } requires that a target state in T is visited at least once. Dually, the safety objective Safe( F ) = {(s0 , a0 , s1 , a1 , s2 . . .) ∈ Ω | ∀k ≥ 0 : sk ∈ F } requires that only states in F are visited. In the first step of the computation of the winning observations, we will restrict the set of available actions in the belief-observation POMDP G. Observe, that any almost-sure winning strategy must avoid reaching the losing absorbing state sl . Therefore, we restrict the actions in the POMDP to only so called allowable or safe actions.
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
59
• (Allow). Given a set of observations O ⊆ O and an observation o ∈ O we define by Allow(o, O ) the set of actions that when played in observation o (in any state γ −1 (o)) ensure that the next observation is in O , i.e.,
Allow(o , O ) = a ∈ Γ (o)
γ Supp δ(s, a) ⊆ O
s∈γ −1 (o)
We will denote by S good = S \ sl the set of states of the POMDP without the losing absorbing state. We compute the set of observations AlmostM (Safe( S good )) from which there exists a memoryless strategy ensuring almost-surely that the losing absorbing state sl is not visited.
,
, γ
= A,
, Γ , δ, O Formal construction of G. We construct a restricted POMDP G = ( S, A s0 ), where A s0 = s0 and the rest is defined as follows: • the set of states is restricted to S = γ −1 (AlmostM (Safe( S good ))), • the set of safe actions available in observation o are
Γ (˜o) = Γ (˜o) ∩ Allow o˜ , AlmostM Safe( S good ) δ is defined as δ but restricted to states • the transition function S, and
= AlmostM (Safe( S good )) and the observation mapping γ
is defined as γ restricted to states • the observation O S. Note that any memoryless almost-sure winning strategy σ for the objective LimAvg=1 on the POMDP G can be inter (o) or reaching a state in S \ G as playing an action from the set Γ (o) \ Γ S leads to a preted on the restricted POMDP contradiction to the fact that σ is an almost-sure winning strategy (the losing absorbing state is reached with positive probability). Similarly due to the fact that the POMDP G is a belief-observation POMDP it follows that the restricted POMDP
G is also a belief-observation POMDP. We define a subset of states of the belief-observation POMDP G that intuitively correspond to winning collapsedS wcs = {(s, (Y , W , R , A )) | W (s) = 1, R (s) = 1}. Finally we compute the set of observations recurrent states (wcs), i.e., = AlmostM (Reach( S wcs )) in the restricted belief-observation POMDP is AW G. We show that the set of observations AW equal to the set of observations AlmostM (LimAvg=1 ) in the POMDP G. In the following two lemmas we establish the required inclusions:
⊆ AlmostM (LimAvg=1 ). Lemma 10. AW Proof. Let σ be a memoryless almost-sure winning strategy for the objective Reach( S wcs ) from every state that has ob We will show that the strategy servation o˜ in the POMDP G, for all o˜ in AW. σ also almost-surely ensures the LimAvg=1 objective in POMDP G. Consider the Markov chain G σ plays in the POMDP G that is restricted only to actions that keep σ . As the strategy the game in observations AlmostM (Safe( S good )) it follows that the losing absorbing state sl is not reachable in the Markov
chain G σ . Also the strategy ensures that the set of states S wcs is reached with probability 1 in the Markov chain G σ. a
Let (s, (Y , W , R , A )) → (s , Y , a, (Y , W , R , A )) be an edge in the Markov chain G σ and assume that (s, (Y , W , R , A )) ∈
S wcs , i.e., W (s) = 1 and R (s) = 1. The only actions (Y , W , R , A ) available in the state (s , Y , a, (Y , W , R , A )) are enabled actions that satisfy:
• for all sˆ ∈ Y , if W (ˆs) = 1, then for all sˆ ∈ Supp(δ(ˆs, a)) ∩ Y we have W (ˆs ) = 1, and • for all sˆ ∈ Y , if R (ˆs) = 1, then for all sˆ ∈ Supp(δ(ˆs, a)) ∩ Y we have R (ˆs ) = 1. It follows that all the states reachable in one step from (s , Y , a, (Y , W , R , A )) are also in S wcs . Similarly, for every enabled action in state (s, (Y , W , R , A )) we have that for all s ∈ Y , if W (s) = 1 and R (s) = 1, then r(s, a) = 1. Therefore, all the rewards r from states in S wcs are 1, and all the intermediate states (s , Y , a, (Y , W , R , A )) have reward 1 by definition. This all together ensures that after reaching the set S wcs only rewards 1 are received, and as the set S wcs is reached with probability 1, it follows by Lemma 1 that σ is an almost-sure winning strategy in the POMDP G for the LimAvg=1 objective. 2
Lemma 11. AlmostM (LimAvg=1 ) ⊆ AW. As we argued before, the Proof. Assume for contradiction that there exists an observation o˜ ∈ AlmostM (LimAvg=1 ) \ AW. observation o˜ must belong to the set AlmostM (Safe( S good )), otherwise with positive probability the losing absorbing state sl is reached.
60
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
As o˜ ∈ AlmostM (LimAvg=1 ) there exists a memoryless almost-sure winning strategy σ in the POMDP G for the objective LimAvg=1 . By Lemma 9 there exists a finite-memory almost-sure winning strategy σ in POMDP G for the LimAvg=1 objective and by Theorem 2 there exists a finite-memory almost-sure winning collapsedstrategy σ = CoSt(σ ) in POMDP G. Let (s, (Y , W , R , A )) be a reachable collapsed-recurrent state in the Markov chain G σ . Then by the definition of collapsed-recurrent states we have that R (s) = 1, and as the strategy σ is almost-sure winning we also have that W (s) = 1, i.e., if we consider the state (s, (Y , W , R , A )) of the POMDP G we obtain that the state belongs to the set S wcs . Note that by Lemma 5 the set of collapsed-recurrent states in the Markov chain G σ is reached with probability 1. By the construction presented in Lemma 8 we obtain a memoryless almost-sure winning strategy σ for the objective LimAvg=1 in the POMDP G. Moreover, we have that the Markov chains G σ and G σ are isomorphic when simplified edges are considered. In particular it follows that the set of states S wcs is reached with probability 1 in the Markov chain G σ and therefore σ is a witness The contradiction follows. 2 strategy for the fact that the observation o˜ belongs to the set AW. The algorithm. From the results of this section we obtain our algorithm (described as Algorithm 1) to determine the existence of finite-memory almost-sure winning strategies for LimAvg=1 objectives in POMDPs. The algorithm requires two procedures, namely AlmostSafe (Algorithm 2) and AlmostReach (Algorithm 3), for the computation of almost-sure winning for reachability and safety objectives, which we present in the following Section 3.3.3. Algorithm 1 Decide FinMem. Input: POMDP G with a LimAvg=1 objective Output: true if there exists a finite-memory almost-sure winning strategy and false otherwise G ← Red(G )
O ← AlmostSafe ( S \ sl ), where sl is the losing absorbing state and S are states of G
S ← γ −1 ( O)
(˜o) = Γ (˜o) ∩ Allow(˜o, G ← G restricted to S and with Γ O)
S wcs ← {(s, (Y , W , R , A )) | W (s) = 1, R (s) = 1} in G
← AlmostReach ( AW S wcs ) in G then return: true if γ (s0 ) ∈ AW else return: false
Algorithm 2 AlmostSafe.
Algorithm 3 AlmostReach.
Input: belief-observation POMDP G set of safe states F ⊆ S Output: AlmostM (Safe( F ))
Input: belief-observation POMDP G set of target states T ⊆ S Output: AlmostM (Reach( T )) X ← ∅, Z ← O do Z ← Z do X ← X X ← ( T ∩ γ −1 ( Z )) ∪ Apre( Z , X ) while X = X Z ← ObsCover( X ) while Z = Z return: Z
Y ← ObsCover( F ) do Y ← Pre(Y ) while Y = Y return: Y
3.3.3. Polynomial-time algorithm for safety and reachability objectives for belief-observation POMDPs To complete the computation for almost-sure winning for LimAvg=1 objectives we now present polynomial time algorithms for almost-sure safety and almost-sure reachability objectives for randomized memoryless strategies in the beliefobservation POMDP G. Notations for algorithms. We start with a few notations below:
• (Pre). The predecessor function given a set of observations O selects the observations o ∈ O such that Allow(o, O ) is non-empty, i.e.,
Pre( O ) = o ∈ O Allow(o, O ) = ∅ .
• (Apre). Given a set Y ⊆ O of observations and a set X ⊆ S of states such that X ⊆ γ −1 (Y ), the set Apre(Y , X ) denotes the states from γ −1 (Y ) such that there exists an action that ensures that the next observation is in Y and the set X is reached with positive probability, i.e.:
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Apre(Y , X ) = s ∈ γ −1 (Y ) ∃a ∈ Allow
61
γ (s), Y such that Supp δ(s, a) ∩ X = ∅ .
• (ObsCover). For a set U ⊆ S of states we define the ObsCover(U ) ⊆ O to be the set of observations o such that all states with observation o are in U , i.e., ObsCover(U ) = {o ∈ O | γ −1 (o) ⊆ U }. Using the above notations we present the solution of almost-sure winning for safety and reachability objectives. Almost-sure winning for safety objectives. Given a safety objective Safe( F ), for a set F ⊆ S of states, let O F = ObsCover( F ) denote the set of observations o such that γ −1 (o) ⊆ F , i.e., all states s ∈ γ −1 (o) belong to F . We denote by ν X the greatest fixpoint and by μ X the least fixpoint. Let
Y ∗ = ν Y . O F ∩ Pre(Y ) = ν Y . ObsCover( F ) ∩ Pre(Y )
be the greatest fixpoint of the function f (Y ) = O F ∩ Pre(Y ). Then the set Y ∗ is obtained by the following computation: 1. Y 0 ← O F ; and 2. repeat Y i +1 ← Pre(Y i ) until a fixpoint is reached. We show that Y ∗ = AlmostM (Safe( F )). Lemma 12. For every observation o ∈ Y ∗ we have Allow(o, Y ∗ ) = ∅ (i.e., Allow(o, Y ∗ ) is non-empty). Proof. Assume for contradiction that there exists an observation o ∈ Y ∗ such that Allow(o, Y ∗ ) is empty. Then o ∈ / Pre(Y ∗ ) ∗ ∗ and hence the observation must be removed in the next iteration of the algorithm. This implies Pre (Y ) = Y , we reach a contradiction that Y ∗ is a fixpoint. 2 Lemma 13. The set Y ∗ is the set of almost-sure winning observations for the safety objective Safe( F ), i.e., Y ∗ = AlmostM (Safe( F )), and can be computed in linear time. Proof. We prove the two desired inclusions: (1) Y ∗ ⊆ AlmostM (Safe( F )); and (2) AlmostM (Safe( F )) ⊆ Y ∗ . 1. (First inclusion). By the definition of Y 0 we have that γ −1 (Y 0 ) ⊆ F . As Y i +1 ⊆ Y i we have that γ −1 (Y ∗ ) ⊆ F . By Lemma 12, for all observations o ∈ Y ∗ we have Allow(o, Y ∗ ) is non-empty. A pure memoryless strategy that plays some action from Allow(o, Y ∗ ) in o, for o ∈ Y ∗ , ensures that the next observation is in Y ∗ . Thus the strategy ensures that only states from γ −1 (Y ∗ ) ⊆ F are visited, and therefore is an almost-sure winning strategy for the safety objective. 2. (Second inclusion). We prove that there is no almost-sure winning strategy from O \ Y ∗ by induction: • (Base case). The base case follows directly from the definition of Y 0 . Note that Y 0 = O F . In every observation o ∈ O \ Y 0 there exists a state s ∈ γ −1 (o) such that s ∈ / F . As G is a belief-observation POMDP there is a positive probability of being in state s, and therefore not being in F . It follows, that there is no almost-sure winning strategy from observations O \ Y 0 . • (Inductive step). We show that there is no almost-sure winning strategy from observations in O \ Y i +1 . Let Y i +1 = Y i and o ∈ Y i \ Y i +1 (or equivalently (O \ Y i +1 ) \ (O \ Y i )). As the observation o is removed from Y i it follows that Allow(o, Y i ) = ∅. It follows that no matter what action is played, there is a positive probability of being in a state s ∈ γ −1 (o) such that playing the action would leave the set γ −1 (Y i ) with positive probability, and thus reaching the observations O \ Y i from which there is no almost-sure winning strategy by induction hypothesis. This shows that Y ∗ = AlmostM (Safe( F )), and the linear time computation follows from the straight forward computation of greatest fixpoints. The desired result follows. 2 Formal description of the almost-sure safety algorithm. The above lemma establishes the correctness of the algorithm that computes the almost-sure winning for safety objectives, and the formal description of the algorithm is described as Algorithm 2. Almost-sure winning for reachability objectives. Consider a set T ⊆ S of target states, and the reachability objective Reach( T ). We will show that:
AlmostM Reach( T ) = ν Z .ObsCover
μ X . T ∩ γ −1 ( Z ) ∪ Apre( Z , X ) .
Let Z ∗ = ν Z .ObsCover(μ X .(( T ∩ γ −1 ( Z )) ∪ Apre( Z , X ))). In the following two lemmas we show the two desired inclusions, i.e., AlmostM (Reach( T )) ⊆ Z ∗ and then we show that Z ∗ ⊆ AlmostM (Reach( T )). Lemma 14. AlmostM (Reach( T )) ⊆ Z ∗ .
62
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Proof. Let W ∗ = AlmostM (Reach( T )). We first show that W ∗ is a fixpoint of the function
f ( Z ) = ObsCover
μ X . T ∩ γ −1 ( Z ) ∪ Apre( Z , X ) ,
i.e., we will show that W ∗ = ObsCover(μ X .(( T ∩ γ −1 ( W ∗ )) ∪ Apre( W ∗ , X ))). As Z ∗ is the greatest fixpoint it will follow that W ∗ ⊆ Z ∗ . Let
X∗ =
μ X . T ∩ γ −1 W ∗ ∪ Apre W ∗ , X
,
and X ∗ = ObsCover( X ∗ ). Note that by definition we have X ∗ ⊆ γ −1 ( W ∗ ) as the inner fixpoint computation only computes states that belong to γ −1 ( W ∗ ). Assume for contradiction that W ∗ is not a fixpoint, i.e., X ∗ is a strict subset of W ∗ . For all states s ∈ γ −1 ( W ∗ ) \ X ∗ , for all actions a ∈ Allow(γ (s), W ∗ ) we have Supp(δ(s, a)) ⊆ (γ −1 ( W ∗ ) \ X ∗ ). Consider any randomized memoryless almost-sure winning strategy σ ∗ from W ∗ and we consider two cases: 1. Suppose there is a state s ∈ γ −1 ( W ∗ ) \ X ∗ such that an action that does not belong to Allow(γ (s), W ∗ ) is played with positive probability by σ ∗ . Then with positive probability the observations from W ∗ are left (because from some state with same observation as s an observation in the complement of W ∗ is reached with positive probability). Since from the complement of W ∗ there is no randomized memoryless almost-sure winning strategy (by definition), it contradicts the claim that σ ∗ is an almost-sure winning strategy from W ∗ . 2. Otherwise for all states s ∈ γ −1 ( W ∗ ) \ X ∗ the strategy σ ∗ plays only actions in Allow(γ (s), W ∗ ), and then the probability to reach X ∗ is zero, i.e., Safe(γ −1 ( W ∗ ) \ X ∗ ) is ensured. Since all target states in γ −1 ( W ∗ ) belong to X ∗ (they get included in iteration 0 of the fixpoint computation) it follows that (γ −1 ( W ∗ ) \ X ∗ ) ∩ T = ∅, and hence Safe(γ −1 ( W ∗ ) \ X ∗ ) ∩ Reach( T ) = ∅, and we again reach a contradiction that σ ∗ is an almost-sure winning strategy. It follows that W ∗ is a fixpoint, and thus we get that W ∗ ⊆ Z ∗ .
2
Lemma 15. Z ∗ ⊆ AlmostM (Reach( T )). Proof. Since the goal is to reach the set T , without loss of generality we assume the set T to be absorbing. We define a randomized memoryless strategy σ ∗ for the objective AlmostM (Reach( T )) as follows: for an observation o ∈ Z ∗ , play all actions from the set Allow(o, Z ∗ ) uniformly at random. Since the strategy σ ∗ plays only actions in Allow(o, Z ∗ ), for o ∈ Z ∗ , it ensures that the set of states γ −1 ( Z ∗ ) is not left (i.e., Safe(γ −1 ( Z ∗ )) is ensured). We now analyze the computation of the inner fixpoint, i.e., analyze the computation of μ X .(( T ∩ γ −1 ( Z ∗ )) ∪ Apre( Z ∗ , X )) as follows:
• X 0 = ( T ∩ γ −1 ( Z ∗ )) ∪ Apre( Z ∗ , ∅) = T ∩ γ −1 ( Z ∗ ) ⊆ T (since Apre( Z ∗ , ∅) is ∅); • X i +1 = ( T ∩ γ −1 ( Z ∗ )) ∪ Apre( Z ∗ , X i ) Note that we have X 0 ⊆ T . For every state s j ∈ X j the set of played actions Allow(γ (s j ), Z ∗ ) contains an action a such that Supp(δ(s j , a)) ∩ X j −1 is non-empty. Let C be an arbitrary reachable recurrent class in the Markov chain G σ ∗ reachable from a state in γ −1 ( Z ∗ ). Since Safe(γ −1 ( Z ∗ )) is ensured, it follows that C ⊆ γ −1 ( Z ∗ ). Consider a state in C that belongs to X j \ X j −1 for j ≥ 1. Since the strategy ensures that for some action a played with positive probability we must have Supp(δ(s j , a)) ∩ X j −1 = ∅, it follows that C ∩ X j −1 = ∅. Hence by induction C ∩ X 0 = ∅. It follows C ∩ T = ∅. Hence all reachable recurrent classes C that intersect with Z ∗ are contained in Z ∗ , but not contained in Z ∗ \ T , i.e., all reachable recurrent classes are the absorbing states in T . Thus the strategy σ ∗ ensures that T is reached with probability 1. Thus we have Z ∗ ⊆ AlmostM (Reach( T )). 2 Lemma 16. The set AlmostM (Reach( T )) can be computed in quadratic time for belief-observation POMDPs, for target set T ⊆ S. Proof. Follows directly from Lemma 14 and Lemma 15.
2
Formal description of the almost-sure reachability algorithm. The previous three lemmas establish the correctness of the algorithm that computes the almost-sure winning for reachability objectives, and the formal description of the algorithm is described as Algorithm 3. The EXPTIME-completeness. In this section we first showed that given a POMDP G with a LimAvg=1 objective we can construct an exponential size belief-observation POMDP G and the computation of the almost-sure winning set for LimAvg=1 objectives is reduced to the computation of the almost-sure winning set for safety and reachability objectives, for which we established linear and quadratic time algorithms respectively. This gives us a 2 O (| S |+|A|) time algorithm to decide (and construct if one exists) the existence of finite-memory almost-sure winning strategies in POMDPs with LimAvg=1 objectives. The EXPTIME-hardness for almost-sure winning easily follows from the result of Reif for two-player partial-observation games with safety objectives [40]: (i) First observe that in POMDPs, if almost-sure safety is violated, then it is violated with
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
63
Fig. 4. Transformation of the PFA P to a POMDP G.
a finite prefix which has positive probability, and hence for almost-sure safety, the probabilistic player can be treated as an adversary. This shows that the almost-sure safety problem for POMDPs is EXPTIME-hard. (ii) The almost-sure safety problem reduces to almost-sure winning for limit-average objectives by assigning reward 1 to safe states, reward 0 to non-safe states and make the non-safe states absorbing. It follows that POMDPs with almost-sure winning for LimAvg =1 objectives under finite-memory strategies is EXPTIME-hard. Theorem 3. The following assertions hold: 1. Given a POMDP G with | S | states, |A| actions, and a LimAvg=1 objective, the existence (and the construction if one exists) of a finite-memory almost-sure winning strategy can be achieved in 2 O (| S |+|A|) time. 2. The decision problem of given a POMDP and a LimAvg=1 objective whether there exists a finite-memory almost-sure winning strategy is EXPTIME-complete. Remark 4. We highlight one important aspect of our results. Several algorithms for POMDPs are based on the belief states, which are probability distributions over the states of the POMDPs, and thus the algorithms require numerical data structures. In contrast, our algorithm is based only on the support of the belief states (i.e., as a set of states rather than a probability distribution). Our algorithm, though exponential time in the worst case, is a discrete algorithm, and we present an implementation in Section 6. 4. Finite-memory strategies with quantitative constraints We will show that the problem of deciding whether there exists a finite-memory (as well as an infinite-memory) almostsure winning strategy for the objective LimAvg> 1 is undecidable. We present a reduction from the standard undecidable 2
problem for probabilistic finite automata (PFA). A PFA P = ( S , A, δ, F , s0 ) is a special case of a POMDP G = ( S , A, δ, O , γ , s0 ) with a single observation O = {o} such that for all states s ∈ S we have γ (s) = o. Moreover, the PFA proceeds for only finitely many steps, and has a set F of desired final states. The strict emptiness problem asks for the existence of a strategy w (a finite word over the alphabet A) such that the measure of the runs ending in the desired final states F is strictly greater than 12 ; and the strict emptiness problem for PFA is undecidable [36]. Reduction. Given a PFA P = ( S , A, δ, F , s0 ) we construct a POMDP G = ( S , A , δ , O , γ , s0 ) with a Boolean reward function
r such that there exists a word w ∈ A∗ accepted with probability strictly greater than 12 in P iff there exists a finite-memory almost-sure winning strategy in G for the objective LimAvg> 1 . Intuitively, the construction of the POMDP G is as follows: 2
for every state s ∈ S of P we construct a pair of states (s, 1) and (s, 0) in S with the property that (s, 0) can only be reached with a new action $ (not in A) played in state (s, 1). The transition function δ from the state (s, 0) mimics the transition function δ , i.e., δ ((s, 0), a)((s , 1)) = δ(s, a)(s ). The reward r of (s, 1) (resp. (s, 0)) is 1 (resp. 0), ensuring the average of the pair to be 12 . We add a new available action # that when played in a final state reaches a state good ∈ S with reward 1, and when played in a non-final state reaches a state bad ∈ S with reward 0, and for states good and bad given action # the next state is the initial state. An illustration of the construction on an example is depicted on Fig. 4. Whenever an action is played in a state where it is not available, the POMDP reaches a losing absorbing state, i.e., an absorbing state with reward 0, and for brevity we omit transitions to the losing absorbing state. The formal construction of the POMDP G is as follows:
• • • •
S = ( S × {0, 1}) ∪ {good, bad}, s0 = (s0 , 1), A = A ∪ {#, $}, The actions a ∈ A ∪ {#} in states (s, 1) (for s ∈ S) lead to the losing absorbing state; the action $ in states (s, 0) (for s ∈ S) leads to the losing absorbing states; and the actions a ∈ A ∪ {$} in states good and bad lead to the losing
64
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
absorbing state. The other transitions are as follows: For all s ∈ S: (i) δ ((s, 1), $)((s, 0)) = 1, (ii) for all a ∈ A we have δ ((s, 0), a)((s , 1)) = δ(s, a)(s ), and (iii) for action # we have
δ (s, 0), # (good) = 1 if s ∈ F 0
δ (good, #) s0 = 1
otherwise
/F δ (s, 0), # (bad) = 1 if s ∈ 0
otherwise
δ (bad, #) s0 = 1,
• there is a single observation O = {o}, and all the states s ∈ S have γ (s) = o. We define the Boolean reward function r only as a function of the state, i.e., r : S → {0, 1} and show the undecidability even for this special case of reward functions. For all s ∈ S the reward is r((s, 0)) = 0, and similarly r((s, 1)) = 1, and the remaining two states have rewards r(good) = 1 and r(bad) = 0. Note that though the rewards are assigned as a function of states, the rewards appear on the transitions. We now establish the correctness of the reduction. Lemma 17. If there exists a word w ∈ A∗ accepted with probability strictly greater than almost-sure winning strategy in the POMDP G for the objective LimAvg> 1 .
1 2
in P, then there exists a pure finite-memory
2
Proof. Let w ∈ A∗ be a word accepted in P with probability μ > 12 and let the length of the word be | w | = n − 1. We construct a pure finite-memory almost-sure winning strategy for the LimAvg> 1 objective in the POMDP G as follows: We 2
denote by w [i ] the ith action in the word w. The finite-memory strategy we construct is specified as an ultimately periodic word ($ w [1] $ w [2] . . . $w [n − 1] $ # #)ω . Observe that by the construction of the POMDP G, the sequence of rewards (that appear on the transitions) is (10)n followed by (i) 1 with probability μ (when F is reached), and (ii) 0 otherwise; and the whole sequence is repeated ad infinitum. Also observe that once the pure finite-memory strategy is fixed we obtain a Markov chain with a single recurrent class since the starting state belongs to the recurrent class and all states reachable from the starting state form the recurrent class. We first establish the almost-sure convergence of the sequence of partial averages of the periodic blocks, and then of the sequences inside the periodic blocks as well.
j
Almost-sure convergence of periodic blocks. Let r1 , r2 , r3 , . . . be the infinite sequence of rewards and s j = 1j · i =1 r i . The infinite sequence of rewards can be partitioned into blocks of length 2 · n + 1, intuitively corresponding to the transitions of a single run on the word ($ w [1] $ w [2] . . . $w [n − 1] $ # #). We define a random variable X i denoting average of rewards of the ith block in the sequence, i.e., with probability μ for each i the value of X i is 2n·n++11 and with probability 1 − μ the value is μ+n
n . 2·n+1
The expected value of X i is therefore equal to E( X i ) = 2·n+1 , and as we have that μ > 12 it follows that E( X i ) > 12 . The fact that we have a single recurrent class and after the ## the initial state is reached implies that the random variable sequence ( X i )i ≥0 is an infinite sequence of i.i.d’s. By the Strong Law of Large Numbers (SLLN) [15, Theorem 7.1, page 56] we have that
1 μ+n =1 P lim ( X 1 + X 2 + · · · + X j ) = j →∞ j 2·n+1
Almost-sure convergence inside the periodic blocks. It follows that with probability 1 the LimAvg of the partial averages on blocks is strictly greater than 12 . As the blocks are of fixed length 2 · n + 1, if we look at the sequence of averages at every 2 · n + 1 step, i.e., the sequence s j ·(2n+1) for j > 0, we have that this sequence converges with probability 1 to a value strictly greater than 12 . It remains to show that all si converge to that value. As the elements of the subsequence converging with probability 1 are always separated by exactly 2 · n + 1 elements (i.e., constant number of elements) and due to the definition of LimAvg the deviation introduced by these 2 · n + 1 elements ultimately gets smaller than any > 0 as the length of the path increases (the average is computed from the whole sequence so far, and deviation caused by a fixed length is negligible as the length increases). Therefore the whole sequence (si )i >0 converges to a value strictly greater than 12 with probability 1. It follows that the strategy ensures LimAvg> 1 with probability 1. 2 2
In the next two lemmas we first show the other direction for pure finite-memory strategies and then extend the result to the class of randomized infinite-memory strategies. Lemma 18. If there exists a pure finite-memory almost-sure winning strategy in the POMDP G for the objective LimAvg> 1 , then there exists a word w ∈ A∗ accepted with probability strictly greater than
1 2
2
in P .
Proof. Assume there exists a pure finite-memory almost-sure winning strategy
σ for the objective LimAvg> 1 . Observe that 2
as there is only a single observation in the POMDP G the strategy σ can be viewed as an ultimately periodic infinite word of the form u · v ω , where u , v are finite words from A . Note that v must contain the subsequence ##, as otherwise the
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
65
LimAvg would be only 12 . Similarly, before every letter a ∈ A in the words u , v, the strategy must necessarily play the $ action, as otherwise the losing absorbing state is reached. In the first step we align the ## symbols in v. Let us partition the word v into two parts v = y · x such that y is the shortest prefix ending with ##. Then the ultimately periodic word u · y · (x · y )ω = u · v ω is also a strategy ensuring almost-surely LimAvg> 1 . Due to the previous step we consider u = u · y and v = x · y, and thus have that v is of the form: 2
$w 1 [1]$w 1 [2] . . . $w 1 [n1 ]$##$w 2 [1]$w 2 [2] . . . $w 2 [n2 ]$## . . . $w m [1]$w m [2]$ . . . $w m [nm ]$## We extract the set of words W = { w 1 , w 2 , . . . , w m } from v . Assume for contradiction that all the words in the set W are accepted in the PFA P with probability at most 12 . As in Lemma 17 we define a random variable X i denoting the average of rewards after reading v . It follows that the expected value of E( X i ) ≤ 12 for all i ≥ 0. By using SLLN we obtain that almost-surely the LimAvg of u · v ω is E( X i ), and hence it is not possible as u · v ω is an almost-sure winning strategy for the objective LimAvg> 1 . We reach a contradiction to the assumption that all the words in W are accepted with probability 2
at most 12 in P. Therefore there exists a word w ∈ W that is accepted in P with probability strictly greater than concludes the proof. 2
1 , 2
which
To complete the reduction we show in the following lemma that pure strategies are sufficient for the POMDPs constructed in our reduction. Lemma 19. Given the POMDP G of our reduction, if there is a randomized (possibly infinite-memory) almost-sure winning strategy for the objective LimAvg> 1 , then there exists a pure finite-memory almost-sure winning strategy σ for the objective LimAvg> 1 . 2
Proof. Let
2
σ be a randomized (possibly infinite-memory) almost-sure winning strategy for the objective LimAvg> 1 . As there 2
is a single observation in the POMDP G constructed in the reduction, the strategy does not receive any useful feedback from the play, i.e., the memory update function σu always receives as one of the parameters the unique observation. Note that the strategy needs to play the pair ## of actions infinitely often with probability 1, i.e., with probability 1 the resolving of the probabilities in the strategy σ leads to an infinite word ρ = w 1 ##w 2 ## . . ., as otherwise the limit-average payoff is at most 12 with positive probability. From each such run ρ we extract the finite words w 1 , w 2 , . . . that occurs in ρ , and then consider the union of all such words as W . We consider two options: 1. If there exists a word v in W such that the expected average of rewards after playing this word is strictly greater than 1 , then the pure strategy v ω is also a pure finite-memory almost-sure winning strategy for the objective LimAvg> 1 . 2 2
2. Assume for contradiction that all the words in W have the expected reward at most 21 . Then with probability 1 resolving the probabilities in the strategy σ leads to an infinite word w = w 1 ##w 2 ## . . ., where each word w i belongs to W , that is played on the POMDP G. Let us define a random variable X i denoting the average between i and (i + 1)th occurrence of ##. The expected average E( X i ) is at most 12 for all i. Therefore the expected LimAvg of the sequence w is at most:
1
n
E lim inf n→∞
n
Xi .
i =0
Since X i ’s are non-negative measurable functions, by Fatou’s lemma [15, Theorem 3.5, page 16] that shows the integral of limit inferior of a sequence of non-negative measurable functions is at most the limit inferior of the integrals of these functions, we have the following inequality:
1
E lim inf n→∞
n
n
Xi
≤ lim inf E n→∞
i =0
Note that since the strategy
n
n
i =0
Xi
1
≤ . 2
σ is almost-sure winning for the objective LimAvg> 1 , then the expected value of rewards
must be strictly greater than
1 . 2
2
Thus we arrive at a contradiction. Hence there must exist a word in W that has an
expected payoff strictly greater than This concludes the proof.
1
1 2
in G.
2
Theorem 4. The problem whether there exists a finite (or infinite-memory) almost-sure winning strategy in a POMDP for the objective LimAvg> 1 is undecidable. 2
66
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Fig. 5. Transformation of the PFA P to a POMDP G.
5. Infinite-memory strategies with qualitative constraints In this section we show that the problem of deciding the existence of infinite-memory almost-sure winning strategies in POMDPs with LimAvg=1 objectives is undecidable. We prove this fact by a reduction from the value 1 problem in PFA, which is undecidable [17]. The value 1 problem given a PFA P asks whether for every > 0 there exists a finite word w such that the word is accepted in P with probability at least 1 − (i.e., the limit of the acceptance probabilities is 1). Reduction. Given a PFA P = ( S , A, δ, F , s0 ), we construct a POMDP G = ( S , A , δ , O , γ , s0 ) with a reward function r , such that P satisfies the value 1 problem iff there exists an infinite-memory almost-sure winning strategy in G for the objective LimAvg=1 . Intuitively, the construction adds two additional states good and bad. We add an edge from every state of the PFA under a new action $, this edge leads to the state good when played in a final state, and to the state bad otherwise. In the states good and bad we add self-loops under a new action #. The action $ in the states good or bad leads back to the initial state. An example of the construction is illustrated with Fig. 5. All the states belong to a single observation, and we will use Boolean reward function on states. The reward for all states except the newly added state good is 0, and the reward for the state good is 1. The formal construction is as follows:
• • • •
S = S ∪ {good, bad}, s0 = s0 , A = A ∪ {#, $}, For all s, s ∈ S and a ∈ A we have δ (s, a)(s ) = δ(s, a)(s ),
δ (s, $)(good) = 1 if s ∈ F 0
/F δ (s, $)(bad) = 1 if s ∈
otherwise
0
otherwise
δ (good, $)(s0 ) = δ (bad, $)(s0 ) = 1, δ (good, #)(good) = δ (bad, #)(bad) = 1 • there is a single observation O = {o}, and all the states s ∈ S have γ (s) = o. When an action is played in a state without an outgoing edge for the action, the losing absorbing state is reached; i.e., for action # in states in S and actions a ∈ A for states good and bad, the next state is the losing absorbing state. The Boolean reward function r : S → {0, 1} assigns all states s ∈ S ∪ {bad} the reward r (s) = 0, and r (good) = 1. Lemma 20. If the PFA P satisfies the value 1 problem, then there exists an infinite-memory almost-sure winning strategy ensuring the LimAvg=1 objective. Proof. We construct an almost-sure winning strategy σ that we describe as an infinite word. As P satisfies the value 1 problem, there exists a sequence of finite words ( w i )i ≥1 , such that each w i is accepted in P with probability at least 1 − i1+1 . We construct an infinite word w 1 · $ · #n1 · w 2 · $ · #n2 · · ·, where each ni ∈ N is a natural number that satisfies 2
the following condition: let ki = | w i +1 · $| + we must have
i
ni ki
≥1−
1 . i
i
j =1 (| w j
· $| + n j ) be the length of the word sequence before #ni+1 , then
In other words, the length ni is large enough such that even if the whole sequence of length
j =1 (| w j · $| + n j ) before ni and the sequence of length | w i +1 · $| is 0s, the sequence ni 1s ensures the average is at least 1 − 1i . Intuitively the condition ensures that even if the rewards are always 0 for the prefix up to w i · $ of a run, even then a single visit to the good state after w i · $ can ensure the average to be greater 1 − 1i from the end of #ni up to the point w i +1 · $ ends. We first argue that if the state bad is visited finitely often with probability 1, then with probability 1 the
objective LimAvg=1 is satisfied.
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
67
Almost-sure winning if bad occurs only finitely often. Observe that if the state bad appears only finitely often, then from some j ≥ 0, for all ≥ j, the state visited after w 1 · $ · #n1 · w 2 · $ · #n2 · · · w · $ is the state good, and then the sequence #n ensures that the payoff from the end of #n up to the end of w +1 · $ is at least 1 − 1 . If the state good is visited, then the sequence of # gives reward 1, and thus after a visit to the state good, the average only increases in the sequence of #’s. Hence it follows that lim-inf average of the rewards is at least 1 − 1i , for all i ≥ 0 (i.e., for every i ≥ 0, there exists a point in the path such that the average of the rewards never falls below 1 − 1i ). Since this holds for all i ≥ 0, it follows that LimAvg=1 is ensured with probability 1 (provided with probability 1 the state bad occurs only finitely often).
The state bad occurs only finitely often. We now need to show that the state bad is visited infinitely often with probability 0 (i.e., occurs only finitely often with probability 1). We first upper bound the probability uk+1 of visiting the state bad at least k + 1 times, given k visits to state bad. The probability uk+1 is at most k1+1 (1 + 12 + 14 + · · ·). The above bound for uk+1 2 is obtained as follows: following the visit to bad for k times, the words w j , for j ≥ k are played; and hence the probability to reach bad decreases by 12 every time the next word is played; and after k visits the probability is always smaller than
1 . Let Ek 2k 1 denote the event that bad is visited at least k + 1 times given k visits to bad. Then we have P( E ) ≤ < ∞. By k k ≥0 k≥1 2k Borel–Cantelli lemma [15, Theorem 6.1, page 47] we know that if the sum of probabilities is finite (i.e., P( E ) k < ∞), k ≥0 then the probability that infinitely many of them occur is 0 (i.e., P(lim supk→∞ Ek ) = 0). Hence the probability that bad is 1 . 2k+1
Hence the probability to visit bad at least k + 1 times, given k visits, is at most the sum above, which is
visited infinitely often is 0, i.e., with probability 1 bad is visited finitely often. It follows that the strategy σ that plays the infinite word w 1 · $ · #n1 · w 2 · $ · #n2 · · · is an almost-sure winning strategy for the objective LimAvg=1 . 2
Lemma 21. If there exists an infinite-memory almost-sure winning strategy for the objective LimAvg=1 , then the PFA P satisfies the value 1 problem. Proof. We prove the contrapositive. Consider that the PFA P does not satisfy the value 1 problem, i.e., there exists a constant c > 0 such that for all w ∈ A∗ we have that the probability that w is accepted in P is at most 1 − c < 1. We will show that there is no almost-sure winning strategy. Assume for contradiction that there exists an infinite-memory almost-sure winning strategy σ in the POMDP G . In POMDPs, for all measurable objectives, infinite-memory pure strategies are as powerful as infinite-memory randomized strategies [7, Lemma 1].2 Therefore we may assume that the almost-sure winning strategy is given as an infinite word w. Note that the infinite word w must necessarily contain infinitely many $ and can be written as w = w 1 · $ · #n1 · $ · w 2 · $ · #n2 · $ · · ·. Moreover, there must be infinitely many i > 0 such that the number ni is positive (since only such segments yield reward 1). Consider the Markov chain G σ and the rewards on the states of the chain. By the definition of the reward function r all the rewards that correspond to words w i for i > 0 are equal to 0. Rewards corresponding to the word segment #ni for i > 0 are with probability at most 1 − c equal to 1 and 0 otherwise. Let us for the moment remove the 0 rewards corresponding to words w i for all i > 0 (removing the reward 0 segments only increases the limit-average payoff), and consider only rewards corresponding to the segments #ni for all i > 0, we will refer to this infinite sequence of rewards as w . Let X i denote the random variable corresponding to the value of the ith reward in the sequence w . Then we have that X i = 1 with probability at most 1 − c and 0 otherwise. The expected LimAvg of the sequence w is then at most:
1
n
E lim inf n→∞
n
Xi .
i =0
Since X i ’s are non-negative measurable function, by Fatou’s lemma [15, Theorem 3.5, page 16] that shows the integral of limit inferior of a sequence of non-negative measurable functions is at most the limit inferior of the integrals of these functions, we have the following inequality:
1
E lim inf n→∞
n
i =0
n
Xi
≤ lim inf E n→∞
1
n
n
Xi
≤ 1 − c.
i =0
If we put back the rewards 0 from words w i for all i > 0, then the expected value can only decrease. It follows that Eσ (LimAvg) ≤ 1 − c. Note that if the strategy σ was almost-sure winning for the objective LimAvg=1 (i.e., Pσ (LimAvg=1 ) = 1), then the expectation of the LimAvg payoff would also be 1 (i.e., Eσ (LimAvg) = 1). Therefore we have reached a contradiction to the fact that the strategy σ is almost-sure winning, and the result follows. 2 Theorem 5. The problem whether there exists an infinite-memory almost-sure winning strategy in a POMDP with the objective LimAvg=1 is undecidable. 2 Since in POMDPs there is only the controller making choices, a randomized strategy can be viewed as a distribution over pure strategies, and if there is a randomized strategy to achieve an objective, then there must be a pure one. This is the key intuition that in POMDPs randomization is not more powerful; for a formal proof, see [7].
68
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Fig. 6. AV – small.
Fig. 7. AV – medium.
Fig. 8. AV – large.
6. Implementation and experimental results We implemented in Java the presented algorithm that given a POMDP with a qualitative constraint decides whether there exists a finite-memory almost-sure winning strategy. Along with the algorithm we also employed a heuristic to avoid the worst-case exponential complexity. Heuristics. Our main heuristic to overcome the exponential complexity of the algorithm is based on an idea to reduce the size of the memory set of the collapsed strategy. Instead of storing the mappings RecFun and AWFun for every state of the POMDP, we store the mappings restricted to the current belief support. Intuitively, for all states that are not in the belief support, the probability of being in them is 0. Therefore, the information stored about these states is not relevant for the strategy. The current belief support is often significantly smaller than the subset construction of the state space, as the size of the belief support is bounded by the size of the largest observation in the POMDP. This heuristic dramatically reduces the size of G ← Red(G ) of Algorithm 1 in our example case studies. Experiments. We evaluated our algorithm on the examples that are discussed below: Avoid the observer (AV). Inspired by the maze POMDPs from [27] we consider a POMDP that models a maze where an intruder can decide to stay in the same position or move deterministically in all four directions, whenever the obstacles in the maze allow the movement. There is an observer moving probabilistically through the maze that checks that there is no intruder present. The intruder cannot see the observer whenever it is behind an obstacle. We consider two settings of the problem: (1) In the capture mode all actions of the intruder have reward 1 with the exception of the moves when the intruder is captured by the observer, i.e, the intruder and the observer are in the same position in the maze. The latter moves have reward 0. (2) In the detect mode we assume the observer is equipped with sensors that can also detect the intruder on a neighboring position (with no walls in between). The rewards are defined as follows: (i) for every move of the intruder that is undetected by the observer the reward is 1; (ii) the reward for a move where the intruder is detected but not captured is 0.5 (the intruder and the observer are on neighboring positions); and (iii) the reward for a move where the intruder is captured is 0 (the intruder and the observer are on the same position). We consider three variants small, medium, and large that differ in the size and the structure of the maze. The variants are depicted in Figs. 6, 7, and 8, respectively. The POMDP corresponding to variant AV – small has 50 states and the intruder starts at position 0 and the observer at position 7. The POMDP corresponding to variant AV – medium has 145 states and
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
69
Fig. 9. Space shuttle.
the intruder starts at position 0 and the observer at position 11. The last POMDP corresponding to variant AV – large has 257 states and the intruder starts at position 0 and the observer at position 15. Supply the space station (SP). We consider an extension of the space shuttle docking problem that originally comes from [11,27]. In the original setting the POMDP models a simple supply delivery problem between two space stations, where a space shuttle must deliver supplies to two stations in space. There are four actions available to the space shuttle: move forwards, turn around, back up, and drop supplies. A successful supply delivery to a station is modeled by first backing up into the space station and then by dropping the supplies in the space station. In the original setting the goal is to supply both stations infinitely often in an alternate fashion. We extend the model by introducing a time bound on the number of turns the space shuttle is allowed to spend traveling in space (without delivering any supplies). We assume there is a given amount of supplies in a space station that decreases in every turn by one, and whenever the space shuttle delivers supplies, the amount of supplies is restored to its maximum. We consider two different bounds: 3 – step SP introduces a 3 turn time bound on the traveling time (supplies for 6 turns in every station), and 6 – step SP introduces a 6 turn time bound on the traveling time (supplies for 12 turns in every station). The schematic representation of the space shuttle is depicted in Fig. 9 and originally comes from [27]. The leftmost and rightmost states in Fig. 9 are the docking stations, the most recently visited docking station is labeled with MRV, and the least recently visited docking station is labeled with LRV. The property of the model is that whenever a LRV station is visited it automatically changes its state to MRV. The actions move forward, turn around, and drop supplies are deterministic. The movement under back up action is probabilistic when the space shuttle is in space, e.g., the space shuttle can move backwards as intended, not move at all, or even turn around, with positive probability. States: The POMDP is a synchronized product of two components: the space shuttle, and the counter that keeps track of the remaining supplies. The state of the space shuttle are as follows: 0 – docked in LRV; 1 – just outside space station MRV, front of ship facing station; 2 – space, facing MRV; 3 – just outside space station LRV, back of ship facing station; 4 – just outside space station MRV, back of ship facing station; 5 – space, facing LRV; 6 – just outside space station LRV, front of ship facing station; 7 – docked in MRV, the initial state. The states of the counter are simply the remaining amount of supplies. The space shuttle cannot distinguish the states in space, with the following exception: the states where the shuttle is facing a station, i.e., states 2, 3, 4, and 5 have the same observation different from the observation of other states in the space. Rewards: The rewards of actions are defined as follows: (1) when the shuttle is facing a station and plays action move forward it bumps into a space station and receives reward 0.1; (2) dropping the supplies outside of the space station receives reward 0.5; (3) every other action as long as the stations are supplied receives reward 1; (4) if the shuttle is traveling too long and the amount of supplies in one of the stations reaches 0 every action receives reward 0 until the station is supplied. RockSample (RS). We consider a modification of the RockSample problem introduced in [41] and used later in [4]. It is a scalable problem that models rover science exploration. The rover is equipped with a limited amount of fuel and can increase the amount of fuel by sampling rocks in the immediate area. The positions of the rover and the rocks are known, but only some of the rocks can increase the amount of fuel; we will call these rocks good. The type of the rock is not known to the rover, until the rock is sampled. Once a good rock is used to increase the amount of fuel, it becomes temporarily a bad rock until all other good rocks are sampled. We consider variants with different maximum capacity of the rover’s fuel tank. An instance of the RockSample problem is parametrized with two parameters [n, k]: map size n × n and k rocks is described as RockSample[n,k]. The POMDP model of RockSample[n,k] is as follows: States: The state space is the cross product of 2k + 1 + c features: Position = {(1, 1), (1, 2), . . . , (n, n)}, 2 ∗ k binary features RockTypei = {Good, Bad} that indicate which of the rocks are good and which rocks are temporarily not able to increase the amount of fuel, and c is the amount of fuel remaining in the fuel tank. The rover can select from four actions: { N , S , E , W }. All the actions are deterministic single-step motion actions. A rock is sampled whenever the rock is at the rover’s current location.
70
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
Fig. 10. RS[4,2].
Fig. 11. RS[4,3]. Table 2 Experimental results. Example
Mode
AV – small
Capture Detect Capture Detect Capture Detect 3 – step 6 – step capacity-3 capacity-4 capacity-3 capacity-4
AV – medium AV – large SP RS[4,2] RS[4,3]
|S|
|A|
|O |
65
5
45
145
5
87
257
5
107
44 74 1025 1281 3137 3921
4 4 4
19 35 4
4
4
Time 0.48 0.28 3.85 3.18 227.85 200.43 0.85 1.27 0.93 2.56 121.65 257.34
Win. s s s s s s s s s s s s
× × × × ×
Rewards: If the rock is good, the fuel amount is increased to the maximum capacity and the rock becomes temporarily bad. For every move the rover receives reward 1 with the following two exceptions:
• When the rover samples a bad rock (also temporarily bad rocks) the reward is 0. • If the fuel amount decreases to 0, every move has reward 0 until the fuel is increased by sampling a good rock. The instance RS[4,2] (resp. RS[4,3]) is depicted on Fig. 10 (resp. Fig. 11), the arrow indicates the initial position of the rover and the filled rectangles denote the fixed positions of the rocks. Results. We have run our implementation on all the examples described above. The results are summarized in Table 2. For every POMDP and its variant we show the number of states | S |, the number of actions |A|, the number of observations |O| of the POMDP, the running time of our tool in seconds, and finally the last column denotes whether there exists a finite-memory almost-sure winning strategy for the LimAvg=1 objective. Our implementation could solve all the examples in less than 5 minutes. 7. Related work We now discuss our work in comparison to several important works in the literature. We study almost-sure winning for limit-average objectives with qualitative and quantitative constraints, which are infinite-horizon objectives. Two special cases of limit-average objectives with qualitative constraints are (i) reachability objectives, where the target states are absorbing and the only states with reward 1 and all other rewards are 0; and (ii) safety objectives, where the non-safe states are absorbing and have reward 0 and all other rewards are 1.
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
71
(Un)Decidability frontier. A very important and strong negative result of Madani et al. [30] shows that approximating the maximal acceptance probability for PFA is undecidable. The result of Madani et al. [30] implies that for POMDPs with reachability objectives approximating the maximal probability to reach the target set both for finite-memory and infinite-memory strategies is undecidable. This implies that for limit-average objectives with a qualitative constraint, under expected semantics or probabilistic winning (i.e., to approximate the maximal probability), the decision problems are undecidable both for finite-memory and infinite-memory strategies. The problem of almost-sure winning with limit-average objectives was not studied before. We show that the almost-sure winning problem with a qualitative constraint is EXPTIME-complete for finite-memory strategies, whereas undecidable for infinite-memory strategies (in contrast to Madani et al. [30] where the undecidability result holds uniformly both for finite-memory and infinite-memory strategies). We also show that the almost-sure winning problem with quantitative constraints is undecidable even for finite-memory strategies. Strategy complexity. It was shown by Lusena et al. [29] that for POMDPs with finite-horizon or infinite-horizon average reward objectives, state-based memoryless strategies are not sufficient. However, for almost-sure winning, for PFA or POMDPs with safety and reachability objectives, belief-support-based memoryless strategies are sufficient: the result for safety objectives follows from the work of Reif [40] and the result for reachability objectives follows from a reduction of POMDPs to partial-observation games by Chatterjee et al. [7]. Sufficiency of belief-support-based (randomized) memoryless strategies was established by Chatterjee et al. [9]. In contrast, we show that for almost-sure winning of limit-average objectives with qualitative constraints, belief-support-based strategies are not sufficient (Example 1). It was shown by Meuleau et al. [31] that for fixed finite-memory strategies, solving POMDPs is isomorphic to finding memoryless strategies in the “cross-product” POMDP. However, note that without an explicit bound on the size of the finite-memory strategies no decidability results can be obtained (for example, approximating the maximal acceptance probability for PFA is undecidable even for finite-memory strategies [30]). Thus our exponential upper bound on the size of the finite memory is the first crucial step in establishing the decidability result. Moreover, it follows from the results of Chatterjee et al. [10] that for POMDPs deciding almost-sure winning under memoryless strategies is NP-complete, even for reachability objectives. Thus using the exponential cross-product construction and an NP algorithm would yield a NEXPTIME algorithm, whereas we present an EXPTIME algorithm for almost-sure winning in POMDPs with limit-average objectives under qualitative constraints. Finite-horizon objectives. There are several works that consider the stochastic shortest path in POMDPs [32] and the finite-horizon objectives [34,19]. It was shown by Mundhenk et al. [34] that the complexity of computing the optimal finitehorizon expected reward is EXPTIME for POMDPs. Our work is very different, since we consider infinite-horizon objectives in contrast to finite-horizon objectives, and almost-sure winning instead of expected semantics. Moreover, limit-average objectives are fundamentally different from finite-horizon and shortest path objectives: in finite-horizon and shortest path objectives every reward contributes to the payoff, whereas in limit-average objectives a finite prefix of rewards does not change the payoff (for example, a finite sequence of zero rewards followed by all one rewards still has the limit-average payoff of 1). In other words, limit-average objectives model the long-run behavior of POMDPs, as compared to finite-horizon objectives. Almost-sure winning for special cases. The problem of almost-sure winning for POMDPs with safety and reachability objectives has been considered before. For safety objectives, if the almost-sure winning is not satisfied, then the violation is witnessed by a finite path which has positive probability. With this intuition it was shown by Alur et al. [1] that the almostsure winning for POMDPs with safety objectives can be reduced to partial-observation games with deterministic strategies, and EXPTIME-completeness was established using the results of Reif [40]. For almost-sure winning with reachability objectives, the problem can be reduced to partial-observation games with randomized strategies [7], the EXPTIME upper bound for randomized strategies in games follows from Chatterjee et al. [9], and the EXPTIME lower bound for POMDPs was established by Chatterjee et al. [8]. In both cases for reachability and safety objectives, the EXPTIME upper bound crucially relies on the existence of belief-support-based strategies, that does not hold for limit-average objectives with qualitative constraints (Example 1). Our main contribution for limit-average objectives with a qualitative constraint is the EXPTIME upper bound (the lower bound follows from existing results); and all our undecidability proofs for almost-sure winning are completely different from the EXPTIME lower bounds for reachability and safety objectives. Comparison with Chatterjee et al. [6]. A preliminary version of our work was presented by Chatterjee et al. [6]. In comparison our present version contains full and detailed proofs of all the results, better description of our algorithm, and Section 6 on implementation and experimental results is completely new. 8. Conclusion We studied POMDPs with limit-average objectives under probabilistic semantics. Since for general probabilistic semantics, the problems are undecidable even for PFA, we focus on the very important special case of almost-sure winning. For almost-sure winning with a qualitative constraint, we show that belief-support-based strategies are not sufficient for finitememory strategies, and establish EXPTIME-complete complexity for the existence of finite-memory strategies. Given our decidability result, the next natural questions are whether the result can be extended to infinite-memory strategies, or to quantitative path constraints. We show that both these problems are undecidable, and thus establish the precise decidability frontier with tight complexity bounds. Also observe that contrary to other classical results for POMDPs where both the finite-memory and infinite-memory problems are undecidable, for almost-sure winning with a qualitative constraint, we
72
K. Chatterjee, M. Chmelík / Artificial Intelligence 221 (2015) 46–72
show that the finite-memory problem is decidable (EXPTIME-complete), but the infinite-memory problem is undecidable. We also present a prototype implementation of our EXPTIME-complete algorithm along with a heuristic, and the implementation works reasonably efficiently on example POMDPs from the literature. An interesting direction of future work would be to consider discounted reward payoff instead of limit-average payoff. Acknowledgements We thank anonymous reviewers for several important suggestions that helped us to improve the presentation of the paper. References [1] R. Alur, C. Courcoubetis, M. Yannakakis, Distinguishing tests for nondeterministic and probabilistic machines, in: Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, ACM, 1995, pp. 363–372. [2] S.B. Andersson, D. Hristu, Symbolic feedback control for navigation, IEEE Trans. Autom. Control 51 (6) (2006) 926–937. [3] P. Billingsley (Ed.), Probability and Measure, Wiley-Interscience, 1995. [4] B. Bonet, H. Geffner, Solving POMDPs: RTDP-Bel vs. point-based algorithms, in: International Joint Conference on Artificial Intelligence, 2009, pp. 1641–1646. [5] A.R. Cassandra, L.P. Kaelbling, M.L. Littman, Acting optimally in partially observable stochastic domains, in: Proceedings of the National Conference on Artificial Intelligence, John Wiley & Sons Ltd, 1995, p. 1023. [6] K. Chatterjee, M. Chmelík, POMDPs under probabilistic semantics, in: Uncertainty in Artificial Intelligence, 2013, pp. 142–151. [7] K. Chatterjee, L. Doyen, H. Gimbert, T.A. Henzinger, Randomness for free, in: Mathematical Foundations of Computer Science, 2010, pp. 246–257. [8] K. Chatterjee, L. Doyen, T.A. Henzinger, Qualitative analysis of partially-observable Markov decision processes, in: Mathematical Foundations of Computer Science, 2010, pp. 258–269. [9] K. Chatterjee, L. Doyen, T.A. Henzinger, J.-F. Raskin, Algorithms for omega-regular games of incomplete information, Log. Methods Comput. Sci. 3 (2007) 4. [10] K. Chatterjee, A. Kößler, U. Schmid, Automated analysis of real-time scheduling using graph games, in: Hybrid Systems: Computation and Control, 2013, pp. 163–172. [11] L. Chrisman, Reinforcement learning with perceptual aliasing: the perceptual distinctions approach, in: Association for the Advancement of Artificial Intelligence, Citeseer, 1992, pp. 183–188. [12] A. Condon, R.J. Lipton, On the complexity of space bounded interactive proofs, in: Foundations of Computer Science, 1989, pp. 462–467. [13] K. Culik, J. Kari, Digital images and formal languages, in: Handbook of Formal Languages, 1997, pp. 599–616. [14] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge Univ. Press, 1998. [15] R. Durrett, Probability: Theory and Examples, second edition, Duxbury Press, 1996. [16] J. Filar, K. Vrieze, Competitive Markov Decision Processes, Springer-Verlag, 1997. [17] H. Gimbert, Y. Oualhadj, Probabilistic automata on finite words: decidable and undecidable problems, in: International Colloquium on Automata, Languages and Programming, in: LNCS, vol. 6199, Springer, 2010, pp. 527–538. [18] E.A. Hansen, R. Zhou, Synthesis of hierarchical finite-state controllers for POMDPs, in: International Conference on Automated Planning and Scheduling, 2003, pp. 113–122. [19] E.A. Hansen, Indefinite-horizon POMDPs with action-based termination, in: Proceedings of the 22nd National Conference on Artificial Intelligence, AAAI Press, 2007. [20] H. Howard, Dynamic Programming and Markov Processes, MIT Press, 1960. [21] L.P. Kaelbling, M.L. Littman, A.R. Cassandra, Planning and acting in partially observable stochastic domains, Artif. Intell. 101 (1) (1998) 99–134. [22] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey, J. Artif. Intell. Res. 4 (1996) 237–285. [23] J.G. Kemeny, J.L. Snell, A.W. Knapp, Denumerable Markov Chains, D. Van Nostrand Company, 1966. [24] H. Kress-Gazit, G.E. Fainekos, G.J. Pappas, Temporal-logic-based reactive mission and motion planning, IEEE Trans. Robot. 25 (6) (2009) 1370–1381. [25] O. Kupferman, S. Sheinvald-Faragy, Finding shortest witnesses to the nonemptiness of automata on infinite words, in: Conference on Concurrency Theory, 2006, pp. 492–508. [26] M.L. Littman, Algorithms for sequential decision making, PhD thesis, Brown University, 1996. [27] M.L. Littman, A.R. Cassandra, L.P. Kaelbling, Learning policies for partially observable environments: scaling up, in: International Conference on Machine Learning, vol. 95, Citeseer, 1995, pp. 362–370. [28] M.L. Littman, J. Goldsmith, M. Mundhenk, The computational complexity of probabilistic planning, J. Artif. Intell. Res. 9 (1998) 1–36. [29] C. Lusena, T. Li, S. Sittinger, C. Wells, J. Goldsmith, My brain is full: when more memory helps, in: Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., 1999, pp. 374–381. [30] O. Madani, S. Hanks, A. Condon, On the undecidability of probabilistic planning and related stochastic optimization problems, Artif. Intell. 147 (1–2) (2003) 5–34. [31] N. Meuleau, K.-E. Kim, L.P. Kaelbling, A.R. Cassandra, Solving POMDPs by searching the space of finite policies, in: Uncertainty in Artificial Intelligence, 1999, pp. 417–426. [32] R. Minami, V.F. da Silva, Shortest stochastic path with risk sensitive evaluation, in: Advances in Artificial Intelligence, Springer, 2013, pp. 371–382. [33] M. Mohri, Finite-state transducers in language and speech processing, Comput. Linguist. 23 (2) (1997) 269–311. [34] M. Mundhenk, J. Goldsmith, C. Lusena, E. Allender, Complexity of finite-horizon Markov decision process problems, J. ACM 47 (4) (2000) 681–720. [35] C.H. Papadimitriou, J.N. Tsitsiklis, The complexity of Markov decision processes, Math. Oper. Res. 12 (1987) 441–450. [36] A. Paz, Introduction to Probabilistic Automata (Computer Science and Applied Mathematics), Academic Press, 1971. [37] A. Pogosyants, R. Segala, N. Lynch, Verification of the randomized consensus algorithm of Aspnes and Herlihy: a case study, Distrib. Comput. 13 (3) (2000) 155–186. [38] M.L. Puterman, Markov Decision Processes, John Wiley and Sons, 1994. [39] M.O. Rabin, Probabilistic automata, Inf. Control 6 (1963) 230–245. [40] J.H. Reif, The complexity of two-player games of incomplete information, J. Comput. Syst. Sci. 29 (2) (1984) 274–301. [41] T. Smith, R. Simmons, Heuristic search value iteration for POMDPs, in: Uncertainty in Artificial Intelligence, AUAI Press, 2004, pp. 520–527. [42] M.I.A. Stoelinga, Fun with FireWire: experiments with verifying the IEEE1394 root contention protocol, in: Formal Aspects of Computing, 2002.