Operations Research Letters 39 (2011) 193–197
Contents lists available at ScienceDirect
Operations Research Letters journal homepage: www.elsevier.com/locate/orl
Time aggregated Markov decision processes via standard dynamic programming✩ Edilson F. Arruda a , Marcelo D. Fragoso b,∗ a
School of Engineering, Pontifical Catholic University of Rio Grande do Sul, Brazil
b
Center for Systems and Control-CSC, National Laboratory for Scientific Computation-LNCC. Av. Getúlio Vargas, 333. Petrópolis, RJ 25651-075, Brazil
article
info
Article history: Received 19 February 2010 Accepted 10 March 2011 Available online 23 March 2011 Keywords: Markov decision processes Time aggregation Dynamic programming
abstract This note addresses the time aggregation approach to ergodic finite state Markov decision processes with uncontrollable states. We propose the use of the time aggregation approach as an intermediate step toward constructing a transformed MDP whose state space is comprised solely of the controllable states. The proposed approach simplifies the iterative search for the optimal solution by eliminating the need to define an equivalent parametric function, and results in a problem that can be solved by simpler, standard MDP algorithms. © 2011 Elsevier B.V. All rights reserved.
1. Introduction This paper proposes a state reduction algorithm for a finitestate ergodic Markov Decision Process (MDP) with average costs per unit time. Relevant state reduction methods have been studied in the literature under the names of embedded [4,6] and time aggregated [2,8,9] MDPs. The existence and construction of embedded MDP models for the total cost and the average cost criteria were investigated in [4] and [6], respectively. Whereas these works focus on theoretical requirements, Cao et al. [2] proposed a time aggregation approach to average cost MDP while also providing a computational framework to solve the resulting aggregated (embedded) process. Such a framework aims at reducing the computational requirements of the traditional dynamic programming (DP) techniques in problems with very large state spaces; see also [8,9]. It is well known that semi-Markov decision processes (SMDP) can be solved by transforming the studied SMDP into an equivalent MDP and solving the latter problem, see for example Proposition 11.4.5 of [7][page 557] and the discussion that follows, or [5][pages 983–988]. It can be argued that the time aggregation approach exploits these SMDP-MDP equivalence transformations by applying the reverse logic. It provides the tools to transform an MDP into an equivalent SMDP with reduced state space and then exploit the reduced state space of the transformed SMDP to find approximate solutions in reduced computational time. Exact
✩ A preliminary version of this paper was presented at the 2009 IEEE Conference on Decision and Control, Shanghai, China. ∗ Corresponding author. Tel.: +55 24 2233 6013; fax: +55 24 2233 6141. E-mail addresses:
[email protected] (E.F. Arruda),
[email protected] (M.D. Fragoso).
0167-6377/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.orl.2011.03.006
solutions can be retrieved in the particular case of large scale MDP with a large number of uncontrollable states, i.e. states for which only a single control action is available, see [2]. One application with this structure is the asset replacement problem [3], to be revisited in Section 4. After applying the time aggregation technique to find an equivalent SMDP, a solution for that SMDP should be sought. Typical approaches to solve the equivalent SMDP involve applying modified policy or value iteration algorithms, tailored for the time aggregation framework [8,9]. In particular, the incremental value iteration (IVI) algorithm in [9] reported better overall performance than its counterparts. This algorithm generates a convergent sequence of parametric MDPs that are solved to optimality for each parameter in the sequence. This paper proposes an alternative transformation that finds an equivalent average cost MDP with reduced state space. The resulting MDP is then solved by a standard value iteration (VI) algorithm [7], which is much simpler and less demanding than its counterparts designed to solve the semi-Markov process resulting from time aggregation. To derive the reduced state space MDP we submit the SMDP, obtained via time aggregation, to an SMDPMDP transformation. Hence, the proposed approach makes use of time aggregation as an intermediate step of an MDP to MDP transformation that turns the original large scale problem into an equivalent problem with reduced state space. It is perhaps noteworthy that a preliminary version of the results was presented at the 2009 CDC (see [1]). We note that solving the transformed MDP with reduced state space is equivalent to performing a single parametric MDP evaluation step in the incremental value iteration (IVI) algorithm in [9]. Hence, the solution procedure is greatly simplified. Moreover, by eliminating the need to choose and refine parameters
194
E.F. Arruda, M.D. Fragoso / Operations Research Letters 39 (2011) 193–197
for the IVI algorithm, the proposed approach tends to be more efficient and robust. This paper is organized as follows. Section 2 introduces the studied problem. The time aggregation formulation is investigated in Section 2.1, and its solution by means of the IVI algorithm is addressed in Section 2.2. In Section 3, a new MDP model with reduced state space is introduced and proved to be equivalent to the time aggregated formulation. Finally, Section 4 comprises numerical experiments. 2. Problem statement Consider a time homogeneous discrete Markov decision process (MDP), with finite state space S and cost function f : S → R+ , where R+ denotes the set of positive real numbers. Let A(i) denote the set of feasible control actions at state i and define A := {A(i), i ∈ S }. Let L = {L : S → A} be the set of feasible stationary control policies, with L denoting a member of this set. At the decision maker’s discretion, a policy L is employed to control the evolution of the system. The selected control policy prescribes a single control action L(i) ∈ A(i) to be applied each time the process pays a visit to a given state i ∈ S. Following a visit to state L(i) i ∈ S, the process moves to state j ∈ S with probability pij . The evolution of the controlled process is described by a Markov chain {Xt , t ≥ 0}, and the one-period transitions are determined by the L(i) transition matrix P L = {pij }, i, j ∈ S. Following [2], we assume that the controlled process is ergodic under all policies. Assume that f L : S → R+ is a measurable positive real-valued function and let
η = lim L
N →∞
N −1 1 −
N k=0
∗
hL f (i) = hf (i, L(i)) := E
f (Xk )
(1)
∀L ∈ L.
(2)
[τi− +1 −1
f L (Xt )|Xτi = i
]
t =τi
= E
[τ− 1 −1
]
f L (Xt )|X0 = i ;
(3)
t =0
L hL 1 (i) = h1 (i, L(i)) = hf (i, L(i)), for f (j) ≡ 1, ∀j ∈ S ,
(4)
where the second equality in the first expression follows from the strong Markov property. In the remainder of the paper, the notations hL f (i) and hf (i, L(i)) will be used interchangeably. For the embedded process {Yt , t ≥ 0}, the transition probability from state i ∈ S1 to state j ∈ S1 , under control a ∈ A(i), is given by p˜ aij . It is assumed that the transition probability matrix relative to the embedded process {Yt , t ≥ 0}, L(i)
L
be the long term average cost of the controlled chain. Because the controlled chain is ergodic, this cost is independent of the initial state. The objective of the decision maker is to find the optimal policy L∗ ∈ L, which satisfies
ηL ≤ ηL ,
Let τ0 = 0 and define τi = min{t > τi−1 : Xt ∈ S1 }, i = 1, 2, . . . . The embedded chain {Yi := Xτi , i ≥ 0} has state space S1 and its properties can be determined by investigating the trajectories between consecutive visits to the set S1 . It is assumed that, for each state in the set S \ S1 , there is a single control action available, i.e. all feasible policies share the same structure outside of the set S1 . Therefore, for each policy L, the average cost ηL is a function of the control actions in S1 . The time aggregation approach involves dividing the original Markov chain into segments, each segment corresponding to a trajectory between two successive visits to S1 . Hence, the segment corresponding to a given state Yi = Xτi is {Xτi , Xτi +1 , . . . , Xτi+1 −1 }. The expected cost of a segment starting at state i ∈ S1 is
P˜ L = p˜ ij
,
for all i, j ∈ S1 ,
(5)
can either be calculated or accurately estimated. Let π˜ = {π˜ (i), i ∈ S1 } be the set of steady state probabilities of the embedded process {Yt , t ≥ 0}, under a control policy L. And define: n¯ L =
−
π˜ (i)h1 (i, L(i)).
(6)
i∈S1
It is proved in [2] that, for each control policy L, the long term average cost of the original MDP is numerically equal to
2.1. Solution by time aggregation
1 M
M −1
∑
hL f (Ym )
m=0
The time aggregation approach relies on selecting a small subset S1 of the state space S and investigating the properties of the embedded MDP restricted to this subset. One may, for instance, select the states which are more important from some control standpoint, or the states that are expected to be visited more frequently under some particular class of control policies. When there is more than one control action available in the subset S \ S1 , one selects a priori, for each state in this subset, a control action to be performed. Observe that the drawback in doing so is a sacrifice in performance, for one can no longer guarantee the optimality of the output of the time aggregation algorithm, and some policies are never evaluated in the search process. However, a careful selection of S1 may limit the loss in performance when only states which are very seldom visited (i.e. states that are rare events), and therefore contribute little to the overall cost, are left out of S1 . Of course, selecting such states a priori depends on the sensibility and prior knowledge of the decision maker.
η = lim
Remark 1. This paper follows [2] and assumes that a single control is available for each state in S \ S1 . In general, this assumption can be dropped if the action space is suitably redefined to take into account the control actions outside of S1 . In such a case, one trades state space reduction for a possible increase in the number of control actions. For more details on this approach, we refer to [6].
f˜ L (i) = f˜ (i, L(i)) =
L
M →∞
∑
1 M
M −1
∑
hL 1 (Ym )
m=0
π˜ (i)hL f (i)
i∈S
1 = ∑
π˜ (i)hL 1 (i)
=
π˜ HfL n¯ L
,
(7)
i∈S1
where π˜ is the row vector of steady state probabilities of the embedded process {Yt , t ≥ 0}; HfL is a column vector of the individual values hL f (i), i ∈ S1 .
2.2. The incremental value iteration algorithm Based on (7), a new performance function f˜ : S1 × L → R+ , such that 1 n¯ L
hf (i, L(i))
(8)
was proposed in [2]. Note that the evaluation of the denominator above (Eq. (6)) requires the prior knowledge of the invariant probabilities of the process {Yt , t ≥ 0}. Given that this knowledge is not available in general, the alternative is to apply an equivalent
E.F. Arruda, M.D. Fragoso / Operations Research Letters 39 (2011) 193–197
performance function. Indeed, an equivalent function fˆδ : S1 × L × R+ → R+ , such that fˆδ (i, a) = fˆ (i, L(i), δ) = hf (i, a) − δ h1 (i, a),
i ∈ S1 , a ∈ A(i), (9)
is proposed in [2], where δ > 0 is a scalar parameter. The long term average cost of a given policy L with respect to a given parametric cost fˆδ is
ηδL =
−
π˜ (i)fˆδ (i, L(i)) =
i∈S1
−
π˜ (i) hf (i, L(i)) − δ h1 (i, L(i)) .
Algorithm 1 Incremental Value Iteration 1. Choose an |S1 |-dimensional vector V0 and initialize δ0 ∈ R. Set m = n = 0 and specify ϵ > σ > 0, and choose Bu > 0. 2. For each i ∈ S1 , compute Vm+1 (i) by − a Vm+1 (i) = min hf (i, a) − δm h1 (i, a) + p˜ ij Vm (j) . a∈A(i)
return to step 2. Here, sp(V ) , max V (i) − min V (i).
And the best policy with respect to the cost function fˆδ , which yields ηδ∗ ≤ ηδL , ∀L ∈ L, can be found as the solution of the Poisson equation [7]
−
2
p˜ aij Vδ∗ − ηδ∗ ,
∀i ∈ S , a ∈ A(i).
−
(10)
π˜ (i)fˆδ (i, L(i)) = 0,
iff δ = ηL .
(11)
i∈S1
i ∈S 1
If |ηˆ δn ≤ ϵ|, go to step 5. Otherwise, make n = n + 1, m = m + 1, and let
δn+1 = δn + α ηˆ δn ,
paij
Here, ˜ is the transition probability from i ∈ S1 to j ∈ S1 for the embedded process {Yt , t ≥ 0}. The results in [2] yield
i ∈S 1
i∈S1
4. Compute ηˆ δn 1 max(Vm+1 (i) − Vm (i)) + min(Vm+1 (i) − Vm (i)) . ηˆ δn =
j∈S1
ηδL =
j∈S1
3. If sp(Vm+1 − Vm ) < σ go to step 4. Otherwise, make m = m + 1 and
i∈S1
Vδ∗ (i) ≤ hf (i, a) − δ h1 (i, a) +
195
with α =
1 Bu
,
and return to step 2.
5. Obtain the optimal policy for all i ∈ S1 , L (i) = arg min ∗
i∈S1
a∈A(i)
hf (i, a) − δm h1 (i, a) +
∗
−
p˜ aij V ∗ (j),
j∈S1
∀i ∈ S , a ∈ A(i).
(12)
˜
( j) .
where the right hand side terms are defined in Eqs. (3) and (4), respectively. Each control policy L ∈ L induces a transition L(i) probability matrix Pˆ L = pˆ ij , with
L∗
To solve the above expression for η , adaptations of the standard value and policy iteration algorithms were proposed in [2,8] and [9]. They proposed cleverly designed algorithms to search for the solution of the Poisson equation under cost function fˆδ , while iteratively updating the parameter δ . Algorithm 1 below is a transcription from [9] of the incremental value iteration algorithm, reported to perform better than its counterparts. It is worth mentioning that the functions Vm in Algorithm 2 correspond to functions gˆm in Algorithm 1 of [9]. Note that steps 2–3 of Algorithm 1 comprise the application of the classical value iteration algorithm [7][Chapter 8] to find the optimal average solution of an MDP with a parametric cost function fˆδk , k = 0, 1, . . . defined in (9). Such a solution is then employed to generate a new parameter δk+1 in Step 4. The algorithm stops when the parameter sequence δk , k = 0, 1, . . . converges to the optimal ∗ solution η∗ = ηL of the original MDP. For details regarding the convergence of Algorithm 1, we refer to [9]. Observe also that one needs to select parameters δ0 and Bu . These parameters directly influence the convergence time of the IVI algorithm.
paij Vm
j∈S1
Hence, the Poisson equation for the optimal control policy becomes V ∗ (i) ≤ hf (i, a) − ηL h1 (i, a) +
−
paij
ˆ =
ρ h1 (i, a)
p˜ aij ,
j ̸= i
ρ a p˜ ii − 1 , 1 + h1 (i, a)
j=i
i, j ∈ S1 , a ∈ A(i),
(14)
where ρ is a scalar parameter and ρ˜ ij is the transition probability of the time aggregated process, defined in Eq. (5). For the equations above to characterize a legitimate transition probability, the parameter ρ must satisfy: 0<ρ<
1 1 − p˜ aii
h1 (i, a).
(15)
The effect of parameter ρ is to reshape the transition probabilities, increasing or decreasing self-transition probabilities. They are decreased whenever ρ > h1 (i, a) and increased otherwise. Let process Zk , k ≥ 0, denote the evolution of the proposed MDP operating under a control policy L ∈ L. For any choice of the parameter ρ satisfying (15), the solution for the proposed MDP ∗
3. An equivalent average cost MDP with reduced state space We proceed to construct a new MDP with state space S1 , whose solution is equivalent to that of the time aggregated MDP in Section 2.1. Given the equivalence between the time aggregated model and the original problem, it follows that, by solving the proposed MDP we also solve the original problem in Section 2. The proposed construction follows an idea from Puterman [7][Section 11.4.3]. Consider an MDP with state space S1 and the same set of feasible control policies L as the time aggregated MDP. Define an instantaneous performance function g : S1 × L → R+ such that g (i, a) =
hf (i, a) h1 (i, a)
,
i ∈ S1 , a ∈ A(i),
(13)
∗ L L ηˆ¯ = ηˆ¯ = arg min ηˆ¯ ,
(16)
L∈L
with N −1 1 − L η¯ˆ = lim g (Zk , L(Zk )), N →∞
(17)
N k=0
can be found by solving the associated Poisson equation, which reads: V˜ ∗ (i) ≤ g (i, a) +
−
∗
pˆ aij V˜ ∗ (j) − ηˆ¯ ,
∀i ∈ S1 , a ∈ A(i).
(18)
j∈S
We now prove that the solution to the Poisson equation associated with the time aggregated formulation (Eq. (12)) can be fully determined from the solution to Eq. (18) above.
196
E.F. Arruda, M.D. Fragoso / Operations Research Letters 39 (2011) 193–197
∗
Theorem 1. Let the pair (V˜ ∗ , ηˆ¯ ) be the solution of Eq. (18). Then, ∗
the pair (V = ρ V , η = ηˆ¯ ) solves Eq. (12).
˜∗
∗
∗
Table 1 Comparison of VI and IVI for a replacement problem. Algorithm
MDP solved
Iterations
Proof. Substituting Eq. (14) into Eq. (18) we obtain V˜ ∗ (i) ≤ g (i, a) +
[ + 1−
ρ
−
h1 (i, a) j∈S
VI IVI IVI IVI
p˜ aij V˜ (j)
] ρ ∗ V˜ ∗ (i) − ηˆ¯ , h1 (i, a) ∗
−
p˜ aij ρ V˜ ∗ (j) ,
j∈S
∀i ∈ S , a ∈ A(i). And the expression below is equivalent to Eq. (12) with V ∗ = ρ V˜ ∗ ∗
and η∗ = ηˆ¯ .
Note that the construction of the proposed MDP is inspired by a transformation result derived in [7][Section 11.4.3] within the context of continuous time semi-Markov processes (SMDP). In effect, Theorem 1 is adapted from Proposition 11.4.5 in [7][page 557]. As Puterman [7] puts it, ‘‘the effect of the transformation is to convert cost into a unit time basis, then alter the transitions so that the average cost [of the proposed MDP] and that of the [time aggregated] SMDP agree’’. The immediate benefit is to obtain the optimal solution by solving a simpler MDP instead of the original SMDP. In view of Theorem 1, one can see that the optimal average cost of the time aggregated MDP can be found by solving the MDP proposed in this section. This problem can be solved by the standard value iteration algorithm [7], described below Algorithm 2 Standard Value Iteration Algorithm 1. Choose an |S1 |-dimensional vector V˜ 0 . Set n = 0 and specify ϵ > 0. 2. For each i ∈ S1 , compute V˜ n+1 (i) by − a˜ ˜Vn+1 (i) = min g (i, a) + pˆ ij Vn (j) . a∈A(i)
j∈S1
3. If sp(Vn+1 − Vn ) < ϵ go to step 4. Otherwise, make n = n + 1 and return to step 2. Here, sp(V˜ ) , max V˜ (i) − min V˜ (i). i∈S1
i∈S1
L∗
4. Compute η 1 ∗ ηL = max(V˜ n+1 (i) − V˜ n (i)) + min(V˜ n+1 (i) − V˜ n (i)) . 2
i∈S1
i∈S1
Obtain the optimal policy for all i ∈ S1 ,
L∗ (i) = arg min
a∈A(i)
g (i, a) +
−
38 35 49 52
Bu
– 10 10 10
– 9 10 8
∀i ∈ S , a ∈ A(i).
Multiplying the expression above by hL 1 (i), we obtain
ρ V˜ ∗ (i) ≤ hf (i, a) − ηˆ¯ h1 (i, a) +
1 5 8 7
IVI parameters
δ0
pˆ aij V˜ n (j) .
j∈S1
Note that Algorithm 2 corresponds to a single application of Steps 2–3 of Algorithm 1, i.e. it corresponds to a single parametric evaluation step of the IVI algorithm. Hence, solving the proposed MDP with reduced state space greatly simplifies the search process and eliminates the need to select parameters δ0 and Bu that appear in Algorithm 1. As a result, solving the proposed MDP makes the search process more robust and parameter free. 4. Numerical examples This section replicates the numerical example in [9] to compare the performance of a standard VI algorithm (Algorithm 2) applied
to the proposed MDP, with instantaneous cost and transition probabilities given by (13)–(14), with the performance of the IVI algorithm (Algorithm 1), that iterates directly on the time aggregated (semi-Markov) domain and applies the performance function in (9). The proposed example is an asset replacement problem, comprised of several components, which must be replaced whenever they fail or reach their expiration date. The problem is to define a replacement schedule that minimizes the long term average cost of the system. A detailed study of this problem and variations of it can be found in [3]. We study a problem with three components. The expiration time is 10 periods and both the setup cost and the cost of each new component are 10 units. At each period, each component can fail with probability 0.01. The state of the system is the combined status (failed, expired, or remaining life time before expiration) of all components. Replacement decisions need only be considered for states with expired or failed components. The set of these states is denoted as S1 . As reported in [9], the total of states in the problem is 113 = 1331, see also [3]. However, the cardinality of S1 is 602. Hence, time aggregation yields significant state space reduction. The value hf (i, a), i ∈ S1 , a ∈ A(i) is equal to the cost at i, for no cost is incurred between successive replacements. As for the values of h1 (i, a) and P˜ L , they can be calculated using Eqs. (4) and (6) in [9], respectively. The experimental data in [9] demonstrates that, for the referred example, the IVI algorithm compares favorably with the preceding policy and value iteration algorithms introduced in [2] and [8]. Bearing that in mind, we use the IVI algorithm as a comparison parameter to evaluate the effect of iterating on the modified MDP, introduced in Section 3. To solve the modified MDP, we employed the standard VI algorithm — Algorithm 2. Disregarding the subtraction in (9), the computational costs for each value function update of the VI and IVI algorithms (Step 2 in Algorithms 2 and 1, respectively) are equivalent. Hence, we count each value function update as an iteration for each method. Recall, from Section 2.2, that steps 2–3 of Algorithm 1 comprise the application of the classical value iteration algorithm to solve a MDP with a parametric cost function fδm . We kept track of the number of parametric MDP solved by Algorithm 1. Note also that Algorithm 2 addresses a single modified MDP, thus it always solves a single MDP. The simulation results can be found in Table 1. The second column conveys the number of MDP solved by each algorithm, while the last column represents the number of iterations. For the IVI algorithm we tested various combinations of parameters. The best results were obtained for parameters Bu = 9 and δ0 = 10. For such a combination, the algorithm converged in 35 iterations, while solving 5 parametric MDP. Note, however, that setting Bu = 10 or Bu = 8 affects the performance, and the algorithm converges in 49 and 52 iterations, while solving 8 and 7 MDP, respectively. The standard VI algorithm, on the other hand, solves the modified MDP in 38 iterations, having a comparable performance to the best performance of the IVI algorithm. Moreover, it has the advantage of not being prone to a bad choice of parameters. Our experiments suggest that the closer the parameter ρ is from the upper bound in Eq. (15), the better the performance. Hence, we 1 selected ρ = 0.99 · mini,a 1−˜ h (i, a). pa 1 ii
E.F. Arruda, M.D. Fragoso / Operations Research Letters 39 (2011) 193–197
Acknowledgements The authors would like to express their gratitude to the referees for their many suggestions and helpful comments, which have certainly improved the quality of the paper. This research was supported by CNPq under Grants No. 4705272007-2 and No. 302501/2010-0 and by FAPERJ under Grant No. E-26/170.008/2008. References [1] E.F. Arruda, M. Fragoso, Standard dynamic programming applied to time aggregated Markov decision processes, in: Proceedings of the 48th IEEE International Conference on Decision and Control, Shanghai, 2009, pp. 2576–2580.
197
[2] X. Cao, Z. Ren, S. Bhatnagar, M. Fu, S. Marcus, A time aggregation approach to Markov decision processes, Automatica 38 (6) (2002) 929–943. [3] R. Dekker, R.E. Wildeman, R. van Egmond, Joint replacement in an operational planning phase, European Journal of Operational Research 91 (1) (1996) 74–88. [4] E.A. Fainberg, Sufficient classes of strategies in discrete dynamic programming I: Decomposition of randomized strategies and embedded models, Theory of Probability and its Applications 31 (4) (1986) 658–668. [5] R. Howard, Dynamic Probabilistic Systems, vol. II, John Wiley & Sons, New York, 1971. [6] A. Leizarowitz, A. Shwartz, Exact finite approximations of average-cost countable Markov decision processes, Automatica 44 (6) (2008) 1480–1487. [7] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, New York, 1994. [8] Z. Ren, B.H. Krogh, Markov decision processes with fractional costs, IEEE Transactions on Automatic Control 50 (5) (2005) 646–650. [9] T. Sun, Q. Zhao, P.B. Luh, Incremental value iteration for time aggregated markov decision processes, IEEE Transactions on Automatic Control 52 (11) (2007) 2177–2182.