4th 4th IFAC IFAC International International Conference Conference on on 4th IFAC International Intelligent Control and andConference Automationon Sciences Intelligent Control Automation Sciences 4th IFAC International Conference on Intelligent Control and Automation Sciences Available online at www.sciencedirect.com June Reims, France June 1-3, 1-3, 2016. 2016. Reims, France Intelligent Control and Automation Sciences June 1-3, 2016. Reims, France June 1-3, 2016. Reims, France
ScienceDirect
IFAC-PapersOnLine 49-5 (2016) 285–290
Policy Derivation Methods for Critic-Only Policy Policy Derivation Derivation Methods Methods for for Critic-Only Critic-Only Reinforcement Learning in Continuous Reinforcement Learning in Continuous Reinforcement Learning in Continuous Action Spaces Action Spaces Action Spaces
∗ ∗ ∗,∗∗ Eduard Eduard Alibekov Alibekov ∗∗∗ Jiri Jiri Kubalik Kubalik ∗∗∗ Robert Robert Babuska Babuska ∗,∗∗ ∗,∗∗ ∗,∗∗ Eduard Alibekov Jiri Kubalik Robert Babuska Eduard Alibekov ∗ Jiri Kubalik ∗ Robert Babuska ∗,∗∗ ∗ ∗ Czech Institute of Informatics, Robotics, and Cybernetics, Institute of Informatics, Robotics, and Cybernetics, ∗ ∗ Czech Czech Institute of Robotics, and Cybernetics, ∗ Czech Technical University in Prague, Republic, Czech Institute of Informatics, Informatics, Robotics, andCzech Cybernetics, Czech Technical University in Prague, Prague, Prague, Czech Republic, Czech Technical University in Prague, Prague, Czech {eduard.alibekov,
[email protected]}@cvut.cz Czech Technical University in Prague, Prague, Czech Republic, Republic, {eduard.alibekov,
[email protected]}@cvut.cz ∗∗ {eduard.alibekov, ∗∗ Delft
[email protected]}@cvut.cz for Systems and Control, {eduard.alibekov,
[email protected]}@cvut.cz Delft Center for Systems and Control, ∗∗ ∗∗ for and Control, ∗∗ Delft Center Delft of Delft, Netherlands, Delft Center for Systems Systems andThe Control, Delft University University of Technology, Technology, Delft, The Netherlands, Delft University of Technology, Delft, The
[email protected] Delft University
[email protected] Technology, Delft, The Netherlands, Netherlands,
[email protected] [email protected] Abstract: State-of-the-art State-of-the-art critic-only critic-only reinforcement reinforcement learning learning methods methods can can deal deal with with aa small small Abstract: Abstract: State-of-the-art critic-only reinforcement learning methods can with aa small discrete space. common approach real-world problems with continuous Abstract: State-of-the-art critic-only reinforcement learning methods can deal deal small discrete action action space. The The most most common approach to to real-world problems withwith continuous discrete action space. most common approach to problems with continuous actions to the action space. In aa method is derive discrete space. The The to real-world real-world with to actions is isaction to discretize discretize the most actioncommon space. approach In this this paper paper methodproblems is proposed proposed tocontinuous derive aa actions is to discretize the action space. In this paper a method is proposed to continuous-action policy the based on a a value value function that hasabeen been computed for discrete discrete actionsaa actions is to discretize action space.function In this that paper method is proposed to derive derive continuous-action policy based on has computed for actions continuous-action policy based on a value function that has been computed for discrete actions by using using any any known known algorithm such as value value iteration. Several variants of the the policy-derivation continuous-action policy based such on a value function thatSeveral has been computed forpolicy-derivation discrete actions by algorithm as iteration. variants of by using known algorithm such as iteration. Several variants policy-derivation algorithm are introduced and compared on two benchmarks: double by using any any algorithm as value value Severalstate-action variants of of the the policy-derivation algorithm areknown introduced and such compared on iteration. two continuous continuous state-action benchmarks: double algorithm are introduced and compared on two continuous state-action benchmarks: double pendulum swing-up and 3D mountain car. algorithm are introduced and compared on two continuous state-action benchmarks: double pendulum swing-up and 3D mountain car. pendulum swing-up and mountain car. pendulum swing-up and 3D 3D mountain car. © 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: reinforcement learning, continuous Keywords: reinforcement learning, continuous actions, actions, multi-variable multi-variable systems, systems, optimal optimal control, control, Keywords: reinforcement learning, continuous actions, multi-variable systems, optimal control, policy derivation Keywords: reinforcement learning, continuous actions, multi-variable systems, optimal control, policy derivation policy policy derivation derivation 1. control k+1 is 1. INTRODUCTION INTRODUCTION control input input (action), (action), x xk+1 is the the state state that that the the 1. control input (action), x is the state the k+1 system reaches from state x after applying action k+1 1. INTRODUCTION INTRODUCTION k control input (action), xk+1 the applying state that that the system reaches from state xk isafter action system reaches from state x after applying action u, is reward for transition. system from state xkkk after applying action Reinforcement u, rrk+1 is the the immediate immediate reward for that that transition. k+1 reaches Reinforcement Learning Learning (RL) (RL) algorithms algorithms provide provide a a way way to to rrk+1 reward for Reinforcement Learning (RL) algorithms provide aa way (2) u, Computation of the the optimal optimal V-function based on on k+1 is u, is the the immediate immediate reward for that that transition. transition. optimally decision and problems of (2) Computation of V-function –– based k+1 Reinforcement algorithms provide way to to optimally solve solveLearning decision (RL) and control control problems of dynamic dynamic (2) Computation of the optimal V-function –– based on optimally solve decision and control problems of dynamic the samples, an approximation of the V-function is (2) Computation of the optimal V-function based on systems (Sutton and Barto, 1998). An RL agent interacts the samples, an approximation of the V-function is optimally solve decision and control problems of dynamic systems (Sutton and Barto, 1998). An RL agent interacts the samples, an approximation of the V-function is systems (Sutton and Barto, 1998). An RL agent interacts learnt, which for each system state predicts the cumuthe samples, an each approximation ofpredicts the V-function is with the the(Sutton systemand by Barto, measuring theAnstates states and interacts applying learnt, which for system state the cumusystems 1998). RL agent with system by measuring the and applying learnt, which for each system state predicts the cumuwith the system measuring the and applying lative reward obtained the learnt, which for each system state under predicts theoptimal cumuactions to a After applying an lative long-term long-term reward obtained under the optimal with theaccording system by by the states states applying actions according to measuring a certain certain policy. policy. Afterand applying an lative actions according to a policy. After applying an policy.long-term lative long-term reward reward obtained obtained under under the the optimal optimal action, RL a reward related policy. actions according to receives a certain certain policy. Aftersignal applying an action, the the RL agent agent receives a scalar scalar reward signal related policy. action, the RL agent receives a scalar reward signal related (3) Policy inference – based on the computed V-function, policy. to the immediate performance of the agent. The goal is (3) Policy inference – based on the computed V-function, action, the RL agent receives a scalar reward signal related to the immediate performance of the agent. The goal is (3) inference –– based on computed V-function, to the the goal the is at sampling time (or (3) Policy Policy inference based on the the computed V-function, find an optimal optimal performance policy, which whichof maximizes theThe long-term the policy policy is derived derived at each each sampling time (or simusimuto find the immediate immediate performance ofmaximizes the agent. agent.the The goal is is an policy, long-term the policy is derived at each sampling time (or to find an optimal policy, which maximizes the long-term lation step), so that the system can be controlled. the policy is derived at each sampling time (or simusimucumulative reward. lation step), so that the system can be controlled. to find an optimal cumulative reward.policy, which maximizes the long-term lation step), so that the system can be controlled. cumulative reward. lation step), so that the system can be controlled. cumulative reward. The available available RL algorithms algorithms can can be be broadly broadly classified classified into into The RL The available RL can classified into critic-only, actor-only, and methods (Konda The available RL algorithms algorithms can be be broadly broadly classified into critic-only, actor-only, and actor-critic actor-critic methods (Konda critic-only, actor-only, and methods (Konda and 2000). methods first the critic-only, actor-only, and actor-critic actor-critic methods (Konda and Tsitsiklis, Tsitsiklis, 2000). Critic-only Critic-only methods first find find the and Tsitsiklis, 2000). Critic-only methods first find the optimal value function (abbreviated as V-function) and and Tsitsiklis, 2000). Critic-only methods first find the optimal value function (abbreviated as V-function) and optimal value as and then derive derive anfunction optimal (abbreviated policy from from this this value function. function. optimal value function (abbreviated as V-function) V-function) and then an optimal policy value then derive an optimal policy from this value function. In contrast, actor-only methods search directly in then derive an optimal policy from this directly value function. In contrast, actor-only methods search in the the In contrast, actor-only methods directly in the policy space. The can combined In contrast, methods search search in into the policy space. actor-only The two two approaches approaches can be bedirectly combined into policy space. The be actor-critic architectures, where the thecan actor and critic critic into are policy space.architectures, The two two approaches approaches can be combined combined into actor-critic where actor and are actor-critic architectures, where the actor and critic both represented explicitly and learned simultaneously. actor-critic architectures, where actorsimultaneously. and critic are are both represented explicitly and the learned both represented explicitly and The learns value function and simultaneously. based on both represented explicitly and learned learned simultaneously. The critic critic learns the the value function and based on that that The critic learns the value function and based on that it determines how the policy should be changed. The critic learns the value function and based on that it determines how the policy should be changed. Each Each it determines how policy be Each class can be be further further divided intoshould model-based and it determines how the the policy should be changed. changed. Each class can divided into model-based and modelmodelclass can be further divided into model-based and modelfree algorithms, depending on the use of a system model class can be further dividedon into and modelfree algorithms, depending themodel-based use of a system model free algorithms, depending on throughout the process. free algorithms, depending on the the use use of of aa system system model model Fig. throughout the learning learning process. Fig. 1. 1. Model-based Model-based critic-only critic-only reinforcement reinforcement learning. learning. throughout the learning process. Fig. 1. Model-based critic-only reinforcement learning. throughout the learning process. Fig. 1. Model-based critic-only learning. In this paper we consider the critic-only, model-based In this paper we consider the critic-only, model-based In this paper we address only reinforcement the third step. Assuming In this paper we consider the critic-only, model-based In this paper we address only the third step. Assuming variant of RL in continuous state and action spaces. The In this of paper consider state the critic-only, variant RL inwecontinuous and action model-based spaces. The In paper we address only the third thatthis a parameterized parameterized approximation of the thestep. trueAssuming unknown In this paper we address only the third step. Assuming variant of RL in continuous state and action spaces. The that a approximation of true unknown typical learning process, depicted in Figure 1, consists of variant of RL in continuous state and action spaces. The typical learning process, depicted in Figure 1, consists of that aa parameterized approximation of the true unknown V-function has been computed, we derive the policy for that parameterized approximation of the true unknown typical learning process, depicted in Figure 1, consists of V-function has been computed, we derive the policy for three steps: steps: typical learning process, depicted in Figure 1, consists of V-function three has been computed, we derive the policy for a general, continuous-action input space. We aim at the V-function has been computed, wespace. deriveWe theaim policy for three steps: a general, continuous-action input at the three steps: a general, input space. We aim at of reward and at (1) general, continuous-action continuous-action input space. aimcompuat the the maximization of the the long-term long-term reward andWe at the the compu(1) Data Data collection collection – – using using a a model model of of the the system system or or the the amaximization (1) – model of the the of reward computational efficiency, efficiency, solong-term that the the method method can at bethe applied to systemcollection itself, samples samples ina the form (x , system u, x xk+1 ,or k maximization of the theso long-term reward and and at the compu(1) Data Data collection – using usingin a the model of (x the the)) maximization tational that can be applied to system itself, form u, rrk+1 k , system k+1 ,or k+1 system itself, samples in the form (x , u, x , r ) tational efficiency, so that the method can be applied k k+1 k+1 multidimensional action spaces. are collected. Here, x is the system state, u is the k k+1 k+1 tational efficiency,action so that the method can be applied to to system itself, samples (xk ,state, u, xk+1 rk+1 spaces. are collected. Here, xkkinisthe theform system u ,is the) multidimensional are multidimensional action action spaces. spaces. are collected. collected. Here, Here, x xkkk is is the the system system state, state, u u is is the the multidimensional
Copyright 2016 IFAC IFAC 2405-8963 © IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Copyright © 2016, 2016 Copyright 2016 responsibility IFAC Peer review© of International Federation of Automatic Control. Copyright ©under 2016 IFAC 10.1016/j.ifacol.2016.07.127
IFAC ICONS 2016 286 June 1-3, 2016. Reims, France
Eduard Alibekov et al. / IFAC-PapersOnLine 49-5 (2016) 285–290
The paper is organized as follows. Section 2 gives a brief introduction to reinforcement learning. The state-ofthe-art and the proposed policy derivation methods are described in Section 3, experimentally demonstrated in Section 4 and discussed in detail in Section 5. Finally, Section 6 concludes the paper. 2. PRELIMINARIES Define an n-dimensional state space X ⊂ Rn , and mdimensional action space U ⊂ Rm . The model is described by the state transition function xk+1 = f (xk , u), with xk , xk+1 ∈ X and u ∈ U. The reward function assigns a scalar reward rk+1 ∈ R to the state transition from xk to xk+1 : xk+1 = f (xk , u) rk+1 = r(xk , u)
(1)
Define a finite set of discrete control input values U = {u1 , u2 , . . . , uN } drawn from U . The V-function can be computed by solving the Bellman equation: V (x) = max[r(x, u) + γV (f (x, u))] u∈U
(2)
where γ is the discount factor (a user-defined parameter). There are several methods to approximate the V-function for continuous state spaces. In this paper, we use the fuzzy V-iteration algorithm (Busoniu et al., 2010) as it is guaranteed to converge and the fuzzy approximator allows us to interpret the values at each fuzzy set core directly as the V-function value. The approximation of the V-function after convergence is denoted by Vˆ (x). The policy is defined by the following mapping: h:X →U
(3)
The most straightforward way to derive a policy corresponding to the approximate value function Vˆ (x) is (Bertsekas and Tsitsiklis, 1996): ˆ (4) h(x) ∈ arg max r(x, u) + γ Vˆ (f (x, u)) u∈U
However, this policy will be discrete-valued and generally will not perform well on control problems with continuous actions. Therefore, we propose alternative methods, whose aim is to provide a better policy than (4). 3. POLICY DERIVATION METHODS The goal is to find a continuous-action policy based on the computed approximate V-function. The measure to be maximized by the policy is the average of the rewards obtained during a long-horizon control experiment. Using long-horizon experiments allows us to effectively measure the steady state error. The policy derivation problem can then be formulated as the following maximization: N −1 1 ˆ r(xk , h(xk )) h(x) = arg max lim N →∞ N (5) h(x) k=1
with xk = f (xk−1 , h(xk−1 ))
∀x0 ∈ X
where N is the number of steps in the control experiment and the limit is assumed to exist. In the sequel we present three different policy-derivation methods. 286
3.1 Related work The problem of deriving policies for continuous-action spaces has not been sufficiently addressed in the literature. The most common approach is to discretize the action space, compute the value for all the discrete actions, and select the one that corresponds to the largest V-function value. One of the earliest references to this approach is (Santamaria et al., 1996). A drawback of this method is that the exhaustive search over all available action combinations quickly becomes intractable as the size of the action space grows. Another similar approach is based on sampling (Sallans and Hinton, 2004; Kimura, 2007). Using Monte-Carlo estimation, this approach can find a near-optimal action without resorting to exhaustive search over a discretized action space. However, for a good performance this method requires a large number of samples and is therefore computationally inefficient. A different approach relies on translating the continuous action selection step into a sequence of binary decisions. Each decision whether to decrease or increase the control action eliminates a part of the action space. This process stops once a predefined precision is reached. More details can be found in (Pazis and Lagoudakis, 2009). The main drawback of this algorithm is that its performance quickly degrades as the dimensionality of the state and input space grows (Pazis and Lagoudakis, 2011). 3.2 Evaluation over a fine grid of actions This method relies on a fine resampling of the action space a posteriori (i.e., after learning the V-function). Define ¯2 × · · · × U ¯m , A ⊆ U ¯1 × U (6) A=U ¯ where each set Ui contains points equidistantly distributed along the ith dimension of the action space. Set A therefore contains all combinations of the control inputs for which the value function is evaluated and the result for the current state x is stored in array G: G[u] = r(x, u) + γ Vˆ (f (x, u)) (7) An example of the result for a two-dimensional action space for one specific x is shown in Fig. 2. This additional sampling allows to control the system by applying actions more precisely, while it does not require extensive computation during learning. The policy derivation step is formalized in Algorithm 1. 3.3 Chebyshev polynomial approximation over the action space The main idea of this method is that a smooth approximation over the action space provides more accurate control and avoids chattering in the case of unstable system equilibria. We use Chebyshev polynomials of the first kind, which are defined by the following recurrent relation: u) = 0 T0 (¯ u) = u ¯ T1 (¯ u) = 2¯ uTn (¯ u) − Tn−1 (¯ u) Tn+1 (¯
IFAC ICONS 2016 June 1-3, 2016. Reims, France
Eduard Alibekov et al. / IFAC-PapersOnLine 49-5 (2016) 285–290
287
this end, we define the discrete set D ⊂ A of the boundary points, and add them to the set of interior points at which the derivatives are equal to zero, resulting in a set E of all potential points of interest. The global maximum on this overall union is then the control action sought. To formulate the policy derivation algorithm in a compact way, define function F (A, G) which constructs the Chebyshev polynomial approximation using the data in A and G as defined in Section 3.2. The resulting function P (·) receives an action u ∈ U and returns the approximated V-function value corresponding to this action. Note that this function includes the affine transformation (8). Algorithm 2: Maximization using Chebyshev polynomials (in the sequel denoted as Cheby) Input: f (x, u), r(x, u), Vˆ (x), γ, A, x0
Algorithm 1: Maximization over a fine grid of actions (in the sequel denoted as Grid)
k←0 while control experiment not terminated do foreach u ∈ A do G[u ] = r(xk , u ) + γ Vˆ (f (xk , u )); end P (·) ← F (A, G) u) E ← {¯ u ∈ U | dPd¯(¯ u = 0} ∪ D
Input: f (x, u), r(x, u), Vˆ (x), γ, A, x0 k←0 while control experiment not terminated do foreach u ∈ A do G[u ] ← r(xk , u ) + γ Vˆ (f (xk , u )); end uk ← arg max G[u ];
xk+1 ← f (x, uk ) k ←k+1 end Output: trajectory [x0 , x1 , ...], [u0 , u1 , ...]
Fig. 2. Example of grid evaluation over action space for a fixed x. Each dimension of the action space contains 10 points, i.e. there are a total of 100 points in the grid. Each point is evaluated according to (7).
foreach u ∈ E do ¯ ] ← r(xk , u ) + γ Vˆ (f (xk , u )) G[u end ¯ ] uk ← arg max G[u u ∈E
u ∈A
xk+1 ← f (xk , uk ); k ←k+1 end Output: trajectory [x0 , x1 , ...], [u0 , u1 , ...]
3.4 Chebyshev approximation to eliminate chattering
They are orthogonal to each other on the interval [−1, 1] √ with respect to the weighting function 1/ 1 − u ¯2 . In order to take advantage of the orthogonality property, the space along one control dimension must be mapped onto the interval [−1, 1]. This is accomplished using the affine transformation: u − uL u ¯ = −1 + 2 (8) uH − uL where uL and uH are respectively the lower and upper bound of u in each dimension. The set of transformed ¯ . For the expansion of the scalar actions is denoted by U input to the multidimensional control input, the Cartesian product of the polynomials is computed. A detailed description of the construction and fitting procedure is beyond the scope of this paper, see e.g. (Hays, 1973) for further details. The polynomial structure of the policy approximator allows us to efficiently find the maxima in (4) by numerically solving the set of algebraic equations obtained by equating the first partial derivatives to zero. In some cases, the polynomial attains its maximum inside the domain U , but in other cases, the maximum may lie outside of this domain. Therefore, the boundaries must also be tested. To 287
The key idea behind this method is based on the following observation – a typical optimal policy results in far more steps with the maximal possible control input than with the intermediate values. For example, the typical optimal policy for a stabilization task steers the state so that it converges to the vicinity of the desired equilibrium as fast as possible. Control input other than the maximal one applied far away from the equilibrium is therefore undesired. However, in the vicinity of the equilibrium, sparse sampling of the discrete actions result in oscillations and control input chattering. Hence, intermediate values of the control input must be used in the vicinity of the equilibrium. To determine the transient phase, we define function C(t, k) which receives the state trajectory t and the sampling instant k and returns either true or false to determine the inference type. For example, the function can be an oscillation detector. The types of function C(t, k) used for each of our benchmarks are presented in Section 4.4. The overall procedure is presented in Algorithm 3. 3.5 Baseline solution The baseline solution is to evaluate equation (4) at every sampling instant, where the maximization is computed
IFAC ICONS 2016 288 June 1-3, 2016. Reims, France
Eduard Alibekov et al. / IFAC-PapersOnLine 49-5 (2016) 285–290
Algorithm 3: Partial maximization using Chebyshev polynomials (in the sequel denoted as Cheby+) Input: f (x, u); r(x, u); Vˆ (x); γ; A; U ; F (A, G); C(t); x0 k←0 t ← {x0 } while control experiment not terminated do foreach u ∈ A do G[u] ← r(xk , u) + γ Vˆ (f (xk , u)); end if C(t, k) then P (·) ← F (A, G) u) E ← {¯ u ∈ U | dPd¯(¯ u = 0} ∪ D foreach u ∈ E do ¯ ] ← r(xk , u ) + γ Vˆ (f (xk , u )) G[u end ¯ ] uk ← arg max G[u u ∈E
else uk ← arg max G[u ] u ∈A
end xk+1 ← f (x, uk ) k ←k+1 t ← t ∪ xk+1 end Output: trajectory [x0 , x1 , ...], [u0 , u1 , ...] over the same discrete action set U on which Vˆ (x) has been learned. To this end we can use Algorithm 1 with A = U.
To compute Vˆ (x), the fuzzy V-iteration algorithm (Bu¸soniu et al., 2010) is used. The learning process can be briefly described as follows. Define a set of samples S = {s1 , s2 , . . . , sN } on an equidistant grid in X . The number of samples per dimension is described by vector B = [b1 , b2 , . . . , bn ]T with the total number of samples n N = i=1 bi . Further define a vector of fixed triangular basis functions φ = [φ1 (x), φ2 (x), . . . , φN (x)]T where each φi (x) is centered in si , i.e., φi (si ) = 1 and φj (si ) = 0, ∀j = i. The basis functions are normalized so that N j=1 φj (x) = 1, ∀x ∈ X . Finally, define a parameter vector θ = [θ1 , θ2 , . . . , θN ]T , θ ∈ RN . The V-function approximation is defined as: Vˆ (x) = θT φ(x)
(13)
u∈U
(10) The iteration terminates when the following convergence criterion satisfied: ||∞
The double pendulum is described by the continuous-time fourth-order nonlinear model: M (α)α ¨ + C(α, α)α ˙ + G(α) = u (14) T where α = [α1 , α2 ] contains the angular positions of the two links, u = [u1 , u2 ]T is the control input which contains the torques of the two motors, M (α) is the mass matrix, C(α, α) ˙ is the Coriolis and centrifugal forces Table 1. MC3D parameters Model parameter Mass Friction coefficient Gravitational acceleration Engine force Time step
(9)
The fuzzy value iteration is: θij+1 ← max r(si , u) + γ (θj )T φ (f (si , u)) , i = 1, 2, ..., N
≥ ||θ − θ
xk+1 = [ˇ px , pˇy , vˇx , vˇy ]T = f (x, u) px ) + sin(3ˇ py ) rk+1 = sin(3ˇ
4.3 Double pendulum swing-up
4.1 V-function learning algorithm
j−1
The mountain car problem is commonly used as a benchmark for reinforcement learning algorithms. The car is initially placed in a valley and the goal is to make the car to drive out of the valley. The engine is not powerful enough and the car must build up momentum by alternatively driving up opposite sides of the valley. The Mountain Car 3D (MC3D) variant (Taylor et al., 2008) extends the standard 2D problem. For the purposes of this paper the state domain is slightly modified. The state is described by four continuous variables, x = [px , py , vx , vy ]T , where px and py are positions in the interval [−1.2, 2.2] and vx and vy are velocities in [−2, 2]. The variable u = [u1 , u2 ] represents the action taken by the agent, where u1 and u2 are responsible for changing the position and velocity in the direction of axis x and axis y, respectively. The transition function f (x, u) is defined in (12) and the model parameters are listed in Table 1. pˇx = px + ∆tvx pˇy = py + ∆tvy f vˇx = vx + ∆t(−gm cos(3px ) + u1 − µvx ) (12) m f vˇy = vy + ∆t(−gm cos(3py ) + u2 − µvy ) m xk+1 = [ˇ px , pˇy , vˇx , vˇy ]T The state variables are restricted to their domains using saturation. Reward function r(x, u) is defined by:
The learning parameters for the fuzzy V-iteration are given in Table 2.
4. EXPERIMENTAL STUDY
j
4.2 Mountain Car 3D
(11)
where is a convergence threshold.
288
Symbol m µ g f ∆t
Value 0.2 0.5 -9.8 0.2 0.1
Table 2. MC3D learning parameters Parameter
Symbol
Set of discrete actions
U
Samples per dimension Discount factor Convergence threshold
B γ
Value 0 −1 1 0 0 , , , 0 0 0 1 −1 [18, 9, 18, 9]T 0.999 10−5
IFAC ICONS 2016 June 1-3, 2016. Reims, France
Eduard Alibekov et al. / IFAC-PapersOnLine 49-5 (2016) 285–290
matrix and G(α) is the gravitational forces vector. The state x contains the angles and angular velocities and is defined as x = [α1 , α˙1 , α2 , α˙2 ]T . The angles [α1 , α2 ] vary in the interval [−π, π) rad and wrap around. The angular velocities [α˙1 , α˙2 ] are restricted to the interval [−2π, 2π) rad/s using saturation. Matrices M (α), C(α, α) ˙ and G(α) are defined by P1 + P2 + 2P3 cos(α2 ) P2 + P3 cos(α2 ) M (α) = P2 + P3 cos(α2 ) P2 b − P3 α˙2 sin(α2 ) −P3 (α˙1 + α˙2 ) sin(α2 ) C(α, α) ˙ = 1 P3 α˙1 sin(α2) b2 −F1 sin(α1 ) − F2 sin(α1 + α2 ) G(α) = −F2 sin(α1 + α2 ) (15)
with P1 = m1 c21 + m2 l12 + I1 , P2 = m2 c22 + I2 , P3 = m2 l1 c2 , F1 = (m1 c1 + m2 l2 )g and F2 = m2 c2 g. The meaning and values of the system parameters are given in Table 3. The transition function f (x, u) is obtained by numerically integrating (14) between discrete time samples with the sampling period ∆t. The control goal is to stabilize the two links at the unstable equilibrium α = α˙ = 0, which is expressed by the following quadratic reward function: rk+1 = −f T (xk , u)Qf (xk , u)
(16)
with the weighting matrix:
Q = diag[1, 0.05, 1.2, 0.05]
• Computational cost – the ratio between the runtime of the baseline solution (see Section 3.5) and that of the tested algorithm. Algorithms 1-3 require a size of the set A (see (6)) to be defined as a vector As = [a1 , a2 , ...am ]T where each ai corresponds to the number of points along the ith dimension. Algorithm 3 receives the function C(t, k) as a parameter, in order to determine the switch between the discrete and continuous control input (see section 3.4). Implementations of this function can differ for each benchmark. To simplify the simulation, the following threshold function is defined: C(t, k) = k > λ (17) where k is the sampling instant and the parameter λ is a threshold, which differs for each benchmark. This approach is based on the a posteriori knowledge about trajectories, obtained by the Algorithm 1, where both tested systems converge to oscillations around goal states and never leave it. However, oscillation detectors can be used instead (Thornhill and H¨agglund, 1997). Algorithms 2 and 3 require function F (A, G) to construct Chebyshev polynomials. To this end, we used the Chebfun open-source package (Driscoll et al., 2014). The parameters required by the algorithms are given in Table 5. The simulation results of the MC3D task and the Double pendulum swing-up task are listed in Tables 6 and 7, respectively.
The V-iteration parameters are defined in Table 4.
5. DISCUSSION
4.4 Simulation To measure the performance of the algorithms the following criteria are defined: K • Cumulative return Rc = i=1 r(xi , u), where K denotes the number of time steps in a control experiment. Rc . • Average long-term return K
Symbol l1 , l2 m1 , m 2 I1 , I 2 c 1 , c2 b1 , b2 g ∆t
According to Tables 6 and 7, Algorithms 1 (Grid) and 3 (Cheby+) outperform the baseline solution with respect to the average return on both test problems. The algorithms show the ability to generalize the control action to continuous, multidimensional action spaces. On the other hand, all the proposed algorithms are computationally more demanding than the baseline solution as Table 5. Simulation parameters Meaning No. of simulation steps K Additional samples As C(t, k) threshold λ Initial state x0
Table 3. Double pendulum parameters Model parameter Link lengths Link masses Link inertias Center of mass coordinates Damping in the joints Gravitational acceleration Sampling period
289
Value 0.4, 0.4 1.25, 0.8 0.0667, 0.0427 0.2, 0.2 0.08, 0.02 -9.8 0.05
MC3D 1500 [11, 11]T 130 [−0.5, 0, −0.5, 0]T
Double pendulum 2000 [11, 11]T 80 [−π, 0, −π, 0]T
Table 6. MC3D simulation results Criterion Average return Cumulative return Computational cost
Baseline 1.8768 2815.30 1:1
Grid 1.9281 2892.25 1:74
Cheby 1.3925 2088.78 1:235
Cheby+ 1.9081 2862.27 1:212
Table 4. Double pendulum learning parameters Parameter
Symbol
Set of discrete actions
U
Samples per dimension Discount factor Convergence threshold
B γ
Value −3 3 0 0 0 , , , , 0 0 0 −1 1 [11, 11, 11, 11]T 0.99 10−5
Table 7. Double pendulum simulation results
Criterion Average return Cumulative return Computational cost
289
Baseline Grid -3.7005 -0.1525 -7401 -305.15 1:1 1:83
Cheby -0.1508 -301.73 1:209
Cheby+ -0.1504 -300.97 1:201
IFAC ICONS 2016 290 June 1-3, 2016. Reims, France
Eduard Alibekov et al. / IFAC-PapersOnLine 49-5 (2016) 285–290
indicated by the computational cost values. The computational cost of the Grid algorithm can further be improved by using non-equidistant distribution of points in set A. For instance, using logarithmically spaced samples can help to decrease the computation time without losing the generalization ability.
of value function approximators. Providing a theoretical guarantees for smooth approximations may be a part of future research and can be done, e.g., by using splines.
The small difference in performance between Cheby and Cheby+ is caused by the function C(t, k). For the Double pendulum task both algorithms behave virtually identically around the goal state. However, the effect of ”undesired” intermediate control, mentioned in section 3.4, is clearly seen in Table 6.
This research was supported by the Grant Agency of the ˇ Czech Republic (GACR) with the grant no. 15-22731S entitled “Symbolic Regression for Reinforcement Learning in Continuous Spaces” and by the Grant Agency of the Czech Technical University in Prague, grant no. SGS16/228/OHK3/3T/13 entitled “Knowledge extraction from reinforcement learners in continuous spaces”.
Figure 3 shows the difference between the Grid and Cheby+ algorithms applied to the double pendulum system. It can be seen that Cheby+ successfully alleviates chattering in the angular velocities. However, as the system is unstable, there is no guarantee that chattering in the input can be completely avoided. The non-zero control input, shown in the bottom plots, compensates for the influence of gravity, as the pendulum is not exactly vertical. Grid
0 -4 0
50
100
150
-4 0
50
100
150
0
50
100
150
200
α˙ 2 (rad/s)
4
0 -4 0
50
100
150
5
0 -4
200 u u
1 2
0
-5
200 u
5
u1 ; u2 (A)
α˙ 2 (rad/s)
0
200
4
u1 ; u2 (A)
Cheby+
4
α˙ 1 (rad/s)
α˙ 1 (rad/s)
4
u
1 2
0
-5 0
50 100 150 Simulation steps
200
0
50 100 150 Simulation steps
200
Fig. 3. The first 200 steps of the trajectories, obtained by the double pendulum simulation. α˙ 1 and α˙ 2 correspond with angular velocities, u1 and u2 are control inputs. All considered algorithms perform well under the assumption, that the V-function approximation is smooth enough for the hill climbing by means of (4). The triangular basis functions used in this paper provide a reasonably smooth approximation for the considered benchmarks. However, there is no guarantee that this will hold in general. 6. CONCLUSIONS AND FUTURE RESEARCH Several alternative policy derivation methods for continuous action spaces were proposed in this paper. The simulation results showed that the Grid and Cheby+ algorithms significantly outperform the standard policy derivation approach. The proposed methods can be used with any kind 290
ACKNOWLEDGEMENTS
REFERENCES Bertsekas, D.P. and Tsitsiklis, J.N. (1996). NeuroDynamic Programming. Athena Scientific, 1st edition. Bu¸soniu, L., Ernst, D., De Schutter, B., and Babuˇska, R. (2010). Approximate dynamic programming with a fuzzy parameterization. Automatica, 46(5), 804–814. Busoniu, L., Babuska, R., Schutter, B.D., and Ernst, D. (2010). Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Inc., Boca Raton, FL, USA, 1st edition. Driscoll, T.A., Hale, N., and Trefethen, L.N. (2014). Chebfun Guide. Pafnuty Publications. Hays, R.O. (1973). Multi-dimensional extensions of the chebyshev polynomials. Mathematics of Computation, 27(123), 621–624. Kimura, H. (2007). Reinforcement learning in multidimensional state-action space using random rectangular coarse coding and gibbs sampling. In Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, 88–95. Konda, V. and Tsitsiklis, J. (2000). Actor-critic algorithms. In SIAM Journal on Control and Optimization, 1008–1014. MIT Press. Pazis, J. and Lagoudakis, M. (2011). Reinforcement learning in multidimensional continuous action spaces. In Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), 2011 IEEE Symposium on, 97–104. Pazis, J. and Lagoudakis, M.G. (2009). Binary action search for learning continuous-action control policies. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 793–800. ACM, New York, NY, USA. Sallans, B. and Hinton, G.E. (2004). Reinforcement learning with factored states and actions. J. Mach. Learn. Res., 5, 1063–1088. Santamaria, J.C., Sutton, R.S., and Ram, A. (1996). Experiments with reinforcement learning in problems with continuous state and action spaces. Sutton, R. and Barto, A. (1998). Reinforcement learning: An introduction, volume 116. Cambridge Univ Press. Taylor, M.E., Kuhlmann, G., and Stone, P. (2008). Autonomous transfer for reinforcement learning. In In The Seventh International Joint Conference on Autonomous Agents and Multiagent Systems. Thornhill, N. and H¨agglund, T. (1997). Detection and diagnosis of oscillation in control loops. Control Engineering Practice, 5(10), 1343 – 1354.