Journal Pre-proof Obtaining accurate estimated action values in categorical distributional reinforcement learning Yingnan Zhao, Peng Liu, Chenjia Bai, Wei Zhao, Xianglong Tang
PII: DOI: Reference:
S0950-7051(20)30025-3 https://doi.org/10.1016/j.knosys.2020.105511 KNOSYS 105511
To appear in:
Knowledge-Based Systems
Received date : 3 July 2019 Revised date : 8 January 2020 Accepted date : 10 January 2020 Please cite this article as: Y. Zhao, P. Liu, C. Bai et al., Obtaining accurate estimated action values in categorical distributional reinforcement learning, Knowledge-Based Systems (2020), doi: https://doi.org/10.1016/j.knosys.2020.105511. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Published by Elsevier B.V.
Journal Pre-proof
Obtaining Accurate Estimated Action Values in Categorical Distributional Reinforcement Learning Yingnan Zhaoa , Peng Liua , Chenjia Baia , Wei Zhaoa,∗, Xianglong Tanga Institute of Technology, Harbin 150001, China
pro of
a Harbin
Abstract
re-
Categorical Distributional Reinforcement Learning (CDRL) uses a categorical distribution with evenly spaced outcomes to model the entire distribution of returns and produces state-of-the-art empirical performance. However, using inappropriate bounds with CDRL may generate inaccurate estimated action values, which affect the policy update step and the final performance. In CDRL, the bounds of the distribution indicate the range of the action values that the agent can obtain in one task, without considering the policy’s performance and state-action pairs. The action values that the agent obtains are often far from the bounds, and this reduces the accuracy of the estimated action values. This paper describes a method of obtaining more accurate estimated action values for CDRL using adaptive bounds. This approach enables the bounds of the distribution to be adjusted automatically based on the policy and state-action pairs. To achieve this, we save the weights of the critic network over a fixed number of time steps, and then apply a bootstrapping method. In this way, we can obtain confidence intervals for the upper and lower bound, and then use the upper and lower bound of these intervals as the new bounds of the distribution. The new bounds are more appropriate for the agent and provide a more accurate estimated action value. To further correct the estimated action values, a distributional target policy is proposed as a smoothing method. Experiments show that our method outperforms many state-of-the-art methods on the OpenAI gym tasks.
lP
Keywords: Distributional reinforcement learning, Estimated action value, Bootstrapping, Interval estimation 2010 MSC: 00-01, 99-00 1. Introduction
15
20
urn a
10
Jo
5
The goal of reinforcement learning (RL) is to learn an optimal policy for sequential decision-making problems, by in- 25 teracting with the environment[1]. Recent research has shown that the integration of RL and deep neural networks can solve problems in a wide range of fields[2], including games[3, 4, 5], robotics[6, 7, 8, 9], natural language[10, 11] and computer vision[12, 13]. 30 One of the major classes of RL is value-based methods, which learn the optimal value function using the Bellman operator. Almost every RL algorithm is based on this approach. The Deep Deterministic Policy Gradient (DDPG)[14] extends two other important algorithms, the deep Q-network and de- 35 terministic policy gradient, which estimate the policy gradient more efficiently and stabilize the learning. The Asynchronous Advantage Actor Critic (A3C)[15] learns a value function and a policy. The value function reduces variance and accelerates learning; A3C can also run several agents in parallel to reduce 40 the training time. Both DDPG and A3C exhibit good performance on complex control tasks with high-dimensional action spaces. ∗ Corresponding
author Email addresses:
[email protected] (Yingnan Zhao),
[email protected] (Peng Liu),
[email protected] (Chenjia Bai),
[email protected] (Wei Zhao),
[email protected] (Xianglong Tang) Preprint submitted to Journal of Knowledge-Based Systems
45
Categorical Distributional RL (CDRL) [16] views RL problems from a whole new perspective. In contrast to the valuebased algorithms, which learn the expectation of the action value, CDRL models the full distribution of the action value. CDRL preserves the multimodality of the value distributions, leading to more stable learning and better policies than other methods. CDRL uses a categorical distribution with evenly spaced outcomes, and the target value distribution can be computed through a distributional Bellman operator. However, the target and the predicted distribution have disjoint supports. To account for this, CDRL projects the target distribution onto the supports of the prediction, and then applies a Kullback-Leibler minimization step. CDRL assigns probabilities to an a priori fixed, discrete set of possible returns, which is determined by the number of atoms l and the bounds (Qmin , Qmax ). The number of possible returns that the agent can obtain is l and the distance between two −Qmin atoms is ∆ = Qmaxl−1 . According to distributional RL theory [16], the expectation of this return distribution (estimated action value) should be approximately equal to the true action value, otherwise the performance will be affected. The estimated action value impacts the update of the distribution, as well as the estimated policy gradient. If it is not accurate, the performance of the policy will suffer. The bounds of this distribution impact the estimated action value and play an important role in CDRL. (Qmin , Qmax ) and l January 13, 2020
Journal Pre-proof
70
75
80
85
90
95
100
2. Background
pro of
2.1. Reinforcement learning RL solves sequential decision making problems with the aim of learning reward-maximizing behavior [1, 21]. A Markov decision process (MDP) can be used as a formalization of RL problems. In MDP, the agent is given a state s ∈ S at each discrete time step and selects an action a ∈ A following the policy π : S → A. Once the action is executed, the agent receives a reward r(s, a) and the environment moves to a new state s0 . The total return is defined as follows: T
Gt = ∑ γ i−t r(si , ai )
(1)
i=t
where γ ∈ (0, 1] is the discount factor, which determines the priority of short-term rewards. At each time step, the agent can use experience of the form (s, a, s0 , r(s, a)) to improve its policy and maximize the expectation of Gt . Value-based methods are effective for solving RL problems. The key component of value-based methods is the action value function. The action value function Qπ (s, a) = E[Gt |st = s, at = a, π] describes the expected return from selecting action a in state s and then acting according to π. We say that the action value is the expectation or the mean of the value distribution. Q-learning [22, 23, 24] is a central algorithm in the field of RL. The update rule for this approach is as follows:
re-
65
before the results of experiments are presented and discussed in Section 4. Finally, Section 5 summarizes the conclusions to this study.
lP
60
urn a
55
Jo
50
define the possible action values that the agent can get. In C-105 DRL, (Qmin , Qmax ) is set as the minimal and maximal expected cumulative reward the agent can receive in each task, and remains fixed throughout the training. The bounds are set to be wide enough to cover all the possible action values. However, most of the time, the performance of the policy is not as good as the optimal policy. Therefore, the true maximum action value that the agent can achieve is far less than Qmax . As a result, the atoms between the true maximum action value and the upper bound Qmax will have some estimation error. These bounds may also be too wide for those state-action pairs that cannot achieve large action values, and so the estimated action value will be inaccurate. To make the estimated action value more accurate, the bounds should be adjusted automatically for different learning stages and state-action pairs. In this study, to make the estimated action value more accurate and improve the CDRL performance, we developed adaptive bounds that can be adjusted depending on the performance of the policy and the different state-action pairs. To achieve 110 this, we train a critic network that can produce the same action value as common value-based RL methods. Then, we save the weights of this critic network over a fixed number of time steps, allowing the action values to be used as the samples for different state-action pairs in different learning stages. Finally, we estimate a confidence interval for Qmax and Qmin based on these samples. When calculating the confidence interval, the sample dependencies, changing policies, and non-stationary and limited sample size create certain challenges. Thus, we use a moving block bootstrap [17, 18]. We construct several blocks to eliminate the dependencies between samples and then we sample the blocks using bootstrapping [19]. These bootstrapped samples can be used to approximate a 1-α confidence interval around the bounds of the categorical distribution. Therefore, the bounds can be adjusted automatically for different stateaction pairs and learning stages, resulting in a more accurate115 approximation of the action value and improved performance. To reduce the variance in the update and make the estimated action value more accurate, we also developed a distributional target policy smoothing (DTPS) strategy. The key idea behind DTPS comes from Expected SARSA [20]. DTPS com-120 putes the target policy based on the expected distribution, by adding a small random noise to the target action and averaging their distributions. To evaluate our method in continuouscontrol tasks, we combine our improved CDRL with DDPG. Experiments demonstrate that this configuration achieves state-125 of-the-art performance. The main contributions of this paper are as follows: (1) We show that the problem of inaccurate estimated action values persists in CDRL. (2) Adaptive bounds are proposed to make the estimate action value more accurate. (3) To further correct the estimated action value, we propose a distributional target poli-130 cy smoothing strategy. (4) Evaluations on several continuouscontrol tasks demonstrate that our method can lead to better performance. The remainder of this paper is structured as follows. Section 2 provides a brief introduction to RL and methods related to the topic of this paper. Our approach is presented in Section 3,
2
Q(s, a) ← Q(s, a) + α[r + γ max Q(s0 , a0 ) − Q(s, a)] a0
(2)
By iterating rule (2), Q-learning converges to the optimal action value function [21], from which an optimal policy can be derived. For a large state space, Mnih et al. introduced the Deep Q-Network (DQN)[4], which uses deep neural networks to approximate the action value function. To further stabilize the training process, DQN uses a target network and experience replay. In contrast to value-based methods such as Q-learning, policybased methods directly optimize the parameterized policy π(a|s; θ ). REINFORCE [25, 26] is a policy-based method that updates θ in the direction of ∇θ logπ(at |st , θ )(Gt − bt (st )), where bt (st ) is used as a baseline to reduce the variance of the gradient estimate. 2.2. Categorical distributional reinforcement learning CDRL models the entire distribution of returns, rather than just the expected returns. Empirically learning the distribution over returns improves data efficiency, final performance, and stability [16, 27, 28]. Obtaining the action value under policy π is a basic problem in RL. One common way to solve this is to use dynamic programming through the Bellman operator [29]: T π Q(s, a) = E[r(s, a)] + γ E [Qπ (s0 , a0 )] P,π
(3)
Journal Pre-proof
145
150
155
urn a
140
Jo
135
lP
re-
pro of
where P is the transition kernel P(.|s, a). Similarly, a distribuupper and lower bounds of these confidence intervals as the tional Bellman operator is defined to compute the value distrinew bounds of the distribution. Finally, we introduce the DTPS bution through dynamic programming: 160 method to further reduce the variance of the update and correct the estimated action value. D T π Z(s, a) := r(s, a) + γPπ Z(s0 , a0 ) (4) 2.3. Deep deterministic policy gradients D π where P is the transition operator and Y := U denotes that the DDPG is an actor–critic, model-free method for continuous random variable Y is distributed according to the same law as U. action space, that extends DQN [4] and the deterministic polThe distribution over returns is denoted by Z and Q = E[Z]. A icy gradient (DPG) [32]. DDPG deploys an experience replay common parametric distribution is the categorical distribution, technique and a target network to stabilize the training. The which is supported on fixed locations z1 ≤ · · · ≤ zN . The paramcore element of DDPG is the DPG theorem and the gradient of eters of this distribution are the probabilities qi corresponding the actor, which can be written as follows: to each location zi . Similar to value-based methods, the goal is ∇θ J(θ ) ≈ Eρ [∇θ πθ (x)∇a Qπθ (x, a)|a=πθ (x) ] (7) to minimize the loss between the target distribution T Zθ and the current estimated distribution Zθ . The problem is that T Zθ where ρ is the state-visitation distribution. DDPG maintains an and Zθ always have disjoint supports, which makes it difficult estimate of the action value function Qω (s, a) by minimizing to calculate the loss. Thus, CDRL applies a projection step to the temporal difference error between the action value function map the target distribution onto the same supports as Zθ . Given before and after applying the Bellman update. Therefore, we a sample transition (s, a, r, s0 ), the Bellman update for one atom can write the resulting loss as: zi can be computed as T zi := r + γzi . The associated probability pi (s0 , π(s0 )) is then distributed to the immediate neighbors of L(ω) = Eρ [(Qω (s, a) − T π Qω 0 (s, a))2 ] (8) T zi . The ith component of the projected update ΦT Zθ (s, a)) Barth-Maron et al. [33] proposed the distributional deep is: deterministic policy gradient (D3PG), which combines CDRmax N−1 |bT z j eQ L with DDPG and has achieved state-of-the-art performance in Qmin − zi | 1 ]0 p j (s0 , π(s0 )) (5) (ΦT Zθ (s, a))i = ∑ [1 − several complex tasks. D3PG considers the inclusion of a dis∆z j=1 tributional critic [16] and adapts CDRL to continuous control setting. To use the distributional critic within the context of We then calculate the sample loss Ls,a (θ ) as follows: the DDPG architecture introduced above, we need to rewrite Ls,a (θ ) = DKL (ΦT Zθ (s, a)||Zθ (s, a)) (6) the DPG theorem by taking the expectation with respect to the action-value distribution, i.e., Through minimizing the above loss function, CDRL converges ∇θ J(θ ) ≈ Eρ [∇θ πθ (x)∇a Qω (x, a)|a=πθ (x) ] to the optimal value distribution and obtains the optimal policy (9) [30, 31]. = Eρ [∇θ πθ (x)E[∇a Zω (x, a)]|a=πθ (x) ] The bounds (Qmin , Qmax ) of the categorical distribution in D3PG updates the critic in the same way as CDRL, including CDRL define the range of all the possible returns the agent can a projection step and a KL minimization step. The inclusion of attain throughout the training and affect the estimated action value (the estimated action value is the expectation of the cate-165 the distributional critic update means that D3PG is capable of state–of–the–art performance on a number of difficult continugorical distribution). The bounds also affect the projection step, ous problems. as shown by (5). If the estimated action value is as close as possible to the real action value, CDRL can achieve good per2.4. Related work formance. Qmax and Qmin reflect the maximum and minimal cumulative rewards that the agent can obtain on each task. In CDRL is a new perspective of RL: when combined with CDRL, Qmin can be set as the cumulative reward that the agent170 DQN, it outperforms other algorithms that only model the exachieves by following the worst policy (random policy), wherepected values in of many Atari 2600 games [34, 35]. Besides as Qmax is the cumulative reward that the agent receives followCategorical distribution, the paper [33] proposed a method to ing the optimal policy. Thus, Qmax is always far greater then model the return distribution with Mixture of Gaussians distriQmin . Most of the time, Qmax is far above the true maximum bution, which can reduce the requirement of domain-knowledge action value that the policy can obtain, and the true action value175 and remove the projection step, but performs worse than cateof some state-action pairs can will never get close to Qmax . This gorical distribution. MoG-DQN [36] modeled the return disleads to an inaccurate estimated action value and harms the RL tribution with Mixture of Gaussian distribution and proposed a performance. new distance metric for MoG distribution: Jensen-Tsallis DisIn this paper, to make the estimated action value more actance (JTD). The advantage of JTD over other metrics such as curate, we describe the construction of adaptive bounds for C-180 KL-Divergence and the Wasserstein Distance is that JTD comDRL. These can be adjusted for different courses of training putes the distance between two MoG distributions in a closed and state-action pairs by using the bootstrapping method to apform, which makes JTD loss applicable with sample-based methproximate the confidence interval of the bounds. We use the ods. 3
Journal Pre-proof
205
210
215
220
3. Proposed method
230
235
In this section, we first show that inappropriate CDRL bounds result in an estimation error, and that this estimation error affects the overall performance. In subsection 3.2, we describe an adaptive bounds method using bootstrapping that addresses this problem. A distributional target policy smoothing strategy is presented in subsection 3.3. Subsection 3.4 summarizes the whole algorithm.
i=1 l
≈ ∑ zi ∗ i=1
=
1 l
1 l ∗ ∑ zi l i=1
(10)
1 l ∗ (Qmin + Qmax ) ∗ l 2 (Qmin + Qmax ) = 2 =
This equation shows that, during the early stage of training, max ) the estimated action value is approximately equal to (Qmin +Q . 2 This is greater than the real action value that the agent can obtain because the performance of the policy is poor during the early stages of training and the bounds will cause an overestimated bias. ˆ a) As the training progresses, the estimated action value Q(s, will decrease and become closer to the real action value. However, because of the inaccurate bounds and the projection step, CDRL will soon underestimate the action value. The parametric distribution is updated by applying the distributional Bell0 0 man operator to each atom zi : zi ← r + γzi , where zi is the target atom, γ ∈ (0, 1] is the discount factor, and r is the reward. The 0 relationship between zi and zi is important. We assume that zi 0 is greater than zi :
3.1. Motivation
=⇒ =⇒
Jo
225
pro of
200
l
ˆ a) = ∑ zi ∗ pi (s, a; θ ) Q(s,
re-
195
in CDRL only consider the minimum and maximum action value the agent can obtain, and remain fixed throughout the training stages. Thus, the bounds are not accurate enough for the estimation of the action value. Given the bounds (Qmin , Qmax ) and the number of atoms l, we assume the support set of this distribution is {zi = Qmin + −Qmin . The atom probabilities are givi∆z : 0 ≤ i ≤ l}, ∆z := Qmaxl−1 en by a parametric model θ : S × A → Rl . We assume the probability of each atom zi for state–action pair (s, a) is pi (s, a; θ ). The initial distribution is typically uniform, as shown in Figure 1(a), so we obtain: pi (s, a; θ ) ≈ 1/l, i = 1, . . . , l. In this way, the action value that the distribution approximates during the early stage is as follows:
lP
190
More recently, additional distributional algorithms have been proposed. Quantile regression (QR-DQN) [37] does not require the projection step and can perform distributional RL over the Wasserstein metric [38]. The Implicit Quantile Networks (IQN)240 technique [39] uses quantile regression to approximate the full quantile function for the return, and produces risk-sensitive policies. Bellemare et al. [40] provided theoretical and empirical results that explain why distributional RL produces the observed improvements. Dopamine [41] is a code repository con-245 taining implementations of many distributional RL algorithms. Tang et al. [42] propose a framework based on distributional RL and unifies several methods in exploration. Dabney et al. [31] present a unifying framework for designing and analysing distributional RL and provide improved analyses of existing distributional RL. Tamar et al. [43] show that the distributional Bellman equation is equivalent to a generative adversarial network (GAN) model. The authors use this insight to propose a GAN-based approach and can be used in the multivariate rewards setting. Our work is also related to solving continuous control tasks. Proximal Policy Optimization (PPO) takes the biggest possible improvement step on a policy without stepping so far. PPO methods are simpler to implement and performs better than other policy gradient methods. Guide Actor-Critic (GAC) [44] utilizes Hessians of the critic for actor learning and learns a guide actor that locally maximizes the critic. In Dual ActorCritic (Dual-AC) [45], the actor and dual critic are updated cooperatively to optimize the same objective function. Trust Path250 Consistency Learning (Trust-PCL) [46] is an off-policy algorithm employing a relative-entropy penalty to impose a trust region on a maximum reward objective and can perform well on a set of standard control tasks. Soft Actor-Critic (SAC) [47] is a new RL framework, in SAC the actor aims to maximize ex255 pected reward while also maximizing entropy to succeed at the task while acting as randomly as possible. Twin Delayed Deep Deterministic policy gradient (TD3) [48] increase the stability and performance with consideration of function approximation error.
urn a
185
r + γzi ≤ zi
(11)
r ≤ (1 − γ)zi
r ≤ (1 − γ)(Qmin + i∆z) 0
In CDRL, the categorical distribution has an important hyperparameter: the bounds of the support (Qmin , Qmax ). These 260 bounds determine the range of the possible action value that the agent can attain throughout the training process and the estimated action value. The action value depends on the policy’s performance and the state–action pairs. The bounds (Qmin , Qmax ) 4
In other words, if r ≤ (1 − γ)(Qmin + i∆z), zi is greater than zi ; 0 otherwise zi , is less than zi . Because the distance ∆z is large (When Qmax is far larger than Qmin ), inequality (11) will hold easily as i becomes larger. As shown in Figure 1(b), the target 0 0 atoms z3 and z4 are less than the original atoms z3 and z4 . After the Bellman update, CDRL applies a projection step. The probability of the target atom will be allocated to the two adjacent atoms, as Figure 1(b) shows. For example, the probabil0 0 0 ity of the target atom z4 is p4 . We define b = (z4 − Qmin )/∆z
pro of
Journal Pre-proof
275
280
285
290
∇θ J(θ ) ≈ Eρ [∇θ πθ (x)E[∇a Zw (x, a)]|a=πθ (x) ]. If the value estimate Zw itself is inaccurate, then the actor network updates in the wrong direction and becomes poor.
lP
urn a
270
0
and p4 = p4 (b − bbc). The projection step and the inaccurate295 bounds mean that the probability of obtaining the higher action value decreases and the whole distribution shrinks, as shown by Figure 1(c). As a result, the estimated action value will not change any more and the learning stops. Does the problem of inaccurate estimated action values occur in practice for CDRL? Figure 2 plots the expected ac-300 tion value of CDRL and the true action value over the learning process on the OpenAI gym environments Ant-v2. The true action value was obtained by averaging the discounted return over 10000 time-steps following the current policy. Then, we saved the visited state-action pairs during evaluation to obtain the ex-305 pected value (the expectation of the return distribution). From Fig. 2(a), for the Categorical distribution with fixed bounds, there is an initial overestimated bias between the expected action value and the true action value; then the expected action value changes slowly and the policy gradually stops learning.310 On the contrary, the expected action value obtained by adaptive bounds is closer to the true action value than that of fixed bounds. Moreover, using adaptive bounds achieves better performance because the achieved true value is larger. Finally, we demonstrate that the inaccurate estimated action315 value caused by CDRL harms the overall performance. The error will propagate because of the distributional Bellman operator, which means it learns an action value distribution from another distribution that already has errors. Therefore, the error propagates throughout the whole course of learning. Addi-320 tionally, in an actor–critic setting, the policy will become poor through the interplay between the actor and the critic. For example, the distributional policy gradient algorithm in D3PG is
Jo
265
re-
Figure 1: Procedure of CDRL: (a) At the beginning of the training stage, the categorical distribution is uniform. We use z1 , . . . , z10 to denote the possible returns 0 0 0 and q1 is the corresponding probability. (b) Projection step of CDRL. z1 , . . . , z4 are the possible returns after applying the distributional Bellman operator and p4 is 0 the probability of the target position z4 . The colored dashed lines are the target probabilities. The rectangles with different colors are the results after projection. (c) After projection, the distribution (light color) shrinks to the target distribution (dark color). As a result, the estimated action value will stop updating and the learning process slows down.
5
3.2. Adaptive bounds
This section introduces a novel method that automatically adjusts the bounds in CDRL and makes the estimated action value more accurate. The estimated action value is the expectation of the distribution, which is determined by the number of atoms l and the bounds of the distribution. However, l cannot change the expectation as much as the bounds. Thus, we focus on finding the appropriate bounds. The new bounds are (Qmin (st , at ), Qmax (st , at )), which can be adjusted based on the policy’s performance (time step) and state–action pairs instead of remaining fixed. We cannot attain the new bounds directly, and the only thing we can do is collecting the action values that the agent obtained during several previous time steps and computing a confidence interval for Qmin (st , at ) and Qmax (st , at ). However, the state–action pairs cannot be the same in continuous-control tasks and we cannot have several action values for one state–action pair. We assume the action value function is parameterized by a critic network Qθ . During the course of the training, we save a sequence of changing weights {θi , θi+1 , . . . , θ j } from time step i to j. Therefore, we can obtain samples for each state–action pair at different time steps. To estimate the confidence interval, we are still faced with some challenges. The first is the limited sample size, because we cannot save too many weights for reasons of computation time and resource. The second issue is
Journal Pre-proof
(b) Adaptive Bounds
(b) Adaptive bounds
pro of
(a) Fixed bounds (a) Fixed Bounds
Figure 3: Distribution over return using two methods.
Figure 2: Expected action value and true action value of fixed bounds and our proposed adaptive bounds on Ant-v2.
330
urn a
lP
335
re-
325
the sample dependencies [18], because the weights are strongly interrelated. 345 The moving block bootstrap (MBB) [17, 49, 50] technique is known to be valid for this situation. Every bootstrapping method, including MBB, is derived from Efron’s original bootstrapping procedure [19]. The bootstrapping produces a bootstrapped sample from the initial samples, whereas MBB first di-350 vides the original sequential data into blocks from which blocks of samples are drawn with replacement. In this way, MBB breaks the dependencies between the sequential data and takes full advantage of the limited number of samples. Finally, these samples are used to estimate the confidence interval of the bound355 s. The whole process of estimating the confidence interval is as follows. Given a sequence of changing weights {θi , . . . , θ j } from time step i to j, we compute the sample action values Qs = {Qi (s, a), . . . , Q j (s, a)} for one state–action pair (s, a) through360 the critic network. Then, we construct a set of overlapping blocks {B1 , . . . , BN }, where N = j −i−l is the number of blocks and Bi = (Qi (s, a), . . . , Qi+l−1 (s, a)). B∗1 , . . . , B∗k represents a random sample with repetitions from the set {B1 , . . . , BN }. Finally, we merge these blocks to obtain the observations Q∗s =365 {Q∗1 (s, a), . . . , Q∗k∗l (s, a)}. Given these observations, an estimate Tn∗ of the statistic can be computed. This process is repeat∗ ,...,T∗ . ed K times, resulting in K samples of the statistic, Tn,1 n,K With these bootstrapped samples, we construct a percentile − t(studentized)interval as follows: 370
upper bound of the distribution. If Tn∗ = min(Q∗s ) and Tn = min(Qs ), we can determine the confidence interval for the lower bound of the distribution. Here the confidence interval for upper Qmin is denoted as (Qlower min , Qmin ) and the confidence interval upper for Qmax is denoted as (Qlower max , Qmax ) . Then we use the upper bound of the confidence interval of Qmax as the upper bound of the new distribution, and use the lower bound of the confidence interval of Qmin as the lower bound of the new distribution. This upper means the new bounds of the distribution is (Qlower min , Qmax ). In this way, the probability of seeing the true action values inside of the new interval is maximized, which benefits the approximation of the return distribution. Before updating the policy, we compute the bounds for every state–action pair and replace the oldest weights in the buffer {θi , . . . , θ j } by the latest weights every a fixed number of time steps. In this way, the bounds of the distribution adapt to the policy and state–action pairs. At the same time, the bounds will be not too wide (Figure 3(b)). In Figure 3(a), the agent cannot obtain action values as large as (or even close to) Qmax using CDRL (fixed bounds). Therefore, the atoms between the true maximum action value and Qmax (we use braces to denote them in Figure 3(a)) produce an estimation error. The new bounds reduce the distance between the two atoms, preventing the distribution from shrinking and the learning from stopping too early. Finally, the adaptive bounds lead to more accurate estimated action values and better performance. Figure 2(b) plots the estimated action value and the true action value using the novel Adaptive Bounds (AB) method. From Figure 2(b), we see that the estimated action value is almost the same as the true action value and the performance of the new method is also better than that of the original CDRL ( average value of AB is greater than that of CDRL).
∗ ∗ P(T ∈ (2Tn − T1−α/2 , 2Tn − Tα/2 )) ≥ 1 − α
(12)
Jo
where Tβ∗ is the β -quantile of the ordered population and Tn is the statistic of the original sample Qs . In bootstrapping, the β -quantile is usually computed as follows:
340
3.3. Distributional Target Policy smoothing strategy When applying CDRL to an actor–critic setting, determin∗ (13) Tβ∗ = (1 − r)T j∗ + rT j+1 istic policies may overfit to narrow peaks in the value estimate [48]. This problem is aggravated when using adaptive boundwhere j = bnβ c+m, r = nβ − j +m and m depends on the quans, because the new bounds will adapt to different state–action tile type. For non-normal distributions, m = β +1 3 . In this way, pairs and policies, and so the bounds change with the training ∗ we can approximate a 1-α confidence interval (2Tn −T1−α/2 , 2Tn − process. Every time the bounds change, the expectation of the ∗ ), which means the probability of seeing the statistic outTα/2 distribution also changes. There will be a sudden change in the side of the interval is low. When Tn∗ = max(Q∗s ) and Tn = critic network, which will influence the actor network accordmax(Qs ), we can determine the 1-α confidence interval for the ing to the (9): ∇θ J(θ ) ≈ Eρ [∇θ πθ (x)E[∇a Zω (x, a)]|a=πθ (x) ]. 6
Journal Pre-proof This sudden change will increase the variance of the target and influence the accuracy of the estimated action value. Thus, we propose a distributional target policy smoothing method to fix this problem. The proposed method originates from Expected SARSA [20, 51]. The update rule of Expected SARSA is: Q(st , at ) ←Q(st , at ) + α[rt+1 +
γ ∑ π(st+1 , a)Q(st+1 , a) − Q(st , at )]
Algorithm 1 AS-CDRL Input: batch size M, learning rates α and β , sample times K, the frequency of saving the critic network weights N, sample size B 1: Initialize the critic network Qθ 2: Initialize the actor network πφ 3: Initialize the distributional critic network Zω 4: Initialize the replay buffer B 5: Initialize the target weights(φ 0 , ω 0 ) ← (φ , ω) 6: for t = 1 to T do 7: Sample M transitions (s, a, s0 , r) from B 8: if t mod N then 9: save critic weights θt 10: end if 11: sample M transitions (s, a, s0 , r) from mathcalB 12: GetUpperConfidence({θ1 , . . . , θB },(s, a)) 13: for j = 1 to K do 14: aˆ = πφ 0 (s0 ) + ε 15: z0 = r + γZω 0 (s0 , a) ˆ 16: end for 17: Compute the target distribution ztarget through averaging over z0 18: Compute the current distribution z = Zω (s, a) 19: Compute the distributional critic loss: δω = M1 ∑ ∇ω DKL (Φztarget ||z) 20: Compute the actor loss: δφ = M1 ∑ ∇φ πφ (s)E[∇a Zω (s, a)]|a=πφ (s) 21: Update the network parameters: ω ← ω + αδω , φ ← φ + β δφ 22: Compute the critic loss using (8) and update θ as in other common RL methods. 23: Update target networks 24: end for
(14)
a
y = r + Eε [Zθ 0 (s0 , πφ 0 (s0 ) + ε)]
where ε is a small random noise and ε ∼ clip(N (0, σ ), −c, c). σ determines the range of the noise and the parameter c controls the limit value of the noise. To approximate this expectation, we first sample the noise and add it to the target action. Then, we have a target distribution. This process is repeated K times to give K target distributions. Because these distributions all have the same support set, we can find an approximate expectation by averaging over them. In practice, our proposed DTPS reduces the variance of the update and further corrects the estimated action values.
re-
380
(15)
lP
375
pro of
Using the expected value E[Q(st+1 , at+1 )] as the target value, rather than Q(st+1 , at+1 ), can reduce the variance in the update. Unfortunately, we cannot achieve the same idea in distributional RL and continuous action space problems by enumerating all possible next actions in order to calculate the expected distribution as the target distribution. Thus, we propose the following target distribution:
3.4. Proposed algorithm
390
In this section, we combine all the components described above and adapt them to an actor–critic setting. The pseudo code of our Adaptive Smoothing-CDRL (AS-CDRL) is presented in Algorithm 1. In step 12, we estimate the confidence interval of the bounds Qmax (s, a) and Qmin (s, a) and set the upper and lower bound of the intervals as the bounds of our distribution. The procedure of computing Qmax can be found in Algorithm 2. Qmin is computed in a similar manner. Steps 13– 17 describe the DTPS procedure.
395
4. Experiments and discussions
400
In this section, we describe the experiments conducted to measure the performance of the proposed AS-CDRL on a set of continuous–control tasks. In subsection 4.1, we describe the experimental setup and set some important hyperparameters of our method. Subsection 4.2 evaluates the proposed AS-CDRL and other baselines on some locomotion tasks and presents the corresponding learning curves. Finally, we analyze the effect of AB and DTPS in subsection 4.3. Readers wishing to reproduce our method are welcome to use the open–source code1 .
urn a
385
Jo
Algorithm 2 GetUpperConfidence Input: last B weights {θ1 , . . . , θB }, block length l, state-action pairs (s, a), confidence level α(=0.05), the number of bootstrapping times K. 1: Get sample action values: QB = {Q(s, a; θ1 ), . . . , Q(s, a; θB )} 2: Construct the block: Bi = (Q(s, a; θi ), . . . , Q(s, a; θi+l−1 )) 3: The total number of blocks:M ← bB/lc 4: for i = 0 to K do 5: Sample M blocks and merge them: Q∗B =(Q∗ (s, a; θ1 ), . . . , Q∗ (s, a; θM∗l )) 6: Compute an estimate of the statistic: Ti∗ = max(Q∗B ) 7: end for 8: Sort({T1∗ , . . . , TK∗ }) ∗ = (1 − r)T ∗ + rT ∗ 9: Tα/2 j j+1 10: j = bKα/2 + (α + 2)/6c, r = Kα/2 + (α + 2)/6 − j ∗ 11: return 2 ∗ max(QB ) − Tα/2
1 https://github.com/zhaoyingnan179346/AS-CDRL
7
Journal Pre-proof
(b) HalfCheetah-v2
(c) Hopper-v2
(e) Reacher-v2
(f) Walker2d-v2
(g) InvertedDoublePendulum-v2
(d) Swimmer-v2
pro of
(a) Ant-v2
(h) InvertedPendulum-v2
re-
Figure 4: Example Mujoco environments.
405
4.1. Setup
410
Our experiments were conducted using the Mujoco [52] environment interfaced through OpenAI Gym2 [53]. We chose eight continuous-control locomotion tasks from Mujoco (see Figure 4). In CDRL, we need to set the bounds Qmin and Qmax of the categorical distribution, and these values have an important im-425 pact on the performance. Finding the optimal bounds for each task is a process of trial and error. The bounds are listed in Table 1.
urn a
Table 1: Optimal bounds for CDRL in different tasks.
Task
Jo
Ant-v2 HalfCheetah-v2 Reacher-v2 Hopper-v2 Walker2d-v2 Swimmer-v2 InvertedPendulum-v2 InvertedDoublePendulum-v2
415
the state and action as input, and produced the atom probabilities as output. Both network parameters were updated using Adam [54]. The learning rate for the actor was 1e−4 and that for the critic was 2.5e−4 . After each time step, the networks were trained with a mini-batch of 100 transitions. Our proposed AS-CDRL consists of two components: AB and DTPS. For the AB component, we saved the critic weights every N time steps. To provide some intuition, we examined the performance with different values of N on HalfCheetah-v2 (see Figure 5(a)). With N = 100 and N = 1000, the training becomes unstable because the frequency of saving the weights also represents the frequency of changing the bounds. If the bounds of the categorical distribution change too often, the estimated action value will also change frequently, which makes the estimated gradients of the actor and the critic inaccurate. Saving the critic weights at a lower frequency (N = 2000 and N = 3000) improves the performance and makes the training process smoother. This situation is similar in other tasks. DTPS reduces the variance of the target and stabilizes the training. The sample numbers K in DTPS represent the number of actions we sample in the small area around the target action. We examined the performance with different values of K on Ant-v2 (see Figure 5(b)). When K = 0 (no DTPS), the policy achieves a high score, but the learning process is unstable and causes divergent behavior. As K increases, the approximation of the expected target distribution becomes more accurate and the instability and divergence problems obviously improve. In practice, we set K = 100. The core idea behind AB is that the bounds can be adjusted automatically based on the policy’s performance and state– action pairs. To verify this, the bounds and the true action values during the training courses were calculated. From Figure
lP
420
Qmin
Qmax
-200 -100 -50 -60 -100 0 0 0
300 1100 0 600 700 120 160 1500
430
435
440
For our method, we used the DDPG architecture. Both the 445 actor and the critic contained two–layer feedforward networks of 400 and 300 hidden nodes, respectively. ReLU non linearity was applied between each layer. The critic received both 2 https://github.com/openai/mujoco-py
450
8
Journal Pre-proof
485
(a) Ant-v2
(b) HalfCheetah-v2
Figure 5: Performance of AS-CDRL with different hyperparameters. 490
re-
495
500
lP
Figure 6: True averaged action value and the adaptive bounds.
pro of
480
as MoG-KL [33] for using the MoG distribution while minimizing the KL-divergence. MoG-DQN [36] is another distributional reinforcement learning method which used a Mixture of Gaussian distribution to model the distribution of the sum of rewards and proposed a new distance metric for MoG distribution: Jensen-Tsallis Distance (JTD). Each task was executed over 1 million time–steps. We evaluated the policy for 10 episodes every 5000 time steps and then averaged the rewards. The learning curves are presented in Figure 7. From the results, we observe that the proposed AS-CDRL achieves better performance than the other methods. AS-CDRL outperforms TD3 on Swimmer-v2, HalfCheetahv2, Hopper-v2, and InvertedDoublePendulum-v2. AS-CDRL matches or outperforms D3PG in both final performance and learning speed across all tasks because AS-CDRL has a lower estimation error and improved performance compared with D3PG. From figure 7 it is clear that MoG-KL does not perform well. Since minimizing a distance metric between two MoG distributions is difficult, neither the KL-divergence nor the Wasserstein distance between two MoG distributions can be computed in a closed form and the estimation of it by a Monte Carlo method is computationally expensive and suffers from high variance. This harms the final performance. MoGDQN outperforms our method on the environments Swimmerv2 and Hopper-v2. On other tasks like HalfCheetah-v2, Ant-v2, Walker2d-v2, MoG-DQN underperforms our method due to the lack of constraint on the return distribution, which makes the learned reward distribution inaccurate and brittle. MoG-DQN is completely random at the beginning of the training. Along with the training, the MoG distribution approximates the real rewards distribution gradually, using only the one-step reward. Therefore, the training process is time-consuming and unstable. To investigate the robustness to random seeds, we record the results for our method and MoG-DQN with different random seeds on all the tasks. We plotted the results on a violin plot in Fig. 8, including markers for the best, worst, and median performances. In Fig. 8, the y-axis represents the normalized score over all the tasks. The white point represents the median score, while the two ends of the line represent the best and worst score. The width of the shape represents the frequency of getting the corresponding score. Figure 8 shows that the violin plot of our method is wider and shorter, which means that there is little difference between the best and worst score. Furthermore, our method achieve similar results regardless of the different random seeds. The violin plot of MoG-DQN is thinner and longer, which means MoG-DQN may obtain very different performance for just setting different random seeds. This will increase the difficulty of debugging and practical applications. Therefore, our method is much better than MoG-DQN in terms of both the performance and the robustness.
505
460
465
470
475
4.2. Evaluation
We compared our algorithm against DDPG [14], and the state–of–the–art policy gradient algorithms D3PG [33], Twin Delayed Deep Deterministic (TD3) [48], and Proximal Policy515 Optimization (PPO) [54]. D3PG is an actor–critic version of CDRL that can achieve state–of–the–art performance on many continuous tasks. TD3 reduces the overestimation bias in an actor–critic setting and greatly improves both the learning speed and performance in a number of challenging tasks. These meth-520 ods like AS-CDRL, are DDPG-family methods that use the deterministic policy gradient theorem. PPO is a family of policy optimization methods based on the stochastic policy gradient theorem. PPO is motivated by Trust Region Policy Optimization(TRPO) [55] and can take the biggest possible improvement525 step on a policy. PPO performs well in continuous–control environments. AS-CDRL builds on CDRL by applying AB and DTPS to reduce the estimation error and improve the performance. Besides Categorical distribution, there is also Mixed of Gaussian (MoG) distribution which do not have bounds. We compared our method with the distribution reinforcement learn-530 ing methods that uses MoG distribution. We denote this method
Jo
455
510
urn a
6, the values of Qmin and Qmax increase as the performance (the true action value) improves. The adaptive bounds are not too wide for the true action value, making the estimated action value more accurate.
9
4.3. Ablation This subsection describes ablations to remove components of our algorithm in order to compare their separate contributions. Our algorithm includes two parts: AB and DTPS. The results presented in Table 2 compare the performance of each component. The significance of each component varies from
re-
pro of
Journal Pre-proof
Figure 7: Learning curves for the OpenAI gym control tasks. The shaded region represents the standard deviation of the average evaluation over three trials.
lP
bias of the estimated action value. DTPS samples several actions around the target action and smoothes the target distribution. In this way, DTPS can eliminate the effect of stochasticity and improve the performance.
urn a
Table 2: Maximum average return over 10 trials of 1 million time steps. The results show the ablation results over DTPS and AB.
Figure 8: Distribution of performance across random seeds.
540
545
Jo
535
task to task. Across all four tasks, the best performance is obtained by AS-CDRL. The biggest gain is due to the inclusion of AB, which is particularly helpful on tasks with wide bound-550 s (Qmax − Qmin is large), which is intuitively reasonable. If Qmax − Qmin is large, the training produces a greater estimation error and harms the final performance. In tasks like Swimmerv2, DTPS is more helpful than AB. This is because the range of possible action values is not large and the estimation error is not a serious problem. In such situations, the contribution of555 AB is not as large as in other tasks. DTPS can enhance the performance in most of these tasks, especially those with large stochasticity, because the target action for the same state may be different and changes frequently. As a result, the variance of the update is large and increases the560 10
Method
HCheetah Ant
Walker2d Swimmer
CDRL AS-CDRL - AB AS-CDRL - DTPS AS-CDRL
10851.26 11089.11 11542.88 12217.74
5237.63 4566.02 5259.77 5615.16
3295.39 3476.00 4845.86 5198.19
122.44 140.22 96.69 141.26
5. Conclusion This paper has described two novel modifications to CDRL. The first, Adaptive Bounds, saves the critic weights over a fixed number of time steps and obtains the associated action values as the samples for different state–action pairs in different learning stages. Moving block bootstrapping then computes confidence intervals for Qmax and Qmin based on these samples, and the upper and lower bound of these intervals are taken as the adapted bounds. In practice, AB effectively reduces the estimation error in CDRL. The second modification, Distributional Target Policy
Journal Pre-proof
570
Acknowledgement 640
575
This work was supported by the National Natural Science Foundation of China (61671175), Science and Technology on Space Intelligent Laboratory (ZDSYS-2018-02). 645
References
595
600
605
610
615
620
re-
lP
590
urn a
585
Jo
580
[1] R. S. Sutton, A. G. Barto, F. Bach, et al., Reinforcement learning: An introduction, MIT press, 1998. 650 [2] Y. Li, Deep reinforcement learning: An overview (2017). arXiv:1701. 07274. [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. A. Riedmiller, Playing atari with deep reinforcement learning (2013). arXiv:1312.5602. 655 [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. 660 [5] G. Lample, D. S. Chaplot, Playing FPS games with deep reinforcement learning, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 2140–2146. [6] S. Levine, V. Koltun, Guided policy search, in: Proceedings of the 30th665 International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013, pp. 1–9. [7] J. Kober, J. A. Bagnell, J. Peters, Reinforcement learning in robotics: A survey, I. J. Robotics Res. 32 (2013) 1238–1274. [8] M. Andrychowicz, D. Crow, A. Ray, J. Schneider, R. Fong, P. Welinder,670 B. McGrew, J. Tobin, P. Abbeel, W. Zaremba, Hindsight experience replay, in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 5055–5065. [9] C. Bai, P. Liu, W. Zhao, X. Tang, Guided goal generation for hindsight675 multi-goal reinforcement learning, Neurocomputing. [10] K. Guu, P. Pasupat, E. Z. Liu, P. Liang, From language to programs: Bridging reinforcement learning and maximum marginal likelihood, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4,680 Volume 1: Long Papers, 2017, pp. 1051–1062. [11] Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu, W.-Y. Ma, Dual learning for machine translation, in: NIPS, 2016. [12] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, Y. Wang, End-to-end active object tracking via reinforcement learning, in: Proceedings of the685 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018, 2018, pp. 3292– 3301. [13] L. Ren, J. Lu, Z. Wang, Q. Tian, J. Zhou, Collaborative deep reinforcement learning for multi-object tracking, in: Computer Vision - ECCV690 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part III, 2018, pp. 605–621.
[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, in: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. [15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, 2016, pp. 1928–1937. [16] M. G. Bellemare, W. Dabney, R. Munos, A distributional perspective on reinforcement learning, in: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 449–458. [17] B. Radovanov, A. Marciki´c, A comparison of four different block bootstrap methods, 2014. [18] M. White, A. M. White, Interval estimation for reinforcement-learning algorithms in continuous-state domains, in: NIPS, 2010. [19] B. Efron, Bootstrap methods: Another look at the jackknife, Vol. 7(1), 1979, pp. 1–26. [20] H. van Seijen, H. van Hasselt, S. Whiteson, M. A. Wiering, A theoretical and empirical analysis of expected sarsa, in: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2009, Nashville, TN, USA, March 31 - April 1, 2009, 2009, pp. 177–184. [21] C. Szepesv´ari, Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers, 2010. [22] C. J. C. H. Watkins, P. Dayan, Technical note q-learning, Machine Learning 8 (1992) 279–292. [23] H. P. van Hasselt, Double q-learning, in: NIPS, 2010. [24] G. Tesauro, Temporal difference learning and td-gammon, ICGA Journal 18 (2) (1995) 88. [25] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, in: Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], 1999, pp. 1057–1063. [26] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning 8 (1992) 229–256. [27] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, T. Tanaka, Parametric return density estimation for reinforcement learning, in: UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, July 8-11, 2010, 2010, pp. 368– 375. [28] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, T. Tanaka, Nonparametric return distribution approximation for reinforcement learning, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, 2010, pp. 799–806. [29] R. Bellman, Dynamic programming., Science 153 3731 (1957) 34–7. [30] M. Rowland, M. G. Bellemare, W. Dabney, R. Munos, Y. W. Teh, An analysis of categorical distributional reinforcement learning, in: AISTATS, 2018. [31] M. Rowland, R. Dadashi, S. Kumar, R. Munos, M. G. Bellemare, W. Dabney, Statistics and samples in distributional reinforcement learning, ArXiv abs/1902.08102. [32] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. A. Riedmiller, Deterministic policy gradient algorithms, in: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, 2014, pp. 387–395. [33] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, T. P. Lillicrap, Distributed distributional deterministic policy gradients, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. [34] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, D. Silver, Rainbow: Combining improvements in deep reinforcement learning, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 3215–3222.
pro of
565
Smoothing changes the target distribution to the expected distribution over some similar actions, which further correct the estimated action values. 625 AB and DTPS constitute our proposed method, Adaptive Smoothing-CDRL. AS-CDRL significantly improves the performance and learning speed of CDRL in eight locomotion tasks taken from Mujoco. Our method outperforms several state–of–630 the–art algorithms, including TD3 and PPO. In this paper we combine our method with DDPG and show promising avenue for improved performance. Incorporating other policy gradient methods (e.g. trust regions) is an exciting avenue for future635 work.
11
Journal Pre-proof
720
725
730
735
740
745
750
755
760
pro of
715
A. Networks
In all experiments, we use feedforward networks to parameterize the policy. Actually, there are also some methods that using recurrent networks to solve the problem of partial observability. We add the comparison results with three types of recurrent networks: Long Short-Term Memory (LSTM), Recurrent Neural Networks (RNN) and Gated Recurrent Unit (GRU). Table 3 shows the results for three tasks: Ant-v2, Walker2d-v2 and HalfCheetah-v2. Table 3 includes the average maximum return and the standard deviation over five trials.
re-
710
7-12, 2012, 2012, pp. 5026–5033. [53] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, Openai gym, CoRR abs/1606.01540. [54] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [55] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, P. Moritz, Trust region policy optimization, in: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 1889–1897. [56] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A. Y. Ng, On optimization methods for deep learning, in: ICML, 2011. [57] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, CoRR abs/1412.6980. [58] S. Ruder, An overview of gradient descent optimization algorithms, ArXiv abs/1609.04747.
Table 3: Average maximum return for different types of networks.
lP
705
urn a
700
Jo
695
[35] A. Gruslys, W. Dabney, M. G. Azar, B. Piot, M. G. Bellemare, R. Munos, The reactor: A fast and sample-efficient actor-critic agent for reinforce-765 ment learning, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. [36] D. Choi, K. Lee, S. Oh, Distributional deep reinforcement learning with a mixture of gaussians, in: 2019 International Conference on Robotics and770 Automation (ICRA), 2019, pp. 9791–9797. [37] W. Dabney, M. Rowland, M. G. Bellemare, R. Munos, Distributional reinforcement learning with quantile regression, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and775 the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 2892–2901. [38] Bickel, P. J, Freedman, D. A, Some asymptotic theory for the bootstrap, The Annals of Statistics (1981) 1196–1217. [39] W. Dabney, G. Ostrovski, D. Silver, R. Munos, Implicit quantile networks for distributional reinforcement learning, in: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stock-780 holmsm¨assan, Stockholm, Sweden, July 10-15, 2018, 2018, pp. 1104– 1113. [40] C. Lyle, P. S. Castro, M. G. Bellemare, A comparative analysis of expected and distributional reinforcement learning, CoRR abs/1901.11084. arXiv:1901.11084. [41] P. S. Castro, S. Moitra, C. Gelada, S. Kumar, M. G. Bellemare, Dopamine: A research framework for deep reinforcement learning, CoRR785 abs/1812.06110. arXiv:1812.06110. [42] Y. Tang, S. Agrawal, Exploration by distributional reinforcement learning, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., 2018, pp. 2710–2716. [43] D. Freirich, T. Shimkin, R. Meir, A. Tamar, Distributional multivariate policy evaluation and exploration with the bellman GAN, in: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2019, pp. 1983–1992. [44] V. Tangkaratt, A. Abdolmaleki, M. Sugiyama, Guide actor-critic for continuous control, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. [45] B. Dai, A. Shaw, N. He, L. Li, L. Song, Boosting the actor with dual critic, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. [46] O. Nachum, M. Norouzi, K. Xu, D. Schuurmans, Trust-pcl: An off-policy trust region method for continuous control, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 790 April 30 - May 3, 2018, Conference Track Proceedings, 2018. [47] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018, 2018, pp. 1856–1865. [48] S. Fujimoto, H. van Hoof, D. Meger, Addressing function approximation795 error in actor-critic methods, in: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018, 2018, pp. 1582–1591. [49] J. Luan, J. Tang, C. Lu, Prediction interval on spacecraft telemetry data based on modified block bootstrap method, in: Artificial Intelligence and Computational Intelligence - International Conference, AICI 2010, Sanya, China, October 23-24, 2010, Proceedings, Part II, 2010, pp. 38– 800 44. [50] Y. Liu, J. Liang, J. Qian, Moving blocks bootstrap control chart for dependent multivariate data, in: Proceedings of the IEEE International Conference on Systems, Man & Cybernetics: The Hague, Netherlands, 10-13 October 2004, 2004, pp. 5177–5182. [51] L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement learning: A survey, J. Artif. Intell. Res. 4 (1996) 237–285. [52] E. Todorov, T. Erez, Y. Tassa, Mujoco: A physics engine for model-805 based control, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, October
12
Method HCheetah
Ant
Walker2d
FF RNN LSTM GRU
5198.1 ± 121.7 4300.2 ± 901.4 3765.2 ± 907.5 4315.0 ± 475.9
4995.6 ± 479.3 3888.1 ± 358.8 4666.6 ± 757.9 4069.0 ± 276.5
12217.7 ± 427.0 12197.0 ± 102.5 11632.0 ± 682.8 11511.7 ± 241.3
It is clear that feedforward networks achieve best performance over all three tasks. In our study, the tasks are fully observable, therefore using recurrent networks is unnecessary and increases the complexity of training. In addition, our method is trained using a replay buffer. During training, we sample the transitions from the replay buffer randomly and the relationship between different batches is weak. As a result, the memory stored in recurrent networks is not useful for the training step. So we select the feedforward networks. The full learning curves are shown as figure 9. B. Number of hidden nodes The number of the hidden nodes determines the number of the parameters and the expression of the networks. In this paper the policy networks are with 400, 300 hidden units. Here we display the comparison results with other numbers of hidden nodes. Table 4 shows the results for three tasks: Ant-v2, Walker2d-v2 and HalfCheetah-v2. Table 4 includes the average maximum return and the standard deviation over five trials.
Journal Pre-proof
Table 4: Average maximum return for different numbers of hidden node. 840
Method HCheetah
pro of
4587.2 ± 83.9 5198.1 ± 121.7 5204.7 ± 285.4 4802.7 ± 1063.1
Table 6: Average maximum return for different sizes of mini-batch.
C. Optimizer In this paper, we use Adam as the optimizer for the actor and critic networks. Here we list the comparison results with other three popular optimizers: Adamax, RMSprop and SGD. Table 5 shows the results for three tasks: Ant-v2, Walker2d-v2 and HalfCheetah-v2. Table 5 includes the average maximum return and the standard deviation over five trials.
Method HCheetah
Ant
Walker2d
50 100 200 300 400
4316.5 ± 707.9 5198.1 ± 121.7 4713.3 ± 211.3 5076.7 ± 608.8 4861.4 ± 715.0
4129.2 ± 355.8 4995.6 ± 479.3 4700.0 ± 707.6 5084.2 ± 623.2 5061.5 ± 476.4
re-
820
10834.3 ± 456.3 12217.7 ± 427.0 12026.2 ± 430.7 12390.9 ± 207.9
lP
815
Walker2d
D. Mini-batch size 4183.1 ± 527.7 4995.6 ± 479.3 In this paper, the mini-batch size we select is 100. The mini4830.1 ± 370.6 batch size plays an important role in reinforcement learning. 4947.1 ± 782.7 845 If the mini-batch size is large, then the estimated gradient is stable and beneficial to all the transitions. For a small minibatch size, the estimated gradient is random and only valid for some transitions. Thus, typically, larger mini-batch size leads From the table 4, the performance of 300 and 200 hidden to better performance. nodes is the worst. As we increase the number of the hid850 To justify that our selection of 100 transitions is optimal, den nodes, the performance does not improve obviously. 400we choose five different sizes of mini-batch: 50, 100, 200, 300, 400-400 and 400-500 hidden nodes achieve similar results. 300, 400. Table 6 shows the results for three tasks: Ant-v2, Considering both the performance and the difficulty of trainWalker2d-v2 and HalfCheetah-v2. Table 6 includes the avering, we select the 400 and 300 hidden nodes. The full learning age maximum return and the standard deviation over five trials. curves are shown as figure 9. 300-200 400-300 400-400 400-500
810
Ant
to a wider range of optimization problems. Therefore we select Adam optimizer. The full learning curves are shown as figure 9.
10957.2 ± 385.0 12217.7 ± 427.0 11775.8 ± 616.1 12203.7 ± 372.6 11740.8 ± 351.6
Table 5: Average maximum return for different optimizers.
855
830
835
Ant
Adam Adamax RMSprop SGD
12217.7 ± 427.0 9785.3 ± 285.3 10669.5 ± 606.8 7898.3 ± 717.8
5198.1 ± 121.7 3338.4 ± 952.3 2476.6 ± 227.5 1339.3 ± 523.3
Walker2d
4995.6 ± 479.3 4663.0 ± 680.0 3308.2 ± 371.7860 2023.2 ± 245.4
urn a
HCheetah
Table 5 indicates that Adam achieves better performance than three other optimizers. SGD is easy to implement but faces problems when finding a good learning rate and is not865 suitable for training non-stationary data in reinforcement learning [56]. RMSprop [57] is closely related to Adam; however, there are still a few differences: RMSprop generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of the first and second moment of the gradient. Further-870 more, RMSprop lacks a bias-correction term, which leads to very large stepsizes and often divergence [57, 58]. Adamax is a variant of Adam based on the infinity norm, which converges to the more stable value but can easily get stuck at the suboptimal solution [57]. Adam combines the advantages of AdaGrad875 and RMSprop: the ability of AdaGrad to deal with sparse gradients and the ability of RMSProp to deal with non-stationary objectives [57]. Compared with Adamax, Adam is well-suited
Jo
825
Method
13
Table 6 indicates us that a small batch-size (50) cannot achieve good performance, while 100 transitions achieves good performance over the three tasks. Although, a larger mini-batch size is good for training, the improvement is not obvious as the mini-batch size is increased. This is because increasing the size of mini-batch eliminates the difference between transitions and the gradient descent directions cancel out. Furthermore, larger mini-batch size requires more computing resources. Therefore, we select 100 as our optimal mini-batch size. The full learning curves are shown as figure 9. E. Learning rate In this paper, the learning rate we choose is 0.00025. Learning rate is an important hyper parameter for our method. A low learning rate reduces the convergence rate. A large learning rate skips the optimal solution. Here we choose seven different learning rates, ranging from 0.01 to 0.00003. Table 7 shows the results for three tasks: Ant-v2, Walker2d-v2 and HalfCheetahv2. Table 7 includes the average maximum return and the standard deviation over five trials. As evident, neither a high nor low learning rate can achieve good performance. Considering both the average return and the standard deviation, we select the learning rate of 0.00025. The full learning curves are shown as figure 9.
Journal Pre-proof
Walker2d
0.01 0.001 0.0003 0.00025 0.0002 0.0001 0.00003
-1892.3 ± 1250.7 2465.4 ± 273.8 3719.6 ± 379.5 5198.1 ± 121.7 5047.7 ± 741.5 5037.4 ± 201.1 3029.1 ± 995.7
869.3 ± 542.1 3234.9 ± 637.7 4325.2 ± 627.4 4995.6 ± 479.3 4692.4 ± 334.6 3763.6 ± 458.1 2423.5 ± 142.4
-289.7 ± 204.4 10618.6 ± 218.4 12433.2 ± 283.9 12217.7 ± 427.0 11183.8 ± 1386.1 10662.6 ± 648.8 5506.9 ± 416.3
lP
Table 8: Hyper parameter values.
re-
There are also some other critic values, such as the exploration noise, size of the replay buffer, discount factor and activation function. These hyper parameters are always set based on the experimental results and the experience. The exploration noise is set as 0.1. A higher noise causes the agent to explore the environment more but increases the training time, whereas a lower noise causes the agent to get stuck at the suboptimal policy. We set the size of the replay buffer as 1e6. The size of the replay buffer determines the total size of the saved transitions. A smaller size leads to a loss of diversity between the transitions, while a larger size consumes more memory resources. All hyper parameters for our algorithms are shown in table 8.
Hyper Parameter
Value
Discount factor Number of hidden nodes Activation function Batch size Learning rate for critic Learning rate for actor Noise clip Optimizer Target network update rate Exploration noise Evaluation frequency Replay Buffer size
0.99 (400,300) (Relu,Relu,Tanh) 100 2.5e-4 1e-4 0.5 Adam 0.005 0.1 5000 1e6
urn a
885
Ant
Jo
880
Method HCheetah
pro of
Table 7: Average maximum return for different learning rates.
14
Jo
urn a
lP
re-
pro of
Journal Pre-proof
Figure 9: Learning curves for different hyper parameters.
15
Journal Pre-proof AUTHOR DECLARATION We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
pro of
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
Signed
by
lP
re-
We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from
[email protected] all
authors
as
follows:
Jo
urn a
Yingnan Zhao, Peng Liu, Chenjia Bai, Wei Zhao, and Xianglong Tang 2019/07/3
Journal Pre-proof
CRediT author statement
pro of
Yingnan Zhao performed Writing-Original Draft, Conceptualization, Methodology and Software. Peng Liu performed Writing - Review & Editing, Resources, Methodology and Supervision. Chenjia Bai performed Validation, Methodology and Writing - Review & Editing. Wei Zhao performed Supervision and Writing- Reviewing and Editing. Xianglong Tang performed Funding acquisition.
Biographies
re-
Yingnan Zhao
lP
Peng Liu
urn a
Chenjia Bai
Jo
Wei Zhao
Xianglong Tang