Knowledge-Based Systems xxx (xxxx) xxx
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Additional planning with multiple objectives for reinforcement learning✩ Anqi Pan a,b , Wenjun Xu c,d , Lei Wang e , Hongliang Ren c ,
∗
a
College of Information Sciences and Technology, Donghua University, 201620, China Engineering Research Center of Digitized Textile & Fashion Technology, Ministry of Education, Donghua University, 201620, China c Department of Biomedical Engineering, National University of Singapore, 117583, Singapore d Peng Cheng Laboratory, Shenzhen, 518055, China e School of Electronics and Information Engineering, Tongji University, 201804, China b
article
info
Article history: Received 20 December 2018 Received in revised form 12 December 2019 Accepted 13 December 2019 Available online xxxx Keywords: Reinforcement learning Multi-objective Robotic control
a b s t r a c t Most control tasks have multiple objectives that need to be achieved simultaneously, while the reward definition is the weighted combination of all objects to determine one optimal policy. This configuration has a limitation in exploration flexibility and presents difficulty in reaching a satisfied terminate condition. Although some multi-objective reinforcement learning (MORL) methods have been presented recently, they concentrate on obtaining a set of compromising options rather than one best-performed strategy. On the other hand, the existing policy-improve methods have rarely emphasized on solving multiple objectives circumstances. Inspired by the enhanced policy search methods, an additional planning technique with multiple objectives for reinforcement learning is proposed in this paper, which is denoted as RLAP-MOP. This method provides opportunities to evaluate parallel requirements at the same time and suggests several optimal feasible actions to improve longterm performance further. Meanwhile, the short-term planning adopted in this paper has advantages in maintaining safe trajectories and building more accurate approximate models, which contributes to accelerating the training program. For comparison, an RLAP with single-objective optimization is also introduced in theoretical and experimental studies. The proposed techniques are investigated on a multi-objective cartpole environment and a soft robotic palpation task. The superiorities in the improved return values and learning stability prove that the multiple objectives based additional planning is a promising assistant to improve reinforcement learning. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Reinforcement learning is a powerful technique which can generate controllers for manipulators and agents with unknown mathematical models. However, it sometimes fails to learn continuous action tasks and always requires extensive training data and computational resources [1,2]. Policy search methods are commonly adopted in agent learning procedures. The policy search techniques can be classified into three categories, policy gradient method, derivative-free optimization and enhanced policy search method, respectively. Policy gradient method, which is also regarded as policy iteration, alternately estimates the value function under current policy and ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.105392. ∗ Corresponding author. E-mail address:
[email protected] (H. Ren).
improves the policy according to the expected cost gradient [3,4]; derivative-free optimization treats the return value as a black box function and optimizes the best solution in terms of the policy parameters using stochastic algorithms [5,6]; enhanced policy search method utilizes some gathered information, such as particular states [7] and optimized trajectories [8], to guarantee policy improvement and accelerate policy updating procedure. The three branches of policy search strategies have advantages and shortcomings in different aspects. The gradient-free optimization, which often adopts stochastic algorithms such as natural evolution strategy (NES) [5], cross-entropy method (CEM) [9] and covariance matrix adaptation (CMA) [10], has achieved competitive performance in low-dimensional continuous control problems. They can guarantee policy improvement by estimating and compromising among different directions. However, these stochastic approaches are of limited utility towards continuous state and action spaces because of their great computational expense in policy updating and evaluating, and the accuracy decreases as the control dimension grows. As a contrast, the derivation-based approach has better training efficiency in value
https://doi.org/10.1016/j.knosys.2019.105392 0950-7051/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
2
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
approximations containing a larger number of parameters, and the policy gradient technique with explicit policy functions provides convenience for policy evaluation and close-loop updating. Among the gradient-based methods, the actor-critic approach has shown competitive learning performance, and it has been brought to solving continuous problems. The representative and popular method is the Deep Deterministic Policy Gradient (DDPG) [11]. However, the gradient-based methods are prone to local optima and the results vary considerably with initial parameters. Meanwhile, the inaccurate function approximations may mislead the policy search process. Remarkably, enhanced policy search methods integrating the above two techniques first build several desired policies by minimizing the structured costs determined by several desired performances [12], then use the obtained policies to assist policy learning. These methods have proved to have competitive learning efficiency as well as the ability to avoid local optima. Nevertheless, the model-based trajectory optimization approaches, such as iterative LQG [13], differential dynamic programming (DDP) [14] and Monte Carlo tree search (MCTS) [15], are still expensive in the calculation. Most applications have multiple requirements which need to be pursued simultaneously during the operation [16]. Such a system encounters difficulties in defining the proportion among different aspects and the terminate condition. In most circumstances, the objectives are combined and simplified into one object to evaluate the system reward [17], while the success condition should still synchronously satisfy multiple requirements. According to the concept of reinforcement learning, the system is trained to achieve better overall score contributed by the weighted objectives whose proportions are predefined at initialization. As a consequence, the learning may fall in bottleneck situations that are difficult to break through with the fixed reward formulations, or end with unqualified policies due to the reason that the high distal reward will conceal some feasible improvements and deteriorate the exploration process. To be concluded, the training is extremely difficult considering both combined requirements and rigorous success conditions, especially under high-dimensional action or state spaces. Some studies have presented strategies denoted as the multiobjective reinforcement learning (MORL), where multiple nondominated policies can be generated emphasizing on different requirements [18,19]. However, the high computational cost and the policy selection among Pareto elite solutions raise new problems in practical applications. Some studies have adopted a hierarchical reinforcement approach [20] and assigned the objectives to subtasks with different priorities, yet the operation time will increase because of the sequential processes and some feasible solutions may be lost. Considering the safe-region operation is another requirement while completing a task, some references have built extra constraint models to maintain the agent exploring in safe regions [21–23], and found the closest actions that satisfy the constraints. These studies have achieved good learning performance in a constrained environment, but they require extra computational resources to form an accurate constraint model. Considering the problems above, an additional planning strategy is introduced to assist reinforcement learning in this paper. The novel technique enables the learner to focus not only on the longterm rewards, but also pay attention to the current short-term conditions through planning an optimized move. Meanwhile, the approach where objectives can be simultaneously and separately considered in short-term planning, realizes a direct and comprehensive optimization among all requirements and provides alternative solutions to pursue better overall performance. The method introduced here is denoted as additional planning with multiple objectives for reinforcement learning (RLAP-MOP), which belongs to the enhanced policy search category. Different from the existing enhanced policy search techniques which
calculate the desired policies or trajectories based only on the long-term discounted reward, our approach additionally adopts short-term planning for action exploration during the training operation, which has advantages in maintaining the controller in safety region as well as seeking for the competitive return values. Furthermore, short-term planning presents an opportunity to evaluate multiple objectives simultaneously and suggests several optimal alternatives to improve long-term optimization in action determination. Meanwhile, different from the existing MORL methods, the proposed RLAP can generate one policy through considering requirements from multiple aspects, which introduces a novel approach to exert effects from multiple objects on policy update. Moreover, different from the existing safe learning methods, the proposed technique will not rely on extra constraint models, but adopts the multi-objective optimization to find the most compromise and appropriate actions. Therefore, the proposed method has advantages in a simpler learning procedure and a more flexible exploration. The contributions are detailed in the following aspects. (1) An RLAP strategy is proposed which adopts short-term information for action determination. (2) The planning mechanism is embedded in the DDPG method to build one deterministic policy, while the enhanced policy search technique has replaced the policy gradient module. (3) Two actor generators are included, the noise-based exploration and the derivative-free optimization, respectively. A decision module is added to determine which generator is selected. (4) Two planning approaches have been introduced and compared in RLAP, namely weighted objective planning (SOP) and multiple objective planning (MOP). (5) A mixture policy search technique containing both longterm policy gradient and short-term planning is applied to build the deterministic policy. (6) The proposed algorithms are tested on two tasks, cartpole balance and robotic control respectively, both include multiple requirements. The experimental result proved the superiority of using multiple objectives planning. The remainder of this paper is organized as follows. The preliminaries of policy search and multi-objective optimization solution are introduced in Section 2. The proposed RLAP algorithm is detailed in Section 3. Then, the experiment descriptions and comparison results are presented in Section 4. Further discussion and conclusion are arranged in Sections 5 and 6, respectively. 2. Related work 2.1. Deterministic policy gradient algorithm Reinforcement learning (RL) aims to find some optimal policies for a Markov decision process (MDP), where the dynamics given by transition probabilities p(st +1 |st , at ) are usually unknown in physical systems such as robotics. Under this circumstance, approximations on value function and action-value function are preferred to evaluate the sampled policies and organize the search procedure. The learning based on function approximation is also referred to as the value iteration approach. A policy π (a|s) is defined as the mapping from state space S and action space A to the probability of taking action a when in state s, where s ∈ S , a ∈ A. The reward function is r(st , at ) : S × A → R. The state visitation distribution for policy π is ρ π . The excepted return value given that the agent takes action at
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
in state st under a policy π , denoted as Q π (st , at ), is defined as follows: Q π (st , at ) = Esi>t ∼ρ π ,ai>t ∼π
∞ [∑
] γ i−t r(si , ai )
(1)
i=t
where γ ∈ [0, 1] is a discounting factor. According to the Bellman equation, the action-value function can be calculated in a recursive manner as:
[
Q π (st , at ) = r(st , at ) + γ Est +1 ∼ρ π ,at +1 ∼π Q π (st +1 , at +1 )
]
(2)
Q-learning is a commonly used off-policy algorithm which adopts greedy policy µg (s) = arg maxa Q (s, a). However, the greedy strategy becomes time-consuming in continuous action spaces which cannot be employed directly [24]. The actor-critic architecture is another RL method which maintains a parameterized action function to map states to specific actions, it has been widely studied in continuous action problems. The actor-critic framework adopts temporal-difference (TD) learning which combines Monte Carlo (MC) and dynamic programming (DP) ideas, and learns from raw experience without a dynamic model of the environment according to the value iteration [25,26]. It consists two separate structures to represent policy as well as value function, specifically, the critic updates the projection of value function according to the observation in batch experience, while the actor updates the policy in approximate gradient directions guided by the value approximation. A deterministic policy µ with respect to the parameter θ , denoted as µθ : S → A, can be used in actor-critic framework, for which the recursive equation can be simplified as [11]:
[
]
Q µ (st , at ) = r(st , at ) + γ Est +1 ∼ρ µ Q µ (st +1 , µ(st +1 ))
(3)
The critic is formulated by a differentiable action-value function with parameterized functional form with respect to parameter ω, which is substituted to represent the true action-value function, Q ω (s, a) ≈ Q µ (s, a). The general gradient-descent update for the critic in off-policy update is
ωt +1 = ωt + αω δt ∇ω Q ω (st , at )
(4)
where αω is the learning rate and δt is the TD error,
δt = r(st , at ) + γ Q ω (st +1 , µ(st +1 )) − Q ω (st , at )
(5)
The agent’s goal is to obtain a policy which maximizes the discounted reward with the start state, so the objective is defined as J(π ) = Q π (s, a). For deterministic policy, the objective can be rewritten as J(µ) = Q µ (s, a). When Q learns under offpolicy strategy with a different stochastic behavior policy β , the direction of the parameterized policy gradient ∇θ J(µθ ) can be defined as
[ ] ∇θ J(µθ ) = Es∼ρ β ∇a Q µ (s, a)|a=µθ (s) ∇θ µθ (s)
where αθ is the learning actor-critic improvement, Deep Deterministic Policy structures are updated by the replay buffer.
while optimizing the policy parameter based on the following structure: max J(µθ ) θ
(8)
subject to C (θ ) ≤ ϵ
where ϵ is a small real number to build the constraint inequation, which represents the minimization of C (θ ) in a robust way [7]. For example, the guided policy search (GPD) which combines policy learning with supervised learning problem, can be viewed as transforming a collection of trajectories p into a parameterized policy µθ [8]. The parameter θ of policy µ is optimized to minimize both the expected cost function L = −J and the distance from the policy µ to the marginal distribution of the trajectories p. Therefore, for GPD, the constraint here is defined as C (θ ) = DKL (µθ ∥ p). 2.3. Strategies for the task with multiple objectives For applications characterized by multiple objectives, the field of multi-objective reinforcement learning (MORL) that includes multiple-policy approaches [27] and single-policy approaches, has been defined and studied. In multiple-policy studies, optimality is replaced by a set of Pareto solutions representing different compromises among conflict objectives [28]. However, these approaches are time-consuming to evaluate several policies at the same time [19] and the entire separate rewards may result in unintended behavior in the control procedure. In contrast, the single-policy algorithms make a trade-off among all objectives through defining different emphasis proportions or different prioritization degrees [29], so that the computational cost can be relieved, while the difficulties in reward definition are raised. Different from the two approaches above, the additional planning proposed in this paper tries to balance among the requirements in short-term planning, and picks the optimized behaviors subsequently by long-term estimated performance to exert effects on policy update based on multiple objects. It is worth mentioning that, a primary objective is also selected to emphasize the training purpose, and its corresponding primary reward r1 is adopted to help determine the executed action. Different from the existing studies which use a single optimization objective, we first adopt multi-objective optimization and long-term expectation to define some candidate solutions, then determine the executed solution based on the primary objective. The difference mentioned will guarantee the adequacy of exploration as well as maintaining the purpose of the task. 3. Reinforcement learning with additional planning 3.1. Short-term planning
(6)
Since Q ω (s, a) is substituted to approximate Q µ (s, a), the actor is updated by adjusting the policy parameter θ according to
θt +1 = θt − αθ ∇a Q ω (st , at )|a=µθ (s) ∇θ µθ (st )
3
(7)
rate. To improve the stability of the soft targets are introduced into the Gradient (DDPG) framework, and the sampling a minibatch randomly from
2.2. Improvement in policy search To improve the searching efficiency and guarantee for the gradient-based policies, remarkable researches add constraint
Since most reinforcement learning approaches are designed to pursue the expected discounted reward. For a trajectory s0 , s1 , . . ., the discounted reward is calculated as J∗ =
∞ ∑
γ i r(si , µ(si )) =
i=0
T1 ∑ i=0
γ i r(si , µ(si ))
(9)
where sT1 +1 ∈ /S it can be seen that no matter whether the agent has finished the task, when sT1 +1 exceeds the state space boundary, T1 becomes the terminate time node in the trajectory. Note that the evaluation based on approximate value function has low accuracy with small training data, the state will over bound frequently. Therefore, the improvement is difficult by using the policy gradient methods and optimization on long-term programming.
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
4
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
We introduce the short-term planning approach with less complexity and better directionality to assist the policy update and the action exploration. Considering the significance to the environment given by the long-term information decreases exponentially fast, some researches also adopted short-term memory in partially observable environments [30]. The difference in our paper is that a deterministic policy is built combining both of the long and short-term information. The additional planning is added before and close to the terminal condition. A parameterized reward estimation function Rν (s, a) obtained by logistic regression is utilized for short-term decision. R = (R1 , . . . , Ri , . . . , Rm ), m is the number of objectives. The optimal actions can be obtained through solving the optimization problem max f(x), where f(x) = Rν (s, x)
(10)
Considering the multiple objectives which result in multiple reward formulations, two optimization approaches are presented in this section and further compared in the experiment section. We select the evolutionary approach to solve the optimizations. The first one is the commonly used approach, which aggregates all objectives into one fitness function with predefined weight vector ω, then solves the optimization according to Algorithm 1. Since during the learning procedure, all the parameters, s, a, and ν are variables of the approximate value R, the approximate model R is recorded as R(s, a, ν ). Similar notations are also employed on approximate model Q and u. Algorithm 1 Short-term SOP planning. Input: The population ∑ size N, the current state sc , the current parameter νc , the fitness f (x) = − i wi Ri (sc , x, νc ); Output: The optimal solution xˆ ; 1: Create the initial population P0 ; 2: t=0; 3: while computational budget is not exhausted do 4: Qt = Recombination + Mutation(Pt ); 5: Pt = Pt ∪ Qt ; 6: Pt +1 = Selection(Pt ); 7: t = t + 1; 8: end while 9: xˆ = arg min(f (x)|x ∈ Pt +1 );
The second approach adopts a multi-objective optimization algorithm NSGAII [31] which can provide several feasible solutions for action selection. During the multi-objective optimization, the fitness of each solution is formed as a vector that represents the multiple rewards. Instead of comparing the fitness values directly, the dominance relation is adopted to determine which solution is better. A vector g = (g1 , g2 , . . . , gD )T is said to dominate k = (k1 , k2 , . . . , kD )T if and only if the element in g is partially less than k in a minimization problem, i.e., ∀i ∈ {1, 2, . . . , D}, gi ≤ ki and ∃i ∈ {1, 2, . . . , D}, gi < ki . Since the optimum is not unique under multiple objectives, the multi-objective approach will provide several noninferior solutions. These solutions will go through further selection procedures based on long-term expectation and the primary reward value to determine the executed action. The multi-objective action planning strategy is detailed according to the following steps: STEP 1. Solve the multi-objective optimization in (10) and find the Pareto set P = {a|a = arg min(−R(sc , a, νc ))}. STEP 2. Estimate the solutions in P using the approximated Q function. A set L is built to collect solutions which have higher approximate Q-values from P . The size of set L is n. This procedure is designed to maintain the balance between short-term reward and long-term expectation.
STEP 3. Choose the solution which has the highest primary fitness value r1 from L as the output action. The flowchart of the multi-objective approach designed for RLAP is detailed in Algorithm 2. Algorithm 2 Short-term MOP planning. Input: The population size N, the current state sc , the current parameter νc , ωc , the fitness f(x) = −R(sc , x, νc ); the constraint c(x) = −Q (sc , x, ωc ); Output: The selected solution xˆ ; 1: Create the initial population P0 ; 2: t=0; 3: while computational budget is not exhausted do 4: Qt = Recombination + Mutation(Pt ); 5: Pt = Pt ∪ Qt ; 6: Pt +1 = Selection(Pt ); 7: t = t + 1; 8: end while 9: Sort {x1 , x2 , ..., xN } ∈ Pt +1 w.r.t. {c(x1 ), c(x2 ), ..., c(xN )}; 10: L = {x1 , x2 , ..., xn }, n ≤ N; 11: xˆ = arg max(r1 (x), x ∈ L);
3.2. Update the parameterized policies Assume the actions optimized by short-term planning and the corresponding operated states are stored in A′ and S ′ , respectively, the mean squared deviation (MSD) is adopted to quantify the distance between the H sampling points from short-term planning and the expected policy µ. To build a mixture policy with both long-term policy gradient and short-term planning, the objective is replaced by optimizing the collocated requirements, given by max J(µθ ) θ
min MSD(µθ (s), a), θ
(11)
a ∈ A′ , s ∈ S ′
A proportion coefficient λ is adopted to balance the proportion between the two parts. According to the derivation from (7), the parameter θ can be updated by
θt +1 = θt + αθ + λ·
1 H
[
−∇a Q ω (st , at )|a=µθ (s)
H ∑
(µθ (si ) − ai )
]
∇θ µθ (st )
(12)
ai ∈A′ ,si ∈S ′
λ is designed to be 1 at initialization and adaptive by increasing λc percent in every training episode, where λc is a pre-defined constant to fine-tune the proportion. With the training step increases, the quality of optimized actions improves, therefore, only the latest transitions are employed in the policy update. 3.3. Practical algorithm The additional planning is integrated with the DDPG, which includes a critic network and an actor-network. The critic network is updated by sampling a minibatch uniformly from the buffer according to the TD learning, while the actor-network is updated by both applying the chain rule to the expected return and minimizing the displacement to the optimized actions. During the critic network updates, the reward values for calculating the discounted return are defined as the summation of all rewards. The framework of the RLAP is illustrated in Fig. 1. The blue blocks and the corresponding connected flow in the framework belonging to the DDPG. Two action generators, deterministic policy and policy-free optimization respectively, are included in the exploration. Under nonspecific situations, actions are calculated from a deterministic policy with Gaussian noise, so the exploration is realized by the random walk following the gradient direction.
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
5
4. Experiments 4.1. Cartpole problem
Fig. 1. The diagram of RLAP. Two action generators, deterministic policy and policy-free optimization respectively, are included in the framework. The decision module is placed to decide which generator will be activated. The policy network is updated by the collocated requirement consists of the approximate return J(π ) from the critic network and loss value (MSD) from the gradient-free planning.
When states are observed close to the state boundary, the policyfree optimization is activated to plan the optimal move. The optimized actions are stored in a buffer for policy improvement operation. The pseudocode of reinforcement learning with additional planning is detailed in Algorithm 3. According to the two different approaches introduced in Section 3.1, the two RLAP algorithms, RLAP-SOP and RLAP-MOP, can be generated from the algorithm. The dangerous region in the pseudocode is defined as the region that exceeds 80% of the state boundary. Algorithm 3 Reinforcement learning with additional planning (RLAP) 1: Randomly initialize critic network Q (s, a, ω) and actor network µ(s, θ ). 2: Initialize target network Q ′ and µ′ . 3: Randomly initialize reward approximate network R(s, a, ν ). 4: Initialize buffer B = ∅. 5: t = 1. 6: while computational budget is not exhausted do 7: Initialize observation state s1 . 8: while st ∈ S do 9: if st is in the dangerous region then 10: Perform the short-term planning according to Section 3.1. 11: else 12: Set action at = µ(st , θ ) + Nt . 13: end if 14: Execute action at and observe reward rt and state st +1 . 15: Add transition (at , st , rt , st +1 ) in B. 16: Sample N random transitions from B to build a minibatch: B1 ← (ai , si , si+1 , ri ), i ∈ [1, N ] 17: 18: 19:
Calculate the error according to (5) using the minibatch B1. Update the critic parameter ω according to (4). Sample H newest optimized transitions from B to build a minibatch: B2 ← (ai , si ), i ∈ [1, H ] Update the actor parameter θ according to (12) using the minibatch
20: B2. 21:
Update the reward approximator Rν by minimizing the loss: Lr =
1 N
∑N
(R(si , ai , ν ) − ri )2 , (ai , si , ri ) ∈ B1.
22: Update target network Q ′ and µ′ . 23: t = t + 1. 24: end while 25: end while
We first run experiments on a cartpole task. The classic cartpole problem to test reinforcement learning methods provides a reward of +1 for every step when the pole remains upright and the position of the cart is within the boundary. The reward is defined sparse and the action is considered to be discrete with two values. Four observations are selected as the states to reflect the agent system, namely the pendulum’s angular displacement (rad), the pendulum’s angular velocity (rad/s), the cart’s displacement (m) and the cart’s velocity (m/s). These states are denoted as state1 , state2 , state3 and state4 respectively. When state1 = 0, the pendulum stands upright on the cart; when state3 = 0, the cart places at the central position. Meanwhile, the boundary boundi is manually initialized to restrain on each state so that the agent will not beyond the desired regions, i.e., statei ∈ [−boundi , boundi ]. boundi is also adopted to normalize the state in reward definition, i.e. |statei /boundi | ∈ [0, 1]. The action is defined as the force (N) which is applied to the cart. The reason we select this task as the environment is that the requirement can be split into two aspects, swing up the pendulum and place the cart at the central position of the track platform. Therefore, different from the classic setting, in our experiment, the reward function is redesigned considering both the pendulum’s angular displacement and the cart’s displacement. The corresponding rewards rp and rc are formulated as follows.
{
rp = exp(−10|state1 /bound1 |) rc = exp(−10|state3 /bound3 |)
(13)
The mathematical model and parameters of the cartpole in [32] are employed, where m = 2.2 kg is the mass of the pendulum, M = 1.3282 kg is the mass of the cart, l = 0.304 m is the length from the centroid of the pendulum to the shaft axis and J = ml2 /3 kg·m2 is the moment of inertia of the pendulum. F0 = 22.915 N/(m·s) and F1 = 0.007056 N/(rad·s) are the friction factors of the cart and the pendulum, respectively. When state1 or state3 exceeds the limited boundary, the episode ends. The maximum number of steps, in one episode, is set to be Tmax = 200 and a successful instance is defined as an episode which has 200 steps. Three methods, DDPG, RLAP-SOP, and RLAP-MOP are tested to learn to solve the cartpole problem. The two networks in DDPG, namely the critic network and the actor-network, are initialized with 1 hidden layer of 50 nodes. The RMSprop with the learning rates of 2.5e − 5 and 1e − 3 is adopted to update the parameters of actor and critic respectively. The reward network in RLAP has the same structure and employs the Adam to compute the learning rate with default parameter values. The reward value for calculating the discounted return is r = rp + rc , in Line 17 of Algorithm 3. The primary reward is defined as r1 = rp , in Line 11 of Algorithm 2. The coefficient λc is set to be 1e − 3. The episode return (ER) and success rate (SR) are selected as the performance metrics. ER is the summation of discounted rewards from the starting state s0 to the terminate state sT , which is described in (14). ER =
T ∑
γ t −1 r(st , at ),
sT1 +1 ∈ / S,
T = min(Tmax , T1 )
(14)
t =0
where γ = 0.99. SR metric adopts 20 random initialized states as the episode beginnings. Following the policy obtained in one independent run, the success ratio of the 20 situations can be calculated. The metric assessments obtained by the three comparison RL methods during the searching procedure are plotted in Fig. 2,
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
6
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
where the result on each time node includes the performance from 5 independent runs. The solid squares represent the median values, while the vertical lines represent the relative ranges between first quartiles and the third quartiles. It can be observed from the figures that all algorithms managed to find the control policies to hold up the pole for 200 steps, where RLAP-MOP has obtained the highest score in episode return relative to the other two methods. Meanwhile, the success rates of RLAP-MOP are stabilized at 1 and the variance at each sample point is close to 0. The success rates illustrate that RLAPMOP has better stability with the training procedure continues. Comparing to RLAP-SOP and DDPG, RLAP-MOP also has a fast increase response reflected by both estimated figures. There is no significant difference between RLAP-SOP and DDPG in the episode return curves, while DDPG shows better success stability relative to RLAP-SOP. The deterioration observed in RLAP-SOP is probably because of the unique optimal solution calculated by weighted rewards neglects the long-term planning and results in the wrong directions. On the other hand, the multi-objective optimization approach provides several feasible solutions based on the short-term model, and these solutions will come through a further selection that evaluated by estimated future rewards. The mechanism above guarantees an improved generalization performance in reinforcement learning. The statistical results (mean and standard deviation) in terms of the performance metrics at the end of the RL procedures are summarized in Table 1. The t-test is adopted to compare the results among the comparison methods at the significance level of 5%. The comparison results are displayed in the table: ‘‘−’’ indicates the proposed RLAP-MOP is significantly worse than the corresponding method; ‘‘+ ’’ indicates RLAP-MOP is significantly better than the corresponding method; ‘‘=’’ indicates that there is no significant difference between RLAP-MOP and the corresponding method. It can be illustrated from the t-test results, the RLAP-MOP performs significantly better than the DDPG method with respect to the ER metric, while the RLAP-MOP and RLAP-SOP have competitive performance. According to the mean values of ER, both the RLAP approaches have obtained larger return values than DDPG, while RLAP-MOP has achieved the highest return value. Since all the three methods have obtained satisfied SR values, the performance in the ER will determine that RLAP-MOP is the best-performed method in the cartpole task. Moreover, to verify the influence on the comparison result caught by different learning rates, a different learning rate on critic network update, αQ = 2.5e − 5, is also tested and the corresponding learning results are added in the performance Table 1. It can be observed that the definition of learning rate shows effects on the learning efficiency, that the larger learning rate has better performance, but it does not influence the comparison results. Under the second learning rate, the RLAP methods still present higher episode returns relative to the DDPG method, and the MOP based approach has a slight advantage than the SOP. 4.2. Force and position control in palpation The proposed method is tested to train a hybrid system developed for performing a tissue interaction task, where both posture and force objectives need to be achieved. The task is inspired by human palpation, where a tendon-driven anthropomorphic robotic finger is included. The finger [33] is 3Dprinted by a FDM printer (BCN3D Sigma) and the adopted material is thermoplastic elastomer (TPE); it is actuated by a servo motor (Dynamixel, MX-64R, resolution 0:088◦ ) through tensioning the tendon, which is connected from the finger endpoint to the roller that is mounted on the servo motor. During the operating, the
soft finger undergoes bucklings when the actuation forces change beyond certain limits, as a result, it is difficult to get a precise mathematical model to describe the finger. The finger module is further mechanically connected to the 6 DoF (degree of freedom) industrial manipulator UR5 (Universal Robots, Denmark) by a rod for positioning the finger. The point at the end of the robot arm is denoted as the Tool Center Point (TCP); the point at the soft fingertip is denoted as TIP. The soft finger and the whole operating system are illustrated in Fig. 3(a). Since the TCP pose is obtained through the forward kinematics by UR5’s base frame (CTCP ) while TIP pose is obtained by the EM sensor (CTIP ), the coordinate systems need registration to one environment coordinate (Ce ). The transformation has been prepared before the learning procedure. The task we consider here is moving the soft finger towards the target point TiO in a confined area to contact and press down the tissue along its normal direction TiOn , applying a gentle desired force Fd for safety haptic purpose. The task is illustrated in Fig. 3(b). The tissue is represented by a convex surface and the soft finger is displayed to imitate the palpation motion. The performance of the task can be quantified by the applied force F , the angle between the force vector and tissue normal vector dα , and the normal distance from the fingertip to the target normal dn. The confined area is restricted on F , dα , and dn by boundary limit boundf , boundα , and boundn respectively, which are defined according to the maximum force requirement and allowed moving region marked in the figure. Meanwhile, in order to fully utilize the operation space of the soft finger and avoid large displacement on the finger base, a quantification of the TCP movement is also considered to evaluate the policy. The RL methods are tested on this task under the Matlab environment, where the models of the tissue and finger are fitted through sampling. In the reinforcement setting, the state st is defined as a 7dimensional vector which includes the position and orientation of TCP w.r.t. the tissue target TiO, and the current cable length that drives the finger. The orientations are represented by Euler angles. The action at is defined as the velocity of TCP position and orientation and the velocity of cable length. The rewards consist of four aspects, including the pressing force F , the distance dα and dn, and the action applied on TCP posture actiontcp which quantifies the TCP movement. When all features are within the boundary, the reward values are calculated as
{
rn = exp(−10|dn/boundn |) rα = exp(−10|dα/boundα |) ra = exp(−10∥actiontcp ∥)
(15)
and
{ rF =
1 + exp(−10|F − Fd |/boundf ) exp(−10|d − dbegin |/dbegin )
F =0 F ̸= 0
(16)
Where d is the distance between the finger tip position to the target point TiO projected to the surface normal vector, which has been obtained in advance. When d < 0, the physical contact occurs between the finger and the tissue. The tissue contact model has been built in advance, so the simulated force can be calculated through d. In this task, dbegin = 15 mm and Fd = 3N are the initial normal distance and the desired pressing force, respectively. A noise is added in every initialization for both training and evaluating to improve the robustness of the policy. The value of the discounted return is calculated through reward summation, r = rF + rd + rα + ra . The primary reward is defined as r1 = rF . The short-term planning in RLAP considers rewards rF , rd and rα . The maximum number of steps in one episode is set to be 50. A successor will control the manipulator to apply a force F (|F − Fd | < 0.15Fd ) within the confined moving space in less than
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
7
Fig. 2. Performance on cartpole problem by 3 comparison methods and 5 different runs per each method over 6000 steps, the population size is N = 50 in RLAP-MOP.
Fig. 3. The equipments and illustration of the palpation task. (a) The structures of the soft finger and the whole operating system. (b) The diagram to illustrate the tissue interaction. Table 1 The statistical results (mean and standard deviation) in terms of ER and SR values on cartpole problem.
αQ
DDPG
RLAP-SOP
RLAP-MOP
ER SR
1.0e−03
9.16e+01 (1.11e+01) + 9.80e−01 (4.47e−02) =
1.05e+02 (1.38e+01) = 9.00e−01 (1.73e−01) =
1.15e+02 (1.27e+01) 1.00e+00 (0.00e+00)
ER SR
2.5e−05
4.48e−01 (1.36e+00) + 0.00e+00 (0.00e+00) =
5.19e+01 (3.26e+01) = 4.20e−01 (4.31e−01) =
6.66e+01 (3.31e+01) 3.30e−01 (3.96e−01)
Metric
50 steps. Therefore, the success condition should synchronously satisfy constraints from the position, orientation as well as the applied force information. The learning rates of critic and actor network are set to be 2.5e − 5. The coefficient λc = 3e − 4. The return values and success rates of the three RL methods during 10000 training steps are displayed in Fig. 4, which are the combinations of 5 independent runs. Each success rate indicates the terminate condition of 20 different trials using one policy. The mean values and fluctuate ranges of the performance on several sampled steps are represented by squares and lines. It can be seen
from the figures that RLAP-MOP has obtained significant better performance than the other two methods, and the ascending curves demonstrate a good exploration ability. The controllers trained by RLAP-SOP and DDPG have weak generalization performance towards the palpation task; they cannot reach the desired force and posture. Although RLAP-SOP once has completed tasks, the performance is unstable and hardly preserved or improved. Therefore, the approach of integrating multiple objectives with short-term planning has better efficiency and stability for reinforcement learning. The statistical performance at the end of the
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
8
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
Fig. 4. Performance on palpation task by 3 comparison methods and 5 different runs per each method over 10000 steps, the population size is N = 50 in RLAP-MOP.
learning iterations is also displayed in Table 2. According to the ttest results, the RLAP-MOP performs significantly better than the other two methods on both the ER and SR assessments, which illustrate the superiority of the MOP planning technique. Generally, the learned policy cannot adapt to different task difficulties. However, the mixed policy with additional planning can work in more situations. To further investigate the validity and extendibility of additional short-term planning, we use the deterministic policy obtained from RLAP-MOP to control with and without additional planning under different degrees of task difficulty. The task difficulty is varied by the value of initial distance of dbegin , specifically, a larger distance demands a longer control trajectory and a more complex policy. Fig. 5(a) and (d) display the moving conditions under the policies with and without additional planning respectively, when dbegin equals to 15 mm. The lines include 10 independent runs with random initial noise. It can be seen that the two policies have similar performance. The features when dbegin equals to 20 mm are displayed in (b) and (e), where policy without additional planning has a lower success rate than a policy with additional planning. The more obvious difference between the two approaches can be observed from (c) and (f), where dbegin equals to 25 mm. The policy without additional planning cannot finish the task, while the policy with planning has reliable control ability to maintain the fingertip in the allowed moving region and finally reach the desired forces.
Because the multi-objective optimization results in multiple good actions, these actions are then evaluated and chosen as the output action considering both the value approximation and the primary objective. This design is determined after comparing several alternative approaches, whose performances are displayed in Fig. 6. In the figure, the SE (left y-axis) and ER (right y-axis) performances are displayed by different colors. The corresponding evaluate criterions are (a) the approximate Q value, (b) an aggregated reward, where the aggregate function is g = ωr × R and ωr is adaptive with the condition changes, (c) the primary reward value, and (d) the approach adopted in this paper (Algorithm 2), respectively. It can be seen that the method (d) significantly outperforms the other three approaches. Other decision making approaches for Pareto surface, such as finding the knee solution [34], will be studied in the future study. Known that the optimization procedure is relatively computationally expensive, it is only applied at some specific training periods, which is defined as the dangerous regions in our experiments. While the rest of the training process adopts a gradient method. The performance given by the cooperation between stochastic policy and optimized trajectory indicates a prospective and efficient training idea for reinforcement learning. Still, there are some subjects need to be studied, such as the time nodes where the additional planning is involved, and the comparison among different proportion coefficient λ values.
5. Discussion
6. Conclusion
This paper presents an additional planning approach by simultaneously optimizing multiple requirements to improve policy. Apart from the approximate value function Q which works as a long-term indicator, the short-term multi-reward model is also built to provide more accurate fitness functions for optimization.
A short-term additional planning technique that simultaneously considers multiple objectives is proposed to improve reinforcement learning. Two tasks are applied to test the training algorithms. On cartpole instance where all comparison methods manage to obtain a successor, RLAP with multiple objectives has
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
9
Table 2 The statistical results (mean and standard deviation) in terms of ER and SR values on palpation task. Metric ER SR
αQ
DDPG
RLAP-SOP
RLAP-MOP
2.5e−05
3.37e+00 (3.39e+00) + 1.70e−01 (2.49e−01) +
3.22e+00 (3.41e+00) + 1.30e−01 (1.40e−01) +
9.96e+00 (3.97e+00) 7.30e−01 (4.27e−01)
Fig. 5. The extendibility of the learned policy. (a), (b) and (c) are the performance of deterministic policy on dbegin =15 mm, 20 mm and 25 mm respectively, where d represent the initialized distance to the tissue surface. (d), (e) and (f) are the corresponding performance of policy with additional planning. The circles represent the fingers have been successfully placed, while the triangles represent the corresponding trials are fail (exceed the confined region).
Fig. 6. Performance of different action select approaches from Pareto set.
improvements on the response speed and the training stability. On the soft robotic palpation task, RLAP-MOP significantly outperforms the original DDPG method and RLAP with the summed objective, which guarantees an ascending control performance. Furthermore, the deterministic policy with multi-objective based
additional planning is proved to have better extendibility which can be applied to solve more difficult control tasks. As a conclusion, the additional planning with multiple objectives has promising improvements in continuous reinforcement learning algorithms, especially on robotic control. Further works will include
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.
10
A. Pan, W. Xu, L. Wang et al. / Knowledge-Based Systems xxx (xxxx) xxx
the investigation on the parameter selection in neural network design and more efficient optimization approach for RLAP. CRediT authorship contribution statement Anqi Pan: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing - original draft, Writing - review & editing. Wenjun Xu: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing original draft, Writing - review & editing. Lei Wang: Supervision, Funding acquisition, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing - original draft, Writing - review & editing. Hongliang Ren: Supervision, Funding acquisition, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing - original draft, Writing - review & editing. Acknowledgments This work was supported by the Singapore TAP Grant R397000350118, National Key Research and Development Program, China, The Ministry of Science and Technology (MOST) of China (No. 2018YFB1307703) and Singapore Academic Research Fund under Grant R-397-000-297-114, awarded to Dr. Hongliang Ren, and China Scholarship Council scholarship under 201706260064 and the Initial Research Funds for Young Teachers of Donghua University, China awarded to Dr. Anqi Pan. References [1] S. Gu, E. Holly, T. Lillicrap, S. Levine, S. Gu, E. Holly, T. Lillicrap, S. Levine, Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates, in: IEEE International Conference on Robotics and Automation, 2017, pp. 3389–3396. [2] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, H. Dan, J. Quan, A. Sendonaris, G. Dulacarnold, Deep Q-learning from demonstrations, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 3224–3230. [3] R.S. Sutton, D.A. McAllester, S.P. Singh, Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, in: Advances in Neural Information Processing Systems, 2000, pp. 1057–1063. [4] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Deterministic policy gradient algorithms, in: International Conference on International Conference on Machine Learning, 2014, pp. 387–395. [5] T. Salimans, J. Ho, X. Chen, S. Sidor, I. Sutskever, Evolution strategies as a scalable alternative to reinforcement learning, 2017, arXiv preprint arXiv:1703.03864. [6] H. Mania, A. Guy, B. Recht, Simple random search provides a competitive approach to reinforcement learning, 2018, arXiv preprint arXiv:1803. 07055. [7] J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy optimization, in: International Conference on Machine Learning, 2015, pp. 1889–1897. [8] S. Levine, V. Koltun, Guided policy search, in: International Conference on Machine Learning, 2013, pp. 1–9. [9] S. Mannor, R.Y. Rubinstein, Y. Gat, The cross entropy method for fast policy search, in: Proceedings of the 20th International Conference on Machine Learning, ICML-03, 2003, pp. 512–519. [10] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, J.-B. Mouret, Black-box data-efficient policy search for robotics, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2017, pp. 51–58.
[11] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, 2015, arXiv preprint arXiv:1509.02971. [12] K.G. Vamvoudakis, F.L. Lewis, Online actorccritic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica 46 (5) (2010) 878–888. [13] Y. Tassa, T. Erez, E. Todorov, Synthesis and stabilization of complex behaviors through online trajectory optimization, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2012, pp. 4906–4913. [14] J. Morimoto, C.G. Atkeson, Minimax differential dynamic programming: An application to robust biped walking, in: Advances in Neural Information Processing Systems, 2003, pp. 1563–1570. [15] C.B. Browne, E. Powley, D. Whitehouse, S.M. Lucas, P.I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, S. Colton, A survey of Monte Carlo tree search methods, IEEE Trans. Comput. Intell. AI Games 4 (1) (2012) 1–43. [16] M.A. Khamis, W. Gomaa, Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework, Eng. Appl. Artif. Intell. 29 (2014) 134–151. [17] J. Hwangbo, I. Sa, R. Siegwart, M. Hutter, Control of a quadrotor with reinforcement learning, IEEE Robot. Autom. Lett. 2 (4) (2017) 2096–2103. [18] M. Ruiz-Montiel, L. Mandow, J.-L. Pérez-de-la Cruz, A temporal difference method for multi-objective reinforcement learning, Neurocomputing 263 (2017) 15–25. [19] K. Van Moffaert, A. Nowé, Multi-objective reinforcement learning using sets of pareto dominating policies, J. Mach. Learn. Res. 15 (1) (2014) 3483–3512. [20] Z. Yang, K. Merrick, L. Jin, H.A. Abbass, Hierarchical deep reinforcement learning for continuous action control, IEEE Trans. Neural Netw. Learn. Syst. 29 (11) (2018) 5174–5184. [21] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, U. Topcu, Safe reinforcement learning via shielding, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 2669–2678. [22] J. Lee, M. Laskey, R. Fox, K. Goldberg, Derivative-free failure avoidance control for manipulation using learned support constraints, 2018, pp. 1–4, arXiv preprint arXiv:1801.10321. [23] T.-H. Pham, G. De Magistris, R. Tachibana, Optlayer-practical constrained optimization for deep reinforcement learning in the real world, in: 2018 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2018, pp. 6236–6243. [24] H. Van Hasselt, Reinforcement Learning in Continuous State and Action Spaces, Springer, 2012, pp. 207–251. [25] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2014. [26] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, M. Lee, Natural actorccritic algorithms, Automatica 45 (11) (2009) 2471–2482. [27] P. Vamplew, R. Issabekov, R. Dazeley, C. Foale, A. Berry, T. Moore, D. Creighton, Steering approaches to pareto-optimal multiobjective reinforcement learning, Neurocomputing 263 (2017) 26–38. [28] S. Parisi, M. Pirotta, J. Peters, Manifold-based multi-objective policy search with sample reuse, Neurocomputing 263 (2017) 3–14. [29] J. García, R. Iglesias, M.A. Rodríguez, C.V. Regueiro, Incremental reinforcement learning for multi-objective robotic tasks, Knowl. Inf. Syst. 51 (3) (2017) 911–940. [30] N. Suematsu, A. Hayashi, A reinforcement learning algorithm in partially observable environments using short-term memory, in: Conference on Advances in Neural Information Processing Systems II, 1999, pp. 1059–1065. [31] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197. [32] H.-K. Lam, M. Narimani, Stability analysis and performance design for fuzzy-model-based control system under imperfect premise matching, IEEE Trans. Fuzzy Syst. 17 (4) (2009) 949–961. [33] F. Chen, W. Xu, H. Zhang, Y. Wang, J. Cao, M.Y. Wang, H. Ren, J. Zhu, Y.F. Zhang, Topology optimized design, fabrication, and characterization of a soft cable-driven gripper, IEEE Robot. Autom. Lett. 3 (3) (2018) 2463–2470. [34] K. Bhattacharjee, H. Singh, M. Ryan, T. Ray, Bridging the gap: Manyobjective optimization and informed decision-making, IEEE Trans. Evol. Comput. (2017) 813–820.
Please cite this article as: A. Pan, W. Xu, L. Wang et al., Additional planning with multiple objectives for reinforcement learning, Knowledge-Based Systems (2019) 105392, https://doi.org/10.1016/j.knosys.2019.105392.