6th IFAC IFAC International International Workshop Workshop on on Periodic Periodic Control Control Systems 6th 6th IFAC Periodic Control Systems Systems June 29 --International July 1, 1, 2016. 2016. Workshop Eindhoven,on The Netherlands 6th IFAC IFAC International Workshop on Periodic Control Systems Systems June 29 July Eindhoven, The Netherlands 6th International Workshop on Periodic Control June 29 - July 1, 2016. Eindhoven, The Netherlands Available online at www.sciencedirect.com June 29 July 1, 2016. Eindhoven, The Netherlands June 29 - July 1, 2016. Eindhoven, The Netherlands
ScienceDirect IFAC-PapersOnLine 49-14 (2016) 113–118
Reinforcement Learning Learning of Potential Potential Fields Reinforcement Reinforcement Learning of of Potential Fields Fields to achieve Limit-Cycle Walking to achieve Limit-Cycle Walking to achieve Limit-Cycle Walking ∗∗ ∗∗ Feirstein ∗∗∗ Ivan Ivan Koryakovskiy Koryakovskiy ∗∗∗ Jens Jens Kober Kober ∗∗ Feirstein Feirstein ∗ Ivan Koryakovskiy ∗ Jens Kober ∗∗ ∗∗ ∗ ∗ ∗ ∗ Feirstein Ivan Koryakovskiy Jens Kober Heike Vallery ∗ Feirstein Heike Ivan Koryakovskiy Jens Kober Vallery Heike Vallery ∗ ∗ Heike Vallery Heike Vallery ∗ ∗ Department of of BioMechanical BioMechanical Engineering, Engineering, TU TU Delft, Delft, Netherlands Netherlands ∗ Department ∗ Department of BioMechanical Engineering, TU Delft, Netherlands ∗ Department of BioMechanical Engineering, TU Delft, Netherlands (
[email protected], {i.koryakovskiy, h.vallery}@tudelft.nl) Department of BioMechanical Engineering, TU Delft, Netherlands (
[email protected], {i.koryakovskiy, h.vallery}@tudelft.nl) (
[email protected], {i.koryakovskiy, h.vallery}@tudelft.nl) ∗∗ ∗∗ (
[email protected], {i.koryakovskiy, h.vallery}@tudelft.nl) DCSC, TU TU Delft, Delft, Netherlands Netherlands (
[email protected]) (
[email protected], {i.koryakovskiy, h.vallery}@tudelft.nl) ∗∗ DCSC, (
[email protected]) ∗∗ DCSC, TU Delft, Netherlands (
[email protected]) ∗∗ DCSC, TU TU Delft, Delft, Netherlands Netherlands (
[email protected]) (
[email protected]) DCSC,
Denise Denise Denise Denise Denise
S. S. S. S. S.
Abstract: Reinforcement Reinforcement learning learning is is aa powerful powerful tool tool to to derive derive controllers controllers for for systems systems where where no no Abstract: Abstract: Reinforcement learning is tool to controllers for systems no Abstract: Reinforcement learning policy is aaa powerful powerful tool to derive derive controllers for complex systems where where no models are Reinforcement available. Particularly Particularly policy search tool algorithms arecontrollers suitable for for complex systems, Abstract: learning is powerful to derive for systems where no models are available. search algorithms are suitable systems, models are available. Particularly policy search algorithms are suitable for complex systems, models are available. Particularly policy search algorithms are suitable for complex systems, to keep learning time manageable and account for continuous state and action spaces. models available. Particularly policy search algorithms are suitable for complex systems, to keepare learning time manageable and account for continuous state and action spaces. to time and account for continuous state action spaces. to keep keep learning learning time manageable manageable andinsight account for continuous state aaand and actioncontroller spaces. However, these algorithms algorithms demand more more insight intofor thecontinuous system to to choose choose suitable controller to keep learning time manageable and account state and action spaces. However, these demand into the system suitable However, these algorithms demand more insight into the system to choose a suitable controller However, these these algorithms algorithms demand more insight insight intoof the system to choose choose aa suitable suitable controller parameterization. This paper paper investigates a type type ofthe policy parameterization for impedance impedance However, demand more into system to controller parameterization. This investigates a policy parameterization for parameterization. This paper investigates aa type of parameterization for impedance parameterization. This paper investigates type bounded: of policy policy Potential parameterization forthis impedance control that allows allowsThis energy input to be be implicitly implicitly bounded: Potential fields. In Infor this work, aa parameterization. paper investigates a type of policy parameterization impedance control that energy input to fields. work, control that allows energy input to be implicitly bounded: Potential fields. In this work, control that allows energy input to be implicitly bounded: Potential fields. In this work, aaa methodology for generating a potential field-constrained impedance controller via approximation control that allows energy input to be implicitly bounded: Potential fields. In this work, methodology for generating a potential field-constrained impedance controller via approximation methodology for generating a potential field-constrained impedance controller via approximation methodology for generating generating potential field-constrained impedance controller via approximation approximation of example trajectories, trajectories, and subsequently improving the the control controller policy using using Reinforcement methodology for aa potential field-constrained impedance via of example and subsequently improving control policy Reinforcement of trajectories, and subsequently improving control Reinforcement of example exampleis trajectories, and subsequently improving the the approximation control policy policy using using Reinforcement Learning, istrajectories, presented. and The subsequently potential field-constrained field-constrained approximation is used used as aa policy policy of example improving the control policy using Reinforcement Learning, presented. The potential is as Learning, The potential field-constrained approximation is used as Learning, is is presented. presented. The potential field-constrained approximation isto used as aaa policy policy parameterization for policy policy search reinforcement learning and and is compared compared is to used its unconstrained unconstrained Learning, is presented. The potential field-constrained approximation as policy parameterization for search reinforcement learning is its parameterization for policy search reinforcement learning and is compared to its unconstrained parameterization for policy search reinforcement learning and is compared to its unconstrained counterpart. Simulations on a simple biped walking model show the learned controllers are able parameterization for policy search reinforcement learning is the compared its unconstrained counterpart. Simulations on a simple biped walking modeland show learnedtocontrollers are able counterpart. Simulations on simple biped walking show the controllers are counterpart. Simulations on a aof simple biped walking model model show the learned learned controllers are able able to surpass the theSimulations potential field field ofsimple gravity by generating generating a stable stable limit-cycle gaitcontrollers on flat flat ground ground for counterpart. on a biped walking model show the learned are able to surpass potential gravity by a limit-cycle gait on for to the of gravity by limit-cycle gait on ground for to surpass surpass the potential potential field field of gravity field-constrained by generating generating aaa stable stable limit-cycle gait on flat flat ground for both parameterizations. The of potential field-constrained controller provides safety with a known known to surpass the potential field gravity by generating stable limit-cycle gait on flat ground for both parameterizations. The potential controller provides safety with a both parameterizations. The potential field-constrained controller provides safety with a known both parameterizations. The potential field-constrained controller provides safety with a known energy bound while performing equally well as the unconstrained policy. both parameterizations. The potential controller provides energy bound while performing equallyfield-constrained well as the unconstrained policy. safety with a known energy energy bound bound while while performing performing equally equally well well as as the the unconstrained unconstrained policy. policy. energy bound while performing equally well as the unconstrained policy. © 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Machine Machine learning, learning, Energy Control, Control, Limit Limit cycles, Walking, Walking, Robot Robot control Keywords: Keywords: Machine Machine learning, Energy Energy Control, Control, Limit cycles, cycles, Walking, Walking, Robot control control Keywords: Keywords: Machine learning, learning, Energy Energy Control, Limit Limit cycles, cycles, Walking, Robot Robot control control 1. INTRODUCTION INTRODUCTION been developed developed that that walk walk down down shallow shallow slopes slopes using using only only 1. been 1. INTRODUCTION INTRODUCTION been developed developed that that walk walk down down shallow shallow slopes slopes using using only only 1. been gravity and the the robot’s robot’s natural dynamics (McGeer (1990)). 1. INTRODUCTION been developed that walk down shallow slopes using only gravity and natural dynamics (McGeer (1990)). gravity and the robot’s natural dynamics (McGeer (1990)). gravity and the robot’s natural (McGeer (1990)). Thus, these mechanisms exploitdynamics the natural natural potential field gravity and the robot’s natural dynamics (McGeer (1990)). Thus, these mechanisms exploit the potential field The demand for robot control that is both safe and energyThus, these these mechanisms mechanisms exploit exploit the the natural natural potential potential field field The demand for robot control that is both safe and energyThe demand for robot control that is both safe and energyThus, of gravity. In consequence, they possess an extremely Thus, these In mechanisms exploit the possess natural potential field gravity. consequence, they an extremely The demand for robot that is both safe and energyefficient is greater greater thancontrol ever with with advances in mobile mobile robots of of gravity. In consequence, they possess an extremely The demand for robot control that is both safe and energyefficient is than ever advances in robots efficient is is greater greater than than ever ever with with advances advances in in mobile mobile robots robots of gravity. they an extremely energy-efficient gait that that is is remarkably remarkably similar to that of of of gravity. In In consequence, consequence, they possess possess an to extremely gait similar that efficient and robots that than interact in human environments. One energy-efficient energy-efficient gait that is remarkably similar to that of efficient is greater ever with advances in mobile robots and robots that interact in human environments. One and robots robots that that interact interact in in human human environments. environments. One One energy-efficient gait that is remarkably similar to that of humans. The stable periodic gait of a passive dynamic energy-efficient gait that is remarkably similar to that of humans. The stable periodic gait of a passive dynamic and such examplethat is the the bipedalinrobot robot which has applications applications humans. The stable periodic gait of a passive dynamic and robots interact human environments. One such example is bipedal which has such example example is is the the bipedal bipedal robot robot which which has has applications applications humans. stable gait of aa (LC). passive dynamic walker is The referred to periodic as aa Limit Limit Cycle (LC). Rendering humans. The stable periodic gait of passive dynamic walker is referred to as Cycle Rendering such ranging from home care to disaster relief. Traditional walker is is referred referred to to as aa Limit Limit Cycle Cycle (LC). (LC). Rendering Rendering such example the bipedal whichrelief. has applications ranging from home care to torobot disaster relief. Traditional walker ranging from ishome home care disaster Traditional this gait slope-invariant and improving its disturbance disturbance walker is slope-invariant referred to as as aand Limit Cycle (LC). Rendering gait improving its ranging from care to Traditional position control, common to disaster industrialrelief. robotics, is not not this this gait slope-invariant and improving its disturbance ranging from home care to disaster relief. Traditional position control, common to industrial robotics, is position control, control, common common to to industrial industrial robotics, robotics, is is not not this gait slope-invariant improving its rejection has been the the focus focusand of many many publications including this gait has slope-invariant and improving its disturbance disturbance rejection been of publications including position suitable for robotscommon that interact interact in unknown unknown environments rejection has been the focus of many publications including position control, to industrial robotics, is not suitable for robots that in environments suitable for for robots robots that that interact interact in in unknown unknown environments environments rejection has been the focus of many publications including Hobbelen and Wisse (2007). For example, walking of the the rejection has been the focus of many publications including Hobbelen and Wisse (2007). For example, walking of suitable because slight position errors can can result in in environments high contact contact Hobbelen and Wisse (2007). For example, walking of the the suitable for robots that interact in unknown because slight position errors result high because slight slight position position errors errors can can result result in in high high contact contact Hobbelen and Wisse (2007). walking of so-called simplest walker on For flat example, terrain can can be achieved achieved Hobbelen and Wisse (2007). For example, walking of the so-called simplest walker on flat terrain be because forces that can damage the robot and its environment. so-called simplest walker on flat terrain can be achieved because slight position errors can result in high contact forces that can damage the robot and its environment. forces that that can can damage damage the the robot robot and and its its environment. environment. so-called simplest walker on flat terrain achieved by emulating a slanted slanted artificial gravitycan fieldbe via robot so-called simplest walker on flat terrain can be achieved by emulating a artificial gravity field via robot forces In the case of humanoid robots which interact in human by emulating a slanted artificial gravity field via robot forces canhumanoid damage the robot andinteract its environment. In the case of robots which in human In the that case of of humanoid robots which interact in human human by emulating aa slanted artificial gravity actuators (Asano and Yamakita Yamakita (2001)). This via is aarobot very by emulating slanted artificial (2001)). gravity field field actuators (Asano and This is very In the humanoid which environments this poses aarobots human-safety issue. in actuators (Asano and Yamakita Yamakita (2001)). This via is aarobot very In the case case of this humanoid robots which interact interact in human actuators environments poses human-safety issue. environments this poses a human-safety issue. and This is very special case(Asano of aa potential potential field. (2001)). actuators (Asano and Yamakita (2001)). This is a very special case of field. environments this poses a human-safety issue. special case of a potential field. environments this poses a human-safety issue. One possible possible solution solution is is to to employ employ impedance impedance control, control, special case of aa potential field. One special case of potential field. One possible solution is to employ impedance control, The design design and and parameterization parameterization of of more more generic generic potenpotenThe One possible solution is impedance control, which attempts to enforce enforce dynamic relation between between The design design and and parameterization parameterization of of more more generic generic potenpotenOne solution is to to aaaemploy employ impedance control, The which attempts to dynamic relation whichpossible attempts to enforce enforce dynamic relation between between tial fields remains challenging, particularly for systems The design and parameterization of more generic potential fields remains challenging, particularly for systems which attempts to a dynamic relation system variables to as enforce opposed ato todynamic controlling them between directly tial fields remains challenging, particularly for systems which attempts relation system variables as opposed controlling them directly system variables variables as as opposed opposed to to controlling controlling them them directly directly tial fields remains challenging, particularly for systems that exhibit modeling uncertainties or are are subjected subjected to tial fields remains challenging, particularly for systems that exhibit modeling uncertainties or to system (Hogan (1984)). Impedance control based on potential that exhibit exhibit modeling modeling uncertainties uncertainties or or are are subjected subjected to to system variables as opposed to controlling them directly that (Hogan (1984)). Impedance control based on potential (Hogan (1984)). Impedance control based on potential unknown disturbances. Reinforcement learning (RL) is that exhibit modeling uncertainties or are subjected disturbances. Reinforcement learning (RL) is aaa (Hogan (1984)). Impedance control based potential fields inherently bounds the energy energy exchanged between the unknown unknown disturbances. disturbances. Reinforcement Reinforcement learning learning (RL) (RL) is isto (Hogan (1984)).bounds Impedance control based on on potential fields inherently the exchanged between the fields inherently bounds the energy exchanged between the unknown a powerful technology to derive controllers for systems where unknown disturbances. Reinforcement learning (RL) is a powerful technology to derive controllers for systems where fields inherently bounds the energy exchanged between the robot and the the environment. environment. Potential fields can can modulate powerful technology to controllers for where fields bounds the energy exchanged between the powerful robot and Potential fields modulate robot inherently and the the environment. environment. Potential fields can can modulate technology to derive derive controllers for systems systems where no models are available. available. Policy search RL RL methods,where also powerful technology to derive controllers for systems no models are Policy search methods, also robot and Potential fields modulate natural dynamics of a system and achieve desired beno models models are are available. available. Policy Policy search search RL RL methods, methods, also also robot the environment. Potential fields can modulate natural dynamics of system and achieve desired benaturaland dynamics of aaa system system and achieve achieve desired be- no known as actor-only actor-only methods, have been found effective no models are available. Policy have search RL found methods, also known as methods, been effective natural dynamics of and desired behavior without requiring high-stiffness trajectory tracking. known as actor-only methods, have been found effective natural dynamics of a system and achieve desired behavior without requiring high-stiffness trajectory tracking. havior without without requiring requiring high-stiffness high-stiffness trajectory trajectory tracking. tracking. known as methods, been found for robotic applications due to tohave their ability toeffective handle known as actor-only actor-only methods, have been foundto effective robotic applications due their ability handle havior Potential fieldsrequiring have been been developed trajectory for path path tracking. planning for for robotic applications due to their ability to handle havior without high-stiffness Potential fields have developed for planning Potential fields fields have have been been developed developed for for path path planning planning for robotic applications due to to higher dimensionality and continuous state and and action for robotic applicationsand duecontinuous to their their ability ability to handle handle dimensionality state action Potential and motion control bybeen reformulating the objective into aa higher higher dimensionality and continuous state and and action Potential fields haveby developedthe forobjective path planning and motion control reformulating into and motion control by reformulating the objective into a higher dimensionality and continuous state action spaces compared to Value-based RL methods (Kober et al. al. higher dimensionality and continuous state and action spaces compared to Value-based RL methods (Kober et and motion control by reformulating the objective into a potential function (Koditschek (1987)). Control torques spaces compared to Value-based RL methods (Kober et al. al. and motion control by reformulating the objective into a potential function (Koditschek (1987)). Control torques potential function (Koditschek (1987)). Control torques spaces compared to Value-based RL methods (Kober et (2013)). Furthermore, policy search methods have been spaces compared to Value-based RL methods (Kober et al. Furthermore, policy search methods have been potential function (Koditschek Control torques can be represented represented as aa vector vector(1987)). field generated generated by the the (2013)). (2013)). Furthermore, policy search methods have been potential function (Koditschek (1987)). Control torques can be as field by can be be represented represented as as aa vector vector field field generated generated by by the the (2013)). Furthermore, search methods have been effectively implementedpolicy on bipedal bipedal robots (Tedrake et al. (2013)). Furthermore, policy searchrobots methods have et been effectively implemented on (Tedrake al. can gradient of the the potential potential field, such field that the the dimensionality effectively implemented on bipedal robots (Tedrake et al. can be represented as a vector generated by the gradient of field, such that dimensionality gradient of of the the potential potential field, field, such such that that the the dimensionality dimensionality effectively implemented on bipedal robots (Tedrake et al. (2004)). effectively implemented on bipedal robots (Tedrake et al. (2004)). gradient of any number number of actuators actuators issuch essentially reduced to one, one, (2004)). gradient of the potential field, that the dimensionality of any of is essentially reduced to of any any number number of of actuators actuators is is essentially essentially reduced reduced to to one, one, (2004)). (2004)). of the scalar value of the potential function. of number is essentially In this this work, work, we we propose propose to to combine combine RL RL and and PFPFthe scalar value of the potential function. theany scalar valueofof of actuators the potential potential function. reduced to one, In In this this work, work, we we propose propose to to combine combine RL RL and and PFPFthe scalar value the function. In the scalar value of the potential function. constrained impedance control to improve robot safety for In this work, we propose to combine RL and constrained impedance control to improve robot safety for Contrasting the the high high energy energy demand demand of of conventional, conventional, fully fully constrained impedance control to improve robot safetyPFfor Contrasting Contrasting the the high high energy energy demand demand of of conventional, conventional, fully fully constrained impedance control to improve robot safety for robots that operate operate in uncertain uncertain conditions because: constrained impedance control to improve robot safety for robots that in conditions because: Contrasting actuated bipedal robots, passive dynamic walkers have robots that operate in uncertain conditions because: Contrasting the high energy demand of conventional, fully actuated bipedal robots, passive dynamic walkers have actuated bipedal bipedal robots, robots, passive passive dynamic dynamic walkers walkers have have robots that operate in uncertain conditions because: robots that operate in uncertain conditions because: actuated actuated bipedal robots, passive dynamic walkers have Copyright 2016 IFAC 1 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © 2016, IFAC (International Federation of Automatic Control) Copyright 1 Copyright © 2016 2016 IFAC IFAC 1 Copyright ©under 2016 responsibility IFAC 1 Control. Peer review© of International Federation of Automatic Copyright 2016 IFAC 1 10.1016/j.ifacol.2016.07.994
IFAC PSYCO 2016 114 Denise S. Feirstein et al. / IFAC-PapersOnLine 49-14 (2016) 113–118 June 29 - July 1, 2016. Eindhoven, The Netherlands
PF-constraint provides safety with a known energy bound • RL provides controllers for systems with modeling uncertainty. The question arises, can policy search RL be combined with potential fields to achieve LC walking? While the theoretical advantage of a PF-constrained impedance control, specifically energy boundedness, are presented in literature, the sub-question arises, are there limitations when it comes to RL convergence? •
τ (q; w) = −∇q ψ(q; w) = −
w
k=1
τ 0,k (xk ) − τ (q k ; w)
w.
Reinforcement Learning (RL) is a machine learning method which attempts to find a control policy, π(u|x, w), which maps states x to actions u. For policy search algorithms, the policy is parameterized by a weighting vector w. The policy is analogous to the impedance control laws derived in the previous section where generalized coordinates q are states and control torques τ are actions. The policy space is explored by randomly perturbing the weighting vector w. Batch exploration is performed where the policy is independently perturbed from the initial policy a set number of times. The perturbed policies are then evaluated by computing the expected return H J = E{ h=0 Rh }, which is a sum of the expected reward R over the finite-horizon H. Episode-based policy evaluation uses the entire episode to assess the quality of the policy used directly (Deisenroth et al. (2011)). The policy is updated with the objective to find a policy which maximizes the expected return. We use the Expectation Maximization Policy learning by Weighted Exploration with the Returns (PoWER) method developed in Kober and Peters (2011).
As opposed to conventional set-point control approaches that directly control system variables such as position and force, impedance control attempts to enforce a dynamic relation between these variables (Hogan (1984)). In this section, an impedance controller is derived for a fully actuated robot with n Degrees Of Freedom (DOF) using least squares optimization. We assume an accurate model of the robot as well as the ability to measure the position and torque at each joint as well as full collocated actuation. Each configuration of the robot can be described by a unique vector q = [q1 , q2 , ..., qn ]T where qn , with index i = 1...n, are the generalized coordinates. T ¨ T , is known, If a desired trajectory, x = q T , q˙ T , q the idealistic control torques, τ 0 , required to achieve this trajectory can be found using inverse dynamics. A function to approximate the torques applied to the system as a function of the robot’s configuration, τ (q) ∈ Rn , can be found by solving the least squares problem min
T
3. POLICY SEARCH REINFORCEMENT LEARNING
2. IMPEDANCE CONTROL INITIALIZATION
2
∂g(q) ∂q
This is similar to the method of Generalized Elasticities presented in Vallery et al. (2009a) and Vallery et al. (2009b). For the RBF g(q) we choose to use compactly supported radial basis functions which allow for the use of a minimal number of center points in the neighborhood of the robot’s position to sufficiently compute the function value. This reduces the computational resources needed during operation.
As a first step towards answering these questions, this paper presents a methodology for defining a potential fieldconstrained (PF-constrained) impedance control and improving it via reinforcement learning. To achieve this, we define an impedance control as a parameterized mapping of configurations to control torques, which is analogous to a policy in Reinforcement Learning (RL) algorithms. A PF-constrained and an unconstrained parameterization of an impedance controller are compared before and after RL applied to the bipedal walking problem. These control methods are compared for three cases: the reference case of the simplest walking model (SWM), the slope-modified case of the SWM on flat ground, and the mass-modified case, of the SWM with modified foot mass on flat ground.
S
4. APPLICATION TO LC WALKING 4.1 Simplest Walking Model The simplest walking model (SWM) developed in Garcia et al. (1998) is often used as a tool to study the paradigm of Bipedal Limit-Cycle walking and is detailed in the following sections. A diagram of the SWM is shown in Fig. 1. The model consists of two massless rigid links of length L connected at the hip by a frictionless hinge. The mass is distributed over three point masses at the hip and feet such
(1)
mh
where τ 0,k (xk ) is a set of training data with S samples.
“R”
g
For the unconstrained case with n degrees of freedom, the vector function τ (q; w) is defined in terms of its components τi (q; wi ) = g i (q)T wi , each of which is approximated by a set of normalized radial basis functions (RBF) g i (q) and corresponding weights wi , i = 1...n.
L
θ φ
ey ex
For the constrained case, function τ (q) is restricted to describe a potential field by enforcing that its work is zero for any closed-path trajectory. This implies the control torques are a function of the joint variables q and are defined as the negative gradient of a potential function ψ(q; w) = g(q)T w with respect to q:
mf γ
mf
2
g (m/s )
“S”
“M”
10.000 10.000 10.000
L (m)
1.000
1.000
1.000
mh (kg)
1.000
1.000
1.000
mf (kg)
0.001
0.001
0.010
γ (rad)
0.004
0.000
0.000
Fig. 1. Diagram of the Simplest Walking Model (SWM) and its parameters for Reference (“R”), Slopemodified (“S”) and Mass-modified (“M”) cases. 2
IFAC PSYCO 2016 June 29 - July 1, 2016. Eindhoven, The Netherlands Denise S. Feirstein et al. / IFAC-PapersOnLine 49-14 (2016) 113–118
115
that the hip mass mh is much larger than the foot mass mf . The model is situated on a slope of angle γ and acts only under the force of gravity with acceleration constant g. The configuration of the model is given by the ankle angle θ and hip angle φ. The generalized coordinates are q = (xc , yc , θ, φ)T where the subscripts “c” denotes the contact point of the stance foot with the ground.
ground, and the mass-modified case, of the SWM with modified foot mass on flat ground, cf. Fig. 1.
The training data was found by first, scanning the initial ˙ for cases in which the SWM converges conditions (q, q) ¨ were to an LC and then the associated accelerations q found using inverse dynamics. The ankle angle was varied between 0.1 and 0.2 rad with a step size of 0.005 rad, and the initial hip angle was set to twice that of the ankle so the model initializes in double support phase. The initial ankle angular velocity was varied between −0.68 and −0.38 rad s−1 with a step size of 0.005 rad s−1 , and the initial hip angular velocity was set to 0 rad s−1 . The torques τ 0 found from training data can be used to solve the least-squares problem in (1) using the recursive leastsquares method described in Sec. 2 resulting in impedance control laws of the form τ (q; w).
For the policy search RL, a horizon of H = 10 was used corresponding to 10 steps of the robot. For the exploration strategy, a batch size of 100 iterations was used. A Gaussian exploration ∼ N (0, σ 2 ) was used which was decreasing linearly over episodes.
For the least squares optimization, 50 RBFs were used. The center locations were determined using a grid step size of 0.05 rad for the ankle angle and 0.1 rad for the hip angle in the area of the ideal trajectory of the SWM.
5.2 Experiment Setup Initial unconstrained and PF-constrained impedance controllers were found using inverse dynamics for each of the three cases described above. For the Slope and Massmodified cases, RL was used to attempt to improve the policy for both the unconstrained and the PF-constrained parameterizations. For the reference case, the performance of the controllers can not be improved further using RL based on the evaluation strategy since the control torques cannot decrease further.
4.2 Reinforcement Learning
5.3 Benchmarking Criteria
The resulting impedance control laws τ (q; w) parameterized by vector w are specific to the simplest walking model case and will likely not be effective if the model is modified or more degrees of freedom are added. If this is the case τ (q; w0 ) parameterized by vector w0 can be used as the initial policy for policy search RL. The policy search with episode-based evaluation strategy described in Sec. 3 can be used where one episode is H steps of the biped. For a biped robot, the state transitions from the previous state x to the next state x caused by actions u can be modeled by solving the equations of motion using iterative methods T where x = q T , q˙ T are the states and the generalized forces Qθ , Qφ are the actions u.
The unconstrained and PF-constrained impedance controllers were compared for each of the three cases based on the following benchmarking criteria. Work and Energy: The energy of the LC of the ideal SWM (unactuated and on a slope) is bounded by the potential field of gravity. The energy bound can be measured as the maximum energy, E of the LC, defined E = V + T where V is the potential energy and T is the kinetic energy. For the LC of the ideal SWM, the total energy is constant at 10.0108 J. At each step kinetic energy is dissipated at impact and an equivalent amount of potential energy is added by the slope. The energy added/dissipated at each step is equivalent to 0.0166 J.
The reward function used for each step is
˙ Rh (x, u) =Rstep − R∆ ||∆θ|| − R∆˙ ||∆θ|| − Rt ||th − t0 || − Rτ,θ ||τ θ || − Rτ,φ ||τ φ || where ∆θ = θh −θh−1 , and ∆θ˙ = θ˙h − θ˙h−1 , and Rstep = 1, R∆ = 10 rad−1 , R∆˙ = 10 s rad−1 , Rt = 1 s−1 , Rτ,θ = 10 N−1 m−1 and Rτ,φ = 100 N−1 m−1 are constants. The first term of the reward function is given as a reward for successfully completing a step. The second term penalizes the change in angle and angular velocity of the stance leg at the beginning of each step. This is to encourage a limitcycle is reached where each step is the same. The third term penalized the change in time of step h from the time of the reference LC step t0 = 1.2180 s. The fourth term penalizes the magnitude of the control torques to minimize the energy added to the system.
Energy consumption can be measured for the actuated q model as the work done by the actuators W = q 1 τ dθ, 0 where q 0 is the configuration at the beginning of the step and q 1 is the configuration at the end of the step. Robustness: The robustness of an LC gait can be measured by its velocity disturbance rejection. An angular velocity disturbance is introduced to the stance leg at the beginning of the first step and the maximum disturbance that can be applied without causing the walker to fall is used as a measure for robustness. RL Performance: The performance of the RL is assessed by plotting the mean performance over the episodes, for several trials, and observing how many episodes it takes to level off.
5. EVALUATION PROTOCOL
6. RESULTS
5.1 Implementation 6.1 Reference case The impedance control laws were implemented on a fullyactuated simple walking model for the three cases: the reference case on a slope, the slope-modified case on flat
The trajectories for the Unconstrained and PF-constrained policies were derived using inverse dynamics. The bench3
IFAC PSYCO 2016 116 Denise S. Feirstein et al. / IFAC-PapersOnLine 49-14 (2016) 113–118 June 29 - July 1, 2016. Eindhoven, The Netherlands
Slope-modi-ed Case after Initialization (a) Unconstrained
Average Return
? (in rad)
0.5
RL Performance for Slope-modi-ed Case
10
(b) PF-constrained
0
8 6 4
Unconstrained PF-constrained
2 0 0
500
1000
1500
Episode -0.5 -0.2
0 3 (in rad)
0.2 -0.2
Control torques Trajectory
0 3 (in rad)
Control Torques
Fig. 4. Mean performance of the RL for the Unconstrained and PF-constrained policies for the Slope-modified case averaged over 10 runs with the error bars indicating the standard deviation. For both policies the exploration variance decreased linearly from 1e-6 to 1e-11 throughout the episodes. Slope-modi-ed Case after RL
0.2
Ideal Trajectory PF
Fig. 2. Trajectory phase plot of the initial (a) Unconstrained and (b) PF-constrained policies for the Slopemodified Case. Slope-modi-ed Case after Initialization
5
#10!4
0
0
-0.5 -0.2
-5
0 3 (in rad)
10.02 E (in J)
(b) PF-constrained
(b) PF-constrained ? (in rad)
=3 (in Nm) =? (in Nm)
(a) Unconstrained -0.0394 -0.0396 -0.0398 -0.04 -0.0402
(a) Unconstrained
0.5
0.2 -0.2
Control Torques Control Torques RL Trajectory PF
10.01
0 3 (in rad)
0.2
Ideal Trajectory RL Trajectory PF
10 0
0.5 1 time (in s)
Fig. 5. Trajectory phase plot of the learned (a) Unconstrained and (b) PF-constrained policies for the Slopemodified Case. Slope-modi-ed Case after RL
Fig. 3. Control torques and energy of one LC step of the initial (a) Unconstrained and (b) PF-constrained policies for the Slope-modified Case. marking criteria for the energy, work and robustness of the reference case are specified in Table 1. 6.2 Slope-modified Case
-0.035
2 0 -2 -4 -6 10.02
E (in J)
Initialization The trajectory phase plots for the initial Unconstrained and PF-constrained policies for the Slopemodified case are shown in Fig. 2 (a) and (b) respectively. The control torques and total energy for the initial Unconstrained and PF-constrained policies for the Slopemodified case are shown in Fig. 3 (a) and (b) respectively.
=3 (in Nm)
0.5 1 time (in s)
=? (in Nm)
0
(a) Unconstrained
(b) PF-constrained
-0.04 #10!3
10.01 10 0
Reinforcement Learning The mean performance of the RL for the Unconstrained and PF-constrained controllers are shown in Fig. 4. The resulting trajectory phase plot for the learned Unconstrained and PF-constrained policies for the Slope-modified case are shown in Fig. 5 (a) and (b) respectively. The resulting control torques and energy for the learned Unconstrained and PF-constrained policies for the Slope-modified case are shown in Fig. 6 (a) and (b) respectively.
0.5 1 time (in s)
0
0.5 1 time (in s)
Fig. 6. Control torques and energy of one LC step of the learned (a) Unconstrained and (b) PF-constrained policies for the Slope-modified Case 6.3 Mass-modified Case Initialization For the Mass-modified case neither the initial Unconstrained nor initial PF-constrained policy leads to a stable limit cycle so the corresponding plots are not shown.
The benchmarking criteria for the energy, work and robustness of the Slope-modified case are specified in Table 1. 4
IFAC PSYCO 2016 June 29 - July 1, 2016. Eindhoven, The Netherlands Denise S. Feirstein et al. / IFAC-PapersOnLine 49-14 (2016) 113–118
RL Performance for Mass-modi-ed Case
Mass-modi-ed Case after RL
6 4
0 0
500
1000
1500
Fig. 7. RL mean performance of the Unconstrained and PF-constrained policies for the Mass-modified case averaged over 10 runs with the error bars indicating the standard deviation. For the Unconstrained policy the exploration variance decreased from 1e-5 to 1e10 and for the PF-constrained policy the variance decreased Mass-modi-ed from 1e-6 to 1e-10. Case after RL (a) Unconstrained
0 -0.02 10.1 10.05 10 0
0.5 1 time (in s)
0
0.5 1 time (in s)
Fig. 9. Control torques and energy of one LC step of the learned (a) Unconstrained and (b) PF-constrained policies for the Mass-modified Case.
(b) PF-constrained
Table 1. Summary of Results. In the table “EB” stands for “Energy bound”, and “MVD” stands for “Maximum velocity disturbance”. 0
-0.5 -0.2
0 3 (in rad)
0.2 -0.2
Control Torques Control Torques RL Trajectory PF
0 3 (in rad)
Reference
Case
0.2
Ideal Trajectory RL Trajectory PF
Fig. 8. Trajectory phase plot of the learned (a) Unconstrained and (b) PF-constrained policies for the Massmodified Case. Reinforcement Learning The mean performance of the RL for both the PF-constrained and unconstrained case are shown in Fig. 7. The resulting trajectory phase plots for the learned Unconstrained and PF-constrained policies for the Mass-modified case are shown in Fig. 8 (a) and (b) respectively. The resulting control torques and energy for the learned Unconstrained and PF-constrained policies for the Mass-modified case are shown in Fig. 9 (a) and (b) respectively.
Parameterization Unconstrained
PF-constrained
Slopemodified
? (in rad)
0.5
0.02
E (in J)
Episode
(b) PF-constrained
-0.05
=? (in Nm)
Unconstrained PF-constrained
2
0
=3 (in Nm)
8
(a) Unconstrained
Unconstrained
Massmodified
Average Return
10
117
Unconstrained
PF-constrained
PF-constrained
Benchmarking Criteria EB (J) Work (J) MVD (rad/s) EB (J) Work (J) MVD (rad/s) EB (J) Work (J) MVD (rad/s) EB (J) Work (J) MVD (rad/s) EB (J) Work (J) MVD (rad/s) EB (J) Work (J) MVD (rad/s)
Initial Policy
Learned Policy
10.011 0.000 -0.050 10.011 0.000 -0.050 10.017 1.507 -0.050 10.019 1.495 -0.060
10.026 1.488 -0.030 10.019 1.332 0.000 10.215 1.311 -0.050 10.062 1.481 -0.020
rameterization, allow the biped to achieve an LC gait on a flat surface (γ = 0 rad) as can be seen in the trajectory phase plots in Fig. 2. It can be seen in Table 1 that the velocity disturbance rejections are comparable to the ideal SWM, however, the energy bound is higher than the ideal case for both controllers. The work done by the actuators is similar for both controllers, however, it is almost 100 times the work done by gravity in the ideal case.
The benchmarking criteria for the work, energy and robustness of the Mass-modified case are specified in Table 1.
As can be seen in Table 1, RL of the initial impedance controllers for the slope-modified case increases the energy bound for both controllers, while decreasing the work done by the actuators. RL also leads to decreased disturbance rejection. As can be seen in Fig. 4, the performance of the unconstrained parameterization levels off before the PF-constrained parameterization, indicating the unconstrained parameterization achieves a higher performance with less episodes compared to the PF-constrained parameterization.
7. DISCUSSION For the reference case for both the unconstrained and PF-constrained parameterization the controlled trajectory perfectly follows the ideal trajectory. No actuator torques are generated and the total energy is equal to 10.011 J. It can be seen in Table 1 that both controllers have the same energy bound and maximum disturbance rejection as the unactuated ideal case. This serves as a validation for both the impedance controllers derived using inverse dynamics and least squares optimization.
For the mass-modified case, the initial impedance controllers, for both PF-constrained and unconstrained parameterizations, do not allow the biped to achieve an LC gait. The impedance controllers derived from inverse
For the slope-modified case, the initial impedance controllers, for both PF-constrained and unconstrained pa5
IFAC PSYCO 2016 118 Denise S. Feirstein et al. / IFAC-PapersOnLine 49-14 (2016) 113–118 June 29 - July 1, 2016. Eindhoven, The Netherlands
bound. Improved tuning of the RL exploration and evaluation strategy could lead to improved policies and more conclusive results for the comparison of the unconstrained and PF-constrained parameterizations. More advanced RL methods could lead to potential fields that further improve performance and even increase robustness.
dynamics appear not to be able to compensate for the modified dynamics of the model. However, RL of these initial policies allows the biped to achieve an LC gait as shown in Fig. 8. This validates the use of RL for achieving an LC gait. As can be seen in Table 1, for both controllers the energy bound and work done is greater than the ideal case. While the robustness of the unconstrained controller is comparable to the ideal case, it is reduced for the PF-constrained controller. As can be seen in Fig. 7, the performance of the unconstrained parameterization levels off before the PF-constrained parameterization.
ACKNOWLEDGEMENTS I. Koryakovskiy and H. Vallery were supported by the European project KOROIBOT FP7-ICT-2013-10/611909.
For all cases, the energy bound and work done by the actuators was similar for both the PF-constrained and unconstrained controllers. As the implementation of the RL did not converge to a single optimal solution, the variance in the resulting energy and work was too large to draw an accurate comparison. For all cases, there are no improvements to the robustness of the limit-cycle against velocity disturbances. The reason for this is that the episode (consisting of H steps of the limit-cycle) is a black-box from the perspective of the episode-based RL. Learning is based only on the inputs and outputs of the episode, therefore any unknown disturbances throughout the episode are not accounted for, and consequently the robustness is not improved by the RL. Exploring and learning throughout the episode may be one way to improve the robustness. Additionally, learning could take place in an unknown environment with unknown disturbances.
REFERENCES Asano, F. and Yamakita, M. (2001). Virtual gravity and coupling control for robotic gait synthesis. IEEE Trans. Systems, Man, and Cybernetics Part A: Systems and Humans, 31(6), 737–745. Deisenroth, M.P., Neumann, G., and Peters, J. (2011). A Survey on Policy Search for Robotics. Foundations and Trends in Robotics, 2, 1–142. Garcia, M., Chatterjee, A., Ruina, A., and Coleman, M. (1998). The Simplest Walking Model: Stability, Complexity, and Scaling. Journal of Biomechanical Engineering, 120(2), 281–288. Hobbelen, D.G.E. and Wisse, M. (2007). Limit Cycle Walking. Humanoid Robots: Human-like Machines, 642– 659. Hobbelen, D.G.E. and Wisse, M. (2008). Swing-leg retraction for limit cycle walkers improves disturbance rejection. Trans. Robotics, 24(2), 377–389. Hogan, N. (1984). Impedance control: An approach to manipulation. In American Control Conf.. Hyon, S.H. and Cheng, G. (2006). Passivity-based fullbody force control for humanoids and application to dynamic balancing and locomotion. In Int. Conf. Intelligent Robots and Systems. Kober, J., Bagnell, J.A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. Int. Journal of Robotics Research, 32, 1238–1274. Kober, J. and Peters, J. (2011). Policy search for motor primitives in robotics. Machine Learning, 84(1-2), 171– 203. Koditschek, D.E. (1987). Exact robot navigation by means of potential functions: Some topological considerations. In Int. Conf. Robotics and Automation. McGeer, T. (1990). Passive Dynamic Walking. Int. Journal of Robotics Research, 9(2), 62–82. Papageorgiou, M. (2012). Optimierung: statische, dynamische, stochastische Verfahren. Springer-Verlag. Tedrake, R., Zhang, T., and Seung, H. (2004). Stochastic policy gradient reinforcement learning on a simple 3D biped. In Int. Conf. Intelligent Robots and Systems. Vallery, H., Duschau-Wicke, A., and Riener, R. (2009a). Generalized elasticities improve patient-cooperative control of rehabilitation robots. In Int. Conf. on Rehabilitation Robotics. Vallery, H., Duschau-Wicke, A., and Riener, R. (2009b). Optimized passive dynamics improve transparency of haptic devices. In Int. Conf. Robotics and Automation.
The scope of these results is limited by the variables of the simple walking model used. The only modifications tested were the ratio of the hip mass to foot mass, and the slope γ. An interesting observation is the learned behavior of “swing-leg retraction” seen in the learned policy for both cases, as shown in Fig. 5 and 8 . This is when the swing leg retracts at the end of a step until it hits the ground. It has been shown in Hobbelen and Wisse (2008) that swing-leg retraction can improve disturbance rejection. 8. CONCLUSION AND FUTURE WORK In this work we successfully combined potential field control and reinforcement learning to achieve limit-cycle walking for a simple walking model. A limit-cycle was achieved on flat ground, and for a modified hip to foot mass ratio. The results demonstrate that a potential field controller can not only “emulate” the effect of gravity on the simple walking model, but also improve its performance if reinforcement learning is applied. The potential field-constrained controller provides safety by bounding the energy while performing equally well compared to an unconstrained controller. The performance of the RL leveled off faster for the unconstrained case. Achieving a limit cycle gait on a SWM is trivial compared to more complex models. In future work the method presented in this paper could be applied to higher degree of freedom models. A strength of this method is the ability to bound the energy of the controlled system. In future work it could be explored how to enforce a desired energy
6