Mechatronics 24 (2014) 1001–1007
Contents lists available at ScienceDirect
Mechatronics journal homepage: www.elsevier.com/locate/mechatronics
Passivity-based reinforcement learning control of a 2-DOF manipulator arm S.P. Nageshrao a,⇑, G.A.D. Lopes a, D. Jeltsema b, R. Babuška a a b
Delft Center for Systems and Control, Delft University of Technology, Mekelweg 2, 2628 CD Delft, The Netherlands Delft Institute of Applied Mathematics, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands
a r t i c l e
i n f o
Article history: Received 24 September 2013 Revised 13 September 2014 Accepted 26 October 2014 Available online 15 November 2014 Keywords: Passivity-based control Port-Hamiltonian systems Reinforcement learning Robotics
a b s t r a c t Passivity-based control (PBC) is commonly used for the stabilization of port-Hamiltonian (PH) systems. The PH framework is suitable for multi-domain systems, for example mechatronic devices or micro-electro-mechanical systems. Passivity-based control synthesis for PH systems involves solving partial differential equations, which can be cumbersome. Rather than explicitly solving these equations, in our approach the control law is parameterized and the unknown parameter vector is learned using an actor–critic reinforcement learning algorithm. The key advantages of combining learning with PBC are: (i) the complexity of the control design procedure is reduced, (ii) prior knowledge about the system, given in the form of a PH model, speeds up the learning process, (iii) physical meaning can be attributed to the learned control law. In this paper we extended the learning-based PBC method to a regulation problem and present the experimental results for a two-degree-of-freedom manipulator. We show that the learning algorithm is capable of achieving feedback regulation in the presence of model uncertainties. Ó 2014 Elsevier Ltd. All rights reserved.
1. Introduction Port-Hamiltonian (PH) modeling of physical systems [1] has found a wide recognition and acceptance in the control community. Because of its underlying principle of modularity and the emphasis on physics, the PH framework can be used to model complex multi-domain physical systems like mechatronic applications [2]. The PH theory originates from port-based network modeling of physical systems with distinct energy storage elements (e.g., electrical, mechanical, electro-mechanical, chemical, hydrodynamical and thermodynamical systems) [1,2]. A strong property of the port-Hamiltonian formalism is that it highlights the relationship between the energy storage, dissipation, and the interconnection structures. This underscores the suitability of energy control methods like the passivity based control (PBC) to regulate a port-Hamiltonian system [3]. Passivity based control, introduced by Ortega and Spong [4], is one of the widely used nonlinear control methods. Among many variants of PBC, standard PBC – a combination of energy-balancing (EB) and damping-injection (DI) – is by far the simplest and most frequently used setpoint regulation method. The control objective ⇑ Corresponding author. E-mail addresses:
[email protected] (S.P. Nageshrao), G.A.DelgadoLopes @tudelft.nl (G.A.D. Lopes),
[email protected] (D. Jeltsema),
[email protected] (R. Babuška). http://dx.doi.org/10.1016/j.mechatronics.2014.10.005 0957-4158/Ó 2014 Elsevier Ltd. All rights reserved.
is achieved by adding an external energy component to the system. Generally this component is obtained by solving a set of partial differential equations (PDEs). Much of the literature on PBC for portHamiltonian systems concern simplifying these complex PDEs [5]. Standard PBC strongly relies on the system model. However, for various practical applications obtaining a precise model of the system is extremely difficult [2]. This makes the passivity-based control synthesis hard to apply. There exist alternate control methods which are less or not at all dependent on the system model. One such example is reinforcement learning (RL), a semi-supervised stochastic learning control method [6]. In RL, the controller (also called policy or actor) optimizes its behavior by interacting with the system. For each interaction, the actor receives a numerical reward, which is calculated as a function of the system’s state transition and the control effort. The objective is to maximize the long term cumulative reward, approximated by the value function. Hence, most of the RL algorithms learn a value function from which the control law can be derived [6,7]. Generally RL algorithms are formulated for discrete state and action spaces. However, as most practical applications require the states and control actions to be continuous, function approximation is often used [7]. In this paper, we employ the actor–critic approach [8], a prominent RL method that can learn a continuous policy for systems with continuous states and actions. The use of RL algorithms to control physical systems has a rich literature spanning over two decades [9–11]. However, a major drawback
1002
S.P. Nageshrao et al. / Mechatronics 24 (2014) 1001–1007
in RL is the slow and non-monotonic convergence of the learning algorithm. The absence of information about the system forces the algorithm to explore a large part of the state-space prior to finding an optimal policy. By incorporating partial knowledge of the system, the learning speed can be significantly increased [12]. To reduce the complexity of the standard PBC and, at the same time, to improve the learning speed, a novel RL based method was devised in [13]. The authors parameterized the standard PBC control law for a single-input single-output nonlinear system, and the unknown parameters were learnt using the actor-critic algorithm. In this paper, the learning paradigm of [13] is extended to regulate a multiple-input, multiple-output system and is evaluated on a two degree of freedom (DOF) manipulator arm. Additionally, we explore the influence of the model and parameter uncertainties on the learning algorithm. We also discuss the choice of the learning parameters. This paper is organized as follows, In Section 2, we provide a theoretical background on PH systems, standard PBC, and reinforcement learning. In Section 3, the energy-balancing actor–critic (EBAC) algorithm of [13] is explained. Section 4 describes the manipulator setup. Simulation and experimental results are discussed in Sections 5 and 6, respectively. Section 7 concludes the paper. 2. Background
supplied power with the difference being the dissipated component. For the negative feedback
u ¼ KðxÞy; where KðxÞ ¼ K T ðxÞ 2 Rmm is the symmetric positive definite damping injection matrix, inequality (3) becomes
_ HðxÞ 6 yT KðxÞy: Asymptotic stability at the origin can be inferred by assuming zero state detectability i.e.,
u ¼ 0;
y ¼ 0 ! x ¼ 0:
However stabilizing the system at the origin, which correlates with the open-loop minimum energy, is not an attractive control problem. Instead, stabilizing the system at any desired state xd is of a wider practical interest. In the PH framework, set-point regulation is often achieved by standard PBC – a combination of energy-balance (EB) and damping-injection (DI) – as elaborated below. 2.2. Standard passivity based control Set-point regulation by standard PBC involves finding a statefeedback control law termed as energy-balance, such that, the closed-loop Hamiltonian Hd ðxÞ has a (local) minimum at the desired state xd i.e.
2.1. Port-Hamiltonian systems
xd ¼ arg min Hd ðxÞ:
Port-Hamiltonian systems [1] are considered as a generalization of Euler–Lagrangian or Hamiltonian systems. A time-invariant PH system in standard input-state-output form is given as1:
Additionally, asymptotic stability of the closed-loop system can be ensured by introducing a damping-injection term. The resulting control law, due to energy balancing and damping injection, is given by:
x_ ¼ ðJðxÞ RðxÞÞrx HðxÞ þ gðxÞu;
uðxÞ ¼ ues ðxÞ þ udi ðxÞ 1 ¼ g T ðxÞgðxÞ g T ðxÞ JðxÞ RðxÞ rx Ha ðxÞ
T
y ¼ g ðxÞrx HðxÞ;
ð1Þ
where x 2 Rn is the state vector, JðxÞ ¼ JT ðxÞ 2 Rnn is the skewsymmetric interconnection matrix, RðxÞ ¼ RT ðxÞ 2 Rnn is the symmetric dissipation matrix, gðxÞ 2 Rnm is the input matrix. The signals y 2 Rm and u 2 Rm are called the port variables and they are used to measure and control the system, respectively. The system’s total stored energy is given by the Hamiltonian HðxÞ 2 R, obtained by summing up the energy stored in all the individual energy storing elements. For example in a mechanical system, the Hamiltonian is given by the sum of the kinetic and potential energy terms:
HðxÞ ¼ TðxÞ þ VðqÞ; 1 ¼ pT M 1 ðqÞp þ VðqÞ; 2
ð2Þ
where for n even, the position q 2 Rn=2 and the momentum p 2 Rn=2 T constitute the system’s state-vector x ¼ ðqT ; pT Þ , and MðqÞ is the mass-inertia matrix. Passivity of the port-Hamiltonian system can be shown by using the positive semi-definite dissipation matrix (i.e., RðxÞ P 0) and a lower-bounded Hamiltonian (i.e., HðxÞ P 0). With uT y being the power supplied to the system, consider the derivative of the Hamiltonian
_ HðxÞ ¼ rx HT x_ ¼ rx HT JðxÞ RðxÞ rx H þ rx HT gðxÞu ¼ rx HT RðxÞrx H þ uT y
ð3Þ
T
6 u y: Eq. (3) is called the differential dissipation inequality and it implies that the change in the system’s stored energy is less than the 1
@ We use the notation rx for the gradient operator @x . In this work all the vectors are column vectors, including the gradient of a scalar function.
KðxÞg T ðxÞrx Hd ðxÞ;
ð4Þ
ð5Þ
with KðxÞ ¼ K T ðxÞ, a user-defined symmetric positive semi-definite damping injection matrix [3]. The energy balancing input ues is a function of the pseudo-inverse of input matrix gðxÞ, i.e., 1 g y ðxÞ ¼ ðg T gÞ g T . The added energy term Ha ðxÞ satisfies the energy balancing equation [3]
Ha ðxÞ ¼ Hd ðxÞ HðxÞ:
ð6Þ
Eq. (6) shows that the desired closed loop energy Hd ðxÞ is a combination of the stored and supplied energy. For a given PH system (1), the supplied energy Ha ðxÞ of standard PBC (5) is obtained by solving a set of partial differential equations called the matching condition [5]
2
3 g ? ðxÞ JðxÞ RðxÞ 4 5rx Ha ðxÞ ¼ 0 g T ðxÞ
ð7Þ
where g ? ðxÞ is the left annihilator matrix of the input matrix gðxÞ, i.e. g ? ðxÞgðxÞ ¼ 0. Among the solutions of (7) the one satisfying the equilibrium condition (4) is chosen as the added energy component. Using the matching condition (7) and the energy balancing condition (6) in the control law (5), the closed loop dynamics of (1) becomes
x_ ¼
JðxÞ Rd ðxÞ rx Hd ðxÞ;
ð8Þ
where Rd ðxÞ ¼ RðxÞ þ gðxÞKðxÞg T ðxÞ is the desired dissipation matrix. For many applications the actuators are constrained by saturation. This is addressed by changing the control action (5) as
usat ðxÞ ¼ 1 uðxÞ ;
ð9Þ
1003
S.P. Nageshrao et al. / Mechatronics 24 (2014) 1001–1007
where 1 : Rm ! Rm is the saturation function. When the control input in (5) is within the saturation bounds, the closed loop dynamics are given by (8). In case the actuator is saturated, the closed loop dynamics become
x_ ¼ JðxÞ RðxÞ rx HðxÞ þ gðxÞusat ðxÞ:
ð10Þ
The requirement to solve (7) such that the equilibrium condition (4) is satisfied can be extremely challenging. However, by using the dissipation obstacle property of the standard PBC along with the parameter learning techniques, the need to explicitly solve the PDE (7) can be eliminated. 2.3. Parameterized Hamiltonian for standard PBC Due to the dissipation obstacle, standard PBC can be used if and only if the system is controllable by using a finite amount of added energy Ha ðxÞ. This implies that only the dynamics with no explicit dissipation can be regulated. Generally, in mechanical systems the dissipation is associated with momentum p whereas position dynamics have no explicit dissipation term. Hence, by using standard PBC one can stabilize the momentum p only at the origin while the position q can be regulated to any desired state say qd . In terms of the system’s Hamiltonian, this means that only the potential energy of a mechanical system can be shaped, whereas the kinetic energy remains unaltered. For example, the desired Hamiltonian for a mechanical system consists of the original kinetic energy term TðxÞ of (2) and the shaped potential energy term V d ðxÞ, i.e.,
Hd ðxÞ ¼ TðxÞ þ V d ðxÞ:
ð11Þ
For a general physical system, the desired Hamiltonian consists of shapable (s) and non-shapable (ns) components:
Hd ðxÞ ¼ Hns ðxÞ þ Hs ðxÞ:
ð12Þ
A procedure to separate the non-shapable and shapable components of the state-vector is provided in [13]. Normally, in the standard PBC framework, the shapable component of the desired energy term of (12) is chosen to be quadratic [3]. Instead, in this work we formulate the desired Hamiltonian as a linearly parameterized function approximator2:
b d ðx; nÞ ¼ Hns ðxÞ þ nT / ðxÞ; H es nes
ð13Þ nes
where n 2 R is the unknown parameter vector and /es ðxÞ 2 R a user defined basis function vector.
is
Remark 1. By constraining /es ðxÞ to be zero at the desired equilibrium qd , one can locally ensure (4) at xd ¼ ðqd ; 0Þ. In order to completely characterize (13) the unknown parameter vector
T n ¼ n1 ; n2 ; ; nnes needs to be computed. To this end we use the actor-critic reinforcement learning algorithm, where optimal parameters are found by the gradient-descent method. 2.4. Reinforcement learning Reinforcement learning (RL) is a stochastic, semi-supervised, model-free learning method. In RL the controller achieves the desired optimal behavior through interaction with the plant. At every sampling instant, the controller senses the plant states xk and applies the control action uðxk Þ, which is generally a nonlinear
feedback law. At the next time step the plant responds to the input by the state transition to a new state xkþ1 and a numerical reward rkþ1 . The control objective in RL is to maximize the long-term cumulative rewards called the return. The expected value of the return is captured by a value-function [6]
Vðxk Þ ¼
1 X
ci qðxkþiþ1 ; uðxkþi ÞÞ ¼
i¼0
1 X
ci rkþiþ1 ;
ð14Þ
i¼0
where k and i are the time indices, 0 < c < 1 is the discount factor, and qðx; uÞ is a user defined problem specific reward function, which provides the instantaneous reward r. If the plant under consideration is known and it has discretestate and finite action spaces and additionally if it satisfies the Markov property, then the value function can be written in a recursive form called Bellman equation, for definition and properties see [6]. For the recursive formulation, the value function (14) can be readily solved. From this value-function an optimal control action can be obtained by one-step search. Alternatively, there exist RL methods that directly search for the optimal control law. Depending on whether the RL algorithm searches for a value function, for a control law or both, the RL methods can be broadly classified into three subcategories [8]: Actor-only algorithms directly search for an optimal control law. Critic-only methods first learn an optimal value function, from which the control law is obtained by one-step optimization. Actor–critic algorithms explicitly search for an optimal control law – the actor. Additionally, the critic learns the valuefunction and provides an evaluation of the controller’s performance. 2.5. Actor–critic scheme Many of the RL methods are developed for systems with finite, discrete state and action space. However, most of the physical systems operate on a continuous state-space and the control law also needs to be continuous. This issue is often addressed by using function approximation, for methods and examples see [6,7]. The actor–critic consists of two independent components, the value-function (critic) (14) is approximated by using a parameter vector h 2 Rnc and a user defined basis function vector /c ðxÞ 2 Rnc as
b p^ ðx; hÞ ¼ hT / ðxÞ: V c
ð15Þ na
^ ðx; #Þ Similarly, by using the parameter vector # 2 R , the policy p (actor) is approximated as
p^ ðx; #Þ ¼ #T /a ðxÞ;
ð16Þ
where /a ðxÞ 2 Rna is a user-defined basis function vector. For the approximated value function and the policy, the actor– critic (AC) method can efficiently learn a continuous control action for a given continuous-state system. In terms of AC, the reinforcement learning objective can be expressed as: find an optimal policy p^ ðx; #Þ, such that for each state x, the discounted cumulative reward b p^ ðx; hÞ is maximized. V The unknown critic parameters are updated using the gradientascent rule
b ðxk ; hk Þ; hkþ1 ¼ hk þ ac dkþ1 rh V
ð17Þ
where ac is the update rate and dkþ1 is the temporal difference [6] b d is the approximation of the The ‘‘hat’’ symbol stands for approximation, e.g. H desired Hamiltonian Hd . 2
b ðxkþ1 ; hk Þ V b ðxk ; hk Þ: dkþ1 ¼ r kþ1 þ c V
ð18Þ
1004
S.P. Nageshrao et al. / Mechatronics 24 (2014) 1001–1007
Remark 2. The rate of convergence can be increased by using the eligibility trace ek 2 Rnc [6]. The parameter update rule using eligibility traces is
[13]. A block diagram representation of the control algorithm is given in Fig. 1.
b ðxk ; hk Þ; ekþ1 ¼ ckek þ rh V
Algorithm 1. Energy-balancing actor–critic.
ð19Þ
hkþ1 ¼ hk þ ac dkþ1 ekþ1 ;
where k 2 ½0; 1 is the trace-decay rate. Using a zero-mean gaussian noise ‘Duk ’ as an exploration term the control input to the system is
^ ðxk ; #k Þ þ Duk : uk ¼ p
ð20Þ
The policy parameter vector # is increased in the direction of the exploration term if the resulting temporal difference dkþ1 of (18) due to control input (20) is positive, otherwise it is decreased. The parameter update rule in terms of the update rate aa is
^ ðxk ; #k Þ: #kþ1 ¼ #k þ aa dkþ1 Duk r# p
ð21Þ
Remark 3. Similar to the critic, eligibility traces can also be used for the actor parameter update. However, for the sake of simplicity it is not used in the current work.
3. Energy balancing actor–critic Using the parameterized energy function (13) and (6) in (5) we obtain the control policy in terms of an unknown parameter vector n and basis function /es ðxÞ as b d ðx; nÞ rx HðxÞ KðxÞg T ðxÞrx H b d ðxÞ uðx; nÞ ¼ g y ðxÞðJðxÞ RðxÞÞ rx H @/ b d ðxÞ ¼ g y ðxÞðJðxÞ RðxÞÞ nT es ðxÞ rx Hs ðxÞ KðxÞg T ðxÞrx H @x ^ ðx; nÞ; ¼p
ð22Þ
^ ðx; nÞ is a function of the user defined damping where the policy p injection matrix KðxÞ. We consider the damping-injection matrix KðxÞ as an additional degree of freedom. It is parameterized using an unknown parameter vector w and a user defined basis function vector /di ðxÞ 2 Rndi as
h
ndi i X b ðx; wÞ ¼ K ½wijl ½/di ðxÞl ij
Input: System (1), k; c; aa for each actor (i.e., aan and aaw ), ac for critic. 1: e0 ðxÞ ¼ 0 8x 2: Initialize h0 ; n0 ; w0 3: for number of trials do 4: Initialize x0 5: k 1 6: loop until number of samples 7: Execute: Draw an action using (22), apply the control ^ ðxk ; nk ; wk Þ þ Duk Þ to (1), observe the next input uk ¼ 1ðp state xkþ1 and the reward r kþ1 ¼ qðxkþ1 Þ 8: Temporal Difference: b ðxkþ1 ; hk Þ V b ðxk ; hk Þ 9: dkþ1 ¼ r kþ1 þ c V 10: 11:
Critic Update: for i ¼ 1; . . . ; nc do
b ðxk ; hk Þ 12: ei;kþ1 ¼ ckei;k þ rhi;k V 13: hi;kþ1 ¼ hi;k þ ac dkþ1 ei;kþ1 14: end for 15: Actor update: 16: for i ¼ 1; . . . ; nes do ^ ðxk ; nk ; wk ÞÞ 17: ni;kþ1 ¼ ni;k þ aan dkþ1 Duk rni;k 1ðp 18: end for 19: for i ¼ 1; . . . ; mðm þ 1Þndi =2 do ^ ðxk ; nk ; wk ÞÞ 20: wi;kþ1 = wi;k + aaw dkþ1 Duk rwi;k 1ðp 21: end for 22: end loop 23: end for
Remark 4. In Algorithm 1 the gradient of the policy (21) is replaced by the gradient of the saturated policy (9) as
^ Þ ¼ rp^ 1ðp ^ Þrg p ^; rg 1ðp
ð25Þ
ð23Þ
where g stands for the unknown parameters n or w. Note that the ^ policy (22) is linear in parameters n or w hence the gradient rg p can be easily obtained.
ð24Þ
Remark 5. The saturation function 1 is problem specific. In the current task we consider a hard-limiting saturation defined as
l¼1
where ½wij 2 Rndi constrained to verify
½wij ¼ ½wji
so that the required symmetry condition of KðxÞ is satisfied. Using b ðxÞ matrix in (22) results in two unknown the approximated K parameter vectors, namely n (with nes unknown entries) and w (with mðm þ 1Þndi =2 unknown entries). These parameter values are learned by the energy-balancing actor critic (EBAC) Algorithm 1
Learning Algorithm reward Cost Function ρ(x,u) Actor-Critic
Plant
1ðuk Þ ¼
uk
if juk j 6 umax
sgnðuk Þumax
otherwise:
ð26Þ
Since the control input is limited to a range of [1, +1]. This causes ^ Þ to be 0 when the controller is saturated. the gradient rp^ 1ðp
ref
4. Manipulator arm
x
The EBAC Algorithm 1 is used for set-point regulation of a fully actuated 2-DOF manipulator arm. The schematic and the physical setup used in this work is shown in the Fig. 2. The manipulator has 2 links, each characterized by a link length li , mass mi , center of gravity r i , and moment of inertia Ii where i 2 f1; 2g. The arm can operate either in the vertical or in the horizontal plane. Here we give only the equations of motion for the vertical plane. For the horizontal plane, the potential energy terms in (2)
ξ,ψ Control Law u(x,ξ,ψ)
(
Fig. 1. Block diagram representation of the energy-balancing actor critic (EBAC) algorithm.
1005
S.P. Nageshrao et al. / Mechatronics 24 (2014) 1001–1007
are neglected. The system Hamiltonian is given by (2) with _ and the system’s stateq ¼ ðq1 ; q2 ÞT , and p ¼ ðp1 ; p2 ÞT ¼ MðqÞq, T vector x ¼ ðqT ; pT Þ . The mass-inertia matrix MðqÞ is
MðqÞ ¼
C 1 þ C 2 þ 2C 3 cosðq2 Þ C 2 þ C 3 cosðq2 Þ C 2 þ C 3 cosðq2 Þ
ð27Þ
C2
and the potential energy VðqÞ is
Table 1 Manipulator arm parameters. Model parameters
Link
Length of link
1
l1
18:5 10
m
Center of mass of link
1
r1
11:05 102
m
2
r2
12:3 102
m
1 2
m1 m2
0.5 0.5
kg kg
1
I1
5 103
kg m2
2
I2
5 103
kg m2
Mass of link
VðqÞ ¼ C 4 sinðq1 Þ þ C 5 sinðq1 þ q2 Þ:
ð28Þ Moment of inertia of link
The constants C 1 ; . . . ; C 5 are defined as 2
C 1 ¼ m1 r 21 þ m2 l1 þ I1 m2 r 22
C2 ¼ þ I2 C 3 ¼ m2 l1 r 2 C 4 ¼ m1 gðr 1 þ l1 Þ C 5 ¼ m2 gr 2 The equations of motion for the manipulator arm in PH form (1) are
q_ p_
¼
0
I
I
0
0
0
0 R22
rq HðxÞ 0 þ u rp HðxÞ g 21
rq HðxÞ ; y ¼ 0 g T21 rp HðxÞ
ð29Þ
where R22 is the dissipation matrix in terms of the friction component l
R22 ¼
l 0 0 l
and g 21 is the input matrix in terms of gear ratio g r and scaling factor b
g 21 ¼
gr b
0
0
grb
A state feedback control law that stabilizes the manipulator arm T at any desired state xd ¼ q1 ; q2 ; 0; 0 is learned by using the EBAC Algorithm 1 of Section 3. The reward function q for the algorithm is formulated such that, at the desired state xd the reward is maximum, and everywhere else a penalty is incurred
Qr Qr
T
gr b
Frication coefficient
l
Units 2
193 3:74 102 0.8738
measurements wrap around 2p. When the system is at the desired equilibrium xd ¼ ðq1 ; q2 ; 0; 0Þ there is no penalty. Elsewhere there is a negative reward proportional to the error between the system state and the desired state xd . The simulation parameters and system bounds for EBAC Algorithm 1 are given in Tables 2 and 3, respectively. Although there are many parameters to be set (c; k; aa ; ac , etc.) it must be noted that only the learning rate for the actor and the critic needs to be carefully chosen, since the rest are nominal values and can be easily obtained from the literature. An improper learning rate may result in poor or no learning. To obtain feasible a’s a typical approach followed in RL community is gridding, where the a’s are gridded over a suitable range and for each combination the learning process is repeated until a satisfactory result is obtained. We use Fourier basis functions [14] for the actor and critic approximations, (15) and (16), respectively. The ith component of the cosine basis function is
ð31Þ n
5. Application of EBAC to 2-DOF arm
Gear ratio Scaling factor
Value
/i ðxÞ ¼ cosðpci xÞ
:
The system parameters of (29) are given in Table 1.
qðq; pÞ ¼
Symbol
cosðq1 q1 Þ 1 cosðq2 q2 Þ 1
p1 p2
T
Pr
0
0
Pr
p1 p2
ð30Þ
with the frequency multiplier ci 2 Z . The cosine basis function can exploit the symmetry in the state and action space, thus a lesser number of basis functions are used as compared to radial basis function (RBF) or polynomial basis function. This reduces the number of parameters to be learnt hence resulting in a relatively faster convergence of the learning algorithm [13]. In order to guarantee equal weighting of all the system states, the states are scaled to a uniform range of x 2 ½1; 1. In addition, x is mapped to zero at x ¼ xd . This satisfies the equilibrium requirement of Eq. (4). Remark 6. Due to the physical constraints of the plant the system states are hard limited to a range of q 2 ½qmax ; qmin and p 2 ½pmax ; pmin
4
where Q r ¼ 25 and Pr ¼ 10 are the weighting constants. The cosine function is used in (30) to take into account the fact that the angle
6. Simulation and experimental results The shaped energy term (13) is parameterized using 2nd order Fourier basis functions resulting in 9 learnable parameters Table 2 Learning parameters.
(a)
(b)
Fig. 2. A two degree of freedom manipulator arm.
Parameters
Symbol
Value
Units
Trials Trial duration Sample time Decay rate Eligibility trace Exploration variance Learning rate for critic b d ðq; nÞ Learning rate for V
– Tt Ts
100 3 0.01 0.97 0.65 1=3 0.01 1 102
– s s – – – – –
1 108
–
b d ðx; wÞ Learning rate for K
c k
r2 ac aa;n aa;w
1006
S.P. Nageshrao et al. / Mechatronics 24 (2014) 1001–1007
Table 3 Bounds on system states and input. Bounds
Symbol
Value
Units
Max control input Min control input Maximum angle Minimum angle Maximum momentum
umax umin qmax qmin pmax
1 1 5p=12 5p=12 2p 102
– – rad rad kg rad/s
Minimum momentum
pmin
2p 102
kg rad/s
0.5 0 θ1 −0.5 −1 −1.5
Simulation Experiment
0
0.5
1
1.5
2
2.5
3
time [sec] 1
u1
Simulation Experiment
0.5 0
n1 ; . . . ; n9 . In similar lines, the damping injection term in (23) and the value function (15) are approximated using 2nd order fourier basis function resulting in 243 and 81 learnable parameters, for the actor and critic, respectively. For the 2-DOF manipulator system (29) using the energybalance (13) and damping-injection (23) the control action (22) is
u1 ðx; n; wÞ u2 ðx; n; wÞ
2 ¼4
1 gr b
nT rq1 /es ðqÞ rq1 Hs ðqÞ
3
5 nT rq2 /es ðqÞ rq2 Hs ðqÞ w11 /di ðxÞ w12 /di ðxÞ rp1 Hns ðxÞ w21 /di ðxÞ w22 /di ðxÞ rp2 Hns ðxÞ 1 gr b
ð32Þ
1
1.5
2
2.5
3
Fig. 4. Comparison: simulation and experimental results h1 and u1 .
0.5 0 θ2 −0.5 −1 −1.5 0
Simulation Experiment
0.5
1
1.5
2
2.5
3
time [sec]
with w12 ¼ w21 . ^ ðx; n; wÞ, Using the parameterized control input (32) as the policy p the EBAC Algorithm 1 of Section 3 is simulated in Matlab. The simulation was repeated for 100 trials each of 3 s. This procedure is repeated 50 times and the resulting mean, minimum, maximum, and confidence region of the learning curve are plotted in Fig. 3. As evident from the average learning curve in Fig. 3, the algorithm shows good convergence as it takes around 10 trials (i.e. 30 s) to obtain a near optimal policy. During the initial stage a dip in the learning curve is visible this is due to zero-initialization of the unknown parameters h; n, and w, which is too optimistic compared to the learned solution. Once the algorithm has converged with sufficient accuracy the exploration is reduced by setting Duk ¼ 0:05, resulting in a considerable jump in the learning curve at 3 min and 45 s into simulation. By following the similar procedure the parameters of the EBAC control law (32) was learned experimentally on a physical setup. The evaluation of the learned controllers is in Figs. 4 and 5. The two learned control laws, one in simulation and the other on the physical setup, provide a comparable performance. A major drawback of the standard PBC is model and parameter dependency, i.e., for model and parameter uncertainties the resulting ES-DI controller might not be able to achieve zero steady-state errors [15]. This issue can be straightforwardly handled by the EBAC Algorithm 1 due to its learning capabilities. We evaluate this by intentionally considering an incorrect system Hamiltonian, this is done by neglecting the potential energy term VðqÞ in (2). The physical difference is that the arm operates in the vertical plane while to design the control law, the horizontal plane of operation was assumed. Fig. 6 illustrates the comparison between standard
Sum of rewards per trial
0.5
time [sec]
−2k
time [sec]
1
Simulation Experiment
u2 0.5 0 0
0.5
1
1.5
2
2.5
3
time [sec] Fig. 5. Comparison: simulation and experimental results h2 and u2 .
EBAC controller ES+DI controller
0.04 0.02
θ1in rad
0
0
−0.02 −0.04 0.6
0.8
1
1.2
1.4
1.6
1.8
time [s] Fig. 6. Comparison: standard PBC (ES-DI) and learning control via EBAC.
PBC and EBAC for an imprecise system Hamiltonian, whereas the standard PBC results in a small steady state error. The EBAC Algorithm 1 successfully compensates for modeling errors. We have also evaluated the EBAC algorithm for parameter uncertainties. This is done by using incorrect mass and length values in the control law (32). The EBAC algorithm was able to compensate and learn an optimal control law with zero steady-state-errors. However, the learning algorithm was found to be sensitive for the variations in the friction coefficient b. 7. Discussion and conclusions
−8k 95% confidence region for the mean Mean Max and min bounds
−16k 0
1
2
3
4
5
time [min] Fig. 3. Results for the EBAC method for 50 learning simulations (k denotes 103 ).
Based on the observations in the current work and the results of [13], we can summarize the general advantages of combining learning with PBC as. Learning (EBAC) can avoid solving complex nonlinear equations, in this paper we do not solve the PDE (7) explicitly.
S.P. Nageshrao et al. / Mechatronics 24 (2014) 1001–1007
RL allows for the local specification of the stabilization or regulation objective, i.e., condition (4) can be easily satisfied for (13) (see Remark 1). Whereas, in standard PBC, one must specify the global desired Hamiltonian. The use of prior information in the form PH model increases the learning speed. This has been experimentally observed for a simple pendulum stabilization in [13] and in the regulation of a manipulator arm in this work. Robustness against model and parameter uncertainty can be achieved thanks to learning. Nonlinearities such as control saturation can be easily handled by RL. In this paper we have evaluated the EBAC method for regulation of a 2-DOF manipulator arm. The standard PBC control law was systematically parameterized and the unknown parameters were learned using the actor-critic algorithm. By its virtue an optimal control law can be obtained using EBAC. Because of its learning capabilities we have observed that the EBAC is robust against model and parameter uncertainty. However, learning for PH systems introduces a few notable challenges. For example, exploration, an integral part of actor–critic techniques (hence also EBAC), may not be feasible when it is too dangerous to explore the system’s state-space, particularly in safety–critical applications. Similar to many RL algorithms, EBAC is also affected by the curse of dimensionality. The proof of convergence of the proposed learning algorithm is yet to be shown. Nevertheless, we find great potential in integrating learning algorithms within the PH framework, as it considerably simplifies the control design process. Future work includes the extension of learning methods to interconnection and damping assignment PBC setting, the explicit inclusion of complex control specifications and the performance criteria in the reward function. Additionally, we plan to explore
1007
compact parameterizations of the desired Hamiltonian and damping matrices for higher order systems. References [1] Schaft A. L2-gain and passivity techniques in nonlinear control. New York: Springer-Verlag Inc.,; 2000. ISBN 978-1-4471-0507-7. [2] Duindam V, Macchelli A, Stramigioli S. Modeling and control of complex physical systems. Berlin Heidelberg, Germany: Springer; 2009. ISBN 978-3-642-03196-0. [3] Ortega R, van Der Schaft AJ, Mareels I, Maschke B. Putting energy back in control. IEEE Control Syst 2001;21(2):18–33. [4] Ortega R, Spong MW. Adaptive motion control of rigid robots: a tutorial. Automatica 1989;25(6):877–88. [5] Ortega R, van der Schaft A, Castaños F, Astolfi A. Control by interconnection and standard passivity-based control of port-Hamiltonian systems. IEEE Trans Autom Control 2008;53(11):2527–42. [6] Sutton RS, Barto AG. Reinforcement learning: an introduction, vol. 1. Cambridge Univ. Press; 1998. [7] Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement learning and dynamic programming using function approximators. CRC Press; 2010. [8] Grondman I, Busoniu L, Lopes GAD, Babuska R. A survey of actor–critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst, Man, Cyber, Part C: Appl Rev 2012;42(6):1291–307. [9] Sutton RS, Barto AG, Williams RJ. Reinforcement learning is direct adaptive optimal control. IEEE Control Syst 1992;12(2):19–22. [10] Peters J, Schaal S. Policy gradient methods for robotics. In: 2006 IEEE/RSJ international conference on intelligent robots and systems. IEEE; 2006. p. 2219–25. [11] Deisenroth M, Rasmussen CE. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th international conference on machine learning (ICML-11); 2011. p. 465–72. [12] Grondman I, Vaandrager M, Busoniu L, Babuska R, Schuitema E. Efficient model learning methods for actor–critic control. IEEE Trans Syst, Man, Cyber, Part B: Cyber 2012;42(3):591–602. [13] Sprangers O, Babuska R, Nageshrao SP, Lopes GAD. Reinforcement learning for port-Hamiltonian systems. IEEE Trans Cyber 2014. http://dx.doi.org/10.1109/ TCYB.2014.2343194. [14] Konidaris G, Osentoski S, Thomas P. Value function approximation in reinforcement learning using the Fourier basis. In: Computer Science Department Faculty Publication Series; 2008. p. 101. [15] Dirksz DA, Scherpen JM. Structure preserving adaptive control of portHamiltonian systems. IEEE Trans Autom Control 2012;57(11):2880–5.