Author’s Accepted Manuscript Neural Network-based online H∞ Control for Discrete-Time Affine Nonlinear system using Adaptive Dynamic Programming Chunbin Qin, Huaguang Zhang, Yingchun Wang, Yanhong Luo www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(16)00318-0 http://dx.doi.org/10.1016/j.neucom.2015.08.120 NEUCOM16822
To appear in: Neurocomputing Received date: 31 March 2015 Revised date: 10 August 2015 Accepted date: 31 August 2015 Cite this article as: Chunbin Qin, Huaguang Zhang, Yingchun Wang and Yanhong Luo, Neural Network-based online H∞ Control for Discrete-Time Affine Nonlinear system using Adaptive Dynamic Programming, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.08.120 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Neural Network-based online H∞ Control for Discrete-Time Affine Nonlinear system using Adaptive Dynamic Programming Chunbin Qina,b , Huaguang Zhangc,d,∗, Yingchun Wangc , Yanhong Luoc a
The College of Computer and Information Engineering, Henan University, Kaifeng, Henan, 475004, China b The College of Environment and Planning, Henan University, Kaifeng, Henan, 475004, China c The College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110004, China d The State Key Laboratory of Synthetical Automation for Process Industries (Northeastern University), Shenyang, 110819, China.
Abstract In this paper, the problem of H∞ control design for affine nonlinear discrete-time systems is addressed by using adaptive dynamic programming (ADP). First, the nonlinear H∞ control problem is transformed into solving the two-player zero-sum differential game problem of the nonlinear system. Then, the critic, action and disturbance networks are designed by using neural networks to solve online the Hamilton-Jacobi-Isaacs (HJI) equation associating with the two-player zero-sum differential game. When novel weight update laws for the critic, action and disturbance networks are tuned online by using data generated in real-time along the system trajectories, it is shown that the system states, all neural networks weight estimation errors are uniformly ultimately bounded by using Lyapunov techniques. Further, it is shown that the output of the action network approaches the optimal control input with small bounded error and the output of the disturbance network approaches the worst disturbance with small bounded error. At last, simulation results are presented to demonstrate the effectiveness of the new ADP-based method. Keywords: ∗
Corresponding author: Huaguang Zhang. Email:
[email protected]
Preprint submitted to Neurocomputing
March 4, 2016
H∞ Control, Adaptive dynamic programming, Neural Networks, Nonlinear discrete-time system, Two-person zero-sum game 1. Introduction It is well known that the control performance for practical systems is often affected by the presence of unknown disturbances such as measurement noise, input disturbances and other exogenous signals, which invariably occur in most applications because of plant interactions with the environment. H∞ control is one of the most powerful control methods for attenuating the effect of disturbances in dynamical systems [1]. The formulation of the H∞ Control for dynamical systems was studied in the frame work of Hamilton-Jacobi equations by Van der Schaft [2] and Isidori and Astolfi [3]. It is worth noting that conditions for the existence of smooth solutions of the Hamilton-Jacobi equation were studied through invariant manifolds of Hamiltonian vector fields and the relation with the Hamiltonian matrices of the corresponding Riccati equation for the linearized problem [2]. Some of these conditions were relaxed into critical and noncritical cases by Isidori and Astolfi [3]. Later, Basar, and P. Bernhard in [4] stated that the H∞ control problem could be posed as the zero-sum two-person differential game, in which the input controller is a minimizing player and the unknown disturbance is a maximizing player. Although the formulation of the nonlinear H∞ control theory has been well developed, the main bottleneck for its practical application is the need to solve the Hamilton- Jacobi-Isaacs (HJI) equation, which is difficult or impossible to solve and may not have global analytic solutions [5]. Therefore, solving the HJI equation remains a challenge. Over the past decades, some methods have been proposed to solve the HJI equation[6, 7, 8]. The smooth solution of the HJI equation has been determined directly by solving for the coefficients of the Taylor series expansion of the value function in a very efficient manner, as it has been presented in [6]. Beard and McLain [7] proposed an iterative-based policy to successively solve the HJI equation by breaking the nonlinear differential equation to a sequence of linear differential equations. On the basis of the work [6] and [7], A similar iterative-based policy was proposed in [8] to the HJI equation for nonlinear systems with input constraints. In recent years, adaptive dynamic programming (ADP) [9, 10, 11, 12, 13] has appeared to be promising methodologies for solving H∞ control problems [15, 16, 17, 18, 19, 20, 21, 22, 23]. Adaptive dynamic programming is a 2
kind of machine learning method for learning the feedback control laws online in real time based on system performance without necessarily knowing the system dynamics, which overcomes the curse of dimensionality [14] of dynamic programming. A. Al-Tamimi et al. in [15] derived adaptive critic designs corresponding to heuristic dynamic programming and dual heuristic dynamic programming to solve online the H∞ control problem of the linear discrete-time system in a forward-in-time manner. Based on this work, Authors in [16] proposed an iterative adaptive critic design algorithm to find the optimal controller of a class of discrete-time two-person zero-sum games for Roesser types 2-D systems. Further, a novel data-based adaptive critic design was proposed by using output feedback of unknown discrete-time zerosum games [17]. Besides, optimal strategies based Q-learning were proposed for the H∞ optimal control problem without knowing the system dynamical matrices in [18] and [19]. For the nonlinear case, Mehraeen et al. [20, 21] developed an off-line iterative approach to solve the HJI equation by using a successive approximation approach. D. Liu et al. in [22] proposed value iteration methods corresponding to heuristic dynamic programming and dual heuristic dynamic programming to solve the HJI equation for constrained input systems. Later, D. Liu et al. [23] proposed an iterative adaptive dynamic programming algorithm to solve the zero-sum game problems for affine nonlinear discrete-time systems. Nevertheless, a common feature of the above ADP-based results for solving the H∞ Control problem is that sequential iterative approaches are utilized to solve the HJI equation, which contain more than one iteration loop, i.e., the value function and the control and disturbance policies are asynchronously updated. However, such a procedure may lead to redundant iterations, and result in low efficiency [24], which motivates us to carry out the work of this paper In this paper, a new ADP-based method is proposed to solve online the H∞ control problem of the nonlinear system, in which three online parametric structures are designed by using three neural networks for solving online the Hamilton-Jacobi-Isaacs equation appearing in the H∞ control problem of the nonlinear system. The main contributions of this paper have two folds. First, we present a new ADP-based method in which the weights of three online parametric structures are tuned simultaneously along the system trajectories to converge to the solution of the HJI equation, which is different from the sequential algorithms in [15, 16, 17, 18, 19, 20, 21, 22, 23]. Second, while explicitly considering the neural network approximation errors in contrast to the works [20, 22], Lyapunov theory is utilized to demonstrate that the 3
system states and the weight estimation errors of three online parametric structures are uniformly ultimately bounded. Besides, it is shown that the pair of the approximated control signal and the disturbance input signal converges to the approximate Nash equilibrium solution of the two-player zero-sum differential game. The remainder of this paper is organized as follows. In Section 2, the problem statement is shown. In Section 3, we present a new ADP-based method method for solving HJI equation of nonlinear discrete-time systems and the rigorous proof of convergence is given. Section 4 presents an example to demonstrate the effectiveness of the proposed method. Finally, conclusions are drawn in Section 5. 2. Problem Formulation In this paper, we consider the following affine nonlinear discrete-time system in the presence of the disturbance d(k): xk+1 = f (xk ) + g(xk )u(k) + d(k) z(k) = [Cxk
Du(k)]
T
(1) (2)
where xk ∈ Rn is the system state, u(k) ∈ Rm is the system control input, d(k) ∈ Rn is the disturbance signal with d(k) ∈ L2 [0, ∞], z(k) is the system fictitious output. Assume that g(xk )F ≤ gM [25], where · F denotes the Frobenius norm. The H∞ control for the nonlinear discrete-time system (1) and (2) is to find a state feedback control u(k) = u(xk )
(3)
such that the closed-loop system (1) and (2) with (3) is asymptotically stable, and has L2 gain less than or equal to γ, i.e. ∞
T
z (k)z(k) ≤ γ
k=0
2
∞
dT (k)d(k)
(4)
k=0
for all d(k) ∈ L2 [0, ∞], where γ > 0 is some prescribed level of disturbance attenuation. Note that throughout this paper we shall assume that γ is fixed and γ ≥ γ ∗ , where γ ∗ is the minimum γ for which the equation (4) can hold. 4
According to [4], it is well known that the H∞ control problem can be posed as a zero-sum two-player differential game, in which the system control input u(k) is regarded as a minimizing player and the disturbance d(k) is regarded as a maximizing one. Correspondingly, we can define the following infinite horizon quadratic cost function for the zero-sum two-player differential game, ∞
J(x(0), u, d) =
U(xk , uk , dk )
(5)
k=0
where U(xk , uk , dk ) = xTk Qxk + uTk Ruk − γ 2 dTk dk , Q = C T C, R = DT D, xk = x(k), uk = u(k), dk = d(k). For the given system control input uk and the bounded disturbance dk , we can define the corresponding value function as V (xk , uk , dk ) =
∞
U(xi , ui , di ).
(6)
i=k
Correspondingly, the Hamilton function can be defined as H(xk , uk , dk ) = V (xk+1 ) − V (xk ) + U(xk , uk , dk ),
(7)
where xk+1 = f (xk ) + g(xk )uk + dk . Therefore, for the zero-sum two-player differential game of the nonlinear discrete-time system (1) and (2), our aim is to find a state feedback saddle point (u∗ , d∗ ) such that V (u∗ , d∗) = min max V (u, d),
(8)
V (u∗, d) ≤ V (u∗ , d∗ ) ≤ V (u, d∗),
(9)
u
d
that means
where u∗ = μ(xk ) and d∗ = η(xk ), μ(·) and η(·) are smooth functions. According to Bellman’s optimality principle, we can obtain that the optimal value function V ∗ (xk ) satisfies the following discrete-time HJI equation V ∗ (xk ) = min max{U(xk , uk , dk ) + V ∗ (xk+1 )}. u
d
5
(10)
At the same time, we can obtain the saddle point (u∗ , d∗ ) of the zero-sum two-player differential game as follows: ∂V ∗ (xk+1 ) 1 u∗ (xk ) = − R−1 g(xk )T , 2 ∂xk+1
(11)
and d∗ (xk ) =
1 ∂V ∗ (xk+1 ) . 2γ 2 ∂xk+1
(12)
Inserting (11) and (12) into (10), the discrete-time HJI equation can be rewritten as 1 ∂V ∗T (xk+1 ) ∂V ∗ (xk+1 ) g(xk )R−1 g T (xk ) 4 ∂xk+1 ∂xk+1 ∗T ∗ 1 ∂V (xk+1 ) ∂V (xk+1 ) − 2 × + xTk Qxk , (13) 4γ ∂xk+1 ∂xk+1
0 =V ∗ (xk+1 ) − V ∗ (xk ) +
where xk+1 = f (xk ) + g(xk )u∗ (xk ) + d∗ (xk ), V ∗ (0) = 0. It is shown from (13) that if one can get the optimal value function V ∗ (xk ) and V ∗ (xk+1 ) by solving the equation (13), then, inserting V ∗ (xk+1 ) into (11) and (12), one can easily obtain the saddle point (u∗ , d∗ ) of the zero-sum twoplayer differential game. However, for the general nonlinear case, it is very difficult to directly solve the optimal value function V ∗ (xk ) and V ∗ (xk+1 ) from (13), since the equation (13) is a nonlinear partial difference equation. 3. Main results In the following section, we will present a new ADP-ased scheme based on neural networks for solving online the HJI equation (13). Since the equation (13) is a nonlinear partial difference equation, it is difficult to get the exact solution V ∗ (xk ) from (13). However, we can find the approximate solution V ∗ (xk ) from (13) by using the universal approximation property of neural networks [25]. Then, the value function V (xk ) can be approximated by a neural network as follows V (xk ) = WcT ψc (xk ) + εc (xk ),
(14)
where Wc is the bounded target weight of neural networks, εc (xk ) is the bounded approximation error of neural networks, ψc (·) is the vector activation function for the neural networks. The upper bound of the ideal neural 6
network weight of the critic network is taken as Wc ≤ WcM while the NN approximation error εc (xk ) is upper bounded as εc (xk ) ≤ εcM [25]. Besides, we should assume that the vector activation function for the critic network is upper bounded by ψc (·) ≤ ψcM while the gradients of the critic network approximation error and the activation function vector are upper bounded by ∂εc (xk )/∂xk ≤ εcM and ∂ψc (xk )/∂xk ≤ ψcM [26], respectively. Correspondingly, by using the universal approximation property of neural networks, the system control input uk and the disturbance dk have neural network representations written as u(xk ) = WaT ϕa (xk ) + εa (xk ),
(15)
d(xk ) = WdT ϕd (xk ) + εd (xk ),
(16)
and
respectively, where Wa and Wd are the bounded target weights of neural networks, εa (xk ) and εd (xk ) are the bounded approximation errors of neural networks, ϕa (·) and ϕd (·) are the vector activation functions for neural networks. The upper bounds of the ideal NN weight of the action and disturbance networks are taken as Wa ≤ WaM and Wd ≤ WdM , respectively, while the NN approximation errors εa (xk ) and εd (xk ) are upper bounded as εa (xk ) ≤ εaM and εd (xk ) ≤ εdM [25], respectively. Besides, we should assume that the vector activation function for the action and disturbance networks are upper bounded by ϕa (·) ≤ ϕaM and ϕd (·) ≤ ϕdM [26], respectively. Next, for obtaining optimal controllers, we will show how to design the critic network, the action network and the disturbance network, respectively. 3.1. The Critic Network Design Since the target weight of the critic network is the unknown ideal constant ˆ c is the estimated value of Wc . Therefore, weight, we can assume that W according to (14), the value function V (xk ) can be written as ˆ cT ψc (xk ). Vˆ (xk ) = W
(17)
According to (7) and (17), we can define the critic error to be ˆ T ψc (xk+1 ) − W ˆ T ψc (xk ) + U(xk , uk , dk ). e(xk ) =W c c 7
(18)
Further, we define an auxiliary critic error vector as E(xk ) = [e(xk ) · · · e(xk−i ) · · · e(xk−l )].
(19)
ˆ T ψc (xk−i+1 ) − W ˆ T ψc (xk−i ) + xT Qxk−i + uT Ruk−i − where e(xk−i ) = W c c k−i k−i γ 2 dTk−i dk−i , 0 ≤ i ≤ l, l is a given constant. Futher, we assume that ψc (xk−i ) = ψc (xk−i+1 )−ψc (xk−i ) and U(xk−i ) = T xk−i Qxk−i + uTk−i Ruk−i − γ 2 dTk−i dk−i . Then, (19) can be written as ˆ T Ψc (xk ), E(xk ) = Υ(xk ) + W c
(20)
where Υ(xk ) = [U(xk ) · · · U(xk−i )] and Ψc (xk ) = [ψc (xk ) · · · ψc (xk−i )]. For the given control input uk and the bounded disturbance dk , the weight update law for the critic network is defined to be ˆ c (k + 1) = W ˆ c (k) − W
αc Ψc (xk )E T (xk ) , Ψc (xk )ΨTc (xk ) + IF
(21)
where αc (0 < αc < 1) is the adaptive gain of the critic network and I is an identity matrix of appropriate dimension. On the other side, for the actual value function V (xk ), we have the conclusion that the Hamilton function should be equal to zero, i.e., H(xk , uk , dk ) = 0. Thus, according to (7) and (14), we have WcT ψc (xk ) + εc (xk ) + U(xk ) = 0,
(22)
where εc (xk ) = εc (xk+1 ) − εc (xk ). Further, we have WcT Ψc (xk ) + Ξc (xk ) + Υ(xk ) = 0,
(23)
where Ξc (xk ) = [εc (xk ), εc(xk−1 ), · · · , εc (xk−l )] and Ξc (xk ) ≤ ΞcM . Substituting (23) into (20), we can obtain ˜ T Ψc (xk ) − Ξc (xk ), E(xk ) = −W c
(24)
˜ c (k) = Wc − W ˆ c (k). Rewriting (21) by using (24), we have where W αc Ψc (xk )ΨTc (xk ) ˜ c (k) )W Ψc (xk )ΨTc (xk ) + IF αc Ψc (xk )ΞTc (xk ) − . Ψc (xk )ΨTc (xk ) + IF
˜ c (k + 1) =(I − W
8
(25)
When xk = 0, the system states have converged to zero, the critic network is no longer updated. So for the given control input uk and the bounded disturbance dk , the persistency of excitation is needed to guarantee the conˆ c (k) to Wc . That is to say that the system states must be vergence of W persistently existing long enough for the critic network to be learned. Next, the boundedness of weight estimation error of the critic network with dynamics given by (1) and (2) is demonstrated for the fixed admissible control policy uk and the bounded disturbance dk . Theorem 1. Let uk be any given admissible control policy for the nonlinear system (1) and (2) with the bounded disturbance dk . Let the weight updating law for the critic network be given by (21). When states of nonlinear system are persistently existing long enough, then the weight estimation error of the critic network is uniformly ultimately bounded. Proof. Choose the following positive definite Lyapunov candidate function: ˜ c (k)) = tr{W ˜ T (k)W ˜ c (k)}. L(W (26) c ˜ c (k)) and substituting the We can calculate the first difference of L(W weight update law (25) reveals ˜ c (k)) =tr{W ˜ T (k + 1)W ˜ c (k + 1)} − tr{W ˜ T (k)W ˜ c (k)} L(W c
c
˜ c (k) αc W ˜ cT (k)Ψc (xk )ΨTc (xk )W ˜ cT (k)Ψc (xk )ΞTc (xk ) 2αc W − } =tr{− Ψc (xk )ΨTc (xk ) + IF Ψc (xk )ΨTc (xk ) + IF ˜ c (k) α2 Ξc (xk )ΨT (xk )Ψc (xk )ΞT (xk ) αc Ξc (xk )ΨTc (xk )W c c + c } + tr{− Ψc (xk )ΨTc (xk ) + IF Ψc (xk )ΨTc (xk ) + I2F ˜ c (k) ˜ T (k)Ψc (xk )ΨT (xk )Ψc (xk )ΨT (xk )W αc2 W c c c } + tr{ Ψc (xk )ΨTc (xk ) + I2F ˜ T (k)Ψc (xk )ΨT (xk )Ψc (xk )ΞT (xk ) α2 W c c } + tr{ c c Ψc (xk )ΨTc (xk ) + I2F ˜ c (k) α2 Ξc (xk )ΨTc (xk )Ψc (xk )ΨTc (xk )W } (27) + tr{ c Ψc (xk )ΨTc (xk ) + I2F Because states of nonlinear system are persistently existing long enough, we can obtain ΨTc (xk )Ψc (xk ) Ψm ≤ < 1, (28) Ψc (xk )ΨTc (xk ) + IF 9
where Ψm (Ψm > 0) is a constant ensured to exist by the persistency of ˜ c (k)) can be excitation condition. Further, combining (27) and (28), L(W upper bounded as ˜ c (k) ˜ cT (k)Ψc (xk )ΨTc (xk )W αc W } Ψc (xk )ΨTc (xk ) + IF α2 Ξc (xk )ΨTc (xk )Ψc (xk )ΞTc (xk ) + tr{ c } Ψc (xk )ΨTc (xk ) + I2F ˜ c (k)2 + α2 Ξ2 . ≤ − αc Ψm W (29) c cM ˜ c (k)) is less than zero provided W ˜ c (k) > αc Ψ−1 Ξ2 . Thus, L(W m cM ˜ c (k)) Using standard Lyapunov extensions [25], it is concluded that L(W is less than zero outside of a compact set. Therefore, the weight estimation error of the critic network is uniformly ultimately bounded. This completes the proof. ˜ c (k)) ≤tr{− L(W
3.2. The Action Network Design To begin the development of the feedback control policy u(xk ), which minimizes the approximated value function (17), u(xk ) is approximated by the action network as ˆ T (k)ϕa (xk ), uˆ(xk ) = W a
(30)
ˆa is the estimated value of Wa . where W The feedback error signal used for tuning the action network is defined to be the difference between the feedback control input applied to (1) and the control input minimizing (17) as ˆ ˆ T (k)ϕa (xk ) + 1 R−1 g T (xk ) ∂ V (xk+1 ) . ea (k) = W a 2 ∂xk+1
(31)
The objective function to be minimized by the action network is defined as ˆ a ) = 1 eT (k)ea (k). (32) Ea (W 2 a The weight update law for the action network is a gradient descent algorithm, which is given by ˆ a (k + 1) = W ˆ a (k) − αa W 10
ϕa (xk )eTa (k) , 1 + ϕTa (xk )ϕa (xk )
(33)
where αa > 0 is the learning rate of the action network. Since the control policy in (15) minimizes value function (14), we have T 1 −1 T ∂ψc (xk+1 ) Wc + WaT ϕa (xk ) R g (xk ) 2 ∂xk+1 1 + R−1 g T (xk )∇εc (xk+1 ) + εa (xk ) = 0. 2
(34)
˜ a (k) = Wa − W ˆ a (k). Let the actor network weight estimation error be W Substituting (34) into (31) results in 1 −1 T ∂ψcT (xk+1 ) ˜ T ˜ ea (k) = −Wa (k)ϕa (xk ) − R g (xk ) Wc − εac (xk ), 2 ∂xk+1
(35)
where εac (xk ) = εa (xk ) + R−1 g T (xk )∇εc (xk+1 )/2, and εac (xk ) ≤ εacM . Substituting (35) into (33), we can obtain ˜ a (k) + αa ˜ a (k + 1) = W W
ϕa (xk )eTa (k) . 1 + ϕTa (xk )ϕa (xk )
(36)
3.3. The Disturbance Network Design To begin the development of the feedback disturbance policy d(xk ), which maximizes the approximated value function (17), d(xk ) is approximated by the disturbance network as ˆ k) = W ˆ T (k)ϕd (xk ), d(x d
(37)
ˆd is the estimated value of Wd . where W The feedback error signal used for tuning disturbance network is defined to be the difference between the feedback disturbance input applied to the system (1) and the control input maximizing (17) as ˆ T (k)ϕd (xk ) − ed (k) = W d
1 ∂ Vˆ (xk+1 ) . 2γ 2 ∂xk+1
(38)
The objective function to be minimized by the disturbance network is defined as ˆ d ) = 1 eTd (k)ed (k). Ed (W 2 11
(39)
The weight update law for the disturbance network is a gradient descent algorithm, which is given by ˆ d (k + 1) = W ˆ d (k) − αd W
ϕd (xk )eTd (k) , 1 + ϕTd (xk )ϕd (xk )
(40)
where αd > 0 is the learning rate of the disturbance network. Since the disturbance policy in (16) maximizes value function (14), we have T ∂ψc (xk+1 ) 1 1 T Wd ϕd (xk ) + εd (xk ) − 2 Wc − 2 ∇εc (xk+1 ) = 0. (41) 2γ ∂xk+1 2γ ˜ d (k) = Wd − Let the disturbance network weight estimation error be W ˆ d (k). Substituting (41) into (38) results in W T ˜ T (k)ϕd (xk ) + 1 ∂ψc (xk+1 ) W ˜ c − εdc (xk ) ed (k) = −W d 2γ 2 ∂xk+1
(42)
where εdc (xk ) = εd (xk ) − ∇εd (xk+1 )/2γ 2 and εdc (xk ) ≤ εdcM . Substituting (42) into (40), we can obtain ˜ d (k + 1) = W ˜ d (k) + αd W
ϕd (xk )eTd (k) . 1 + ϕTd (xk )ϕd (xk )
(43)
3.4. Stability Analysis In this section, the convergence of weights of the critic, action and disturbance networks is demonstrated while explicitly considering the neural network reconstruction errors. Theorem 2. Let u0 (k) be any initial stabilizing control policy for the nonlinear system (1) and (2) with the given bounded disturbance d0 (k). Let the weight tuning laws for the critic, action and disturbance networks be provided by (21), (33) and (40), respectively. Let the weights of the critic, action, and disturbance networks be updated simultaneously along trajectories of the system (1) and (2), then the system state x(k), the neural network weight ˜ a (k), W ˜ d (k) and W ˜ c (k) are uniformly ultimately bounded estimation errors W with the bounds specifically given by (52), (53), (54) and (55), respectively. 12
Proof. Choose the following Lyapunov function candidate L(k) =
αc αa2 αd2 Ψm α a αd L1 (k) + L2 (k) 2 2 2 4(1 + ϕaM )(1 + ϕdM ) (1 + ϕaM )(1 + ϕ2dM ) 2 Ψm αc αd2 gM αc αa2 Ψm + L (k) + L4 (k), 3 1 + ϕ2dM 1 + ϕ2aM
(44)
˜ T (k)W ˜ c (k)}, L3 (k) = tr{W ˜ T (k)W ˜ a (k)}, where L1 (k) = xTk xk , L2 (k) = tr{W c a T ˜ (k)W ˜ d (k)}. L4 (k) = tr{W d First difference of (44) is given by L(k) =
α a αd αc αa2 αd2 Ψm L1 (k) + L2 (k) 2 2 2 4(1 + ϕaM )(1 + ϕdM ) (1 + ϕaM )(1 + ϕ2dM ) 2 Ψm αc αd2 gM αc αa2 Ψm + L (k) + L4 (k). (45) 3 1 + ϕ2dM 1 + ϕ2aM
Considering firstly L1 (k), substituting (30) and (37) as well as performing some mathematical manipulation reveals ˆ T (k)ϕa (xk ) + W ˆ T (k)ϕd (xk )2 − xk 2 L1 (k) =f (xk ) + g(xk )W a d ∗ ∗ ˜ T (k)ϕa (xk ) =f (xk ) + g(xk )u (xk ) + w (xk ) − g(xk )W a
˜ d (k)ϕd (xk ) − g(xk )εa (xk ) − εd (xk )2 − xk 2 −W ˜ T (k)ϕa (xk )2 ≤ − (1 − 2K ∗ )xk 2 + 8g 2 W M
+
˜ dT (k)ϕd (xk )2 8W
+
a 2 2 8(gM εaM
+ ε2dM )
(46)
where f (xk ) + g(xk )u∗(xk ) + w ∗(xk )2 ≤ K ∗ xk 2 . Next, considering L2 (k) and according to (29), we have ˜ c (k)2 + α2 Ξ2 . L2 (k) ≤ −αc Ψm W c cM
(47)
Third, considering L3 (k) and substituting the weight update law (36) reveals after manipulation
13
αa (2 − αa ) ˜ T Wa (k)ϕa (xk )2 1 + ϕTa ϕa αa (1 + αa ) ˜ c + 2εacM )W ˜ T (k)ϕa (xk ) + (λmax (R−1 )gM ψcM W a 1 + ϕTa ϕa αa2 ˜ c 2 + 2α2 ε2 (λmax (R−1 )gM ψcM + )2 W a acM 4(1 + ϕTa ϕa ) αa2 ˜ c + (λmax (R−1 )gM ψcM εacM )W 1 + ϕTa ϕa αa (3 − 3αa ) ˜ T αa Λca ˜ 2 Wa (k)ϕa (xk )2 + ≤− Wc + Πεa (48) 2 2(1 + ϕaM ) 1 + ϕ2aM
L3 (k) ≤ −
)2 + 0.25αa ((λmax (R−1 )gM ψcM )2 + 2) where Λca = (αa + 1)(λmax (R−1 )gM ψcM −1 2 2 and Πεa = 0.5αa (αa (λmax (R )gM ψcM ) + 10αa + 8)εacM . Now, considering L4 (k) and substituting the weight update law (43) reveals after manipulation
αd (2 − αd ) ˜ T Wd (k)ϕd (xk )2 T 1 + ϕd ϕd αd (1 + αd ) −2 ˜ c + 2εdcM )W ˜ dT (k)ϕd (xk ) + (γ ψcM W 1 + ϕTd ϕd αd2 ˜ c 2 + 2αd2 ε2dcM (γ −2 ψcM + )2 W 4(1 + ϕTd ϕd ) αd2 ˜ c + (γ −2 ψcM εdcM )W 1 + ϕTd ϕd 3αd (1 − αd ) ˜ T αd Λcd ˜ 2 ≤− Wc + Πεd , Wd (k)ϕd (xk )2 + 2 2(1 + ϕdM ) 1 + ϕ2dM
L4 (k) ≤ −
(49)
)2 +0.25αd ((γ −2 ψcM )2 +2) and Πεd = 0.5αd (αd (γ −2 ψcM )2 + where Λcd = (αd +1)(γ −2 ψcM 2 10αd + 8)εdcM . Then, substituting (46), (47), (48), (49) into (45), we have
αc αa2 αd2 Ψm (1 − 2K ∗ ) xk 2 4(1 + ϕ2aM )(1 + ϕ2dM ) 2 Ψm (3 − 7αa ) ˜ T αa αc αd2 gM Wa (k)ϕa (xk )2 − 2 2 2(1 + ϕaM )(1 + ϕdM )
L(k) ≤ −
14
αc αd αa2 Ψm (3 − 7αd ) ˜ T Wd (k)ϕd (xk )2 2(1 + ϕ2aM )(1 + ϕ2dM ) 2 Λca − αa Λcd ) ˜ 2 αa αc αd Ψm (1 − αd gM Wc + ΠεM , − 2 (1 + ϕaM )(1 + ϕ2dM ) −
(50)
where ΠεM =
αa αd αc2 Ξ2cM αc αa2 αd2 Ψm + 4(1 + ϕ2aM )(1 + ϕ2dM ) (1 + ϕ2aM )(1 + ϕ2dM ) 2 Ψm Πεa αc αa2 Ψm Πεd αc αd2 gM + . + 1 + ϕ2dM 1 + ϕ2aM
(51)
Therefore, L(k) is less than zero provided the following inequalities hold 4ΠεM (1 + ϕ2aM )(1 + ϕ2dM ) (52) xk > αc αa2 αd2 Ψm (1 − 2K ∗ ) or
˜ T (k) > W a
or
2ΠεM (1 + ϕ2aM )(1 + ϕ2dM ) 2 αa αc αd2 ϕ2aM gM Ψm (3 − 7αa )
˜ T (k) > W d
or
˜ c > W
2ΠεM (1 + ϕ2aM )(1 + ϕ2dM ) αc αd αa2 ϕ2dM Ψm (3 − 7αd )
ΠεM (1 + ϕ2aM )(1 + ϕ2dM ) 2 αa αc αd Ψm (1 − αd gM Λca − αa Λcd)
(53)
(54)
(55)
and the tuning gains are selected as 0 < αa < 3/7, 0 < αd < 3/7, 0 < αc < 1, 2 0 < K ∗ < 1/2 and αd gM Λca + αa Λcd < 1. According to standard Lyapunov extensions [25], the system states, the weight estimation errors of the action, disturbance, and critic Networks are uniformly ultimately bounded with bounds provided by (52), (53), (54),and (55), respectively. This completes the proof.
15
Remark 1. In the Theorem 2, it is shown that the weights of the critic, action, and disturbance networks are tuned simultaneously along the system trajectories to converge to the solution of the HJI equation. Therefore, the new ADP-based method proposed in this paper is different from the existing methods [15, 16, 17, 18, 19, 20, 21, 22, 23], in which the weights of the critic, action, and disturbance networks are updated in an alternate and offline manner. Besides, in this paper, we present the new weight tuning law of critic network which is different from the existing methods in [27, 28]. And we also demonstrate the effectiveness of the new weight tuning law of critic network in the new ADP-based method. Remark 2. Although the Hamilton-Jacobi-Bellman equation (HJB) for nonlinear discrete-time system is solved in an online manner [29, 30], due to the disturbance inputs reflected additionally in the HJI equation appearing in the H∞ control, the HJI equation is more difficult to solve than the HJB equation. In the Theorem 2, a new ADP-based method is presented to solve the HJI equation along the system trajectories. Remark 3. Comparing to the existing methods [15, 16, 17, 18, 19, 20, 21, 22, 23], we can see that the neural network approximation errors of the critic, action, and disturbance networks are considered in the Theorem 2. Besides, if the neural network approximation errors εc (k), εa (k) and εd (k) are neglected as in [15, 16, 17, 18, 19, 20, 21, 22, 23], then the weight estimation errors of the critic, action, and disturbance networks converge to zero asymptotically. Theorem 3. Suppose the hypotheses of Theorem 2 hold. Then, the obtained ˆ k ) converge to the approxcontrol input uˆ(xk ) and the disturbance input d(x imate Nash equilibrium solution of the zero-sum differential game. In other ˆ k ) converge to the optimal feedback saddle point [u∗ (xk ), words, uˆ(xk ) and d(x ∗ d (xk )] with bounds given by (57) and (59), respectively. Proof. according to (15) and (30), we can obtain ˜ T (k)ϕa (xk ) + εa (xk ) u∗ (xk ) − uˆ(xk ) = W a
(56)
Further, we have ˜ T (k)ϕa (xk ) + εaM (xk ) u∗ (xk ) − uˆ(xk ) ≤ W a 2ΠεM (1 + ϕ2aM )(1 + ϕ2dM ) + εaM (xk ) ≤ 2 αa αc αd2 gM Ψm (3 − 7αa ) 16
(57)
On the other side, according to (16) and (37), we can obtain ˆ k) = W ˜ T (k)ϕd (xk ) + εd (xk ) d∗ (xk ) − d(x d
(58)
Further, we have ˆ k ) ≤ W ˜ dT (k)ϕd (xk ) + εdM (xk ) d∗ (xk ) − d(x 2ΠεM (1 + ϕ2aM )(1 + ϕ2dM ) + εdM (xk ) ≤ αc αd αa2 Ψm (3 − 7αd )
(59)
ˆ k )] converges to the approximate Nash equilibrium So, the pair [ˆ u(xk ), d(x solution of the zero-sum game. This completes the proof. 4. Simulation Results In this section, we solve a numerical example using the algorithm developed in Section 3. Consider the following nonlinear discrete-time system studied in [21]: x1 (k + 1) −0.8x2 (k) = x2 (k + 1) sin(0.8x1 (k) − x2 (k)) + 1.8x2 (k) 0 0 d(k) (60) u(k) + + 1 −1 − x2 (k) with the initial states are taken as x(0) = [0.1 0.1]T . The corresponding cost function is defined as (6), where Q and R are chosen as identity matrices of appropriate dimensions, γ = 20. The initial stabilizing controller is defined as u0 (k) = x1 (k) + 1.5x2 (k). To implement the proposed scheme, we choose three-layer feedforward neural networks as the critic network, action network and disturbance network with structure 1-10-2. The initial weights of the critic and disturbance neural networks are chosen to be random in [−0.2, 0.2], while the initial weights of the action network are chosen to reflect the initial stabilizing controller u0 (k). In the learning process, we assume that the weight matrix between the input layer and hidden layer is constant, we only tune the weight matrix between the hidden layer and the output layer. At the same time, the learning rate of the critic network is αc = 0.25, the learning rate of the 17
action network is αa = 0.15, and the learning rate of disturbance network is αd = 0.15. Besides, a probing noise is added to the control input to maintain the excitation condition. The state trajectories of the nonlinear system are shown in Fig. 1. The convergent trajectories of the weights of the critic network are shown in Fig. 2. We can see that, after 500 time steps, the weights of the critic network does converge to constants with persistent excitation. 0.15
0.1
System States
0.05
0
−0.05
−0.1
−0.15
−0.2
0
100
200
300 400 Time steps
500
600
700
Figure 1: The evolution curves of the system states.
0.2 Wc1 Wc2 Parameters of the critic NN
0.15
Wc3 Wc4 Wc5
0.1
Wc6 Wc7 Wc8
0.05
Wc9 Wc10 0
−0.05
0
100
200
300 400 Time steps
500
600
700
Figure 2: The convergence process of the critic network weights.
Moreover, in order to make comparison with the optimal controller and the initial stabilizing controller, we present the state trajectories for the nonlinear system with the obtained optimal control controller and the initial 18
stabilizing controller. And we also present control signals for the nonlinear system with the obtained optimal control controller and the initial stabilizing controller. In the comparison, a disturbance w(k) = 0.05exp(−0.1k) is introduced into the nonlinear system at k = 0. Fig. 3 shows the optimal control signal, and Fig. 4 presents the state trajectories for the nonlinear system with the obtained optimal control controller. Fig. 5 shows the initial stabilizing control signal, and Fig. 6 presents the state trajectories for the nonlinear system with the initial stabilizing controller. We can see from Fig. 3 and Fig. 5 that the obtained optimal control controller yields better performance than the stabilizing controller. We can also see that the state trajectories in Fig. 4 have fewer oscillations when compared to those of Fig. 6. Those results demonstrate the effectiveness of the proposed method. 0.25 u(x(k))
optimal
Control Input
0.2
0.15
0.1
0.05
0
0
20
40
60
80
100
Time steps
Figure 3: The evolution curves of the optimal controller.
Next, we will evaluate the overall performance of the nonlinear system with the optimal controller and the initial stabilizing controller, respectively. Here, we employ a performance metric [6, 21] defined as k (xT (i)Qx(i) + uT (i)Ru(i)) Attenuation(k) = i=0 k . (61) 2 T i=0 (γ d (i)d(i)) we can see from Fig.7 that a significant decrease in the control effort (61) is observed when the optimal controller is applied to the nonlinear system. It is worth noting that the performance metric obtained from the optimal controller in this paper is very close to the performance metric from the existing method in [21](see Fig.3 in [21]), which verifies the weight estimation error of the critic network is uniformly ultimately bounded. 19
0.1 x1optimal
0.08
x2optimal
0.06
System States
0.04 0.02 0 −0.02 −0.04 −0.06 −0.08 −0.1
0
20
40
60
80
100
Time steps
Figure 4: The system states curves for the optimal controller.
0.25 u(x(k))initial 0.2 0.15
Control Input
0.1 0.05 0 −0.05 −0.1 −0.15 −0.2
0
20
40
60
80
100
Time steps
Figure 5: The evolution curves of the initial stabilizing controller.
20
0.1 x1initial
0.08
x2initial
0.06
System States
0.04 0.02 0 −0.02 −0.04 −0.06 −0.08 −0.1
0
20
40
60
80
100
Time steps
Figure 6: The system states curves for the initial stabilizing controller.
0.11 optimal controller initial controller
0.1 0.09
Attenuation
0.08 0.07 0.06 0.05 0.04 0.03 0.02
0
20
60
40
80
100
Time steps
Figure 7: The attenuation with the optimal controller and initial controller.
21
5. Conclusion In this paper, we have proposed a new ADP-based method to solve online the H∞ control problem for affine nonlinear discrete-time systems. The importance of the proposed method relies on simultaneous tuning the weights of the critic, action, and disturbance networks by using data generated in realtime along the system trajectories, and then the solution of the HJI equation has been obtained online without solving this equation. The convergence analysis of the new ADP-based method has been proved by using Lyapunov technique with considering the neural network approximation errors. Finally, A simulation example has shown the effectiveness of the proposed method for the H∞ control problem of affine nonlinear discrete-time systems. Acknowledgements The work described in this paper was supported by the National Natural Science Foundation of China (61034005, 61273027, 61304132), the National High Technology Research and Development Program of China (2012AA040104), the Liaoning Industry Program (2013219005) and the China Postdoctoral Science Foundation funded project (2015M572104). References [1] K. Zhou, and J. C. Doyle, Essentials of robust control, Prentice-Hall. 1997. [2] A. J. van der Shaft, L2 gain analysis of nonlinear systems and nonlinear state feedback H∞ control, IEEE Transactions on Automatic Control, vol. 37, pp. 770–784, 1992. [3] A. Isidori and A. Astolfi, Disturbance attenuation and H∞ control via measurement feedback in nonlinear systems, IEEE Transactions Automatic Control, vol. 37, no. 9, pp. 1283-1293, Sep. 1992. [4] T. Basar, and P. Bernhard, H∞ optimal conrol and related minimax design problems: a dynamic game approach, Second Edition, Boston, 1995. [5] K. G. Vamvoudakis, and F. L. Lewis, Online solution of nonlinear twoplayer zero-sum games using synchronous policy iteration, International Journal of Robust and Nonlinear Control, vol. 22, pp. 1460–1483, 2012. 22
[6] J. Huang, and C. F. Lin, Numerical approach to computing nonlinear H∞ control laws, Journal of Gaidance Control and Dynamics, vol. 18, pp. 989–994, 1995. [7] R. W. Beard and T. W. McLain, Successive Galerkin approximation algorithms for nonlinear optimal and robust control, International Journal of Control, vol. 71, pp. 717–743, 1998. [8] M. Abu-Khalaf, F. L. Lewis, and J. Huang, Policy iterations and the Hamilton-Jacobi- Isaacs equation for H∞ state feedback control with input saturation, IEEE Transactions on Automatic Control, vol. 51, pp. 1989–1995, 2006. [9] H. Zhang, Y. Luo, and D. Liu, Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constrains, IEEE Transactions on Neural Networks, vol. 20, pp. 1490– 1503, 2009. [10] M. Hamidreza, F. L. Lewis, and N. Mohammad-Bagher, Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained input continuous-time systems. Automatica, Vol. 50, No. 1, pp. 193–202, 2014. [11] Y. Jiang, and Z. Jiang,, Robust adaptive dynamic programming and feedback stabilization of nonlinear systems, IEEE Transactions on Neural Networks and Learning Systems, Vol. 25, No. 5, pp. 882–893, 2014. [12] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive dynamic programming for control: algorithms and stability. Springer-Verlag, London, 2013. [13] Draguna Vrabie, Dr. Kyriakos Vamvoudakis and F. L. Lewis, Optimal adaptive control and differential games by reinforcement learning principles, The Institution of Engineering and Technology, London, United Kingdom, 2013. [14] W. B. Powell, Approximate dynamic programming: solving the curses of dimensionality. Hoboken, NJ: Wiley, 2007. [15] A. Al-Tamimi, M. Abu-Khalaf, and F. L. Lewis, Adaptive critic designs for discrete-time zero-sum games with application to H-infinte control, 23
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 37, pp. 240–247, 2007. [16] Q. Wei, H. Zhang, and L. Cui, Data-based optimal control for discretetime zero-sum games of 2-D systems using adaptive critic designs, ACTA Automatica Sinica, vol. 35, pp. 682–692, 2009. [17] L. Cui, H. Zhang, X. Zhang, and Y. Luo, Data-based adaptive critic design for discrete-time zero-sum games using output feedback, in: proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 11-15 April 2011, Paris, France, pp. 190–195. [18] A. Al-Tamimi, F. L. Lewis, and Abu-Khalaf, Model-free Q-learning designs for linear discrete-time zero-sum games with application to Hinfinity control, Automatica, vol. 43, pp. 682–481, 2010. [19] K. H. Kim, and F. L. Lewis, Model-free H∞ control design for unknown linear discrete-time systems via Q-learning with LMI, Automatica, vol. 46, pp. 1320–1326, 2010. [20] S. Mehraeen, T. Dierks, and S. Jagannathan, Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks, in: Proceedings of International Jiont Conference on Neural Networks, Barcelona, Spain, july 2010, pp.1–8. [21] S. Mehraeen, T. Dierks, S. Jagannathan, and M. L. Crow, Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks, IEEE Transactions on Cybernetics, vol. 43, pp. 1641–1655, 2013. [22] D. Liu, H. Li, and D. Wang, H∞ control of unknown discrete-time nonlinear systems with control constraints using adaptive dynamic programming, in: Proceedings of IEEE World Congress on Computational Intelligence, 10–15 June 2012, Brisbane, Australia, pp: 3056–3061. [23] D. Liu, H. Li, and D. Wang, Neural-network-based zero-sum game for discrete-time nonlinear systems via iterative adaptive dynamic programming algorithm, Neurocomputing, vol.110, pp. 92–100, 2013. [24] H. Wu, and B. Luo, Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear H∞ control, 24
IEEE Transactions on Neural Networks and Learning Systems, vol. 23, pp. 1884–1895, 2012. [25] S. Jagannathan, Neuarl network control of nonlinear discrete-time systems, CRC PRESS, 2006. [26] K. G. Vamvoudakis, and F. L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica, vol. 46, pp. 878–888, 2010. [27] H. Zhang, C. Qin, B. Jiang, and Y. Luo, Online adaptive policy learning algorithm for H∞ state feedback control of unknown affine nonlinear discrete-time systems, IEEE Transactions on Cybernetics, vol. 44, pp. 2706–2718, 2014. [28] C. Qin, Y. Wang, Y. Luo, and H. Zhang, Neural network-based nearoptimal control for nonlinear discrete-time zero-sum differential games associated with the H∞ control problem, in: Proceedings of Fifth International Conference on Intelligent Control and Information Processing, August 18-20, 2014, Dalian, Liaoning, China, pp. 341–347. [29] T. Dierks, and S. Jagannathan, Online optimal control of nonlinear discrete-time systems using approximate dynamic programming, Journal of Control Theory and Applications, Vol. 9, pp. 361–369, 2011. [30] T. Dierks, and S. Jagannathan, Online optimal control of affine nonlinear discrete-time systems with unknow internal dynamics by using time-based policy update, IEEE Transactions on Neural Networks and Learning Systems, vol. 22, pp. 1118–1129, 2012.
25