Adaptive dynamic programming-based optimal control of unknown nonaffine nonlinear discrete-time systems with proof of convergence

Adaptive dynamic programming-based optimal control of unknown nonaffine nonlinear discrete-time systems with proof of convergence

Neurocomputing 91 (2012) 48–55 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A...

698KB Sizes 0 Downloads 44 Views

Neurocomputing 91 (2012) 48–55

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Adaptive dynamic programming-based optimal control of unknown nonaffine nonlinear discrete-time systems with proof of convergence$ Xin Zhang, Huaguang Zhang n, Qiuye Sun, Yanhong Luo School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, PR China

a r t i c l e i n f o

a b s t r a c t

Article history: Received 23 September 2011 Accepted 7 January 2012 Communicated by H. Jiang Available online 23 February 2012

In this paper, a novel neuro-optimal control scheme is proposed for unknown nonaffine nonlinear discretetime systems by using adaptive dynamic programming (ADP) method. A neuro identifier is established by employing recurrent neural networks (RNNs) model to reconstruct the unknown system dynamics. The convergence of the identification error is proved by using the Lyapunov theory. Then based on the established RNN model, the ADP method is utilized to design the approximate optimal controller. Two neural networks (NNs) are used to implement the iterative algorithm. The convergence of the action NN error and weight estimation errors is demonstrated while considering the NN approximation errors. Finally, two numerical examples are used to demonstrate the effectiveness of the proposed control scheme. & 2012 Published by Elsevier B.V.

Keywords: Optimal control Adaptive dynamic programming Recurrent neural network System identification

1. Introduction During the last decades, the optimal control of nonlinear systems has been an active area of research in the control field, which often requires solving the nonlinear Hamilton–Jacobi–Bellman (HJB) equation [1–7]. The approximate dynamic programming (ADP) algorithm plays an important role in seeking solutions for the optimal control. However, it is worthy of mentioning that most of the existing results based on the ADP technique require the knowledge of known nonlinear dynamics [8–10]. In order to relax the requirement of explicit model, some studies have attempted to solve the controller without an priori system mode [11–14]. For linear discrete-time systems, Q-earning was introduced to relax some of the exact model-matching restriction in [15]. For the affine nonlinear discrete-time systems, Dierks et al. [16] relaxed the requirement by using the online system identification and offline optimal control training. Then, this method was extended to solve the optimal control problem for unknown discrete-time affine nonlinear systems based on globalized dual heuristic dynamic programming (GDHP) technique in [17]. However, the optimal control of unknown nonaffine nonlinear discrete-time systems remains a challenging task, which is the motivation of our work. In this paper, a neuro identifier is firstly presented to reconstruct the unknown system dynamics. Since recurrent neural networks

$ This work was supported by the National Natural Science Foundation of China (50977008, 61034005, 60904101, 61104010), National Basic Research Program of China (2009CB320601), Science and Technology Research Program of the Education Department of Liaoning Province (LT2010040) and the National High Technology Research and Development Program of China (2012AA040104). n Corresponding author. Tel.: þ86 24 83687762; fax: þ86 24 83679605. E-mail address: [email protected] (H. Zhang).

0925-2312/$ - see front matter & 2012 Published by Elsevier B.V. doi:10.1016/j.neucom.2012.01.025

have powerful representation capability and can successfully overcome the disadvantages of feedforward networks, a discrete-time recurrent neural network (RNN) model is introduced as the neuro identifier. Then, the convergence of the identification error is proved by using the Lyapunov theory. Once the identifier is established, it can be used to design the optimal controller. Consequently, it paves the way for applying ADP method to deal with the optimal control problem of the unknown nonaffine nonlinear discrete-time systems. Then, the heuristic dynamic programming (HDP) technique is used to solve the approximate optimal control based on the established RNN model. Two NNs are utilized in the implementation of this algorithm: a critic NN is used to approximate the value function, whereas an action network is used to approximate the optimal control policy. It is noticed that Al-Tamimi et al. proved the convergence of the iteration HDP algorithm in [8]. However, they did not considered the NN approximation errors. Most of the existing results about ADP have ignored the NN approximation error in the proof of convergence. In this paper, the convergence of the NN implementation of the HDP technique is demonstrated while considering the NN approximation errors. The main contributions of this paper include: 1. It is the first time that the optimal control problem of the unknown nonaffine nonlinear discrete-time systems based on ADP method is investigated. 2. A neuro identifier is established based on the RNN model. The convergence of the identification error is proved based on the Lyapunov theory. 3. The proposed control scheme does not require explicit knowledge of the system dynamics but only the established RNN model. The convergence of the NN implementation of the HDP

X. Zhang et al. / Neurocomputing 91 (2012) 48–55

technique is demonstrated while considering the NN approximation errors. The rest of the paper is organized as follows. In Section 2, the problem formulation is given. The neural network identification scheme is presented in Section 3. Then, the ADP-based approximate optimal control scheme is proposed with stability proof in Section 4. Two simulation examples are presented to show the satisfying performance of the proposed scheme in Section 5. Finally, the conclusions are drawn in Section 6.

2. Problem formulation Consider the nonaffine nonlinear discrete-time system described by xðk þ1Þ ¼ FðxðkÞ,uðkÞÞ,

ð1Þ

where xðkÞ A Rn is the state vector and uðkÞ A Rm is the control vector, and Fð,Þ is an unknown general nonlinear smooth function with respect to x(k) and u(k). Assume that the system (1) is controllable and bounded-input and bounded-ouput stable. x¼ 0 is an equilibrium point on a compact set O. In order to control the system (1) in an optimal manner, it is required to select the control policy u(k) that minimizes the infinite horizon cost functional VðxðkÞ,uðÞÞ ¼

1 X

xT ðiÞQxðiÞ þ uT ðiÞRuðiÞ,

ð2Þ

i¼k

where Q and R are the symmetric positive definite matrices with appropriate dimensions. For optimal control problems, the designed control policy must not only stabilize the system on O, but also guarantee that the cost functional (2) is finite, i.e., the control must be admissible.

49

The system identification error is defined as ^ eðkÞ ¼ xðkÞxðkÞ:

ð6Þ

From (3) and (5), we can obtain the identification error dynamics as T

T

~ ðkÞf ðxðkÞÞ þ W ~ ðkÞf ðxðkÞÞuðkÞ eðk þ 1Þ ¼ AT eðkÞ þ W 1 2 1 2 nT ~ nT ~ þ W f ðeðkÞÞ þ W f ðeðkÞÞuðkÞeðkÞ, 1

1

2

2

Assumption 1 (Hayakawa et al. [18]). The RNN approximation error term eðkÞ is assumed to be upper bounded by a function of the state identification error e(t) such that

eT ðkÞeðkÞ r sn eT ðkÞeðkÞ,

ð8Þ

where s is the bounded constant target value such that Jsn J r sM . n

Next, the stability analysis of the proposed identification scheme is presented by using the Lyapunov theory. Theorem 1. Let the proposed identification scheme in (5) be used to identify the system dynamics in (1). If the updating laws of RNN weights are designed as ^ 1 ðkþ 1Þ ¼ W ^ 1 ðkÞa1 f ðxðkÞÞeT ðk þ1Þa2 f ðxðkÞÞfT ðxðkÞÞW ^ 1 ðkÞ W 1 1 1 1 ^ 2 ðkÞa2 f ðxðkÞÞuðkÞeT ðk þ1Þ, ^ 2 ðkþ 1Þ ¼ W W 2

3. Neural network identification scheme The identified nonlinear system is represented as (1). According to the Stone–Weierstrass theorem [11], this nonlinear system can be written as xðk þ1Þ ¼ AT xðkÞ þ W n1T f1 ðV nT xðkÞÞ þW n2T f2 ðV nT xðkÞÞuðkÞ þ eðkÞ,

ð3Þ

where Vn is the ideal weight matrix between the input layer and hidden layer. W n1 and W n2 are the ideal weight matrices between the hidden layer and output layer. eðkÞ is the bounded RNN functional approximation error. f1 ðÞ and f2 ðÞ are the NN activation functions, which are selected as a monotonically increased function and satisfies

Proof. Consider the following positive definite Lyapunov function candidate defined as LðkÞ ¼ L1 ðkÞ þ L2 ðkÞ þ L3 ðkÞ,

for any y1, y2 A R and y1 Z y2 , k1 40, k2 40, such as hyperbolic tangent function. Additionally, bounds on the ideal output layer weights are taken as JW n1 J r W 1M , JW n2 J r W 2M , the activation functions are bounded such that Jf1 J r f1M , Jf2 J r f2M , while the bounded RNN functional approximation error eðkÞ r eM . In the system identification process, we keep the weight matrix between the input layer and the hidden layer as constant, and only tune the weight matrix between the hidden layer and the output layer. So, we design the following RNN model as the neuro identifier: ^ þ1Þ ¼ xðk

^ T ðkÞf ðxðkÞÞþ W ^ T ðkÞf ðxðkÞÞuðkÞ, ^ þW AxðkÞ 1 2 1 2

ð5Þ

^ 1 ðkÞ and W ^ 2 ðkÞ are ^ where xðkÞ is the estimate system state vector, W ^ the estimations of the constant ideal weight, xðkÞ ¼ V nT xðkÞ.

ð10Þ

where L1 ðkÞ ¼ eT ðkÞeðkÞ, T

~ ðkÞW ~ 1 ðkÞg, L2 ðkÞ ¼ Z trfW 1 L3 ðkÞ ¼

ð4Þ

ð9Þ

where a1 40 and a2 4 0 are the NN learning rates. Then, the state identification error e (k) converge uniformly to a bounded region ~ 1 (k) and near the origin while the matrices estimation errors W ~ 2 (k) are bounded. W

0 r f1 ðy1 Þf1 ðy2 Þ r k1 ðy1 y2 Þ, 0 r f2 ðy1 Þf2 ðy2 Þ r k2 ðy1 y2 Þ

ð7Þ

~ ðeðkÞÞ ¼ ^ 1 ðkÞW n , W ~ 2 ðkÞ ¼ W ^ 2 ðkÞW n , f ~ 1 ðkÞ ¼ W where W 1 1 2 f1 ðxðkÞÞf1 ðV nT xðkÞÞ and f~ 2 ðeðkÞÞ ¼ f2 ðxðkÞÞf2 ðV nT xðkÞÞ. It is important to note that the identification error must be persistently excited enough for tuning RNN model. In order to satisfy the persistent excitation condition, probing noise is added to the control input [19]. Further, the persistent excitation condition ensures Jf1 ðxðkÞÞJ Z f1m and Jf2 ðxðkÞÞJ Z f2m with f1m and f2m being positive constants. Before proceeding, the following mild assumption is needed.

1

a2

T

~ ðkÞW ~ 2 ðkÞg: trfW 2

where Z 40 is design parameter. The first difference of the Lyapunov function is calculated as

DLðkÞ ¼ DL1 ðkÞ þ DL2 ðkÞ þ DL3 ðkÞ:

ð11Þ

For simplicity, we denote T

T

~ ðkÞf ðxðkÞÞ, Y2 ðkÞ ¼ W ~ ðkÞf ðxðkÞÞuðkÞ, Y1 ðkÞ ¼ W 1 2 1 2

F1 ðkÞ ¼ W 1nT f~ 1 ðeðkÞÞ, F2 ðkÞ ¼ W n2T f~ 2 ðeðkÞÞuðkÞ:

ð12Þ

DL1 ðkÞ is obtained using the identification error dynamics as DL1 ðkÞ ¼ eT ðk þ1Þeðk þ1ÞeT ðkÞeðkÞ ¼ ½AT eðkÞ þ Y1 ðkÞ þ Y2 ðkÞ þ F1 ðkÞ þ F2 ðkÞeðkÞT eðk þ1ÞeT ðkÞeðkÞ:

ð13Þ

50

X. Zhang et al. / Neurocomputing 91 (2012) 48–55

Considering DL2 ðkÞ, as well as applying the Cauchy–Schwarz inequality yields T

Now, observing the facts Jsn J r sM , JW n1 J r W 1M , Jf1 J r f1M , Jf2 J r f2M , we have

DLðkÞ r½1ð8þ 12ð2Z þ a2 u2M f22M ÞÞðl2max ðAÞ þ K 21 þ K 22 þ s2M ÞJeðkÞJ2

T

~ ðkþ 1ÞW ~ 1 ðkþ 1ÞW ~ ðkÞW ~ 1 ðkÞg DL2 ðkÞ ¼ Z trfW 1 1

2

T

½Za21 23ð2Z þ a2 u2M f2M ÞJY1 ðkÞJ2

~ 1 ðkÞ ¼ Z trf½ðIa21 f1 ðxðkÞÞf1 ðxðkÞÞÞW

2

½13ð2Z þ a2 u2M f2M ÞJY2 ðkÞJ2 þDM ,

a1 f1 ðxðkÞÞðeT ðk þ1Þ þ a1 W n1T f1 ðxðkÞÞT T 2 ~ 1 f1 ðxðkÞÞf1 ðxðkÞÞÞW 1 ðkÞ

where DM ¼ 2a Z a2 and Z are selected as

½ðIa

~ T ~ 1 W 1 f1 ðxðkÞÞW 1 ðkÞW 1 ðkÞg

T

nT

a1 f1 ðxðkÞÞðe ðk þ1Þ þ a

½eT ðk þ1Þ þ a1 W 1nT f1 ðxðkÞÞg

(

r Za21 YT1 ðkÞY1 ðkÞ þ 2ZeT ðk þ 1Þeðk þ1Þ T n nT 2 1 f1 ðxðkÞÞW 1 W 1 f1 ðxðkÞÞ:

þ 2a Z

ð14Þ

T

Meanwhile, let Ia21 f1 ðxðkÞÞf1 ðxðkÞÞ 4 0 hold by selecting a1 r 1=f1m . Taking the first difference of the L3 ðkÞ reveals

¼

1

a2 1

a2

T

T

~ ðkþ 1ÞW ~ 2 ðk þ1ÞW ~ ðkÞW ~ 2 ðkÞg trfW 2 2 ~ 2 ðkÞa2 f ðxðkÞÞuðkÞeT ðk þ 1ÞÞT trfðW 2 T

~ 2 ðkÞa2 f ðxðkÞÞuðkÞeT ðk þ1ÞÞW ~ ðkÞW ~ 2 ðkÞg ðW 2 2

a2 rmin



1

T

a2 eðkþ 1ÞuT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞeT ðk þ 1Þg

2

a2 u2M f22M 2

ð15Þ

Combining (13)–(15) to get the first difference of (10), we get

DLðkÞ r ½AT eðkÞ þ Y1 ðkÞY2 ðkÞ þ F1 ðkÞ þ F2 ðkÞeðkÞT eðkþ 1Þ eT ðkÞeðkÞZa21 YT1 ðkÞY1 ðkÞ þ2ZeT ðkþ 1Þeðk þ 1Þ T

þ2a21 Zf1 ðxðkÞÞW n1 W 1nT f1 ðxðkÞÞ

,

18P 2

) ,

,

ð18Þ

2 ¼ lmax ðAÞ þ K 21 þ K 22 þ

where P s2M o 1=8, and given that the following inequalities: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DM JeðkÞJ 4 2 1ð8 þ12ð2Z þ a2 u2M f2M ÞÞP or

JY1 ðkÞJ 4

or

JY2 ðkÞJ 4

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DM

Za21 23ð2Z þ a2 u2M f22M Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DM 2

13ð2Z þ a2 u2M f2M Þ

¼ 2YT2 ðkÞeðk þ1Þ T

Therefore, if the design parameters a1 ,

6u2M f2M 24u2M f2M P

~ T ðkÞf ðxðkÞÞuðkÞeT ðk þ 1Þ ¼ trf2W 2 2

þ a2 uT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞeT ðkþ 1Þeðk þ 1Þ:

ð17Þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 v 3 u u1 þ 3a2 u2 f2 1 2M M t 5, a1 A 42 , a2 u2M f22M f1M

r Z trfa21 YT1 ðkÞY1 ðkÞ þ½eT ðkþ 1Þ þ a1 W n1T ðkÞf1 ðxðkÞÞT

DL3 ðkÞ ¼

f21M W 21M .

2 1

ð19Þ

hold, then we can conclude that the identification error e(k), ~ 2 ðkÞ are ~ 1 ðkÞ and W the RNN model weights estimation errors W uniformly ultimately bounded (UUB). This completes the proof. & In the following sections, the optimal control scheme for unknown nonaffine nonlinear discrete-time system will be development based on the established RNN model after a sufficiently ^ 1 ðkÞ and W ^ 2 ðkÞ tend to be long learning session. The weights W constant matrices, which are denoted as W 1 and W 2 .

T

þ a2 uT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞeT ðk þ 1Þeðk þ 1Þ 4. ADP-based approximation optimal controller design

T

r ½2 þ 3ð2Z þ a2 uT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞÞ ½AT eðkÞ þ F1 þ F2 eðkÞT ½AT eðkÞ þ F1 þ F2 eðkÞ T

½Za21 23ð2Z þ a2 uT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞÞYT1 ðkÞY1 ðkÞ T

½13ð2Z þ a2 uT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞÞYT2 ðkÞY2 ðkÞ 2 1

u0 ðxðkÞÞ ¼ arg minðxT ðkÞQxðkÞ þ uT ðkÞRuðkÞ þ V 0 ðxðkþ 1ÞÞÞ: u

fT1 ðxðkÞÞW n1 W 1nT f1 ðxðkÞÞeT ðkÞeðkÞ

þ2a Z

½eT ðkÞAAT eðkÞ þ FT1 ðkÞF1 ðkÞ þ FT2 ðkÞF2 ðkÞ þ eT ðkÞeðkÞ ½Za

T

Z þ a2 u

V 1 ðxðkÞÞ ¼ xT ðkÞQxðkÞ þ uT0 ðkÞRu0 ðkÞ þ V 0 ðxðk þ1ÞÞ:

T ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞÞYT1 ðkÞY1 ðkÞ

T

½13ð2Z þ a2 uT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞÞYT2 ðkÞY2 ðkÞ þDðkÞeT ðkÞeðkÞ,

ð20Þ

Once the policy u0 ðkÞ is determined, iteration on the value is performed by computing

T

r ½8 þ 12ð2a1 Z þ a2 uT ðkÞf2 ðxðkÞÞf2 ðxðkÞÞuðkÞÞ 2 1 23ð2

The HDP algorithm is used to obtain the optimal controller. In the HDP algorithm, one starts with an initial value which is commonly taken as V 0 ðxðkÞÞ ¼ 0, and then solves for u0 ðkÞ as follows:

The HDP value iteration scheme is implemented by iterating between a sequence of control policies ui ðxðkÞÞ calculated from ui ðxðkÞÞ ¼ arg minðxT ðkÞQxðkÞ þ uT ðkÞRuðkÞ þ V i ðxðk þ 1ÞÞÞ

ð16Þ

T

where DðkÞ ¼ 2a21 Zf1 ðxðkÞÞW n1 W 1nT f1 ðxðkÞÞ. Since the nonlinear system (1) is bounded-input and bounded-output stable, i.e., x(k) and u(k) are bounded, we assume that JuðkÞJ r uM . From (4) and (12), we can know that nT nT nT nT ^ F1 ðkÞ ¼ W n1T f1 ðV nT xðkÞÞW 1 f1 ðV xðkÞÞ r k1 W 1 V eðkÞ9K 1 eðkÞ, nT nT nT nT ^ F2ðkÞ ¼ W 2nT f1 ðV nT xðkÞÞW 2 f2 ðV xðkÞÞ r k2 W 2 V eðkÞ9K 2 eðkÞ:

ð21Þ

u

ð22Þ

and a sequence of positive value functions V i ðxðkÞ given by V i þ 1 ðxðkÞÞ ¼ xT ðkÞQxðkÞ þ uTi ðkÞRui ðkÞ þ V i ðxðk þ 1ÞÞ

ð23Þ

with initial condition V 0 ðxðkÞÞ ¼ 0. Note that i is the value iterations index, whereas k is the time index. In order to implement the iteration algorithm, two NNs are utilized: a critic NN is used to approximate the value function, whereas an action network is used to approximate the optimal control policy.

X. Zhang et al. / Neurocomputing 91 (2012) 48–55

51

The weight update law for the action NN is selected to be

4.1. Critic NN design

jT

Using the universal approximation property of NN, the value function VðxðkÞÞ is approximated by a NN as follows: V i ðxðkÞÞ ¼ W Tci fc ðxðkÞÞ þ eci ,

ð24Þ

where Wci is the unknown ideal constant weights and fc ðxÞ is called the critic NN activation function vector and eci is the critic NN approximation error. The upper bound for the ideal NN weights is taken as JW ci J r W cM , while the approximation error is upper bounded as Jeci J r ecM , where W cM and ecM are the positive constants. Additionally, in this work it will be assumed ^ þ1ÞJ r e0cM , where e0cM is the positive constant. that J@eci =@xðk ^ ci be an approximation of Wci, then we have the Let W approximation of V i ðxðkÞÞ as follows: ^ T f ðxðkÞÞ: V^ i ðxðkÞÞ ¼ W ci c

ð25Þ

The output target function of the critic network V i þ 1 ðxðkÞÞ can be computed by (23). Then, we define the error function for the critic network as eci ðxðkÞÞ ¼ V^ i þ 1 ðxðkÞÞV i þ 1 ðxðkÞÞ:

ð26Þ

And define the objective function to be minimized in the critic network as

^ jþ1 ¼ W ^ j  aaj fa ðxðkÞÞeai ðkÞ , W ai ai CðxðkÞÞ þ 1

ð32Þ

where aaj is a small positive design parameter, CðxðkÞÞ ¼ j

j

~ ¼ W ai W ^ , we can get the estimation fTa ðxðkÞÞfa ðxðkÞÞ. Define W ai ai error dynamics of the action NN weights as follows: jT

~ j þ aaj fa ðxðkÞÞeai ðkÞ : ~ jþ1 ¼ W W ai ai CðxðkÞÞ þ 1

ð33Þ

4.3. Stability analysis From (31), we can know that ðj þ 1ÞT

^ ejaiþ 1 ðkÞ ¼ W ai

T

1 2

fa ðxðkÞÞ þ R1 fT2 ðx j þ 1 ðkÞÞW 2

@fc ðx^ j þ 1 ðk þ 1ÞÞ ^ W ci : @x^ j þ 1 ðk þ 1Þ

ð34Þ The weight update law (32) can be viewed as a control input for the error function of the action network (34). Substituting the tuning law (32) into (34) reveals the closed loop error system to be

aaj CðxðkÞÞ CðxðkÞÞ þ 1 T @fc ðx^ j þ 1 ðk þ 1ÞÞ ~ 1 T  R1 f ðx j þ 1 ðkÞÞW 2 W ci jT

Eci ðxðkÞÞ ¼ 12eTci ðxðkÞÞeci ðxðkÞÞ:

ð27Þ

~ f ðxðkÞÞej ðkÞ ejaiþ 1 ðkÞ ¼ W ai a ai

The weight updating law for training the critic network is gradient-based adaptation given by   ^ ci þ 1 ¼ W ^ ci ac @Eci ðxðkÞÞ , W ð28Þ @W ci

The optimal controller for each iteration is approximated by the action NN as follows:

where Wai is the matrix of unknown ideal constant weights and the critic NN approximation error. The upper bound for the ideal NN weights are taken as JW ai J r W aM while the approximation error is upper bounded as Jeai J r eaM , where WaM and eaM are the positive constants. ^ ai be an approximation of Wai, we have the approximaLet W tion of ui ðxðkÞÞ as follows:

þ

T

@fc ðx^ j ðkþ 1ÞÞ W ci @x^ j ðkþ 1Þ

1 1 T @eci R f2 ðx j þ 1 ðkÞÞW 2 ¼ 0: 2 @x^ j ðk þ 1Þ

ð36Þ

Combining (35) and (36), we have the error system for the action NN as

aaj CðxðkÞÞ CðxðkÞÞ þ 1 T @fc ðx^ j þ 1 ðk þ 1ÞÞ ~ 1 T  R1 f ðx j þ 1 ðkÞÞW 2 W ci jT

~ f ðxðkÞÞej ðkÞ ejaiþ 1 ðkÞ ¼ W ai a ai

2 @x^ j þ 1 ðk þ1Þ 1 @eci T : ejai  R1 f2 ðx j þ 1 ðkÞÞW 2 2 @x^ j ðkþ 1Þ

2

ð37Þ

In the following theorem, the convergence of the action NN weights is demonstrated while explicitly considering the NN reconstruction errors eai and eci . Theorem 2. Consider the error system for the action NN given by (37), let the weights updating laws of the action NN are given by (32). Then, the action NN error (31) and the action NN weight ~ j converge uniformly to a bounded region near estimation errors W ai the origin as j-1. Proof. Consider the following positive definite Lyapunov function candidate defined as

j

eaij ðkÞ ¼ u^ i ðkÞujin ðkÞ @V^ i ðxj ðk þ 1ÞÞ 1 1 T R f2 ðx j ðkÞÞW 2 2 @xj ðk þ 1Þ

T ^ ^ ci : ^ jT f ðxðkÞÞ þ 1 R1 fT ðx j ðkÞÞW 2 @fc ðx j ðkþ 1ÞÞ W ¼W a 2 ai 2 @x^ j ðk þ 1Þ

1 2

ð30Þ

A second training loop is introduced to find the action network weights that minimize the cost function. The index for this loop is denoted by j. To begin the development of the second loop, we define the action error to be the difference between the approximate control input j u^ i ðkÞ in (30) and the control input that minimizes the approximate value function ujin ðkÞ. By using (25), as well as the results of Theorem 1, the error function for the action network is written as

jT

@fc ðx^ j þ 1 ðk þ1ÞÞ 1 1 T R f2 ðx j þ 1 ðkÞÞW 2 W ci , 2 @x^ j þ 1 ðkþ 1Þ

ejai þW Tai fa ðxðkÞÞ þ R1 fT2 ðx j þ 1 ðkÞÞW 2

ð29Þ

fa ðxÞ is called the action NN activation function vector and eai is

^ f ðxðkÞÞ þ ¼W ai a

T

þ W Tai fa ðxðkÞÞ þ

^ ci and W ~ j ¼ W ai W ^ j . Since the control ~ ci ¼ W ci W where W ai ai policy in (29) minimizes the infinite horizon cost function (24), we have

4.2. Action NN design

^ T f ðxðkÞÞ: u^ i ðxðkÞÞ ¼ W ai a

@x^ j þ 1 ðk þ1Þ

ð35Þ

where ac 4 0 is the learning rate of the critic network.

ui ðxðkÞÞ ¼ W Tai fa ðxðkÞÞ þ eai ,

2

2

J ¼ J 1 þJ 2 , where ð31Þ

J1 ¼ zeaijT ðkÞeaij ðkÞ,

ð38Þ

52

X. Zhang et al. / Neurocomputing 91 (2012) 48–55

~ jT ðkÞW ~ j ðkÞg, J 2 ¼ trfW ai ai

or

Z 40 is a design parameter. Taking the first difference of the

u Jejai ðkÞJ o t

Lyapunov function (38), substituting the error system for the action NN (37) and applying the C–S inequality reveals ðj þ 1ÞT DJ 1 ¼ zeai ðkÞejaiþ 1 ðkÞzejT ðkÞejai ðkÞ ai

(

jT

~ f ðxðkÞÞJ2 þ r 5z JW ai a



aaj CðxðkÞÞ

2

ZðCðxðkÞÞ þ 1Þ2 þ5Za2aj CðxðkÞÞ2 a2aj CðxðkÞÞ

ð45Þ

hold, then we can conclude that DJ o 0. Therefore, it can be concluded that the action NN error ejai ðkÞ and the weight estima~ j are UUB as j-1. tion errors of the action NN W ai This completes the proof. &

Jeaij ðkÞJ2

CðxðkÞÞ þ 1

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u DM ðCðxðkÞÞ þ1Þ2

T

^ 1 2 ~ ci J2 J @fc ðx j þ 1 ðk þ 1ÞÞ J2 þ e2 f W 2 JR1 J2 JW aM 4 2M 2M @x^ j þ 1 ðk þ1Þ  1 2 1 2 þ f2M W 22M e02 J zJejai ðkÞJ2 , aM JR 4

5. Simulation

þ

n

ð39Þ

o n jT o ~ j þ 1 ðkÞ tr W ~ ðkÞW ~ j ðkÞ ðkÞW ai ai ai     a2aj CðxðkÞÞ aaj fðkÞ ~ jT ejT trfejai ðkÞejT ðkÞg þtr 2W ¼ ai ai 2 CðxðkÞÞ þ 1 ai ðCðxðkÞÞ þ 1Þ    a2aj CðxðkÞÞ aaj fðkÞ j 2 ~ jT Je ðkÞJ þ tr 2 W r ai CðxðkÞÞ þ 1 ðCðxðkÞÞ þ 1Þ2 ai !T 9 T = ^ j ðk þ1ÞÞ @ f ð x jT 1 T 1 c ^ f ðxðkÞÞ þ R f ðx j ðkÞÞW 2 ^ ci  W W 2 ai a ; 2 @x^ j ðkþ 1Þ ðj þ 1ÞT

~ DJ 2 ¼ tr W ai

r

a2aj CðxðkÞÞ ðCðxðkÞÞ þ 1Þ þ þ

a

2 aj ~ jT f ðxðkÞÞJ2 JW Jejai ðkÞJ2  ai a 2 CðxðkÞÞ þ 1

2a2aj

1

jT

CðxðkÞÞ þ 1

~ f ðxðkÞÞJ2 þ JW ai a

CðxðkÞÞ þ1

e2aM

1 f2 W 2 e02 JR1 J2 : 4ðCðxðkÞÞ þ 1Þ 2M 2M cM

ð40Þ

Finally, using (39) and (40), DJ ¼ DJ 1 þ DJ 2 is rewritten as ! 2a2aj 2aaj ~ jT f ðxðkÞÞJ2  DJ r  5z þ JW ai a CðxðkÞÞ þ 1 CðxðkÞÞ þ1 !   aaj CðxðkÞÞ CðxðkÞÞ 2 2  z5zaaj  Jejai ðkÞJ2 þ DM , CðxðkÞÞ þ1 ðCðxðkÞÞ þ 1Þ2 ð41Þ where DM is defined as

DM ¼

^ 5z 2 ~ ci J2 J @fc ðx j þ 1 ðk þ1ÞÞ J2 f W 2 JR1 J2 JW 4 2M 2M @x^ j þ 1 ðk þ 1Þ þ 5ze2aM þ

Example 1. Consider the following affine nonlinear discrete-time system: x1 ðk þ1Þ ¼ sinð0:5x2 ðkÞÞ, x2 ðk þ1Þ ¼ sinð0:9x1 ðkÞÞ cosð1:4x2 ðkÞÞ þu:

5z 2 f W 2 e02 JR1 J2 4 2M 2M aM

e2aM CðxðkÞÞ þ 1

þ

1 f2 W 2 e02 JR1 J2 : 4CðxðkÞÞ þ 1 2M 2M cM

0.5

ð42Þ

0

If aaj and z are selected to satisfy " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #  110zðCðxðkÞ þ1Þ þ 1 ,U , aaj A 2

−1.5

and given the following inequalities: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DM ðCðxðkÞÞ þ 1Þ ~ jT f ðxðkÞÞJ o JW ai a 5ZðCðxðkÞÞ þ 1Þ þ 2aaj 2a2aj

0

2000

4000 6000 Time steps

8000

10000

0

2000

4000

8000

10000

0.4 0.2 W2

1 10ðCðxðkÞ þ1Þ

−0.5 −1

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 9 8pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 < 110zðCðxðkÞ þ 1Þ þ 1 20z ðCðxðkÞ þ 1Þ2 þ11= , U ¼ min , : ; 2 10zCðxðkÞ

zo

ð46Þ

The cost functional is defined by (2), where Q and R are chosen as identity matrices of appropriate dimensions. It is assumed that the system dynamics are unknown. At first, a neuro identifier based on RNNS is used to identify the nonlinear system (46). The activation functions f1 ðÞ and f2 ðÞ are selected as hyperbolic tangent function. Select the designed parameters as A¼[0.8,0;0,0.8], a1 ¼ 0:09 and a2 ¼ 0:0625. We apply the RNN identification scheme for 10,000 steps. To maintain the excitation condition, probing noise is added to the control input for the first 5000 steps. Fig. 1 shows that the weights of RNN will converge to constants with persistent excitation. The system identification errors are shown in Fig. 2. Then, we finish the training of the RNN model and keep its weights unchanged. Now using only the established RNN model, the HDP algorithm is implemented. The initial weights of the critic NN and action NN all set to be random in [  0.8,0.8]. The activation functions of critic NN and action NN are denoted as fc ðÞ ¼ tansigðÞ and fa ðÞ ¼ tansigðÞ. In the training process, the learning rates are chosen as ac ¼ 0:05 and aa ¼ 0:01. The convergence process of the value function is shown in Fig. 3. The state trajectories for the nonlinear system are shown in Fig. 4.

W1

þ

In this section, two examples are provided to demonstrate the effectiveness of the proposed approach. The first example considers an affine nonlinear system, while the second one considers a nonaffine nonlinear system.

ð43Þ

0 −0.2

ð44Þ

6000

Time steps Fig. 1. The weights of RNN.

X. Zhang et al. / Neurocomputing 91 (2012) 48–55

53

1

0.6 e1

x1

0.5

0.2

0 −0.5

0.4 0

0

2000

4000

6000

8000

−0.2

10000

0

10

20

Time steps 2

x2

0

e2

50

40

50

0 −0.2 −0.4

−1

−0.6 0

2000

4000

6000

8000

−0.8

10000

Time steps

0

10

Fig. 2. The identification errors.

20 30 Time steps Fig. 4. The state trajectories.

1

0.04

0.9

0.02

u∗ uˆ ∗

0

0.8

−0.02

0.7

u ∗ − uˆ ∗

The value function

40

0.2

1

−2

30

Time steps

0.6

−0.04 −0.06 −0.08

0.5

−0.1

0.4

−0.12

0.3

0

10

20 30 Iteration steps

40

50

0

10

40

50

n Fig. 5. The trajectories of un and u^ .

Fig. 3. The convergence process of the value function.

Example 2. Consider the following nonaffine nonlinear discretetime system: x1 ðkþ 1Þ ¼ 0:5x2 ðkÞ0:3 sinðx1 ðkÞÞ, ð47Þ

−3

12

8 6 4 2 0 −2 −4

The cost functional is defined as Example 1. It is assumed that the system dynamics are unknown. Using similar method shown in Example 1, we can obtain the trajectories of identified weights

x 10

10

eu

In order to compare the HDP algorithm with full knowledge of the affine nonlinear system dynamics, we use the method proposed in [8] to solve the optimal control policy un. We denote the approximate optimal control policy obtained in this paper as n u^ . The trajectories of un and un are shown in Fig. 5. The difference n between un and u^ is shown in Fig. 6. From Fig. 6, it can be seen that the difference between the two control policies is in the order of 103 , which can be effectively implemented on unknown nonlinear discrete-time systems by utilizing the RNN model as a neuro identifier to learn the complete system dynamics.

x2 ðkþ 1Þ ¼ cosðx2 ðkÞÞ sinðx1 ðkÞÞ þ 2u2 tanhðuðkÞÞ:

20 30 Time steps

0

10

20 30 Time steps

n Fig. 6. The difference between the un and u^ .

40

50

X. Zhang et al. / Neurocomputing 91 (2012) 48–55

0.5

0.6

0

0.4

−0.5

0.2

x1

W1

54

−1 −1.5

0 0

2000

4000

6000

8000

−0.2

10000

0

10

Time steps

30

40

50

40

50

40

50

Time steps 0.2

0.5 W2

20

0 x2

0 −0.5

−0.2 −0.4

0

2000

4000 6000 Time steps

8000

10000

−0.6

0

10

20

30

Time steps

Fig. 7. The weights of RNN.

Fig. 10. The state trajectories.

1

e1

0.5

0.12

0

0.1

−0.5 −1

0

2000

4000 6000 Time steps

8000

10000

0.08 0.06 u

2

e2

1

0.04

0

0.02

−1 −2

0

2000

4000 6000 Time steps

8000

10000

0 −0.02

Fig. 8. The identification errors.

0

10

20 30 Time steps

Fig. 11. The trajectory of the approximate optimal control.

Value funcion

0.8

functions and learning rates are the same as the ones in Example 1. The convergence process of the value function is shown in Fig. 9. The state trajectories for the nonlinear system is shown in Fig. 10. The trajectory of the approximate optimal control is shown in Fig. 11. The simulation results reveal that the proposed controller can be applied to nonaffine nonlinear systems and obtain satisfying performance even for the unknown disctete-time system dynamics.

0.6 0.4 0.2 0 −0.2

6. Conclusion 0

10

20

30

40

50

Iteration Steps Fig. 9. The convergence process of the value function.

as shown in Fig. 7. The system identification error is shown in Fig. 8. Then based on the established RNN model we design the approximate optimal controller. The initial weights, activation

In this paper, we have proposed an effective control scheme for unknown nonaffine nonlinear systems by using ADP method. A RNN model was introduced as a neuro identifier to reconstruct the unknown system dynamics. Moreover, based on the obtained model, the HDP method was utilized to design the approximate optimal controller. Then, the convergence of the NN implementation was demonstrated while considering the NN approximation errors. The simulation studies have demonstrated the validity of the proposed approximate optimal control scheme.

X. Zhang et al. / Neurocomputing 91 (2012) 48–55

References [1] M. Abu-Khalaf, F.L. Lewis, Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach, Automatica 41 (5) (2005) 779–791. [2] R. Beard, G. Saridis, J. Wen, Galerkin approximations of the generalized Hamilton–Jacobi–Bellman equation, Automatica 33 (12) (1997) 2159–2177. [3] S. Jagannathan, Neural Network Control of Nonlinear Discrete-time Systems, CRC Press, Boca Raton, FL, 2006. [4] J. Si, A.G. Barto, W.B. Powell, D. Wunsch II, Handbook of Learning and Approximate Dynamic Programming, Wiley–IEEE Press, New York, 2004. [5] F.-Y. Wang, H. Zhang, D. Liu, Adaptive dynamic programming: an introduction, IEEE Comput. Intell. Mag. 4 (2) (2009) 39–47. [6] F.-Y. Wang, N. Jin, D. Liu, Q. Wei, Adaptive dynamic programming for finitehorizon optimal control of discrete-time nonlinear systems with e-error bound, IEEE Trans. Neural Networks 22 (1) (2011) 24–36. [7] Q. Wei, H. Zhang, J. Dai, Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions, Neurocomputing 72 (7–9) (2009) 1839–1848. [8] A. Al-Tamimi, F.L. Lewis, M. Abu-Khalaf, Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof, IEEE Trans. Syst. Man Cybern. B Cybern. 38 (4) (2008) 943–949. [9] H. Zhang, Y. Luo, D. Liu, RBF neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints, IEEE Trans. Neural Networks 20 (9) (2009) 1490–1503. [10] H. Zhang, Q. Wei, Y. Luo, A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear system based on greedy HDP iteration algorithm, IEEE Trans. Syst. Man Cybern. B Cybern. 38 (4) (2008) 937–942. [11] W. Yu, Nonlinear system identification using discrete-time recurrent neural networks with stable learning algorithms, Inf. Sci. 158 (2004) 131–147. [12] J.M. Lee, J.H. Lee, Approximate dynamic programming-based approaches for input–output data-driven control of nonlinear processes, Automatica 41 (7) (2005) 1281–1288. [13] J.D.J. Rubio, W. Yu, Stability analysis of nonlinear system identification via delayed neural networks, IEEE Trans. Circuits Syst. II, Exp. Briefs 54 (2) (2007) 161–195. [14] F.L. Lewis, K.G. Vamvoudakis, Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data, IEEE Trans. Syst. Man Cybern. B Cybern. 41 (1) (2011) 14–25. [15] A. Al-Tamimi, F.L. Lewis, M. Abu-Khalaf, Model-free Q-learning designs for linear discrete-time zero-sum games with application to H1 control, Automatica 43 (3) (2007) 473–481. [16] T. Dierks, B.T. Thumati, S. Jagannathan, Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence, Neural Networks 22 (2009) 851–860. [17] D. Liu, D. Wang, D. Zhao, Adaptive dynamic programming for optimal control of unknown nonlinear discrete-time systems, in: IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning, 2011, pp. 242–249. [18] T. Hayakawa, W.M. haddad, N. Hovakimyan, Neural network adaptive control for a class of nonlinear uncertain dynamical systems with asymptotic stability guarantees, IEEE Trans. Neural Networks 19 (1) (2008) 80–89. [19] K.G. Vamvoudakis, F.L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica 46 (5) (2010) 878–888.

Xin Zhang received her B.S. degree in Automation Control in 2005 from Northeastern University, Shenyang, China. Currently, she is working towards her Ph.D. degree in Control Theory and Control Engineering at Northeastern University, Shenyang, China. Her research interests include approximate dynamic programming, neural networks adaptive control, game theory, and their industrial application.

55

Huaguang Zhang received his B.S. degree and M.S. degree in Control Engineering from Northeast Dianli University of China, Jilin City, China, in 1982 and 1985, respectively. He received his Ph.D. degree in Thermal Power Engineering and Automation from Southeast University, Nanjing, China, in 1991. He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and Head of the Institute of Electric Automation, School of Information Science and Engineering, Northeastern University, Shenyang, China. His main research interests are fuzzy control, stochastic system control, neural networks based control, nonlinear control, and their applications. He has authored and coauthored over 200 journal and conference papers, four monographs and co-invented 20 patents. Dr. Zhang is an Associate Editor of Automatica, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Systems, Man, and Cybernetics-Part B and Neurocomputing, respectively. He was awarded the Outstanding Youth Science Foundation Award from the National Natural Science Foundation Committee of China in 2003. He was named the Cheung Kong Scholar by the Education Ministry of China in 2005.

Qiuye Sun received his B.S. degree in Electrical Power System and Automation in 2005 from Northeastern Dianli University, Changchun, China. He received his M.S. degree in Power Electronics and Power Transmission in 2004 from Northeastern University, Shenyang, China. He received his Ph.D. degree in Control Theory and Control Engineering in 2007 from Northeastern University, Shenyang, China. He is currently working in Northeastern University as an Associate Professor. His research interests include transmission and distribution system, smart grids, power system and their industrial application.

Yanhong Luo received her B.S. degree in Automation Control, M.S. degree and Ph.D. degree in Control Theory and Control Engineering from Northeastern University, Shenyang, China, in 2003, 2006 and 2009, respectively. She is currently working in Northeastern University as an Associate Professor. Her research interests include fuzzy control, neural networks adaptive control, approximate dynamic programming and their industrial application.