Output-feedback adaptive optimal control of interconnected systems based on robust adaptive dynamic programming

Automatica 72 (2016) 37–45 Contents lists available at ScienceDirect Automatica journal homepage: www.elsevier.com/locate/automatica Brief paper O...

Download PDF

876KB Sizes 88 Downloads 256 Views

Report

PDF Reader
Full Text

Automatica 72 (2016) 37–45

Contents lists available at ScienceDirect

Automatica journal homepage: www.elsevier.com/locate/automatica

Brief paper

Output-feedback adaptive optimal control of interconnected systems based on robust adaptive dynamic programming✩ Weinan Gao a , Yu Jiang a , Zhong-Ping Jiang a,b , Tianyou Chai b a

Department of Electrical and Computer Engineering, Tandon School of Engineering, New York University, Brooklyn, NY 11201, USA

b

State Key Lab of Synthetical Automation for Process Industries, Northeastern University, Shenyang, Liaoning 110819, China

article

info

Article history: Received 6 August 2014 Received in revised form 29 December 2015 Accepted 5 May 2016

Keywords: Approximate/adaptive dynamic programming (ADP) Output-feedback control Nonlinear dynamic uncertainty Robust optimal control

abstract This paper studies the adaptive and optimal output-feedback problem for continuous-time uncertain systems with nonlinear dynamic uncertainties. Data-driven output-feedback control policies are developed by approximate/adaptive dynamic programming (ADP) based on both policy iteration and value iteration methods. The obtained adaptive and optimal output-feedback controllers differ from the existing literature on the ADP in that they are derived from sampled-data systems theory and are guaranteed to be robust to dynamic uncertainties. A small-gain condition is given under which the overall system is globally asymptotically stable at the origin. An application to power systems is given to test the effectiveness of the proposed approaches. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction As an important subfield of modern control theory, optimal control aims to develop controllers that optimize certain performance (see Lewis, Vrabie, & Syrmosc, 2012). Traditional optimal control methods need to solve the Hamilton–Jacobi–Bellman (HJB) equation (or algebraic Riccati equation (ARE) for linear systems) via the perfect knowledge of the system dynamics. Unfortunately, it is often difficult to obtain an accurate mathematical model for realworld, modern engineering, and natural systems. Approximate/adaptive dynamic programming (ADP) (see, e.g., Lewis & Liu, 2013; Ni, He, & Wen, 2013; Si, Barto, Powell, & Wunsch, 2004; Vamvoudakis, 2014; Werbos, 1974, 1990; Zhang, Liu, Luo, & Wang, 2013) is a non-model-based approach which gives rise to online approximation of optimal solutions via some recursive numerical methods. The related research has been studied for discrete-time Markov decision processes (Bertsekas & Tsitsiklis, 1996; Powell, 2007; Sutton & Barto, 1998) and discrete-time

✩ This work has been supported in part by the U.S. National Science Foundation grant ECCS-1501044, and in part by the National Natural Science Foundation of China grant 61374042, and the Project 111 (No. B08015). The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Raul Ordonez under the direction of Editor Miroslav Krstic. E-mail addresses: [email protected] (W. Gao), [email protected] (Y. Jiang), [email protected] (Z.-P. Jiang), [email protected] (T. Chai).

http://dx.doi.org/10.1016/j.automatica.2016.05.008 0005-1098/© 2016 Elsevier Ltd. All rights reserved.

feedback control systems (Lewis & Vrabie, 2009; Lewis, Vrabie, & Vamvoudakis, 2012; Liu & Wei, 2014; Wang, Zhang, & Liu, 2009). For continuous-time systems, related work can be found in Bhasin, Sharma, Parte, and Dixon (2011), Jiang and Jiang (2012a, 2014), Luo, Wu, Huang, and Liu (2014), Vrabie, Pastravanu, Abu-Khalaf, and Lewis (2009), and Yang, Liu, Ma, and Xu (2016). ADP and output regulation theory is firstly integrated by Gao and Jiang (2016) to solve the problem of asymptotic tracking with disturbance rejection. Recently, extending these solutions to output-feedback problems has received attention from Lewis and Vamvoudakis (2011), Gao, Huang, Jiang, and Chai (2016); Gao, Jiang, Jiang, and Chai (2014) and Zhu, Modares, Peen, Lewis, and Yue (2015) for linear systems and He and Jagannathan (2005) and Liu, Huang, Wang, and Wei (2013) for nonlinear systems based on neural networks (Ge, Lee, & Harris, 1998). A common feature of these papers is that no dynamic uncertainty (Jiang & Mareels, 1997) is addressed. However, there are numerous practical examples of continuous-time systems arising from engineering and biology for which dynamic uncertainty is unavoidable. The contribution of this paper is threefold. First, different from existing output-feedback ADP for discrete-time linear systems (Lewis & Vamvoudakis, 2011) or continuous-time static outputfeedback ADP design which requires the accurate knowledge of the input matrix (Zhu et al., 2015), a dynamic output-feedback ADP approach is proposed for continuous-time linear systems without the exact knowledge of any system matrices. By employing

38

W. Gao et al. / Automatica 72 (2016) 37–45

the sampled-data system theory (Chen & Francis, 1995), the unmeasurable state can be reconstructed in terms of input/output data, whereby we can iteratively solve the ARE. Second, we, for the first time, develop an output-feedback value iteration (VI) ADP algorithm for continuous-time linear systems by using the sampled-data approach. As the third contribution, we study the robust optimal redesign issue for a class of interconnected systems with dynamic uncertainties. The state and order of dynamic uncertainties are unknown. Because of the implementation of sampled-data outputfeedback robust optimal controllers, the closed-loop system is a hybrid system that involves both continuous-time and discretetime dynamics. Thus, robustness analysis cannot be conducted directly by our previous work on state-feedback robust ADP (Jiang & Jiang, 2013, 2014). Instead, we derive the global asymptotic stability of the closed-loop interconnected system based on a combined application of Lyapunov theory, sampled-data systems theory, and nonlinear small-gain method. To the best of the authors’ knowledge, this paper represents the first step towards the ADP design of output-feedback adaptive optimal controllers for continuous-time nonlinear systems with both static and dynamic uncertainties. The remainder of this paper is organized as follows. In Section 2, we formulate the control problem, and briefly review the linear–quadratic regulator (LQR) theory. In Section 3, we develop adaptive optimal output-feedback strategies by using both policy iteration (PI) and VI based ADP methods. Robustness and suboptimality of the closed-loop system are analyzed in Section 4. An application to a practical example on the power systems is presented in Section 5. Finally, conclusions are contained in Section 6. Notations. Throughout this paper, R+ denotes the set of nonnegative real numbers. |·| represents the Euclidean norm for vectors and the induced norm for matrices. A continuous function α : R+ → R+ belongs to class K if it is increasing and α(0) = 0. It belongs to class K∞ if, in addition, it is proper. A continuous function β : R+ × R+ → R+ belongs to class KL if for each fixed t, the function β(·, t ) is of class K and, for each fixed s, the function β(s, ·) is non-increasing and tends to 0 at infinity. ⊗ indicates the Kronecker product operator and vec(A) = [aT1 , aT2 , . . . , aTm ]T , where ai ∈ Rn are the columns of A ∈ Rn×m . For a symmetric matrix P ∈ Rm×m , vecs(P ) = [p11 , 2p12 , . . . , 2p1m , p22 , 2p23 , . . . , 2pm−1,m , pmm ]T 1 m(m+1) 2

∈ R . For an arbitrary column vector v ∈ R , vecv(v) = 1 [v12 , v1 v2 , . . . , v1 vn , v22 , v2 v3 , . . . , vn−1 vn , vn2 ]T ∈ R 2 n(n+1) . λM (P ) and λm (P ) denote the maximum and the minimum eigenvalue of n

the real symmetric matrix P. For any piecewise continuous function u : R+ → Rm , ∥u∥ stands for supt ≥0 |u(t )|. 2. Problem formulation and preliminaries Consider a linear subsystem interacting with a nonlinear subsystem known as the dynamic uncertainty, characterized by the ζ -system: x˙ = Ax + B(u + ∆(ζ , y)),

(1)

ζ˙ = g (ζ , y), y = Cx

(2) (3)

where x ∈ R , ζ ∈ R are unmeasurable states with an unknown integer p > 0, u ∈ Rm the input, y ∈ Rr the output. A ∈ Rn×n , B ∈ Rn×m , and C ∈ Rr ×n are unknown system matrices with (A, B) controllable, (A, C ) observable satisfying |A| ≤ AM , |B| ≤ BM , and |C | ≤ CM . g : Rp × Rr → Rp and ∆ : Rp × Rr → Rm are two unknown, locally Lipschitz functions with g (0, 0) = 0 and ∆(0, 0) = 0. n

p

Remark 2.1. The system (1)–(3) belongs to the class of interconnected systems studied by Saberi, Kokotovic, and Summers (1990). If ∆(ζ , y) = ∆1 (y), the system (1) and (3) is a linear system with nonlinear output injection (see Krener & Isidori, 1983). Assumption 2.1. The ζ -system with y regarded as the input and ∆ as the output has the strong unboundedness observability (SUO) property with zero offset (Jiang, Teel, & Praly, 1994), i.e., there exist a function σ1 of class KL and a function γ1 of class K such that for any measurable essentially bounded control y(t ) on [0, T ) with 0 < T ≤ +∞, the solution ζ (t ) of (2) right maximally defined on [0, T ′ )(0 < T ′ ≤ T ) satisfies

|ζ (t )| ≤ σ1 (|ζ (0)|, t ) + γ1 (∥[yT[0,t ] , ∆T[0,t ] ]T ∥),

∀t ∈ [0, T ′ )

where y[0,t ] and ∆[0,t ] are truncated functions of y and ∆ over [0, t ], respectively. Assumption 2.2. The ζ -system is input-to-output stable (IOS) (Sontag, 2007), i.e., there exist a function σ∆ of class KL and a function γ∆ of class K such that, for any initial state ζ (0), any measurable essentially bounded input y and any t ≥ 0,

|∆(t )| ≤ σ∆ (|ζ (0)|, t ) + γ∆ (∥y∥).

(4)

Considering the reduced-order system (1) and (3) in the absence of the dynamic uncertainty x˙ = Ax + Bu, y = Cx,

(5)

define the cost as Jc (x(0)) =

∞



(yT (τ )Qy(τ ) + uT (τ )Ru(τ ))dτ 0

(6)

√

where Q = Q T ≥ 0 and R = RT > 0 with (A, Q C ) observable. Moreover, a minimal cost Jc∗ = xT (0)P ∗ x(0) in (6) is obtained by using the following control policy u = −R−1 BT P ∗ x := −K ∗ x

(7)

where P ∗ = (P ∗ )T > 0 is the unique solution to the algebraic Riccati equation (ARE): AT P ∗ + P ∗ A + C T QC − P ∗ BR−1 BT P ∗ = 0.

(8)

A discretized model of (5) is obtained by taking periodic sampling xk+1 = Ad xk + Bd uk , yk = Cxk

(9)

where Ad = eAh , Bd = ( 0 eAτ dτ )B, and h > 0 is the sampling period. Suppose the sampling frequency ωh = 2π /h is nonpathological (Chen & Francis, 1995). Then, both (Ad , C ) and √ (Ad , Q C ) are observable with (Ad , Bd ) controllable. The cost for (9) is

h

Jd (x(0)) =

∞ 

(yTj Qd yj + uTj Rd uj )

(10)

j =0

where Qd = Qh, Rd = Rh. Notice that (10) can be viewed as a first-order approximation of equation (7) in Melzer and Kuo (1971) which itself is the discretized version of cost (6). The optimal control law minimizing (10) is uk = −(Rd + BTd Pd∗ Bd )−1 BTd Pd∗ Ad xk := −Kd∗ xk where Pd∗ = ATd Pd∗ Ad

Pd∗ T

(11)

> 0 is the unique solution to

∗

− Pd + C T Qd C

− ATd Pd∗ Bd (Rd + BTd Pd∗ Bd )−1 BTd Pd∗ Ad = 0.

(12)

The sensitivities of Pd∗ and Kd∗ , with respect to sampling period h, are discussed in the following lemma.

W. Gao et al. / Automatica 72 (2016) 37–45

Lemma 2.1. Letting Pd∗ and Kd∗ satisfy Eqs. (11)–(12) at sampling period h, then 1 Pd∗ = P ∗ + C T QCh + OP (h2 ), 2 1 Kd∗ = K ∗ + K ∗ (A − BK ∗ )h + OK (h2 ) 2

(13)

X (h) = (

T

=

(h)Bd )/h.

(14)

It is easy to obtain the limits of X (h), Y (h) and Z (h) as h goes to zero: T

Y (0) = Pd∗ (0)B,

∗

T

Then, Eqs. (11)–(12) imply Kd∗ (0) = R−1 BT Pd∗ (0),

X (h) = Y (h)Kd (h), ∗

(15)

Pd (0)A + A Pd (0) + C QC = Pd (0)BR ∗

T

∗

∗

T

B Pd (0)

−1 T ∗

which indicates that Pd (0) = P and Kd (0) = K . For the firstorder sensitivities, Kd∗ (h) and X (h) are differentiated at h = 0, ∗

∗

∗

(eAs )T C T QCeAs ds,   s  h Aλ As T T ds, e Bdλ (e ) C QC M = 0 0  T  s   h  s A λ T A λ Rˆ = Rd + e Bdλ C QC e Bdλ ds. 0

0

0

Then, we have J ⊕ = xT (0)P ⊕ (h)x(0), where P ⊕ (h) satisfies

X 2 ( h) = Y2 (h) =

(19)

(Ad − Bd Kd∗ )T P∆⊕ (Ad − Bd Kd∗ ) − P∆⊕ h

,

Qˆ − C T Qd C − MKd∗ − (Kd∗ )T M T + (Kd∗ )T (Rˆ − Rd )Kd∗ h

.

Eqs. (12) and (19) imply X2 (h) = Y2 (h). By X2 (0) = Y2 (0) and the first-order sensitivities of X2 and Y2 , we have

A 2

∂Y ∂P∗ 1 = d B + (2AT Pd∗ (0) + Pd∗ (0)A)B, ∂h ∂h 2 ∂Z T ∗ = B Pd (0)B. ∂h

(Ad − Bd Kd∗ )T P ⊕ (Ad − Bd Kd∗ ) − P ⊕

(16)

+ (AT Pd∗ (0) + Pd∗ (0)A) ,

⊕ 0 = (A − BK ∗ )T P∆ (0) + P∆⊕ (0)(A − BK ∗ ),

 ∂ P∆⊕ C T QC 0 = (A − BK ) − ∂h 2  ⊕  ∂ P∆ C T QC + − (A − BK ∗ ). ∂h 2 ∗ T

(17)

Substituting (17) into (16), and by (15), we have

∂ Pd∗ C T QC − ∂h 2

h



⊕ Define P∆ (h) = P ⊕ (h) − Pd∗ (h), and

∂P∗ ∂P∗ AT ∂X = AT d + d A + (AT Pd∗ (0) + Pd∗ (0)A) ∂h ∂h ∂h 2



where

ˆ d∗ . = Qˆ − MKd∗ − (Kd∗ )T M T + (Kd∗ )T RK

By (14), the first-order sensitivities of X (h), Y (h) and Z (h) are



(A − BR B Pd (0))  ∗  ∂ Pd C T QC − (A − BR−1 BT Pd∗ (0)) = 0, + ∂h 2  ∗  ∂ Kd∗ ∂ Pd C T QC 1 −1 T =R B − + Kd∗ (0)(A − BKd∗ (0)). ∂h ∂h 2 2 T

ˆ j xTj Qˆ xj + 2xTj Muj + uTj Ru

∗

∂ Kd∗ ∂Y T ∂Z = R −1 − R−1 R−1 BT Pd∗ (0), ∂h ∂h ∂h ∂X ∂ Y −1 T ∗ ∂ Kd∗ ∗ = Pd (0)B + R B Pd (0). ∂h ∂h ∂h

−1 T ∗

∞ 

0

Z (0) = R.

Kd∗ (h) = Z −1 (h)Y T (h),

(yT (τ )Qy(τ ) + uT (τ )Ru(τ ))dτ

j=0

Qˆ =

X (0) = Pd (0)A + A Pd (0) + C QC , ∗

∞

 0

Y (h) = (ATd Pd∗ (h)Bd )/h, Z (h) = (Rd +

where lim suph→0 |OJ (h2 )/h2 | < ∞.

J⊕ =

(h)Ad − Pd (h) + C Qd C )/h, BTd Pd∗

(18)

Proof. Write J ⊕ as a summation (see, Melzer & Kuo, 1971):

Proof. Inspired by Melzer and Kuo (1971), for h > 0, let ∗

Theorem 2.1. Letting Jc⊕ be the cost in (6) for the system (5) in closedloop with controller (11), for small h > 0 such that ωh is nonpathological, the error between Jc⊕ and Jc∗ is Jc⊕ − Jc∗ = yT (0)Qy(0)h + OJ (h2 )

where, for i = P , K , lim suph→0 |Oi (h2 )/h2 | < ∞.

ATd Pd∗

39



(A − BK ∗ ) is a Hurwitz matrix, revealing that P∆⊕ (0) = 0 and C T QC 2

. By Lemma 2.1, we obtain the approximation of P

⊕ P ⊕ − P ∗ = P∆ + (Pd∗ − P ∗ ) ≃ C T QCh.

The proof is thus completed. ∂P∗

Since A − BR−1 BT Pd∗ (0) is asymptotically stable, we obtain ∂ hd = ∂K ∗ C T QC and ∂ hd = 12 K ∗ (A − BK ∗ ). 2 Remark 2.2. Lemma 2.1 implies that, for any ϵ1 > 0, there exists δ1 > 0 such that |K ∗ − Kd∗ | < ϵ1 , if the sampling period h < δ1 . The following technical theorem characterizes the relationship between the optimal cost for the original continuous-time system and the cost value for the discretized counterpart under the sampled-data controller (11).

⊕

⊕ ∂ P∆ ∂h ∗

=

−P :

(20)

Remark 2.3. Since (12) is nonlinear in Pd , solving Pd∗ directly from (12) is difficult. It can be approximated by the PI Algorithm 1 (Hewer, 1971) or the VI Algorithm 2 (Lancaster & Rodman, 1995, Chap. 17). ∞ Remark 2.4. The sequences {Pj }∞ j=0 and {Kj }j=1 computed from ∗ ∗ Algorithm 1 or 2 converge to Pd and Kd , respectively (see Hewer, 1971; Lancaster & Rodman, 1995). Moreover, for j = 0, 1, 2, . . . , we know Ad − Bd Kj is Schur, with Kj computed by Algorithm 1.

40

W. Gao et al. / Automatica 72 (2016) 37–45

Algorithm 1 PI Algorithm 1: 2: 3:

4: 5:

Find a K0 such that Ad − Bd K0 is a Schur matrix. j ← 0. Select a sufficiently small constant ϵ > 0. repeat Solve Pj and Kj+1 from 0 = (Ad − Bd Kj )T Pj (Ad − Bd Kj ) − Pj + C T Qd C + KjT Rd Kj

(21)

Kj+1 ← (Rd + BTd Pj Bd )−1 BTd Pj Ad

(22)

j ← j + 1. until |Pj − Pj−1 | < ϵ

2: 3:

4: 5:

xk+1 = Ad xk + Bd vk = Aj xk + Bd (Kj xk + vk ), yk = Cxk .

(27)

3.2. PI-based output ADP design Letting v¯ k = [vkT−1 , vkT−2 , . . . , vkT−n ]T ∈ Rmn , zk = [¯vkT , y¯ Tk ]T ∈

R , q = n(m + r ), Θ = [Mu , My ], K¯ j = Kj Θ and P¯j = Θ T Pj Θ , the state of (27) satisfies q

Algorithm 2 VI Algorithm 1:

Suppose ∆(ζ (t ), y(t )) is available in the learning phase. Let u(t ) = v(t ) − ∆(ζ (t ), y(t )), where v(t ) is a piecewise constant signal. Defining the matrix Aj := Ad − Bd Kj , by periodic sampling, we have

xk = Θ zk .

Select a sufficiently small constant ϵ > 0. j ← 0. Pj ← 0. repeat Compute Pj+1 and Kj+1 by

(28)

From (21), (27) and (28), we have zkT+1 P¯ j zk+1 − zkT P¯ j zk

Pj+1 ← ATd Pj Ad + C T Qd C − ATd Pj Bd (Rd + BTd Pj Bd )−1 BTd Pj Ad (23)

= xTk+1 Pj xk+1 − xTk Pj xk

Kj+1 ← (Rd + BTd Pj+1 Bd )−1 BTd Pj+1 Ad

= xTk ATj Pj Aj xk + (Kj xk + vk )T BTd Pj Bd (Kj xk + vk )

(24)

j←j+1 until |Pj − Pj−1 | < ϵ

3. ADP design for output-feedback control In this section, we present a state estimation method in terms of input/output data. Both PI-based and VI-based ADP learning strategies are proposed to find the optimal dynamic outputfeedback controller with unknown system matrices. 3.1. State estimation The discretized variant of system (1) and (3) can be rewritten as xk+1 = Ad xk + Bd uk + ξk , yk = Cxk

+ 2(Kj xk + vk )T BTd Pj (Ad − Bd Kj )xk − xTk Pj xk   = (Kj xk + vk )T BTd Pj Bd BTd Pj Ad   −Kj xk + vk × − (yTk Qd yk + xTk KjT Rd Kj xk ) 2xk     −K¯ j zk + vk 12 T ¯ 11 ¯ ¯ Hj = (Kj zk + vk ) Hj 2zk

−(

yTk Qd yk

 (k+1)h

xk = My y¯ k + Mu u¯ k + Mξ ξ¯k

(26)

where

− vecv(zkj,1 ), . . . , vecv(zkj,s +1 ) − vecv(zkj,s )]T Γj,zz = [zkj,0 ⊗ zkj,0 , zkj,1 ⊗ zkj,1 , . . . , zkj,s ⊗ zkj,s ]T Γj,z v = [zkj,0 ⊗ vkj,0 , zkj,1 ⊗ vkj,1 , . . . , zkj,s ⊗ vkj,s ]T Γj,˜v = [vecv(vkj,0 ), vecv(vkj,1 ), . . . , vecv(vkj,s )]T

¯ j zkj,0 ), vecv(K¯ j zkj,1 ), . . . , vecv(K¯ j zkj,s )]T Γj,kz  = [vecv(K ΦjP = [yTkj,0 Qd ykj,0 + zkTj,0 K¯ jT Rd K¯ j zkj,0 , . . . , yTkj,s Qd ykj,s

n < k0,0 < k0,1 < · · · < k0,s < k0,s + 1 = k1,0 < · · · .

y¯ k = [yTk−1 , yTk−2 , . . . , yTk−n ]T ∈ Rrn 2 ξ¯k = [ξkT−1 , ξkT−2 , . . . , ξkT−n ]T ∈ Rn

And−1 Bd

n×mn

]∈R

) , . . . , (CAd ) , C T ]T ∈ Rrn×n

CAdn−1 T

T

0

C

CAd

···

CAnd−2

0   S =  .. . 

0

C

.. . ···

..

··· .. .

CAnd−3  

0

0

0 0

.

0 0

C 0

¯ j11 ) vecs(H





  ΨjP  vec(H¯ j12 )  = ΦjP vecs(P¯ j )

(30)

¯ jT )), −δj,zz ] ∈ where ΨjP = [Γj,˜v − Γj,kz  , 2(Γj,z v + Γj,zz (Iq ⊗ K 1

W = [In , Ad , . . . , Adn−1 ] ∈ R

.. .

For any stabilizing gain matrix K¯ j , (29) indicates

1

R(s+1)×( 2 m(m+1)+mq+ 2 q(q+1)) .

n×n2



(29)

where

n×n2

u¯ k = [uTk−1 , uTk−2 , . . . , uTk−n ]T ∈ Rmn

U = [(

)

+ zkTj,s K¯ jT Rd K¯ j zkj,s ]T

Mu = V − My S (In ⊗ Bd ) ∈ Rn×mn

V = [Bd , Ad Bd , . . . ,

¯

δj,zz = [vecv(zkj,0 +1 ) − vecv(zkj,0 ), vecv(zkj,1 +1 )

My = And (U T U )−1 U T ∈ Rn×rn

Mξ = W − My S ∈ R

¯

¯ j11 = BTd Pj Bd and H¯ j12 = BTd Pj Ad Θ . where H For a sufficiently large positive integer s and j = 0, 1, 2, . . . , we define

(25)

where ξk = kh eA((k+1)h−τ ) B∆(ζ (τ ), y(τ ))dτ . By Aangenent, Kostic, de Jager, van de Molengraft, and Steinbuch (2005) and Lewis and Vamvoudakis (2011), the state xk can be reconstructed as follows:

+

zkT KjT Rd Kj zk

 2   ∈ Rrn×n .  

Lemma 3.1. If there exists a positive integer s∗ such that for all s > s∗ rank([Γj,zz , Γj,z v , Γj,˜v ]) =

q(q + 1) 2

then, (30) has a unique solution. Proof. See Gao et al. (2014).

+ qm +

m(m + 1) 2

,

(31)

W. Gao et al. / Automatica 72 (2016) 37–45

Under the condition of Lemma 3.1, Eq. (30) is uniquely solved ¯ j11 and by the least squares method. From (22), K¯ j+1 is obtained by H

¯ j12 H

41

(33) implies

ΨjV vecs(H¯ j+1 ) = ΦjV

(34)

Now, we are ready to present an online PI-based algorithm for the adaptive optimal output-feedback design.

where of the uniqueness of the solution is also ensured by (31). The VI-based output ADP Algorithm 4 is developed as below. The convergence of Algorithm 4 is given in Theorem 3.2. The proof is similar to the proof of Theorem 3.1.

Algorithm 3 PI-based Output ADP Algorithm

Algorithm 4 VI-based Output ADP Algorithm

¯ j11 )−1 H¯ j12 . K¯ j+1 = (R + H

1: 2: 3: 4: 5: 6: 7:

(32)

Select a stabilizing gain K¯ 0 and a sufficiently small constant ϵ>0 Apply an initial stabilizing control policy νk on [0, k0,0 ). j ← 0 repeat j Apply vk = −K¯ j zk + ek on [kj,0 , kj,s ] with ek an exploration noise Solve P¯ j , K¯ j+1 from (30) and (32) j←j+1 until |P¯ j − P¯ j−1 | < ϵ

¯ ∞ Theorem 3.1. If (31) is satisfied, sequences {P¯ j }∞ j=0 and {Kj }j=1 obtained by Algorithm 3 converge to P¯ d∗ and K¯ d∗ where P¯ d∗ = Θ T Pd∗ Θ and K¯ d∗ = Kd∗ Θ . PjT

Proof. Given a stabilizing Kj , if Pj =

is the solution to (21), then

Kj+1 is uniquely determined by (22). By (29), P¯ j and K¯ j+1 satisfy (30) and (32). Letting P¯ and K¯ solve (30) and (32), (31) ensures that P¯ j = P¯ and K¯ j+1 = K¯ are uniquely determined. By Hewer (1971), we have limj→∞ K¯ j = K¯ d∗ , limj→∞ P¯ j = P¯ d∗ .

1: 2: 3: 4: 5: 6: 7: 8:

Select a sufficiently small constant ϵ > 0 Apply an arbitrary initial policy νk on the interval [0, k0,0 ). ¯ j ← 0. K¯ j ← 0 j ← 0. H repeat j Apply vk = −K¯ j zk + ek on [kj,0 , kj,s ] ¯ j+1 from (34) Solve H −1 ¯ 12 ¯ j11 K¯ j+1 ← (R + H +1 ) Hj+1 j←j+1 ¯ j − H¯ j−1 | < ϵ until |H

¯ j = H¯ d∗ , limj→∞ K¯ j = K¯ d∗ , Theorem 3.2. If (31) holds, then limj→∞ H where  ¯∗

Hd =

BTd Pd∗ Bd

BTd Pd∗ Ad Θ

Θ T ATd Pd∗ Bd

Θ T ATd Pd∗ Ad Θ

 .

4. Robustness and suboptimality In this section, we show the control policy

3.3. VI-based output ADP design uk = −K¯ d∗

¯ j , take the form of Let two matrices, Hj and H  Hj =

Hj11

Hj12

Hj12 T

)

Hj22

Hj11 Hj12 T

¯

¯ j12 H

(¯ )

¯ j22 H

( 

¯j = H



 ≡



BTd Pj Bd

BTd Pj Ad

ATd Pj Bd

ATd Pj Ad

 ≡



BTd Pj Ad Θ

Θ T ATd Pj Bd

Θ T ATd Pj Ad Θ

 .

4.1. Robustness analysis Considering (1) in closed-loop with (35), by (26), it follows that

From (23) and (27), we obtain yTk+1 Qd yk+1

xTk+1 F

=−

(Pj )xk+1 +

xTk+1 Pj+1 xk+1

[ − ( ) (Rd + )    T vk v + ⊗ k vec(Hj+1 ) xTk+1

=−

Hj22

Hj12 T

xk

Hj11 −1 Hj12

x˙ = Ax + B(uk + ∆)

= (A − BK ∗ )x + B[(K ∗ − Kd∗ )x + Kd∗ (x − xk ) + Kd∗ Mξ ξ¯k + ∆].

]xk+1

xk

zk

= −φkj +1 + (ψk )T vecs(H¯ j+1 ) where

F (Pj ) = ATd Pj Ad − ATd Pj Bd (Rd + BTd Pj Bd )−1 BTd Pj Ad ,

φkj +1 = zkT+1 [H¯ j22 − (H¯ j12 )T (R + H¯ j11 )−1 H¯ j12 ]zk+1 ,  T  ψk = vecv vkT zkT . Define

ΨjV = [ψkj,0 , ψkj,1 , . . . , ψkj,s ]T , ΦjV = [yTkj,0 +1 Qd ykj,0 +1 + φkj,0 +1 , . . . , j

yTkj,s +1 Qd ykj,s +1 + φkj,s +1 ]T .

(36)

The following lemma quantifies the dynamic uncertainty and the error between the state of the continuous system and that at the sampling instant.

=− [H¯ j22 − (H¯ j12 )T (R + H¯ j11 )−1 H¯ j12 ]zk+1   T vk ¯ j +1 ) + vecv vecs(H zkT+1

j

(35)

obtained by Algorithms 3 and 4 robustly and globally asymptotically stabilizes the overall system (1)–(3) at the origin. The suboptimality of (35) is also analyzed.

,

BTd Pj Bd

  u¯ k y¯ k

(33)

Lemma 4.1. Let t ∈ [kh, (k + 1)h]. For any ϵ2 > 0 and ϵ3 > 1, there exists a δ2 > 0 such that, if h < δ2 ,

|Kd∗ (x(t ) − xk + Mζ ξ¯k ) + ∆| ≤ ϵ2 |x(t )| + ϵ3 ∥∆[kh,(k+1)h] ∥.

(37)

Proof. Denote x(t ) by xt for simplicity. By (25) and (36), we have

  |xt − xk | ≤ 

h

  e dτ  |A − BKd∗ |(|xt − xk | + |xt |) 0  h    Aτ  + e dτ  |B|(|Kd∗ Mζ ξ¯k | + ∥∆[kh,(k+1)h] ∥). Aτ

0

Define φ1 (h) = heAM h . Let the sampling period h satisfy

φ1 (h)(AM + BM |Kd∗ |) < 1,

(38)

42

W. Gao et al. / Automatica 72 (2016) 37–45

which implies 1 − |

h 0

eAτ dτ | |A − BKd∗ | > 0. Then,

where α > 0 and

ωc ∗ 2 BM ω2 c ∗ 4 B2M α (ϵ1 + ϵ2 ) − 3|R|(ϵ12 + ϵ22 ) − , (42) ∗ µ 4µ∗ 2 ϵ2 d2 = 3|R|ϵ32 + 3 . (43) α We select ω, α, ϵ1 , ϵ2 such that d1 > 0. Then, for all t ≥ 0, (41)

|Kd∗ (xt − xk + Mζ ξ¯k ) + ∆|    h Aτ   0 e dτ  |A − BKd∗ | |Kd∗ |   ≤ |xt |  h  1 −  0 eAτ dτ  |A − BKd∗ |    h    0 eAτ dτ  |B| |Kd∗ |2 |Mζ |   +  + |Kd∗ | |Mζ |  h  1 −  0 eAτ dτ  |A − BKd∗ |  h    √ Aτ  × e dτ  |B| n + 1 0     h Aτ   0 e dτ  |B| |Kd∗ |  ∥∆[kh,(k+1)h] ∥   +  h  1 −  0 eAτ dτ  |A − BKd∗ |

d1 = ω −

implies directly that



V (t ) ≤ exp −



|x(t )| ≤ exp −  (39)

The estimate of the upper bound of |Mξ | is

φ 2 ( h) = n

√ √

1−

neAM h

+



rn3 qCM

√ √ 1 − neAM h

1 − ( neAM h )n−1

.

ϵ2 φ1 (δ2 ) < min , (AM + BM KM )(KM + ϵ2 )  1 ϵ3 − 1 × , , AM + BM KM (AM (ϵ3 − 1) + BM KM (ϵ3 + 1))   ϵ3 − 1 1 − φ1 (h)(AM + BM KM ) φ2 (δ2 ) < − KM , √ BM KM2 φ1 (h) 2 nBM φ1 (h) 

e

e

(44)

d2 d1



d1 2λm

(P ∗ )

∥∆∥,

t

λM (P ∗ ) |x(0)| λm (P ∗ )

∀t ≥ 0

(45)

(46)

where σy (|x(0)|, t ) = CM exp(− 2λ (1P ∗ ) t ) m d

tion of KL and γy (s) = CM



d2 s d1



λM (P ∗ ) |x(0)| λm (P ∗ )

is a func-

is a function of K . By (42) and

c ∗ 2 BM ϵ3

µ∗ − c ∗ 2 BM (ϵ1 + ϵ2 )

 + ϵ4 s

(47)

where ϵ4 > 0. Now, we are ready to give a sufficient condition under which the interconnected system (1)–(3) is globally asymptotically stable at the origin.

By (8) and (40), we get P∆ = ω

∥∆∥2 .

d1

|y(t )| ≤ σy (|x(0)|, t ) + γy (∥∆∥)

γy (s) = CM

(A−BK ∗ )T t (A−BK ∗ )t

+



(A − BK ∗ )T P1 + P1 (A − BK ∗ ) + C T QC + K ∗ T RK ∗ + ωI = 0. (40) ∞

d2 λm (P ∗ )

(43), we have

Choose a Lyapunov function V = xT P1 x, where P1 = P1T > 0 solves the following equation with ω > 0:



V (0) +

which implies that the x-system, as written in (1) and (3), with ∆ as the input is input-to-state stable (ISS) (Sontag, 1989). By (3), we have

By Lemma 2.1, for any ϵ1′ > 0, there exist a δ1′ > 0 such that, if h < δ1′ , we have |Kd∗ | < |K ∗ |+|K ∗ − Kd∗ | ≤ |R−1 |BM |P ∗ |+ϵ1′ := KM . From (38) and (39), for any ϵ2 > 0 and ϵ3 > 1, if h < δ2 < δ1′ , where δ2 meets

then (37) holds.



t λm (P ∗ )

An immediate consequence of the previous inequality is

:= ϵ2′ |xt | + ϵ3′ ∥∆[kh,(k+1)h] ∥. 1 − ( neAM h )n

d1

Theorem 4.1. Under the condition of Assumptions 2.1 and 2.2, if h < min{δ1 , δ2 }, and γ∆ (s) satisfies

γ∆ (s) ≤ (Id + ρ1 )−1 ◦ γy−1 ◦ (Id + ρ2 )−1 (s),

∀s ≥ 0

(48)

for some ρ1 , ρ2 of class K∞ , the control policy (35) globally asymptotically stabilizes (1)–(3).

dt

0

where P∆ = P1 − P ∗ . Then, we have |P∆ | = ωc ∗ 2 /(2µ∗ ) by ∗ ∗ |e(A−BK )t | ≤ c ∗ e−µ t . The derivative of the Lyapunov function V at kh ≤ t ≤ (k + 1)h along the trajectory of the system (36) is V˙ = xT [(A − BK ∗ )T P1 + P1 (A − BK ∗ )]x

+ 2xT P ∗ B[(K ∗ − Kd∗ )x + Kd∗ (x − xk + Mζ ξ¯k ) + ∆] + 2xT P∆ B[(K ∗ − Kd∗ )x + Kd∗ (x − xk + Mζ ξ¯k ) + ∆]

Proof. Assumptions 2.1 and 2.2 indicate that the ζ -system has SUO property with zero offset and is IOS with the gain function γ∆ (s). (45) and (46) guarantee that the x-system has SUO property with zero offset and is IOS with the gain function γy (s) (see Proposition 3.1, Jiang et al., 1994). By the nonlinear small-gain theory (Theorem 2.1, Jiang et al., 1994), the system (1)–(3) in closed-loop with (35) is globally asymptotically stable (GAS) at the origin under the following small-gain condition

(Id + ρ2 ) ◦ γy ◦ (Id + ρ1 ) ◦ γ∆ (s) ≤ s,

≤ −xT C T QCx − xT K ∗ T RK ∗ x − ω|x|2 + 2xT (K ∗ )T × R[(K ∗ − Kd∗ )x + Kd∗ (x − xk + Mζ ξ¯k ) + ∆]    ω2 c ∗ 4 B2M α ωc ∗ 2 BM 2 (ϵ1 + ϵ2 ) + + |x| µ∗ 4 µ∗ 2

∀s ≥ 0.

The proof is thus completed. Fig. 1 is depicted to outline the structure of the proof. 4.2. Suboptimality analysis

ϵ3 α

2

+ ∥∆[kh,(k+1)h] ∥2

≤ −d1 |x| + d2 ∥∆[kh,(k+1)h] ∥ 2

2

(41)

If (48) holds,  there exists a function σ of class KL such   that |∆(t )| ≤ σ [x(0)T , ζ (0)T ]T  , t . Letting Jc⊙ be the cost in (6) for the closed-loop system (36) in the presence of dynamic

W. Gao et al. / Automatica 72 (2016) 37–45

43

∂K ∗

Fig. 2. Comparisons of Kd∗ and Kc∗ + ∂ hd h at different h.

Fig. 1. The structure of the proof of Theorem 4.1.

uncertainty, we have ∞



⊙

yT Qy + xT (K ∗ )T RK ∗ xdτ

Jc = 0

+ (uk + K ∗ x)T R(uk + K ∗ x)dτ  ∞ ≤ |x|2 |C T QC + (K ∗ )T RK ∗ |dτ 0  ∞ + |R| [(ϵ1 + ϵ2 )|x| + (ϵ3 − 1)∥∆[τ ,τ +h] ∥]2 dτ 0  ∞ ≤ |x|2 [|C T QC + (K ∗ )T RK ∗ | + 2(ϵ1 + ϵ2 )2 |R|]dτ 0

+ 2(ϵ3 − 1)2 |R|h

∞  i=0

    x(0)   , ih . σ 2  ζ (0) 

(49) ¯ j with their optimal values by Algorithms 3–4. Fig. 3. Comparisons of P¯ j and H

(41) implies that x(kh + h)T P1 x(kh + h) − x(kh)T P1 x(kh)



(k+1)h

≤ −d1

5. Application to power systems

|x|2 dτ + d2 h∥∆[kh,(k+1)h] ∥2 .

kh

An immediate consequence is ∞



|x|2 dτ

0

≤

1 d1

 d2 h

∞  i =0

     x(0)  ∗ λM (P1 )   . σ  , ih + Jc ζ (0)  λm (P ∗ ) 2

(50)

Combining (49) and (50), we obtain the following theorem that characterizes the suboptimality of the closed-loop system composed of (1) and (35). Theorem 4.2. Under the conditions of Assumptions 2.1 and 2.2, if (48) holds, then (35) is suboptimal for system (1)–(3) with the cost Jc⊙ in (6) satisfying Jc⊙ ≤ ηJc∗ + D

(51)

where

η= D=

[|C T QC + (K ∗ )T RK ∗ | + 2(ϵ1 + ϵ2 )2 |R|]λM (P1 ) , d1 λm (P ∗ )  T [|C QC + (K ∗ )T RK ∗ | + 2(ϵ1 + ϵ2 )2 |R|]d2 d1

    ∞    2 2  x(0)  + 2(ϵ3 − 1) |R| × h σ  , ih . ζ (0)  i=0

The stability analysis and controller design of power systems have attracted considerable attention recently (see, e.g., Vrabie et al., 2009; Wang & Hill, 1996; Wang, Zhou, & Wen, 1993). In this section, we consider an interconnection of two synchronous generators wherein the generator 2 is regarded as the dynamic uncertainty of generator 1. The mathematical model of generators i = 1, 2 is

1δ˙i (t ) = −1ωi (t ), 1ω ˙ i (t ) = − 1P˙mi (t ) =

1 Ti

Di 2Hi

1ωi (t ) +

ω0 2Hi

1Pmi (t ),

[−1Pmi (t ) + ui (t ) − d(t )]

(52)

where ′ ′ d(t ) = Eq1 Eq2 [B12 cos(δ12 ) − G12 sin(δ12 )](1ω1 − 1ω2 ),

δ12 (t ) = (1δ1 (t ) + δ10 ) − (1δ2 (t ) + δ20 ). The meaning and values of parameters are referred to Jiang and Jiang (2012b). The system (52) takes the form of (1)–(3) with x = [1δ1 , 1ω1 , 1Pm1 ]T , y = [1δ1 , 1ω1 ]T , ζ = [1δ2 , 1ω2 , 1Pm2 ]T , d = ∆(ζ , y). In order to validate Lemma 2.1, Fig. 2 is depicted to show that, for small h ∈ (0, 0.1), Kd∗ (h) is close to its first-order approximation computed by (13). In this example, u2 (t ) is already designed to satisfy Assumptions 2.1 and 2.2. Q and R are identity matrices, and h = 1 ms. Both PI-based and VI-based output ADP

44

W. Gao et al. / Automatica 72 (2016) 37–45

Fig. 4. Power angles and frequencies of generators.

learning algorithms are validated. Fig. 3 shows the convergence ¯ j using Algorithms 3–4. The plots of the angles and of P¯ j and H frequencies of generators 1 and 2 are depicted in Fig. 4. 6. Conclusions This paper studies the adaptive optimal control problem for a class of continuous-time uncertain interconnected systems using output-feedback and adaptive dynamic programming. Both the PI-based and VI-based approaches are given for the iterative, online adaptive optimal output-feedback controller design. Robustness to dynamic uncertainties is examined by means of sampled-data systems theory, input-to-state stability and small-gain techniques. A practical application to power systems is given to validate the effectiveness of the proposed design. Our future work includes extensions to a broader class of nonlinear systems (Bian, Jiang, & Jiang, 2014; Jiang & Jiang, 2014, 2015) and applications to intelligent transportation systems (Gao, Jiang, & Ozbay, 2015; Liu, Lu, & Jiang, 2016). Acknowledgments The authors thank Prof. A. Astolfi and Tao Bian for fruitful discussions. References Aangenent, W., Kostic, D., de Jager, B., van de Molengraft, R., & Steinbuch, M. (2005). Data-based optimal control. In Proceedings of the American control conference. Vol. 2. Portland, OR (pp. 1460–1465). Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific. Bhasin, S., Sharma, N., Parte, P., & Dixon, W. E. (2011). Asymptotic tracking by a reinforcement learning-based adaptive critic controller. Journal of Control Theory and Applications, 9(3), 400–409. Bian, T., Jiang, Y., & Jiang, Z. P. (2014). Adaptive dynamic programming and optimal control of nonlinear nonaffine systems. Automatica, 50(10), 2624–2632. Chen, T., & Francis, B. A. (1995). Optimal sampled-data control systems. Springer. Gao, W., Huang, M., Jiang, Z. P., & Chai, T. (2016). Sampled-data-based adaptive optimal output-feedback control of a 2-DOF helicopter. IET Control Theory and Applications,. Gao, W., & Jiang, Z. P. (2016). Adaptive dynamic programming and adaptive optimal output regulation of linear systems. IEEE Transactions on Automatic Control,. Gao, W., Jiang, Y., Jiang, Z.P., & Chai, T. (2014). Adaptive and optimal output feedback control of linear systems: An adaptive dynamic programming approach. In Proceedings of the 11th World congress on intelligent control and automation. Shenyang, China (pp. 2085–2090). Gao, W., Jiang, Z.P., & Ozbay, K. (2015). Adaptive optimal control of connected vehicles. In Proceedings of the 10th international workshop on robot motion and control. Poznan, Poland (pp. 288–293).

Ge, S. S., Lee, T. H., & Harris, C. J. (1998). Adaptive neural network control of robotic manipulators. World Scientific Pub. Co. Inc.. He, P., & Jagannathan, S. (2005). Reinforcement learning-based output feedback control of nonlinear systems with input constraints. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(1), 150–154. Hewer, G. (1971). An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Transactions on Automatic Control, 16(4), 382–384. Jiang, Y., & Jiang, Z. P. (2012a). Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 48(10), 2699–2704. Jiang, Y., & Jiang, Z. P. (2012b). Robust adaptive dynamic programming for largescale systems with an application to multimachine power systems. IEEE Transactions on Circuits and Systems II: Express Briefs, 59(10), 693–697. Jiang, Z. P., & Jiang, Y. (2013). Robust adaptive dynamic programming for linear and nonlinear systems: An overview. European Journal of Control, 19(5), 417–425. Jiang, Y., & Jiang, Z. P. (2014). Robust adaptive dynamic programming and feedback stabilization of nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 882–893. Jiang, Y., & Jiang, Z. P. (2015). Global adaptive dynamic programming for continuous-time nonlinear systems. IEEE Transactions on Automatic Control, 60(11), 2917–2929. Jiang, Z. P., & Mareels, I. (1997). A small-gain control method for nonlinear cascaded systems with dynamic uncertainties. IEEE Transactions on Automatic Control, 42(3), 292–308. Jiang, Z. P., Teel, A. R., & Praly, L. (1994). Small-gain theorem for ISS systems and applications. Mathematics of Control, Signals, and Systems, 7(2), 95–120. Krener, A. J., & Isidori, A. (1983). Linearization by output injection and nonlinear observers. Systems & Control Letters, 3(1), 47–52. Lancaster, P., & Rodman, L. (1995). Algebraic Riccati equations. New York, NY: Oxford University Press Inc.. Lewis, F. L., & Liu, D. (Eds.) (2013). Reinforcement learning and approximate dynamic programming for feedback control. Hoboken, NJ: Wiley. Lewis, F. L., & Vamvoudakis, K. G. (2011). Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 41(1), 14–25. Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 9(3), 32–50. Lewis, F. L., Vrabie, D., & Syrmos, V. L. (2012). Optimal control. Hoboken, NJ: Wiley. Lewis, F. L., Vrabie, D., & Vamvoudakis, K. G. (2012). Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Systems Magazine, 32(6), 76–105. Liu, D., Huang, Y., Wang, D., & Wei, Q. (2013). Neural-network-observer-based optimal control for unknown nonlinear systems using adaptive dynamic programming. International Journal of Control, 86(9), 1554–1566. Liu, T., Lu, X., & Jiang, Z. P. (2016). A junction-by-junction feedback-based strategy with convergence analysis for dynamic traffic assignment. Science China Information Sciences, 59(1), 1–17. Liu, D., & Wei, Q. (2014). Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 25(3), 621–634. Luo, B., Wu, H. N., Huang, T., & Liu, D. (2014). Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica, 50(12), 3281–3290. Melzer, S. M., & Kuo, B. C. (1971). Sampling period sensitivity of the optimal sampled data linear regulator. Automatica, 7(3), 367–370. Ni, Z., He, H., & Wen, J. (2013). Adaptive learning in tracking control based on the dual critic network design. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 913–928. Powell, W. B. (2007). Approximate dynamic programming: Solving the curse of dimensionality. New York, NY: John Wiley & Sons. Saberi, A., Kokotovic, P., & Summers, S. (1990). Global stabilization of partially linear composite systems. SIAM Journal on Control and Optimization, 2(6), 1491–1503. Si, J., Barto, A. G., Powell, W. B., & Wunsch, D. C. (2004). Handbook of learning and approximate dynamic programming. New York, NY: John Wiley & Sons. Sontag, E. D. (1989). Smooth stabilization implies coprime factorization. IEEE Transactions on Automatic Control, 34(4), 435–443. Sontag, E. D. (2007). Input to state stability: Basic concepts and results. In P. Nistri, & G. Stefani (Eds.), Nonlinear and optimal control theory (pp. 163–220). Berlin: Springer-Verlag. Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. Cambridge, MA: MIT Press. Vamvoudakis, K. G. (2014). Event-triggered optimal adaptive control algorithm for continuous-time nonlinear systems. IEEE/CAA Journal of Automatica Sinica, (3), 282–293. Vrabie, D., Pastravanu, O., Abu-Khalaf, M., & Lewis, F. (2009). Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 45(2), 477–484. Wang, Y., & Hill, D. J. (1996). Robust nonlinear coordinated control of power systems. Automatica, 32(4), 611–618. Wang, F. Y., Zhang, H., & Liu, D. (2009). Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine, 4(2), 39–47. Wang, Y., Zhou, R., & Wen, C. (1993). Robust load-frequency controller design for power systems. IEE Proceedings on Generation, Transmission and Distribution, 140(1), 11–16.

W. Gao et al. / Automatica 72 (2016) 37–45 Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. (Ph.D. thesis), Harvard University. Werbos, P. J. (1990). A menu of designs for reinforcement learning over time. In W. Miller, R. Sutton, & P. Werbos (Eds.), Neural networks for control (pp. 67–95). MIT Press. Yang, X., Liu, D., Ma, H., & Xu, Y. (2016). Online approximate solution of HJI equation for unknown constrained-input nonlinear continuous-time systems. Science China Information Sciences, 328, 435–454. Zhang, H., Liu, D., Luo, Y., & Wang, D. (2013). Communications and control engineering, Adaptive dynamic programming for control: Algorithms and stability. Springer. Zhu, L. M., Modares, H., Peen, G. O., Lewis, F. L., & Yue, B. (2015). Adaptive suboptimal output-feedback control for linear systems using integral reinforcement learning. IEEE Transactions on Control Systems Technology, 23(1), 264–273.

Weinan Gao was born in Shenyang, China. He received his B.Sc. and M.Sc. degrees from the Northeastern University, Shenyang, China, in 2011 and 2013, respectively. He is currently a Ph.D. candidate in the Control and Networks Lab, Tandon School of Engineering, New York University. His research interests include reinforcement learning, adaptive dynamic programming (ADP), optimal control, cooperative adaptive cruise control (CACC), connected vehicles, sampled-data control systems, and output regulation theory.

Yu Jiang received the B.Sc. degree in Applied Mathematics from Sun Yat-sen University, Guangzhou, China, in 2006, the M.Sc. degree in Automation Science and Engineering from South China University of Technology, Guangzhou, China, in 2009, and the Ph.D. degree in Electrical Engineering from New York University, in 2014. In summer 2013, he worked as a research intern at Mitsubishi Electrical Research Laboratories, Cambridge, MA. He is currently a Software Engineer with the Control Systems Toolbox Team at The Mathworks, Inc., Natick, MA. His research interests include adaptive dynamic programming and other numerical methods in control and optimization. He received the Shimemura Young Author Prize (with Prof. Z.P. Jiang) at the 9th Asian Control

45

Conference in Istanbul, Turkey, 2013. He is also the recipient of the Alexander Hessel Award for the Best Ph.D. Dissertation in Electrical Engineering at NYU in the 2014–2015 academic year. Zhong-Ping Jiang received the B.Sc. degree in mathematics from the University of Wuhan, Wuhan, China, in 1988, the M.Sc. degree in statistics from the University of Paris XI, Paris, France, in 1989, and the Ph.D. degree in automatic control and mathematics from the Ecole des Mines de Paris, Paris, France, in 1993. He is currently a Professor of electrical and computer engineering with the Tandon School of Engineering, New York University. He has written two books: Stability and Stabilization of Nonlinear Systems (with Dr. I. Karafyllis, Springer, 2011) and Nonlinear Control of Dynamic Networks (with Drs. T. Liu and D. Hill, Taylor & Francis, 2014), and is the author of over 180 journal papers. His main research interests include stability theory, robust/adaptive/distributed nonlinear control, adaptive dynamic programming, and their applications to information, mechanical, and biological systems. Prof. Jiang is a Fellow of IEEE and a Fellow of IFAC. Tianyou Chai received the Ph.D. degree in control theory and engineering in 1985 from Northeastern University, Shenyang, China, where he became a Professor in 1988. He is the founder and Director of the Center of Automation, which became a National Engineering and Technology Research Center and a State Key Laboratory. He is a member of Chinese Academy of Engineering, IFAC Fellow and IEEE Fellow, director of Department of Information Science of National Natural Science Foundation of China. His current research interests include modeling, control, optimization and integrated automation of complex industrial processes. He has published 170 peer reviewed international journal papers. His paper titled Hybrid intelligent control for optimal operation of shaft furnace roasting process was selected as the best paper for the Control Engineering Practice Paper Prize for 2011–2013. He has developed control technologies with applications to various industrial processes. For his contributions, he has won 4 prestigious awards of National Science and Technology Progress and National Technological Innovation, the 2007 Industry Award for Excellence in Transitional Control Research from IEEE Multiple-conference on Systems and Control.