Near-optimal output tracking controller design for nonlinear systems using an event-driven ADP approach

Near-optimal output tracking controller design for nonlinear systems using an event-driven ADP approach

Communicated by Tao Li Accepted Manuscript Near-Optimal Output Tracking Controller Design for Nonlinear Systems Using an Event-Driven ADP Approach K...

2MB Sizes 0 Downloads 20 Views

Communicated by Tao Li

Accepted Manuscript

Near-Optimal Output Tracking Controller Design for Nonlinear Systems Using an Event-Driven ADP Approach Kun Zhang, Huaguang Zhang, He Jiang, Yingchun Wang PII: DOI: Reference:

S0925-2312(18)30541-1 10.1016/j.neucom.2018.05.010 NEUCOM 19560

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

23 April 2017 12 March 2018 2 May 2018

Please cite this article as: Kun Zhang, Huaguang Zhang, He Jiang, Yingchun Wang, Near-Optimal Output Tracking Controller Design for Nonlinear Systems Using an Event-Driven ADP Approach, Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.010

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

CR IP T

Near-Optimal Output Tracking Controller Design for Nonlinear Systems Using an Event-Driven ADP Approach Kun Zhang, Huaguang Zhang∗, He Jiang, Yingchun Wang

AN US

College of Information Science and Engineering, Northeastern University, Box 134, 110819, Shenyang, P.R. China

Abstract

CE

PT

ED

M

In this paper, a novel output tracking control scheme for a class of continuoustime nonlinear system is proposed by using an event-driven adaptive dynamic programming (ADP) approach. Based on the controlled system and the desired reference trajectory, an augmented system is constructed for the optimal tracking control problem (OTCP), where the control law is composed of the steady-state part and the feedback part. According to the standard solution of the OTCP, the feedback part associated to the cost function is to be solved and the steady-state part of the control can be obtained directly. Thus the event-driven near-optimal control method is proposed to solve feedback control of the OTCP, where a single network ADP approach is developed to approximate the optimal cost function. The output tracking error dynamic get converged to zero and the parameters of critic neural network (NN) get converged to the optimal ones by using the designed weight tuning law and event-triggering condition. The stability of the augmented system under the novel control scheme is guaranteed based on Lyapunov theory and two simulation examples are presented to demonstrate the effectiveness of the proposed method.

AC

Keywords: event-triggered control, neural network (NN), output tracking control, adaptive dynamic programming (ADP), optimal control ∗

Corresponding author Email addresses: [email protected] (Kun Zhang), [email protected] (Huaguang Zhang), [email protected] (He Jiang), [email protected] (Yingchun Wang)

Preprint submitted to Neurocomputing

May 9, 2018

ACCEPTED MANUSCRIPT

1. Introduction

AC

CE

PT

ED

M

AN US

CR IP T

The basic control scheme design of seeking a stabilizing controller to achieve a system dynamic and the desired trajectory working synchronization is a key issue in control fields. To obtain the tracking object with a minimized performance index, the optimal control theory is proposed and a lot of optimal control approaches are developed, where a feedback control law will be obtained by solving the Hamilton-Jacobi-Bellman (HJB) equation. During the last few years, the reinforcement learning (RL) methods play an important role to learn to solve the optimal controller and have been widely used in researches [1, 2, 3]. Associating with the performance function, adaptive/approximate dynamic programming (ADP) approaches are introduced as a class of RL methods in [4, 5, 6, 7], where the optimal controllers are achieved by using the neural network (NN) approximations. However, the value or policy iteration procedures are important and many NN errors from every approximating step are burdened in the traditional time-triggered ADP architecture. To find a scheme to reduce the unnecessary computational load and transmissions in the optimal controller designs by traditional time-triggered ADP techniques, event-triggered control methods have been proposed in [8, 9, 10, 11]. An event-triggered near optimal control architecture was designed in [12] for continuous-time nonlinear systems with unknown internal states. The robust control problems were analysed in many works, such as [13, 14], and [15] developed an adaptive robust control method to solve an uncertain system by using event-driven neural dynamic programming. For the discrete-time systems, an adaptive event-triggered scheme in [16] based on the heuristic dynamic programming technique was designed to learn the optimal control and value function, where the dynamic was estimated by a model network. [17] proposed a novel event-triggered control architecture for load frequency control with supplementary ADP method, and presented significant benefits in modern power systems by the computer-based communication network technologies. The nonquadratic cost function was introduced with the constrained control inputs for nonlinear dynamics in [18, 19, 20], where the event-triggered conditions were considered. On the other hand, many researchers tried to simplify the typical actorcritic structure in the traditional ADP methods by eliminating the actor 2

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

or critic NN and received several results [21, 22, 23]. For the discrete-time dynamics, single network adaptive critic architectures were proposed in [24, 25], where the optimal controllers were explicitly expressed from the states and costate variables. The finite-horizon single network adaptive critic was developed and a nonquadratic cost function was used for the discrete-time systems with constrained input in [26]. For the continuous-time system, an online approach to solve adaptive optimal controller was proposed in [27], where an actor/critic structure was presented for the nonlinear system with infinite horizon cost. A single-network ADP approach was utilized in [28] to obtain the optimal controllers which make the value functions reach the Nash equilibrium of nonzero-sum differential games. The optimal tracking control problem (OTCP) was discussed in [29, 30, 31], where a critic-only Qlearning method was developed in [30] for the model-free dynamic. Under the event-triggered and single-network ADP architectures, the optimal regulation problem is still a key issue of concern to be solved by researchers. With the adaptive and self-learning ability, several optimal ADP solutions of the OTCP were explored in [32, 33, 34, 35, 36, 37, 38], where the optimal controllers were classified by their forms. For the standard solution of the OTCP, the controller is consist of steady-state part and feedback part. The greedy heuristic dynamic programming iteration algorithm for discrete-time was proposed in [39] to solve the optimal regulation problem transformed from the OTCP. A data-driven robust controller was designed in [40] for approximating the optimal tracking control scheme by using an ADP method, where the feedback part was obtained. For the novel solution of the OTCP, an online learning algorithm was proposed in [41] to solve the partially-unknown linear quadratic tracking control, where a discounted performance function was used instead of the standard solution. The augmented methods are developed in [42, 43] to solve the controllers, and [43] used a novel augmented system for the OTCP instead of the optimal regulation problem by an integral RL technique. A discrete-time linear quadratic tracker was designed in [44] based on the Q-learning algorithm, and an actor-critic-based reinforcement learning method was used in [45] for partially unknown nonlinear discrete-time systems with constrained input. However, without considering the computational load during the approximation and iteration procedures, these methods for the OTCP achieved the optimal tracking control goal by the typical ADP approaches. Motivated by the potential optimal scheme research, an event-driven nearoptimal output tracking controller is designed for the continuous-time nonlin3

ACCEPTED MANUSCRIPT

AN US

CR IP T

ear affine system to solve the OTCP. This paper is organized as the following. The main idea is inspired by how to better use the ADP techniques in solving the OTCP, as presented in the aforementioned introduction ofSection 1. The formulation of the output tracking control is shown in Section 2. An augmented system based on the standard OTCP method is constructed in Section 3 by the output tracking errors and reference trajectories, where the feedback part is converted to the control law of the augmented system. The event-triggered condition and the weight tuning law of single critic network are designed for the optimal tracking controller. Based on the proposed method, the stability of the system dynamic is also analyzed by using the Lyapunov theory in Section 4. Two simulation examples are shown in Section 5 to demonstrate the effectiveness of the event-driven technique. The main results and contribution remarks are concluded in Section 6. 2. Problem Formulation

M

The control objective for OTCP in this paper is to drive the system output track a designed/reference trajectory. Let us consider a class of continuoustime input-affine nonlinear system as the controlled dynamic: (1a)

y = Cx(t)

(1b)

ED

x˙ = f (x(t)) + g(x(t))u(t)

PT

where x(t) ∈ Rn is the state, u(t) ∈ Rm is the control, and y ∈ Rs is the output.

CE

Assumption 1. Suppose the drift dynamic f (x) ∈ Rn , f (0) = 0, the input dynamic g(x) ∈ Rn×m of this system are Lipschitz continuous and g(x) is bounded such that kg(x)k ≤ gm , kg(x1 ) − g(x2 )k ≤ Lg kx1 − x2 k, where gm and Lg are positive constants. The output matrix is C ∈ Rs×n and the system (1) is controllable on a compact set Ω ∈ Rn .

AC

The desired reference trajectory is bounded as a Lipschitz continuous command generator: x˙ d = Ψ(xd (t)) (2a) yd = Cxd (t)

(2b)

where xd ∈ Rn , and yd ∈ Rs . Note that the reference dynamic should be stable in this paper, and not necessary be asymptotically stable. 4

ACCEPTED MANUSCRIPT

CR IP T

The target of the optimal tracking control problem is to find a optimal controller u∗ (t) so as to make y can track the desired output yd in an manner which can achieve the optimal cost function. For the standard solution to the OTCP, the control policy u(t) is consist of the steady-state part us and feedback part ue [39, 40]. Thus the desired reference trajectory can be expressed as x˙ d (t) = f (xd (t)) + g(xd (t))us (t)

t

AN US

where us (t) can be achieved based on equations (2a) and (3) directly. feedback part ue (t) need to be obtained by minimizing the following function. Define the output tracking error is ey (t) = y(t) − yd (t), and the function for the OTCP is Z ∞ r(ey (τ ), ue (τ ))dτ V (t) =

(3)

The cost

cost

(4)

M

where r(ey , ue ) = Qy (ey ) + U (ue ), Qy (ey ) is a positive definite function, U (ue ) = uTe Rue , and R = RT is a symmetric positive definite matrix. Then the control law u(t) can be obtained as u(t) = us (t) + ue (t).

ED

3. The Novel Augmented System and Tracking HJB Equation

PT

In the section, a new augmented system for the standard OTCP is constructed based on the output tracking error and the command generator dynamic. Then the tracking HJB equation for the augmented system get well developed.

AC

CE

3.1. The Novel Augmented System of Output Tracking Control According to the continuous-time affine system (1) and the reference trajectory (2), the state tracking error is defined ex = x − xd . Hence ey = Cex and the output tracking error dynamic can be obtained as e˙ y (t) = C e˙ x (t) = C{x˙ − x˙ d } = C{f (x(t)) + g(x(t))u(t) − Ψ(xd (t))}.

(5)

Based on equation (2a) and (3), the steady-state part us of control u is expressed by the following: Ψ(xd (t)) = f (xd (t)) + g(xd (t))us (t), 5

(6)

ACCEPTED MANUSCRIPT

and can be developed further by us (t) = [g T (xd )g(xd )]−1 g T (xd )[Ψ(xd ) − f (xd )].

(7)

e˙ y (t) = C{f (x) − Ψ(xd ) + g(x)(ue + us )} = C{f (x) − Ψ(xd ) + g(x)ue

CR IP T

Inserting us into the output tracking error dynamic (5) has

+ g(x)[g T (xd )g(xd )]−1 g T (xd )[Ψ(xd ) − f (xd )]}.

(8)

"

# C{f (ex + xd ) − Ψ(xd ) + g(ex + xd )us } F (X(t)) = , Ψ(xd ) " # C{g(ex (t) + xd (t))} G(X(t)) = , 0

(10a) (10b)

M

where

AN US

Let the augmented system state be X(t) = [eTy (t), xTd (t)]T , the novel augmented system dynamic becomes ˙ X(t) = F (X(t)) + G(X(t))ue (t) (9)

ED

and ue (t) is the control of this constructed augmented system. For this novel augmented system, define the cost function as Z ∞ ¯ V (X(t)) = Q(X(τ )) + U (ue (τ ))dτ

(11)

t

CE

PT

¯ where Q(X(t)) = Qy (ey ) is a positive definite function and U (ue (t)) = T ¯ ¯ here is chosen as a quadratic form by Q(X(t)) = ue Rue . The term Q(X(t))  T    Qy 0 ey ¯ = ey X T QX = eTy Qy ey in this paper, where Qy is a posixd 0 0 xd tive definite matrix with an appropriate dimension.

AC

Remark 1. The new augmented system (9) is constructed based on the standard solution of the OTCP, where the steady-state part us (t) of the control is obtained directly and inserted in the dynamic F (X). The solution of OTCP is converted to solve the optimal feedback control ue (X) for developed the augmented system, where the cost function V (X) and ue (X) actually is not associated with the reference system dynamics. Different with the resent related literature [41, 43, 44, 45], the novel cost function is also the Lyapunov equation and does not have to rely on the extra discount factor. 6

ACCEPTED MANUSCRIPT

CR IP T

Definition 1. A feedback control policy µ(X) is said to be an admissible control for this novel augmented system (9), if the control µ ∈ Λ(Ω), µ(0) = 0, is continuous and can stabilize the system dynamic with finite cost function ∀x0 ∈ Ω.

Thus we use the admissible feedback control µ(X) in the following section, where µ(X) = µ(ey ) for the defined augmented system with state X(t) = [eTy (t), xTd (t)]T .

AN US

3.2. The Tracking HJB Equation for OTCP The solution of the optimal control for nonlinear system can be converted to solving the nonlinear Hamilton-Jacobi-Bellman (HJB) equation. Based on this novel augmented system, the tracking HJB equation will be given in the subsection. Differentiate the cost function (11) along the augmented system (9), the tracking Bellman equation can be obtained ¯ − µT (X)Rµ(X), V˙ (X(t)) = −X T QX

M

which can be further expressed as

¯ + µT Rµ + ∇V T (X)(F (X) + G(X)µ) 0 = X T QX

(12)

(13)

ED

where ∇ = ∂/∂X denotes the partial derivative with respect to the state X. Then the Hamiltonian function yields (14)

PT

¯ + µT Rµ + ∇V T (X)(F (X) + G(X)µ). H(X, µ, V ) = X T QX

CE

Note that the cost function (11) is a Lyapunov equation for the nonlinear system from the version on (12) and (13). Under the admissible control policy µ(X), V (X) is finite as kV (X)k ≤ Vm , where Vm is a positive number. For the cost function, the following assumption is given.

AC

Assumption 2. Suppose the derivative of cost function with respect to state ∇V (X) is Lipschitz continuous on state space X ∈ Rs+n as satisfying k∇V (X1 )− ∇V (X2 )k ≤ Lv kX1 − X2 k, where Lv is positive constant. Define the optimal cost function of the augmented system as Z ∞ ∗ ¯ V (X) = min [X T (τ )QX(τ ) + µ(τ )T Rµ(τ )]dτ. µ∈Λ(Ω)

t

7

(15)

ACCEPTED MANUSCRIPT

The tracking HJB equation with the optimal performance index is ¯ + µT Rµ + ∇V ∗T (X)(F (X) + G(X)µ). H(X, µ, V ∗ ) = X T QX

(16)

CR IP T

The optimal control policy µ∗ (X) can be achieved by employing the stationarity condition from the equation (16): 1 µ∗ (X) = argmin H(X, µ, V ∗ ) = − R−1 GT (X)∇V ∗ (X). 2 µ∈Λ(Ω)

(17)

Lamma 1. Let the feedback control policy µ(X) be designed as the following form:

AN US

1 µ(X) = − R−1 GT (X)∇V (X), (18) 2 then the control law satisfies kµ(X1 ) − µ(X2 )k ≤ Lb kX1 − X2 k, where Lb is a positive real constant. Proof. Based on the Lagrange mean-value theorem, there is (19)

M

¯ ¯ T (X2 − X1 ) µ(X1 ) − µ(X2 ) = (∂µ(X)/∂ X) ¯ T (X2 − X1 ) = ∇µ(X)

ED

¯ exists between X1 and X2 . where X Under the aforementioned assumptions, computing the norm obtains (20)

PT

¯ T (X2 − X1 )k kµ(X1 ) − µ(X2 )k = k∇µ(X) ¯ ≤ k∇µ(X)kkX 1 − X2 k ≤ Lb kX1 − X2 k

CE

¯ is a positive real constant. where Lb ≥ k∇µ(X)k The proof is thus complete.

AC

Lamma 2. Let the V ∗ (X) be the optimal solution of the cost function as (15), and µ∗ (X) is the optimal control policy with current state X as (17). ˆ there is For the optimal policy with an another state X, ˆ V ∗ (X)) = (µ∗ (X) − µ∗ (X)) ˆ T R(µ∗ (X) − µ∗ (X)) ˆ H(X, µ∗ (X), ˆ 2 ≤ krk2 L2 kX − Xk b

ˆ is a state at any other time different with X. where R = rT r and X 8

(21)

ACCEPTED MANUSCRIPT

Proof. Substituting µ∗ (X) back into the tracking HJB equation (16), it has: ¯ + µ∗T (X)Rµ∗ (X) H(X, µ∗ (X), V ∗ (X)) = X T QX

=0

CR IP T

+ ∇V ∗T (X)(F (X) + G(X)µ∗ (X)) (22) ¯ + ∇V ∗T (X)F (X) − µ∗T (X)Rµ∗ (X) = X T QX

where GT (X)∇V ∗ (X) = −2Rµ∗ (X). Then the general HJB tracking equation (16) can be expressed by insertˆ as ing the optimal control µ∗ (X) ∗ ˆ ˆ V ∗ (X)) = X T QX ¯ + µ∗T (X)Rµ ˆ H(X, µ∗ (X), (X)

AN US

ˆ + ∇V ∗T (X)(F (X) + G(X)µ∗ (X)) ¯ + µ∗T (X)Rµ∗ (X) = X T QX

+ ∇V ∗T (X)(F (X) + G(X)µ∗ (X)) − µ∗T (X)Rµ∗ (X) ∗ ˆ ˆ ˆ − µ∗ (X)) + µ∗T (X)Rµ (X) − 2µ∗T (X)R(µ∗ (X)

M

= H(X, µ∗ , J ∗ ) + µ∗T (X)Rµ∗ (X) ∗ ˆ ˆ ˆ + µ∗T (X)Rµ (X) − 2µ∗T (X)Rµ∗ (X)

ED

ˆ T R(µ∗ (X) − µ∗ (X)). ˆ = (µ∗ (X) − µ∗ (X))

PT

Using Lamma 1, one has: ˆ V ∗ (X)) = (µ∗ (X) − µ∗ (X)) ˆ T R(µ∗ (X) − µ∗ (X)) ˆ H(X, µ∗ (X), ˆ 2 ≤ krk2 L2b kX − Xk

(23)

(24)

CE

which gets (21) and completes the proof.

AC

Based on the tracking HJB equation of the augmented system, the stability of the output tracking error dynamic under the optimal control by event-triggering mechanism will be achieved in the following. 4. Event-Driven Single Network ADP Tracking Controller Design In order to solve the optimal tracking controller, the event-driven single network ADP method is proposed for the constructed augmented system. As per the proposed method, the tracking stability of event-triggering mechanism and the approximated optimal solution are obtained. 9

ACCEPTED MANUSCRIPT

CR IP T

4.1. Tracking Stability of Event-Triggering Mechanism According to the optimal cost function (15) and the optimal control (17), the design of event-triggered condition is provided and the tracking stability is discussed in this section. Define {ςj }∞ j=0 is a monotonically increasing time sequence of the triggering instants with ς0 = 0, j ∈ N, where N = {0, 1, · · · } is the non-negative ˆ j = [(ejy )T , (xj )T ]T integer set. Assume the sampled state vector X(ςj ) , X d for all the time t ∈ [ςj , ςj+1 ), j ∈ N when the system is sampled at the triggering instants ςj . Define the event-triggered output error between the sampled tracking error and current tracking error as (25)

AN US

ejs (t) = ejy − ey (t), ∀t ∈ [ςj , ςj+1 ).

M

ˆ j ) based on the sampled Then the event-triggered optimal controller is µ∗ (X ˆ state Xj rather than the current state X(t) for t ∈ [ςj , ςj+1 ). It means that ˆ j ) will execute in the triggering instants and the control the control µ∗ (X ˆ j )}∞ sequence {µ∗ (X j=0 will become a continuous input by a zero-order hold ∗ ˆ ∗ ˆ j ), ∀t ∈ [ςj , ςj+1 ), j ∈ N. The event-triggered as µ (Xj , t) = µ (X(ςj )) = µ∗ (X optimal control should be presented by: ˆ j )∇V ∗ (X ˆj ) ˆ j ) = − 1 R−1 GT (X µ∗ (X 2

(26)

ED

ˆ j ) = (∂V ∗ (X)/∂X)| ˆ and t ∈ [ςj , ςj+1 ). where ∇V ∗ (X X=Xj

CE

PT

Theorem 1. For the novel continuous-time augmented system, the optimal ˆ j ). cost function is V ∗ (X) and the event-triggered optimal control is µ∗ (X The tracking error dynamic is asymptotically stable under the event-triggered optimal controller (26), if we define the event-triggered condition for all t ∈ [ςj , ςj+1 ), j ∈ N as kejs (t)k2 ≤

(1 − η12 )σmin (Qy )key (t)k2 , keT k krk2 L2b

(27)

AC

where η1 ∈ (0, 1) is the sample frequency parameter, eT is the threshold and σmin (Qy ) is the minimal eigenvalue of positive definite matrix Qy . Proof. Selecting J1 (t) = V ∗ (X) as the Lyapunov equation, we derive J1 (t) ˆ j ) and obtain the formulation with respect to time by using the control µ∗ (X as ˆ j )). J˙1 (t) = ∇V ∗T (X)(F (X) + G(X)µ∗ (X (28) 10

ACCEPTED MANUSCRIPT

Based on Lamma 1 and Lamma 2, it yields ˆ j )) J˙1 = ∇V ∗T (X)(F (X) + G(X)µ∗ (X ¯ − µ∗ (X ˆ j )T Rµ∗ (X ˆj ) = −X T QX

CR IP T

ˆ j ) − µ∗ (X))T R(µ∗ (X ˆ j ) − µ∗ (X)) + (µ∗ (X

= −eTy Qy ey − krk2 kµ∗ (ejy )k2 + krk2 kµ∗ (ejy ) − µ∗ (ey )k2

(29)

≤ −η12 σmin (Qy )key k2 + (η12 − 1)σmin (Qy )key k2 − krk2 kµ∗ (ejy )k2 + krk2 L2b kejs k2

AN US

¯ = eT Qy ey and µ(X) = µ(ey ) as the definition in where R = rT r, X T QX y the augmented system. Then the derivative term J˙1 ≤ −η12 σmin (Qy )key k2 − krk2 kµ∗ (ejy )k2 < 0 for any ey 6= 0, t ∈ [ςj , ςj+1 ), that means the eventtriggered condition (27) can ensure the tracking dynamic asymptotic stability for the augmented system. The proof is thus complete.

M

According to the novel augmented system (9) and the tracking HJB equation (16) , the event-triggered tracking controller based on NN approximation and single network ADP method will be achieved in the following section.

V ∗ (X) = W T Φ(X) + (X)

AC

CE

PT

ED

4.2. Event-Driven Single Network Tracking ADP Structure In the general time-driven ADP method, both critic module and action module are estimated by neural networks which work on the approximate Hamiltonian function simultaneously. The event-triggered single network tracking ADP structure of the output tracking control problem is shown in the Fig. 1, where a zero-order hold device is used to transform the feedback control sequence into a continuous-time input signal and only one critic neural network is needed. Let the optimal cost function V (X) be expressed by (30)

where W = [w1 , · · ·, wN ]T is a weight set, Φ(X) = [φ1 (X), · · ·, φN (X)]T is a set of activation functions, (X) is the approximation error, and N is the selected number of activation functions. P Assumption 3. In the critic network, we have W T Φ(X) = N ι=1 wι φι (X), and the approximation error (X) → 0 as N → ∞. Both the weight matrix 11

ACCEPTED MANUSCRIPT

Event generator Xˆ j

Critic Neural Network Wˆ

X

e Feedback part

Bellman Error X

uˆe ( Xˆ j ) Zero-order uˆe (t )

Augmented System actor

hold

Controlled System

uˆ(t )

xd

y e y

AN US

Steady-state part

’Vˆ

CR IP T

X

Reference Dynamic

us (t )



yd

xd

( ) u (t )

uˆ(t )

M

Figure 1: Event-triggered single-network output tracking structure

ED

W and the reconstruction error  are bounded as kW k ≤ Wm and kk ≤ c , where Wm and c are positive constants. Suppose the activation function Φ(·) is Lipschitz continuous in the state space, Φ(·) and the gradients ∇Φ(·) are all bounded as kΦ(·)k ≤ Φm and k∇Φ(·)k ≤ ∇Φm , where Φm and ∇Φm are positive constants.

CE

PT

Hence the derivative of the optimal cost function V ∗ (X) with respect to X can be further obtained as ∇V ∗ (X) = ∂V ∗ (X)/∂X = ∇ΦT (X)W + ∇

(31)

AC

where ∇ΦT (X)W = (∂Φ(X)/∂X)T W and ∇ = ∂(X)/∂X. Substituting (31) back into the Hamiltonian function (16), it becomes ¯ + µT (X)Rµ(X) H(X, µ, V ∗ ) = X T QX + (W T ∇Φ + ∇T )(F (X) + G(X)µ) ¯ + µT (X)Rµ(X) = X T QX + W T ∇Φ(X)(F (X) + G(X)µ) + a 12

(32)

ACCEPTED MANUSCRIPT

where a = ∇T (F (X) + G(X)µ), and ka k ≤ 3 can be guaranteed as N is selected larger enough, 3 is a positive constant. The approximated optimal control policy µ1 from the critic NN yields (33)

CR IP T

1 µ1 (X) = − R−1 GT (X)∇ΦT (X)W. 2

Inserting µ1 into the tracking HJB equation (16) and using the equation (23) in Lamma 2, the HJB error becomes: ¯ + µT (X)Rµ1 (X) HJB = X T QX 1

AN US

+ W T ∇Φ(X)(F (X) + G(X)µ1 (X)) ¯ + µT1 (X)Rµ1 (X) = X T QX + ∇V ∗T (X)(F (X) + G(X)µ1 (X)) − ∇T (F (X) + G(X)µ1 (X))

(34)

= (µ∗ (X) − µ1 (X))T R(µ∗ (X) − µ1 (X))

M

− ∇T (F (X) + G(X)µ1 (X)) 1 = ∇T D∇ − ∇T (F (X) + G(X)µ1 (X)) 4

ED

where D = G(X)RGT (X), and kHJB k → 0 as ∇ → 0. For any positive constant 2 , the appropriate critic neural network can be constructed so that sup∀x kHJB k ≤ 2 .

CE

PT

4.3. Stability Analysis of Event-Driven Tracking Control The stability of this augmented system under the proposed event-triggered tracking controller will be obtained in this subsection. Let the optimal cost function V ∗ (X) in (30) be approximated by ˆ T Φ(X) Vˆ (X) = W

(35)

AC

ˆ is the approximated weight parameter of W . where W The approximated event-triggered optimal control policy becomes ˆ j ) = − 1 R−1 GT (X ˆ j )∇Vˆ (X ˆ j ). µ ˆ(X 2

13

(36)

ACCEPTED MANUSCRIPT

Then the Hamiltonian function (16) can be further estimated based on the NN as

(37)

CR IP T

¯ +µ ˆ j )Rˆ ˆj ) H(X, µ ˆ, Vˆ ) = X T QX ˆT (X µ(X ˆ T ∇Φ(X)(F (X) + G(X)ˆ ˆ j )) +W µ( X =e

AN US

where the e is the approximate residual error which should be reduced. Define 1 ˜ = W −W ˆ. the squared residual error as Er = eT e and the weight error as W 2 In order to minimize the squared residual error and solve the optimal controller, a weight updating law based on gradient descent method is proposed for updating the parameters of critic NN in t ∈ [ςj , ςj+1 ) as ˆ˙ = −α{ β [β T W ˆ + X T QX ¯ −W ˆ T ∇Φ(X)G(X)ˆ ˆj ) + β T K W ˆ ]} W µ(X m2

(38)

ˆ j )), m = (β T β + 1), where α is the learning rate, β = ∇Φ(F (X) + G(X)ˆ µ( X and K is a tuning matrix.

ED

M

Theorem 2. Considering the novel augmented system (9) and the approximated event-triggered optimal control policy (36), the parameter tuning law of critic NN is used as (38). Then the output tracking error dynamic is asymptotically stable and the weights of the critic NN converge to the optimal ones if the adaptive event-triggering condition is designed as

PT

kejs (t)k ≤

(1 − η22 )λσmin (Qy )key k2 , kˆ eT k Wm ∇Φm gm Lb

(39)

CE

where η2 ∈ (0, 1) and λ > 0 are the selected parameters.

AC

Proof. Consider the Lyapunov function J2 (t) = J21 + J22 + J23 for the augmented system, where J21 = V ∗ (X(t)), ˜ T (t)α−1 W ˜ (t), J22 = W

(40)

ˆ j ). J23 = V ∗ (X

To analyze the stability in the solving procedure, the event-triggered system with flow and jump dynamics are considered as two cases, including 14

ACCEPTED MANUSCRIPT

CR IP T

event is not triggered at t ∈ [ςj , ςj+1 ) and event is triggered at t = ςj+1 , j ∈ N. Case 1: Event is not triggered, i.e., ∀t ∈ [ςj , ςj+1 ). Based on the Lyapunov equation J2 (t), the time derivative becomes J˙2 (t) = J˙21 + J˙22 + J˙23 . The first term J˙21 is presented by ˆ j )] J˙21 = (∇ΦT (X)W + ∇)T [F (X) + G(X)ˆ µ( X ˆj ) = W T ∇Φ(X)F (X) + W T ∇Φ(X)G(X)ˆ µ(X ˆ j )]. + ∇ [F (X) + G(X)ˆ µ(X T

(41)

AN US

Inserting the HJB error from Lamma 2, the derivative yields:

¯ − µT (X)Rµ1 (X) − W T ∇Φ(X)G(X)µ1 (X) J˙21 = −X T QX 1 T ˆ j ) + b + HJB + W ∇Φ(X)G(X)ˆ µ(X ¯ − µT (X)Rµ1 (X) + HJB + b = −X T QX 1 T ˆ j ) − µ1 (X)) + W ∇Φ(X)G(X)(ˆ µ(X

(42)

M

¯ + HJB + b + 1 W T D ¯W ˜ ¯ − 1 W T DW = −X T QX 4 2 ˆj ) − µ + W T ∇Φ(X)G(X)(ˆ µ( X ˆ(X))

AC

CE

PT

ED

¯ = ∇Φ(X)G(X)RGT (X)∇ΦT (X) and b = ∇T [F (X)+G(X)ˆ ˆ j )] where D µ( X with bound as kb k ≤ 1 . ˜˙ . By using the weight ˜ T α−1 W For the second term, one has J˙22 = W

15

ACCEPTED MANUSCRIPT

tuning law (38), the derivative becomes

CR IP T

˜ T { β [β T W ˆ + X T QX ¯ −W ˆ T ∇Φ(X)G(X)ˆ ˆj ) + β T K W ˆ ]} J˙22 = W µ(X m2 ˜ T { β [β T W ˆ + X T QX ¯ − W T ∇ΦF (X) − X T QX ¯ =W m2 − µT1 (X)Rµ1 (X) − W T ∇Φ(X)G(X)µ1 (X) ˆ T ∇Φ(X)G(X)ˆ ˆj ) + β T K W ˆ ]} + HJB − W µ( X

AN US

˜ T { β [−β T W ˜ − µT (X)Rµ1 (X) =W 1 2 m ˆj ) − W T ∇Φ(X)G(X)µ1 (X) + W T ∇Φ(X)G(X)ˆ µ(X ˆ T ∇Φ(X)G(X)ˆ ˆj ) + β T K W ˆ ]} + HJB − W µ( X

(43)

M

˜ T { β [−β T W ˜ + µT1 (X)Rµ1 (X) =W m2 ˜ T ∇Φ(X)G(X)ˆ ˆ j ) + HJB + β T K W ˆ ]} +W µ(X ˜ +W ˜ T ∇Φ(X)G(X)ˆ ˆj ) ˜ T { β [−β T W µ( X =W m2 ˜ + µT1 (X)Rµ1 (X) + HJB + β T KW ] − βT KW

ED

T ˆ j )GT (X)∇ΦT (X)}W ˜ ˜ T { ββ (K + IN ) + β µ ˆT (X = −W m2 m2 ˜ T { β [HJB + µT1 (X)Rµ1 (X) + β T KW ]} +W m2

PT

ˆ j in this For the third term, the derivative J˙23 = 0 as state is always X case. Then adding (42) and (43), we can obtain

AC

CE

J˙2 (t) = J˙21 (t) + J˙22 (t) + J˙23 ¯ + W T ∇ΦG(ˆ ˆj ) − µ = −X T QX µ(X ˆ(X))

1 ¯ −W ˜ T AW ˜ +W ˜ TB + C − W T DW 4 ¯ + W T ∇ΦG(ˆ ˆj ) − µ = −λX T QX µ(X ˆ(X)) ¯ − 1 W T DW ¯ −W ˜ T AW ˜ +W ˜ TB + C − (1 − λ)X T QX 4

(44)

T ˆ j )GT (X)∇ΦT (X), B = β2 [HJB + µT (X)R where A = ββ (K + IN )+ mβ2 µ ˆ T (X 1 m2 m 1 ¯ T µ1 (X) + β KW ] + 2 DW and C = HJB + b . Through the matrix K and

16

ACCEPTED MANUSCRIPT

CR IP T

¯ − W T DW ¯ + parameter λ > 0 can be selected to guarantee −4(1 − λ)X T QX 2 4kCk + kBk /kAk ≤ 0 by choosing appropriate values, we can obtain that ¯ −W ˜ T AW ˜ +W ˜ T B + C ≤ 0. Then using the ¯ − 1 W T DW −(1 − λ)X T QX 4 Lamma 1, the derivative of J2 becomes: ¯ + W T ∇Φ(X)G(X)(ˆ ˆ(ey )) J˙2 (t) ≤ −λX T QX µ(ejy ) − µ

≤ −η22 λσmin (Qy )key k2 + (η22 − 1)λσmin (Qy )key k2 +

Wm ∇Φm gm Lb kejs k

(45)

AN US

Then it can be concluded that the derivative becomes J˙2 (t) ≤ −η22 λσmin (Qy ) key k2 < 0, ∀ey 6= 0 if the triggering condition (39) is satisfied. This implies the derivative of the Lyapunov function is negative in the case as t ∈ [ςj , ςj+1 ). Case 2: Event is triggered, i.e., ∀t = ςj+1 . The Lyapunov function for this time becomes: ˆ j+1 ) − J2 (X(ς − )) = 4J21 (t) + 4J22 (t) + 4J23 (t). 4J2 (t) = J2 (X j+1

(46)

M

Considering J˙2 < 0 for t ∈ [ςj , ςj+1 ), the state and cost function of augmented system (9) are continuous, the derivatives of J21 (t) and J22 (t) can be obtained by two inequalities: (47)

ED

ˆ j+1 ) − V ∗ (X(ς − )) ≤ 0, 4J21 (t) = V ∗ (X j+1

˜ T (X ˆ j+1 )α−1 W ˜ (X ˆ j+1 ) − W ˜ T (X(ς − ))α−1 W ˜ (X(ς − )) ≤ 0. 4J22 (t) = W j+1 j+1 (48)

PT

For the third term, it has

CE

ˆ j+1 ) − V ∗ (X ˆ j ) ≤ −κ(kej+1 (ςj )k), 4J23 (t) = V ∗ (X X

(49)

AC

ˆ j+1 − X ˆ j . Hence we where κ(·) is a class-κ function [23, 46] and ej+1 = X X obtain 4J2 (t) = 4J21 (t) + 4J22 (t) + 4J23 (t) ≤ 0 at the triggering instants ∀t = ςj+1 . From the two cases, it can be concluded that the augmented system is asymptotically stable under the event-triggered constrained optimal controller. That means the parameters of the critic converge to the optimal ones and the output tracking error dynamic converges to zero by the proposed method. The proof is complete.

17

ACCEPTED MANUSCRIPT

CR IP T

Remark 2. As the event-driven optimal feedback control solution of the augmented system is obtained by minimizing the cost function (11), both the steady-state control us (t) and feedback control ue (t) from the initial problem formulation are solved simultaneously as desired. Only one critic NN is used to solve the event-driven output tracking control, where critic parameters get convergence as tracking errors decreasing to zero. The architecture gets simpler and the computational load gets improved by the novel developed event-driven single network ADP tracking controller. 5. Simulation Results

M

AN US

For the augmented system from the OTCP, the typical mass, spring and damper system [43] is presented as the controlled dynamics, which is shown in Fig. 2. In the following two examples, this dynamic is considered as linear and nonlinear.

m

PT

ED

Actuator(u)

CE

Figure 2: Mass, spring and damper system

AC

5.1. Linear Example In the case, the spring and damper dynamics are considered to be linear, and the system becomes:     0 1 0    x˙ = (50a) k c x + 1 u − − m m m y = Cx(t) 18

(50b)

ACCEPTED MANUSCRIPT

yd = Cxd (t) T

CR IP T

where x = [x1 , x2 ]T , x1 is the position, x2 is the velocity, m is the mass of this object, k is the stiffness constant of this spring, and, c is the damping, C = I2 , and, Ii is the i dimensional unit matrix. They are selected as m = 1 Kg, k = 5 N/m and c = 0.5 N · s/m. The reference trajectory is expressed by # " 0 1 xd (51a) x˙ d = Ψ(xd ) = −5 0

(51b)

T



M

is constructed as

AN US

where xd = [xd1 , xd2 ] and the initial state xd (0) = [0.5, −0.5] for the command generator. Bade on the two dynamics, the linear augmented system         f (ex + xd ) − Ψ(xd ) g(ex + xd ) e˙ y   +g(ex + xd )us + µ (52) = 0 x˙ d Ψ(xd )

ED

0 1 0  −5 −0.5 0 X˙ =   0 0 0 0 0 −5

  0 0  1 0  X +   0 1  0 0



 µ 

(53)

AC

CE

PT

where X T = [X1 , X2 , X3 , X4 ] = [eTx , xTd ], µ = ue , and us (t) = [g T (xd )g(xd )]−1 g T (xd )[Ψ(xd ) − f (xd )] = 0.5xd2 . Let the initial state X(0) = [0.5, 0, 0.5, −0.5]T for this augmented system ¯ ¯ = eT Qy ey in with an initial admissible control, R = I1 and Q(X) = X T QX y   Q 0 y ¯= the cost function, where Q and Qy = 2I2 . The activation function 0 0 of critic NN is selected as ΦT (X) = [X12 , X1 X2 , X1 X3 , X1 X4 , X22 , X2 X3 , X2 X4 , ˆ T = [wˆ1 , · · · , wˆ10 ]. For the linX32 , X3 X4 , X42 ], and the weight vector is W ear continuous-time system, we solve the algebraic Riccati equation by using MATLAB ‘care’ program, then obtain the optimal parameters of the critic NN as W = [5.9464, 0.3923, 0, 0, 1.1255, 0, 0, 0, 0, 0]T . The output tracking trajectory is presented in Fig. 3, where the command generator dynamic gets well tracked by the spring and damper system. To show the tracking performance better, Fig. 4 a) depicts that the output 19

ACCEPTED MANUSCRIPT

Output tracking trajectory 3

y yd

2

0 -1 -2 -3 2 1 0 0

20

30

40

50

AN US

-1

y1

CR IP T

y2

1

10

Time (s)

Figure 3: Output tracking trajectory

AC

CE

PT

ED

M

tracking errors. Fig. 4 b) compares the curves of feedback control ue (t) obtained under the event-triggered and the time-triggered ADP methods, where they are approaching gradually. The entire control law u containing steady-state part us and feedback part ue is shown in Fig. 4 c). As the tracking goal is obtained, the weights of critic NN are converged ˆ = [5.9755, 0.3742, −0.0291, 0.0386, 0.9107, 0.0520, −0.0120, −0.0162, to W − 0.0012, −0.0049]T as the NN training procedure is finished, which is very close to the optimal weights. Then the cumulative numbers of triggered events with the proposed event-triggered ADP technique and traditional ADP method are compared in Fig. 4 d), where the the event-driven controller is only updated 116 times and the controller by traditional time-driven ADP method will be updated 500 times (the times come from the Runge Kutta integration process of MATLAB ode program with step size 0.1s). Fig. 4 e) depicts the sampling period of the control law during the event-triggered tuning process with the sample frequency parameter is selected as η2 = 0.74, λ = 0.9, and Fig. 4 f) displays the relationship between the event-triggered output error kejy k and the trigger threshold kˆ eT k, where the event-triggered error can increase to the trigger threshold and will be forced to zero. In addition, it also should be mentioned that we can adjust the sample frequency parameter η2 in the event-triggering condition, which can change the sampling frequency. If the states of the system are not sampled frequently,

20

ACCEPTED MANUSCRIPT

b) Feedback control u e

a) Output tracking error 2

e

1

µ(X) ˆj ) µ ˆ (X

y1

ey2

2

0

0

-1

0.2

0

-2

-0.2 10

-2 -4 10

20

30

40

50

Time (s)

0

10

20

AN US

0

CR IP T

4

12

30

14

16

40

50

Time (s)

c) Control law u 2

500

u=u +u s

1 0

e

d) Cumulative number of triggered events Traditional ADP Event-triggered ADP

400 300 200

M

-1 -2 -3 10

20

30

40

0

50

0

ED

0

100

10

Time (s)

2 1.5

CE

1

AC

||ejy ||

2 1.5 1 0.5

0

10

50

||eT ||

2.5

0.5

0

40

f) Evolution of the triggering condition

3

PT

2.5

30

Time (s)

e) Sampling period

3

20

0 20

30

40

50

0

Time (s)

10

20

30

40

Time (s)

Figure 4: Performances of the designed event-triggered tracking control scheme

21

50

ACCEPTED MANUSCRIPT

it means that much computational load is eliminated and fewer transmissions are required between the tracking error and the controller by using the eventtriggering ADP method.

(54)

AN US

x˙ 1 = x2 x˙ 2 = −5x31 − 0.5x2 + u y = Cx(t)

CR IP T

5.2. Nonlinear Example In the nonlinear case, let the spring is nonlinear by nonlinearity k(x) = −5x3 , the system dynamic becomes:

M

where C = I2 , and the reference trajectory is also selected the command generator (51) as in linear example. Then the augmented system is constructed as     ed2 0  −5(ed1 + xd1 )3 − 0.5ed2 + 5x3d1   1   +  µ X˙ =  (55)    0  xd2 −5xd1 0

AC

CE

PT

ED

where X T = [X1 , X2 , X3 , X4 ] = [eTy , xTd ], µ = ue and us (t) = [g T (xd )g(xd )]−1 g T (xd )[Ψ(xd ) − f (xd )] = −5xd1 + 5x3d1 + 0.5xd2 . Select the initial state is X(0) = [0.5, 0.5, 1, 1]T for this augmented sys¯ ¯ = eTy Qy ey tem with an initial admissible control, R = I1 and Q(X) = X T QX   ¯ = Qy 0 and Qy = 4I2 . The activation funcin the cost function, where Q 0 0 tion of critic NN is ΦT (X) = [X12 , X1 X2 , X1 X3 , X1 X4 , X22 , X2 X3 , X2 X4 , X32 , ˆ T = [wˆ1 , · · · , wˆ10 ]. X3 X4 , X42 ] and the weight vector is W The output tracking trajectories are shown in the Fig. 5 a), where the desired trajectories get well tracked. To explain the tracking performance further, we display the output tracking errors in the Fig. 5 b), where we can see the output tracking errors converge to zero as we desired. The comparison of the feedback control ue based on the event-triggered ADP technique and the traditional ADP method is shown in Fig. 5 c), where the two controllers are approaching gradually. The entire control law u containing steady-state part us and feedback part ue is displayed in Fig. 5 d). 22

ACCEPTED MANUSCRIPT

a) Output tracking trajectory y yd

2

b) Output tracking error

2

3

ey1 ey2

1

0

0 -1

-1 -2 -3 2

-2 0 -2

20

0

y1

60

40

-3

100

80

0

10

20

Time (s)

CR IP T

y2

1

30

40

50

60

70

80

90

Time (s)

c) Feedback control u e

d) Control law u

4 µ(X) ˆj ) µ ˆ (X

AN US

0.5

u=u s +ue

3 2

0

1

0.2 -0.5

0

0

-1

-0.2

-2

16

18

20

22

-3

0

10

20

30

40

50

60

70

80

90

ED

Time (s)

M

-1

0

10

20

30

40

50

60

70

80

90

Time (s)

Figure 5: Trajectories of the output tracking performance and control laws

AC

CE

PT

The sampling period of the event-triggered control method during the learning process is presented in Fig. 6 a), where the sample frequency parameter η2 = 0.78 and λ = 0.9 for the sampled-data system. Hence under the novel event-trigger controller, the parameters of the neural network are ˆ = [0.6002, 0.7524, 1.5113, 1.3928, 0.6044, −0.0007, 0.0041, achieved as W 1.2202, 0.0011, 0.4889]T . As the mass, spring and damper system finished the tracking goal of the OTCP, we obtain the parameters of the critic NN as presented in Fig. 6 b), where one can observe the convergence of the weights of critic NN. The evolution of the triggering condition is shown in Fig. 6 c), where we can find that the threshold eˆT and the event-triggered error ejy converge to zero along with the tracking error approaches zero. Note that the event-triggered error is forced to zero when the triggering condition is not satisfied, and it means that the tracking error is sampled at the triggering

23

a) Sampling period

1.2

CR IP T

ACCEPTED MANUSCRIPT

b) Weight of the cirtic NN

2

1

1.5

AN US

0.8 1

0.6

0.5

0.4

0

0.2 0

-0.5

0

10

20

30

40

50

60

70

80

Time (s)

0

10

20

30

40

50

60

70

80

90

Time (s)

c) Evolution of the triggering condition

M

1.6

90

||ˆ eT || ||ejy ||

1.4 1.2

d) Samples: Cumulative number of triggered events

1800

Traditional ADP Event-triggered ADP

1600 1400 1200

ED

1 0.8 0.6

0.2 0 0

10

PT

0.4

20

30

40

50

1000 800 600 400 200 0

60

70

80

90

0

10

20

30

40

50

60

70

80

Time (s)

CE

Time (s)

AC

Figure 6: Evolutions of the sampling and critic parameters during the solution

24

90

ACCEPTED MANUSCRIPT

CR IP T

instants. In addition, the proposed event-triggered near-optimal controller only needs 807 samples of the state while the traditional time-driven ADP controller uses 1800 samples (the times come from the Runge Kutta integration process of MATLAB ode program with step size 0.05s) as shown in Fig. 6 d), which implies fewer transmissions are required by the event-triggering technique. 6. Conclusion

PT

ED

M

AN US

In this paper, an effective event-triggered single network ADP method has been developed to solve the optimal output tracking controller of the OTCP. A novel augmented system has been constructed based on the output tracking errors and the command generater dynamics, and the cost function associated to the novel augmented system has been presented. Then an event-driven single network ADP-based output tracking controller for this constructed continuous augmented system has been proposed to solve the tracking HJB equation, where only one NN is used instead to the typical actor-critic structure with two NNs. According to the adaptive eventtriggering condition, the stability of the optimal output tracking control problem for the continuous-time nonlinear system and the convergence of the parameters of critic NN have been achieved by Lyapunov theory. A simpler architecture with less computational load have been obtained based on the developed event-triggered single network ADP-based method. Therefore, the effectiveness of the proposed method have been well demonstrated in the simulation results. Acknowledgement

AC

CE

This work was supported by the National Natural Science Foundation of China (61433004, 61627809, 61621004), and IAPI Fundamental Research Funds 2013ZCX14. References [1] D. Liu, X. Yang, D. Wang, Q. Wei, Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints, IEEE Trans. Cybern. 45 (7) (2015) 1372-1385.

25

ACCEPTED MANUSCRIPT

[2] S.G. Khan, G. Herrmann, F.L. Lewis, T. Pipe, C. Melhuish, Reinforcement learning and optimal adaptive control: an overview and implementation examples, Annual Reviews in Control. 36 (1) (2012) 42-59.

CR IP T

[3] B. Luo, H. Wu, T. Huang, D. Liu, Reinforcement learning solution for HJB equation arising in constrained optimal control problem. Neural Netw. 71 (2015) 150-158. [4] F. Wang, H. Zhang, D. Liu, Adaptive dynamic programming: an introduction, IEEE Comput. Intell. Mag. 4 (2) (2009) 39-47.

AN US

[5] K.G. Vamvoudakis, F.L. Lewis. Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica. 46 (5) (2010) 878-888. [6] H. Zhang, H. Liang, Z. Wang, T. Feng, Optimal output regulation for heterogeneous multiagent systems via adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 28 (1) (2017) 18-29.

ED

M

[7] Q. Wei, F.L. Lewis, D. Liu, R. Song, H. Lin, Discretetime local value iteration adaptive dynamic programming: convergence analysis, IEEE Trans. Syst. Man Cybern. Syst. (2016) doi:10.1109/TSMC.2016.2623766.

PT

[8] X. Zhong, Z. Ni, H. He, X. Xu, D. Zhao, Event-triggered reinforcement learning approach for unknown nonlinear continuous-time system, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2014.

CE

[9] A. Sahoo, H. Xu, S. Jagannathan, Event-based optimal regulator design for nonlinear networked control systems, in: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014.

AC

[10] J. Feng, N. Li, Design of observer-based event-driven controllers for a class of state-dependent nonlinear systems, J. Franklin Inst. 353 (7) (2016) 1573-1593. [11] A. Sahoo, H. Xu, S. Jagannathan, T. Dierks, Event-triggered optimal control of nonlinear continuous-time systems in affine form by using 26

ACCEPTED MANUSCRIPT

neural networks, in: proceedings of the IEEE Conference on Decision and Control, 2015.

CR IP T

[12] X. Zhong, H. He, An event-triggered ADP control approach for continuous-time system with unknown internal states, IEEE Trans. Cybern. 47 (3) (2017) 683-694.

[13] K.Z. Han, J. Feng, Robust periodically time-varying horizon finite memory fault detection filter design for polytopic uncertain discrete-time systems, Int. J. Robust. Nonlinear Control 27 (17) (2017) 4116-4137.

AN US

[14] H. Liang, H. Zhang, Z. Wang, J. Wang, Consensus robust output regulation of discrete-time linear multi-agent systems, IEEE/CAA Journal of Automatica Sinica 1 (2) (2014) 204-209.

[15] D. Wang, C. Mu, H. He, D. Liu, Event-driven adaptive robust control of nonlinear systems with uncertainties through NDP strategy, IEEE Trans. Syst. Man Cybern. Syst. (2016) doi:10.1109/TSMC.2016.2592682.

ED

M

[16] L. Dong, X. Zhong, C. Sun, H. He, Adaptive event-triggered control based on heuristic dynamic programming for nonlinear discretetime systems, IEEE Trans. Neural Netw. Learn. Syst. (2016) DOI:10.1109/TNNLS.2016.2541020.

PT

[17] L. Dong, Y. Tang, H. He, C. Sun, An event-triggered approach for load frequency control with supplementary ADP, IEEE Trans. Power Syst. 32 (1) (2017) 581-589.

CE

[18] D. Wang, C. Mu, Q. Zhang, D. Liu, Event-based input-constrained nonlinear H∞ state feedback with adaptive critic and neural implementation, Neurocomputing. 214 (2016) 848-856.

AC

[19] Y. Zhu, D. Zhao, H. He, J. Ji, Event-triggered optimal control for partially-unknown constrained-input systems via adaptive dynamic programming, IEEE Trans. Ind. Electron. (2016) doi:10.1109/TIE.2016.2597763. [20] L. Dong, X. Zhong, C. Sun, H. He, Event-triggered adaptive dynamic programming for continuous-time systems with control constraints, IEEE Trans. Neural Netw. Learn. Syst. (2016) doi:10.1109/TNNLS.2016.2586303. 27

ACCEPTED MANUSCRIPT

[21] J. Ding, S.N. Balakrishnan, Approximate dynamic programming solutions with a single network adaptive critic for a class of nonlinear systems, J. Contr. Theory & Appl. 9 (3) (2011) 370-380.

CR IP T

[22] D. Wang, D. Liu, Neuro-optimal control for a class of unknown nonlinear dynamic systems using SN-DHP technique, Neurocomputing. 121 (2013) 218-225.

AN US

[23] J. Ding, S.N. Balakrishnan, F.L. Lewis, A cost function based single network adaptive critic architecture for optimal control synthesis for a class of nonlinear systems, in: Proceeding of the International Joint Conference on Neural Networks, 2010. [24] R. Padhi, N. Unnikrishnan, X. Wang, S.N. Balakrishnan, A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems, Neural Netw. 19 (10) (2006) 1648-1660.

M

[25] P.K. Patchaikani, L. Behera, G. Prasad, A single network adaptive critic-based redundancy resolution scheme for robot manipulators, IEEE Trans. Ind. Electron. 59 (8) (2012) 3241-3253.

ED

[26] A. Heydari, S.N. Balakrishnan, Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics, IEEE Trans. Neural Netw. Learn. Syst. 24 (1) (2013) 145-157.

PT

[27] D. Vrabie, F.L. Lewis, Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems, Neural Netw. 22 (2009) 237-246.

CE

[28] H. Zhang, L. Cui, Y. Luo, Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP, IEEE Trans. Cybern. 43 (1) (2013) 206-216.

AC

[29] C. Mu, Z. Ni, C. Su, H. He, Air-breathing hypersonic vehicle tracking control based on adaptive dynamic programming, IEEE Trans. Neural Netw. Learn. Syst. 28 (3) (2017) 584-598. [30] B. Luo, D. Liu, T. Huang, D. Wang, Model-free optimal tracking control via critic-only Q-learning, IEEE Trans. Neural Netw. Learn. Syst. 27 (10) (2016) 2134-2144. 28

ACCEPTED MANUSCRIPT

[31] C. Mu, Z. Ni, C. Su, H. He, Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems, IEEE Trans. Cybern. 47 (6) (2017) 1460-1470.

CR IP T

[32] Z. Ni, H. He, J. Wen, Adaptive learning in tracking control based on the dual critic network design, IEEE Trans. Neural Netw. Learn. Syst. 24 (6) (2013) 913-928. [33] Q. Wei, D. Liu, Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification, IEEE Trans. Auto. Sci. Eng. 11 (4) (2014) 1020-1036.

AN US

[34] K. Zhang, H. Zhang, G. Xiao, H. Su, Tracking control optimization scheme of continuous-time nonlinear system via online single network adaptive critic design method, Neurocomputing 251 (2017) 127-135.

[35] Z. Gu, D. Yue, J. Liu, Z. Ding, H∞ tracking control of nonlinear networked systems with a novel adaptive event-triggered communication scheme, J. Franklin Inst. (2017) doi:10.1016/j.jfranklin.2017.02.020.

ED

M

[36] B. Kiumarsi, F.L. Lewis, M.B. Naghibi-Sistani, A. Karimpour, Optimal tracking control of unknown discrete-time linear systems using inputoutput measured data, IEEE Trans. Cybern. 45 (12) (2015) 2770-2779.

PT

[37] Q. Wei, D. Liu, Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification. IEEE Trans. Autom. Sci. Eng. 11 (4) (2014) 1020-1036.

CE

[38] H. Modares, F.L. Lewis, Z. Jiang, H∞ tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE Trans. Neural Netw. Learning Syst. 26 (10) (2015) 2550-2562.

AC

[39] H. Zhang, Q. Wei, Y. Luo, A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm, IEEE Trans. Syst. Man Cybern. B Cybern. 38 (4) (2008) 937-942. [40] H. Zhang, L. Cui, X. Zhang, Y. Luo. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method, IEEE Trans. Neural Netw. 22 (12) (2011) 2226-2236. 29

ACCEPTED MANUSCRIPT

[41] H. Modares, F.L. Lewis, Linear quadratic tracking control of partiallyunknown continuous-time systems using reinforcement learning, IEEE Trans. Autom. Contr. 59 (11) (2014) 3051-3056.

CR IP T

[42] X.P. Xie, S.L. Hu, Relaxed stability criteria for discrete-time TakagiSugeno fuzzy systems via new augmented nonquadratic Lyapunov functions, Neurocomputing 166 (2015) 416-421.

[43] H. Modares, F.L. Lewis, Optimal tracking control of nonlinear partiallyunknown constrained-input systems using integral reinforcement learning, Automatica. 50 (7) (2014) 1780-1792.

AN US

[44] B. Kiumarsi, F.L. Lewis, Hamidreza Modares, A. Karimpour, M.B. Naghibi-Sistani, Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics, Automatica. 50 (4) (2014) 1167-1175.

M

[45] B. Kiumarsi, F.L. Lewis, Actor-critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans. Neural Netw. Learn. Syst. 26 (1) (2015) 140-151.

AC

CE

PT

ED

[46] H. K. Khalil and J. W. Grizzle, Nonlinear systems. Upper Saddle River, NJ, USA: Prentice-Hall, 1996.

30

CR IP T

ACCEPTED MANUSCRIPT

M

AN US

Kun Zhang received the B.S. degree in mathematics and applied mathematics from Hebei Normal University, Shijiazhuang, China, in 2012 and the M.S. degree in management science and engineering from Northwest University for Nationalities, Lanzhou, China, in 2015. He is currently pursuing the Ph.D. degree in control theory and control engineering at Northeastern University, Shenyang, China. His main research interests include reinforcement learning, dynamic programming, neural networks-based controls and their industrial applications.

AC

CE

PT

ED

Huaguang Zhang received the B.S. degree and the M.S. degree in control engineering from Northeast Dianli University of China, Jilin City, China, in 1982 and 1985, respectively. He received the Ph.D. degree in thermal power engineering and automation from Southeast University, Nanjing, China, in 1991. He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and Head of the Institute of Electric Automation, School of Information Science and Engineering, Northeastern University, Shenyang, China. His main research interests are fuzzy control, stochastic system control, neural networks based control, nonlinear control, and their applications. He has authored and coauthored over 200 journal and conference papers, four monographs and co-invented 20 patents.

31

CR IP T

ACCEPTED MANUSCRIPT

M

AN US

He Jiang received the B.S. degree in automation control in 2014 from Northeastern University, Shenyang, China, where he is currently pursuing the Ph.D. degree in control theory and control engineering. His current research interests include adaptive dynamic programming, fuzzy control, optimal control and their industrial applications.

AC

CE

PT

ED

Yingchun Wang was born in Liaoning, China, in 1974. He received the B.S., M.S., and Ph.D. degrees from Northeastern University, Shenyang, China, in 1997, 2003, and 2006, respectively. He was a Visiting Scholar with West Sydney University, Sydney, NSW, Australia, from 2015 to 2016. He is currently an Associate Professor with the School of Information Science and Engineering, Northeastern University. His current research interests include network control, multiagent systems, fuzzy control and fuzzy systems, and stochastic control.

32