Neural-network-based output-feedback control with stochastic communication protocols

Neural-network-based output-feedback control with stochastic communication protocols

Automatica 106 (2019) 221–229 Contents lists available at ScienceDirect Automatica journal homepage: www.elsevier.com/locate/automatica Brief paper...

613KB Sizes 0 Downloads 95 Views

Automatica 106 (2019) 221–229

Contents lists available at ScienceDirect

Automatica journal homepage: www.elsevier.com/locate/automatica

Brief paper

Neural-network-based output-feedback control with stochastic communication protocols✩ ∗

Derui Ding a , Zidong Wang b,c , , Qing-Long Han a a

School of Software and Electrical Engineering, Swinburne University of Technology, Melbourne, VIC 3122, Australia College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao 266590, China c Department of Computer Science, Brunel University London, Uxbridge, Middlesex, UB8 3PH, United Kingdom b

article

info

Article history: Received 7 October 2017 Received in revised form 15 January 2019 Accepted 5 April 2019 Available online 20 May 2019 Keywords: Output-feedback control Stochastic communication protocols Adaptive dynamic programming Actor–critic structures

a b s t r a c t This paper is concerned with the neural-network-based (NN-based) output-feedback control issue for a class of nonlinear systems. For the purpose of effectively mitigating the phenomena of data congestion/collision, the stochastic communication protocols are favorably utilized to orchestrate the data transmissions, and the resultant closed-loop plant is represented by a so-called protocol-induced Markovian jump system with uncertain transition probability matrices. Taking such an uncertainty probability into account, a novel iterative adaptive dynamic programming (ADP) algorithm is developed to obtain the desired suboptimal solution with the help of auxiliary quasi-HJB equation, and the algorithm convergence is also investigated via the intensive use of the mathematical analysis. In this ADP framework, an NN-based observer with a novel adaptive tuning law is first adopted to reconstruct the system states. Then, based on the reconfigurable system, an actor–critic NN scheme with online learning is developed to realize the considered control strategy. Furthermore, in light of the Lyapunov theory, some sufficient conditions are derived to guarantee the stability of the zero equilibrium point of the closed-loop system as well as the boundedness of the estimation errors for critic and actor NN weights. Finally, a simulation example is employed to demonstrate the effectiveness of the developed suboptimal control scheme. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Optimal control has attracted increasing research interests due to its clear engineering insights in many practical application areas such as economics, ecology, engineering, finance, aerospace, and so forth. As is well known, the optimal control strategy is often associated with the solution of a certain Hamilton–Jacobi–Bellman (HJB) equation. In the case of a linear system, this equation can be simplified to an algebraic Riccati one whose feasibility has been well investigated. For a nonlinear system, the corresponding HJB equation is a nonlinear partial differential equation whose solvability is a non-trivial issue. ✩ This work was supported in part by the National Natural Science Foundation of China under Grants 61573246 and 61873148, the Research Fund for the Taishan Scholar Project of Shandong Province of China, the Australian Research Council Discovery Project under Grant DP160103567, the Natural Science Foundation of Shanghai, China under Grant 18ZR1427000, and the Alexander von Humboldt Foundation of Germany. The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Michael V. Basin under the direction of Editor Ian R. Petersen. ∗ Corresponding author at: College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao 266590, China. E-mail addresses: [email protected] (D. Ding), [email protected] (Z. Wang), [email protected] (Q.-L. Han). https://doi.org/10.1016/j.automatica.2019.04.025 0005-1098/© 2019 Elsevier Ltd. All rights reserved.

In order to deal with the numerical algorithms for solving general HJB equations, the so-called adaptive/approximate dynamic programming (ADP) methods have been employed for control system design with known/unknown system dynamics, see Kiumarsi and Lewis (2015), Modares, Lewis, and Naghibi-Sistani (2013), Dierks and Jagannathan (2012), Liu, Huang, Wang, and Wei (2013) and the references therein. The ADP-based methods are implemented in a forward-in-time manner, thereby getting rid of the curse of dimensionality. The convergence issue has been thoroughly examined for systems with various cost functions including linear quadratic cost with a discount factor (Wei, Lewis, Sun, Yan, & Song, 2017) and generalized cost with a nonquadratic functional (Zhang, Luo, & Liu, 2009). In addition, a critic network is usually adopted in dynamic programming (DP) algorithms to estimate the cost function in the HJB equation. Particularly, for unknown system dynamics, a specified structure referred to as the actor–critic one (Vamvodakis & Lewis, 2010; Xu, Zhao, & Jagannathan, 2015) can effectively cater for the implementation of DPs. In the framework of such a structure, a critic network combined with an action network is utilized to approximate the cost function and estimate the control sequence. On another active research frontier, the communication protocol, which is widely deployed in networked systems with

222

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

hope to alleviate data collisions, often results in the inevitable changes of system parameters or the random jumps of system states. These typical protocols include stochastic communication scheduling, event-triggered scheduling, Round-Robin (RR) scheduling, to name just a few. It is noted that some initial research effort has been paid to the performance analysis problem under communication protocols (Ding, Wang, Han and Ge, 2019; Ding, Wang, Han and Wei, 2019; Vamvoudakis, 2014). For instance, an event-triggered optimal control issue has been discussed in Vamvoudakis (2014) via an impulsive system approach in order to find the tuning laws of weight matrices of actor and critic NNs. In this paper, we intend to make one of the first few attempts to handle the NN-based control problem under stochastic communication protocols (SCPs). It is worth pointing out that SCPs serve as a widely used mechanism describing a certain class of carrier-sense multiple access with collision avoidance (CSMA/CA) protocols (Zou, Wang, & Gao, 2016). For the purpose of theoretical analysis, a control system under such a protocol can be regarded as a Markov jump system (MJS) and the corresponding analysis/synthesis problem can then be handled by looking into the effects brought from the Markovian jumps (Patrinos, Sopasakis, Sarimveis, & Bemporad, 2014). Note that an optimal control scheme has been developed in Zhong, He, Zhang, and Wang (2014) to solve the HJB equation with respect to discrete-time nonlinear MJSs with unknown system dynamics. In the context of NN-based control for MJSs, the transition probabilities specified in most literature were assumed to be exactly known (Zhong et al., 2014). This assumption, however, is not always true in practice since accurate identification of transition probabilities is nontrivial yet cost-consuming. As such, it makes more sense to investigate the optimal control problem with uncertain transition probabilities and, consequently, the HJB equation derived from optimality principle would highly dependent on the uncertain probabilities which hinder the application of the ADP approaches. In this regard, an effective alternative approach is to design an auxiliary quasi-HJB equation related to the upper bound of probabilities so as to obtain a suboptimal control sequence, and it is also necessary to analyze the convergence of the developed ADP algorithm serving such an auxiliary equation. This gives rise to one of the main motivations for our current research. On the other hand, when the control task is performed in the framework of an actor–critic scheme, the corresponding weight matrices need to be estimated via some designed tuning laws. In this regard, if we were to deal with the NN-based outputfeedback control problems for MJSs under SCPs, there are two essential difficulties identified as follows: (1) how to design the tuning laws of weight matrices for the addressed control issue of MJSs? and (2) how to select the learning step length to guarantee the boundedness of estimated weights? Motivated the above discussions, in this paper, we endeavor to study the NN-based output-feedback control with SCPs subject to uncertain transition probabilities. 2. Problem formulation and preliminaries In this paper, we consider the following class of stochastic nonlinear systems: xk+1 = Axk + F (xk ) + Buk yi,k = Ci xk ,

(1a)

i = 1, 2, . . . , ny

(1b)

where xk ∈ Ωx ⊂ R is the system state, uk ∈ R is the control input, and yi,k ∈ R stands for the measurement output from sensor i. F (xk ) is an unknown discrete-time nonlinear function satisfying F (0) = 0. A, B and Ci (i = 1, 2, . . . , ny ) are known constant matrices with compatible dimensions. nx

nu

For (1a), based on SCPs, the ny sensors are randomly scheduled to transmit their measurements to the controller with the purpose of reducing network congestion. It is worth noting that the accurate identification of the transition probability is by no means a trivial task due primarily to the cost limit from a practical point of view. In this case, without loss of generality, the transition probability is allowed to vary within an interval: Prob{σk+1 = j|σk = i} = π˜ ij , i, j ∈ {1, . . . , ny } where π˜ ij satisfies 0 ≤ πij −π¯ ij ≤ ∑n π˜ ij ≤ πij + π¯ ij and j=y 1 π˜ ij = 1 with known scalars πij and π¯ ij . In the framework of the SCP scheduling, the actual received information with the following zero-order-holders y¯ k := [y¯ 1,k y¯ 2,k . . . y¯ ny ,k ]T ∈ Rny can be expressed by

{ y¯ i,k =

yi,k , y¯ i,k−1 ,

if σk = i, otherwise.

(2)

In addition, it follows from (1a), (1b) and (2) that y¯ k = Φσk yk + (I − Φσk )y¯ k−1

(3)

with yk = [y1,k y2,k . . . yny ,k ] and Φσk = diag{δ (1 − σk ), δ (2 − σk ), . . . , δ (ny − σk )}, where δ (a) is a binary function which is equal to 1 for a = 0, otherwise 0. T

In what follows, for the convenience of description, we denote

[ Aσk =

A

Φσk C

Cσk = Φσk C

[

0

I − Φσk I − Φσk

]

, B=

[

B 0

]

] [ , C = C1T

, ...

C2T

CnTy

]T

.

In light of the universal approximation property and the learning capability of NNs, the nonlinear function F (xk ) on compact set Ωx ⊂ Rnx can be represented as Wf ξf (ηk ) + ϖη,k where Wf denotes the connecting weight matrix of neurons, ξf (·) is the activation function vector of NNs, and ϖη,k is the approximation

]T

, one has the following

ηk+1 = Aσk ηk + Wf ξf (ηk ) + Buk + ϖη,k y¯ k = Cσk ηk

(4)

error of NNs. Denoting ηk := xTk augmented system:

[

{

y¯ Tk−1

Assumption 1. The approximate error, the activation function and the connecting weight matrix for all employed NNs on a compact set Ωη ⊂ Rnx +ny (related with Ωx ) satisfy the following conditions:

• the NN approximate error ϖ∗,k , the NN activation function ξ∗ (·) and their gradients ∇ϖ∗,k and ∇ξ∗ (·) are bounded, that is, ∥ϖ∗,k ∥ ≤ ε¯ ϖ ∗ , ∥ξ∗ (·)∥ ≤ ε¯ ξ ∗ ∥∇ϖ∗,k ∥ ≤ ε¯ ∇ϖ ∗ and ∥∇ξ∗ (·)∥ ≤ ε¯ ∇ξ ∗ ; • the connecting weight matrix W∗ is also bounded, that is, ∥W∗ ∥F ≤ w ¯∗ where ε¯ ξ ∗ , ε¯ ϖ ∗ , ε¯ ∇ϖ ∗ , ε¯ ∇ξ ∗ and w ¯ ∗ are known positive constants. Here, ∗ stands for ‘‘f ’’ or other symbols used in the following sections. Since the true states might be unavailable for the controller, a Luenberger-type observer is proposed as follows:

ˆ ks ξf (ηˆ k ) + Buk + Lσk (y¯ k − Cσk ηˆ k ) ηˆ k+1 = Aσk ηˆ k + W

(5)

where ηˆ k is the reconstructed system state and ˆ is the estimated value of Wf at time instant k. In addition, by resorting to ˆ ks for the gradient descent approach, the adaptive tuning law W the NN weights is selected as Wks

ˆ sk+1 = W ˆ ks W + θ1

(

CσTk+1 (y¯ k+1 − yˆ k+1 )ξfT (ηˆ k )

∥1 + ξ ηˆ ξ ηˆ ∥ T f ( k) f ( k)

∑ny

i=1 ∥

CiT Ci 2



ˆ ks − θ2 W

)

(6)

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

where θ1 and θ2 are two positive scalars to be designed. Furthermore, subtracting (5) from (4) results in the following error dynamics:

˜ ks ξf (ηˆ k ) η˜ k+1 = (Aσk − Lσk Cσk )η˜ k + W + Wf (ξf (ηk ) − ξf (ηˆ k )) + ϖη,k ˜ ks ξf (ηˆ k ) + ϖη, := A˜σk η˜ k + W ˜ k

(7)

˜ ks = Wf − W ˆ ks , A˜σk = Aσk − Lσk Cσk and where η˜ k = ηk − ηˆ k , W ϖη, = W ( ξ ( η ) − ξ ( η ˆ )) + ϖ . ˜ k f f k f k η,k Similar to the case of deterministic systems in Jagannathan, Vandegrift, and Lewis (2000), we introduce the following stability definition. Definition 1. Let {σk } be a Markov process defined on the probability space (Ωσ , F, P) with a filtration {Fk }k≥0 . The error dynamics (7) is said to be semi-globally ultimately bounded in mean-square sense, if there exist a compact set Ωη˜ ⊂ Rnx +ny , a constant υ1 > 0 and a number N(υ1 , η˜ 0 , σ0 ) such that for all

η˜ 0 ∈ Ωη˜ ⏐ { } E ∥η˜ k ∥2 ⏐η˜ 0 , σ0 < υ1 ,

Definition 3. The nonlinear system (1a) is enhanced-admission controllable in mean-square sense if there exists an enhanced admissible control law u(ηk , σk ) such that SE (ηk , k) along with the closed-loop system (4) is finite. Theorem 1. If there exists a sequence of pairs (u∗ (ηk , i), ∗ J (ηk , i)) satisfying the following discrete-time quasi-HJB equation:

⎧ ny { } ∑ ⎪ ⎪ ∗ ∗ ⎪ J ( η , i) = min ℓ ( η , u( η , i) , i) + κ ˜ J ( η , j) ⎪ k k k ij k + 1 ⎪ u(ηk ,i) ⎨ j=1

(9)

j=1

The purpose of this paper is to design a suboptimal control policy uk on estimated state ηˆ k to optimize the following cost function: S (ηk , k, i)

N →∞

Definition 2 (Enhanced Admissible Control). Let {σk } be a Markov process defined on the probability space (Ωσ , F, P) with a filtration {Fk }k≥0 . A control law u(ηk , σk ) is said to be an enhanced ∑N admissible control }with respect to SE (ηk , k) = limN →∞ m=k { m−k E ϑ ℓmax (ηm , u) , if u(ηk , σk ) with u(0, σk ) = 0 is continuous on a compact set Ωu ⊂ Rnu and can stabilize system (4) in mean-square sense for all η0 ∈ Ωη ⊂ Rnx +ny , and SE (ηk , k) is finite.

ny { } ⎪ ∑ ⎪ ⎪ ∗ ⎪ u ( η , i) = arg min ℓ ( η , u , i) + κ˜ ij J ∗ (ηk+1 , j) ⎪ k k k ⎩ uk

∀k ≥ N(υ1 , η˜ 0 , σ0 ).

N ⏐ } {∑ ⏐ = lim E ℓ(ηs , u(ηs , σs ), σs )⏐ηk , σk = i

223

(8)

s=k

where the utility function ℓ(ηk , u(ηk , σk ), σk ) with ℓ(0, 0, σk ) = 0 is positive on any σk , nonzero ηk and u(ηk , σk ). The following assumption is needed in order to reveal the boundedness of developed approximate scheme in sequel. Assumption 2. The function ℓ(η, u, i) (i = 1, 2, . . . , ny ) satisfies the following conditions:

• ℓ(η, u, i) is continuously differentiable on u and its deriva∂ℓ(η,u,i) tive is denoted as ψℓ,i (u) := ∂ u ; • the derivative function ψℓ,i (u) is invertible ( ) with its inverse −1 ∂ℓ(η,u,i) function denoted as u(η, i) = ψℓ, ; i ∂u −1 2 • the inverse function satisfies ∥ψℓ,i (η)∥ ≤ γ ∥η∥2 with a known positive constant γ . Note that the∑ function ∫ u ℓ(η, u, i) is quite general with examples p including (1) κ s=1 0 tanh−1 (ν/κ )Ri dν for nonlinear systems with input constraints; and (2) ηT Pi η + uT Ri u for linear systems.

then the system the }policy u∗ (ηk , σk ) is stochastically {∑∞(4) under 2 stable (i.e. E ∥η ∥ |η , k 0 σ0 < ∞) and the minimum of cost k=0 S (ηk , k, i) defined in (8) is no more than J ∗ (ηk , i), where κ˜ ij = πij + π¯ ij for ∀i, j. This theorem is trivial and therefore its proof is omitted. In the following, an ADP algorithm is developed to obtain the solution of quasi-HJB equation (9). In this algorithm, the cost function and the control law are updated by iterations with the iteration index s increasing from zero to infinity. ADP Algorithm S1.

S2.

S3.

Given an error tolerance ω > 0 and the iteration index s ∈ Z, start with initial policies J (0) (ηk , i) = 0 and u(0) (ηk , i). Solve J (s+1) (ηk , i) via quasi-HJB equation J (s+1) (ηk , i) = ℓ(ηk , u(s) (ηk , i), i) ∑n (10) + j=y 1 κ˜ ij J (s) (ηk+1 , j). Update the control policies(via u(s) (ηk , i) =

arg minuk

+ S4.

∑ny

j=1

ℓ(ηk , uk , i)

κ˜ ij J (s) (ηk+1 , j)

)

.

(11)

Go to S2 if ∥J (s+1) (ηk , i) − J (s) (ηk , i)∥ > ω.

Now, two lemmas, which can be easily accessed via the same idea in Al-Tamimi, Lewis, and Abu-Khalaf (2008), are provided in order to prove the convergence of the established iterative ADP algorithm.

3. Main results 3.1. Iterative ADP and its convergence analysis In order to overcome the challenge resulting from the uncertain transition probabilities with respect to SCPs, a modified concept on admissible control, named as enhanced admissible control, will be introduced as follows. It should be pointed out that such a modified concept, though a bit more conservative than the traditional one, will play a very important role in the convergence analysis. {∑ } ∑ny ny Denoting ϑ = max ¯ 1j , . . . , ¯ n y ,j j=1 π1j + π j=1 πny ,j + π and ℓmax (ηk , u) = max{ℓ(ηk , u(ηk , 1), 1), . . . , ℓ(ηk , u(ηk , ny ), ny )}, we come up with a new definition on the traditional admissible control.

Lemma 1. Assume that ηk is an arbitrary state vector, {µ(s) (ηk , σk )} is any enhanced admissible control sequence, and the function Φ (s) (ηk , i) is only updated by

Φ

(s+1)

(ηk , i) = ℓ(ηk , µ (ηk , i), i) + (s)

ny ∑

κ˜ ij Φ (s) (ηk+1 , j)

j=1

with s ∈ Z, i ∈ {1, 2, . . . , ny }, and ηk+1 driven by (4) with {µ(s) (ηk , σk )}. For the initial value J (0) (·, ·) = Φ (0) (·, ·) = 0, one has J (s) (ηk , i) ≤ Φ (s) (ηk , i) for ∀i, where J (s) (ηk , i) is obtained via the iterative computation between (10) and (11). Lemma 2. Let ηk be an arbitrary state vector and {µ(ηk , σk )} be any enhanced admissible control sequence. Then, the following are true:

224

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

• there exists an upper bound SE (ηk , k) such that J (s) (ηk , i) ≤ SE (ηk , k) holds for ∀i. Such an upper bound is determined by the enhanced admissible control sequence {µ(ηk , σk )}; • if the quasi-HJB equation (9) is solvable, then there exists the smallest upper bound J ∗ (ηk , i) satisfying J (s) (ηk , i) ≤ J ∗ (ηk , i) ≤ SE (ηk , k). Remark 1. It is observed from Lemma 2 that all control sequences by making use of the iterative computation (10) and (11) are enhanced admissible control sequences. In other words, if the nonlinear system (1a) is enhanced-admission controllable, then there exists a set of optimal solutions for the quasi-HJB equation (9). The following theorem, whose proof is omitted due to space limit, uncovers that the sequence obtained in (10) and (11) converges to an upper bound on its optimal value. Theorem 2. For an arbitrary state vector ηk ∈ Ωη , s ∈ Z and i ∈ {1, 2, . . . , ny }, if the closed-loop system (4) is enhanced-admission controllable, then J (s) (ηk , i) updated by (10)–(11) is a monotonically non-decreasing sequence, that is J (s) (ηk , i) ≤ J (s+1) (ηk , i). Furthermore, the sequence of pairs (J (s) (ηk , i), u(s) (ηk , i)) converges to (J ∗ (ηk , i), u∗ (ηk , i)).

Construct a Lyapunov function candidate as follows: Vk = V1,k + V2,k

˜ ks }. ˜ ks )T Pσk W = η˜ kT Pσk η˜ k + µ1 tr{(W

Taking the dynamics (7) and (13) into consideration, we calculate the differences of V1,k and V2,k on the condition σk = i, respectively, as follows:

E{∆V1,k |η˜ k , σk = i}

= E{V1,k+1 − V1,k |η˜ k , σk = i} n

y ∑ { sT ) ( ˜k 3π˜ ij A˜Ti Pj A˜i η˜ k + ε¯ ξ2f tr W ≤ − η˜ kT Pi −

j=1 ny

×

(∑

˜ ks + 3π˜ ij Pj W

)

}

j=1

ny ∑

T 3π˜ ij ϖη, ˜ k ˜ k Pj ϖη,

E ∆V2,k |η˜ k , σk = i

{

}

⏐ }} { { sT ˜ ks+1 − W ˜ ksT Pσk W ˜ ks ⏐η˜ k , σk = i ˜ k+1 Pσk+1 W = µ1 tr E W { {( ˜ ks ˜ ksT Pσk+1 W = µ1 tr E (1 − θ1 θ2 )2 W ˜ ksT Pσk+1 CσT Cσk+1 A˜σk η˜ k ξ˜fT (ηˆ k ) − 2(1 − θ1 θ2 )θ1 W k+1

For the adopted Luenberger-type observer (5) with adaptive tuning law (6), one has the following result.

˜ ks ξ˜f (ηˆ k )ξfT (ηˆ k ) ˜ ksT Pσk+1 CσT Cσk+1 W − 2(1 − θ1 θ2 )θ1 W k+1

Pi −

ny ∑

µ1 Pi −

ny ∑

˜ ksT Pσk+1 ϖW ,k + θ12 ξ˜f (ηˆ k )η˜ kT A˜Tσ CσT + 2(1 − θ1 θ2 )W k k+1 × Cσk+1 Pσk+1 CσTk+1 Cσk+1 A˜σk η˜ k ξ˜fT (ηˆ k ) + 2θ12 ξ˜f (ηˆ k )η˜ kT ˜ ks ξ˜f (ηˆ k )ξfT (ηˆ k ) × A˜Tσk CσTk+1 Cσk+1 Pσk+1 CσTk+1 Cσk+1 W − 2θ1 ξ˜f (ηˆ k )η˜ kT A˜Tσk CσTk+1 Cσk+1 Pσk+1 ϖW ,k ˜ ksT CσT Cσk+1 Pσk+1 + θ12 ξ˜f (ηˆ k )ξfT (ηˆ k )W k+1

(3 + µ1 µ ¯ )κ˜ ij A˜Ti Pj A˜i > 0,

j=1

(12)

(3ε¯ ξ2f + µ1 Θη )κ˜ ij Pj > 0

˜ ks ξf (ηˆ k )ξ˜fT (ηˆ k ) × CσTk+1 Cσk+1 W ˜ ksT CσT Cσk+1 Pσk+1 ϖW ,k − 2θ1 ξ˜f (ηˆ k )ξfT (ηˆ k )W k+1 ⏐ ) }} T ˜ ˜ ks ⏐η˜ k , σk = i . + ϖW ,k Pσk+1 ϖW ,k − WksT Pσk W

j=1

subject to CiT Ci Pi CiT Ci ≤ µ2 ∥CiT Ci ∥2 Pi , where

E ∆V2,k |η˜ k , σk = i

{

+ θ1 (θ1 + µ2 θ1 + µ2 θ3 )ε¯ ξ2f , then the dynamics of estimation errors (7) and the adaptive law (6) are semi-globally ultimately bounded in mean-square sense.

= µ1

ny ∑

}

( ˜ ksT Pj W ˜ ks ) π˜ ij (1 − θ1 θ2 )2 tr(W

j=1

Proof. For presentation convenience, we denote

˜ ks ξ˜f (ηˆ k ) − 2(1 − θ1 θ2 )θ1 η˜ kT A˜Ti CjT Cj Pj W

ξf (ηˆ k ) ξ˜f (ηˆ k ) = . ∑n T ∥1 + ξf (ηˆ k )ξf (ηˆ k )∥ i=y 1 ∥CiT Ci ∥2

˜ ksT Pj CjT Cj W ˜ ks ξ˜f (ηˆ k ) − 2(1 − θ1 θ2 )θ1 ξfT (ηˆ k )W

Then, recalling the tuning law (6), one can easily obtain

˜ ksT Pj ϖW ,k } + 2(1 − θ1 θ2 )tr{W

ˆ ks+1 = W ˆ ks + θ1 CσT Cσk+1 η˜ k+1 ξ˜fT (ηˆ k ) − θ1 θ2 W ˆ ks . W k+1

+ θ12 η˜ kT A˜Ti CjT Cj Pj CjT Cj A˜i η˜ k ξ˜fT (ηˆ k )ξ˜f (ηˆ k )

Subtracting Wf from both sides of the above equation, we obtain

˜ ks ξ˜f (ηˆ k )ξfT (ηˆ k )ξ˜f (ηˆ k ) + 2θ12 η˜ kT A˜Ti CjT Cj Pj CjT Cj W − 2θ1 η˜ kT A˜Ti CjT Cj Pj ϖW ,k ξ˜f (ηˆ k )

˜ ks+1 =W ˜ ks − θ1 CσT Cσk+1 η˜ k+1 ξ˜fT (ηˆ k ) + θ1 θ2 W ˆ ks W k+1 ˜ ks ξf (ηˆ k )ξ˜fT (ηˆ k ) + ϖW ,k − θ1 CσTk+1 Cσk+1 W ˜ T ˆ k ). where ϖW ,k = θ1 θ2 Wf − θ1 CσTk+1 Cσk+1 ϖη, ˜ k ξ f (η

(16)

In light of the property of the trace operator, the above equality can be simplified as follows:

µ ¯ = µ2 θ1 (1 − θ1 θ2 + 2θ1 + θ3 ) Θη = (1 − θ1 θ2 )(1 − θ1 θ2 + θ1 + θ3 )

˜ ks − θ1 CσT Cσk+1 A˜σk η˜ k ξ˜fT (ηˆ k ) =(1 − θ1 θ2 )W k+1

(15)

j=1

3.2. Performance analysis and design of observers

ˆ 0s be selected Theorem 3. Let the initial NN observer weight W within the compact set ΩWF , and the activation function be bounded in the compact set Ωη . For a given admissible control input uk , if there exist positive scalars θ1 , θ2 , θ3 , µ1 , µ2 , positive-definite matrices Pi and matrices Li (i = 1, 2, . . . , ny ) satisfying

(14)

(13)

˜ ksT CjT Cj Pj CjT Cj W ˜ ks ξf (ηˆ k )ξ˜fT (ηˆ k )ξ˜f (ηˆ k ) + θ12 ξfT (ηˆ k )W ˜ ksT CjT Cj Pj ϖW ,k ξ˜f (ηˆ k ) − 2θ1 ξfT (ηˆ k )W ) ˜ ksT Pi W ˜ ks ). + tr{ϖWT ,k Pj ϖW ,k } − µ1 tr(W

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

Considering the condition CiT Ci Pi CiT Ci ≤ µ2 ∥CiT Ci ∥2 Pi , one has from above equality that

E ∆V2,k |η˜ k , σk = i

{

≤ µ1

ny ∑

can be represented by a critic NN and an actor NN, respectively, as J (∞) (ηk , i) = WJiT ξJ (ηk ) + ϖJ (ηk , i)

}

T ξu (ηk ) + ϖu (ηk , i) u(∞) (ηk , i) = Wui

˜ ks ) ˜ ksT Pi W ˜ ks ) + Θw − µ1 tr(W ˜ ksT Pj W × tr(W ( ) T with Θw = 1 + (1 − θ1 θ2 + 2θ1 )θ3−1 tr{ϖW ,k Pj ϖW ,k }.

)

(17)

Substituting (15) and (17) into (14) leads to

E{∆Vk } n

y ∑ ) ( (3 + µ1 µ ¯ )κ˜ ij A˜Ti Pj A˜i η˜ k ≤ − η˜ kT Pi −

H(s) (ηk , i) = ℓ(ηk , u(s) (ηk , i), i) − J (s) (ηk , i)

j=1 n

y ∑ ) s} ( ˜k ˜ ksT µ1 Pi − (3ε¯ ξ2f + µ1 Θη )κ˜ ij Pj W − tr W

+

T κ˜ ij (µ1 Θw + 3ϖη, ˜ k ). ˜ k Pj ϖη,

(18)

j=1

Finally, we have from condition (12) that

E{∆Vk } ≤ 0, when ∥η˜ k ∥2

(19) κ˜ (µ Θ j=1 ij 1 w

≥ λmin

(

+3ϖ T

η, ˜ k

Pj ϖη, ˜ k)

∑ny Pi − j=1 (3+µ1 µ ¯ )κ˜ ij A˜ Ti Pj A˜ i

)

˜ ks ∥2F or ∥W



(23)

κ˜ ij J (s) (ηk+1 , j).

(k)

which implies that the system is

ultimately bounded in mean-square sense (Wang, Lam, & Liu, 2007) and the proof is complete. ♮ −1

Obviously, the solution (J ∗ (ηk , i), u∗ (ηk , i)) of the discrete-time quasi-HJB equation (9) satisfies H(∞) (ηk , i) = 0. From a practical point of view, we like to apply the online computation approach to approximate the desired convergence solution and, as such, the superscript ‘‘s’’ will be replaced by time scalar ‘‘k’’ in the sequel. Keeping the estimated state ηˆ k in mind, we employ the following approximation Jˆ (k) (ηˆ k , i) on the cost function J (k) (ηk , i) in (23):

ˆ Ji )T ξJ (ηˆ k ) Jˆ (k) (ηˆ k , i) = (W

∑ny

T Pϖ κ˜ (µ Θ +3ϖη, ˜ k) j=1 ij 1 w ˜ k j η, ( ), ∑ny λmin µ1 Pi − j=1 (3ε¯ ξ2f +Θη )κ˜ ij Pj

ny ∑ j=1

j=1

∑ny

(22)

where WJi and Wui are the desired weights of designed NNs, ϖJ (ηk , i) and ϖu (ηk , i) are the bounded approximation errors, and ξJ (ηk ) and ξu (ηk ) are activation function vectors. Recalling Assumption 1, we denote ∥WJi ∥F ≤ w ¯ J , ∥Wui ∥F ≤ w ¯ u , ∥ξJ (ηk )∥ ≤ ε¯ ξ J and ∥ξu (ηk )∥ ≤ ε¯ ξ u , where w ¯ J, w ¯ u , ε¯ ξ J and ε¯ ξ u are known positive constants. Define the residual error function on (10) as follows:

{

+

(21)

and

( T T π˜ ij µ ¯ η˜ k A˜i Pj A˜i η˜ k + Θη

j=1

ny ∑

225

−1

Noting that −(Pj − Pi )Pj (Pj − Pi ) ≤ 0, one has −Pi Pj Pi ≤ Pj − 2Pi . By applying Schur Complement Lemma and then performing the congruence transformation diag{I , Pi , . . . , Pi }, the following theorem is readily accessible.

(k)

ˆ Ji where W

(24)

is the estimated weight of WJi . In this case, the

developed discrete-time quasi-HJB equation cannot be satisfied and its residual error can be written as H(k) (ηˆ k , i)

ˆ Ji(k) = ℓ(ηˆ k , u(k) (ηˆ k , i), i) − ξJT (ηˆ k )W +

ny ∑

ˆ Jj(k) κ˜ ij ξJT (ηˆ k+1 )W

(25)

j=1

ˆ 0s be selected Corollary 1. Let the initial NN observer weight W within the compact set ΩWF and the activation function be bounded in the compact set Ωη . For a given admissible control input uk , if there exist positive scalars θ1 , θ2 , θ3 , µ1 , µ2 , positive-definite matrices Pi , and matrices L˜i (i = 0, 1, 2, . . . , ny ) satisfying

[

−Pi ∗

µ1 Pi −

I ⊗ (Pi Ai − L˜i Ci ) −Λi ⊗ P˜

T

]

+ VecTξ (k + 1)Ni VecWˆ (s, J) with Ni = diag{κ˜ i,1 I , κ˜ i,2 I , . . . , κ˜ i,ny I }, Si = diag{0, 0, . . . , 0, I , 0, . . . , 0},

< 0,

ny ∑

= ℓ(ηˆ k , u(k) (ηˆ k , i), i) − VecTξ (k)Si VecWˆ (s, J)

 (20)

(3ε¯ ξ2f + µ1 Θη )κ˜ ij Pj > 0

j=1

subject to CiT Ci Pi CiT Ci ≤ µ2 ∥CiT Ci ∥2 Pi , where 1 Λ− = diag{ (3 + µ1 µ ¯ )κ˜ i1 , · · · , (3 + µ1 µ ¯ )κ˜ i,ny }, i P˜ i = diag{2Pi − P1 , 2Pi − P2 , . . . , 2Pi − Pny },

then, with gain matrices Li = Pi−1 L˜i and the adaptive law (6), the dynamics of the estimation errors (7) is semi-globally ultimately bounded in mean-square sense. 3.3. NN Implementation Based on Actor–critic Structures We are now in a position to implement the ADP algorithm by utilizing estimated state ηˆ k . In light of the universal approximation property, the optimal value function and the control input



i−1



(k)T

(k)T

ˆ J1 VecWˆ (s, J) = [W

ˆ J2 W

Vecξ (k) = [ξJT (ηˆ k )

ξJT (ηˆ k )

... ...

(k)T

ˆ J ,n ] T , W y ξJT (ηˆ k )]T .

Denote S¯ = [S1 · · · Sny ]T , N¯ = [N1 · · · Nny ]T ,VecH (k, ηˆ k ) ˜ ξ (k) = I ⊗ Vecξ (k). It = [H(k) (ηˆ k , 1) . . . H(k) (ηˆ k , ny ) ]T and Vec is obtained from (25) that T

˜ ξ (k)S¯Vec ˆ (k, J) VecH (k, ηˆ k ) = Vecℓ (ηˆ k ) − Vec W T

˜ ξ (k + 1)N¯ Vec ˆ (k, J). + Vec W Noting that VecH (k, ηˆ k ) is a function on VecWˆ (k, J), we minimize the cost 12 VecTH (k, ηˆ k )VecH (k, ηˆ k ) to catch the update law for the critic NN. With the help of the gradient descent approach, one has VecWˆ (k + 1, J) = (1 − δ0 )VecWˆ (k, J)



{ } δ1 σ1T σ Vec (k , J) + Vec ( η ˆ ) 1 ℓ k ˆ W 1 + tr{σ1 σ1T }

(26)

226

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

where δ0 and δ1 are two design parameters, and

Proof. First, for the purpose of simplicity, we introduce the following notations:

T

T

˜ ξ (k)S¯. ˜ ξ (k + 1)N¯ − Vec σ1 = Vec Noticing the boundedness of ∥ξJ (ηk )∥, one has that the matrix (1 + tr{σ1 σ1T })−1 σ1T σ1 is positive semidefinite. Similarly, in the implementation of the controller, we have to replace the desired controller policy (22) by the suboptimal control input

Γi (ηˆ k ) = −BT ∇ξJT (ηˆ k+1 )Υκ,i VecWˆ (k, J), Γ (ηˆ k ) = [Γ1T (ηˆ k )

Γ2T (ηˆ k )

...

T ξu (ηˆ k ) VecW ∗ (u, ηˆ k ) = [Wu1

(27)

where uˆ (k) (ηˆ k , i) stands for the approximated value of control ˆ ui(k) is the estimated weight of Wui . input u(k) (ηk , i), and W According to (11) and (21), the corresponding control policy u(k) (ηk , i) is given by

∂ℓ(ηk , u (ηk , i), i) =− ∂ uk (k)

ny ∑

κ˜ ij BT

j=1

(k)T

Vecϕ (Γ (ηˆ k )) = [ϕℓ,1 (Γ1 (ηˆ k ))

∂ J (ηk+1 , j) , ∂ηk+1

( T ∇ξJ (ηk+1 )WJj(k) + ∇ϖJT (ηk+1 , j) .

From Assumption 2, ψℓ,i (u(k) (ηk , i)) :=

(28)

∑ny

κ˜ ij BT

∂ℓ(ηk ,u(k) (ηk ,i),i) ∂ u(k) (ηk ,i)

has an

=



j=1

−1 inverse function ( ) which can be denoted as u (ηk , i) = ψℓ,i ∂ℓ(ηk ,u(k) (ηk ,i),i) , then the approximated value of u(k) (ηk , i) is ob∂ uk (k)

tained by (29)

It is not difficult to see that the suboptimal control input (k) uˆ (k) (ηˆ k , i) is not the same with uˆ o (ηˆ k , i), and the control input error between them is U

(ηˆ k , i) = uˆ (ηˆ k , i) − ˆ

ηˆ , i)

u(k) o ( k

ˆ ui(k) )T ξu (ηˆ k ) − ψℓ,−i1 = (W

(

(k) T J ( k+1 )WJj

)



ny ∑

κ˜ ij

(30)

j=1

× B ∇ξ ηˆ T

ˆ

(k+1)

ˆ ui(k) − δ2i ξu (ηˆ k )U (k)T (ηˆ k , i) =W

(31)

where δ2i is a positive parameter to be designed.

0 < δ0 < 1, δ1 > 0, Ψ2 + Ψ3T Ψ4−1 Ψ3 < 0

(32)

where

Ξ2 = diag{δ2,1 I , δ2,2 I , . . . , δ2,ny I }, Ψ2 = − 2Ξ2 + β1 Ξ2 − ε¯ Ψ3 = Ξ2 − ε¯

2 2 ξ u Ξ2

+ β ε¯

2 2 1 ξ u Ξ2 2 2 ξ u Ξ2

, Ψ4 = β2 Ξ2 − ε¯

...

˜ J ,n ] T , W y

(k)T

WJT,ny ]T .

δ1 σ1T Vecℓ (ηˆ k ) 1 + tr{σ1 σ1T } − (I − Ψ˜ 1 )VecW ∗ (J) + Ψ˜ 1 VecW˜ (k, J)

(33)

where Ψ˜ 1 = (1 − δ0 )I − δ1 σ1T σ1 /(1 + tr{σ1 σ1T }). According to Lemma 1 in Ge, Lee, Li, and Zhang (2003) combined with the condition (32), we conclude that the weight esti˜ Ji(k) is bounded. mation error W For the critic NN (24) with tuning law (26), we select a Lyapunov function candidate ny ∑

(k)

(k)

˜ ui )T W ˜ ui ). tr((W

(34) [i]

For the presentation convenience, we let VecW˜ (k, u, ηˆ k ) stand for the ith block element of vector VecW˜ (k, u, ηˆ k ). With the help of (30), we have

∆V3,k = V3,k+1 − V3,k { ≤ tr −2VecTW˜ (k, u, ηˆ k )Ξ2 VecW˜ (k, u, ηˆ k )

+2VecTW˜ (k, u, ηˆ k )Ξ2 Vecψ (Γ (ηˆ k )) − 2VecTW˜ (k, u, ηˆ k ) { ×Ξ2 VecW ∗ (u, ηˆ k ) + ε¯ ξ2u VecTW˜ (k, u, ηˆ k )Ξ32

,

− β ε¯

(k)

+2VecTW˜ (k, u, ηˆ k )Ξ22 VecW ∗ (u, ηˆ k ) + VecTψ (Γ (ηˆ k )) ×Ξ22 Vecψ (Γ (ηˆ k )) − 2VecTψ (Γ (ηˆ k ))Ξ22 VecW ∗ (u, ηˆ k ) }} +VecTW ∗ (u, ηˆ k )Ξ22 VecW ∗ (u, ηˆ k ) , which implies that

−¯ε

2 2 1 ξ u Ξ2

,

˜ ui = W ˆ ui − Wui (for actor NN (27) with tuning law (26)) and W with tuning law (31)) are bounded.

−VecTW˜ (k, u, ηˆ k )(2Ξ2 − β1 Ξ2

{

− β1 ε¯ ξ2u Ξ22 )VecW˜ (k, u, ηˆ k )

2 2 ξ u Ξ2 VecTW˜ (k T

, u, ηˆ k )(2Ξ2 − 2ε¯ ξ2u Ξ22 )Vecψ (Γ (ηˆ k ))

+

−Vecψ (Γ (ηˆ k ))(β2 Ξ2 − ε¯ ξ2u Ξ22 − β1 ε¯ ξ2u Ξ22 ) ×Vecψ (Γ (ηˆ k )) + β2 γ Γ T (ηˆ k )Ξ2 Γ (ηˆ k ) + Π (Wu∗ , ηˆ k )

}

where Π (Wu∗ , ηˆ k ) = VecTW ∗ (u, ηˆ k )(β1−1 Ξ2 + (1 + 2β1−1 ε¯ ξ2u )Ξ22 ) VecW ∗ (u, ηˆ k ). Defining vector χk = [VecTW˜ (k, u, ηˆ k ) VecTψ (Γ (ηˆ k )) ]T , the above inequality can be rewritten as follows:

∆V3,k ≤ tr

˜ Ji(k) = W ˆ Ji(k) − WJi (for critic NN (24) then the estimation errors W (k)

−T ϕℓ, ˆ k )) ]T , ny (Γny (η

Then, it follows from (26) that

∆V3,k ≤ tr

Theorem 4. Let the initial weights of observer, critic and actor NNs be, respectively, selected from compact sets ΩWF , ΩWC and ΩWA which include their ideal weights, and the activation functions be bounded in the compact set Ωη and Ωηˆ (ηˆ k ∈ Ωηˆ ⊂ Rnx +ny ). If there exist a positive-define matrix Ξ2 and four positive constants δ0 , δ1 , β1 , β2 such that

2 2 ξ u Ξ2

(k)T

...

...

T WJ2

˜ u,ny ξu (ηˆ k ) ]T , W

×VecW˜ (k, u, ηˆ k ) − 2VecTW˜ (k, u, ηˆ k )Ξ22 Vecψ (Γ (ηˆ k ))

.

By employing the gradient descent approach again to minimize ∥U (k) (ηˆ k , i)∥2 , the tuning law of the actor NN weights is selected by

ˆ ui W

(k)T

˜ J2 W

...

i=1

j=1

(k)

T VecW ∗ (J) = [WJ1

V3,k =

ny ( ∑ ) ˆ Jj(k) u(k) ˆ k , i) = ψℓ,−i1 − κ˜ ij BT ∇ξJT (ηˆ k+1 )W o (η

(k)

(k)T

˜ J1 VecW˜ (k, J) = [W

WuT,ny ξu (ηˆ k ) ]T ,

VecW˜ (k + 1, J) = −

(k)

∂ℓ(ηk ,u(k) (ηk ,i),i) (k) ) ∂ u (ηk ,i)

which can be rewritten as

...

˜ u1 ξu (ηˆ k ) VecW˜ (k, u, ηˆ k ) = [W −T

(k)

ˆ ui )T ξu (ηˆ k ) uˆ (k) (ηˆ k , i) = (W

ΓnTy (ηˆ k )]T ,

{

χkT Ψ˜ 2 χk

} (35) + Π (Wu∗ , ηˆ k ) + β2 γ Γ T (ηˆ k )Ξ2 Γ (ηˆ k ) [ ] Ψ2 Ψ3 where Ψ˜ 2 = < 0. To this end, it follows from the ∗ −Ψ4 ˜ ui(k) Lyapunov stability theorem that the weight estimation error W is also bounded. The proof is complete. ♮

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

227

Therefore, it follows from (41) that

3.4. Stability analysis of the closed-loop system According to the optimal control theory, the policy (22) stabilizes the system

E{∆Vk } = E{Vk+1 − Vk }

≤ − (1 − K ∗ (1 + 3β3 ))M1 ∥ηk ∥2 − M3i ∥η˜ k ∥ + 2

ηk+1 = Aσk ηk + Wf ξf (ηk ) + BWuTσk ξu (ηk )

¯ + M7i + M8 ε¯ w

where

+ Bϖu (ηk , σk ) + ϖη,k := Υ (ηk , σk )

ny ∑

on a compact set. Along the similar line of Dierks and Jagannathan (2012), it is clear that there exists a constant K ∗ < 1 such that

M3i = λmin (Pi −

ny ∑ 2   E π˜ ij Υ (ηk , j) ≤ K ∗ E∥ηk ∥2 .

M4 = λmax (β1 Ξ2 + (1 + 2β1 ε¯ ξ2u )Ξ22 ),

(36)

In the framework of observer-based control, taking the control policy (27) into account, one has the following actual closed-loop system:

(37)

with ∆ξ˜uk = ξu (ηk ) − ξu (ηˆ k ). It is not difficult to see from (36) and (37) that

+ λB ϑ (3 + β3−1 )∥VecW˜ (k, u, ηˆ k )∥2

¯ η2 )M5 , ¯ f2 ε¯ ξ2f + w ¯ f2 + 2w M6 = 2θ12 (θ22 w M7i =

where λB = λmax (B B). Then, for 1 − δ0 − δ1 > 0, it follows from (26) that tr{VecWˆ (k + 1, J)VecTWˆ (k + 1, J)}

≤ (1 + β4 )(1 − δ0 )2 tr{VecWˆ (k, J)VecTWˆ (k, J)} ηˆ

(39)

ηˆ σ˜ }

T T 1 Vecℓ ( k )Vecℓ ( k ) 1

where σ˜ 1 = δ σ /(1 + tr{σ σ }). In what follows, construct the Lyapunov function candidate T 1 1

+ M1 ∥ηk+1 ∥2 + V1,k + V2,k + V3,k

(40)

where V1,k , V2,k and V3,k are the same with those in (14) and (34), and the parameters M1 and M2 are defined by M1 = maxij {κ˜ 2 }

−λmin (Ψ˜ 2 ) λB ϑ (3+β3−1 )

∇ξ J ij and M2 = . 1−(1+β4 )(1−δ0 )2 It is not difficult to see from (18), (35), (38) and (40) that

2 ¯ J2 ε¯ ξ2J + 2ε¯ ϖ M8 = ny M2 (1 + β4 )δ12 (2w J)

T κ˜ ij (µ1 Θw + 3ϖη, ˜ k) ˜ k Pj ϖη,

(41)

Noting ∥ℓ(ηˆ k , u(ηˆ k , i), i)∥ ≤ ∥ ξ ηˆ + ϖJ (ηˆ k , i)∥, one has tr{VecTℓ (ηˆ k )Vecℓ (ηˆ k )} ≤ 2ny (w ¯ ε¯ + ε¯ In addition, it is not difficult to see that

0 −0.45 −1.2

−0.2

] [ ] −0.45 1.2 0.5 , B = −0.48 , −0.85 0.5 ] [ ] 0 , C2 = −0.58 0 −0.2 .

0.7032 0.0014

1.9575 −0.0020

1.1466 0.0002

0.0007 0.0407

−0.0004 −1.4515

−0.0008 0.7835

[ L1 =

−1

+ β4 )(2w ¯ ε¯ + 2ε¯ ϖ J ).

>

The nonlinear function is sin(x2,k ) ] taken as f (xk ) = 0.1 tan(x1,k ) sin(x2,k ) cos(x3,k ) T where xi,k (i = 1, 2, 3) stands for the ith element of vector xk . In addition, the transition probabilities are π11 = 0.78, π12 = 0.2, π21 = 0.4, π22 = 0.58 and their uncertainty bounds are π¯ 11 = 0.02, π¯ 12 = 0.01, π¯ 21 = 0.01, π¯ 22 = 0.02. Firstly, by resorting to Corollary 1 with µ1 = 0.4, µ2 = 6, θ1 = 0.1, θ2 = 0.8, and θ3 = 0.2, we obtain the desired observer gains

M2 (1 + β4 )tr{σ˜ 1T Vecℓ (ηˆ k )VecTℓ (ηˆ k )σ˜ 1 }

δ

2

[

j=1



0.8 0.25 0

[

2 2 + λB ϑ (3 + β3−1 )M1 (ε¯ ϖ ¯ϖ ¯ ξ u )2 ) u + 4ε u (ε

2

0 when ∥ηk ∥2

In this section, a simulation example is adopted to show the effectiveness of the proposed control scheme for a class of nonlinear systems under SCPs. Consider system (1a) with two sensors and following parameters:

C1 = 0.45

+ M2 (1 + β4−1 )tr{σ˜ 1T Vecℓ (ηˆ k )VecTℓ (ηˆ k )σ˜ 1 }

WJiT J ( k ) 2 2 2 J ξJ ϖ J ).

<

¯ u2 +M7i +M8 ny M4 ε¯ ξ2u w

Theorem 5. Let the initial weights of observer, critic and actor NNs be, respectively, selected from compact sets ΩWF , ΩWC and ΩWA which include their ideal weights. In addition, let the gains of the Luenberger-type observer (5) be designed from Corollary 1. If there exist a positive-define matrix Ξ2 and four positive constants δ0 , δ1 , β1 , β2 such that 1 −δ0 −δ1 > 0 and the condition (32) in Theorem 4 hold, then the closed-loop system (4) with control law (27) and the error dynamics of Luenberger-type observer (7) with adaptive tuning law (6) are semi-globally ultimately bounded in mean-square sense.

A=

≤ − (1 − K ∗ (1 + 3β3 ))M1 ∥ηk ∥2 − M3i ∥η˜ k ∥2

+ tr{Π (Wu∗ , ηˆ k )},

−1 2 ¯ ξ2u ). + λB ε¯ ϖ u ϑ (3 + β3 )M1 (1 + 4ε

[

E{∆Vk } = E{Vk+1 − Vk }

2 2 J ξJ

2 κ˜ ij 3λmax (Pj )(2w ¯ f2 + ε¯ ϖ η ) + M6 ,

4. An illustrative example

Vk = M2 tr{VecWˆ (k, J)VecTWˆ (k, J)}

−1

ny ∑

(38)

T

ny M2 12 (1

) λmax (Pi ),

or ∥η˜ k ∥ > . By resorting M3i 3 1 to the last case, we can obtain a compact set Ωη˜ such that Ωη˜ ⊂ Ωη . In addition, according to the modified version of the standard Lyapunov stability theorem used in Wang et al. (2007), we have the following theorem.

2 2 + λB ϑ (3 + β3−1 )(ε¯ ϖ ¯ϖ ¯ ξ u )2 ) u + 4ε u (ε

ny ∑

−1

Obviously, we have E{∆Vk }

≤ − (1 − K ∗ (1 + 3β3 ))∥ηk ∥2

+

M5 = µ1 1 + (1 − θ1 θ2 + 2θ1 )θ3

¯ u2 +M7i +M8 ny M4 ε¯ ξ2u w (1−K ∗ (1+3β ))M

E{∥ηk+1 ∥2 − ∥ηk ∥2 }

β2 γ ∥B∥2 λmax (Ξ2 )ε¯ 2

−1

(

−1

˜ uTσ ξu (ηˆ k ) − Bϖu (ηk , σk ) + BW k

T 1 1

j=1

j=1

ηk+1 = Υ (ηk , σk ) − BWuTσk ∆ξ˜uk

+ (1 + β4 )tr{σ˜

(3 + µ1 µ ¯ )κ˜ ij A˜Ti Pj A˜i ),

−1

j=1

−1

(42)

2 u

ny M4 ξ2u

[ L2 =

]

1 0

0 , 1 1 0

]

0 . 1

228

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

Fig. 1. The state trajectories and the cost function.

Fig. 2. The estimated NN weights.

The considered cost function ℓ(η, u, i) is ηT Qi η + uT Ru with weight matrices Q1 = 1.6I, Q2 = 2.4I and R = I. The desired positive-define matrix and the positive constants in Theorem 4 are, respectively, Ξ2 = diag{0.4, 0.5}, δ0 = 0.00001, δ1 = 0.5, β1 = 0.5, β2 = 0.8. For these given parameters, the eigenvalues of matrix Ψ2 + Ψ3T Ψ4−1 Ψ3 are −0.1190 and −0.1075, and therefore, the condition (32) in Theorem 4 is true. In what follows, we perform the simulation of iterative control scheme developed in this paper. In this iterative scheme, activation functions of the approximation network, the critic network and the action network are chosen as σf = 0.05[tansig (0.25ηˆ 1,k ) tansig(0.25ηˆ 2,k ) tansig(0.25ηˆ 3,k ) 0 0]T , σu = 0.4 [tanh(0.22ηˆ 1,k ) tanh(0.22ηˆ 2,k ) tanh(0.22ηˆ 3,k ) 0 0]T and σv = 0.36[ηˆ 12,k ηˆ 1,k ηˆ 2,k ηˆ 1,k ηˆ 3,k ηˆ 22,k ηˆ 2,k ηˆ 3,k ηˆ 32,k ]T . The initial state is randomly generated in Matlab via x0 = unifrnd(−0.1, 0.1, ˆ 0s = 3, 1) and the initial values of the weight matrices are W (0) (0) ˆ J1 = W ˆ J2 = [2.95 −3.1 1.8 5.26 −4.65 3.2]T , 4I, W

ˆ u1 = [−1.35 −6.00 7.20 0 0]T , and W (0) ˆ u2 W = [−1.20 −5.50 7.00 0 0]T . The simulation results are shown in Figs. 1–2. First, the states and their estimation of closed-loop systems are shown in Fig. 1(a), which clearly indicates the feasibility of proposed observer and controller. Finally, the updates of the weight matrices of actor NN and critic NN are plotted in Fig. 2. Based on the results presented in the figures, the proposed suboptimal control scheme (0)

has effectively dealt with the challenges resulting from the SCP scheduling with uncertain transition probabilities. 5. Conclusions In this paper, we have investigated the neural-network-based output-feedback control for a class of stochastic nonlinear systems under stochastic communication scheduling, where an interval has been utilized to describe the uncertainty of transition probability. In addition, the investigated system has been modeled as a Markovian jump one with uncertain transition probability matrices. A novel iterative adaptive dynamic programming algorithm has been firstly developed to obtain the desired suboptimal solution with the help of auxiliary optimization issue and its convergence has been discussed via intensively mathematical analysis. Then, a sufficient condition on the adopted Luenberger-type observer with the given adaptive tuning law has been proposed to guarantee the boundedness of estimation errors. Furthermore, the implementation of neural-network-based output-feedback control has been realized via the well-known actor–critic structure. With the help of stability theory, a sufficient condition has been established to guarantee the boundedness of estimation errors of the critic and actor NN weights. Finally, a simulation result has demonstrated the effectiveness of the developed control technique.

D. Ding, Z. Wang and Q.-L. Han / Automatica 106 (2019) 221–229

References Al-Tamimi, A., Lewis, F. L., & Abu-Khalaf, M. (2008). Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 38(4), 943–949. Dierks, T., & Jagannathan, S. (2012). Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Transactions on Neural Networks and Learning Systems, 23(7), 1118–1129. Ding, D., Wang, Z., Han, Q.-L., & Ge, X. (2019). A survey on model-based distributed control and filtering for industrial cyber-physical systems. IEEE Transactions on Industrial Informatics, http://dx.doi.org/10.1109/TII.2019. 2905295, (in press). Ding, D., Wang, Z., Han, Q.-L., & Wei, G. (2019). Neural-network-based outputfeedback control under Round-Robin scheduling protocols. IEEE Transactions on Cybernetics, 49(6), 2372–2384. Ge, S. S., Lee, T. H., Li, G. Y., & Zhang, J. (2003). Adaptive NN control for a class of discrete-time non-linear systems. International Journal of Control, 76(4), 334–354. Jagannathan, S., Vandegrift, M. W., & Lewis, F. L. (2000). Adaptive fuzzy logic control of discrete-time dynamical systems. Automatica, 36(2), 229–241. Kiumarsi, B., & Lewis, F. L. (2015). Actor-critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Transactions on Neural Networks and Learning Systems, 26(1), 140–151. Liu, D., Huang, Y., Wang, D., & Wei, Q. (2013). Neural network observer-based optimal control for unknown nonlinear systems using adaptive dynamic programming. International Journal of Control, 86(9), 1554–1566. Modares, H., Lewis, F. L., & Naghibi-Sistani, M.-B. (2013). Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Transactions on Neural Networks and Learning Systems, 24(10), 1513–1525. Patrinos, P., Sopasakis, P., Sarimveis, H., & Bemporad, A. (2014). Stochastic model predictive control for constrained discrete-time Markovian switching systems. Automatica, 50, 2504–2514. Vamvodakis, K. G., & Lewis, F. L. (2010). Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 46(5), 878–888. Vamvoudakis, K. G. (2014). Event-triggered optimal adaptive control algorithm for continuous-time nonlinear systems. IEEE/CAA Journal of Automatica Sinica, 1(3), 282–293. Wang, Z., Lam, J., & Liu, X. (2007). Filtering for a class of nonlinear discrete-time stochastic systems with state delays. Journal of Computational and Applied Mathematics, 201, 153–163. Wei, Q., Lewis, F. L., Sun, Q., Yan, P., & Song, R. (2017). Discrete-time deterministic q-learning: A novel convergence analysis. IEEE Transactions on Cybernetics, 47(5), 1224–1237. Xu, H., Zhao, Q., & Jagannathan, S. (2015). Finite-horizon near-optimal output feedback neural network control of quantized nonlinear discrete-time systems with input constraint. IEEE Transactions on Neural Networks and Learning Systems, 26(8), 1776–1788. Zhang, H., Luo, Y., & Liu, D. (2009). Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints. IEEE Transactions on Neural Networks, 20(9), 1490–1503. Zhong, X., He, H., Zhang, H., & Wang, Z. (2014). Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 25(12), 2141–2155. Zou, L., Wang, Z., & Gao, H. (2016). Observer-based H∞ control of networked systems with stochastic communication protocol: The finite-horizon case. Automatica, 63, 366–373.

Derui Ding received both the B.Sc. degree in Industry Engineering in 2004 and the M.Sc. degree in Detection Technology and Automation Equipment in 2007 from Anhui Polytechnic University, Wuhu, China, and the Ph.D. degree in Control Theory and Control Engineering in 2014 from Donghua University, Shanghai, China. From July 2007 to December 2014, he was a teaching assistant and then a lecturer in the Department of Mathematics, Anhui Polytechnic University, Wuhu, China. He is currently a senior research fellow with the School of Software and Electrical Engineering, Swinburne University of Technol-

229

ogy, Melbourne, Australia. From June 2012 to September 2012, he was a research assistant in the Department of Mechanical Engineering, the University of Hong Kong, Hong Kong. From March 2013 to March 2014, he was a visiting scholar in the Department of Information Systems and Computing, Brunel University London, UK. His research interests include nonlinear stochastic control and filtering, as well as multi-agent systems and sensor networks. He has published around 40 papers in refereed international journals. He is serving as an Associate Editor for Neurocomputing. He is also a very active reviewer for many international journals.

Zidong Wang was born in Jiangsu, China, in 1966. He received the B.Sc. degree in mathematics in 1986 from Suzhou University, Suzhou, China, and the M.Sc. degree in applied mathematics in 1990 and the Ph.D. degree in electrical engineering in 1994, both from Nanjing University of Science and Technology, Nanjing, China. He is currently Professor of Dynamical Systems and Computing in the Department of Computer Science, Brunel University London, UK. From 1990 to 2002, he held teaching and research appointments in universities in China, Germany and the UK. His research interests include dynamical systems, signal processing, bioinformatics, control theory and applications. He has published more than 400 papers in refereed international journals. He is a holder of the Alexander von Humboldt Research Fellowship of Germany, the JSPS Research Fellowship of Japan, William Mong Visiting Research Fellowship of Hong Kong. He serves (or has served) as the Editor-in-Chief for Neurocomputing, Deputy Editor-in-Chief for International Journal of Systems Science, and an Associate Editor for 12 international journals including IEEE Transactions on Automatic Control, IEEE Transactions on Control Systems Technology, IEEE Transactions on Neural Networks, IEEE Transactions on Signal Processing, and IEEE Transactions on Systems, Man, and Cybernetics-Part C. He is a Fellow of the IEEE, a Fellow of the Royal Statistical Society and a member of program committee for many international conferences.

Qing-Long Han received the B.Sc. degree in Mathematics from Shandong Normal University, Jinan, China, in 1983, and the M.Sc. and Ph.D. degrees in Control Engineering and Electrical Engineering from East China University of Science and Technology, Shanghai, China, in 1992 and 1997, respectively. From September 1997 to December 1998, he was a Post-doctoral Researcher Fellow with the Laboratoire d’Automatique et d’Informatique Industielle (currently, Laboratoire d’Informatique et d’Automatique pour les Systémes), École Supérieure d’Ingénieurs de Poitiers (currently, École Nationale Supérieure d’Ingénieurs de Poitiers), Université de Poitiers, France. From January 1999 to August 2001, he was a Research Assistant Professor with the Department of Mechanical and Industrial Engineering at Southern Illinois University at Edwardsville, USA. From September 2001 to December 2014, he was Laureate Professor, an Associate Dean (Research and Innovation) with the Higher Education Division, and the Founding Director of the Centre for Intelligent and Networked Systems at Central Queensland University, Australia. From December 2014 to May 2016, he was Deputy Dean (Research), with the Griffith Sciences, and a Professor with the Griffith School of Engineering, Griffith University, Australia. In May 2016, he joined Swinburne University of Technology, Australia, where he is currently Pro Vice-Chancellor (Research Quality) and a Distinguished Professor. In March 2010, he was appointed Chang Jiang (Yangtze River) Scholar Chair Professor by Ministry of Education, China. His research interests include networked control systems, multi-agent systems, time-delay systems, complex dynamical systems and neural networks. He is one of The World’s Most Influential Scientific Minds: 2014–2016, and 2018. He is a Highly Cited Researcher according to Clarivate Analytics (formerly Thomson Reuters). He is a Fellow of the IEEE and a Fellow of The Institution of Engineers Australia. He is an Associate Editor of several international journals, including IEEE Transactions on Cybernetics, IEEE Transactions on Industrial Electronics, IEEE Transactions on Industrial Informatics, IEEE Industrial Electronics Magazine, IEEE/CAA Journal of Automatica Sinica, Control Engineering Practice, and Information Sciences.