Robust control design for multi-player nonlinear systems with input disturbances via adaptive dynamic programming

Neurocomputing 334 (2019) 1–10 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Robust con...

Download PDF

2MB Sizes 0 Downloads 215 Views

Report

PDF Reader
Full Text

Neurocomputing 334 (2019) 1–10

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Robust control design for multi-player nonlinear systems with input disturbances via adaptive dynamic programming Qiuxia Qu a,b, Huaguang Zhang a,∗, Chaomin Luo c, Rui Yu a a

College of Information Science and Engineering, Northeastern University, Shenyang, 110819, China College of Information and Control Engineering, Shenyang Jianzhu University, Shenyang, Liaoning 100168, China c Department of Electrical and Computer Engineering, University of Detroit Mercy, Michigan, USA b

a r t i c l e

i n f o

Article history: Received 16 February 2017 Revised 13 November 2017 Accepted 18 November 2018 Available online 22 January 2019 Communicated by Hak Keung Lam Keywords: Adaptive dynamic programming (ADP) Input disturbances Multi-player systems Nonzero-sum games Nash equilibrium Neural network

a b s t r a c t In this paper, a novel robust control strategy based on adaptive dynamics programming (ADP) technique is proposed for multi-player nonlinear systems with input disturbances. A pair of robust control policies is constructed by multiplying appropriate coupling gains to the Nash solution of nominal nonlinear nonzero-sum game with predeﬁned cost functions accounting for system uncertain disturbances. Suﬃcient conditions for the existence of robust control strategy are derived, and it is proved that the robust control strategy can guarantee the multi-player nonlinear systems to be stable in the sense of uniform ultimate boundedness (UUB) with disturbance rejection. The single-network ADP algorithm is employed to solve the coupled Hamilton–Jacobi equations, where only requires to online tune the weights of critic neural networks (NN) for each player. By utilizing Lyapunov theory, the NN weight estimation errors are proved to be uniformly ultimately bounded, while the stability of the closed-loop nonzero-sum game system is also guaranteed. Two numerical experiments are given to demonstrate the effectiveness of the proposed approach.

1. Introduction Multi-player systems have recently gained increasing attention due to their extensive real world applications in both science and engineering, such as microgrid systems [1], autonomous guided vehicles [2], Pursuit–Evasion game [3], as well as missile guidance, military strategy, aircraft control and aerial tactics [4]. For multiinput systems, each input is computed by a player, and each player tries to inﬂuence the system state to minimize its own cost function. In this case, the optimization problem for each player is coupled with the optimization problem for other players. Differential game theory provides solution concepts for many multi-player, multi-objective optimization problems [5]. As one important part of game theory, the objective of nonzero-sum game is to ﬁnd a pair of optimal policies called Nash solution. Many researchers have studied nonzero-sum game by using various analysis methods including game theory introduced in [6], and Lyapunov functions. As one class of reinforcement learning techniques, ADP methods have been studied widely for the forward-in-time solution of the Hamilton–Jacobi–Bellman (HJB)

∗

Corresponding author. E-mail addresses: [email protected] (Q. Qu), [email protected] (H. Zhang), [email protected] (C. Luo), [email protected] (R. Yu). https://doi.org/10.1016/j.neucom.2018.11.054 0925-2312/© 2018 Elsevier B.V. All rights reserved.

© 2018 Elsevier B.V. All rights reserved.

equation or coupled Hamilton–Jacobi (HJ) equation to the optimal control problems in stochastic systems [7,8], differential games [9– 14] as well as other ﬁelds [15–19]. And a large number of evident results for practical applications have been obtained [20–22]. The ADP methods can be categorized into two classes, i.e., policy iteration (PI) and value iteration (VI) algorithms [23–29,32,33]. Recently, both the PI and VI were provided using only the inputoutput data to obtain an optimal controller for unknown discretetime linear system in [30], and these methods were extended to develop a linear tracking controller in [31]. It is worth mentioning that many researchers have studied the robust adaptive control problem for linear and nonlinear systems subjected to external disturbances via ADP methods [34–38]. Besides the aforementioned works, inspired by Adhyaru et al. [39], Wang et al. designed a robust controller by multiplying a proper gain, and proved that it not only makes the system with uncertain disturbances achieve uniformly ultimately bounded but also is optimal [40]. Recently, several adaptive control algorithms have been applied to resolve the nonzero-sum game for linear and nonlinear systems to obtain the feed-back Nash equilibrium. In [41], an online PI learning algorithm was proposed for nonlinear system which can update the critic-actor structure for each player, synchronously and simultaneously. Zhang et al. presented an ADP algorithm for two-player nonlinear nonzero-sum games with a single network

2

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

in [42]. Following their studies, the coupled nonlinear Hamilton– Jacobi (HJ) equations can be resolved by integrating experience replay algorithm with single-network ADP in [43]. Although remarkable effects have been acquired by the use of ADP algorithms in design of optimal control pair for multi-player systems, the effect of the disturbances on the multiple controllers was ignored. In industrial applications, practical systems are often subjected to model uncertainty such as unknown or partially known timevarying process parameters. These disturbances tend to affect the stability performance of the multi-player systems. The robust control problem for uncertain nonlinear systems is challenging due to the nonlinear property of dynamics and cooperative effections of multiple controllers as well as multiple uncertain disturbances, thus has remained an open problem. The main diﬃculty may be that the existing methods cannot be used to design multiple cooperative robust controllers, specially when the intrinsic limitations of feedback linearizing control and the impossibility to deal with systems with parameter uncertainties became apparent [44]. Therefore, it is necessary to design robust controller to deal with the multi-player system subjected to disturbances resulting in the deterioration of nominal closed-loop performance. In this paper, a novel control strategy design method is proposed at the ﬁrst time, by combining ADP and reinforcement learning techniques for multi-player nonlinear systems with multiple bounded input disturbances. Compared with the traditional sliding mode controller designs [45], this method generates a continuous time control signal while avoiding chattering. In contrast to the design of robust controller based on disturbance observer [46], in which the bounds on the system disturbance and its derivatives are generally required to be known for control parameter selection, this method relaxes the requirement of a priori knowledge of the system disturbance. With the development of the recursive adaptive control design procedure named backstepping, it makes the design of the feedback control strategy systematic. However, the drawback of the backstepping design procedure, computational complexity naturally increases with the system order [44]. Here, to design the control scheme of the uncertain system, this robust control problem is ﬁrstly transformed to nonzero-sum game for nominal multi-player systems without disturbances, with cost functions modiﬁed to account for the disturbances on each player, and the Nash equilibrium solution of nonzero-sum game can be obtained. Then, the robust control strategy can be established by multiplying appropriate feedback gains to the Nash equilibrium solution. The proposed control is veriﬁed to guarantee the uncertain nonlinear system to be stable in the sense of uniform ultimate boundedness. Next, an online single NN-based ADP algorithm is developed to solve the Nash equilibrium solution of nonzero-sum game. In addition, Motivated by the results in [37], the convergence of policy iteration algorithm for multi-player systems with modiﬁed cost functions is shown by proving which was equivalent to the Newton’s method in a mathematical sense for seeking out a ﬁxed point in a Banach space. Different from [41], this online ADP learning algorithm based on a simple critic NN structure is adopted to solve the coupled HJ equations, reducing the complexity of computation and avoiding the approximation error from actor NN. Speciﬁcally, considering the effectiveness of ADP and reinforcement learning techniques to obtain the approximate feedback Nash equilibrium solution, the developed robust control approach is easy to understand and implement. The rest of this article is organized as follows: Section 2 presents the problem formulation. Suﬃcient conditions for the existence of robust control strategies are derived in Section 3. Section 4 presents the learning procedure of approximate optimal solutions using multiple critic NNs based on online ADP algorithm. And convergence proof is provided. In Section 5, two numerical examples are provided to demonstrate

the effectiveness of the proposed adaptive robust control scheme. Section 6 gives the conclusion. 2. Problem statement Consider a class of nonlinear multi-player systems with input disturbances described as

x˙ = f (x(t )) +

N

g j (x(t ))(u¯ j (t ) + d j (x ))

(1)

j=1

where x(t ) ∈ Rn is the state vector, u¯ j ∈ Rm j is the controller or player, d j (x ) ∈ Rm j is the input disturbance, and f (x ) ∈ Rn , g j (x ) ∈ Rn×m j are differentiable in their arguments with f (0 ) = 0. Deﬁne a set I = {1, 2, . . . , N}. The following assumptions are considered on the system dynamics, where · stands for the 2-norm of matrix or vector. Assumption 1. 1. The internal system dynamics f(x) is locally Lipschitz, and f(x) ≤ bf x, b f ∈ R + . 2. The input gain matrix gj (x) is bounded by a positive constant bg j , i.e., g j (x ) ≤ bg j , for each j ∈ I. Assumption 2. The input disturbances are assumed to be bounded by known functions djM (x), i.e., d j (x ) ≤ d jM (x ), j ∈ I. Moreover, d j (0 ) = 0, and d jM (0 ) = 0, j ∈ I, so that x = 0 is an equilibrium of system (1). In order to simplify the expression, we use x, uj and dj to represent x(t), uj (t) and dj (x) in the following statements, respectively. For the dynamical system (1), one requires to ﬁnd a pair of feedback controllers (u¯ 1 , u¯ 2 , . . . , u¯ N ) such that the closed-loop system is stable with all unknown disturbance terms. Note that multiple control inputs and uncertain disturbance terms as well as nonlinear nature of the dynamical system model (1) results in that it is generally diﬃcult to deal with the robust control problem directly. In this paper, we shall present that the robust control problem can be converted into nonzero-sum game by appropriately selecting value function for each player or controller. Afterwards, by solving the Nash equilibrium, we can obtain a robust control pair to guarantee the closed-loop system (1) achieve UUB stability. In general, nominal nonzero-sum game model can be described by

x˙ = f (x ) +

N

g j ( x )u j

(2)

j=1

where u j ∈ Rm j is the controller or player. We also assume that f + Nj=1 g j u j is Lipschitz continuous on a set ∈ Rn in containing the origin and that system (2) is controllable. In contrast to traditional performance index associated with each player in [42], the novel local performance functions are deﬁned as

Ji (x(0 )) =

δ + x Qi x + 2 iM

0

=

∞

N

uTj Ri j u j

dt

j=1 ∞

0

T

2 [δiM + ri (x(t ), u1 (x(t )), . . . , uN (x(t )))]dt, i ∈ I,

(3) where 1 N 2

N

δ iM (x)

is

2 j=1 d jM λM (Ri j ),

j=1

a

known

with

function

deﬁned

as

2 δiM

=

δiM (0 ) = 0, ri (x, u1 , . . . , uN ) = xT Qi x +

uTj Ri j u j is the utility function, Qi ∈ Rn×n , Rii ∈ Rmi ×mi and

Ri j ∈ Rm j ×m j are symmetric positive deﬁnite matrices.

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

Remark 1. The individual performance function (3) relates to all multiple players or controllers. The selection of function δ iM (x) is important for the following robust controller design. Deﬁnition 1. [41] A control policy pair v = (v1 , . . . , vN ), is deﬁned as admissible with respect to the cost function (3) on the set , denoted by vi ∈ ψ (), i ∈ I, if vi is continuous on , vi (0 ) = 0, v stabilize systems (2) locally, and value of (3) is ﬁnite ∀x(0) ∈ . The value function Vi (x(t)) associated with admissible control policies for the ith player is deﬁned as

Vi (x(t )) =

∞

2 [δiM + ri (x(τ ), v1 (x(τ )), . . . , vN (x(τ )))]dτ , i ∈ I.

t

(4) In the following, the robust control problem for uncertain multi-player systems is transformed to nonzero-sum game. Therefore, it is desirable to seek out the optimal admissible control pair (v∗1 , . . . , v∗i , . . . , v∗N ), such that the performance indices (3) are minimized. Deﬁnition 2. [47] An N-tuple of control policies (v∗1 , v∗2 , . . . , v∗N ) is referred to as a global Nash equilibrium solution for an N-player game if for all i ∈ I, the following inequalities are satisﬁed

Ji∗ Ji (v∗1 , v∗2 , . . . , v∗i , . . . , v∗N ) ≤ Ji (v∗1 , v∗2 , . . . , vi , . . . , v∗N ). The N-tuple of the performance values (J1∗ , J2∗ , . . . , JN∗ ) is named as a Nash equilibrium outcome of the N-player nonzero-sum game. For any admissible control vi , the differential form of (4) is given as 2 0 = δiM + ri (x, v1 , . . . , vN ) + (∇ Vi )T

N

g j ( x )v j , i ∈ I

(5)

j=1

∂V with ∇ Vi = ∂ xi . Moreover, this equation needs boundary condition

Vi (0 ) = 0. Deﬁne the Hamiltonian functions as

Hi (x, ∇ Vi , v1 , . . . , vN ) 2 = δiM + ri (x, v1 , . . . , vN ) + (∇ Vi )T

N

g j (x )v j , i ∈ I.

(6)

j=1

According to optimal control theory, the minimized value functions Vi∗ (x ) are deﬁned as

Vi∗ (x(t )) = min

vi ∈ψ () t

∞

2 [δiM + ri (x(τ ), v1 (x(τ )), . . . ,

vN (x(τ )))]dτ , i ∈ I,

(7)

Remark 2. Note that the coupled HJ equations of nonzero-sum game system (2) can be solved by obtaining the optimal control policies (9) under traditional performance function for each player. However, it is inevitable that the multiple players or controllers can be disturbed in real applications. Therefore, the design of control policies which can restrain some uncertain disturbances is of great signiﬁcance. Robust control issue of nonlinear multi-player systems with input disturbances has not been investigated yet. In the following section, we will develop a novel adaptive robust control scheme based on the optimal control method to address this issue. 3. Robust control design for uncertain multi-player nonlinear systems In this section, in light of the Nash solution obtained by the previous section, a pair of feedback control policies will be developed to guarantee that the closed-loop system (1) achieve UUB stability and restrain the disturbances. Consider the following modiﬁed control policies by multiplying related positive scalar μi to (9)

u¯ i = μi v∗i = −

1 μi R−1 gTi (x )∇ Vi∗ , i ∈ I. ii 2

(11)

Here, μi is called the coupling gain related to the optimal control policy v∗i . Obviously, the range of the pair of (μ1 , μ2 , . . . , μN ) indicates the gain margin of the Nash solution (v∗1 , v∗2 , . . . , v∗N ). The robustness of developed control policy pair (u¯ 1 , u¯ 2 , . . . , u¯ N ) is stated in the following Theorem 1. Theorem 1. For nominal system (2) with cost functions (3), assume that the coupled HJ Eq. (8) have solutions Vi∗ , i ∈ I. There exist a pair of coupling gains μi ∈ [3/2, +∞ ) for all i ∈ I, such that the control policy pair (u¯ 1 , u¯ 2 , . . . , u¯ N ) given by (11) ensure the uncertain sysN 2 = 1 2 tem states (1) are UUB by setting δiM j=1 d jM λM (Ri j ), i ∈ I. 2 ∗ Proof: Select V (x ) = N i=1 Vi (x ) as the Lyapunov candidate. Its N ∗ ˙ ˙ derivative is V (x ) = i=1 Vi (x ). The derivative of Vi∗ along the nonlinear systems (1) is

V˙ i∗ (x ) = (∇ Vi∗ )T ( f (x ) +

N

g j (x )(u¯ j + d j ))

j=1 2 = −xT Qi x − δiM −

−

1 2

μi −

1 (∇ Vi∗ )T Dii ∇ Vi∗ 2

N 1 (μ j − 1 )(∇ Vi∗ )T D j j ∇ V j∗ 2 j=1, j =i

which satisﬁes

0 = min Hi (x, ∇ Vi∗ , v1 , . . . , vN ), i ∈ I vi ∈ψ ()

(8)

−

N N 1 (∇ V j∗ )T Di j ∇ V j∗ + (∇ Vi∗ )T g j ( x )d j . 4 j=1, j =i

∂V ∗

where ∇ Vi∗ = ∂ xi , with the condition Vi∗ (0 ) = 0. Further, the optimal control policy for the ith player is obtained by differentiating (8) with respect to vi , which is derived as

1 2

v∗i = − R−1 gTi (x )∇ Vi∗ , i ∈ I. ii

(9)

Deﬁne Dii = gi (x )R−1 gTi (x ), and Di j = g j (x )R−1 R R−1 gT (x ). Subii jj ij jj j ∗ ∗ stituting vi and v j into (5), the coupled HJ equations become

1 (∇Vi∗ )T Dii ∇Vi∗ 4 N N 1 1 + (∇ V j∗ )T Di j ∇ V j∗ − (∇ Vi∗ )T D j j ∇ V j∗ = 0, i ∈ I. 4 2

2 xT Qi x + δiM + (∇ Vi∗ )T f (x ) −

j=1, j =i

3

j=1, j =i

(10)

(12)

j=1

λm ( · ) and λM ( · ) denote the minimum and maximum eigenvalue, −1 1 respectively. Since (∇ Vi∗ )T D j j ∇ V j∗ = (∇ Vi∗ )T g j Ri j 2 Ri2j R−1 gT ∇ V j∗ , jj j and according to complete square formula, (12) can be rewritten as 2 V˙ i∗ (x ) = −xT Qi x − δiM −

−

1 2

μi −

1 (∇ Vi∗ )T Dii ∇ Vi∗ 2

N 1 1 −1 [(μ j − 1 )Ri j 2 gTj (x )∇ Vi∗ + Ri2j R−1 gT (x )∇ V j∗ ]T jj j 4

j=1, j =i

−1

1

×[(μ j − 1 )Ri j 2 gTj ∇ Vi∗ + Ri2j R−1 gT (x )∇ V j∗ ] jj j +

N ( μ j − 1 )2 (∇Vi∗ )T g j R−1 gTj (x )∇ Vi∗ ij 4

j=1, j =i

4

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

+(∇ Vi∗ )T

N

Algorithm 1. Start (v01 (x ), . . . , v0N (x )). Step 1 (Policy (vl1 , . . . , vlN ), solve

g j ( x )d j

j=1

≤ −λm (Qi )x2 −

1 2

μi −

1 (∇ Vi∗ )T Dii ∇ Vi∗ 2

2 0 = δiM + ri (x, vl1 , . . . , vlN ) + (∇ Vil+1 )T

N ( μ j − 1 )2 + (∇Vi∗ )T g j (x )R−1 gTj (x )∇ Vi∗ ij 4

−δ

+ (∇

)

Vi∗ T

N

g j ( x )d j .

2 V˙ i∗ (x ) ≤ −λm (Qi )x2 − δiM +

3 μi − 2

∗ T

∇Vi

N 1 T d j Ri j d j 2

gi (x )R−1 gTi (x )∇ Vi∗ ii (13)

j=1, j =i

Because (v1 , v2 , . . . , vN ) is an admissible control pair for nominal system (2) with value functions (3) associated with each player, by using the deﬁnition of admissible control [43], one can conclude that Vi∗ (x ) is ﬁnite for arbitrary x ∈ . Moreover, according to functional analysis in [48], we can conclude that ∇ Vi∗ (x ) ∗ > 0, i.e., ∇ V ∗ (x ) ≤ V ∗ , i ∈ I. Combincan be bounded by VidM i idM ing the bounds of gj in Assumption 1 and the bounds of dj in N 2 = 1 2 Assumption 2, and substituting δiM j=1 d jM λM (Ri j ) into (13), 2 we can obtain that Vi (x) is upper bounded by

V˙ i∗ (x ) ≤ −λm (Qi )x2 −

1 2

1 3 μi − (∇Vi∗ )T gi (x )R−ii 2 2

(14)

Accordingly, provided that the coupling gains μi ∈ [3/2, +∞ ), for each value function Vi∗ (x ), then its derivative V˙ i∗ (x ) < 0 can be established as long as the state x(t) lies outside of the compact set

where ωi

(15)

N −1 ∗2 2 2 j =1, j =i ((μ j −1 ) +2 )λM (Ri j )VidM bg j

4 λm ( Q i )

Next, based on the remarkable work of [28,37], a brief description to show the convergence of the online PI algorithm for nonlinear multi-player nonzero-sum game under modiﬁed cost functions. Consider a Banach space ⊂ {V (x ) | V (x ) : V → R, V (0 ) = 0} with a norm · V . Deﬁne a mapping Gi : × · · · × → as

N

follows:

1 (∇Vi∗ )T Dii ∇Vi∗ 4 N N 1 1 + (∇ V j∗ )T Di j ∇ V j∗ − (∇ Vi∗ )T D j j ∇ V j∗ , i ∈ I. 4 2

2 Gi = xT Qi x + δiM + (∇ Vi∗ )T f (x ) −

j=1, j =i

j=1, j =i

A mapping Ti : → is deﬁned as

j=1, j =i

i = { x : x ≤ ω i }

(17)

(18)

2

N ( μ j − 1 )2 + 2 ∗2 2 + λM (R−1 )VidM bg j . ij 4

1 2

vl+1 = − R−1 gTi (x )∇ Vil+1 . i ii

If max{V1l+1 (x ) − V1l (x ), . . . , VNl+1 (x ) − VNl (x )} ≤ and l ≤ lM , stop and obtain the set of approximate optimal control policies vl+1 ; else, set l = l + 1 and go to step 1. It does not go on until i vi converge to v∗i .

j=1

N ( μ j − 1 )2 + 2 (∇Vi∗ )T g j (x )R−1 gTj (x )∇ Vi∗ . ij 4

+

(16)

Step 2 (Policy Improvement): Update the N-tuple of control policies using (9)

Applying Young’s inequality to the last term on the right side of the above inequality, we can obtain

( f (x ) + g j (x )vlj ),

Vil+1 (0 ) = 0, i ∈ I.

j=1

1 − 2

N j=1

j=1, j =i 2 iM

with N-tuple of initial admissible policies Set Vi0 (· ) = 0, and let l = 0. Evaluation): With the N-tuple of policies for N-tuple of values Vil using (10)

.

Denote x = {x : x ≤ max{ω1 , ω2 , . . . , ωN }}. Consequently, V˙ (x ) < 0 holds whenever x is out of the compact set x , which implies that the robust control pair (u¯ 1 , u¯ 2 , . . . , u¯ N ) guarantees the trajectory of system (1) to be UUB. In the next section, an online ADP algorithm will be introduced to obtain the Nash equilibrium. 4. Approximate solution for multi-player nonzero-sum game Using Theorem 1, the robust control pair can be indirectly obtained by solving N coupled HJ equations for the N-player nonzerosum game (2). However, it is still diﬃcult to get analytical solutions owing to the nonlinear nature of coupled HJ equations. Next, PI algorithm is introduced which can be applied to solve the coupled HJ equations.

)−1 G , i ∈ I. TiVi = Vi − (GiV i i

(19)

denotes the Fréchet derivative of G with respect to V . where GiV i i i

However, it is diﬃcult to compute Fréchet derivative. In view of the deﬁnition of Gâteaux derivative in [37], and the equivalence between Gâteaux derivative and Fréchet derivative under some speciﬁed conditions introduced in a lemma in [48], as well as the results in Lemma 2 given by Liu et al. [28], we give the following Lemma to prove that Algorithm 1 is mathematically equivalent to the quasi-Newton’s iteration in a Banach space . Lemma 1. Let Ti be a mapping deﬁned in (19). Then, the iteration between (16) and (17) is equivalent to the following quasi-Newton’s iteration )−1 G , l = 0, 1, 2 · · · . Vil+1 = TiVil = Vil − (GiV l i

(20)

i

Proof: The details of the proof is similar as reference [28] thus omitted. With the results in Lemma 1, the convergence of Algorithm 1 for multi-player nonzero-sum game of continuous-time nonlinear systems is obtained, thus the value function Vil+1 , will converge to the optimal value function Vi∗ as l → ∞, ∀i ∈ I. In the next subsection, we use Algorithm 1 to motivate the control structure for an online synchronous approximate optimal learning algorithm based on single critic NN for each player, while the robust control policies can be derived indirectly.

4.1. Online PI algorithm 4.2. Online single NN-based ADP algorithm The PI algorithm includes two iteration steps: policy evaluation based on (5) and policy improvement based on (9), which is described as follows:

Assume the solutions of the N coupled HJ Eq. (10) are smooth functions. According to the Weierstrass high-order approximation

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

5

theorem, the value functions Vi∗ (x ) are approximated on a compact set by feedforward NNs as

Thus, the adaptive robust control policies can be obtained as follows:

Vi∗ (x ) = WiT σi (x ) + εi (x ),

1 ˆ i , i = 1, 2. gTi (x )∇σiT W uˆ¯ i = μi vˆ i = − μi R−1 ii 2

∀i ∈ I

(21)

where Wi ∈ RLi is the ideal weights, σi (x ) ∈ RLi is the activation function of critic NN, Li is the number of neurons in the hidden layer, and ε i (x) is the approximation error. The partial derivatives of Vi (x) with respect to x is

∇ Vi∗ (x ) = ∇ σiT (x )Wi + ∇ εi (x ), i ∈ I

v

∗ i

1 = − R−1 gT (x )(∇ σiT Wi + ∇ εi ), i ∈ I. 2 ii i

(24)

In addition, this approach can be easily extended to more than two players. Since the object of this section is to ﬁnd the solutions of the coupled HJ equations using the above introduced NN approximators, it is necessary to show the effect of the approximation error on the coupled HJ Eq. (10). Substituting v∗1 and v∗2 into (10), the coupled HJ equations become

1 x Q1 x + δ ∇σ1 f (x ) − W1T ∇ σ1 D11 ∇ σ1T W1 4 1 1 + W2T ∇ σ2 D12 ∇ σ2T W2 − W1T ∇ σ1 D22 ∇ σ2T W2 = εHJ1 , 4 2 T

2 1M

+ W1T

(25)

and

xT Q2 x + δ22M + W2T ∇σ2 (x ) f (x ) −

1 T W ∇ σ2 D22 ∇ σ2T W2 4 2

1 1 + W1T ∇ σ1 D21 ∇ σ1T W1 − W2T ∇ σ2 D11 ∇ σ1T W1 = εHJ2 4 2

(26)

where δ 1M and δ 2M are deﬁned as in (3), ε HJ1 and ε HJ2 are the residual errors produced by the NN approximation for player 1 and for player 2, respectively. As the number of neurons in the hidden layer Li increases, these errors converge uniformly to zero, i.e., sup εHJi → 0 as Li → ∞, i = 1, 2. Besides, for ﬁxed Li , the HJ approximation errors are bounded by constants such that εHJ1 ≤ ε¯1 , and εHJ2 ≤ ε¯2 . ˆ 1 and W ˆ 2 are the estimations of current NN weight. Assume W The critic NN for the ith player can be constructed as

ˆ T σi ( x ) , i = 1 , 2 . Vˆi (x ) = W i

(27)

The corresponding partial derivative of the approximate value function is

∇ Vˆi = ∇σ

T ˆ i Wi ,

i = 1, 2,

(28)

and the approximated optimal control policy is

1 2

ˆ i , i = 1, 2. vˆ i = − R−1 gTi (x )∇σiT W ii

ˆ 1T ∇σ1 f (x ) − = xT Q1 x + δ12M + W

(29)

˜ 1 = W1 − W ˜ 2 = W2 − ˆ 1, W Deﬁne weight estimation errors as W ˆ W2 . The Eqs. (27) and (29) imply that only the critic NN is required ˆ i . Not since the critic and actor NNs share the same weight vector W utilizing the actor NN as [11] leads to a simpler scheme and less computational cost, which is extremely important for multi-player nonlinear systems.

1 T ˆ ∇ σ1 D11 ∇ σ1T W ˆ1 W 4 1

1 T 1 T ˆ ∇ σ2 D12 ∇ σ2T W ˆ2 − W ˆ ∇ σ1 D22 ∇ σ2T W ˆ 2 εb , + W 1 4 2 2 1

(31)

and

ˆ 1, W ˆ 2) Hˆ 2 (x, W ˆ 2T ∇σ2 f (x ) − = xT Q2 x + δ22M + W

(23)

In order to analysis the effect of the NN approximator, we show the approach to solve the 2-player nonzero-sum game for the nonlinear system described by

x˙ = f (x ) + g1 (x )u1 + g2 (x )u2 .

The approximated Hamilton functions become

ˆ 1, W ˆ 2) Hˆ 1 (x, W

(22)

where the partial derivatives of σ i (x) and ε i (x) are ∇ σiT (x ) = ∂ σiT (x )/∂ x, and ∇ εi (x ) = ∂ εi (x )/∂ x, respectively. In the following statement, σ i , ε i , ∇ σ i and ∇ ε i is used to represent σ i (x), ε i (x), ∇σ i (x) and ∇ε i (x) for simplifying the expression, respectively. The optimal control policies are derived as

(30)

1 T ˆ ∇ σ2 D22 ∇ σ2T W ˆ2 W 4 2

1 T 1 T ˆ ∇ σ1 D21 ∇ σ1T W ˆ1 − W ˆ ∇ σ2 D11 ∇ σ1T W ˆ 1 εb . + W 2 4 1 2 2

(32)

ˆ 1 and W ˆ 2 for minIt is desired to tune the critic NN weights W imizing the squared residual error E = 12 (εb2 + εb2 ) by using gra1

2

dient descent algorithm, while also guaranteeing the stability of closed-loop system. 4.3. Stability analysis Before proceeding, the following standard assumptions are necessary for the stability analysis. Assumption 3. The following critic NN parameters are assumed to be bounded: 1. Wi ≤ WiM , 2. σi ≤ σiM , ∇ σi ≤ σidM , 3. εi ≤ εiM , ∇ εi ≤ εidM . Theorem 2. Consider nonzero-sum game for nominal system (24) with two players. The critic NN is given by (27) for each player, and associated controller is given by (29). An initial admissible control pair is generated by choosing proper initial weights of the critic NNs. Let NN weights updating law be given by

θ¯ T ˆ θ¯ α ˆ˙ 1 = −α1 1 εb + 1 ∇ σ1 D11 ∇ σ1T W ˆ1 1 W W 1 1 m1 4 m1 θ¯ T ˆ α1 ¯T ˆ ˆ1 2 W ˆ + ∇ σ1 D21 ∇ σ1T W 2 − α1 (F1W1 − S1 θ1 W1 ), 4 m2

(33)

and

θ¯ T ˆ θ¯ α ˆ˙ 2 = −α2 2 εb + 2 ∇ σ2 D22 ∇ σ2T W ˆ2 2 W W 2 2 m2 4 m2 θ¯ T ˆ α2 ¯T ˆ ˆ2 1 W ˆ + ∇ σ2 D12 ∇ σ2T W 1 − α2 (F2W2 − S2 θ2 W2 ) 4 m1 θ¯i = θi /(θiT θi + 1 ), T ˆ D22 ∇σ2 W2 /2 ), and mi = θ¯iT θ¯i

where

(34)

θi = ∇σi ( f (x ) − D11 (x )∇σ1T Wˆ 1 /2 − + 1, α i > 0, are the tuning gain of the

critic NN, Fi > 0, and Si > 0 are tuning parameters, i = 1, 2. Then the ˜ 1 and W ˜ 2 are closed-loop system state and weight estimation errors W guaranteed to be UUB. Proof: In order to investigate the stability of the closed-loop system (24) with two controllers (vˆ 1 , vˆ 2 ), consider VL as Lyapunov function given as

VL = V1∗ (x ) + V2∗ (x ) +

1 T −1 1 T −1 ˜ α W ˜1 + W ˜ α W ˜2 W 2 1 1 2 2 2

(35)

where Vi∗ is the optimal value function for the ith player that is a positive deﬁnite and smooth solution of the coupled HJ Eq. (10).

6

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

By taking the time derivative of the ﬁrst term along the state trajectories under the control pair (vˆ 1 , vˆ 2 ) in (29), we get

1 1 ˆ 1 − D22 ∇ σ2T W ˆ 2 + ε˙ 1 D11 ∇ σ1T W 2 2 1 1 ˜ 1 + W1T ∇ σ1 D22 ∇ σ2T W ˜ 2 + ε˙ 1 . = W1T η1 + W1T ∇ σ1 D11 ∇ σ1T W 2 2 (36)

V˙ 1∗ = W1T ∇ σ1 f −

Similarly, for the second term

V˙ 2∗ (x ) =W2T η2 + where

1 T 1 ˜ 2 + W2T ∇ σ2 D11 ∇ σ1T W ˜ 1 + ε˙ 2 W ∇ σ2 D22 ∇ σ2T W 2 2 2 (37)

1 2

1 2

ηi =∇ σi f − D11 ∇ σ1T W1 − D22 ∇ σ2T W2 , i = 1, 2, ε˙ i =∇ ε

T i

(38)

1 1 ˆ 1 − D22 ∇ σ2T W ˆ 2 , i = 1, 2. f − D11 ∇ σ1T (x )W 2 2

(39)

W1T η1 = − xT Q1 x −

(40)

α

˜ 2T +W

−1 ˜˙ 2 W2

α

˜ 1T θ¯1 + ˜ 1T θ¯1 −W =W ˜ 1T +W

θ¯1

1

m1 4

εHJ1 m1

˜ 2T θ¯2 + ˜ 2T θ¯2 −W +W

⎡P

P12 P22 P32 P42 P52

11

⎢P21 V˙ L ≤ −zT ⎢P31 ⎣ P41 P51

P13 P23 P33 P43 P53

P14 P24 P34 P44 P54

⎤

P15 P25 ⎥ P35 ⎥z + zT + c ⎦ P45 P55

(43)

1 2 (ρ λM (R11 ) + ρ22 λM (R12 ) + ρ12 λM (R21 ) 2 1 +ρ22 λM (R22 ))In ,

P11 = Q1 + Q2 +

T T T T P22 = P33 = 1, P12 = P21 = P13 = P31 = P14 = P41 = P15 = P51 T T = P23 = P32 = P45 = P54 = 0,

1 1 T ∇ σ1 D11 ∇ σ1T W1 − S1 = P24 , 4m1 2 1 1 T − ∇ σ2 D22 ∇ σ1T W1 − ∇ σ2 D12 ∇ σ2T W2 = P25 , 4m1 8m1 1 1 T − ∇ σ1 D11 ∇ σ2T W2 − ∇ σ1 D21 ∇ σ1T W1 = P34 , 4m2 8m2 1 1 T − ∇ σ2 (x )D22 ∇ σ2T W2 − S2 = P35 , 4m2 2 1 − ∇ σ1 D21 ∇ σ1T W2 θ¯2T + F1 , 4m2 1 − ∇ σ2 D12 ∇ σ2T W1 θ¯1T + F2 , 4m2

P52 =

P53 = (41)

˜˙ 1 = −W ˆ˙ 1 , W ˜˙ 2 = −W ˆ˙ 2 . Thus, using (31) and (32), Noting that W we can obtain the time derivatives of last two terms in (35) as −1 ˜˙ 1 W1

1 ε b2 σ 2 λM (R22 )W˜ 2 , i = 1, 2 2 idM g2 2dM

where bεi = 12 εidM b2g1 σ1dM λM (R11 )W1M + 12 εidM b2g2 σ22dM λM (R22 ) W2M , i = 1, 2. ˜ T θ¯1 , W ˜ T θ¯2 , W ˜ T,W ˜ T ]. Grouping terms toDeﬁne zT = [xT , W 1 2 1 2 gether, and arranging Eqs. (36)–(42), we can get

P43 =

W2T η2 = − xT Q2 x −

˜ 1T W

+

P42 = −

and

1 2 (ρ λM (R21 ) + ρ22 λM (R22 ))x2 + εHJ2 2 1 1 1 − W2T ∇ σ2 D22 ∇ σ2T W2 − W1T ∇ σ1 D21 ∇ σ1T W1 . 4 4

1 2

ε˙ i (x ) ≤bεi + εidM b f x + εidM b2g1 σ1dM λM (R11 )W˜ 1

where the components of the matrix P are given by

According to [23], we can assume there exists a quadratic bound 2 = for dj (x), i.e., d jM (x ) = ρ j x, with ρ j > 0, ∀ j ∈ I. Then δiM 1 N 2 2 j=1 ρ j x λM (Ri j ). From (25) and (26), we can obtain 2

1 2 (ρ λM (R11 ) + ρ22 λM (R12 ))x2 + εHJ1 2 1 1 1 − W1T ∇ σ1 D11 ∇ σ1T W1 − W2T ∇ σ2 D12 ∇ σ2T W2 , 4 4

In the following, some bounds are given to facilitate the next analysis. Recalling Assumptions 1 and 3, it is straight forward to show that

εHJ2 m2

˜ 1T ∇σ1 (x )D11 (x )∇σ1T (x )W ˜1 W

1 T 1 ˜ ∇ σ2 D12 (x )∇ σ2T W ˜ 2 + W1T ∇ σ1 D22 (x )∇ σ2T W ˜2 + W 4 2 2 1 T ˜ ∇ σ2 D12 (x )∇ σ2T W2 − W 2 2 θ¯ T 1 T ˜ 1 ) 1 (W1 − W ˜ 1) ˜ 1 ∇ σ1 D11 (x )∇ σ1T (W1 − W − W 4 m1 θ¯ T 1 T ˜ 1 ) 2 (W2 − W ˜ 2) ˜ 1 ∇ σ1 D21 (x )∇ σ1T (W1 − W − W 4 m2 θ¯ 1 ˜ T ˜ 2T 2 ˜2 +W W ∇ σ2 D22 (x )∇ σ2T W m2 4 2 1 T 1 ˜ ∇ σ1 D21 (x )∇ σ1T W ˜ 1 + W2T ∇ σ2 D11 (x )∇ σ1T W ˜1 + W 4 1 2 1 T ˜ ∇ σ1 D21 (x )∇ σ1T W1 − W 2 1 θ¯ T 1 T ˜ 2 ) 2 (W2 − W ˜ 2) ˜ 2 ∇ σ2 D22 (x )∇ σ2T (W2 − W − W 4 m2 θ¯ T 1 T ˜ 2 ) 1 (W1 − W ˜ 1) ˜ 2 ∇ σ2 D12 (x )∇ σ2T (W2 − W − W 4 m1 ˜ 1) + W ˜ 2 ). (42) ˜ 1T (F1 − S1 θ¯1T )(W1 − W ˜ 2T (F2 − S2 θ¯2T )(W2 − W +W

P44 = P55 =

the vector = [1 , 2 , 3 , 4 , 5 ]T is given by

1 = (ε1dM + ε2dM )b f , 2 =

εHJ1 m1

, 3 =

εHJ2 m2

,

1 1 ∇ σ1 D11 ∇ σ1T W1 θ¯1T W1 − ∇ σ1 D21 ∇ σ1T W1 θ¯2T W2 4m1 4m2 + F1W1 − S1 θ¯ T W1 ,

4 = −

1

5

1 1 =− ∇ σ2 D22 ∇ σ2T W2 θ¯2T W2 − ∇ σ2 D12 ∇ σ2T W2 θ¯1T W1 4m2 4m1 + F2W2 − S2 θ¯ T W2 , 2

and the last term is given by

1 1 c = − W1T ∇ σ1 D11 ∇ σ1T W1 + bε1 + ε¯1 − W2T ∇ σ2 D12 ∇ σ2T W2 4 4 1 1 +bε2 + ε¯2 − W2T ∇ σ2 D22 ∇ σ2T W2 − W1T ∇ σ1 D21 ∇ σ1T W1 . 4 4 According to above assumptions and the fact that θ¯1 < 1 and θ¯2 < 1, there exist M ∈ R+ and cM ∈ R+ , such that ϱ ≤ ϱM , and c ≤ cM , respectively. Let the parameters F1 , F2 , S1 , and S2 be chosen such that P > 0, then (43) becomes

V˙ L ≤ −z2 λm (P ) + zM + cM = −λm (P )

z −

M 2 M2 + 4cM λm (P ) + . 2λm (P ) 4λm (P )

(44)

Then the Lyapunov derivative is negative if

z ≥

M2 cM M + + bz . λm (P ) 2λm (P ) 4λ2m (P )

(45)

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

˜ T θ¯1 > bz or W ˜ T θ¯2 > bz , or W ˜ 1 > Speciﬁcally, if x > bz or W 1 2 ˜ ˙ bz or W2 > bz , VL < 0. It is concluded that the system state and the critic NNs estimation errors are UUB.

7

x

0.8

Theorem 3. Suppose the hypotheses of Theorem 2 holds, then the control pair of (vˆ 1 , vˆ 2 ) converges to the ﬁnite neighborhoods of the Nash equilibrium.

system state

0.6

Remark 3. Persistence of excitation (PE) is considered for training critic NN to approximate the value functions. Generally, through adding probing noise to the control input and constantly testing, proper PE can be obtained.

0.2 0 -0.2 -0.4

˜ 1 and W ˜ 2 are both Proof. According to Theorem 2, NN weights W UUB. Recalling Assumption 3 and the bound of gi (x), and using (23) and (29), we have

-0.6 -0.8

1 2

1 2

0.4

0

˜ i − R−1 gT (x )∇εi vˆ i − v∗i = R−1 gTi (x )∇σiT W i ii ii (46)

Remark 4. Distinct to the results in [41,42], the emphasis in this paper is that the robust control problem of multi-player system can be settled by approximately solving coupled HJ equations of the nonzero-sum game. Under some novel cost functions including a term about the bounds of multiple input disturbances for each player, this problem is ﬁrstly implemented based on online single NN-based adaptive optimal learning algorithm. Evidently, this robust control scheme can be extended to the nonlinear systems with more than two players or controllers in theory. 5. Simulation In this section, two simulation examples are given to demonstrate the effectiveness of the proposed adaptive robust control approach. Example 1. Consider the following two-player linear multi-player system with uncertain disturbances described by

(47)

where d1 = 0.5 p1 xsin2 xcosx, d2 = 0.5 p2 xsin(x3 ), p1 and p2 are chosen randomly in [−1, 1]. Then d1M = 0.5|x|, d2M = 0.5|x|. The perturbation parameter is selected to design value function which can be compared with the actual value in [28]. In the following, the robust control problem is transformed to the nominal nonzero-sum game [28] in the form of

(48)

We ﬁrstly obtain the associated Nash solution by applying single network ADP algorithm. Choose Q1 = 1.5, Q2 = 0.75, R11 = R12 = 2, R21 = R22 = 1, F1 = F2 = 10I3 , S1 = S2 = 101, 1 ∈ R3 , and α1 = α2 = 1 for the simulation. Thus, we can obtain δ12M = 0.5x2 , and δ22M = 0.25x2 . In this case, the performance indices are deﬁned as

J2 (x(0 )) =

∞ 0

∞ 0

800

1000

Corollary 1. According to Theorems 2 and 3, the robust control pair (u¯ 1 , u¯ 2 ) can be derived to a ﬁnite neighborhood of the ideal value, speciﬁcally satisfying uˆ¯ i − u¯ i ≤ μi bvi , i = 1, 2.

J1 (x(0 )) =

600

Fig. 1. The state trajectory of nominal system (48) for NN learning process.

[1.5x2 + δ12M + 2u21 + 2u22 ]dt,

(49)

[0.75x2 + δ22M + u21 + u22 ]dt,

(50)

1.2

The critic NN weights for player 1

Thus, vˆ i − v∗i are UUB for all i ∈ I. This completes the proof.

3 x˙ = − x + u1 + 2u2 . 4

400

Time (s)

1 ≤ bgi λM (R−1 )(σiTdM bz + εidM ) bvi , i = 1, 2. ii 2

3 x˙ = − x + (u¯ 1 + d1 ) + 2(u¯ 2 + d2 ) 4

200

W11

1

W12 0.8

W13

0.6 0.4 0.2 0 -0.2 0

200

400

600

800

1000

Time (s) Fig. 2. The convergence of critic NN weights for player 1.

with optimal value functions are V1∗ (x ) = 12 x2 , and V2∗ (x ) = 14 x2 , respectively for player 1 and player 2. The optimal weights are [0.5, 0, 0]T , and [0.25, 0, 0]T . The critic NN activation functions are given as σ1 = σ2 = [x2 , x4 , x6 ]T and weights are given as [Wi1 , . . . , Wi3 ]T for constructing approximating value function Vˆi (x ) for the two players. In addition, both the two weights are initialed from [1, 1, 1]T to construct the initial admissible control pair, and the updating rates for the critic NNs are both 2. A small exploratory signal consisting of sinusoids of varying frequencies is added to the control input at the ﬁrst 900 s to excite the system states and ensure the PE qualitatively. Fig. 1 presents the evolution of the system state. Figs. 2 and 3 ˆ 1 converges to [0.50 02, −0.0 056, 0.0134]T , depict that the weight W ˆ and the weight W2 converges to [0.2506, −0.0078, 0.0081]T . Thus, the approximate weights converge to the optimal ones in a small neighborhood. Based on the converged weight vectors, we choose μ1 = 1.6, μ2 = 1.65, to obtain the robust control pair (u¯ 1 (x ), u¯ 2 (x )) as the stabilizing control law for uncertain system (47). Then, we apply the derived adaptive robust control pair to system (47). Fig. 4 displays the system state in (47) is forced to be stable in the sense of UUB.

8

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

6

W21

1

W22 0.8

x1

4

W23

System States

The critic NN weights for player 2

1.2

0.6 0.4 0.2

x2

2

0

-2

-4

0 -0.2 0

200

400

600

800

-6

1000

0

200

400

Time (s)

600

800

1000

Time (s)

Fig. 3. The convergence of critic NN weights for player 2.

Fig. 5. The state trajectory of nominal system (52) for NN learning process.

The critic NN weights for player 1

1.2

1

0.8

0.6

0.4

W W

0.2

W

11 12 13

0 0

100

200

300

400

500

600

700

800

900

1000

Time (s) Fig. 4. The state trajectory of system (47) under the robust control policy pair. Fig. 6. The convergence of critic NN weights for player 1.

Example 2. Consider the following nonlinear multi-player system with uncertain disturbances described by

x˙ = f (x ) + g1 (x )(u¯ 1 + d1 ) + g2 (x )(u¯ 2 + d2 ) where

f (x ) =

J2 (x(0 )) =

∞ 0

∞ 0

[xT Q1 x + δ12M + uT1 R11 u1 + uT2 R12 u2 ]dt

(53)

[xT Q2 x + δ22M + uT1 R21 u1 + uT2 R22 u2 ]dt.

(54)

The optimal value functions for the two players are V1∗ (x ) = + x22 , V2∗ (x ) = 14 x21 + 12 x22 , respectively. The critic NN activation functions are selected as σi = [x21 , x1 x2 , x22 ]T and weights are given as [Wi1 , . . . , Wi3 ]T for constructing approximating value function Vˆi (x ), i = 1, 2. In addition, ˆ 1 are initialed as [1, 1, 1]T , and W ˆ 2 are initialed from [0.35, 0.8, W 1 2 2 x1

0 0 , g2 ( x ) = , cos(2x1 ) + 2 sin(4x21 ) + 2

and d1 = 0.5 p1 x1 sin(x22 ), d2 = 0.5 p2 x2 sin3 (8x1 )cos(20x52 ), p1 and p2 are chosen randomly in [−1, 1]. Then, we choose d1M = x, d2M = x for simulation in order to construct the same cost function deﬁned in [42]. In the following, we ﬁrst obtain the Nash solution of the following nominal nonzero-sum game system

x˙ = f (x ) + g1 (x )u1 + g2 (x )u2 .

J1 (x(0 )) = and

x2 − 2x1 −x2 − 0.5x1 + 0.25x2 (cos(2x1 ) + 2 )2 , +0.25x2 (sin(4x21 ) + 2 )2

g1 ( x ) =

(51)

In this case, the performance indexes are deﬁned as

(52)

Choose Q1 = 1.5I2 , Q2 = 0.75I2 , R11 = R12 = 2, R21 = R22 = 1 for the simulation. Therefore, we can obtain δ12M = 0.5(x21 + x22 ), and δ22M = 0.25(x21 + x22 ).

0.35]T . Let the initial state of the system (52) be [−1, −0.5]T . During the learning process, a probe noise is added to the control inputs to satisfy the PE condition. Besides, choose F1 = F2 = 100I3 , S1 = S2 = 101, 1 ∈ R3 , and α1 = α2 = 1 for the constants on the tuning law. Fig. 5 presents the evolution of the system states. Figs. 6 and 7 depict the convergence process of the critic NN weights of each player during the online learning precess. When the PE condition is removed after 900 s, the tuned critic NN weights converges to [0.50 01, 0.0 0 01, 0.9999]T and [0.2566, −0.0 072, 0.4914]T , respec-

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

Acknowledgment

The critic NN weights for player 2

2

W21 W22

1.5

W23

This work was supported by the National Natural Science Foundation of China (61627809, 61621004), and IAPI Fundamental Research Funds 2013ZCX14.

1

References 0.5

0

-0.5 0

200

400

600

800

1000

Time (s) Fig. 7. The convergence of critic NN weights for player 2.

1.5

x1

1

x2 0.5

System States

9

0 0.01

-0.5

0 -1 -0.01 -1.5

-0.02 120

130

140

150

-2 0

20

40

60

80

100

120

140 150

Time (s) Fig. 8. The state trajectory of system (51) under the robust control policy pair.

tively. It is observed that the parameters convergence to the optimal values W1 = [0.5, 0, 1]T and W2 = [0.25, 0, 0.5]T . Next, we choose μ1 = 1.6, μ2 = 1.65, in order to obtain the robust control pair (u¯ 1 (x ), u¯ 2 (x )) for the system (51). The evolutions of the uncertain system state trajectories are shown in Fig. 8, which demonstrate the effectiveness of this adaptive robust control method again. 6. Conclusion A novel robust control strategy based on single critic NN-based ADP method has been proposed for nonlinear multi-player systems with input disturbances. The robust control pair has been designed by multiplying a set of coupling gains on the Nash solution for corresponding nominal nonzero-sum game systems. The existence condition of the novel robust control has been provided by using Lyapunov techniques. By constructing a critic NN for each player, the proposed online synchronous ADP algorithm has been employed to approximatively solve the nonlinear coupled HJ equations of the nominal nonzero-sum game systems. Stability analysis has been given to prove that the closed-loop nonzero-sum game system states and the NN weight errors are UUB. Simulation results have been presented to validate the theoretical analysis.

[1] A. Bidram, A. Davoudi, F.L. Lewis, J.M. Guerrero, Distributed cooperative secondary control of microgrids using feedback linearization, IEEE Trans. Power Syst. 28 (2013) 3462–3470. [2] K.R.S. Kodagoda, W.S. Wijesoma, E.K. Teoh, Fuzzy speed and steering control of an AGV, IEEE Trans. Control Syst. Technol. 10 (2002) 112–120. [3] D. Li, J. Cruz, G. Chen, C. Kwan, M. Chang, A hierarchical approach to multi-player pursuit-evasion differential games, in: Proceedings of the Forty– forth IEEE Conference on Decision and Control, IEEE, Seville, Spain, 2005, pp. 5674–5679. [4] H. Abou-Kandil, G. Freiling, G. Jank, Necessary and suﬃcient conditions for constant solutions of coupled Riccati equations in Nash games, Syst. Control Lett. 21 (1993) 295–306. [5] P. Morris, Introduction to Game Theory, Springer, New York, 2012. [6] A. Starr, Y. Ho, Nonzero-sum differential games, J. Optim. Theory Appl. 3 (1969) 184–206. [7] H. Zhang, C. Qin, Y. Luo, Neural-network-based constrained optimal control scheme for discrete-time switched nonlinear system using dual heuristic programming, IEEE Trans. Autom. Scie. Engi. 11 (2014) 839–849. [8] X. Zhong, H. He, H. Zhang, Z. Wang, Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 2141–2155. [9] M. Abu-Khalaf, F. Lewis, J. Huang, Neurodynamic programming and zero-sum games for constrained control systems, IEEE Trans. Neural Netw. 19 (2008) 1243–1252. [10] H. Modares, F. Lewis, M. Sistani, Online solution of nonquadratic two-player zero-sum games arising in the H∞ control of constrained input systems, Int. J. Adapt. Control Signal Process. 28 (2014) 232–254. [11] K. Vamvoudakis, F. Lewis, G. Hudas, Multi-agent differential graphical games: online adaptive learning solution for synchronization with optimality, Automatica 48 (2012) 1598–1611. [12] M. Abouheaf, F. Lewis, K. Vamvoudakis, S. Haesaert, R. Babuska, Multi-agent discrete-time graphical games and reinforcement learning solutions, Automatica 50 (2014) 3038–3053. [13] H. Zhang, T. Feng, G. Yang, H. Liang, Distributed cooperative optimal control for multi-agent systems on directed graphs: an inverse optimal approach, IEEE Trans. Cybern. 45 (2015a) 1315–1326. [14] H. Zhang, J. Zhang, G. Yang, Y. Luo, Leader-based optimal coordination control for the consensus problem of multi-agent differential games via fuzzy adaptive dynamic programming, IEEE Trans. Fuzzy Syst. 23 (2015b) 152–163. [15] H. Modares, F. Lewis, Optimal tracking control of nonlinear partially unknown constrained-input systems using integral reinforcement learning, Automatica 50 (2014) 1780–1792. [16] Q. Wei, F. Wang, D. Liu, X. Yang, Finite-approximation-error based discrete– time iterative adaptive dynamic programming, IEEE Trans. Cybern. 3 (2014) 2820–2833. [17] Z. Ni, H. He, J. Wen, Adaptive learning in tracking control based on the dual critic network design, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 913–928. [18] A. Heydari, S. Balakrishnan, Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 145–157. [19] J. Fu, H. He, X. Zhou, Adaptive learning and control for MIMO system based on adaptive dynamic programming, IEEE Trans. Neural Netw. 22 (2011) 1133–1148. [20] Q. Wei, D. Liu, Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasiﬁcation, IEEE Trans Autom. Scie. Eng. 11 (2014) 1020–1036. [21] T. Dierks, B. Brenner, S. Jagannathan, Near optimal control of mobile robot formations, in: Proceedings of the IEEE Symposium on ADPRL, IEEE, Piscataway, NJ, USA, 2011, pp. 234–241. [22] H. Zargarzadeh, S. Jagannathan, J. Drallmeier, Online near optimal control of unknown nonaﬃne systems with application to HCCI engines, in: Proceedings of the IEEE Symposium on ADPRL, IEEE, Piscataway, NJ, USA, 2011, pp. 258–263. [23] D. Wang, D. Liu, H. Li, Policy iteration algorithm for online design of robust control for a class of continuous-time nonlinear systems, IEEE Trans.Autom. Sci.Eng. 11 (2014) 627–632. [24] H. Zhang, Y. Luo, D. Liu, Neural network-based near-optimal control for a class of discrete-time aﬃne nonlinear systems with control constraints, IEEE Trans. Neural Netw. 20 (2009) 1490–1503. [25] K. Vamvoudakis, F. Lewis, Online actor-critic algorithm to solve the continuous-time inﬁnite horizon optimal control problem, Automatica 46 (2010) 878–888. [26] T. Dierks, S. Jagannathan, Online optimal control of aﬃne nonlinear discrete– time systems with unknown internal dynamics by using time-based policy update, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 1118–1129.

10

Q. Qu, H. Zhang and C. Luo et al. / Neurocomputing 334 (2019) 1–10

[27] H. Zhang, C. Qin, Y. Luo, Neural-network-based constrained optimal control scheme for discrete-time switched nonlinear system using dual heuristic programming, IEEE Trans. Autom. Scie. Eng. 11 (2014) 839–849. [28] D. Liu, H. Li, D. Wang, Online synchronous approximate optimal learning algorithm for multiplayer nonzero-sum games with unknown dynamics, IEEE Trans. Syst., Man, Cybern., Syst. 44 (2014) 1015–1027. [29] D. Liu, Q. Wei, Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 621–634. [30] F. Lewis, K. Vamvoudakis, Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data, IEEE Trans. Syst. Man Cybern. B Cybern. 41 (2011) 14–25. [31] B. Kiumarsi, F. Lewis, M. Naghibi-Sistani, A. Karimpour, Optimal tracking control of unknown discrete-time linear systems using input-output measured data, IEEE Trans. Cybern. 45 (2015) 2770–2779. [32] Q. Wei, D. Liu, H. Lin, Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems, IEEE Trans. Cybern. 46 (2016) 840–853. [33] Q. Wei, D. Liu, X. Yang, Inﬁnite horizon self-learning optimal control of nonaﬃne discrete-time nonlinear systems, IEEE Trans. Neural Netw. Learn. Syst. 26 (2015) 866–879. [34] Y. Jiang, Z. Jiang, Robust adaptive dynamic programming and feedback stabilization of nonlinear systems, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014) 882–893. [35] Y. Jiang, Z. Jiang, Robust adaptive dynamic programming for large-scale systems with an application to multimachine power systems, IEEE Trans. Circuits Syst.II 59 (2012) 693–697. [36] Q. Wei, R. Song, P. Yan, Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP, IEEE Trans. Neural Netw. Learn. Syst. 27 (2016) 444–458. [37] H. Wu, B. Luo, Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear H∞ control, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 1884–1895. [38] H. Modares, F. Lewis, Z. Jiang, H∞ tracking control of completely unknown continuous-time systems via off-policy reinforcement learning, IEEE Trans. Neural Netw. Learn. Syst. 26 (2015) 2550–2562. [39] D. Adhyaru, I. Kar, M. Gopal, Bounded robust control of nonlinear systems using neural network-based HJB solution, Neural Comput. Appl. 20 (2011) 91–103. [40] D. Wang, D. Liu, H. Li, H. Ma, Neural-network-based robust optimal control design for a class of uncertain nonlinear systems via adaptive dynamic programming, Inf. Sci. 282 (2014) 167–179. [41] K. Vamvoudakis, F. Lewis, Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton–Jacobi equations, Automatica 47 (2011) 1556–1569. [42] H. Zhang, L. Cui, Y. Luo, Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP, IEEE Trans. Cybern. 43 (2013) 206–216. [43] D. Zhao, Q. Zhang, D. Wang, Y. Zhu, Experience replay for optimal control of nonzero-sum game systems with unknown dynamics, IEEE Trans. Cybern. 46 (2016) 854–865. [44] A. Ferrara, On multi-input backstepping design with second order sliding modes for a class of uncertain nonlinear systems, Int. J. Control 71 (1998) 767–788. [45] S. Mobayen, Fast terminal sliding mode controller design for nonlinear second-order systems with time-varying uncertainties, Complexity 21 (2015) 239–244. [46] S. Mobayen, S. Javadi, Disturbance observer and ﬁnite-time tracker design of disturbed third-order nonholonomic systems using terminal sliding mode, J. Vib. Control 23 (2017) 181–189. [47] T. Basar, G. Olsder, Dynamic Noncooperative Game Theory, 2nd, PA: SIAM, Philadelphia, 1999. [48] E. Zeidler, Nonlinear Functional Analysis vol.1: Fixed Point Theorems, Springer-Verlag, New York, 1985. Qiuxia Qu received the M.S. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2010, where she is currently pursuing the Ph.D. degree in control theory and control engineering. Her current research interests include adaptive dynamic programming, neural network, optimal control, and their industrial applications.

Huaguang Zhang (M’03, SM’04, F’14) received the B.S. degree and the M.S. degree in control engineering from Northeast Dianli University of China, Jilin City, China, in 1982 and 1985, respectively. He received the Ph.D. degree in thermal power engineering and automation from Southeast University, Nanjing, China, in 1991. He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and Head of the Institute of Electric Automation, School of Information Science and Engineering, Northeastern University, Shenyang, China. His main research interests are fuzzy control, stochastic system control, neural networks based control, nonlinear control, and their applications. He has authored and coauthored over 280 journal and conference papers, six monographs and co-invented 90 patents. Dr. Zhang is the fellow of IEEE, the E-letter Chair of IEEE CIS Society, the former Chair of the Adaptive Dynamic Programming & Reinforcement Learning Technical Committee on IEEE Computational Intelligence Society. He is an Associate Editor of AUTOMATICA, IEEE TRANSACTIONS ON NEURAL NETWORKS, IEEE TRANSACTIONS ON CYBERNETICS, and NEUROCOMPUTING, respectively. He was an Associate Editor of IEEE TRANSACTIONS ON FUZZY SYSTEMS (2008–2013). He was awarded the Outstanding Youth Science Foundation Award from the National Natural Science Foundation Committee of China in 2003. He was named the Cheung Kong Scholar by the Education Ministry of China in 2005. He is a recipient of the IEEE Transactions on Neural Networks 2012 Outstanding Paper Award. Chaomin Luo (S’01-M’08) received the B.Eng. degree in radio engineering from Southeast University, Nanjing, China, the M.Sc. degree in engineering systems and computing from the University of Guelph, Guelph, ON, Canada, and the Ph.D. degree in electrical and computer engineering from the University of Waterloo, Waterloo, ON, Canada, in 2008. He is currently an Associate Professor with the Advanced Mobility Laboratory, Department of Electrical and Computer Engineering, University of Detroit Mercy, Detroit, MI, USA. His current research interests include control and automation, computational intelligence, intelligent controls and robotics, and embedded systems. Dr. Luo was a recipient of the NSERC Postgraduate Award in Canada and President’s Graduate Awards from the University of Waterloo, and the Best Student Paper Presentation Award at 2007 SWORD. He also received Faculty Research Awards from the University of Detroit Mercy in 2009, 2015, and 2016. He serves as the Editorial Board Member of the International Journal of Complex Systems-Computing, Sensing and Control and an Associative Editor of the International Journal of Robotics and Automation. He was the general Co-chair in the IEEE International Workshop on Computational Intelligence in Smart Technologies, and Journal Special Issues Chair, IEEE 2016 International Conference on Smart Technologies, Cleveland, OH, USA. He has organized and chaired several special sessions on the topics of intelligent vehicle systems and bio-inspired intelligence in the IEEE reputed international conferences, such as the IJCNN and the IEEE-SSCI and WCCI. Rui Yu received the B.S. degree in information and computing science from Northeastern University, Shenyang, China, in 2011, and the M.S. degree in computational mathematics form Northeastern University, Shenyang, China, in 2013. She is currently working toward the Ph.D. degree in control theory and control engineering, Northeastern University. Her research interests include complex networks, neural networks and network control.