Data-based stable value iteration optimal control for unknown discrete-time systems with time delays
Communicated by Dr. Derui
Ding
Journal Pre-proof
Data-based stable value iteration optimal control for unknown discrete-time systems with time delays He Ren, Huaguang Zhang, Hanguang Su, Yunfei Mu PII: DOI: Reference:
S0925-2312(19)31649-2 https://doi.org/10.1016/j.neucom.2019.11.047 NEUCOM 21575
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
6 August 2019 28 October 2019 28 November 2019
Please cite this article as: He Ren, Huaguang Zhang, Hanguang Su, Yunfei Mu, Data-based stable value iteration optimal control for unknown discrete-time systems with time delays, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.11.047
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Data-based stable value iteration optimal control for unknown discrete-time systems with time delays He Renb , Huaguang Zhanga,b , Hanguang Sub , Yunfei Mub a
State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, Liaoning, 110004, P. R. China. b School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110004, P. R. China. Correspondence E-mail:
[email protected].
Abstract In this study, a novel data-based stable value iteration (SVI) optimal control scheme is presented in order to tackle with the linear discrete-time (DT) system with multiple time delays. Due to the difficulty in acquiring the knowledge of system dynamics, the optimal control strategies could be computed with only history input and output database by employing an estimator on the basis of adaptive dynamic programming (ADP) technology. By analyzing features of time delay systems, a homologous equivalent notion of delay-free systems is proposed so that the optimal control policy against systems with multiple time delays could be designed indirectly. Moreover, four equivalent conditions are deduced between the two related systems. The convergence of the rising SVI algorithm with discount factor is discussed according to optimal control principles. The proposed SVI algorithm is proved to converge to optimal values with proper discount factor. In the end, two numerical examples are given, and simulation results illustrate that the presented data-based SVI method is effective. Keywords: adaptive dynamic programming, data-based control, multiple time delays, stable value iteration, optimal control
Email addresses:
[email protected] (He Ren),
[email protected] (Huaguang Zhang),
[email protected] (Hanguang Su),
[email protected] (Yunfei Mu)
Preprint submitted to Neurocomputing
December 4, 2019
1. Introduction Time delays inevitably exist in great majority of practical systems especially large-scaled and complex systems, such as in power systems, telecommunication, industrial productions, chemical reactions and so on [1–7]. In general, time delays depend on the response speed of actuator, signal transmission process, complexity of the system and a series of actual factors [8]. Researches about time delays have never stopped for its significant realistic meanings in various areas. In the past decades, optimal control for time delay systems attracted a great number of scholars’ attention and some important contributions had been published in [9–12]. The optimal control policy was obtained according to the optimality standards, generally acknowledged as solving the Hamilton-Jacobi-Bellman (HJB) equation which could hardly be calculated directly [13]. So far, adaptive (approximate) dynamic programming (ADP) algorithms are recognized as strong tools to tackle with optimization problems. By utilizing the ADP algorithm, optimal control policies are approximated with iteration mechanism [14–20]. Also, different from the former dynamic programming method, the curse of dimensionality is refrained validly [21]. Due to the order of iteration, ADP methods are classified into policy iteration (PI) and value iteration (VI) [22], which are generally applied to online and off-line optimal control systems, respectively. Moreover, distinguished from PI method, VI does not need an initial admissible control. In [23], a SVI method was presented for discrete-time (DT) two-player zero-sum game for the first time. Moreover, reconstruction errors and disturbances were taken into consideration in SVI algorithm in [24]. However, it is still an opening and challenging problem although some contributions have been developed in solving optimal control problems with time delays. During these years, linear quadratic tracking (LQT) problem with multiple time delays was solved by ADP algorithm in [25] when the system states were unknown. A novel reinforcement learning which combines the off-policy technology and experience-replay method effectively solved the optimal tracking control problem without the knowledge of system dynamics in [26]. In [27], a novel dual iteration method was presented in order to design the control of time-delay systems, and the system states and performance value could be updated simultaneously. In addition, the multiobjective optimal control problem with time delays was dealt with multiobjective ADP method in [28]. In [29], optimal control problem in finite-horizon for a series 2
of nonlinear systems with time delays was discussed utilizing value iteration method. It is worth noting that, in all of the aforementioned results, system dynamics are supposed to be available and accurate. The iteration errors will accumulated once the system dynamics are inaccurate. Nevertheless, in most practical applications, dynamics can hardly be obtained directly for the cost constraints and systems complexity. Hence, the proposal of the data-based optimal ADP method is especially significant and meaningful. Thus, some scholars introduce data-based and model-free mechanisms into the optimal ADP methods and receive some excellent achievements. For example, a data-based VI method combining with Q-learning technology was proposed in [30] when a single delay exist in the system state, control input and output. In [31], a model-free scheme was designed to obtain optimal control policies for linear systems with multiple time delays. Moreover, databased PI and VI methods were presented in dealing with DT linear systems with multiple time delays barely utilizing history input and output database in [32]. The key of the last two methods is to convert the time-delay system to an equivalent system without delays, which is constrained by some necessary and sufficient conditions mentioned in the following paper. In our study, a novel stable value iteration (SVI) method with discount factor is developed into the data-based optimal ADP control problem for linear DT systems with multiple delays. Primary highlights of this study are summarized as follows. Firstly, four essential conditions are presented, under which the original system with multiple time delays can be transformed into an equivalent delay-free system. Secondly, a novel data-based SVI method is established so that the optimal control policy can be obtained with history input and output database, which avoid the difficulties in acquiring or estimating system dynamics in practical applications. Thirdly, benefit from the mathematical induction, the proposed SVI method can converge to optimal control strategy and value function under some reliable hypothesis. Finally, two simulations are displayed to demonstrate the validity of the novel SVI algorithm. The rest sections in this work are arrangement as follows. In Section 2, the problem description and the basic knowledge about coordinate transformation are presented. The SVI algorithm for time delay systems and its convergence are considered in Section 3. In Section 4, the data-based optimal SVI method is proposed. Two simulations and their results are listed in Section 5. At last, the conclusion is expressed in Section 6.
3
2. Preliminaries 2.1. Problem Description First of all, we focus on a type of linear DT systems with multiple time delays, which is formulated as equation (1). It follows from equation (1) that time delays exist in the system state, control input and output. xk+1 =
τα X
Am xk−m +
yk =
Bn uk−n
n=0
m=0 τγ
X
τβ X
(1)
Cp xk−p ,
p=0
where xk ∈ Rq , uk ∈ Rs and yk ∈ Rt represent for the system state, control input and system output vectors, respectively. Am ∈ Rq×q (m = 0, 1, · · · , τα ), Bn ∈ Rq×s (n = 0, 1, · · · , τβ ) and Cp ∈ Rs×t (n = 0, 1, · · · , τγ ). Furthermore, τα , τβ and τγ ∈ N. Assumption 1. To ensure that the considered time delay system has at least one solution, we assume that the linear DT system (1) is controllable and observable simultaneously. Let us define the performance index associated with the discount factor for the system (1) as V (xk ) = V (xk , u(xk )) =
∞ X
θ
t−k
(xTt Qxt
t=k
+
uTt Rut )
=
∞ X
θt−k r(xt , u(xt )),
t=k
(2) the matrices Q and R are positive in which r(xt , u(xt )) = + definite, and θ is the discount factor with 0 < θ < 1. Nowadays, the concerned problem is transformed into solving the optimal control strategy u∗ (xk ) in order to minimize the following performance index: xTt Qxt
uTt Rut ,
V ∗ (xk ) = V (xk , u∗ (xk )) ≤ V (xk , u(xk )).
(3)
In accordance with the Bellman optimal principle, the following equation about V ∗ (xk ) can be always established: V ∗ (xk ) = min r(xk , u(xk )) + θV ∗ (xk+1 ). u(xk )
4
(4)
The optimal control strategy u∗ (xk ) should satisfy the following first order necessity condition: ∂V ∗ (xk ) ∂(xTk Qxk + u(xk )T Ru(xk )) ∂xk+1 T ∂V ∗ (xk+1 ) = + θ( ) = 0. (5) ∂u(xk ) ∂u(xk ) ∂u(xk ) ∂xk+1 Then, we have θ ∂V ∗ (xk+1 ) u∗ (xk ) = − R−1 B0T . 2 ∂xk+1
(6)
Substituting equation (6) into (4), we can obtain HJI equation as: ∂V ∗ (x ) θ2 ∂V ∗ (xk+1 ) T k+1 V ∗ (xk ) = xTk Qxk + B0 R−1 B0T + θV ∗ (xk+1 ). 4 ∂xk+1 ∂xk+1 (7) Now, optimal performance index could be obtained by computing equation (7). However, it is almost impossible to be solved directly in theory. Therefore, an ADP method is employed for this tough problem. 2.2. Coordinate Transformation Define a time delay operator φ as: xk−j = φj xk , k ∈ N.
(8)
F [φ] = F0 + F1 φ + F2 φ2 + · · · + FH φH ,
(9)
Let R[φ] be the ring of polynomials, then polynomial matrices with φ can be described as:
where Fw (w = 0, 1, · · · , H) are constant matrices. Moreover, the addition and multiplication calculations are operated as general arithmetic rules. Definition 1. A square integer matrix A ∈ R[φ] is defined unimodular if its determinant equals to -1 or 1. Definition 2. (Smith invariant factors) Polynomial matrix M (λ) and its rank equals to r can be written into the following equivalent form: M (λ) = N1 (λ)S(λ)N2 (λ), (10) ∆(λ) 0 in which S(λ) = , and ∆(λ) = diag[d1 (λ), d2 (λ), · · · , dr (λ)], 0 0 where df (λ) is divisible by df −1 (λ) for f = 2, 3, · · · , r. Moreover, N1 (λ) and N2 (λ) are unimodular matrices.
5
Definition 3. (Change of coordinates) There is a causal transform for the system (1) xˆk = F [φ]xk (11) with F [φ] ∈ R[φ]q×q , where the Smith invariant F [φ] is in the form of φH for H ∈ N. Definition 4. (Delay-free system model) The delay-free system described as equation (12) is equivalent to the system (1) τ∗
∗
xˆk+1 =
τα X
m=0 τγ∗
yk =
X
Am xˆk−m +
β X
B n uk−n = A0 xˆk + B 0 uk
n=0
(12)
C p xˆk−p = C 0 xˆk
p=0
if there is a unimodular matrix F [φ] such as equation (11). Am ∈ Rq×q (m = 0, 1, · · · , τα∗ ), B n ∈ Rq×s (n = 0, 1, · · · , τβ∗ ) and C p ∈ Rs×t (n = 0, 1, · · · , τγ∗ ). Moreover, τα∗ , τβ∗ and τγ∗ ∈ N are delay times. 2.3. Necessary and sufficient conditions In this section, four necessary and sufficient equivalent conditions are given for the delay-free and the multiple-delays system. The main conclusions are generalized as the following Lemma 1. Lemma 1. If and only if there is a group of matrixes Fw ∈ Rq×q (q is the dimension of Fw , and w = 1, 2, · · · , τσ ) satisfies the following four conditions, then the original DT system (1) with time delay equivalents to the delay-free system P (12). (1) kw=0 Fw Ak−w = A0 Fk for k = 0, 1, ..., τσ + τα , with A0 = A0 ; P (2) kw=0 Fw Bk−w = B k for k = 0, 1, ..., τσ + τβ , with B 0 = B0 ; B k = 0 for ∀k > P0; (3) kw=0 C k−w Fw = Ck for k = 0, 1, ..., τσ + τγ , with C 0 = C0 ; C k = 0 for ∀k > 0; P σ (4) det( τw=0 Fw φw ) ∈ R\{0}.
Proof. According to Definition 3, the system (12) can be transformed into xˆk+1 = A0 Fˆ [φ]xk + B 0 uk (13) yk = C 0 Fˆ [φ]xk . 6
After that, we could find unimodular matrices F [φ] ∈ R[φ]q×q to make the following equations (14) and (15) always correct. xˆk+1 =F [φ]xk+1 =[F0 F1 · · · Fτσ ] × [φ0T φ1T · · · φτσ T ]T xk+1 A0 xk + A1 xk−1 + · · · + Aτα xk−τα A0 xk−1 + A1 xk−2 + · · · + Aτ xk−τ −1 α α =[F0 F1 · · · Fτσ ] × .. .
A0 xk−τσ + A1 xk−τσ −1 + · · · + Aτα xk−τα −τσ B0 uk + B1 uk−1 + · · · + Bτβ uk−τβ B0 uk−1 + B1 uk−2 + · · · + Bτ uk−τ −1 β β + [F0 F1 · · · Fτσ ] × .. . B0 uk−τσ + B1 uk−τσ −1 + · · · + Bτβ uk−τβ −τσ =A0 Fˆ [φ]xk + B 0 uk (14) yk =C 0 F [φ]xk =C 0 [F0 F1 · · · Fτσ ] × [xTk · · · xTk−τσ ]T τγ X Cp xk−p =
(15)
p=0
Compared with the delay-free system (12) and the initial system (1), if they are equivalent, then the corresponding coefficients in equations (14) and (15) are equal. Thus, we could obtain the conditions (1), (2) and (3). And condition (4) is established in order to ensure that F [φ] are unimodular matrices. Thus, the Smith invariant polynomials of F [φ] are in the form that defined in Definition 2. Therefore, the transformation between timedelay system (1) and delay-free system (12) is a causal change if and only if condition (4) is satisfied. This completes the proof. Remark 1. Equivalent delay-free system (12) has the same nature with the original system (1). Actually, every linear DT system (1) with multiple time delays has a nature of delay-free system. The transformation between the two systems in Lemma 1 just wipes out the time delay form and simplify the system [33]. Therefore, we could obtain the optimal control strategy 7
by solving the delay-free system (12) indirectly. It’s worth mentioning that, when F0 is set to be the identity matrix, then the state matrices A0 , B 0 and C 0 in the delay-free system can always equal to A0 , B0 and C0 in the origin system [31].
3. SVI method for time delay systems 3.1. Derivation of iterative VI method A VI algorithm was presented by Liu et al.[34]. When iteration number equals to zero, one has u0 (xk ) = min (r(xk , u(xk )) + θV0 (xk+1 )), u(xk )
(16)
where k and i are time and iteration number, respectively. For initial performance index V0 (·) ≥ 0, we can obtain the initial control strategy u0 (xk ) from equation (16). After that, we can update the value function with V1 (xk ) = r(xk , u0 (xk )) + θV0 (A0 xk + B0 u0 (xk )).
(17)
Similar to the traditional VI method, the control strategy and value function on ith step could be described by: ui (xk ) = min (r(xk , u(xk )) + θVi (xk+1 ))
(18)
Vi+1 (xk ) = r(xk , ui (xk )) + θVi (xk+1 ).
(19)
u(xk )
The iteration will not stop until kVi+1 (xk ) − Vi (xk )k < ε, in which ε is a positive real constant that small enough. Otherwise, substitute i = i + 1 and go back to equations (18) and (19). When θ = 1, the iteration method is equivalent to the traditional VI method, and the following conditions are established. First, when i → ∞, Vi (xk ) = V ∗ (xk ) and ui (xk ) = u∗ (xk ). Second, equation (19) is not always a Lyapunov equation when 0 < θ < 1. Therefore, the range of θ is worth discussing.
8
3.2. Convergence analysis In this subsection, the convergence of the presented SVI algorithm is given. That is, when the iteration step i goes to ∞, Vi (xk ) = V ∗ (xk ) and ui (xk ) = u∗ (xk ). In order to discuss the convergence, two necessary assumptions are made as follows: Assumption 2. Suppose 0 < β < ∞, then the inequality can always be established 0 ≤ θV ∗ (xk+1 ) ≤ βr(xk , u(xk )). (20) Assumption 3. There are 0 ≤ ρ1 ≤ 1 and 1 ≤ ρ2 < ∞ satisfying the following inequality: 0 ≤ ρ1 V ∗ (xk ) ≤ V0 (xk ) ≤ ρ2 V ∗ (xk ),
(21)
in which V ∗ (xk ) is the optimal value function and V0 (xk ) is the initial value function. Remark 2. Obviously, from the definition of value function, we could know that Vi (xk ) is a monotonically function and V0 (·) is a finite positive function whatever initial state xk is selected. In addition, from Assumption 1, we know that there must be a control uk making V (xk ) finite. Thus, V ∗ (xk+1 ) is also finite with arbitrary control strategy u(xk ). For the reasons above, there must exist 0 ≤ ρ1 ≤ 1 and 1 ≤ ρ2 < ∞ satisfying Assumption 3. Theorem 1. For the DT linear system (12), the value function Vi (xk ) and control strategy ui (xk ) will converge to the optimal values limi→∞ Vi (xk ) = V ∗ (xk ), limi→∞ ui (xk ) = u∗ (xk ) if the following inequality (22) is satisfied when the iteration step goes to infinity. ρ2 − 1 ρ1 − 1 ∗ V (xk ) ≤ Vi (xk ) ≤ 1 + V ∗ (xk ) (22) 1+ (1 + β −1 )i (1 + β −1 )i Proof. It follows from Assumption 2 and Assumption 3 that βr(xk , u(xk ))− θV ∗ (xk+1 ) ≥ 0, (ρ1 − 1)/(1 + β) ≤ 0 and (ρ2 − 1)/(1 + β) ≥ 0. Then the following two inequations are established. ρ1 − 1 (βr(xk , u(xk )) − θV ∗ (xk+1 )) ≤ 0. 1+β 9
(23)
ρ2 − 1 (βr(xk , u(xk )) − θV ∗ (xk+1 )) ≥ 0. 1+β
(24)
Mathematical induction is employed to prove Theorem 1. The right hand and left hand of inequality (22) are completed respectively. Firstly, we present the proof of the right of inequality (22). When i = 1, we can obtain that V1 (xk ) = min [r(xk , u(xk )) + θV0 (xk+1 )] u(xk )
≤ min [r(xk , u(xk )) + θρ2 V ∗ (xk+1 )] u(xk )
≤ min [r(xk , u(xk )) + θρ2 V ∗ (xk+1 ) + u(xk )
ρ2 − 1 (βr(xk , u(xk )) − θV ∗ (xk+1 ))] 1+β
1 + ρ2 β min [r(xk , u(xk )) + θV ∗ (xk+1 )] 1 + β u(xk ) ρ2 − 1 = 1+ V ∗ (xk ). −1 1+β =
(25)
Make an assumption that the conclusion is correct when i = j − 1, j = 1, 2, · · · , ρ2 − 1 V ∗ (xk ). (26) Vj−1 (xk ) ≤ 1 + (1 + β −1 )j−1
10
Therefore, when i = j, we can obtain that Vj (xk ) = min [r(xk , u(xk )) + θVj−1 (xk+1 )] u(xk ) o n ρ2 − 1 ∗ V (xk+1 ) ≤ min r(xk , u(xk )) + θ 1 + u(xk ) (1 + β −1 )j−1 n ρ2 − 1 ≤ min r(xk , u(xk )) + θ 1 + V ∗ (xk+1 ) −1 j−1 u(xk ) (1 + β ) o j (ρ2 − 1)β ∗ + [βr(xk , u(xk )) − θV (xk+1 )] (1 + β)j n (ρ2 − 1)β j r(xk , u(xk )) = min 1+ u(xk ) (1 + β −1 )j−1 o ρ2 − 1 (ρ2 − 1)β j−1 ∗ − θV (xk+1 ) + 1+ (1 + β −1 )j−1 (1 + β)j ρ2 − 1 = 1+ min [r(xk , u(xk )) + θV ∗ (xk+1 )] −1 j (1 + β ) u(xk ) ρ2 − 1 = 1+ V ∗ (xk ). −1 j (1 + β )
(27)
So far, the right hand of (22) is proved. The left hand of (22) can be proved with the similar principle, when i = 1 we can obtain that V1 (xk ) = min [r(xk , u(xk )) + θV0 (xk+1 )] u(xk )
≥ min [r(xk , u(xk )) + θρ1 V ∗ (xk+1 )] u(xk )
≥ min [r(xk , u(xk )) + θρ1 V ∗ (xk+1 ) + u(xk )
ρ1 − 1 (βr(xk , u(xk )) − θV ∗ (xk+1 ))] 1+β
1 + ρ1 β min [r(xk , u(xk )) + θV ∗ (xk+1 )] 1 + β u(xk ) ρ1 − 1 = 1+ V ∗ (xk ). 1 + β −1 =
(28)
Make an assumption that the conclusion is correct when i = j − 1, j = 1, 2, · · · , ρ1 − 1 1+ V ∗ (xk ) ≤ Vj−1 (xk ) (29) (1 + β −1 )j−1 11
Therefore, when i = j we can obtain that Vj (xk ) = min [r(xk , u(xk )) + θVj−1 (xk+1 )] u(xk ) o n ρ1 − 1 ∗ V (xk+1 ) ≥ min r(xk , u(xk )) + θ 1 + u(xk ) (1 + β −1 )j−1 n ρ1 − 1 ≥ min r(xk , u(xk )) + θ 1 + V ∗ (xk+1 ) u(xk ) (1 + β −1 )j−1 o (ρ1 − 1)β j ∗ + [βr(xk , u(xk )) − θV (xk+1 )] (1 + β)j n (ρ1 − 1)β j r(xk , u(xk )) = min 1+ u(xk ) (1 + β −1 )j−1 o ρ1 − 1 (ρ1 − 1)β j−1 ∗ + 1+ − θV (xk+1 ) (1 + β −1 )j−1 (1 + β)j ρ1 − 1 = 1+ min [r(xk , u(xk )) + θV ∗ (xk+1 )] (1 + β −1 )j u(xk ) ρ1 − 1 = 1+ V ∗ (xk ) (1 + β −1 )j
(30)
So far, the left hand of (22) is proved. When the iteration step goes to infinity, we have the following two equalities: ρ1 − 1 ∗ lim V (xk ) = V ∗ (xk ) (31) 1+ i→∞ (1 + β −1 )i ρ2 − 1 ∗ V (xk ) = V ∗ (xk ). (32) lim 1+ −1 i i→∞ (1 + β ) Therefore, V∞ (xk ) = V ∗ (xk ), and from (18), we can know that u∞ (xk ) = u∗ (xk ). This completes the proof. 4. Data-based SVI method for time delay system In this section, a novel data-based SVI algorithm for the DT linear system with multiple delays is proposed. An estimator is employed to reconstruct the states with input and output database. In addition, the relative data-based Bellman equation is also given in this section.
12
4.1. State estimator utilizing history database The DT linear time delay system (12) could be rewritten as: uk−1 xˆk−1 xˆk−2 uk−2 xˆk = [A0 A1 · · · Aτα∗ ] × .. + [B 0 B 1 · · · B τβ∗ ] × .. . . uk−τβ∗ −1 xˆk−τα∗ −1 (33) xˆk−1 xˆk−2 yˆk = [C 0 C 1 · · · C τγ∗ ] × .. . . xˆk−τγ∗ −1
For simplicity, equation (33) can be simplified as follows: ˜xk−1,k−τ ∗ −1 + Bu ˜ k−1,k−τ ∗ −1 xˆk = Aˆ α β ˜ yˆk = C xˆk−1,k−τ ∗ −1 ,
(34)
γ
∗ ˜ ∈ Rq×s(τβ∗ +1) and C˜ ∈ Rt×q(τγ∗ +1) . xˆk−1,k−τ ∗ −1 in which A˜ ∈ Rq×q(τα +1) , B α represents the group of states during the time interval of [k − 1, k − τα∗ − 1].
Remark 3. There is a full column rank matrix Cˆ satisfying the following equation that is equivalent to (34) on the time interval [k − t, k − 1]. yˆk−1,k−t = Cˆ xˆk−1,k−τγ∗ −t .
(35)
Moreover, yˆk−1,k−t = [yk−1 , yk−2 , · · · , yk−t ]T , and the proof had been completed in the reference [32]. Theorem 2. Since equation (35) holds, the state equation in (34) has the following expression: ˜ k−1,k−N −1 , xˆk = M1 yk−1,k−N −1 + Bu
(36)
in which M1 = A˜Cˆ − and Cˆ is defined as equation (35). Cˆ − represents the ˆ −1 Cˆ T . left inverse of the matrix Cˆ with Cˆ − = (Cˆ T C)
13
Proof. The equivalent state equation (34) equals to: ˜xk−1,k−τ ∗ −1 + Bu ˜ k−1,k−N −1 , xˆk = Aˆ α
(37)
when τβ∗ = N . Moreover, when t = N + 1 equation (35) can be transformed into: (38) yk−1,k−N −1 = Cˆ xˆk−1,k−N −τγ∗ −1 . Then, there must be an invertible matrix M satisfying ˆ A˜ = M C.
(39)
And we can obtain the general solution of equation (39) as ˜ − Cˆ Cˆ − ) , M1 + M2 , M = A˜Cˆ − + G(I
(40)
˜ is an arbitrary matrix. Pre-multiplying equation (37) by M, we in which G could obtain that: ˜xk−1,k−τ ∗ −1 =(M1 + M2 )yk−1,k−N −1 Aˆ α =(M1 + M2 )Cˆ xˆk−1,k−N −τ ∗ −1 .
(41)
γ
˜ − Cˆ Cˆ − )Cˆ = 0, equation (41) equals to: For M2 Cˆ = G(I ˜xk−1,k−τ ∗ −1 =M1 Cˆ xˆk−1,k−N −τ ∗ −1 Aˆ α γ =M1 yk−1,k−N −1 .
(42)
Finally, substituting (42) into (37) yields equation (36). This completes the proof. 4.2. Data-based SVI algorithm According to Theorem 2, equation (36) equals to: uk−1,k−N −1 ˜ ˜ M1 ]zk−1,k−N −1 , xˆk = [B M1 ] = [B yk−1,k−N −1
(43)
T T where zk−1,k−N −1 = [uTk−1,k−N −1 yk−1,k−N −1 ] . The performance index function V (xk ) could always be presented as the quadratic form:
V (xk ) = xTk P xk . 14
(44)
Thus, we have the optimal value function as: V ∗ (xk ) =xTk P xk
˜T B ˜ M1 ]zk−1,k−N −1 P [B M1T T ˜ PB ˜ B ˜ T P M1 B T =zk−1,k−N −1 ˜ M T P M1 zk−1,k−N −1 MT P B T =zk−1,k−N −1
(45)
1 1 T =zk−1,k−N −1 P zk−1,k−N −1 .
and the Bellman equation: T T ∗T ∗ T zk−1,k−N −1 P zk−1,k−N −1 = xk Qxk + uk Ruk + θzk−1,k−N P zk−1,k−N .
(46)
T Then partition θzk−1,k−N P zk−1,k−N as:
T θP 11 θP 12 θP 13 u∗k T P zk−1,k−N = uk−1,k−N θP 21 θP 22 θP 23 uk−1,k−N . θzk−1,k−N yk−1,k−N yk−1,k−N θP 31 θP 32 θP 33 (47) According to optimality condition, which is ∂V ∗ (xk )/∂uk = 0, one has u∗k
0 = Ru∗k + θP 11 u∗k + θP 12 uk−1,k−N + θP 13 yk−1,k−N .
(48)
And we can get the optimal control that can minimize the value function u∗k = −(R + θP 11 )−1 × (θP 12 uk−1,k−N + θP 13 yk−1,k−N ).
(49)
It follows from equation (49) that the optimal control u∗k is calculated by discount factor θ, P 11 , P 12 , P 13 and history input/output data. Thus, solving the optimal control problem could be transformed into solving the matrix P . The specific derivation and algorithm are concluded as follows. i+1 The matrix P can be obtained from the following equation: T zk−1,k−N −1 P
i+1
i
i T zk−1,k−N −1 = xTk Qxk + uiT k Ruk + zk−1,k−N P zk−1,k−N . (50)
Therefore, we can rewrite equation (50) as stk(P
i+1
) =[zk−1,k−N −1 ⊗ zk−1,k−N −1 ]−1
i
i T × (xTk Qxk + uiT k Ruk + zk−1,k−N P zk−1,k−N )
15
(51)
Algorithm 1 Data-based SVI algorithm for time delay systems 0
Step 1: Initiate the algorithm with an arbitrary control u0k and set P = 0. i+1 from Step 2: Value update: solve P stk(P
i+1
) =[zk−1,k−N −1 ⊗ zk−1,k−N −1 ]−1
i
i T × (xTk Qxk + uiT k Ruk + zk−1,k−N P zk−1,k−N ).
(52)
Step 3: Update the control strategy using i+1
i+1
i+1
ui+1 = −(R + θP 11 )−1 × (θP 12 uk−1,k−N + θP 13 yk−1,k−N ). k i+1
(53)
i
− P k ≤ ε (ε is a positive real constant Stop the iteration when kP that small enough). Otherwise go back to Step 2. In algorithm 1, persistent excitation signal must be introduced into the control input to ensure that the system can be explored integral. In other words, sufficient numerous excitation signals can guarantee [zk−1,k−N −1 ⊗ zk−1,k−N −1 ] is invertible in every iteration step. Remark 4. The utilization of discount factor θ is mainly to suppress the impact of disturbances at subsequent times on current performance. Therefore, the presented SVI method can ensure the convergency of the system under persistent excitation when comparing with the traditional VI method. If the discount factor θ is close to 1, the system may be concussion or divergence for the existence of disturbances with some arbitrary initial controls. When θ is small or even close to zero, the convergence speed could be too slow. In this section, we present a data-based SVI algorithm for solving the DT linear optimal control problem with multiple time delays. In the next section, two numerical simulations are given to demonstrate the effectiveness of the algorithm that we proposed. 5. Simulation 5.1. The second-order time-delay system Premeditate a second-order time-delay system described as: xk+1 =A0 xk + A1 xk−1 + A2 xk−2 + B0 uk yk =C0 xk + C1 xk−1 , 16
(54)
Convergence process of p11 , p 12 , p 13 and p14
4
p 11 p 12
3
p 13 p 14
p 11 -p 14
2
1
0
-1
-2
-3
0
5
10
15
20
25
30
35
40
45
50
Iteration Number
Figure 1: Convergence Curves of P 11 , P 12 , P 13 and P 14
1.2 −0.9 −1 1.1 0 −1 2 in which A0 = , A1 = , A2 = , B0 = , 1 0 0 1 0 0 1 C0 = [−0.8 0], C1 = [0 − 0.5]. The matrices Q and R are identity matriceswithproper dimensions. To satisfy the conditions (1)-(3), select 1 0 0 1 1 φ F [φ] = + φ = . Then set θ = 0.8 and ε = 0.02, the 0 1 0 0 0 1 matrix P can be obtained utilizing Algorithm 1. Moreover, the persistent excitation signal consists of sine and cosine signals dT = 0.011e−0.00001T × (sin(100T )cos(100T ) + sin(2T )cos(0.1T ) + sin(−1.2T )cos(0.5T ) + sin(T ) + sin(1.12T ) + cos(2.4T ) + sin(2.4T )), in which e is the natural constant, and T = 1150 represents the number of excitation signals. Then, we could obtain the following simulation results: In Figure 1, the convergence process of P 11 , P 12 , P 13 and P 14 in matrix P are presented. We can observe that the curves converge to the optimal values at the 10th step. Figure 2 shows the control curve of uk under the optimal weights calculated by Algorithm 1. In Figure 3, the states curves of x1 and x2 are displayed. Obviously, system states converge to zero at the 20th iteration with the optimal control u∗k .
17
Approximate optimal control
50
40
30
u *(k)
20
10
0
-10
-20
0
5
10
15
20
25
30
35
40
45
50
Time (k)
Figure 2: Curves of the optimal control u∗k The state of the system
100
x1 x2
80
60
x*(k)
40
20
0
-20
-40
0
5
10
15
20
25
30
35
40
45
50
Time (k)
Figure 3: Curves of System States
5.2. The three-order time-delay system Consider a three-order time-delay system, andthe parameters are given as 0 −1 −1 0 0 −1 1 follows: A0 = 1 0 0.5, A1 = 0.5 0 −1, B0 = 0 , C0 = [0 1 0] 0 0.5 0.5 0 0 0 0.5 18
Convergence process of p11 , p 12 , p 13 ,p 14 ,p 15 and p16
0.4
p 11 p 12
0.3
p 13 p 14
p 11 -p 16
0.2
p 15 p 16
0.1
0
-0.1
-0.2
-0.3
0
5
10
15
20
25
30
35
40
45
50
Iteration Number
Figure 4: Convergence Curves of P 11 , P 12 , P 13 , P 14 , P 15 and P 16
and C1 = [0 0 1]. The matrices Q and R are identity matrices with proper dimensions. (1)-(3), select To satisfy the conditions 1 0 0 0 0 0 1 0 0 F [φ] = 0 1 0 + 0 0 −1 φ = 0 1 −φ. Then set θ = 0.2 and 0 0 1 0 0 0 0 0 1 ε = 0.8, the matrix P can be obtained utilizing Algorithm 1 with persistent excitation signal dT = (sin2 (100T )cos(100T ) + sin2 (2T )cos(0.1T ) − sin2 (1.2T )×cos(0.5T )+sin5 (T )+sin2 (1.12T )+cos(2.4T )sin3 (2.4T ))×0.01, in which T = 2850 represents the number of excitation signals. Eventually, we could obtain the following simulation results: In Figure 4, the convergence process of P 11 , P 12 , P 13 , P 14 , P 15 and P 16 in matrix P are presented. We can observe that the curves converge to the optimal values at about the 8th step. Figure 5 shows the control curve of uk under the optimal weights calculated by Algorithm 1. In Figure 6, the states curves of x1 , x2 and x3 are displayed. Obviously, system states converge to zero at the 35th iteration with the optimal control u∗k .
19
Approximate optimal control
12
10
8
u *(k)
6
4
2
0
-2
0
5
10
15
20
25
30
35
40
45
50
Time (k)
Figure 5: Curves of the optimal control u∗k The state of the system
15
x1 x2
10
x3
5
x*(k)
0
-5
-10
-15
-20
0
5
10
15
20
25
30
35
40
45
50
Time (k)
Figure 6: Curves of System States
6. Conclusions In this paper, a novel data-based SVI optimal control scheme has been presented to deal with the linear DT system with multiple time delays. The 20
optimal control strategy could be received by utilizing an estimator based on ADP algorithm with only history input and output data when the system dynamics were completely or partially unknown. The corresponding equivalent notion of delay-free systems has been proposed based on analyzing the features of the time delay system. So we can obtain the optimal control for the multiple time delay systems indirectly. After that, four equivalent conditions have been deduced between the two related systems. The convergence of the rising SVI algorithm with discount factor was derived on the basis of optimal control principles. The proposal SVI method was proved to converge to optimal values with proper discount factor. At last, two numerical simulations have been given and the results illustrate that the presented data-based SVI method was effective. Acknowledgements This work was supported by the National Natural Science Foundation of China (61627809, 61433004, 61621004, 61673100), Liaoning Revitalization Talents Program (XLYC1801005), and Fundamental Research Funds for Central Universities (N150504011). References References [1] F. L. Lewis, Vrabie, and V. L. Syrmos, Optimal Control. Hoboken, NJ, USA: Wiley, 2012. [2] H. Zhang, L. Cui and Y. Luo, Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using singlenetwork ADP, IEEE Transactions on Systems, Man, and Cybernetics: Systems 43 (1) (2013) 206-216. [3] H. Zhang, T. Feng, H. Liang, and Y. Luo, LQR-based optimal distributed cooperative design for linear discrete-time multiagent systems, IEEE Transactions on Neural Networks and Learning Systems 28 (3) (2017) 599-611. [4] Y. Wang, X. Sun, P. Shi and J.Zhao, Input-to-state stability of switched nonlinear systems with time delays under asynchronous switching, IEEE 21
Transactions on Systems, Man, and Cybernetics: Systems 43 (6) (2013) 2261-2265. [5] D. Wang, D. Liu, H. Li, B. Luo, and H. Ma, An approximate optimal control approach for robust stabilization of a class of discrete-time nonlinear systems with uncertainties, IEEE Transactions on Systems, Man, and Cybernetics: Systems 46 (5) (2016) 713-717. [6] R. Song, F. L. Lewis, Q. Wei, and H. Zhang, Off-policy actor-critic structure for optimal control of unknown systems with disturbances, IEEE Transactions on Systems, Man, and Cybernetics: Systems 46 (5) (2016) 1041-1050. [7] Z. Wu, Y. Wu, Z. Wu and J. Liu, Event-based synchronization of heterogeneous complex networks subject to transmission delays,IEEE Transactions on Systems Man and Cybernetics 48 (12) (2018) 2126-2134. [8] H. Zhang, J. Zhang, G. Yang, et al. Leader-based optimal coordination control for the consensus problem of multiagent differential games via fuzzy adaptive dynamic programming, IEEE Transactions on Fuzzy Systems 23 (1) (2015) 152-163. [9] D. Eller, J. Aggarwal and H. Banks, Optimal control of linear timedelay systems, IEEE Transactions on Automatica Control 14 (6) (1969) 678-687. [10] H. Zhang, L. Li, J. Xu, and M. Fu, Linear quadratic regulation and stabilization of discrete-time systems with delay and multiplicative noise, IEEE Transactions on Automatica Control 60 (40) (2015) 2599-2613. [11] T. Ma, Synchronization of multi-agent stochastic impulsive perturbed chaotic delayed neural networks with switching topology, Neurocomputing 151 (2015) 1392-1406. [12] T. Ma, F. L. Lewis, and Y. Song, Exponential synchronization of nonlinear multiagent systems with time delays and impulsive disturbances, International Journal of Robust Nolinear Control 26 (8) (2016) 16151631.
22
[13] H. Zhang, Y. Luo, D. Liu, Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints, IEEE Transactions on Neural Networks and Learning Systems 20 (9) (2009) 1490-1503. [14] H. Zhang, Q. Wei, D. Liu, An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games, Automatica 47 (1) (2011) 207-214. [15] Q. Wei, F. L. Lewis, Q. Sun, P. Yan, and R. Song, Discrete-time deterministic Q-learning: A novel convergence analysis, IEEE Transactions on Systems, Man, and Cybernetics: Systems 47 (5) (2017) 1224-1237. [16] F. Wang, N. Jin, D. Liu and Q. Wei, Adaptive dynamic programming for finitehorizon optimal control of discrete-time nonlinear systems with ε-error bound, IEEE Transactions on Neural Networks and Learning Systems 22 (1) (2011) 24-36. [17] L. Cui, H. Zhang, B. Chen and Q. Zhang, Asymptotic tracking control scheme for mechanical systems with external disturbances and friction, Neurocomputing (73) (2010) 1293-1302. [18] Q. Wei, G. Shi, R. Song, and Y. Liu, Adaptive dynamic programmingbased optimal control scheme for energy storage systems with solar renewable energy, IEEE Transactions on Industrial Electronics 64 (7) (2017) 5468-5478. [19] K. G. Vamvoudakis and F. L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica 46 (5) (2010) 878-888. [20] K. Sun, Y. Li, and S. Tong, Fuzzy adaptive output feedback optimal control design for strict-feedback nonlinear systems, IEEE Transactions on Systems, Man, and Cybernetics: Systems 47 (1) (2017) 33-44. [21] H. Zhang, D. Liu, Y. Luo, and D. Wang, Adaptive Dynamic Programming for Control: Algorithms and Stability. London, U.K.:SpringerVerlag, 2012.
23
[22] G. Xiao, H. Zhang, Q. Qu, et al. General value iteration based single network approach for constrained optimal controller design of partiallyunknown continuous-time nonlinear systems, Journal of the Franklin Institute 355 (5) (2018) 2610-2630. [23] X. Zhong, H. He, Ding, Wang and Z. Ni, Model-free adaptive control for unknown nonlinear zero-sum differential game, IEEE Transactions on Systems, Man, and Cybernetics: Systems 48 (5) (2018) 1633-1646. [24] Q. Wei and D. Liu, Data-driven neuro-optimal temperature control of watercgas shift reaction using stable iterative adaptive dynamic programming, IEEE Transactions on Industrial Electronics 61 (2014) 63996408. [25] Y. Liu, H. Zhang, Y. Luo, and J. Han, ADP based optimal tracking control for a class of linear discrete-time system with multiple delays, Journal of the Franklin Institute 353 (9) (2016) 2117-2136. [26] C. Chen, Hamidreza Modares, K. Xie, et al. Reinforcement learningbased adaptive optimal exponential tracking control of linear systems with unknown dynamics, IEEE Transactions on Automatic Control 10.1109/TAC.2019.2905215. [27] Q. Wei, D. Wang, and D. Zhang, Dual iterative adaptive dynamic programming for a class of discrete-time nonlinear systems with time delays, Neural Computing and Applications 23 (7-8) (2013) 1851-1863. [28] R. Song, W. Xiao, and Q. Wei, Multi-objective optimal control for a class of nonlinear time-delay systems via adaptive dynamic programming, Soft Computing 17 (11) (2013) 2109-2115. [29] R. Song, Q. Wei and Q. Sun, Nearly finite-horizon optimal control for a class of nonaffine time-delay nonlinear systems based on adaptive dynamic programming, Neurocomputing 156 (2015) 166-175. [30] J. Zhang, H. Zhang, B. Wang, and T. Cai, Nearly data-based optimal control for linear discrete model-free systems with delays via reinforcement learning, International Journal of System Science 47 (7) (2014) 1563-1573.
24
[31] J. Zhang, H. Zhang, Y. Luo, and T. Feng, Model-free optimal control design for a class of linear discrete-time systems with multiple delays using adaptive dynamic programming, Neurocomputing 135 (2014) 163170. [32] H. Zhang, Y. Liu, G. Xiao and H. Jiang, Data-based adaptive dynamic programming for a class of discrete-time systems with multiple delays, IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017) http://dx.doi.org/10.1109/TSMC.2017.2758849. [33] A. Grate-Garca, L.A. Mrquez-Martnez and C.H. Moog, Equivalence of linear time-delay systems, IEEE Transactions on Automatic Control 56 (3) (2010) 666-670. [34] D. Liu, H. Li and D. Wang, Neural-network-based zero-sum game for discreteCtime nonlinear systems via iterative adaptive dynamic programming algorithm, Neurocomputing 110 (2013) 92-100.
25
He Ren received the B.S. degree in automation control in 2013 and the M.S. degree in control theory and control engineering in 2016 from Northeast Dian li University, Jilin, China. He has been pursuing the Ph.D. degree since 2016 in Northeastern University, Shenyang, China. His current research covers neural adaptive dynamic programming, neural networks, optimal control and their industrial applications.
Huaguang Zhang (M’03, SM’04, F’14) received Zhang.jpg the B.S. degree and the M.S. degree in control engineering from Northeast Dianli University of China, Jilin City, China, in 1982 and 1985, respectively. He received the Ph.D. degree in thermal power engineering and automation from Southeast University, Nanjing, China, in 1991. He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and Head of the Institute of Electric Automation, School of Information Science and Engineering, Northeastern University, Shenyang, China. His main research interests are fuzzy control, stochastic system control, neural networks based control, nonlinear control, and their applications. He has authored and coauthored over 280 journal and conference papers, six monographs and co-invented 90 patents. Dr. Zhang is the fellow of IEEE, the E-letter Chair of IEEE CIS Society, the former Chair of the Adaptive Dynamic Programming & Reinforcement Learning Technical Committee on IEEE Computational Intelligence Society. He is an Associate Editor of AUTOMATICA , IEEE TRANSACTIONS ON NEURAL NETWORKS, IEEE TRANSACTIONS ON CYBERNETICS, and NEUROCOMPUTING, respectively. He was an Associate Editor of IEEE TRANSACTIONS ON FUZZY SYSTEMS (2008-2013). He was awarded the Outstanding Youth Science Foundation Award from the National Natural Science Foundation Committee of China in 2003. He was named the Cheung Kong Scholar by the Education Ministry of China in 2005. He is a recipient of the IEEE Transactions on Neural Networks 2012 26
Outstanding Paper Award. He is also a recipient of Andrew P. Sage Best Transactions Paper Award 2015.
Hanguang Su received the B.S. degree in automation control and the M.S. degree in control engineering from Northeastern University, Shenyang, China, in 2013 and 2015, respectively, where he is currently pursuing the Ph.D. degree in control theory and control engineering with the School of Information Science and Engineering. His current research interests include reinforcement learning, optimal control, fuzzy control, adaptive dynamic programming, and their applications.
Yunfei Mu received the B.S. degree in Mathematics and Applied Mathematics in 2015 from Jilin Normal University, Siping, China. He received the M.S. degree in Operational Research and Cybernetics from Northeastern University, Shenyang, China, in 2018. He has been 27
pursuing the Ph.D. degree since 2018 in Northeastern University, Shenyang, China. His current research covers singular systems, observer design, fault tolerant control, optimal control and their industrial applications.
28
Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
29