Robust min–max optimal control design for systems with uncertain models: A neural dynamic programming approach

Robust min–max optimal control design for systems with uncertain models: A neural dynamic programming approach

Journal Pre-proof Robust min-max optimal control design for systems with uncertain models: A neural dynamic programming approach Mariana Ballesteros, ...

2MB Sizes 0 Downloads 75 Views

Journal Pre-proof Robust min-max optimal control design for systems with uncertain models: A neural dynamic programming approach Mariana Ballesteros, Isaac Chairez, Alexander Poznyak

PII: DOI: Reference:

S0893-6080(20)30018-6 https://doi.org/10.1016/j.neunet.2020.01.016 NN 4379

To appear in:

Neural Networks

Received date : 18 June 2019 Revised date : 7 January 2020 Accepted date : 14 January 2020 Please cite this article as: M. Ballesteros, I. Chairez and A. Poznyak, Robust min-max optimal control design for systems with uncertain models: A neural dynamic programming approach. Neural Networks (2020), doi: https://doi.org/10.1016/j.neunet.2020.01.016. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier Ltd. All rights reserved.

Journal Pre-proof

Robust Min-Max Optimal Control Design for Systems with Uncertain Models: a Neural Dynamic Programming Approach?

pro of

Mariana Ballesterosa , Isaac Chairezb,∗, Alexander Poznyaka a Department

b Department

of Automatic Control, CINVESTAV-IPN, Mexico City of Bioprocesses, UPIBI-Instituto Polit´ecnico Nacional, Mexico City

Abstract

lP

re-

The design of an artificial neural network (ANN) based sub-optimal controller to solve the finite-horizon optimization problem for a class of systems with uncertainties is the main outcome of this study. The optimization problem considers a convex performance index in the Bolza form. The dynamic uncertain restriction is considered as a linear system affected by modeling uncertainties, as well as by external bounded perturbations. The proposed controller implements a min-max approach based on the dynamic neural programming approximate solution. An ANN approximates the Value function to get the estimate of the Hamilton-Jacobi-Bellman (HJB) equation solution. The explicit adaptive law for the weights in the ANN is obtained from the approximation of the HJB solution. The stability analysis based on the Lyapunov theory yields to confirm that the approximate Value function serves as a Lyapunov function candidate and to conclude the practical stability of the equilibrium point. A simulation example illustrates the characteristics of the sub-optimal controller. The comparison of the performance indexes obtained with the application of different controllers evaluates the effect of perturbations and the sub-optimal solution.

urn a

Keywords: Approximate dynamic-programming; Artificial Neural Networks; Hamilton-Jacobi-Bellman equation; Bellman function; Sub-optimal controller.

1. Introduction

1.1. Brief survey

Jo

5

1.1.1. Optimal control for systems with exact mathematical models Optimal control theory has the main objective of designing the input function for the dynamic system in such a way that a given performance criterion can be optimized. Consumed energy, norm of tracking error or prescribed convergence time, among others can be formalized as performance criterion. It is widely accepted that Dynamic ? Mariana Ballesteros is sponsored by Mexico scholarship from the National Council for Science and Technology, (Scholar reference: 550803) ∗ Corresponding author Email addresses: [email protected] (Mariana Ballesteros), [email protected] (Isaac Chairez ), [email protected] (Alexander Poznyak)

Preprint submitted to Journal of LATEX Templates

January 26, 2020

Journal Pre-proof

30

35

40

45

50

55

pro of

re-

25

lP

20

1.1.2. Robust optimal control for systems with uncertain mathematical models Robust controllers are aimed to enforce the existence of a suitable practical equilibrium point that may solve the stabilization and the tracking trajectory problems for uncertain systems (including the class of systems proposed in this study). There are several theories to design robust controls such as: H∞ guaranteed cost control (Xue Anke et al., 2001; Modares et al., 2014b), attractive ellipsoid (Azhmyakov, 2011; Poznyak et al., 2014; Azhmyakov et al., 2019), Sliding Mode control (Utkin et al., 2009; Saad et al., 2019; Edwards & Spurgeon, 1998) and Artificial Neural Networks (ANN) (Yang et al., 2015; Tang et al., 2016; Patan, 2018; Wang et al., 2018), among others. The combination of robust and optimal control theories is not common because they are dealing with systems having dissimilar characteristics. Under some given conditions of the uncertain system that should be regulated, a degree of robustness has been considered in the design of optimal controllers. Usually, these control approaches are known as realizable sub-optimal control methods. Systems with parametric uncertainties, multimodel representation (Poznyak & Boltyanski, 2012), or bounded perturbations (Mulje & R.M.Nagarele, 2016) have been controlled sub-optimally. In general, sub-optimal robust approaches try to compensate the unknown section of the system exactly, but the effect of uncertainties over the optimization criterion has not been completely researched. Other approaches use the optimal control estimated by the DP theory, but assuming that the system is not perturbed.

urn a

15

Jo

10

Programming (DP) and Maximum Principle (Sage, 1968; Kirk, 2004; Liberzon, 2012) provide the fundamentals to calculate the optimal controllers. The DP theory designs optimal controllers presented as close-loop solutions. This characteristic makes these controllers more useful within the automatic control area. The potential solution coming from an optimal control approach requires three principal parts: 1) an accurate mathematical model of the system that should be controlled, 2) the performance criterion (usually known as the cost functional) which must be optimized and 3) the set of state or/and control restrictions for the system (Bryson, 1975). In addition, optimal control problems in most practical applications consider the optimization of the performance index in the infinite horizon time period (Palanisamy et al., 2015). The infinite horizon controls are preferred because the gain of the optimal controller depends on the solution of algebraic matrix equations (the Riccati equation is the most common of such matrix relationships). Nevertheless, if the optimal control problem considers the finite-time horizon problem (which usually corresponds to several real control design problems), time varying gains normally modify the control action (see, for example, the first 4 chapters in Poznyak & Boltyanski (2012)). Even in these challenging cases, there are remarkable results in the optimal control theory, but the necessity of having the exact mathematical model of the system cannot be neglected. One of the major difficulties in solving optimal control problems relies on the assumption that the precise mathematical model is available, which appears to be unfeasible for most real plants. Indeed, the application of optimal control solutions on diverse real systems is limited because of the lack of sufficient accurate mathematical models. If the system that must be controlled is uncertain and/or affected by external perturbations, the classical optimal control approaches cannot be applied straightforwardly (Bertsekas, 1995). The alternative of robust controllers appeared as an option to deal with modeling uncertainties and the effect of bounded perturbations (see Chapters 5-17 in Poznyak & Boltyanski (2012)).

2

Journal Pre-proof

80

85

90

95

100

105

pro of

re-

75

lP

70

urn a

65

Jo

60

The degree of robustness is evaluated using the variation of the performance index by neglecting the uncertain section of the system. Independently of the uncertainties in the model, the application of the main tools of DP theory leads to the design of the gains in the controller using the solution obtained of a partial differential equation known as the Hamilton-Jacobi-Bellman (HJB) equation (Bertsekas, 1995). The HJB equation may have analytic solution only if the system satisfies an exact mathematical representation of the plant and no perturbations affect the dynamics. The HJB equation has a continuous and differentiable solution in some specific cases. Indeed, the effect of perturbations or uncertainties on the solution of the HJB equation is analyzed just in a small number of studies (Lewis & Vrabie, 2009; Masmoudi et al., 2011). Furthermore, the stability (in some well-defined sense) of the equilibrium point, forced by the application of the sub-optimal controller, has not been clearly explained yet. One manner to overcome the problem of finding a solution for the HJB equation (maybe not exact but close enough to it) is the Approximate Dynamic Programming (ADP) methodology (Powell, 2007; Lewis & Liu, 2013; Jiang & Jiang, 2014). This technique, aimed to attain a justifiable approximation of the HJB solution, uses diverse methods of approximation. For example, in Beard et al. (1997), the infinite horizon control design was realized based on the implementation of a suitable approximation for the so-called generalized HJB equation implementing the discretization technique proposed by Galerkin. An alternative manner to get a feasible solution for the optimal control problem is using the concept of state dependent Riccati equations (Cloutier & Cockburn, 2001; C¸imen, 2008; Heydari & Balakrishnan, 2013). The studies developed in Huang et al. (2000) and Wang et al. (2003) used the method of characteristics yielding an analytic solution, which was linearly parametrized by some not-unique constant scalars. This study showed the handling of perturbations using non explicit close-loop forms. In other studies, the approximation uses ANN structures without internal feedbacks (Heydari & Balakrishnan (2013), Kim et al. (2000), Cheng et al. (2007), Yang & He (2018), among others). The aforementioned method is known as Neuro-Dynamic Programming (NDP), which uses the artificial intelligence technique known as Reinforcement Learning (RL). RL uses iterative algorithms to obatin smaller approximation errors with respect to the optimal solutions, which are also referred as cost-to-go, Q-factors, or optimal policy. In most of such cases, the DP approach involves a dimensional problem. The NDP has been successfully applied to solve optimization problems which have large number of states or/and complex dynamics (Bertsekas & Tsitsiklis, 1996). NDP is a technique that may find optimal or sub-optimal solutions based on the principle of non-supervised or semi-supervised learning techniques. The NDP technique has been used in different applications. In Wang & Qiao (2019), the approximate optimal control design is described for a class of non-affine systems, with the application in a torsional pendulum device; a class of robustness in the design is included in Wang (2019) for systems that are disturbed by uncertain factors. 1.2. Main contribution of the paper The NDP may apply diverse learning techniques, such as the reinforcement learning (RL), adaptive automatic critics, integral RL, as well as the reinforcement Q scheme, which are classified as unsupervised methods (Lewis & Vrabie, 2009; Vamvoudakis et al., 2014; Kiumarsi et al., 2014; S.Sutton & Barto, 2012). In particular, NDP has been used to solve finite horizon optimal problems in systems affected by admissible 3

Journal Pre-proof

115

120

pro of

110

uncertainties (Modares et al., 2014a; Vrabie & Lewis, 2009; Masmoudi et al., 2011; Jiang & Jiang, 2014). Therefore, based on all the aforementioned works, a closed loop solution for an optimal control problem in finite horizon time can be obtained by the DP approach. However, for the class of systems tackled in this study, a robustness property in the design must be included. Furthermore, considering the uncertainties and the theoretical findings of ADP and in particular of NDP, ANN can be used for the approximation in this class of systems. The main novelties and contributions of this paper consist on the following aspects: • the exact HJB equation with the max-min form for the class of quasi-Lipschitz uncertain dynamics is obtained, and the final exact analytic expression for this max-min Hamiltonian is found; • a different ANN structure for the NDP is suggested to approximate the exact solution of the max-min HJB equation, justifying the suggested finite series ANN approximation and including the robustness property analysis;

125

re-

• the learning law for the adjustment of the NDP weights is given in the explicit continuous-time form of ordinary differential equations (ODE); • the workability analysis (the ultimate boundedness of the approximation error) is realized by the Lyapunov-like stability theory, for the characterization of the class of control-design parameters;

lP

• the obtained theoretical results are also confirmed by the presented numerical examples. 1.3. Structure of the paper

140

urn a

135

The rest of this manuscript is defined by the following order: - Section II details the problem statement; - Section III defines the specific aspects of the robust-optimal approach. In addition, in this section, the Theorem of the min-max HJB equation is described; - Section IV introduces the ANN topology which establishes the approximation for the Value function; - Section V details the convergence of the obtained procedure and the NDP characterization is summarized in the main Theorem; - Section VI presents some aspects, dealing with the numerical solutions for the NADP, which are analyzed to realize the robust optimal control; - Section VII closes the paper with some conclusions associated to this work. 2. System Dynamics and Problem Statement The class of studied uncertain systems satisfies:

Jo

130

x(t) ˙ = [A0 + ∆A(x,t)]x(t) + B0 u(t) + ∆B(x,t), t ∈ [s, T ]; x(s) = y.

(1)

Here, the vector x ∈ Rn defines the system state, the vector u ∈ Uadm ⊂ Rm corresponds to the control. The set Uadm defines the set of admissible controllers, u ∈ Uadm ⊂ Rm . The admissible controllers are piece-wise continuous in the interval [s, T ]. Both 4

Journal Pre-proof

150

155

matrices A0 ∈ Rn×n , as well as B0 ∈ Rn×m are constant and known, the column rank of B0 is equal to m. The state-associated parametric uncertainties ∆A : Rn ×R+ → Rn×n depend on both time and state, while ∆B : Rn × R+ → Rn represents the external perturbations, which is unknown. In this study, assume that the non-perturbed system in (1), that is ∆A = 0 and ∆B = 0, can be stabilizable. The state-associated uncertain section ∆A is absolutely bounded (entry by entry) while ∆B(x,t) is continuous. The class of systems in (1) includes a wide range of potential nonlinear plants. The so-called parametric robustness control design has used the class of systems described by (1) for many years (Yu, 1983; Gu et al., 1990; Qu, 1993). The uncertainties sections, that is ∆A and ∆B, belong to a given admissible set characterized as:

pro of

145

Assumption 1. The set of admissible uncertainties for (4) is:   > Ψadm := η = [∆A(x,t), ∆A(x,t) ≤ α2 ;  >∆B(x,t)] : tr ∆A (x,t), tr ∆B (x,t), ∆B(x,t) ≤ β2 ,

lP

urn a

165

The selected type of uncertain dynamics included in (1) can describe different plants such as biotechnological, mechanical, chemical, among others (Spong et al., 2006). Problem statement. The main aim proposed in this study corresponds is to design the controller function u∗ ∈ Uadm ⊆ Rm such that the application of the controller on (1) minimizes the proposed performance criterion J. For this study, this criterion is subjected to the class of uncertainties included in ψadm . The proposed method used to design the controller applied the min-max optimization methodology, with the minimization running over the admissible control set Uadm . The maximization should be calculated with respect to the uncertainties. In consequence, the value of η∗ ∈ ψadm that maximizes the Hamiltonian should be calculated exactly. Then, the controller should include a given degree of robustness with respect to the bounded perturbations included in Ψ, and the uncertainties associated to the parameters. The cost functional satisfies the named Bolza form corresponding to: J(s, y; u(·)) := h0 (x(T )) +

Z T

t=s

h(x(t), u(t))dt.

(3)

The functional (3) is related with the state, the control signal, and the evolution time Poznyak & Boltyanski (2012). The specific structure of the vector field h : Rn ×Uadm → R corresponds to the following sum of quadratic forms: h(x, u) = x> Qx + u> Ru.

Here Q ∈ Rn×n and R ∈ Rm×m are symmetric and positive matrices. The positive scalar function h0 : Rn → R is also continuous defining the so-called terminal condition. Therefore, the design of the optimal controller is associated to the next min-max optimization problem of the criterion given in (3):

Jo

160

re-

where t ∈ [s, T ], s ≥ 0, ∀x ∈ Rn η ∈ Rn×(n+1) .

(2)

max J(s, y; u(·)) →

η∈Ψadm

subject to (1).

5

min

u(·)∈Uadm [s,T ]

Journal Pre-proof

3. Robust Optimal Control Design The mathematical model (1) is equivalent to:  > x˙ = A0 x + B0 u + η x> 1 , x0 = x(0), t ∈ [0, T ].

(4)

Here, the scalar function V (t, x) receives the name of value function. Such proposal is a robustified reformulation of the HJB where the effect of the uncertainties has been taken into account. Definition 1. The given function V is defined for any (s, y) ∈ [0, T ) × Rn by: V (s, y) := min max J(s, y; u(·)).

re-

u∈Uadm η∈Ψadm

The corresponding boundary condition for V is V (T, y) = h0 (y). In (5), the scalar function H : Rn × Rn ×Uadm × Ψadm → R is the Hamiltonian, which is defined as: H(ρ, x, u, η) := ρ> x˙ − h(x, u),

η∈Ψadm

lP

where ρ ∈ Rn is the adjoin variable. The first step in the optimal control design is finding η∗ such that the Hamiltonian should be minimized with respect to the controller given in the admissible class of uncertainties for both ∆A and ∆B. This process solves arg max {J(t, x; u(·))} according to the main theorem introduced in the follow-

urn a

ing subsection. Considering only the terms containing η, the solution for this part of the problem can be achieved as: (    ) ∂V (t, x) > x ∗ η = argmin − η . (6) 1 ∂x η∈Ψadm ∂V (t, x) Here, defines the gradient with respect to the state x for the value function V . ∂x The equation given in (6), is equivalent to:     ∂V (t, x) x , , (7) η∗ = argmin − η> 1 ∂x η∈Ψadm where (·, ·) represents the inner product in the Euclidean space. The application of the Cauchy-Schwartz inequality on (7) leads to:

    

> ∂V (t, x) x ∂V (t, x) x

. − η> , ≥ − η (8)

1 ∂x ∂x 1

Jo

170

pro of

Here, we omitted the dependence with respect to time for sake of simplicity. The application of the DP method entails finding the continuous and smooth enough solution associated to the HJB equation evaluated on trajectories of (4):   ∂V (t, x) ∂V (t, x) + max min H − , x, u, η = 0. (5) − u∈Uadm η∈Ψadm ∂t ∂x

Using the triangle inequality on (8) (with the adequate consistent norm in the space of matrices):

    



∂V (t, x) x

> > ∂V (t, x)

x , , ≥ − η − η 1 ∂x ∂x 1 F 6

Journal Pre-proof

pro of

q 

 where the norm η> F = tr ηη> is compatible with the vector norm k·k, tr ηη> =     tr η> η and tr η> η = tr ∆A> ∆A + tr ∆B> ∆B . Therefore, using the admissible set of uncertainties Ψadm , one gets:

  

∂V (t, x) q ∂V (t, x) x 2

(9) , ≥ − − η>

∂x φ (kxk + 1), 1 ∂x where φ = α2 + β2 , with α and β given in (2). The inequality (9) can be used to obtain the exact value of η∗ by taken into account only the equality:

p ∂V (t, x) h > i ∂V (t, x) h > i −1

x 1 x 1 η = φ

. ∂x ∂x ∗

The calculus of the sub-optimal unconstrained control u∗ is given as follows (by direct differentiation of the HJB with respect to u):

re-

1 ∂V (t, x) u∗ = − R−1 B> . 0 2 ∂x

(10)

Substituting both u∗ and η∗ on the HJB equation leads to:

185

urn a

180

(11)

The solution of the restricted optimization problem for (1) is attained by getting the solution of (11). This solution cannot be obtained analytically because the nonlinear form ∂V (t, x) of (11) in relation to the gradient and with respect to η∗ . Nevertheless, Section ∂x IV describes the approximation for the HJB solution of (11) based on the application of dynamic ANN. 3.1. Max-min HJB Equation The DP method leads to sufficient conditions for the optimality of the designed controller selected from Uadm . This method is supported on the optimality principle proposed by Bellman. The following theorem defines the form of the HJB equation for systems affected by both bounded perturbations and parametric uncertainties. Theorem 1. Suppose that the function V (s, y) is continuously differentiable with respect to both arguments t and x and assume that it satisfies (11). Then, it is a feasible solution for the terminal optimization problem associated to the partial differential equation, the HJB equation (5), (t, x) ∈ (0, T ] × Rn , satisfying the given boundary conditions V (T, x) = h0 (x).

Jo

175

    ∂V (t, x) ∂V (t, x) > 1 ∂V (t, x) > ∂V (t, x) − A0 x + B0 R−1 B> 0 ∂t ∂x 4 ∂x ∂x

p

∂V (t, x) 2 2

−kxkQ − φ (kxk + 1)

∂x = 0.

lP



P ROOF. Consider the feasible control u ∈ Uadm and the following representation supported on the Bellman’s optimal principle for the so-called value function V (t, x): Z sˆ  V (s, y) = min max h(x, u,t)dt +V (s, ˆ x (s)) ˆ , ∀sˆ ∈ [s, T ] . u∈Uadm η∈Ψadm

t=s

7

Journal Pre-proof

Then, using the definition of the minimum operation, the following inequality is valid: Z sˆ  V (s, y) ≤ max h(x, u,t)dt +V (s, ˆ x (s)) ˆ . η∈Ψadm

t=s

η∈Ψadm

pro of

Considering that max(a1 + a2 ) ≤ max(a1 ) + max(a2 ) if a1 ≥ 0 and a2 ≥ 0; the next inequality is satisfied:  Z sˆ  V (s, y) + min − h(x, u,t)dt + min {−V (s, ˆ x (s))} ˆ ≤ 0. η∈Ψadm

t=s

1 and using the definition of the Value function, one may get: Multiplying by sˆ − s   Z sˆ  1 min {V (s, y) −V (s, ˆ x (s))} ˆ + min − h(x, u,t)dt ≤ 0. sˆ − s η∈Ψadm t=s η∈Ψadm

Therefore:

re-

By the Mean Value Theorem, the following partial differential equation for the Value function takes place:      ∂V (t, x) > ∂V (t, x) x − A0 x + B0 u + η + min {−h (x, u,t)} ≤ 0. − u ∂t ∂x η∈Ψadm    ∂V (t, x) ∂V (t, x) + max min H − , x, u,t ≤ 0. − u∈Uadm η∈Ψadm ∂t ∂x

(12)

Therefore

max

η∈Ψadm

Z

lP

Consequently, for any positive scalar ε > 0 and s close to s, ˆ exists a control u ∈ Uadm satisfying: u(·) := uε,sˆ(·) ∈ Uadm [s, T ] . sˆ

t=s

 h (x, u,t) dt +V (s, ˆ xsˆ) ≤ V (s, y) + ε (sˆ − s) .

urn a

The Value function evaluated at (s, y) is independent of the perturbations, then, the following inequality is valid:  Z sˆ  1 max h (x, u,t) dt +V (s, ˆ xsˆ) −V (s, y) ≤ ε. (13) sˆ − s η∈Ψadm t=s The left-hand side of (13) admits the lower-bound determining the minimum operation instead of the maximum, the following inequality holds and by the fundamental theorem of calculus, (13) can be replaced by:     Z sˆ  ∂V (t, x) ∂V (t, x) 1 −ε ≤ min H − , x, u, η,t − dt. sˆ − s t=s η∈Ψadm ∂x ∂t

Jo

Taking the maximum value on the right-hand side with respect to the elements in the admissible control set yields:      Z sˆ  1 ∂V (t, x) ∂V (t, x) −ε ≤ max min H − , x, u, η,t − dt. sˆ − s t=s u∈Uadm η∈Ψadm ∂x ∂t Assume that s ↓ s, ˆ then: −ε ≤ max

u∈Uadm

190



    ∂V (t, x) ∂V (t, x) H − , x, u, η,t − . ∂x ∂t η∈Ψadm min

From (12) and (14) when ε → 0, the Theorem is proven. 8

(14)

Journal Pre-proof

4. Approximated solution of the HJB equation

215

220

pro of

re-

210

lP

205

urn a

200

4.1. Weights adjustment for the ANN The feasible approximation associated to (11) is given by: V (t, x) = Va (t, x) + V˜ (t, x).

(15)

Here, Va (t, x) represents an approximate solution associated to the HJB by the following ANN structure: Va (x,t) = ω(t)σ(x) + x> P(t)x, (16)

Jo

195

The approximated solution uses an ANN structure that should consider the specific settings for the Value function (continuity, positiveness, and differentiability at least with respect to its first component). This strategy has been used in diverse control and estimation problems, as in (Vrabie & Lewis, 2009; Beard et al., 1998; Wang et al., 2014) among others. ANN have been recognized by their ability to approximate nonlinear continuous functions, at least locally (Hornik et al., 1989). This characteristic served to justify the approximation of vector valued static maps, which can even be time dependent as well as multivariable functions. Approximation properties of the ANN have been studied in several works, this fact represents a finding for this work. For example, the issue of the ANN structure to achieve a local approximation for nonlinear mappings depends on the basis of activation functions, the number of layers, the configuration or in general the complexity of the net. In Barron (1993), the approximation bounds for a class of ANN with sigmoidal activation functions are derived, the structure is a one-layer with n nodes. In Kainen et al. (2012) the role of the input dimension is considered in the estimation for the bounds of approximation errors, the structure of the ANN has a perceptron and Gaussian radial computational units. The quality of the approximation remains as an interest of research. In Guliyev & Ismailov (2018) the bound of the approximation error for a single hidden layer feedforward ANN with fixed weights is analyzed. Based on all the aforementioned properties, ANN could be implemented as feasible approximations for the trajectories corresponding to well-defined solutions of either ordinary or partial differential equations. There are variants for such approximations proposed in the last decade aimed to calculate the approximation of the solutions for the HJB equations. Commonly, these approximate solutions did not meet the requested restrictions of the HJB equations. Even more, the approximation ability associated to ANN is only justified if the number of basis components in the ANN grows unbounded. This part of the article considers two main open problems in the ANN approximation of the HJB: the first defines how to get the justification for the properties requested in the approximation of the HJB correctly and second how to get the characterization of the effect for limiting the number of components in the basis associated to the network topology.

and V˜ : R+ × Rn → R+ corresponds to the approximation error produced because of the finite number of activation functions in the ANN structure. In (16), the matrix P ∈ Rn×n is positive definite and bounded uniformly with respect to t. The structure of ω ∈ R+ and σ ∈ R+ is proposed as follows: ˜ > (t)ω(t), ˜ ω(t) = ω

˜ σ(x) = σ˜ > (x)σ(x).

9

Journal Pre-proof

225

˜ : R+ → Rr corresponds to the weights vector and σ˜ : Rn → Rr is an The function ω ANN activation vector based on an independent sigmoid activation functions. This particular structure ensures the positiveness of the approximate value function. For this study, the sigmoid function satisfies the following definition:

The following assumptions are proposed for the approximated Value function. Assumption 2. The given local approximation (Cybenko, 1989) error V˜ (t, x) corresponding to the given ANN based approximate HJB solution is absolutely bounded, that is: V˜ (t, x) ≤ V + (N).

re-

Here V + ∈ R+ and N is the number of activation functions. Notice here: V + (N) corresponds to a monotonically decreasing function with respect to the given number N of elements in the ANN. Therefore, the increment of the number of activation functions can decrease the value of V˜ (t, x) to any arbitrary small constant.

lP

Assumption 3. The proposed formal connection between the approximated HJB function Va (t, x) with the approximation error function V˜ (t, x) satisfies the inequality proposed ahead: V + (N)  |Va (t, x)| .

Therefore, the HJB equation based on the approximated function Va (t, x) including the approximation error function V˜ (t, x) → 0 is governed by: ˙ ˙ −x> P(t)x − ω(t)σ(x) − ω(t)∇> σ(x)A0 x − 2x> P(t)A0 x  1 (17) + ω(t)∇> σ(x) + 2x> P(t) B0 R−1 B> 0 (ω(t)∇σ(x) + 2P(t)x) 4 p 2 − φ (kxk2 + 1) kω(t)∇σ(x) + 2P(t)xk − kxkQ = 0.   ∂ of the given Here, the mathematical operator ∇ corresponds to the gradient ∂x scalar function of the vector x. The equation (17) can be split in both time dependent differential equations using the rewriting of quadratic terms x> Ric(P)x,  −1 > ˙ − P(t)A0 + A> Ric(P) = −P(t) 0 P(t) + P(t)B0 R B0 P(t) p 2 4 φ (kxk + 1) 2 (18) −Q − P (t), z(ω(t), P)

urn a

235

This function is bounded, as well as differentiable real-valued scalar function, which is defined for all possible real-valued arguments. This function has no negative derivative, x ∈ Rn comes from the system (1), g ∈ R, ς ∈ Rn , c is the shifting scalar and ϕ ∈ Rn are real-valued constant vectors.

Jo

230

pro of

Definition 2. The sigmoid component in the ANN topology is defined as follows:  −1 > sig(x) = g 1 + exp−ς (x−ϕ) + c.

˜ >W˜ , where and the elements ω

 > σ(x) ˜ ˜ d ω(t) ω(t)∇ 1 ˜ W =− + −A0 x + ω(t)B0 R−1 B> 0 ∇σ(x) dt 2σ(x) 4 # p p 4 φ (kxk2 + 1) ω(t) φ (kxk2 + 1) −1 > +B0 R B0 2P(t)x − P(x)x − · ∇σ(x) , z(ω, P) z(ω, P) 10

(19)

Journal Pre-proof

240

with z(ω(t), P) = kω(t)∇σ(x) + 2P(t)xk, considering W˜ = 0 and Ric(P) = 0. The simultaneous on-line numerical solution of (18) and (19), and the numerical solution for the initial conditions give an approximated numerical solution of the HJB. The numerical solution is described in Section VI. 4.2. Characterization of the approximation quality The DP equation under the approximation of the Value function can be summarized in the Theorem presented here:

pro of

245

Theorem 2. Taking into account the function given in Definition 1 and equation (15) for (s, y) ∈ [0, T ) × R. Then, the relation satisfies: Z sˆ  Va (s, y) = min max h (x, u,t) dt +V (s, ˆ x(s)) ˆ − V˜ (s, y), (20) u∈Uadm η∈Ψadm t=s ∀sˆ ∈ [s, T ].

η∈Ψadm

t=s

re-

P ROOF. Taking into account the Definition 1, for any u(·) ∈ Uadm [s, T ]:  Z sˆ Va (s, y) ≤ max h (x, u,t) dt + J(s, ˆ x(s); ˆ u(·)) − V˜ (s, y).

(21)

Denoting the right-hand side of the inequality (21), except the approximation error function by V¯ (s, y) and considering the application of the minimum operator over the set Uadm [s, T ] with respect to u(·), it follows:

lP

Va (s, y) −V + (N) ≤ V¯ (s, y).

(22)

Notice that, for the given κ > 0 there is an admissible control uκ (·) ∈ Uadm [s, T ] in such a way that xκ (·) := x(·, s, y, uκ (·), η∗ ): max {J(s, y; uκ (·))} − V˜ (s, y) ≤ Va (s, y) + κ.

κ∈Ψadm

urn a

Therefore, the following inequality holds:

Va (s, y) + κ +V + (N) ≥ V¯ (s, y).

(23)

Assume that κ → 0, the aggregation of the inequalities proposed in (22) and (23) imply the result (20) of Theorem 2, assuming that u∗ corresponds to the sub-optimal admissible controller. Indeed, if V + → 0, u∗ is the optimal control. 4.3. Exact solution for the optimal controller design based on ANN The ANN structure proposed in (16) served to prove that the exact result for the optimal controller is attainable if the number of neurons in the ANN layer tends to infinity (Lewis & Vrabie, 2009; Vrabie & Lewis, 2009). Indeed, because V + (N) is a decreasing function with respect to its only argument. Then, if we use

Jo

250

lim sup Va (s, y) +V + (N) ≥ V¯ (s, y). N→∞

It is possible to justify that Va (s, y) ≥ V¯ (s, y), which leads to get the optimal control u∗ if N → ∞. 11

Journal Pre-proof

5. Practical stability analysis for the sub-optimal control solution The presence of uncertainties ∆A and ∆B has an impact on the value estimation for the performance index (3) evaluated over the trajectories of the nonlinear system (1). Therefore, a robustness analysis for the state trajectories is requested. The stability analysis for the equilibrium point of (1) is based on the Lyapunov stability method.

pro of

255

Proposition 1. Consider the nonlinear uncertain system given in (1) with the suboptimal controller design introduced in (10). Suppose that Assumption 1 stands, then, if there is a bounded positive and symmetric matrix P (P− ≤ P(t) ≤ P+ with P− ∈ Rn×n and P+ ∈ Rn×n positive matrices) of the time varying Riccati matrix equation (Λ1 > 0, Λ1 ∈ Rn×n ):  d P(t) − P(t)A0 + A> 0 P(t) dt +2P(t)B0 R−1 B> 0 P(t) − P(t)Λ1 P(t) − Q2 = Q4 (t),

(24)

β=

270

275

(25)

lP

The proof of this proposition is given in the Appendix. Proposition 1 states sufficient conditions only. The boundedness of the stable trajectory under the robust optimal closed loop control depends on the existence of the positive solution for (18). The existence of necessary conditions for this proposition requires further research activities (a topic beyond the scope of this study). The proposed method can provide just sufficient arguments in view of the main result, which comes from a Lyapunov-like stability analysis.

urn a

265

ε0 . supt≥0 λmax {Q4 (t)}

6. On the numerical realization of ROC The solution of the proposed class of robust sub-optimal controller requires the implementation of a special class of recurrent algorithm (Algorithm 6). The algorithm is a routine to check the terminal condition for the approximated Value function and adjust the weights parameters associated to the learning laws. The numerical method used a mixed strategy executing the evolution of the states controlled by the sub-optimal controller. Each sequence was evaluated by adjusting only the initial condition for the weights and the matrix P estimated as solution of the Riccati differential equation. The parameters µ, Γ1 and Γ2 are real positive constants, the value V0 is chosen to satisfy V0 > µ, Pa ∈ R4×4 and wa ∈ R4 are the initial conditions for the first iteration (ωo and Po ). In Figure 1, the depicted scheme represents the interaction between the numerical solution and the Simulinkr simulation of the model. Notice that the numerical evaluation of each controller attained by the implementation of the ANN approximation should be realized in Simulinkr , while the external simulations executed in m-languager supervises and updates the weights values.

Jo

260

re-

with Q4 : R+ → Rn×n a time-dependent positive definite matrix, and the weights in the ANN approximation (16) governed by (19) are bounded by ω ≤ ω+ ∀t ≥ 0 with σ(x) 6= 0, ∀x 6= 0 then, the origin defines an uniformly practical stable equilibrium point of (1) (Haddad & Chellaboina, 2011) with an ultimate bound given by

12

Journal Pre-proof

280

7. Numerical simulations

re-

pro of

Algorithm 1 Recurrent algorithm to adjust the initial parameters of the learning laws equations 1: Start 2: Initialization i ← 0; ω0 (0) ← wa ; Pi (0) ← Pa ; Vi+1 (0, z(0)) ← Vi (T, z(T )); 3: µ; Γ1 ; Γ2 ; 4: while Vi (T, z(T )) ≥ µ do 5: i = i + 1; ∂ 1 6: u∗i = − R−1 B> Vi (t, x); 0 2 ∂x 7: Pi+1 (0) = Pi (T ) + Γ1 Pa ; d 8: ωi+1 (0) = ωi (0) + Γ2 Vi (t, x); dωi d 9: Exert simulation of x; dt 10: Vi+1 (0, z(0)) = Vi (T, z(T )); 11: end while 12: u∗ = u∗i 13: Stop

lP

To analyse the performance of the proposed sub-optimal controller algorithm, this study considered the next academic example. The nominal matrices A and B for the system were:     0 1 0 0 0 0 1  , B0 = 0.34 ∗ 0 0 . A0 =  0 −8 −4 −12 7 5

urn a

The uncertain matrices were:   sin(100 ∗ t) sin(10000 ∗ t) 0.6 ∗ sin(10000 ∗ t) , sin(90 ∗ t) A = 0.8 ∗ cos(69 ∗ t) 0.3 ∗ sin(854 ∗ t) 0.5 ∗ sin(100t) 0.4 ∗ cos(972t) 0.6 ∗ cos(80t)   0.245 ∗ sin(10t) cos(15t)  sin(0.3t) ∆B =  0.157 ∗ cos(5t) sin(0.3t) 0.245 ∗ cos(0.2t)

Jo

The matrices ∆A and ∆B belong to the given set Ψadm . The boundary values for the uncertain matrices are: α = 1.00 and β = 0.13. The proposed ANN was simulated using the following number of ANN weights ω = [ω1b ω2b ω3b ω4b ω5b ω5b ]> . A simple validation scheme used the comparison of the states calculated if the robust sub-optimal controller was evaluated in contrast to the numerical result achieved when the pole placement technique was proposed. In this numerical case, the desired poles vector corresponded to [−23.68, −0.15 + 0.8 j, −0.15 − 0.8 j]> that leads to the following control gain   2.17 −1.08 −3.25 K= . −1.66 −0.83 −2.49

285

The comparison of the state trajectories of both controllers (the ANN based sub-optimal and the pole placement based) appears in Figure 2. This comparison demonstrates that 13

re-

pro of

Journal Pre-proof

Figure 1: Structure of the routines for the numerical simulation and the adjustment of the parameters

lP

all the states converge to a bounded region near the origin with an upper bound of 0.03 after 1.0 seconds. However, the trajectories of the system dynamics controlled with the sub-optimal controller approaches the origin slower than the ones controlled with the pole-placement method. The trajectories depicted in Figure 2 served to estimate the

urn a

5

States x(t)

0

x1,Min-Max

-5

-10

0

5

10

x

2,Min-Max

x3,Min-Max x1,LQR x2,LQR x3,LQR

15

20

25

30

Time, s

290

Jo

Figure 2: States determined with the states and control actions using the sub-optimal and pole-placement approaches.

norm of x. The euclidean norm was used for comparison purposes. This comparison confirmed that the pole-placement technique enforces a faster movement of the state trajectories toward the origin within the first 1.5 seconds. The variation of the state norm also served to estimate the variation of the value function, as well as the functional time evolution (Figure 3). The evolution of the states and the evaluation of the control 14

Journal Pre-proof

100 ||x||LQR

80

60

40

20

0 0

5

10

15

Time, s

pro of

Euclidean Norm of States ||x(t)||

2

||x||Min-Max

20

25

30

Figure 3: Norm of the states vector estimated with the states and control actions using the sub-optimal and pole-placement approaches.

re-

12

×109

urn a

Performance Index J(t)

10

lP

300

function yield to determine the variation of the performance index. This comparison was a major element to define the advantage of the control design attained in this study. The comparison of the performance index calculated with the sub-optimal controller and the pole-placement technique showed a significant reduction of such value after 0.05 seconds and a final (after 0.5 seconds) of 33 % after 0.5 seconds (Figure 4). The depicted results also show that the variation of the initial condition for the weights in the ANN provides a significant variation in the performance index temporal evolution. Also, with the local optimal values for the weights of the ANN, the lowest performance index is attained (Figure 4).

8 6 4 2

5

10

×1010

1 0 0

0.02

0.04

Time, s 15

20

25

30

Time, s

Jo

0 0

2

J(t)

295

Figure 4: Cost functional estimated with the states and control actions using the sub-optimal and poleplacement approaches.

305

With the aim of showing the dependence of the performance index with respect to the initial condition of the weights, Figure 5 demonstrates the value of the performance index J(u(·)) calculated at the the of the simulation process with T = 30s. This behavior confirms the usefulness of the suggested algorithm, which is in charge of adjusting

15

Journal Pre-proof

the ANN weights. 1.2

×1010

pro of

Performance Index

1.15

1.1

1.05

1

0.95 0

20

40

60

80

100

Iterations over the W(0) adjustment

Figure 6 shows the variation of the approximate value function Va obtained with the states and the control signal calculated from the functional J(u(·)). The decreasing behavior of this value function confirmed the applicability of the approximation based on the ANN, as well as the numerical methodology used to adjust the initial weights in each round of evaluation considered in Section VI.

1010

urn a

Value function

1015

lP

310

re-

Figure 5: Cost functional estimated with the states and control actions using the sub-optimal approach presented as a function of the variation of the initial weights in the ANN.

105

100

0

5

10

15

20

25

30

Time, s

315

320

Jo

Figure 6: Value function calculated with the states and control actions using the sub-optimal controller obtained by the approximation based on neural networks.

Figure 7 demonstrates the time evolution of the controller using the sub-optimal solution and the pole-placement approach. The sub-optimal control signals were smaller than those produced by the pole-placement except during the first 0.05 seconds. This fact justifies the differences between the convergences of the states to a bounded zone near the origin. It is relevant that sub-optimal control signals reach the values obtained when the pole-placement controller is considered. This fact can be confirmed after 1.0 second of simulation. Figure 8 depicts the variation for the norm associated to the 16

Journal Pre-proof

10 u1,Min-Max u2,Min-Max u1,LQR u2,LQR

0

-5

-10 0

5

10

15

Time, s

pro of

Control function u(t)

5

20

25

30

Figure 7: Comparison of the control signals obtained by the sub-optimal approach and the PD design strategy.

re-

controller signal u. The time evolution of this function was considered due to its participation on the value function, as well as the functional estimation. The variation of the control norm together with the variation of the states norm justifies the increment of the function when the sub-optimal controller is considered in comparison to the regular pole-placement method, which is not taking care of the perturbations effect. Figure

30 25 20 15

urn a

Control function u(t)

35

lP

40

||u||Min-Max ||u||LQR

10

5

0 0

5

10

15

20

25

30

Time, s

Figure 8: Norm of control signal

330

335

˜ that were 9 shows the variation of all the four components in the weights vector ω considered in the approximation of the value function solution for the HJB equation Va (t, x). Notice that in opposition to the usual behaviour in the ANN approximation of uncertain functions, the weights obtained in this study do not converge to constant values because of the influence of the system states x and the solution of the Riccati equation P. The evolution of the weights included in the ANN structure can be complemented with the variation of sigmoidal functions. Their variations demonstrates the effect of the states evolution over the ANN structure (Figure 10). Figure 11 shows the time evolution of all the four elements included in P which was

Jo

325

17

Journal Pre-proof

10

×1011

6

4

2

0 0

5

10

15

Time, s

pro of

Components of W

8

20

25

30

Figure 9: Time dependent trajectories of ωb (t).

re-

2.5

1.5

1

0.5

0 0

lP

Sigma components

2

5

10

15

20

25

30

Time, s

345

350

obtained by the on-line numerical solution for the Riccati time varying matrix differential equation. The ANN approximation proposed in this study represents a contribution to the robust realization of optimal controller for systems with the admissible class of parametric uncertainties and external perturbations considered in this study. The proposed numerical results demonstrated the existence of the practical stable equilibrium point associated to the origin. This result justified the theoretical result associated to the stability analysis based on the Lyapunov-like function (energetic type). Moreover, the approximate solution based on the Riccati matrix equation also showed the convergence to bounded and constant values. Interestingly, the weights of the ANN are not converging to constant values in contrast to the regular ANN approximations of the HJB for the optimal control realization. The existing approximate results of the optimal controller for the class of systems considered in this study have not proposed the analytic results attained here. Notice that this study provides the evaluation of the approximate function over the sub-optimal control, as well as its impact on the Hamiltonian associated to the ANN approximation. The quasi-linear form of the uncertain system has motivated the proposal of a mixed

Jo

340

urn a

Figure 10: Time dependent trajectories of σb (t).

18

Journal Pre-proof

1.9 1.8 1.7 1.6 1.5 1.4 0

5

10

15

Time, s

pro of

Components of matrix P

2

20

25

30

Figure 11: Time evolution of the components of the matrix P calculated by the numerical solution of the time-dependent Riccati matrix differential equation.

re-

controller using a linear form plus the approximated solution based on the ANN. 8. Conclusions

365

370

lP

360

In this manuscript, the finite time horizon optimal control problem for a class of uncertain system has been tackled by the proposition of a min-max sub-optimal control with a NDP approach. The controller used an ANN approximation of the Value function. The proposed approximated solution of the HJB equation has been based on an ANN structure added with a classical quadratic form of the state weighted by a timedependent positive definite matrix. The ANN structure obeys the nature of the Value function (positive and continuous) by using quadratic terms and by choosing sigmoidal activation functions. The tuning of the free-parameters for the approximation adjustment was solved with a recurrent numerical algorithm. Compared with other works, in this study, the effect of the unknown terms is added in the main theorems, and the structure of the ANN considered the natural quadratic structure of the nominal part of the system. The numerical simulation of the proposed controller was implemented to regulate a quasi-linear system with parametric uncertainties and external perturbations and it was compared with a classical pole-placement controller. The performance of the sub-optimal controller was illustrated by the numerical simulation proposed in this study.

urn a

355

References

375

Jo

Azhmyakov, V. (2011). On the geometric aspects of the invariant ellipsoid method: Application to the robust control design. In 2011 50th IEEE Conference on Decision and Control and European Control Conference (pp. 1353–1358). doi:10.1109/ CDC.2011.6161180. Azhmyakov, V., Mera, M., & Ju´arez, R. (2019). Advances in attractive ellipsoid method for robust control design. International Journal of Robust and Nonlinear Control, 29, 1418–1436.

19

Journal Pre-proof

Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39, 930–945.

385

Beard, R. W., Saridis, G. N., & Wen, J. T. (1997). Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation. Automatica, 33, 2159 – 2177. doi:http://dx.doi.org/10.1016/S0005-1098(97)00128-3. Beard, R. W., Saridis, G. N., & Wen, J. T. (1998). Approximate Solutions to the TimeInvariant Hamilton–Jacobi–Bellman Equation. Journal of Optimization Theory and Applications, 96, 589–626. doi:10.1023/A:1022664528457.

pro of

380

Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control volume 1. Athena scientific Belmont, MA. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming volume 5. Athena Scientific Belmont, MA. 390

Bryson, A. E. (1975). Applied optimal control: Optimization, estimation and control. CRC Press.

395

re-

Cheng, T., Lewis, F. L., & Abu-Khalaf, M. (2007). A neural network solution for fixed-final time optimal control of nonlinear systems. Automatica, 43, 482 – 490. doi:https://doi.org/10.1016/j.automatica.2006.09.021. Cloutier, J. R., & Cockburn, J. C. (2001). The State-Dependent Nonlinear Regulator with State Constrains. In Proceedings.

400

lP

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2, 303–314. Edwards, C., & Spurgeon, S. (1998). Sliding mode control: theory and applications. CRC Press.

405

urn a

Gu, K., Zohdy, M. A., & Loh, N. K. (1990). Necessary and sufficient conditions of quadratic stability of uncertain linear systems. IEEE Transactions on Automatic Control, 35, 601–604. Guliyev, N. J., & Ismailov, V. E. (2018). On the approximation by single hidden layer feedforward neural networks with fixed weights. Neural Networks, 98, 296–304. Haddad, W. M., & Chellaboina, V. (2011). Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press.

Jo

410

Heydari, A., & Balakrishnan, S. N. (2013). Finite-Horizon Control-Constrained Nonlinear Optimal Control Using Single Network Adaptive Critics. IEEE Transactions on Neural Networks and Learning Systems, 24, 145–157. doi:10.1109/TNNLS. 2012.2227339. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359 – 366. doi:http://dx.doi. org/10.1016/0893-6080(89)90020-8.

415

Huang, C.-S., Wang, S., & Teo, K. (2000). Solving Hamilton—Jacobi—Bellman equations by a modified method of characteristics. Nonlinear Analysis: Theory, Methods & Applications, 40, 279 – 293. doi:http://dx.doi.org/10.1016/ S0362-546X(00)85016-6. 20

Journal Pre-proof

420

C¸imen, T. (2008). State-Dependent Riccati Equation (SDRE) Control: A Survey. IFAC Proceedings Volumes, 41, 3761 – 3775. doi:http://dx.doi.org/10.3182/ 20080706-5-KR-1001.00635. 17th IFAC World Congress.

425

pro of

Jiang, Y., & Jiang, Z. P. (2014). Robust Adaptive Dynamic Programming and Feedback Stabilization of Nonlinear Systems. IEEE Transactions on Neural Networks and Learning Systems, 25, 882–893. doi:10.1109/TNNLS.2013.2294968. Kainen, P. C., Kurkova, V., & Sanguineti, M. (2012). Dependence of computational models on input dimension: Tractability of approximation and optimization tasks. IEEE Transactions on Information Theory, 58, 1203–1214. Khalil, H. K. (1996). Noninear systems. Prentice-Hall, New Jersey, 2, 5–1. 430

Kim, Y. H., Lewis, F. L., & Dawson, D. M. (2000). Intelligent optimal control of robotic manipulators using neural networks. Automatica, 36, 1355 – 1364. doi:https://doi.org/10.1016/S0005-1098(00)00045-5.

435

re-

Kirk, D. (2004). Optimal Control Theory: An Introduction. Dover Books on Electrical Engineering Series. Dover Publications. Kiumarsi, B., Lewis, F. L., Modares, H., Karimpour, A., & Naghibi-Sistani, M.-B. (2014). Reinforcement Q-learning for optimal tracking control of linear discretetime systems with unknown dynamics. Automatica, 50, 1167 – 1175. doi:https: //doi.org/10.1016/j.automatica.2014.02.015.

440

lP

Lewis, F. L., & Liu, D. (2013). Reinforcement Learning and Approximate Dynamic Programming for Feedback Control volume 17. John Wiley & Sons. Lewis, F. L., & Vrabie, D. (2009). Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control. IEEE Circuits and Systems Magazine, 9, 32– 50. doi:10.1109/MCAS.2009.933854.

450

455

Masmoudi, N. K., Rekik, C., Djemel, M., & Derbel, N. (2011). Two coupled neuralnetworks-based solution of the Hamilton–Jacobi–Bellman equation. Applied Soft Computing, 11, 2946 – 2963. doi:https://doi.org/10.1016/j.asoc.2010.11. 015. Modares, H., Lewis, F. L., & Naghibi-Sistani, M.-B. (2014a). Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica, 50, 193 – 202. doi:https: //doi.org/10.1016/j.automatica.2013.09.043. Modares, H., Lewis, F. L., & Sistani, M.-B. N. (2014b). Online solution of nonquadratic two-player zero-sum games arising in the H∞ control of constrained input systems. International Journal of Adaptive Control and Signal Processing, 28, 232–254. doi:10.1002/acs.2348.

Jo

445

urn a

Liberzon, D. (2012). Calculus of variations and optimal control theory: A concise introduction. Princeton University Press.

Mulje, S. D., & R.M.Nagarele (2016). LQR Technique based Second Order Sliding Mode Control for Linear Uncertain Systems. International Journal of Computer Applications, 137, 23–29. 21

Journal Pre-proof

465

Palanisamy, M., Modares, H., Lewis, F. L., & Aurangzeb, M. (2015). Continuoustime Q-learning for Infinite-Horizon Discounted Cost Linear Quadratic Regulator Problems. IEEE Transactions on Cybernetics, 45, 165–176. doi:10.1109/TCYB. 2014.2322116. Patan, K. (2018). Two stage neural network modelling for robust model predictive control. ISA Transactions, 72, 56 – 65.

pro of

460

Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality volume 703. John Wiley & Sons. Poznyak, A., Polyakov, A., & Azhmyakov, V. (2014). Attractive Ellipsoids in Robust Control. Springer. 470

Poznyak, A. S., & Boltyanski, V. G. (2012). The Robust Maximum Principle. Springer Science & Business Media. Qu, Z. (1993). Robust control of nonlinear uncertain systems under generalized matching conditions. Automatica, 29, 985–998. Saad, W., Sellami, A., & Garcia, G. (2019). H∞ -sliding mode control of one-sided lipschitz nonlinear systems subject to input nonlinearities and polytopic uncertainties. ISA Transactions, 90, 19 – 29.

re-

475

Sage, A. P. (1968). Optimum Systems Control. Prentice-Hall.

485

S.Sutton, R., & Barto, A. G. (2012). Reinforcement Learning an Introduction. (Second edition ed.). The MIT Press. Tang, Z. L., Ge, S. S., Tee, K. P., & He, W. (2016). Robust Adaptive Neural Tracking Control for a Class of Perturbed Uncertain Nonlinear Systems With State Constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 46, 1618– 1629. doi:10.1109/TSMC.2015.2508962.

urn a

480

lP

Spong, M. W., Hutchinson, S., & M., V. (2006). Robot Modeling and Control volume 141. John Wiley and Sons, Inc.

Utkin, V., Guldner, J., & Shi, J. (2009). Sliding mode control in electro-mechanical systems volume 34. CRC press.

490

Vamvoudakis, K. G., Vrabie, D., & Lewis, F. L. (2014). Online adaptive algorithm for optimal control with integral reinforcement learning. International Journal of Robust and Nonlinear Control, 24, 2686–2710. doi:10.1002/rnc.3018.

495

Jo

Vrabie, D., & Lewis, F. (2009). Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks, 22, 237 – 246. doi:https://doi.org/10.1016/j.neunet.2009.03.008. GoalDirected Neural Systems. Wang, D. (2019). Intelligent Critic Control With Robustness Guarantee of Disturbed Nonlinear Plants. IEEE Transactions on Cybernetics, . Wang, D., Liu, D., Li, H., & Ma, H. (2014). Neural-network-based robust optimal control design for a class of uncertain nonlinear systems via adaptive dynamic programming. Information Sciences, 282, 167–179. 22

Journal Pre-proof

500

Wang, D., Liu, D., Zhang, Y., & Li, H. (2018). Neural network robust tracking control with adaptive critic framework for uncertain nonlinear systems. Neural Networks, 97, 11 – 18. doi:https://doi.org/10.1016/j.neunet.2017.09.005. Wang, D., & Qiao, J. (2019). Approximate neural optimal control with reinforcement learning for a torsional pendulum device. Neural Networks, 117, 1–7.

510

Wang, S., Jennings, L. S., & Teo, K. L. (2003). Numerical Solution of HamiltonJacobi-Bellman Equations by an Upwind Finite Volume Method. Journal of Global Optimization, 27, 177–192. doi:10.1023/A:1024980623095.

pro of

505

Xue Anke, Jiang Nan, & Sun Youxian (2001). Robust guaranteed cost control with H∞ γ disturbance attenuation performance. In Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148) (pp. 4218–4219 vol.6). volume 6. doi:10.1109/ ACC.2001.945638. Yang, Q., Jagannathan, S., & Sun, Y. (2015). Robust Integral of Neural Network and Error Sign Control of MIMO Nonlinear Systems. IEEE Transactions on Neural Networks and Learning Systems, 26, 3278–3286. doi:10.1109/TNNLS.2015.2470175. Yang, X., & He, H. (2018). Self-learning robust optimal control for continuous-time nonlinear systems with mismatched disturbances. Neural Networks, 99, 19 – 30.

re-

515

lP

Yu, Y. (1983). On stabilizing uncertain linear delay systems. Journal of Optimization Theory and Applications, 41, 503–508.

Appendix

urn a

Proof for Proposition 1 P ROOF. Introduce the approximate value function Va : R+ × Rn → R+ as a feasible Lyapunov function candidate for (1). Based on the assumptions, the candidate energetic function satisfies x> P− x ≤ Va (t, x) ≤ x> P+ x + ω+ σ(x).

Notice that both x> P− x and ω+ σ(x) + x> P+ x are both class-K functions (Khalil, 1996). The full-time derivative of Va along the trajectories of x and ω: d d d Va (t, x(t)) = 2x> (t)P(t) x(t) + x> (t) P(t)x(t) dt dt dt d d + ω(t)σ(x) + ω(t)∇> σ(x) x(t). dt dt

(26)

Reorganizing the elements on (26), the differential equation yields

Jo

520

d d Va (t, x) = 2x> (t)P(t) + ω(t)∇> σ(x) x(t) dt dt d d ¯ > (t) ω(t)σ(x). ¯ +x> (t) P(t)x(t) + 2ω dt dt

(27)

The substitution of (1) in (27) leads to d d d ¯ > (t) ω(t)σ(x) ¯ Va (t, x(t)) = x> (t) P(t)x(t) + 2ω dt dt  h dt  > i + 2x> (t)P(t) + ω(t)∇> σ(x) A0 x(t) + B0 u(t) + η x> (t) 1 . 23

(28)

Journal Pre-proof

The application of the sub-optimal admissible controller u∗ in (10) transforms (28) to

(29)

pro of

 d Va (t, x) = x> (t) P(t)A0 + A> 0 P(t) x(t) dt    > 1 −1 > ∂V (t, x) > +2x (t)P(t)B0 − R B0 + 2x> P(t)η x> 1 2 ∂x(t)    1 −1 > ∂V (t, x) > + ω(t)∇ σ(x) A0 x + B0 − R B0 2 ∂x  h  > > i d d > > > ¯ ¯ (t) ω(t)σ(x). +x + ω(t)∇ σ(x) η x 1 P(t)x(t) + 2ω dt dt

∂ The gradient of V , ∂x V , can be estimated using the definition of activation functions. Then, the reorganization of the terms in (29) is equivalent to:

re-

 d −1 > Va (t, x) = x> P(t)A0 + A> 0 P(t) − 2P(t)B0 R B0 P(t) x dt  > > > −1 > >  −2x P(t)B0 R B0 ω(t)∇σ(x) + 2x P(t)η x 1 (30) 1 ω(t)∇σ(x) + ω(t)∇> σ(x)A0 x − ω(t)∇> σ(x)B0 R−1 B> 0 2   > > d d > ¯ > (t) ω(t)σ(x). ¯ + ω(t)∇ σ(x) η x 1 + x> P(t)x(t) + 2ω dt dt The application of the Young’s inequality (Poznyak et al., 2014), justifies the following inequality, straightforwardly from (30)

lP

 d −1 > Va (t, x) ≤ x> P(t)A0 + A> 0 P(t) − 2P(t)B0 R B0 P(t) x dt  1 ω(t)∇σ(x) + ω(t)∇> σ(x)A0 x − ω(t)∇> σ(x)B0 R−1 B> 0 2  > > > −2x> P(t)B0 R−1 B> 0 ω(t)∇σ(x) + ω(t)∇ σ(x) Λ2 ω(t)∇ σ(x)  > >  >  > −1  > >   +x> P(t)Λ1 P(t)x + x> 1 η> Λ−1 + x 1 η Λ2 η x 1 1 η x 1 d d > > ¯ (t) ω(t)σ(x). ¯ +x P(t)x(t) + 2ω dt dt

urn a

−1 Factorizing with respect to ω and η (Λ−1 = Λ−1 1 + Λ2 ),  d d −1 > Va (t, x) ≤ x> P(t) + P(t)A0 + A> 0 P(t) − 2P(t)B0 R B0 P(t) dt dt     > +P(t)Λ1 P(t) x + x> 1 η> Λ−1 η x> 1  d ¯ > (t) ω(t)σ(x) ¯ +2ω − ω(t) 2x> P(t)B0 R−1 B> 0 dt  1 > −1 > > −x> A> 0 + ∇ σ(x)B0 R B0 ω(t) − ω(t)∇ σ(x)Λ2 ∇σ(x). 2

Jo

Considering that ∆A> Λ−1 ∆A ≤ Q0 y ∆> BΛ−1 ∆B ≤ ε0 , one gets  d d > Va (t, x) ≤ x P(t) + P(t)A0 + A> 0 P(t) dt dt  d ¯ ¯ > (t) ω(t)σ(x) − 2P(t)B0 R−1 B> 0 P(t) + P(t)Λ1 P(t) + Q0 x + 2ω dt  > > +ε0 − ω(t) 2x> P(t)B0 R−1 B> 0 − x A0  1 > −1 > > + ∇ σ(x)B0 R B0 ω(t) − ω(t)∇ σ(x)Λ2 ∇σ(x). 2

24

(31)

Journal Pre-proof

pro of

Substituting the adjustment law for the weights in (31) yields to  d d −1 > Va (t, x) ≤ x> P(t) + P(t)A0 + A> 0 P(t) − 2P(t)B0 R B0 P(t) dt dt   > > +P(t)Λ1 P(t) + Q0 x + ε0 − ω(t) 2x> P(t)B0 R−1 B> 0 − x A0  > σ(x) ˜ 1 ω(t)∇ > σ(x)Λ ∇σ(x) + 2ω ¯ > (t) + ∇> σ(x)B0 R−1 B> ω(t) − ω(t)∇ 2 0 2 2σ(x)  1 −1 > −A0 x + ω(t)B0 R−1 B> 0 ∇σ(x) + B0 R B0 2P(t)x 4 # p p 4 φ (kxk2 + 1) ω(t) φ (kxk2 + 1) ∇σ(x) − P(x)x σ(x). − z(ω, P) z(ω, P)

(32)

Simplifying the expression (32), one gets  d d Va (t, x) ≤ x> P(t) + P(t)A0 + A> 0 P(t) dt dt 

re-

−2P(t)B0 R−1 B> 0 P(t) + P(t)Λ1 P(t) + Q0 x + ε0   1 > σ(x)Λ ∇σ(x) ω(t) − ω(t)∇ −ω(t) ∇> σ(x)B0 R−1 B> 2 0 p p4 2 2 ω (t) φ (kxk2 + 1) 4 φ (kxk + 1) ∇σ(x) − ∇> σ(x) ∇σ(x). −ω(t)x> P(t) z(ω, P) z(ω, P)

lP

 > Considering the extended vector ϕ := x> ∇> σ(x) yields to the simplified form:  d d > Va (t, x) ≤ −x − P(t) − P(t)A0 − A> 0 P(t) dt dt  (33) > +2P(t)B0 R−1 B> 0 P(t) − P(t)Λ1 P(t) − Q2 x − ω(t)ϕ Q3 ϕ + ε0 ,

urn a

where:

  Π11 Π12 Q2 := Q0 + ω(t)Q1 , Q3 (t) := , Π> 12 Π22 p φ (kxk2 + 1) Π12 := 2 P(t), Π11 := Q1 , Q1 ∈ Rn×n , Q1 = Q> 1 , Q1 > 0 z(ω, " P) # p ω(t) φ (kxk2 + 1) 1 Π22 := B0 R−1 B> ω(t). 0 − Λ2 + 4 z(ω, P)

Considering (24), the inequality (33) is equivalent to:

Jo

d Va (t, x) ≤ −x> Q4 (t)x − ω(t)ϕ> Q3 ϕ + ε0 , dt  If one considers the subspace of the state variables x ∈ x> Q4 (t)x ≥ ε0 (Q4 (t) is given in (24)), then d Va (t, x) ≤ 0 dt In consequence, the states x are ultimately bounded with the bound given in (25).

25

Journal Pre-proof

Conflict of Interest and Authorship Conformation Form Please check the following as appropriate: All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.

o

This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.

o

The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript

o

The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript:

re-

pro of

o

Affiliation Unidad Profesional Interdisciplinaria de Biotecnología,

urn a

Author’s name

lP

Robust Min-Max Optimal Control Design for Systems with Uncertain Models: a Neural Dynamic Programming Approach

Mariana Ballesteros Isaac Chairez

Jo

Alexander Poznyak

Instituto Politécnico Nacional Unidad Profesional Interdisciplinaria de Biotecnología, Instituto Politécnico Nacional Departamento de Control Automático Centro de Investigación y Estudios Avanzados