A policy iteration approach to online optimal control of continuous-time constrained-input systems

A policy iteration approach to online optimal control of continuous-time constrained-input systems

ISA Transactions 52 (2013) 611–621 Contents lists available at ScienceDirect ISA Transactions journal homepage: www.elsevier.com/locate/isatrans A ...

1MB Sizes 1 Downloads 117 Views

ISA Transactions 52 (2013) 611–621

Contents lists available at ScienceDirect

ISA Transactions journal homepage: www.elsevier.com/locate/isatrans

A policy iteration approach to online optimal control of continuous-time constrained-input systems Hamidreza Modares a,n, Mohammad-Bagher Naghibi Sistani a, Frank L. Lewis b a b

Department of Electrical Engineering, Ferdowsi University of Mashhad, Mashhad 91775-1111, Iran University of Texas at Arlington Research Institute, 7300 Jack Newell Blvd. S., Ft. Worth, TX 76118, USA

art ic l e i nf o

a b s t r a c t

Article history: Received 15 March 2012 Received in revised form 23 January 2013 Accepted 6 April 2013 Available online 24 May 2013

This paper is an effort towards developing an online learning algorithm to find the optimal control solution for continuous-time (CT) systems subject to input constraints. The proposed method is based on the policy iteration (PI) technique which has recently evolved as a major technique for solving optimal control problems. Although a number of online PI algorithms have been developed for CT systems, none of them take into account the input constraints caused by actuator saturation. In practice, however, ignoring these constraints leads to performance degradation or even system instability. In this paper, to deal with the input constraints, a suitable nonquadratic functional is employed to encode the constraints into the optimization formulation. Then, the proposed PI algorithm is implemented on an actor–critic structure to solve the Hamilton–Jacobi–Bellman (HJB) equation associated with this nonquadratic cost functional in an online fashion. That is, two coupled neural network (NN) approximators, namely an actor and a critic are tuned online and simultaneously for approximating the associated HJB solution and computing the optimal control policy. The critic is used to evaluate the cost associated with the current policy, while the actor is used to find an improved policy based on information provided by the critic. Convergence to a close approximation of the HJB solution as well as stability of the proposed feedback control law are shown. Simulation results of the proposed method on a nonlinear CT system illustrate the effectiveness of the proposed approach. & 2013 ISA. Published by Elsevier Ltd. All rights reserved.

Keywords: Optimal control Reinforcement learning Policy iteration Neural networks Input constraints

1. Introduction The optimal control of continuous-time (CT) nonlinear systems is a challenging subject in control engineering. Solving such a problem requires solving the Hamilton–Jacobi–Bellman (HJB) equation [1], which has remained intractable in all but very special problems. This has inspired researchers to present various approaches for obtaining approximate solutions to the HJB equation. One of the existing approximation approaches for solving the HJB equation is a power-series based method [2,3]. This approach separates the system nonlinearities into a power-series and then, to avoid prohibitive computational effort, computes local estimates by using only a few terms of the series. The second approach to approximate the HJB solution is the state-dependent Riccati equation (SDRE) [4,5]. This approach is the extension of the well-known Riccati equation to nonlinear systems. But solving the SDRE is much more difficult than solving the Riccati equation,

n

Corresponding author. Tel.: +98 9155624453. E-mail addresses: [email protected], [email protected] (H. Modares).

because the coefficients in the SDRE are functions of the states instead of being constant-valued as in the Riccati equation. Another elegant approach to approximate the HJB solution is policy iteration (PI) [6], where an iterative process is used to find a sequence of approximations converging to the solution of the HJB equation. PI is a class of reinforcement learning (RL) [7,8] methods that have two-step iterations: policy evaluation and policy improvement. In the policy evaluation step, the cost associated with a control policy is evaluated by solving a nonlinear Lyapunov equation (LE). In the policy improvement step, the algorithm finds an improved policy under which the system performs better. These two steps are repeated until the policy converges to a near-optimal policy. Considerable research has been conducted for approximating the HJB solution of discrete-time systems using PI algorithms [9–30]. However, due to the complex nature of the HJB equation for nonlinear CT systems, only few results are available [31–38]. The first practical PI algorithm developed for nonlinear CT systems was proposed by Beard [31]. He utilized the Galerkin approximation method to find approximate solutions to the LE in the policy evaluation step of the PI algorithm. However, Galerkin approximation method requires the evaluation of numerous integrals, which is computationally intensive [32]. A computationally effective algorithm to find near-optimal control laws was presented

0019-0578/$ - see front matter & 2013 ISA. Published by Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.isatra.2013.04.004

612

H. Modares et al. / ISA Transactions 52 (2013) 611–621

by Abu-Khalaf and Lewis [32]. They used neural network (NN) approximators to approximate solutions to the LE. Their results showed the suitability of NN approximators for PI methods. Although efficient, both methods presented in [31,32] are offline techniques. Developing online learning algorithms for solving optimal control problems is of great interest in the control systems society, since in this manner additional approaches such as adaptive control can be integrated with the optimal control to develop adaptive optimal control algorithms for systems with parametric uncertainties or even unknown dynamics. An online PI algorithm was first presented by Doya [33] for optimal control of CT systems. Nevertheless, this algorithm was not shown to guarantee the stability of the control system. Murray et al. [34], proposed a PI algorithm which converges to the optimal control solution without using an explicit, a priori obtained, model of the drift dynamics of the system. However, it requires measurements of the state derivatives. Vrabie and Lewis [35] presented an online PI algorithm which solves the optimal problem, using only partial knowledge about the system dynamics and without requiring measurements of the state derivatives. However, the inherently discrete nature of their controller prevents the development of stability proof of the closed-loop system. Vamvoudakis and Lewis [36] proposed an online algorithm based on PI algorithm with guaranteed closed-loop stability for CT systems with completely known dynamics. Inspired by the work in [36], Dierks and Jagannathan [38] presented a single online approximator-based optimal scheme with guaranteed stability. Moreover, motivated by the work of [36], Bhasin et al. [37] presented an online PI algorithm where the requirement of knowing the system drift dynamics was eliminated by employing a NN to identify the drift dynamics. Although efficient, none of these online PI algorithms takes into account the input constraints caused by actuator saturation. The control of systems subject to input constraints is of increasing importance, since almost all actuators in real-world applications are subject to saturation. In fact, control design methods that ignore the constraints on the magnitude of the control inputs may lead to performance degradation and even system instability. Hence, during the control development, due attention must be paid to the constraints which the control signals must comply with. This issue is of more importance when one designs an online learning control method, because instability may easily occur as a result of continuing online adaptation and learning during input saturation. This motivates our research into incorporating the actuator saturation limits when designing a PI algorithm for optimal control of CT systems. This paper is concerned with developing an online optimal control method for CT systems in the presence of constraints on the input amplitude. To deal with actuator saturation, a suitable nonquadratic functional is used to encode the constraints into the optimization formulation. Then, a PI algorithm on an actor–critic structure is developed to solve the associated HJB equation online. That is, the optimal control law and the optimal value function are approximated as the output of two NNs, namely an actor NN and a critic NN. The problem of solving the HJB equation is then converted to simultaneously adjusting the weights of these two NNs. Given an arbitrarily nonoptimal control policy by the action network, the critic network guides the action network toward the optimal solution by successive adaptation. The closed-loop stability of the overall system and boundedness of the actor and critic NNs weights are assured by using Lyapunov theory. To our knowledge, this is the first treatment in which the input constraints are considered during the design of an online PI learning algorithm for solving the optimal control problem. Note that, although in [39] the authors presented an actor–critic algorithm for control of discrete-time systems with input constraints, their method does not converge to the optimal feedback control

solution for a user-defined cost function, as it only minimizes a norm of the output error. This paper is organized as follows. In the next section, some notations and definitions are given. An overview of optimal control for CT systems with input constraints is given in Section 3. This requires preliminary offline design. The development and implementation of the proposed online PI algorithm is presented in Section 4. Sections 5 and 6 present simulation results and conclusion, respectively.

2. Preliminaries 2.1. Notations and definitions Throughout the paper, ℜ denotes the real numbers, ℜn denotes the real n vectors, ℜmn denotes the real m  n matrices, I denotes the identity matrix with appropriate dimension, for a scalar v, jvj denotes the absolute value of v, for a vector x, ∥x∥ indicates the Euclidean norm of x, for a matrix M, ∥M∥ indicates the induced 2-norm of M, and trfMg denotes the trace of the matrix M. With sgn ðzÞ we denote the sign function defined as follows  1 z≥0 sgn ðzÞ ¼ −1 z o 0 Finally, we write ð:ÞT to denote transpose and λmin ð:Þ to denote the minimum eigenvalue of a Hermitian matrix. Lemma 1. (Young's inequality) [40]: For any two vectors x and y, it holds that xT y≤

∥x∥2 ∥y∥2 þ : 2 2

ð1Þ

Definition 1. (Uniformly ultimately bounded (UUB) stability) [41]: Consider the nonlinear system (2) x_ ¼ f ðx; tÞ

ð2Þ n

with state xðtÞ∈ℜ . The equilibrium point xe is said to be UUB if there exists a compact set Ω⊂ℜn so that for all x0 ⊂Ω, there exists a bound B and a time TðB; x0 Þ such that ∥xðtÞ−x0 ∥≤B for all t≥t 0 þ T. That is, after a transition period T, the state remains within the ball of radius B around x0 . Definition 2. (Exponential stability) [42]: The equilibrium state xe of the system (2) is exponentially stable if there exists an η 4 0, and for every ε 4 0 there exists a δðεÞ 40 such that jxðt; t 0 ; x0 Þ−xe j≤ε e−ηðt−t0 Þ for all t 4 t 0 , whenever jx0 −xe j oδðεÞ. Definition 3. (Zero-state observability) [41]: System (2) with measured output y ¼ hðxÞ is zero-state observable if yðtÞ≡0 ∀t≥0 implies that xðtÞ≡0 ∀t≥0. Definition 4. (Persistently exciting (PE) signal) [42]: The bounded vector signal zðtÞ is PE over the interval ½t; t þ T 1  if there exists T 1 4 0, γ 1 4 0, and γ 2 4 0, such that for all t. Z tþT 1 γ 1 I≤ zðτÞ zT ðτÞ dτ≤γ 2 I ð3Þ t

Definition 5. (Lipschitz) [42]: A function f : ½a; b-R is Lipschitz on ½a; b if jf ðx1 Þ−f ðx2 Þj≤kjx1 −x2 j for all x1 ; x2 ∈½a; b, where k≥0 is a constant. 2.2. Function approximation by neural networks The NN universal approximation property indicates that any continuous function f ðxÞ can be approximated arbitrary closely using a two-layer NN with appropriate weights on a compact set.

H. Modares et al. / ISA Transactions 52 (2013) 611–621

That is, on a compact set x∈Ω, f ðxÞ can be approximated as T

T

f ðxÞ ¼ W ϕðL xÞ þ εðxÞ

ð4Þ

where ϕð:Þ is the activation functions, L is a matrix of first-layer weights, W is a matrix of second-layer weights, and εðxÞ is called the NN functional approximation error. For sufficiently large number of NN neurons, εðxÞ is bounded. In fact, for any choice of a positive number εN , one can find a NN of large enough size such that ∥εðxÞ∥≤εN for all x ∈ Ω. In recent years, the use of NN has been accelerated in feedback control applications. However, a two-layer NN is nonlinear in the parameters L, which makes adjustment of these parameters difficult in feedback control applications. If the first layer weights L are fixed, then the NN is linear in the adjustable parameters W. It has been shown that, if the first-layer weights L are suitably fixed, then the approximation property can be satisfied by selecting only the output weights W for good approximation [41]. For this to occur, ϕðLT xÞ ¼ sðxÞ must provide a basis. The following standard assumption is considered for the NNs used in this paper. Assumption 1. [36] a. The NN reconstruction error and its gradient are bounded over the compact set Ω, i.e. ∥εðxÞ∥≤bε and ∥∇εðxÞ∥≤bεx . b. The NN activation functions and their gradients are bounded, i.e. ∥sðxÞ∥≤bs and ∥∇sðxÞ∥≤bsx .

3. Optimal control for CT constrained input systems 3.1. Problem formulation

c. The system (5) is controllable over the compact set Ω. d. The performance functional (7) satisfies zero-state observability. Remark 1. Assumption 2(b) indicates that the solution x(t) of the system (5) is unique for any finite initial condition and admissible control input. Assumption 2(c) states that there exists a continuous control on Ω that asymptotically stabilizes the closed-loop system. Assumption 2(d) is guaranteed by the condition that Q ðxÞ is positive definite, i.e. Q ðxÞ 4 0; x∈Ω−f0g; Q ð0Þ ¼ 0. To guarantee that the control signals satisfy the input constraints, the following nonquadratic functional is proposed by [43,44] and used by [32,45]. Z u UðuÞ ¼ 2 ðλφ−1 ðv=λÞÞT R dv 0 2 2 −1 3 3 ϕðv1 Þ ϕ ðu1 Þ 6 6 7 −1 7 v∈ℜm ; φ∈ℜm ; φðvÞ ¼ 4 ⋮ ð8Þ 5; φ ðuÞ ¼ 4 ⋮ 5 ϕðvm Þ ϕ−1 ðum Þ where ϕð:Þ is a continuous one-to-one, bounded function satisfying jϕð:Þj≤1 and belonging to class C p ðp≥1Þ with ϕð0Þ ¼ 0. Moreover, it is a monotonic odd function with its first derivative bounded by a constant. Also, R is a positive definite matrix. Denoting Ru ωðvÞ ¼ φ−1 ðv=λÞR ¼ ½ω1 ðv1 Þ…ωm ðvm Þ, and defining 0 ωðvÞT dv ¼ R ui m ∑i ¼ 1 0 ωi ðvi Þdv, where subscript i in ωi , ui and vi denotes the ith element of the corresponding vector, it is clear that U(u) in (8) is a scalar for u∈ℜm . Also note that U(u) is positive definite because ϕ−1 ðvÞ is monotonic odd and R is positive definite. In this paper, the well-known hyperbolic tangent ϕð:Þ ¼ tanhð:Þ is used and therefore (8) becomes Z u −1 UðuÞ ¼ 2 ðλ tanh ðv=λÞÞT R dv ð9Þ

Consider the affine CT dynamical system describe by x_ ¼ f ðxÞ þ gðxÞ uðxÞ n

0

ð5Þ n

nm

m

, and uðxÞ∈ℜ . Each component where x∈ℜ , f ðxÞ∈ℜ , gðxÞ ∈ℜ of uðxÞ is considered to be bounded by a positive constant, i.e. jui ðxÞj≤λ

i ¼ 1; …; m

ð6Þ

where λ∈ℜ is the saturating bound for the actuators. The problem discussed in this paper is to find an admissible (in the sense that defined herein) feedback solution uðxÞ which satisfies the constraints (6) and minimizes the following infinite horizon performance index associated with the system (5). Z



VðxðtÞÞ ¼

ðQ ðxðτÞÞ þ UðuðτÞÞÞdτ

613

Using the cost functional (9) in (7) and differentiating V along the system trajectories, the following nonlinear LE is obtained Z u −1 LEðx; u; VÞ : Q ðxÞ þ 2 ðλ tanh ðv=λÞÞT R dv þ ∇V T ðxÞðf ðxÞ 0

þgðxÞuðxÞÞ ¼ 0; Vð0Þ ¼ 0

where ∇VðxÞ ¼ ∂VðxÞ=∂x , which denotes the gradient of the value function VðxÞ. Eq. (10) is an infinitesimal version of (7). Let V n ðxÞ be the optimal cost function defined as Z ∞ min ðQ ðxðτÞÞ þ UðuðτÞÞÞdτ ð11Þ V n ðxðtÞÞ ¼ uðτÞ∈ψðΩÞ t t≤τ o ∞

ð7Þ

t

where Q ðxÞ is a positive definite monotonically increasing function and UðuÞ is a positive definite integrand function. Definition 6. (Admissible control input) [31,32]: The control input uðxÞ is admissible with respect to the cost function (7) on Ω, written as uðxÞ∈ψðΩÞ, if uðxÞ is continuously differentiable on Ω, uð0Þ ¼ 0, uðxÞ stabilizes the system (5) and the cost function (7) is finite ∀x∈Ω.

ð10Þ

Then V n ðxÞ satisfies the following HJB equation Z u −1 min ½Q ðxÞ þ 2 ðλ tanh ðv=λÞÞT R dv

u∈ψðΩÞ

0

þ ∇V nT ðxÞðf ðxÞ þ gðxÞuðxÞÞ ¼ 0

ð12Þ

Assuming that the minimum on the left-hand side of Eq. (12) exists and is unique, the optimal control function for the given problem is obtained by differentiating (12). The result is   1 −1 T R g ðxÞ∇V n ðxÞ ¼ −λ tanhðDn Þ un ðxÞ ¼ −λ tanh ð13Þ 2λ

To ensure that the optimal control problem is well posed and the solution to (7) is positive definite, the following assumptions are made in keeping with other work in the literature.

where Dn ¼ ð1=2λÞ R−1 gðxÞT ∇V n ðxÞ. The nonquadratic cost (9) for un is

Assumption 2.

Uðun Þ ¼ 2

Z

un

−1

ðλ tanh ðv=λÞÞT R dv

0

a. x∈Ω and Ω is a compact set containing the origin as an interior point. b. f ðxÞ and gðxÞ are Lipschitz on Ω and f ð0Þ ¼ 0, i.e. the origin is an equilibrium point for system (5).

−1

¼ 2λðtanh ðun =λÞÞT Run þ λ2 R Lnð1−ðun =λÞ2 Þ Putting (13) into (14) yields 2

Uðun Þ ¼ ∇V nT ðxÞgðxÞλ tanhðDn Þ þ λ2 R Lnð1−tanh ðDn ÞÞ

ð14Þ ð15Þ

614

H. Modares et al. / ISA Transactions 52 (2013) 611–621

Substituting un ðxÞ (13) and Uðun Þ (15) back into the nonlinear LE (10), the HJB equation for constrained-input systems becomes 2

Q ðxÞ þ ∇V nT ðxÞf ðxÞ þ λ2 R Lnð1−tanh ðDn ÞÞ ¼ 0; Vð0Þ ¼ 0

ð16Þ

In order to find the optimal control solution directly, first the HJB equation (16) must be solved for V n ðxÞ, then V n ðxÞ is substituted into (13), where the optimal control input that achieves this minimal performance is obtained. However, the HJB equation (16) is extremely difficult to solve. 3.2. Solving the HJB equation using an offline PI algorithm Note that the LE (10) is linear in the cost function derivative ∇V, while the HJB (16) is nonlinear in the value function derivative ∇V n . Hence, solving the LE (10) for VðxÞ requires solving a linear partial differential equation (PDE), while solving the HJB equation (16) for V n ðxÞ requires solving a nonlinear PDE, which may be impossible to solve. This is the motivation of introducing an iterative PI algorithm for approximating the HJB solution. Instead of directly solving the HJB equation and computing V n ðxÞ, the PI algorithm starts with a given admissible control policy and then performs a sequence of two-step iterations as follows to find the optimal control policy. 1. (Policy evaluation) given a control input ui ðxÞ, find V i ðxÞ using the following LE Z ui −1 LEðx; ui ; V i Þ : Q ðxÞ þ 2 ðλ tanh ðv=λÞÞT R dv 0 i

þð∇V i ÞT ðf ðxÞ þ gðxÞu ðxÞÞ ¼ 0 2. (Policy improvement) update the control policy using   1 −1 T R g ðxÞ∇V i ðxÞ uiþ1 ðxÞ ¼ −λ tanh 2λ

ð17Þ

ð18Þ

Note that according to (18), the control signals always satisfy the input constraints (6). Policy evaluation and policy improvement steps (17) and (18) describe a sequence of iterations that their solutions converge to the optimal value function V n ðxÞ and the optimal control policy un ðxÞ , respectively [32]. In [32], a NN was trained to obtain the approximate solutions to the LE. However, in [32] the solution of the HJB equation is obtained by means of offline iterative computation procedures. Also, this algorithm is a sequential algorithm for solving the HJB equation. That is, while the LE is being updated, the control policy is held constant and vice versa. Moreover, solving the multidimensional integrations appearing in this algorithm is computationally expensive. In the next section, we develop an online learning algorithm to solve the HJB equation without the need for iterative techniques or offline training and with less computational task.

differential equations for tuning the NNs weights. In fact, in our algorithm, the critic and actor NNs are tuned at the same time to solve for (17) and (18), respectively. This is the continuous version of generalized policy iteration (GPI) introduced in [7]. In GPI, at each step the value of a given policy is not evaluated completely and only the current estimated value is updated towards that value. Fig. 1 shows the block diagram of the proposed PI-based controller. Before presenting our main algorithm in Section 4.2, in the following, the critic NN structure and its tuning and convergence is presented. The critic is used to approximate the value function. Assuming the value function solution to the HJB equation (16) is a smooth function (which is held under Assumption 2), then there exists a single-layer NN such that the solution V ðxÞ of (16) and its gradients can be uniformly approximated as [32,36] VðxÞ ¼ W 1 T sðxÞ þ εðxÞ

ð19Þ

∇VðxÞ ¼ ∇T sðxÞW 1 þ ∇εðxÞ

ð20Þ

where sðxÞ∈ℜl provides a suitable basis function vector, εðxÞ is the approximation error, W 1 ∈ℜl is a constant parameter vector, and l is the number of neurons. To see the effect of the NN reconstruction error on the HJB equation (16), assume that the optimal value function is approximated by (19). Then, using its gradient (20) in (16), the HJB equation can be written as 2

Q þ W 1 T ∇sf þ λ2 R Lnð1−tanh ðDÞÞ ¼ εHJB

where D ¼ ð1=2λÞR−1 g T ∇sT W 1 , and εHJB , the residual error due to the function reconstruction error, is consists of the term −∇εT f plus higher order terms of ∇ε. In [32], the authors showed that as the number of hidden layer neurons l increases, the error of the HJB approximation solution converges to zero. Hence, for each constant εh 4 0, one can construct a NN so that sup∀x ∥εHJB ∥≤εh . Note that in (21) and in the sequel, for ease of exposition, the variable x is dropped.

4.1. Critic NN weights tuning and convergence This subsection presents tuning and convergence of the critic NN weights for a fixed control policy, in effect designing an observer for the unknown value function for using in feedback control. Consider a fixed control policy uðxÞ and assume that its corresponding value function is approximated by (19). Then, using the gradients of the value function approximation (20), the LE (10)

4. Online PI algorithm for real-time solution of optimal constrained control A new online PI algorithm is developed in this section to solve the optimal control problem for constrained systems in real time by learning along the system trajectories. Two NNs, i.e. an actor NN and a critic NN, are used to approximate the solution of the value function and its corresponding optimal policy. Although the offline PI algorithm of previous section is used to motivate the control structure online, that sequential algorithm itself is not implemented. Instead of sequentially updating the critic and actor NNs, both are updated simultaneously and the learning is implemented as

ð21Þ

Fig. 1. The proposed PI-based optimal controller design method.

H. Modares et al. / ISA Transactions 52 (2013) 611–621

becomes

Z

LEðx; u; W 1 Þ : Q þ 2

u 0

−1

ðλ tanh ðv=λÞÞT Rdv þ W T1 ∇sðf þ guÞ ¼ εLE

ð22Þ where the residual error due to the function reconstruction error is εLE ¼ −ð∇εÞT ðf þ guÞ

Z

u

Q þ2 0

−1 ^ T ∇sðf þ guÞ ¼ δ ðλ tanh ðv=λÞÞT R dv þ W 1

Viewing (31) as a linear time-varying system with input ~ 1 is given by [47] α1 βεLE =ms , the solution of W

~ 1 ð0Þ þ ~ 1 ðtÞ ¼ φðt; t 0 ÞW W

Z

t t0

φðτ; t 0 Þα1

β εLE dτ ms

ð32Þ

with the state transition matrix defined as ∂φðt; t 0 Þ T ¼ −α1 ββ φðt; t 0 Þ; ∂t

φðt 0 ; t 0 Þ ¼ I

ð33Þ

ð25Þ

∥φðt; t 0 Þ∥≤η1 e−η2 ðt−t0 Þ

From (25) and using the chain rule, the gradient descent algorithm for E is given by ∂E β _^ ¼ −α1 d W 1¼− ^1 ð1 þ βT βÞ2 ∂W ð1 þ βT βÞ2

ð31Þ

From Theorem 1, it can be concluded that φðt; t 0 Þ is exponentially stable provided that β is PE. Therefore, if β is PE, the state transition matrix of the homogeneous part of (31) satisfies

ð26Þ

α1

β T ~ _~ W εLE 1 ¼ −α1 ββ W 1 þ α1 ms

ð24Þ

^ 1 is the current estimated value of W 1 . where W Note that the error δ is the continuous-time counterpart of the temporal difference (TD) [7]. The problem of finding the value function is now converted to adjusting the parameters of the critic NN such that the TD δ is minimized. In order to bring the TD error to its minimum value, i.e. εLE (by comparing (22) and (25), it is ^ 1 -W 1 ), the following objective function clear that this occurs if W is consider E ¼ 12d2

~ 1 ¼ W 1 −W ^ 1 . Using (30) in (27) and denoting β ¼ β=ð1 þ where W βT βÞ and ms ¼ 1 þ βT β, yields

ð23Þ

Under Assumptions 1 and 2, this residual error is bounded on the compact set Ω, i.e. sup∀x∈Ω ∥εLE ∥≤εmax . However, the ideal weights vector of the critic NN, i.e., W 1 , which provides the best approximation solution for (22) is unknown and must be approximated in real time. Hence, the output of the critic NN and the approximate nonlinear LE can be written as ^ ^ 1T s VðxÞ ¼W

615

ð27Þ

where β ¼ ∇sðf þ guÞ, α1 4 0 is the learning rate and the term ð1 þ βT βÞ2 is used for normalization. The next theorem is needed to show the exponential convergence of the critic NN weights error when using the update law (27).

ð34Þ

for all t; t 0 4 0 and some η1 ; η2 4 0. Using (32) and (34), and the fact that ∥β∥ o 1, we obtain

~ 1 ðtÞ∥≤η e−η2 t þ ∥W 0

α1 ms

Z 0

t

e−η2 ðt−τÞ ∥εLE ∥dτ

ð35Þ

~ 1 ð0Þeη2 t0 . Eq. (35) can be written as where η0 ¼ η1 W ~ 1 ∥≤η e−η2 t þ ∥W 0

α1 εmax ms η 2

ð36Þ

for some constant η0 4 0. ~ 1 ∥≤η e−η2 t and the proof of (a) is For εLE ¼ 0, (36) becomes ∥W 1 ~ 1 ∥≤εt þ cεmax where εt is an complete. For εLE ≠0, (36) becomes ∥W exponentially decaying to zero term and c ¼ ðα1 =ms η2 Þ, and this completes the proof of (b). Remark 2. Note that εLE is the residual error due to the NN reconstruction error and is near to zero as the number of NN neurons are selected sufficiently enough.

Theorem 1. [42]: Consider the linear time-varying system x_ ¼ −αðtÞαT ðtÞx

ð28Þ

The equilibrium state xe ¼ 0 of the above equation is exponentially stable, if the bounded signal αðtÞ is PE. Theorem 2. Let uðxÞ be any admissible bounded control policy and consider the adaptive law (27) for tuning of the critic NN weights. If β ¼ β=ð1 þ βT βÞ is PE, then ^ 1 converges exponentially to the unknown (a) For εLE ¼ 0, W weights W 1 . ~ 1 ¼ W 1 W ^ 1 converges (b) For bounded εLE , i.e. sup∀x ∥εLE ∥≤εmax , W ~ 1 j∥W ~ 1 ∥ r cεmax g∥, exponentially to the residual set Rs ¼ fW where c 4 0 is a constant. Proof. To analyze the convergence of the critic NN weights, first the dynamics of the critic NN weights error is derived. From (22) it yields Z

u

Q þ2 0

−1

ðλ tanh ðv=λÞÞT Rdv ¼ −W T1 ∇sðf þ guÞ þ εLE

ð29Þ

Using (29) in (25) results in ~ ∇sðf þ guÞ þ εLE δ ¼ −W 1 T

ð30Þ

4.2. Actor NN and the proposed synchronous PI algorithm This section presents our main results. To solve the optimal control problem in real time, an online PI algorithm is given which involves simultaneous and synchronous tuning of the actor and critic NNs weights. In the policy improvement step, the actor finds an improved control policy corresponding to the current estimated value ^ 1 is the current estimation for the function. Assuming that W optimal critic NN weights, then according to (18) the policy update law is   1 −1 T T ^ R g ∇s W 1 u1 ¼ −λ tanh ð37Þ 2λ However, the policy update law (37) does not guarantee the stability of the closed-loop system. In practice, it is desired for a control system which is implemented in real time to be stable. Therefore, to assure stability in a Lyaponuv sense (as will be discussed later) the following policy update law is used   1 −1 T T ^ u^ 1 ¼ −λ tanh R g ∇s W 2 ð38Þ 2λ ^ 2 is considered as the current estimated value of the where W optimal unknown weights vector W 1 .

616

H. Modares et al. / ISA Transactions 52 (2013) 611–621

Define the actor NN estimation error as ^2 ~ 2 ¼ W 1 −W W

ð39Þ

Before presenting the main results, note that since by Assumption 2, f ðxÞ is Lipschitz and f ð0Þ ¼ 0, we have ∥f ðxÞ∥≤bf ∥x∥

ð40Þ

Also, similar to [36], the following assumption is considered on the input gain matrix gðxÞ. Assumption 3. gðxÞ is bounded by a constant, i.e. ∥gðxÞ∥≤bg We now present the main theorem which provides tuning laws for the actor and critic NNs weights that assures convergence of the proposed PI algorithm to a near-optimal control law, while guaranteeing stability. Theorem 3. Given the dynamical system (5). Let the critic NN be given by (24) and the actor NN be given by (38). Let the tuning law for the critic NN weights be provided by β1 _^ ^ Tβ Þ W ðQ þ U^ þ W ð41Þ 1 ¼ −α1 1 1 ð1 þ β1 T β1 Þ2 where β1 ¼ ∇sðf þ g u^ 1 Þ, β1 ¼ β1 =ð1 þ β1 T β1 Þ and U^ (defined later in (A.11)) is obtained by substituting u^ 1 (38) into the cost (9). Assume that β1 is PE and let the actor NN weights be tuned as " # h iβ T _^ 1 T ^ ^ ^ ^ ^ W1 W 2 ¼ −α2 ðY 2 W 2 −Y 1 β1 W 1 Þ−∇s g λ tanhðDÞ−sgnðDÞ ms1 ð42Þ ^ ¼ ð1=2λÞR−1 g T ∇sT W ^ 2 , ms1 ¼ 1 þ β T β , and Y 1 4 0 and where D 1 1 Y 2 4 0 are design parameters satisfying 1 Y 2 − Y 1 Y 1 T 40 2

ð43Þ

Let Assumptions 1–3 hold. Then the closed-loop system states, the critic NN error, and the actor NN error are UUB, for sufficiently large number of NN neurons. Proof. see the Appendix. Remark 3. Note that according to the control policy (38), the control signals always satisfy the constraints (6) and the restriction of actuator saturation overcomes successfully. Remark 4. According to the error bound obtained for the critic NN weights in Theorem 3, it is clear that Theorem 2 still holds with ε ¼ sup∀x ∥B∥ instead of εmax , where B ¼ ð2c=ms1 Þ and c is defined in the proof. Remark 5. According to Theorem 3, the error bounds for the system states and the actor and critic NNs weights depend on the NN approximation errors, the LE residual, the HJB residual, and the unknown critic NN weights. As the number of NN hidden layers increases all of these go to zero except for the unknown critic NN weights. However, these bounds are in fact conservative and the simulation results show that the value function and the optimal control solution are closely identified. Deriving update laws to obtain asymptotic stability of the closed-loop system and NN weights, instead of UUB stability, is still an open problem in policy iteration-based optimal control methods. Remark 6. The requirement for the regressors β and β1 in Theorems 2 and 3 to be PE is to ensure sufficient exploration of the state space and is crucial for a proper convergence of the critic NN to the optimal value function. As no verifiable method exists to ensure PE in nonlinear systems, a small exploratory signal consisting of sinusoids of varying frequencies can be added to the control input to ensure PE qualitatively. Future efforts can focus on

relaxing the PE assumption by replacing it with a milder condition on the regressor vectors β and β1 . For instance, a recent concurrent learning method presented in [46] could be used, instead of the gradient algorithm of (27), for updating the critic NN weights to relax the PE assumption. But it does not constitute a part of the novel results presented herein. 5. Simulation results In this section we provide a simulation of the new online PI algorithm using a nonlinear system to demonstrate its effectiveness. The new algorithm is compared with the offline design method of [32] and also with a standard PID controller. It is seen that the proposed algorithm outperforms both these methods. In fact, the PID controller does not guarantee that the actuator limits are satisfied. In order to demonstrate the feasibility of the proposed online optimal control scheme, the following nonlinear system is considered. x_ 1 ¼ x1 þ x2 −x1 ðx21 þ x22 Þ x_ 2 ¼ −x1 þ x2 −x2 ðx21 þ x22 Þ þ u

ð44Þ

The aim is to control the system with control limits of juj≤1. This problem was considered by Abu-Khalaf and Lewis [32] to test their offline optimal control design algorithm for constrainedinput systems. The nonquadratic cost functional is chosen as [32] Z ∞ Z u −1 Vðx1 ; x2 Þ ¼ ðx21 þ x22 þ 2 ðtanh ðvÞÞT dvÞdt ð45Þ 0

0

Also, similar to [32], the critic NN is chosen as Vðx1 ; x2 Þ ¼ W c1 x21 þ W c2 x22 þ W c3 x1 x2 þ W c4 x41 þ W c5 x42 þ W c6 x31 x2 þW c7 x21 x22 þ W c8 x1 x32 þ W c9 x61 þ W c10 x62 þ W c11 x51 x2 þW c12 x41 x22 þ W c13 x31 x32 þ W c14 x21 x42 þ W c15 x1 x52 þW c16 x81 þ W c17 x82 þ W c18 x71 x2 þ W c19 x61 x22 þ W c20 x51 x32 þW c21 x41 x42 þ W c22 x31 x52 þ W c23 x21 x62 þ W c24 x1 x72

ð46Þ

This is a power series neural network with 24 activation functions containing powers of the state variable of the system up to order eight. Note that we have chosen polynomial activation functions only to compare our results to those in [32]. The activation function may be chosen to be sigmoid functions, radial basis functions, and others which satisfy the assumptions previously discussed. Remark 7. The choice of the number of activation functions in NN approximators in the literature is commonly carried out by computer simulation. Thus, we performed this simulation using an approximator (46) containing powers of the state variable of the system up to the sixth order. The algorithm was not observed to converge. Therefore, the simulation was repeated using powers of states up to order 8, which resulted in convergence. Moreover, no improvement was noted on running the simulation again using powers of order 10. Therefore, these simulation results are presented using powers up to order eight. 5.1. Learning process The proposed PI algorithm is implemented as in Theorem 3. All NN weights are initialized randomly in the range [−5, 5]. The learning rates α1 and α2 are selected as 20 and the design parameters Y 1 and Y 2 are selected as 2I and 40I, respectively. Remark 8. The learning rates and the design parameters in adaptive laws (41) and (42) are obtained by repeated simulation. It is not difficult to select these parameters to obtain convergence and stability of the proposed control law. They can be tuned

H. Modares et al. / ISA Transactions 52 (2013) 611–621

further, because their specific values affect the convergence speed and transient response of the algorithm. Fig. 2 shows the critic weights, converging to their optimal values. To allow a clear view of the convergence of the critic weights, only the first 10 weights are depicted and the other weights are omitted from the figure. At the end of the learning phase, both the critic and actor weights converge to W1 ¼ [Wc1,… Wc24] ¼[5.6260, 5.9740, 3.8345, 4.3184, 3.5279, 2.7252, 2.8840, 3.7195, 0.0643, 0.1561, 7.7463, 2.5815, 0.1774, 4.5895, 2.4928, 1.0161, −1.0456, 5.2595, 2.0160, 4.3456, −0.0551, 2.3107, 3.4515, 0.5051]. Hence, the optimal control law is obtained by 1 2:81x1 þ 3:83x2 þ 7:06x32 þ 1:36x31 þ 2:88x21 x2 þ 5:58x1 x22 þ C B B 0:47x52 þ 3:87x51 þ 2:58x41 x2 þ 0:27x31 x22 þ 9:18x21 x32 þ 6:23x1 x42 C C u ¼ −tanhB B −4:18x7 þ 2:63x7 þ 2:02x6 x2 þ 6:52x5 x2 −0:11x4 x3 þ 5:78x3 x4 C 2 1 1 1 2 1 2 1 2A @ 2 5 6 þ10:35x1 x2 þ 1:77x1 x2 0

ð47Þ

The states of the system during online simulation are shown in Fig. 3 where disturbance is added to the control input to ensure the PE condition is satisfied. After 300 s, the PE condition is no longer required and is thus removed. After that, the states remain very close to zero, as required. Note that once the actor weights are known, the optimal controller has been found. Then, one can use the weights of the actor NN as the final weights of the controller in any online control runs, without having to intentionally insert any excitation signals to the system.

performance of the proposed control law (47) (we call it control law 2). Figs. 4–6 depict the state x1 , the state x2 and the control effort u for both control laws 1 and 2, starting the system from a specific initial condition. Comparing the results of control laws 1 and 2 for this specific initial condition, it is obvious that the performance of the proposed optimal control law is better than those of [32], as both the control effort and the states for the proposed control law converge to zero faster. In fact, for this simulation, the cost (45) for control laws 1 and 2 is 4.2896 and 3.9061, respectively. To compare the performance of control laws 1 and 2 more precisely, the cost (45) associated with each of these two control laws are computed for 2500 different initial states in the region −1≤x1 ≤1 and −1≤x2 ≤1. We consider V 1 and V 2 for the approximated value functions associated with control laws 1 and 2, respectively. Fig 7 shows the difference between these two value functions. To do a far comparison, the sum and the maximum and minimum values of all the elements of the difference V 1 −V 2 are

5.2. Performance evaluation 5.2.1. Comparing the performance of the proposed online algorithm with the offline method of [32] In this section, the performance of the near optimal control law founded in [32] (we call it control law 1) is compared with the

Fig. 4. System state x1 for a specific initial condition.

Fig. 5. System state x2 for a specific initial condition. Fig. 2. Convergence of the first 10 critic NN weights.

Fig. 3. The trajectory of states during online learning.

617

Fig. 6. Control effort for a specific initial condition.

618

H. Modares et al. / ISA Transactions 52 (2013) 611–621

Fig 7. 3D plot of the difference between value functions of control laws 1 and 2.

Fig. 10. Control effort using PID controller with and without saturation.

Fig. 8. System state x1 using PID controller with and without saturation.

derivative (PID) controller. To do a fair comparison, the optimal PID parameters are obtained using genetic algorithm [48] in an offline manner. The results of the PID controller are shown for two different situations; when assuming working with unsaturated actuator and when assuming the control signal is bounded by the given bounds. Figs. 8–10 show the state x1 , the state x2 and the control effort u, respectively, for both situations. It is obvious that the performance of the PID controller with unconstrained input is satisfactory. However, from Fig. 10 it can be seen that the PID controller does not satisfy the input constraints as the control input exceeds the given bounds. Also, comparing the response of the unconstrained input case with those of the constrained input case, it can be seen that the performance of the PID controller deteriorates due to the presence of constraints. Moreover, by comparing Figs. 8–10 with Figs. 4–6, it is clear that the performance of the proposed optimal controller outperforms the PID controller.

6. Conclusion

Fig. 9. System state x2 using PID controller with and without saturation.

computed. In fact, the sum, and the maximum and minimum values are 939.71, 2.39 and −0.25, respectively. From these results, it is clear that the cost of the proposed control law is smaller than the cost of the control law of [32] over a large area of the specific region and is higher than it over a small area of the specific region. This confirms that like the control law 2 of [32], the proposed control law is a near-optimal control law with a better performance. Hence, it can be concluded that the proposed method converges to a near-optimal control solution as the theoretical results suggested.

Systems with input constraints are common in practice and the practical control systems are required to provide satisfactory performance in the presence of these constraints. In this paper, a priori design philosophy was employed for the design of an online constrained optimal controller in which the input constraints were taken into consideration at the onset of control design. The proposed control method was implemented on an actor–critic structure as an extension of the novel work of Vamvoudakis and Lewis [36] which was presented for systems with unconstrained inputs. To show the effectiveness of the proposed method, its results on a nonlinear system were compared with the offline optimal control method of [32] and the traditional PID controller which was tuned offline by genetic algorithm. The simulation results illustrated how the input saturation destroys the performance of the PID controller. Also, comparing the results of the proposed method with those of [32] confirmed that the proposed method converges to a near-optimal control solution. In contrast to [32], the proposed method solved the optimal control problem in an online fashion, and hence it can be extended for control of uncertain systems. The future work is to employ a third neural network in conjunction with the actor–critic structure to eliminate the requiring of the system dynamics.

Acknowledgement 5.2.2. Comparing the performance of the proposed method with a traditional PID controller In this section, we compare the performance of the proposed control law (47) with the frequently used proportional integral

This work is supported by AFOSR grant FA9550-09-1-0278, NSF grants ECCS-1128050, ARO grant W91NF-05-1-0314, and China NNSF grant 61120106011.

H. Modares et al. / ISA Transactions 52 (2013) 611–621

^ Using (A.6) and (A.11), U−U is given by

Appendix. Proof of Theorem 3

^ ^ T ∇ s g λ tanhðDÞ ^ þ λ2 R Lnð1−tanh2 ðDÞÞ ^ U−U ¼W 2

Consider the following Lyapunov function JðtÞ≡VðxðtÞÞ þ J 1 ðtÞ þ J 2 ðtÞ

ðA:1Þ

2

−W T1 ∇ s g λ tanhðDÞ−λ2 R Lnð1−tanh ðDÞÞ

ðA:13Þ

2

where VðxÞ is the optimal value function which is assumed to be positive-definite (this achieves when Assumption 2 is satisfied), ~ T α−1 W ~ 1 and J ðtÞ ¼ 1W ~ T α−1 W ~ 2 . The derivative of JðtÞ is J 1 ðtÞ ¼ 12W 2 1 1 2 2 2 given by _J≡V_ þ _J þ _J 1 2

ðA:2Þ

The first term of (A.2) is V_ ¼ ðW T1 ∇s þ ∇T εÞðf þ g u^ 1 Þ ¼ W T1 ∇sf ^ þ ε0 ðxÞ −W T ∇s g λ tanhðDÞ

ðA:3Þ

1

^ Using (39) and Assumption 3, and where ε0 ðxÞ ¼ ∇ε ðf −gλ tanhðDÞÞ. tacking norm of ε0 ðxÞ yields T

^ ∥ε0 ðxÞ∥ ¼ ∥∇εT ðf −g λ tanhðDÞÞ∥≤b εx bf ∥x∥ þ λbεx bg

f ¼ −Q ðxÞ−U þ W T1 ∇ s g λ tanhðDÞ þ εHJB

2

λ2 R Lnð1−tanh ðDÞÞ ¼ λ2 RfLnð4Þ−2D−2Lnð1 þ e−2D Þg 2

¼ λ2 RfLnð4Þ−2Ds gnðDÞ þ εD g ¼ λ2 R Lnð4Þ −W T1 ∇s g λ sgnðDÞ þ λ2 RεD

ðA:14Þ

where the second equality is obtained from the fact that Lnð1 þ e−2D Þ can be closely approximated as 0 for D 4 0 and −2D for D o 0, and εD is the approximation error. Note that εD is bounded and its maximum value (occurring in D ¼ 0) is ln 4. i.e., ∥εD ∥ o ln 4. Similarly, T

^ ¼ λ2 R Lnð4Þ−W ^ ∇s g λ sgnðDÞ ^ þ λ2 Rε ^ λ2 R Lnð1−tanh ðDÞÞ 2 D 2

ðA:15Þ

where ∥εD^ ∥ o ln 4. Substituting (A.14) and (A.15) in (A.13), adding ^ and some manipulaand subtracting the term W T1 ∇ s g λ sgnðDÞ tions yields

ðA:5Þ

^ þW ~ T ∇s g λ sgnðDÞ ^ ^ ^ T ∇s g λ tanhðDÞ U−U ¼W 2 2

^ −W T1 ∇ s g λ tanh ðDÞ−W T1 ∇s g λ ½sgnðDÞ−sgnðDÞ

where U ¼ W 1 T ∇s g λ tanhðDÞ þ λ2 R lnð1−tanh ðDÞÞ

The term λ2 R Lnð1−tanh ðDÞÞ can be written as

ðA:4Þ

From the HJB equation (21) we have W T1 ∇s

619

ðA:6Þ

þλ2 RðεD^ −εD Þ

ðA:16Þ

ðA:7Þ

Substituting (A.16) in (A.12) and some manipulations yields  β1 ~ T ~ T ∇ s g λ tanhðDÞ ^ ~ 1 −W _J ¼ W −βT1 W 1 1 2 2 T ð1 þ β1 β1 Þ T ^ ^ ~ T ∇s g λ sgnðDÞ−W þW 2 1 ∇ s g λ ½sgnðDÞ−sgnðDÞ  þλ2 RðεD^ −εD ÞD þ εHJB ðA:17Þ

From (A.4), (A.7), the fact that U is positivedefinite and boundness of εHJB , i.e. sup∀x ∥εHJB ∥≤εh [32], we have

Denoting εJ ¼ λ2 RðεD^ −εD Þ þ εHJB , β1 ¼ β1 =ð1 þ β1 T β1 Þ and ms1 ¼ T 1 þ β1 β1 , (A.17) becomes

is obtained by replacing the control input u ¼ −λ tanhðð1=2λÞ R−1 g T ∇sT W 1 Þ into the cost functional (9) and hence is positive definite. Substituting (A.5) in (A.3), V_ becomes ^ þ εHJB þ ε0 V_ ¼ −Q ðxÞ−U þ W T1 ∇ s g λ ½tanhðDÞ−tanhðDÞ

^ þ bεx b ∥x∥ þ λbεx bg þ ε V_ o −Q ðxÞ þ W T1 ∇ s g λ½ tanhðDÞ−tanhðDÞ f h ≤−Q ðxÞ þ 2λbg bφx ∥W 1 ∥ þ bεx bf ∥x∥ þ λbεx bg þ εh

ðA:8Þ

Denoting k1 ¼ bεx bf and k2 ¼ 2λbg bϕx ∥W 1 ∥ þ λbεx bg þ εh , and noting that since Q ðxÞ 4 0 there exists q such that xT qx o Q ðxÞ for x∈Ω, (A.8) becomes V_ o −xT qx þ k1 ∥x∥ þ k2

ðA:9Þ

The second term of (A.2) is _~ −1 _J ¼ W ~ T α−1 W ~ T 1 ¼ W 1 α1 α1 1 1 1

~ ¼W 1 T

β1

β1 ð1 þ β1 T β1 Þ2

^ T ∇ s ðf −g λ tanhðDÞÞÞ ^ ðQ þ U^ þ W 1

T

ð1 þ β1 T β1 Þ2

^ ∇ s ðf −g λ tanh ðDÞÞ ^ ðQ þ U^ þ W 1

ðA:10Þ

where ^ 2 T ∇sg λ tanhðDÞ ^ þ λ 2 R lnð1−tanh2 ðDÞÞ ^ U^ ¼ W

ðA:11Þ

is obtained by substituting u^ 1 (38) into the cost (9). ~ T ðβ =ð1 þ β T β Þ2 Þ By adding and subtracting the term W 1 1 1 1 T ^ W 1 ∇sðf −g λ tanhðDÞÞ to the right-hand side of (A.10), using ^ β1 ¼ ∇sðf −g λ tanhðDÞÞ and doing some manipulations, (A.10) becomes ~ T _J ¼ W 1 1

β1

^ ~ β −W T ∇sðf −g λ tanhðDÞÞ ðU−U− W 1 1 1 T

T

ð1 þ β1 β1 Þ

2

^ þ εHJB Þ þW T1 ∇sðf −g λ tanhðDÞÞ

ðA:18Þ

~ 1 ¼ W 1 −W ^ 1 , the second term of where εJ is bounded. By using W (A.18) can be written as h i T ~ T ∇s g λ tanhðDÞ−sgnð ~ 1 ^ ^ β1 W W DÞ 2 ms1 h i T ~ T ∇s g λ tanhðDÞ−sgnð ^ ^ β1 W 1 ¼W DÞ 2 ms1 h iβ T T 1 ~ ∇s g λ tanhðDÞ−sgnð ^ ^ ^1 −W DÞ W 2 ms1

ðA:19Þ

Substituting (A.19) in (A.18),

−Q −U−W T1 ∇s ðf −g λ tanhðDÞÞ þ εHJB ÞÞ β1 ^ ^ T ∇s ðf −g λ tanhðDÞÞ ^ ~ T ðU−U þW ¼W 1 1 ð1 þ β1 T β1 Þ2 −W T1 ∇ s ðf −gλ tanhðDÞÞ þ εHJB Þ

h i T ~ T β βT W ~ T ∇s g λ tanhðDÞ−sgnð ^ ^ β1 W ~ 1 −W ~ 1 _J ¼ −W DÞ 1 1 1 1 2 ms1 h i h i ~ T β1 W 1 T ∇s g λ sgnðDÞ−sgnðDÞ ^ þ εJ þW 1 ms1

ðA:12Þ

h i β T 1 ~ 1þW ^1 _J ¼ −W ~ T β βT W ~ T ∇s g λ tanhðDÞ−sgnð ^ ^ W DÞ 1 1 1 1 2 ms1 h i h i ~ T β1 W 1 T ∇s g λ sgnðDÞ−sgnðDÞ ^ þ εJ þW 1 m " s1 # h iβ T T 1 ^ ^ ~ W1 þW 2 ∇s gλ sgnðDÞ−tanhðDÞ ms1

ðA:20Þ

^ ^ þ εJ and d ¼ ∇s g λ ½sgn ðDÞ− Denoting c ¼ W 1 T ∇s g λ ½sgnðDÞ−sgnðDÞ ^ ððβ T =ms1 ÞW 1 Þ, (A.20) becomes tanhðDÞ 1 h i T ~ T β βT W ~ T ∇s g λ tanhðDÞ−sgnð ^ ^ β1 W ~ 1þW ^1 _J ¼ −W DÞ 1 1 1 1 2 ms1 ~ T β1 c þ W ~ Td þW 1 2 ms1

ðA:21Þ

Note that c and d are bounded since all the terms exist in c and d are bounded under Assumptions 1 and 3 and boundness of εHJB .

620

H. Modares et al. / ISA Transactions 52 (2013) 611–621

~ 2 α2 −1 Finally, using (A.9) and (A.21), and the fact that _J 2 ¼ W _~ −1 _^ ~ _ ¼ − W α , J becomes W W 2 2 2 2 _J o −xT q x þ k1 ∥x∥ þ k2 −W ~ 1þW ~ T β βT W ~ T β1 c þ W ~ Td 1 1 1 1 2 ms1 " # T h i _^ ^ ^ β1 W ^1 ~ T α−1 W −W 2 −∇ s g λ tanhðDÞ−sgnðDÞ 2 2 ms1

ðA:22Þ

Now we define the actor tuning law as " # h iβ T T ^ _^ 1 ^ ^ ^ ^ ¼ −α ðY −Y β Þ−∇s g λ tanhð DÞ−sgnð DÞ W W W W 2 2 2 2 1 1 1 1 ms1 ðA:23Þ This adds to _J the terms ~ T Y 1 βT W ~ T Y 2 ðW 1 −W ~ 2 Þ−W ~ T Y 1 βT ðW 1 −W ~ 1Þ ^ 2 −W ^ 1 ¼W ~ T Y2W W 2 2 2 2 1 1 ~ Y 2W ~ Y1β W1 þ W ~ Y1β W ~ 2 −W ~ 1 ~ Y 2 W 1 −W ¼W 1 1 2 2 2 2 T

T

T

T

T

T

ðA:24Þ

And generally, _J becomes ~ T β βT W ~ T β1 c−W ~ T Y2W _J o −xT q x þ k1 ∥x∥ þ k2 −W ~ 1þW ~ 2 1 1 1 1 2 ms1 ~ T ½d þ Y 2 W 1 þ Y 1 βT W 1  þ W ~ T Y 1 βT W ~ 1 þW 2

1

2

1

ðA:25Þ

For the last term of the right-hand side of (A.25), by applying Young's inequality (see Lemma 1) we obtain T ~ 2 2 ~ T ~ T Y 1 βT W ~ 1 ≤ ∥W 2 Y 1 ∥ þ ∥β1 W 1 ∥ W 1 2 2 2

ðA:26Þ

Using (A.26) in (A.25), and the fact that xT qx≥λmin ðqÞ∥x∥2 we have ~ T β1 c _JðxÞ o −λmin ðqÞ∥x∥2 þ k1 ∥x∥ þ k2 − 1 W ~ 1þW ~ T β βT W 1 2 1 1 1 ms1   h i T T 1 T T ~ d þ Y 2W 1 þ Y 1β W 1 ~ ~ 2þW −W W 2 Y 2− Y 1Y 1 2 1 2

ðA:27Þ

Denoting B ¼ Y 2 −ðð1=2ÞY 1 Y 1 T Þ and choosing the design parameters Y 1 and Y 2 such that B ¼ Y 2 −ðð1=2ÞY 1 Y 1 T 4 0Þ, it can be easily shown that the Lyapunov derivative is negative if vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u u k1 2 k1 k2 þt 2 ðA:28Þ þ ∥x∥〉 2λmin ðqÞ 4λmin ðqÞ λmin ðqÞ 2c T ~ ∥β1 W 1 ∥〉 ms1

ðA:29Þ T

~ 2 ∥〉 d þ Y 2 W 1 þ Y 1 β1 W 1 ∥W λmin ðBÞ

ðA:30Þ

~ 1∥ is UUB. If β is PE, however, Equation (A.29) shows that ∥β T1W ~ 1 is also UUB. See the proof of Theorem 2. W The analysis above demonstrates that the system states and the NNs weights are UUB and the proof is completed. References [1] Bellman R. Dynamic programming. New Jersey: Princeton University Press; 1957. [2] Garrard WL, Enns DF, Snell SA. Nonlinear feedback control of highly maneuverable aircraft. International Journal of Control 1992;56(4):799–812. [3] Chanane B. Optimal control of nonlinear systems: a recursive approach. Computers & Mathematics with Applications 1998;35(3):29–33. [4] Wernli A, Cook G. Suboptimal control for nonlinear quadratic regulator problem. Automatica 1975;11(1):75–84. [5] Beikzadeh H, Taghirad HD. Robust SDRE filter design for nonlinear uncertain systems with an H∞ performance criterion. ISA Transactions 2012;51 (1):146–52. [6] Howard RA. Dynamic programming and Markov processes. Cambridge, MA: MIT Press; 1960. [7] Sutton RS, Barto AG. Reinforcement learning—an introduction. Cambridge, MA: MIT Press; 1998.

[8] Powell WB. Approximate dynamic programming: solving the curses of dimensionality. Wiley-Interscience; 2007. [9] Masmoudi NK, Rekik C, Djemel M, Derbel N. Two coupled neural-networkbased solution of the Hamilton–Jacobi–Bellman equation. Applied Soft Computing 2011;11(3):2946–63. [10] Al-Tamimi A, Lewis FL, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2008;38 (4):943–9. [11] Prokhorov DV, Wunsch DC. Adaptive critic designs. IEEE Transactions on Neural Networks 1997;8(5):997–1007. [12] Enns R, Si J. Helicopter trimming and tracking control using direct neural dynamic programming. IEEE Transactions on Neural Networks 2003;14 (4):929–39. [13] Chen Z, Jagannathan S. Generalized Hamilton–Jacobi–Bellman formulationbased neural network control of affine nonlinear discrete time systems. IEEE Transactions on Neural Networks 2008;19(1):90–106. [14] Li B, Si J. Robust dynamic programming for discounted infinite horizon Markov decision processes with uncertain stationary transition matrices. In: Proceedings of IEEE symposium on approximate dynamic programming and reinforcement learning; 2007. p. 96–102. [15] Si J, Wang YT. On-line learning control by association and reinforcement. IEEE Transactions on Neural Networks 2001;12(2):264–76. [16] Ferrari S, Steck JE, Chandramohan R. Adaptive feedback control by constrained approximate dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2008;38(4):982–7. [17] Wei QL, Zhang HG, Dai J. Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions. Neurocomputing 2009;72:1839–48. [18] Werbos PJ. A menu of designs for reinforcement learning over time. In: Miller WT, Sutton RS, Werbos PJ, editors. Neural networks for control. Cambridge, MA: MIT Press; 1991. p. 67–95. [19] Zhang HG, Luo YH, Liu D. Neural network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraint. IEEE Transactions on Neural Networks 2009;20(9):1490–503. [20] Zhang HG, Wei QL, Luo YH. A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2008;38(4):937–42. [21] Han D, Balakrishnan SN. State-constrained agile missile control with adaptivecritic-based neural networks. IEEE Transactions on Control Systems Technology 2002;10(4):481–9. [22] Syafiie S, Tadeoa F, Villafin M, Alonso AA. Learning control for batch thermal sterilization of canned foods. ISA Transactions 2011;50:82–90. [23] Padhi R, Unnikrishnan N, Wang X, Balakrishnan SN. A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural Networks 2006;19:1648–69. [24] Werbos PJ. Approximate dynamic programming for real time control and neural modeling. In: White DA, Sofge DA, editors. Handbook of intelligent control. Multiscience Press; 1992. [25] Dierks T, Thumati BT, Jagannathan S. Adaptive dynamic programming-based optimal control of unknown affine nonlinear discrete-time systems. In: proceedings of international joint conference on neural networks; 2009. p. 14–19. [26] Si J, Barto AG, Powell WB, Wunsch D, editors. Handbook of learning and approximate dynamic programming. Wiely-IEEE Press; 2004. [27] Govindhassamy JJ, McLoone SF, Irwin GW. Second order training of adaptive critics for online process control. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2005;35(2):381–5. [28] Bradtke S, Ydstie BE, Barto AG. Adaptive linear quadratic control using policy iteration. In: Proceedings of American control conference; 1994. p. 3475–3479. [29] Al-Tamimi A, Vrabie D, Abu-Khalaf M, Lewis FL. Model-free approximate dynamic programming schemes for linear systems. In: Proceedings of international joint conference on neural networks; 2007. p. 371–378. [30] Landelius T. Reinforcement learning and distributed local model synthesis. PhD dissertation. Sweden: Linkoping University; 1997. [31] Beard RW. Improving the closed-loop performance of nonlinear systems. PhD dissertation. Rensselaer Polytechnic Institute, Troy, NY; 1995. [32] Abu-Khalaf M, Lewis FL. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 2005;41:779–91. [33] Doya K. Reinforcement learning in continuous time and space. Neural Computation 2000;12(1):219–45. [34] Murray JJ, Cox CJ, Lendaris GG, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews 2002;32(2):140–53. [35] Vrabie D, Lewis FL. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks 2009;22:237–46. [36] Vamvoudakis K, Lewis FL. Online actor–critic algorithm to solve the continuous infinite-time horizon optimal control problem. Automatica 2010;46:878–88. [37] Bhasin S. Reinforcement learning and optimal control methods for uncertain nonlinear systems. PhD dissertation. University of Florida; 2011. [38] Dierks T, Jagannathan S. Optimal control of affine nonlinear continuous-time systems. In: proceedings of American control conference; 2010. p. 1568–1573.

H. Modares et al. / ISA Transactions 52 (2013) 611–621

[39] He P, Jagannathan S. Reinforced learning neural network based controller for nonlinear discrete-time systems with input constraints. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 2007;37(425):436. [40] Hardy G, Littlewood J, Polya G. Inequalities. 2nd ed. Cambridge, U.K.: Cambridge Univ. Press; 1989. [41] Lewis FL, Jagannathan S, Yesildirek A. Neural network control of robot manipulators and nonlinear systems. Taylor & Francis; 1999. [42] Ioannou P, Sun J. Robust adaptive control. New Jersey: Prentice Hall; 1996. [43] Lyshevski SE. Constrained optimization and control of nonlinear systems: New results in optimal control. In: Proceedings of IEEE conference on decision and control; 1996. p. 541–546. [44] Lyshevski SE. Optimal control of nonlinear continuous-time systems: Design of bounded controllers via generalized nonquadratic functional. In: proceedings of American control conference; 1998. p. 205–209.

621

[45] Adhyaru D, Kar IN, Gopal M. Bounded robust control of nonlinear systems using neural network-based HJB solution. Neural Computing & Applications 2011;20(1):91–103. [46] G. Chowdhary, Concurrent learning for convergence in adaptive control without persistency of excitation, PhD dissertation. Georgia Tech University; 2010. [47] Chen CT. Linear system theory and design. 3rd ed.Oxford University Press; 1999. [48] Shen JC. New tuning method for PID controller. ISA Transactions 2002;41:473–84.