Neurocomputing 149 (2015) 1620–1630
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Optimal switching between controlled subsystems with free mode sequence Ali Heydari a,n, S.N. Balakrishnan b a b
Mechanical Engineering Department, South Dakota School of Mines and Technology, USA Mechanical and Aerospace Engineering Department, Missouri University of Science and Technology, USA
art ic l e i nf o
a b s t r a c t
Article history: Received 18 July 2013 Received in revised form 1 July 2014 Accepted 13 August 2014 Communicated by H. Zhang Available online 28 August 2014
The problem of optimal switching and control of nonlinear switching systems with controlled subsystems is investigated in this study where the mode sequence and the switching times between the modes are unspecified. An approximate dynamic programming based method is developed which provides a feedback solution for unspecified initial conditions and different final times. The convergence of the proposed algorithm is proved. Versatility of the method and its performance are illustrated through different numerical examples. & 2014 Elsevier B.V. All rights reserved.
Keywords: Optimal switching Approximate dynamic programming Switching system Adaptive critics
1. Introduction A switching system is characterized by a group of subsystems with different dynamics of which one is active at each time instant. Hence, in order to control such systems one needs a switching schedule along with a control input to be applied. Examples of switching systems can be found in dynamical systems in different fields, from aerospace to chemical engineering [1–5]. There are a few developments in this area [6–15], however, still there are many open issues even for the case of linear subsystems with a quadratic cost functions [7,16]. It should be noted that the optimal switching problems are much more complicated compared to conventional optimal control problems due to the intercoupling between the effect of the applied continuous input and the discrete switching/ scheduling. In other words, the presence of the discrete events in the system makes it very hard to find an optimal solution. Formulating an optimal control problem as a function optimization problem, the relation between conventional optimal control problems and optimal switching problems is similar to the relation between a smooth function optimization and a mixed integer programming, e.g., see [15]. The restriction that some of the variables can only assume integer values, in mixed integer programming, leads to different challenges including the non-convexity of the problem. Such challenges exist in optimal switching problems, as well.
n
Corresponding author. Tel.:+1 605 394 2200. E-mail addresses:
[email protected] (A. Heydari),
[email protected] (S.N. Balakrishnan). http://dx.doi.org/10.1016/j.neucom.2014.08.030 0925-2312/& 2014 Elsevier B.V. All rights reserved.
Development in the field can be mainly classified into two categories. In the first category, the sequence of active subsystems, called mode sequence, is selected a priori [6–11], and the problem, i.e., finding the switching instants between the modes, is solved using nonlinear programming methods. In these papers, the gradient of the cost with respect to the switching instants/points is calculated. Afterward, the switching instants/points are adjusted to find the local optimum. An iterative solution to a nonlinear optimization problem is suggested in [10] and using the combination of this control approach with ideas from model predictive control, the authors developed the so-called crawling window optimal control scheme for the optimal switching problem. The second category is based on discretizing the problem space in order to deal with a finite number of options. Authors of [12] utilized a direct search to evaluate the cost function for different randomly selected switching time sequences among the finite number of options to select the best sequence. In [13], state and input spaces are discretized for calculation of the value function for optimal switching through dynamic programming. In [14], a genetic algorithm is used to find the optimal switching times among the choices. All the cited methods work only with a specific initial condition; each time the initial condition is changed, a new set of computations needs to be performed to find the new optimal switching instants. In order to extend the validity of the results to different initial conditions within a pre-selected set, in [9] a solution is found as the local optimum in the sense that it minimizes the worst possible cost for all trajectories starting in the selected initial states set.
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
In the past two decades, approximate dynamic programming (ADP) has been shown to have a lot of promise in providing comprehensive solutions to conventional optimal control problems in a feedback form [17–28]. ADP is usually carried out using two neural network (NN) syntheses called adaptive critics (ACs) [18–20]. In the heuristic dynamic programming (HDP) class with ACs, one network, called the critic network, maps the input states to output the cost and another network, called the actor network, outputs the control with states of the system as its inputs [20,21]. Motivated by the potentials of ADP in conventional problems, different ADP-based methods were developed for solving optimal switching problems [29–33]. The authors of [29,30] developed a switching method in which the number of functions subject to approximation grows exponentially with the number of iterations and an idea is suggested by the authors for eliminating some of these functions. The authors of this study also investigated the ADP-based approaches to optimal switching [31–33]. The developments in [31,32] are for optimal switching problems with fixed mode sequence. In such systems, the mode sequence and the number of switching are given and the problem is finding the optimal switching time. The problem in the current study, however, is much more complicated due to the fact that the mode sequence, the number of switching, and the switching times are all unknown and subject to be calculated such that a cost function is optimized. As compared with [33], the main difference is assuming a controlled subsystem in the current study as opposed to autonomous subsystems investigated in [33]. Investigation of controlled subsystems makes the problem more complicated due to the intercoupling that exists between the effect of switching between the modes and applying different control inputs once a mode is active. Another challenge is both finding the switching schedule and the continuous input in the problem subject to this study. Considering this background, the contribution of this study is presenting an ADP-based solution to switching problems with controlled subsystems and free mode sequence. The idea is as simple as learning the optimal cost-to-go and the optimal control for different active modes. It is shown that having these functions the optimal mode can be found in a feedback form, i.e., as a function of the instantaneous state of the system and the remaining time. An algorithm is developed which fits in the category of HDP for learning the desired functions. The proof of convergence is also provided. This method has several advantages over existing developments in the field: (1) it provides global optimal switching (subject to the assumed neural network structure) unlike the nonlinear programming based methods [6–11] which could provide only local optimal solutions. (2) The order of active subsystems and the number of switching are free, as opposed to simpler problems of having a fixed mode sequence [6–14,31,32]. (3) The neurocontroller determines an optimal solution for unspecified initial conditions, without needing to retrain the networks. (4) Once trained, the neurocontroller gives solution to any other final time as well, as long as the new final time is not greater than the final time for which the network is trained. (5) The switching is scheduled in a feedback form, hence, it has the inherent robustness of feedback solutions in disturbance rejection, unlike open loop developments [6–9,11–15]. The closest development in the literature to the current study is Ref. [30]. The differences which highlight the advantages and disadvantages of each method, are fourfold. (a) Ref. [30] presents an algorithm through which neural networks are utilized in approximating smooth functions, which potentially leads to more accurate approximations as compared with the algorithm presented in the current study which is based on approximating possibly non-smooth functions. (b) The number of functions needed to be learned at each training iteration and stored for online control grows exponentially with the number of iterations,
1621
in Ref. [30]. But, in the current study only one critic and as many actors as the number of subsystems are required to be trained. (c) The development in Ref. [30] admits a hard terminal constraint on the state, while the method presented here admits soft terminal constraints. (d) The algorithm proposed in Ref. [30] trains the NNs for a single selected initial state vector, but, the current study leads to a very versatile neurocontroller in terms of being able to control different initial conditions and different final times, without requiring the weights to be re-tuned. The rest of this paper is organized as follows: the problem formulation and the solution idea are presented in Section 2. Approximations of the optimal cost-to-go and the optimal control with neural networks are explained in Section 3. Numerical analyses are given in Section 4 and conclusions from this study are given in Section 5.
2. Problem formulation and solution idea A switching system with nonlinear input-affine subsystems can be represented by the set of M subsystems or modes modeled by: x_ ðtÞ ¼ f vðtÞ ðxðtÞÞ þ g vðtÞ ðxðtÞÞuðtÞ;
ð1Þ
where functions f v : ℝn -ℝn and g v : ℝn -ℝnm , 8 v A V f1; 2; …; Mg, represent the dynamics of the subsystems and are assumed to be smooth. Integers n and m denote the dimensions of state vector x and control vector u, respectively. The continuous time is denoted with t and the initial and final times are denoted with t 0 and t f , respectively. Controlling the switching systems requires a control input, u : ½t 0 ; t f Þ-ℝm , and a switching schedule, v : ½t 0 ; t f Þ-V. The latter determines the active subsystem at time t and the former provides the input to the active subsystem. The optimal solution, however, is a solution that minimizes cost function: Z tf J ¼ ψðxðt f ÞÞ þ ðQ ðxðtÞÞ þ uðtÞT RuðtÞÞdt: ð2Þ t0
Convex positive semi-definite functions Q : ℝn -ℝ and ψ : ℝn -ℝ penalize the states and R A ℝmm is a positive definite matrix penalizing the control effort, in the selected cost function. The problem is determining an input history uðtÞ and a switching history vðtÞ such that cost function (2) is minimized, subject to dynamics (1). The selected approach in this study for solving the problem is ADP [17–20], formulated with discrete-time dynamics. Therefore, the dynamics and the cost function are discretized using a small sampling time Δt, xk þ 1 ¼ f vk ðxk Þ þ g vk ðxk Þuk ;
k A K; vk A V;
N1
J ¼ ψðxN Þ þ ∑ ðQ ðxk Þ þ uTk Ruk Þ
ð3Þ ð4Þ
k¼0
where N ¼ ðt f t 0 Þ=Δt , xk ¼ xðkΔt þt 0 Þ, uk ¼ uðkΔt þ t 0 Þ, and vk ¼ vðkΔt þ t 0 Þ. Subscript k denotes the discrete time index and K f0; 1; …; N 1g. If Euler integration is used for discretization, one has f v ðxÞ x þ Δtf v ðxÞ, g v ðxÞ Δtg v ðxÞ, Q ðxÞ ΔtQ ðxÞ, and R ΔtR. Defining the cost-to-go as the incurred cost from current time k and state xk to the final time N, denoted by J k ðxk Þ, one has: N1
J k ðxk Þ ¼ ψðxN Þ þ ∑ ðQ ðxj Þ þ uTj Ruj Þ:
ð5Þ
j¼k
From the form of the cost function, it directly follows: J N ðxN Þ ¼ ψðxN Þ; J k ðxk Þ ¼ Q ðxk Þ þ uTk Ruk þ J k þ 1 ðxk þ 1 Þ; 8 k A K:
ð6Þ
1622
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
Based on Bellman principle of optimality [34], regardless of what decisions are made for vj , j A f0; 1; …; k 1g, the optimal solution for the remaining time steps, i.e., for j A fk; k þ 1; …; N 1g is the solution which minimizes J k ðxk Þ. Let the optimal cost-to-go be defined as the incurred cost if optimal decisions and controls are applied from the current time to the final time and denoted with J nk ðxk Þ. The method developed in this study is based on approximating the optimal cost-to-go and the optimal control ‘given’ the active subsystem v, denoted with n uv; ðxk Þ, 8 v A V. Once these functions are learned, the optimal k mode at current time, k, and current state, xk , denoted with vnk ðxk Þ, is given by: n n þ J nk þ 1 ðf v ðxk Þ þ g v ðxk Þuv; ÞÞ vnk ðxk Þ ¼ argminv A V ðQ ðxk Þ þ ukv;nT Ruv; k k n n ¼ argminv A V ðukv;nT Ruv; þ J nk þ 1 ðf v ðxk Þ þ g v ðxk Þuv; ÞÞ: k k
ð7Þ
The minimization given in Eq. (7) is among the finite number of elements of V and can be done online easily with relatively small number of computations. For example, if the system has two subsystems, finding optimal active subsystem at each instant k n simplifies to evaluating scalar values ukv;nT Ruv; þ J nk þ 1 ðf v ðxk Þ þ k v;n g v ðxk Þuk Þ for v ¼ 1 and v ¼ 2 and comparing the values to select the optimal v. This process needs to be done at each instant k, 8 k A K. Once vnk ðxk Þ is calculated, the respective control approximator can be used to output the control value. The next section n gives an algorithm for learning desired functions J nk ðxk Þ and uv; ðxk Þ, k 8 v A V. 3. Approximating optimal control and optimal cost-to-go In order to motivate the idea of using ADP for obtaining optimal control and optimal cost-to-go for switching problems, the use of ADP for conventional optimal control problems with fixed-finaltime cost functions is discussed first. 3.1. ADP for conventional optimal control Assume the conventional fixed-final-time optimal control problem: xk þ 1 ¼ f ðxk Þ þ gðxk Þuk ;
8 k A K;
ð8Þ
along with with smooth functions f : ℝ -ℝ and g : ℝ -ℝ the cost function given in (4). The optimal solution to the problem of minimizing cost function (4) subject to dynamics (8) is given by Bellman equation [34]: n
n
n
nm
J nN ðxN Þ ¼ ψ ðxN Þ;
ð9Þ
unk ¼ arginf u A ℝm ðuT Ru þ J nk þ 1 ðf ðxk Þ þgðxk ÞuÞÞ; 8 k A K;
ð10Þ
J nk ðxk Þ ¼ Q ðxk Þ þ unkT Runk þ J nk þ 1 ðf ðxk Þ þ gðxk Þunk Þ; 8 k A K;
ð11Þ
the critic networks at time step k, respectively. Utilizing different weights for different time steps provides the networks with the ability to learn the time-dependent behavior of the solution to fixed-final-time problems. Eqs. (9)–(11) may be used to find network weights V k and W k , 8 k. Substituting (12) and (13) in Eqs. (9)–(11) leads to: W TN ϕðxN Þ ¼ ψðxN Þ;
ð14Þ
V Tk σðxk Þ ¼ arginf u A ℝm ðuT Ru þ W kTþ 1 ϕðf ðxk Þ þ gðxk ÞuÞÞ; 8 k A K:
ð15Þ
W Tk ϕðxk Þ ¼ Q ðxk Þ þ σðxk ÞT V k RV Tk σðxk Þ þ W kTþ 1 ϕðf ðxk Þ þ gðxk ÞV Tk σðxk ÞÞ; 8 k A K:
ð16Þ The minimization given in Eq. (15) may be done by setting the derivative of the function subject to minimization to zero which leads to V Tk σðxk Þ ¼ 12 R 1 gðxk ÞT ∇ϕðf ðxk Þ þ gðxk ÞV Tk σðxk ÞÞT W k þ 1 ; 8 k A K; ð17Þ where ∇ϕðxÞ ∂ϕðxÞ=∂x is formed as a column vector. Interested readers are referred to [40] for more details and the sufficient conditions for equivalency of Eqs. (15) and (17). Unknowns W k and V k can be calculated in a backward-in-time fashion, considering Eqs. (14), (17), and (16). In other words, using (14) one can calculate W N . Then, having W N , one can calculate V N 1 using (17). Having calculated W N and V N 1 , unknown W N 1 can be found using (16). Repeating this process from k ¼ N 1 to k ¼ 0, all the unknowns weights can be calculated. There is a problem in finding V k from Eq. (17). The issue is, Eq. (17) is a nonlinear equation versus the unknown V k . Ref. [35] suggests using the successive approximation given by: V kði þ 1ÞT σðxk Þ ¼ 12 R 1 gðxk ÞT ∇ϕðf ðxk Þþ gðxk ÞV ðiÞT σðxk ÞÞT W k þ 1 ; 8 k AK: k
for solving the nonlinear equation, starting with an initial guess on . Superscript ‘ðiÞ’ denotes the index of iteration. Let the V ð0Þ k converged weight be denoted by V k . Replacing Eq. (17) with the iterative relation (18), the same backward-in-time process can be conducted for finding the weights, with the exception of having a set of iterations to be done at each time step k for finding the respective V k . This process is detailed in Algorithm 1. Algorithm 1. Step 1: Find W N such that W TN ϕðxN Þ ffi ψðxN Þ for different xN A Ω where Ω denotes a compact subset of ℝn representing the domain of interest. Step 2: For k ¼ N 1 to 0 repeat { Step 3: Set i ¼ 0 and select a guess on V ð0Þ . k Step 4: Randomly select h different state vectors ½j
xk A Ω, j A J f1; 2; ::; hg, for h being a large positive integer.
In the HDP scheme [19] for regulators, two NNs named actor and critic, are trained for approximating the optimal control and the optimal cost-to-go, respectively, in order to solve the problem of curse of dimensionality existing with the Bellman equation. Ref. [35] extends the idea to fixed-final-time problems. Selecting linear-in-weight NNs, the control and the optimal cost-to-go are approximated by: V Tk σðxk Þ ffi unk ðxk Þ; k A K;
½j
Vector-valued functions σ : ℝ -ℝ and ϕ : ℝ -ℝ represent the selected smooth basis functions, where p and q are the respective number of (linearly independent) neurons. Matrices V k A ℝpm and W k A ℝq are the unknown weights of the actor and p
n
½j
½j
½j
½j
σðxk Þ, 8 j A J. uk ¼ V ðiÞT k Step 6: Find V ðik þ 1Þ such that ½j
½j
½j
V ðik þ 1ÞT σðxk Þ ffi 12 R 1 gðxk ÞT ∇ϕðxk þ 1 ÞT W k þ 1 , 8 j A J. Step 7: Set i ¼ i þ 1 and repeat Step 6, until ‖V ðik þ 1Þ V ðiÞ ‖ k
converges with a preset tolerance.
Step 8: Set V k ¼ V ðiÞ . k Step 9: Find W k such that T ½j ½j ½j ½j W Tk ϕ xk ffiQ xk þ σ xk V k RV Tk σ xk
ð13Þ n
½j
Step 5: Set xk þ 1 ¼ f ðxk Þ þ gðxk Þuk , where
ð12Þ
W Tk ϕðxk Þ ffi J nk ðxk Þ; k A K [ fNg:
ð18Þ
q
½j
½j
½j
þ W kTþ 1 ϕðf ðxk Þ þ gðxk ÞV Tk σðxk ÞÞ; 8 j A J }
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
In Steps 1, 6, and 9 of Algorithm 1, the method of Least Squares, explained in Appendix A, can be used for finding the unknown weights in terms of the given parameters. Remark 1. Uniform approximation capability of neural networks [36,37] in approximation of continuous functions indicates that once the network is trained for a large enough number of samples, denoted by h, distributed throughout the domain of interest, the network is able to approximate the output for any new sample of the domain with a bounded approximation error. This error bound can be made arbitrarily small if NNs' activation functions are rich and the number of training samples, h, is large enough. For the linear-in-weight neural networks selected in this study and the polynomial basis function utilized in the numerical examples, Weierstrass approximation theorem [38] proves a similar uniform approximation capability. Once the networks are trained offline, the optimal control will be given in a feedback form by the actor and can be implemented for online control of the plant. The critic network approximates the optimal cost-to-go as a function of the given state and time. This feature of the critic will be used in the next subsection, along with the actor, to find solution to the optimal switching problem. 3.2. ADP for switching optimal control The idea presented in the previous subsection for optimal control of conventional problems is extended to switching problems in this subsection. An algorithm is proposed for learning the optimal cost-to-go and the controls, where ðM þ 1Þ neural networks are utilized. In other words, one critic will be used to approximate J nk ðxk Þ and M actors will be used to approximate n uv; ðxk Þ, 8 v A V. k v;n V vT k σðxk Þ ffiuk ðxk Þ;
v A V; k A K;
W Tk ϕðxk Þ ffi J nk ðxk Þ; k A K [ fNg;
ð19Þ ð20Þ
Considering the smoothness of ϕð:Þ, and hence existence of its gradient, the minimization given in Eq. (23) may be replaced with finding its stationary point, i.e., setting the gradient equal to zero to get: T 1 1 V vT g v ðxk ÞT ∇ϕðf v ðxk Þ þ g v ðxk ÞV vT k σðxk Þ ¼ 2 R k σðxk ÞÞ W k þ 1 ; 8 k A K and 8 v A V:
þ 1ÞT V v;ði σðxk Þ ¼ 12 R 1 g v ðxk ÞT ∇ϕðf v ðxk Þ þ g v ðxk ÞV v;ðiÞT σðxk ÞÞT W k þ 1 ; k k
8 k A K and 8 v A V;
8 k A K and 8 v A V;
with the which is an iterative relation, starting with some converged value denoted with V vk , 8 v A V. Algorithm 2 describes the detailed learning process. Algorithm 2. Step 1: Find W N such that W TN ϕðxN Þ ffi ψðxN Þ for different xN A Ω where Ω denotes a compact subset of ℝn representing the domain of interest. Step 2: For k ¼ N 1 to 0 repeat Steps 3 through 11 below { Step 3: Randomly select h different state ½j
vectors xk A Ω, j A J f1; 2; ::; hg, for h being a large positive integer. Step 4: For v ¼ 1 to M repeat Steps 5 through 9 below. { Step 5: Set i ¼ 0 and select a guess for V v;ð0Þ . k v;½j ½j ½j v;½j Step 6: Set xk þ 1 ¼ f v xk þ g v xk uk , v;½j ½j v;ðiÞT where uk ¼ V k σ xk , 8 j A J. Step 7: Find V kv;ði þ 1Þ such that
ð21Þ
n n J nk ðxk Þ ¼ Q ðxk Þ þ minv A V ðukv;nT Ruv; þ J nk þ 1 ðf v ðxk Þ þ g v ðxk Þuv; ÞÞ; 8 k A K: k k
ð22Þ This process may continue in a backward form from k ¼ N 1 to k ¼ 0. By using the equations for the network structures (19) and (20) in Eqs. (9), (21), and (22), the desired weight update law can be obtained, as given by (14) followed by arginf u A ℝm ðu
8 k A K and 8 v A V;
T
T T ½j ½j v;½j V kv;ði þ 1ÞT σ xk ffi 12 R 1 g v xk ∇ϕ xk þ 1 W k þ 1 8 j A J; :
Ru þ W kTþ 1 ϕðf v ðxk Þ þg v ðxk ÞuÞÞ; ð23Þ
T W Tk ϕðxk Þ ¼ Q ðxk Þ þ minv A V σðxk ÞT V vk RV vT k σðxk Þ þ W k þ 1 ϕðf v ðxk Þ 8 k A K: ð24Þ þ g v ðxk ÞV vT k σðxk ÞÞ ;
(27)
Step 8: Set i ¼ i þ1 and repeat Step 7, until þ 1Þ ‖V v;ði V v;ðiÞ ‖ converges with a preset tolerance. k k
Step 9: Set V vk ¼ V v;ðiÞ . k } v;½j
Afterwards, the optimal cost-to-go can be found using
V vT k σðxk Þ ¼
ð26Þ V v;ð0Þ k
Step 10: Set uk
n uv; ¼ arginf u A ℝm ðuT Ru þ J nk þ 1 ðf v ðxk Þ þg v ðxk ÞuÞÞ; k
ð25Þ
Moreover, the idea of using successive approximation, discussed in the previous subsection for conventional problems, may be utilized for remedying the issue of having a nonlinear equation in (25) versus V vk . To this end, Eq. (25) may be replaced with
V vk A ℝpm
relates the actor to the Note that the superscript v in respective subsystem. In other words, the optimal control ‘given’ the active subsystem v at time k is approximated by V vT k σðxk Þ, v where VvT k denotes the transpose of matrix Vk. In order to obtain suitable equations for developing the training algorithm to determine W k and V vk , 8 k and 8 v, the Bellman equation given by the set of Eqs. (9)–(11), may be adapted as follows: Eq. (9) can be used for finding J nN ð:Þ. The optimal control given subsystem v can be calculated using
1623
½j ¼ V vT k σ xk , 8 j A J, 8 v A V.
Step 11: Find W k such that ½j ½j v;½j T v;½j ½j W Tk ϕ xk ffi Q xk þ minv A V uk Ruk þ W Tk þ 1 ϕ f v xk (28) ½j v;½j ; 8 j A J; þ g v x k uk } þ 1Þ As discussed before, Eq. (26) relates V v;ði to V v;ðiÞ , i.e., it is an k k iterative equation. The following theorem provides the sufficient condition for the convergence of the iterations.
Theorem 1. The iterations given by (26) converge to the unique solution of the nonlinear eq. (25) using any selected initial guess on V v;ð0Þ A ℝpm , 8 v A V, and 8 k A K, providing the sampling time k selected for discretization of continuous-time dynamics (1) is small enough. The proof is given in Appendix B. In Theorem 1 the role of the sampling time in discretization of a continuous-time system is emphasized. However, it should be
1624
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
noted that optimal weights V vk , 8 k A K and 8 v A V, can be calculated by solving the nonlinear equation given by (25) directly, without using the iteration given in (26). Typically, one needs to resort to numerical methods for solving the set of Eq. (25). Theorem 1 proves that for any given smooth dynamics and smooth basis functions, if the sampling time is small enough, the iterations given in (26) converge to the unique solution to the nonlinear Eq. (25). If the sampling time is fixed, then certain conditions on the dynamics or the cost function terms need to hold in order for the iterations to converge. These conditions can be easily derived from the proof of Theorem 1. An analysis on the approximation capability of the NNs is required since the results produced through the developed algorithms depend on this capability. As mentioned in Remark 1, NNs can provide uniform approximations with any desired degree of accuracy when the functions subject to approximation are continuous. Considering Eqs. (14), (16), (18) and (26), the continuity of the functions subject to approximation, given in the right hand sides of these equations, follows from the convexity of ψð:Þ and Q ð:Þ as well as the smoothness of f v ð:Þ s, g v ðxk Þ s and the basis functions. For Eq. (24), however, the continuity of the right hand side is not obvious, due to the switching between different v A V. Theorem 2 proves the desired continuity. Theorem 2. Function F : ℝn -ℝ defined as T vT FðxÞ minv A V σðxÞT V vk RV vT k σðxÞ þ W k þ 1 ϕðf v ðxÞ þ g v ðxÞV k σðxÞÞ is a continuous function versus x at every x A Ω, given continuous functions f v , g v , ϕ, and σ, 8 v A V. The proof is given in Appendix B. Considering Theorem 2, the continuity of the right hand side of Eq. (24) follows, as it is the summation of continuous function Q ð:Þ and function F ð:Þ defined in the theorem. Remark 2. It should be noted that the continuity of the optimal cost-to-go versus the states may not directly lead to the continuity n of optimal control given an active subsystem, i.e., uv; ð:Þ. In the k approximation method proposed in this study, however, since the optimal cost-to-go is being uniformly approximated with a smooth NN, the control given by the right hand side of Eq. (25) is guaranteed to be continuous. For a detailed analysis of the characteristics of the cost-to-go function of a switching problem, interested readers are referred to [16] for linear systems with quadratic cost functions.
3.3. Implementation and control Once the NNs' weights are trained using Algorithm 2, one may use them for online optimal control/scheduling of the system. This is done in real-time through feeding the current state xk and time k to equation: vnk ðxk Þ ¼ argminv A V σðxk ÞT V vk RV vT k σðxk Þ T ð29Þ þ W k þ 1 ϕðf v ðxk Þ þ g v ðxk ÞV vT k σðxk ÞÞ ; 8 k A K; to calculated the optimal mode, vnk ðxk Þ. Having calculated vnk ðxk Þ, vn ðx Þ;n
vn ðx ÞT
the optimal control unk is given by unk ¼ ukk k ffi V kk k σ ðxk Þ. Hence, the optimal solution can be found online in a feedback form. As for the computational load of the proposed method, it should be noted that the training is supposed to be done offline. Using least squares method, described in Appendix A, each weight update is composed mainly of a pseudo-inverse calculation. The numerical analyses show that the iterations converge very fast and the whole training can be conducted in a fraction of a minute for
the simulated examples (more details are presented in the next section). Once the training is done, the online control and scheduling is computationally very cheap, since, set V has a finite number of elements and the minimization given in (29) is as simple as comparing the values of the scalar function subject to minimization for different elements of V, i.e., for different subsystems, to select the minimizer. As mentioned in the introduction, one of the features of this method is providing approximate global optimal solution. A requirement for this characteristic is the learned cost-to-go and controls being approximations of the global optimal cost-to-go and controls. Using Algorithm 2, the global optimality of the trained networks follows from the proof of Theorem 1. In other words, once it is proved that (26) is a contraction mapping, the uniqueness of fixed point V vk to the iterative Eq. (26) follows [39]. Details of this result are beyond the scope of this study and are presented in [40]. As for the weights of the neural networks, least squares method leads to the global optimal weights, due to the convexity of least squares problems [41]. Looking at Eq. (7), which is ‘the decision maker’ for switching, one may observe high frequency switching between the modes in some problems. In fact, this behavior is observed in Example 2 in this study. The following two remedies are suggested to avoid high frequency switching.
3.3.1. Minimum dwell time remedy Dictating a minimum dwell time after each switching can help eliminate high frequency switching. After the first mode selection, one can dictate a minimum dwell time before switching to another mode at every change. That is, once a switching occurs, one may skip evaluating Eq. (7), or equivalently Eq. (29), and instead stay with the current active subsystem until the minimum dwell time is passed.
3.3.2. Threshold remedy Selecting a positive real number as the threshold, the switching is allowed once the cost-to-go difference between activating the new mode and staying with the current mode is more than the threshold. To be more specific, assume the active subsystem is v right before time instant k, and by evaluating Eq. (7) one realizes that switching to subsystem w leads to the cost-to-go less than the cost-to-go of staying with subsystem v. In such a case, switching to subsystem w is allowed only if: n n n ukw;nT Ruw; þ J nk þ 1 ðf w ðxk Þ þg w ðxk Þuw; Þ o ukv;nT Ruv; k k k n þ J nk þ 1 ðf v ðxk Þ þ g v ðxk Þuv; Þ þτ; k
where the pre-selected threshold is denoted with positive real τ. The same algorithm (Algorithm 2) may still be used in the offline training stage of the NNs. The abovementioned alternative remedies, however, can be used in online control. The alterations created by the remedies result in a ‘sub-optimal’ control of the system. The result will remain sub-optimal because the neurocontroller calculates the optimal solution in a feedback form. More specifically, the perturbation due to the applied remedy can be considered as a disturbance for the controller. Providing suitable selection of the minimum dwell time or the threshold, the feedback nature of the controller can deal with the resulting disturbance without too much performance degradation. This behavior is due to the inherent nature of feedback controllers in moderate disturbance rejections.
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
4. Numerical applications and analyses
1625
The performance of the proposed method is evaluated through the following two examples. The source codes for the numerical analyses can be found at [44]. Example 1. The first example is a scalar switching system with two modes given below ( f 1 ðxÞ þ g 1 ðxÞu x þ u x_ ¼ f 2 ðxÞ þg 2 ðxÞu x3 þ u
Weight Elements
20 0 −20 −40 −60 −80
0
1
2
3
4
Iterations Fig. 2. History of some of the weights of the actors versus the iterations, Example 1.
R tf
Weight Elements
200 100 0 −100
0
0.2
0.4 0.6 Time (s)
0.8
Fig. 1. Time history of NN weights, Example 1.
1
State
1 0 −1 −2 0 Active Mode
For example, if xðt Þ 4 1, utilizing subsystem 2 leads to a faster convergence toward the origin with less control effort due to f ðxÞ 4 jf ðxÞj, therefore, the optimal mode in this case is sub2 1 system 2. Note that if xðtÞ A ½ 2; 0, the optimal mode is the mode which has smaller jf v ðxÞj in order to require less control effort to derive the state away from its point of attraction, i.e., the origin, toward the desired terminal point of 2. The existence of analytical optimal switching function (30) for this system makes it a suitable example for investigating the performance of the developed method. Polynomial functions xi , for i A f0; 1; 2; …; 6g and i A f0; 1; 2; …; 5g are selected for the basis functions ϕð:Þ and σð:Þ, respectively. The horizon is discretized to N ¼ 200 time steps, i.e., Δt ¼ 0:005 s. The training steps detailed in Algorithm 2 are carried out using h ¼ 50. The entire training phase took less than 4 s on a machine with Intel Core i7 3770, 3.4 GHz running MATLAB 2013a. The time history of the weights of the trained NNs is shown in Fig. 1. It depicts the time-dependent behavior of the elements of the weights throughout the horizon, i.e., from t ¼ 0 to t ¼ 1 s. Moreover, Fig. 2 depicts the evolution of some of the actor weights versus the iterations. As seen in this figure, the iterations converged in less than 4 epochs. The NNs are utilized for controlling initial condition xð0Þ ¼ 2, once the networks are trained. The results, including the histories of the state, the active mode, and the optimal control, are shown in Fig. 3. The state history shows that the controller has successfully driven the initial state to close to the desired terminal point in the given time. The history of active modes shows that switching has happened exactly at the optimal times, considering the analytical optimal switching schedule given by Eq. (30). An important feature of the developed method is providing optimal solution for different final times, without needing to
2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2
1 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 Control
The selected cost function is J ¼ 50ðxðt f Þ þ 2Þ2 þ 0 0:5 uðtÞ2 dt where t f ¼ 1 s, hence, the objective is directing the state toward value 2. Selecting the domain of interest of Ω ¼ ½ 3; 3 the optimal switching function for the given system can be analytically calculated as ( 1 if 0 r xðtÞ r 1 or 2 r xðtÞ r 1 vn ðtÞ ¼ : ð30Þ 2 if 1 r xðtÞ or 1 rxðtÞ r0 or xðtÞ o 2
−2 −4 −6 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time (s)
Fig. 3. Simulation results for x(0)¼ 2 and tf ¼ 1, Example 1.
retrain the networks. Let the new final time be t f ¼ 0:5 s, i.e., the state should be brought to close to 2 in half of the previously selected final time. Note that once the optimal weights are available for k A ½0; 1; …; N, the optimal weights for the horizon of N 1 steps, where N1 o N, are the last N 1 set of weights, i.e., they are given by W k , and V vk s where k A ½N N1 ; N N 1 þ1; …; N, due to Bellman principle of optimality [34]. Fig. 4 shows the result of controlling initial condition xð0Þ ¼ 2 with t f ¼ 0:5 s. Interestingly the neurocontroller has successfully controlled the state to get to close to the desired terminal point in the shorter final time. To do this, a different control history and a different switching schedule are selected. The new active mode history, however, is still in accordance with the analytical vn ðtÞ given in Eq. (30). To further investigate the performance of the method, another initial condition, i.e., xð0Þ ¼ 0:5 is selected and controlled using the same trained networks. The results are depicted in Fig. 5. Considering the resulting state history and the switching schedule shows that the controller is able to solve the problem of optimal switching for different initial conditions. Finally, different initial conditions from the range of 3 to 3 with the step size of 1, and different final times from the range of 0.25–1 s with the step size of 0.125 are simulated and the resulting state histories are depicted in Fig. 6. In all the cases, the controller has been successful in directing the states toward the proximity of the desired terminal point. This figure demonstrates the versatility of
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
2
4
1
2 State
State
1626
0 −1 −2
0 −2
0
0.2
0.4
0.6
0.8
−4
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Active Mode
2
2 State
1
1
0 −1
0
0.2
0.4
0.6
0.8
1
−2
Control
0
Time (s) Fig. 6. Simulation results for different initial conditions (top) and different final times (bottom), Example 1.
−5
−10
0
0.2
0.4 0.6 Time (s)
0.8
1
Fig. 4. Simulation results for x(0) ¼2 and tf ¼0.5, Example 1.
State
1 0 −1
Active Mode
−2
0
0.2
0.4
0.6
0.8
1
1 0.2
0.4
0.6
0.8
1
0 −2 −4 −6
T Rt Let the cost function be J ¼ 75x t f xðt f Þ þ 0f 0:5 uðt Þ2 dt where t f ¼ 2 s. The domain of interest is selected as Ω ¼ fx A ℝ4 : jxi j o 2:5; 8 i g. Let the vector whose elements are all the non-repeating polynomials made up through multiplying the elements of vector x by those of vector y be denoted with x y. In this example the following basis functions are used:
2
0
Control
is subject to be calculated, its orientation is limited to be parallel to the X or the Y axes. This problem is modeled as a switching problem where there are two modes; mode 1 in which the force steers the mass parallel to the X axis and mode 2 in which the force steers it parallel to the Y axis. The state vector is formed as x ¼ ½x1 ; x2 ; x3 ; x4 T , where x1 and x2 , respectively, are the X and Y positions and x3 and x4 are the rates of change of x1 and x2 , respectively. The model for the system is given by ( f ðxÞ þ g 1 ðxÞu ½x3 ; x4 ; 0; 0T þ ½0 0 1 0T u x_ ¼ f ðxÞ þ g 2 ðxÞu ½x3 ; x4 ; 0; 0T þ ½0 0 0 1T u
0
0.2
0.4
0.6
0.8
1
Time (s) Fig. 5. Simulation results for x(0)¼ 0.5 and tf ¼ 1, Example 1.
the method as compared to open loop methods [6–9,11–15] or closed loop methods which are dependent on a single initial condition [30]. Example 2. The second example is a fourth order linear system which models the planar motion of a point mass in the absence of friction. The objective is moving the mass to the origin. The force, however, is limited to be applied either in the X or in the Y direction, where X and Y denote the perpendicular axis in the plane. Input uðtÞ denotes the applied force and while its magnitude
σðxÞ ¼ ½1; xT ; ðx xÞT T ; ϕðxÞ ¼ ½1; xT ; ðx xÞT ; ðx ðx xÞÞT T : Using sampling time of Δt ¼ 0:005, the horizon is discretized to 400 steps. The training is carried out using Algorithm 2 with h ¼ 200 and the iterations were observed to converge in 4 iterations. The training took less than 50 s on the machine with Intel Core i7 3770, 3.4 GHz running MATLAB 2013a. Selecting initial condition of xð0Þ ¼ ½0; 2; 2; 0T , the trained network was used for switching and control of the system and the results are given in Fig. 7. It can be seen that the controller has been able to move the mass to toward the origin through applying a history of force and performing a suitable switching between the applied directions. The same problem is solved with the shorter final time of t f ¼ 1:5 and the results are given in Fig. 8. As expected, the same trained network has been able to solve the problem with a new final time, as well. Fig. 9 shows the result of solving the problem with a new initial condition of xð0Þ ¼ ½2; 2; 0; 2T and the final time of t f ¼ 2. Interestingly, the controller has done a nice job in controlling the new initial condition, too. Considering the switching schedule given in Fig. 7, it is observed that the controller exhibits high frequency switching between the modes in order to steer the mass in the desired direction. If this behavior is unacceptable, one may apply one of
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
x1
2
x2
x3
2
x4
States
1
States
1627
0
1 0 −1
−1
−2
−2 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0
2
0.2
0.4
0.6
0.8
Active Mode
Active Mode
2
1
0
0.2
0.4
0.6
0.8
1
1
1.2
1.4
1.6
1.8
2
1.2
1.4
1.6
1.8
2
1.2
1.4
1.6
1.8
2
Time (s)
Time (s)
1.2
1.4
1.6
1.8
2
1
0
2
0.2
0.4
0.6
0.8
1 Time (s)
Time (s)
15 10 Control
Control
10 5 0
5 0 −5 −10
−5 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
Fig. 8. State histories for x(0) ¼[0,2,2,0]T and tf ¼ 1.5, Example 2.
Fig. 9. State histories for x(0) ¼[2,2,0,2]T and tf ¼ 2, Example 2.
the suggested remedies in Section 3.3. As an example, the threshold remedy is simulated here to investigate its effect on the performance of the controller. The threshold was set as τ ¼ 0:2 and the results of the first simulated problem in Example 2 are given in Fig. 10. Comparing the state histories of Fig. 10 with those given in Fig. 7, it can be seen that the controller has been able to bring the mass toward the origin, while the number of switching is much less than in Fig. 7. Note that due to the coupling between the selected mode and the applied input, the control history given in Fig. 10 is also different than the one given in Fig. 7. Finally, in order to demonstrate the potential of ‘feedback scheduling’ of the proposed method, a brief robustness analysis is conducted. It is assumed that when mode 1 is active, a random
0.4
0.6
0.8
1 Time (s)
Time (s)
Fig. 7. Simulation results for x(0) ¼ [0,2,2,0]T and tf ¼ 2, Example 2.
0.2
Fig. 10. State histories for x(0) ¼ [0,2,2,0]T, tf ¼ 2, and threshold τ ¼0.2, Example 2.
disturbance force uniformly distributed between 0 and 6 units will be applied in the positive Y direction. Afterwards, without retraining the networks, the initial condition xð0Þ ¼ ½0; 2; 2; 0T with the simulation results presented in Fig. 7 are simulated under the disturbances. Three different scenarios are conducted under a single history of time varying disturbances: (A) operating the system with open loop input uk and open loop mode schedule vk , calculated from the simulation results presented in Fig. 7. (B) Operating the system with closed loop input but with open loop mode schedule calculated under the no-disturbance assumption. (C) Operating the system with both closed loop input and closed loop scheduling. The resulting state trajectories for the three scenarios are presented in Fig. 11. The states diverge in scenario A, as expected, due to fully open loop operation of the system. Comparing scenario B with scenario C, however, the interesting feature of feedback scheduling can be seen. Note that, in both these scenarios, the input uk is being calculated in a feedback form, based on the current states influenced by the disturbance. However, in scenario B, the fixed mode schedule obtained in Fig. 7 under the no-disturbance assumption, is applied, while, in scenario C, the mode scheduling is also conducted in a feedback fashion, based on the current states of the system. The respective cost resulting from Scenario B turned out to be more than twice the cost of scenario C. These comparisons show the advantage of feedback scheduling and the respective capability of the proposed controller, as compared with open loop methods [6– 9,11–15].
5. Conclusions A method in the framework of approximate dynamic programming was developed for determining the optimal control and the optimal switching schedule for switching systems with controlled nonlinear subsystems and unspecified mode sequence. The performance of the method in solving problems with different initial
1628
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
Define h i σ σ x½1 σ x½2 ⋯ σ x½h h i Y Y x½1 Y x½2 ⋯ Y x½h Using the method of least squares, solution to the system of linear equations (31) is given by þ 1Þ V v;ði ¼ ðσσ T Þ 1 σY T k
ð32Þ
Note that for the inverse of matrix ðσσ T Þ, which is a p p matrix, to exist, one needs the basis functions σð:Þ to be linearly independent and h to be greater than or equal to the number of the basis functions.
Appendix B This appendix includes the proofs of Theorems 1 and 2. Proof of Theorem 1: The iteration performed on V v;ðiÞ , given in (26) and repeated k here, is a successive approximation to find a fixed point of a function: þ 1ÞT V v;ði σðxk Þ ¼ 12 R 1 g v ðxk ÞT ∇ϕðf v ðxk Þ þ g v ðxk ÞV v;ðiÞT σðxk ÞÞT W k þ 1 ; k k
Fig. 11. State histories under random disturbances for three different scenarios: (A) open loop control and mode scheduling, (B) closed loop control with open loop mode scheduling, and (C) closed loop control and mode scheduling, Example 2.
conditions and different final times was investigated both analytically and numerically. These results and analyses lead to concluding that the proposed method is versatile and is suitable for solving different real-world problems including aerospace, mechanical, and chemical problems. Application of the proposed scheme for solving real-world problems in different disciplines is a proposed future work.
i.e., there exists function ℱ : ℝpm -ℝpm such that (26) is of the form: þ 1Þ V v;ði ¼ ℱðV v;ðiÞ Þ: k k
The claim of the theorem is proved if it can be shown that (33) is a contraction mapping [39]. Since ℝpm with 2-norm denoted by ‖:‖ is a Banach space, iterations given by (33), regardless of the selected initial V v;ð0Þ , converges to a unique V vk satisfying k v v V k ¼ ℱðV k Þ, if there exists a 0 r ρ o 1 such that for every U 1 and U 2 in ℝpm , the following inequality holds [39]: ‖ℱðU 1 Þ ℱðU 2 Þ‖r ρ‖U 1 U 2 ‖:
Acknowledgment This research was partially supported by a grant from the National Science Foundation.
ð33Þ
ð34Þ
Function ℱð:Þ can be formed by converting (26) to a least squares form performed in Appendix A. Rewriting Eq. (32), given in Appendix A, leads to ℱðV v;ðiÞ Þ k T 3 T T ½1 1 þ g v x½1 V kv;ðiÞT σ x½1 ∇ϕ f v x½1 Wkþ1 7 6 12 R g v xk k k k 7 6 6 T 7 7 6 T T 7 6 ½2 ½2 ½2 v;ðiÞT ½2 1 1 ∇ϕ f v xk þ g v xk V k σ xk Wkþ1 7 6 2 R g v xk ðσσ T Þ 1 σ 6 7 7 6 7 6 ⋮ 7 6 T 7 6 T T 5 4 ½h ½h v;ðiÞT ½h ∇ϕ f x x σ x W þ g V 12 R 1 g v x½h kþ1 v v k k k k k 2
Appendix A In Algorithms 1 and 2, in some steps, different weight update rules for the weights of the actor and critic networks, i.e., V k and W k , are given. The least squares method can be used for rewriting these equations such that V k and W k are explicitly given based on the known parameters. In this appendix, the process for finding such an equation for V k from Eq. (26) is explained and one can easily find the corresponding equation for W k . To perform least þ 1Þ squares for the weight update of V v;ði , h random states denoted k ½j with x , where j A f1; 2; …; hg; are selected. Denoting the right hand side of Eq. (26) resulting from each x½j with Yðx½j Þ, the objective is finding V kv;ði þ 1Þ such that it solves: 8 v;ði þ 1ÞT σ x½1 ¼ Y x½1 Vk > > > > > < V v;ði þ 1ÞT σ x½2 ¼ Y x½2 k ð31Þ > ⋮ > > > > v;ði þ 1ÞT :V σ x½h ¼ Y x½h k
ð35Þ One has T pffiffiffi ℱðU 1 Þ ℱðU 2 Þ r hðσσT Þ 1 σ‖12 R 1 gv ðx½ℓ ÞT ∇ϕ f v x½ℓ Wkþ1 þg v x½ℓ U T1 σ x½ℓ k k k k T T 1 1 g v x½ℓ ∇ϕ f v x½ℓ W k þ 1‖ þ g v x½ℓ U T2 σ xk½ℓ 2R k k k
ð36Þ where integer ℓ, 0 r ℓ rh, is given by T T ½j ½j ½j ½j W k þ 1– ℓ ¼ argmax‖12 R 1 g v xk ∇ϕ f v xk þ g v xk U T1 σ xk 1rj rh
1 1 gv 2R
½j T
xk
T ½j ½j ½j ∇ϕ f v xk þ g v xk U T2 σ xk W k þ 1 ‖:
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
In inequality (36), the following norm inequality is used 2 3 y1 6 7 pffiffiffi 6 y2 7 7 ‖6 6 ⋮ 7‖ r h‖yℓ ‖ 4 5 yh
1629
sampling time is refined. This completes the proof of convergence of V v;ðiÞ to the unique V vk which satisfies Eq. (25), 8 k A K, using any k ð37Þ
, for any small enough sampling time. ■ initial guess on V v;ð0Þ k Proof of Theorem 2: Let function F v : ℝn -ℝ, 8 v A V, be defined as T vT F v ðxÞ σðxÞT V vk RV vT k σðxÞ þ W k þ 1 ϕðf v ðxÞ þ g v ðxÞV k σðxÞÞ:
where yi s are real-valued row-vectors and ℓ ¼ argmax‖yi ‖. 11 r i r h
Also let piecewise constant function vn : ℝn -V be given by
Smoothness of ϕð:Þ leads to the Lipschitz continuity of ∇ϕð:Þ on compact set Ω [42]. Therefore, there exists some positive real number ρϕ such that for every x1 and x2 in Ω, one has ‖∇ϕðx1 Þ ∇ϕðx2 Þ‖r ρϕ ‖x1 x2 ‖. Using this feature of ∇ϕð:Þ, inequality (36) can be written as pffiffiffi 1 ‖ℱðU 1 Þ ℱðU 2 Þ‖ rρϕ h‖ σσ T σ‖ T ‖‖σ x½ℓ ‖‖W k þ 1 ‖‖ U T1 U T2 ‖ ð38Þ ‖12 R 1 g v x½kℓ ‖‖g v x½ℓ k k
vnk ðxÞ ¼ argminv A V F v ðxÞ;
ð41Þ
n
note that F ð:Þ ¼ F vk ð:Þ ð:Þ. Let x be any point in Ω and set: v ¼ vnk ðxÞ:
ð42Þ
Select an open set α Ω such that x belongs to the boundary of α and limit v^ ¼ lim ‖x x‖-0 vnk ðxÞ:
ð43Þ
xAα
By defining T pffiffiffi ρ ρϕ h‖ðσσ T Þ 1 σ‖‖12 R 1 g v x½kℓ ‖‖g v x½ℓ ‖‖σ x½ℓ ‖‖W k þ 1 ‖ k k ð39Þ one can select the sampling time Δt in discretization of the continuous dynamics (1) small enough such that the condition 0 r ρ o 1 is satisfied, since a smaller Δt, directly results in a smaller T ‖ while the other terms including ‖12 R 1 g v x½kℓ ‖ are not ‖g v x½ℓ k affected. Note that smoothness, and hence continuity, of g v ð:Þ s and σð:Þ in their domain results in being bounded in the compact set Ω x½kℓ
[43], therefore, the dependent terms in (39) are upper bounded. The expression given for the contraction mapping coefficient ρ in (39) involves ‖W k þ 1 ‖ also. It should be noted that W k þ 1 is already learned from the previous step in the algorithm, therefore, it is bounded. In other words, starting from k ¼ N 1, one uses the successive approximation given by (26) and once V v;ðiÞ converges, k it is used in (24) to calculate the bounded W k . This process is repeated till k ¼ 0. If the selected sampling time Δt is not small enough, at some k, k A K, the respective ρ given in (39) does not satisfy condition 0 r ρ o1, therefore, V v;ðiÞ does not converge as k i-1. In that case, one may select a smaller sampling time and restart the algorithm, i.e., from k ¼ N 1 to calculate the weights corresponding to the smaller sampling time. Refining the sampling time leads to a change in W k þ 1 as well. However, as the sampling time becomes smaller, W k þ 1 remains bounded. This boundedness follows from looking at the definition of W k þ 1 , which is the weights for the network that approximates a discretized cost-togo. In other words, W Tk þ 1 ϕðxÞ ffi ψðxN Þ þ
N1
∑ ðQ ðxj Þ þ uTj Ruj Þ
ð40Þ
j ¼ kþ1
As the sampling times go to zero, the value of the discretized cost-to-go converges to the cost-to-go given by ψ x t f þ R tf Q ðxðt ÞÞ þ uðt ÞT Ruðt Þ dt. On the other hand, since the sysðk þ 1ÞΔt
exists. If v ¼ v^ , for every such α, then there exists some open set β Ω containing x such that vnk ðxÞ is constant for all x A β, because vnk ðxÞ only assumes integer values. In this case the continuity of n
F vk ðxÞ ðxÞ at x follows from the fact that F v ðxÞ is continuous at x, for every fixed v A V. Finally, the continuity of the function subject to investigation at every x A Ω, leads to the continuity of the function in Ω. ^ Now assume v a v^ , for some α. From the continuity of F v ðxÞ at x, for the given v^ , one has ^
^
F v ðxÞ ¼ lim F v ðx þ δxÞ:
ð44Þ
δx-0
If it can be shown that for every selected α, one has v
^
F ðxÞ ¼ F v ðxÞ;
ð45Þ
then the continuity of F (44) one has
vnk ðxÞ
ðxÞ at x follows, because from (45) and
^
F v ðxÞ ¼ lim F v ðx þ δxÞ;
ð46Þ
δx-0
and (46) leads to the continuity by definition [43]. The proof that (45) holds is done by contradiction. Assume that ^
F v ðxÞ o F v ðxÞ;
ð47Þ v
v^
then, due to the continuity of F ðxÞ and F ðxÞ at x, there exists an open set γ containing x, such that: ^
F v ðxÞ o F v ðxÞ; 8 x A γ:
ð48Þ
On the other hand, Eq. (43) implies that there exists an open set with x at its boundary at which v^ ¼ vnk ðxÞ, hence, because x A γ and γ is an open set, one has ^
F v ðxÞ Z F v ðxÞ;
( x A γ:
ð49Þ
But, (49) contradicts (48). Hence, (47) is not possible. The impossibility of ^
F v ðxÞ 4 F v ðxÞ
ð50Þ
tem does not have a finite-scape time (which follows from smoothness of the dynamics on the compact domain of interest,) the finite-horizon cost-to-go will be finite, using any finite control. Note that the control history included in the integration given in (40) correspond to the already converged time-steps, hence, they
directly follows from (42). Because if (50) holds then v avnk ðxÞ, which is against (42). Therefore, equality (45) holds and hence, n F vk ðxÞ ðxÞ, or equivalently FðxÞ, is continuous at every x A Ω. This completes the proof. ■
are bounded. Therefore, as Δt -0, the value of W Tk þ 1 ϕðxÞ will be finite. Since the basis functions ϕðxÞ are linearly independent, a
References
finite W Tk þ 1 ϕðxÞ leads to a finite W Tk þ 1 , as seen in the least squares operation described in Appendix A. Therefore, term ‖W k þ 1 ‖ existing in the expression for ρ in (39) remains bounded as the
[1] M. Soler, A. Olivares, E. Staffetti, Framework for aircraft trajectory planning toward an efficient air traffic management, J. Aircr. 49 (2012) 985–991. [2] K. Benmansour, A. Benalia, M. Djemaï, J. de Leon, Hybrid control of a multicellular converter, Nonlinear Anal.: Hybrid Syst. 1 (2007) 16–29.
1630
A. Heydari, S.N. Balakrishnan / Neurocomputing 149 (2015) 1620–1630
[3] E.A. Hernandez-Vargas, R.H. Middleton, P. Colaneri, F. Blanchini, Dynamic optimization algorithms to mitigate HIV escape, in: Proceedings of the IEEE Conference on Decision and Control, 2010, pp. 827–832. [4] Z. Gong, C. Liu, E. Feng, L. Wang, Y. Yu, Modelling and optimization for a switched system in microbial fed-batch culture, Appl. Math. Model. 35 (2011) 3276–3284. [5] M. Rinehart, M. Dahleh, D. Reed, I. Kolmanovsky, Suboptimal control of switched systems with an application to the DISC engine, IEEE Trans. Control Syst. Technol. 16 (2008) 189–201. [6] X. Xu, P.J. Antsaklis, Optimal control of switched systems via non-linear optimization based on direct differentiations of value functions, Int. J. Control 75 (16) (2002) 1406–1426. [7] X. Xu, P.J. Antsaklis, Optimal control of switched systems based on parameterization of the switching instants, IEEE Trans. Autom. Control 49 (1) (2004) 2–16. [8] M. Egerstedt, Y. Wardi, H. Axelsson, Transition-time optimization for switched-mode dynamical systems, IEEE Trans. Autom. Control 51 (1) (2006) 110–115. [9] H. Axelsson, M. Boccadoro, M. Egerstedt, P. Valigi, Y. Wardia, Optimal modeswitching for hybrid systems with varying initial states, Nonlinear Anal.: Hybrid Syst. 2 (3) (2008) 765–772. [10] X., Ding, A., Schild, M. Egerstedt, J. Lunze, Real-time optimal feedback control of switched autonomous systems, in: Proeedings of the IFAC Conference on Analysis and Design of Hybrid Systems, 2009, pp. 108–113. [11] M. Kamgarpoura, C. Tomlin, On optimal control of non-autonomous switched systems with a fixed mode sequence, Automatica 48 (2012) 1177–1181. [12] R. Luus, Y. Chen, Optimal switching control via direct search optimization, Asian J. Control 6 (2004) 302–306. [13] M. Rungger, O. Stursberg, A numerical method for hybrid optimal control based on dynamic programming, Nonlinear Anal.: Hybrid Syst. 5 (2011) 254–274. [14] M. Sakly, A. Sakly, N. Majdoub, M. Benrejeb, Optimization of switching instants for optimal control of linear switched systems based on genetic algorithms, in: Proceedings of the IFAC International Conference on Intelligent Control Systems and Signal Processing, Istanbul, 2009. [15] Sorin C. Bengea, Raymond A. DeCarlo, Optimal control of switching systems, Automatica 41 (2005) 11–27. [16] W. Zhang, J. Hu, A. Abate, On the value functions of the discrete-time switched LQR problem, IEEE Trans. Autom. Control 54 (11) (2009) 2669–2674. [17] P.J. Werbos, Approximate dynamic programming for real-time control and neural modeling, in: D.A. White, D.A. Sofge (Eds.), Handbook of Intelligent Control, Multiscience Press, New York, NY, 1992. [18] S.N. Balakrishnan, V. Biega, Adaptive-critic based neural networks for aircraft optimal control, J. Guid. Control Dyn. 19 (4) (1996) 893–898. [19] D.V. Prokhorov, D.C. Wunsch II, Adaptive critic designs, IEEE Trans. Neural Netw. 8 (5) (1997) 997–1007. [20] A. Al-Tamimi, F.L. Lewis, M. Abu-Khalaf, Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof, IEEE Trans. Syst. Man Cybern. Part B 38 (2008) 943–949. [21] T. Dierks, B.T. Thumati, S. Jagannathan, Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence, Neural Netw. 22 (2009) 851–860. [22] S. Ferrari, R.F. Stengel, Online adaptive critic flight control, J. Guid. Control Dyn. 27 (5) (2004) 777–786. [23] G.K. Venayagamoorthy, R.G. Harley, D.C. Wunsch, Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator, IEEE Trans. Neural Netw. 13 (3) (2002) 764–773. [24] R. Padhi, N. Unnikrishnan, X. Wang, S.N. Balakrishnan, A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems, Neural Netw. 19 (2006) 1648–1660. [25] D. Han, S.N. Balakrishnan, State-constrained agile missile control with adaptive-critic-based neural networks, IEEE Trans. Control Syst. Technol. 10 (4) (2002) 481–489. [26] A. Heydari, S.N. Balakrishnan, Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics, IEEE Trans. Neural Netw. Learn. Syst. 24 (1) (2013) 145–157. [27] Q. Wei, D. Liu, An iterative ϵ-optimal control scheme for a class of discretetime nonlinear systems with unfixed initial state, Neural Netw. 32 (2012) 236–244. [28] R. Song, H. Zhang, The finite-horizon optimal control for a class of time-delay affine nonlinear system, Neural Comput. Appl. (2011), http://dx.doi.org/ 10.1007/s00521-011-0706-3.
[29] N. Cao, H. Zhang, Y. Luo, D. Feng, Infinite horizon optimal control of affine nonlinear discrete switched systems using two-stage approximate dynamic programming, Int. Journal of Syst. Sci. 43 (9) (2012) 1673–1682. [30] C. Qin, H. Zhang, Y. Luo, B. Wang, Finite horizon optimal control of non-linear discrete-time switched systems using adaptive dynamic programming with epsilon-error bound, Int. J. Syst. Sci. (2013), http://dx.doi.org/10.1080/ 00207721.2012.748945. [31] A. Heydari, S.N. Balakrishnan, Optimal multi-therapeutic HIV treatment using a global optimal switching scheme, Appl. Math. Comput. 219 (2013) 7872–7881. [32] A. Heydari, S.N. Balakrishnan, Optimal switching and control of nonlinear switching systems using approximate dynamic programming, IEEE Trans. Neural Netw. Learn. Syst. 25 (6) (2014) 1106–1174. [33] A. Heydari, S.N. Balakrishnan, Optimal switching between autonomous subsystems, J. Frankl. Inst. 351 (2014) 2675–2690. [34] D.E. Kirk, Optimal Control Theory: An Introduction, Dover Publications (2004) 53–58. [35] A. Heydari, S.N. Balakrishnan, Fixed-final-time optimal control of nonlinear systems with terminal constraints, Neural Netw. 48 (2013) 61–71. [36] K. Homik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (1989) 359–366. [37] J.G. Attali, G. Pages, Approximations of functions by a multilayer perceptron: a new approach, Neural Netw. 10 (6) (1997) 1069–1081. [38] M. Stone, P. Goldbart, Mathematics for Physics - A Guided Tour for Graduate Students, Cambridge University Press, Cambridge, England (2009) 70. [39] H. Khalil, Nonlinear Systems, 3rd ed., Prentice Hall, New Jersey, 2002. [40] A. Heydari, S.N. Balakrishnan, Global optimality of approximate dynamic programming and its use in non-convex function minimization, Appl. Soft Comput. 24 (2014) 291–303. [41] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press (2009) 4–7. [42] J.E. Marsden, T. Ratiu, R. Abraham, Manifolds, Tensor Analysis, and Applications, 3rd ed., Springer-Verlag Publishing Co., New York (2001) 73. [43] Trench, W.F., Introduction to Real Analysis, available online at 〈http://ramanujan. math.trinity.edu/wtrench/texts/TRENCH_REAL_ANALYSIS.PDF〉, 2012, pp. 309, 313. [44] Online, available at 〈http://webpages.sdsmt.edu/%7eaheydari/Research/Source Codes/Neurocomp-SwitchingSystem-SourceCode.html〉, 2012, pp. 309, 313.
Ali Heydari received his Ph.D. degree in mechanical engineering from the Missouri University of Science and Technology in 2013. He is currently an assistant professor of mechanical engineering at the South Dakota School of Mines and Technology. He was the recipient of the Outstanding M.Sc. Thesis Award from the Iranian Aerospace Society, the Best Student Paper Runner-Up Award from the AIAA Guidance, Navigation and Control Conference, and the Outstanding Graduate Teaching Award from the Academy of Mechanical and Aerospace Engineers at Missouri S&T. His research interests include optimal control, nonlinear control, approximate dynamic programming, and control of hybrid and switching systems.
S.N. Balakrishnan received his Ph.D. degree in Aerospace Engineering from the University of Texas in Austin. He is currently Curators' Professor of Aerospace Engineering in the Department of Mechanical and Aerospace Engineering at Missouri University of Science Technology in Rolla, Missouri. His research interests include neural networks, optimal control, estimation, nonlinear control, control of large scale systems and impulse driven systems. His papers mainly deal with development of algorithms in control and estimation and applications to aircrafts, missile, spacecraft, launch vehicles, robots, manufacturing and other interesting systems.