An adaptive optimization scheme with satisfactory transient performance

An adaptive optimization scheme with satisfactory transient performance

Automatica 45 (2009) 716–723 Contents lists available at ScienceDirect Automatica journal homepage: www.elsevier.com/locate/automatica Brief paper ...

2MB Sizes 202 Downloads 146 Views

Automatica 45 (2009) 716–723

Contents lists available at ScienceDirect

Automatica journal homepage: www.elsevier.com/locate/automatica

Brief paper

An adaptive optimization scheme with satisfactory transient performanceI Elias B. Kosmatopoulos ∗ Dynamic Systems and Simulation Laboratory, Department of Production & Management Engineering, Technical University of Crete, Chania 73100, Greece

article

info

Article history: Received 14 August 2007 Received in revised form 30 May 2008 Accepted 18 September 2008 Available online 23 December 2008 Keywords: Adaptive optimization Stochastic approximation Simultaneous perturbation stochastic approximation (SPSA) Kiefer–Wolfowitz procedure Adaptive fine-tuning

a b s t r a c t Adaptive optimization (AO) schemes based on stochastic approximation principles such as the Random Directions Kiefer–Wolfowitz (RDKW), the Simultaneous Perturbation Stochastic Approximation (SPSA) and the Adaptive Fine-Tuning (AFT) algorithms possess the serious disadvantage of not guaranteeing satisfactory transient behavior due to their requirement for using random or random-like perturbations of the parameter vector. The use of random or random-like perturbations may lead to particularly large values of the objective function, which may result to severe poor performance or stability problems when these methods are applied to closed-loop controller optimization applications. In this paper, we introduce and analyze a new algorithm for alleviating this problem. Mathematical analysis establishes satisfactory transient performance and convergence of the proposed scheme under a general set of assumptions. Application of the proposed scheme to the adaptive optimization of a large-scale, complex control system demonstrates the efficiency of the proposed scheme. © 2008 Elsevier Ltd. All rights reserved.

1. Introduction Many controller fine-tuning as well as adaptive and learning control problems can be formulated as optimization problems where the objective function to be optimized cannot be computed directly (i.e. its analytical form is not known) but only estimated via observations. More formally, a large class of these problems can be shown – see e.g. Section 4 of this paper – to be equivalent to the problem of constructing an algorithm for updating the vector of controller parameters θk ∈ Rnθ so that an appropriately defined objective function Jk ≡ J (θk , xk )

(1)

converges as close as possible to one of its local minima. Here, J denotes the unknown nonlinear objective function which is assumed to be C m , m ≥ 2, bounded-from-below and unbounded for unbounded θk ; Jk , θk denote the objective function measurement and the value of the control parameter vector at the kth algorithm iteration, respectively; xk ∈ Rnx is a vector sequence of exogenous signals; and nθ , nx denote the dimensions of the vectors θk , xk , respectively. In a typical case, J corresponds to an appropriately defined performance index, while the entries of

I This paper was presented at Proc. 17th IFAC World Congress, pp. 5071–5076, Seoul, Korea, July 6–11, 2008. This paper was recommended for publication in revised form by Associate Editor Derong Liu under the direction of Editor Miroslav Krstic. ∗ Tel.: +30 28210 37306; fax: +30 28210 37584. E-mail address: [email protected].

0005-1098/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.automatica.2008.09.014

the vector xk may correspond to signals that are not available for measurement (e.g, sensor noise, un-measurable disturbances, etc.) as well as signals that are available for measurement (e.g. system states, measurable disturbances, reference signals, demand, etc.). Stochastic Approximation (SA) methods such as the Random Directions Kiefer–Wolfowitz (RDKW) (Kushner & Clark, 1978), the Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 1992, 2000) and the Adaptive Fine-Tuning (Kosmatopoulos, Papageorgiou, Vakouli, & Kouvelas, 2007) algorithms are directly applicable to the above problem since – contrary to conventional optimization methods – they can efficiently deal with the problem of optimization of functions that cannot be computed directly but only estimated via observations. A common element in the aforementioned SA algorithms is the need for evaluating the objective function at random or pseudo-random perturbations 1θ + θ around the current vector θ . For instance, 2-sided RDKW and SPSA J −J assume an iterative scheme of the form θk = θk−3 − αk k−22c k−1 rk , k with θk−2 = θk−3 + ck dk , θk−1 = θk−3 − ck dk being the perturbation points, αk , ck being slowly decaying to zero step-sizes and rk , dk being random vectors. Unfortunately, the objective function values Jk−2 , Jk−1 at the perturbation points may be particularly large. In other words, the aforementioned SA algorithms may introduce large ‘‘spikes’’ in the objective function, which may result to severe poor performance or stability problems when these methods are applied to closed-loop controller optimization applications (see e.g. Kosmatopoulos et al. (2007)). In this paper, we propose and analyze a new algorithm that overcomes the above problem. More precisely, as we establish by using rigorous arguments, at each iteration of the proposed algorithm Jk is strictly decreasing outside a subset centered at a

E.B. Kosmatopoulos / Automatica 45 (2009) 716–723

local minimum of J; the magnitude of this subset depends on the magnitude of the exogenous disturbances, the estimation accuracy xk − x¯ k (where x¯ k denotes an estimation/prediction of xk ) and the approximation accuracy of the estimator used by the proposed algorithm; moreover, we establish that the proposed algorithm guarantees convergence under a general set of assumptions. The basic difference of the proposed algorithm as compared to RDKW, SPSA and AFT is the use of a function approximator/estimator for the approximation of the unknown objective function and for the estimation of the effect of different candidate updates of the parameter vector θk . Application of the proposed methodology to the adaptive optimization of a large-scale, complex, highly nonlinear control system demonstrate the validity of our theoretical results. It is worth noting that – due to the highly nonlinear and complex nature of system dynamics as well as the large size of the controller parameter vector – standard SA algorithms such as the SPSA algorithm, failed when applied to this particular problem for literally any choice of their design parameters (Kosmatopoulos et al., 2007), while the AFT algorithm of Kosmatopoulos et al. (2007) preserved a quite ‘‘spiky’’ performance. 1.1. Notation A function f is said to be C m , where m is a positive integer, if it is uniformly continuous and its first m derivatives are uniformly continuous. The notation vec (A, B, . . . , ), where A, B, . . . are scalars, vectors or matrices, is used to denote a vector whose elements are the entries of A, B, C , . . . (taken columnwise). Z+ , R+ denote the set of nonnegative integers and nonnegative real n numbers, respectively. For a vector √ x ∈ R , |x| denotes the Euclidean norm of x (i.e. |x| = xτ x), while for a matrix A ∈ 2

Rn , |A| denotes the induced matrix norm of A. ∇ J (θ, x) denotes the gradient of J (θ , x) w.r.t. to θ . Also, if xk , k ∈ Z+ is a vector sequence, then x{`,...,k} for ` < k is used to denote the vectors x` , x`+1 , . . . , xk . The notation O(·) is used to denote the standard ‘‘order of’’ notation. Finally, we will say that, see Freeman and Kokotovic (1996), a function χ : R+ 7→ R+ is of class K∞ (symbolically, χ ∈ K∞ ) when χ is continuous, strictly increasing, χ(0) = 0 and χ (r ) → ∞ as r → ∞. 2. The proposed algorithm We start this section by firstly presenting a formal definition of the two basic criteria that will be used in this paper for the evaluation of Adaptive Optimization (AO) algorithms: Definition 1. Consider the unknown function J defined in (1) and consider an iterative algorithm which, at the kth iteration, calculates the current vector θk as a function of (a) the past values of the parameter vectors θ0 , θ1 , . . . , θk−1 , (b) the past values of the objective function measurements J0 , J1 , . . . , Jk−1 , (c) some estimates x¯ 0 , x¯ 1 , . . . , x¯ k−1 of the past values of the exogenous vectors x0 , x1 , . . . , xk−1 , respectively and (d) a prediction x¯ k of the current exogenous vector xk . We will say that the aforementioned algorithm is a (σ , τk )-Adaptive Optimization algorithm, where σ is a nonnegative scalar and τk is a nonnegative scalar sequence, if (1) lim supk7→∞ |∇ J (θk , xk )| ≤ σ if xk is a bounded deterministic vector sequence or lim supk7→∞ E [|∇ J (θk , xk )||Gk−1 ] ≤ σ if xk is a stochastic vector sequence, where Gk−1 is an appropriately defined σ -field (see e.g. Theorem 2 in Section 3), and (2) Jk < Jk−1 + τk , ∀k ∈ Z+ .

717

The criterion (1) in the above Definition is used to evaluate the steady-state characteristics of AO algorithms (how close to a local minimum of J the algorithm converges), while criterion (2) is used for the evaluation of the worst-case transient performance characteristics of the AO algorithm. Apparently, one wishes to make σ and τk equal to zero; however, since J is an unknown function and, moreover, in most applications the exogenous terms xk are not exactly known (or even worse: in some applications these terms are totally unknown) it is impossible, in general, to make σ and τk equal to zero; the best an AO algorithm can do in most applications is to guarantee that the magnitude of the terms σ and τk is upper bounded by some bounds that depend on the exogenous signals xk and the estimation/prediction accuracy xk − x¯ k . Remark 1. Please note that criteria (1) and (2) of Definition 1 are used in order to evaluate the performance of an AO algorithm and do not impose any assumption on the unknown objective function J (·). Two further remarks are in order, before we proceed to the presentation of the proposed algorithm. Remark 2 (Availability of x¯ k ). In Definition 1, we have assumed that some estimates/predictions x¯ 0 , . . . , x¯ k of the exogenous vectors x0 , . . . , xk are available. If such estimates/predictions are not available, as it happens in many practical applications, all the results of this paper are still valid by setting x¯ k = 0, ∀k and cx¯ = cx , where cx¯ , cx are defined in Theorem 1 (Section 3). Remark 3 (τk for Standard SA Algorithms). In order to evaluate the performance of standard SA algorithms under criterion (2) of Definition 1, let us consider a case where |∇ J (θk−1 , xk )| = Bk ; then, it can be seen that in the case of standard SA algorithms such as the SPSA or the RDKW algorithms, the term τk is given by

τk = αk Bk + b1 αk2 + b2 with b1 , b2 being two positive terms that depend on xk and xk − xk−1 . In other words, in the case where θk−1 is quite far from a local minimum of J, standard SA algorithms cannot avoid large ‘‘spikes’’ of Jk . We are now ready to proceed to the presentation of the proposed algorithm. As we have already noticed in Section 1, the proposed approach makes use of a function approximator for the estimation of the unknown objective function J. More precisely, the proposed approach uses a linear in the parameters function approximator in order to generate – at each algorithm iteration – an estimate of the unknown function J as follows: Jˆk (θ , x¯ ) = ϑkτ φ(θ , x¯ ).

(2)

Here Jˆk (θ , x¯ ) denotes the approximation of J (θ , x) generated at the kth algorithm iteration, x¯ denotes an estimate of the actual exogenous vector x, φ : Rnθ × Rnx 7→ RL denotes a nonlinear vector of regressor terms (which is assumed to be a smooth function of its arguments), ϑk ∈ RL denotes the vector of parameter estimates calculated at the kth algorithm iteration (e.g. in the case where the estimator (2) is a neural network, ϑk denotes the vector of neural network weights) and L is a positive user-defined integer denoting the ‘‘size’’ of the approximator (2). The parameter estimation vector ϑk is calculated according to

ϑk = arg min ϑ

k−1 1X

2 `=` k

(J` − ϑ τ φ(θ` , x¯ ` ))2

(3)

where `k = min{0, k − L − Th } with Th being a userdefined nonnegative integer. Standard least-squares optimization algorithms can be used for the solution of (3).

718

E.B. Kosmatopoulos / Automatica 45 (2009) 716–723

To better understand the usage of the estimator (2) within the proposed algorithm assume for the time-being that the exogenous sequence xk is exactly known (i.e. xk ≡ x¯ k , ∀k); then, if the regressor vector φ is chosen to belong to a family of Universal Approximators (e.g. polynomials, neural networks, etc.) it can be seen using standard arguments (see e.g. Huang, Chen, and Sew (2006), Huang and Chen (2007), Kosmatopoulos, Polycarpou, Christodoulou, and Ioannou (1995), and Maiorov and Meir (1998) and the references therein) that the choice for ϑk according to (3) guarantees that for any bounded θ ∈ Rnθ , J (θ, xk ) = Jˆk (θ , x¯ k ) + νk (θ , x¯ k )

(4)

where νk is a term that can be made arbitrarily small (for k large enough) provided that

φk ≡ φ(θk , x¯ k ) satisfies a Persistence of Excitation (PE) condition, see e.g. Ioannou and Sun (1995). Long-standing results in the theory of adaptive estimation and system identification (see e.g. Ioannou and Sun (1995) and the references therein) have established that PE is sufficient as well as necessary for the convergence of the parameter estimates ϑk to their optimal values; if a PE condition does not hold then the term νk in (4) may be particularly large no matter what the choice for L is. We close this parenthesis by noting that, since in the general case the exogenous vector xk is not exactly known, Eq. (4) should be replaced by (here χ1 ∈ K∞ ) (5)

h i (j) rank φmax{0,k−L+1} , . . . , φk−1 , φ(θk−1 ± ∆θk , x¯ k ) (6)

and

h i−1 ∆θ (1) , . . . , ∆θ (K ) ≤ Ξ k k α k

(7)

where Ξ is a finite positive constant independent of αk . Let also (here ei is defined as eii = 1 and eij = 0, j 6= i)



vec Jˆk (θk−1 + ck ei , x¯ k ) − Jˆk (θk−1 , x¯ k )

cJ k = ∇

ck

 (8)

where ck is a user-defined positive scalar sequence. Then ∆θk ≡ θk − θk−1 is chosen according to

∆θk = arg

min

(j)

±∆θk ,j∈{1,...,K }

 τ cJ k . ±∆θk(j) ∇

J. Note that condition (6) renders the problem of finding 1θk a computationally hard problem. Fortunately, as we will see in Proposition 1, under appropriate selection of the regressor vector φ (j) and the random generator for producing 1θk , there is no practical need to check the computationally ‘‘heavy’’ condition (6). 3. Convergence properties of the proposed algorithm Our first main result establishes satisfactory transient performance and convergence of the proposed algorithm in the case of (uniformly bounded) deterministic exogenous signals. Theorem 1. Let

ϑ ∗ = arg min ϑ

The proposed algorithm attempts to construct a sequence of parameter vectors θk which guarantees, on the one hand, that the vector sequence φk satisfies a PE condition, and, on the other hand, that at each algorithm iteration the new parameter vector θk leads to a non-negligible decrease of the estimate Jˆk (w.r.t. to Jˆk−1 ); if φk satisfies a PE condition then Jˆk will be an effective approximation of Jk – see Eq. (5) – and thus the choice of θk that leads to a nonnegligible decrease of the estimate Jˆk is most likely to lead to a nonnegligible decrease of Jk , too. We are now ready to describe the procedure used by the proposed algorithm for choosing θk : let αk be a user-defined scalar (j) positive sequence and ∆θk ∈ {−αk , +αk }nθ , j ∈ {1, . . . , K } denote a collection of K ≥ nθ vectors of candidate perturbations, satisfying ∀j ∈ {1, . . . , K }

= min{k − 1, L}

culated in (2) and (3) is an accurate estimate of J (i.e. Jˆk ≈ J), then cJ k ≈ ∇ J in which case it is straightforward for someone to see ∇ that 1θk generated according to (9) leads to a non-negligible decrease of the objective function. Condition (6) is a standard PE condition while condition (7) is imposed to make sure that there exists at least one candidate (j) perturbation ±∆θk that leads to a non-negligible decrease of (j)

• the ‘‘size’’ L of the regressor vector φ is large enough, • the vector sequence

J (θ, xk ) = Jˆk (θ , x¯ k ) + νk (θ , x¯ k ) + χ1 (xk − x¯ k ) .

The key idea of the proposed algorithm is to use the approxima(j) tion Jˆk to estimate the effect of the candidate perturbations ±∆θk to the objective function and choose the perturbation that leads to the maximum – estimated – decrease of the objective function. This is realized in the proposed algorithm through the decision mechanism (9), which chooses one among the candidate pertur(j) bations ±∆θk ; the motivation behind using (9) is that, if Jˆk as cal-

(9)

sup

x:|x|≤cx ,θ:|θ|≤cθ

|J (θ , x) − ϑ τ φ(θ , x)|

and ν(θ , x) = J (θ , x) − ϑ ∗ τ φ(θ , x),

ν=

sup

x:|x|≤cx ,θ:|θ|≤cθ

|J (θ , x) − ϑ τ φ(θ , x)|

and suppose that there exist finite nonnegative constants cx , cx¯ , c∆x , cθ such that, for each k ∈ Z+ , the following hold: (A1) (A2) (A3) (A4)

|xk | < cx , |xk − x¯ k | < cx¯ , |xk − xk−1 | < c∆x . The proposed algorithm admits a solution satisfying (6) and (7). The proposed algorithm guarantees that |θk | ≤ cθ . The user-defined sequences αk , ck satisfy αk ≥ α > 0, c¯ > ck ≥ c > 0.

Then, the proposed algorithm is a (σ¯ , τ¯k )-Adaptive Optimization algorithm with

σ¯ =

1

α

max {δ1 , δ2 } , τ¯k =



0

δ3,k

if |∇ J (θk−1 , xk )| > εk otherwise

where δ1 = c¯ O(1) + χ2 (ν) + χ3 (cx¯ ), δ2 = α 2 O(1) + χ4 (c∆x ), εk = 1 max{ck O(1)+χ2 (ν)+χ5 (ηk )+χ3 (|x{`k ,...,k} − x¯ {`k ,...,k} |), αk2 O(1) αk + χ4 (|xk − xk−1 |)}, δ3,k = αk εk + χ4 (|xk − xk−1 |) + χ3 (|x{`k ,...,k} − x¯ {`k ,...,k} |)αk2 , χi , i = 2, . . . , 5 ∈ K∞ and ηk is a bounded term satisfying ηk = 0, ∀k ≥ L. Proof. Let Πk ≡ ∇ J (θk−1 , xk ) and note that J (θ , x) = (ϑ ∗ )τ φ(θ , x¯ ) + ν¯ (θ , x, x¯ ), where ν¯ (θ , x, x¯ ) = (ϑ ∗ )τ [φ(θ , x) −φ(θ , x¯ )] + ν(θ , x). Using condition (6) it can be seen that the solution Jˆk calculated according to (2) and (3) satisfies J (θ , x) = Jˆk (θ , x¯ ) + µk (θ , x, x¯ )

(10)

where µk (θ , x, x¯ ) = ν¯¯ k φ(θ , x¯ ) + ν¯ (θ , x, x¯ ) + ηkτ φ(θ , x¯ ) with  ν¯¯ k = O ν¯ (θ{`k ,...,k} , x{`k ,...,k} , x¯ {`k ,...,k} ) and ηk = 0, ∀k ≥ L; using (10) we directly obtain that ∂θ∂ J (θ , x) − c1 Jˆk (θ + i k  R1 ck ei , x¯ ) + c1 Jˆk (θ , x¯ ) = 0 ∂θ∂ J (θ , x) − ∂θ∂ J (θ + sck ei , x) ds + k i i

E.B. Kosmatopoulos / Automatica 45 (2009) 716–723

(µk (θ + ck ei , x, x¯ ) − µk (θ , x, x¯ )) ≡ k,i (θ , x, x¯ ). Using this equality and (9), we readily obtain that ∆θkτ Πk = 1 ck

(j)

arg min±∆θ (j) [(±∆θk )τ (Πk − k )]τ Πk k



Hk where k

=

vec(k,i (θk−1 , xk , x¯ k )) which implies that there exists a χ¯ 1  ∈ K∞ such that |Πk | > ¯k ≡ χ¯ 1 maxi vec k,i (θk−1 , xk , x¯ k ) ⇒ Hk = min±∆θ (j) k

h

±∆θk(j)



Πk

i

≡ H¯ k ≤ − Ξαnkθ |Πk |, where

the last inequality was obtained by using (7). Therefore, using standard ‘‘Taylor expansion’’ arguments (see e.g. Prop. 1, Bertsekas and Tsitsiklis (2000)) it can be seen that |Πk | > ¯k ⇒ Jk ≤ α Jk−1 + 1θkτ Πk + γ1 (xk−1 )nθ αk2 + γ2 (|xk , xk−1 |) ≤ Jk−1 − Ξ nk |Πk | + θ

γ1 (xk−1 )nθ αk2 + γ2 (|xk − xk−1 |) for some positive function γ1 and γ2 ∈ K∞ , or, equivalently, √ αk |Πk | + (1 − Ik ) ¯k αk nθ Jk ≤ Jk−1 − Ik Ξ nθ + γ1 (xk−1 )nθ α + γ2 (|xk − xk−1 |) 2 k

(11)

where Ik is defined as Ik = 1 if |Πk | > ¯k and Ik = 0, otherwise. Using the above inequality together with (A1), the fact that J is atleast C 2 and the definition of ¯k , µk we can readily establish the proof after some lengthy but quite straightforward calculations. M

719

Theorem 2. Suppose that (A2)–(A3) hold and additionally that the following assumptions hold:

  0 (A1 ) E [xk − x¯ k |Gk−1 ] = 0, E [xk |Gk−1 ] = 0, E |xk |2 |Gk−1 < ∞, where Gk denotes the σ -field generated by {x0 , . . . , (j) (j) xk , x¯ 0 , . . . , x¯ k , ∆θ0 , . . . , ∆θk }. 0 (A4 ) The k→∞ αk = 0, P∞user-defined sequences P∞ 2αk , ck satisfyPlim ∞ = ∞, < ∞, < ∞, k=0 αk k=0 αk k=0 αk ck limk→∞ αk /ck = 0. Suppose moreover that Th is chosen according Th = k − L¯ where L¯ is1 any positive integer. Finally, let ϑ˜ k∗ , ν˜ k be defined according to   ϑ˜ k∗ = arg min E |J (θk , xk ) − ϑ τ φ(θk , xk )|2 |Gk−1 ϑ  τ ν˜ k (θ , x) = sup J (θ` , x` ) − ϑ˜ k∗ φ(θ` , x` ) . `∈{`k ,...,k}

Then, the proposed algorithm is a (σ˜ , τ˜k )-Adaptive Optimization algorithm with

  σ˜ = E χ6 (˜νk ) |Gk−1 ,

χ6 ∈ K∞

(12)

and τ˜k is defined as τ¯k in Theorem 1 by replacing αk by some positive constant α ∈ (0, α0 ) and ν by ν˜ k .

Roughly speaking, Theorem 1 states that, in the case of deterministic exogenous disturbances, the terms σ¯ and τ¯k are, in the worst case, ‘‘proportional’’ to

Proof. The analysis of the proof of Theorem 1 leading to inequality (11) is valid here as well (by replacing ϑ ∗ , ν in the proof of Theorem 1 by ϑ˜ k∗ , ν˜ k defined in Theorem 2, respectively). Taking

(a) the approximation accuracy ν of the function approximator (2), which can be made arbitrarily small (see Proposition 1), (b) the estimation accuracy xk − x¯ k and (c) the exogenous signals’ ‘‘velocity’’ xk − xk−1 .

Ek−1 [Jk ]

Additionally, the term τ¯k is also affected by an extra term (the term ηk ) which becomes negligible for k ≥ L; the presence of ηk is unavoidable since, initially, the function J is totally unknown and it takes some iterations for the proposed algorithm to produce an effective estimate of this function. Next we present some comments regarding assumptions (A1)–(A4). Remark 4 (Assumptions (A1)–(A4)). Assumption (A1) requires that the exogenous signal xk is uniformly bounded; note that the proposed algorithm does not require knowledge of the bounds cx , cx¯ , c∆x . Assumption (A2) is quite difficult to verify for a general choice of the regressor vector φk ; however, as we establish in Proposition 1, if φk is chosen to be either a polynomial or a neural network of a specific structure, then assumption (A2) is (j) trivially satisfied if the candidate perturbations ∆θk are Bernoullilike random terms. Assumption (A3) is imposed in order to avoid lengthy technicalities in the proof of our main results. It is not difficult for someone to see that all of the results of this paper are valid if we remove assumption (A3) and use a a projection mechanism as in Kushner and Clark (1978) for keeping θk bounded; similarly to Kushner and Clark (1978) it can be seen that the introduction of such mechanisms does not destroy the performance and convergence properties of the proposed algorithm. Finally, assumption (A4) is a standard SA assumption on updating schemes with fixed step-sizes (see e.g. Borkar and Meyn (2000)). The next result establishes the properties of the proposed algorithm if we remove assumption (A1) on boundedness of the exogenous signal xk and assume instead that xk and xk − x¯ k are random vectors (not necessarily uniformly bounded) that are zeromean with finite variance (w.r.t. to an appropriately defined σ field). Note that in this case the sequences αk , ck should be chosen to be slowly decaying to zero terms.

0

conditional expectations on (11) and using (A1 ) and some lengthy, but quite straightforward calculations, we obtain that (here, for notational convenience we use the notation Ek−1 [·] ≡ E [·|Gk−1 ])

    αk |Πk | + γ¯1 αk2 + Ek−1 (1 − Ik ) ˜k αk ≤ Jk−1 − Ek−1 Ik Ξ nθ ≤ Jk−1 − Xk + γ¯1 αk2 + Ek−1 [(1 − Ik )] αk ck O(1)   + Ek−1 (1 − Ik ) αk ψk (x{`k ,...,k} ) − ψk (¯x{`k ,...,k} )

(13) 0

where  (note  that the term γ¯1 is bounded since, from (A1 ), Ek−1 |xk |2 < ∞) γ¯1 = nθ Ek−1 [γ1 (xk )], the term ˜k is defined similar to the term ¯k in the proof of Theorem 1, χ¯ 2 ∈ K∞ , ψk (·) is an appropriately defined function (that depends on φ(·, ·)) and X hk = Yk if Yk ≥ 0 and Xk i= 0, otherwise, with Yk = α

0

Ek−1 Ik Ξ nk |Πk | − (1 − Ik ) αk χ¯ 2 (˜νk ) . Assumption (A4 ) guaranθ P ∞

γ¯1 αk2 and k=1 Ek−1 [(1 − Ik )] αk ck O(1) converge; 0 moreover, using Ek−1 [xk − x¯ k ] = 0 (see assumption (A1  )), we have that Ek−1 (1 − Ik ) αk ψk (x{`k ,...,k} ) − ψk (¯x{`k ,...,k} ) = 0. tees that

k=0

P∞

Therefore, application of Robbins and Siegmud theorem (Robbins & Siegmund, 1971) on nonnegative P∞almost-super-martingales to inequality (13), establishes that k=0 Xk converges. Standard arguments can be now applied – see e.g. proof of part (b) of Proposition 3.1 that the convergence of P of Dippon and Renz (1997) – toPshow ∞ k Xk together with the facts that k=0 αk = ∞ and ∇ J is at least C 1 imply limk→∞ α1 Xk = 0, or, equivalently that limk→∞ α1 Yk ≤ k k 0; the last inequality implies that limk→∞ Ek−1 [(1 − Ik )] < 1 (unless ν˜ k → 0) and thus lim h k→∞ Ek−i1 [Ik ] > 0, which, in turn, implies that for k 7→ ∞, Ek−1

1

Ξ nθ

  |Πk | ≤ Ek−1 χ¯ 2 (˜νk ) which estab-

lishes (12). The rest of the proof is similar to that of Theorem 1. M

1 Due the fact that α 7→ 0, it is necessary to impose condition T = k − L; ¯ if this k h condition does not hold then (A2) will not hold as well since (6) will not admit a bounded solution for k 7→ ∞.

720

E.B. Kosmatopoulos / Automatica 45 (2009) 716–723

Theorem 2 establishes that in the case xk is stochastic vector 0 sequence satisfying (A1 ) then Jk preserves similar transient performance properties as in the case of deterministic sequences; moreover Theorem 2 establishes that the parameter vector sequence θk converges (with probability 1) arbitrarily close to a local minimum of J (under appropriate selection of the approximator (2) – see Proposition 1). 0

0

0

Remark 5 (Assumptions (A1 ), (A2 )). Assumption (A1 ) is also a quite standard assumption in SA (see e.g. Bertsekas and Tsitsiklis 0 (2000)). Assumption (A2 ) is a standard assumption on SA algorithms with vanishing gains. In the next proposition we show that if the estimator (2) is chosen to be either an Incremental-Extreme Learning Machine (IELM) (Huang et al., 2006; Huang & Chen, 2007) or a PolynomialLike Universal Approximator (PLUA) (Kosmatopoulos et al., 1995) (j) and, moreover, the candidate perturbations ∆θk are Bernoulli-like random terms, then there is no need to check the computationally heavy condition (6); moreover,  Proposition 1 establishes that the terms ν, E χ6 (˜νk ) |Gk−1 defined in Theorems 1 and 2, respectively, can be made arbitrarily small by increasing the ‘‘size’’ L of the estimator (2). Proposition 1. Suppose that the regressor vector φ is selected according to one of the following: (I-ELM) φi (θ , x) = S (Aτi vec(θ , x) + bi ), i ∈ {1, . . . , L}, where S (·) is an invertible smooth nonlinear function and the vectors Ai and the real parameters bi are randomly generated (with Ai , bi being zeromean), or dθi,n θ

dθi,1

dxi,n x

dxi,1

S¯ (x1 ) . . . S¯ (xnx ) (PLUA) φi (θ , x) = S (θ1 ) . . . S (θnθ ) ,i ∈ {1, . . . , L}, where S is any smooth monotone function and dθi,j , dxi,j are dx

dx

nonnegative integers such that S¯ (¯xk,1 ) i,1 . . . S¯ (¯xk,nx ) i,nx 6= 0, ∀k, i and moreover the integers dθi,j are such that ∃j ∈ {1, . . . , nθ } : dθi,j > 0, ∀i ∈ {1, . . . , L}. (j) Moreover assume that ∆θk are random zero-mean vectors in nθ 2 {−αk , +αk } satisfying (7). Then, condition  (6) is satisfied  with probability 1. Moreover, the terms ν(·) and E χ6 (˜νk ) |Gk−1 defined in Theorems 1 and 2, respectively, satisfy for χ7 , χ8 ∈ K∞

ν = χ7 (1/L) ,

E χ6 (˜νk ) |Gk−1 = χ8 (1/L) .





(14)

Proof. We provide with a sketch of the proof only for the I-ELM case; the proof for the PLUA case is similar. Since S is invertible, if (6) does not hold then there exists a nonzero vector b such τ that b (Avec(∆θ` , x¯ ` ) + b) =  0, ` ∈ {k − Lk + 1, . . . , k − (j)

1}, bτ Avec ±∆θk , x¯ k + b

= 0, where A denotes the matrix τ whose rows are the vectors Ai and b = vec(bi ). Since Ai , bi and (j) ∆θk are randomly chosen, it is quite straightforward to show

that the probability a nonzero vector b to satisfy the above system of equations is zero. The proof of the first equation in (14) is based on standard approximation results over compact spaces (see e.g. Huang et al. (2006) for the case of I-ELM and Kosmatopoulos et al. (1995) for the case of PLUA), while the proof of the second equation in (14) can be obtained using the approximation results over unbounded spaces developed in Maiorov and Meir (1998). M We close this section by introducing some remarks regarding the choice of the proposed algorithm’s design parameters:

(j)

(j)

(j )

2 A choice ∆θ = α ∆ where ∆ are Bernoulli-like random vectors satisfies k k k k (7); see also (Bhatnagar, Fu, Marcus, & Wang, 2003) for construction of zero-mean random or random-like sequences that satisfy condition (7).

• Contrary to other applications of function approximators where the ‘‘size’’ L of an approximator of the form (2) should be significantly large to guarantee that it can approximate nonlinear functions over the whole input space, this is not the case here: in the case of the proposed algorithm it is sufficient that the approximator has enough regressor terms to come up with an approximation of the unknown function J over a small neighborhood around the most recent vector θk . • Having the above in mind, a relatively small (as compared to other applications of function approximators) number L of regressor terms should suffice for efficient algorithm performance; similarly, since the approximation required is over a small neighborhood of the current value of θk a small time-window (determined by the parameter Th in (3)) should be chosen. As a matter of fact, in all practical applications of algorithms using functions approximators for optimization purposes, (see Kosmatopoulos et al. (2007)) as well as in various applications where we tested the proposed algorithm, a choice for L, Th according to L ≈ 1/2(nθ + nx ), Th = 50 was found to produce quite satisfactory results. Moreover, in the case where a polynomial approximator is used, we found that it suffices to use a polynomial approximator of maximum order equal to 3 with randomly polynomial terms at each algorithm P chosen P θ θ x x iteration (i.e. j di,j + ` di,` = 3 with di,j , di,j randomly chosen). • Finally, for the choice of the step-sizes αk , ck similar rules as the ones apply in standard SA algorithms can be used. 4. Application to adaptive fine-tuning of a large-scale traffic control system In order to evaluate the performance of the proposed algorithm, we have tested it and evaluated it – by means of simulation experiments – to the problem of adaptive fine-tuning (optimization) of the control parameters of a large-scale traffic control system. In general, the problem of adaptive controller fine-tuning can be formulated as follows: Assume a nonlinear system whose dynamics are described by the following nonlinear difference equation zt +1 = g (zt , ut , dt ),

z0 = z (0)

(15)

where zt , ut , dt denote the vectors of system states, control inputs, and exogenous signals, respectively, t denotes the time-index, and g (·) is a – possibly unknown – sufficiently smooth nonlinear vector function. Moreover, assume that the following control law is applied to the system (15) ut = $ (θ , zt )

(16)

where $ (·) is a known smooth vector function and θ is the vector of control parameters. The performance of the controller (16) when applied to system (15) is evaluated through the following objective function (performance index) J (θ; z0 , DT ) = πT (zT ) +

T −1 X

πt (zt , ut )

(17)

t =0

where πt are known nonnegative functions, T denotes the time4

horizon over which the control law (16) is applied and DT = [d0 , d1 , . . . , dT −1 ] denotes the time-history of the exogenous signals. By defining x = vec (z0 , DT ), (17) may be rewritten as J (θ , x) = J (θ; z0 , DT ). The problem at hand is – by evaluating the performance of the controller (16) for different values of the controller vector θ – to ‘‘tune’’ θ so as it converges close to a local minimum of J while keeping, during the fine-tuning experiments, J as small as possible. The fine-tuning process involves many different experiments (of duration T ); during each experiment the vector θ remains constant and after the experiment ends,

E.B. Kosmatopoulos / Automatica 45 (2009) 716–723

721

the performance of the controller (for the particular choice for θ ) is evaluated using the performance index J. Note that the performance index J is a function of the closed-loop system (15) and (16) dynamics; as a result J depends on the – possibly unknown – function g (·) and thus it is not possible, in general, to obtain an analytic form for J and its gradient. AO methodologies are thus well suited for the solution of the above problem. We have picked a particularly challenging adaptive controller fine-tuning problem in order to evaluate the performance of the proposed algorithm. More precisely, we have tested the proposed algorithm to the adaptive fine-tuning of the large-scale control system of the urban traffic network of the city of Chania, Greece. Standard SA algorithms such as the SPSA algorithm, failed when applied to this particular problem for literally any choice of their design parameters (Kosmatopoulos et al., 2007), due to the highly nonlinear and complex nature of system dynamics as well as the large size of the controller parameter vector; on the other hand, SA methods involving the use of function approximators such as the Adaptive Fine-Tuning (AFT) method, although they eventually improved significantly the overall system performance, they exhibited ‘‘spiky’’ transient performance. It has to be emphasized at this point that such a ‘‘spiky’’ performance is not acceptable in a real-life application, since it may lead to severe congestion problems, complaints, dangerous driving, etc., which, in turn, may force the traffic operators to cancel the fine-tuning process. Next we briefly present some details regarding the particular application; it is worth noting that the simulation setting is the same as the one of Kosmatopoulos et al. (2007). Due to space limitations, some details of the simulation experiments are not presented here; the interested reader is referred to Kosmatopoulos et al. (2007) for a more detailed description. Traffic Network: The particular traffic control system chosen to be fine-tuned was that of the urban traffic network of the city of Chania, Greece, shown in Fig. 1; it is worth noting that this particular network is a typical urban traffic network containing all possible varieties of complex junction staging. System Dynamics: The system dynamics for the particular application can be quite effectively approximated by the highly-nonlinear macroscopic traffic model METACOR described in Elloumi, HajSalem, and Papageorgiou (1994). The system states zt correspond to the total number of vehicles in each of the traffic network links over the tth cycle (where the cycle is defined as the time required for each junction to complete one sequence of traffic signaling), the entries of the exogenous input vector dt correspond the incoming flow (in number of vehicles) at each of the network’s origins over the tth cycle and the control input vector ut corresponds to the duration of the green times of each of junctions’ stages over the tth cycle; the cycle is constant and equal to 90 seconds for all junctions. The dimensions of the vectors zt , ut and dt are 71, 42 and 22, respectively. Control System: The control system assumed in the simulations is the so-called TUC controller (Diakaki, Dinopoulou, Aboudolas, Papageorgiou, Ben-Shabat, Seider, & Leibov, 2003) defined according to: $ (θ , z ) = H (Lθ z ), where the nonlinear operator H is used to guarantee that the control decisions satisfy minimum and maximum allowable green time constraints as well as that the sum of green times of the stages of each junction adds up to the cycle minus the lost times (intergreens) of the particular junction and the entries of the matrix Lθ ∈ Re42×71 correspond to the tuneable controller parameters θ ; as it was seen in Kosmatopoulos et al. (2007) it suffices to fine-tune the 544 ‘‘most important’’ entries out of the 42 × 71 = 2982 total entries of the matrix Lθ , while keeping the rest entries equal to zero. Traffic Demand Scenarios: A randomly perturbed periodic sequence (with period equal to 10 iterations) of exogenous vectors

Fig. 1. The traffic network of the city of Chania. The junctions are represented by nodes and the links are represented by arrows. Each network link corresponds to a particular junction stage.

722

E.B. Kosmatopoulos / Automatica 45 (2009) 716–723

xk (which correspond in the particular application to traffic demand scenarios, i.e. time-histories of incoming flow in the network origins) was used in the simulation experiments. More precisely, at the kth algorithm iteration xk was calculated according to (please note that in all fine-tuning experiments the initial system state z0 was set equal to 0)

¯ mod(k,10),T + ξk xk = vec D 

¯ i,T correspond to 10 different traffic demand scenarios where D that were designed based actual traffic data (Kosmatopoulos et al., 2007) and ξk is a zero-mean random vector of appropriate ¯ i,T ]. The total duration dimensions with variance equal to 0.05E [D T of each of these aforementioned scenarios is 14 hours; it is ¯ i,T gives rise to highly congested worth noting that each of D ¯ i,T is traffic conditions and, moreover, that the variance among D ¯ ¯ ¯ particularly high (namely, E [Di,T − E [Di,T ]] ≈ 0.5E [Di,T ]).

Table 1 Comparison of AFT and the proposed algorithm: (a) worst-case performance in iterations 50–100, (b) worst-case performance in iterations 100–400 and (c) average improvement in iterations 300–400.

(¯c0 , α¯ k )

AFT vs Proposed algorithm (% of improvement w.r.t. the base case)

(0.01, 0.01) (0.01, 0.05) (0.01, 0.1) (0.1, 0.1)

−40.5 −46.7 −87.5 −62.2

a

b

−12.9 −37 −20.3 −32

−35.4 −44.6 −76.2 −87.8

c

−12.2 −18.7 −10 −23.5

9.4 21.4 28.9 27.4

12.7 28.8 29.2 23.6

Initial Controller Parameters: Following the design principles of TUC (Diakaki et al., 2003), a linearized model of the closed-loop system (15) and (16) dynamics was used in order to obtain the initial Lθ matrix (or, equivalently, the control parameter vector θ0 ); as expected, such an initialization for Lθ leads to a performance that is far from being – locally – optimal. Estimation of Exogenous Signals: The total dimension of the exogenous vector xk is extremely high (equal to number of origins × T /cycle = 12 320); similarly to Kosmatopoulos et al. (2007), we used a low-dimensional estimate x¯ k ∈ R88 whose entries correspond to the average incoming flow at a particular origin for the time intervals [0, T /4), [T /4, T /2), . . . , [3T /4, T ). Note that such a choice for x¯ k , although addresses the problem of dealing with x¯ k of very large dimensions, introduces a significant estimation error xk − x¯ k . Finally, a moving average of x¯ ` , ` < k was used for the calculation of the prediction x¯ k for the next experiment; in this way, significant prediction uncertainty was assumed). In order to compare the proposed algorithm performance with an SA algorithm incorporating random perturbations, we also simulated the AFT algorithm of Kosmatopoulos et al. (2007), whose cJ k if k is a dynamics are described according to θk = θk−1 − αk ∇ positive even number and θk = θk−1 + αk ∆k , otherwise, where cJ k is defined in (8) and ∆k is a random vector defined similarly ∇ (j)

to the vectors ∆θk defined in Section 2. Both algorithms used a third-order PLUA with L = 300 was used (note that L ≈ 1/2(nθ + nx ) = 0.5(544 + 88) and the sequences αk , ck were designed so 0 as assumption (A4 ) is satisfied, namely ck = c¯0 (k + 1)−1/3 , αk = α¯ 0 ln(k + 1)/(k + 1), with c¯0 , α¯ 0 being positive design constants. The system performance for θ = θ0 was used as the base case for the evaluation of the performance of the two algorithms; in other words, the TTT corresponding to θk as produced by the two algorithms was compared to the TTT (in % of improvement) that corresponds to θ0 for the same traffic demand scenario xk . Apparently, negative ‘‘ % improvements’’ correspond to cases where the TTT that produced under a particular θk is worse than the TTT of the base case. Table 1 summarizes the comparison results of the performance of the two algorithms for different choices of the sequence c¯0 , α¯ 0 and for different performance criteria. It is worth noting that the performance of the algorithms for iterations 0 − 50 was not taken into consideration in the comparison, since during these iterations both algorithms behave similarly. Careful inspection of Table 1, reveals that the proposed algorithm significantly outperforms the AFT algorithm especially as far as it concerns their worst-case performance characteristics. Moreover, the proposed algorithm seems to be more robust as far as it concerns the choice of the design constants c¯0 , α¯ 0 . Note also the trade-off – in both algorithms – between satisfactory transient performance (which,

Fig. 2. Mean speed improvement (%) of the AFT algorithm (upper plot) and the proposed algorithm (lower plot).

in general, corresponds to small values for c¯0 , α¯ 0 ) and steady-state convergence (large values for c¯0 , α¯ 0 )). Fig. 2 exhibits a typical behavior of the two algorithms (upper plot: AFT, lower plot: proposed algorithm) for the particular choice c¯0 = 0.01, α¯ 0 = 0.05. Please notice that the AFT algorithm preserves a ‘‘spiky’’ behavior, especially at days 58 and 102, where its performance is more than 40% worse than that obtained using the initial θ = θ0 ; it is worth noting that these big negative spikes correspond to the emergence of grid-locks at the traffic network, where practically the vehicles stay blocked for hours. By inspecting the lower plot of Fig. 2 it is clearly seen that the proposed algorithm avoids the aforementioned ‘‘spiky’’ behavior of AFT while maintaining efficient convergence behavior. Also, note that the convergence of the proposed algorithm is faster than the one of the AFT. 5. Conclusions In this paper a new adaptive optimization algorithm has been proposed that is applicable to optimization applications where the objective function to be optimized is unknown (but available for measurement) and is subject to exogenous signals which can be unknown or partially known. The main advantage of the proposed scheme is that it guarantees satisfactory transient performance, contrary to existing algorithms which are known to exhibit ‘‘spiky’’ transient behavior. References Bertsekas, D. P., & Tsitsiklis, J. N. (2000). Gradient convergence in gradient methods with errors. SIAM Journal in Optimization, 10, 627–642. Bhatnagar, S., Fu, M. C., Marcus, S. I., & Wang, I. J. (2003). Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation, 13, 180–209. Borkar, V. S., & Meyn, S. P. (2000). O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal in Optimization, 38, 447–469.

E.B. Kosmatopoulos / Automatica 45 (2009) 716–723 Diakaki, C., Dinopoulou, V., Aboudolas, K., Papageorgiou, M., Ben-Shabat, E., Seider, E., et al. (2003). Extensions and new applications of the traffic signal control strategy TUC. Transportation Research Record, (1856), 202–216. Dippon, J., & Renz, J. (1997). Weighted means in stochastic approximation of minima. SIAM Journal of Control and Optimization, 35, 1811–1827. Elloumi, N., Haj-Salem, H., & Papageorgiou, M. (1994). METACOR: A macroscopic modeling tool for urban corridors. TRISTANII (Triennal symposium on transportation analysis). (Vol. 1) (pp. 135–150). Freeman, R. A., & Kokotovic, P. V. (1996). Inverse optimality in robust stabilization. SIAM Journal of Control and Optimization, 34(4), 1365–1391. Huang, G.-B., Chen, L., & Sew, C. K. (2006). Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transaction on Neural Networks, 17, 879–892. Huang, G.-B., & Chen, L. (2007). Convex incremental extreme learning machine. Neurocomputing, 70, 3056–3062. Ioannou, P. A., & Sun, J. (1995). Stable and robust adaptive control. Englewood Cliffs, NJ: Prentice Hall. Kosmatopoulos, E. B., Polycarpou, M. M., Christodoulou, M. A., & Ioannou, P. A. (1995). High-order neural network structures for identification of dynamical systems. IEEE Transactions on Neural Networks, 6, 422–431. Kosmatopoulos, E. B., Papageorgiou, M., Vakouli, A., & Kouvelas, A. (2007). Adaptive fine-tuning of non-linear control systems with application to the urban traffic control strategy TUC. IEEE Transactions on Control Systems Technology, 15(6), 991–1002. Kushner, H. K., & Clark, D. S. (1978). Stochastic approximation methods for constrained and unconstrained systems. New York: Springer. Maiorov, V., & Meir, R. S. (1998). Approximation bounds for smooth functions in C (Rd ) by neural networks and mixture networks. IEEE Transactions on Neural Networks, 9, 969–978.

723

Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341. Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45, 1839–1853. Robbins, H., & Siegmund, D. (1971). A convergence theorem for nonnegative almost supermartingales and some applications. In J. S. Rustari (Ed.), Optimizing methods in statistics (pp. 233–257). New York: Academic Press.

Elias B. Kosmatopoulos received the Diploma, M.Sc. and Ph.D. degrees from the Technical University of Crete, Greece, in 1990, 1992, and 1995, respectively. He is currently an Assistant Professor with the Department of Production Engineering and Management, Technical University of Crete (TUC), Greece and Deputy Director of the Dynamic Systems and Simulation Laboratory at TUC. Prior to joining TUC, he was Research Assistant Professor with the Department of Electrical Engineering-Systems, University of Southern California (USC) and a Postdoctoral Fellow with the Department of Electrical & Computer Engineering, University of Victoria, B.C., Canada. Dr. Kosmatopoulos’ research interests are in the areas of neural networks, adaptive optimization and control and intelligent transportation systems. He is the author of over 30 journal papers. Among his theoretical contributions the most important are the analysis of approximation, stability and learning capabilities of Recurrent High Order Neural Networks and the development and analysis of a switching adaptive controller for unknown dynamical systems.