Journal of Mathematical Analysis and Applications 244, 333᎐347 Ž2000. doi:10.1006rjmaa.2000.6703, available online at http:rrwww.idealibrary.com on
Rates of Convergence of Adaptive Step-Size of Stochastic Approximation Algorithms S. Shao Department of Mathematics, Cle¨ eland State Uni¨ ersity, Cle¨ eland, Ohio 44115
and Percy P. C. Yip AI WARE, Inc., Beachwood, Ohio 44122 Submitted by Alan Schumitzky Received October 16, 1996
We propose a new adaptive algorithm with decreasing step-size for stochastic approximations. The use of adaptive algorithms in various applications is widely spread across fields such as system identification and adaptive control. We analyze the rate of convergence of the proposed algorithms. An averaging algorithm, on its optimality of the rate of convergence, is used to control the step sizes. Our proofs are based on recent results in stochastic approximations and Gaussian Approximation Theorem. 䊚 2000 Academic Press
1. INTRODUCTION Consider the classical problem yn s x Tn q ¨ n ,
Ž 1.1.
where yn are observations, x n are system inputs which are components of the random process, ¨ n is an additive noise, and g R d is a parameter. In order to estimate the parameter of Ž1.1., the following well-known adaptive algorithm is often used.
ˆnq 1 s ˆn q ⑀ x n Ž yn y x Tn ˆn . ,
Ž 1.2.
333 0022-247Xr00 $35.00 Copyright 䊚 2000 by Academic Press All rights of reproduction in any form reserved.
334
SHAO AND YIP
where ˆn is the sequence of vectors to estimate the parameter and to be recursively updated; step size ⑀ is a positive constant. The algorithm Ž1.2. is intended to make n converge to the vector * that optimizes the estimation of the parameter . The idea of determining a class of methods of the estimation or identification using stochastic approximations or adaptive algorithms goes back to the 1950’s. The first article was given by Robbins and Monro w11x. w3x, following the fundamental work of Robbins and In 1966, Khasminskii ´ Monro, first gave a convergence result for stochastic approximation algorithms with constant step-size and smooth dependent vector fields with parameter . This work led to an initial series of several theorems on stochastic approximations. Then in 1977, Ljung w6, 7x first formulated a general convergence result in the general adaptive algorithms with decreasing step-size using the Markov setting approach for the case of conditionally linear dynamics. Ljung’s work has been an important contribution to the theory of the convergence under some realistic assumptions. He proves the almost sure convergence of an algorithm derived from Ž1.2. by replacing ⑀ with a decreasing sequence ⑀ n tending to zero. Kushner, Huang, and Yang w4, 5x later gave a full treatment of stochastic adaptive algorithms with step-sizes ⑀ n s ny␣ , for various values of ␣ . The results of Benveniste et al. w1x in the 1980’s and 1990’s presented both treatments of the Markov setting and the discontinuities of the random vector field for both constant step-size and decreasing step-size. In 1992, in an important paper, Polyak, and Juditsky w10x show that, for the general stochastic approximation algorithms, for any sequence of step-sizes, if the step-sizes ⑀ n tend to zero slower than O Ž1rn., then the averaged sequence Ž nrn. converges to its limit at an optimum rate. Motivated by their results, we present in this paper an analysis of the optimum rate of convergence for a new stochastic gradient algorithm that searches the optimum vector in order to estimate the parameter . We suggest the operating adaptive algorithms with optimum asymptotic rate of convergence as
ˆnq 1 s ˆn q ⑀ n x n e n ,
Ž 1.3.
⑀ nq 1 s ⑀ n q ␥n x n e n ,
Ž 1.4.
where ˆn are the estimates of to be updated recursively, e n s yn y x Tn ˆn ,
⑀n s
1 nq1
Ž 1.5.
n
Ý ⑀ nyi . is0
Ž 1.6.
STOCHASTIC APPROXIMATION ALGORITHMS
335
We rewrite Ž1.6. as
⑀ n s ⑀ ny1 q
1 nq1
Ž ⑀ n y ⑀ ny1 . ,
Ž 1.7.
which can be conveniently updated. We require that ⑀ n are positive real numbers, Ý⬁ns0 ⑀ n - q⬁, ␥n are positive real numbers for all n, Ý⬁ns 0 ␥n s q⬁, and ␥n converge to zero ‘‘slower’’ than 1rn. Then we have the ⑀ n converge to zero. When the step-sizes ␥n converge to zero slower than 1rn, the averaging of iterates of type ⑀ n have been shown to have superior properties to the basic iterate ⑀ n itself in the sense of smaller asymptotic variance w4, 5x; that is, the sequence ⑀ n converges to zero at an optimum rate. We will prove the convergence of Ž1.3. and Ž1.4. as well as analyze their asymptotic rate of convergence by the Gaussian Approximation Theorem. The outline of this paper is as follows. In Section 2 we review certain theoretically relevant results of convergence of stochastic approximation and adaptive algorithms. In Section 3, we present our proposed algorithm and analyze the convergence of ˆn . The convergence and the rate of convergence of ⑀ n will be studied in Section 4. In Section 5 we close with some final remarks.
2. SOME RESULTS FROM STOCHASTIC APPROXIMATION AND ADAPTIVE ALGORITHMS In this section, we introduce some definitions and review some wellknown basic results of convergence theorems of stochastic approximation w1, I, Chaps. 2, 3; II, Chaps. 1, 3, 4x which will be used in Sections 3 and 4 to prove our main theorems. Consider the general adaptive algorithm of the form z nq 1 s z n q ␥n H Ž z n , x n . ,
Ž 2.1.
where z n Ž z n is similar to ˆn of Ž1.2.. is a sequence of random variables in R d, ␥n is a sequence of positive real numbers with Ý n ␥n s q⬁, and Ý n ␥n␣ - q⬁ for some ␣ ) 1, the system inputs x n lie in R k . If we assume that Ži. for fixed z, P Ž n g d < ny1 , . . . ; z ny1 , . . . . s x n s f Ž n . ,
HG
Ž ny1 , d . ,
z ny 1
336
SHAO AND YIP
where f is a function, n4 is a Markov chain with transition probability z , and it is asymptotically stationary with limiting distribution z Ž d .. Žii. There exists a mean vector field defined by h Ž z . [ lim Ez Ž H Ž z, x n . . , nª⬁
where Ez denotes the distribution of the state x n4 for a fixed value of z. Now we state Theorem A. THEOREM A. Assume the algorithm Ž2.1. satisfies assumptions Ži. and Žii. in domain D* of attractor z* of the ODE defined by dz dt
s hŽ z . ,
z Ž 0 . s a.
Then for any compact subset Q of D* and 0 s , we ha¨ e Ža. P lim z n s z* G 1 y C Ž ␣ , Q, < < . Ý ␥n␣ .
½
Žb.
nª⬁
5
n
For any ⑀ ) 0 we ha¨ e P max z n y z Ž a, t n . ) ⑀ - C Ž ␣ , Q, < < . Ý ␥n␣ ,
½
n
5
n
where t n s Ý nis1 ␥ i , C Ž ␣ , Q, < <. is a constant depending on ␣ , Q and the norm of the initial condition . Theorem A shows that we can control the algorithm over the error, which is uniform along the trajectory of the algorithm. In the next section, Theorem A will be applied to prove our convergence theorem. We now introduce the definition of of class L i Ž R d, L1 , L2 , p1 , p 2 ., Theorem B, and Theorem C. DEFINITION. A differentiable function f Ž , x . with respect to x, is said to be of class L i Ž R d, L1 , L2 , p1 , p 2 . if for all , ⬘ g R d, for all x, x 1 , x 2 g R d, satisfying conditions f Ž , x 1 . y f Ž , x 2 . F L1 Ž 1 q < < . Ž < x 1 y x 2 < . Ž 1 q < x 1 < p 1 q < x 2 < p 2 . ,
Ž C1. f Ž , 0 . y f Ž ⬘, 0 . F L2 < y ⬘ < , f ⬘ Ž , x . y f ⬘ Ž ⬘, x . F L2 < y ⬘ < Ž 1 q < x < p 2 . .
Ž C2. Ž C3.
337
STOCHASTIC APPROXIMATION ALGORITHMS
It is said to be of class L i Ž R d . if it is of class L i Ž r d, L1 , L2 , p1 , p 2 . for some values of L j , pj . THEOREM B. Let ⌸ be a transition probability on R k . Suppose that for all p G 0, there exist constants K 1 , K 2 , - 1, such that for all g g L i Ž p ., x 1 , x 2 g R k , n G 0, we ha¨ e
H⌸
n
Ž x 1 , dx 2 . Ž 1 q < x 2 < p . F K 1 Ž 1 q < x 1 < p . ,
Ž B1 .
⌸ n g Ž x 1 . y ⌸ n g Ž x 2 . F K 2 w g x p < x 1 y x 2 < Ž 1 q < x 1 < p q < x 2 < p . , Ž B2 . where g is a function on R k = E, and E is a finite subset of R r , and w g x p is defined by
w gxp s
sup x 1/x 2 , egE
g Ž x1 , e . y g Ž x 2 , e . < x1 y x 2 < Ž1 q < x1 < p q < x 2 < p .
.
Then for all f Ž , x . of class L i Ž r d, L1 , L2 , p1 , p 2 ., there exist functions hŽ ., and constants Ci depending only on the L j , pj such that h Ž . y h Ž ⬘ . F C1 Ž < y ⬘ < . , < < F C2 Ž 1 q < < . Ž 1 q < x < p 1q1 . ,
Ž x . y ⬘ Ž x . F C3 Ž < y ⬘ < . Ž 1 q < x < p 2q1 . . THEOREM C.
If Ž2.1. satisfies the following assumptions:
ŽA1. ␥n G 0, ␥n ª 0, and Ý n ␥n s q⬁. ŽA2. Ž z n , x n . nG 0 is a Marko¨ process. ŽA3. For any compact subset Q of domain D ; R s, there exist constants C1 , q1 depending on Q such that for all z g Q and all n, we ha¨ e H Ž z, x . F C1 Ž 1 q < x < q1 . . ŽA4. There exists a function h on D and for each z g D, a function z g R k such that: ŽA4 1 . h is locally Lipschitz on D. ŽA4 2 . Ž I y ⌸ z .z s H z y hŽ z . for all z g D, where Hz is a function defined by H z Ž x . s H Ž z, x .. ŽA4 3 . For all compact subset Q of D, there exist constants C2 , C3 , q2 , q3 , g w 12 , 1x such that for all z, z⬘ g Q
z Ž x . F C 2 Ž 1 q < x < q 2 . , ⌸ z z Ž x . y ⌸ z ⬘ z ⬘ Ž x . F C3 < z y z⬘ < Ž 1 q < x < q 3 . .
338
SHAO AND YIP
ŽA5. For any compact subset Q of D and any s ) 0, there exists an in¨ ariant probability distribution s Ž Q . - q⬁ such that for all n, x g R k , a g R d we ha¨ e E x , a Ž z k g Q , k F n. Ž 1 q < x nq1 < s . 4 F s Ž Q . Ž 1 q < x < s . . For N s 0, 1, 2, . . . , we rewrite Ž2.1. to have N N N ˆ N N z nq 1 s z n q ␥n H Ž z n , x n . ,
Ž 2.2.
where z nN s z nqN , ␥nN s ␥nqN , and x nN s x nqN . Then lim PxN, a
Nª⬁
½
z nN y z Ž t nN ; a . G ␦ s 0
5
sup nFm
NŽ
T.
for all ␦ ) 0, where Px,Na denote the distribution of Ž x n , z n . nG 0 for the initial condition x 0 s x. n
t nN s
Ý ␥ iN , is0
N m N Ž T . s inf k : k G 0, ␥ 1N q ⭈⭈⭈ q␥ kq1 G T4,
n
Ý Ž t
zNŽ t. s
N N i F t - t iq 1 .
z iN ,
is0
where A is the characteristic function of A defined by
A Ž t . s
½
1, 0,
if t g A; otherwise,
and z Ž t nN ; a. is the solution of the ODE dz dt
s h Ž z Ž t , a. . ,
z Ž 0, a . s a, associated with Hˆ and Ž ⌸ z . z g R d . Remarks. Theorem C shows that our algorithm Ž2.1. can be approximated by solutions of the ODE if Assumptions ŽA1. to ŽA5. are satisfied. Theorem B gives sufficient conditions by which ŽA4. of Theorem C is satisfied under the case Markov chains depend on the parameter and the transition probability ⌸ is independent of .
STOCHASTIC APPROXIMATION ALGORITHMS
339
We now introduce the Gaussian Approximation Theorem for sequence of algorithms with decreasing step-sizes. THEOREM D. We assume that the function H, the sequence Ž␥nN . nG 0 , N s 0, 1, 2, . . . , and the transition probability Ž ⌸ z . z g R d satisfy Assumptions ŽA1. to ŽA5. and in addition, ŽA6. For all z g D, h g C 2 , there exists a unique symmetric d = d matrix Ž R i j Ž z .. and for all Ž z, x . g R d = R k , there exists a matrix ŽWi zj Ž x .. such that ŽA6 1 . R i j is locally Lipschitz on D. ŽA6 2 .
Ž I y ⌸ z . Wi zj Ž x . s ⌸ z zi Ž x . zj Ž x . y ⌸ z zi Ž x . ⌸ z zj Ž x . y R i j Ž z . . ŽA6 3 . For any compact subset Q of D, there exist constants K 1 , K 2 , p1 , p 2 , g w1r2, 1x such that for all z, z⬘ g Q we ha¨ e Wi zj Ž x . F K 1 Ž 1 q < x < p 1 . , ⌸ zWi zj Ž x . y ⌸ z ⬘Wi zj ⬘ Ž x . F K 2 < z y z⬘ < Ž 1 q < x < p 2 . , where z is the function of ŽA4., ŽA7. lim N ª⬁ ␥nN s 0 and there exists a ␣ G 0 such that lim sup Nª⬁
'␥
N k
y
'␥
N Ž ␥ kq 1.
k
N kq1
3r2
y ␣ s 0.
Then the distributions Ž Pˆx,Na .N G 0 of the processes UN s
z N Ž t . y z N Ž t ; a.
'␥
N
Ž t.
con¨ erge weakly as N ª 0 to the distribution of the Gaussian diffusion with initial condition 0 and generator gi¨ en by d
L t⌿ Ž x . s
d
Ý ⭸ i⌿ Ž x .
Ý ⭸ j hi Ž z Ž t , a. . x j
is1
js1
q
1 2
d
Ý i , js1
⭸ i2j ⌿ Ž x . R i j Ž z Ž t , a . . .
q ␣ xi
340
SHAO AND YIP
If the algorithm of Ž2.1. is convergent, surely z n ª z*. Then we have the following Theorem E. THEOREM E. If the algorithm Ž2.1. satisfies Assumptions ŽA1. to ŽA6., and, in addition,
'␥
n
lim
y ␥nq1
'
Ž ␥nq 1 .
nª⬁
s ␣,
3r2
Ž z y z* . h Ž z . F y␦ < z y z* < 2 , with lim inf 2 ␦ nª⬁
ž
␥n
q
␥nq 1
␥nq1 y ␥n 2 ␥nq1
/
) 0,
there also exists a matrix B s Ž Bi j .d=d s ␣␦ i j q ⭸ j h i Ž z*. whose eigen¨ alues ha¨ e strictly negati¨ e real parts. Then we ha¨ e Ž1.
The sequence PˆN of the distributions of the processes defined by UNŽ t. s
z Ž t N q t . y z*
'␥ Ž t
N
q t.
con¨ erges weakly toward the distribution of a stationary Gaussian diffusion with generator L⌿ Ž x . s Bx ⭈ ⵜ⌿ Ž x . q Ž2.
1 2
d
Ý
⭸ i2j ⌿ Ž x . R i j Ž z* . .
i , js1
The sequence of random ¨ ariables z n y z*
'␥
ngN
,
n
con¨ erges in distribution to a zero-mean Gaussian ¨ ariable with co¨ ariance Cs
⬁
H0 e
sB
T
( R( e s B ds,
where t n s Ý nis1 ␥ i , C Ž a, Q, < <. is a const ant depending on ␣ , Q, and the norm of the initial condition . Remark. It is important to note that if the Markov chain Ž x n . nG 0 with the transition probability ⌸ z has a distribution with an invariant probabil-
STOCHASTIC APPROXIMATION ALGORITHMS
341
ity ⌫z , the matrix R s Ž R i, j Ž z .. in Assumption ŽA6. and Theorem E is Ri j Ž z . s
⬁
cov H i Ž z, x k . , H j Ž z, x 0 . .
Ý ksy⬁
3. CONVERGENCE OF ˆn Consider the algorithm
ˆnq 1 s ˆn q ⑀ n x n e n , where e n s yn y x Tn ˆn ,
⑀ n s ⑀ ny1 q
1 nq1
Ž ⑀ n y ⑀ ny1 . .
We assume that Ži. 0 , x 1 , x 2 , . . . are defined on a probability space Ž ⍀, A, P ., Fn are the -algebra of events generated by the random variables ˆ0 , x 1 , x 2 , . . . , x n . Žii. Ž x n , ˆn . nG 0 is a Markov process, its transition probability depends on n. Žiii. Ž yn , x n . nG 0 are bounded. Živ. For any fixed in the domain of the operating algorithm, there exists a function f and an asymptotic stationary Markov chain Ž n . with limiting distribution Ž d . such that x n s f Ž n . . We now define hŽ . s
HH Ž , f Ž . . Ž d . ,
s lim E Ž H Ž , x n . . , nª⬁
s EŽ H Ž , x . . , where H Ž , X . s yX Ž X T y Y ., E denote the distribution of the state Ž X n . nG 0 for a fixed value of . Then we have d dt
s hŽ . ,
Ž 0 . s a.
Ž 3.1.
342
SHAO AND YIP
We know that if ⑀ 1 ª 0, the algorithm Ž t . has a tendency to follow the solution of the differential equation with the initial condition 0 s a. In fact, h Ž . s lim E Ž H Ž , x n . . , nª⬁
s min E Ž yn y x Tn x n . , sy
1 ⭸E
yn y x Tn
ž
⭸⑀ n
2
2
/.
For dˆ dt
s h Ž ˆŽ t . . ,
where t n s Ý nis0 ⑀ i and ˆŽ t n . is closed to the solution ˆn of equations:
ˆnq 1 s ˆn q ⑀ n h Ž ˆn . , ˆ0 s a. Since ⑀ n ) 0 for all n and Ý n ⑀ n - q⬁, we have
⑀n ª 0
Ý ⑀ n s q⬁
and
n
by ⑀ n s 1rŽ n q ⑀ nyi . Therefore, ⑀ n decreases as O Ž1rn. and Ý n ⑀ n␣ - q⬁ for ␣ ) 1. By Theorem A in Section 2, we have the following convergence theorem. 1.Ý nis0
THEOREM 1. Suppose assumptions Ži., Žii., Žiii., and Živ. are satisfied, and suppose that the ODE Ž3.1. has an attractor * in the domain D* ; D. Then for any compact subset Q of D and the algorithm
ˆnq 1 s ˆn q ⑀ n x n e n with ˆ0 s a g Q and 0 s , we ha¨ e ŽA. P lim ˆn s * G 1 y C Ž ␣ , Q, < < . Ý ⑀ n␣ .
½
ŽB.
nª⬁
5
n
For any ⑀ ) 0 we ha¨ e P max ˆn y Ž t n ; a . ) ⑀ - C Ž ␣ , Q, < < . Ý ⑀ n␣ ,
½
n
5
n
STOCHASTIC APPROXIMATION ALGORITHMS
343
where t n s Ý nis1 ⑀ i , C Ž ␣ , Q, < <. is a constant depending on ␣ , Q and the norm of the initial condition . Remark. Since ⑀ n decreases as O Ž1rn., we know that the asymptotic rates of convergence are the same for the averaging iterates Ž n s Ž1rŽ n q 1.. Ýn ny i . and Ž ˆn . itself, see w5x. is0
4. RATE OF CONVERGENCE OF ⑀ n Consider the algorithm Ž1.4. in Section 1,
⑀ nq 1 s ⑀ n q ␥n x n e n ,
Ž 4.1.
where, as we defined before, ⑀ n are positive real numbers, Ý⬁ns0 ⑀ n - q⬁, ␥n are positive real numbers for all n, Ý⬁ns0 ␥n s q⬁, and ␥n converge to zero ‘‘slower’’ than 1rn. We further assume that ŽI. Ž ⑀ n , x n . is a Markov process. ŽII. For all ⑀ , the Markov chain with transition probability ⌸ ⑀ is positive, recurrent with invariant probability ⌫⑀ . For N s 0, 1, 2, 3, . . . , we introduce a sequence Ž ⑀ nN ., where ⑀ nN s ⑀ Nqn for all n G 0 satisfying N N N N N ⑀ nq 1 s ⑀ n q ␥n x n e n ,
⑀ 0N s a, ˆ
Ž 4.2.
x 0N s x,
where e nN s ynN y x nNˆnN , ˆnN s ˆnqN , x nN s x nqN are associated with the same transition probability ⌸ . Set n
t nN s
Ý ␥ kN , ks0
⑀ NŽ t. s
Ý Ž t
N N k F t F t kq 1 .
⑀ kN ,
ks0 N m N Ž T . s inf k : k G 0, ␥ 1N q ⭈⭈⭈ q␥ kq1 G T4,
for some constant T ) 0. Now we can prove Theorem 2 and Theorem 3 by Theorem B and Theorem C in Section 2.
344 THEOREM 2.
SHAO AND YIP
Suppose the abo¨ e assumptions are satisfied. Then
½
lim PxN, a
Nª⬁
⑀ nN y ⑀ˆ Ž t nN ; a . G ␦ s 0
sup
5
nFm NŽT .
for all ␦ ) 0, where ⑀ˆŽ t nN ; a. is the solution of the ODE dˆ ⑀ dt
sˆ hŽ ˆ ⑀ Ž t , aˆ. . ,
⑀ˆ Ž 0, aˆ. s a. ˆ Proof. Theorem C in Section 2 will be applied to our proof. We note that Assumptions ŽA1. and ŽA2. are satisfied by ŽI.. We only need to show that ŽA3., ŽA4., and ŽA5. are satisfied. From Ž4.1., we have H Ž ⑀ , x . s yx Ž x T Ž ⑀ . y y .. Since ˆn ª * and Ž yn , x n . nG 0 are bounded, for any compact subset Q of domain D*, there exists a constant C1 and that H Ž ⑀ , x . F C1 Ž 1 q < x < 2 . , hence ŽA3. is satisfied. By Assumption II, we can define
ˆh Ž ⑀ . s HH⑀ Ž y . ⌫⑀ Ž dy . , s ⌫⑀ H⑀ , where H⑀ is a function defined by H⑀ Ž x . s H Ž x, ⑀ .. Then the function H⑀ y ˆ hŽ ⑀ . has a zero integral with respect to ⌫⑀ and thus the Poisson equation ŽA4 2 .
Ž I y ⌸ ⑀ . ⑀ s H⑀ y h Ž ⑀ . has a solution ⑀ with the form
⑀ Ž y . s
Ý ⌸⑀k Ž H⑀ y ˆh Ž ⑀ . . Ž y . kG0
which is convergent. It is clear that H Ž ⑀ , x . is of class L i Ž R d .; we choose s 1. By Theorem B, ˆ h is locally Lipschitz on D. Therefore, ŽA4. is Ž . satisfied. A5 holds because ⑀ n converges to zero and Ž yn , x n . nG 0 is bounded. By Theorem C, we have
½
lim PxN, aˆ
Nª⬁
⑀ nN y ˆ ⑀ Ž t nN ; aˆ. G ␦ s 0
sup nFm
NŽ
T.
5
STOCHASTIC APPROXIMATION ALGORITHMS
345
for all ␦ ) 0, where ˆ ⑀ Ž t nN ; aˆ. is the solution of the ODE dˆ ⑀ dt
sˆ h Ž ⑀ˆ Ž t ; a ˆ. . ,
ˆ⑀ Ž 0, aˆ. s aˆ ˆ In fact, ˆhŽ ⑀ˆ. s y 12 Ž ⭸ EŽŽ e nN . 2 .r⭸␥n .. We finished the associated with H. proof. Since ⑀ n converges almost surely, and ␥n converges to zero slower than 1rn, we let ␥nrŽ␥nq1 . s 1 q oŽ␥n ., then we have
'␥
y
N n
lim
'␥
N nq1
N Ž ␥nq 1.
nª⬁
3r2
s 0 s ␣.
For each ⑀ ) 0, since ˆ h - 0, there exists a ␦ ) 0 such that
⑀ˆ h Ž ⑀ . F y␦⑀ 2 , with lim inf 2 ␦ nª⬁
ž
␥nN
q
N ␥nq 1
N ␥nq1 y ␥nN N ␥nq1
/
) 0.
By Theorem E, we have Theorem 3. THEOREM 3. Ž1.
If Assumptions ŽI. and ŽII. are satisfied, then
The sequence PˆN of the distributions of the processes defined by UNŽ t. s
ˆ⑀ Ž t N q t . '␥ Ž tN q t .
con¨ erges weakly toward the distribution of a stationary Gaussian diffusion with generator
ˆ ⭈ ⵜ⌿ Ž x . q 12 ⭸ 2 ⌿ Ž x . R Ž 0 . , L⌿ Ž x . s Bx ˆ T. where Bˆ s dhrd ⑀ Ž0. - 0 and Ž PˆN . is defined on the Skorokhod space ⍀ Ž2. The sequence of random ¨ ariables ⑀n
'␥
n
,
ngN
346
SHAO AND YIP
con¨ erges in distribution to a zero-mean Gaussian ¨ ariable with co¨ ariance ⬁
s Bˆ
Cs
H0 e
Ri j Ž ⑀ . s
Ý
⬁
ˆT
( R( e s B ds, cov H i Ž ⑀ , x k . , H j Ž ⑀ , x 0 . .
ksy⬁
Remarks. We now consider the adaptive algorithm N ˆnN s ˆny1 q ⑀ nN x nN e nN ,
Ž 4.3.
ˆ0N s a, ˆ where ˆnN s ˆnqN , ⑀ nN s ⑀ nqN , N s 0, 1, 2, . . . . Since ⑀ n s O Ž1rn. and lim nª⬁ ⑀ nN s 0, then lim sup k
'⑀
N k
y
'⑀
N Ž ⑀ kq 1.
Nª⬁
N kq1
3r2
y
1 2
s 0.
By similar arguments as Theorem 3 we have the sequence of random variables
ˆn y *
'⑀
ngN
,
n
converging in the distribution to a zero-mean Gaussian variable with covariance matrix C that is a positive symmetric solution of the Lyapunov equation
˜ q CB˜T q R˜ s 0, BC where Cs Ri j Ž z . s
⬁
H0 e
s B˜
⬁
Ý
˜T
˜ e s B ds, ( R( cov H i Ž , x k . , H j Ž , x 0 . .
ksy⬁
5. CONCLUSIONS AND REMARKS We present a new adaptive algorithm with decreasing step size; rates of convergence of this algorithm have been investigated. By taking advantage of the important result by Polyak and Juditsky w10x, that the averaging
STOCHASTIC APPROXIMATION ALGORITHMS
347
iterates have superior properties to the basic iterate itself in the sense of smaller asymptotic variance if the step sizes converge to zero slower than 1rn, we suggest using an averaging algorithm, on its optimality of the rate of convergence, to control the step-sizes of our main operating algorithm. Such treatment will led to the optimum rate of convergence of algorithms Ž1.3. and Ž1.4.. Under appropriate assumptions, series of analysis show that we can control the algorithm over the error, which is uniform along the trajectory of the main algorithm ŽTheorem 1.. Both algorithms Ž1.3. and Ž1.4. can be approximated by solutions of the ODEs ŽTheorems 1 and 2.. Finally, by Gaussian Approximation Theorem, Theorem 3 shows that the sequence of the distributions of the processes converges weakly toward the distribution of a Gaussian diffusion and the sequence of random variables Ž ⑀ nrŽ ␥n . ng N converges in distribution to a zero-mean Gaussian variable with the covariance provided.
'
REFERENCES 1. A. Benveniste, M. Metivier, and P. Priouret, ‘‘Adaptive Algorithms and Stochastic ´ Approximations,’’ Springer-Verlag, Berlin, 1990. 2. E. Eweda and O. Macchi, Convergence of an adaptive linear estimation algorithm, IEEE Trans. Automat. Contr., AC-29 Ž1984., 119᎐127. 3. R. Z. Khasminskii, On stochastic processes defined by differential equations with small ´ parameter, Theory Probab. Appl., 11, No. 2 Ž1966., 211᎐228. 4. H. Kushner and H. Huang, Rates of convergence for stochastic approximations type algorithm. SIAM J. Control Optim. 17, No. 5 Ž1979., 607᎐617. 5. H. Kushner and J. Yang, Stochastic approximation with averaging: Optimal asymptotic rates of convergence for general processes, SIAM J. Control Optim. 31 Ž1993., 1045᎐1062. 6. L. Ljung, On positive real transfer functions and the convergence of some recursion, IEE Trans. Automat. Control AC-22 Ž1977., 539᎐551. 7. L. Ljung, Analysis of recursive stochastic algorithms, IEEE Trans. Automat. Control AC-22 Ž1977., 551᎐575. 8. L. Ljung, Analysis of stochastic gradient algorithms for linear regression problems, IEEE Trans. Inform. Theory IT-30 Ž1984., 151᎐160. 9. L. Ljung and P. E. Caines, Asymptotic normality of prediction error estimation for approximate system models, Stochastics 3 Ž1979., 29᎐49. 10. B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim. 30 Ž1992., 838᎐855. 11. H. Robbins and S. Monro, A stochastic approximation method, Ann. Mat. Stat. 22 Ž1951., 400᎐407.