A Stochastic Approximation Algorithm with Random Differences

A Stochastic Approximation Algorithm with Random Differences

Copyright © 1996IFAC 13th Triennial World Congress, San Fran~ist:~I , 3d-03 5 USA A STOCHASTIC APPROXIMATION ALGORITHM WITH RANDOM DIFFERENCES· H...

212KB Sizes 30 Downloads 67 Views

Copyright © 1996IFAC 13th Triennial World Congress, San

Fran~ist:~I ,

3d-03 5

USA

A STOCHASTIC APPROXIMATION ALGORITHM WITH RANDOM DIFFERENCES· H. F. Chen Institute of Systems Science Beijing 100080, P. R. China hfehen %iss09@ISS09. iss.a c. en

T . E. Duncan and B. Pasik-Duncan Depadrnfnt

of Mathematics

University of Kansas Lawrence, KS 66045

duncan @kuhub.cc .ukan.5.edu [email protected],edu

Abstract . A simultaneous perturbation algorithm for a Kiefer-Wolfowitz type problem tha.L uses onesided randomized differences and truncations at randomly varying bounds is given in this paper. At each iteration of the algorithm only two observations are required in contrast to 2l observations, where f is the dimension, in the classical algorithm. The algorithm given here is convergent under ouly some mild conditions. A rate of convergence and a.n asymptotic normality of the a.lgorithm are also described. While only an algorithm with one-sided randomized differences is considered here, a corresponding algorithm with two--sided randomized differences has simHar results with the same assumptiolls. Key words: stochastic approximation, Kiefer-Wolfowitz algorithm, stochastic approxima.tion algorithm with randomized differences, simultaneous perturbation stochastic approximation, constrained simultaneous perturbation stochastic approximation algorithm

1. INTRODUCTION

A Kiefer-Wolfowitz (KW) (1952) algorithm is used to find th e extrema of an unknown function L : ]Rt _ 1l which may be observed with some additive noise. If the gradient of L, V L . can be observed then the problem can be solved by a H.obbins-Monro (RM) algorithm. Let Z"n be the estimate of the unique extremum of L a.t the nth iteration. Oue approacb to a KW algorithm is

to observe L at t he following values.

for i

= 1,2, ... ,f where Cn E lW. \ {O}.

Consider noisy observations of L so that ;+ -L( Z"nH)+<;+ ~n+l

Yn+1 -

.. Research partially supported by NSF Grant DMS- 930593G and by the National Natural Scien ce Foundation of China.

and

;Yn+l

3882

.'= L( Z"n'-) + 'm+l

where ~~"t 1 and ~~+ 1 are observa tion noises. The ratio

To reduce the evaluat ions for a KW algorith m Spall (1992) replaced the determ inistic compon entwise differ-

where '1 E JK+. A4. The third derivative of L is bounde d. AS. ZO E IR' is an asympt otically stable point for the differential equatio n = !(,,(t)) where ! V L. A6. The sequence (,,' , kEN) is infinitely often in a compac t set that is contain ed in the domain of attracti on of ZO given in AS. A 7. The sequences (a" kEN) and (c., k E 1'1) satisfy at > 0, Cl: > 0 for all le E N, Ok - 0 a.nd Ck -- 0 as k _ 00, Ok = QC., and (%!-):2 < 00.

the asympt otic normal ity of the modified KW algorith m

Furthermore, some conditi ons are impose d on the se(a" kEN) in (Span, 1992) but this sequence can be arbitrarily chosen by the user of the algorithm in (Span, 1992).

i+

i-

Yn+l - Yn+l 2r.n

(I)

can be used as an estimat e for the ith compon ent of V L. With this approac h a KW aJgorith m requires 2l

measur ements of L. If t is large , fo, exampl e, the opti-

mizatio n of weights in a neural network, then this KW

algorith m can be rather slow.

ences in (1) by a symme tric random difference. Using the ordinar y differential equatio n (ODE) method (Kushner and Clark, 1978) Spall showed the convergence and

though the conditi ons required are restrictive .

*

=

f

f

.\::::1

k==l

quence

Initially Spall'. KW algorith m and the conditio ns that he uses are described. Let (Ai, i ; 1, . .. ,l, k =

In this paper, a one-sided randomized difference is used, that is,

identically distribu ted random variables with zero mean. Let il. be given by

c,

(vt+l - yZ+I)g,

1,2, . .. ) be a sequenc e of mutual ly indepen dent and

a.=[a i .... ,aW.

(2)

At each iteratio n, two measur ements are taken:

Vt+1 V'+I

and

g,

is used to estimat e !(z.)

= L(". + e.a.) + ~t+l = L(", - e,a.) + ~'+I

yZ+1

(vt+l - V'+I)g, 2«

(3)

= [~i'''' ~J

(4)

where

is used as an estimat e for \1L(".) . The KW algorith m is formed as follows X.l::+l

~

Xk

+ ak

(vt+l - V'+I)g, 2

c.

(5)

For the convergence of the algorith m (5) Spall (1992)

AI. The random variables (~t+l - ~.+

"

kEN) is a. ma.rtingale difference sequence (mds) with uniformly bounde d second momen ts . A2. sup E(L'(" , + c,a,)) < 00. ,~N

A3. The sequence (z. , kEN) is assumed a priori to be uniform ly bounde d , that is,

sup IIz.11 'Ef\!

< '1 < 00

= L(".) + {~+I '

(8) (9)

By the change in forming the differences and a modification of the algorith m (5) and the use of a direct

approach to verify convergence, the conditi ons A2- A6 are elimina led and AI is weakened to one that provides not only a sufficient but also a necessary condition for convergence. This result is given in Theore m 1. In Theorem 2 the observation noist' is modified to one with only bounde d second momen ts that is indepen dent of (a" k EFl) . A convergenc" rate and an asympt otic

normal ity of the algorith m are given in Theore ms 3 and

.

In this form the algorith m seeks the maximu m of L . The minimum of L is found by replacing ak by -Ok. required the following conditio ns.

(7)

= \1L(z.) where

yt+1 = L(". + e,a,) + ~t+1

Then the vector symme tric difference

g.

!,JT

= [~,. a, , il,

(6)

4. respectively. The proofs of these results are given in

(Chen, et. al., a)

2. THE ALGO RITHM AND ITS CONVE RGENC E

Initially the algorith m is precisely described. Let (ilL i = I, ... ,t, le E 1\1) be a sequence of indepen dent and identica lly distribu ted random variables where la~1 < a,

1.,'.1 < b, E (.,',) = 0 for all i E {I, .. . ,f} and le E 1\1

and a, b E 1R+. Further more let il. be indepen dent of u(~t , ~;- , i E {O, ... , k)) for k E 1'1. Define V'+I

r. = and

~'+I

by the following equatio ns Yk+l ::

a .s .

(Vt+1 - vZ+I)g,

",

~'+I = {t+! - EZ+!·

3883

It follows that

such that

+ e.A.) -

Yk+l = (L(x.

L(x.»)g .

+ ~k+lg..

Cc

~ a·e{tJ 1 < 00

(10)

L...,

c.t;a t

k=l

CA:

(15)

a.s.

and Choose XO E lit and fix it. Define the following KW algorith m with random ly varying truncat ions and randomized differences: %1+1

= {x,t + a.tYk+l )10!=t'II+llkYA-+111 ~ M,,.}

(11)

'-1

uJ:

=

L: lUr.+a.lIi+dl>M",;} i=O

0"0:=

a.s, (or j

=

1, ... , t.

While Theore m 1 gives necessary and sufficient conditions for the convergence of the KW algorith m (\1) it may not be apparen t if there are useful noise process es that satisfy (14-16) . The following theorem gives a large family of noise processes that satisfy (14-16) .

(12) Theore m 2. Let HI and H2 be satisfied. rf

0

where (M., k E H) is a sequence of strictly positive, strictly increas ing real number s that diverge to +00. It is clear that (1'1 is the number of truDcations that have occurre d before time k. Clearly the random vector Xl; is measur able with respect to :Fie := where = u(A" i E {D, .. . , k)). Thus the random vector A. is indepen dent of u(x; , i"; k) .

rt V:Ft_l

:Ft

=

*'

H2. The two sequences of strictly positive real number s (ab k E H) and (c., k E H) satisfy e. ~ 0 as k -+ 00,

00

L

£=1

a. =

00

and there is apE (1, 2] such

00

that

L

a~

<

00 .

t:::l

Remark . If L is twice continu ously differen tiable then f is locally Lipschitz continuous. If in HI .x 0 is the unique minimu m of L , then in (11,12) al should be replaced by -ai:.

The following theorem gives necessary and sufficient conditi ons for the convergence of the algorith m (Il). Theore m 1. Let HI and H2 be satisfied and (Xl, k EH) be given by (11). The sequence (x., k E H) satisfies lim '_00

Xi:

==

xO

a.s.

(14)

,

k:::l

/r

i) sup I~k I :E; ~ a.s. where ~ is a random variable;

• Ea < •

ii) sup

00,

then lirn Xi: = zO '_00

a.s,

(17)

where x. is given by (11). It is import ant to note tha.t the random variable ~k may have arbitrar y depend ence on the family of random variables (~j; j E H, j k) and may not be zero mean. For exampl e, a sequence of bounde d deterministic observation errors satisfies the conditions i) or ii).

*'

Remark . Theore ms 1 and 2 can be extende d to the case where f(x) 0 for all x E J ,,"d J is not a singleton. In this case III is replaced by some conditio ns on v where v -L and f is locally Lipschitz continuous [I, 2).

=

=

While necessary and sufficient conditi ons for the COllvergence of the algorith m (U) are import ant, it is also import ant to have some informa tion on the rate oC convergence of the algorith m. [n the following theorem a rate of convergence of th e algorith m (11) is given . Theore m 3. Assume the hypothescs of Theore m 2 and that

(18)

co = o(at)

(19)

for some ~ E (0, I) and

(13)

if and only if the observation noise ~. in (10) can be decomposed into the sum of two parts for ea.ch j E {I , ... ,!} as

00

L;t < 00

and the observation noise (~., k E H) is indepen dent of (Al' k E H) and satisfies one of the following two conditions

The following conditi ons are impose d on the aJgorith m.

HI. The functio n \1 L I is locally Lipschitz continuous. There is a unique maxim um of L at XO so that l(x O) = 0 and I(x) 0 for x '" xO There is a Co E ll!.+ such that Ilxoll < Co and s up L(x) < L("O) . ilz-U;:c(]

(16)

I(x)

= F(:>: - xo) + 6(x)

(20)

where F E L(ll!.',ll!.'), 6(x) = o
IIxn for 6 given in (19).

3884

xOIl

= o(a~)

a.s.

(21)

t)

=

Remark. If an = ~ and en nl., for some v E (0, and all n E N then the conditions on (an, n E N) and (cn , n E N) in Theorem 3 are satisfied.

The following result is an asymptotic normality property of (xn' nE N) given by the algorithm (11). Theorem. 4. Assume that the conditions of Theorem 2 are satisfied and that i)

=

Hm (a;;-~l - a;l)

n_~

"Y E (:~,

0"

> 0 and en

= a~

for some

!).

ii) Ilf(x) - F(x - xO)II';; 611x - ",olll+P for some (3 > 0 and b > O. iii) F iv)

= t - j. where Wi = 0 for i

+ QI'I is stable

en = E• biwn_i

for I'

< 0, (bi'

i E

i=O

N) is a sequence of real numbers, r E fiJ is fixed and (Wi' pt', i E M) is a martingale difference sequence

that satisfies

E[w? I P;"_tl,;;".o for all i E N where tro E

m+,

A classical Kiefer-Wolfowitz algorithm has been modified in two ways: i) the one-sided randomized differences are used instead of the two-sided deterministic differences, ii) the estimates are truncated at randomly varying bounds. For the convergence analysis, a direct method is used rather than the classical probabilistic method or the ordinary differential equation method. By the algorithm modifications i) and ii) and a different approach to algorithm analysis, the following algorithm improvements have been made: i) some restrictive conditions on the function L or some boundedness assumptions on the estimates have been removed, ii) some restrictive conditions on the noise process have been removed, iii) the number of required observations at each iteration has been reduced. The algorithm has been numerically tested on a number of examples to verify the convergence to the maximum. If the function L has many extrema then the algorithm may become stuck at a. local extremum. To obtain the almost sure convergence to the global extrema, some methods that combine search and stochastic approximation are needed, but it seems that there is a lack of a sufficient theory for this approach,

.lim E[w; I P;"-l] = ",'

'_00 where .,.' E

Il4

REFERENCES

and

lim sup E[w?10w.l>NJl

N_oo iEN

Then

a;;-~(Xn - XO)~Z

where Z is N(O, 5),

5=

= o.

".'".i

(~bi)

'I

".;, = E and r is given in iv).

e'(F+""/)e'(F+""I)T dt

[(tl~l' 1

Chen, H, F" T. E. Duncan and B. Pasik-Duncan, A Kiefer-Wolfowitz algorit.hm with randomized differences, preprint. Kiefer, K and J. Wolfowitz (1952). Stochastic approximation of a regression function, Ann. Math. Stat., 23, 462-466.

Kushner, H. J. and D. A. Clark (1978). Stochastic Approximation Methods Jor Constrained and Unconstrained Systems. Springer, New York. Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation} IEEE Trans. Autom. ControI, 37, 332-341.

3885