Copyright © IF AC Identification and System Parameter Estimation, Beijing, PRC 1988
STRONG CONSISTENCY OF A PARALLEL STOCHASTIC APPROXIMATION ALGORITHM WITH NON-ADDITIVE NOISE Y. M. Zhu* and G. Yin** *Institute of Mathematical Sciences, Academia Sinica, Chengdu Branch, Sichuan 610015 , PR C **Department of Mathematics, Wayne State Ulliversit)', Detroit, MI 48202, USA
Ahfroet. In this paper, a parallel stochastic approximation algorithm is considered. The esse.ce of this algorithm is to bring out the potential of parallel processing and asynchronous communication into full play. Instead of using a single serial processor, a collection of processors is used in such a way that the processors operate in parallel, compute and communicate asynchronously and at random times. The scope of this investigation is to extend the strong convergence result to the non-additive noise case. Under rather broad conditions, strong consistency is obtained via the method of randomly varying truncations.
Ke,lIIord,. parallel processing, stochastic approximati9n, strong consistency, non-additive noise, randomly varying truncations.
there may be limitations on the amount of communications allowed between distinct processors. Even if there exists infinite capacity for communications, centralization may still be burdensome, because no single processor may have the capability to solve the overall problem by iUelf. To overcome the difficulties, Kushner and Yin (1987b) suggested a parallel stochutic approximation algomhm,
1. INTRODUCTION
It is well-known that the methods of stochastic approximation have been widely employed in systems theory and applications, such as adaptive control, Monte Carlo optimization, syriem identification, estimation, detection etc. The pioneer work of Robbins and Monro (Robbins and MollJ'O, 1951) stimulated widespread reactions. Several decades have passed since the first paper published. There have been patly many elegant and significant contributions added to the literature, for example Ljung (1977), Kushner and Clark (1978), Kushner (1984) and Chen (1985) among others.
and obtained the weak convergence and rate of convergence results. Later, Yin and Zhu (1987) specialized the above proposed model to a parallel BM like algorithm with additive noise, and proved the strong consistency. The essence in both papers is to take advantage of the methods of parallel processing and asynchronous communications. By splitting a large problem into small pieces via state space decomposition, we let several parallel processors compute in cooperative and asynchronous way and at random times. The model is qnite realistic. It is highly parallelized and asynchronized.
Because of the popularity of parallel processing method, the problems of design parallel computation skemes spontaneouslyarise. Although there is vast literature in p~ allel processing method for deterministic algorithms, the research in parallel and decentralized stochastic approximation algorithms is relatively new. Recently, Tsitsiklis (1984), KUllhner and Yin (1987a, 1987b), Yin and Zhu (1987), Zhu and Yin (1987), among others proposed and lliudied several of such algorithms. Their work opens up a lot of new hopes.
The present paper is an extension ofYin and Zhu (1987). We shall concern ourselves with parallel stochastic approximation algorithm in a more general setting, namely, the aforementioned algorithm with non-additive noise. The non-additive noise is certainly non-vacuous in applications (cf. Kushner, (1981) for the corresponding centralized procedures). In Kushner and Ym (1987b), by the methods of weak convergence, the asymptotic behavior of the algorithm was studied in detail, the point near which the iterates spent most time was found. However there is still considerable interest in obtaining the strong consistency for such algorithm.
Currently, it is a common practice to use stochastic approximation algorithm to locate the zeros of a regreltsion function with random noise corrupted observations, or to carry out Monte Carlo optimizations. But, if the dimension of the state space is large, huge amount of memory space and computation may be needed, which in turn rises the cost and slows down the computation. In addition, the classical stochastic approximation algorithm is centralized in nature . However, synchronization may be rather time consuming, since it msy cause seriOUII time delays. Besides, synchronoUII computation is aot real time implemantable (cf, Li and Buar, (1987)). Therefore, to exchange information in a centnlized and I)'1lcAroaised fashion might not be desirable. Moreover,
To get w.p.l convergence result, normally, one assumes that some compact set in the domain of attraction of a stable point of a certain ODE to be entered infinitely ofien, or requires the growth rate of the function under consideration to be of a particular form (linear growth for instance). Since the "boundedness" of the algorithm !leems to be crucial, when implementing such algomhms, 373
374
Y. M. Zhu and G. Yin
....ny, some kind of projections or truncations are ueed, which in turn requires prior howled!e of the bounded resion. To overcome the difficulties and relax the conditions, a new method was introduced in Chen and Zhu (1986). Here, we adopt this method, and use randomly varylB! truncations to prove the stron! conver!ence for the parallel al!orithm. We or!anize the rest of the paper in the following way. The problem formulation and the conditions are siven next. The main theorem will then be presented in section 3. Finally, some concluding remarks are made in section
•• 2. PROBLEM FORMULATION Let there be r processors, each of them controls Pi components of the Nte vector z, i ~ r. Let z be such that, z ::;::: (zl, ... , z')'. Where zi, i ~ r denote the Pi dimensional vector controlled by processor i. As a matter of fact, from analysis point of view, it is really no los8 of !enerality to let Pi ::;::: 1, for each i ~ r. Henceforth, we &8sume that each processor controls one component of the state vector, i.e. the ith processor controls the ith component ofthe Nte vector. For each i ~ r, let processor i takes r} units of time to complete the jth iteration. Where b} }is a sequence of bounded and positive integer valued random variables, which may depend on state and noise. Detine ,.~ by
Ni(n)::;::: •• p{kj1'k ~ n}
I~ ::;::: I{4~=o},
In renewal langua!e, Ni (n) is called the countin! process. It counts the number of events or renewalll (i.e., the number of iterations or computation times for processor i) up to time n. ~~ is the so called "a!e" or "current life" process which represents the time elapsed since last iteration. If ~~ ::;::: 0, then it says nothiB!, but n is a random computation time. To keep track of all the Nte values used in the r processors at each time n, we dellne an au!mented vector, in! &11 in ::;::: (i~, ... , i~)', where for each i,
With these notations, we can rewrite (2.1) &8
(2.3) or in a vector form
where n
,.~::;:::
Zn
Er}.
{1'~} is a sequence of random computation time. It is quite similar to the conventional renewal process. However, we do not require r~ to be U.d random variables. Let r. be the noise incurred in the nth iteration for processor i. Let Zl ::;::: (zl, ... , zn' be the initial value and z!,. be the value of the ith component of the state at the end of the nth iteration (or nth processing time). For ne [1']''']+1)' put
r..
.
:fr.
= (z~, ... ,z~)'
b(in,en)::;::: (bl(i~,e~), ... ,b'(i~,e~))'
j=1
e~::;:::
(2.2)
e;.
= diag(I~, ... ,I~).
In
In practice, when one works with stochutic approximation algorithms, often time some sort of truncation IICheme is used to &IIsure the boundedness of the al!Drithm. However, in the usual projection algorithm, a priori bounded region is normally &8IIumed, which &11 we commented before may be some what restrictive. UU! the idea of random truncations, we shall modify our al!Drithm a little bit, detl.ne and work with a truncated version of the parallel stoch&lltic approximation algorithm. A word about the notations, in the following, we shall use K to denote a constant, its value may change from usage to usage. To start with, we &llsume that
J
= (z:,., ... , Z~4)" .
For appropriate b(·,·) (precise conditions for this function will be !iven in the sequel), and fn ::;::: ~, the basic al!orithm to be considered is
(AI) For each i ~ r, b~} is bounded and
r~
= r~(zr'
.-1
,e;. )
E(r~ - ,,~Irt - "i)
(2.1)
,,-1
= 0, I < n
where ,,~ is a sequence of random variables satisfyin! Remark: CompariB! with the classical stochastic approximation al!orithm, the al!orithm here h&ll hi!hly parallel aature. StarUB! with initial value, new values of zi are computed based on the most recently determined values of zi, for j ~ r. The newly computed values are pUled to all other processors &11 soon &11 they are available. In fact, we could also allow some delays delays are bounded.
&11
long
&11
these
with
"i
a constant. 00
E nt(I-Q)(,,~ -
"~-d
eontlerge.
a .•.
.... 1
E(r~ - ,,~)2 ~ r~
Since each processor takes a differnt random time to complete each iteration, the "iteration number" is no longer a !ood time indicator. The &ll)'Dchronization causes much of the difficulties.
for some
To proceed, we make the followiB! definitions.
poeitive real numbers.
1
00
E nl+Qr~ < 00
n .. 1
0,
with
°<
0
< 1j and {rn} is a sequence cl
A Parallel Stochastic Approximation Algorithm (A2) 6(·,
e) is continuous uniformly in e.
375
a ft, luch that for any n > ft,
(AS) For each i ~ r, there is a conUnu01lll function b'O, such that for each z, each n, each m with n ~ m ~ 00, m
11::>.;[6' (z, e;) -
hence
b' (z)]I ~ K lrd1 + Ib' (z)l)
Zn+1
n
= Zn + fnln+lb(i n , en)
is bounded for any n >
=
Remark: Write b(z) W(z), .. . ,br(z»'. (A3) implies that the following is true m
lE f;IHI[6(z,e;) - '(z)]I:5 Kfn(l n
(A') There
&re a twice continuously differentiable function 170, a real number M> 0, and a point xE {Zj Izl < M} so that
v(z) "# v(,) , Vz E E , ,
(i)
~
ft.
The proof of the theorem consists of three steps. al~orithm.
1. Show the boundedness of the
2. Show a continuous interpolation of Zn an ordinary differential equation.
+ Ib(z)l) .
conver~e
To proceed, we shall prove a series of lemm&8. Lemma 1: Under (AI), 1 ··
;;.,.~ - p'
1
= o(nW-a)
where er is the same &8 appeared in (AI) .
v(x) < inf{v(x) ; Ixl
(iii)
Proof: Define m~
= M} = d and
= ~.,.~ , we can then write m~ &8 (3.2)
[v(x) ,dj n v(E)"# [v(x) ,dj where E
to
ar~ment.
3. Complete the proof by applying stability
E
(ii)
(3.1)
= (z;b(z) = 0, Vz ERr} and p -I
' (1 = dtag p
By (AI) , it can be shown that
E (n +~)1 ft ('~+l - P~+l)
1 ). pr
- I ' •• • , -
con17erge.
a.,.
n
Note (AI) allows the random computation times depend on the state &8 well &8 on the noise. (A2) (A3) do not seem to be restrictive . Similar kind of conditions were also used in Kushner (1981) for centralized stochastic approximation al~orithm. However, we do not need any restriction on the rate of growth for b(·, .) with respect to the state variables. Let {Mn} be a monotone increasing sequence of positive real numbers, such that Mn - 00 , as n - 00. Define ITn &8 ITo 0
=
and also, E(n
+ l)ft(p~+1 -
p~)
converges
a.s.
n
where
f3 = HI - er). Hence, (n + l)ft(p~+1 - p~)
=0(1)
therefore
(2.5) and consequently, and write
nft(m~ - p') - 0 a, n -
00 .
The lemma is proved. Note 0 < p' < Define
00,
0<
tr <
00
and
"';"'(nl+1 Ni(n) + 1 N,(n) - N,(n) - N,(n) + 1 N,(n)
"';".cn) < _n_ < (2.6) these
Now , we have completed the formulation , and we ready to present the main theorem.
&re
to~ether
Ni(n) ~.!... n p'
Theorem S.I: If (Al)-(A3) hold, then for any initial condition Zl , and any Zn defined by (2.6) , we have n
=0
1O .p.l
where d( ·, E) denotes the distance from a point to the set
followin~
Corollary: Under (AI),
3. THE CONVERGENCE RESULT
limd(zn , E)
yield the
Ni(n)
a.'.,
1
and 1
- n - - pi =O(nt(1-cr)'
(3.S)
We are now in a position to prove the boundedness of the ~orithm.
<
E.
Lemma :I: &8sume (Al)-(A3),
Remark: By monotonieity, ITn - IT w.p.l, either IT is lane or IT 00. If IT < 00 , then for lixed w, there existl
To prove the lemma, we establish the following microlemmu.
=
IT
00,
w.p.l.
376
Y. M. Zhu and G. Yin
Mlerolemma 1: Under condition (A3), suppose in is bounded whenever Zn is, if {zn'} is a conver~ent subRqllence of {zn}, ihen here exist constants K > 0, 6 > 0, such that for any 'I > 0, satisfying 0 < " < 6, ihere is a n~, such that
m(".~)
L
=,,~(~)
ljIj+Ib(:)
j=" m(".,,)
L
+,,~(~)
ljIj+l[6(~,ej)-£(~)!
m(n·.~)
1
L
m(".,,)
ljIj+lb(ij,ej)I~K
+,,~(~)
;=n'
L
ljI;+J\b(ij, ej) - £(:, ej)!
(U)
j ••d
for 'fn' > n~ and 'fm,
n' ~ m ~ m(n',,,), where To accomplish our
m
m(n,,,)
Mlerolemma 2: Under (A2) and the conditions of Microlemma 1, there exist 6 > 0, K > 0 and n~ such ihd for any 'I < 6 and any m satisfying n' ~ m ~ m(n', ,,)+1, _ have
K",
'
The idea of ihe proofs for ihese microlemmas are essentially in Ch en and Zhu (1986), _ omit them here. Now, we prove Lemma 2 by contradiction. SUppOR ihe contrary, iT = 00, then starting from z, Zn would be across the sphere {z ; Izl = M} infinitely often. By (A4) there exists an interval 161'~! c !V(z),d] with ~ "# ,,(i), ~ It ,,(E). If Zn are the points which start from z, and have not been across the sphere {z; Izl = M} yet, then IZnl ~ M, implies linl ~ M by (A3), IlnIn+lb(in,en)1 ~ 0 a.s. And hence ,,(zn) would be across the interval 16.. ~! infinitely often from the left. It then yields that there exist {zn'}' and {zm'} subsequences of {z n} satisfying
(1)
(Il)
we assert
'rm(".~)
j=n
IZm - zn" ~
~oal,
·0 i...Jj=" Mlero 1emma ...
= max{m; L lj ~ 'I}'
lj I j+l
P-1 ". If the &lI8erUon is true, then ihe first term on ihe ri3ht hand side of (3.4) will be less than -A'I for some A > 0 by (A4) . The second term tends to 0 by Microlemma 2. Consequently, limsup"(zm(",~l+d - 61 < + lim sup 1"(:)lIp- l 1 .mu
--
-A"
"~l~m(",,,)
Ib(ij,ej) - b(:,ej)l" .
Hence by (A2), (3.5) The arbitrariness of " implies
On the oiher hand, z" -- ;r as k --
00
and
uniformly in k ~ k". If" is sufficiently small, for any m with k ~ m ~ m(k,,,) + 1, ,,(zm) < ~ From (I)(II)(llI), ,,(zn'-d < 61J 61 ~ "(Zj) ~~, n' ~
i
~ m' - 1
(3.8) (111)
which contradicts (3.5).
The continuity of "sO implies that
Thus we need only prove Microlemma 3. Note m(",II)
L
jslr
By (II) ,,(zn') -- 61 , Define
F=
{z;~ ~
m(",,,)
l;I;+1
,,(z) ~ ~}n {z; Izl ~ M} .
=L
j="
ljdiag(I}+I, ' " ,11+1 ) ,
Hence we need only show
It can be seen that F is closed .
a.,.
Let {z,,} be a convergent subsequence of {zn' } with limit :, then ,,(:) = 61, : E F , hence: It E. By the assumptions and Microlemma 2, for k have
~
k", _
To this end, let
"(Zm(Ir ,,,)+I) - ,,(z,,)
m(t)
="~(II)(zm(",~l+1 - z,,)
="~(:)(zm(Ir,")+1 - z,,)
+(11 - :)'''u(i)(zm(".'')+1 - z,,)
(3.7)
= sup{n; tn ~ t}.
Choose a sequence 6n , such that 0 < 6n -- 0 and ,('-0 )
IlUPj~n ~ - 0, as n -- 00 , where a is the same as in (A1) Cd). Divide [N, (k) , N,(m (k, into subintervaI.s, we have
,,»!
&Dd "~(:)(zm(" , ~ ) +1 - z,,) m(",~ )
=,,~(~)
L
j-"
ljIj+lb(ij , ej)
(3.8) For the reason of saving notation, we define
A Parallel Stochastic Approximation Algorithm m-
m+
= m(t. + l6.)
Next, let xO (-) be a piecewise linear interpolation of x, such that
= m(h, + l6" + 6.) = Ni (m-)
Ni-
Nt =Ni(m+). On the interval [Ni- ,Nt), debe
Now as k ..... thus
~
Put
let tN,(.) ..... t, then tN,(m(.,'1» ..... t +",
00,
N,(m(",'1JJ
E (,} = u:n
N,(,,)
1.
Fix w, Xn is uniformly bounded by Lemma 2, hence {xn (.)} is uniformly bounded. Let 6 > 0 be given, we have xn(t + 6) - xn(t)
=xo(t +tn + 6) - xO(t + tn) + I - Xn 6.
1+'1
ft(.)d •.
(3.~)
=X n
I
in
We claim that
fi" (,) -,+, 1 /Ai
(3.10)
V. E [N;(k),N;(m(k,,,))).
If this is true, then applying the Bounded Convergence Theorem in (3.9), passing to the limit we get the desired rell'1llt. Examine gt(-) , we obtain I
l(.) i
-
377
""N,+-I
N:t- - N ·- N+-N- LJNI
"
m+ _ m-
I
i ""m+-l
I
m+-m- LJm-
. (,~
m+-I
1 ~ 6" L..,; m-
1 [-
.
(J
(.1J '
Note the boundednes8 of {x n } implies the boundedness {in}. As a consequence,
Ixn(t
Lemma S: Write x
N+ - N:1 m'+ - m'- = /Ai
N+ 1 m+ (m'+ - /Ai)[1 + m+ - m-I N:1 m- (m'- - /AJ m + -m(3.12)
Since m
1 mt!I-a) N+ ' ) 1~ K ~ O. 1( m+ - /Ai m+ - m Q~
(m:
1 ·
N;-
J
,~
•.
(3.13)
Lemma 4: Proof: The function tI(·) in (A3) can be served as aLiapunov function for the ODE (3.13). By the well-known stability theory lim d(x(t), E)
= o.
(3.a)
x·
Since 0 ..... x(·) uniformly on any bounded interval, we can select a subsequence {X",} of {x,,}, such that
.'
limd(x"" E)
By the same token, we can treat the last term in (3.12).
= O.
(3.15)
If limn d(xn' E) =F 0, we would have another II1lbsequence Xm -+ ~ ~ E and
• =inf{ltl(x) -
fact, we can write this term as
6 -+ 0
The proof is in the spirit of proof for Microlemma 3 and argument of ODE approach. We omit it here.
L
The second term in (3.11) tends to 1, roughly speaking by replacing (r' and (j by ir' and i m - repectively. In
a.
/A'
1-00
by Corollary of Lemma 3.1, we know that
0
= (xl, ... , x'), xi satisfies
.
Vl
m+-m
-+
i" = --,.b'(x),
-+ 00
For the tirst term on the right of (3.11), we have
x"(t)1
uniformly in n. Therefore {x n (.)} is an equicontinuous family. Applying Ascoli-Anela's Lemma, we can extract a convergent subsequence x"O, with limit denoted by x, such that x·O ..... x(·) uniformly on each bounded interval.
(3.11) The last term on the right of (3.11) tends to 1 as k by the choice of 6•.
+ 6) -
tI(~)I,x E
E} > O.
Since E is closed and {Xn} is bounded, for any x E E, either
+('r "'i [ m+ :m-
=
1
E;::~ -I
1
(j
-
ir~_
--1+-'· imimrv-
Recall in 1, -+ 1 and ivT rv i' ..... 1, the first n and note ~ m two terms above tends to 0, and the last term goes to 1. Thus equation (3.10) holds. Hence, Microlemma 3 is proved. The argument above is crucial in our proof, We need to use this averaging procedure twice in the proof of Theorem 3.1.
II(~)
+. ~ II(X)
or
II(X)
~ II(~)
- •.
Suppose tI(~) + • ~ ,,(x) for any x E E, the continuity of "0, the boundedness of {x",}, (3.14) and the above inequality then imply that 1I(~)+2/3. ~ "(x,,,) and hence {1I(X n }} would be across the interval
[,,(~)
1
2
+ S·,II(~) + S·I
378
Y. M. Zhu a nd G. Yin
iabiiely often. Similar ar~ment u in the proof of Lemma 2 yields a contradiction. The cue 11{%) ~ 11{~) - r can be handled analosoudy. Thus Lemma' is established and the proof of Theorem 3.1 is concluded. '.CONCLUSION A Stochutic approximation alsorithm with non-additive noise and parallel structure is analyzed. W.p.l conversence result is obtained by truncating the iterates at randomly varying bounds. We expect the necessary and suflicient condition for convugence is the law of large Dumber type for both the noise and the random compuiatiOD times, but careful analysis is needed for further collllideration. Investigations for various stochutic models with parallel structure deserve further research. Problems of this kind are both intersting and challenging from theoretical u well u practical view point. REFERENCES Chen, B.F., (1985), Recurnte Elfimation and Control for StoeJaa.tie Sr,tem,. John Wiley, New York.
K1l8hner, B.J., (1984), ApprozimtJtion and Weak Conur,enee Method, for Random Proee"CI. MIT Press, Cambridge MA . Kushner, B.J., and G. Yin, (1986a), Asymptotic properties of distributed and communicating stochutic approximation algorithms. SIAM 1. on Control and OptimillGtion, 25, 1266-1290. Kushner, B.J., and G. Yin, (1987b), Stochutic approximation algorithms for parallel and distributed processing. StocJaa.tiu, 25, 219-250. LaSalle, J .P., (1976)' The SttUilit, of D,namic S,lfemt. CBMS-NSF regional com. series in appl. math. 25, SIAM. Li, S., and T. Basar, (1987), Asymptotic agreement and cODVergence of stochutic algorithm. IEEE Tran . on Automatic Control. AC-32 (1987) , 612-618. Ljung, L. , Analyeis of recursive stochastic algorithms. IEEE Tran. on Automatic Control, AC-22(1977) , 551575. Robbins, B., and S. Moron, (1951) , A stochutic approximation method , Ann. Math. Statit., 22, 400-407.
Chen, B.F., and Y. M . Zhu, (1986) , Stochutic approximation procedure with randomly varying truncations. Seientie Siniea (unCI A), vol XXIX, No. 9.
Tsitsiklis, J .N., (1984) , Problems in decentralized decision making and computation. Ph.D. then" Elect. Eta,. Dept., M .I. T. , Cambridge , MA .
Kushner, B.J., and D.S. Clark, (1978) Stoehalfic Approz. im41ion Method, for Conlfrained and Uncon,trained S,,· tem. Springer-Verlag.
Yin, G., and Y.M . Zhu, (1987) On w.p.l convergence of a parallel stochastic approximation algorithm. LCDS 11187-17, Brotlln Unit ., Providence , RI.
K1l8hner, B.J., (1981), Stochutic approximation with dieeontinuous dynamic:s and state dependent noise: w.p.l and weak cODVergence. lMAA, 82, 527-542
Zhu, Y.M., and G . Yin, (1987), Optimality of quasiconvex combinations for stochastic approximation algorithms with parallel observations.LCDS #87-18, Bro.n Unit. , Providence, RI.