Journal of Statistica Planning and Inference 2 (1978) ~53-43. 0 North-Hoifand Publishing Company
Dan ANBAR* Case Western Reserve University and ‘TefeAviv University Received 26 July 1977 Recommended by Shelley Zacks Abstract: A stochastic approxfmiation procedure of the Robbins-Monro type is considered. The original idea behind the New&Raphson method is used as follows. Given n approximations X ,, . . .,X, with observations Yr,. . ., Y,,, a least squares line is fitted to the points (X,, Y,),. . ., (X,, Y,) where M
Robbins-Monro
Procedure,
Elkient
1. Introduction The Newton-Raphson (NR) method for approximating the zero of a function is probably the best known approximation method. The id,ea is a very simple one. Let M(x) be a differentiable function and 0 be the unique solution of the equation M(x) =O. Pick a point x, arbitrarily, draw the tangent line to M at x =x1 and let x2 be the point of intersection of this line with the x-axis. Continuing in this manner one obtains a sequence of points x,, x2,. . . where x,+ 1= SC, - M (x,)/M'(x,) n=l,2,... . When the function A4 satisfies some regularity conditions, the sequence (x,,} converges to the rooi 8. The first attempt to consider a stochastic version of the NR method was made by Robkins and Monro (1951). They considered the case where M can only be observed statistically. Namely, at each point x one can observe a random variable Y(x) such that E Y(x)= M(x). They suggested the following procedure. Let x1 be a real number. Let X1 =x1 and for n=1,2,.. ., Xn+r =Xm-a,,Y, where y, is a random variable with conditional distribution given (X 1= x1,X2 =x2,. ..,X,, = x,) equals to the distribution of Y(x,*) and (a,}, II= 1,2,,. . is a sequtmt: of real numbers. This process is known as the Robbins-Monro (RM) process. The analogy between the RM process and the NR method is clear. The function A4 case by its esti ate Y” and the seqLlence of at x, is replaced Ztt the derivatives by a fixed sequence of teal numberr. It was shown that under certain *Support#:d by the Omce of Naval at Case Wettern Reserve University.
esearch under Contract NO001495-C-0529, Project N 153
154
Darr slnbar
regularity conditionis and a proper choice of the sequence (a,>, the RM process converges to 8 a.s. and in the mean square. (e.g. Robbins and Monro (1W ), Blum (1$&t), Dvoretzky (1956), Robbins and Siegmund (1971))* Chung (1954) and later Sacks (W-58) have found conditions under which the RM is asymptotically normal. Sacks (1958) investigated the case where a,=&’ ‘, PT = 1,2,. . .) where A is some pc%sitiveconstant. He has shown that under some general conditions the sequence tP2(XR -- 0) converges in law to a normal ratldom variable with mean zero and variance r12a2/(2Au - 1) where c2 ==limx+, Var Y(x) and 01A is the derivative of M at x =O. It is immediate to check that the asymptotic variance is minimized when A=Ix? However, without the knowledge of lcll and 8, a cannot be computed. Thus the Qroblem of constructing efficient process (i.e. with minimal asymptotic variance) for estilmating 8 becomes natural. This problem was considered first by Albert and Gardner (1967) and by Sakrison (1965).’ Both Albert and Gardner and Sakrison have replaced the ronstant A by a stochastic sequence estimating CC*. Since in both cases the estimating sequence depends on M, their methods are useful only when M is a known function. The case where M is unknown was considered by Venter (1967). Venter’s method requires that at the rlth stage (n= 1,2,. . .) the statistician makes two observations YLand Y,”at X,-c, and X, + c,, where X, is the rzth approximation and {c,} is a given null sequence of positive numbers. This way of collecting the data made it possible for Venter to construct an efficient adaptive process, Although Venter’s method is asymptotically efficient, the present author feels that taking the observations two at a time may make it unattractive in situations where the total number of experiments allowed is small. This raises a natural question. After the nth stage of the experiment the statistician has observed the first n approximations X1,X2,. . .,X, and the observations on M at these points Y1,YZ,. . ., K. Is it not possible to use these data for determining the next approximation X,+ t in an effEent way? The purpose of this paper is to provide a positive answer. Motivated by the original NR method one might fit a least squares line to the points (X1, Y1), (X2, Y2),. . ., (X,, &) and taite the point of iniersection of this line with the x-axis as the next approximating point. This yields the following procedure. Let X1 be a random variable to be chosen by the statistician. For n= 1,2,. . . let
(1) where
hln= i_ (Xii= I
X”)q;’ $I 1)-
(Xi-X
n
)’
for 022
and h,, is an arbitrary positive number. Since M is not assumed to be linear it seems preferable to consider the process giv :n by
A stochastic Newton-Raphson method
155
with X ,. arbitrary and bm,n”
i
(xi-xm,n)K
i I
i=m
where
(XiwXm
‘ “t-3
. n12
II
c xi.
x rn.8=(n-m)-l
i=mf
1
As indicated, m may be taken to depend upon n. This is not needed in any of tlhe proofs. Mowever, one might want to base the estimator (3) of a only upon the later part of the sequence {X,} and ignore the beginning. This can obviously be done if m is chosen to go to infinity with n. As will be seen later ell the results hold in this was provided m(n)= o((log n)l’?+e) for every E> 0. In this paper a truncated version of (2) is considered. It is shown that this process yields a strongly consistent sequence of estimators for 0 which is a,symptoticaIly efficient.
2:. The r~esults For Phe purpose of easy reference let us iist all the assumptions which will be needed in the sequel. (Ml)
M(x) is a measurabIe function satisfying (x - 0) M(x) A for all x # 0.
(Mi.) For every 0 CE < ‘I inf lkf (x)1> 0. r:c\x-e\+-’ (MI3) M(x)=ar(x-8)+6(x-0)
for all x where O
(M4) There exist positke numbers 0 CK
(0 (ii)
IM(x>ISIX - el”
Ll?t Y(x) be ELrandox variable such that E Y(x) = M(x). It is assumed khat for every - co
(i) supx (Z(x)jz+J<~., s
O
Em Anbar
156
Remrark. Condition (Zl)(ii) is obviously implied by (24). Both conditions are included, however, since some of the results which are of independent interest (such as Lemma 4) make use only of the weaker version (Zl)(ii). Assumption (M4) implies that the derivative of M at x ~8 satisfies K
n) lf2+&)
(4)
for every 00,
where if
b,,,- 1
if a,
(5)
if u7;&b,,._r. b,,$,”is given by (3), and Yn is a random variable with conditional given (X1, YI, Y,,. e ., Y,- 1) the same as the distribution of YX,).
distribution
Theorem 1. Ifconditions (Ml), (M2), (M4) (i) and (21) (i) clre satisfied then X, given bq’ (4) and (5) converges almost surely to 6. The proof is omitted since it follows from Robbins and Siegmund’s Application
2 (1971, Sec. 4). Remark. It c.an be easily verified that (5) can be replaced by truncating b,,, in (xn,B,) where (a,,) arId {P,> are sequences of positive numbers satisfying Car,-2 11~*
of Le
A stochmticrrNcrwtowRaphson method
157
1 a:nd Conditions (M3) urvrd(24) ~6okLE,fm = m(n) rj such that m(n)= o((log n)lj2 ’ “)
as w+ 00 for euery E> O!,hen r Ak=IX,“-+co as. at least
as
fist as log n.
Prtnlf. By (M3) it follows that kr(x) can be written as M(X) = (a + t,b(x))x where $(x)-+O as x-+0. Thus
X r+l:=(l-d,n-.‘rX,-A,,n--l.Z,
(6)
whr;re d,=A&-P$(X,)). By Theorem 1,
and:
Thus for n&z, = n,(o) md almosl. all 1ti1,
==(1 -d,n X,2+1
- l )"X," + Atn -i2Zf-2A,n-‘(l
2 (l - agl-~
-d,,n-‘)X,2,
l )X,”+A$Ti?,Z -2A,r-‘l[l
-d,r~-~)X,&
(7)
=j, (1 -aziW1 ), n &,, vlrhere ,i,-, is such that 1 -ad-l >Q for all j2jo. where 0 c c5 s( b, s ce,c 00 uniformly in n. Let Plnn=ri ’ yn for n >=m ;;; ma.x (jO,n,?). Iterating 17) yields
Cfewly
-Y,,,= h,n-*2
for all n & m 2 m.ax (ia, nol)a# Thus
N-l _
‘!T A!d a.:111
II
C .4&k-
1(1 -dkk-
’ );?i’Jk .
(f-9
k=m
ConsiQer first the last term of (8). ILy ckilging
the order OFsummation one ge,ts
where the !2iNare uniformly bounded positive numbers. How
Furthermore Ef-flogk)-” +ej12Ak(l -dkk-
‘)X,.zJ2
=
by Lemma 1 and (Zl )(i) for every E> 0. Thus
converges a.s. to a finite limit as ;tl--,ac, by Theorem D of Lo&ye (1963) p. 387. Next observe that Kronecker’s Lemma (Lo&e (1963) p. 238) can be slightly modified to read: If z Wn= W < 00, { 6,,} is a monotone increasing sequence of positive numbers such that lim,,, b, = cx, and h,, are positive and uniformly bounded numbers, then Em,,, m b; 1 c;(“=l h,,b, W, = 0. Thus,
i f A&k-‘(1 rt=olfk-m
-d,k-‘)X,Z,=o((rogN)“2+e)
Gorevery E> 0. Now N
Ca*-2
i,)Izcm(t
-
(m/N)202-1)=o((IogN)“2’“)
n=m
as N---CO for every 00.
It remains to consider the middle term of (8). Again by changing the order of summation one gets N
c i n=m
5 k2az-2(k-2a2+1_N-2az+l)~~
/@j;,,k-2~;&
k=m
k=m N z
c k
A stocl~stk
Akwton-Raphson
method
159
where OK& -. 1. and .ID=(1 --LI)*~‘~~~‘I . Let A> 0 and 1~ > 0 be as in Assumption (2X). Ikfme
Clearly
But by
Hence
b\yhichimplies that
where cl land c2 are positive constants. Since 2: &Zz,*2 the proof is complete. The following result is; due to Stout (1970). Lemlna 3. Let {cm, &,
u, = I[2log
s,,=i
k.= 1
n 2 1) be a martingale diference sequence. Let
log Vy’“,
tk, n==l,2,...
,
Let K, be .F,+ mea..;urable functi’ws such that x,+0 as. as n-+ 00. If Vi -+ co as. and I{,ls K&~/W, a.s., then lim suf~~_+~SJU,V/, = 1 a.s. Let X, be given by (4) and (5).
ce
ofPC?
Ifconditions
(
n--+co as n-=+co.
Dm Anbar
I6Q
Roof. Let jl be an integer such that 1 -a& a, < dR8.s. for all n, one obtains
1 ~4.
Iterating (6) and recalling that
for every m kjr, where
B&=
fi
if nxn,
(l-aJ1)
Ji=mt 1
if
=lt
Bkn== fi
(I-dij+)
m=n
if nwn,
j=m+A
if n=m.
=l
to the definition of yn, let yi = &. Thus /!I:,,= y,“/$, for every n 2 mkJ,. that 7:: is a random variable measurable with respect to .F(XI,Z,,.. .Z,_,). Since m=m(!~)=o((kogn)~‘~‘&) for every c>O and since Q, >j, j?k_,,n =o((log npliI~;~+E~n“1). it is sufficient to conskier only the second term on the right-hand side of (9). Now apply Lemma 3 with Sk:= k - l/2 +&Zkwhere E> 14is to be deterr;;ined later. By (21) there exist positive con:,tants c, c2, c3 and c., such that
kyilar
No&e
c,n2&C V2n= cc ? nzE
and c3uog
log
n) “~$Un~c~(10g10gn)“2
for a!l n sufficiently large. Let -+O as PI--*m and x,log logs)’
be a sequence o * Fositiv? numbers such that K~ +?~,(~+%2’ -‘I2 c co for every S >O. By (Zl)(i),
(K,)
Therefore by the Borcl--Cant&i Lemma with probability one
lz.1s tc,(n/log log tl)*‘2
A stochastic Newton+-Raphson method
161
where +0 as n-+oo. Thus the conditions of Lemma 3 are satisfied and therefore I Ck Q1/2+rZk= o(q#(log log n)lj2) a.s. &=I where {qn}is any sequence such that q,,- oo as FI+ 00. To complete the proof let f$=y;-’ k-112-e for k&.
Now ~;c)-~&k”* hence we cdn choose E to be so small such t$at al.--* -E > 0 and thus 0fk-b00 as k-, 00. “l’herefore by Kronecker’s Lemma
ak>Cgk0’-1’2-e.
Since a,>*
n
C+ n
s
k = .*
akk-
Ii2 +?!?k = o (?j,#(bg
log
n)“2)
a&
for 221m>jI. Or alternatively k-‘y;-1Zk=o(~nn-““(loglogrlj1’2)
as.
k=n
which completes the proof. Corollary. Let X, and qll he1as in Lemma 4. Then .Y, = o(tp-
1’2(lcbglog n)“‘) nnd
n tl.x;
=o (q, log log n) as. as n-m,
where X,=n-
’ r, Xi. i= 1
Proof Obvious The consistency of b,,, can now be proved. Theorem 2. LrtX,begivenby(4)and(5)andb~,,b~(3)withm=nz(n)=o((logri)“”+”) as n-+ 00. If Conditions (A41) through (M4), (Zl) and (24) clre satisfied and a2 < 2K then &,,,--a as. where a is defined in (M3). roof. By (M3) n
(Xi -~,,,bw,) h,,,,,,=
fJ
+
i (xi -R.n,JG
‘_=zi ________ ____ + ‘yL______-
C
i=m
(xi-xm
,n)’
As in the proof of Lemma 2,6(x) can
C
i=m
(xi
-xtn
-- * . n)’
Dan Anbar
162
By I,emma 2 and the corollary to Lemma 4, z;1)5*(& -gm,Jt goes to 08 at feast as log n. Thus by the Toeplitz Lemma (Lo&e (1963) page 238) it follows that the second term of the right-hand side of (10) goes to zero a.s. p1+08, By the same argument given in the proof of Lemma 2 it folows that
for every E> 0. Now it follows from Lemma 3 that
Hence by the Corollary, rHTm,n z= ,,,Zi= o(q, log log n) as PI-+w and q,, is as in
Lemma 4. Hence the Iast term on the right-hand side of (10) converges to zero as. as n+ 00. This completes the proof. The Alowing theorem states that the process X, is asymptotically enicient. The P k !j proof is omitted since it follows exactly the proof of Sacks (1958). Theorem 3. Let X, be given by (4) and (5) with m(n)=o((log
PI)‘/~+~)us n-+x~ If
conditions (MI) through (M4) and (21) through (24) are satisfied and a2 <2K then ;Pz(Xn - t?) ccmverges in law to 4 noro7lal random variable with mean zero and variance cr2/a2.
[ 14 $ b
1
Finally it should be noted that once the convergence of bm,, to a is proved it follows from Gaposhkin and Krasulina (1971) that X, obeys the law of the iterated logarithm, i.e.
lim sup [2~2(loglog n)- ‘1 ‘I2 (& - 8) = G/a
Pl-+uJ
a.s.
efeaerices Cl3 Albert, A.E. and L.A. Gardner (1967). Stochastic approximation and nonlinear regression. Research Monograiph No. 42. The M.I.T. Press, Cambridge, MA.
PI Blum, J.R. (1954). Approximation methods which converge with probability one. Ann. Math. Statist. 25, 382-386.
PI Chung, K.L. (1954). On stochastic approximation method. Ann. &f&t. Statist. 25,463-483. PI Dvoretzky, A. (1956). On stochastic approximation. Proc. Third Berkeley Symp. 1, 39-56.
fsli ’Gaposhkin, V.F. and Krasulina, T.P. (1974). On the lalv of the iterated logarithm in stochastic approximation processes. Theor. Probability Appt. 19,84WW. . (1963) Probability Theory, 3rd. ed. Van Nostrand, Princeton. onro (1951). A stochastic a ethod. Ann. Mnth. Statist. 22,
i 1
A stochasttcNewton-kaphson me&&
163
[KJ Robbins, H, and D. Siegmund (1971). A convergence theorem for non-negative almost supermarting&s and some applications, Optimizing Methods in Statistics. Academic Press, New York, 233-257. [9] Sacks, 3. (1958). Asymptotic distribution of stochastic approximation procedures, Ann. Math. Statist. 29, 373-405. [W’J ‘&&t%& Da; “(l$#). BTi&nt recursive estimation; application to estimating the parameters of a 1conrj&&e funicttion,lni. J. E@g,Science’3,461483. Jl’j Stout, W,F. ”(19%)1 A mwtitigak analbgue of KoImogorov’s law of the iterated logarithm. 2. VCbhrschtrfnlkk’~itsthearle Verw, Gebiete 1!5,279-29% [S2] Venter, J.H. (1967). An extension of the Robbins-Monro procedure, Ann. Math. ~CaCisC.38, 181-I 190.