The convergence of some computational algorithms in the method of least squares

The convergence of some computational algorithms in the method of least squares

THE CONVERGENCE OF SOME COMPUTATIONAL ALGORITHMS IN THE METHOD OF LEAST SQUARES* V. V. IvANOV Kiev (Received 9 December 1960) LET IfI and H, be Hilbe...

474KB Sizes 1 Downloads 20 Views

THE CONVERGENCE OF SOME COMPUTATIONAL ALGORITHMS IN THE METHOD OF LEAST SQUARES* V. V. IvANOV Kiev (Received 9 December 1960)

LET IfI and H, be Hilbert spaces, and A a given linear operator with a domain of definition G, C HI and a domain of values G, C H,. It is required to form a sequence x, E G,, for which lim IiA&--y ii =jzjl

Ax-y

n-+03

II,

yeHa.

(1)

Let e,, e2, . . . , e,, . . . be an A-complete system in G, (see [l]). We shall seek x, in the form n (2) & = c akek 1

from the condition that the unknowns a! be minimized: AU’) = I@

=kYk

11,

Yk

=

Aek

#

0,

fYal,

q,

. . . . a,).

(3)

It may easily be seen that

i.e. A is a convex function. It is therefore

natural to use the various algorithms of

descent finding a:. 9 1. METHOD Let a(O) 1 , a(O) $j f *-a, alp)

OF CO-ORDINATE

DESCENT (GAUSS-SEIDEL)

be the initial values of the parameters.

and find a?) in accordance

We form

with the condition

AI” = [l~$l’-a$l)~~I\= ;~l;n\~u!‘+z~J~. a We know ail) = (up), yr)/(~~r, ~3. Having found a?), we find up) from the condition Ail) = 11~&--a$~$, 11= min /I#-CZJJ, 11, (a) l

Zh. vych. mat. 1: No. 6, 963-971, 1961. 1129

@’ = @’ -.

#'yl.

1130

V. V. IVANOV

We find in the same way a$l), . . . , ap). We then find a?) in accordance condition

with the

ai*) in accordance with 2.42’= UP’- a&I . . . ,

A%) = II~@)-a$~$, II= min III@)---ay, 11, {a>. a!*) in accordance

with

ApI = II~$~)--a$,*)y,]I= ‘;i; 11u~)--ayn [I, a. and so on. We have in general, at the jth step of the iterative process: A?‘=

I~@k$)yS

S = 1,2, . . . . n; i = 1,2,3, ..,,

II= 7: I[u$j’--ayJ, D

(4)

where

(5) THEOREM 1. The method of co-ordinate descent is always convergent in the sense that f$$’

=

;-‘lb’-$

a&Y&

(6)

I(

and

lim ag) = a!,

I--

k = 1,2, . . . . n.

(7)

Proof. We assume to start with that yl, yz, . . . , yn are linearly independent*. Since the discrepancy Ay) can only diminish at each step of the algorithm and, on the other hand, never becomes less than zero, lim A$j) exists. Hence, on making ~--f~ use of (5), we have (A?) = A$/-‘))

uniformly with respect to s and j. This implies that S-l Ca~)~~,ys)+~a!!-',(yl,y3 8 1

=(Y,YJ+EP,

s =

1,2,

. . . . n,

(91

where

(10) We introduce the system l Given this assumption, the assertions of the theorem are not new. But the proof is related to what follows and is of independent interest.

1131

Convergence of algorithms in the method of least squares

s = 1, 2, . . . . II.

(11)

1

of this system is known to be non-zero

by virtue of the linear independence of the {yJ, and the solution (CC!}of the system minimizes A. But (9) differs from (11) only in having magnitudes on the right-hand sides that tend to zero when j-~ 00. Hence it is clear that lay)- aPI -+ 0, j + 00. If Yl,Y,, -*-,yn are linearly dependent and ykl, yk,, . . . , ykm is a maximal linearly independent sequence of them, system (11) is soluble and any solution (an} of it satisfies the relation

The determinant

Thus, if we assume for convenience in writing that yl, y2, . . . , y,,, , (9) and (11) can be rewritten as $

‘%‘%k,ys)

=

yk, ,

ykl, . . . , yk_ coincide with

s =

(Y.y~)-~~~~“(~k,y~)+Elj’,

1,2,

. . . . m, (9’)

$

+?tik,Ys)

$

=

attik,Ys)

s =

fj%ys)+d’),

=

m+l, m+2, . . . . n,

b%y.)-i&%k,y.),

s =

1, 2, . . . . m,

(11’) $ ag (Ykt Ys)=

s =

(Yd$

m+l, m-I-2, . . . . n.

Given any a!, k = m+l, . . . . 12, (11’) yields a set {a:} satisfying (12), since the first m of equations (11’) are equivalent to the system

7 whilst the last n-m k=m+l,

&yk,h)

=

(Y,Y&),

s =

1,2,

. . . . m,

(13)

equations are a consequence of the first m. We put a: = @,

. . . . II. Now, IQ)-aPI
j+co,

k=l,2

)..., n,

(14)

and consequently, IlY-~~~:!i'vkII-[lY-~bYkIl~II~(~~~'-a~ykl~<~2IE~")~o 1 1 1

j300-

(15)

Relation (6) is proved. To prove the stronger assertion (7), we observe that the discussion can always be reduced to the case when

lb-2 atykII= 0. 1

1132

v. v. IVANOV

Thus we can always put y = U+V, where w is orthogonal to the subspace formed by the elements yl, yz, . . . , y,, whilst u belongs to this subspace. We now have, for any (at):

and

(17) This is equivalent to our assertion. On the basis of (lo), we can now rewrite (15) as A$$ < C, 1/201,,ys)A~i,,[A~il,-Ay)],

(18)

whence A$

< C, [A&+

AP)].

(19)

It may readily be seen that

It follows from (19) and (20) that the series {t &)) are absolutely convergent. j=l

This completes the proof of the theorem. As applied to the solution of systems of linear algebraic equations, Theorem 1 implies that the familiar Seidel method is also convergent in the case when the matrix is simply positive (possibly degenerate); admittedly, the approximate solution here converges to different values, depending on the initial vector. A series of worked examples shows that the practical convergence may be extremely good in degenerate or close to degenerate cases. It may be remarked that the method of co-ordinate descent in essence bypasses the problem of forming algebraic system (1 l), and leads to an economical employment of the computer memory. For instance, when solving numerous boundary problems for partial differential equations by the net method, algebraic systems Ax = y are obtained, the matrices of which have a fairly small number of non-zero diagonals. In such a case, as is clear from the description of the computational scheme, the memory only needs to store the vector u,(8 , the required approximation (a$,j)} and the matrix A, not taking into account the computer program; the vectors {yk} can be obtained as columns of the matrix A, taking e, = (1, 0, . . . , 0), e2 = (0, 1 , 0, . . . , 0), . . . , e, = (0, . . . , 0, 1). At the same time the matrix A*A of system (11) will in general be completely filled, so that storing it in the memory may prove difficult. Let us compare the method of co-ordinate descent with the method of fastest descent (see [2]). Theoretically the latter would seem to have an undoubted advantage, since at each step min I/u--aA(A*u) II (A* is the operator conjugate to A)

Convergence of algorithms in the method of least squares

1133

is sought in the direction A*u of fastest decrease of Iju-uaAa 11.But it must be noted that more computational operations go into one step in this method: computing A*u and all the more, AA*u, is in general far more complicated than calculating Aek, since the calculation of Aek can be substantially simplified by suitable choice of ek. Furthermore, the method of fastest descent does not provide for seeking the solution as a linear combination of elementary elements (ek}, and this is often a disadvantage. Also, exarnples may be quoted where the advantage mentioned may not lead to more rapid convergence. Examples can easily be devised which show that, erem at the first step, min I[u-uA(A*u) (al 5 2. METHOD

OF GRADIENT

II > min IIu-aAe, {al

II.

(21)

DESCENT (L. V. KANTOROVICH)

If the method of fastest descent, in the form in which it is treated in [3], is applied to the functional

taking n-dimensional Euclidean space as the domain of definition X, the advantages indicated above of the method of co-ordinate descent are preserved in essence, and above all, a more rapidly convergent process is obtained in a number of cases. It is well known that the direction of fastest change of A in a given case is the same as the direction of the gradient to the level surface A = const. and is defined by the vector

3

s=1,2 ,,.., II.

(22)

We shall therefore refer to this particular form of the method of fastest descent as the method of gradient descent or the gradient method. We thus have the following computational descent. We take the initial point

scheme with the method of gradient

PO(@), c$), . . . . aA”)) and the initial value

We then find o=

($ai”)yk-Y7ys)9s=1,2

If they are not all zero, we find

,..., n.

1134

V. V. IVANO~

Pl(ail), . . . . ail)),

Al =

b-$aPhIl.

If the

=($&)yk-_y,y,),

s=

1,2,..., n.

are not all zero, we find

fl

=

4 = IIY-$ ai2)yk II1

P,(ap) , . . ., aAa)),

and so on. In general, if, at the j+lth

step, the

(Eaa,1j =($aP)yk-_y,yJ,

A.J

s = 1,2, .... n,

are not all zero, we find

Pj+l(a(lJ’+l), . . . . af+l)), $j+l)

=

_=

L,ij+l)

j=

ek,

1,2,...

1

If s = 1,2, .... n,

’ then

is the required element and Aj = $

A(P). =k

Convergence of algorithms in the method of least squares

1135

Let Q, denote the operator which projects G, on to the subspace N,, of nontrivial solutions of system (1 l), and 2: the unique solution of system (1 l), orthogonal to the subspace N.. THEOREM 2. The gradient method is always convergent in the sense that lim Aj = I;i; A(P), f-*m %

(24)

lim x;j) = Q.xi”) + goIK-

(25)

where j-03

Proof. Let us show first that

w n

aA

1

iJa, ” = O

when and only when aA@x, 7 0, s = 1,2, . . . , n. Let yS=O

s = m+l, . . . . n,

and y,=g@y,, 1

where yl,y2,

. . . . y,,, are linearly independent.

In this case

$$-=zd”‘(e)y

s = m+l,

. . . . n,

(26)

and

It follows from (27) that

m m -+c( c aa, aA

d(s)&)

k=l

r

k

r=1,2

,...,

m.

(28)

s=m+l

On expanding the determinant of this system as a sum of determinants, it may easily be shown that it cannot be equal to zero. Consequently, aA/&, = 0, r = 1,2, . ..) m. But, by (26) we now have aA/&, = O,s= m+l, . . . . n, i.e. we have proved what was required. It is further easily seen that {Aj> is a non-increasing sequence, bounded from below. Therefore lim j_+_,A, = A0 exists, and this limit is attained at some point PO, A0 = A(PO), which is a limit point for (Pj) . After simple calculations, we obtain

Hence it follows, since the denominator $ aB(yk,y,) = (y,yA

is bounded (laA/&,[ < Ily,II), that s = 1, 2, . . ., n.

(30)

v. v. IVANOV

1136

But this implies that A0 = A(P”>-

minA,

P”(ai,ai,

. . ..a!)#

t%

In order to prove the stronger assertion (25), we observe as in the proof of Theorem 1 that the discussion can always be reduced to the case when y lies in the subspace of vectors yr, yz, . . . , yn. Let us suppose that {Eik}has the property $W&S)

s=1,2

= 0,

,...,

n,

(31)

and that x’,“)= 8. Now, since Pj+l=

Pj+tjAjVj,

j=

PI = t,,A,V,,

1,2 , . . . .

(32)

and

w n aA

S-1

aa,j

Es

=

2 (k~~,~-f @Yk) 1

s=l

z

Aj

0,

j=1,2

, ....

(33)

all the x$),j= 1,2 , ***, are orthogonal to the subspace N.. It follows from this that x$/j has a unique limiting value, orthogonal to N.. If x$O)# 0, on taking the element y- C:ai”)yk instead of y, we arrive at the relation lim $)+xi”) j+ 00 where (~3 is a complete orthonormal (liIllj,,Z$j), uk)= 0, we have

= Zt+$$&, 1

(34)

system in N.. In view of the fact that

dk = (xi”), u,J

(35)

so that $ dkuk= Qnxioo? The theorem is proved. A number of worked examples shows that, in the case of ill-conditioned problems, when the “ellipse” A = const. is considerably stretched in some direction, the gradient method gives a faster convergence than the method of co-ordinate descent. In ordinary problems the speeds of convergence are roughly the same, though the accuracy with which the minimum is located is always higher by the method of co-ordinate descent, on the assumption, of course, that the computation is carried out with the same rounding-off rules. The reason for this is that, close to the minimum, we have to divide in the gradient method by the numbers

which tend to zero, whereas in the method of co-ordinate by the invariant numbers llyk)j2.

descent the division is

Convergence

of algorithms

in the method

of least squares

1137

It may be remarked in conclusion that, by utilizing the ideas of [4], methods of co-ordinate and gradient descents can be introduced, which seek at each step to minimize directly of the norm of the error: for instance, for the method of coordinate descent minimization of I/x-x:{! - a~,// where x is the exact solution of the equation Ax = y (which is assumed soluble), $i is the approximate solution already obtained, yS = A*e,, {e,} being an arbitrary complete system in Gr. The required value a0 can be found from the formula

Assertions similar to the above can be proved for the convergence of the methods of co-ordinate and gradient descents in the present case. Translated by D. E. BROWN REFERENCES 1. htIKHLIN, S. G., Pryamye metody v matematicheskoi fizike. (Direct methods in mathematical physics.) Gostekhizdat, Moscow-Leningrad, 1950. 2. KANTOROVICH, L. V., Funktsional’nyi analiz i prikladnaya matematika. (Functional analysis and applied mathematics.) Usp. matem. nauk 3: No. 6 (28), 135-137, 145, 1948. 3. KQNI’OROVICH, L. V. and AKLLOV, G. P., Funktsional’nyi analiz v normirovannykh prostranstvakh. (Functional analysis in normed spaces.) Gostekhizdat, Moscow-Leningrad, 1959. 4. PRIDMAN, V. M., Novye metody resheniyalineinogo operatornogo uravneniya. (New methods of solution of a linear operator equation.) Dokl. Akad. Nauk. SSSR 128: No. 3,482-484, 1959.