THE CONVERGENCE OF SOME COMPUTATIONAL ALGORITHMS IN THE METHOD OF LEAST SQUARES* V. V. IvANOV Kiev (Received 9 December 1960)
LET IfI and H, be Hilbert spaces, and A a given linear operator with a domain of definition G, C HI and a domain of values G, C H,. It is required to form a sequence x, E G,, for which lim IiA&--y ii =jzjl
Ax-y
n-+03
II,
yeHa.
(1)
Let e,, e2, . . . , e,, . . . be an A-complete system in G, (see [l]). We shall seek x, in the form n (2) & = c akek 1
from the condition that the unknowns a! be minimized: AU’) = I@
=kYk
11,
Yk
=
Aek
#
0,
fYal,
q,
. . . . a,).
(3)
It may easily be seen that
i.e. A is a convex function. It is therefore
natural to use the various algorithms of
descent finding a:. 9 1. METHOD Let a(O) 1 , a(O) $j f *-a, alp)
OF CO-ORDINATE
DESCENT (GAUSS-SEIDEL)
be the initial values of the parameters.
and find a?) in accordance
We form
with the condition
AI” = [l~$l’-a$l)~~I\= ;~l;n\~u!‘+z~J~. a We know ail) = (up), yr)/(~~r, ~3. Having found a?), we find up) from the condition Ail) = 11~&--a$~$, 11= min /I#-CZJJ, 11, (a) l
Zh. vych. mat. 1: No. 6, 963-971, 1961. 1129
@’ = @’ -.
#'yl.
1130
V. V. IVANOV
We find in the same way a$l), . . . , ap). We then find a?) in accordance condition
with the
ai*) in accordance with 2.42’= UP’- a&I . . . ,
A%) = II~@)-a$~$, II= min III@)---ay, 11, {a>. a!*) in accordance
with
ApI = II~$~)--a$,*)y,]I= ‘;i; 11u~)--ayn [I, a. and so on. We have in general, at the jth step of the iterative process: A?‘=
I~@k$)yS
S = 1,2, . . . . n; i = 1,2,3, ..,,
II= 7: I[u$j’--ayJ, D
(4)
where
(5) THEOREM 1. The method of co-ordinate descent is always convergent in the sense that f$$’
=
;-‘lb’-$
a&Y&
(6)
I(
and
lim ag) = a!,
I--
k = 1,2, . . . . n.
(7)
Proof. We assume to start with that yl, yz, . . . , yn are linearly independent*. Since the discrepancy Ay) can only diminish at each step of the algorithm and, on the other hand, never becomes less than zero, lim A$j) exists. Hence, on making ~--f~ use of (5), we have (A?) = A$/-‘))
uniformly with respect to s and j. This implies that S-l Ca~)~~,ys)+~a!!-',(yl,y3 8 1
=(Y,YJ+EP,
s =
1,2,
. . . . n,
(91
where
(10) We introduce the system l Given this assumption, the assertions of the theorem are not new. But the proof is related to what follows and is of independent interest.
1131
Convergence of algorithms in the method of least squares
s = 1, 2, . . . . II.
(11)
1
of this system is known to be non-zero
by virtue of the linear independence of the {yJ, and the solution (CC!}of the system minimizes A. But (9) differs from (11) only in having magnitudes on the right-hand sides that tend to zero when j-~ 00. Hence it is clear that lay)- aPI -+ 0, j + 00. If Yl,Y,, -*-,yn are linearly dependent and ykl, yk,, . . . , ykm is a maximal linearly independent sequence of them, system (11) is soluble and any solution (an} of it satisfies the relation
The determinant
Thus, if we assume for convenience in writing that yl, y2, . . . , y,,, , (9) and (11) can be rewritten as $
‘%‘%k,ys)
=
yk, ,
ykl, . . . , yk_ coincide with
s =
(Y.y~)-~~~~“(~k,y~)+Elj’,
1,2,
. . . . m, (9’)
$
+?tik,Ys)
$
=
attik,Ys)
s =
fj%ys)+d’),
=
m+l, m+2, . . . . n,
b%y.)-i&%k,y.),
s =
1, 2, . . . . m,
(11’) $ ag (Ykt Ys)=
s =
(Yd$
m+l, m-I-2, . . . . n.
Given any a!, k = m+l, . . . . 12, (11’) yields a set {a:} satisfying (12), since the first m of equations (11’) are equivalent to the system
7 whilst the last n-m k=m+l,
&yk,h)
=
(Y,Y&),
s =
1,2,
. . . . m,
(13)
equations are a consequence of the first m. We put a: = @,
. . . . II. Now, IQ)-aPI
j+co,
k=l,2
)..., n,
(14)
and consequently, IlY-~~~:!i'vkII-[lY-~bYkIl~II~(~~~'-a~ykl~<~2IE~")~o 1 1 1
j300-
(15)
Relation (6) is proved. To prove the stronger assertion (7), we observe that the discussion can always be reduced to the case when
lb-2 atykII= 0. 1
1132
v. v. IVANOV
Thus we can always put y = U+V, where w is orthogonal to the subspace formed by the elements yl, yz, . . . , y,, whilst u belongs to this subspace. We now have, for any (at):
and
(17) This is equivalent to our assertion. On the basis of (lo), we can now rewrite (15) as A$$ < C, 1/201,,ys)A~i,,[A~il,-Ay)],
(18)
whence A$
< C, [A&+
AP)].
(19)
It may readily be seen that
It follows from (19) and (20) that the series {t &)) are absolutely convergent. j=l
This completes the proof of the theorem. As applied to the solution of systems of linear algebraic equations, Theorem 1 implies that the familiar Seidel method is also convergent in the case when the matrix is simply positive (possibly degenerate); admittedly, the approximate solution here converges to different values, depending on the initial vector. A series of worked examples shows that the practical convergence may be extremely good in degenerate or close to degenerate cases. It may be remarked that the method of co-ordinate descent in essence bypasses the problem of forming algebraic system (1 l), and leads to an economical employment of the computer memory. For instance, when solving numerous boundary problems for partial differential equations by the net method, algebraic systems Ax = y are obtained, the matrices of which have a fairly small number of non-zero diagonals. In such a case, as is clear from the description of the computational scheme, the memory only needs to store the vector u,(8 , the required approximation (a$,j)} and the matrix A, not taking into account the computer program; the vectors {yk} can be obtained as columns of the matrix A, taking e, = (1, 0, . . . , 0), e2 = (0, 1 , 0, . . . , 0), . . . , e, = (0, . . . , 0, 1). At the same time the matrix A*A of system (11) will in general be completely filled, so that storing it in the memory may prove difficult. Let us compare the method of co-ordinate descent with the method of fastest descent (see [2]). Theoretically the latter would seem to have an undoubted advantage, since at each step min I/u--aA(A*u) II (A* is the operator conjugate to A)
Convergence of algorithms in the method of least squares
1133
is sought in the direction A*u of fastest decrease of Iju-uaAa 11.But it must be noted that more computational operations go into one step in this method: computing A*u and all the more, AA*u, is in general far more complicated than calculating Aek, since the calculation of Aek can be substantially simplified by suitable choice of ek. Furthermore, the method of fastest descent does not provide for seeking the solution as a linear combination of elementary elements (ek}, and this is often a disadvantage. Also, exarnples may be quoted where the advantage mentioned may not lead to more rapid convergence. Examples can easily be devised which show that, erem at the first step, min I[u-uA(A*u) (al 5 2. METHOD
OF GRADIENT
II > min IIu-aAe, {al
II.
(21)
DESCENT (L. V. KANTOROVICH)
If the method of fastest descent, in the form in which it is treated in [3], is applied to the functional
taking n-dimensional Euclidean space as the domain of definition X, the advantages indicated above of the method of co-ordinate descent are preserved in essence, and above all, a more rapidly convergent process is obtained in a number of cases. It is well known that the direction of fastest change of A in a given case is the same as the direction of the gradient to the level surface A = const. and is defined by the vector
3
s=1,2 ,,.., II.
(22)
We shall therefore refer to this particular form of the method of fastest descent as the method of gradient descent or the gradient method. We thus have the following computational descent. We take the initial point
scheme with the method of gradient
PO(@), c$), . . . . aA”)) and the initial value
We then find o=
($ai”)yk-Y7ys)9s=1,2
If they are not all zero, we find
,..., n.
1134
V. V. IVANO~
Pl(ail), . . . . ail)),
Al =
b-$aPhIl.
If the
=($&)yk-_y,y,),
s=
1,2,..., n.
are not all zero, we find
fl
=
4 = IIY-$ ai2)yk II1
P,(ap) , . . ., aAa)),
and so on. In general, if, at the j+lth
step, the
(Eaa,1j =($aP)yk-_y,yJ,
A.J
s = 1,2, .... n,
are not all zero, we find
Pj+l(a(lJ’+l), . . . . af+l)), $j+l)
=
_=
L,ij+l)
j=
ek,
1,2,...
1
If s = 1,2, .... n,
’ then
is the required element and Aj = $
A(P). =k
Convergence of algorithms in the method of least squares
1135
Let Q, denote the operator which projects G, on to the subspace N,, of nontrivial solutions of system (1 l), and 2: the unique solution of system (1 l), orthogonal to the subspace N.. THEOREM 2. The gradient method is always convergent in the sense that lim Aj = I;i; A(P), f-*m %
(24)
lim x;j) = Q.xi”) + goIK-
(25)
where j-03
Proof. Let us show first that
w n
aA
1
iJa, ” = O
when and only when aA@x, 7 0, s = 1,2, . . . , n. Let yS=O
s = m+l, . . . . n,
and y,=g@y,, 1
where yl,y2,
. . . . y,,, are linearly independent.
In this case
$$-=zd”‘(e)y
s = m+l,
. . . . n,
(26)
and
It follows from (27) that
m m -+c( c aa, aA
d(s)&)
k=l
r
k
r=1,2
,...,
m.
(28)
s=m+l
On expanding the determinant of this system as a sum of determinants, it may easily be shown that it cannot be equal to zero. Consequently, aA/&, = 0, r = 1,2, . ..) m. But, by (26) we now have aA/&, = O,s= m+l, . . . . n, i.e. we have proved what was required. It is further easily seen that {Aj> is a non-increasing sequence, bounded from below. Therefore lim j_+_,A, = A0 exists, and this limit is attained at some point PO, A0 = A(PO), which is a limit point for (Pj) . After simple calculations, we obtain
Hence it follows, since the denominator $ aB(yk,y,) = (y,yA
is bounded (laA/&,[ < Ily,II), that s = 1, 2, . . ., n.
(30)
v. v. IVANOV
1136
But this implies that A0 = A(P”>-
minA,
P”(ai,ai,
. . ..a!)#
t%
In order to prove the stronger assertion (25), we observe as in the proof of Theorem 1 that the discussion can always be reduced to the case when y lies in the subspace of vectors yr, yz, . . . , yn. Let us suppose that {Eik}has the property $W&S)
s=1,2
= 0,
,...,
n,
(31)
and that x’,“)= 8. Now, since Pj+l=
Pj+tjAjVj,
j=
PI = t,,A,V,,
1,2 , . . . .
(32)
and
w n aA
S-1
aa,j
Es
=
2 (k~~,~-f @Yk) 1
s=l
z
Aj
0,
j=1,2
, ....
(33)
all the x$),j= 1,2 , ***, are orthogonal to the subspace N.. It follows from this that x$/j has a unique limiting value, orthogonal to N.. If x$O)# 0, on taking the element y- C:ai”)yk instead of y, we arrive at the relation lim $)+xi”) j+ 00 where (~3 is a complete orthonormal (liIllj,,Z$j), uk)= 0, we have
= Zt+$$&, 1
(34)
system in N.. In view of the fact that
dk = (xi”), u,J
(35)
so that $ dkuk= Qnxioo? The theorem is proved. A number of worked examples shows that, in the case of ill-conditioned problems, when the “ellipse” A = const. is considerably stretched in some direction, the gradient method gives a faster convergence than the method of co-ordinate descent. In ordinary problems the speeds of convergence are roughly the same, though the accuracy with which the minimum is located is always higher by the method of co-ordinate descent, on the assumption, of course, that the computation is carried out with the same rounding-off rules. The reason for this is that, close to the minimum, we have to divide in the gradient method by the numbers
which tend to zero, whereas in the method of co-ordinate by the invariant numbers llyk)j2.
descent the division is
Convergence
of algorithms
in the method
of least squares
1137
It may be remarked in conclusion that, by utilizing the ideas of [4], methods of co-ordinate and gradient descents can be introduced, which seek at each step to minimize directly of the norm of the error: for instance, for the method of coordinate descent minimization of I/x-x:{! - a~,// where x is the exact solution of the equation Ax = y (which is assumed soluble), $i is the approximate solution already obtained, yS = A*e,, {e,} being an arbitrary complete system in Gr. The required value a0 can be found from the formula
Assertions similar to the above can be proved for the convergence of the methods of co-ordinate and gradient descents in the present case. Translated by D. E. BROWN REFERENCES 1. htIKHLIN, S. G., Pryamye metody v matematicheskoi fizike. (Direct methods in mathematical physics.) Gostekhizdat, Moscow-Leningrad, 1950. 2. KANTOROVICH, L. V., Funktsional’nyi analiz i prikladnaya matematika. (Functional analysis and applied mathematics.) Usp. matem. nauk 3: No. 6 (28), 135-137, 145, 1948. 3. KQNI’OROVICH, L. V. and AKLLOV, G. P., Funktsional’nyi analiz v normirovannykh prostranstvakh. (Functional analysis in normed spaces.) Gostekhizdat, Moscow-Leningrad, 1959. 4. PRIDMAN, V. M., Novye metody resheniyalineinogo operatornogo uravneniya. (New methods of solution of a linear operator equation.) Dokl. Akad. Nauk. SSSR 128: No. 3,482-484, 1959.