CHAPTER 1
I T E R A T I V E M E T H O D S ON N O R M E D L I N E A R SPACES
1.1.
INTRODUCTION
T h e first result in nonlinear functional analysis which is likely to attract attention is the contraction mapping or fixed point principle because it is a simple and constructive approach to finding the roots of equationsf(x) = 0 in normed spaces. Many problems in analysis are of this type and the method allows one to prove the existence of a solution xo to f(x ) = 0 by explicitly constructing a sequence of iterates (xn> which converge to xo. T h e fixed point argument and several of its applications are studied in Section 1.2. Following this, there is a discussion of FrCchet differentials, with the important specialization to scalar or vector mappings on En where we can explicitly identify the FrCchet derivative with the gradient (Jacobian) of the scalar-valued (vector-valued) mapping. With this precise notion of differentiability it is possible to give a simple treatment of gradient methods in Section 1.4, with application to the questions of finding the roots of a nonlinear mapping of En into itself and of locating the minimum of an unconstrained function on En. One of the numerical pitfalls of gradient techniques is the problem of ill conditioning, which is also examined here. We then introduce the concept of generalized inverse of a matrix in order t o construct an important approach to the least squares 1
2
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
solution of f ( x ) = 0. This is done in Section 1.5; in the section following, Newton's method is presented in a function space setting in order to obtain a rapidly convergent algorithm for solving nonlinear two-point boundary value problems. T h e method reduces the original problem to one of solving a sequence of linear two-point boundary value problems. I n the last section a constructive approach to the question of determining the extent of asymptotic stability for nonlinear differential equations is briefly examined. We assume that the reader has some familiarity with finitedimensional vector spaces and with linear mappings between such spaces, as well as a basic knowledge of elementary functional analysis, even through not much is really needed for an understanding of this book. For references to background material or to valuable supplementary reading, we suggest that the reader consult the list of books given in the bibliography. I n the remainder of this section, we summarize some of the notation used, and remind the reader of certain theorems. All linear or vector spaces (denoted by E, F, V , W, etc.) are invariably taken over the real scalar field in this book. Rn denotes any n-dimensional vector space and R' is the real axis; the letters u, v, w , x, y , and z are used to denote vectors. If a real scalar product is introduced on Rn, then the space is denoted by En. It is convenient to distinguish between various inner products on En by writing (u,v) to denote the Euclidean inner product C upi (it is generally understood that a basis has been fixed in En and that {ui}, {vi} are the components of u, z, relative to this basis) and by writing (u, u ) to~ mean (u, Gv), for any positive definite matrix G. No such distinction is usually necessary between norms, all of which are denoted by 11 )I. On occasion, however, we write 11 u JIG to mean ( u , Gu)'12. Relative to a fixed basis in En and Em, the set Hom(En, Em) of all linear maps from En to Em is concretely identified with the collection of m by n ( m rows, n columns) matrices. All linear maps between finite-dimensional spaces are bounded in the sense thatif f is such a map (or matrix), then llf(u)l\ I l f i l 11 u I( , where
<
(1.1.1)
1.1.
3
INTRODUCTION
Bounded linear maps are strongly continuous, i.e.,
llf(un) -f(u)ll-+ 0 as 11 u, - u 11 -+ 0, and, conversely, continuity implies boundedness. T h e expression (1.1.1) defines a norm and therefore Hom(En, Em) is an nm-dimensional normed linear space; if m = n, this space is denoted simply by Hom(En). Relative to a fixed basis in En,every u E En (read: u in En) can be written as a column vector u
z
(7); un
denotes the corresponding row vector. Thus, for example, uTGv. AT is the transpose of a matrix A and Det A is its determinant. T h e reader recalls what is meant by an orthogonal set of vectors, relative to a given inner product, and he remembers the indispensable Schwarz inequality. Convergence un+uo of a sequence in a normed linear space means, unless otherwise specified, that 1) u, - uo 11 -+0. A mapping f (not necessarily linear) between two (not necessarily linear) spaces E, F is often denoted by f : E -+ F and we say that f is injective if it is one-one into F, surjective if it maps onto F, and bijective if it is injective and surjective. Thus, a linear map between finite-dimensional spaces is bijective if and only if the corresponding matrix is invertible (nonsingular). I f f is scalar valued on Rn, then f E Cm means that f is m-times continuously differentiable,
UT
(u, Gv) can be written as
m >, 0.
T h e notation B,(u) denotes either an open or closed ball of radius 6 about u in any normed linear space. For example, the closed ball S> which reads as of radius 6 about the origin is (u E B 1 1 ) u I[ “the set of all u in B such that 1) u /I < 6.” We use the symbol I to signal the end of a proof or definition. I n this volume a Hilbert space always is assumed to be real, complete, and separable. An integer n is used in different contents throughout the book, but its meaning should be clear in each instance. For example, n may be the dimension of En, or it may denote the nth component
<
4
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
of a vector u (either as sub- or superscript), or it can indicate the nth term in a sequence of vectors.
EXERCISES
1.1.1.
Verify that ( , )G is an inner product on En if G is positive definite.
1.1.2.
Let {eJ be an orthonormal basis in E n relative to ( , ) G . Show ~ (a,a). that relative to this basis one obtains (u, v ) =
1.1.3. All norms on En are equivalent in the sense that if 11 and /I (I2 are any two such norms, then /I ]I1 ( 1 1 112 for some scalar [. Prove this for the case in which 11 /I2 is the Euclidean norm ( , )1/2.
<
1.1.4.
If A is any m by m matrix, then the expression
is a norm for any such matrix. Prove this and show that 11 A 11 I) B 11 if B is any other m by m matrix.
jl AB Ij 1.1.5.
<
If (A,} is a sequence of m by m matrices, show that
11 A , 11 + 0 [for any norm /I 11 on Hom(Em)] if and only if each component a t j of A , tends to zero.
1.2.
C O N T R A C T I O N MAPPINGS
A number of important problems in analysis center around the solution of the equation u
=f(u)
+ v,
(1.2.1)
1.2.
5
CONTRACTION MAPPINGS
where f is a mapping (generally nonlinear) from a normed space B into itself and v is a known element of B. Several concrete examples of (1.2.1) are given below and in the exercises, as well as in Sections 1.4 and 1.6. If g : B B is defined by -j
g(u)
= f ( 4+ n?
(1.2.2)
then uo is a solution of (1.2.1)if and only if it is afixedpoint of the mapping g, i.e., whenever g(uo) = uo. There exists an extensive literature on the existence of fixed points but most theorems simply resolve the question “does (1.2.1)have a solution” without indicating how such a solution can be found. However, there is one such theorem which has the added virtue of being constructive in its proof, since it finds uo by the familiar method of successive approximations. Th is result is known as the contraction mapping principle and we begin by proving it in its simplest form as Lemma 1.2.5. First, however, a definition is necessary. Definition 1.2.3. A mapping f of a normed space B into itself is contractive if for all u, u in B
where 0
< a < 1. I
T h e inequality (1.2.4) clearly implies that a contraction mapping is continuous on B. T h e case in which a is allowed to equal one is studied in Chapter 5 . Lemma 1.2.5. Let f be a contraction map from a Banach space B into itself. T h e n f has a unique fixed point.
Proof.
For arbitrary uo in B let u1 = f(uo), u2 = f@l) = f“%)
and in general let T he n
u n = f(un-1)
I/ un - urn I1 = Ilf(Un-1)
= f”(uo).
-f(~rn-l)Il
G a /I un-1
- urn-1
II
6
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
and therefore if m
3 n, we obtain I1 un - urn II
Also
II uo - urn-, /I
< I/ uo < 11 ug
< an /I uo - urn-, u1 241
+ I/
I/ + .*.
11(1
/I.
urn-n-1
+ + + .*-
OL
d II uo - u1 I/ * li(1
-
- um-n
/I
am-n-1)
4
by repeatedly using the contraction property and the fact that mi is a convergent geometric series. Hence (1.2.6)
which shows that {un}is a Cauchy sequence, since 01 < 1. Using the completeness of B, we know that there exists an element uo in B for which /I u, - uo 11 0. Moreover, since f is continuous, --f
f(u0) =
li+if(u,)
= limu,,, n+m
= uo
and thus uo is a fixed point. T o show that this point is unique, suppose thatf(u') = u'. Then
/I uo - u' II
= Ilf(u0)
-f(u')lI
< a /I uo - u' 11 < /I uo - u' 11)
which is a contradiction unless u' = uo.
It should be noted that the proof of Lemma 1.2.5 did not make use of the linearity of the space B but depended only on its completeness. Since any closed subset of a complete space is itself complete, it follows that Lemma 1.2.5 remains valid if B is replaced by any closed subset. I n particular, if the contraction f maps the closed ball B , C B into itself, then it has a unique fixed point in B, . From a computational point of view, the main attraction of Lemma 1.2.5 is the estimate (1.2.6) for as m + GO, we obtain
11 u,
- uo
/I
< const an,
(1.2.7)
which shows that u, -+ uo at a rate not slower than that of 01n for < 1 . This rate of convergence suggests that the constructive (Y
1.2.
CONTRACTION M A P P I N G S
7
proof of the lemma can be utilized as a feasible algorithm for solving equations of the form (1.2.1). I n fact, if uo is some initial guess at the solution uo and if uo lies within a closed subset of B in which f is contractive, then the iterative scheme un,,
=> ,.(g
=f @ n )
+
23
(1.2.8)
is norm convergent to a unique solution uo of (1.2.1), where g is given by (1.2.2); note that g is contractive wherever f is contractive and hence Lemma 1.2.5 is applicable. T h e iterative procedure of (1.2.8) is generally known as the method of successive approximations. As we saw, the success of this method is dependent on being able to find an initial approximation uo which lies within the domain of attraction of the fixed point uo. This domain is defined as the set of initial points uo for which {un} converges to uo. Lemma 1.2.5 states that any closed set on which f is contractive belongs to the domain of attraction. Moreover, the point uo,being unique, does not depend on the initial approximation chosen within the set of points attracted to xo. T h e problem then is to find the domain. At best, an estimate is obtained, usually quite restrictive and narrow, of a region in which f has to be contractive; Lemma 1.2.5 merely gives a sufficient but by no means necessary condition for the convergence of (1.2.8) and in practice the iterates may converge in a much larger region. For example, in Section 1.4 we encounter Newton's method for finding roots of a mapping h on the space Ek.This leads us to an iterative scheme of the type (1.2.8) and when K = 1, for example, the iterates are given by (1.2.9)
One would like to know when {xn} is convergent to a fixed point of
which, in turn, would satisfy h(x) = 0 [the case in which z, = 0 in (1.2.1)]. We leave as a simple exercise the following demonstration: if (1.2.10)
8
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
in some closed subset Q of R1which is mapped into itself by g, then there exists a fixed point xo of g which is a root of h. However, it is not hard to find examples in which the iterates (x,} converge, even though (1.2.10) is violated. One such example is provided by h(x) = x2 - 2, whose root is clearly xo = 4 2 . Thus, xO is a fixed point of g(x)
=x
-
1
h ( x ) = x7 h (x) 2
+x
I t is easy to see that g(x) 2 4 2 for x > 0 and therefore the closed set x 3 % i s mapped into itself by g. Moreover, if the iterates are started at xo = %,then x1 = g(xJ =
y,
and within the next few iterates x,
I
h(x) h”(x) [h’(X)]2
x:!
-
- 42,
I i 21 =
1
-
1.5,
as expected. However,
I
2 >
at x = Q and (1.2.10) is not satisfied. Of course, if the iterate x,, is sufficiently close to 4 2 , then 1 - 1/xo21 < 1 and g is contractive. Lemma 1.2.5 admits several refinements, one of which states that it is not necessary to know that f is contractive, provided some sufficiently large iterate off (in some cases this may be f itself) is contractive. This is stated as a corollary in which f * denotes f iterated m times. Corollary 1.2.1 1. Suppose f : B +B, where B is a Banach space, and f m is contractive for some m 3 1. T h e n f has a unique fixed point.
thus
Prouf.
By Lemma 1.2.5, f m has a unique fixed point uo and f@O)
=
f(f”(.”)
= f”+’@O)
=
f”(f(.”),
from which it is shown that f ( u o ) is also a fixed point off m. By uniqueness it follows that uo = f ( u o ) . Now any fixed point off is a fixed point off m, which proves that uouniquely satisfies u = f (u). I
1.2.
9
CONTRACTION MAPPINGS
An Application to Integral Equations
We pause here to examine one application of the above corollary. Consider the Volterra integral equation defined by u = f ( u ) = XKu or, more specifically, by
4.)
=
K(x, L u ( 5 ) )a,
JZ
( 1.2.12)
where h is a scalar. T h e problem is to find a solution of (1.2.12). If K is a continuous and nonlinear function in x, then (1.2.12) is a special case of (1.2.1) in which B is taken to be the space of continuous functions on [0, 13 with uniform norm. I t is assumed, moreover, that K satisfies the uniform Lipschitz condition
I K(x, 5,
-
K(x, 5, .&)I
- UZ(5)l
(1.2.13)
on [O, 11. We wish to show now that f m is contractive on B for any A, when m is sufficiently large. T o do this, first note that if u1 , u2 are two continuous functions on [0, 11 and if y = Ij u1 - u2 /I (the uniform norm in this case), then
lf(4 - f ( 4
<1 1P
I Ud5)
-
4 5 ) l d5
PYX,
using (1.2.13). Also, since =
f"U)
=
jZK(x, 5 , f ( 4 ) d5 j: K (x, 5,
1'
K(5,77,4?))4)d5,
it follows that
I f"(.l)
-
f2(4 < h2P2
1I i
JZ
0
0
UI(77)
- %(77)l
4
;
and, in general, or (1.2.14)
10
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
But 1 h impm/m!is the mth term of a convergent series and so for each h there exists some m for which I h I m /3”/m! < 1. This shows that f m is contractive on B. It follows from Corollary 1.2.11 thatf has a unique fixed point uo which is a continuous solution to the Volterra equation. Extensions of the Fixed Point Argument
In a number of problems of interest f is only given as a mapping from a closed ball B , C B into B. In order to exhibit a fixed point by means of the contraction mapping principle it is first necessary to ascertain whether B, maps into itself underf. Lemma 1.2.5 can be refined to account for mappings on B, , provided that f does not move the center of B , too far. More precisely, we have the following result. that
Lemma 1.2.15.
Let f : B, -+ B be contractive and suppose (1.2.16)
where OL is the contraction constant and 6 is the radius of the closed ball B6(a)with center u. Then f has a unique fixed point in B, . Proof.
For any u in B , , using (1.2.16),
llf(4 - v /I
< llf(4 -f(.)ll
+ llf(v> - v I1
<.llu--l/+(1 -46 or6 (1 - 01)s
< + =6
is obtained and f(u) also belongs to B, . Hence, Lemma 1.2.5 is applicable. 1 It is a useful observation that if B, is an open ball, then Lemma 1.2.15 is still valid, provided (1.2.16) is taken to be a strict inequality, for then f can be restricted to a suitable closed ball contained within B , and the proof completed as before. Another useful fact connected with the last lemma is as follows: if f is contractive on B and iff moves a point u by a distance r , then the distance of u from the fixed point off is at most ./(I - a), where OL
1.2.
11
CONTRACTION MAPPINGS
is the contraction constant. In fact, if B,(u) is the closed ball about
u of radius 6 = r/(1 - a), then Lemma 1.2.15 shows that the
fixed point off also lies in B, . Yet another significant improvement of the contraction mapping principle is based on the consideration of a family of maps f (s, -) from B , --+ B which depend on a parameter s belonging to some normed space S.
Definition 1.2.17. A family of maps f (s, uniformly contractive on B if Ilf(S7
4 -f(h 411 < II 24
for all u, v in B and s E S ; 0 < cy
-
01
v
/I
0
)
with s E S are (1.2.18)
< 1. I
Theorem 1.2.19. Let f ( s , *) be a family of uniformly contractive maps from an open ball B , C B to B and suppose that f (-, u ) is continuous in s belonging to S for each u E B, . Iff moves the center of B, by less then (1 - a)6 uniformly on S, thenf(s, .) has a unique fixed point uo(s) in B, for each s and uo is continuous on S.
Proof. The existence of the fixed point uo(s) for each s is guaranteed by Lemma 1.2.15. It suffices to show that uo defines a continuous map from S to B. Since f (., u ) is continuous in s for each fixed u in B , we know that 11 s - t /I < r ) , with norm taken in S and 7 sufficiently small, implies that Ilf(s, a 0 ( 4 -f(t,
UO(S))ll
< fz
for any given E > 0 (the last norm is understood in the sense of B). However, f ( s , uo(s)) = uo(s) and therefore the contraction f ( t , .) with parameter t moves the point uo(s) at most a distance E. As observed above (in a remark made after the proof of Lemma 1.2.19, this shows that the distance between uo(s) and the fixed point uo(t)off ( t , -) is at most ~ / ( 1- a). Hence, 11 s - t I/ < r ) implies
and this establishes the continuity of uo in s.
I
12
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
The Inverse Mapping Theorem
An immediate application of the last theorem is the inverse
mapping theorem which is stated here in its simplest form: Suppose f is a real-valued C1 map from an open subset of the real axis into itself such thatf(0) = 0. Iff'(0) is nonzero, thenfpossesses a local continuous inverse at x = 0 in the sense that it maps a sufficiently small open ball about the origin bijectively on an open set containing the origin and such that the inverse off is continuous on this set. I n fact, if
where s is a real parameter, thenf(x) = s has a unique solution for each s whenever g(s, x) has a fixed point in x. Using the mean value theorem on R1 (Apostle [Al, p. 117]), for 0 < a < 1,
YI
Yl
is obtained if x and y lie within a sufficiently small open ball B , about the origin for then 6 (which lies on the line joining x t o y ) will be close to zero. Thus, g(s, .) is uniformly contracting on B , . Also g moves the center of B , by less then (1 - a ) 6, provided s also lies in a small enough ball about the origin:
for I s I < 6, . Hence, by Theorem 1.2.19, there exists a continuous function xn(s) defined on B,l such that g(s, xO(s)) = xO(s),
i.e., for which f(x"(s)) = s.
This completes the proof of the inverse mapping theorem in this case.
1.2.
13
CONTRACTION MAPPINGS
It is worth noting that the roots off (x) tively by
=s
can be found itera-
which is a special case of Newton's method, discussed earlier, in which f' varied with each iterate as f ' ( x n ) rather than being fixed at one point. I n any event, the proof of the inverse mapping theorem is simply another application of root finding techniques, a topic which will be explored in more detail in Sections 1.4-1.6. A Useful Lemma
We conclude this section by an application of fixed point ideas to a linear problem since it leads to a result which will be useful in the next chapter. Suppose A is an n by n matrix with jl A ]I < 1. T h e n the equation u = & Au v has a unique solution in En for each vector z, since f(u) = &Au + z, is contractive. T h is shows that ( I f A)-l exists and u = (I f A)-l u. More generally, since
+
(Ih - A ) = X
(I -),AX -
it follows that Ih - A is invertible for all h which satisfy I/ A / / < j h j. Thus, if p is an eigenvalue of A, then p 1 I] A I / . Note, in particular, that a unique solution to Au = v can always be found v where by writing this system of equations as u = Bu B = A - I , provided, of course, that /I B I/ < 1 . T h is observation is the basis for the familiar iterative method for inverting A. I n the space Hom(En) we can speak of open sets of matrices relative to the norm imposed on Hom(En). We now wish to prove the following result.
<
+
Lemma 1.2.20. set in Hom(En).
T h e set of invertible matrices forms an open
Proof. Suppose A is invertible. It suffices to show that all B in some ball about A are also invertible. In fact, suppose
14
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
11 B - A I/ < 6 and write B as B = A - (A - B). T h e n A-lB = I - A-'(A - B ) = I - Q. But /I Q I/ = II A-YA
-
B)ll < 1
if 6 is small enough. Hence, as shown above, A-1B an inverse. T hen B = A(l - Q) is also invertible.
= I -Q
I
has
EXERCISES
1.2.1. A Fredholm integral equation can be treated as a special case of the equation (1.2.12) when x is taken to be the upper limit unity. Assume again that (1.2.13) holds and show now that a unique solution exists in the space of continuous functions on [0, 11 with uniform norm for the equation
whenever
1.2.2. If I/ A I/ < 1, show that ( I - A)-1
m
=
1 A", n=O
where the series is interpreted in the sense that possesses a limit in norm as m + CO.
1.2.3.
CE-,, An
Give an example of a continuous map of B into itself which possesses a fixed point but is not contractive.
1.2.4. Suppose f maps R1 into itself. Show that the condition (1.2.10) is sufficient to guarantee that f be contractive. 1.2.5. Let f be a real-valued C1 map on some compact subset K of R1 containing the origin and suppose that f # 0. Show thatf(x) = 0 has at most a finite number of solutions in K .
1.3.
DIFFERENTIATION ON NORMED SPACES
15
1.2.6. Suppose that map f has a fixed point uo in a Banach space B and suppose also that f is contractive on some ball B,(uo). Show that f maps the ball into itself and hence that u , + ~= f (u,) converges to uo whenever uo E B , .
1.3.
DIFFERENTIATION ON NORMED SPACES
FrCchet Differentiability
This section is devoted to a discussion of some topics of rather basic importance to that which follows in this and in subsequent chapters. Recall first that iff is a real-valued function on R1, then it can be said that f is differentiable at some x whenever f [ ( x u ) - f ( x ) ] / u has a limit as u -+0. This definition does not readily generalize to mappings on higher-dimensional spaces. However, note that if f ’ ( x ) is the derivative off at x, then
+
df(x, u ) = f ’ ( X ) . u
is a linear mapping in u for which
as I u I -+ 0. I n fact (1.3.1) can be taken as the definition of differentiability on R1 and, indeed, it is in this form that the concept of differentiation can be extended to arbitrary normed linear spaces. Definition 1.3.2. Let f be defined as a map between two normed linear spaces E, F. Let U be an open subset of E. Then f is strong or F differentiable on U if for each x in U there exists some X in Hom(E, F ) for which
as ]I u 11 +0 (the norms, of course, are understood in the sense of
16
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
E or F , as the case may be). T h e term X(x, F differential off at x. I
a)
is called the strong or
For each x, h is defined and linear on E and it plays the same role as df did on R1. I n general h is not linear in x nor need it even be defined for all x. T h e name “F differential” is a short version of Frkchet differential and its definition was originally given by Fr6chet.l T h e following theorem summarizes some of its properties. Theorem 1.3.4. Let f be a mapping from E to F ; if the F differential h exists for some x in E , it is unique. I f f is a constant map, then h is the zero map, and iff is linear, then h(x, u ) = f ( u ) , independent of x.
Proof.
Suppose A, p are both F differentials off at x and let Af(.r, u )
Then
=
I/ Yx, u ) - P(X, u)il I/ u /I
f(.
+
I/ L /I
u) -f(x).
u ) - 4 ( x , .)I/
I/ u /I
4% u ) - P(X, u)ll 4 i/ u /I
as / / u / / + 0. Now fix u1 in E and let u the linearity of A, p in u,
11 h(x, u ) - P(X, .)I1 I/ u I/
-
=
I / X(x,
0
xul (zis a real scalar). By
u1) - p(x, u,)ll
I‘ u1 /I
But I/ u Ij -+ 0 as a -+ 0 and hence I/
+,
u1) - P(Xx, %)I/
=0
for each nonzero u1 in E. I t follows that h = p and hence the F differential is uniquely defined at x. Jf’hen f is constant, then Af(x, u ) = 0 for all u and thus h = 0 satisfies (1.3.3). Finally, iff is linear, then A f ( x , u ) = f ( u ) and (1.3.3) is satisfied by h(x, u ) = f ( u ) .
’
M. FrCchet, Ann. SCI.Ecole Norm. Sup. 42, 293-323 (1925).
1.3.
17
DIFFERENTIATION O N NORMED SPACES
Definition 1.3.2 is given in terms of a differential; in certain cases it is possible and even convenient to speak of a derivative. Thus, for example, if E is a Hilbert space and i f f is real valued, then h(x, .) is a bounded linear functional on E for each x in U C E. By a theorem of Riesz (Theorem 5.2.10), every such functional is of the form h (x, u ) = (u, v), where v belongs to E. T h e term v may be thought of as the strong or F derivative off at x E U , in analogy to the R1 case discussed above in which h(x, u ) = u . v, with z, = f ’ ( x ) . T w o particular cases will be discussed in each of which the notion of derivative can be isolated from that of the differential. We begin by looking at real-valued functions on finite-dimensional spaces. Lemma 1.3.5. Letf be a real-valued C1 function on an open subset U of En.Th en the F differential off exists on U and is given
where Vf(x) is the vector
evaluated at x (the F derivative or gradient off). Proof.
4% u) =
+ u1 -4 + + +f(%+ u1 ,..., xn +
f(*l
9
+f(%
x2
Y...?
u1, x2
-f(.1
>...,4
+
+
,-..,Xn) -f(% u1 , x2 ,-..,xn) ... Un) -f(.1 u1 ,..., XfL-1 %I-1 4.
+
u 2 , x3
+
2
Now apply the classical mean value theorem (Apostle [A1 ,p. 1171) to each of these pairs to obtain 4 x 9
af (x1 + 51 4 = Ul 3x1
2
*2
,..., Xn>
18
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
where 0 < ti < ui and u = (ul notation the last sum is written as
,..., u ~ ) ~for; convenience of
T he n
0 and f is Cl). T h e as I/ u I/ + 0 (since I/ u /I -+ 0 implies /I 5 / / F differential is unique when it exists and, therefore, by (1.3.3), --j
q x , ).
as asserted.
=
(Vf(4, ).
Using a somewhat lengthier but analogous argument, one can show that iff maps En into Em and iff is C1 in each of its m components on an open subset U of En, then the F differential exists on U and is given by u ) = /Ax).,
where Jr is the m by n F derivative or Jacobian matrix defined by
/AX)
==
af" __ af" ... __ ax,
( 1.3.6) __
ax,
and f is the vector
That Jr is a matrix is already clear from the fact that A(x, -) belongs
1.3.
DIFFERENTIATION ON NORMED SPACES
19
to Hom(En, Em)for each x in U. I n the two examples just discussed the F derivatives Of and J f were identified with specific elements of Hom(E, F ) and they gave rise to differentials which not only are linear and continuous in u for each x, but which are continuous on U for each u: i.e., Of and J f are continuous on U since f is C1.When f is real valued on a Hilbert space, then the F derivative off can be identified with some element v in that space. By analogy with the special setting of finite-dimensional Hilbert spaces En,z, is called the gradient off. Since the infinite-dimensional Hilbert spaces of interest are function spaces, this definition allows us to speak of a gradient in function space (a specific example is given in Section 4.3). Directional Derivatives
Definition 1.3.2 does not restrict the manner in which u -+ 0 in E. A weaker notion of differential is obtained by insisting that u -+ 0 along certain directions given by u = aul as -+ 0 ( E is a real scalar and u1 is a fixed element of E). First, a definition is given for scalar mappings on En. O(
Definition 1.3.7. T h e directional derivative at some x in En in the direction u (with 11 u I/ = 1) of a real-valued function f is given by the limit, when it exists, (1.3.8)
Lemma 1.3.9. Let f be C1and real valued on an open subset U of En. Then the directional derivative exists on U and is given by fu =
(1.3.10)
(Vf>u ) .
Proof. By Lemma 1.3.5 the F differential exists on U and therefore if I/ u /I = 1,
is obtained as
01 -+
0 for then 11 au /I
--f
0.
1
20
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
Because of (1.3.10) the directional derivative is often called the weak differential off. In Chapter 4 this concept is extended to a Hilbert space setting and it is then applied to certain optimization questions in function space. Note that f, becomes t h e j t h partial derivative of f when u is taken to be the direction vector which is zero in all components except for a one in t h e jth place. all vectors u E En for which (u,x) = 6 Given a vector x in E7&, ( b is a real scalar) defines a hyperplane. For example, if
is given in E 2 (the plane), then ( u, x) UlXl
+ upq
=
b means
=b
which is the equation of a straight line (see Fig. 1.3.1). I n E3
FIG. 1.3.1. ’0.
(u, x)
Half-spaces and hyperplanes in
E2. Shaded area
IS
half-space
1.3.
DIFFERENTIATION ON NORMED SPACES
21
which describes a plane in space. Now in E2 the vectors u for which u13c1
+
u2x2 -
b
20
defines a half-plane (Fig. I .3.1). This motivates the definition of a half-space in En as the set of vectors u which satisfy (u , x) - b >, 0 [or (u, x) - b 01. If S is the surface in En defined by the locus of points for which a C1 function f = constant (equipotential surface) and if x E S , then the tangent hyperplane to x is the hyperplane described by the set of points u for which
<
(Vf(x1,4
= 0.
Thus, the tangent hyperplane at x lies along those directions u u /I = 1) for which the directional derivative is zero at x, viz. along directions which are orthogonal to & V f ( x ) (see Fig. 1.3.2 for an example in E2).
(11
FIG. 1.3.2. Tangent hyperplane and negative gradient in E2.Shaded area is half-space of downhill directions for F as seen locally at x : (V’(.Y), ZL) < 0.
22
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
Since the directions orthogonal to Of(.) are locally in the direction of no change inf(zero-directional derivative), it seems plausible that Vf(x) should itself point locally- in the direction of greatest change in f. T o prove this, use Schwarz’s inequality, together with I( u / / = 1, to obtain
-11
Vf(x)il
< ( V f ( 4 9 4 d /I V f ( 4
with equality if and only if ti =
+rVf(x)
[hence, y = 1/11Vf(x)ll]. I t follows that the direction which is locally of steepest descent at x, relative to the Euclidean norm, is given by u = -Vf(x)/il Vf(x)iI and then (Vf(X),u )
=
-/I
Vfll
<0
[unless Of(.) = 01. Now consider the half-space of directions u for which ( V f ( x ) ,24) < 0 (which includes the vector u = - V f ( x ) ; see shaded area in Fig. 1.3.2). Since
+
then for all a which are small enough and positivef(x nu) < f ( x ) . This shows that all u such that ( V f ( x ) ,u ) < 0 are locally downhill directions f or f at x. Hence, there exist two half-spaces separated by the tangent hyperplane to the surfacef = constant at x, such that the half-space (Vf, u ) < 0 defines descent paths and (Vf, u ) > 0 defines ascent paths. Steepest Descent with Respect t o Non-Euclidean N o r m s
T h e F differential and directional derivatives o f f on En do not depend on the choice of norm, as shown previously. However, both of these expressions are written as (Vf, u),which is in terms of the Euclidean inner product on En. If the norm on En is non-Euclidean, say
/I u IIG
= (u, Gu)1’2
1.3.
DIFFERENTIATION ON NORMED SPACES
23
(see remarks in Section 1.1), then the question “what constitutes a steepest descent path at x, with respect to the non-Euclidean norm ?” is not readily determined as it was in the Euclidean case where Schwarz’s inequality could be applied to (Vf, u). Nonetheless, it is of interest to determine what this path will be, since the character of the surface f = constant varies with the norm chosen. For example, the surface I/ u ;1 = constant represents an ellipse in E2 while in terms of the Euclidean norm I/ u (I2 = constant is a circle in E2. I n order to find the steepest descent path, the Lagrange multiplier technique is used to minimize the directional derivative over all u for which 11 u (IG = 1 (the Lagrange method is fully discussed in the next chapter): for fixed x, the function f, (linear in u ) satisfies Vfu
+ A V(lI u :1
- 1 ) = 0,
where h is a scalar multiplier and 11 u ;1 - 1 = 0 is the constraint. Hence, Of(.) + 2XGu = 0 [we leave it as an exercise to show that the gradient of the quadratic form (u, Gu) is the vector 2Gu] from which it follows that u = -yG-lVf(x),
and y is the normalizing factor (1 V f ( x ) \ l c - l . Thus, we see that the paths which are locally of steepest descent at x in En depend on the norm chosen, and if this norm is I/ I I G , then the path is in the direction u
=
-G-lVf(x).
When G = I (Euclidean norm), we recover the result obtained earlier. This fact will be of importance later. Taylor’s Theorem and Some Applications
Let U be an open interval on R1.Iff is a real-valued C2 function on U , then the usual Taylor’s theorem (Apostle [Al, p. 961) states that where
4(x,
4 = df(x, 4 + ,f”(O,
&,
4 =f(.
U2
+ 4 -f(4
(1.3.1 1)
24
1.
ITERATIVE M E T H O D S ON N O R M E D LINEAR SPACES
<
+
and where lies on the line joining x to x $- u ( i e . , 5 = x qu, where 0 4 77 < 1). T o extend this result to En,it is necessary to define the following. Definition 1.3.12. A subset C of a normed linear space E is conrex if any point on the line segment joining two points in C also lies in C; that is, if u , z‘ E C , then ,xu (1 - LY), z1 E C for all I O<:
+
\Ye can now state the following version of Taylor’s theorem on
E*L.
Theorem 1.3.13. Let a real-valued f be C2on an open convex u belong to U , subset C of En. Then if x and x
+
+ #, Hr(S)u), (1.3.14) where 5 lies on the line segment joining x and x + u ( i e . , 5 = x + qu A&,
where 0
).
=
(Vf(x),).
< 7) < 1) and HI(.) is the Hessian matrix off defined by
iy %XI ax,
H,
=
...
ay _ _ ..
(1.3.15)
sxn ax,
evaluated at x. I f f is only C’, then 4 ( x , ).
=
(WS), 4.
( 1.3.16)
Proof. T h e result is classical. (See, for example, Apostle {AI, pp. 123-1241.) 1 Note that Theorem 1.3.13 reduces to the previously stated version of Taylor’s theorem on R1 since all convex sets on R1 are intervals (Exercise 1.3.4). Other examples of convex sets include half-spaces in En and the ball B,(x) of radius S about any given x in a normed linear space. These facts are also left as simple exercises.
1.3.
25
DIFFERENTIATION ON NORMED SPACES
Definition 1.3.17. Let f be real valued on En.Then f has a weak minimum at xo on En if df(xo, u) 3 0 for all u in some open ball Bs(xo); xo is a strong minimum if df(xo, u ) > 0 in B , . T h e element xo is a local minimum (strong or weak) iff(x')
By Theorem 1.3.13, 0
< 4 ( x 0 , 4 = ( V f ( 0 ,u )
for all 11 u 11 < 6. Let u = -a Vf(xo) with that 11 u 11 < 6. Th en -.(Vf(C),
Vf(x0N
If Vf(xo) # 0, then, since ( V f ( 0 ,Vf(x0N
as
a:
--f
0 (recall: f is Cl),
-
a:
> 0 small
enough so
3 0.
II Vf(x0)I12
-a(Vf(S), V f ( X 0 ) ) < 0 for all sufficiently small
01:
this is a contradiction and therefore Vf(x0) = 0.
I
We close this section with the following observation. If x, u are NU describes a line in En starting at x and given in En,then 7 = x in the direction u as (Y is varied over positive values. If a C1 function f reaches a local minimum (maximum) in this direction, then f, = 0 there, i.e., where (Vf(v),u ) = 0. I n particular, when u is the local steepest descent path at x in the norm /I u /IG, then f achieves its minimum (maximum) along u at some q for which
+
26
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
I n the Euclidean case,
and this is viewed geometrically in the plane by drawing a straight line from any point x, along the negative gradient path, until the line is tangent to some contour described by f = constant. T h e tangent line is perpendicular to the gradient o f f at the point of contact with this contour.
EXERCISES
1.3.1. T h e C1 assumption made on mappings on En guarantees that the F derivative is continuous. O n the real axis the usual definition of derivative allows for certain pathological examples in which the derivative is not continuous. Find such an example.
1.3.2. If a norm / j IIG is introduced in En, show that the tangent hyperplane at x on a surface S is defined by those directions u for which u is orthogonal to + G Of (x). 1.3.3. Show that the following sets are convex: (a) half-spaces in En;(b) the ball B,(x) = {x I /I x - x I/ in any normed linear space.
< S}
1.3.4. Show that all convex sets on R1 are intervals. 1.3.5. Compute the gradient of a quadratic form (u, Gu), where G is positive definite. 1.3.6. Iff is C2, then the Hessian matrix H , is symmetric. Why ? 1.3.7. Suppose f is C1 on E2 and that we wish to minimize rnaxoGrGlf ( x , y ) over all y in [0, 11. I f f is unconstrained,
1.4.
27
GRADIENT TECHNIQUES
then the minimum can be sought over those points in [0, 11 for which
YoZ:&f(x,
d
Y)) = 6(,F:&f(X,
Y ) ) = 0.
What is wrong with this reasoning ?
1.4.
GRADIENT T E CHN I Q U ES
Newton’s Method
I n this section we examine iterative root finding techniques in more detail, with particular emphasis on descent methods to find the roots of Of = 0. Suppose f :R1+ R1 is C1 and that we wish to find that xo for whichf(xo) = 0. Replace this nonlinear problem by a sequence of linear root-finding problems by approximating f at some initial guess xo with a tangent line to f at xo and then find where that line is zero. Call the zero x1 and repeat, obtaining in this manner a sequence (xn} which hopefully will converge to a root xo off itself. T hi s algorithm is the essence of Newton’s method (see Fig. 1.4.1 for a graphical illustration). Analytically what one does is to approximatefin the vicinity of xo by its first-order Taylor expansion f [Eq. (1.3.1 l)] and then seek the zero of the linear functionf. This problem, however, has an explicit solution: f(x)
= f(xo>
+f’(xo)(x
- X”) =
0
implies that
iff’(xo) # 0; label the root so found by x1 and repeat as before to obtain the sequence (1.4.1)
28
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
T h e question of convergence for this iterative process is equivalent to finding the domain of attraction for a fixed point of the mapping g(x) : x - f (x) ‘f‘(x) in a region where f ’ # 0. I n fact, as has been shown in Section 1.2, if the condition (1 2.10) holds in some neighborhood of the root x* o f f , then g is contractive there and t h e iterates (xrL} will be convergent to xo. However, the condition (1.2.10) is awkward to verify and not often satisfied in practice. An even stronger assertion can be made here, viz. if Q is the domain
FIG. 1.4.1.
Illustration of Sex\ton’s method.
of attraction for x0 and if xo E B,then {xn>converges t o xo with the rapidity at least that of a geometric progression (assuming f is C2; see Exercise 1.4.1). T h e main difficulty in implementing Newton’s method is the following: Q is generally unknown even if the region of contraction (usually much smaller than the domain of attraction) can be estimated. hZoreover, f‘ needs to be computed a t each iterate and, even more serious from a computational point of view, l f ’ may be numerically unstable whenf‘ is close to zero. T h e r e is a variant of Newton’s method which eliminates the inversion problem, which makes it more convenient to use, but which may
1.4.
29
GRADIENT TECHNIQUES
have slower convergence than Newton’s method. This scheme is defined by (1.4.2)
where f’ is computed only once, at xo . I n Figs. 1.4.1 and 1.4.2, Newton’s method and its modification are compared for the function f (x) = x2 whose root is obviously xo = 0 and it is seen that the modified algorithm converges more slowly.
I
FIG. 1.4.2.
Illustration of a modification of Newton’s method.
Suppose now that a system of nonlinear equations is given: fz(ul
,..., u,)
= 0,
i
=
1 ,..., n.
T h e problem is one of finding the roots of the nonlinear map ; En + En,where
f
30
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
is C1. By analogy with the scalar Newton's method the iterative scheme is defined un+i
= un
-
J;'(u%>f(un),
(1.4.3)
where J , plays the role of derivative forf. This algorithm also arises in the context of minimization of an unconstrained and real-valued C2 function f on En; in this case the root sought is that xo for which Of = 0 (by virtue of Theorem 1.3.18). Note that Of : En + En and i f f is C2, then the Jacobian of the gradient is found by simple calculation to be the same as the Hessian off : Jvr = H , . Hence, (1.4.3) now becomes u,+l = u, - FZ;l(u,) Vf(un).
(1.4.4)
I n either setting for Newton's method on En the familiar difficulties reappear, perhaps even more severely. T h e domain of attraction is unknown; also either Ht or J , needs to calculated and inverted, and if either of these matrices is numerically singular or ill conditioned, then inversion may lead to numerically poor results or even be impossible altogether (an example of this is encountered in Section 1.6). However, when the method is applicable, one can again expect it to be rapidly convergent. Descent M e t hods I n this book the most interesting applications of Newton's algorithm in fact are to minimization problems in which the zeros of V f are sought. I n this framework the notion of domain of attraction D has another interpretation as the set of initial points uo from which the iterates (xn} converge downhill to a minimum of f (i.e., to a root of Of = 0). This suggests that the iterative procedure (1.4.1) be modified so that it describes a descent process and, in fact, u n f l should be guaranteed as far as possible downhill from u, at each step. Letting Au = -H-l f
Vf(u0)
and assuming that f is locally downhill at u,, in the direction du, then, as shown in the previous section, f reaches a minimum in
1.4.
31
GRADIENT TECHNIQUES
that direction at a point u1 where the directional derivative is zero:
(Vf(uA m u , ) V f ( U 0 ) ) If u is written as uo
(1.4.5)
= 0.
+
01du for any scalar 01, then we see that the directional derivative in the direction du is zero at uo 0 1 ~d u = u1 for some suitable uo > 0 (unless, of course, uo is already the minimum o f f ) . Repeating this procedure at u1 allows one to find u2 by choosing a value of 01, call it a l ,for whichf(u,) is as far as possible downhill fromf(ul) in the new direction d u . I n this way sequences {u,} and {a,} are obtained in such a manner that the algorithm defined by
+
un+1 = U n
-
%fG1(un) W
n
)
(1.4.6)
is a descent algorithm for finding uo (the modiJied Newton’s method). T h e {a,} are said to be optimal when they are chosen, as above, to minimize f in the direction d u at each step (a numerical technique for finding the a’s is described in Chapter 3 ) . Note that at each iterate V’(U,+~) is orthogonal in the Euclidean sense to -H3Un)
VfW
whenever 01, is chosen optimally; this follows from (1.4.5). I n order that the directions du =
-HFl(U)
Vf(U)
be locally downhill at u, , it is sufficient that the n by n matrices Hr(un)[and hence the matrices Ht’(u,)] be positive definite. This is a special case of the following result. Lemma 1.4.7. T h e direction du = -H Of(.) is locally downhill at u for any value of Of if and only if H i s positive definite.
Proof. T h e direction d u is locally downhill at u if and only if (Vf(u),d u ) < 0 as we saw in the previous section. But du = -H Of(.) and therefore one is guaranteed of having locally downhill directions for all possible choices of Of if and only if (Vf(u),H Vf(u)) > 0. T h e last inequality is equivalent to asserting the positive-definite character of H. I
32
1.
ITERATIVE LIETHODS O N NORMED LIKEAR SPACES
Through the last lemma a class of iterative schemes can be defined for finding the roots of Y f by means of u,+1 = u,,
- .-
(1.4.8)
anfI;l Y f ( u n ) ,
where the {H,} [and hence {H;'}] are chosen to be positive-definite n by n matrices. T h e algorithm (1.4.8) is called a gradient descent technique since ZI;' T f ( u 1 8is) the gradient off at u,, with respect to the non-Euclidean norm 11 ilH,L (see the discussion in Section 1.3). It is understood of course that the {xn) are to be chosen optimally. If H , = H f ( x r L ) then , (1.4.8) is the modified Newton's method, and if 11, = I (the identity matrix), then (1.4.8) is known as the metliod of steepest descent (in this case descent is along the negative gradient in the Euclidean norm). I n any event the directions Au,, Z L , , + ~ u , are downhill, and this suggests that iff is bounded below, then . [ f ( u n ) )will descend to some value y as n a3 [recall: f ( ~ ~ -.+: f~( u) l l ) ]so that Y f ( u n )+ 0 and u,, + uo. An attempt to justify this heuristic reasoning is given below. First, however, note that since any C 2 function on E n can be approximated in some ball about u by the quadratic Taylor expansion (1.3.14), then the behavior of an iterative descent scheme in the vicinity of u can be inferred by looking at the way the algorithm performs on quadratic functions. I Ience, the behavior of gradient descent methods on quadratic functions is of particular significance from a practical point of view. Xow i f f is quadratic on E n , it coincides with its Taylor expansion (1.3.14) about any point and 2
--f
O f ( u ) = f ( ~-1- A u ) - - - - f ( ~= ) ('Tf(u),Au)
4-~ ( O UGdu), , (1.4.9)
where G is the (constant) Hessian matrix of f(see Exercise 1.4.3). Moreover, 'I will be optimal \vith respect to the iterative process (1.4.8) if Of(.) is as small as possible in the direction AZL= -ZZF1 Tf(u)for any positive definite If. But Of
-E(Tf(M), I / -
Yf(2l))
'
2.1
2)(Yf(U), I 1 -'GIl-1 T f ( U ) )
and it suffices to find that I for which g'( Y) the root of g'( Y) =: 0, \ve see that
= g(.)
0. If we solve for
:
(1.4.10)
1.4.
33
GRADIENT TECHNIQUES
I n particular, for the steepest descent method (in which H
=I
)
and for the modified Newton's method (in which H = G), 01 = 1. Hence, for any quadratic f , the modified Newton's method coincides with Newton's method discussed at the beginning of this section. Since H,(u) is continuous in u for any C2 function, in the vicinity of a minimum uo the modified Newton's method can be expected to perform much like Newton's method with its characteristically rapid convergence because in the vicinity of u* the function f is roughly quadratic and H,(u) H,(uo) = G. For the same reason one also expects that when n is large, an 1 in the iterative scheme (1.4.6). T h e next theorem tells us something about the behavior of gradient techniques when applied to quadratic functions.
- -
Theorem 1.4.11. Let f be a quadratic function on En with positive-definite Hessian G. Suppose {Hn}is a sequence of positivedefinite matrices satisfying 0 < A, 11 H;' /I Ao and let
<
<
f2 = {u I f ( U ) G f ( u 0 ) ) . T h e n the sequence {un>starting at u,, and defined by (1.4.8) has a subsequence which converges to a root uo of V f in Q, and f ( u n ) - f f ( u o ) .If uo is a unique root, then u, + uo.
Proof. such that
Since G is positive definite, there exists some p
>0
from which we see that f is bounded below (why?) and that f ( u )4 +co as 11 u 11 co. This shows that Q is bounded, for otherwise a sequence j j uk /I co can be found for which f ( u A ) < f ( u o ) and this is not possible. From the compactness of Q a subsequence fun,) of the iterates fun} defined by (1.4.8) is obtained such that un Y+ uo for some uo E Q. Lemma 1.4.7 shows that f ( ~ ~ + f~( u) n ) . Hence, since f is bounded below, f ( u J tends to some value y and it must follow that y = f ( u o ) . What remains to be ---f
--f
<
34
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
shown is that Yf(uo)= 0. T o this end we use Theorems 2.2.14 and 2.2.18 of the next chapter which state that since the Hessian o f f is positiL e definite on any open convex ball containing 52, thenf must satisfy 0
< ( W n ) , Un
tl
~
Un)
< f(Un+d
-
-f @ n )
0
as 12 + cx). Suppose that C f ( u o ) # 0. T h e n it must follow that u,,~ - u, + 0 (why ?). Now from (1.4.10) we see that or, =
(Vf(Un), (V(U,),
K1T u n ) )
f f , - ' G f y V ( U ))
.
<
However, Cf is continuous on Q . and /I H i 1 11 Xo. Therefore, the denominator of the last expression is bounded above by some constant. Also ( W U n ) ,
K1W
n ) )
3 A0 I1 Vf(%1I2 2 6 > 0
+ 0 means that
for all sufficiently large n, since Vf(un)-+ V f ( u o )
/I Yf(u,)Il is ultimately bounded away from zero. I n conclusion, Y , 2 6' > 0 for some 6' and hence u,+~ - u,, = -%H,l 0.t %)
does not tend to zero with n. However, this contradicts the fact that ti,, ~-u, + 0 and thus one must conclude that V f ( u o )= 0. Finally, if u" is a unique root of Vf, then u, + uo, for if (uJ has a subsequence u hich is convergent to some other u', one can repeat the argument above to show that u' is also a root. Hence, u' = uo. I I n the particular case of steepest descent, Theorem 1.4.11 is applicable since 11;' - I . However, in this case the rate of convergence of the iterates in (1.4.8) is often too slow to be of practical interest. I n Chapter 3 we encounter an explicit example in which steepest descent fails as a numerically useful descent method. Its typical convergence behavior in E 2 is illustrated by drawing the elliptical contour lines of f == constant, where f is given by A,s: 1 A,xi and A, > A, >: 0. As we know, the steepest descent path a t a point A , in the Euclidean sense is a direction orthogonal to the tangent line to this contour at A , and it terminates where the directional derivative in this direction is zero, viz. at some point A , for which Cf is orthogonal to the given direction. Continuing
1.4.
GRADIENT TECHNIQUES
35
in this manner, one obtains a series of zigzag lines linking points A,, A,,A , ,..., and the more the elongation of contour, the more frequent is the zigzag stitching. If the contours are circular (A, = A,), the steepest descent algorithm terminates in one step since any normal to the circle passes through the origin. I t also converges in one step if V f is an eigenvector of G-l at the point A , . T o prove the last assertion, let us first note the following result concerning Newton's algorithm. Lemma 1.4.12. I f f is a quadratic on En with positive-definite Hessian G, then the modified Newton's algorithm [algorithm (1.4.8)with H , = GI converges in one step to a root of V f .
Proof.
Let
f(.)
=c
+
( a , ).
+ h(u, w.
Theorem 1.4.11 is clearly applicable to f and there exists a root of V f . Now Vf(zi,) = a Gu, and 0 = a GuO from which one obtains u0 = U, - Gp' Vf(u0). I (1.4.13)
+
u0
+
Note that 01, = 1 in (1.4.13), which is another proof that the modified Newton's method coincides with the usual Newton's method for a quadratic. Since u0 - u0 = - 0 1 ~ Vf(u0) = -G-l Vf(uo), then the steepest descent algorithm has one-step convergence if and only if V f (u,) is an eigenvector of G-l with eigenvalue
I n particular, if the contours o f f are circular and centered at uo u - u0 [i.e., if f ( u ) = f (uo) then G = I , yo = 1, and Vf(u,) is an eigenvalue of G-l for any u,. I n this case steepest descent has one-step convergence for all u, since it coincides with Newton's method. I n this regard we note without proof that for a
+
36
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
quadratic the steepest descent method either converges in one step or it converges only in the limit as n One way to accelerate convergence of the steepest descent method is to distort the space E n by viewing it is a different norm. If the distortion is such that an ellipsoid /I u ;1 = constant is warped to look like a sphere in E n , then the steepest descent method should now behave like Yewton’s method, at least for a quadratic. This can be accomplished as follows. Assuming G positive definite, let 2 Z/Cu (the reader should be aware that positive-definite matrices have positive-definite square roots; see, for example, Bellman [Bl, pp. 92-93]). T h e n the quadratic ---f
2
f ( u ) -= c
becomes g(2)
+ ( a , + A(u,
-=f(dG-’z)= c
U)
Gu)
( Z / E ’ a , z ) -+ f ( 2 , 2).
~’~
T h e IIessian of g is now the identity and therefore steepest descent converges in one step, starting at z,, = Z/cuo. I n fact, from zo N nO - ‘C,g(zo) after multiplying through by & - l , uo uo - G - - lYf(uo)and therefore zn must be a root of Tg by virtue of Lemma 1.4.12 [note that Tg(z) = 4G-l Tf(u)]. Another method of approach is a s follows. T h e steepest descent method in E” is taken along the negative gradient path relative to the Euclidean norm. If a non-Euclidean norm Ij u l j G in E n is introduced, then the negative gradient path becomes - G-l Tf(see Section 1.3), which is the same as Newton’s method, for a quadraticf. Therefore, after warping a Euclidean space with a non-Euclidean norm, the steepest descent path has one-step convergence. Since ~
2
ii u
:=
(.\/cu, Z / c u )
-:
the non-Euclidean norm in “u space” can be viewed as a Euclidean norm in “2 space,” and the steepest descent in the non-Euclidean “ zi space” is equivalent to Newton’s method in Euclidean “ z space.” If the matrices { E f r < jin (1.4.8) are chosen properly, then convergence can be improved considerably over that of steepest descent. As shown in Chapter 3, a suitable choice of {H,} makes
* \V. E.
Langlois, I B M
/. Res. Develop.
10, 98-99 (1966).
1.4.
GRADIENT TECHNIQUES
37
the algorithm converge in a finite number of steps to the minimum of a quadratic function. However, a direction -H-l Vf is a steepest descent path relative to the non-Euclidean norm 11 l l H and thus the statement about “suitable choice” means that the norm in En at each iterate is varied in such a way as to distort the surface of a quadratic function in some favorable manner, i.e., at each iterate an attempt is made to accelerate convergence. This question is examined again in more detail in Chapter 3, but one can begin to appreciate now how the convergence of algorithm (1.4.8) may be affected by the choice of scale for the surface of a C 2 function, at least locally where a quadratic approximation is valid. Indeed, improper scaling can even affect the efficiency of Newton’s method by causing the matrix G to be ill conditioned, and this, in turn, may result in serious error magnification when the descent algorithm is handled numerically. I t is appropriate to insert some remarks here on the concept of ill conditioning. A Digression on 111 Conditioning Consider the system of equations Ax two straight lines
=
b in E2 defined by the (1.4.14)
or (1.4.15 )
I n our discussion it will be convenient to assume that the matrix A is positive definite and that lengths of the rows of A have been normalized [the rows, of course, are the normals to the straight lines defined by (1.4.14)]. If A is not already normalized, one can multiply through the first equation in (1.4.14) by the norm da$ a:2 and the second equation by daEl a i 2 . I n any case, if A is invertible, then (1.4.15) possesses a unique solution given by
+
+
(1.4.1 6)
38
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
It is clear that the solution, when it exists, is the point of intersection between the straight lines. It is also apparent that a unique solution exists if and only if the two lines are not parallel or, equivalently, if and only if the their normals (described by the rows of A ) are linearly independent or noncollinear. Numerically, however, one may have difficulty if the lines are nearly parallel because the solution
may be highly sensitive to small changes in the positions of the lines caused by roundoff in the computation of the coefficients of A and b. This is illustrated in Figs. 1.4.3 and 1.4.4. I n Fig. 1.4.3 the lines are nearly orthogonal and errors in the intersection point x are comparable (i.c., roughly of the same magnitude) as the errors in the coefficients of the vector 6, whereas in Fig. 1.4.4 the lines are nearly collinear and small changes in b (such changes may be caused, for example, by the lack of numerical precision used in computing b ) result in highly magnified errors in x. This kind of
FIG. 1.4.3. Example of a well-conditioned problem.
1.4.
39
GRADIENT TECHNIQUES Err
i n b>
Line
OIIX,t 0
X
12 2
:bl
FIG. 1.4.4. Example of an ill-conditioned problem. Shaded area is error spread in solution to Ax = b.
numerical instability leads to poor resolution of the solution of the system (1.4.14) and is obviously undesirable in most computations. Now the fact that the lines are nearly collinear is reflected by the knowledge that Det A is almost zero. This can be shown in several ways. Since Det A = a11a22(recall that A is symmetric), the near collinearity of the lines means that the angle between the normalized rows of A is nearly zero and hence that a,, a12and aI2 aZ2or Det A 0. If the rows are orthogonal, then aI2 = 0 and Det A = 1 [proof: orthogonality means that
-
-
-
%Q12
+
a12%!2
=d
a 1 1
+ azz)
= 0;
40
1.
ITERATIVE hlETHODS O N NORMED LINEAR SPACES
hence a I 2 = : 0-otherwise T r A = 0, which is impossible for a positive definite matrix (why I n fact, since n e t A is the area of the parallelogram whose sides are the rows of A (Exercise 1.4.6) and since this area tends to zero as the parallelogram becomes increasingly flat (which occurs as the angle between the rows of A tends to zero), we have another interpretation of the same idea. Now Det A u11a22(Exercise 1.4.6) and Det A attains its maxim u m with value unity when the inequality becomes an equality, which occurs when the rows are orthogonal. Hence, as the lines become more and more collinear, Det A becomes increasingly smaller, and the case in which the lines are perpendicular corresponds to Det A = 1. A small determinant then signifies that the solution
?)I.
<
C:I is sensitive to small changes in the coefficients of A and b and, in this case, we also see from (1.4.16) that the elements of A-l tend to be quite large. Under these circumstances the matrix A is said to be ill conditioned. Another way to detect ill conditioning is in terms of the eigenvalues of A. If h satisfies Det(A - XI) = 0, then, in the 2 by 2 case, one can solve the resulting quadratic in h to obtain
-
2h
=
TrA
-
d ( T r A)2
~
4 Det A
-
and if I k t A 0, then A, T r A and h, 0 and the ratio hl/X2is large. Again, when the lines are perpendicular, Det A = 1 and T r 13 2, or A, = A, = 1. As the lines become increasingly parallel, h,/A, + X. Since A is positive definite, there exist two orthonormal basis vectors e l , e2 in E 2 relative to which ,4 has diagonal form (see Bellnian [ E l, pp. 50-511). Hence, if x = x1e1 x2e2 , A x = X,x,Ae, -{- h,x,Ae, . Suppose h2 = 0 ( A is of rank 1). T h e n A has a nontrivial null space and the range of A is restricted to lie along the direction Ae, . If A is merely ill conditioned so that, numerically at least, A, is very small, then A x is strongly biased in favor of a single direction and ultimately, as X,/h, --t m, A begins to behave more and more like a singular matrix. I n fact, severe ill conditioning is a special case of looking singular on a computer 7~
+
1.4.
GRADIENT TECHNIQUES
41
(where numerical precision is limited by roundoff). One consequence of this is with regard to descent algorithms. I n fact, if H is ill conditioned, then -H Of will tend to lie along certain restricted directions in the range of H since the null space of H is effectively no longer trivial. But this means that the algorithm may “bog down” or “hang up” if the direction Vf lies within this null space. Referring again to Fig. 1.4.4, we see that the solution x is not only unstable in the presence of roundoff noise but also is very nearly restricted to lie along the direction of the roughly collinear lines no matter how the vector b varies (by moving 6 we move the lines parallel to themselves). I11 conditioning also affects the efficiency of Newton’s method, in the following way. T h e quadratic surface f ( x ) = (x, A x ) = const is an ellipse in the plane. If A,/X, is large, the ellipse is thin and elongated (Exercise 1.4.7). We know from Newton’s method on a quadratic function that if we start anywhere on the ellipse then -A-l Of points directly to the center of it (here the Hessian off is A itself). Numerically, however, Of is known only within roundoff and if A is ill conditioned, then small errors can magnify themselves to produce drastic over- or underestimates of the direction -kl Vf. If the problem is properly scaled so that the ellipse is nearly circular (Al/A, I), then of course no ill conditioning need be expected. T o summarize the discussion: a matrix A is ill conditioned if it represents a system of nearly collinear equations. Assuming the rows of A to be normalized, we saw that ill conditioning reflects itself by having Det A small or A,/& large or by improper scaling of the quadratic form (x,Ax):the surface described by this form is a thin, elongated ellipse. T h e effect is solely numerical since in principle a solution to Ax = b can always be obtained by inverting A. Thus, ill conditioning is a measure of numerical instability in the presence of roundoff, and if Det A 0, for example, then the solution x will be highly sensitive to the numerical noise generated by roundoff. Computationally, at least, solutions which deteriorate in this manner are essentially useless. It also was shown that ill conditioning can affect descent algorithms by stalling the vectors
-
-
42
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
-A-l Of along directions which may not be emphatically enough downhill. T h e practitioner should be aware of the numerical pathology associated with ill conditioning when implementing algorithms computationally. This subject is encountered later in this chapter, as well as in Chapter 3, where some indications are given as to its solution. One remedy is apparent however; simply compute with greater precision to cut down the effect of roundoff.
A Question of Stability
We return here to the question of domain of attraction and briefly look at another setting for the same idea. If dx dt
=
f(4
is a system of ordinary nonlinear differential equations with
f : En-+ En and
then an equilibrium solution to this system is a trajectory z ( t ) for which f ( z ( t ) )= 0, t 3 0. Without any loss of generality the equilibrium state can be taken to be zero solution z = 0 (Exercise 1.4.9). Then it can be stated that the origin is asymptotically stable relative to some set 9 containing the origin if /I x(t)JI+ 0 as t + +a, whenever x is a solution to the differential equation and the initial vector x(0) lies within Q. T h e practical significance of this concept is explored briefly in Section 1.7; here we simply want to prove a result concerning the size of this domain. Theorem 1.4.17. and let
Let V be a real-valued C1 function on En
n
:= {x
1
V(x)
< V(X,)}.
Suppose that Q contains the origin and that V
0 on Q. Let
x(t)
1.4.
43
GRADIENT TECHNIQUES
be a solution to the system i = f ( x ) starting at x(0) assume that V satisfies
= xo
and
on Q, with v(0)= 0. Finally, suppose that Y ( x )4 +GO 11 x 11 + 00. T hen x ( t ) +0 as t + 00.
as
V(x(t)) = ( V V , f ) < 0
+
Proof.
By the mean value theorem on R1, V(x(t))- W O ) )
for all t
> 0. Hence, 0
<0
< V(.(t)) < V(%)
and the trajectory x lies within L2. Moreover, Q is bounded; otherwise V ( x )-+ +00 for a suitable choice of x I] + co, which is impossible. It is clear from the monotonic character of Y that V(x(t))tends to some nonnegative limit as t + co. Hence
+
V ( x ( t ) )+ 0
(Exercise 1.4.10). Since Q is compact, the set {x(tn)} has a cluster point xo in L2 as t, -+00 through any sequence of values. I t follows = 0 or that i = 0. Now x ( t ) + 0 as t + co; that v ( x ( t ) )-+ otherwise there exists some E > 0 and tk + oc, such that
v(xo)
+
11 4t;)ll 2 E . However, the above reasoning shows that any cluster point of the trajectory {x(tA)} must be the origin, which is a contradiction.
a
T h e concept of domain of attraction has now appeared in three frameworks which, in a sense, are all related. For example, if f : En 4El is C1 and if xo is a strong minimum o ff, then it is a root of Vf [hence a fixed point of g(x) = x - Vf(x)] and it is also an asymptotically stable equilibrium point of 2 = -Vf. I n fact, if =
f(4 -f(X0),
then V ( x ) >, 0 in some ball about xo and
V
=
(Vf,
X) =
- 1 1 Of112 < 0
44
1.
ITERATIVE hIETHODS ON NORMED LINEAR SPACES
for x # xo (and hence Theorem 1.4.17 is applicable). I n each instance one wishes to determine a region Q about xo which determines the extent of attraction. Any further interconnections between these concepts are left to the reader. However, the problem of finding roots o f f or, equivalently, of minimizing I l f l i z is explored further in the next section. Also the topic of stability is examined in more detail in Section 1.7, especially with regard to the practical question of finding a function V which satisfies Theorem 1.4.17. A feasible algorithm for accomplishing this is discussed there. T h e entire subject of iterative descent methods (apparently first considered by Cauchy3) is thoroughly explored in the first chapter of a volume by Goldstein [GI]. This reference is recommended as a useful supplement to Sections I .2-1.4.
EXERCISES
1.4.1.
Assume f is C 2 on R1 and that Newton's scheme (1.4.1) converges to some xn. Show that x,, -+ xo at a rate at least that of a geometric progression, viz. s,
-
I
xo
~
< const
,
s,,- .tn i2.
1.4.2. Show that a matrix N is positive definite if and only if its inverse is positive definite. 1.4.3.
Consider the quadratic function on E" given by f(.)
=c
+ (a,
.Y)
1
;(.T,
Gs).
Show that G is the Hessian off and that if so is any point, then f ( X ) = .f(.Y") -~(T~(.\."),1') ; ( / I , Ch),
-
.x s,,. IIence, f coincides with its Taylor where / I --: expansion of second order about a n y point. ~
A . Cauchy, C'ovipf. R e d . 25 (1847)
1.5. LEAST
SQUARES APPROXIMATION
45
1.4.4.
Show that a quadratic on En has a strong global minimum where Of = 0 if and only if its Hessian is positive definite.
1.4.5.
Show that iff ( u ) is a quadratic on En with positive-definite Hessian and ifg(z) =f (dc-lz), then Og(z) = 2/GP1Vf(u).
1.4.6. Show (a) if A is a 2 by 2 positive-definite matrix, then Det A u11a22and (b) the maximum is achieved when the rows of A are orthogonal. Also prove that Det A is the area of the parallelogram whose sides are formed with the rows of A.
<
1.4.7.
If the eigenvalue ratio AJA, is large, show that a quadratic form (x, Ax) represents a thin, elongated ellipse in the plane [i.e., consider the surface contours (x, Ax) = constant]. Here A , , A, are the eigenvalues of A.
1.4.8.
Let A be a 2 by 2 nonsingular matrix whose rows are normalized. If we normalize A-l in the same way, then Det A-l = Det A. Also, the ratio of largest to smallest eigenvalue is the same for A and A-l. Show this.
1.4.9.
Let z ( t )be an equilibrium solution to the system a? = f (x). Show that z = 0 is an equilibrium state after a suitable change of variables in En.
1.4.10. Suppose V ( t ) >, 0 is a C1 function and that V strictly decreases as t increases. Prove that v(t)-+ 0 as t + + co.
1.5.
LEAST SQUARES APPROXIMATION
Let
f= be a C1 function on En and suppose that we seek a root off (u)= p,
46
1.
ITERATIVE METHODS O N NORMED L I N E A R SPACES
with p E Em. This describes a system of m nonlinear equations in n unknowns given by f”ul )...)u,)
-
p
= 0.
(1.5.1)
We know that if m = n and if J f is nonsingular, then the modified Newton’s scheme can be expected to yield a solution via the iterates
where the {all} are chosen optimally (this much is known from the previous section). However, if is not of maximal rank, if m > n (overdetermined system) or if m < n (underdetermined system), then not only is Newton’s algorithm (1.5.2) not applicable, but the system (1.5.1) may not even possess a solution. Consider, for example, whether three straight lines in the plane always need to have a common point of intersection (if not, such a system of three linear equations in two unknowns is said to be inconsistent). However, an alternate procedure for “solving” (1.5.1) in all cases involves minimization of the Euclidean norm of f ( u ) - p. If uo furnishes a solution to (1.5. l), then uo also yields a global minimum to i l f ( u ) - /3 I/ [in fact i ~ f ( u o) p 11 = 01 and conversely. If (1.5.1) does not possess an exact solution, it is still possible to speak of a least squures solution to the nonlinear system as some uo which minimizes Jf
(without loss of generality we may assume that /3 = 0). I n the sense of (1.5.3) one obtains a best norm approximation to (1.5.1) and if the system possesses a solution, then the approximation is exact (i.e., the norm is zero). I n the following we attempt to develop an iterative procedure for obtaining a least squares approximation in all cases, and this requires that the concept of generalized inverse of a matrix be introduced. We begin with a short introduction to this notion.
1.5.
47
LEAST SQUARES APPROXIMATION
Some Remarks on Generalized Inverses
To simplify notation the spaces En,Em will be denoted by V , W , respectively, and an element A in Hom ( V ,W ) will be an m by n matrix, defined in the usual way as a representation of a linear map between V and W relative to the usual orthonormal basis in V , W (see Section 1.1). If M is any r-dimensional subspace of V , then the orthogonal complement of M in V will be the n - r subspace, denoted by M I . V can then be written as the direct sum V = M 0M i and every v E V is uniquely represented as v o no, with vo E M and uo E M I . A mapping P in Hom( V ) is called an orthogonal projection onto M along N if Pv, = v o and Pvo = 0. The mapping I - P is then an orthogonal projection on N along M.Referring to the matrix of P (also denoted by P ) , we know that P is an orthogonal projection if and only if P 2 = P and P = PT.(We urge the reader who is not familiar with the notions of direct sum, projections, and orthogonal complements to consult the volume by Halmos [H2, especially Sections 18, 21, 22, 41, and 751.) I n Section 4.4 we will have occasion to use these concepts again. As an exercise it is left to the reader to show that if el ,..., e , is a basis in M and e, ,..., e, is a basis in ML, then the matrix of P relative to the basis is diagonal with ones in the first r positions and zeros thereafter. I n order t o define a generalized inverse of A E Horn( V , W ) , consider first the system Ax = b defined by m >, n equations in n unknowns:
+
a i j x j = bi
,
i = 1,..., m.
(1.5.4)
j=1
or
kl all
a22
... ...
am2
..*
a12
"')("')
=
a2n
amn
xn
(b').
(1.5.5)
bm
Since b is a linear combination of the n columns of A, the columns generate the range or column space of A [denoted R(A)]. Also, 6, is the Euclidean inner product between the first row of A and the vector x so that if b, = 0, then this row is orthogonal to x. A similar statement can be made for 6, through b,, . I t follows that if b = 0, then x belongs to the null space of A [denoted N(A)], and the rows
4x
1.
I T E R A T I V E M E T I I O I X O N N O R M E D L I S E A R SPACES
of A generate the orthogonal complement in V , relative to the
Euclidean inner product, of the null space of A . T h e adjoint of A is a mapping from TT’ t o V and is represented (relative to the same orthonormal basis which is used to represent A ) by the n by m matrix AT. ‘I‘he row space of A is clearly the same as the column space of A’, and hence R(nlr) is N ( A ) l .Sirnilarily, the orthogonal complement of N ( A T )must be the column space of A . Hence, one obtains the direct sum decompositions (1.5.6)
If m
= n, then A is invertible, provided that the null space of A is trivial “ ( A ) = {O)] or, equivalently, that the rank of A [i.e., the dimension of R ( A ) ]is maximal. T h e condition m = n implies that R ( A ) -= It’ and in general we can say that A E Hom( V , W ) is invertible if it is bijective; when rank A is maximal, then A is injective but not surjective unless rn - n, and if rank iz < n, then the map is not even surjective. A vector v in c’ belongs to N ( A ) if A z : ~: 0. ‘I‘herefore, if A is restricted to i\’(A)-L, then it is injective. Suppose, on the contrary, that 21, z’ t :V(A): and that A u = Av. T h e n u -~ 7’ E N ( A ) , which is impossible (unless u = e) since il;(A)I is a vector space. n’(A)l is often identified with the quotient space V)lli(A)and the fact that A is injective on V / N ( A )is well known (see, for example, Malmos [H2, Sections 21 and 221). More generally one easily shows (Exercise 1.5.2) that A is injective on a given space if and only if its null space is trivial. I,et us now denote the restriction of A to iV(A)’by A. T h e n 2 maps N(A)Lbijectively on R ( A ) and therefore A has an inverse defined on its range. T h e following definition is now meaningful.
A
E
Definition 1.5.7. T h e generalized ineerse A + of a matrix Horn( TJ7, W ) is the matrix in Hom( W , V) given by
(1.5.8)
Since A+ annilhilates the orthogonal complement of R(A) in W , its null space coincides with that of Ar. From the way A+ is
1.5.
49
LEAST SQUARES APPROXIMATION
defined it is also clear that A+ and AT have the same range. However, A+ and AT need not be the same mappings. They can behave differently in the following way: if w = Av with v E N ( A ) I , then w E R(A) and A+w = v, whereas ATw may be any other element in R ( A ) l .Our discussion thus far is compactly summarized in Fig. 1.5.1.
FIG.1.5.1. The range and null spaces of the adjoint and generalized inverse of A .
+
Let z, = v0 v0, with vo E N ( A ) and vo E N(A)L = R(A+). T h e n since Avo = 0, one obtains A+Av = A+Az,O = vo = PR(a+) [projection onto range of A+ along N ( A ) ] .I n the same way, if w E W , then w = wo wo with wo E N(A+)and wo E R ( A ) so that
+
AA+w = AA+w0 = W O
== PR(a)
[projection onto range of A along N(A+)].Hence, the relations (1.5.9)
are obtained which were apparently first used by Moore4 to define E. H. Moore, Bull. Amer. Math. SOC.26, 394-395 (1920).
50
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
A+. We leave it to the reader as an exercise to show that relations (1.5.9) are equivalent to the defining relations of P e n r o ~ egiven , ~ by A+AA+= A'
AA+A = A (AA+)*= AA+ (A+A)' = A+A.
( 1.5.10)
It is not hard to see that A+ is uniquely defined since the inverse of A in Definition 1.5.7 is itself unique. When A is of maximal rank, then N ( A ) = (0) and therefore R(A+)= N ( A ) I = V . It follows that A+A = I and A possesses a left inverse. If m = n, then it follows in the same way that AA+ = I or A+ = A-l. When m > n and rank A is maximal, then ATA is invertible (Lemma 4.4.8). Hence, if Q = (ATA)-l AT, then Q A = I. Also AQw E R(A) for any w E W [again, since N ( A ) = {O}] and by the uniqueness of A+, A+ = (ATA)-'A*.
(1.5.1 1)
A useful review of generalized inverses is given by Desoer and Whalen [7]. An Iterative Procedure for Finding Least Squares Solutions
T o return to the original problem, Newton's method is directly applicable to finding the roots off only when m = n and rank J f is maximal. I t may happen, as already noted, that J r is not bijective in one of several ways. Moreover , even if invertible in principle, may be numerically singular because of ill conditioning. I n all these cases the generalized inverse of J r should allow the scope of Newton's method to be extended. T h e following discussion is closely patterned on that of Fletcher [8]. First, note that if g : En + El is given by g(u) = Ilf(u) - p [I2, then J f
R. Penrose, Proc. Cambridge Philos. SOC.51, 406-413 (1955).
1.5.
51
LEAST SQUARES APPROXIMATION
But the Jacobian off has m rows {Ofi} and hence
v llf(4 B 112 -
=2Jf(U)T(f(4
-
B).
(1.5.12)
If uo is a least squares solution to a linear system described by (1.5.13)
Au-P=O,
then the norm of Au - /3 is minimized at uo and V jj Au0 - /3 11
= 2AT(Au0- /?) = 0
(in this case the Jacobian is simply A). The converse is also true since I( Au - /3 11 is a C1and convex function of u [this is established in Section 2.2 (Corollary 2.2.17)]. Thus, uo is a least squares solution to Au = ,t?if and only if AT(Au0- 8)
(1.5.14)
= 0.
The least squares solution to (1.5.13) can be obtained in one step starting at any uo in En by letting UO =
u,, - A+f(u0),
(1 515)
where f ( u ) = Au - /3. In fact, by using the second and third relations in (1.5.10), one finds that since un = ug - A+Au, A+P, then A T A 0 = ATAun- ATAAfAu, ATAA+/3 = ATAA+P = AT(AA+)TP = (AA+A)T/3. Hence AT(Au0- 8) = 0 or V /I Auo - P 11 = 0.
+
+
As we saw, this condition is sufficient to guarantee that uo is a least squares solution. I n particular, if m = n and A possesses an inverse, then (1.5.15) becomes u0 = A-l/3, and if A is of maximal rank ( m 2 n), then, using (1.5.11), UO =
(ATA)-'ATP.
52
1.
ITERATIVE RlETIIODS ON NORMED LINEAR SPACES
This suggests that a suitable iterative procedure for finding a root of a nonlinear f is given by un , 1
:=
u,,
(1.5.16)
an/r+(~4f(~J,
~
where the {I,,}are chosen optimally (i.e., they are chosen so as to niininiizef in the direction du,, = z / , + ~ - u,, starting at un). Again special cases of interest arise when either m = n and JT' exists (modified Newton's method) or rn > n and rank J , = n (generalized Newton's method). As we know, the latter case leads to (1 5 1 7 )
I t remains t o be shown that the directions (1.5.16) define downhill paths. T o see this, let ilu
=
-./;(u)f(u)
YiAZL)
=
2J:Wf(U).
Fg, 4
=
-2a(f,
and Then
JfJTf).
But ],IT is a projection matrix with its first Y diagonal elements nonzero by virtue of (1.5.9); Y is the rank of (assumed nonzero). Hence J f
( f , J f J f ' f )=
c (f'). r
1-1
and (Tg, 1121) - . 0 when >; 0. It folloLvs that d u lies in the halfspace of descent directions at ZL [unless, of course, Tg(u) : 01. In summary, a root finding problem for nonlinear f : E" + E n has been converted into a minimization or least squares problem for the nonlinear function g : Ett -+ E' defined as g(u) 11 f(u)I!'. T h e iterates (1.5.16) always define dolvnhill paths which, hopefully, converge to a vector un for which L~
2
J;(u").f(Zl") =
Y;
',f(u")l:' -= 0.
T h e question of convergence nil1 not be pursued here except to reinark that such a proof \voulcl then hold in particular for the
1.5.
LEAST SQUARES APPROXIMATION
53
Newton's method in which JT = 1; .' This shows, incidently, that Newton's method is actually a descent method for the norm g, which is a sort of converse to the situation encountered in the previous section in which the problem of minimizing a scalar-valued function f on En was reduced to the question of finding the roots of Vf. I n the latter setting one still had an iterative descent process, provided that the Hessian H , was positive definite. It should be emphasized that a solution uo to the nonlinear least squares problem can be expected to provide at best a local minimum to the norm lif(u)li and we cannot rule out the possibility that even better local approximations will exist. With regard to the actual computation of a generalized inverse, we refer the reader to a paper by Tewarson'j where an explicit algorithm is given.
EXERCISES
1.5.1. If P is a projection on a subspace ilf of V along M , show that relative to a suitable basis, the matrix of I' has the form
where the number of nonzero ones equals the dimension of M .
1.5.2.
1.5.3.
A matrix A in Hom(V, W ) is injective if and only if N ( A ) = (0) or, equivalently, whenever rank A is maximal. Prove this. Exhibit a proof of the fact that if A iare the rows of a matrix
A and if Bj are the columns of B , then the elements of A B are the scalars ( A i ,Bj).
R. Tewarson, Comput.
I. 10, 41 1-413
(1968).
54
1.
1.5.4.
We know that rank A = rank A T for any A Show that rank A = rank AT = T r AA+.
1.5.5.
Verify that relations (1.5.9) are equivalent to relations (1.5. lo).
ITERATIVE METHODS O N NORMED LINEAR SPACES
E
Hom(V, W ) .
1.5.6. Extend Lemma 1.2.20 by proving the following result: all m by n matrices of maximal rank form an open set in Horn (V,W). 1.5.7.
Prove that Ax = b has a solution if and only if b E R(A),and show that if u is any element in V , then x = A'6
+ ( I - A+A)u
[hint: I - A+A is a projection on N ( A ) along N(A)J-1.
1.6.
TWO-POINT BOUNDARY VALUE PROBLEMS
One of the more useful applications of Newton's method occurs in the framework of infinite-dimensional normed spaces. T h e setting for this application is the solution of a system of ordinary nonlinear first-order differential equations xz
= f"X1
,...) x, , t ) ,
i = 1,..., n
(1.6.1)
with t belonging to an interval [0, TI. Moreover, it is assumed that the trajectory
);I?(
x(t) =
will satisfy conditions imposed on the initial and terminal values of t (a two-point boundary value problem). If
1.6.
TWO-POINT BOUNDARY VALUE PROBLEMS
55
then the solution of the problem requires that we find a root of the mapping x -+ A? -f (x, -) over all smooth trajectories x(.) which satisfy the prescribed boundary conditions. T o make the framework for this problem more precise, let B be the space of vectorvalued C1 functions on [0, TI normed by (1.6.2)
Let g(x, .) = A? - f ( x , It is clear that g maps B into its completion B [ B consists of all vector-valued Co functions on [0, TI under norm (1.6.2)]. T h e following is merely a formal argument to support the use of Newton’s method in finding the roots of g. First, we wish to compute the F differential of g in order to speak sensibly about an analog of (1.4.1). With this in mind, let us define 0 ) .
where it is assumed that f is C1in x and Co in t. Since
for any u E B [here V’i(x, =)is the gradient of f i with respect to the variables in x], a simple computation shows that as 11 u Ij + 0,
This follows since for each t in [0, TI, it is true that
56
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
and (using Lemma 1.3.5),
Also, the mapping D : x xo is linear, and if it is also bounded, then D E Hom(B, B ) and the mapping coincides with its F differential independent of x in B (Theorem 1.3.4). Of course, in most cases D is not a bounded map but when it is, the F derivative of g is, defined in B by (1.6.3) g'(x, .) = I1 - Jr(X, .). ---f
Still proceeding formally, this suggests that the following iterative algorithm for finding the roots of g be defined xn
i-1
=% -
k'(% > .)I-'g(GI
3
(1.6.4)
Under somewhat restrictive conditions it may be shown7 that the mapping ,j: x + [g'(., .)]-'g(x, .) is contractive on a subset of B consisting of functions x(.) which satisfy the given boundary conditions, and hence that the iterative scheme (1.6.4) converges in B to a fixed point o f f or, what is the same thing, to a solution of (1.6.1). As immediate consequence, if x,,is a starting function for (1.6.4) which lies in the domain of attraction for the solution xu,then {xn} will converge uniformly in t [because of norm (1.6.2)] at a rate no less than that of a geometric progression. I t is clear that since each x, satisfies the boundary conditions, xn will also satisfy the boundary conditions. Finally, it is necessary to implement the algorithm just described. T h e crucial fact here is the following: although 2 is a nonlinear map, the iterates (1.6.4) w i l l define linear two-point boundary value problems, each of which is accessible to numerical techniques, as shown below. IIence, the analogy with Newton's method is complete in that a nonlinear root-finding problem is replaced by a sequence of linear ones. I<. illcG111 and P. Kenneth, Proc. Internat. Astronaut. Congr., 14th, Paris, 173-188 (1963).
1.6.
57
TWO-POINT BOUNDARY VALUE PROBLEMS
If (1.6.3) is substituted into (1.6.4), then %+l
=
Jf(% .) %+l >
- Jf(Xn
7
.) xn
+f(% .) 9
(1.6.5)
which for each n is a system of linear nonhomogeneous differential equations of the form ( I .6.6) z i = A (i )u b(t),
+
where A is the matrix
4 4 = Jf(u(t)?4
(1.6.7)
and b is the vector b(4
=
J f ( 4 4 , 4 44 + f ( u ( t ) ,
t).
(1.6.8)
Solution of Linear Two-Point Boundary Value Problems
If x,,is given as an initial guess, then A and b can be computed and the system (1.6.6) solved to obtain xl, and so on. However, it is still necessary to exhibit the manner in which a system of the form (1.6.6) can be numerically solved subject to conditions on the endpoints of the interval [0, TI. T o fix ideas, suppose that u(0) is to agree with the vector
in its last n - k components and that u( T ) is to agree with c in its first K values. Now integrate forward the homogeneous system zi = A ( t ) u k times, using at each new integration the initial condition d ( 0 ) which is a zero vector except for a one in i t s j t h place, 1 j k. Call the resulting solution y j , none of which will in general satisfy the terminal conditions. Now integrate the nonhomogeneous system with initial condition
< <
58
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
to obtain the particular solution up. By the principle of superposition for linear systems,
c
+ UP(.)
k
ajuj(*)
3=1
will also be a solution to (1.6.6) for any set of real numbers al,..., ak. Now let L be a k by k matrix whose j t h column consists of the first k components of uj(T). We wish to find a vector
such that k
olW(T)
+ uP(T)
j=1
agrees with the vector c in its first K places. Therefore, the system of k linear equations defined by La = /3
(1.6.9)
must be solved, where
If we can solve for a , then we have obtained a solution to the linear two-point boundary value problem. Returning now to the original system (1.6.5), we can summarize the computational procedure to be used. Beginning with the initial guess xo , compute A and 6 from (1.6.7) and (1.6.8) and solve the resulting linear system (1.6.6). This gives a vector x l , which is then used to recompute A , 6, and (1.6.6) is again solved. Proceed in this manner until two successive iterates agree to some desired accuracy [uniformly over [0, TI according to norm ( I .6.2); recall that if 11 x, - xo 11 4 0, then 11 xTc.,l- x, / / + 01. T h e final iterate is the desired solution to within the given accuracy. When integrating the nonhomogeneous system, it is numerically expedient to replace the choice of zeros in
1.6.
TWO-POINT BOUNDARY VALUE PROBLEMS
59
the first k places of up(0)by the most recent values d,..., ak since this provides the best guess at an initial value for u p . Thus, for example, if is the vector used to find x1 , then, solving for x2 , let (Y
and proceed as above to find a' =
i"), ak'
Then, to find x 3 , let
Numerically one can expect some difficulty with the above algorithm if the matrix L is ill conditioned because 01 may not be accurately determined (see Section 1.4 for a discussion of ill conditioning). This difficulty can arise in the following way. T h e columns of L are determined by integrating the homogeneous system with initial condition uj(0). Now the vectors d ( 0 ) are linearly independent and, in fact, orthogonal. Hence, the solutions d(.)remain independent at subsequent values of t , at least in principle. In fact, the d(-) may become less and less orthogonal (therefore more and more dependent) as integration forward results in an illconditioned L. One attempt to remedy this involves orthogonalizaat several or all integration steps by a tion of the vectors d(.) Gram-Schmidt orthogonalization process. I n doing so, one must be careful to carry out the Gram-Schmidt operations on the d ( 0 )
60
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
vectors as well in order to be consistent. For example, if n if
are two homogeneous solutions at t orthogonal initial conditions
=
=
3 and
t , , corresponding to
respectively, then the orthogonalization of u,v leads to
(see Bellman [Bl, pp. 44461). Performing the same operations on the given initial conditions leads to two new initial conditions
and
T h e independent vectors (1.6.11) will then produce the orthogonal vectors (1.6.10) (within roundoff) when the homogeneous system is integrated forward to f = t , . I n this way the final integration step will assure us of nearly orthogonal vectors u'( T ) , as is numerically desirable. However, even this procedure will not always insure that L is properly conditioned and it may be necessary to employ an even more effecti\ e orthogonalization procedure such as the one described by Conte [5]. T h e fact that I , can be ill behaved is illustrated by an example given in the same Conte paper. Here, the determinant of the normalized L matrix is roughly 10-lo,which
1.6.
TWO-POINT
BOUNDARY VALUE PROBLEMS
61
suggests ill conditioning. I n fact, for Conte’s problem the solution to La = p leads to a poor resolution of 01. An Example
T h e method outlined above appears to work well in practice and in several examples the domain of attraction has shown itself to be considerably broader than most sufficient conditions would lead one to believe. T h e algorithm itself is due to McGill and Kenneth [13], at least in the form given above, and the next example is a modification of one of the examples given by them. Consider the problem of determining the transfer trajectory of a space vehicle from one orbit about the earth to another such orbit, where the initial and terminal positions as well as the transit time are fixed. Ignoring lunar perturbations, the equations of motion become (after suitable normalization of constants)
x2=
where r(4
=
dx:(t)
(1.6.12)
x2 --
r
+x;(4 + g ( t )
and t E [0, 21. T h e equations (1.6.12) are of course equivalent to a system of six first-order equations. T h e boundary conditions are given by ~ ~ (= 0 1.076000, ) x1(2) = 0,
~ ~ (= 0 ~) ~ (= 0 0, )
~ ~ (= 2 0.576000, )
x3(2) = 0.997661
and an initial guess is the straight line path between the initial and terminal orbit points. Using the computer program described in the Appendix, the algorithm converged in three iterates to a trajectory whose terminal conditions matched the required terminal conditions to within an accuracy of lou5. This example illustrates the effectiveness of the method.
62
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
We should mention finally that a somewhat similar approach to the one described above for solving two-point boundary value problems has also been developed by Bellman and Kalaba under the name of “quasilinearization.”*
1.7.
F U R T H E R REMARKS ON STABILITY
I t is worth making a few more comments about stability questions for differential equations. Theorem 1.4.17 is a special case of a theorem given by LaSalleg who, in turn, extended some earlier work of A, Liapunov done around 1892. A very readable introduction to the subject is given by LaSalle and Lefschetz.lO Suppose that the origin is an equilibrium point for the vectorvalued system of differential equations 2 = f ( x ) . Then, by Taylor’s Theorem [Eq. (1.3.14)], f“x>
=
(Vf”O),
4 + g”x>,
where g(x) represents higher-order nonlinear terms and f is one of the components of the vector f. Letting A be the real n by n matrix with components aii = afi(0)/axi , the differential equations can be written as = A x +g(x). (1.7.1)
*
I n engineering analysis of nonlinear systems it is a frequent practice to ignore the terms g and to determine the behavior of the system in terms of its linear part i - Ax = 0. If the nonlinear terms g are “not too large,” then it might be expected that the stability of the origin for the linearized equations will imply the stability of the overall system. Under suitable conditions this kind of heuristic reasoning can be vindicated, as seen from the next
* See R. Bellman and R. Kalaba, “Quasilinearization and Nonlinear Boundary Value Problems.” Elsevier, New York, 1965. J. P. LaSalle, IEEE Trans. Circuit Theory 97, 520-527 (1960). lo J. P. LaSalle and S. Lefschetz, “Stability by Liapunov’s Direct Method.” Academic Press, New York, 1961. @
1.7.
FURTHER REMARKS ON STABILITY
63
theorem. Let the reader recall first that the origin is asymptotically stable for a linear system i- Ax = 0 if and only if the eigenvalues of A all have negative real parts (such a matrix is termed a stability matrix). Moreover, in the linear case, the domain of attraction for asymptotic stability is all of En (for a proof of these results see Bellman [BI, p. 2411). Theorem 1.7.2. Let A be a stability matrix and suppose that for each 71 > 0 there is a ball B, about the origin for which 11 g(x)lI 7 11 x I\ when x E B , . Then the origin is asymptotically stable for Eq. (1.7.1).
<
Proof. We sketch a proof. Let Q be any positive-definite n by n matrix. Then there exists a positive-definite matrix P for which
ATP + PA
=z
-8
(1.7.3)
(Bellman [BI, p. 2451). Let V ( x ) = (x, Px). Then V is a nonnegative C1 function on En. If x(.) is any solution to (1.7.1), then V ( X ( t ) ) = (k,Px) = (x, =
+ (x, P3i)
+ (x, PAX)+ ( g ( 4 p1. + (x, Pg(4 + 2(x, Pg(.)>.
A=Px)
4 x 7
81 .
If A, , A, are the smallest and largest eigenvalues of Q, P, respectively, then -(x, Qx)
Hence
< -A,
11 x /I2
and
@>d (217AI - ' A d 1
(2,
x
Pg) < A I/ x l12.
/I2
(1.7.4)
and therefore P < 0 in some sufficiently small ball about the origin with V(0) = 0. By Theorem 1.4.17 the origin is asymptotically stable relative to the ball B,. I Estimation of Domains of Attraction
Theorem 1.7.2 suggests a method for estimating the domain of attraction for the asymptotic stability of the origin for equations of type (1.7.1). First pick Q and solve for P in (1.7.3). Effective
64
1.
ITERATIVE METHODS ON NORMED LINEAR SPACES
numerical procedures for doing this are described in a paper by James0n.l’ Next form the function V as in Theorem 1.7.2. T h e contours V(x) : (x, Px)= constant describe ellipsoids in En. We would like to find the largest such ellipsoid within which P < 0, for then, by Theorem 1.4.17, this ellipsoid would lie within the domain of attraction for the origin. Theorem 1.7.2 states that if A is a stability matrix, then there always exists such a region of attraction (nontrivial). T h e problem then is the maximization of V(x) over x E ErL while maintaining P < 0. Equivalently, we would like to find the smallest ellipsoid V(x) = constant having a point of contact with the surface = 0. Figure 1.7.1 illustrates Contour
V=o
( x , P x ) = constont
Equilibrium point
FIG. 1.7.1.
Domain of attraction in
E 2for an equilibrium point.
this in a typical setting for n = 2. Now the problem of minimizing V(x)subject to the constraint = 0 is often accessible to numerical techniques for its solution, and the next two chapters will be devoted almost exclusively to this question. I n particular, it now can be seen that the problem of estimating a region of asymptotic stability for equations of the form (1.7.1) is reducible to a constrained minimization problem. T h e size of the estimated region also depends on the choice of the matrix 0,and therefore it is necessary to vary over Q to obtain a best estimate of the set J2 of l l A . Jameson, S I A M
I. 16,
1020-1023 (1968).
1.7. Theorem has been following system of
FURTHER REMARKS ON STABILITY
65
1.4.17. T h e method outlined here is due to Geiss and tested numerically by him with some ~ u c c e s s T. h~ e~ ~ ~ ~ simple example illustrates Geiss’ approach. Consider the equations i= Ax g(x) given by
+
(1.7.5)
It is clear that the origin in E2 is an equilibrium state and that A is a stability matrix. In order to estimate the domain of attraction, let Q = I and solve for P in ( I .7.3). A simple calculation shows that P
(i
=
and hence
Moreover,
(1.7.6)
0 %
P = X1kl + x2x2 -
-x;
-
+ x,x;,
x;
I n order to find a point of contact between the circular contour V ( x ) = constant and the locus = 0, apply the Lagrange multiplier rule, which is discussed in the next chapter, to conclude that there exists a scalar X for which
vv +AVV
=0
(1.7.7)
or, more explicitly, XI
x2
4-h(x2, - 2x1) = 0
+ A(2x1x,
~
2.4
= 0.
G. Geiss, GAEC Rep. R M 3161 (1966). For details in the numerical testing of the outlined method, see G. Geiss, “Analysis and Design of Nonlinear Systems via Liapunov’s Method.” PrenticeHall, Engelwood Cliffs, New Jersey, 1970. l2
l3
66
1.
ITERATIVE METHODS O N NORMED LINEAR SPACES
T hi s leads to the solution x;
= 25: =
9 -. 2
Therefore, the circular contour having point of contact with has a radius 27 x: x: 2 8 ’
+
I
=0
-
T h e interior of this circle is a domain of attraction for the origin. We leave the reader to try to improve this estimate by varying Q. We conclude this section with a brief discussion of a stability question in economics. Consider a nonlinear model relating price to demand and supply as described by p = f ( p ) , where p is a price vector and f ( p ) is a vector-valued function of excess supply or demand. A solution of the equations describes price adjustment over time as a function of supply and demand in a competitive economy, and an equilibrium point is determined by the price vector for which f(p) = 0 (supply equals demand). As a basic probblem, one can determine the extent to which such an economy is stable with respect to its equilibium state. If the linearized system is asymptotically stable, then one may begin to estimate the stability domain by applying the algorithmic procedure described above.