On optimization of algorithms for finction
minimization
5.
KHISAMUTDINOV, A. I., A priori information and choice of model in the Monte Carlo method for evaluating the sum of a Neumann series, Dep. at VINITI, No. 724-74, Dep., 1970.
6.
ERMAKOV, S. M., and ZOLOTUKHIN, V. G., Application of the Monte Carlo method for calculating protection from nuclear radiation, In: Topics in reuctor protection physics (Vopr. fii. zashchity reaktorov), Atomizdat, 171-182, Moscow, 1963.
7.
MACMILLAN, D. B., Comparison of statistical estimators for neutron Monte Carlo calculations, Nucl. Sci. Engng., 26, No. 3,366-372,1966.
U.S.S.R. Comput. Maths Math. Phys. Vol. 17, pp. 35-52 @ Pergamon Press Ltd. 1978. Printed in Great Britain
35
0041-5553/77/0601-0035%07.50/O
ON OPTIMIZATION OF ALGORITHMS FOR FUNCTION MINIMIZATION* G. SONNEVEND Budapest, Hungary THE PROBLEM of optimization of algorithms is investigated by an ordered search for the minimum of salient functions and points. These are achieved if the following a prior’ information about the function is known; a properly valued secondary performance matrix with a corollary giving the true non-negative numbers
Introduction In the last two decades there have been great activity to find good function minimization techniques. Many methods, “algorithms” have been introduced, some of them carefully analyzed, see e.g. [I 1. However the question of optimization of algorithms - in respect to a given class of algorithms, &, a given class of functions, F, for a given criterion - was dealt with in a less extent. Optimization problems were considered mostly qualitatively, that is from the point of view of the “order” (type) of convergence, the behaviour of quadratic functions. The only generally known, yet very simple example of an optimal algorithm is the so called Fibonacci search method (and its limit, the “golden section” search method; see also [2-4] ). In practice when, for example, the minimization problem arises from the problem of planning a sequence of highly expensive “experiments” the need for optimisation of the corresponding algorithm is obvious. (As an experiment we shall mean, in future the computation of the function, or its derivative at a given point.) On the other hand, each “standard programm” (which is used many times in a Computing Centre) should be, in some sense, optimal. This sense should be determined by the deterministic or stochastic class of minimization problems which should be solved. In this contribution we give a general formulation and methods of solution of the optimization problem for algorithms searching the minimum (point) of a function and those searching the root (solution) of a system of nonlinear equations. First we deal with the case of functions of one real variable, in 5 4 we present a method to construct optimal algorithms in the multidimensional case, when the functions involved have, roughly speaking, second derivatives, whose eigenvalues lie between two, known nonnegative numbers (and also, when the system of nonlinear equations correspond to monotone operators). *Zh. vJ?hisl. Mat. mat. Fiz.. 17, 3,591-609,
1977.
In stating the problem of optimization we follow the line of works of N. S. Bachvalow, see [S, 61. The main features of our approach are: 1. Using the number of computation of the values of the function as the quantity which should be kneed to yield a given accuracy e, or more precisely f&ng this number we try to get the “greatest” accuracy. The error measures, by some real number, the ro ess of appro~ma~on of the ‘~m~~~~ituatio~” (we take it as the diameter of the set of ~o~~~~tion of ah possible points of minimum, de~ation of the minimal value of the function from the fast fsmallest) computed one . _.), 2. The class of func~ons~ F, is described by an inequality, of Iocal or nonlocal type, connecting the values of the function and its derivatives, so that in the case of local characterisation each function is dete~ined by some scalar or vector function of one real variable or several real variables which assumes its values from a (compact) set. This class expresses the a ptioti ~fo~ation on the function and identi~es it with the trajectory of a “control process”. fn order to solve a concrete lotion problem we try to get as much a priori i~fo~ation as possible on the function to be added, and a~tu~y regard it as an element of a chtss 5. 3. The algori~ use, at successive steps ~exper~ents~ all the a poste~o~ ~fo~a~on on the concrete function which was gathered in the previous steps (and of course which is relevant for the future search). ~ac~c~iy, in many cases, at each step we can compute only the value of the function, (partly because to compute by some more measurements the gradient of the function would not be optic!) In such cases it is important to know (remember) the previously computed values of the function. Throu~out this paper we assume that the values of the fiction can be computed exactly, the more realistic case of imprecise me~urements coufd be dealt with moor modi~cations, see e.g. 17) . 4. The method of dynamic progr~~g leads to the ~o~st~c~on of the optimal ~go~~ through the solution of some opt~~ control problems, related to the control process, F, and trough ~ndi~g the minimum of some “standard” (that is, not de~~~g on the concrete element, f( -1, of the class 9”) functions. The arising control probIems can be solved using extensions of the max~um p~~ciple~ see [8,9], the multidimension~ control problems are solved by a new non-local technique, i-e, not by a mum principle. Here we present the general method illustrated on a nontrivial example, where the a prim+ ~fo~ation is given by known, costive lower and supper bounds, m, M, for the (eigen) values of the second derivative of the function which we do not assume to be ~ont~uous. We compute the optimal algorithm in special cases. As important tools of our methods we mention the theory of convex bodies and their functions, theory of linear and quadratic ~ro~~~ (~equ~ities~.
37
On optimization of algorithmsfor function minimization
1. Definition
of the problem
Suppose we have to minimize a function f(t): R 1 + R1 about which an a priori information
is given:
maL(t) a!
f” (t) =u(t), where m, M are known, nonnegative
numbers, u(t) measurable.
f (ai) =yi, where the minimum
l(G)
point, 0, is uniquely
Assume that
determined
(1.2)
e= [al, &I
f’(O) =o,
=y2,
(1.1)
if m > 0. Let N, the number of admissible
steps, be frxed. As an admissible algorithm we regard a sequence of functions
f (aj)7i=l, 2, . . . , n; m,M).
a n+i=AnN+i (aj,
3
(1.3)
Remark that the algorithms (1.3) take no use of the values f’(aj). This is motivated practice, because if the only way for computing f(t) in a small neighbourhood
by
is to compute two values of
of T, then in view of the supposed expensiveness
of f(t) it is clear (and will be demonstrated computing f
(approximately)f’(r)
of the computation
in the sequel on examples) that there is no sense in
’ (7). This shows that many popular “gradient” methods are far from the optimal.
However it should be noted that the algorithms which use the derivatives are much more easily optimised as our example will show. (There are cases, in which the computation f(z), is approximately
i=1,2
of the same “complexity”
The criterion of approximation, ,..., N,fixed) J’=J’(ai,
d, =min(tlf’(t)=O, UC.) J”=
off
’ (7) or grad
or f(z).)
2, . , . , N; m, iIf)=&-&,
f(ai)=yi,
i=l,2,.
f(ai)=yi,
i-1,2,
. . ,N),
(1.5)
i=l,2,...,N),
(If’(t) 121f(ai)=yi,i=l,2,
It is easy to see that it is legitimate here to substitute
(1.4)
. . .,N),
min f(oi)-minmin(f(t)If(ui)=yi, lZZi
ld=minmax t UC.)
off(r)
the error, let be one of the following (here f(ai) = yi,
f(ai), i=l,
&=max(tlf’(t)=O, UC*)
then the computation
(1.6)
. . . ,N).
inf (sup) by min (max).
From the practical point of view criterion (1 S) seems to be the most important one. There is a principal difference between criteria (1.4) and (1.6) on one side and (1 S) on the other side: the criterion (1.6) cannot be obtained by taking supremum
of a functional,
J(f), independent
from
ai, f(ai) (because here min max is not equal max min). Remark that the criterion (1.6) has independent characterization
of the minimum
interest in connection
with the following
point (for m > 0, M < -) as the solution of the problem: min
If(~-H)-f(g) 11111<6
max
Criteria (1 S) and (1.6) have sense for functions
I.
of many variables while for the criterion
the multidimensional case there are several possible generalizations set of localization of possible points of minimum).
(1.4) in
(diameter or measure of the
38
6. Sonnevend
The functional (1.4) seems to be adequate if the m~imization problem arises from a gradient type multidimensional search algorithm, whose sub~go~t~ is a oned~ensional search. The class (1 .l) arises that way if the minimized function satisifies, for each z, (1.7) In 8 4 we give a nonlocdt characterization of this class of functions, and based on it we deal with the problems analagues to (1.4)-( 1.6).
2. Methods for the construction of optimal algorithms 1. First let us remark that the c~culation of the error-functions Jo’ = J” (~~,~(~~),i = 1, 2 , . . . , N), r = E,Y, d, is equivalent to the solution of an optimal control problem related to the control process (1 .I) with objective functionals P(e), In the case r = V,d, we shah solve a sequence of control problems depending on the parameter t, then fulfill an operation of minimization with respect to t. Similary it turns out that the computation of the functions AnN (a) can be achieved by the solution of some optimal control problems and by the minimization of some standard functions. This is done by using the principle of “optimal step” of dynamic programming. Let us denote the best guaranteed value ofJOr (a) by Ji;_ n (.), when we know only ai, f(ai)=yjfori=1,2,...,n, Then a,,t = a* is determined as the solution of the equality z, E_mE”,p> 1 inf
SUP 1:.,-j
(Ui,
f(&)a,
f(a),
1GLciGB)
=J~--n(cZir
f(&),
IGiG?2).
cw
(For n = N - 1, in some cases, inf(sup) can not be attained, e.g. see Theorem 3.) Thus the functions A,N f =)can be computed for n = N, N - 1,. . . ,3, 2. The contro1 problems involved in (1.4)--(1.6), (2.1) are not quite of the usual type, yet the necessary conditions from the maximum principle can be obtained for them, see [8,3]. To do this there are several possib~ities. Let us remark first that the equality constraints, f(+) =yi for “inner” points ai, can be incorporated into the integral cost function by adding a term E-'Gc(t-Ui)]z-Ui1'=f,'(t,
S),
let.z(.)==f(.),
where Se {r) is an approximation of the &-function, E + 0. Or one can transform f&.) = y. with the exception of ai = al, a~, as, into the foi~o~g two conditions:
where
and aIn, {a-f), are the nearest to @ among ui from the left (right) side.
39
On optimization of algorithms for firnction minimization
The determination of numbers ok, &, k = 1,2, is equivalent to the consecutive solution of control problems with “terminal” cost functions +x’ (a). (It is well known how to transform terminal cost function into integral cost function.) The cost functionals (1.4) or (1.5) can be expressed by piecewise constant integral cost functions, for example % d, =
s a,
h(x’)dt,
where
x’f 0,
h(L)={;;
x’>O,
We can approximate h (.) by a continuously differentiable function he (.) and going to the limit e + 0 we obtain that the second component of the $ function, G2 (7) has a jump at point dl), J/2 (dl + 0) - a2 (dl - 0) d 0, otherwise it satisfies the equation, see [8] , q’=(),
$L-q)l.
(2.3)
If d, Q aSn @as”), we can give the “role” of a2n, (uln), to Use. The transversality conditions for f(aln) arising from (2.2) can be obtained by approximating the terminal manifolds in (2.2) (line segments) by smooth curves, so we get I)” (a,y
=o
pl
Ip2(~in)w
x’(a,“) =ai,
4” (a,“) GO,
X’(Ui”) ==pi.
That way a complete characterization of the optimal paths can be given in the problem (1.4). The path which gives the optimal value ford, is characterized by the following: the function u (T) is piecewise constant and takes the values m, M, m, M on subsequent (from left) intervals, the first change from M to m occurs at its minimum point 0. By introducing the variables 13= C;r, f(d) = t2, and suitable variables t3, t4, to represent the solution 9 (7) (by it “jump quantities”)
~“=[q?(e+o)-lf(e-o)]
(I#‘)-‘,
~‘=~“(e-o)
(I$‘)-‘,
we get a systems of equations linear in t2, quadratic in .!$. However there comes in the “nonlinearity” with the variables t3, i$. As “known” quantities we have f&Q k = 1,2,3 (one of them is always missing), and the two transversality conditions, so there is, in general, one and only one solution. The above analysis shows that d2 - d, = J 1 (ai, yi 1 < i
In problem (l&) the f~n~tio~a~ If(f) f2 is, a~g~br~~~y~ one of the simplest possible. The
corresponding integra1 cast function is linear. The oned~e~sional rn~~a~o~ problems with respect to t and a arising in (1.6) and (2.1) can be reduced to simpler problems of solving an equation by the following important observation. As seen from above, in general the exact cbaracte~~atio~ of the extremal solutions is cumbersome by the necessity of dist~~g~is~ng many pussible ~t~~tions: as the trau~~ers~ity co~dit~o~s~sign off@),.... ~~se~u~a~” In many cases for a given value r, (a) there are two values of f(t), (~(~~~~at which the function if($) i2, res~e~tive~~PN _ fz_ 1 (a, f(a)% .) becomes critical. If these values are monotone decrees and monotone increasing functions of t(a), then, in general, their rna~~~ becomes minimal if and only if they are equal. (In the concrete examples below this fact will be often used.)
3, Before taping to examples let us make some general remarks, which will be ~iustrated by the examples below. As seen from the above discussion, the exact sohrtion of the optimal&y system (2.f) is, in ~e~~~~~ quite ~omp~cated= fn such cases we beg to eompute only the opt&r& onestep algorithm. h-rsome cases it can be proved that the optimal N-step ~go~t~s is the N-th iteration of the optic onestep ~gorit~m or at least can be app~ox~ated by it especial for N-+ MeGenerally the Iimit of the optimal N-step ~go~~s for N+ wt.is more easy to fmd out. (~a~ti~~y the Solomon of the sy~em (2.l) is found by guessing the solution and proving it to be that by induction with respect to (N-n) = 1,2,, . . .) More ge~er~y, if the problem cont~ns some “small par~eter”, W” 1 or IV-l, we can try to find the solution for @* 0.
p, for example g = M,
Even the optimal onestep ~go~t~ may turn out tc, be very ~ornp~~~ated~ then we try to find an a~~ro~atiun for it by a computa~ion~y more simple one. The operation of taking the ~~rn~rn, in respect to a, f can not be fulfilled exactly (by a formula givhg a,.+1 ia (U)), therefore in practive we should use a min~~at~on ~rocednre for this standard function J(u) (i.e. not depending on the concrete fun~t~o~~(. )). ft is h~~ort~t to know whether 3(a) has unique rn~~~rn, mainly: when J(G) is a convex function in a?
I, First let us turn to the ~rn~~e~case, when we are ab2e (ready~ to ~orn~ute o&y the de~ivatives~(~~~.Then the problem consists in finding optimal algorit~ for the solution of fhe equation ~‘~~~==%Q) -0, (While admitt~g m =:0, M = 00,we ass~e
~~~~~‘~~~~~~~*
(3.11
that 8 ) g (6) = 0, exists, is unique.)
Thecriteria(1.49and (1.6) have their obvious counte~~ts;
for s~p~city we use the same notation for them as well as for the admissible al~~~t~s in (1.3) taking in them S(ai) instead of f(a& Let us denote the value of error after measurements by J’ (N) = Ju’ (a, g (a$ i = 1) 2 *. +. ,N). It is ob~ous that it depends on@ on gIN, ~2Jv, the two meas~r~en~ at which g (~3 changes the sign.
41
On optimization of algorithms for function minimization
Theorem 1. a. The optimal algorithm for the problem Jt actually does not depend on N. The following estimate is exact:
x=1/2 (i-m/M).
Iz(N)<~N-2~z(2),
(3.2)
b. The optimal onestep algorithm for the problem Jd, is not the beginning of optimal algorithms for N > 3, its convergense is estimated by the inequality
Jd (N)
M-2m 3M-2m
M-m
Remark that part a of the Theorem was proved in [7] (in a little more general case when the measurements
of the value of g(t) contain some error 6g(t) < 60).
To prove the theorem let us note first, that the interval of localization to the intersection
[dl n, dzn] , of the abscissa, g E 0, with a parallelogramm
of the roots is equal whose opposite
(azn, g(a2n)) and whose sides have the direction corresponding
vertices are (aln,g(aln)), values m , M;
d,2-&2=J*(2)=min
to the
a,-$,a,_s(a,)
m
a,-+,
In problem J’ using the argument of the end of the previous section it is easy to see that the optimal onestep choice of the following measurement and so for arbitrary
r(n+l)~J*(n> which equality is exact, that is for appropriate Now the statement inequality
consists in taking a3 = 2-l
(dl 2 + d22)
n:
f( i-3)
9
value of g (a3) we have equality.
of the theorem follows from the fact that for each n > 2 the above
is valid and it is possible that we have equality for one and the same function,
g( .),
for each value of 12. Let us remark here that for a function g ( .) whose derivative is discontinuous at the point 0, where g (0) = 0, the well known “algorithm of secants” converges more slowly than above. However for a function which is continuously differentiable at the neighborhood of its root, the optimal algorithms converges more slowly than the algorithms of secants. To prove the second part of the theorem let us remark first that the value t = t*, at which g(t) is, in absolute value, bounded by the smallest number is characterized by the property that the intersection of the parallelogramm with the vertical line at t* is symmetrical to the t-axis. This is an application of the observation from the end of the previous section. Now the same observation
allows to find out the optimal onestep algorithm, N = 3.
Consider the intersection P n P* of the parallelogramm, P, with its symmetric image, P* (axis of symmetric being the t-axis). We say that a rectangle, which lies symmetrical to the
42
G.Sonnevefd
Mxis, is of the type (m, M), if it is similar to a rectangle L>fm,M) with the following property, there is a point on the vertical tine (t = corn&) going through the centre of rt (m, M) which forms with two ~s~rnet~c~~ vertices of l)(m, N) lines whose director-tiger are equal nt and&f. Write inside P C”I P* the greatest possible rectangle of the type (m, M), take a3 to be its center (first suppose m > 0). The inequ~ity of statement b) follows from this ~onstru~~on, it is not exact for N > 3. Here the optimal N-step aIgorithms depends on N, (and not only from the interval of ~~~~ation of roots!). In principle they could be computed using the dynamic progr~~g equations [4] . If m = 0, M < m then the iteration of the optimal onestep algorithm is also an optimal N-step algorithm for all N, and is identical with the optimal algorithm for the problem J’, of loc~~ation of the root: consists in halving [d 1”, d$r] . Here the convergence of the algorithm may be estimated exactly Jd(N)G
2 1M32-NJ’(2). -L
This iueq~ality is exact in the sense, that for each initial situation there is a number, No, such that we may have equality for all N > No, If M = =, m > 0, the optimal onestep algorithm may not exist, which consists in taking a3 = t*, if t* =&al, a2. 2. Let us turn to the case when we can calculate (measure) at the same step the function f(t) and its de~vative~‘(r). in this case the all info~atiou, which is relevant for the future, is contained in the results of the measurements at consecutive points aln, azn, where the derivative, f’ (Q, changes the sign The optimal one step agony
for the problem Jy, cm be easily constructed, if m = 0,
M = - making use of the obse~ation above.
As noted earlier the chara~te~ation of opts path which supply the values d, n, d,n, in problem J’, is simpler than before, because there are no “transversality” problems here. Let us denote the interval of localization after ~-meas~ements by din, d2”.
lrhe optimalchoice ofa,+$ in problem J’ (1.4) is,for the casesm = 0, M arbitrary, ~nd~e~de~t of(N - n), and consirtsin t@kirtg the next me~~rerne~t at the midpoint of the inte~ff~of ~ocffl~a~o~ofpossiblerni~~rn~rn points a,,,=2-“(di”+d,“),
&“--d,*~:Zz-” (q-d:).
(3.3)
For the caSem > 0, the optimal~-~tepmethod at step n de~e~~ not omTyon the i~t~~alof ~ocal~z~tio~ and points aI”, a2a, but alsodepends “directly” on the valuesoff ( 6), f ’ ( e), yet for many casesthe above choice, (3.31, is a good one. 3. The computation of optimal N-step algorithms for the problem Jl is even more complicated if we cannot compute derivatives. The most simple case: m = 0, M = *, can be soIved (see also [23, where Theorem 3 is stated (without proof3 and for problem P, the optimal onestep ~go~t~ and its limit, for (d,” - dzn) + 0, are computed.
On optimization of algorithms for function minimization
43
Using the same notation as in (2.3), we can, using convexity of f(x), easily determine the intervalloflocalization [dln,dzn] atstepn=3,4,... (assuming that initially at least no > 3 values Off(Ui), i = 1,2,. . . , no, are given, and&) Gf(Qi), i = 1,2, at 5 by thinking that at the points (f(aI) = -00, f ’ (az) = 00. In the last, N-th step of the algorithm we have f
(uy
)=
f (a,),
min
dzR~‘Gz,N-i
lCiQv--i
computed, and take a measurement of f(a) at (I on [dlN-l, dzN-‘] infinitesimal near to q-l, on that part of it which is greater, we use the notation j = 1 or 2 depending on
or not. Therefore we have li’=max(Id,N-i-afi--i
I, Id2N-1-~~-L I).
(3.4)
Theorem 3. The optimal N-step algorithm for the problem J1 (1.4) in the case m = 0, M = * (assuming the uniqueness of 0) is given by an+L=&+l wherecl
k#j,
=C2=
for n
(asncN-i+dj”cN_+i),
1,Ck+2=CktCk+l,k=
I,&...,
(3.5)
is the Fibonacci sequence. We have, for
k= l,or2, Ii_-n(a,,
f (a,), lGi
max(c,--‘,+,Idjn-u8nI,
c;L,Ia+u,~l).
(3 *6)
We prove (3.6) by induction according to s = N - n, for s = 1 it is true by (3.4). First remark that J&_n( .) is a monotone function, of I din - a3n 1, for fured a3”, therefore the “worst” cases of the value f(a) in (2.1) are f(a) = f (a3 n ) + e - small value. (We have assumed 0, f’ (0) = 0, to be unique, in order to be able to guarantee Jz (N) * 0 for N + 00, therefore the case E = 0 in the above relation is indeed a good case, of course if we assume the measurement off (a) to be without error.) Now taking into account the different possibilities, e 5 0, a 5 a3n, the inductive part in the proof of (3.6) is equivalent to: min
max
iak-usi CS
a=[d jsdkl
,
k-al c.-,
Id,4 ’
CP-,
,
Iu,-d C*
9
the validity of which is easily verified by the “observation” above we should have:
&--a a,-u m =p + u= (CN-_n+C~__n__i)-i(djCN__n--i+~SCN--n). CN--n
CN--n--I
44
G. Sonnevend
For the general case, m, M arbitrary, we propose to choose an+l according to (3.3) or (3 S), whether we use derivatives too or not. The reason behind this is that the “main difference” between the cases with different (m, M) lies in determination of the intervals of localization. By the known asymptotic formula for c, = 2-” (45 t l)n (dS)-l, we can easily go in (3.5) to the limit,l\r-+ 03and so get the generalization of the search method, called “golden section search” for the case m = 0, M = 00, and an approximation of the limit algorithm for m, M is arbitrary. We state as a conjecture that if m #M, there exists a constant, x (m, M) > 0 such thatJ~(N)>xxN-3J~(3),xSx(0,~)=(~S-1)2-1 ( remind that f” (r) is not assumed to be continuous).
4. Construction of optimal algorithms in the multidimensional
case
1. Now we deal with the analogous of the problems (1.4)-( 1.6) in the multidimensional
case for the class of functions defined by (2.1). More exactly let us define the class of functions, f(z) EW( m, M) , ZERP, 0 ‘/ = (W2
+ . . . f (z?y)‘/) ~lI~--Yl12~~g(~)
-g(y),
~-y~~~ll~--yl12,
I, y=RP,
In the cases M = 00 we do not require g(z) to be one valued; i.e. g( * ): Z4?P+g(z) multivalued mapping qEg(z) are called subdifferentials: TlEaf(s) if
(4.1) dP
is a possibly
w),
f(z+wpf(q+(q,
w=RP.
It is known, see [lo], .$24, that the necessary and sufficient condition for the existence of a convex function f(z), such that g(z) C Sf(z), for each z, is that for every sequence of pairs (zi, Vi),
qt=g(Zi),
j=
l,
2,.**>
n,
(z 2-zi,
qi)+.
. . +(2*-z,,.
~,)~O
(4.2)
which condition is called cyclical monotony of g (. ). Thus, in the case m = 0, M = 00, W(0, -) is the class of convex functions. On the generalization of the condition (4.2) see below. Remark that in contrast to the relation (2.1) we do not require f(z) to have (continuous) second derivatives
(yet it is true, that these derivatives exist in the sense of generalized functions, this is here not important for us). Note that in Newton type methods, see [l] 1, it is usually assumed that the function Fii (z) are continuous. Of course in practice the fimction f(z) is often locally characterized by the local condition (1.7) SP
(Fij(z))c[m,
Ml,
Sp - spectrum,
(4.3)
45
On optimization of algorithmsfor function minimization
from which (4.1) is a consequence, see below. The important technical, distinction between (4.1) and (4.3) is in respect to compactness that is, the uniform limit of functions, fn (z), n = 1,2, . . . , having continuous second derivatives, satisfying (43), f=(z), may not have (continuous) second derivatives while foe (z) satisfies (4.1). Remark that if we are interested in the solution of at nonlinear equation g(z)=O,
g: Rp+Rp
(4.4)
where g(z) is not required to be the gradient of a function, only assumed to satisfy (4.1) and to be continuous, then for the class defined by (4.1) we may formulate, as in the case of (3.1) an optimization problem for root finding algorithms, see below. The mappings g (. ) satisfying (4.1) are called monotone mappings, and it is known, see [7] that the condition (4.1) together with the continuity of g(z), assures, if m > 0, the existence of a unique solution, 13,g (0) = 0, of (4.4). The following theorem expresses a characterization of the class W(m, M), which will be used throughout in the construction of optimal algorithms. Theorem 4. The necessary and sufficient condition for the existence of a function, f(z), belonging to
W(m, M), and assuming prescribed values f (ai) = pi, grad f (ai) = g (ai) = vi, i = 1,2, . . . , N, is given by the following N (N - 1) relations
f(Uj)
>f(Ui)+
+L2 &
7
1
Uj-Us)+-
2
~ll~i-~jl12
Ilg(%)-g(h) --m (W%) II29
i,j=1,2
,...,
N.
To explore the relation between the conditions (4.1) and (4.3) let us introduce the function of one real variable, SER’, 0
=f(Uj+SWij)
Wij”IlU(-Ujll-‘(Ui-Uj)
7
e
(4.6)
The meal value theorem for h(s) yields 0<6<1. h(S)=h(0)+h’(0)s+~h”@s)s2,
Which is for s = Ilai- ai IIequivalent to (4.9, really h’(O)=Ilai-Ujll-‘(grad
f(Uj),
U,-Uj)y
mGh’(~)fM.
(4.7)
The validity of (4.7) follows from that the function h (s) being convex has mealsurable (integrable) second derivatives. If on a set, I’, of positive measure h” (7) does not belong to [m, M] we get a contradiction to (4.1) using the fact that almost every point, t, of r is a density point of F, i.e. (z&)-‘p(sEr,
t-&
‘1,
therefore, using h’(t+&)=h’(t--E)+
t+c J h”(T)& t-e
E-+0,
46
G. Sonnevenrd
wehave,ifh”(r)>Mt6
onf’forsome6>0
h’(t+&)3h’(t--E)+Z(M+6)e+o(E),
&G&(6),
from which we have a contradiction to (4.1) by (h’ (t + e) - h’ (t - e) ) 2~ = (g(zI ) - g (zz), z1 - z,), where z1 = ai t (t + E)wii, z2 = aj + (t - rz)wii, llzl - 22 II= 2~.
Conversely, from the validity of (4.7), for all possibly values of aj = x, 7 and ai = y follows the validity of (4.1) by the mean value theorem O
h’(s) =h’(O) +sh”(os),
s= l/r-yll.
(4.8)
Thus we see that from the locally condition (4.3) follows the global condition (4.1). To prove the sufficience of the conditions (4.5) let us suppose m = 0, M< 00. This is no restriction of generality because by substracting the function f. (z) = m I]z ]I2 from each function of the class W(m, M) we get the class W(0, M - m) and conversely adding f0 (z) to the elements of the class W(0, M- m) we get the class W(m, M). Now the relations (4.6) express that the convex hull of the N paraboloids P(aj)
is a surface f *(z), which goes through the points (f(at), af), having normal vectors corresponding to g (a$:
i=l, 2,. . , , N,
gradf’(ef=g(e),
The function grad fr (z) = g*(z) satisfies (4.1). Indeed the left part of (4.1) is simply a consequence of convexity off * (z) while the right part of (4.1) follows from the fact that h* (s) = f* (z + SW)satisfies (4.3) for every z, 11 w ]I= 1, in I@‘. This is because at every point, b, f * (z) has a supporting hyperplane defined by g* (b), and f * (z) lies below the paraboloid
where f’(b) =ZhJ’ (LQbi) ,
g’(b) =I&
grad P (ai, b<).
(4.9)
Indeed, by the theorem of representation of a convex hull through the extreme points of its constituents, see [lo] ,327, these sums may be chosen, for each point, 6, to contain no more than f = p t 2 summands, the convex combination of a finite number, r, of the above paraboloids, P(aai) being the paraboloid P(b), is obviously included in the convex hull, and has the supporting hyperplane as defined in (4.9). The necessity of (4.5) follows from the inequalities
for any function fETV(O, M), i, j=i,
2,. . . , N.
As noted earlier from (4.7) follows (4.1). 2. In order to define a problem of opt~at~on of ~go~thms to find the solution of (4.4), 8, f (0) (for more general definitions see [6] ) we should fm the form of the algorithm: %+i=&+i
(I(n),
m,
w,
n==no,. . . , N-i.
(4.10)
On optimization of algorithms for function minimization
47
where Z(n)=Z(?Z, f,
U”)G{f(Ui)=Ei,
g(Ui)=Tli,
l
an= (a,, . . . ) a,),
is the information on the function f( . ), g( .), gained in the first IZsteps. The initial information I(n,-J is supposed to be given so that the set of functions f(z) ~W(rn, M), I(n,) = [J q] (no) is compact, with respect to (uniform) convergence of f(z) on some convex subset which contains K(nu) defined in (4.15). Remark that iff, (z), f(z) ~W(rn, M)? M
or
(4.11) (4.12)
lGiGn1,
grad f (ui), IGiGn).
(4.13)
The measure of approximation, yielded by the N measurements, 1(N), may be given, in analogy with (1.4)-( 1.6) as one of the followings J’(N)=P(Z(N,
f, u”))=p((K(N),
K(N))={81g(8)=0,Z(N)
given)
(4.14)
where p( .) is a functional depending on sets in RP, e.g., volume, diameter, width, see [ 1 l] ; the set K(N) might be called as the set of localization for 8. If m = 0 we cannot guarantee y(K(N))+O, forN-+m J”(N)={
min
f(ui)}-~f{f(z)IZ(N,f,u”)
given},
(4.15)
l
this functional in contrast to (4.14), (4.16) has sense only for the algorithms of minimization J”(N)=
inf sup{llg(z) ll”lI(N) I
given}.
(4.16)
8
For this functional in the case M = 00,we cannot guarantee Z(N) < Z(n,,), or if m = 0, then it gives no information on the set K(N). The optimality conditions corresponding to the problem of minimization of P (N), r = 1, v, d, by chosingA,(.), n = no t 1, . . . , N, are, as in 5 3,easily described: inf sup I,‘-,-, a (f&?
(I(n), a, f(u), g (a> >=JiL
U (n) 1.
(4.17)
The corresponding “optimal control” problems are more difficult. We show in the following that they can be solved using Theorem 4. Our method to solve (4.14)-(4.17) will be different from that used in Q 3, in that we do not try to generalize the maximum principle for these multidimensional control problems. 3. Suppose first that we have the information pattern (4.12). Let f(ai) be fixed. We show how Theorem 4, i.e. the relations (4.6) enable us to evaluate the functionals Jr (Z(N)), r = I, v, d.
Let us first investigate the case af the fu~~t~o~~ Jf. From (4-18) and (4,193 we have, taco gWQ
From this we see that K(N) is a convex set, in ~e~er~* The casem = 0, M = QOisof special interest. In order to generalize, Theorem 2, let US
suppose that the equation (4.3) has unique solution, 8, for the given function f(z)~MV(O, ~1. The set K CN) is here a convex polyhedron, which is the projection to the z-space of the inter* section of a ~~~s~~t~ hy~er~~a~e I’, through the point lG$S#
Natethat fFn5K (~91is in the ~y~er~l~~ ru. Taking a measurement off@), g(x) = tp at X, the worst case is ~bv~~~slyf(x) = Fflt and g(;r) may be a~~~tra~ for ~~~~~, then X(E + 1) will be one of K&, gfx)) or x = 8. The set of those points x ofK, for which the rna~~rn of g2 (x, cp).,in respect to cpIis smaller then a constant is convex, these sets degenerate into a point, a @ (n) >, which gives the u~~~~esolutiun of the ~~t~~ onestep search problem. The point a (K (B>) is c~aracte~zed, in virtue of its being the de~e~~rat~o~ of convex sets, by the property that there are at Ieast p + I h~~e~~a~es ru (& (in fu), which give the rn~~ value to ~1~(x, C,O) and
On optimizationof algorithmsfor function minimization
49
The computation of e (K) is in general difficult, a relatively good approximation of a (K) is given by g(K) the centre if gravity of K. g(K) can be computed by dividing K into simplexes Ki,i=1,2 ,..., 5, computing g(Ki) - which is the vertorial (arithmetical) mean of the vertices of Ki - then (4.22)
(II (Ki)) is computed evoluting a determinant). In order to determine Ki we should find the vertices, extremal points of K (n); in the theory of linear programming there are some known methods (due to Uzawa, Clee ....) for the solution of this problem, see [ 12, 131. The example of a convex 9-angle, L *, which arises from a central symmetrical 6-angle, L, by boulding up triangles of equal aren on three, nonconsecutive sides of L, shows that the
centre of gravity g(L*) is, in general, not equal to a (L”), which is here the centre of L, by (4.2 1). It is easy to see, that there exist 9 angles of the above type, for which the optimal twostep algorithm in problem Jl, p (K(N) ) is not given by the iteration of optimal onestep algorithms. Theorem 5. By choosing a,, 1 = g(K (n)), the centre of gravity of K (n), we have the following estimations
-W-p--i)]2
P
(p+l)
(N) f@(N-p-‘)l’
(p+l),
@=I-F.,
(4.23)
which are exact from below, for all N > p + 2.
To obtain (4.23) we use the fact that the centre of gravity, g(K), has distances X, (cp), A, (v), from two parallel supporting hyperplanes of K, in directions cp,(+JI), for which p min (X1, 1,) 2 max (AI, h2), for all cp,see [l l] , 5 7.34. Obviously ~~ (x, cp)increases, because of convexity of K, when we substitute K by a cone K”, which is generated by K, (x, cp)n K, (x, cp>,and an extremal point of K, (as vertice of K”), on one of the supporting hyperplanes and is based on the other supporting hyperplane.
The body K(n) can remain for all n, to be a simplex, then a (K) = g(K), and therefore the estimation on the left side of (4.23) is exact. The right side is exact only for the first step. The optimization problem for the solution of the equation (4.3) for the class of monotone operators g=IV (0, w) (assuming uniqueness of e), in the case of the information pattern (4.13) and error functional /A(K (N)), can be dealt with the same method. The relations (4.21)-(4.23) are not changed. For the case of the information pattern (4.11) and functional J’, m = 0, M = m, the set K(N) is determined, by convexity by the intersection of ro, with the supporting hyperplanes of that faces of the convex hull of the points (f (ai), ai), which do not contain the vertice (f ($), zN). Here K(N) contains TN, in general, as an inner point. Theorem 3 could be generalized to
this case. We do not recommend to use the above algorithms if M is relatively small (or if m is great). It should be noted that the convergence of the algorithm (4.22) is better then the rate given by (423)ifO
50
G. Sonnevend
4. Let us turn to the case of the functional Jr’. We have
~U(~)=i~~(N,z), P(N, z>~l~,~Nf(at) - inf {j(z) II(N, f, UN)given). f
in the case of the information pattern (4.12) the computation of P’(z) is equivalent to the solution of a simple linear programming problem for y = (f(z), g(z)), with object functional mm yl, and linear conditions (4.18), (4.19). The function JV(n, z) is convex in z. To see this, define the function
m conv (fi (z) , f2(2)>=f (2)=sup {h (2, z,) 1zo=&J, &={20pz(2, z,)=mItz-zoII~~fi(Z), i=l, Z}, I(fi> UN)=f (f2,UN> (in case m = 0, the epigraph of the function F(z) is the convex hull of the epigraphs of the function fi (z>, i = 1,2). In the same way as in Theorem 4 it can be proved that T(z) ~W(rn, M) and I(T, dv) = I(& aN), i = 1,2. The choice (algorithm) (4.24) seems to be a good one. The determination of a”(n), in the case m = 0, M = -, is equivalent to the solution of a linear programming problem, where the variable (unknown) vectors are y = (f(z), z), the object functional: min y l, and the side conditions are given by (4.18), (4.19) (obviously we can set g(z) = 0 in (4.18) ). In the case of the info~ation pattern (4.1 I), the computation (for a given value of z) of the value P (n, z) is equivalent to the solution of a quadratic programming problem, where y = (f(x), g(x), g(ai), 1 G i 4 n) is the variable vector, mm yl - the object functional, (4.5), (4.18), (4.19) are side conditions. P (n, z) is, in general, not convex in z. If m = 0, M = m, relations (4.5) determine g(ai), i = 1,2, . . . , n, which belong to a convex polyhedron, Li. Then the computation of a”(n), (4.24), is equivalent to the solution of a finite number of quadratic programing problems, given by (4.18), (4.19) as side conditions for the unknownsy = (f(x), g(x)), object functional miny 1, where in (4.19) g(@ is one of the extreme points of L, i = 1, 2 , -**, n. The choice of (4.24) might be proposed. 5. Now let us turn to the case of the error functional Jd(N) (4.16). In the case of the ~fo~ation pattern (4.12) we can determine the value of Jd(N) by solving a quadratic programing problem: &(Z)=sup{l~l~,di(z)~~A(z)~,~>+~b(z),~~~c,(z~, t*g
(4.25)
i=l,2,...,N}, ~‘(~) = min 6 fz, IV)=E (2 (N) 1N) , I
(4.26)
where the constants ci, dj are determined from (4.18), in which the bounds for f(z) resulting from the N relations (4.19) are used. The solution of the problem (4.25) requires to find that point of a convex polyhedron L (z) which is farthest from the origine n = 0. For the computation of E(Z, n) an approp~ate modi~cation of the simplex method might be used to compute con~cutive extreme points of L (z). We propose to chose a,,1 = z(n). A simpler and easy Implementable algorithm consist in chasing
51
(4.27) as the solution of the “convex” problem of min~~ation of (y t Xq2) in the va~ables (z,y = f(z), 9 = g(z)) under the 2~~equ~ity constraints (4.18), (4.19) for some fured h > 0. In the case of the information pattern (4.11) we should add the restrictions (4.5) for g(aj) to (4.18) and (4.19) and for the dete~ination of the bounds for f(z), i.e. for the computation of ci (z), di(z), we should solve two quadratic programming problems, for the unknowns y = (f(z), g(q), i = 1,2, . . . , N), with side conditions (4.19), (4.5) and object functionals min y f and maxyl. The same choice as in (4.27) may be recommended. In the case of the info~a~ou pattern (4.13), when we are interested in the solution of the equation (4.4), where the monotone operator g( .) satisfies {4.1), the functional @(N) can be computed as in (4.25), (4.26) using (4.1) instead of (4.5), (4.18), (4.19), and the choice (4.27) might be recommended. In all three cases we suppose that 6 is unique if m = 0, and remark that it is especially reasonable to use the algorithm (4.27), if rrr > 0, is great enough. If we have the info~ation pattern (4.13), where g(x) is the gradient of function f~‘GV(rn, M) (such cases may arise in applications), then besides the conditions (4.1) we have also the fo~o~g conditions for every sequence of pairs a$, g(@@), j = 1,2, . . . , n GiV, i+ (1, 2, . . . , fi) ,
(4.28)
where the sum is cyclical, and we write @kinstead of a&. These relations are consequences of the relations (4.6). They give the necessary and sufficient condition for the existence of function f(z) e w(m, @) ,for which grad f(Qj) =g(a&, i = 1, 2 , . . . , N. This fact was mentioned in (4.2) for the case m = 0, N = m. In the general case it follows from Theorem 4 (if M< -) because for the existence of numbers fi, for which fi - 4 3 ,Q, i, j= I,2 , . . . , N, the conditions Z& G 0, where the sum is cyclical in the indexes, are necessary and sufficient. It is easy to see that it is enough to require (4.28) for cyclical sums with two or three summands: n = 2 or 3. The proof for the case m > 0, M = m is easily obtained by the same reasoning as in the case m = 0, M= 00,see [lo], $27. Remark that if g(z) = gradf(z), and f~W(m, have the relations g(u)-g(~)=~(u-~),
M) is twice continuously differentiabIe, we
Hdqm,
M)=(FjSpF=tm,
HI, (4.29)
l-CB. Bist~fskas
52
and the set of matrices, H(na, H) is canvex, compact, the integral as a convex combination belongs to ff(m, M). It can be proved that (4.29) is a consequence of (4.28), for all (ma .A#).
1.
DANXLIN, Yu. hf. and PSHEN~CH~~, B. N., ~~rneF~~a~ methods iz extremd problems fChisIennye melody v ekstremaf’nykh zadacbakh), Moscow, Nauka, 1975.
2.
CHERNOUS’KO, F. L., Optimal search for the minimum of convex functions, Zh. vFchisL Mt. mat. FiL, IO, No. 6,1355-1366,1970.
3.
VASIL’EV, F. P., Lectures of metb~s ~fs~~~i~gextremul ~~ob~erns,(Lektsii po metodam resheniya ekstremal’nykh, zadacb), Moscow, Izv. Mosk. Univ. 1974.
4.
SUKHAREV, A. G., The best strategy for the sequential search for an extremum, Zk, u$?risl. Mar. mat, F&Z.,It, NQ I, 3550,2972.
5.
~AKHV~OV~
6.
~~HVALOV, N. S., The properties of optimal methods of solving problems of ma~erna~c~ Zh. ~ychisl Mot. mat, Fiz., 10%No. 3,555-.568,1970.
7,
CHERNOUS’KO, F. L., Optimal algorithm for searching for the root of a function calculated approximately, Zh. @chid. Mat. mat. Hz., 8, No. 4,71X-724,1968.
8.
PONTRYAGIN, L. S. et aZ,, Mathematical Theory of Optimal Processes, Moscow, Nauka, 1971.
9.
GXRSANOV, I. V., Lectures on the mathematical Theory of Extremal Problems, Moscow, Izd. Mask. Univ. 1970.
10. R~K~RFELL~R,
N, S., ~~rne~&~i~et~~~s Pt. 1 (~his~en~ye metody), Moscow, Nauka 1973. physics,
R., Convex Analysis, Moscow, Mir, 1973.
11, BONNESEN, T., FENCtfEL, W,, Theorie der konvexen Korper. BerIin, Springer, 19%
in opt~~tio~ 12. DANZIG, G. B., VAN SLYKE, R., Generalised linear programm~, scale systems with applications, New York, McGraw-Ha, 1971.
methods for Large
13. ARROW, K, J. et al., Studies in linear and non-linear programming. Stanford, Stanford Univ. Press, 1958. 14. SOmEVEND, G., On the optimization of adaptive algorithms. Budapest, Pubis Dept Comput. and Numer. Math., L. EBtvtis Univ,, No. 3, 1977, U.S.S.R. Cornput. Maths Math. Phys. Vol. 17, pp. 52-64 0 Pergamon Press Ltd. 1978. Prmted in Great Britain
V. B. BESIRITSBAS Vilnius
ADMISSIBLE and optimal continuous analogues of a multi-step optimal control process are defined. Discrete optimal control processes are obtained, which are convergent with respect to a functional to the cont~~ous analogue. The optimal continuous andtogue of a multi-step resource di~~bution process is found, and its connection with the initial process is investigated. *Z/z. @chisl:Mat. mat. Fiz., P7,3,610-621,
1977.