.... .... ii.. .... .. INTRODUCTION .... TO OPTIMIZATION TECHNIQUES I! .. ::
CHAPTER I l l
In our study we will have need to refer to several basic optimization techniques. This chapter will outline some approaches to Optimization. The reader is referred to the works in the bibliography for greater detail.
3.1 INDIRECT METHODS
Suppose we wish to find an extremum (a maximum or minimum) of F (x), where x = (q, x 2 , . . . , x,,). A necessary condition for an extremum 2 is that V F (2) = 0,
(3.1)
that is, that the gradient of F evaluated at 2 be 0. The gradient of F is VF
= (aF lax,,
aF / a x 2 , . . . , aF/ax,,).
(3.2)
Equation (3.1) gives a set of n equations in n unknowns (which may or may not be linear). There are many numerical methods for solving both linear and nonlinear sets of equations [2,4]; the nonlinear methods are often equivalent to using direct methods immediately. Solution of Eq. (3.1) gives a candidate or candidates for an extremum; a solution may be a maximum, minimum, or point of inflection and must be tested by examination of the second partial derivatives or by investigating nearby points. 47
48 111
Optimization Techniques
3.2 DIRECT METHODS
Direct methods are based upon a search for an extremum. The search may be completely random; we may simply evaluate F(x) at a large number of randomly generated points, assuming the maximum is close to the one of those points which gives the smallest value for F(x) (for minimization). Gradient techiques use more information about the local character of the function. One starts at some initial guess, x('),of the optimum, and attempts to improve this guess successively by making the next estimate x('+"such that F (x"")) is less than F(x(") (for minimization). In gradient techniques, the choice is made so that it tends t o decrease (or increase) F as quickly as possible. Suppose x(j+" = x(j)+ Ax for Ax of small magnitude, where Ax = (Axi. Ax2, ..., Ax,,).Then
+ AX)
F(x('+l)) = F(x'"
using Taylor's expansion. I t follows that the change in F(x) is
A F ( ~ ( ~ ) q) x ( i +
$
zz
j=i
-1
-~ ( ~ ( 9 )
1))
(Axj).
aF
axj
(3.3)
x.(i)
Noting Eq. (3.2). we can write this more simply ast
AF (x")) % VF * Ax.
(3.4)
We are looking for the direction of Ax which will maximize (or minimize) AF for Ax of constant length. Since the Schwarz inequality -
llxll llvll
X'
Y
llxll IIYII
(3.5)
implies by Eq. (3.4) that - jlVF
I/ IlAxlI
I VF. AX IIIVF )I
IlAXil;
(3.6)
AF will be maximized (minimized) for Ax such that the right-hand (lefthand) inequality holds. This is the case for
AX = + K VF,
- (z:=
t The expression x y is the dofproditcl (or innerproduct) of x and y: x y = 1 The norm of x is \ / x i /= I xiZ)"' in this chapter.
(3.7)
7 - 1xty;.
3.2 Direct Methods 49 where, for K positive, the positive sign is for maximization and the negative sign for minimization. This yields the general form of the gradient technique: Given an initial guess x('),
+
x ( ~ + '=) x ( ~-) ci VF(x("),
i = 1 , 2, 3 , . .. ,
(3.8)
where ci is a positive constant that may be different for each iteration, and the sign is fixed throughout the procedure. Several possible choices of ci are E~
= E,
a constant,
(3.9)
E~
= E/IIVF(X(~))II,
(3.10)
l/i,
E~ =
(3.11)
and
a2F (x'") E i = 6/11
ax2
/
(3.12)
if we define
Combinations of these and other methods can often be used to advantage. More sophisticated approaches vary the step size as a function of the rate of convergence [3]. A common intuitive interpretation of the gradient technique is that it corresponds (for maximization) to climbing a hill by taking each step in the direction of steepest rise. Generally speaking, there is a trade-off between speed of convergence and stability of convergence. The smaller the magnitude of e i , the slower the convergence; but if c i is too large, the algorithm may never converge. The choice of Eq. (3.1 1) yields particularly slow convergence, but it will yield convergence in most cases. This is the case because of the divergence of the harmonic series m
1 l/i=
00.
(3.13)
Ei=CI
(3.14)
i= 1
If 00
1 i=l
the aigorithm may converge to a point which is not an extremum. For example, suppose V F to be a nonzero constant vector K over a large region; then
50 111 Optimization Techniques
and xcm)+ Kcr
+ x(’)
as M -+ 00 if x(’) never leaves the region of constant gradient. Since the minimum cannot occur in this region, c i should not be chosen to form a convergent series. Although Eq. (3.11) converges too slowly to be the best choice when V F can be calculated accurately, it converges despite random noise in the estimation of V F (if the mean of such noise is zero), and is often chosen in such cases. This is a loosely stated result from the area of stochastic approximation [ I , 61. Equation (3.10) yields an algorithm in which each successive iteration is changed by an increment of constant magnitude E in the direction of the gradient. Equation (3.12) decreases the step size in proportion to the curvature of the function at x(~). As we may readily visualize by the hill-climbing analogy, we may obtain a local maximum rather than a global maximum by the gradient technique. We would thus prefer to minimize functions with no stationary points (points where VF = 0) other than the global optimum. We would further prefer functions whose derivatives were continuous; otherwise the functions may exhibit “edges ” which frustrate rapid convergence.
3.3 LINEAR PROGRAMMING
Optimization with constraints yields another wide class of methods. We will be particularly interested in inequality constraints. Linear programming provides a method of minimizing a linear cost function with variables constrained by linear inequalities [4, 51. The usual form of the problem solved by linear programming algorithms is the following: Find a solution zlr z2, . . . , z, 2 0 which minimizes cizi subject to the constraints that a i j z i2 bj f o r j = 1,2, ...,m. If a problem is posed in this form, with m and n sufficiently small, a linear programming algorithm will either give a solution or indicate that none exists. The dual problem in linear programming is a formulation of the problem which transposes the matrix of elements a i j to result in a problem where the roles of m and n are reversed but the solution of which can be used to obtain the solution of the original problem. This transposition is often of value because it is the number of constraints m which largely determines the computational cost of the procedure.
cy=
3.4 A localized Random Search Technique 51 3.4 A LOCALIZED RANDOM SEARCH TECHNIQUE
The following algorithm will find a local minimum of a continuous function without requiring an analytic expression for the gradient, and is presented as an example of this type of algorithm. Let x ( ~ be ) the ith guess of the minimum of F ( x ) . We wish to generate a sequence x(O), x('), .. .,x ( ~ ).,. ., such that F (x")) > F (x")) > * . . > F(x(')) . We do so by first choosing an initial guess x(O) by intuition. The procedure is then as follows: > a
(1) Calculate Ax = (Ax,, Ax2 , ..., Ax,,) by taking n samples from the probability density of Fig. 3.la, by means to be described.
FIG. 3.1 Generating random numbers: (a) probability density function desired ; (b) one-sided density function.
(2) Set xTEST = x* + Ax, where x* = x(~),the last calculated member of the sequence {x'~)}(and x* = x(O) initially). (3) If (for minimization) F(xTEST) < F(x*), then x* is set to xTEST (i.e., x ( ~ + '= ) xTEST) and the procedure continues from step (I). (4) If F(xTEST) 2 F(x*), then X& = x* - Ax is similarly tested. If F(x&) < F(x*), x* is set to X& (i.e., x ( ~ + ' = ) xiEsr), and the procedure continues from step (1). (The theory is that, if a step in one direction is " uphill," a step in an opposite direction is likely to be " downhill.")
52 111 Optimization Techniques
( 5 ) If F(x&)
2 F(x*), x* remains unchanged and step (1) is initiated.
The width parameter T in the distribution should in general shrink as the solution is approached; it may in fact be different for each component of x. We will see that 7 appears only multiplicatively in the equations generating the samples of the distributions and may easily be imposed after the random number is generated. Constraints are easily handled by testing each x before using it. Reasonable stopping rules can be devised based on the number of successes in the last N tries and on the step size for successful tries. The overall process is diagrammed in Fig. 3.2.
FIG. 3.2
are vectors.
Exercises
53
GENERATING RANDOM SAMPLES OF THE GIVEN DISTRIBUTION Let z be a random number uniformly distributed in [0, 11. Then x' has the distribution in Fig. 3.lb if (3.15) where z, = 1 - k,(k, - 1)/(1
K,
= (k2 -
1)(1
K6 = (I - k,)(l
+ k,k,),
+ k,k2)/k1, and
+ kik2).
If z' is a random number which takes the values 1 and - 1 with equal probability, then x = z'x' has the distribution in Fig. 3.la, as desired. EXERCISES
3.1. One value of x which minimizes F(x) = (2 - x2)' is clearly x = ,/2. Starting with x(') = 1.5 and using the gradient technique x(i+l)= x(i) -ci VF(~('),
i = 1,2, ...,
give five iterations each for (a) ci =O.l/i,
(b) ci =0.1,
(c) ci = 1.
Explain any unusual results. 3.2. Use the gradient technique to find the minimum of F(x) = 12 - x2 I . Do five iterations with ci = 0.5 and x ( ' ) = 1.5.
3.3. Using the gradient technique with ci as in Eq. (3.10) with E = 1 : do five iterations toward finding the minimum of
3.4. Derive Eq. (3.15).
54 111 Optimization Techniques SELECTED BIBLIOGRAPHY 1. Dvoretzky, A., On Stochastic Approximation, in Proc. 3rd Berkeley Symp. Math. Srut.
Prob. (J. Neyman, ed.), pp. 95-104, Univ. of California Press, Berkeley, 1965. 2. Isaacson, E., and Keller, H. B., “Analysis of Numerical Methods,” Wiley, New York, 1966. 3. McMurty, G. J., Adaptive Optimization Procedures, in “Adaptive, Learning, and Pattern Recognition Systems” (J. M. Mendel and K. S. Fu, eds.), Academic Press, New York, 1970. 4. Ralston, A,, and Wilf, H. S. (eds), “Mathematical Methods for Digital Computers,” Wiley, New York, 1960. 5. Simonnard, M., “ Linear Programming,” Prcntice-Hall, Englcwood Cliffs, New Jersey, 1966. 6. Wilde, D. J., “Optimum Seeking Methods,” Prentice-Hall, Englcwood Cliffs, New Jersey, 1964. 7. Wilde, D. J., and Ekightler, .C. S., “Foundations of Optimization,” Rentice-Hall, Englewood Cliffs, New Jersey, 1967.