1.18 Steepest Ascent, Steepest Descent, and Gradient Methods R. G. Brereton, University of Bristol, Bristol, UK ª 2009 Elsevier B.V. All rights reserved.
1.18.1 1.18.1.1 1.18.1.2 1.18.1.3 1.18.1.4 1.18.2 1.18.2.1 1.18.2.2 1.18.2.3 1.18.2.4 1.18.3 1.18.3.1 1.18.3.2 1.18.3.3 1.18.4 References
Introduction Formulation of Problem Literature Applications in Chemometrics Notation Method One Variable More than One Variable Influence of the Parameters Limitations Examples Algebraic Numerical Literature Conclusion
Symbols a b1 i t xi xit
stepsize parameter estimated coefficient for factorial design factor number iteration number value of factor i (any iteration) value response for factor i and iteration t
577 577 578 579 579 580 580 581 582 584 584 584 585 587 589 589
xt y y yt
response vector for iteration t response vector (t iterations) value of response (any iteration) response for iteration t distance between two levels in one factor design
1.18.1 Introduction 1.18.1.1
Formulation of Problem
The steepest ascent (or steepest descent or gradient) method is used to determine optimum conditions. One of the best-cited early descriptions in the literature was by Box and Wilson in 1951,1 who were working in ICI in the UK, Box by training a statistician and Wilson a chemist. A major driving force for fundamental work in the area of optimization was post–World War II industry, with a requirement for very efficient processes as part of economic reconstruction. Chemists, in particular, encounter multifactorial problems where it is necessary to optimize, for example, a yield or the quality of a product as a function of many factors. A traditional approach to searching for optima is to perform a grid search, for example, to run experiments or processes over a series of conditions, for example, if there are three factors, a 10 10 10 grid would involve performing 1000 experiments under different conditions. The problems with grid search approaches are numerous. The first is that the
577
578 Steepest Ascent, Steepest Descent, and Gradient Methods
60
Gradient small and negative: Slow movement in negative direction. Close to optimum.
Gradient small and positive. Slow movement in positive direction. Close to optimum.
50
40 Gradient large and positive. Rapid movement in positive direction. Far from optimum.
30
20
10 Direction of gradient
0 0
0.5
1
1.5
2
2.5
3
3.5
4
Figure 1 Illustration of the principle of steepest ascent.
number of experiments can be prohibitive if there are several factors. The second is that there may be significant experimental error meaning that if experiments are repeated under identical conditions, different answers might be obtained; hence, just choosing the best point on the grid may be misleading especially if the optimum is quite flat. The third is that the initial grid may be sparse in order to make the number of experiments feasible, and could miss features that are close to the optimum or find false (local) optima. The steepest ascent is based on the premise that the response changes fastest in the direction of the optimum, so it is mainly necessary to determine the gradient and use this to determine the direction of the next point. This method requires only the gradient to be calculated and a step size that relates to how far away the new point is from the old one, although the original variables may have to be coded in the experimental space. Figure 1 illustrates the principle for a simple optimization of a parabolic function. At the bottom left, the gradient is positive and large, so the gradient method tells the optimizer to move by a large amount in the positive direction. In contrast at the top right, the gradient is small and negative, telling the optimizer to move in a negative direction by a small amount. Of course in real situations, we do not know the shape or reproducibility of the curve, and as such steepest ascent methods provide pointers as to what conditions are necessary to perform the next optimization experiment. The methods described in this chapter are sequential methods in that single experiments are performed in sequence.
1.18.1.2
Literature
There is a large literature on this method. The most recognized and best-cited early reference in the statistics and chemometrics literature is by Box and Wilson.1 As often happens, however, there are earlier and less wellcited papers on the same theme often in diverse literature, for example Booth in the late 1940s published numerous papers on this method, which he called ‘steepest ascents’ primarily in crystallography.2,3 Mathematicians have often attributed the method of steepest descent to the physicist Debye,4 who in 1909 worked it out in an asymptotic study of Bessel functions. Debye himself commented that he had borrowed the idea of the method from a paper of Riemann published in 1863, and the first paper on the method is generally
Steepest Ascent, Steepest Descent, and Gradient Methods
579
accepted to be from Cauchy in the 1820s5 in the context of solving integrals. A good historical perspective has been published in 1997.6 At the time of the 1940s and 1950s, it was less common for there to be cross-disciplinary communication, especially because journals were less openly accessible and probably most read by specialist peer groups. For example, at the time of writing, Booth’s paper in Nature2 has been cited only 30 times and primarily in the crystallographical literature, whereas Box and Wilson’s paper over 1100 times. Most of the subsequent papers were published in the statistical literature and can be quite daunting for chemists. Brooks and Mills published the first paper with steepest ascent in the title.7 It is widely reported in most classical statistical textbooks on experimental design and optimization.8–10 In the mainstream chemical literature, this approach was slow to become recognized, in contrast to the simplex method in the 1970s, when analytical chemists in particular started to recognize the importance of optimization. This is largely due to advocacy in that there were powerful advocates of the simplex methods, especially Deming and coworkers, and these were incorporated into software. An ISI survey of papers published in 2006 with keywords steepest ascent/descent/descents revealed 122 papers published in that year, of which only 8 had any chemical content, primarily at the interface between chemistry and other disciplines. Of these, 5 use the classically accepted and defined steepest ascent algorithm.11–15 The few papers using the steepest ascent or descent method in experimental applications relating to analytical chemistry were mainly published in the mid-1990s of which three are cited in this chapter.16–18 In contrast, 26 out of 117 paper in simplex optimization/optimization were published in journals with chemical content over that period, including mainstream analytical chemistry, chemometrics, chromatography, and mass spectrometry journals. Part of this discrepancy is likely due to readily available software for laboratory-based optimization. In fact steepest ascent methods are commonly incorporated into algorithms being very simple and widely available but chemists who are laboratory based and not fundamentally programmers are less aware of the potential of this approach for optimization. 1.18.1.3
Applications in Chemometrics
There are two very distinctly different reasons why a chemometrician may wish to use the methods described in this chapter. The first reason is for computational optimization. This does not involve performing experiments and may involve trying to find a solution to a numerical problem, for example, in curve resolution, or for optimizing a model. This could be quite useful for example if the relationship between a response and the value of several factors is known, for example by using an experimental design such as a central composite design, and from this relationship the optimal conditions are obtained. This type of application is the most common and involves primarily incorporation into algorithms, and could be used to obtain the minimum or maximum of a response surface. The second, and less common, application is as an aid to experimental optimization. In this case, steepest ascent (or descent) methods can be used to determine the conditions for a sequence of experiments. Normally, a small experiment is performed (such as a two-level factorial design) to obtain some feel for the relationship between the response and the experimental conditions, and after this, steepest ascent suggests new conditions for the next set of experiments and so on. 1.18.1.4
Notation
In this chapter, we will employ the following notation. We will exemplify the methods using a single response vector y, which is a function of one or more (¼I) factors to be optimized, denoted by x1, . . . . xi . . . . xI or in vector form x. Each step in the optimization will be denoted by t so that yt is the tth value of y, xt of the factors, and xit of factor i. Note that a scalar value of y without a subscript does not refer to a specific step, similarly a scalar value of x with a single subscript. When a specific step is referenced, then y is subscripted by the iteration number and the scalar x has two subscripts, and the vector x one subsript. The symbol q denotes a partial derivative, so that qy/qxi is the partial derivative of the response with the ith factor. The notation Ñy implies the vector of partial derivatives of y with respect to x so that Ñy ¼ [qy/qx1 . . . qy/qxi . . . qy/qxI]. Other variables will be defined where necessary.
580 Steepest Ascent, Steepest Descent, and Gradient Methods
1.18.2 Method 1.18.2.1
One Variable
The method for one variable is straightforward. The first step is to choose (a) starting condition of the factor to be optimized, x11 (factor 1 and iteration 1, in this case I ¼ 1) and (b) a step size parameter a, which determines how fast the optimization progresses. It is then very straightforward for determining the condition for the next step: x12 ¼ x11 þ a dy/dx1 for finding a maximum and x12 ¼ x11 a dy/dx1 for finding a minimum. The gradient of the response is obtained at the new value of x12 and the procedure continues. Most of the original applications of this approach were in numerical computing and simulations usually trying to find the optimum of a function efficiently. Under such circumstances, it is easy to numerically or algebraically calculate a derivative. If the equation of the function is known, this can be performed algebraically. Otherwise, a large variety of approaches are available, one of the simplest and fastest being the Savitzky–Golay method.19 In experimental work however, this is not straightforward and one commonly recommended approach is to perform a small factorial design at each point. For example, we could perform a one-factor, two-level design at each point x1t by recording the responses at x1t and x1t þ . These can then be modeled by the equation y ¼ b0 þ b1x1 so the derivative dy/dx1 ¼ b1. Therefore, the next point for determining a maximum is x1tþ1 ¼ x1t þ ab1. A quadratic model would be possible if three experimental points were measured, of the form y ¼ b0 þ b1x1 þ b11x21 giving a new point at x1tþ1 ¼ x1t þ a(b1 þ 2x1t). This would add to the amount of experiments required; however, even if the overall response curve is quadratic or higher in nature, often a linear local approximation is likely to be adequate in many cases providing the step size (a) is sufficiently small. The benefits of a better local approximation using more experiments and a higher order model, probably resulting in faster progress toward the optimum, should be balanced against the extra experimentation required to achieve this improved model. As an example of a single-parameter model, Figure 2 shows the first eight steps of a steepest descent optimization for a single-factor response, where the stating point is x1 ¼ 0.6, the true optimum is x1 ¼ 2.33, the
110 100 90
y
80 70 60 50 40 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x1 Figure 2 Illustration of steepest descent for a single-parameter optimization, involving eight steps, a value of a of 0.1, of of 0.05, and a starting point of x1 ¼ 0.6.
Steepest Ascent, Steepest Descent, and Gradient Methods
581
Table 1 Main adjustable parameters Starting condition(s)
x1 (one factor); x1 (several factors)
Rate of optimization Window size for determination of local gradient Stopping criterion
a (usually only necessary for experimental implementation) " (could relate to difference in response or in the values of the factors)
value of a ¼ 0.1, and the gradient at each point in the graph is estimated by a linear approximation involving performing experiments at x1t 0.05. There are various stopping rules to determine whether an optimum is found, but in numerical optimization usually a small value " is set up in advance and convergence is assumed when |x1tþ1 x1t| < " or the value of the optimum does not vary much. For experimental optimization, this is probably best determined by eye, when it looks as if the process has converged to the satisfaction of the experimenter, and the convergence can be influenced by noise: If there is a high noise level, often the optimum will oscillate around a central point. Hence oscillation is also a sign that the optimum has been found providing the rate of oscillation does not change much. Sometimes, these new conditions could lead to a new experimental optimization around the point determined by simplest ascent, or performing the optimization in a second step using fresh optimization parameters (usually values of a). The procedure will be illustrated numerically in Section 1.18.3.2. The main parameters are summarized in Table 1.
1.18.2.2
More than One Variable
The equations when I > 1 are a relatively straightforward extension of the method described above. The first step is to choose (a) starting conditions for the factors to be optimized, x1, (b) a step size parameter a, which determines how fast the optimization progresses. In experimental situations, the factors should be coded to ensure that they are on a similar scale and to produce a sensible experimental design at each point in the optimization. This is because of the parameter discussed above; for example, if the original parameters are recorded on very different scales, for example pH and temperature, a 1 change in temperature may have much less influence than a change in pH of 1 unit. For numerical computing, this may not matter so much especially in the case of optimization of functions. The equations then become x2 ¼ x1 þ a Ñy ¼ x1 þ a[qy/qx11 . . . qy/qxi1 . . . qy/qxI 1], where xt is the vector of conditions for step t, for finding a maximum, with an equivalent equation for a minimum. Stopping rules similar to the one-factor example can be obtained, with the Euclidean distance between the new and old vector of conditions and also between the new and old responses being the common convergence criterion. As usual if this is to be performed experimentally, at each new point in the optimization, several experiments are required in order to determine the gradient. A recommended simple strategy involves a two-level, twofactor design consisting of four experiments, the distance between the levels being defined by . This allows the response to be fitted to the equation y ¼ b0 þ b1x1 þ b2x2 þ b12x1x2. Sometimes, the interaction term can be neglected. Note that even if the overall response surface has obvious curvature, a local linear model is usually adequate, as the aim of the model is to determine the direction and rate of the search for the next point and not as an overall model for the response surface. Note that there are several alternative approaches in the literature, especially where there are several factors, one being to perform a factorial (or fractional factorial) design to get the coefficients of a model that are used to determine the gradient for several steps (e.g., five or six) and then at the new optimum perform a similar design again to obtain a revised model and then perform further optimization and so on. This modification may require more steps to reach an optimum than the alternative of performing a factorial design at each step, as the experimental model for determining the gradient is not estimated at each step, but this is balanced against the number of experiments required to obtain an accurate measure of the gradient at
582 Steepest Ascent, Steepest Descent, and Gradient Methods
1
16
10
10
12
14
9
11
0.8
9
7 9
6
7 6
12
0.4
10
8
10
0.6
5.5
8
5.5
5
5
11
3
4
–0.6
–0.4
4. 5
5
4.5
6
4
–0.2
3.5
0 x1
0.2
4
0.4
7
5 5 5. 6
–0.8
7
10
11
–1 –1
2.6
8 3
8
9
–0.8
2.
2 2.5 .4
5 4.5
3.5
–0.6
3.5
2.3
3 2.8
–0.4
5.5
6 5.5
2.6
2.4
5 2.
4
–0.2
3
7
2.8
3.5
10
8
6
9
5
0
4.
3.5
x2
7
4.5
4
0.2
12
11
8
0.6
0.8
1
Figure 3 Illustration of steepest descent for a single-parameter optimization, involving 14 steps, a value of a of 0.05, of of 0.05, and a starting point of x1 ¼ [0.7 0.8].
each step, and for more than two factors it is preferable to reduce the number of factorial designs at the cost of slightly less efficient convergence to the optimum owing to inaccuracies in the estimation of the gradient at every step. The progress of a typical optimization for two factors is given in Figure 3. In this case, a two-level, two-factor design is performed at each experimental point.
1.18.2.3
Influence of the Parameters
The most crucial parameter is a, which should be chosen with some care. The influence of a on optimization is illustrated in Figure 4 all starting at the same point (Table 2). It can be seen that there is only a quite small region of values for which optimization is possible. It is particularly important to note how different the behavior of the method is when a is increased from 0.1 to 0.115. In the latter case the method diverges and will never find an optimum, whereas in the former it oscillates around the optimum but finds it within 10–15 steps, dependent on stopping criteria. Note that a value of a equal to 0.05 is extremely efficient, but the problem is that the response surface is not known in advance and as such it is not possible to know what the correct value is in advance of experimentation or computation. In a computational algorithm, often some checks are needed to see whether it looks as if the optimum is converging and if so how fast, and then perhaps modify a if there is a problem. In experimental science, it should be obvious if there are difficulties. If the value of the response is getting worse, this is likely to be a sign of lack of convergence (although a noisy response surface may cause confusion, hence some idea of the experimental error is useful). If the value of the response is very slowly improving, possibly a is too small. In contrast, the starting conditions are not too crucial, even if they are very wide off the optimum (see Figure 5) and it normally just takes a few extra steps in the optimization procedure. This in practice is because steepest ascent (or descent) methods do not have fixed step sizes unlike methods such as simplex optimization, so can move rapidly toward the minimum providing the value of a is set sensibly.
Steepest Ascent, Steepest Descent, and Gradient Methods
(a) 110
(b) 110
100
100
90
90
Start
Start
80
80
70
70
60
60
50
0
0.5
1
1.5
2
2.5
3
3.5
4
50
(c)
(d)
110
110
100
100
90
0
0.5
1
1.5
2
2.5
3
3.5
4
2
2.5
3
3.5
4
6
7
90
Start
Start 80
80
70
70
60
60
50
583
0
0.5
1
1.5
2
2.5
3
3.5
4
50
(e)
(f)
110
250
0
0.5
1
1.5
100 200 90
Start
80
150
70 100 60 Start
50
0
0.5
1
1.5
2
2.5
3
3.5
4
50 –2
–1
0
1
2
3
4
5
Figure 4 Influence of the value of a on optimization: (a) 0.001, (b) 0.01, (c) 0.05, (d) 0.1, (e) 0.11, and (f) 0.115. All iterations start at a value of x1 ¼ 0.6. Note that the scale for (f) is different to illustrate divergence. The horizontal axis represents x1 and the vertical axis y.
584 Steepest Ascent, Steepest Descent, and Gradient Methods Table 2 Influence of a on the optimization of Figure 4 Value of a
Comments
0.001 0.01 0.05 0.1 0.11 0.115
Is unlikely to attain the optimum within a realistic timescale Slowly finds optimum within 10–15 steps dependent on stopping criterion Very rapidly finds optimum within 3–4 steps Oscillates around optimum but finds it within 10–15 steps Wide oscillation around optimum, very slow convergence Does not converge, oscillates wildly
(a)
(b)
110
110 Start
100
100
90
90
80
80
70
70
60
60
50 0
0.5
1
1.5
2
2.5
3
3.5
4
50
Start
0
0.5
1
1.5
2
2.5
3
3.5
4
Figure 5 Influence of starting conditions on the progress of the optimization, using a ¼ 0.1, and a value of (a) x1 ¼ 0 and (b) x1 ¼ 01.5.
1.18.2.4
Limitations
Although this method may appear simple, a problem is that it can be quite slow for convergence, and will often perform many small steps when going down a valley to a minimum. It is also quite strongly dependent on the choice of a. For fairly straightforward optimizations it is adequate, but for more elaborate optimizations it can be slow and sometimes never reach an optimum. There are various modifications but the conjugate gradient method20 is probably the most common.
1.18.3 Examples 1.18.3.1
Algebraic
The simplest type of example involves the optimization of a function. In chemometrics, it is often possible to determine a relationship between the level of factors that determine a response and the response itself. As an example, consider trying to find the minimum of the function y ¼ 3 – 3x1 – 2x2 þ 2x12 þ 7x22 – 2x1 x2
We can calculate the derivatives qy=qx1 ¼ – 3 þ 4x1 – 2x2 qy=qx2 ¼ – 2 þ 14x2 – 2x1
Steepest Ascent, Steepest Descent, and Gradient Methods
585
Table 3 Numerical example of optimization of the function y ¼ 3 – 3x1 – 2x2 þ 2x12 þ 7x22 – 2x1 x2 Iteration
x1
x2
y
qy/qx1
qy/qx2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.000 0.300 0.520 0.648 0.735 0.788 0.824 0.845 0.860 0.869 0.874 0.878 0.880 0.882 0.883
0.000 0.200 0.180 0.232 0.237 0.252 0.257 0.262 0.264 0.266 0.267 0.268 0.268 0.269 0.269
3.0000 2.0400 1.6604 1.5079 1.4462 1.4211 1.4109 1.4067 1.4050 1.4043 1.4040 1.4039 1.4039 1.4039 1.4039
–3.000 –2.200 –1.280 –0.872 –0.533 –0.351 –0.219 –0.142 –0.090 –0.058 –0.037 –0.024 –0.015 –0.010 –0.006
–2.000 0.200 –0.520 –0.048 –0.155 –0.044 –0.052 –0.023 –0.019 –0.010 –0.007 –0.004 –0.003 –0.002 –0.001
The calculations are presented in Table 3 for a value of a ¼ 0.1. It can be seen that the response has effectively converged within 10 iterations, although there is still a slight improvement in the optimal values of x1 and x2 as the optimum is probably very flat. Whereas in the numerical example above, it is possible to obtain the value of the optimum by simple differentiation of the original response function without the need for steepest descent methods, where there are several factors and several terms including high-order interactions and functions such as higher order polynomials or exponentials, it may be hard or impossible to solve the equations algebraically and the methods described in this chapter provide valuable alternatives.
1.18.3.2
Numerical
In most practical cases, the underlying relationship between the response to be optimized and the levels of the factors is unknown, and optimization consists of performing a series of experiments in sequence until an optimum is found. In order to understand this, we will illustrate by a simple one-factor example. If we use a numerical approach for optimization (which could be performed experimentally as well as by computational modeling), a typical procedure might be as follows. 1. Choose a starting value of x1. 2. In order to obtain an approximate value of the gradient at this starting point, some tests need to be performed. A simple one is to perform a two-factor factorial design,21 involving, in this case, just two experiments at x1 þ and x1 . The values of the response y can be determined and a linear model of the form y ¼ b0 þ b1x1 obtained. 3. The value of b1 is the value of the gradient and so a new value of x1 þ ab1 is used for the next step. 4. The response of this new value is then determined in the next step for an estimate of the gradient at x1 þ ab1 and so on. Note that it is not actually necessary to perform an experiment at x1 to determine the gradient in the region of the new response. Note also that a factorial design is very simple and approximates to a linear model, even though the response surface is likely to be multilinear (and cannot be linear if there is an optimum): This is an approximation to the gradient. Better approximations are possible if more experiments are performed at each stage, but this requires more time and effort, and in most cases the estimate of the gradient will differ by only a small amount; hence there is a balance between performing an elaborate design and a very basic experiment that may involve an approximation but is faster.
586 Steepest Ascent, Steepest Descent, and Gradient Methods Table 4 Numerical example: noise free, ¼ 0.05, a ¼ 0.02 x1
y
x1 –
y
x1 þ
y
b1
y – b1x1
y þ b1 x1
1.000 1.312 1.577 1.776 1.908 1.988 2.033 2.057 2.071 2.077 2.081 2.083 2.084 2.084 2.085
70.971 75.515 78.626 80.280 80.983 81.234 81.314 81.337 81.344 81.346 81.346 81.347 81.347 81.347 81.347
0.950 1.262 1.527 1.726 1.858 1.938 1.983 2.007 2.021 2.027 2.031 2.033 2.034 2.034 2.035
70.186 74.839 78.111 79.927 80.758 81.093 81.223 81.275 81.297 81.307 81.312 81.314 81.315 81.316 81.316
1.050 1.362 1.627 1.826 1.958 2.038 2.083 2.107 2.121 2.127 2.131 2.133 2.134 2.134 2.135
71.744 76.166 79.104 80.587 81.157 81.320 81.347 81.341 81.331 81.325 81.321 81.319 81.318 81.317 81.317
15.583 13.275 9.930 6.598 3.992 2.263 1.232 0.655 0.344 0.179 0.093 0.048 0.025 0.013 0.007
70.192 74.852 78.129 79.950 80.784 81.121 81.252 81.305 81.327 81.337 81.342 81.344 81.345 81.346 81.346
71.750 76.179 79.122 80.610 81.183 81.348 81.376 81.370 81.361 81.355 81.351 81.349 81.348 81.347 81.347
Italicized observations are never obtained and are given for reference as discussed in the text.
An example of a simulated and noise-free numerical optimization is given in Table 4. The starting point of this optimization is x1 ¼ 1. A value of ¼ 0.05 is chosen and two experiments are performed at x1 ¼ 0.95 and 1.05 giving values of the response y of 7.186 and 7.744, respectively. Using a simple liner model, the value of the gradient is b1 ¼ 15.583, so the next value of x1 ¼ 1 þ a 15.583. Setting a as 0.02, we have a new value of x1 ¼ 1.312 and so test the response at x1 ¼ 1.262 and x1 ¼ 1.362 and so on. The columns in italics are not directly observed, but are presented for reference. The response at the central value of x1 can be calculated as can the values of y b1, which are the values that would be obtained at the factorial points if the linear model was perfect, which is not the case because the surface is curved. The first six steps of this optimization are illustrated in Figure 6, with the six values of x1 for each step represented using squares and the factorial points that are actually measured represented using circles. In normal practice, there is noise imposed upon an experiment, and the understanding of the optimization process becomes more complex. Table 5 illustrates a similar optimization to that of Table 4 except that there is some noise in all the measurements. We can see that the values of the gradient are quite different in this case, and the progress of the optimization is at a different rate. When near the optimum, it oscillates because the
82
Response (y )
80 78 76 74 72 70 0.8
1
1.2
1.4 1.6 Value of x1
Figure 6 Numerical optimization of one-factor response – noise free.
1.8
2
2.2
Steepest Ascent, Steepest Descent, and Gradient Methods
587
Table 5 Numerical example: noise free, ¼ 0.05, a ¼ 0.02 x1
y
x1 –
y
x1 þ
y
b1
y – b1 x1
y þ b1 x1
1.000 1.376 1.656 1.799 1.988 2.024 2.071 2.117 2.051 2.058 2.064 2.122 2.084 2.084 2.070
70.974 76.421 79.402 80.390 81.251 81.310 81.256 81.380 81.418 81.266 81.259 81.437 81.385 81.280 81.309
0.950 1.326 1.606 1.749 1.938 1.974 2.021 2.067 2.001 2.008 2.014 2.072 2.034 2.034 2.020
69.926 75.656 79.003 80.038 80.938 81.198 81.111 81.357 81.184 81.345 81.181 81.392 81.284 81.387 81.329
1.050 1.426 1.706 1.849 2.038 2.074 2.121 2.167 2.101 2.108 2.114 2.172 2.134 2.134 2.120
71.806 77.054 79.722 80.982 81.117 81.431 81.345 81.024 81.219 81.377 81.467 81.204 81.287 81.314 81.227
18.801 13.982 7.185 9.443 1.792 2.323 2.339 –3.329 0.356 0.322 2.862 –1.880 0.028 –0.730 –1.020
70.034 75.722 79.042 79.918 81.161 81.194 81.139 81.546 81.400 81.250 81.116 81.531 81.383 81.317 81.360
71.914 77.120 79.761 80.863 81.340 81.426 81.372 81.213 81.436 81.282 81.403 81.343 81.386 81.244 81.258
Italicized observations are never obtained and are given for reference as discussed in the text.
estimation of the gradient is extremely sensitive to noise; hence there appear quite large gradients, both positive and negative close to the optimum. This is illustrated in Figure 7 for the first six steps. Note that the two factorial points are sometimes completely out of line with the central point. This is because of the influence of noise, which can have the effect of shifting these points either both above or below the curve, especially when it is quite flat, often resulting in quite false estimates of the gradient and hence oscillation. A comparison of the change in the value of x1 for both the noise-free process and noisy process at each successive step is given in Figure 8.
1.18.3.3
Literature
There are relatively few examples of this technique reported in the literature that may be of interest to chemometricians, but we will review one application in this section.16 This work was used to optimize the separation of two enantiomers chromatographically on a chiral stationary-phase column using super critical fluid chromatography (SFC), and is exemplified by the separation 82 80
Response (y)
78 76 74 72 70 68 0.8
1
1.2
1.4 1.6 Value of x1
1.8
Figure 7 Numerical optimization of one-factor response in the presence of noise.
2
2.2
588 Steepest Ascent, Steepest Descent, and Gradient Methods
2.2 Noisy 2.0
Value of x1
1.8 1.6 Noise free 1.4 1.2 1.0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 Step
Figure 8 Value of x1 for each successive step of optimization for the noisy and noise-free numerical examples: the underlying process has the same behavior.
of ethyl-(2R,3S)-dihydroxyoctanoate and ethyl-(2S,3R)- dihydroxyoctanoate. Three factors were considered, namely (1) start density of the mobile phase, (2) temperature, and (3) density gradient. The aim was to obtain a good chiral resolution (CR) given by CR ¼
tR2 – tR1 –1 wh2 þ wh1
where tR is the retention time and wh the peak width at half-height for each compound whereby both peaks are resolved (set by the authors at around 0.274) while minimizing the retention time, so achieving a good resolution at maximum speed. 1. The first stage was to perform a half-factorial design with central points (five experiments) to determine conditions that would lead to the desired resolution of 0.274, using coded values as in Table 6, which were found by the model to be x1 ¼ 0.344 g ml1, x2 ¼ 52 C, and x3 ¼ 0.013 g ml1 min1 giving an experimental value of tR2 ¼ 12.251 min and CR ¼ 0.281 (owing to experimental uncertainty, it is not exactly equal to the estimated value). 2. Once this was achieved, another half-factorial design (this time with just four points) was obtained around the optimum of step 1, to determine the rate of change with respect to both the resolution and the retention time (RT) of the slowest elute compound, which gives a model that can be used to determine the derivatives for the steepest ascent. Note that in this application the authors do not perform factorial designs afresh at each step of the optimization. 3. Six experiments were performed along the line of steepest ascent using the model of step 2 for computing the derivatives to give an optimum value of x1 ¼ 0.311 g ml1, x2 ¼ 58 C, and x3 ¼ 0.019 g ml1 min1 with a value of CR ¼ 0.255 and tR2 ¼ 10.704 min. 4. Steps 2 and 3 are repeated again to give final values of x1 ¼ 0.276 g ml1, x2 ¼ 62 C, and x3 ¼ 0.022 g ml1 min1 with a value of CR ¼ 0.265 and tR2 ¼ 10.293 min. Table 6 Three variables for worked example of Section 1.18.3.3 coded for the first half-factorial design Variable
–1
0
þ1
x1, start density, g ml–1 x2, temperature, C x3, density gradient, g ml–1 min–1
0.270 47 0.007
0.300 50 0.010
0.330 53 0.013
Steepest Ascent, Steepest Descent, and Gradient Methods
589
1.18.4 Conclusion The steepest ascent (or descent) method is one of the oldest and most straightforward approaches for optimization based on approaches first described almost two centuries ago. This method is quite commonly used in computational optimization algorithms, whereas it is relatively rare in chemometrics, probably largely due to the lack of publicity in the literature, as compared to other methods such as simplex optimization. It does have some limitations, and is very sensitive to the value of a, which relates to the step size of the optimization; however, other approaches such as simplex also have such limitations. When used experimentally or numerically, there needs to be a way of estimating gradients, which may involve performing an experimental design, such as a factorial design, at each successive step. Nevertheless, this is a simple approach with a long vintage and as such is worth considering as one of the tools of the chemometrician.
References 1. Box, G. E. P.; Wilson, K. B. On the Experimental Attainment of Optimum Conditions. J. R. Stat. Soc. B Stat. Methodol. 1951, 13, 1–45. 2. Booth, A. D. Application of the Method of Steepest Descents to X-Ray Structure Analysis. Nature 1947, 160, 196–196. 3. Booth, A. D. The Refinement of Atomic Parameters by the Technique Known in X-Ray Crystallography as the Method of Steepest Descents. Proc. R. Soc. Lond. A Math. Phys. Sci. 1949, 197, 336–355. 4. Debye, P. Naeherungsformeln fuer die Zylinderfunktionen fuer große Werte des Arguments undunbeschraenkt veraenderliche Werte des Index. Math. Ann. 1909, 67, 535–558. 5. Cauchy, A. L. Memoire sur divers points d’analyse. Mem. Acad. (France) 1829, 8, 130–138. 6. Petrova, S. S.; Solov’ev, A. D. The origin of the Method of Steepest Descent. Historia Math. 1997, 24, 361–375. 7. Brooks, S. H.; Mickey, M. R. Optimum Estimation of Gradient Direction in Steepest Ascent Experiments. Biometrics 1961, 17, 48–56. 8. Davies, O. L., Ed. Design and Analysis of Industrial Experiments. Hafner: New York, 1954. 9. Box, G. E. P.; Hunter, W. G.; Hunter, J. S. Statistics for Experimenters; Wiley: New York, 1978. 10. Bayne, C. K.; Rubin, I. B. Practical Experimental Designs and Optimization Methods for Chemists. VCH Publishers: Deerfield Beach, FL, 1986. 11. Aguilar-Mogas, A.; Gimenez, X.; Bofill, J. On the Implementation of the Runge–Kutta–Fehlberg Algorithm to Integrate Intrinsic Reaction Coordinate Paths. Chem. Phys. Lett. 2006, 432, 375–382. 12. Ma, H. T.; Liu, X. J.; Bian, W. S.; Meng, L. P.; Zheng, S. J. A Theoretical Study of the Mechanism and Kinetics of FþN-3 Reactions. ChemPhysChem. 2006, 7, 1786–1794. 13. Liu, T.; Ye, L.; Chen, H. J.; Li, J. Y.; Wu, Z. H.; Zhou, R. H. A Combined Steepest Descent and Genetic Algorithm (SD/GA) Approach for the Optimization of Solvation Parameters. Mol. Simul. 2006, 32, 427–435. 14. Boye-Peronne, S.; Gauyacq, D.; Lievin, J. Vinylidene–Acetylene Cation Isomerization Investigated by Large Scale ab initio Calculations. J. Chem. Phys. 2006, 124, 214305. 15. Mezey, P. G. Rules on the Changes of Approximate Symmetry Measures along Reaction Paths. Mol. Phys. 2006, 104, 723–729. 16. Petersson, P.; Lundell, N.; Markides, K. E. Chiral Separations in Supercritical Fluid Chromatography: A Multivariate Optimization Method. J. Chromatogr. A 1992, 623, 129–137. 17. Duineveld, C. A. A.; Bruins, C. H. P.; Smilde, A. K.; Bolhuis, G. K.; Zuurman, K.; Doornbos, D. A. Multicriteria Steepest Ascent. Chemom. Intell. Lab. Syst. 1994, 25, 183–201. 18. Ghosh, B.; Agarwal, D. C.; Bhatia, S. Synthesis of Zeolite A from Calcined Diatomaceous Clay: Optimization Studies. Ind. Eng. Chem. Res. 1994, 33, 2107–2110. 19. Savitzky, A.; Golay, M. J. E. Smoothing þ Differentiation of Data by Simplified Least Squares Procedures. Anal. Chem. 1964, 36, 1627–1639. 20. Fletcher, R.; Reeves, C. M. Function Minimisation by Conjugate Gradients. Comp. J. 1964, 7, 149–154. 21. Brereton, R. G. Chemometrics: Data Analysis for the Laboratory and Chemical Plant; Wiley: Chichester, 2003; Chapter 2.
590 Steepest Ascent, Steepest Descent, and Gradient Methods
Biographical Sketch
Richard Brereton is Professor of Chemometrics and Director of the Centre for Chemometrics in Bristol. He obtained his BA, MA, and Ph.D. from the University of Cambridge and was subsequently employed by the University of Bristol. He is Fellow of the Royal Society of Chemistry, a Chartered Chemist, Fellow of the Royal Statistical Society, and Fellow of the Royal Society of Medicine. He was awarded the Royal Society of Chemistry’s 2006 Theophilus Redwood lectureship. He is the author of the book Chemometrics: Data Analysis for the Laboratory and Chemical Plant (Wiley 2003) and the author of Applied Chemometrics for Scientists (Wiley 2007), and was fortnightly Feature Writer in Chemometrics for the Alchemist and Associate Editor of Chemometrics and Intelligent Laboratory Systems. He is Chemometrics editor for Chemistry Central Journal. By early 2008, he published 311 articles, 136 of which are refereed papers and 6 edited/authored books. His work has been cited in 1792 publications. He has given 142 invited lectures in 23 countries worldwide and supervised 113 coworkers in Bristol. His current research interests include pattern recognition in forensics, biology, pharmaceuticals, medicine, and plastics using data obtained from chromatography, thermal analysis, acoustic spectroscopy, and mass spectrometry.