Compuren them. Engng Vol. 19, No. 617. pp. 807-825.
Pergamon 0098-1354@4)ooo824
PARALLEL NONLINEAR OPTIMIZATION FOR CHEMICAL PROCESS DESIGN K. A. HIGH?* Department
of Chemical
(Received
Engineering,
May
195
Copyright @ 1YY5Elscvicr Scicncc Ltd Printed in &cat Britain. All rights reserved WY& 1354/95 SY.50 + O.(x)
TECHNIQUES PROBLEMS
and R. D. LAROCHEZB
The Pennsylvania PA 16802, U.S.A.
State
1993; received for publication
University,
24 hne
University
Park,
1994)
Abstract-Parallel processing methods with constrained nonlinear optimization were developed to show the potential to efficiently perform chemical process design calculations. A sequential optimization subroutine was developed that used a SQP algorithm and the BFGS inverse Hessian update. This subroutine had parallelism introduced at various points. Most of these methods examined ways to perform and use the parallel simultaneous function and gradient evaluations. Function evaluations are generally expensive calculations in chemical process optimization. Algorithms using a parallel finite difference Hessian (PH), Straeter’s parallel variable metric (PVM) update, and Freeman’s projected parallel variable metric (PPVM) update were investigated. Other algorithms investigated included Schnabei’s parallel partial speculative gradient evaluation and parallel line searching. The success of a global optimization algorithm using simultaneous minimization shows potential for robust and reliable global minimization of complex multiextremal problems. The results of this work show the potential of decreasing large scale problem optimization time with parallel processing.
this area. Some limited parallel optimization techniques were developed as early as 1960. Only recently, however, has a major emphasis on parallel nonlinear optimization research been realized as machines are becoming available to more researchers. The development of parallel nonlinear optimization will allow for a robust and reliable solution of those problems that were unsolvable before. Also, it may be possible to perform on-line optimization of processes and to potentially allow for advanced architecture machines to become involved in computer integrated manufacturing (CIM). There are many reasons to investigate the use of optimization algorithms for use on parallel processing machines. These include the ability to: (1) have faster computation times; (2) perform more accurate, rigorous calculations to reduce the number of iterations required; (3) try more initial guesses giving a more robust solution; (4) find more local minima of multiextrema1 nonlinear problems; and simultaneous objective function (5) perform evaluations. Chemical engineering process optimization has typically focused on the use of Newton or quasiNewton techniques for successive quadratic programming (SQP) strategies (Chen and Stadtherr, 1983, 1984). These methods tend to require fewer iterations and, therefore, fewer function evaluations
1. INTRODUCTION
Parallel processing has the potential to improve the current state of the art in chemical process design research_
Parallel
processing
can aliow
for the solu-
tion of problems that have never before been attempted due to limitations in computational power and speed. The major motivation in performing large scale chemical process design and optimization problems. Over the past decade, parallel processing as a computational tool has become readily available to many researchers through the use of networks. The network allows computer national Internet advanced architecture access to researchers machines at national laboratories and centers (Comer, 1988). The introduction of the Internet network has enabled researchers to conduct algorithm development on machines of differing architectures and numbers of processors. There have been major advances in general paraldevelopment. and algorithm lel computing However, there has been little specific effort in numerical optimization mainly because of the unavailability of multiprocessors to researchers in *Author to addressed.
whom
all
correspondence
should
be
$ Present address: School
of Chemical Engineering, Oklahoma State Universitv, Stillwater, OK 74078, U.S.A. 1 Present address: Cray Research, Inc., 655E Lone Oak Drive, Eagan, MN 55121, U.S.A. 807
808
K. A. HIGH and R. D. LAROCHE
than other algorithms. Quasi-Newton methods require the determination of the gradients of the objective function to be optimized. Additionally, an approximation to the second derivative matrix (the Hessian) or its inverse is required. This information provides a direction in which to search for the optimal point. SQP methods can be applied to constrained and unconstrained problems. Although the focus of this work will be on parallel SQP strategies, the authors would like to note that there is also much research being done in other areas of optimization, particularly in the operations field (Pardalos et aI., 1992; Meyer and Zenios, 1988). In this paper, a background for parallel SQP strategies and potential for inclusion of these strategies into general chemical engineering optimization routines will be discussed. It is important to keep in mind that in chemical process optimization, one of the most expensive calculations is the function evaluation. This typically requires the simulation of an entire chemical process plant. Many types of optimization problems are performed where expensive function evaluations can require over 90% of the computational time (Byrd et al., 1988). The opportunities for parallelism to be incorporated into an SQP algorithm include: (1) calculation of the inverse Hessian second derivative matrix; (2) calculation of numerical gradients; (3) line search procedure with expensive function evaluations; (4) evaluation of complex functions by breaking up the function into sub-functions and evaluating these mini-functions concurrently; (5) optimization of functions where good initial guesses are not known; and (6)optimization of functions that might have many minima. SQP methods that use parallel gradient calculations for the inverse Hessian calculation (1 above) will be examined. Additionally, parallel function evaluations will be used for line searching strategies as well as the calculation of the numerical gradient (2 and 3 above). Finally, parallel global routines (using SQP approaches) will be examined for points 5 and 6 above. These techniques will be employed for the minimization of constrained objective functions. As the function evaluation is one of the most expensive calculations, those methods that use a SQP algorithm and can minimize the number of iterations are important. These techniques will be for objective functions that are single valued and incorporate continuous variables. It is computationally expensive to determine the inverse Hessian matrix that is used to calculate the
direction vector for Newton and quasi-Newton optimization methods (point 1 above). One approach to the estimation of the Hessian is to modify or “update” the matrix from the previous iteration’s matrix using gradient information from the current and previous points. Efficient algorithms exist to directly update the inverse Hessian matrix (as opposed to updating the Hessian directly) eliminating the need for a calculationally intensive matrix inversion. The BFGS method (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) is a generally accepted approximation method for the updating of the Hessian (or its inverse). Straeter (1973) developed a parallel variable metric (PVM) technique that performed multiple updates at each iteration. The first derivative information that was needed for the updates was obtained from the simultaneous parallel evaluation of the gradients in independent directions. The number of directions is based on the dimension of the inverse Hessian and therefore the dimension of the objective function. The metric is then modified by multiple rank-one (one updating term added) corrections from the gradient information. The goal with this approach was to determine a more appropriate direction from a more accurate Hessian approximation. This would minimize the number of iterations taken and hopefully reduce the calculation time. The PVM was first applied to unconstrained optimization problems. Dayde (1989) expanded the use of the PVM update for constrained optimization by incorporating the Bartholomew-Biggs (1980, 1982) method based on the use of the classical penalty function and the Lagrangian function. He used an algorithm developed by Berger et al. (1986a, b). This formulation is an equality constrained quadratic programming formulation (EQP) of the successive quadratic programming (SQP) problem. In the EQP formulation, only those constraints that are active are included in the penalty or Lagrangian function. Freeman (1989, 1991) generalized Straeter’s algorithm for use with Davidon’s (1975) projection matrices. This created a positive definite matrix from multiple rank-two (two correction terms added) updates. This algorithm is termed the projected parallel variable metric method (PPVM). This use of a positive definite matrix helped to solve the problem of ill-conditioning. Freeman applied this technique to unconstrained problems. An implementation of parallel numerical Hessian evaluation using a finite difference approach has been described by Lootsma and Ragsdell (1988). The disadvantage of this technique is that the
Parallel nonlinear optimization inverse matrix has to be determined before the direction can be calculated using a Newton (or quasi-Newton) approach. This is in contrast to the other update methods described above (PVM, PPVM, or BFGS) that are able to directly update the inverse Hessian. To reduce the time required to calculate numerical gradients, Schnabel (1987) developed the technique of parallel partial speculative gradient numeriThis strategy provides for the cal evaluations. simultaneous evaluations of the objective function at a line search trial point and at other points. These evaluations at other points are those that are necessary for the determination of the numerical gradient. The philosophy behind this method is that since all of the processors are idle except for the one that is used to calculate the objective function at the trial point, it is probably more prudent to use the extra processors for function evaluations that may be used later. After the line search phase is complete when a new point has been selected as successful in minimizing the function in the given direction, any remaining function evaluations are performed that are necessary for the gradient evaluation. The necessity of these extra function evaluations arise when the dimension of the problem, and therefore the number of functions needed to calculate the gradient numerically, is larger than the number of processors available. The speculative gradient method is in comparison to a parallel gradient evaluation in which the function evaluations for the numerical gradient are only calculated after the line search phase is complete (Lootsma and Ragsdell, 1988). To reduce the time required for line searching, parallel simultaneous function evaluations could also be performed at various trial points along the search direction. Several chemical engineers have looked at techniques to break up the function evaluations into smaller parts and perform these evaluations in parallel. Chen and Stadtherr (1985a-c) and Chen (1988) worked on problems of this type. Investigators have to be cautious, however, because this technique can be highly problem specific. To be more general, parallelism should be introduced in the optimization algorithm and not in the objective function evaluation. Investigators (Vegeais, 1987; Coon and Stadtherr, 1988; Zitney and Stadtherr, 1993) are beginning to look at general techniques that use parallel processing to perform the sparse matrix techniques necessary to simultaneously solve the thousands of objective function and constraint equations associated with a chemical process problem. Many researchers have looked at the use of global optimization to help in solving problems where good
techniques
809
initial guesses are not known and/or the function may have several minimum values. Dixon and Szego (1978), Dixon et al. (1984) and Schnabel (1985, 1987) investigated the use of simultaneous local minimizations from several initial guesses. Some experiments have been carried out using parallel processing machines. This global optimization allows for the robust and reliable solution of those optimization problems where an effective initial guess is not known and also for those problems that may have many minima (multiextremal functions). This research was initiated to analyze the parallel nonlinear optimization techniques described above for their appropriateness to chemical process design and optimization (High, 1991). The major emphasis of this work was to develop parallel nonlinear optimization techniques. A sequential algorithm was developed that used the BFGS inverse Hessian update in the Bartholomew-B&g’s formulation (1980, 1982). Powell’s (1978) quadratic line search technique as well as Chamberlain et nL.‘s (1979) watchdog technique were also used for efficient line search procedures. A suite of classical constrained test problems as well as small chemical process optimization problems were obtained from the literature. These problems were minimized using the sequential algorithm as well as with the algorithms that were developed using the various parallel inverse Hessian evaluation, parallel speculative gradient and numerical gradient evaluation, parallel line searching, and global minimization techniques described. Parallelization was not applied to the quadratic problem. This work represents one of the first uses of the PPVM (Freeman, 1989, 1991) in a constrained optimization algorithm. The major goal of this work is to bring appropriate applied mathematical techniques for optimization to the attention of chemical engineers. This article will review the state of the art for parallel SQP algorithms. Many others parallel algorithms exist (Pardalos et al., 1992; Meyer and Zenios, 1988) but these will not be discussed because of space limitations. The implementation of these algorithms on an Alliant FX/8 (parallel machine with eight processors) will be discussed. Results and conclusions based on the use of these algorithms on a set of test problems will be discussed. 2.
PARALLEL
OPTIMIZATION
BACKGROUND
Tables 1 and 2 show typical, general algorithms for global and local optimization routines that are or can be used in chemical engineering process optimization. Note that the minimization can be applied to unconstrained functions (where the Hessian matrix
K. A.
810 Table
1. Global
optimization
HIGH
and R. D.
routine
(1) Divide region to examine for optimum(s) into N divisions (not necessarily equal to the number of processors). (2) Select starting point(s) in each region. Either add new point if last point was a failure or continue with previous minimum. (3) Perform local minimization from each starting point. Stop at a given criteria. i.e. numer of iterations, time, when a minimum has been reached, etc. (4) Determine if convergence criteria has been met. If it has. stop. If not go to step 2. (5) Select global optimum from local minima.
is the second derivative matrix of the objective function) as well as for constrained optimization (where the Hessian is of either a penalty or a Lagrangian function which incorporates the constraints into a new function with the objective function). Upon examination of these routines, it becomes evident that there are several different opportunities for the incorporation of parallelism. These can be summarized as: (1) Start several simultaneous minimization routines on different processors with varied initial guesses as described in Table 1 and shown in Fig. 1. The global optimum is the local minimum with the lowest value. (2) Perform simultaneous fun&on evaluations on a number of processors at different steps during the local optimization routine as described in Table 2 (this is the typical approach for a Newton or quasi-Newton approach, i.e. SQP methods). Figure 2 shows the calculation of simultaneous function evaluations that will later be used for: (a) Numerical gradient evaluation (Vf(x) or g(x)), Step I; (b) Numerical Hessian evaluation (H(x), exact, or W(x), approximate), Step 1; (c) Line search, Step 3. (3) Perform simultaneous gradient evaluations for Hessian evaluation approximation, Step 1 in Table 2 (see also Fig. 3). (4) Parallelize or break up the function evaluations into smaller subproblems. Table 2. Local
optimization
routine
(1) At a given point, x,, calculate f(Q) and gradient g(~) (exact or finite difference approximation) and H(x,) (exact Hessian) or W(Q) (finite difference Hessian or other approximation). May use gradient for W(Q) estimation. The iteration number is indicated by k. (2) Calculate the search direction, 4, from the Ii (or W(x,)) and g(x,) information. Generally, this is a quasi-Newton approach. (3) Perform line search to fme (x,+rldk) and f(xk+A&) where minimum function value is found. 1 is the step length in the direction calculated in Step 2. May need to calculate gradient g(xk +ddx) depending on line search technique. (4) Determine if convergence criteria has been met. If it has, stop. If not go to step I.
LAROCHE
(5) Use available parallel techniques for the various matrix operations. (6) Vectorize calculations at as many opportunities as possible. The degree of granularity (size of parallel task) decreases from number 1 above (coarse grain) to 6 (line grain). For this study, parallel optimization levels 1, 2, and 3 above were examined and will be described in more detail. These opportunities for parallelism have the potential to be incorporated into SQP (successive quadratic programming) optimization for chemical engineering problems. Additionally, these levels are where the optimization algorithm developer has the most opportunity for improvement. Levels 4, 5, 6 tend to be problem specific. 2.1. Global optimization Global minimization involves the assumption that the objective function is nonconvex and possesses more than one minimum. Global optimization problems can be deterministic (i.e. uses a mathematical formulation from a given point to find the next point) or probabilistic (stochastic or random in selection of next point). Dixon et al. (1984) points out that it is generally accepted that deterministic methods can only be applied to a restricted set of objective functions whereas the probabilistic approach is more general. Most algorithms for global optimization involves clustering techniques as follows: (1) N trial point function evaluations are performed over a finite region (random or uniform distribution) which can be done in parallel. (2) Perform a local search sequence from the trial points. This also can be done in parallel. (3) Invoke a clustering phase to eliminate all but one out of a group of points that are gravitating to the same point. Return to step 2. Schnabel (1985) improved on this method by eliminating those clusters where the function value is high compared to other clusters and developed it for parallel processing. Another technique is the controlled random search (CRS) method developed by Dixon et al. (1984) which allows for the optimization of multiextremal functions with parallel processing. Byrd et al. (1990) also describes a global parallel optimization technique that does not use a clustering technique. It is possible to take these ideas for global optimization and extend them to chemical engineering problems. The most appropriate use for global optimization in process design are: (1) when a good initial guess is not known;
Parallel nonlinear optimization techniques
*1 *2
811
min(Y*))
-til-
mln ---IhI-
*3
mm
min VW
1
2
3
-WI-
x
*1**z min OWN,.
min (f(x))z
3
Bit diffmt
statting points of
dimension n
(f(x)) 3 are minimums obtained from staating points 1.2.3
min
mill f(x) = min [min (f(x))1
9min (f(xN2.
min (f(x))3
1
Fig. 1. Global optimization routine with parallel processors.
(2) when robustness and reliability is required; (3) when the function may have many minima. The algorithm used in this study for global optimization is as follows: (1) determine the number of subregions needed (usually equal to the number of processors); (2) determine the start points for the local simultaneous optimizations; (3) initiate the parallel tasks, i.e. the simultaneous local minimizations, (4) stop minimizations when maximum number of iterations is reached. Many more sophisticated approaches exist but will not be described here.
2.2. Parallel function evaluations Objective function evaluations are some of the most expensive calculations for chemical engineering process design optimization. Typically, they require much more time to complete as compared to other optimization calculations (which are generally linear algebra type calculations) (Byrd et al., 1988). This is because each objective function evaluation requires the simulation of the entire process plant at the desired conditions. This can involve the need to solve tens of thousands to hundreds of thousands of equations. Numerical gradient evaluations are even more expensive to calculate. They involve the determination of n + 1 function evaluations where n is the
f(xl)
*1
*2
,I = I 2.
>
f (x2>
,->
lineseanzh
BVW
*3
f(x3) Fig. 2. Parallel function evaluations.
/
ev-on
(gW = VfW)
K. A.
812
HIGH
and R. D. LAR~CHE
->
I-
x3 -
P
Hevaluation
(x3)
I
Fig. 3. Parallel gradient evaluations
dimension of the optimization problem. Most optimization routines require at least a function evaluation and gradient evaluation at each iteration. A major reduction in optimization time, therefore, can be realized if function evaluations can be computed simultaneously using parallel processing. 2.2.1. Numerical gradient evaluations. One of the easiest ways to implement parallel processing into nonlinear optimization is for the calculation of the various function values for the numerical derivatives of the gradient necessary for Newton-type methods. These are typically a major computational stumbling block in the solution of optimization problems because of the time associated with the function evaluations. The first derivative can be approximated from: f(X+ciei)
af(X)
-=
- f(X - Cjei) 2ci
aXi
(1)
where c is a small constant and e is the elementary vector in the ith direction. The dimension of the problem is n, i.e. the number of variables. For a four dimensional problem and for i = 1, the elementary is vector is (1 .O,O,O)T. The above formulation termed the central difference. This requires that for each dimension, two functions must be evaluated, one at each of the two perturbations from x(+c,e, and -ciei). For the gradient to be calculated, therefore, 2n + 1 function evaluation calculations must be performed (including f(x)). Using parallel processing, it would be possible to use 2n + 1 processors to perform each of the function evaluations. The forward difference: af(x) -=
axi
f(x+c,e;)
-f(x)
ci
(2)
requires that only one function be evaluated at each (X+C,ei) as well as f(x). For an n-dimensional problem n + 1 processors could be simultaneously used to calculate the n + 1 functions (including f(x)) needed to solve for all the elements of the gradient vector when performing the forward difference calculation. For small values of c, the forward difference method can approximate the central difference and thus save n calculations. For large n or computationally expensive functions, this difference can be quite appreciable. As mentioned before, this is one of the simplest methods to decrease computation time since at every iteration of a local minimization routine, at least one gradient evaluation of the function must generally be taken. Schnabel (1985) considers this type of parallelism for optimization. Lootsma and Ragsdell (1988) discuss the parallelization of the Newton method for the first derivatives. Another important note to consider is that using numerical gradients versus analytical gradients does not necessarily mean that there will be an associated increase in computational effort. Analytical gradient evaluations can be just as expensive as a finite different gradients. For example, if the problem is three dimensional, the function and gradient evaluation requires four calculations. These are the function calculation, f, as well as the calculation of the analytical derivatives of f with respect to xt , x2, and x3 or gl(x), g*(x), g3(x). respectively. For the finite difference approach method, four calculations are also necessary. These are: f(x), f(x,), f(xZ), f(xS) where f(xi) refers to function evaluations at new vectors where xi is perturbed in the ith dimension. However, the effort for gradient evaluation can only be a small multiple of the function evaluation even if n is large. This can happen with “reverse modes” of differentiation if implemented properly.
Parallel nonlinear optimization An example of this automatic approach for exact derivatives is JAKE-F which is available on NETLIB, a library of a wide variety of mathematical codes available via Internet from Oak Ridge National Laboratories (the mail address on Internet is
[email protected]). The developers of this feature state that for scalar functions, the ratio between the gradient evatuation and function evaluation is fixed by a bound of five. Finite difference approaches also heip to eliminate the user error that can typically accompany the coding of tedious equations as well as the more fundamental error in deriving the analytical gradients of very complex functions. However, analytical methods can sometimes do better with respect to accuracy when the equations are coded correctly. As is generally the case in process optimization, the function cannot generally be written explicitly in analytical form. Thus, it is necessary in this case to determine numerical gradients. The x’s (independent variables in the optimization problem) are typically flowrates, compositions, temperatures, pressures of input streams to various process units. These affect the overall economics of the plant, but it is not possible to determine the objective function as a function of these x’s directly. The x’s affect the output streams of the various unit operations that are part of the plant. The output streams are calculated from flowsheet simulators such as ASPEN and PRO/II. These outputs are complex PLUSW formulations (i.e. thermodynamic equations of state) of the input x values. It is these output stream values that become variables in the objective function. 2.2.2. Numerical Hessian evaluations. The second derivative information for the determination of the Hessian matrix requires that the diagonal elements: a’f(x) -=
f(x+c,e,)
- 2f(x) + f(x-ciei)
al,
and the off-diagonal 8f(X) -= axsrj
f(X+ Ciei+
ti
(3)
elements: Cjej)
-
f(X + ciej) - f(X+C,ej)
+ f(X)
CKj (4)
be calculated. For each of the diagonal elements, two function evaluations (of the objective or the Lagrangian functions) are needed at (x+c,ei) and (x-ciei). The entire calculation requires 2n + 1 number of function evaluations to be performed for the diagonal elements of the Hessian matrix. Additionally, for each off-diagonal element,. the function at (x + ciei + cjej) needs to be calculated. To calculate the Hessian off-diagonal terms, it is function required that just (n’- n)/2 additional
techniques
813
evaluations be performed because: (1) the Znd, 3rd, and 4th terms of the numerator of equation (4) had been calculated for the previous equation (3); (2) there are n*-n off-diagonal terms; and (3) the Hessian matrix is symmetric and hence, only l/2 of the off-diagonal terms have to be calculated to determine the full matrix. Therefore, for the numerical determination of the Hessian using the above equations, a total of one evaluation at x, the 2n contributions of equation (3) for the diagonal elements, and the (n* - n)/2 contributions of equation (4) for the off-diagonal terms yield a total of: 1+-
n(n + 3) (5)
2
function evaluations are necessary. This many processors could be used to perform the calculations in parallel. 2.2.3. Line search. There are several opportunities for use of parallelism in the line searching phase. One is to use parallel processing to allow for the calculation of the extra functions needed for the evaluation of the numerical gradient. Another technique is to perform simultaneous function evaluations at various 1 (step size) values in the improving direction to find a line minimum. 2.2.4. Speculative gradient evaluation. Schnabel (1987) has developed some elegant techniques for using parallel processing during the line search phase. This involves the calculation of the gradient at a speculative (x+;ld) point where il is the step length along a previously determined improving direction d away from the current point x. The point (x +ld) is speculative because it has not been determined that the function is improved at that point. An example of one of his methods is given in Tables 3 and 4. This example is for a six dimensional problem where there is a four processor machine available. A function evaluation at the point (xa +Ad,) is f, and gi refers to the function evaluation for the numerical derivative at that point with respect to the ith variable. It is important to remember that the derivative is not necessary for the line search but is needed in the next iteration to determine a new line search direction.
Table 3. Parallel finite difference
evaluation
for f, f. g, f. g
step number 1 Processor 1 Processor 2 Processor 3 Proc.essor 4
-
f
2
3
4
f
g1 P? gs g,
gs
-
E -
5
6
7
f
g1 g2 g, g,
b
-
8” -
K. A. HIGH
814 Table 4. Partial speculative gradient cvahation
and
for f. f. g. f. g
Step number
Processor 1 Processor 2 PxMx=r3 Processor 4
1
2
f g, gz g,
f g, gz g,
3
4
5
6
g, gs
f g1 g2 gs
gg gs
-
-
g”
-
-
8”
7
R. D.
LAROCHE
were not wasted. The entire process took 5 steps instead of 7. Assuming that there is not any computational cost associated with the overhead of setting up the parallel tasks, the second approach will take less computational time than the first, typicat parallel approach to solving a gradient evaluation. 2.3. Simultaneous
In determining the advantages of using the speculative gradient evaluation, a hypothetical sequence of function and gradient evaluations, f(xk + A,dk), Yxk + &dk ) , g(xl, + Ma,) 9 f(xk+, +&d&+,)9 g(xk+l +R3dk+,) (or f, f, g, f, g) as may be needed in a sequence of two iterations is reviewed in Table 3. The f(xk + l,dk) was determined not to be an appropriate reduction over f(xk), so a new step length, & , was chosen for the line search. The function was reduced adequately at that point, xk + &da,, so the gradient was then evaluated. The next iteration was then started. The first point of the line search at & in the new direction dk+lwas a success and the gradient was then evaluated. As can be seen in Table 3, it takes 7 steps to complete the calculations for the sequence f, f, g, f, g using the standard parallel finite difference approach. Some of the processors were idle during these calculations. In Table 4, at each evaluation of f, additional function evaluation calculations for the gradient were performed. Since it can not be determined whether the step length il was a failure until after the function calculation, the risk is taken that the calculations for the gradient are wasted as in Step 1. However, in two steps (2 and 4) these calculations
gradient
evaluations
Simultaneous gradient evaluations allow for the Hessian (or approximate) matrix evaluation calculations. Three such approaches are the parallel Hessian evaluation (PH). the parallel variable metric evaluation (PVM) and the parallel projected variable metric evaluation (PPVM) described below. The PVM and the PPVM are updating techniques similar to the BFGS formulation. They can be used to update the Hessian or the inverse Hessian directly. 2.3.1. Parallel Hessian (PH). The formulation: Vf(x+c,e,)
- Vf(x) (6)
ci
can be used to approximate the ith row and column of the Hessian matrix if the user can supply the first derivative information analytically and thus n + 1 processors can be used. If the gradient information is not avaiiable analytically, then (n + l)(n + 1) processors could be used to perform the necessary calculations to approximate the Hessian for this formulation since n + 1 function evaluations would have to be performed for each of n + 1 gradients. This procedure is shown in Fig. 4. The p’s and c’s are small scalars not necessarily equal to each other. It
n+l
f (X” 1 f(Xn +Plr) f(X, +Pn%)
wherexi Fig. 4.
=x+ciei
Finite difference Hessian calculation.
Parallel nonlinear optimization techniques should be noted that there is generally not a preference for using the equation (6) formulation as opposed to the method described in equations (3) and (4) for determining the Hessian matrix. At this point it is important to note that this is only an approximation to the Hessian for use in a quasi-Newton method. This is because of the finite difference approach may have a small amount of error associated with it. Once this approximate Hessian (W) is calculated (of the objective function for unconstrained optimization and for the Lagrangian function for constrained minimization), the inverse is necessary to calculate the direction: d= -W-‘Vf(x).
(7)
The inversion calculation can be quite appreciable for large n. The quasi-Newton methods that use an approximation (usually an updating approach) to the inverse Hessian as opposed to the Hessian itself are attractive for this reason since inversion of large matrices is avoided at this step. Although it is possible to evaluate the Hessian numerically as described above, there are several reasons why it might be better to use a different approach. These include: (1) Those places where the determinant of the Hessian does not exist (i.e. the matrix is singular). In these cases, updating techniques can at least give an approximation that allows the iteration to continue. (2) The function may be locally concave yielding a negative definite matrix for the Hessian and an updating method may be able to give a positive definite approximation. (3) The need to invert the Hessian matrix. Variable metric methods (quasi-Newton methods) have been the subject of intense research as they take advantage of the Newton methods yet do not require the determination of the exact second derivative matrix. The parallel variable metric methods that will be examined approximate the inverse metric or Hessian using (n + 1) gradient evaluations as was shown in Fig. 4. If n + 1 processors are available, the work can be divided and it would be necessary to only have to wait for the time it takes to calculate the n + 1 functions for a gradient evaluation. Each processor would thus be responsible for the calculation of the gradient at x or at a slight perturbation from x in a given direction. With the calculation of the Hessian using these parallel variable metric approximations, it may be possible to eliminate some function evaluations as discussed above since the Hessian is always symmetric and that some calculations are repeated for the off-diagonal terms. However, as shown in Fig. 4,
815
it is much straightforward to allow each processor to calculate its own gradient for its input x. 2.3.2. Parallel Variable Metric (PVM). Straeter (1973) developed the parallel variable metric method (PVM) technique for use with unconstrained problems. He showed that his new algorithm was competitive with the DavidonFletcher-Powell (DFP) update (rank two update) even when the PVM was performed serially because the use of multiple gradient information at each x provided for a more accurate Hessian (metric) calculation. His results are based on the use of a sequential computer. He hypothesized the performance of the algorithms on a real parallel processor assuming no overhead to set up the parallel jobs. The PVM is based on a rank one update (SRl) and can be summarized as folows for a problem of n dimensions: (wk+l)-l
=
(wk)-l
2 (rki)(rkj)T w j=,(A&, lTrkj
_
where W& is the estimate of the Hessian from the previous iteration and rki is the residual vector in the jth direction as follows: ‘kj
=
cwk
I-‘&k,
-
AXkj
j=l..
I
. . ,n.
(9)
Aaj is the difference in the gradients at the current point and at a perturbation from x in the jth direction. A&i=Vf(Xkj)-Vf(Xk),
j=l..
. . ,n
(10)
The perturbations from the current x are found from the following formula: Xkj=Xk+cej,
j=l,....n
(11)
where c is a scalar, usually of order 1 x lo-‘. The difference in the current point and the perturbation is given as: AXk,=Xkj-Xk,
i=
1,. * . ) n.
(12)
The major philosophy of this algorithm is that it is assumed that gradients obtained at different points simultaneously may be used to yield a more accurate inverse Hessian approximation. If an accurate Hessian can be obtained then a more appropriate direction could be calculated (see Fig. 5). This would potentially lead to a reduction in the number of iterations and more importantly the number of function evaluations that need to be taken for the optimal value to be found. The authors would like to note that the exact Hessian (PH) and the PVM approaches are unreliable because they do not necessarily yield a positive definite matrix. The line searches in the direction calculated may fail or a direction may not be obtained at all.
K. A. HIGH and R. D. LAROCHE
816
xk+l
=,‘k+kkdk
0
% Fig. 5. Quasi-Newton technique.
2.3.3. Project Parallel Variable Metric (PPVM). Freeman’s (1989, 1991) projected variable metric technique (PPVM) also was shown to be highly effective. The major advantage of the PPVM over the PVM is that by the use of projected vectors, it is possible to overcome the major disadvantage of the PVM’s rank one update which does not maintain positive definiteness. The estimate of the inverse of the approximate Hessian, W, is given by:
(13) y = Phg = P(Vf(x,+, ) - Vf(x,)) z=PTAx=PT(xr+,-XX*)
(14) (13
and P is the projection matrix given by Davidon (1975). Updates are done in n dimensions as in the PVM. This approximation to the inverse Hessian provides a symmetric, positive definite matrix provided that : yrz > 0.
(16)
If this is not the case, the PPVM cannot guarantee positive definiteness and a symmetric rank one update based on the SRI rank one update (the PVM) is attempted. The algorithm used by the authors did not update the matrix if it did not yield an improving direction. Another feature of the PPVM is after the line search phase is complete, another update is attempted using the projected rank-two updating formula described above using: Y=pAg=p(vT(xk+~)--f(xk+,))
(17)
and (18)
since the updating information exists. Therefore at each iteration, two sets of updates are performed. One is obtained using information from the parallel gradient evaluations, and the other uses information from the gradients in moving from one point to another point in the line search phase. The following results from Freeman’s work are important to analyze when considering the use of the PPVM in an optimization algorithm for chemical process design : (1) The PVM is unreliable since it is a rank one update, but may involve less time since there are less “housekeeping” chores to perform. when it does succeed, it sometimes takes the same number of iterations as the PPVM. (2) The use of PPVM requires less iterations than the BFGS, and for some problems this difference is enough to offset the difference in calculation time needed (generally for very complex functions). For simple functions, the BFGS is more efficient. (3) A decrease in time was seen using Schnabel’s (1987) speculative gradient evaluation. These results come from the use of eight complex unconstrained optimization problems. Thus, the conclusions drawn from these observations are that the PPVM is more efficient than the PVM and BFGS in terms of the number of iterations required for convergence, but, in order for a reduction in time to be realized, the function to be minimized and the gradients must be expensive enough to dominate the execution time. As this requirement is generally true of chemical process design, it is likely that the use of PPVM can show an increase in efficiency in terms of both number of iterations and execution time for these large problems. It should be noted at this time that researchers are limited in obtaining large chemical problems to which these parallel optimization technique can be applied. At the current time, it is impossible to execute parallel flowsheet simulations using PRO/II or ASPEN PLUSTM. Thus, researchers must develop their own modules for large scale problems. 3. TEST SET AND
IMPLEMENTATION
Table 5 identifies and characterizes those test problems that were used to evaluate the methods selected for parallel nonlinear optimization. In addition to the above test problems, three ‘*chemical engineering” type problems were chosen to be used for the evaluation of the parallel optimization routines (see Table 6). These are a distillation problem (DIST) given in Edgar and Himmelblau (1988). an alkylation problem (ALK) proposed by Bracken
Parallel nonlinear optimization techniques Ttablc 5. Test
problemspccilications
Number of variables
Number of equality cunsiraints
Numhcr of inequality constraints
Number of bounds
3 2 2 3 4 32
2 0 1 1 0 2I
0 2 L 2 0 0
6 0 0 3 8 64
;:
10 8
3 2
0 6
208
I*,_’ 111.2 122 132
10 ;
3 :
5
0
c) 13 3 6
20 t 4 IO
Test problem number and rcfercncc I’.’ 21.’ 3’.’ 4’ 5’ $
’ Dayde (1989). 2Himmelblau (1972). and McCormick (1968), which was studied by Vasantharajan and Biegler (1987), Vinante and Valladares (1985), Berna et al. (1980), and and Debrosse (1973), and the Westerberg Williams-Otto plant problem (WOP) which was studied by Vasantharajan and Biegler (1987), Vinante and Vatladares (1985), Christensen (1970) and di Bella and Stevens (1965). All of these problems involve functions that are single valued and use continuous variables. Table 7 lists the experiments that were performed with the various options for parallelism (High, Table 6. Small scale chemical ChE problem name and reference DIST’ ALK’ wop2 I Edgar
engineering
Number of variables
Number of equality constraints
4 10 11
5 9
problem
specifications
Number of inequality constraints
Number
of
bounds
0 8 2
2 20 1x
and Himmelblau (1988). and Biegler (1987).
* Vasantharajan
1991). A constrained optimization algorithm using the BFGS as the Hessian evaluation (of the Langrangian function) technique that did not include any parallelism in the line searching, gradients and local minimizations was developed. Using this algorithm, it was possible to determine the effect of several parameters and to develop a general algorithm for the comparison of the various parallel techniques. Once this algorithm was developed, the parallel features were added to ascertain the potential benefits. These parallel features included: (1) Parallelism in the Hessian evaluation of the Lagrangian function-this algorithm replaced the BFGS method for determining the Hessian with either the PVM, PPVM, or PH approximations. (2) Parallelism the Gradient/Speculative in Gradient-parallel function evaluations were performed either after the line search (Gradient) or during the line search (Speculative Gradient) for the gradient calculation to be used to determine if a minimum had been reached or in the next iteration. A BFGS inverse Hessian approximation was used in the quasi-Newton approach. (3) Parallelism in the Line Search-function evaluations where simultaneously determined at various step lengths in the direction determined using a quasi-Newton approach with the BFGS inverse Hessian approximation. (4) Global optimization-Several minimizations using the BFGS update for the Hessian and all other operations sequential were executed in parallel. The first algorithms to be tested were those described in Step 1 above on the 13 test problems. The
Table 7. Experiments
performed
Parallelism investigated in: Hessian update Problem #
BFGS
PVM
PPVM
PH
1 2 3 4 5 6 7 8 9 10 11 12 13 DIST ALK WOP
x x x x x x x x x x x x x x x x
x x x x x x x x x
x x x x x x x x x x x
x x x x x x x x x x
X
x x
Hessian
x
x x x
evaluation
x
817
Gradient/ speculative gradient BFGS
Line search BFGS
Global minimization BFGS
x
x
x
X
x
x x x
x x x
X
x
K. A. HIGH and R. D. LAROCHE
818
BFGS sequential algorithm was then applied to the three chemical engineering problems. It was determined that because of findings that for specific problems, the function calculation time was long, the numerical gradient calculation time was long, or some line searching was being accomplished, the other three options of parallelism (2, 3, 4 above) were used on those problems only. Several other points must be considered when developing algorithms for nonlinear optimization. These include: (1) Convergence criteria to use. (2) Method of determining initial guesses. The criteria developed for these constrained problems were that the norm of the gradient of the Lagrangian function (which incorporates the function and the constraints into one function) as well as the norm of the active constraint vector must be less than a specified user input E value for converence to occur. The active constraints were all equality constraints and those inequality constraints that were violated or binding (equal to zero). In these algorithms, the constraints were not scaled and the E value was taken to be 1 x 10-4. In all experiments, the initial guesses provided by the references were used. In the global optimization routine, more initial guesses were examined. The major focus of the research is to determine the effectiveness of the parallel optimization routines discussed. Those parts of the algorithm that involve sequential operation and have been studied extensively by other researchers were not emphasized. However, at as many points as possible, the best sequential mode for a particular part of the algorithm was selected. For example, to perform the various linear algebra equations in the algorithm, LINPACK and BLAS routines were used. To perform the parallel portions of the various function and gradient evaluations, the SCHEDULE utility (Dongarra and Sorensen, 1987) was used on an eight-processor Alliant FX/8 at Argonne National Laboratories. The parallel tasks could also have been accomplished using various implementations of parallel FORTRAN that are available. Other parallel computers, such as Cray supercomputers, could have been used to perform this work as it is not architecture dependent. For more details on implementation, see High (1991). 4. NUMERICAL
RESULTS
4. I. Parallel gradient evaluations for variable metric techniques Four algorithms for the determination of the inverse Hessian matrix of the Lagrangian function
were examined. These included the PVM, PPVM, and the PH for the parallel algorithms and the BFGS for the sequential algorithm. The results using the various algorithms are listed in Table 8. The PH and the PVM were not successful on all runs. The number of iterations and times for all of the methods are competitive, however, when the PH and the PVM was successful. It appears that the PVM rank one update is not as successful as the PPVM’s rank two update procedure. One of the reasons that the PH may have encountered difficulty might have been at points where the numerical Hessian was unable to be inverted and the previous matrix had to be used (or steepest descent direction employed). In general, as indicated in Table 8, the PPVM was the most efficient in terms of iterations and the BFGS was most efficient in terms of computational time. For larger problems with expensive function evaluations, it is expected that the PPVM will be most efficient in terms of iterations and time. As noted before, with current computing and software capabilities, it is difficult to develop large scale chemical engineering optimization problems that can be run in parallel. This situation will continue until it is possible to easily interface a process flowsheet simulator with parallel capabilities. Figures 6, 7 and 8 show the functional values, active constraint norm, and the Lagrangian function norm versus iterations for Problem 4 using the PPVM update and the BFGS update. This representative problem was selected to show the typical progression of these values during the course of the problem computation. For Problem 4 (as was generally the case), the PPVM was most efficient in terms of number of iterations. Figure 6 shows the values of the objective function of Problem 4 for all iterations. For Problem 4, the functional value starts out at 0. Although the function value is low at this point, the constraints are violated making the Lagrangian function value high. The function reaches a maximum before decreasing in value to the final value of 3.401 x 105. At this
Table&
Hcssian
Summary of results with various ithms in local minimization update
BFGS
PVM
Hessian routine
update
PPVM
For problems I- 13 Ave 9 of ltcrations Avc time (s)
31.M 0.82
For 1-7. Y-13 Avc # of I tcrations Avc time (s)
32.17 0.7x
32.00 l.II
24.56 1.16
For 1-5, Y. 10. 12 Avc # of Iterations Avc time (s)
37.25 0.83
32.25 1.17
25.63 1.32
algor-
PH
24.46 I.18
29.25 0.Y I
Parallel nonlinear optimization techniques point the Lagrangian and penalty functions are minimized. The norm of the active constraints (Fig. 7) decreases gradually until the final few iterations and then decreases dramatically. The value of the gradient of the Lagrangian function norm (Fig. 8) oscillates at relatively high values until finally precipitously dropping to a value below 1 x 10m4(again, the value for convergence, E). It is the norm of the gradient of the Lagrangian function value convergence criterion (Fig. 8) that is the last to be satisfied. 4.2. Speculative
and numerical
gradient
evaluations
Table 9 summarizes the results for performing the partial speculative gradient algorithm as described in Section 2.2. At each I (step length) value during the line search phase, all available processors are used to calculate functional values at the current point x and at small perturbations from x for use in the numerical gradient evaluation. The number of simultaneous function evaluations equaled the number of parallel processors available. If the function value was appropriately minimized at this step length, the rest of the function evaluations needed for the gradient evaluation were performed. This number of extra “clean-up” function evaluations is equal to the dimension of the probIem plus one (n+ 1) minus the number of processors. This method does not change the point (x) at each iteration or decrease the total number of iterations but potentially will allow for an acceleration in the calculation time. In addition to the chemical process problems, Problem 10 was chosen to be examined because of its larger required computation time as compared to the other test problems during the local minimiza-
Fig. 6. PPVM
and BFGS
819
tion routines (described in Section 4.1). As seen in Table 9, the use of the speculative gradient algorithm requires more time than the sequential version for all four of the problems. At each rZ, there are more calculations (the number of function evaluations is equal to the number of processors used) than there were in using the sequential version (only one function evaluation) but in performing these in parallel, there should not be an increase in computational time seen. However, the computation time in using the speculative gradient was larger than the sequential algorithm. This is mainly because the overhead time required to set up the parallel tasks is large relative to the time for function evaluations and the other calculations for the optimization routine. For those situations where the number of processors is less than the number of function evaluations necessary for the gradient evaluation, SCHEDULE will again have to be invoked after the line search phase to calculate the extra functions required. This would create extra overhead. The dimensions of problem 10, ALK, and WOP are greater than the eight processors that are available, so SCHEDULE would have to be invoked after the line search phase. For problem DIST, however, when the number of processors is five there is no need for the clean-up step since all of the functions for the gradient could simultaneously be calculated during the line search. A drop in the time for the algorithm is seen in using five processors instead of four. For all problems, using more than one processor with the speculative gradient algorithm allows for speedup as compared to using one processor. However, a leveling off effect is seen where increasing the
function value for Problem 4.
K. A. HIGH and R. D. LAROCHE
820
--uplste BFGSupdate
.
0
5
lb
1;
4.3
Iteration Fig. 7. PPVM and BFGS active constraintnorm value for Problem 4.
number of processors does not decrease the computation time. It may be at this point that inefficiencies created when invoking extra processors equals the extra time gained in using more processors to perform the computations. These inefficiencies suggest that the best use of speculative gradient searches may come with the use of large problems. Table 10 explores various algorithms for determining the numerical gradient of the function. The results presented in the second column are for the speculative gradient as presented in Table 9 for problem WOP. Speculative function values at perturbations from x are determined for future use in the numerical gradient evaluation. This problem (WOP) was examined because it required the most
computational time of any of the problems. Another algorithm was used where the speculative gradient is performed only when il is 1 (i.e. at a full Newton step) and the results are presented in the third column of Table 10. The fourth column results are for the situation where parallel functions are not calculated until the line search phase is over. This is a traditional use of parallel processing for gradient evaluation. As the number of times SCHEDULE needs to be invoked is reduced (for any number of processors), the computation time decreases (from column to column). For each algorithm (each column) a reduction in computation time is seen with a levelling off effect as was seen in Table 9. For the sequential
-
PPvMuphte
BFGSUpdare
Fig. 8. PPVM and BFGS Lagrangian functiongradient norm value for Problem 4.
Parallel
nonlinear
optimization
Table 9. Results with partial speculative gradient algorithms. Test problem #
10
Dimension Sequential algorithm’ time (s) Parallel algorithm2 time (s) using: 1 PROCESSOR 2 PROCESSORS 3 PROCESSORS 4 PROCESSORS 5 PROCESSORS 6 PROCESSORS 7 PROCESSORS 8 PROCESSORS
10 2.26
5.66
DIST
ALK
WOP
4 0.58
10 2.67
11 9.32
1.18
6.36
21.50
5.06 4.86 4.74 4.22 4.82 4.80 4.81
18.32 17.37 17.37 17.20 17.26 17.53 17.83
’ No function evaluations for future gradient evaluation tive points during line search phase, BFGS update Hessian. ’ Function evaluations for future gradient evaluation at points during line search phase. BFGS update Hessian.
at speculafor inverse
4.32 4.08 3.88 3.85 3.87 3.82 3.89
0.90 0.88 0.87 0.70 0.70 0.70 0.70
speculative for inverse
version of WOP, 9.32 s of computation time were used. In using the traditional parallel gradient evaluation technique [fourth column of Table lo), using one processor for sequential operation with SCHEDULE creates an extra 7.55 s of computational effort for a total of 16.87 s. This is an 81% increase in time over the sequential version. These two algorithms use identical numbers of calculations. Thus, it can be seen that the overhead time for setting up parallel tasks is not insignificant. The 7.55 s of overhead give an idea of the magnitude of overhead. The results of this part of the study may appear discouraging. However, the authors are confident that with much larger problems, the parallel speculative and parallel gradient evaluations will be shown to be advantageous. 4.3. Parallel line search calculation Table 11 presents the results for using a parallel line search strategy for those problems that had large amounts of line searching (8 and 10 as well as ALK and WOP) during the local optimization routine. During the line search phase, after the Newton Table
10. Results
of Williams-Otto
Seauential algbrithm time (s) Parallel algorithm time (s) using: 1 PROCESSOR 2 PROCESSORS 3 PROCESSORS 4 PROCESSORS 5 PROCESSORS 6 PROCESSORS 7 PROCESSORS 8 PROCESSORS
821
step was determined not to be effective, parallel function evaluations were performed at the 5 preselected step length rt values (or 8 for problem 8 as there was failure at 5). Unfortunately, all of the problems required more iterations and computation time for the parallel line search than with the use of the sequential version with quadratic interpolation. These results show the high effectiveness of the quadratic interpolation scheme. Also, the number of function evaluations per iteration needed for these problems studied is not enough to warrant using parallel processing (2.785 function evaluations/iteration maximum for the WOP for quadratic interpolation). However, an effective strategy was developed that can be used for large-scale problems where much line searching is needed. Typical chemical process design problems may not require extensive line searching, so this method does not seem appropriate for further consideration unless deemed necesary. 4.4. Global optimization The final effort of this work involved the use of global optimization where simultaneous local minimization searches (using the BFGS updating formula) were conducted from various initial points. The results are shown in Table 12. Problem 11 was chosen to be studied because none of the Hessian updating techniques (local minimization techniques) were able to find the published minimum (-0.8660) from the given starting point (Dayde, 1989). It was hoped that the use of initial guesses different from the starting point given would allow for the solution. Unfortunately, this was not the case. The authors believe this value is a local minimum of the Lagrangian function, however, it is possible that there was premature termination of the code. This is doubtful, as the code was specifically written to stop at a point where the gradient of the Lagrangian function and the active constraints were minimized. It was possible to find a lower minimum (-0.5)
plant problem
WOP with speculative gradient at all L values
techniques
with various
parallel
WOP with speculative gradient atA=l
gradient
algorithms
WOP with parallet gradients
9.32
9.32
9.32
21.50 18.32 17.37 17.37 17.20 17.26 17.53 17.83
18.35 15.67 14.72 14.40 14.09 13.89 14.06 14.19
16.87 13.86 12.72 12.55 12.45 12.27 12.19 12.16
K. A.
822 Table
HIGH
11. Results
# Sequential algorithm iterations’ # Function cvaluatiotw? Sequential time (s)
23
# Iterations for parallel algorithm” # Function evaluations2 Time for 5 processors (s)
54
for 1 processor
Function
10 82
42 1.2635
129 2.2597 82
281 (8 pro=) 2.8035
ALK
WOP
83
257
173 2.6720
366 3.1816
3.1531
5.7295
716 9.3189
85
313 2.8035
2.6604
(s)
value
line search algorithms
with aarallel 8
Test problem #
Time
and R. D. LAR~CHE
-47.7611
934 4775 37.2014
3.6131
43.0057
- 1.7650
-99.3481
‘Function evaluations performed sequentially at each speculative point during a quadratic Line search. * Including those for line searching and numerical gradient evaluation. 3 Function evaluations performed in parallel at five predetermined points (or eight for problem 8. failure occurred with five parallel function evaluations) if function is not minimized enough at d= 1.
7 than was found in the (-0.43302). For the DIST problem, from starting point Number 4 (the one used in the local minimization routines), 25 iterations were required. However, from starting point 6, 15 iterations were needed to reach the same minimum, and from starting point 8, 20 iterations were needed. This helps to show that Points
using
starting
local
minimization
1,6
and
routine
Table Problem Starting
12. Results
number point #
11
the solution 47199.9951 is probably the global one. The local minimizations from the other starting points were either not complete after 100 iterations or had been declared failures. The next problem examined with the global optiroutine was the ALK problem. mization Minimizations from starting points 1,2,5,6,7 found a local minimum at -1.5475. However, when the
of global
minimization
DIST
ALK
WOP
1 # iterations 1 function value 1 time’
42 -0.5000 0.89
100 Not done 1.76
63 - 1.5475 1.89
0 Failure 0.44
2 # iterations 2 function value 2 time’
11 -0.4330 0.22
53 Failure 0.98
84 - 1.5475 2.06
0 Failure 0.34
3 # iterations 3 function value 3 time’
27 -0.43M 0.51
100 Nor done 1.58
27 -0.4540 0.62
0 Failure 0.34
4 # iterations 4 function value 4 time’
27 -0.4330 0.52
25 47 199.9951 0.27
22 -1.3398 0.50
257 -99.3481 7.14
5 # iterations 5 function value 5 time’
8 -3.26928-6 0.14
58 Failure 0.92
71 - 1.5475 2.05
674 -99.3481 17.79
6 # iterations 6 function value 6 time’
24 -0.5ooo 0.45
I.5 47 199.9951 0.20
89 -I .5475 2.26
0 Failure 0.34
7 0 iterations 7 function value 7 time’
62 -0.5OuO 1.26
100 Not done 1.58
65 - 1.5475 1.73
0 Failure 0.34
8 # iterations 8 function value 8 time’
36 Failure 0.84
20 47 199.9951 0.29
88 - 1.7650 1.26
0 Failure 0.34
Local min time’ Global min time I proc’ Global min time 8 prol?
0.84 4.86
0.58 7.63
2.67 12.42
9.32 27.12
1.64
3.14
3.10
19.04
’ Time for this individual minimization to complctc (part of the cntirc minimization routine). ‘Time for entire local minimization routine to eomplctc. .1Time for entire minimization routine lo complctc.
global
Parallel nonlinear optimization techniques minimization routine was started at Point 8, a lower minimum (- 1.7650) was found. For problem WOP, six of the initial points provided failures. The starting point (R4) used in the local minimum routines detailed later on found the minimum in 257 iterations. The same minimum was found from starting point number 5 in 674 iterations. In running the local minimization algorithms, x and functional values as well as other values were written to an output tile. This created extra overhead. When performing the global minimization routine, output was not written to a file since input/ output is not allowed while performing parallel tasks. This also made the algorithm more efficient. For Problem 11, the local minimization routine required 0.84 s while for the global routine the same minimization from the same starting point only required 0.52 s. The input/output created an extra 0.32 s of overhead. Many sequential routines use readinglwriting procedures that would have to bc eliminated in going to a global minimization format. The time for 1 processor is slightly larger than the sum of the time for each of the 8 jobs because the time to set up of the parallel tasks was not accounted for in the eight individual jobs. For Problem 11, in roughly twice the amount of time that it took to perform the local minimization (with I/O) it was possible to perform 8 minimizations (without I/O) using eight processors. Using the time required for starting point 4 as an indication of the amount of time the local minimization would have required without I/O (0.52 s), in roughly three times more computational time it was possible to analyze seven other initial guesses using eight processors. This phenomena would have a greater impact with larger probIems as the overhead time to set up the parallel tasks would be small compared to the tocal minimization time. More sophisticated global optimization approaches may show greater benefits. This global optimization algorithm shows some exciting possibilities for ensuring robustness. When a good initial guess is not known, it is possible to divide up the bounded region into small subsections and to perform local minimizations over the regions without an undue increase in computational time. 5.
CONCLUDING
REMARKS
To establish a basis for evaluation of parallel nonlinear optimization techniques, a sequential optimization algorithm was developed that successfully solved the test problems. The minimums found for 11 of the 13 problems was the same as the
823
published optimums. This conclusion can be expanded as follows: (1) The equality-constrained quadratic programming formulation (EQP) was an effective method for solution of the constrained problems using the Bartholomew-Biggs (1980, 1982) algorithm for successive quadratic programming. (2)The incorporation of the BFGS (Broyden, Fletcher, Goldfarb, Shanno) inverse Hessian updating formula into the algorithm was successful. (3) The quadratic interpolation scheme developed by Powell (1978) and the watchdog technique of Chamberlain et al. (1979) were successful strategies for the line search procedure. (4) LINPACK and BLAS routines for various linear algebra operations were incorporated into the algorithms. Parallel nonlinear optimization has been shown to potentially be an effective tool for chemical process design : (1) The Alliant FX/8 was an appropriate machine that allowed for parallel computations. The SCHEDULE commands were effectively included into the algorithms and the utility was successful at queuing parallel tasks. (2) The algorithm that incorporated the PPVM formulation was successful at solving all of the problems. The PVM and the PH algorithms were unable to solve all of the problems but were competitive in terms of iterations and computation time as compared to the PPVM and BFGS algorithms when they were able to solve the problems. When a minimum was obtained for all of these procedures, it was the same as with the BFGS. (3)The BFGS algorithm was the fastest in terms of computational time. (4) The PPVM was the most efficient in terms of numbers of iterations. (5) The techniques developed for speculative gradient (Schnabel, 1987) and numerical gradient evaluation were successfully incorporated into the EQP algorithm. The amount of overhead for the parallel tasks was greater than the time savings seen from parallelism for the small-scale problems increasing the computation time relative to the sequential algorithm. (6) The parallel line search technique was successfully introduced into the EQP formulation. However, the technique was not as efficient as the quadratic line search technique of Powell (1978).
824
K. A. HIGH and R. D. LAROCHE
(7) The global optimization strategy for use with multiple simultaneous minimizations was the most effective strategy developed for use with the small scale problems as it allowed for several initial guesses to be investigated at substantial time savings over several sequential minimizations. This shows the possibility of global optimization allowing for the robust and reliable solution of multiextremal problems. Other, more sophisticated algorithms for global minimization are available and may show even greater potential. (8) For the benefits of parallel processing to truly be seen, the function evaluations must be large enough to overcome the extra calculations needed for some of the techniques as well as the overhead required to perform parallel tasks (9) The incorporation of parallelization to perform simultaneous function and gradient evaluations within the optimization subroutine as opposed to parallelism within the function and gradient subroutines themselves eiiminated the need to develop problem specific programs. The algorithms that were developed are applicable to all types of chemical engineering optimization problems. The most important conclusion of this work is that there are several opportunities for parallelism in nonlinear optimization that can effectively be used for chemical process design research. The major objective of this work was to bring parallel optimization strategies to chemical engineering researchers that are not problem specific. These strategies take into account that the function and gradient evaluations tend to be one of the most expensive calculations. This work shows the potential for parallel processing to effect the computation time associated with large scale optimization problems. This potential will be realized when it is possible to run parallel ASPEN PLUSTM or PRO/II simulations or when researchers can incorporate these parallel algorithms into their own simulation software and run on parallel and supercomputers. This work also represents one of the first times that the PPVM was algorithm. into a constrained incorporated Additionally, global optimization shows potential for improving the robustness of minimization routines.
Acknowledgements-Acknowledgement is made to the donors of the Petroleum Research Fund, administered by the American Chemical Society, for partial support of this research.
NOMENCLATURE
c = scalar fsmall) d = direction vector e=elementarv vector (i.e. ll.O.O.OIT) f(x) = objective function evaluated at-x. g = partial derivative of objective function (with respect to a given variable) g(x) = gradient of the objective function evaluated at x H(x) = Hessian (2nd derivative) matrix (metric) of the objective function (or Langrangian evaluated at x IV= number of divisions for global optimization n = dimension of objective function P = projection vector p = scalar (small) r = residual vector W(x) = estimate of the Hessian matrix evaluated at x x = vector of independent variables y=vector used in PPVM z = vector used in PPVM Greek A = difference operator E = scalar for convergence determination L = step length a = de1 operator (first-partial derivative) V = Gradient operator subscript i = direction indicator j = direction indicator k = iteration counter Superscriprs - 1 = inverse T = transpose REFEIWNCES
Bartholomew-Biggs M. C., Recursive quadratic programming baaed on penalty functions for constrained minimization. Nonlinear Optimization Theory and Algorithms (L. C. W. Dixon, E. Spedicato, and G. P. Szego, Eds). Birkhauser , Boston (1980). Bartholomew-Biggs M. C. Recursive quadratic programming for nonlinear constraints_ Nonlinear Optimization Z9i3Z (M. J. D. Powell, Ed.). Academic Press, London
(19823.
Bereer Ph.. M. Davde and J.-C. Dunvach, Evaluation and
comparison of parallel algorithms in structural optimization. Innovative Numerical Methods in Engineering (R. P. Shaw, J. Periaux. A. Marino and C. A. Brebbia,
(1986a). Berger Ph., M. Dayde
Chaudouet. J. Wu, C. Eds). Springer, Berlin
and J.-C. Dunyach, Optimum design using a parallel algorithm for nonlinear programming problems. Parallel Computing 85 (M. Feilmeier, and U. Schendel, Eds). Elsevier, G. Joubert Amsterdam (1986b). Bema T. J., M. H. Locke and A. W. Westerberg, A new approach to optimization of chemical processes. AZChE
J. 26, 37-43 (1980). Bracken J. and G. P. McCormick,
Selected Applications of Nonlinear Prowamminn. Wiley. New York (19681. Broyden C. G., se conv&gence of a class of double-rank minimization algorithms, part I and II. J. Inst. Math. Appl. 6. 222 (1970). Byrd R. H., R. B. Schnabel and G. A. Schultz, Parallel -quasi-Newton methods for unconstrained optimization. Math. Prog. 42, 273-306 (1988). Byrd R. H., C. L. Dert, H. G. Rinnooy Kan and R. B. Schnabel. Concurrent stochastic methods for global optimization. Math. Prog. 46, l-29 (1990). Chamberlain R. M. C. Lemarechal. H. C. Pedersen and M. J. D. Powell. The watchdog technique for forcing
Paraiiei nonlinear optimization techniques convergence in algorithms for constrained optimization. Presented at the ZOth International Symp. on Mathematical Programming, Montreal (1979). Chen H.-S. and M. A. Stadtherr, SQPHP, A code for successive quadratic programming by the Han-Powell method: theory and performance. Report MAS-183, Department of Chemical Engineering, University of Illinois, Urbana (1983). Chen H.-S. and M. A. Stadtherr, Enhancements of the Han-Powell method for successive quadratic programming. Computers them. Engng 8, 229 (1984). Chen H.-S. and M. A. Stadtherr, A simultaneous-modular approach to process flowsheeting and optimization: I. Theory and implementation. AiChE J. 31, 1843-1856 (1985a). Chen H.-S. and M. A. Stadtherr, A simultaneous-modular approach to process flowsheeting and optimization: II. Performance on simulation problems. AIChE J. 31, 1857-1867. Chen H.-S. and M. A. Stadtherr, A simultaneous-modular approach to process flowsheeting and optimization: III. Performance on optimization problems. AfChE J. 31. 1868-1881 (1985~): Chen Y.-T., A parallel computation study of simultaneous modular chemical process flowsheeting on the Cray X-MP. M.S. Thesis, The Pennsylvania State University, University Park (1988). Christensen J. H., The structuring of process optimization. AlChE J. 16, 177-184 (1970). Comer D., Internetworking with TCPIIP: Principles, Protocols, % Architecture. Prentice-Hail, New York (1988). Coon A. B. and M. A. Stadtherr, Parallel implementation of sparse LU decomposition for chemical engineering appiications. Draft Report, Department of Chemical Engineering, University of Illinois, Urbana (1988). Davidon W. C., Optimally conditioned optimization aigorithms without line searches. Math. Prog. 9, l-30 (1975). Dayde M., Parallel algorithms for nonlinear programming problems. 1. Opt. Th. Appl. 61, 23-46 (1989). di Beiia C. W. and W. F. Stevens, Process optimization by nonlinear programming. 2nd. Eng. Chem. Proc. Des. Dev. 4, 1620 (1965). Dixon L. C. W. and G. P. Szego, The global optimization problem: an introduction. Towards Global Optimization 2 (L. c. W. Dixon and G. P. Szego, Eds). North-Holland, Amsterdam (1978). Dixon L. C. W., K. D. Pate1 and P. G. Ducksbury, Experience running ootimisation algorithms on parallel processing SystemsrSistem Modelliig and Optim;zation, the Proceedings of the 11th IFIP Conference (P. Thoft-Christen&n, Ed.). Springer, Berlin (1984). Dongarra J. J. and D. C. Sorensen, SCHEDULE user’s guide. Argonne National Laboratory, Argonne (1987). Edgar T. F. and D. M. Himmeiblau, Optimization of Chemical Processes, pp. 386-387. McGraw-Hill, New York (1988).
825
Fletcher R., A new approach to variable metric aigorithms. Comput. J. 13, 317 (1970). Freeman T. L., A parallel unconstrained quasi-Newton algorithm and its performance on a local memory paraiiei computer. Appl. Num. Math. 7.369-378 (1991). Freeman T. L.. Parallel oroiected variable metric aiaorithms for unconst&&d optimization. NASA Contractor Report CR-181930. Langley Research Center, Hampton (1989). Goldfarb D., A family of variable metric methods derived by variational means. Math. Comput. 24, 23 (1970). High K. C., Nonlinear ootimization and naraiiei nrocess&g for chemical process design. Ph.D. Thedis, The Pennsylvania State University, University Park (1991). Himmeibiau D. M., Applied Nonlinear Programming. McGraw-Hill, New York (1972). Lootsma F. A. and K. M.‘Ragsdeii. State-of-the-art in parallel nonlinear optimization. Parallel Comput. b, i33-155 (1988). _ Meyer R. R. and S. A. Zenios, Parallel optimization on novel computer architectures. Ann. Opn Res. 14 (Baitzer, Basei, 1988). Pardaios P. M.. A. T. Phillips and J. B. Rosen, Topics in Parallel Computing in Mathematical Programming. Science Press, New York (1992). Powell M. J. D., A fast algorithm for nonlinear constrained optimization calculations. Numerical Analysis, Dundee 1977, Lecture Notes in Mathematics 630, (G. A. Watson, Ed.). Springer, Berlin (1978). Schnabei R. B., Concurrent function evaluations in local and global optimization. Comput. Meth. Appl. Mech. Engng 64, 537-552 (1987). Schnabei R. B.. Parallel comnutine in ootimization. Computational Mathematical Kogramhing (K . Schittkowski, Ed.). Springer, Berlin (1985). Shanno D. F., Conditioning of quasi-Newton methods for function minimization. Math. Comput. 24, 647 (1970). Straeter T. A., A parallel variable metric optimization algorithm. NASA Technical Note D-7329, Langley Research Center, Hampton, VA 23665 (1973). Vasantharajan S. and L. Biegier, large-scale decomposition for successive quadratic programming. Engineering Design Research Center, EDRC-06-32-87, Carnegie-Mellon University (1987). Vegeais J. A., Vector and parallel strategies for sparse matrix problems in equation-based chemical process flowsheeting. Ph.D. Thesis, University of Illinois, Urbana (1987). Vinante C. and E. Valladares, Application of the method of multipliers to the optimization of chemical processes. Computers them. Engng 9, 83-87 (1985). Westerberg A. W. and C. J. Debrosse, An optimization algorithm for structured design systems. AZChE J. 19, 33.5-343 (1973). Zitney S. E. and M. A. Stadtherr. Supercomputing strategies for the design and analysis of compiex separation systems. had. Eng. Chem. Res. 32, 604-612 (1993).