Operations Research Letters 10 (199l) 75-81 North-Holland
March 1991
A tilted cutting plane proximal bundle method for convex nondifferentiable optimization K r z y s z t o f C. K i w i e l Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warsaw, Poland Received September 1989 Revised February 1990
A proximal bundle method is given for minimizing a convex function f. It accumulates so-called tilted cutting planes in polyhedral approximations to f that introduce some second-order information and interior point features absent in usual methods. Global convergence of the method is proved. nondifferentiable optimization; convex programming; descent methods; cutting planes
1. Introduction
This paper considers the nondifferentiable optimization (NDO) problem of minimizing a convex function f : R N---) R. We assume that at each x ~ R N we can compute f ( x ) and an arbitrary subgradient
g(x) ~ Of(x). We present a bundle method of descent that generates a sequence {Xk}k~__t C R N converging to a minimizer of f (if any) with f ( x k+t) < f ( x k) if x k+l 4= x k, and trial points {yk} at which f and g are calculated. At iteration k, yk+, = argmin{ f k ( x ) + ½uk[x - xk121x (= n N },
(1)
where u k > 0 and f k is a piecewise linear (polyhedral) approximation to f described below. A serious step from x k to x k+l =y~-+t occurs if yk+l is significantly better than x k in the sense that f(yk+l) <~f(Xk) + rnLok, where m L ~ (0, 1) is fixed, and vk = / k ( y k + , ) _ f ( x k)
(2)
is the (negative) predicted descent. Otherwise, a null step x k+l = x k improves the next model f k + t of f. The k-th approximation to f of the form
fk(x; O k ) = m a x { f ( x k ) _ a ( x k , y , O ~ ) + ( l _ O ~ ) < g ( y Q , x _ x k ) : j ~ j k }
(3)
depends on a tilting vector O k = (0f: j ~ j k ) with Of ~ [0, 1 - x], Vj ~ j k c {1 . . . . . k }, where K ~ (0, 1] and f o r x , y ~ R N and0~<0
a(x, y, O)= max{ f ( x ) - ] ( x ;
y, 0), x [ f ( x ) - f ( x ;
f ( x ; y, O ) = f ( y ) + ( 1 - O ) ( g ( y ) ,
x-y),
y, 0)] },
V x ~ n N.
(4) (5)
The hyperplanes of its linear pieces may cut off parts of the epigraph of f, epi f = {(~, x ) E R I + N : ~ > ~ f ( x ) } , 0167-6377/91/$03.50 © 1991 - Elsevier Science Publishers B.V. (North-Holland)
75
Volume 10, Number 2
OPERATIONS RESEARCH LETTERS
March 1991
since they are obtained by tilting those of f ( - ; y J, 0), and possibly shifting them 'downwards' to obtain
fk(xk; O k) <~f(x k) due to x > 0 in (4) (so that tJk ~< 0 in (2)). In contrast, the usual bundle methods (see, e.g. the reviews in [6,8]) use the outer cutting plane approximation f k ( - ; 0) = m a x { ) 7 ( - ; yJ, 0): j ~ j k }
<~f
with epi fk(.; O)D e~i f (since f ( . ; y J, 0)~ 0 may enable the method to skip visiting successive 'comers' of epi f , by going inside epi f in the spirit of interior point methods [1,2,10]. This may speed up its convergence. Another motivation for tilting hyperplanes is given in the following lemma. Its part (i) was proved and parts (it) and (iii) were hinted at by Tarasov and Popova in [12], who considered ilk(.; Ok) defined by (3) and (4) with the second max term deleted, i.e.
f k p ( x ) = m a x ( ) 7 ( x ; yJ, Of): j ~ j k } ,
Vx.
(6)
Lemma 1. Suppose f is strictly convex quadratic with a minimizer ~. Then:
(i)
)7(~; y, 0 ) ~ ) 7 ( ~ ; y , ½ ) = f ( 2 ) ,
V y ~ R :v and 0<<,0<~½.
(it) If Of = ½, Vj ~ jk, and for some 07and f k = ( j ~ j k : ) 7 ( 0 7 ; yj, ½)=f~p(07)}, we have rank(O, g(yJ)): j ~ f k ) = N + 1, then Y= Y. (iii) If Of = ½, Vj ~ jk, and f~p has a unique minimizer 07, then 07= Y. Proof. (i) Do simple calculations with
)7(x; y, 1) = f ( ~ ) + ½(g(y), x - ~). (it) Part (i) and the rank condition imply that the linear system )7(x; y J, ½) = ~, Vj ~ fk, has a unique solution (~, ~) with x = x and ~ = f ( ~ ) =f~p(07). (iii) 0 ~ hat afke(07) implies the rank condition of part (it). [] It is worthwhile to consider a version of Lemma 1 in which the variables are restricted to an affine subspace on which f is quadratic. This suggests that ilk(.; ok) with ok = ½, Vj ~ jk, may employ some second-order information in determining the component of d k = y k+a ]-Jx k in the ridge subspace [8] at x k along which f is smooth and nearly quadratic. A comment on relations between our approach and that of Tarasov and Popova [12] is in order. They proposed using (6) in a tilted cutting plane nondescent method based on (1) with the quadratic term deleted and jk = (1 . . . . . k ) , but without the second max term in (4) they had to assume that )7(Y; yJ, 0k) ~~j >~ 1. They showed that such Of > 0 exist if yJ are in a sufficiently small neighborhood of Y in which f is smooth and strongly convex, but gave no constructive rules. Hence their method is applicable to very special problems only, and suffers from the usual difficulties of cutting plane methods with unbounded storage. Our method modifies one in [5], which stemmed from the pioneering work of Lemarechal [7] and is now considered as the most efficient bundle method [11]. Our tilted approximations may be used in other bundle methods [6,8]. In particular, (4) is an a-function in the sense of [9], and extensions to nonconvex and constrained cases can easily be developed as in [3,9]. Natural questions arise about choices of 0 and K (fixed or variable). In our present modification of [5], the analysis is general enough to leave room for (hopefully) efficient implementations; a similar path could be followed for other methods. The paper is organized as follows. The algorithm is derived in Section 2. Its global convergence is studied in Section 3. Some modifications and extensions are described in Section 4. Section 5 reports some numerical results. Finally, we have a conclusion section. 76
Volume 10, Number 2
OPERATIONS RESEARCH LETTERS
March 1991
We denote by ( -, • ) = • × • and 1" I, respectively, the usual inner product and norm in R ~. For e >/0, the e-subdifferential of f at x is defined by O , f ( x ) = ( p ~ Ru: f ( y ) >~f(x) + ( p , y - x ) - e, Vy ~ ~N ) and ~f = ~0f.
2. The method We start by analyzing subproblem (1). We assume that 0 ~< 0f ~< 0max < 1, Vj • jk, and ~ ~ (0, 1] in (3) and (4). Defining d k = y k + l _ x k and vk by (2), we observe that (d k, vk) solves the quadratic programming (QP) subproblem minimize
v+lukld[
2 o v e r a l l ( d , o ) ~ n N+l
satisfying
- a ( x k, yJ, O ~ ) + ( 1 - O f ) g ( y J ) × d < ~ v ,
(7) Vj~jk,
with Lagrange multipliers, say X~ >/0, j ~ jk. As in [5], from the Karush-Kuhn-Tucker conditions for (7), we deduce that pk + ukd k = 0,
(8)
_ vk = u k [ d k [ 2 + a~k p = [ pk ] 2 / U + Otp,~k
(9)
where p k Otp ~k >~ 0 (SO that v k >/O) and kjk satisfy (Pk,~pk,1)=
~
X~((1-Of)g(yJ),et(x
k, yJ, 0 f ) , l ) .
j~jk
Hence with the following scaled multipliers, Aksum= 2~ ( 1 - - 0 ~ ) ~ ' ~ [ 1 - - 0 m a x ,
1],
jEJ x
j~jk
we may define the aggregate subgradient (/3 k, kpk) of [3, Ch. 2] satisfying (~',a~)=
E X~(g(yJ),a(x',yJ,
O)),
j~jk
f ( x ) >~f ( x k ) + ( # k, x - x k) - %^J, , By construction and (4) with x > 0, ]3k =pk/Aksu m J ^k
and
V x E RN.
(10)
(1-0max)l/3k[ < Ipkt ,< I~kl, j
k
< ap/KXsu m
< t ~ k / K ( ] -- 0max) ,
so (10), the Cauchy-Schwartz inequality and (9) imply f ( x ) >~f ( x k) + ( ( p k , x - x k }
- ak/K)/Xksum
>~f(xk) -- (I P k [ I x - xkl+&kp/K)/(1 -- 0ma~) >~f( x k ) - ( l u k d ' l ~ / 2 1 X - - X k l - - Ok/~)/(1--0max), We may now state the method in detail.
UX~ R N.
(11)
77
Volume 10, Number 2
OPERATIONS RESEARCH LETTERS
March 1991
Algorithm 2.1. Step 0 (Initiafization). Select an initial point x ~ ~ R N, a final o p t i m a h t y tolerance Copt >/0, an i m p r o v e m e n t p a r a m e t e r m e ~ (0, ½], an initial weight u ~ > 0, a lower b o u n d for weights Um~n> 0, an u p p e r tilting b o u n d 0 ~< 0m~x < 1, a safeguarding p a r a m e t e r 0 < K ~< 1 - 0m~x, and the m a x i m u m n u m b e r of stored subgradients Mg >~ N + 2. Set yl = x 1, j 1 = {1}, f l 1 = f ( y l ) , and gl = g(ya). Choose an initial tilting coefficient 0 ~ 0~ 4 0max, and set the optimality measure elo = ~ . Set the counters k = 1, l = 0 and k(0) = 1. Step 1 (Direction finding). Find the solution (d k, v k) and its multipliers ~ such that the set f k = { j ~ j k : ~ 4= 0} satisfies I f k [ ~< N + 1. C o m p u t e [ pk I and &~ f r o m (8) and (9)• Step 2 (Stopping criterion) • If v ~ >/ - e op t, terminate; otherwise continue. Step 3 (Descent test). Set yk+X = x k + d k. If f ( y k + a ) <~f(x k) + rnLv k, set t~ = 1, k ( l + 1) = k + 1 and increase the counter of serious steps l by 1; otherwise, set t~ = 0 (null step). Set x k+~ = x k + t~d k. Step 4 (Linearization updating). Select a set J~ such that J k c J~ c j k and ] J ~ [ ~ < M _ - 1, and set j k + l = J ? U { k + 1}. Set gk+l = g ( y k + l ) , r~k+~ k + l = J7,x~+1 ~k+l = l-k r ; Y k+l , 0") and Jj j + ( g J, 2 " + 1 - x ~ ) for
Step 5 (Weight updating). Set e~ +~ = min{e~, I pk I +&~_}. If x k+a :~ x ~, select u k+~ ~ [Um~,, Uk]; otherwise, either set u k+~ = u k or choose u k+' ~ [u ~, 10u ~] i f a ( x ~, yk+~, 0) > max{e~ +~, - 10v~}. Step 6 (Linearization tilting). Select ~k+~ak+~~ [0, 0max]. If X k+l 4= X k, choose Of +~ ~ [0, 0m,~] for j ~ J ~ ; otherwise, set Oak +1 = o k for j ~ J~. Step 7. Increase k by 1 and go to Step 1. A few c o m m e n t s on the method are in order (see [5] for details). Step 1 m a y use the dual QP m e t h o d of [4]. Step 2 is justified by the optimality estimate (11). N o t e that yJ need not be stored, since
f(x;
y, 0 ) = ( l - 0 ) f ( x ;
y,0)+0f(y)
and
~ k = j : ( X k ; y j, 0)
at Step 4. Step 5 uses the safeguard of [5]. Its essence is that u k can grow to ~ only if x ~ a p p r o a c h e s some Y ~ X, whereas uniform strong convexity in (1), i.e. u k >/Umin, ensures convergence of x k to a minimizer of f , if any. One m a y choose u k+~ as in [5] via safeguarded quadratic interpolation to estimate the curvature of f between x k and yk+~. T h e rules of Step 6 will be justified by our convergence analysis. It would still go through if 0f converged to a c o m m o n limit for infinitely m a n y consecutive null steps.
3.
Convergence
In this section we show that x k ---, Y ~ X if X 4=~J. We assume, of course, that the tolerance Copt = 0. Then (11) implies that u p o n termination x k ~ X. H e n c e we m a y suppose that the algorithm does not terminate. T o save space, we shall only indicate how to m o d i f y the analysis in [5]. Consider the following condition,
f ( x k) >~f(£)
for some fixed £
and
all k,
(12)
which holds if X 4= B or Y is a cluster point of { x k }. First, we note that by (11), ( p k , y _ x k) ~< 8 ~ / x if (12) holds, and this suffices for the p r o o f of L e m m a 3.1 in [5]. Lemma2.(i)
Let w k = -~uk l dk l 2 + ap.-k Then vk <~ _ w k <~ ½vk <~O. I f k ( l ) <~k <~n < k ( l + l), then
w ~ < w k ~ < ( 1 - 0 k.~ k~/)~2 j I g(xk"~) 12/Umin"
Moreover, if x k+l = x k, then 0 <~w k+l <~w k - ½u k l y k+2 - y k + 2 1 2 . (ii) I f (12) holds then there exists C < ~ such that a ( x k, y k+a , o) <. c / ¢ ~ c ~ ~ , v k . 78
M a r c h 1991
OPERATIONS RESEARCH LETTERS
V o l u m e 10, N u m b e r 2
Proof. Use (1) and (3) with Of +~ = Of if x ~+~ = x k as in [5].
[]
The following results are crucial for analyzing null steps. 0 <~0 < 1, 0 < ~ <~ 1 - O, v < O, x, y and d = y - x in ~ N are such that f ( y ) > y, 0) + (1 - O ) g ( y ) × d > mLv.
Lemma 3. I f 0 < m e < l , f ( x ) + rely, then - a ( x ,
Proof. If a ( x , y, O) = f ( x ) - f ( x ;
y, 0) in (4), then
y, O) + (1 - O ) g ( y ) × d = f ( y )
-a(x,
-f(x)
> rely.
If a ( x , y, O) = x a ( x , y, 0), then -xa(x,
y, O) + (1 - O ) g ( y ) × d >
(1 - O ) [ - a ( x , = (1 - O ) [ f ( y )
y, O) + g ( y )
-f(x)]
×d]
> (1 - O ) m L v > mLV.
[]
[,emma 4. I f x k = x k(I) = ~ for some f i x e d l and all k >~ k ( l ) , then w k $0 and v k -+ O. Proof. By the algorithm's rules and L e m m a 2(i), u k+l > u k, w k+l <~ w k, V k >~k ( l ) and [yk+2 _ y k + l I -+ 0. Then ½[d k l 2 ~ w k / u m i n and y k + l = x k + d k are bounded. Let ~ = l i m s u p k + ~ v k and K c { 1 , 2 . . . . } satisfy vk ~ , c g. Let k >~k ( l ) and =
'+'
,
,
+
(1
-
x
d
-
Then, by (1), (2) and (3) with x ~+1 = x k, k + 1 ~ j ~ + l and n ."~ - - a~ kk ++ll ek<~max{-a(x _vk
k+', y ' , O J ' + ' ) + ( 1 - O J ' + ' ) g ( y J ) X d k + ' :
<
1,
jE~_J k+l }
(l_O~++la)(g(yk+,), yk+2_yk+,)
Vk+l -- Vk + [ g ( Y k+l) I l Y k + 2 - y k + l 1, SO limsup{ek: k ~ K } < ~ O from the local boundedness of g. But by L e m m a 3, e k > m L v k - - v k = (1 -- m e ) I vk I with 0 < m e < 1, V k > k ( l ) , so ~ = 0. Then w k $0 and v k ~ 0 from L e m m a 2(i). [] Collecting the preceding results, one may prove the following lemma and theorem as in [5]. L e m m a 5. I f (12) holds and liminfk ~o~ I vk I = 0 then x ~ ~ ~ for some Y ~ X.
Theorem 6. E i t h e r x k ~ Y ~ X o r X = ~ J a n d
I x k l --, + ~ .
lnbothcasesf(xk)$
[]
inff.
[]
4. M o d i f i c a t i o n s a n d e x t e n s i o n s
To trade off storage and work per iteration for speed of convergence, one may replace subgradient selection with aggregation as in [5]. Then only one aggregate subgradient and Mg > 1 'ordinary' ones need be stored (in contrast with the previous requirement Mg > N = 2). Following [5], one can easily extend our method to the problem of minimizing f on a n o n e m p t y closed convex subset S of •u. To this end, choose x 1 ~ S and add 8 ( x ) to f k in (1), where 8 is the indicator of S (null of S, oo outside). By combining our preceding results with those in [5], one may verify Theorem 6 for X = a r g m i n ( f + 8}. Moreover, one can replace m s with S in (11). Note that (7) is augmented with the constraint x k + d k ~ S, and can be solved by the QP method of [4] if S is described by finitely m a n y linear inequalities. Finally, consider the case when f ( x ) = E T = l f ( x ) , where f/. : ~ N .._~~ are convex with subgradients g / A x ) ~ Of~(x) for i = 1 . . . . . n. Then, as in [5], we may use f k = ET_l£k in (1), where f Ak are constructed 79
Volume 10, Number 2
OPERATIONS RESEARCH LETTERS
March 1991
by replacing f , g and jk with f,, g/i and j k in (3)-(5), with E"i=1 I j/k ] ~< Mg and Mg >~ N + 2n; see [5]. This modification increases storage and QP work per iteration, but may converge faster because fk can now incorporate much more linear pieces for modelling f.
5. Numerical results The purpose of this paper is to establish some general convergence theory, and it will be worthwhile to report extensive tests only after several implementation issues (concerning choices of O k and u k) have been satisfactorily resolved. We may, however, describe briefly results for three standard test problems obtained by a very rough implementation with fixed 0f = 0.25 and x = 0.5. For the Shor minimax problem (see [5]) the method terminated with k = 26 and f(x k) = 22.600164. For the M A X Q U A D problem [5] we have k = 33 and f(x k) = -0.8414066. Finally, for the TR48 problem [5] we got k = 156 and f(x k) = -638564.9896. The optimal values are 22.60016, -0.841408 and -638565, respectively, and we used the stopping criterion vk >~ - 1 0 - 6 ( 1 + ]f(x k) [). With the method of [5] we had k = 29, 41 and 199, respectively. Of course these results are preliminary, but a comparison with [3,5,11] suggests some potential in the method for developing efficient implementations.
6. Conclusions We have introduced tilted polyhedral approximations that can be used in bundle N D O methods to provide some second-order information and interior point features. An application to the method of [5] has been discussed, and some encouraging numerical experience has been reported. Implementational issues and more extensive tests will be described elsewhere.
Acknowledgement This research has been supported by the Polish Academy of Sciences.
References [1] J.-L. Goffin, "Affine methods in nondifferentiable optimization", CORE Discussion Paper No. 8744, CORE, Universit6 Catholique de Louvain, Louvain-la-Neuve, 1987. [2] N. Karmarkar, "A new polynomial-time algorithm for linear programming", Combinatorica 4, 373-395 (1984). [3] K.C. Kiwiel, Methods of Descent for Nondifferentiable Optimization, Lecture Notes in Mathematics 1133, Springer, Berlin, 1985. [4] K.C. Kiwiel, "A dual method for solving certain positive semidefinite quadratic programming problems", S l A M J. Sci. Statist. Comput. 10, 175-186 (1989). [5] K.C. Kiwiel, "Proximity control in bundle methods for convex nondifferentiable minimization", Math. Programming 46, 105-122 (1990). [6] K.C. Kiwiel, "A survey of bundle methods for nondifferentiable optimization", in: M. Iri and K. Tanabe (eds.), Mathematical Programming: Recent Developments and Applications, KTT/Kluwer, Tokyo, 1989, 263-282. [7] C. Lemarechal, "Nonsmooth optimization and descent methods", Research Report RR-78-4, International Institute of Applied Systems Analysis, Laxenburg, 1977. [8] C. Lemarechal, "Nondifferentiable optimization", in: G.L. Nernhauser, A.H.G. Rinooy-Kan and M.J. Todd (eds.), Optimization, Handbooks in Operations Research and Management Science, vol. 1, North-Holland, Amsterdam, 1989, 529-572. [9] R. Mifflin, "A modification and an extension of Lemarechal's algorithm for nonsmooth minimization", Math. Programming Stud. 17, 77-90 (1982). [10] N. Meggido and M. Shub, "Boundary behavior of interior point algorithms in linear programming", Math. Oper. Res. 14, 97-146 (1989). 80
Volume 10, Number 2
OPERATIONS RESEARCH LETTERS
March 1991
[11] H. Schramm, Eine Kombination yon Bundle- und Trust-Region Verfahren zur Losung nichtdifferenzierbarer Optimierungsprobleme, Bayreuther Mathematische Schriften, Heft 30, Bayreuth, 1989. [12] V.N. Tarasov and N.K. Popova, "A modification of the cutting-plane method with accelerated convergence", in: V.F. Demyanov and D. Pallaschke (eds.), Nondifferentiable Optimization." Motivations and Applications, Lecture Notes in Economics and Mathematical Systems 255, Springer, Berlin, 1986, 284-290.
81