A note on robust descent in differentiable optimization

A note on robust descent in differentiable optimization

Operations Research Letters 45 (2017) 530–532 Contents lists available at ScienceDirect Operations Research Letters journal homepage: www.elsevier.c...

371KB Sizes 9 Downloads 41 Views

Operations Research Letters 45 (2017) 530–532

Contents lists available at ScienceDirect

Operations Research Letters journal homepage: www.elsevier.com/locate/orl

A note on robust descent in differentiable optimization Jean-Pierre Dussault Département d’Informatique, Université de Sherbrooke, Sherbrooke (Québec), Canada J1K 2R1

article

info

Article history: Received 27 February 2017 Received in revised form 15 August 2017 Accepted 15 August 2017 Available online 22 August 2017

a b s t r a c t In this note, we show how to alleviate the catastrophic cancellations that occur when comparing function values in trust-region algorithms. The main original contribution is to successfully adapt the line search strategy Hager and Zhang (2005) for use within trust-region-like algorithms. © 2017 Published by Elsevier B.V.

Keywords: Unconstrained nonlinear optimization Descent algorithms Numerical accuracy

0. Introduction We consider descent algorithms for solving min f (x) x∈ R

(1)

where it is understood that we are seeking a local minimum of f , and that f is C 2 (Rn ). We adopt the convention that variables x are column vectors while gradients ∇ f (x) are row vectors. Two main algorithm categories have been developed for differentiable unconstrained optimization: line search methods and trust region algorithms. In line search methods, a search direction d ∈ Rn and a step t > 0 are computed ensuring sufficient decrease of f along d. The direction d is typically required to be a direction of sufficient descent, i.e. ∇ f (x)d ≤ −c1 ∥∇ f (x)∥2 and ∥d∥ ≤ c2 ∥∇ f (x)∥ for some bounded c1 , c2 > 0. Sometimes, a simple Armijo backtracking procedure is applied to ensure plain descent using a sufficient descent d [10]. In trust-region methods [2], a quadratic model qx (d) = f (x) + ∇ f (x)d + 21 dt ∇ 2 f (x)d is used to compute a direction ensuring qx (d) < qx (0) = f (x). Then, the algorithm assesses whether the step d is accepted based on the ratio of achieved vs. predicted reduction r =

∆f (x) def f (x) − f (x + d) = . ∆qx (d) qx (0) − qx (d)

In both approaches, descent must be assessed, i.e. f (x + d) − f (x) < 0 E-mail address: [email protected]. http://dx.doi.org/10.1016/j.orl.2017.08.009 0167-6377/© 2017 Published by Elsevier B.V.

(2)

is enforced. Non-monotone approaches [6] have been devised to ensure descent periodically instead of at every step. Descent has to be enforced as well in those non-monotone variants. Since f (x) − f (x + d) ∼ O(∥d∥2 ), when d is smaller than the square root of the machine precision, usually because x is close to a minimizer or sometimes for some other reason, ∆f (x) may not be computed reliably, being obtained by subtracting two close quantities. This phenomenon is referred to as catastrophic cancellation. This phenomenon was observed in [4] and [7] and two different strategies were developed to alleviate this catastrophic cancellation. In [4], automatic finite differentiation was used to clean the descent test, an approach useful and tested both for trust region algorithms and line search variants. In [7], a clever reformulation of the descent test was used to improve the reliability of their line search procedure. We present in this note the adaptation of the reformulation of [7] to trust-region-like algorithms. 1. Reformulation of the descent The descent condition f (x + d) − f (x) < 0 may be expressed def

using the function φ (α ) = f (x + α d) as φ (α ) − φ (0) < 0. The Armijo criterion f (x + α d) − f (x) ≤ δα∇ f (x)d may be rewritten as φ (α ) − φ (0) ≤ δαφ ′ (0). When φ is quadratic (f is quadratic along d), φ (α ) = q(α ) = 12 a2 α 2 + a1 α + a0 with a2 = dt ∇ 2 f (x)d, a1 = ∇ f (x)d and a0 = f (x). The Armijo condition may be rewritten as 12 a2 α + a1 ≤ δ a1 which yields 21 (a2 α + a1 ) + 12 a1 ≤ δ a1 , i.e. as expressed in [7], a2 α + a1 = q′ (α ) ≤ (2δ − 1)q′ (0) = (2δ − 1)a1 .

J.-P. Dussault / Operations Research Letters 45 (2017) 530–532

Equivalently, we may write α2 (q′ (α ) + q′ (0)) ≤ δα q′ (0). Thus, for a quadratic function f , the descent condition may be expressed as q′ (1)+q′ (0) f (x + d) − f (x) = < 0. 2 The descent condition when f is not quadratic may be approximated by the formula 0 > f (x + d) − f (x) ≈

∇ f (x + d)d + ∇ f (x)d

2 and the approximation should be accurate when d is small.

Model_Algorithm(x, β, f , m)

{ Given: x; } { objective function f ; } { quadratic model q; } { initial value for ∆. } repeat d ← arg min∥d∥≤∆ qx (d) ∆f ← f (x) − f (x + d) { Here is the numerical switch to (3) } if (∆q < ϵ ∨ |∆f |≤ ϵ|f (x)| ) then ∆f ← −(∇ f (x)d + ∇ f (x + d)d)/2

(3)

1.1. Accuracy of formula (3) Since the pth order expansion of φ has an error term of O(∥d∥p+1 ), assuming f ∈ C 3 (Rn ), we may write the Taylor expansion φ (α ) = q(α ) + O(α∥d∥)3 . Therefore, the degree of approximation of (3) may be expressed as

φ (α ) − φ (0) = q(α ) − q(0) + O(∥α d∥3 ) q′ (α ) + q′ (0) =α + O(∥α d∥3 ). 2

( ) ∆q ← − ∇ f (x) + 21 dt ∇ 2 f (x) d r ← ∆f /∆q if (r < 0.25 ) then ∆ ← ∆/2 else x ← x+d if (r > 0.75 ) then ∆ ← 2∗∆

(4)

Now, φ ′ (0) = q′ (0) and φ ′ (α ) = ∇ f (x + α d)d while q′ (α ) = (∇ f (x) + α dt ∇ 2 f (x))d and

so that φ ′ (α ) = q′ (α ) + O(α 2 ∥d∥3 ) and

φ ′ (α ) + φ ′ (0)

q′ (α ) + q′ (0)

+ O(∥α d∥3 ). 2 Setting α = 1 in (5) and combining with (4) yields 2



{ = q(0) − q(d)} {Unsuccessful} {Successful} {Very successful}

until ( termination_criterion) Result ← x Algorithm 1: A standard trust-region method for unconstrained optimization

∇ f (x + α d) = ∇ f (x) + α dt ∇ 2 f (x) + O(∥α d∥2 )

α

531

(5)

⏐ ( )⏐ ⏐ ⏐ ⏐(f (x + d) − f (x)) − ∇ f (x + d)d + ∇ f (x)d ⏐ = O(∥d∥3 ). ⏐ ⏐ 2 When d becomes small, roughly of the order of the square root of the machine precision, the expression f (x + d) − f (x) suffers from catastrophic cancellations but formula (4) can still be evaluated accurately. 2. Implementation within a trust-region framework The actual use of formula (4) involves some trade off related to the magnitude of d with respect to the machine precision. Inspired by [7], we recommend to use (3) when ∆f becomes small, |∆f (x)| ≤ ϵ|f (x)|, and when ∆q becomes small, ∆q(x) ≤ ϵ where ϵ is a prescribed tolerance. The resulting algorithm is only slightly modified, as seen in Algorithm 1. Observe that

) ( 1 ∆q = q(0) − q(d) = f (x) − f (x) + ∇ f (x)d + dt ∇ 2 f (x)d 2 ( ) 1 t 2 = − ∇ f (x) + d ∇ f (x) d 2

is computed without cancellation. 3. Numerical observations The improvement resulting from the use of formula (4) is apparent on specific instances where the catastrophic cancellations indeed result in bad behavior. In most instances, the behavior of the improved variant is exactly the same as the vanilla plain version. Critical instances benefit from the improved version by reaching a tighter tolerance. Since the formula (4) uses a further gradient evaluation, it may happen that its use triggers more gradient evaluations than the naive descent computation. Our code reuses the additional gradient for successful steps. For unsuccessful steps, this additional gradient evaluation is not used but nevertheless is counted as an evaluation. Therefore, when both versions

Table 1 Failure and success rate for various tolerances. Stopping tolerance ∥∇ f (x)∥∞ ≤ ϵ

Formula (4)

f (x + d) − f (x)

ϵ ϵ ϵ ϵ ϵ ϵ

39 55 57 60 60 60

15 26 43 54 57 60

= 10−15 = 10−12 = 10−8 = 10−6 = 10−5 = 10−4

succeed in solving a given instance, usually they share exactly the same number of evaluations, but in a few instances, there is a slight difference, sometimes in favor of formula (4), sometimes against it. We present observations on a set of test problems. The 60 examples are taken from the Luk˘san et al. collection [12] of scalable versions (some are modified) of CUTEst problems; the collection has been implemented in the Julia language [1,11] using JuMP [3]. The problem dimensions are fixed to n = 100 (the dixmaan family requires n to be a multiple of 3, so those are generated with n = 99). The switching ϵ in Algorithm 1 is 104 ϵmachine and our machine precision is ϵmachine = 2.220446049250313 × 10−16 . The variant for solving the model is a standard trust-region algorithm using the so-called absolute value scaling [5]. The trustregion subproblems are solved using the Moré–Sorenson [9] approach. Our implementation is coded in the language Julia [1]; a LDLt factorization of the Hessian is obtained with the HSL MA57 [8] interfaced within Julia [11]. Table 1 reports on the improved reliability provided by using the formula (3). For tolerances of 10−4 or higher, both variants solve all the 60 problems of the test set. An interesting point is that when using the formula (3), 55 out of the 60 problems are solved to 10−12 , a remarkable fact in itself; the naive descent test (2) achieves only 26 successes. Most examples where the use of formula (3) converges but not the formula (2) are caused by cancellation, f (x + d) − f (x) numerically vanishes. Also, in most examples, only one iteration, the last one, uses the improved formula (3). For the penalty2 problem (see Table 2) solved to a precision of 10−8 , two iterations switch to (4). The last ratio for the improved

532

J.-P. Dussault / Operations Research Letters 45 (2017) 530–532 Table 2 Last two iterations compared on penalty2 example. Iter

f

∆f

Formula (4)

∥d ∥

∆f /∆q

∆f (2)/∆q

21 22

9.71 × 104 9.71 × 104

1.59 × 10−7 −1.46 × 10−11

1.59 × 10−7 1.88 × 10−12

1.14 × 10−4 5.64 × 10−7

1.001712 −7.73029

1.002567 1.000011

formula is 1.000011, a very good agreement with the prediction. This is an example where the difference between f (x + d) and f (x) computed with (2) does not even have the right sign. The use of (2) stalls, producing unsuccessful steps until ∆ becomes too small and the algorithm declares a failure. 4. Conclusion Numerical evaluation of descent is prone to numerical cancellation. The modified formula (3) derived from [7] and adapted here to trust-region algorithms proves useful to alleviate this cancellation phenomenon. Improved descent evaluation should be part of any robust implementation. The use of automatic finite differencing as advocated in [4] is arguably preferable since its accuracy does not depend on d becoming small; on the down side, it involves elaborate software development while the approximation just described is readily and easily added to any existing code. Acknowledgments This research was partially supported by NSERC grant OGP0005491. Most of this work was completed while in sabbatical leave at INSA, Rennes. I very warmly thank the associate editor for his very careful reading and suggestions, which led to a much improved manuscript.

References [1] Jeff Bezanson, Alan Edelman, Stefan Karpinski, Viral B. Shah, Julia: A fresh approach to numerical computing, SIAM Rev. 59 (1) (2017) 65–98. [2] Andrew R. Conn, Nicholas I.M. Gould, Philippe L. Toint, Trust-region methods, MPS-SIAM Series on Optimization, Society for Industrial and Applied Mathematics, 2000. [3] Iain Dunning, Joey Huchette, Miles Lubin, JuMP: A modeling language for mathematical optimization, SIAM Rev. 59 (2) (2017) 295–320. [4] Jean-Pierre Dussault, Benoit Hamelin, Robust descent in differentiable optimization using automatic finite differences, Optim. Methods Softw. 21 (5) (2006) 769–777. [5] Nicholas I.M. Gould, Jorge Nocedal, The modified absolute-value factorization norm for trust-region minimization, in: Renato De Leone, Almerico Murli, Panos M. Pardalos, Gerardo Toraldo (Eds.), High Performance Algorithms and Software in Nonlinear Optimization, in: Applied Optimization, vol. 24, Springer US, 1998, pp. 225–241. [6] Luigi Grippo, F. Lampariello, Stefano Lucidi, A nonmonotone line search technique for Newton’s method, SIAM J. Numer. Anal. 23 (4) (1986) 707–716. [7] William W. Hager, Hongchao Zhang, A new conjugate gradient method with guaranteed descent and an efficient line search, SIAM J. Optim. 16 (1) (2005) 170–192. [8] HSL. A collection of Fortran codes for large scale scientific computation, http: //www.hsl.rl.ac.uk/. [9] Jorge J. Moré, Danny C. Sorensen, Computing a trust region step, SIAM J. Sci. Stat. Comput. 4 (3) (1983) 553–572. [10] Jorge Nocedal, Stephen J. Wright, Numerical Optimization, second ed., Springer, New York, 2006. [11] Juliasmoothoptimizers, https://github.com/JuliaSmoothOptimizers. [12] Ladislav Luk˘san, Ctirad Matonoha, Jan Vl˘cek, Modified CUTE Problems for Sparse Unconstrained Optimization, Technical Report 1081, Institute of Computer Science, Academy of Science of the Czech Republic, 2010.