Journal of the Korean Statistical Society 38 (2009) 29–39
Contents lists available at ScienceDirect
Journal of the Korean Statistical Society journal homepage: www.elsevier.com/locate/jkss
New line search methods for unconstrained optimizationI Gonglin Yuan ∗ , Zengxin Wei College of Mathematics and Information Science, Guangxi University, Nanning, Guangxi, 530004, PR China
article
info
Article history: Received 16 January 2008 Accepted 20 May 2008 Available online 7 July 2008 AMS 2000 subject classifications: 65H10 90C06 Keywords: Unconstrained optimization Line search method Global convergence R -linear convergence Probability
a b s t r a c t It is well known that the search direction plays a main role in the line search method. In this paper, we propose a new search direction together with the Wolfe line search technique and one nonmonotone line search technique for solving unconstrained optimization problems. The given methods possess sufficiently descent property without carrying out any line search rule. The convergent results are established under suitable conditions. For numerical results, analysis of one probability shows that the new methods are more effective, robust, and stable, than other similar methods. Numerical results of two statistical problems also show that the presented methods are more interesting than other normal methods. © 2008 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.
1. Introduction Consider the unconstrained optimization problem min f (x),
(1.1)
x∈R n
where f : Rn → R is continuously differentiable. The line search method usually takes the following iterative formula xk+1 = xk + αk dk ,
k = 0, 1, 2, . . .
(1.2)
for (1.1), where xk is the current iterate point, αk > 0 is a steplength and dk is a search direction. Different choices of dk and αk will determine different line search methods (Raydan, 1997; Schropp, 1997, 2000; Van Wyk, 1984; Vrahatis, Androulakis, Lambrinos, & Magolas, 2000; Yuan & Lu, 2008). We denote f (xk ) by fk , ∇ f (xk ) by gk , and ∇ f (xk+1 ) by gk+1 , respectively. k.k denotes the Euclidian norm of vectors. We all know that a method is called steepest descent method if we take dk = −gk as a search direction at every iteration, which has wide applications in solving large-scale minimization problems (Raydan, 1997; Raydan & Svaiter, 2002; Vrahatis et al., 2000; Yuan & Sun, 1999). One drawback of the method is often yielding zigzag phenomena in solving practical problems, which makes the algorithm converge to an optimal solution very slowly, or even fail to converge (Horst, Pardalos, & Thoai, 1995; Luenerger, 1989; Nocedal & Wright, 1999). If we take dk = −Hk gk as a search direction at each iteration in the algorithm, where Hk is an n × n matrix approximating [∇ 2 f (xk )]−1 , then the corresponding method is called the Newton-like method (Horst et al., 1995; Luenerger, 1989; Nocedal & Wright, 1999; Yuan & Sun, 1999) such as the Newton method, the quasi-Newton method, variable metric method, etc.
I This work was supported by China NSF grants 10761001.
∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (G. Yuan).
1226-3192/$ – see front matter © 2008 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.jkss.2008.05.004
30
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
Many papers have proposed this method for optimization problems (Byrd & Nocedal, 1989; Dai, 2003; Dennis & Schnabel, 1983; Dennis & Moré, 1974, 1977; Fletcher, 1997; Griewank & Toint, 1982; Li & Fukushima, 2001; Powell, 1970; Wei, Li, & Qi, 2006; Wei, Yu, Yuan, & Lian, 2004; Zhang, Deng, & Chen, 1999). However, the Newton-like method needs to store and compute matrix Hk at each iteration and thus adds to the cost of storage and computation. Accordingly, this method is not suitable to solve large-scale optimization problems in many cases. Compared with line search methods, trust region methods obviously possess desirable theoretical features. For example, trust region method has very strong convergence properties and can be applied to ill-conditioned problems. In each iteration of trust region methods, it needs to solve a subproblem (Byrd, Schnabel, & Schultz, 1987; Fan, 2003; Fu & Sun, 2005; Horst et al., 1995; Luenerger, 1989; Nocedal & Wright, 1999; Rojas, Santos, & Sorensen, 2000; Schultz, Schnabel, & Bryrd, 1985; Vardi, 1985; Yuan & Sun, 1999). However, trust region method has one main drawback, i.e., if a trial step is not accepted, the trust region subproblem with a reduced radius is resolved until an acceptable step is found. Therefore, the subproblem may be solved several times at an iteration, which can considerably increase the total cost of computation. Due to its simplicity and its very low memory requirement, the conjugate gradient method is a powerful line search method for solving the large-scale optimization problems. This method can avoid, like steepest descent method, the computation and storage of some matrices associated with the Hessian of objective functions. Conjugate gradient method has the form
dk =
−gk + βk dk , −gk ,
if k ≥ 1 if k = 0,
(1.3)
where βk ∈ R is a scalar which determines the different conjugate gradient methods (Dai, 2002; Dai & Yuan, 2000; Fletcher, 1997; Fletcher & Reeves, 1964; Hestenes & Stiefel, 1952; Liu & Storey, 1992; Polak & Ribiere, 1969; Wei, Li, & Qi, 2006; Wei, Yao, & Liu, 2006). However, the following sufficiently descent condition which is very important to insure the global convergence of the optimization problems gkT dk ≤ −c kgk k2 ,
for all k ≥ 1 and some constant c > 0
(1.4)
is difficult to be satisfied by nonlinear conjugate gradient method, and this condition may be crucial for conjugate gradient methods (Gibert & Nocedal, 1992). At present, the global convergence of the PRP conjugate gradient method is still open when the inexact Wolfe–Powell line search rule is used. These observations motivate us to propose a new method which not only possesses the simplicity and low memory but also possesses desirable theoretical features. Generally, the following condition gkT+1 dk = 0,
for all k,
(1.5)
must be satisfied when the exact line search is used in theory if we solve (1.1) by conjugate gradient method, however, this case hardly happens in practicality. Then we can define the following descent direction as the search direction in (1.2),
dk+1 =
−gk+1 +
−gk+1 ,
kgk+1 k2 dk , −dTk gk+1
if dTk gk+1 6= 0
(1.6)
otherwise,
where d0 = −∇ f0 = −g0 . If dTk gk+1 6= 0, we can see that the search direction dk is the vector sum of the gradient −gk and the former search direction dk−1 , which is similar to conjugate gradient method. Otherwise, the steepest descent method is used as the restart condition. So we believe that computational features should be effective. It is easy to see that the sufficiently descent condition (1.4) is true without carrying out any line search technique by this way. Then the theoretical property is interesting too. It is well known that many statistical problems are unconstrained problems, such as regression analysis (see Lin and Ying (2001), Pan, Louis, and Connett (2000) and Petra and Thomas (2008) etc.). Then the given method can be used to solve the statistical problems. In Section 4, we will propose two statistics examples and transform them as unconstrained optimization problems. These numerical results show that the new algorithms are competitive to other methods. This paper is organized as follows. In the next section, the algorithms are stated. The global convergence and the R-linear convergence of the new methods are established in Section 3. Numerical results and a conclusion are presented in Section 4 and in Section 5, respectively. 2. Algorithms In this section, two algorithms will be presented. Firstly, we give two line search rules for choosing the stepsize αk : (i) Wolfe rule: find a steplength αk such that f (xk + αk dk ) ≤ fk + σ1 αk gkT dk ,
(2.1)
g (xk + αk dk ) dk ≥ σ
(2.2)
T
T 2 gk dk
;
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
31
(ii) Nonmonotone Wolfe line search rule (Zhang & Hager, 2004): find a steplength αk satisfying f (xk + αk dk ) ≤ Ck + σ1 αk gkT dk ,
(2.3)
g (xk + αk dk ) dk ≥ σ
(2.4)
T
T 2 gk dk
,
where 0 < σ1 < σ2 < 1, Ck+1 =
µk Qk Ck +f (xk +αk dk ) Qk+1
µmax ≤ 1.
, Qk+1 = µk Qk + 1, C0 = f (x0 ), Q0 = 1, µk ∈ [µmin , µmax ], 0 ≤ µmin ≤
Combining (1.6) with the above two line search rules, we state the algorithms as follows. New Algorithm 1 (NA1). Step 0: Choose an initial point x0 ∈ Rn , 0 < ε < 1, 0 < σ1 < σ2 < 1. Set d0 = −g0 = −∇ f (x0 ), k := 0; Step 1: If kgk k2 ≤ ε , then stop; Otherwise go to step 2; Step 2: Compute steplength αk by Wolfe line search (2.1) and (2.2), let xk+1 = xk + αk dk . Step 3: Calculate the search direction dk+1 by (1.6). Step 4: Set k := k + 1 and go to step 1. New Algorithm 2 (NA2). Step 0: Choose an initial point x0 ∈ Rn , 0 < µ < 1, 0 < ε < 1, 0 < σ1 < σ2 < 1. Set C0 = f (x0 ), Q0 = 1, d0 = −g0 = −∇ f (x0 ), k := 0; Step 1: If kgk k2 ≤ ε , then stop; Otherwise go to step 2; Step 2: Compute steplength αk by the nonmonotone Wolfe line search (2.3) and (2.4), let xk+1 = xk + αk dk . Step 3: Calculate the search direction dk+1 by (1.6). Step 4: Let Qk+1 = µQk + 1,
Ck+1 =
µQk Ck + f (xk+1 ) Qk+1
.
(2.5)
Step 5: Set k := k + 1 and go to step 1. We will concentrate on the convergent results of NA1 and NA2 in the following section. 3. Convergence analysis In order to establish the convergence of the NA1 and NA2, the following assumptions are needed. Assumption 3.1. (i) f is bounded below on the level set ς = {x ∈ Rn : f (x) ≤ f (x1 )}; (ii) In some neighborhood ζ of ς , f is differentiable and its gradient is Lipschitz continuous, namely, there exists a constant L > 0 such that kg (x) − g (y)k ≤ Lkx − yk, for all x, y ∈ ζ . In the following, we assume that gk 6= 0 for all k, or otherwise a stationary point has been found. Lemma 3.1. Consider (1.6). Then we have (1.4). Proof. If k = 0, it is easy to get dT0 g0 = −kg0 k2 , then (1.4) holds. For k ≥ 1, by (1.6), if dTk gk+1 6= 0, we have dTk+1 gk+1 = −gkT+1 gk+1 +
kgk+1 k2 T d gk+1 −dTk gk+1 k
= −2kgk+1 k2 , on the other hand, if dTk gk+1 = 0, we have dTk+1 gk+1 = −kgk+1 k2 . Therefore, we can choose one constant c ∈ (0, 1] such that (1.4). The proof is complete.
The above lemma shows that the search direction dk satisfies the sufficiently descent condition. Based on Lemma 3.1 and Assumption 3.1, let us prove the global convergence theorem of NA1.
32
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
Theorem 3.1. Let {αk , dk , xk+1 , gk+1 } be generated by the NA1, and Assumption 3.1 hold. Then we have
2 ∞ T X g dk k
kd k k
k=0
< +∞,
(3.1)
and thus
lim
k→∞
2
gkT dk
= 0.
kd k k
(3.2)
Proof. By (2.2) and Assumption 3.1(ii), we have
(σ2 − 1)gkT dk ≤ (g (xk + αk dk ) − gk )T dk ≤ kg (xk + αk dk ) − gk kkdk k ≤ αk Lkdk k2 . Then we get
αk ≥
(σ2 − 1)gkT dk . Lkd k k2
(3.3)
By (2.1) and Lemma 3.1, we obtain fk+1 = fk + σ1 α
T k g k dk
≤ fk −
σ1 c 2 (1 − σ2 ) Lc02
gkT dk
kdk k
2
.
(3.4)
Since f is bounded from below for all k, we conclude that from (3.4)
2 ∞ T X g dk k
k=0
kd k k
< +∞.
(3.5)
Therefore, (3.1) holds. (3.1) implies (3.2). The proof is complete.
In Zhang and Hager (2004), they obtain the condition (1.4) by assumption. This implies that our method is improved in theory. Similarly to Lemma 1.1 in Zhang and Hager (2004), we can get the following lemma. Lemma 3.2. Let {αk , dk , xk+1 , gk+1 } be generated by the NA2, and Assumption 3.1(i) be satisfied. Then we have fk ≤ Ck ≤ Ak for Pk each k, Ak = k+1 1 i=0 f (xi ), and there exists αk such that the nonmonotone conditions of the line search update are satisfied. Proof. Denote Tk : R → R by Tk (t ) =
tCk−1 + fk t +1
,
we get Tk0 (t ) =
Ck−1 − fk
(t + 1)2
.
Using Lemma 3.1, we have dTk gk ≤ 0. From (2.1), we obtain fk ≤ Ck−1 , which means that Tk0 (t ) ≥ 0 for all t ≥ 0. So Tk is nondecreasing, and fk = Tk (0) ≤ Tk (t ) for all t ≥ 0. In particular, taking t = µQk−1 gives fk = Tk (0) ≤ Tk (µQk−1 ) = Ck , this ensures the lower bound for Ck in this lemma. Now we prove the upper bound Ck ≤ Ak by induction. For k = 0, this holds by the initialization C0 = f (x0 ). Now assume that Cj ≤ Aj for all 0 ≤ j ≤ k. According to (2.5), the initialization Q0 = 1, and the fact that µ ∈ (0, 1), we obtain Qi+1 = 1 +
j i Y X
µ ≤ i + 2.
(3.6)
j=0 m=0
Then the above inequality implies that Ck = Tk (µQk−1 ) = Tk (Qk − 1) ≤ Tk (k).
(3.7)
Using the induction step, Tk (k) =
kCk−1 + fk k+1
≤
kAk−1 + fk k+1
= Ak .
(3.8)
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
33
Combining (3.7) and (3.8), we have the upper bound of Ck . Since the Wolfe condition can be satisfied when gkT dk < 0 is true and f is bounded from below, and since fk ≤ Ck , it follows that for each k, αk can satisfy the nonmonotone conditions of the line search update. Then we conclude this lemma. The proof is complete. Now we prove the global convergence of the NA2. Theorem 3.2. Let {αk , dk , xk+1 , gk+1 } be generated by the NA2 and Assumption 3.1 hold. Then we have (3.1) and (3.2). Proof. By (2.5), (2.1) and (3.3), and Lemma 3.1, we obtain Ck+1 =
≤
µQk Ck + fk+1 Qk+1
µQk Ck + Ck + σ1 αk gkT dk Qk+1
σ1 (1 − σ2 )c ≤ Ck −
gkT dk
2
kdk k
Lc0 Qk+1
.
(3.9)
Since f is bounded from below and fk ≤ Ck holds for all k, we conclude that Ck is also bounded from below. From (3.9), we obtain ∞ X k=0
gkT dk
2
kdk k
Qk+1
< +∞.
(3.10)
Using (2.5), the initialization Q0 = 1, and the fact µ ∈ (0, 1), we get Qj+1 = 1 +
j Y i X
µ≤1+
∞ X
i=0 m=0
µi + 1 =
i =0
1 1−µ
.
(3.11)
Consequently, (3.10) implies (3.1) and (3.2). The proof is complete.
Assumption 3.2. There exists a constant c0 > 0 such that
kdk k ≤ c0 kgk k
(3.12)
for all sufficiently large k. By (3.12), it is easy to obtain kgk k → 0 as k → ∞ or see Shi (2004). The following Assumption is further needed if we want to get the linear convergence of NA2. Assumption 3.3. f is strongly convex which means that there exists a scalar γ > 0 such that f (x) ≥ f (y) + g (y)T (x − y) +
1 2γ
kx − yk2 ,
for all x, y ∈ Rn .
(3.13)
Interchanging x and y and adding, we have
(g (x) − g (y))T (x − y) ≥
1
γ
kx − yk2 .
Let x∗ denote the unique minimizer of f . It follows from the above inequality, with y = x∗ , that
kx − x∗ k ≤ γ kg (x)k.
(3.14)
Since f is convex, f (x(t )) is a convex function of t, where x(t ) = x + t (x − x ), t ∈ [0, 1], and the derivative f (x(t )) is an increasing function of t ∈ [0, 1] with f 0 (x(0)) = 0. This observation combined with (3.14) gives ∗
f (x) − f (x∗ ) =
∗
0
1
Z
f 0 (x(t ))dt ≤ f 0 (x(1)) = g (x)(x − x∗ ) ≤ γ kg (x)k2 .
(3.15)
0
Similarly to Zhang and Hager (2004), we can obtain the following theorem, here we also state the process of the proof as follows.
34
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
Theorem 3.3. Let {αk , dk , xk+1 , gk+1 } be generated by the NA2. Suppose that Assumptions 3.1–3.3 are satisfied, there exists a > 0 such that αk ≤ a for all k. Then there exists θ ∈ (0, 1) such that f (xk ) − f (x∗ ) ≤ θ k (f (x0 ) − f (x∗ )),
for each k.
(3.16)
This means that the sequence {fk } converges R-linearly to f (x∗ ). Proof. Since f (xk+1 ) ≤ Ck and Ck+1 is a convex combination of Ck and f (xk+1 ), we have Ck+1 ≤ Ck for each k. Hence, f (xk+1 ) ≤ Ck ≤ Ck−1 ≤ · · · ≤ C0 = f (x0 ), this implies that all the iterates xk are contained in the level set ς . By the proof of Theorem 3.1 and (3.12), we get f (xk+1 ) ≤ Ck − βkgk k2 , where β =
σ1 (1−σ2 )c Lc0
(3.17)
. Also, by Assumption 3.1(iii) and the upper bound a on αk , xk+1 such that
kxk+1 − xk k = αk kdk k ≤ ac0 kgk k. Combining this with Assumption 3.1(ii) gives
kgk+1 − gk k ≤ Lkxk+1 − xk k ≤ Lac0 kgk k, from which it follows that
kgk+1 k ≤ kgk+1 − gk k + kgk k ≤ b0 kgk k,
b0 = 1 + Lac0 .
(3.18)
Now, we show that for each k, Ck+1 − f (x∗ ) ≤ θ (Ck − f (x∗ )), where θ = 1 − β b1 (1 − µ), b1 =
(3.19) 1
β+γ b20
, which immediately yields (3.16) since fk ≤ Ck and C0 = f0 .
Case 1. If kgk k ≥ b1 (Ck − f (x )). Using (2.5), (3.17) and (3.11), we obtain ∗
2
µQk (Ck − f (x∗ )) + (f (xk+1 ) − f (x∗ )) 1 + µQk µQk (Ck − f (x∗ )) + (Ck − f (x∗ )) − βkgk k2 ≤ 1 + µQk βkgk k2 = Ck − f (x∗ ) −
Ck+1 − f (x∗ ) =
Qk+1
≤ Ck − f (x ) − β(1 − µ)kgk k2 . ∗
(3.20)
Since kgk k2 ≥ b1 (Ck − f (x∗ )), the relation (3.19) has been obtained in Case 1. Case 2. If kgk k2 < b1 (Ck − f (x∗ )). By (3.15) and (3.18), we get f (xk+1 ) − f (x∗ ) ≤ γ kgk+1 k2 ≤ γ b20 kgk k2 ≤ γ b20 b1 (Ck − f (x∗ )). Considering (3.20) yields
(µQk + γ b20 b1 )(Ck − f (x∗ )) 1 − γ b20 b1 Ck+1 − f (x ) ≤ = 1− (Ck − f (x∗ )). 1 + µQk Qk+1 ∗
(3.21)
Rearranging the expression for b1 , we have γ b20 b1 = 1 − β b1 . By the bound (3.11) again and inserting the above relation in (3.21), we get (3.19). Then we obtain (3.19) and (3.16) directly. This completes the proof. In a way similarly to the proof of the above theorem (or see the paper (e.g., Shi (2004))), it is easy to obtain the R-linear convergence theorem of NA1. Here we state the theorem as follows but omit the proof. Theorem 3.4. Let {αk , dk , xk+1 , gk+1 } be generated by the NA1, and the conditions in Theorem 3.3 hold. Then the sequence {fk } converges R-linearly to f (x∗ ).
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
35
Fig. 1. Performance profiles of methods with the Wolfe rule (CPU Time).
4. Numerical results In this section, we report some numerical results with NA1, NA2, the steepest descent method, the FR method, and the PRP method. We test these algorithms on some problems (Moré, Garbow, & Hillstrome, 1981) taken from MATLAB with given initial points and two statistical problems. The parameters were chosen as: σ1 = 0.1, σ2 = 0.9, µ = 10−4 , ε = 10−5 . Since the line search cannot always ensure the descent condition dTk gk < 0, uphill search direction may occur in the numerical experiments. In this case, the line search rule may fail. In order to avoid this case, the stepsize αk will be accepted if the searching number is more than twenty five in the line search. We also stop the program if the iteration number is more than one thousand, and the corresponding method is considered to be failed. All codes were written in MATLAB and run on PC with 2.60 GHz CPU processor and 256 MB memory and Windows XP operation system. (i) For problems (Moré et al., 1981). The following Himmeblau stop rule (Yuan & Sun, 1999) is used: |f (xk )−f (xk+1 )| If |f (xk )| > e1 , let stop1 = ; Otherwise, let stop1 = |f (xk ) − f (xk+1 )|. |f (x )| k
If kg (x)k < ε or stop1 < ε was satisfied, then we will stop the program, where e1 = 10−4 . The detailed numerical results are listed on the web site Http://blog.sina.com.cn/gonglinyuan. Dolan and Moré (2002) gave a new tool to analyze the efficiency of Algorithms. They introduced the notion of a performance profile as a means to evaluate and compare the performance of the set of solvers S on a test set P. Assuming that there exist ns solvers and np problems, for each problem p and solver s, they defined tp,s = computing time (the number of function evaluations or others) required to solve problem p by solver s.
Requiring a baseline for comparisons, they compared the performance on problem p by solver s with the best performance by any solver on this problem; that is, using the performance ratio rp,s =
tp,s min{tp,s : s ∈ S }
.
Suppose that a parameter rM ≥ rp,s for all p, s is chosen, and rp,s = rM if and only if solver s does not solve problem p. The performance of solver s on any given problem might be of interest, but we would like to obtain an overall assessment of the performance of the solver, then they defined
ρs (t ) =
1 np
size{p ∈ P : rp,s ≤ t },
thus ρs (t ) was the probability for solver s ∈ S that a performance ratio rp,s was within a factor t ∈ R of the best possible ration. Then function ρs was the (cumulative) distribution function for the performance ratio. The performance profile ρs : R 7→ [0, 1] for a solver was a nondecreasing, piecewise constant function, continuous from the right at each breakpoint. The value of ρs (1) was the probability that the solver would win over the rest of the solvers. According to the above rules, we know that one solver whose performance profile plot is on top right will win over the rest of the solvers. In Fig. 1, ‘‘NA1’’, ‘‘TSDA1’’, ‘‘FRA1’’, and ‘‘PRPA1’’ stand for New Algorithm 1, the steepest descent method, the FR method, and the PRP method with the Wolfe line search rule, respectively. In Fig. 2, ‘‘NA2’’, ‘‘TSDA2’’, ‘‘FRA2’’, and ‘‘PRPA2’’ stand for New Algorithm 2, the steepest descent method, the FR method, and the PRP method with the nonmonotone Wolfe line search rule, respectively.
36
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
Fig. 2. Performance profiles of methods with the nonmonotone Wolfe rule (CPU Time). Table 1 Price pi ($)
1
2
2
2.3
2.5
2.6
2.8
3
3.3
3.5
Demand di (500 g)
5
3.5
3
2.7
2.4
2.5
2
1.5
1.2
1.2
Table 2 xi
2
3
4
5
6
7
8
9
hi
5.6
8
10.4
12.8
15.3
17.8
19.9
21.4
10
11
22.4
23.2
Figs. 1 and 2 show the implementation of the FR method, the PRP method, New Algorithm 1, and New Algorithm 2 using the total CPU time as a measure. From Figs. 1 and 2, it is easy to see that NA1 and NA2 are the best among these methods, respectively. The PRP method is faster than the other two methods and the steepest descent method is the slowest. In a word, the given methods are competitive to the other similar methods and the new direction is notable. (ii) Two statistical problems. 1. Problem 1 (linear regression analysis). In Table 1, there are data of some kind of commodity between year demand and price: The statistical results indicate that the demand will possibly change though the price is inconvenient, and the demand will be possible invariably though the price changes. Overall, the demand will decrease with the increase of the price. Our objective is to find out the approximate function between the demand and the price, namely, we need to find the regression equation of d to p. It is not difficult to see that the price p and the demand d are linear relations. Denote the regression function by dˆ = β0 + β1 p, where β0 and β1 are the regression parameters. Our work is to get β0 and β1 . By the least squares method, we need to solve the following problem min Q =
n X [di − (β0 + β1 pi )]2 i =0
and obtain β0 and β1 , where n = 10. Then the corresponding unconstrained optimization problem is defined by min f (β) =
β∈R 2
n X
di − β(1, pi )T
2
.
(4.1)
i=0
2. Problem 2 (Quadratically regression analysis). In Table 2, there are data of the age x and the average height H of pine tree: Similarly to problem 1, it is easy to see that the age x and the average height H are parabolic relations. Denote the regression function by hˆ = β0 + β1 x + β2 x2 , where β0 , β1 , and β2 are the regression parameters. Using the least squares method, we need to solve the following problem min Q =
n X [hi − (β0 + β1 xi + β2 x2i )]2 i =0
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
37
Table 3 Test results for Problem 1
β ∗ = (6.5, −1.6) β˘ = (1, −0.01)
β` ε∗ =
β˘ = (−10, 0.04)
β` ε∗ =
β˘ = (−2, −1.0)
β` ε∗ =
β˘ = (15, 15)
β` ε∗ =
∗k ` kβ−β kβ ∗ k
∗k ` kβ−β kβ ∗ k
∗k ` kβ−β kβ ∗ k
∗k ` kβ−β kβ ∗ k
β ∗ = (6.5, −1.6) β˘ = (1, −0.01) β˘ = (−10, 0.04) β˘ = (−2, −1.0) β˘ = (15, 15)
β` ε∗ =
∗k ` kβ−β kβ ∗ k
β` ε∗ =
∗k ` kβ−β kβ ∗ k
β` ε∗ =
∗k ` kβ−β kβ ∗ k
β`
NA1
TSDA1
FRA1
PRPA1
(6.438301, −1.575289)
(6.438283, −1.575313)
Fail
Fail
0.009929
0.009930
Fail
Fail
(6.438280, −1.575313)
(6.438303, −1.575255)
Fail
Fail
0.009930
0.009930
Fail
Fail
(6.438285, −1.575314)
(6.438287, −1.575314)
Fail
Fail Fail
0.009930
0.009929
Fail
(6.438287, −1.575316)
(6.438286, −1.575315)
(77.597388, −33.906776)
Fail
0.009929
0.009929
11.666119
Fail
NA2
TSDA2
FRA2
PRPA2
(6.438301, −1.575289)
(6.438283, −1.575313)
(6.438305, −1.575322)
(6.438285, −1.575317)
0.009929
0.009930
0.009926
0.009929
(6.438280, −1.575313)
(6.438303, −1.575255)
(−2.058130, 1.412430)
(6.438290 −1.575318)
0.009930
0.009930
1.355363
0.009929
(6.438285, −1.575314)
(6.438287, −1.575314)
(6.134345, −1.438036)
(6.438286, −1.575314)
0.009930
0.009929
0.059743
0.009930
(6.438287, −1.575316)
(6.438286, −1.575315)
(77.597388,
(6.438286, −1.575314)
−33.906776) ε∗ =
∗k ` kβ−β kβ ∗ k
0.009929
0.009929
11.666119
0.009930
and obtain β0 , β1 , and β2 , where n = 10. Then the corresponding unconstrained optimization problem is defined by min f (β) =
β∈R 3
n X [hi − β(1, xi , x2i )T ]2 .
(4.2)
i =0
It is well known that the above problems (4.1) and (4.2) can be solved by extreme value of calculus. Here we will solve these two problems by our methods and the other three methods, respectively. We will stop the program if the iteration number is more than one thousand or the condition k∇ f (β)k ≤ 1e − 5 is satisfied. In this experiment, the direction is defined by: dk+1 =
−gk+1 +
−gk+1 ,
kgk+1 k2 dk , −dTk gk+1
if dTk gk+1 < 1e − 10
(4.3)
otherwise.
Other conditions are the same as the test condition of problems (Moré et al., 1981). The columns of Tables 3 and 4 have the following meaning:
β ∗ : the approximate optimization solution. β` : the solution as the program is terminated. ∗ ` ˘ ε∗ : the relative error between β and β. β : the initial point. fail: this method fails. These numerical results of Tables 3 and 4 indicate that our algorithms are the best among these eight algorithms. Moreover the initial points do not influence the results obviously. The steepest descent method is better than the FR method and the PRP method, and the FR method and the PRP method are not so effective for these two statistical problems. According to the numerical results of the problems (Moré et al., 1981) and these two statistical examples, it is not difficult to conclude that the given methods are competitive to the other similar methods and the new direction is notable. 5. Conclusions This paper gives a method whose direction is generated in a new way, which combines the Wolfe rule and one nonmonotone line search technique (Zhang & Hager, 2004), and presents two new methods for unconstrained optimization. The global and R-linear convergence are established under weaker assumptions on the search direction dk . Moreover, the direction dk can ensure the sufficient condition (1.4) without any line search technique. For the test problems, the comparison of the numerical results shows that the new search direction of the new algorithms is a good search direction at every iteration. For further research, we should study the performance of the new algorithm at different stop rules and different testing environment (such as Gould, Orban, and Toint (2003)). Moreover, more numerical experiments for large practical problems should be done in the future.
38
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
Table 4 Test results for Problem 2 ∗k ` kβ−β kβ ∗ k
β ∗ = (−1.33, 3.46, −0.11)
β˘
β`
ε∗ =
NA1
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
(−1.296574, 3.450247, −0.107896) (−1.328742, 3.460876, −0.108650) (−1.328504, 3.460798, −0.108646) (−1.321726, 3.458558, −0.108483)
0.009407 0.000551 0.000585 0.002301
TSDA1
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
(−1.330790, 3.461553, −0.108699) (−1.331004, 3.461624, −0.108704) (−1.331257, 3.461707, −0.108710) (−1.102648, 3.702885, −0.137055)
0.000586 0.000622 0.000669 0.090007
FRA1
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
(−3.681636, −2.573035, 0.543468) (−2.444390, 0.536744, 0.245644) Fail Fail
1.754925 0.849038 Fail Fail
PRPA1
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
Fail Fail Fail Fail
Fail Fail Fail Fail
NA2
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
(−1.296574, 3.450247, −0.107896) (−1.328742, 3.460876, −0.108650) (−1.328504, 3.460798, −0.108646) (−1.321726, 3.458558, −0.108483)
0.009407 0.000551 0.000585 0.002301
TSDA2
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
(−1.330790, 3.461553, −0.108699) (−1.331004, 3.461624, −0.108704) (−1.331257, 3.461707, −0.108710) (−1.102648, 3.702885, −0.137055)
0.000586 0.000622 0.000669 0.090007
FRA2
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
(−3.681636, −2.573035, 0.543468) (−2.444390, 0.536744, 0.245644) (−1.336538, 3.476674, −0.110213) (−1.333833, 3.461110, −0.108608)
1.754925 0.849038 0.004830 0.001140
PRPA2
(−1.1, 3.0, −0.5) (−1.2, 3.2, −0.3) (−0.003, 7.0, −0.8) (−0.001, 7.0, −0.5)
(−1.362139, 3.472004, −0.109442) (−1.235903, 3.430201, −0.106479) (−1.250025, 3.434870, −0.106811) (−1.337666, 3.463824, −0.108859)
0.009253 0.026632 0.022622 0.002331
Acknowledgments We would like to thank two referees for giving us many valuable suggestions and comments, which improved this paper greatly. References Byrd, R., & Nocedal, J. (1989). A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM Journal on Numerical Analysis, 26, 727–739. Byrd, R. H., Schnabel, R. B., & Schultz, G. A. (1987). A trust-region algorithm for nonlinearly constrained optimization. SIAM Journal on Numerical Analysis, 24, 1152–1170. Dai, Y. (2003). Convergence properties of the BFGS algorithm. SIAM Journal on Optimization, 13, 693–701. Dai, Y. (2002). A nonmonotone conjugate gradient algorithm for unconstrained optimization. Journal of System Sciences & Complexity, 15, 139–145. Dai, Y., & Yuan, Y. (2000). A nonlinear conjugate gradient with a strong global convergence properties. SIAM Journal of Optimization, 10, 177–182. Dennis, J. E., & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice-Hall, Inc. Dennis, J. E., & Moré, J. J. (1974). A characterization of superlinear convergence and its application to quasi-Newton methods. Mathematics of Computational, 28, 549–560. Dennis, J. E., Jr., & Moré, J. J. (1977). Quasi-Newton methods, motivation and theory. SIAM Review, 19, 46–89. Dolan, E. D., & Moré, J. J. (2002). Benchmarking optimization software with performance profiles. Mathematical Programming, 91, 201–213. Fan, J. Y. (2003). A modified Levenberg-Marquardt algorithm for singular system of nonlinear equations. Journal of Computational Mathematics, 21, 625–636. Fu, J. H., & Sun, W. Y. (2005). Nonmonotone adaptive trust-region method for unconstrained optimization problems. Applied Mathematics and Computation, 163, 489–504. Fletcher, R. (1997). Practical method of optimization, vol I: Unconstrained optimization (2nd ed.). New York: Wiley. Fletcher, R., & Reeves, C. (1964). Function minimization bu conjugate gradients. Computer Journal, 7, 149–154. Gibert, J. C., & Nocedal, J. (1992). Global convergence properties of conjugate gradient methods for optimization. SIAM Journal of Optimization, 2, 21–42. Griewank, A., & Toint, Ph. L. (1982). Local convergence analysis for partitioned quasi-Newton updates. Numeric Mathematic, 39, 429–448. Gould, Nicholas I. M., Orban, Dominique, & Toint, Philippe L. (2003). CUTEr (and SifDec): A constrained and unconstrained testing environment, revisited. ACM Transactions on Mathematical Software, 29, 373–394. Hestenes, M. R., & Stiefel, E. (1952). Method of conjugate gradient for solving linear equations. Journal of Research of the National Bureau Standards, 49, 409–436. Horst, R., Pardalos, P. M., & Thoai, A. V. (1995). Introduction to global optimization. Dordrecht: Kluwer. Li, D., & Fukushima, M. (2001). A modified BFGS method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics, 129, 15–35.
G. Yuan, Z. Wei / Journal of the Korean Statistical Society 38 (2009) 29–39
39
Lin, D. Y., & Ying, Z. (2001). Semiparametric and nonparametric regression analysis of longitudinal data. Journal of the American Statistical Association, 96, 103–126. Liu, Y., & Storey, C. (1992). Efficient generalized conjugate gradient algorithms, part 1: Theory. Journal of Optimization theory and Application, 69, 17–41. Luenerger, D. C. (1989). Linear and nonlinear programming (2nd ed.). Reading, MA: Addition Wesley. Moré, J. J., Garbow, B. S., & Hillstrome, K. E. (1981). Testing unconstrained optimization software. ACM Transactions on Mathematical Software, 7, 17–41. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Berlin Heidelberg, New York: Springer. Pan, W., Louis, T. A., & Connett, J. E. (2000). A note on marginal linear regression with correlated response data. American Statistician, 54, 191–195. Petra, Bu. kova, & Thomas, Lumley (2008). Semiparametric log-linear regression for longitudinal measurements subject to outcome-dependent follow-up. Journal of Statistical Planning and Inference, 138, 2450–2461. Polak, E., & Ribiere, G. (1969). Note sur la xonvergence de directions conjugees. Rev. Francaise informat Recherche Operatinelle, 3e Annee, 16, 35–43. Powell, M. J. D. (1970). A new algorithm for unconstrained optimization. In J. B. Rosen, O. L. Mangasarian, & K. Ritter (Eds.), Nonlinear programming. New York: Academic Press. Raydan, M. (1997). The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal of Optimization, 7, 26–33. Raydan, M., & Svaiter, B. F. (2002). Relaxed steepest descent and Cauchy–Barzilai–Borwein method. Computational Optimization and Applications, 21, 155–167. Rojas, M., Santos, S. A., & Sorensen, D. C. (2000). A new matrix-free Algorithm for the large-scale trust-region subproblem. SIAM Journal of Optimization, 11(3), 611–646. Schropp, J. (1997). A note on minimization problems and multistep methods. Numeric Mathematic, 78, 87–101. Schropp, J. (2000). One-step and multistep procedures for constrained minimization problems. IMA Journal of Numerical Analysis, 20, 135–152. Schultz, G. A., Schnabel, R. B., & Bryrd, R. H. (1985). A family of trust-region-based algorithms for unconstrained minimization with strong global convergence properties. SIAM Journal on Numerical Analysis, 22, 47–67. Shi, Z. J. (2004). Convergence of line search methods for unconstrained optimization. Applied Mathematics and Computation, 157, 393–405. Van Wyk, D. J. (1984). Differential optimization techniques. Applied Mathematical Modelling, 8, 419–424. Vardi, A. (1985). A trust-region algorithm for equality constrained minimization: Convergence properties and implementation. SIAM Journal of Numerical Analysis, 22, 575–579. Vrahatis, M. N., Androulakis, G. S., Lambrinos, J. N., & Magolas, G. D. (2000). A class of gradient unconstrained minimization algorithms with adaptive stepsize. Journal of Computational and Applied Mathematics, 114, 367–386. Wei, Z., Li, G., & Qi, L. (2006). New quasi-Newton methods for unconstrained optimization problems. Applied Mathematics and Computation, 175, 1156–1188. Wei, Z., Li, G., & Qi, L. (2006). New nonlinear conjugate gradient formulas for large-scale unconstrained optimization problems. Applied Mathematics and Computation, 179, 407–430. Wei, Z., Yao, S., & Liu, L. (2006). The convergence properties of some new conjugate gradient methods. Applied Mathematics and Computation, 183, 1341–1350. Wei, Z., Yu, G., Yuan, G., & Lian, Z. (2004). The superlinear convergence of a modified BFGS-type method for unconstrained optimization. Computational Optimization and Applications, 29, 315–332. Yuan, G. L., & Lu, X. W. (2008). A new line search method with trust region for unconstrained optimization. Communications on Applied Nonlinear Analysis, 15(1), 35–49. Yuan, Y., & Sun, W. (1999). Theory and methods of optimization. Beijing: Science Press of China. Zhang, J., Deng, N., & Chen, L. (1999). New quasi-Newton equation and related methods for unconstrained optimization. Journal of Optimization Theory and Applications, 102(1), 147–167. Zhang, H. C., & Hager, W. W. (2004). A nonmonotone line search technique and its application to unconstrained optimization. SIAM Journal of Optimization, 14(4), 1043–1056.