Applied Mathematics Letters 45 (2015) 12–17
Contents lists available at ScienceDirect
Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml
An efficient monotone projected Barzilai–Borwein method for nonnegative matrix factorization Yakui Huang ∗ , Hongwei Liu, Sha Zhou School of Mathematics and Statistics, Xidian University, Xi’an 710126, PR China
article
info
Article history: Received 14 October 2014 Received in revised form 10 January 2015 Accepted 10 January 2015 Available online 20 January 2015 Keywords: Nonnegative matrix factorization Alternating nonnegative least squares Projected Barzilai–Borwein method Monotone
abstract In this paper, we present an efficient method for nonnegative matrix factorization based on the alternating nonnegative least squares framework. Our approach adopts a monotone projected Barzilai–Borwein (MPBB) method as an essential subroutine where the step length is determined without line search. The Lipschitz constant of the gradient is exploited to accelerate convergence. Global convergence of the proposed MPBB method is established. Numerical results are reported to demonstrate the efficiency of our algorithm. © 2015 Elsevier Ltd. All rights reserved.
1. Introduction Nonnegative matrix factorization (NMF) [1,2] is to approximate a given matrix V ∈ Rm×n , V ≥ 0 by the product of two nonnegative matrices W ∈ Rm×r and H ∈ Rr ×n with r ≪ min{m, n} being a specified positive integer. Using the Frobenius norm to measure the distance between V and WH, NMF can be formulated as: min F (W , H ) :=
1
∥V − WH ∥2F 2 ×r r ×n s.t . W ∈ Rm + , H ∈ R+ . W, H
(1)
NMF has received considerable attention in the past decade due to its usefulness in dimension reduction of image, text, and signal data, see [3–8] and references therein. Numerous algorithms have been proposed for solving (1). Lee and Seung [9] developed the multiplicative update (MU) algorithm which updates the two matrices by multiplying each entry with a positive factor in each iteration. Although MU is easy to implement, it has been observed to converge relatively slowly, especially when dealing with dense matrices [3,10,11]. Paatero and Tapper [2] suggested to use the alternating nonnegative least squares (ANLS) framework: W k+1 = arg min F (W , H k ),
(2)
H k+1 = arg min F (W k+1 , H ).
(3)
W ≥0
H ≥0
Notice that problem (1) is nonconvex and NP-hard [12]. The ANLS framework allows us to solve two convex subproblems (2) and (3) for which optimal solutions can be found. Recently, Grippo and Sciandrone [13] proved the convergence of the ANLS framework to a stationary point of (1). Clearly, at each iteration, the main cost of the ANLS framework is in solving the subproblems (2) and (3). Many algorithms aim to solve the two subproblems efficiently have been developed, for example, the active set method [14], the projected
∗
Corresponding author. E-mail address:
[email protected] (Y. Huang).
http://dx.doi.org/10.1016/j.aml.2015.01.003 0893-9659/© 2015 Elsevier Ltd. All rights reserved.
Y. Huang et al. / Applied Mathematics Letters 45 (2015) 12–17
13
gradient (PG) method [11], the projected Barzilai–Borwein (BB) method [15], the projected Newton method [16], and the projected quasi-Newton method [10,17]. Guan et al. [18] pointed out that these methods may be inefficient because of using time consuming line searches. They applied Nesterov’s optimal gradient method (OGM) [19] to solve the subproblems without line search and developed the NeNMF solver. However, it was observed by Huang et al. [20] that OGM may take thousands of iterations to reach a given tolerance which will degrade the efficiency of NeNMF. In [20], the authors proposed a quadratic regularization projected Barzilai–Borwein (QRPBB) method. By making use of the Lipschitz constant of the gradient, the QRPBB method improves the performance of the projected BB method significantly and outperforms other three solvers including PG, APBB2 [15], and NeNMF. However, the QRPBB method has to calculate two projections and two gradients at each iteration which is expensive for a large matrix. Moreover, it needs to perform a nonmonotone line search to determine the step length at each iteration. In this paper, follow the ANLS framework, we present an efficient monotone projected BB (MPBB) method to solve the subproblems. The MPBB method exploits the Lipschitz constant of the gradient and makes use of the two BB stepsizes [21] alternately to accelerate convergence. Unlike the QRPBB method, our MPBB method only computes two projections and two gradients at even steps. Moreover, the proposed MPBB method determines the step length without using any line search. Global convergence of the MPBB method is established. Numerical results are reported to demonstrate the efficiency of our algorithm. As is well known that BB-like methods are often more efficient with nonmonotone schemes. We will show by experiments that our MPBB method outperforms the APBB2 method which employs the Grippo–Lampariello–Lucidi nonmonotone line search [22]. Our results provide the possibility that, by proper modification, monotone BB-like methods can win the nonmonotone ones in some cases. The rest of this paper is organized as follows. Section 2 introduces the MPBB method for solving the subproblems and presents its global convergence result. Experimental comparisons among several NMF solvers are presented in Section 3. 2. Monotone projected Barzilai–Borwein method and its convergence Since W and H is perfectly symmetric, we focus only on the update of the matrix W . At the kth iteration of the ANLS framework, we have to solve min f k (W ) := F (W , H k ) = W ≥0
1 2
∥V − WH k ∥2F .
(4)
To simplify the notation, we use f (W ) rather than f k (W ) in the rest of the paper. Let P (·) be the operator that projects all the negative entries of an input matrix to zero. It is well known that W is a stationary point of (4) if and only if, for any fixed α > 0,
∥P [W − α∇ f (W )] − W ∥F = 0.
(5)
By Lemma 1 in [20], we know that f (W ) is convex and its gradient ∇ f (W ) is Lipschitz continuous with constant L = ∥H k (H k )T ∥2 . Since H k (H k )T is an r × r matrix and r ≪ min{m, n}, the Lipschitz constant L is not expensive to obtain.
Our monotone projected Barzilai–Borwein (MPBB) method is presented in Algorithm 1, where W k and H k are obtained from the previous iteration in the ANLS framework. Algorithm 1. Monotone projected BB method Step 1. Choose constants σ ∈ (0, 1), αmax > αmin > 0. Compute L = ∥H k (H k )T ∥2 . Set W0 = W k , α0 = 1, and t = 0. Step 2. If Wt is a stationary point of (4), stop. Step 3. If t is odd, set Zt = Wt ; otherwise, compute Zt by
Zt = P Wt −
1 L
∇ f (Wt ) .
(6)
˜ t , 1}, Step 4. Compute Dt = P [Zt − αt ∇ f (Zt )] − Zt and δt = ⟨Dt , H k (H k )T Dt ⟩. If δt = 0, set λt = 1; otherwise, set λt = min{λ where (1 − σ )⟨∇ f (Zt ), Dt ⟩ . δt Set Wt +1 = Zt + λt Dt . Step 5. Define St = Wt +1 − Wt and Yt = ∇ f (Wt +1 ) − ∇ f (Wt ). If ⟨St , Yt ⟩ ≤ 0, set αt +1 = αmax ; otherwise, compute ⟨St , St ⟩ , for odd t; ⟨ St , Yt ⟩ αtBB+1 = ⟨St , Yt ⟩ , for even t. ⟨Yt , Yt ⟩ λ˜ t = −
Set αt +1 = min{αmax , max{αmin , αtBB +1 }}. Step 6. Set t = t + 1 and go to Step 2.
(7)
(8)
14
Y. Huang et al. / Applied Mathematics Letters 45 (2015) 12–17
The idea of using the projected BB method without line search has been introduced by Han et al. [15]. However, their approach (APBB4) uses αtBB +1 = ⟨St , St ⟩/⟨St , Yt ⟩ throughout the iterative process. Dai and Fletcher [23] have observed that the alternate strategy (8) is better than using one. Moreover, APBB4 does not make use of the Lipschitz constant of the gradient. In Section 3, we will show by experimental results that our MPBB method outperforms APBB2 which has been shown better than APBB4 in [15]. The step length λt in Step 4 satisfies that f (Wt +1 ) = f (Zt + λt Dt ) ≤ f (Zt ) + σ λt ⟨∇ f (Zt ), Dt ⟩.
(9)
In fact, denote that
φ(λ) = f (Zt + λDt ) − f (Zt ) − σ λ⟨∇ f (Zt ), Dt ⟩ =
λ2 2
δt + (1 − σ )λ⟨∇ f (Zt ), Dt ⟩.
˜ t defined by (7) If δt = 0, then λt = 1. Since ⟨∇ f (Zt ), Dt ⟩ ≤ 0 (see [20]) and σ ∈ (0, 1), we have φ(1) ≤ 0. If δt > 0, then λ ˜ ˜ t , we have is the minimizer of φ(λ). We need only to show (9) holds for the case λt = 1, i.e. λt ≥ 1. By the definition of λ −(1 − σ )⟨∇ f (Zt ), Dt ⟩ ≥ δt . Therefore, φ(1) ≤ − 21 δt < 0. Notice that (5) will keep the algorithm unnecessarily running. In practice, similar to [11], we use ∥P [Wt − ∇ f (Wt )] − Wt ∥F ≤ ϵW ,
(10)
where ϵW = max(10 , ϵ)∥P [W0 − ∇ f (W0 )] − W0 ∥F . If Algorithm 1 solves (4) without any iterations, we decrease the stopping tolerance by ϵW = 0.1ϵW . The same strategy is adopted to update H. For a given W0 ≥ 0, define the level set by −3
L(W0 ) = {W |f (W ) ≤ f (W0 ), W ≥ 0}. Now we establish the global convergence of the MPBB method. Theorem 1. Suppose that the level set L(W0 ) is bounded, then any accumulation point of the sequence {Zt } generated by Algorithm 1 is a stationary point of (4). Proof. From Lemma 3 of [20], we have, for even t, f (Zt ) ≤ f (Wt ) − 2L ∥Zt − Wt ∥2 . Notice that Zt = Wt when t is odd. Thus, we always have f (Zt ) ≤ f (Wt ).
(11)
It follows from Lemma 2 of [20] and definitions of αt and Dt that
⟨∇ f (Zt ), Dt ⟩ ≤ −
1
αt
∥Dt ∥2 ≤ −
1
αmax
∥Dt ∥2 ,
(12)
which together with (9) and (11) implies that f (Zt +1 ) ≤ f (Wt +1 ) = f (Zt + λt Dt ) ≤ f (Zt ) −
σ λt ∥Dt ∥2 . αmax
(13)
Therefore, the sequence {f (Zt )} is nonincreasing. Again from (11), we know that {Zt } ⊆ L(W0 ). Since L(W0 ) is bounded, ˜ t and then {Zt } admits an accumulation point. Assume that {tj } is a subsequence such that Ztj → Z¯ . By the definition of λ (12), we obtain
λ˜ tj ≥
1
αmax
∥Dtj ∥2
1−σ
⟨Dtj ,
Hk
(
)
H k T Dtj
⟩
≥
1−σ
αmax ρ
¯ := λ,
¯ , where ρ > 0 is the maximum eigenvalue of H k (H k )T . Therefore, it follows from the definition of λt that λtj ≥ min{1, λ} which together with (13) gives limj→∞ ∥Dtj ∥2 = 0. Recalling that the sequence {αt } is bounded, by taking a subsequence if necessary, we have αtj → α¯ . Thus, ∥P (Z¯ − α∇ ¯ f (Z¯ )) − Z¯ ∥ = 0, i.e. Z¯ is stationary.
By (5) and (6) we know that if Wt is a stationary point of (4) then Zt = Wt is stationary. On the other hand, if Zt is a stationary point of (4) then Dt = 0 which implies that Wt +1 is stationary. 3. Experiments In this section, we compare the performance of our MPBB method using the ANLS framework with other three ANLSbased methods: the NeNMF [18], the APBB21 [15], and the QRPBB method [20]. Our method is implemented in MATLAB (codes are available under request). All the experiments were executed on a 3.10 GHz Core i5 PC with 4 GB RAM and Windows 7 OS. 1 Available at: http://homepages.umflint.edu/~lxhan/software.html.
Y. Huang et al. / Applied Mathematics Letters 45 (2015) 12–17
15
Table 1 Experimental results on synthetic datasets (dense matrices) with ϵ = 10−7 . (m, n, r)
Alg
iter
niter
pgn
(100, 50, 5)
NeNMF APBB2 QRPBB MPBB
281.1 421.7 343.6 253.0
11414.5 5306.9 3127.7 2513.1
7.98E−05 7.18E−05 4.20E−05 5.60E−05
0.23 0.26 0.18 0.13
0.4481 0.4481 0.4481 0.4481
(200, 100, 10)
NeNMF APBB2 QRPBB MPBB
638.6 591.0 739.6 518.1
27940.5 9375.3 7757.1 6433.7
6.73E−04 5.69E−04 6.90E−04 5.82E−04
1.69 0.88 0.95 0.69
0.4392 0.4393 0.4393 0.4392
(300, 100, 20)
NeNMF APBB2 QRPBB MPBB
831.0 304.2 447.8 553.6
34054.0 5789.9 4448.6 6146.5
2.92E−03 3.09E−03 3.13E−03 2.93E−03
5.49 1.02 1.27 1.58
0.4057 0.4060 0.4058 0.4058
(500, 300, 25)
NeNMF APBB2 QRPBB MPBB
874.4 216.8 224.9 248.1
29416.0 3811.9 2909.3 2986.7
1.58E−02 1.22E−02 1.11E−02 1.34E−02
8.28 1.72 1.83 1.74
0.4471 0.4473 0.4473 0.4473
(1000, 500, 50)
NeNMF APBB2 QRPBB MPBB
466.3 49.6 51.8 58.3
15471.5 1357.1 888.5 886.9
1.16E−01 1.03E−01 6.79E−02 1.00E−01
15.99 1.84 1.75 1.68
0.4465 0.4477 0.4477 0.4477
(2000, 1000, 50)
NeNMF APBB2 QRPBB MPBB
500.1 55.3 32.1 30.6
16643.0 1537.6 520.3 433.8
3.29E−01 2.90E−01 2.37E−01 2.39E−01
37.84 4.65 2.55 2.04
0.4713 0.4721 0.4722 0.4722
(3000, 1000, 100)
NeNMF APBB2 QRPBB MPBB
218.5 92.6 97.5 119.7
8224.0 2877.1 1727.3 1550.4
1.33E+00 1.03E+00 7.63E−01 9.81E−01
59.79 27.40 27.30 21.78
0.4560 0.4564 0.4564 0.4564
(5000, 1000, 100)
NeNMF APBB2 QRPBB MPBB
124.4 74.9 16.0 15.3
4367.8 2665.7 311.3 282.1
2.10E+00 1.59E+00 1.78E+00 1.66E+00
60.76 33.45 9.61 7.90
0.4617 0.4617 0.4643 0.4644
Time
Residual
Table 2 Experimental results on the CBCL and ORL databases with ϵ = 10−7 . The CBCL image database is represented by a 361 × 2429 matrix while the ORL image database is represented by an 10,304 × 400 matrix. (m, n, r)
Alg
iter
niter
pgn
Time
Residual
(361, 2429, 49)
NeNMF APBB2 QRPBB MPBB
128.7 109.9 94.1 62.5
4623.1 4490.9 2041.7 1267.0
2.08E−01 1.69E−01 1.04E−01 1.63E−01
10.83 7.69 5.16 3.34
0.1947 0.1955 0.1942 0.1955
(10,304, 400, 25)
NeNMF APBB2 QRPBB MPBB
11.4 14.9 24.5 10.0
492.8 345.8 344.5 149.4
3.02E−01 2.83E−01 1.89E−01 2.72E−01
3.21 1.77 2.18 1.07
0.2075 0.1883 0.1830 0.1871
By the optimality conditions for constrained optimization, (W , H ) is a stationary point of (1) if and only if
∇HP F (W , H ) = 0,
P ∇W F (W , H ) = 0,
where P ∇W F (W , H ) =
∇W F (W , H )ij , min{0, ∇W F (W , H )ij },
(W )ij > 0, (W )ij = 0,
and ∇HP F (W , H ) is defined in the same way. Similar to [11], we use the following stopping criterion:
P P P F (W k , H k )T ≤ ϵ ∇HP F (W 1 , H 1 ), ∇W F (W 1 , H 1 )T , ∇H F (W k , H k ), ∇W F
where ϵ > 0 is a tolerance. For all test problems, we set ϵ = 10−7 .
F
(14)
16
Y. Huang et al. / Applied Mathematics Letters 45 (2015) 12–17 Table 3 Experimental results on the Reuters-21578 and TDT2 datasets with ϵ = 10−7 . The Reuters-21578 corpus is represented by an 18,933 × 8293 matrix while the TDT-2 corpus is represented by a 36,771 × 9394 matrix. (m, n, r)
Alg
iter
niter
pgn
Time
Residual
(18,933, 8293, 50)
NeNMF APBB2 QRPBB MPBB
10.0 13.3 10.2 10.0
249.7 260.7 118.8 120.4
1.29E−01 7.27E+00 6.16E+00 6.97E+00
5.52 8.10 4.54 3.63
0.9241 0.9320 0.9239 0.9242
(36,771, 9394, 100)
NeNMF APBB2 QRPBB MPBB
10.0 10.0 10.0 10.0
306.6 288.7 110.3 112.3
1.64E−01 4.67E+01 3.74E+01 3.25E+01
22.63 22.87 16.55 13.82
0.9094 0.9323 0.9088 0.9096
We compared these methods on both synthetic datasets and real-world problems using the ORL2 and CBCL3 face image databases, and the Reuters-21578 [24] and TDT-2 [25] text corpora.4 As observed in [26], the performance of NMF solvers may be different when applying to V and its transpose V T . Similar to [26], we apply NeNMF and APBB2 to the transpose matrix V T when m > n. On the other hand, the QRPBB method and our MPBB method are applied to V T for the case m < n. We do not compare the method in [26] because its performance is close to that of the QRPBB method. In the following tables, iter denotes the number of iterations needed when the termination criterion (14) was met. We denote the total number of sub-iterations for solving (2) and (3), the final value of projected gradient norm as defined in (14), the final value of ∥V − W k H k ∥F /∥V ∥F , and the CPU time used at the termination of each algorithm, by niter, pgn, residual, and time, respectively. For each random problem, we use the rand function in MATLAB to generate a m × n matrix V . Then we generate 10 different random initial points and present the average results in Table 1. For most of the test problems, our MPBB method outperforms other three solvers in terms of number of iterations, sub-iterations and CPU time. Moreover, the final values of pgn and residual obtained by MPBB are as good as those by other solvers. For the CBCL and ORL image databases, and the Reuters-21578 and TDT-2 text corpora, we follow the same setting as in [26]. Average results of 10 different randomly generated initial iterates are reported. Table 2 shows that, for the CBCL and ORL image databases, the compared algorithms compute fairly good solutions with regard to the projected gradient norm. When consider the residual, NeNMF yields a larger value for the ORL image databases which means the reconstruction obtained by NeNMF is worse. We can observe from Table 3 that, for the Reuters-21578 and TDT-2 text corpora, the values of residual obtained by NeNMF, QRPBB, and MPBB are close and smaller than those by APBB2. NeNMF gives the smallest projected gradient norm. This is possibly due to that problem (1) as well as the subproblems (2) and (3) are not strongly convex which may have multiple minimizers. Acknowledgments The authors are very grateful to Dr. Naiyang Guan of National University of Defense Technology for providing us the codes of [18]. They also would like to thank the Editor-in-Chief, Prof. Alan Tucker and the anonymous referees for their helpful comments and suggestions. This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61072144 and No. 61179040 and the Fundamental Research Funds for the Central Universities (K50513100007). References [1] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (6755) (1999) 788–791. [2] P. Paatero, U. Tapper, Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values, Environmetrics 5 (2) (1994) 111–126. [3] M.W. Berry, M. Browne, A.N. Langville, V.P. Pauca, R.J. Plemmons, Algorithms and applications for approximate nonnegative matrix factorization, Computat. Statist. Data Anal. 52 (1) (2007) 155–173. [4] S. Bonettini, R. Zanella, L. Zanni, A scaled gradient projection method for constrained image deblurring, Inverse Problems 25 (1) (2009) 015002. [5] A. Cichocki, R. Zdunek, S.I. Amari, New algorithms for non-negative matrix factorization in applications to blind source separation, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2006, pp. 621–624. [6] A. Cichocki, R. Zdunek, A.H. Phan, S.I. Amari, Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation, Wiley, Chichester, 2009. [7] S.Z. Li, X.W. Hou, H.J. Zhang, Q.S. Cheng, Learning spatially localized, parts-based representation, in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, pp. 207–212.
2 http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. 3 http://cbcl.mit.edu/software-datasets/FaceData2.html. 4 MATLAB format data is available at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.
Y. Huang et al. / Applied Mathematics Letters 45 (2015) 12–17
17
[8] V.P. Pauca, Text mining using nonnegative matrix factorizations, in: Proceedings of the 2004 SIAM International Conference on Data Mining, 2004, pp. 22–24. [9] D.D. Lee, H.S. Seung, Algorithms for nonnegative matrix factorization, in: Advances in Neural Information Processing Systems, 2001, pp. 556–562. [10] D. Kim, S. Sra, I.S. Dhillon, Fast Newton-type methods for the least squares nonnegative matrix approximation problem, in: SIAM International Conference on Data Mining, 2007, pp. 343–354. [11] C.-J. Lin, Projected gradient methods for nonnegative matrix factorization, Neural. Comput. 19 (10) (2007) 2756–2779. [12] S. Vavasis, On the complexity of nonnegative matrix factorization, SIAM J. Optim. 20 (3) (2009) 1364–1377. [13] L. Grippo, M. Sciandrone, On the convergence of the block nonlinear Gauss–Seidel method under convex constraints, Operat. Res. Lett. 26 (3) (2000) 127–136. [14] H. Kim, H. Park, Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method, SIAM J. Matrix Anal. Appl. 30 (2) (2008) 713–730. [15] L. Han, M. Neumann, U. Prasad, Alternating projected Barzilai–Borwein methods for nonnegative matrix factorization, Electron Trans. Numer. Anal. 36 (6) (2009) 54–82. [16] P. Gong, C. Zhang, Efficient nonnegative matrix factorization via projected Newton method, Pattern Recognit. 45 (9) (2012) 3557–3565. [17] R. Zdunek, A. Cichocki, Non-negative matrix factorization with quasi-Newton optimization, in: Artificial Intelligence and Soft Computing, 2006, pp. 870–879. [18] N. Guan, D. Tao, Z. Luo, B. Yuan, NeNMF: an optimal gradient method for nonnegative matrix factorization, IEEE Trans. Signal Process. 60 (6) (2012) 2882–2898. [19] Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2 ), in: Soviet Mathematics Doklady, 1983, pp. 372–376. [20] Y. Huang, H. Liu, S. Zhou, Quadratic regularization projected Barzilai–Borwein method for nonnegative matrix factorization, Data Min. Knowl. Discov. (2014) http://dx.doi.org/10.1007/s10618-014-0390-x. [21] J. Barzilai, J.M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal. 8 (1) (1988) 141–148. [22] L. Grippo, F. Lampariello, S. Lucidi, A nonmonotone line search technique for Newton’s method, SIAM J. Numer. Anal. 23 (4) (1986) 707–716. [23] Y.-H. Dai, R. Fletcher, Projected Barzilai–Borwein methods for large-scale box-constrained quadratic programming, Numer. Math. 100 (1) (2005) 21–47. [24] D.D. Lewis, Y. Yang, T.G. Rose, F. Li, RCV1: a new benchmark collection for text categorization research, J. Mach. Learn. Res. 5 (2004) 361–397. [25] C. Cieri, D. Graff, M. Liberman, N. Martey, S. Strassel, The TDT-2 text and speech corpus, in: Proceedings of the DARPA Broadcast News Workshop, 1999, pp. 57–60. [26] Y. Huang, H. Liu, S. Zhou, Quadratic regularization projected alternating Barzilai–Borwein method for constrained optimization, optimization-online (2014) preprint.