Journal of Computational and Applied Mathematics 371 (2020) 112679
Contents lists available at ScienceDirect
Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
Majorized Proximal Alternating Imputation for regularized rank constrained matrix completion ∗
Shenfen Kuang a,b , Hongyang Chao c,d , , Qia Li a,e a
School of Data and Computer Science, Sun Yat-sen University, Guangzhou, PR China School of Mathematics and Statistics, Shaoguan University, Shaoguan, PR China CVTE Research, Guangzhou, PR China d Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, PR China e Guangdong Province Key Laboratory of Computational Science, Guangzhou, PR China b c
article
info
Article history: Received 9 May 2018 Received in revised form 15 April 2019 Keywords: Matrix completion Bilinear factorization Majorization minimization Proximal alternating minimization Nuclear norm Non-convex regularization
a b s t r a c t Low rank approximation in the presence of missing data is ubiquitous in many areas of science and engineering. As an NP-hard problem, spectral regularization (e.g., nuclear norm, Schatten p-norm) is widely used. However, solving this problem directly can be computationally expensive due to the unknown rank of variables and full rank Singular Value Decompositions (SVD). Bilinear factorization is efficient and effectively benefits from the fast numerical optimization. However, most of existing methods require explicit knowledge of the rank and often tend to be flatlining or over fitting. To balance and take advantage of both, we formulate the matrix completion problem under the regularized rank constrained conditions. We relax this problem to the surrogates with Majorization Minimization (MM). By employing a proximal alternating minimization method to optimize the corresponding surrogates, we propose an efficient and accurate algorithm named Majorized Proximal Alternating Imputation (MPA-IMPUTE) algorithm. The proposed method iteratively replaces the missing data at each step with the most recent estimate, and only involves simple matrix computation and small scale SVD. Theoretically, the proposed method is guaranteed to converge to the stationary point under mild conditions. Moreover, the proposed method can be generalized to a family of non-convex regularization problems. Extensive experiments over synthetic data and real-world dataset demonstrate the superior performance of our method. © 2019 Published by Elsevier B.V.
1. Introduction Matrix completion amounts to estimate the missing entries of a matrix from a very limited number of its entries under low rank assumption. The movies rating completion known as the Netflix competition [1] greatly motivates and popularizes this problem. Recently, it has attracted considerable interest in machine learning, computer vision and related communities, since a great variety of tasks can be formulated as the problem of estimating a low rank matrix or tensor with missing data, such as image restoration [2], structure from motion [3,4], image retrieval and tagging completion [5], photometric stereo [6], and so on. ∗ Corresponding author at: Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, PR China. E-mail address:
[email protected] (H. Chao). https://doi.org/10.1016/j.cam.2019.112679 0377-0427/© 2019 Published by Elsevier B.V.
2
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
For a given matrix X ∈ Rm×n with part of its entries observed, matrix completion is often posed as a convex optimization problem with nuclear norm regularization [7]. In this paper, we further consider this formulation by adding an explicit low rank constraint. Mathematically, the problem we are interested in can be formulated as follows: min f (Y ) + λ∥Y ∥∗ , Y
s.t . rank(Y ) ≤ r ,
(1)
where f (Y ) = 12 ∥PΩ (X − Y )∥2F , Ω ⊂ {1, 2, . . . , m}×{1, 2, . . . , n} denotes the indices of known entries, PΩ is the orthogonal sampling operator which is defined as follows:
{ PΩ (Y )i,j =
Yi,j , if (i, j) ∈ Ω ,
0,
otherwise.
∥Y ∥∗ [8] is the nuclear norm, which is defined as the sum of singular values of Y ; λ is a regularization parameter; r is an estimated rank; rank(Y ) is the number of non-zero singular values of Y . Matrix rank minimization is well known as an NP-hard problem. At present, there are mainly two types of approaches to obtain the approximate solution: spectral regularization and bilinear factorization. Spectral regularization, e.g., nuclear norm, total variation regularization [9–11], and non-convex regularization [12,13], aims to find a surrogate function of the rank, such that the surrogate function enforces the solution to low rank. Among them, nuclear norm regularization is the most prominent approach for matrix completion. It has been shown that nuclear norm is the tightest convex relaxation of rank over the unit ball [14]. Candes et al. [8] further showed that under certain conditions, the incomplete data matrix can be exactly recovered by nuclear norm minimization. However, in many learning tasks, the matrix rank is predetermined by a domain specific constraint, e.g., in rigid structure from motion the rank is at most 4 [4], but the nuclear norm regularization cannot incorporate this prior information directly. More importantly, nuclear norm minimization often involves high or full rank Singular Value Decomposition (SVD) at each iteration, which is computationally expensive. Another widely used technique to estimate missing data is bilinear factorization, e.g., [15–17], that is, directly factorize the original matrix into two (or more) much smaller ones. A major advantage of factorization is that each update is computationally cheap and has a small memory requirement. However, most of existing methods require explicit knowledge of the matrix rank. Besides, when regularization is not used, it may be more prone to flatlining or over fitting [18,19]. To better solve problem (1), we combine and take advantage of spectral regularization and bilinear factorization, such that the given predetermined rank information can naturally be incorporated into traditional regularization problem. We first reparameterize the unknown low rank matrix as the product of two much smaller matrices, i.e., for a given estimated rank r, Y can be decomposed as Y = UV , where U ∈ Rm×r , V ∈ Rr ×n , such that the low-rank constraint is automatically satisfied. Note that there is a gauge freedom, that is, for any invertible r × r matrix G, (UG)(G−1 V ) = UV . To reduce the degree of freedom, it is reasonable to further restrict U in a (compact) Stiefel manifold, which is defined as St(r , m) := {U ∈ Rm×r : U T U = Ir }. The orthonormality constraint shrinks the solution space so that completing the missing data becomes a better posed problem, since U and V are now determined up to G such that GT G = Ir . The underlying idea is also based on a low rank factorization of the unknown matrix, similar to SVD, Y = USDT . Like in SVD, U is in a Stiefel manifold, but SDT is regarded as a single matrix denoted by V in this paper. A similar setting is considered in [20] for the L1 norm case with ADMM framework, while [21–23] exploit the manifold structure with Riemannian optimization. With the use of factorization Y = UV (U T U = Ir ), the nuclear norm is written as ∥Y ∥∗ = ∥V ∥∗ . We thus recast Problem (1) as follows: min
U ∈St(r ,m),V ∈Rr ×n
F (U , V ) = f (UV ) + λ∥V ∥∗ .
(2)
In order to meet the ever growing size and number of large-scale datasets, we aim to develop efficient algorithm for Problem (2) that can deal with large matrices while still maintain high precision. Motivated by Majorization Minimization (MM) [24] as well as the ideas introduced in SoftImpute [25] and SoftImpute-ALS [17], we try to relax (2) to the surrogates. The main idea is to impute the missing data iteratively, then every imputation step leads to an upper bound (surrogate) function. The corresponding surrogates are solved via the proximal alternating minimization [26] method with convergence guarantee. In summary, the contributions of this paper are as follows:
• We propose to use the MM and proximal alternating minimization to solve the regularized rank constrained matrix completion problem. The proposed Majorized Proximal Alternating Imputation (MPA-IMPUTE) algorithm only involves small scale SVD and simple matrix computation. In theory, we prove that MPA-IMPUTE has sufficient decrease in the objective function value, and any limit point of MPA-IMPUTE is a stationary point of Problem (2). In addition, we establish a O( 1n ) convergence rate of MPA-IMPUTE. • We further extend MPA-IMPUTE to solve the non-convex (concave) regularization problems. By formulating the corresponding surrogate function at each iteration, the corresponding subproblem can be efficiently addressed. Finally, we adapt our method to Truncated Nuclear Norm Regularization (TNNR) and show that it indeed leads to a better solution.
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
3
The rest of this paper is organized as follows. We introduce the proposed method and convergence analysis in Section 2, generalize to the non-convex regularization in Section 3, report the experimental results in Section 4, and conclude the paper in Section 5. 1.1. Notation For a given m × n matrix X , Ω ⊂ {1, . . . , m} × {1, . . . , n} denotes the indices of observed entries, |Ω | denotes the ⊥ number of observed entries. PΩ (X )i,j = Xij if (i, j) ∈ Ω , else 0. PΩ (.) denotes the complement of PΩ (.) that satisfies ⊥ T PΩ (X ) + PΩ (X ) = X ; U Σ V is the SVD of X , where Σ = diag (σi (X )), σi (X ), i = 1, 2, . . . , min{m, n} denotes the ith ∑min{m,n} singular value of X . Ur Σr VrT is the truncated SVD of X with the first r components. ∥X ∥∗ = i=1 √ σi (X ) is the nuclear norm. The scalar product is defined as < X , Y >= tr(X T Y ), where tr is the trace function. ∥X ∥F =
∑m ∑n
2 j=1 Xij i=1 n×p T
Frobenius norm. St(p, n)(p ≤ n) denotes the set of all n × p orthogonal matrices, i.e., St(p, n) := {Y ∈ R For any square matrix A, Sym(A) =
A+AT 2
is the
: Y Y = Ip }.
.
2. Proposed method This section presents an algorithm for solving Problem (2). We first consider the nuclear norm regularization and then generalize it to non-convex regularization in the next section. 2.1. Surrogate function Due to the nonconvexity and nonsmoothness, as well as the bilinear term UV , Problem (2) is hard to optimize. A possible way is to relax this problem to a surrogate. Majorization Minimization (MM) [24,27], works by finding a surrogate function that minorizes or majorizes the objective function. A successful MM algorithm substitutes a simple optimized problem for the original difficult problem. This will often lead to a more efficient algorithm and solution. Similar to [17,25], our main idea is to impute the missing data after solving U or V , then every imputation step leads to an upper bound function. Instead of solving the original problem we solve the upper bound (surrogate) function, then U and V can be updated accordingly. To relax (2) with MM, first of all, observe that ⊥ PΩ (X − UV ) = PΩ (X ) + PΩ (UV ) − UV . k k We denote X ∗ as the completion of the missing data of Xm×n with the most recent estimate. Given the latest Um ×r , Vr ×n , ∗ the completion of X can be represented as
⊥ X ∗ = PΩ (X ) + PΩ (U k V k ) = PΩ (X − U k V k ) + U k V k .
(3)
Suppose |Ω | ≪ m × n, then PΩ (.) is sparse, and U k V k is low rank if r ≪ min{m, n}. This decomposes X ∗ as sparse plus low rank matrix, which is efficient to store and also efficient to left and right matrix multiplication for large scale matrix. We define the surrogates in Lemma 1. Lemma 1. Given U k , V k , define Q1 (U |U k , V k ) = Q2 (V |U k , V k ) =
1 2 1 2
∥X ∗ − UV k ∥2F + λ∥V k ∥∗ ,
(4)
∥X ∗ − U k V ∥2F + λ∥V ∥∗ ,
(5)
⊥ where U ∈ St(r , m), V ∈ Rr ×n , X ∗ = PΩ (X ) + PΩ (U k V k ). Then we have
F (U , V k ) ≤ Q1 (U |U k , V k ), F (U k , V ) ≤ Q2 (V |U k , V k ). Proof. First, observe that
∥PΩ (X − UV k )∥2F ≤ ∥PΩ (X − UV k ) + PΩ⊥ (U k V k − UV k )∥2F = ∥(PΩ (X ) + PΩ⊥ (U k V k )) − (PΩ (UV k ) + PΩ⊥ (UV k ))∥2F = ∥X ∗ − UV k ∥2F . Hence, this inequality leads to F (U , V k ) ≤ Q1 (U |U k , V k ).
(6)
4
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
Similarly, we have
∥PΩ (X − U k V )∥2F ≤ ∥PΩ (X − U k V ) + PΩ⊥ (U k V k − U k V )∥2F = ∥X ∗ − U k V ∥2F . This inequality leads to F (U k , V ) ≤ Q2 (V |U k , V k ).
□
Lemma 1 shows that Q1 majorizes (upper bounds) F (U , V k ) and Q2 majorizes (upper bounds) F (U k , V ). Given U k , V k , instead of directly alternatively minimizing F (U , V ) to obtain U k+1 and V k+1 , we minimize the surrogate functions (4) and (5), respectively. Moreover, equalities of (6) hold as follows: F (U k , V k ) = Q1 (U k |U k , V k ), F (U k , V k ) = Q2 (V k |U k , V k ). 2.2. Update U Note that traditional MM only ensures non-increment of the objective function, hence has no convergence guarantee. Given U k , V k , we consider adding the proximal term ∥U − U k ∥2F for (4). The goal is to make the objective function in the subproblem well defined so that the generated sequence is sufficient decrease. This is motivated by proximal alternating algorithm that the proximal terms are critical to ensure the convergence of the sequence. We thus consider the following optimization:
β1
U k+1 = arg min Q1 (U |U k , V k ) + U ∈St(r ,m)
2
∥U − U k ∥2F ,
(7)
where β1 is a constant to balance the surrogate function and the proximal term. Based on the fact that U T U = Ir and ∥A − B∥2F = tr(AT A) − 2tr(AT B) + tr(BT B), (7) can be represented in simple form. By removing the constant, minimizing (7) is equivalent to maximizing the following optimization: T
U k+1 = arg max tr((X ∗ V k + β1 U k )U T ). U ∈St(r ,m)
(8)
T
ˆ Vˆ T , then Assume the SVD of X ∗ V k + β1 U k is Uˆ Σ
(
T
tr (X ∗ V k + β1 U k )U T
)
ˆ T Uˆ T ) = tr(Uˆ T U Vˆ Σ ˆ T ). = tr(U Vˆ Σ
Moreover,
ˆ T ) = tr(T Σ ˆ T) = tr(Uˆ T U Vˆ Σ
r ∑
tii σi ,
i=1
where T = Uˆ T U Vˆ = (tij ) satisfies T T T = I, thus |tii | ≤ 1. The above trace is maximized when tii = 1. Hence, the desired solution is given as follows: U k+1 = Uˆ m×r Vˆ T ,
(9) kT
ˆ Note that the solution of (7) can also be derived ˆ Vˆ is the SVD of X V +β1 U , Uˆ m×r is the first r columns of U. where Uˆ Σ from its Lagrangian form [23,28], we use it for the convergence analysis. T
∗
k
2.3. Update V Similar to (7), we add the proximal term for updating V . Given the latest U k+1 and V k , we consider the following optimization problem: V k+1 = arg min Q2 (V |U k+1 , V k ) + V
β2 2
∥V − V k ∥2F .
(10)
We first simplify (10), since Q2 (V |U k+1 , V k ) +
=
β2 + 1 2
β2 2
∥V − V k ∥2F
tr(V T V ) − (β2 + 1)tr(Y T V ) + λ∥V ∥∗ + C
= (β2 + 1)
(
1 2
∥V − Y ∥2F +
) λ ∥V ∥∗ + C , β2 + 1
(11)
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
5
Algorithm 1 Majorized Proximal Alternating Imputation (MPA-IMPUTE) algorithm for nuclear norm regularization Input: X ∈ Rm×n , Ω , λ, β1 , β2 , r. 1: Initialize k = 0. T k k T 2: Compute the truncated SVD: X = Ur Σr Vr , then initialize U = Ur , V = Σr Vr . 3: repeat 4: Given U k , V k , update U k+1 : ⊥ (a). Impute X ∗ = PΩ (X ) + PΩ (U k V k ). k+1 (b). U = arg minU ∈St(r ,m) Q1 (U |U k , V k ) + β21 ||U − U k ||2F . k+1 5: Given U , V k , update V k+1 : ⊥ (a). Impute X ∗ = PΩ (X ) + PΩ (U k+1 V k ). k+1 k+1 (b). V = arg minV Q2 (V |U , V k ) + β22 ||V − V k ||2F . 6: k = k + 1. 7: until convergence
T
where Y = β 1+1 (U k+1 X ∗ + β2 V k ), C is a constant given U k+1 and V k . Minimizing (11) can be addressed by Singular Value 2 Thresholding (SVT) operator [29]. Therefore, the optimal solution of (10) is given as follows:
ˆ Vˆ , V k+1 = UD
(12)
ˆ Vˆ is the thin SVD of Y and Dii = max(Σ ˆ ii − λ , 0). where Uˆ Σ β2 +1 The complete procedure of the proposed method is depicted in Algorithm 1. Algorithm 1 repeatedly replaces the missing data at each step with the most recent estimate U and V , and then updates the estimate by solving Problem (7) for U and Problem (10) for V . We give the convergence analysis and computational complexity in the next section. 2.4. Convergence analysis Theorem 1. The sequence {(U k , V k )} generated by Algorithm 1 satisfies the following properties: 1. F (U , V ) has sufficient decrease on the sequence {(U k , V k )}, that is, F (U k , V k ) − F (U k+1 , V k+1 ) ≥
β1 2
∥U k+1 − U k ∥2F +
β2 2
∥V k+1 − V k ∥2F .
2. limk→∞ (U k+1 − U k ) = 0, limk→∞ (V k+1 − V k ) = 0. 3. The sequence {(U k , V k )} is bounded. Proof. Since U k+1 = arg min Q1 (U |U k , V k ) +
β1
U ∈St(r ,m)
2
∥U − U k ∥,
i.e., (7) takes the minimum value when U = U k+1 , so we have 1 2
∥X ∗ − U k+1 V k ∥2F +
≤
1 2
β1
∥X ∗ − U k V k ∥2F +
2
∥U k+1 − U k ∥2F
β1 2
∥U k − U k ∥2F , β
that is, 12 ∥X ∗ − U k V k ∥2F − 12 ∥X ∗ − U k+1 V k ∥2F ≥ 21 ∥U k+1 − U k ∥2F . Together with F (U k , V k ) = Q1 (U k |U k , V k ) and F (U , V k ) ≤ Q1 (U |U k , V k ), we have F (U k , V k ) − F (U k+1 , V k )
≥ Q1 (U k |U k , V k ) − Q1 (U k+1 |U k , V k ) = ≥
1 2
1
∥X ∗ − U k V k ∥2F − ∥X ∗ − U k+1 V k ∥2F
β1 2
2
∥U k+1 − U k ∥2F .
(13)
6
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
Similarly, F (U k+1 , V k ) − F (U k+1 , V k+1 )
≥ Q2 (V k |U k+1 , V k ) − Q2 (V k+1 |U k+1 , V k ) β2 k+1 ∥V − V k ∥2F . ≥ 2
Therefore, together with the above two inequalities, we have F (U k , V k ) − F (U k+1 , V k+1 )
( ) ( ) = F (U k , V k ) − F (U k+1 , V k ) + F (U k+1 , V k ) − F (U k+1 , V k+1 ) β1 k+1 β2 ≥ ∥U − U k ∥2F + ∥V k+1 − V k ∥2F . 2
(14)
2
Thus, F (U , V ) is monotonically decreasing. Summing the above inequalities for k ≥ 1, it follows that F (U 1 , V 1 ) ≥
∞ ( ∑ β1
2
k=1
β2
∥U k+1 − U k ∥2F +
2
) ∥V k+1 − V k ∥2F .
(15)
This implies that limk→∞ (U k+1 − U k ) = 0, limk→∞ (V k+1 − V k ) = 0. Since for any k > 0, UkT Uk = Ir , hence {Uk } is bounded. Furthermore, since V → ∞, ∥V ∥∗ → ∞, this implies F (U , V ) → ∞ iff V → ∞, thus {Vk } is bounded. Consequently, we can conclude that the sequence {(U k , V k )} is bounded. □ Theorem 2. Assume (U ∗ , V ∗ ) is the stationary point of Problem (2), then the following first order necessary optimality conditions are satisfied: Z − U ∗ Sym(U ∗ Z ) = 0, T
U
∗T
(16)
PΩ (X − U V ) + ∂∥V ∥∗ = 0, ∗
∗
∗
∗
∗
where Z = PΩ (X − U V )V
∗T
(17)
, and ∂∥.∥∗ is the sub differential of nuclear norm.
Note that Eq. (16) can be derived on U either by the KKT conditions or Riemannian gradient of Stiefel manifold, as detailed in [23]. Eq. (17) is directly accomplished by taking the derivative with respect to V and setting it to zero. The details of sub differential of nuclear norm can be found in [25,29]. Theorem 3. Suppose {(U k , V k )} be the sequence generated by Algorithm 1, then any accumulation point of {(U k , V k )} is a stationary point of Problem (2). Proof. The theorem is equivalent to prove that any accumulation point of {(U k , V k )} generated by Algorithm 1 satisfies (16) and (17). As in Theorem 1 shows, F (U , V ) is bounded, so the generated sequence {(U k , V k )} with Algorithm 1 has accumulation point (U ∗ , V ∗ ). Since any accumulation point (U ∗ , V ∗ ), there exists a subsequence {(Ukj , Vkj )} so that (Ukj , Vkj ) → (U ∗ , V ∗ ). Note that U kj +1 = arg min Q1 (U |U kj , V kj ) +
β1
U ∈St(r ,m)
2
∥U − U kj ∥2F .
(18)
Now, we consider its Lagrangian form, which is defined as follows: L(U) =
1 2
∥X ∗ − UV kj ∥2F +
β1 2
( ) ∥U − U kj ∥2F − tr C (U T U − Ir ) ,
⊥ where X ∗ = PΩ (X ) + PΩ (U kj V kj ) and C is the Lagrange multiplier as well as a symmetric matrix. By setting kj +1 U satisfies T
(X ∗ − U kj +1 V kj )V kj + β1 (U kj +1 − U kj ) − U kj +1 C = 0. Since U
kj +1
(19) dL(U) dU
= 0, (20)
− U → 0 and (U , V ) → (U , V ), the above equation can be simplified as follows: kj
kj
kj
∗
∗
PΩ (X − U ∗ V ∗ )V ∗ − U ∗ C = 0, T
where C = Sym U ∗ T PΩ (X − U ∗ V ∗ )V We now consider (10), since
(
(21)
) ∗T
. Therefore, the optimality condition of (16) is satisfied.
V kj +1 = arg min Q2 (V |U kj +1 , V kj ) + V
β2 2
∥V − V kj ∥2F ,
(22)
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
7
⊥ where X ∗ = PΩ (X ) + PΩ (U kj +1 V kj ). Define
G(V ) = Q2 (V |U kj +1 , V kj ) +
β2 2
∥V − V kj ∥2F .
By setting its derivative to zero, i.e., U
kj +1 T
dG(V ) dV
(23)
= 0, we have
(X ∗ − U kj +1 V kj +1 ) + β2 (V kj +1 − V kj ) + ∂∥V kj +1 ∥∗ = 0.
(24)
By letting kj → ∞, we have U ∗ PΩ (X − U ∗ V ∗ ) + ∂∥V ∗ ∥∗ = 0. T
(25)
Thereby completing the proof of the theorem. □ β
β
Define △k = 21 ∥U k+1 − U k ∥2F + 22 ∥V k+1 − V k ∥2F . From Theorem 1, it is clear that △k → 0 as k → ∞, this leads to (U , V k ) → (U ∗ , V ∗ ). Therefore, △k can be used to measure how close (U k , V k ) is from a stationary point. The following theorem characterizes the convergence rate of MPA-IMPUTE algorithm. k
Theorem 4. Suppose that {(U k , V k )} be the sequence generated by Algorithm 1, and (U ∗ , V ∗ ) be an accumulation point of {(U k , V k )}, then we have the following finite convergence rate of MPA-IMPUTE algorithm: min △k ≤
F (U 1 , V 1 ) − F (U ∗ , V ∗ ) n
1≤k≤n
.
(26)
Proof. According to (14), we have
△k ≤ F (U k , V k ) − F (U k+1 , V k+1 ).
(27)
Summing both sides of the above inequalities over k = 1 to n, we obtain min △k ≤
1≤k≤n
≤
n 1∑
n
△k
k=1
n ) 1 ∑( F (U k , V k ) − F (U k+1 , V k+1 ) n k=1
=
1(
F (U 1 , V 1 ) − F (U n+1 , V n+1 )
n ) 1( F (U 1 , V 1 ) − F (U ∗ , V ∗ ) . ≤ n
)
□
The above theorem characterizes the rate at which △k converges to 0, and establishes a O( 1n ) convergence rate of MPA-IMPUTE algorithm. 2.5. Computational complexity At each iteration, the computational cost of Algorithm 1 contains two parts. First consider the cost of updating U at the current iterate k, there are two sub-steps. In step 4a, note that ⊥ X ∗ = PΩ (X ) + PΩ (U k V k ) = PΩ (X − U k V k ) + U k V k ,
(28)
where X is represented as the sparse plus low rank matrix, that is, X is divided into three parts: PΩ (X − U V ), U ∈ Rm×r , and V k ∈ Rr ×n . Step 4a only needs to form PΩ (X − U k V k ), which requires O(|Ω |r) flops. Note that Algorithm 1 does not explicitly compute X ∗ , since the multiplication U k V k requires O(mnr) flops. In addition, the storage cost of X ∗ is O(|Ω | + mr + nr), which is efficient for large matrices when the rank r is small. In step 4b, the cost mainly comes from the matrix multiplication and SVD computation, as shown in (9), we have ∗
k
∗
T
T
T
X ∗ V k + β1 U k = PΩ (X − U k V k )V k + U k V k V k + β1 U k . kT
k
k
(29) kT
kT
The multiplication PΩ (.)V requires O(|Ω |r) flops, and U k V k V = U k (V k V ) requires O(mr 2 + nr 2 ) flops. Hence forming T T X ∗ V k + β1 U k requires O(|Ω |r + mr 2 + nr 2 ) flops. Computing the SVD of the m × r matrix X ∗ V k + β1 U k requires O(mr 2 ) 2 2 flops. So the cost of step 4b is O(|Ω |r + 2mr + nr ). Similarly, step 5a requires O(|Ω |r) flops to form PΩ (X − U k+1 V k ). In step 5b, the cost comes from forming Y and its SVD in (12), the cost is O(|Ω |r + 2nr 2 + mr 2 ). The total cost of an iteration (U k , V k ) → (U k+1 , V k+1 ) is O(4|Ω |r + 3(m + n)r 2 ). This implies that the computational cost of Algorithm 1 is linear in matrix dimensions if r ≪ min{m, n}.
8
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
3. Optimization for Non-convex regularization Despite its solid theoretical foundation and efficient optimization, nuclear norm minimization often tends to overshrink the rank components and treats the different rank components equally [12]. This limits its flexibility in practical applications. Recent work e.g., [12,30,31] showed that non-convex regularization has better empirical performance than nuclear norm. However, non-convex regularization often leads to non-convex, non-smooth, and even non-Lipschitz optimization problems, thus is much more difficult to solve. A popular relaxation is to use the local linear approximation technique [32] to obtain its upper bound. Lu et al. [33] found that most of existing non-convex regularization functions are concave and monotonically increasing, thus their gradients are decreasing. Based on this property, Lu et al. proposed the Iterative Reweighted Nuclear Norm (IRNN) for a wide range of non-convex spectral regularization problems. At each iteration, IRNN iteratively solves a weighted nuclear norm minimization [34]. However, IRNN as well as most existing algorithms cannot be applied to high dimensional data due to the high rank SVD and memory requirement. Our extension aims to fill this gap. We extend Algorithm 1 to non-convex regularization that goes beyond nuclear norm regularization. Given an estimated rank r, we restrict U ∈ St(r , m), then σi (UV ) = σi (V ). We propose minimizing min
U ∈St(r ,m),V ∈Rr ×n
F (U , V ) = f (UV ) + λ
r ∑
g (σi (V )) ,
(30)
i=1
where g(.) is the non-convex penalty function. Note that the regularization term is independent of U, hence the U update step is the same as nuclear norm regularization. The remaining issue is to define a suitable surrogate function for the V update step. As concrete examples, we consider the following two popular non-convex penalty functions, other widely studied ones can be found in [35].
• Truncated nuclear norm regularization (TNNR) [2]: Given a matrix X ∈ Rm×n , and a truncated rank t, the truncated nuclear norm is defined as follows: min{m,n}
∑
∥X ∥t =
g (σi (X )) ,
(31)
i=1
where the penalty function g(x) is defined as follows:
{ g(x) =
0, i ≤ t ,
x, i > t . ∑min{m,n} Thus, ∥X ∥t = σi (X ). The corresponding sub differential is given as follows: i=t +1 { 0, i ≤ t , ∂ g(x) = 1, i > t .
• Logarithm norm (reweighted l1 [34]), which is defined as g(x) = C log(x + ϵ ), where C and ϵ are the given constants. C . For a given matrix X ∈ Rm×n , ∂ g(σi (X )) = σ (XC)+ϵ for i = 1 to min{m, n}. The sub differential of ∂ g(x) is ∂ g(x) = x+ϵ i
Given V k ∈ Rr ×n , due to the concave property of g(V ), we have g (σi (V )) ≤ g σi (V k ) + ∂ g σi (V k )
(
)
(
)(
) σi (V ) − σi (V k ) .
(32)
Denote wi = ∂ g σi (V k ) (i = 1, 2, ... , r), then w = (w1 , w2 , ... , wr )T ∈ Rr is the weighted vector of singular values of V . Since g(x) is concave and the singular values computed by SVD are always decreasing, the weights satisfy the relationship: w1 ≤ w2 ≤ · · · ≤ wr . Therefore, the surrogate (upper bound) function with respect to V can be defined as follows:
(
)
Q3 (V |U k+1 , V k , wk )
=
1 2
∥X ∗ − U k+1 V ∥2F + λ
r ∑
wi σi (V ) + C ,
(33)
i=1
⊥ where X ∗ = PΩ (X ) + PΩ (U k+1 V k ) and C = λ (4), (32), and (33), we have
∑r
i=1
( ) ( ) {g σi (V k ) − ∂ g σi (V k ) }σi (V k ) is a constant given V k . Together with
F (U k+1 , V ) ≤ Q3 (V |U k+1 , V k , wk ).
(34)
This implies that Q3 majorizes F (U , V ) and the equality holds iff V = V . For the truncated nuclear norm, since its sub differential are constants, the weights are thus always constants. For the logarithm penalty function, the weight vector is computed based on previous solution and dynamically updated at each iteration. By imposing Q3 with the proximal term, V step is defined as follows: k+1
V k+1 = arg min Q3 (V |U k+1 , V k , wk ) + V
k
β2 2
∥V − V k ∥2F .
(35)
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
9
Algorithm 2 Majorized Proximal Alternating Imputation with Non-Convex Regularization Input: X ∈ Rm×n , Ω , λ, β1 , β2 , r. 0 r 1: Initialize k = 0, w ∈ R . T k k T 2: Compute the truncated SVD: X = Ur Σr Vr , then initialize U = Ur , V = Σr Vr . 3: repeat ⊥ 4: Impute X ∗ = PΩ (X ) + PΩ (U k V k ). 5: Update U k+1 by solving (7). ⊥ (U k+1 V k ). 6: Impute X ∗ = PΩ (X ) + PΩ 7: Update V k+1 by solving (35). ( ) 8: (Optional) Update the weights wik+1 = ∂ g σi (V k+1 ) (i = 1, ... , r). 9: k = k + 1. 10: until convergence
The above problem can also be simplified based on (11), so that it only involves the weight nuclear norm minimization, which can be solved by the following lemma: Lemma 2. Weighted singular value thresholding operator [34]: For the problem min
V ∈Rr ×n
1 2
∥V − Y ∥2F + λ
r ∑
wi σi (V ),
(36)
i=1
ˆ Vˆ . Then the minimum of (36) is achieved where the weights satisfy w1 ≤ w2 ≤ · · · ≤ wr , r ≤ n, Y ∈ Rr ×n is with thin SVD Uˆ Σ ˆ Vˆ , where D is a diagonal matrix and Dii = max(Σii − λwi , 0)(1 ≤ i ≤ r). at V = UD Based on Lemma 2 and (11), given U k+1 and V k , the closed form solution of V k+1 is given by
ˆ Vˆ , V k+1 = UD
(37) wλ
k+1 T
ˆ Vˆ is the SVD of Y = 1 (U ˆ ii − i , 0). where Uˆ Σ X ∗ + β2 V k ), and Dii = max(Σ β2 +1 β2 +1 The complete procedure is outlined in Algorithm 2. Together with (5), (30), and (35), we can use the same techniques to prove that Algorithm 2 has the same convergence properties as Theorems 1–4 showed. Thus, we omit the proof here. The advantage of such formulation is that the non-convex regularization only needs to compute the SVD on a much smaller matrix compared with IRNN and TNNR, which are prohibitively expensive both in terms of computational time and memory requirement. Furthermore, our method is natural to incorporate the given rank information into the nonconvex regularization problem. Due to shrinkage and thresholding of the non-convex regularization, Algorithm 2 is not sensitive to the estimated rank to some degree. This is far superior to many factorization approaches that only work well on a fixed rank. 4. Experiments We evaluate the performance of the proposed method on the synthetic data and real world problems. We compare our proposed method with several state-of-the-art methods that represent different types of optimization, including APG1 (nuclear norm regularization [36]), LMaFit2 (alternating minimization [15]), IRNN3 (non-convex regularization [12]), OptSpace4 (double Grassmannian optimization [37]), LRGeomCG5 (fixed rank manifold [38]), and Grouse6 (single Grassmannian optimization [39]). All the experimental codes are available from websites of the corresponding authors. All the codes are conducted in Matlab on a desktop PC with 3.0 GHz CPU and 8 GB memory. The demo code of our method can be found at https://github.com/shfkuang/mpa-impute. In the following experiments, the sampling ratio is defined as SR = |Ω |/mn. The stopping criteria and testing error in our experiments are defined as follows: ∥P (X −UV )∥F Training error= Ω < tol; ∥P (X )∥ Ω
F
∥PΩ (X −U k+1 V k+1 )∥F | ∥PΩ (X −U k V k )∥F ∥X −UV ∥ error) = ∥X ∥ F . F
Relative change = |1 − Testing error (relative
<
tol ; 2
1 http://www.math.nus.edu.sg/~mattohkc/NNLS.html. 2 http://lmafit.blogs.rice.edu/. 3 https://sites.google.com/site/canyilu/. 4 http://swoh.web.engr.illinois.edu/software/optspace/code.html. 5 http://www.unige.ch/math/vandereycken 6 http://sunbeam.ece.wisc.edu/grouse/.
10
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
Fig. 1. Convergence analysis, (a) fix SR = 0.3, and set rank r with 5, 20, 50, respectively, (b) fix r = 20, and set SR with 0.1, 0.3, 0.5 respectively.
Note that training error and relative change are controlled by the parameter tol, which is used for termination when tol is below a certain threshold (e.g., tol = 10−4 ). Relative error can be thought of as the testing error, which is used in synthetic data for evaluating performance of the compared algorithms. For the proposed method, we set β1 = β2 = 0.001 and λ = 0.001 for the fixed rank problem. When the matrix rank is given by an estimated value, we set λ0 = ∥PΩ (X )∥∞ and let it decrease at each iteration, e.g., λk+1 = 0.98 × λk , where 0.98 is the decayed factor. It will be stopped until a predefined target λ∞ is reached (e.g., λ∞ = 0.001). In all of our experiments, the data are partitioned into three parts: training set (50%), validation set (10%), and test set (40%). The main purpose of validation set is to tune the estimated rank. We use warm-starts strategy [25] for tuning the estimated rank r, which allows us to efficiently compute an entire path of solutions on a grid of values of the rank. 4.1. Synthetic data We generate m × n matrix X of rank r by the following procedures. First generate the Gaussian random matrix: U ∈ Rm×r , V ∈ Rr ×m . Then construct the synthetic data as X = UV + γ G, where G is the Gaussian white noise with standard deviation, γ controls the noise ratio, we set and fix γ = 0.01. The locations of observed indices Ω are sampled uniformly from X at random. The sample ratio |Ω |/mn is denoted by SR. The parameters are set with maxiter = 500, tol = 10−4 by default. Our first experiment is to test the convergence of the proposed algorithm under different ranks and SR. First consider the 1000 × 1000 data matrix X . We plot the training errors at all iterations with different SR and ranks r, the results are reported in Fig. 1. As we can see in the figure, when the rank is fixed, decreasing the sampling rate makes the convergence slow. While the SR is fixed, the higher the rank, the harder for the algorithm to converge. However, all the test cases show that our algorithm converges linearly. This is in accordance with our previous convergence analysis. The second experiment is to test how the running time changes as the dimensions of the matrix increase. We consider the square matrix for simplicity. We test with three different ranks, and the dimensions of matrix ranging from 500 × 500 to 5000 × 5000. The results are reported in Fig. 2. As we see, the running time is approximately linear in matrix dimensions. The lower the matrix rank, the likely the linear relation is established. This is consistent with what we discussed in Section 2.5. The third experiment is to test the sensitivity of Algorithm 1 with different estimated ranks r. Suppose the true rank is equal to 20, and the estimated rank r are set to 25 and 50. We compared our method with two basic benchmarks, IRNN (non-convex regularization with log penalty), and LMaFit (factorization method). As IRNN does not use the known rank information, we truncate the SVD with the estimated rank at each iteration, which is reimplemented in [30]. We plot the training error against time under different estimated ranks. The results are reported in Fig. 3. As IRNN involves full SVD at each iteration, we can observe in Fig. 3a that LMaFit and our method are much faster than IRNN. We can see in Fig. 3b that directly incorporate the rank information prohibits IRNN to converge to a satisfactory solution. LMaFit converges faster than our method as it only involves the QR factorization and linear equation. However, the solution quality is much lower than ours as it converges too early and gets trapped in local minima. Under the two estimated ranks, our method that combines with regularization and factorization, outperforms these two methods. Furthermore, our method does not get trapped in local minima though the estimated rank is much larger than the true rank. The effectiveness of the proposed method benefits from regularization and factorization. First, by employing nuclear norm regularization, the optimization becomes well posed and greatly avoids over fitting. Second, by employing factorization, the computational cost is greatly reduced due to the efficient optimization.
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
11
Fig. 2. The running time as orders of the matrix increase with three different ranks (r).
Fig. 3. Test the sensitive of estimated rank, true rank r = 20, tol = 10−6 , (a) estimated rank r = 25. (b) estimated rank r = 50.
The fourth experiment is to further compare our algorithm with different existing algorithms to evaluate the convergence speed under different estimated ranks. We test with two cases m = n = 500 and m = n = 1000. Results are shown in Figs. 4 and 5. When the estimated rank is equal to the true rank, we can see in Fig. 4a that LMaFit and LeGeomCG perform better than the other methods. These two methods are designed for the fixed rank problem and are both without regularization. However, when the estimated rank increases to 12, both LMafit and LeGeomCG converge to local minima, as we can see in Fig. 4b. Yet other methods as well as ours converge to the stopping criterion though the estimated rank is larger than the true rank. Therefore, introducing the regularization term may possibly require extra running time but lead to a more satisfactory solution. In Fig. 5a, we can see that when the matrix size increases to 1000 × 1000 and true rank to 20, LMaFit and LRGeomCG seem to fail on this test when the estimated rank is set with 25 and 40, respectively. And our method and the other regularization methods achieve better performance. We can draw a conclusion empirically that the methods without regularization are more suitable for the fixed rank problem, while regularization methods are more robust and can deal with more mild and general problem. Note that Grouse works well on all these test cases but it does not have the regularization term. As the synthetic data is of indeed low rank, increasing the rank makes the subspace of Grouse redundant but without changing the subspace structure. Our method is not sensitive to the estimated rank due to the shrinkage and thresholding of nuclear norm, and the computational cost is cheap due to the bilinear factorization.
12
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
Fig. 4. Test the convergence speed of different algorithms, m = n = 500, r = 10, SR = 0.3, (a) estimated rank r = 10, (b) estimated rank r = 12.
Fig. 5. Test the convergence speed of different algorithms, m = n = 1000, r = 20, SR = 0.3, (a) estimated rank r = 25, (b) estimated rank r = 40.
4.2. Collaborative filtering In this section, we evaluate the proposed method with the real world recommender system datasets: Moivelens7 [40] (100K, 1M and 10M) and Jester jokes datasets8 [41]. The statistics of the datasets used in our experiments are shown in Table 1. As these datasets are changing over time, the description may be slightly different from other publications. For the Movielens dataset, the rows correspond to viewers and the columns to movies, with the entry Xij being the rating Xij ∈ {1, . . . , 5} by viewer i for movies j. For the Jester jokes dataset, there are 100 jokes from 73,421 users, and the rating ∈ [−10, 10]. Both of the tasks are to predict the ratings that viewers would give to movies (jokes) they have not yet rated. To make such inference to be meaningful, it is often assumed that the data lie in a much lower dimensional manifold. Hence the problem can be posed as the rank constrained matrix completion problem. Since the datasets do not have ground true values for the unobserved data, e.g., in Movielens 100K dataset, there are about 6% (see Table 1) data that are observed, so the experiment is to partition the observed data into three parts: training set (6%*50% = 3%), validation set (6%*10% = 0.6%), and test set (6%*40% = 2.4%). We use the following Root Mean Square Error (RMSE) on the test set to evaluate the performance: RMSE =
∥PΩ (X − UV )∥F . √ |Ω |
Completing the data matrix of recommender system is often using a low rank assumption, yet its true rank is unknown. Therefore, we generate a series of estimated ranks ranging from 2 to 100 to evaluate the performance. For different estimated ranks, we plot the time and RMSE for the compared methods. Note that IRNN and APG do not incorporate the prior information so the estimated rank does not affect their results. We report the results in Fig. 6. The results show that the proposed method and APG are faster than other methods, but the RMSE of our method is much better than APG. 7 https://grouplens.org/datasets/movielens/. 8 http://eigentaste.berkeley.edu/dataset/.
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
13
Table 1 Basis information of recommendation datasets used in our experiments. Movielens 100K Movielens 1M Movielens 10M Jester-all
m
n
#ratings
SR
943 6,040 71,567 73,421
1,682 3,706 10,677 100
100,000 1,000,209 10,000,054 2,430,997
0.063 0.0447 0.0131 0.3311
Fig. 6. RMSE and Training time on Movielens—100K dataset.
Fig. 7. RMSE and training time as the estimated ranks increase on Movielens—1M dataset.
Furthermore, OptSpace and IRNN have similar RMSE with ours but require more running time. Grouse seems to fail in this test which is reflected on its RMSE. We can see that with the increasing of estimated rank, the results of our method are not affected too much. This further implies that our method is not sensitive to the estimated rank. For Movielens 1M and 10M, since we only use 50% of the known entries as the training set, there are only about 2% in Movielens 1M and less than 1% in Movielens 10M. Hence lots of iterations are necessary to complete the unknown entries. We only compare with LRGeomCG and LMaFit, since other methods are too slow for this task. To recover such datasets, the rank is often supposed to be very low, otherwise it will fail to recover the matrix. We compute the RMSE and running time with a series of estimated rank, the results are shown in Figs. 7 and 8. We can see in Figs. 7 and 8 that the proposed method performs much better than the other methods in terms of RMSE and running time. Since the max iteration is set to 500 and the lack of enough observations, increasing the estimated rank results in inferior RMSE. LMaFit, LRGeomCG as well as ours, all have better recovery results at very low rank. For different estimated ranks, we record the best approximate rank of a computed solution, which is defined as the number of singular values larger than 10−8 . We report the results in Table 2. Though our method involves double SVD factorization, the proposed method imputes the missing data at each sub-step after updating U or V . Hence the convergence speed can be greatly accelerated. We can see that the running time of our method is very close to LMaFit, which involves only a single QR factorization at each iteration. And our method achieves more attractive performance than LMaFit in terms of RMSE and training error. For the Jester jokes datasets, better results are obtained when the estimated rank is lower than 10, as we see in Table 2. The RMSE of these three methods are all large, yet our method still performs better than the other two methods. As we do not pre-process the data, it is still unclear whether the negative rating will affect the completion results. 4.3. Image recovery In this section, we demonstrate the accuracy of the proposed algorithm on the image recovery problem. We mainly test the image inpainting problem, which concerns filling in the unknown pixels of an image from an incomplete set of
14
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
Fig. 8. RMSE and training time as the estimated ranks increase on Movielens—10M dataset.
Fig. 9. Low rank test images. Source: Originated from [2]. Table 2 Numeric results for Movielens and Jester datasets. Results are averaged over 10 runs. Ours
LMaFit
LRGeomCG
Movielens 1M
Training error RMSE Rank Time
0.014 0.902 4 36.91
0.050 0.934 4 44.82
0.017 1.015 4 54.13
Movielens 10M
Training error RMSE Rank Time
0.168 0.843 5 214.45
0.192 0.871 5 249.58
0.192 0.869 5 236.13
Jester-all
Training error RMSE Rank Time
0.011 4.151 7 10.88
0.032 5.220 8 10.16
0.134 4.923 8 12.04
observed entries. We formulate the image inpainting problem as a matrix completion problem, and test it with several benchmark images9 shown in Fig. 9. Actually, most of the natural images do not really have low-rank structure, but their top few of the singular values dominate most of the information. Thus, matrix completion solvers can be applied to obtain a low rank approximation. We follow the experimental settings in [2]. Here we consider two types of noises on the real images. The first one is with 50% of pixels missing. The other adds some unrelated texts on the image. The goal is to remove the noises by using low rank matrix completion. For the color image, there are three channels. Matrix completion is applied for each channel independently. The testing images are all cropped with 300 × 300. We set parameters tol = 10−3 , maxrank = 20, maxiter = 200. We use the well known Peak-Signal-to-Noise-Ratio (PSNR) to evaluate the performance, PSNR = 10 log10 (2552 /MSE), ∑ 1 ⊥ ˆ 2 ˆ where MSE = 3|Ω ⊥ | i=r ,g ,b ∥PΩ (Xi − Xi )∥F , and Xi is the recovery matrix. We first perform the rank 10 recovery for the testing images with both random mask and text mask. The results are reported in Tables 3 and 4. We can see that when the estimated rank is 10, LMaFit as well as ours has better recovery results in terms of RMSE and running time. We observe that LRGeomCG cannot compete with LMaFit and ours on this test. Moreover, the result of IRNN is unstable, and Grouse fails again to recover the noisy images. We point out here that IRNN employed with other non-convex regularizations, e.g., SCAD, Capped norm, possibly has better performance than logarithm penalty that we used in this experiment. However, they are sensitive for the parameters, we do not compare them in this experiment. 9 https://sites.google.com/site/zjuyaohu/.
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
15
Table 3 Rank 10 recovery images of the five test images with random mask (SR = 0.5). Results are averaged over 10 runs. Image
(a)
Method
PSNR
(b)
(c)
(d)
(e)
Ours APG IRNN-log LmaFit OptSpace LRGeomCG Grouse
27.90 21.24 26.43 27.67 21.00 27.70 19.41
27.82 24.01 24.44 27.60 21.98 27.61 19.91
21.45 20.51 21.23 21.29 18.22 21.35 16.80
27.45 27.03 27.23 27.34 23.95 27.34 21.56
33.89 15.94 32.94 33.23 28.55 28.32 25.21
Method
Running time (s)
Ours APG IRNN-log LmaFit OptSpace LRGeomCG Grouse
0.51 15.41 36.32 0.58 27.33 10.97 4.90
0.60 14.61 45.44 0.42 26.78 9.61 4.83
0.57 14.88 37.56 0.72 27.44 24.69 4.86
5.54 13.30 21.13 6.08 26.97 12.97 4.41
0.74 14.41 33.05 0.54 27.01 35.22 4.84
Table 4 Rank 10 recovery images of the five test images with text mask. Results are averaged over 10 runs. Image
(a)
Method
PSNR
(b)
(c)
(d)
(e)
Ours APG IRNN-log LmaFit OptSpace LRGeomCG Grouse
25.94 20.89 21.07 25.72 17.19 25.74 16.09
26.83 24.08 24.73 26.54 17.54 26.54 16.40
20.93 19.96 20.75 20.65 16.21 20.63 15.28
24.47 23.43 23.56 24.66 20.14 24.33 19.78
34.29 15.65 32.65 32.83 28.19 28.38 25.69
Method
Running time (s)
Ours APG IRNN-log LmaFit OptSpace LRGeomCG Grouse
1.81 13.16 28.53 2.48 25.39 21.68 6.38
0.65 15.17 36.12 0.43 27.49 8.51 6.80
3.97 12.82 29.00 3.80 27.30 22.87 6.61
7.93 13.09 18.90 8.38 26.56 64.24 7.27
0.56 14.85 26.86 1.06 25.81 27.12 6.75
We also try all possible rank values, ranging from 2 to 20, and choose the best PSNR as their recovery results. By employing warm starts strategy, this process is running fast. We visualize the recovery results of Image (b) with random mask noise in Fig. 10, and Image (d) with test mask noise in Fig. 11. 4.4. Non-convex regularization (TNNR) test Finally, we verify the efficiency of Algorithm 2 with non-convex regularization. In particular, we consider the Truncated Nuclear Norm Regularization (TNNR), which is defined as follows: min Y
1 2
∥PΩ (X − Y )∥2F + λ∥Y ∥t .
(38)
The original implementation of TNNR [2] includes ADMM and APGL algorithms. Given a matrix Y ∈ Rm×n , TNNR only minimizes the smallest min(m, n) − t singular values for a given truncated parameter t, since the rank of a matrix only corresponds to nonzero singular values. Suppose Ut Σt Vt is the truncated SVD of Y , Hu et al. [2] observed that ∥Y ∥t = ∥Y ∥∗ − tr(UtT YVt ). Then the iterative algorithm can be formulated via convex optimization. Given Y k and its truncated SVD Utk Σtk Vtk , Y k+1 can be obtained by the following optimization T
min ∥Y ∥∗ − tr(Utk YVtk ) Y
s.t. PΩ (X ) = PΩ (Y ).
(39)
The above problem can be formulated as augmented Lagrangian optimization so that ADMM or APGL algorithm can be applied. For an optimal parameter t, both algorithms show state-of-the-art performance for image recovery problem.
16
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
Fig. 10. Image (with random mask noise) recovery comparison with different algorithms. Their PSNR is shown correspondingly.
Fig. 11. Image (with text mask noise) recovery comparison with different algorithms. Their PSNR is shown correspondingly.
However, since both TNNR-APGL and TNNR-ADMM involve inner and outer iterations, their convergence speeds are very slow. We now use Algorithm 2 to better solve Problem (38), i.e., we consider the problem of the form: min
U ∈St(r ,m),V ∈Rr ×m
1 2
∥PΩ (X − UV )∥2F + λ∥V ∥t .
(40)
Note that r is an estimated rank, and t is a truncated rank, t ≤ r. Since TNNR needs to choose a truncated rank to achieve best PSNR, we find the best truncated rank by a series of rank ranging from 1 to 20. We fix the factorization of the proposed method with rank 20. All the methods are first to evaluate the best truncated rank, then the algorithm is run to achieve the best results. We then conduct the experiment with different SR, and test it with Image (e). We report the results in Fig. 12. We can see that Algorithm 2 is several times faster than ADMM and APGL while still has similar even slightly better recovery performance. These three implementations of TNNR all achieve satisfactory PSNR, but our method is several times faster. We then recover Image (a) with random mask noise, the results are shown in Fig. 13. As we can see, our method is better than ADMM and APGL. Finally, we evaluate the performance with text mask noise for all the testing images, the results are shown in Table 5. The results further show that, our method outperforms ADMM and APGL in terms of running time and RMSE.
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
17
Fig. 12. PSNR and running time with different SR on Image (e).
Fig. 13. Image recovery comparison with different TNNR algorithms. Their PSNR is shown correspondingly. Table 5 Compare with different algorithms for TNNR for the images with text noise mask. Results are averaged over 10 runs. Image
(a) (b) (c) (d) (e)
ADMM
APGL
Ours
Time
Rank
PSNR
Time
Rank
PSNR
Time
Rank
PSNR
20.6 23.8 32.1 25.2 10.9
8 15 9 9 2
26.67 32.14 25.45 28.15 41.64
50.4 31.4 47.5 61.9 22.7
5 9 8 7 2
28.45 31.70 25.41 28.05 41.50
1.3 4.3 3.6 2.5 3.4
5 8 7 7 2
27.32 31.91 25.70 28.20 41.88
5. Conclusion In this paper, we investigate the regularized rank constrained matrix completion problem. We combine the spectral regularization and bilinear factorization to better solve this problem. By proposing the MPA-IMPUTE algorithm, the problem can be solved more accurately and faster. We verify our method from the perspective of theoretical analysis and experimental results. The proposed method only involves simple matrix computation and small scale SVD, thereby is well
18
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
suited for large scale applications, e.g., collaborative filtering. We also extend our method to the non-convex regularization and adapt the truncated nuclear norm regularization to image inpainting. The results show that our method indeed leads to better results than their original implementations. However, the proposed method is only guaranteed to converge to local minima due to the non-convex bilinear factorization. Nonetheless, recent advances [42,43] showed that global convergence guarantee can possibly be established under certain rigorous conditions. For future work, we plan to study its global convergence of the proposed method. Meanwhile, we plan to extend the proposed method to tensor completion, e.g., [44–46]. Acknowledgments This work is partially supported by NSF of China under Grants 61672548, U1611461, 61173081, 11971499, and 11571385; the Young Creative Talent Project of Guangdong Provincial Department of Education, China, under Grant 2018KQNCX233; the Guangzhou Science and Technology Program, China, under Grant 201510010165, and the Guangdong Natural Science Foundation, China, under Grant 2017A030313017. References [1] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Comput. 42 (8) (2009) 30–37. [2] Y. Hu, D. Zhang, J. Ye, X. Li, X. He, Fast and accurate matrix completion via truncated nuclear norm regularization, IEEE Trans. Pattern Anal. Mach. Intell. 35 (9) (2013) 2117–2130. [3] Y.-X. Wang, C.M. Lee, L.-F. Cheong, K.-C. Toh, Practical matrix completion and corruption recovery using proximal alternating robust subspace minimization, Int. J. Comput. Vis. 111 (3) (2015) 315–344. [4] A.M. Buchanan, A.W. Fitzgibbon, Damped newton algorithms for matrix factorization with missing data, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2005, pp. 316–322. [5] L. Wu, R. Jin, A.K. Jain, Tag completion for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 716–727. [6] R. Basri, D. Jacobs, I. Kemelmacher, Photometric stereo with general, unknown lighting, Int. J. Comput. Vis. 72 (3) (2007) 239–257. [7] E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J. ACM 58 (3) (2011) 11. [8] E. Candes, B. Recht, Exact matrix completion via convex optimization, Commun. ACM 55 (6) (2012) 111–119. [9] X.-L. Zhao, F. Wang, T.-Z. Huang, M.K. Ng, R.J. Plemmons, Deblurring and sparse unmixing for hyperspectral images, IEEE Trans. Geosci. Remote Sens. 51 (7) (2013) 4045–4058. [10] T.-Y. Ji, T.-Z. Huang, X.-L. Zhao, T.-H. Ma, G. Liu, Tensor completion using total variation and low-rank matrix factorization, Inform. Sci. 326 (2016) 243–257. [11] X. Li, Y. Ye, X. Xu, Low-rank tensor completion with total variation for visual data inpainting, in: Proceedings os AAAI, 2017, pp. 2210–2216. [12] C. Lu, J. Tang, S. Yan, Z. Lin, Nonconvex nonsmooth low rank minimization via iteratively reweighted nuclear norm, IEEE Trans. Image Process. 25 (2) (2016) 829–839. [13] A. Cui, J. Peng, H. Li, C. Zhang, Y. Yu, Affine matrix rank minimization problem via non-convex fraction function penalty, J. Comput. Appl. Math. 336 (2018) 353–374. [14] M. Fazel, Matrix Rank Minimization with Applications (Ph.D. thesis), Stanford University, 2002. [15] Z. Wen, W. Yin, Y. Zhang, Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm, Math. Program. Comput. 4 (4) (2012) 333–361. [16] Y. Xu, R. Hao, W. Yin, Z. Su, Parallel matrix factorization for low-rank tensor completion, Inverse Probl. Imaging 9 (2) (2015) 601–624. [17] T. Hastie, R. Mazumder, J.D. Lee, R. Zadeh, Matrix completion and low-rank SVD via fast alternating least squares, J. Mach. Learn. Res. 16 (1) (2015) 3367–3402. [18] R. Cabral, F. De la Torre, J.P. Costeira, A. Bernardino, Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition, in: IEEE International Conference on Computer Vision, 2013, pp. 2488–2495. [19] N. Boumal, P.-A. Absil, Low-rank matrix completion via preconditioned optimization on the Grassmann manifold, Linear Algebra Appl. 475 (2015) 200–239. [20] Y. Zheng, G. Liu, S. Sugimoto, S. Yan, M. Okutomi, Practical low-rank matrix approximation under robust l 1-norm, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1410–1417. [21] B. Mishra, G. Meyer, S. Bonnabel, R. Sepulchre, Fixed-rank matrix factorizations and Riemannian low-rank optimization, Comput. Statist. 29 (3–4) (2014) 591–621. [22] B. Mishra, G. Meyer, F. Bach, R. Sepulchre, Low-rank optimization with trace norm penalty, SIAM J. Optim. 23 (4) (2013) 2124–2149. [23] G. Meyer, S. Bonnabel, R. Sepulchre, Regression on fixed-rank positive semidefinite matrices: a Riemannian approach, J. Mach. Learn. Res. 12 (Feb) (2011) 593–625. [24] D.R. Hunter, R. Li, Variable selection using MM algorithms, Ann. Statist. 33 (4) (2005) 1617. [25] R. Mazumder, T. Hastie, R. Tibshirani, Spectral regularization algorithms for learning large incomplete matrices, J. Mach. Learn. Res. 11 (Aug) (2010) 2287–2322. [26] H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Lojasiewicz inequality, Math. Oper. Res. 35 (2) (2010) 438–457. [27] Z. Lin, C. Xu, H. Zha, Robust matrix factorization by majorization minimization, IEEE Trans. Pattern Anal. Mach. Intell. 40 (1) (2018) 208–220. [28] Z. Zhang, The singular value decomposition, applications and beyond, arXiv preprint arXiv:1510.08532. [29] J.-F. Cai, E.J. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (4) (2010) 1956–1982. [30] Q. Yao, J.T. Kwok, W. Zhong, Fast low-rank matrix learning with nonconvex regularization, in: IEEE International Conference on Data Mining, IEEE, 2015, pp. 539–548. [31] Y.-F. Li, Y.-J. Zhang, Z.-H. Huang, A reweighted nuclear norm minimization algorithm for low rank matrix recovery, J. Comput. Appl. Math. 263 (2014) 338–350. [32] H. Zou, R. Li, One-step sparse estimates in nonconcave penalized likelihood models, Ann. Statist. 36 (4) (2008) 1509. [33] C. Lu, J. Tang, S. Yan, Z. Lin, Generalized nonconvex nonsmooth low-rank minimization, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4130–4137. [34] S. Gu, Q. Xie, D. Meng, W. Zuo, X. Feng, L. Zhang, Weighted nuclear norm minimization and its applications to low level vision, Int. J. Comput. Vis. (2016) 1–26.
S. Kuang, H. Chao and Q. Li / Journal of Computational and Applied Mathematics 371 (2020) 112679
19
[35] C.-H. Zhang, T. Zhang, A general theory of concave regularization for high-dimensional sparse estimation problems, Statist. Sci. 27 (4) (2012) 576–593. [36] K.-C. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems, Pac. J. Optim. 6 (615–640) (2010) 15. [37] R.H. Keshavan, A. Montanari, S. Oh, Matrix completion from noisy entries, J. Mach. Learn. Res. 11 (Jul) (2010) 2057–2078. [38] B. Vandereycken, Low-rank matrix completion by Riemannian optimization, SIAM J. Optim. 23 (2) (2013) 1214–1236. [39] L. Balzano, R. Nowak, B. Recht, Online identification and tracking of subspaces from highly incomplete information, in: Allerton Conference on Communication, Control, and Computing, IEEE, 2010, pp. 704–711. [40] F.M. Harper, J.A. Konstan, The movielens datasets: History and context, ACM Trans. Interact. Intell. Syst. 5 (4) (2016) 19. [41] N. Srebro, T. Jaakkola, Weighted low-rank approximations, in: International Conference on Machine Learning, vol. 3, 2003, pp. 720–727. [42] P. Jain, P. Netrapalli, S. Sanghavi, Low-rank matrix completion using alternating minimization, in: ACM Symposium on Theory of Computing, ACM, 2013, pp. 665–674. [43] R. Sun, Z.-Q. Luo, Guaranteed matrix completion via non-convex factorization, IEEE Trans. Inform. Theory 62 (11) (2016) 6535–6579. [44] T. Yokota, Q. Zhao, A. Cichocki, Smooth parafac decomposition for tensor completion, IEEE Trans. Signal Process. 64 (20) (2016) 5423–5436. [45] P. Zhou, C. Lu, Z. Lin, C. Zhang, Tensor factorization for low-rank tensor completion, IEEE Trans. Image Process. 27 (3) (2018) 1152–1163. [46] T.-X. Jiang, T.-Z. Huang, X.-L. Zhao, T.-Y. Ji, L.-J. Deng, Matrix factorization for low-rank tensor completion using framelet prior, Inform. Sci. 436 (2018) 403–417.