Knowledge-Based Systems 177 (2019) 68–81
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Latent graph-regularized inductive robust principal component analysis✩ ∗
Lai Wei a , , Rigui Zhou a , Jun Yin a , Changming Zhu a , Xiafen Zhang a , Hao Liu b a b
Department of Computer Science, Shanghai Maritime University, Haigang Avenue 1550, Shanghai, China Wuhan Digit Engineering Institute, Canglong North Road 709, Wuhan, Hubei, China
article
info
Article history: Received 19 August 2018 Received in revised form 10 April 2019 Accepted 11 April 2019 Available online 17 April 2019 Keywords: Low-rank subspace Manifold structure Weighted sparse constraint Robust principal component
a b s t r a c t Recovering low-rank subspaces for data sets becomes an attractive problem in recent years. We proposed a new low-rank subspace learning algorithm, termed latent graph-regularized inductive robust principal component analysis (LGIRPCA), in this paper. Different from the existing low-rank subspace learning methods, LGIRPCA considers the feature manifold structure of a given data set and designs a new Laplacian regularizer to characterize the structure information. We proved that the devised Laplacian regularizer could be transferred to be a weighted sparse constraint for the required low-rank projection matrix. Moreover, the relationships between LGIRPCA and some related algorithms were also discussed. An optimization algorithm based on augmented Lagrange multiplier method was presented to solve LGIRPCA problem. Finally, extensive experiments on several benchmark databases demonstrate the effectiveness of our method for image recover and image classification. © 2019 Elsevier B.V. All rights reserved.
1. Introduction In computer vision and image processing fields, high dimensional data is usually assumed to lie in a group of low-rank linear subspaces [1–5]. This implies that high-dimensional data often has redundant features or is corrupted by errors. Therefore, the research on how to remove the irrelevant features or corruptions from the original data samples, and recover their low-rank subspace structures of has been attracted many attentions in the past years [3–8]. Principal Component Analysis (PCA) [7,9] is one of most popular algorithm which is able to find the appropriate low-rank representations of original data samples in an l2 -sence [10]. The major drawback of classical PCA is that it is very sensitive to noise, outliers or gross corruptions [10–14]. In order to enhance the robustness of PCA, a robust principal component analysis algorithm (RPCA) has been proposed [10,11]. RPCA assumes that a data matrix is composed of a low-rank matrix and a sparse error matrix. By directly recovering the low-rank matrix, RPCA can obtain the low-rank representation for each data sample. As ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.04.007. ∗ Corresponding author. E-mail addresses:
[email protected] (L. Wei),
[email protected] (R. Zhou),
[email protected] (J. Yin),
[email protected] (C. Zhu),
[email protected] (X. Zhang),
[email protected] (H. Liu). https://doi.org/10.1016/j.knosys.2019.04.007 0950-7051/© 2019 Elsevier B.V. All rights reserved.
the promising performances shown in some image processing applications [15,16], RPCA has been proven to be more powerful for discovering accurate low-rank representations of original data than classical PCA. The successes of RPCA also encourage some related works such as low-rank and sparse principal feature coding (LSPFC) [17], robust PCA via outlier pursuit [18] and reinforced robust principal component pursuit [19]. However, RPCA is a transductive algorithm, which means it cannot handle out-of-sample problem [20]. Hence, Bao et al. proposed a kind of inductive robust principal component analysis (IRPCA) [20] method whose goal was to seek a low-rank projection matrix for an given data set. By using the low-rank projection matrix, the low-rank representations of new data samples could be also easily obtained. Inspired by IRPCA, Zhang et al. developed the inductive version of LSPFC (ILSPFC) [17]. According to the experimental results presented in the corresponding references, IRPCA and ILSPFC were superior to RPCA. Recently, Zhu et al. devised an unsupervised feature selection approach by using regularized self-representation (RSR) skill [21]. Based on the depth analysis on RSR, we could find that RSR actually has a very similar formulation of IRPCA. However, different from IRPCA, RSR reformulated a feature selection task as a feature reconstruction problem. RSR claims that the irrelevant features could be linear reconstructed by some important features in the feature space of a given data sets. Then by the calculation on each feature’s reconstruction coefficient vector, the weights of each feature for indicating its importance could be obtained. Finally, the vital features with bigger weights could be determined. It
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
also should be pointed out that RSR sets up the connections between feature selection methods and some subspace clustering algorithms such as sparse subspace clustering (SSC) [1,2] and low-rank representation (LRR) [3,4]. Enlightened by the above mentioned algorithms, in this paper, we proposed a kind of latent graph-regularized inductive robust principal component analysis (LGIRPCA) algorithm. Firstly, different from IRPCA, for a given data set, LGIRPCA forced the obtained projection matrix to be low rank by using matrix factorization instead of minimizing the nuclear norm of the projection matrix. Secondly, as stated in RSR, the projection matrix could also be interpreted as a reconstruction matrix in the feature space of the given data set. However, unlike RSR, LGIRPCA computes the reconstruction coefficient vector of each original feature in a latent feature space. The left matrix factor of the projection matrix was used to find the latent feature space and the right one denoted the reconstruction coefficient vector of each original feature in the latent feature space. Thirdly, according to the existing co-clustering algorithms [22–24], the geometric structure information of features should also be considered. Hence, LGIRPCA designed a kind of graph regularizer to characterize the feature manifold in the latent feature space. Fourthly, the devised Laplacian regularizer could be transferred to be a sparse constraint of the project matrix, so we illustrated the relationships between LGIRPCA and some related algorithms such as IRPCA, RSR and so on. Finally, an optimization algorithm for solving LGIRPCA problem has been presented and extensive experiments were conducted to show the superiorities of LGIRPCA. Therefore, the main contributions of the paper could be summarized as follows: 1. We clearly illustrated the connections between PCA-related algorithms (including RPCA, IRPCA, PCA-L1, JSPCA and ILSPFC) and the self-representation based feature selection method, i.e. RSR. 2. We proposed a new latent graph-regularized inductive robust principal component analysis (LGIRPCA) algorithm by combining the methodologies used in RSR and PCA-related algorithms. 3. Different from the existing Laplacian regularizer construction methods, in LGIRPCA, we designed a new Laplacian regularizer for the projection matrix by using one of its matrix factor. The rest of the paper is organized as follows: Section 2 briefly reviews some RPCA-related algorithms such as RPCA, IRPCA and RSR. Section 3 introduces the idea of latent graph-regularized inductive robust principal component analysis (LGIRPCA). In Section 4, we state the relationships between LGIRPCA and closed related algorithms. The extensive experiments both on synthetic and real world data sets are performed to show the effectiveness of LGIRPCA in Section 5. Finally, Section 6 gives the conclusions. 2. Preliminaries In this section, we will briefly introduce RPCA, IRPCA and RSR algorithms and state their relationships.
69
where λ > 0 is a parameter, rank(D) is the rank of D and ∥E∥0 is the l0 -norm of matrix E. Because it is an NP-hard problem to solve Problem (1), rank(D) and ∥E∥0 are often substituted by ∥D∥∗ and ∥E∥1 respectively, where ∥ · ∥∗ denotes the nuclear norm of a matrix,1 ∥ · ∥1 is the l1 -norm of a matrix. Then the convex surrogate of RPCA’s objective function could be expressed as follows: min
∥D∥∗ + λ∥E∥1 ,
s.t .
X = D + E.
D ,E
(2)
By using an augmented Lagrange multiplier (ALM) method [25], the above problem could be solved. Because of the excellent performance of RPCA, many subsequent researches have been proposed. Besides the related algorithms mentioned in Section 1, Huang et al. developed a tensor based RPCA-related model to handle hyperspectral image denoising problem [26]. It was proven that this method was capable of removing different kinds of noise existed in hyperspectral images. Zhang et al. combined RCA and LRR method together and proposed a new robust alternating low-rank representation for image recovery and outliers detection [27]. 2.2. Inductive robust principal component analysis (IRPCA) Clearly, RPCA is a transductive algorithm, i.e., it fails to compute the low-rank representations for new data samples. Bao et al. proposed an inductive RPCA (IRPCA) [20] method to alleviate the drawback of RPCA. IRPCA regards that the low-rank matrix D could be obtained by projecting the original data matrix X into a low-rank subspace, namely D = PX. Here P ∈ Rm×m is a low-rank projection matrix. So IPRCA actually aims to solve the following problem: min
∥P∥∗ + λ∥E∥1
s.t .
X = PX + E.
P,E
(3)
2.3. Regularized self-representation (RSR) Suppose the singular value decomposition (SVD) of the lowrank projection matrix P = USVt (Vt is the transpose of V, S is a diagonal matrix, so St = S), then Pt = VSUt . Based on the definition of nuclear norm, we have ∥Pt ∥∗ = ∥P∥∗ . Moreover, ∥Et ∥1 = ∥E∥1 , so Problem (3) could be transferred to the following same problem: min
∥Pt ∥∗ + λ∥Et ∥1 ,
s.t .
Xt = Xt Pt + Et .
P,E
(4)
ˆ = Xt = [ˆx1 , xˆ 2 , . . . , xˆ m ], where xˆ j (j = 1, 2, . . . , m) is the Let X ˆ spans the feature jth feature of the original data set X. Hence, X space of data set X. Assume W = [w1 , w2 , . . . , wm ] = Pt and Eˆ = Et . Then we have min
∥W∥∗ + λ∥Eˆ ∥1 ,
2.1. Robust principal component analysis (RPCA)
s.t .
ˆ = XW ˆ + Eˆ . X
Suppose the data matrix X = [x1 , x2 , . . . , xn ] ∈ Rm×n , where xi ∈ Rm (i = 1, 2, . . . , n) is the ith data point in the given data set X. RPCA assumes that X was generated by a low-rank matrix D (could be regarded as clean data matrix) with some corruptions E (should be sparse), namely X = D + E. Then the objection of RPCA is to seek the low-rank matrix D and the sparse matrix E with a given data matrix X:
Finally, by replacing nuclear norm and l1 -norm in the above problem with l2,1 -norm, we actually could obtain the RSR problem [21]:
min
rank(D) + λ∥E∥0 ,
s.t .
X = D + E,
D,E
W,E
(1)
min
∥W∥2,1 + λ∥Eˆ ∥2,1 ,
s.t .
ˆ = XW ˆ + Eˆ . X
W,E
(5)
(6)
1 Nuclear norm is also known as the trace-norm, namely the sum of the singular values of a matrix.
70
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
According the constraint in the above problem, we can see xˆ j = ˆ j + eˆ j , which means the jth feature could be linearly reconXw structed by other features. And eˆ j ∈ Eˆ indicates the reconstruction errors. 3. Latent graph-regularized inductive robust principal component analysis 3.1. Motivation Based on the descriptions in the previously section, we can see the connections between the three algorithms, i.e. RPCA, IRPCA and RSR. Now we return to the problem of classical PCA. PCA seeks an orthogonal transformation matrix Q that minimizes the reconstruction error of X in an l2 -sense: min
∥X − Qt QX∥2F ,
s.t .
QQt = Ik ,
Q
(7)
where Q ∈ Rk×m is the required projection matrix, k is the dimension of the low-dimensional subspace, Ik ∈ Rk×k is an identity matrix and ∥ · ∥F is the Frobenius norm of a matrix. Suppose P = Qt Q, we could obtain an equivalent problem of classical PCA as follows: min
λ∥X − PX∥2F + ∥P − Qt Q∥2F ,
s.t .
QQt = Ik ,
Q,P
As mentioned in the previous sections, high-dimensional data often corrupted by errors and the neighborhood size K in KNN is usually difficult to choose, hence the constructed Laplacian regularizer may not be able to recover the manifold structure of original features truthfully. Fortunately, according to the methodology of RSR, in Eq. (11), Pt is the reconstruction coefficient matrix of the original features. From the view point of subspace clustering algorithms [1–4], Pt could be regarded as a better representation of original feature ˆ which carries more geometric structure information matrix X of original features. Moreover, the absolute value of (i, j)th element of Pt (denoted as |[Pt ]ij |) could also indicates the similarity between features xˆ i and xˆ j . Hence, we proposed to use Pt to construct the Laplacian regularizer. In particular, we start with an affinity graph G with [G]i,j = (|[Pt ]ij | + |[Pt ]ji |)/2 = (|[P]ji | + |[P]ij |)/2. Then in order to make the similarity original features have similar coefficients in the∑ latent subspace, the Laplacian m 2 regularizer could be defined as i,j=1 (qi − qj ) [G]ij = tr(Q(D − t t G)Q ∑m ) = tr(QLQ ), here D is a diagonal matrix that satisfies [D]ii = j=1 [G]ij and L = D − G is a Laplacian matrix (tr(·) denotes the trace of a matrix). We could analysis the constructed Laplacian regularizer more in-depth, tr(QLQt )
(8)
i,j=1 m
Then we abandon the constraint QQt = Ik and relax the second term of Problem (8) to be ∥P − Qt R∥2F , where R ∈ Rk×m . Then we have min λ∥X − PX∥2F + ∥P − Qt R∥2F .
Q,R,P
=
min
λ∥E∥1 + ∥P − Qt R∥2F
s.t .
X = PX + E.
∑
(qi − qj )2 (|[P]ji | + |[P]ij |)/2
i,j=1 m
(9)
=
Similar to IRPCA introduced in Section 2, the error term could be defined as E = X − PX and l1 norm could be used to fit more complex corruptions in data sets. Hence, Problem (9) transform into the following formulation: Q,R,P,E
= tr(Q(D − G)Qt ) m ∑ = (qi − qj )2 [G]ij
1∑
2
(qi − qj )2 |[P]ji | +
i,j=1
= |M
⨀
m 1∑
2
+(qi − qj )2 |[P]ij |
i,j=1
P|1 , (12) 2
(10)
We could find that Eq. (10) is much similar to the objective function of IRPCA. In Eq. (10), we hope P could approximate by two matrix factors and k < m, hence P would also be low-rank. Then we borrow the idea of RSR,
ˆ = XP ˆ t + Eˆ ⇒ Xˆ ≈ XR ˆ t Q + Eˆ ⇒ Xˆ ≈ AQ + Eˆ , (11) X = PX + E ⇒ X ˆ t = Xt Rt = (RX)t = A. RX could be regarded as the where XR projection of X in a latent data subspace determined by R. And (RX)t could be considered as the feature space corresponding to the latent data subspace. Therefore, we can conclude that the original feature of data set X could be linearly reconstructed by the latent features, and Q = [q1 , q2 , . . . , qm ] is the reconstruction coefficient matrix. Furthermore, based on the existing co-clustering algorithms, the manifold structures of features should also be considered in the design of machine learning algorithms [22–24]. Hence, we hope that similar original features should have similar reconstruction coefficient vectors in the latent feature subspace, namely if ∥ˆxi − xˆ j ∥2 → 0, then ∥qi − qj ∥2 → 0, (i, j = 1, 2, . . . , m). The most popular method for satisfying this purpose is to devise a Laplacian regularizer [28–31] on Q which can help Q to characterize the manifold structures of the original features [22–24]. The construction of Laplacian regularizer in the existing co-clustering algorithms usually contain two steps: (1) constructing the affinity matrix by using original features and KNN (K-nearest-neighbors) method [32]; (2) computing the corresponding Laplacian matrix.
⨀
where M is a symmetric matrix and [M]ij = (qi − qj )⨀ . M P denotes the Hadamard product of M and P with [M P]ij = [M]ij [P]ij . Hence, the Laplacian regularizer of Q could be transformed to be ⨀a weighted l1 -norm constraint on P. Let Z (Q, P) = QLQt = |M P|1 and add it into Problem (10), we finally get the objective function of latent graph-regularized inductive robust principal component analysis (LGIRPCA): min
∥P − Qt R∥2F + β Z (Q, P) + λ∥E∥1 ,
s.t .
X = PX + E,
Q,R,P,E
(13)
where β > 0 is a parameter. 3.2. Optimization LGIRPCA problem could be solved by using the alternating direction method (ADM) [33]. First, by introducing an auxiliary variable J, Problem (13) could be converted to be the following equivalent problem: min
∥P − Qt R∥2F + β Z (Q, J) + λ∥E∥1 ,
s.t .
X = PX + E, P = J.
Q,R,P,J,E
(14)
Then the augmented Lagrange function of the above problem could be expressed as follows: L
= ∥P − Qt R∥2F + β Z (Q(, J) + λ∥E∥1 + ⟨Y1 , X − PX − ) E⟩ + ⟨Y2 , P − J⟩ + µ/2 ∥X − PX − E∥2F + ∥P − J∥2F ,
(15)
where Y1 , Y2 are two Lagrange multipliers and µ > 0 is a parameter. By minimizing L, the five variables Q, R, P, J, E could be
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
updated in sequence with fixed other variables. We first initialize the five variables Q, R, P, J, E to be zero, namely, Q = R = 0 ∈ Rk×m , P = J = 0 ∈ Rm×m and E = 0 ∈ Rm×n , where 0 denotes a matrix with every element to be zero. The repeated iteration includes the following six steps: 1. Update R with other fixed variables. We could find only one term in Eq. (15) is relevant to R, namely ∥P − Qt R∥2F . Then the optimal R∗ could be obtained as R∗ = (Qt )−1 P, where (Qt )−1 is the pseudo-inverse of Qt . 2. Update Q with other fixed variables. Neglecting the irrelevant terms of Q in Eq. (15), the we have: min ∥P − Qt R∥2F + β Z (Q, J) = min ∥P − Qt R∥2F + β tr(QLQt ). (16) Q
2RRt Q + β Q(L + Lt ) − 2RPt = 0
(17)
Eq. (17) is actually a Sylvester equation [34] w.r.t. Q, so the optimal Q∗ could be obtained by using the Matlab function lyap() to solve Eq. (17). 3. Update J⨀ with other fixed variables. Once Q∗ is computed, Z (Q, J) = ∥M J∥1 and M is computed by Q∗ . Then by collecting the relevant terms of J in Eq. (15), we have min
β∥M
⨀
J∥1 + ⟨Y2 , P − J⟩ + µ/2∥P − J∥2F
min
β∥M
⨀
J∥1 + µ/2∥P − J + Y2 /µ∥2F
J
J
(18)
Then the optimal solution J∗ to Eq. (18) satisfies [J∗ ]ij = Sεij ([P + Y2 /µ]ij ), where Sε (x) = max(x − ε, 0) + min(x + ε, 0) and εij = (β/µ)[M]ij . 4. Update P with other fixed variables. Dropping the irrelevant terms of P, and the following optimization problem could be obtained: min P
=
summarized as follows: Y∗1 ← Y1 + µ(X − PX − E) Y∗2 ← Y2 + µ(P − J) µ∗ ← min(µmax , ρµ)
(23)
where µmax and ρ are two given positive parameters. 3.3. Algorithm The algorithmic procedure of LGIRPCA is summarized in Algorithm 1. Once P∗ is computed, for a new sample xnew , its reconstruction could be obtained as P∗ xnew .
Q
Here, L is defined by using J. By taking the derivation of above equation w.r.t Q and set it to be zero, the following equation holds, namely:
=
71
min P
∥P − Qt R∥2F + ⟨Y1 , X − PX − E⟩ + ⟨Y2 , P − J⟩ ) ( + µ/2 ∥X − PX − E∥2F + ∥P − J∥2F ( ∥P − Qt R∥2F + µ/2 ∥X − PX − E + Y1 /µ∥2F ) + ∥P − J + Y2 /µ∥2F
(19)
P 2Im + µ(XX + Im ) = 2Q R + µ (X − E + Y1 /µ)X + J − Y2 /µ t
)
t
t
(
Input: Data set X = [x1 , x2 , · · · , xn ] ∈ Rm×n , the dimension of the latent subspace l, parameters β, λ and the maximal number of iteration Maxiter; Output: The coefficient matrix P∗ , two factors Q∗ , R∗ and error term E∗ ; 0 = Y02 = 0, µ0 = 1: Initialize the parameters, i.e., Y1 10−8 , µmax = 1030 , ρ = 1.1, ϵ = 10−6 , k = 0. 0 0 m×m 2: Initialize the variables, P = J = 0 ∈ R , Q0 = R0 = 0 ∈ l×m R . k k 3: while ∥X − P X − E ∥∞ > ϵ do k+1 4: Update R by solving Rk+1 = ((Qk )t )−1 Pk ; 5: Update Qk+1 by solving Eq. (17); 6: Update Jk+1 with [Jk+1 ]ij = Sεij ([Pk + Yk2 /µk ]ij ), where Sε (x) = max(x − ε, 0) + min(x + ε, 0) and εij = (β/µk )[Mk ]ij .
7:
)
(20)
Update Pk+1 =
2Im + µk (XXt + Im )
(
)−1 (
2(Qk+1 )t Rk+1 +
)) ( µk (X − Ek + Yk1 /µk )Xt + Jk+1 − Yk2 /µk . 8:
By taking the derivative of the above problem w.r.t. P and setting it to be zero, the following equation could be obtained:
(
Algorithm 1 Latent graph-regularized inductive robust principal component (LGIRPCA)
9:
10: 11: 12:
Update Ek+1 with [Ek+1 ]ij = Sε ([X − Pk+1 X + Yk1 /µk ]ij ), where Sε (x) = max(x − ε, 0) + min(x + ε, 0) and ε = λ/µk ; Update parameters Yk1+1 = Yk1 + µk (X − Pk+1 X − Ek+1 ), Yk2+1 = Yk2 + µk (Pk+1 − Jk+1 ), µk+1 = min(µmax , ρµk ); Update k = k + 1. end while return the coefficient matrix P∗ = Pk , Q∗ = Qk , R∗ = Rk , E∗ = Ek .
4. Further discussions
Hence, the optimal value of P,
(
P∗ = 2Qt R + µ (X − E + Y1 /µ)Xt + J − Y2 /µ
(
))
( )−1 × 2Im + µ(XXt + Im ) .
(21)
5. Update E with other fixed variables. Similar to the previous steps, we abandon the irrelevant terms of E, then minimizing Eq. (15) equals solving the following problem: min
λ∥E∥1 + ⟨Y1 , X − PX − E⟩ + µ/2∥X − PX − E∥2F
min
λ∥E∥1 + µ/2∥X − PX − E + Y1 /µ∥2F
E
=
E
(22)
Similar to the method for updating J, we also could find that the optimal solution E∗ to Eq. (22) satisfies [E∗ ]ij = Sε ([X − PX + Y1 / µ]ij ), where ε = λ/µ. 6. Update parameters with other fixed variables. The precise updating schemes for parameters existed in Eq. (15) are
In this section, we will firstly analysis the complexity of the optimization algorithm of LGIRPCA (i.e. Algorithm 1), then expound the relationships between LGIRPCA and some close-related algorithms. 4.1. Complexity analysis We discuss the computational burden of Algorithm 1. Suppose the data matrix X ∈ Rm×n , the computational burden of Algorithm 1 is determined by the updating of five variables: P, J ∈ Rm×m , Q, R ∈ Rk×m , E ∈ Rm×n , where k is the dimension of a lowdimensional latent subspace. Then we analyze the complexity of updating these variables in each step. First, updating R needs to compute the pseudo-inverse of an m × k matrix (k < m), whose complexity is O(m3 ). And the m × m matrix multiplication is also O(m3 ). Hence, the total time
72
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
Fig. 1. Sample images from the five benchmark databases.
Fig. 2. Corrupted sample images from the five benchmark databases. From the top line to the bottom line of each sub-figure, the corrupted percentages are 10%, 20%, 30%, 40% and 50% respectively.
complexity for updating R is O(m3 ). Second, it takes O(k3 ) time to solve a Sylvester equation for updating Q. Third, updating J performs an m × m matrix addition and an shrinkage operation, which is O(m2 ). Fourth, similar to the first part, the complexity of updating P is O(m3 ). Fifth, the computation burden for the shrinkage operation on an m × n matrix is O(m × n) and the m × m matrix multiplication is O(m3 ). Hence, the time complexity for updating E is O(m3 ). Finally, we can see that the time complexity of Algorithm 1 in each iteration taken together is O(m3 ). Suppose the number of iterations is N, then the total complexity of Algorithm 1 should be O(N × m3 ).
4.2. The relationships among LGIRPCA, IRPCA and RSR Based on the descriptions in the previous sections, we could easily illustrate the relationships between LGIRPCA and IRPCA. For a given data matrix X ∈ Rm×n , in LGIPRCA, the projection matrix P ∈ Rm×m is approximated by two matrices Q ∈ Rk×m and R ∈ Rk×m , namely P = Qt R. Because k < m, hence P must be lowrank. While in IRPCA, P is forced to be low-rank by minimizing the nuclear norm of P. Hence, suppose β = 0, LGIRPCA will degenerate to be ⨀ an alternative formulation of IRPCA. When β ̸ = 0, Z (Q, P) = ∥M P∥1 is a weighted sparse constraint on P. As described in [17,35], a sparse projection matrix has the ability
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
73
Fig. 3. Image recovery results of the evaluated algorithms on different databases. The first and the second lines of each sub-figure illustrate the recovery results obtained by the evaluated algorithms on the training images and the testing images of each database respectively. The second column to the seventh column in each sub-figure show the results obtained by different algorithms. The values of parameters corresponding to the evaluated algorithms were reported in the supplementary table 1 in the supplementary materials . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
of identifying important variables, hence LGIRPCA will definitely outperform IRPCA. In Section 2, we have stated the connections between IRPCA and RSR. The main difference between IRPCA and RSR actually is that RSR uses l2,1 -norm to characterize the corruptions and hope the feature reconstruction matrix (namely, the projection matrix in IRPCA) to be row-sparse. As we mentioned above, RSR also could be considered as a special case of LGIRPCA. Furthermore, RSR assumes that each feature can be linearly represented by other features in the original feature space. Because data usually corrupted by noise or outliers, the obtained reconstruction matrix by RSR may not be able to reveal the structure of original features faithfully. LGIRPCA proposes to reconstruct each original feature in a learned latent feature space, so
we believe the feature structures could be truthfully discovered in the latent feature space. 4.3. The relationships among LGIRPCA, PCA-l1 and JSPCA The problem of PCA-L1 [12] is defined as follows: min
∥X − Qt QX∥1 ,
s.t .
QQt = Ik ,
Q
(24)
where Q ∈ Rk×m . Suppose P = Qt Q and E = X − PX. PCA-L1 could have an alternative form as follows: min
∥P − Qt Q∥2F + λ∥E∥1 ,
s.t .
X = PX + E, QQt = Ik ,
Q
(25)
74
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
Fig. 3. (continued).
Yi et al. recently proposed a joint sparse principal component analysis (JSPCA) method [14] whose original objective function could be defined as follows: min ∥X − Qt RX∥2,1 + η∥Rt ∥2,1 , Q ,R
(26)
where η > 0 is a parameter. Following the methodology used above, JSPCA could transferred to be the following equivalent problem: min
∥P − Qt R∥2F + β∥Rt ∥2,1 + λ∥E∥2,1
s.t .
X = PX + E.
P,Q,R
(27)
By analyzing PCA-L1, JSPCA and LGIRPCA, we could find that: (1) PCA-L1 actually performs symmetric decomposition on the projection matrix P, JSPCA and LGIRPCA allow to use two different matrix to approximate P; (2) PCA-L1 hopes to find an orthogonal latent low-dimensional subspace. By minimizing the l2,1 -norm of R, JSPCA hopes to learn a sparse low-dimensional subspace; (3) From the viewpoint of feature reconstruction, PCA-L1 and JSPCA want to reconstruct each original feature in the original feature space, and LGIRPCA regards that the original features could be better reconstructed in a learned latent subspace; (4) PCA-L1 and LGIRPCA use l1 -norm to characterize the corruptions while l2,1 -norm is used in JSPCA; (5) If only the original formulations of PCA-L1 and JSPCA are considered, LGIRPCA has more flexible formulation and could be used in more applications such as determining the importance of each original feature. 4.4. The relationship between LGIRPCA and ILSPFC
where 0 < β < 1 is a parameter. Different from our LGIRPCA, it could be find that ILSPFC hopes the reconstruction term PX of data matrix X to be low-rank and sparse together. However, the projection matrix P is not imposed on any constraint. Because rank(PX) ≤ min(rank(P), rank(X)), the obtained projection P of ILSPFC may not be low-rank. Moreover, P and X are updated together, hence P may be superior for the training data but be inferior for new data samples. On the other hand, for some outside samples Xtest , because the inequation rank(PXtest ) ≤ min(rank(P), rank(Xtest )) still holds and P may not be low-rank, we cannot conclude that the features PXtest obtained by ILSPFC would be low-rank. However, because LGIRPCA can obtain a low-rank projection matrix, the obtained features of LGIRPCA must be low-rank. 5. Experiments 5.1. Experiment setup Comparison methods: In this section, we will perform experiments to verify the effectiveness of LGIRPCA, along with illustrating the comparison results with other closely related methods including IRPCA [20], RSR [21], PCA-L1 [12], JSPCA2 [14] and ILSPFC [17]. Parameter setting: The experiments conducted in this section are roughly divided into two groups: image recovery and image classification. In these experiments, we will carefully tune the corresponding parameters in the evaluated six algorithms. The regularization parameter β in JSPCA, ILSPFC and LGIRPCA will be
ILSPFC [17] considers the following problem: min
(1 − β )∥PX∥∗ + β∥PX∥1 + λ∥E∥1 ,
s.t .
X = PX + E,
P, E
(28)
2 The equivalent formulations of the original RSR, PCA-L1 and JSPCA problems, i.e. Problem (6), (25) and (27), are used in our experiments. And Problem (6), (25) and (27) could be solved by using ADM.
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
75
Fig. 4. Reconstruction errors obtained by the evaluated algorithms versus the percentage of corruption on the five databases. The values of parameters corresponding to the evaluated algorithms were reported in the supplementary table 2 in the supplementary materials.
chosen from {0.0001, 0.001, 0.01, 0.1, 1} and the parameter λ in all algorithms is selected in {0.0001, 0.001, 0.01, 0.1, 1, 10, 20}. We fix the dimension k of the latent subspaces found by PCAL1, JSPCA and LGIRPCA as n − 1 (n is the number of samples in a training data set). We finally performed some experiments to test the performance of LGIRPCA varied with parameters.
Yale B database [37], AR database [38], MINIST database,3 COIL20 database [39]. The brief information of these databases are summarized as follows: ORL database includes 10 images of each of 40 individuals. These images were taken at different times, varying the lighting, facial expressions and facial details. Each image was resized to 32 × 32 in our experiments. The extended Yale B face database consists of 38 individuals and around 64 near frontal images under different illuminations
Databases: Several real-world benchmark databases will be used in our experiments, such as ORL database [36], the extended
3 http://yann.lecun.com/exdb/mnist/.
76
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
Fig. 4. (continued).
for each individual. In this database, some images are corrupted by shadows. We used images from the first 10 persons (640 images in total) and resized each image to 32 × 32 pixels. AR database is composed of more than 4000 face images of 126 persons. Each person has 26 pictures. These pictures include front views of faces with different expressions, illuminations, and occlusions. In our experiments, we took the pictures from the first 20 persons (520 images) of AR and resized each image to 50 × 40 pixels. MNIST database has 10 subjects, corresponding to 10 handwritten digits, namely 0 − 9. This database has a training set of 60 000 images and a testing set of 10 000 images. We formed a subset which consists of the first 50 samples of each digit’s training data set and resized each image to 28 × 28 pixels. COIL-20 database contains total 1440 images for 20 different objects. The images of each object are taken 5◦ apart as the object rotating on a turn table and each object has 72 images. We chose first 10 images from each object to form a subset and resized each image to 32 × 32 pixels. The sample images from the above database are illustrated in Fig. 1(a)–(e). 5.2. Image recovery In this subsection, we will test the performances of the evaluated algorithms for image recovery tasks. For each above database, we randomly selected 40% images and add white Gaussian noise with SNR = 20 to them. The corrupted pixels were also randomly chosen and the percentage of the corrupted pixels were varied form 5% to 50%. Then we divided each database into two sub-groups with equal number
of samples, namely the training data set Xtrain and the testing data set Xtest . Some corrupted images for each database were illustrated in Fig. 2. We first visualized the reconstruction results obtained by each evaluated algorithm. Here, the corruption percentage of each database is set to be 15%. For each algorithm, we use the obtained low-rank projection matrix P to produce the reconstruction images, namely the reconstruction images from Xtrain and Xtest are computed as PXtrain and PXtest respectively. We all know that the different values of the parameter(s) in each algorithm will greatly influence the performances of the algorithm. Hence, we use the following strategy to find the optimal P for each algorithm: firstly, each algorithm is performed on every clean training database with the parameter(s) varying in the given intervals (i.e., β ∈ {0.0001, 0.001, 0.01, 0.1, 1} and λ ∈ {0.0001, 0.001, 0.01, 0.1, 1, 10, 20}) and the obtained projection matrices are record; secondly, we choose 5 projection matrices with least ranks obtained by each algorithm and select the one in the 5 projection matrices which can achieve minimum reconstruction error on each training data set (i.e., ∥Xtrain − PXtrain ∥2F ) to be the optimal P0 ; thirdly, for corruption databases, we select the optimal Pcp (cp denotes the corruption percentage) for each algorithm which can get minimal approximal reconstruction error on each training data set. Here approximal reconstruction error is defined as ∥P0 Xtrain − Pcp Xtrain ∥2F .4 Then the visualization results obtained by each algorithm are illustrated in Fig. 3(a)–(e). 4 In some databases (e.g. the extend Yale B and AR), some images have already been corrupted by shadow or scarf, hence the projection matrix which can get minimum reconstruction error on each training data set may not be the best one. Therefore, we use this compromise method to choose an optimal P.
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
77
Fig. 5. Classification accuracies obtained by the evaluated algorithms versus the percentage of corruption on the five databases with different training samples. The values of parameters corresponding to the evaluated algorithms were reported in the supplementary table 3 in the supplementary materials.
78
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
Fig. 5. (continued).
In each sub-figure, we use some red rectangles to indicate the differences of the results obtained by LGIRPCA and other evaluated algorithms. By observing these figures, we find several interesting conclusions: (1) based on the results on ORL, MNIST and COIL-20 databases, it can be seen that LGIRPCA is capable to recovery the corrupted images while some algorithms (e.g., IRPCA, PCA-L1 and JSPCA) are failed; (2) from sub-figure (b) and (c), we can see that although RSR and ILSPFC can also remove some corruptions in the images, the recovery images obtained by the two algorithms lost some discriminative features. Hence, the recovered images from different persons seems to be similar to each other. But the reconstruction images obtained by LGIRPCA maintained the discriminative features of original images. Besides observe the image representation result, for precise comparisons, we use the reconstruction error [17] to quantify the experimental result obtained by each algorithm, namely:
∥P0 Xtrain − Pcp Xtrain ∥2F , ∥P0 Xtrain ∥2F 2 ∥P0 Xtest − Pcp Xtest ∥F = , ∥P0 Xtest ∥2F
Rtrain = Rtest
(29)
where Rtrain and Rtest denote the reconstruction errors on the training data set Xtrain and the testing data set Xtest respectively. Then Rtrain and Rtest of each evaluated algorithms vary with the percentage of corruption on the five databases are recorded in the left and right sub-figures of Fig. 4(a)–(e). Clearly, from Fig. 4, we can find that LGIRPCA always achieves the lowest reconstruction errors on both training and testing data sets of the five databases. 5.3. Image classification In this subsection, we will test the performances of LGIRPCA and other comparative algorithms for image classification tasks. Firstly, for the above each database, we randomly chose q sample images of every subject to form a training data set (denoted as q-train), and the rest samples of every subject were used to form a testing data set. Secondly, for a training data set, the six evaluated algorithms were adopted to find the different low-rank projection matrices. Thirdly, By using an obtained lowrank projection matrix, both the training samples and testing samples could be projected into a latent subspace. Fourthly, in the latent subspace, nearest neighbor method [32] was used to
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
79
Fig. 6. Classification accuracies obtained by the LGIRPCA vary with the parameters.
classify the testing samples and the classification accuracy would be recorded. Then experiments were performed 20 times, and the average classification accuracies of each evaluated algorithm versus the percentage of corruption on the five databases were plotted in Fig. 5(a)–(e). From Fig. 5, we can see that LGIRPCA almost achieved the highest classification accuracies on all the experiments. Especially
on MNIST and COIL-20 databases, LGIRPCA obtains much better results than those of other evaluated algorithms. For clear comparisons, we also recorded the average classification accuracies obtained by the evaluated algorithms on each database in Table 1. Here, the corruption percentage of each database is 0. From Table 1, we can clearly found that LGIRPCA was superior to the related algorithms.
80
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
Fig. 6. (continued). Table 1 The average classification accuracies (%) of different algorithms on the five image databases. The optimal values are emphasized in bold style. Database
LGIRPCA
IRPCA
RSR
PCA-L1
JSPCA
ILSPFC
ORL
4-train 5-train 6-train
0.8333 0.8650 0.8653
0.7625 0.8100 0.8313
0.8111 0.8450 0.8479
0.8250 0.8200 0.8375
0.7833 0.8600 0.8563
0.8250 0.8550 0.8500
Yale B
25-train 32-train 38-train
0.7500 0.7808 0.7938
0.7423 0.7782 0.7844
0.7423 0.7731 0.7781
0.7462 0.7782 0.7844
0.7346 0.7577 0.7875
0.7500 0.7808 0.7875
AR
10-train 13-train 10-train
0.8500 0.8769 0.8813
0.8409 0.8563 0.8731
0.8182 0.8000 0.8462
0.8409 0.8344 0.8654
0.8455 0.8594 0.8769
0.8545 0.8625 0.8800
MNIST
20-train 25-train 30-train
0.8100 0.8320 0.8550
0.7800 0.8160 0.8500
0.7767 0.8160 0.8400
0.7767 0.8160 0.8375
0.7967 0.8280 0.8550
0.7733 0.8000 0.8450
COIL-20
4-train 5-train 6-train
0.9667 0.9800 0.9900
0.9500 0.9800 0.9500
0.9500 0.9600 0.9750
0.9667 0.9800 0.9900
0.9683 0.9800 0.9900
0.9000 0.8900 0.9250
5.4. Parameter sensitivity Finally, we will show the performance of LGIRPCA varies with the two parameters β and λ. We also recorded the classification accuracies obtained by LGIRPCA on five uncorrupted databases and illustrated the experimental results in Fig. 6(a)–(e). From
Fig. 6, it can been seen that the performance of LGIRPCA varied with β in some extend, but the degree of variation was acceptable. Hence, we can concluded that LGIRPCA was insensitive to the parameters. 6. Conclusion In this paper, we proposed a kind of latent graph regularized inductive robust principal component algorithm (LGIRPCA) to find the low-rank representations of high-dimensional data samples. When seeking the low-rank projection matrix for a given data set, LGIRPCA considers the feature manifold structure of the data sets and designed a Laplacian regularizer to characterize the latent feature manifold. Different from the existing method, the Laplacian regularizer was designed by using the low-rank projection matrix which was able to recover the manifold structure of original features truthfully. Moreover, in LGIRPCA, the lowrank projection matrix was approximated by the product of two matrix factors, the Laplacian regularizer could be reformulated as weighted sparse constraint for the low-rank projection matrix. We presented an optimization algorithm to solve LGIRPCA problem and conducted lots of experiments to show the effectiveness of LGIRPCA from different aspects. And based on the experimental results, we can get the conclusion that LGIRPCA was an effective method for low-rank subspace learning for different databases.
L. Wei, R. Zhou, J. Yin et al. / Knowledge-Based Systems 177 (2019) 68–81
Appendix A. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.knosys.2019.04.007. References [1] E. Elhamifar, R. Vidal, Sparse subspace clustering, in: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009, Miami, Florida, USA, 2009, pp. 2790–2797. [2] E. Elhamifar, R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2765–2781. [3] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in: J. Frnkranz, T. Joachims (Eds.), Proc. 27th International Conference on Machine Learning, ICML-10, Haifa, Israel, 2010, pp. 663–670. [4] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013) 171–184. [5] F. De la Torre, M.J. Black, Robust principal component analysis for computer vision, in: Proc. 8th Int. Conf. Computer Vision, ICCV-01, Barcelona, Spain, 2001, pp. 362–369. [6] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [7] I. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, New York, 1986. [8] L. Wei, X. Wang, J. Yin, A. Wu, Self-regularized fixed-rank representation for subspace segmentation, Inform. Sci. 412–413 (2017) 194–209. [9] M.A. Turk, A.P. Pentland, Face recognition using eigenfaces, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1991, pp. 586–591. [10] J. Wright, Y. Peng, Y. Ma, A. Ganesh, S. Rao, Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices by Convex Optimization, NIPS, 2009. [11] E.J. Candes, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, J. ACM 58 (3) (2011) 1–37. [12] F. Nie, H. Huang, C. Ding, D. Luo, H. Wang, Robust principal component analysis with non-greedy l1-norm maximization, in: Proc. 22nd Int. Joint Conf. Artif. Intell, Barcelona, Spain, 2011, pp. 1434–1438. [13] N. Kwak, Principal component analysis based on L1-norm maximization, IEEE Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) 1672–1680. [14] S. Yi, Z. Lai, Z. He, Y. Cheung, Y. Liu, Joint sparse principal component analysis, Pattern Recognit. 61 (2017) 524–536. [15] G. Zhu, S. Yan, Y. Ma, Image tag refinement toward low-rank, content-tag prior and error sparsity, in: Proc. Int. Conf. Multimedia, 2010, pp. 461–470. [16] Z. Zhang, X. Liang, A. Ganesh, Y. Ma, TILT: transform invariant low-rank textures, in: Proc. Asian Conf. Comput. Vis, 2010, pp. 314–328. [17] Z. Zhang, F. Li, M. Zhao, L. Zhang, S. Yan, Joint low-rank and sparse principal feature coding for enhanced robust representation and visual classificatio, IEEE Trans. Image Process. 25 (6) (2016) 2429–2444. [18] X. Huan, C. Caramanis, S. Sanghavi, Robust pca via outlier pursuit, in: Proc. Adv. Neural Inf. Process. Syst, 2010, pp. 2496–2504.
81
[19] P.P. Brahma, Y. She, S. Li, J. Li, D. Wu, Reinforced robust principal component pursuit, IEEE Trans. Neural Netw. Learn. Syst. (2017). [20] B.K. Bao, G. Liu, C. Xu, S. Yan, Inductive robust principal component analysis, IEEE Trans. Image Process. 21 (8) (2012) 3794–3800. [21] P. Zhu, W. Zuo, L. Zhang, Q. Hu, S.C.K. Shiu, Unsupervised feature selection by regularized self-representation, Pattern Recognit. 48 (2) (2015) 438–446. [22] Q. Gu, J. Zhou. Co-clustering on manifolds, Co-clustering on manifolds, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM, 2009, pp. 359–368. [23] F. Shang, L.C. Jiao, F. Wang, Graph dual regularization non-negative matrix factorization for co-clustering, Pattern Recognit. 45 (6) (2012) 2237–2250. [24] C. Ding, T. Li, W. Peng, H. Park, Orthogonal Nonnegative Matrix T-Factorizations for Clustering, 2006, pp. 126–135. [25] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, Y. Ma, Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix, J. Mar. Biol. Assoc. Uk 56 (3) (2009) 707–722. [26] Z. Huang, S. Li, L. Fang, H. Li, J.A. Benediktsson, Hyperspectral image denoising with group sparse and low-rank tensor decomposition, IEEE Acess (6) (2018) 1380–1390. [27] Z. Zhang, M. Zhao, F. Li, L. Zhang, S. Yan, Robust alternating low-rank representation by joint Lp- and L2, p-norm minimization, Neural Netw. Official J. Int. Neural Netw. Soc. 96 (2017) 55–70. [28] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [29] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [30] H. Wang, Y. Yang, B. Liu, H. Fujita, A study of graph-based system for multi-view clustering, Knowl.-Based Syst. 163 (2019) 1009–1019. [31] Y. Zhang, Y. Yang, T. Li, H. Fujita, A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE, Knowl.-Based Syst. 163 (2019) 776–786. [32] R. Duda, P. Hart, D. Stork, Pattern classification, second ed., WileyInterscience, 2000. [33] Z. Lin, M. Chen, L. Wu, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, UIUC, Champaign, IL, USA, 2009. Tech. Rep. UILU-ENG-09-2215. [34] R. Bartels, G. Stewart, Solution of the matrix equation AX + XB = c, Commun. ACM 19 (3) (1976) 275–276. [35] H. Zou, T. Hastie, R. Tibshirani, Sparse principal component analysis, J. Comput. Graph. Statist. 15 (2) (2006) 265–286. [36] F. Samaria, A. Harter, Parameterisation of a stochastic model for human face identification, in: Proceedings of Second IEEE Workshop Applications of Computer Vision, 1994. [37] K. Lee, J. Ho, D. Driegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698. [38] A.M. Martinez, R. Benavente, The AR Face Database, CVC, Univ. AutonomaBarcelona, Barcelona, Spain, 1998. [39] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library (COIL-20), Technical Report CUCS-(1996) 005-96.