Pattern Recognition 60 (2016) 813–823
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Flexible constrained sparsity preserving embedding L. Weng a,b, F. Dornaika b,c,n, Z. Jin a a b c
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China University of the Basque Country UPV/EHU, San Sebastian, Spain IKERBASQUE, Basque Foundation for Science, Bilbao, Spain
art ic l e i nf o
a b s t r a c t
Article history: Received 29 February 2016 Received in revised form 15 June 2016 Accepted 28 June 2016 Available online 30 June 2016
In this paper, two semi-supervised embedding methods are proposed, namely Constrained Sparsity Preserving Embedding (CSPE) and Flexible Constrained Sparsity Preserving Embedding (FCSPE). CSPE is a semi-supervised embedding method which can be considered as a semi-supervised extension of Sparsity Preserving Projections (SPP) integrated with the idea of in-class constraints. Both the labeled and unlabeled data can be utilized within the CSPE framework. However, CSPE does not have an out-of-sample extension since the projection of the unseen samples cannot be obtained directly. In order to have an inductive semi-supervised learning, i.e. being able to handle unseen samples, we propose FCSPE which can simultaneously provide a non-linear embedding and an approximate linear projection in one regression function. FCSPE simultaneously achieves the following: (i) the local sparse structures is preserved, (ii) the data samples with a same label are mapped onto one point in the projection space, and (iii) a linear projection that is the closest one to the non-linear embedding is estimated. Experimental results on eight public image data sets demonstrate the effectiveness of the proposed methods as well as their superiority to many competitive semi-supervised embedding techniques. & 2016 Elsevier Ltd. All rights reserved.
Keywords: Constrained embedding Sparsity preserving projections Flexible manifold embedding Semi-supervised learning Out-of-sample problem
1. Introduction In many real world applications, such as face recognition and text categorization, the data are usually provided in a high dimension space. In many real-world problems, collecting a large number of labeled samples is practically impossible. The reasons are twofold. Firstly, these labeled samples can be very few. Secondly, acquiring labels requires expensive human labor. To deal with this problem, semi-supervised embedding methods can be used to project the data in the high-dimensional space into a space with fewer dimensions. A lot of methods for dimension reduction have proposed. Principal Component Analysis [1] (PCA) and Multidimensional Scaling [2] (MDS) are two classic linear unsupervised embedding methods. Linear Discriminant Analysis [1] (LDA) is a supervised method. In 2000, Locally Linear Embedding [3] (LLE) and Isometric Feature Mapping (ISOMAP) [4] were separately proposed in science which laid a foundation of manifold learning. Soon afterward, Belkin et al. proposed Laplacian Eigenmaps [5] (LE). He et al. proposed both Locality Preserving Projection [6] (LPP), essentially a linearized version of LE, and Neighborhood Preserving Embedding [7] (NPE), a linearized version of LLE. LPP and NPE can be n Corresponding author at: University of the Basque Country UPV/EHU Manuel Lardizabal, 1 20018 San Sebastian, Spain.
http://dx.doi.org/10.1016/j.patcog.2016.06.027 0031-3203/& 2016 Elsevier Ltd. All rights reserved.
interpreted in a general graph embedding framework with different choices of graph structure. Most of these methods are unsupervised methods. Afterwards, sparse representation [8–10] based methods have attracted extensive attention. Lai et al. proposed a 2-D feature extraction method called sparse 2-D projections for image feature extraction [11]. In [12], a robust tensor learning method called sparse tensor alignment (STA) is then proposed for unsupervised tensor feature extraction based on the alignment framework. In [13], multilinear sparse principal component analysis (MSPCA) inherits the sparsity from the sparse PCA and iteratively learns a series of sparse projections that capture most of the variation of the tensor data. Sparsity Preserving Projection (SPP) is an unsupervised learning method [10]. It can be considered as an extension to NPE since the latter has a similar objective function. However, SPP utilizes sparse representation over the whole data to obtain the affinity matrix. In the last decade, semi-supervised learning algorithms have been developed to effectively utilize a large amount of unlabeled samples as well as the limited number of labeled samples for real world applications [14–22]. In the past years, many graph-based methods for semi-supervised learning have been developed [23– 35]. Constrained Laplacian Eigenmaps [36] (CLE) is a semi-supervised embedding method. CLE constrains the solution space of Laplacian Eigenmaps only to contain embedding results that are
814
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
consistent with the labels. Labeled points belonging to the same class are merged together, labeled points belonging to different classes are separated, and similar points are close to one another. Similarly, Constrained Graph Embedding [37] (CGE) tries to project the data points from a same class onto one single point in the projection space with a constraint matrix. Flexible Manifold Embedding [38] (FME) is a label propagation method. FME simultaneously estimates the non-linear embedding of unlabeled samples and the linear regression over these nonlinear representations. In [39], the authors propose a whole learning process that can provide the data graph and a linear regression within a same framework. SPP is a successful unsupervised learning method. To extend SPP to a semi-supervised embedding method, we introduce the idea of in-class constraints in CGE into SPP and propose a new semi-supervised method for data embedding named Constrained Sparsity Preserving Embedding (CSPE). The weakness of CSPE is that it can not handle the new coming samples which means a cascade regression should be performed after the non-linear mapping is obtained by CSPE over the whole training samples. Inspired by FME, we add a regression term in the objective function to obtain an approximate linear projection simultaneously when non-linear embedding is estimated and proposed Flexible Constrained Sparsity Preserving Embedding (FCSPE). So in this paper, two semi-supervised embedding methods namely CSPE and FCSPE are proposed. Compared to the existing works, the proposed CSPE retains the advantages of both CGE and SPP. On the other hand, the proposed FCSPE simultaneously estimates the non-linear mapping over the training samples and the linear projection for solving the out-of-sample problem, which is usually not provided by existing graph-based semi-supervised non-linear mapping methods. This paper is organized as follows. Section 2 reviews the related methods including LPP, SPP, CGE and FME. Section 3 introduces the two proposed semi-supervised methods. Section 4 presents performance evaluations on six face image databases: Yale, ORL, FERET, PIE, Extended Yale B and LFW (the original version and the aligned version), one handwriting image database USPS and an object image database COIL-20. Section 5 presents some concluding remarks.
represents the similarity between sample x i and sample xj . Estimating the graph affinity W from data can be carried out by many graph construction methods [40]. The simplest method is based on the use of KNN graph. The definition of KNN graph is as follows:
⎧ 1, x i ∈ δk (x j ) or Wij = ⎨ ⎩ 0 otherwise,
x j ∈ δk (x i ), (2)
where δk (x i ) means a set of the k neighbors of x i . After some simple algebraic formulations, we obtain:
∑ (yi
− yj )2Wij = 2pT XL X T p ,
(3)
i, j
where L = D − W is the Laplacian matrix and D is a diagonal matrix with Dii = ∑j Wij . With the constraint pT X D XT p = 1, the problem becomes:
min p
pT X L X T p . pT X D X T p
(4)
The optimal p is given by solving the minimum eigenvalue problem:
X L X T p = λ X D X T p.
(5)
The eigenvectors p1, … , pd corresponding to the d smallest eigenvalues are then used as the columns of the projection matrix P , i.e. P = [p1, … , pd ]. The projected samples are obtained by Y = PT X . 2.2. Sparsity preserving projection As LPP tries to preserve the neighborhood structure, Sparsity Preserving Projection [9,10] (SPP) aims to keep the structure over the whole data set by using sparse representation instead of the linear representation of k nearest neighbors to get the weight matrix. For x i , the representative coefficients of the rest samples are obtained by solving an ℓ1 problem:
min
∥ si ∥1 ,
s. t.
x i = X si ,
si
(6)
where si = [si1, … , si (i − 1), 0, si (i + 1), … , sin ]T . The problem of SPP is: 2. Related work Some mathematical notations are listed and will be used in the next several sections. Let X = [x1, x2, … , xn] ∈ Rm × n be the data matrix, where n is the number of training samples and m is the dimension of each sample. Let y = [y1, y2 , … , yn ]T be a one-dimensional map of X . Under a linear projection yT = pT X , each data point x i in the input space Rm is mapped into yi = pT x i in the real line. Here p ∈ Rm is a projection axis. Let Y ∈ R d × n be the data projections in a d dimensional space. 2.1. Locality preserving projection Locality Preserving Projection [6] (LPP) is a classic unsupervised embedding method which aims to preserve the local structure of the data by keeping two sample points close in the projection space when they are similar in the original space. The reasonable criterion of LPP is to optimize the following objective function under some constraints:
min
∑ (yi i, j
− yj )2Wij,
(1)
where W is the affinity matrix associated with the data and Wij
⎛ min ∑ ⎜⎜ yi − p i ⎝
⎞2
∑ sij yj ⎟⎟ j
⎠
(
)2
= min ∑ pT x i − pT X si . p
i
(7)
With the constraint pT X XT p = 1, Eq. (7) becomes:
min p
˜ Tp pT X (I − S − S T + S T S) X T p pT X SX = max T , p pT X X T p p X XT p
(8)
where S˜ = S + ST − ST S , and S = [s1, … , sn ]T . The corresponding eigenvalue problem is:
X S˜ X T p = λ X X T p.
(9)
The eigenvectors p1, … , pd corresponding to the d largest eigenvalues are the columns of the sought linear transform, i.e., P = [p1, … , pd ], and Y = PT X . 2.3. Constrained Graph Embedding Constrained Graph Embedding [37] (CGE) is a semi-supervised non-linear embedding method which uses the label information as additional constraints mapping the samples with a same label to one point in the projection space. We assume that the first l samples are with labels from c classes. In the projection space, a constraint matrix U is used to keep the samples with a same label
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
3. Proposed methods
in one point. The definition of U is as follows:
⎛J 0 ⎞ U=⎜ ⎟, ⎝ 0 In − l ⎠
(10)
where U ∈ Rn ×(c +(n − l )) and the i-th row of J is an indictor vector of xi :
⎧ 1 if x i is labeled from class j, Jij = ⎨ ⎩ 0 otherwise,
(11)
where j = 1, … , c . An auxiliary vector z is adopted to implement the constraint (y⁎ is the one-dimensional map of data matrix X):
y⁎ = U z.
(12)
With the above constraint, it is clear to see that if x i and xj share the same label, then yi = yj . With simple algebraic formulation, we have:
∑ (yi
− yj )2Wij = y⁎T L y⁎ = zT U T L U z,
(13)
i, j
and
y⁎T D y⁎ = zT U T D U z,
(14)
where the affinity matrix W can be given by the KNN graph as defined in Eq. (2), L = D − W and D is diagonal with Dii = ∑j Wij . The problem of CGE is as follows:
min zT U T L U z , z
s. t.
815
(15)
zT U T D U z = 1.
In this section, we propose a semi-supervised learning method named Constrained Sparsity Preserving Embedding (CSPE). Afterwards, we introduce another flexible semi-supervised embedding method named Flexible Constrained Sparsity Preserving Embedding (FCSPE). The framework of CSPE does not provide a straightforward solution to the out-of-sample problem. Indeed, the regression is carried out as an extra step. With the flexible method, FCSPE, both the non-linear mapping and the regression are simultaneously estimated within one single framework. This will lead to more flexibility in the final solution as shown by the experimental results. 3.1. Constrained Sparsity Preserving Embedding (CSPE) In this section, we introduce a semi-supervised embedding method named Constrained Sparsity Preserving Embedding (CSPE). In CSPE, the construction of the affinity matrix is parameter free and the sparse structure is preserved in the projection space. In addition, the idea of in-class constraints from CGE is also integrated in the algorithm which merges the sample points with the same label together in the projection space. The affinity matrix is obtained by sparse representation as Eq. (6). In this way, the affinity matrix can be obtained without setting any parameters. To keep the sparse representation in the projection space as explained in SPP, every yi should have a similar combination of the rest samples in the projection space as xi in the original space. So the problem of CSPE is the same like in SPP:
min ∑ (yi − yT si )2 , y
The optimal vector z is given by the minimum eigenvalue solution to the generalized eigenvalue problem:
(16)
U T L U z = λ U T D U z.
The eigenvectors z1, … , z d corresponding to the d smallest eigenvalues, yield the auxiliary matrix Z = [z1, … , z d ]. We have Y⁎ = (UZ)T . Y⁎ is a d × n matrix and it represents the data projection of X .
(19)
i
where si is the same in (6). The definition of the constraint matrix U is the same as Eq. (10) to constrain the projection sample points, i.e. let y⁎ = U z . We note yi = y⁎T ei , where the i-th item of the vector ei is 1, 0 otherwise. Then, Eq. (19) becomes:
min ∑ (zT U T ei − zT U T si )2 . z
With simple algebraic manipulations, the objective function can be rewritten as:
2.4. Flexible Manifold Embedding
∑ (zT UT ei − zT UT si )2 = zT U T (I − S − ST Flexible Manifold Embedding [38] (FME) is a label propagation method. FME can simultaneously estimate a prediction matrix over the whole input samples and an approximate linear projection from the original samples to their prediction vectors by a regression term. FME algorithm solves the following problem: min Tr (F − F, P, b
T)T Λ
(F − T) +
Tr (F T L
F) + μ (∥ P
(20)
i
∥2
+γ∥
XT P
T
+1b −F
∥2 )
(17)
where Λ is a diagonal matrix with the first l and the rest u diagonal elements as 1 and 0. L is a graph Laplacian matrix of some similarity matrix. P is the unknown projection matrix and b is a bias vector. Assuming c is the number of classes, Y in Eq. (17) is a n × c matrix: T = [t1, t2 ,..., tn]T ∈ Rn × c , where tij = 1 if x i belongs to class j, 0 otherwise. The task is to estimate the label indicator matrix F = [f1, f2 ,..., fn]T ∈ Rn × c so that the label of unlabeled samples can be inferred. The solution to Eq. (17) is as follows:
F = (Λ + L + μγ H c − μγ 2 N)−1Λ T,
(18) 1
where N = X Tc (γ X c X Tc + I)−1X c and X c = X Hc with Hc = I − n 1 1T . I is an identity matrix of size n.
+ S T S) U z
i
= zT U T U z − zT U T S˜ U z,
(21)
where S˜ = S + ST − ST S . With the constraint zT UT U z = 1, the objective function Eq. (21) can be recasted into:
max z
zT U T S˜ U z . zT U T U z
(22)
The optimal vector z is given by the maximum eigenvalue solution to the following generalized eigenvalue problem:
U T S˜ U z = λ U T U z.
(23)
The eigenvectors z1, … , z d corresponding to the d largest eigenvalues, yield the auxiliary matrix Z = [z1, … , z d ]. The data projections of X in the d-dimensional space is given by:
Y⁎ = (U Z)T .
(24)
The projection matrix of CSPE could not be obtained directly since CSPE provides a non-linear mapping. The traditional way to deal with a new incoming sample is to re-perform the whole algorithm again which will be time-consuming. We assume a linear projection y (x) = PT x to approximate the
816
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
Fig. 1. Six image samples are shown for FERET, Extended Yale, LFW-a, COIL-20, and USPS datasets.
original non-linear mapping, where P = [p1, … , pd ]. One simple idea to calculate the matrix P is to fit the following function which minimizes the least square error on the existing samples.
(
)
P = arg min ∥ PT X − Y ∥2 + γ ∥ P ∥2 . P
(25)
where γ is a positive balance parameter that controls the regularization. By vanishing the derivative of the right side w.r.t. P , the optimal P can be obtained as:
P = (XX T + γ I)−1XY⁎T .
min Z, P , b
(26)
For a new incoming sample x test , its embedding ytest is given by
ytest = PT x test = ((XX T + γ I)−1XUZ)T x test .
linear regression based on minimizing the residual errors (choosing a linear subspace that are close to the non-linear one). This flexible optimization is usually better then a cascade estimation which simultaneously computes non-linear embedding of the samples and the regression over these non-linear representations. Assuming YT = XT P + 1 bT , we try to minimize the following function:
(27)
Over the training samples, the approximate linear projection P can also be used. 3.2. Flexible Constrained Sparsity Preserving Embedding (FCSPE) CSPE is a semi-supervised embedding method cannot obtain a linear projection within its criterion. Thus, in order to deal with unseen data samples, regression should be applied to estimate a linear projection as a cascade process after the non-linear embedding is obtained. To deal with the out-of-sample problem, we introduce Flexible Constrained Sparsity Preserving Embedding (FCSPE), a new flexible embedding method based on CSPE which simultaneously utilizes the core idea of CSPE and generates a projection matrix by optimizing one objective criterion. This criterion aims at getting both the non-linear representations and the
Tr (Z T U T (I − S − S T + S T S) U Z) + μ (∥ P ∥2 + γ ∥ X T P + 1 bT − U Z ∥2 ), s . t .
ZT U T U Z = I.
(28)
Define cost function Q (Z , P, b) as follows:
Q (Z, P, b) = Tr (ZT U T (I − S − S T + S T S) U Z) + μ (∥ P ∥2 + γ ∥ X T P + 1 bT − U Z ∥2 ).
(29)
Let ∂Q (Z , P, b) /∂b = 0, we obtain
b=
1 T T Z U 1 − PT X 1 . n
(30)
Then we can have Q (Z , P). Again let ∂Q (Z , P) /∂P = 0, we get
P = γ (γ X H c X T + I)−1X H c U Z = A U Z,
(31)
1
where Hc = I − n 1 1T and A = γ (γ X Hc XT + I)−1X Hc . 1
XT P + 1 bT = Hc XT A U Z + n 1 1T U Z = B U Z , where
We have 1
B = X Hc A + n 1 1T . Hence,
Q (Z) = Tr (ZT E Z),
(32)
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
817
Table 1 Average recognition rates (%) on Yale. Labeled
1
2
3
Method
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
GFHF RMGT LPP SPP SPDA SDE FME CGE
66.57 12.0 69.9 7 10.4 79.2 7 7.4 80.8 7 5.3 76.5 7 7.1 63.6 7 12.0 74.17 8.8 77.17 9.0
– – 82.5 76.8 83.6 78.1 80.7 78.1 72.2 7 13.4 75.6 78.3 73.8 7 6.9
70.87 12.1 72.3 711.1 81.5 7 5.9 86.0 7 5.8 83.0 7 7.7 71.5 78.8 78.8 7 9.4 82.5 7 7.8
– – 86.9 73.8 88.9 74.6 88.6 73.1 80.4 76.1 82.6 74.8 82.17 6.9
70.4 7 12.5 71.1 712.0 83.6 76.4 86.7 74.2 84.4 77.8 73.4 7 9.8 79.8 79.9 82.9 712.8
– – 87.9 7 4.2 91.5 75.4 91.4 72.9 86.17 5.5 85.4 7 4.4 86.3 7 7.1
CSPE FCSPE
83.17 4.8 83.67 4.3
83.5 76.7 82.9 7 7.6
90.5 7 5.0 90.37 5.3
89.4 73.0 89.6 7 3.4
94.9 73.6 95.1 73.3
93.57 5.2 92.4 7 5.2
Table 2 Average recognition rates (%) on ORL. Labeled
1
2
3
Method
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
GFHF RMGT LPP SPP SPDA SDE FME CGE
59.17 6.1 60.17 5.9 65.5 7 4.0 67.9 7 2.8 56.8 7 4.0 38.9 7 2.4 58.6 7 5.2 61.3 7 6.5
– – 61.1 74.1 63.6 74.8 56.4 74.9 44.9 7 6.1 56.0 75.0 56.4 73.6
68.77 5.1 69.9 7 4.9 73.17 5.0 79.0 7 4.4 72.3 72.8 50.1 75.2 76.0 7 4.3 74.7 77.0
– – 69.17 4.5 76.4 7 5.5 74.4 7 5.2 60.4 7 4.1 75.3 7 4.8 70.87 5.3
75.4 7 4.8 76.9 7 4.4 77.4 7 4.3 87.6 7 3.1 83.5 7 3.3 62.0 7 5.4 84.8 7 3.7 83.5 7 5.5
– – 75.3 7 3.0 84.0 7 2.1 84.8 7 1.7 72.6 73.4 84.6 7 1.9 81.8 7 5.4
CSPE FCSPE
66.9 73.7 68.57 3.1
62.0 74.3 64.57 4.1
79.8 7 5.7 82.8 73.0
78.9 7 3.7 80.0 7 3.7
88.4 7 2.8 88.87 2.4
88.5 7 1.8 88.87 2.0
Table 3 Average recognition rates (%) on FERET. Labeled
1
2
3
Method
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
GFHF RMGT LPP SPP SPDA SDE FME CGE
17.8 7 10.8 18.6 710.6 17.3 7 11.7 37.7 7 16.9 24.37 15.6 20.6 7 13.2 25.3 7 17.5 24.77 11.5
– – 18.3 78.8 29.0 7 15.0 23.6 7 12.9 20.0 7 11.8 24.57 12.3 24.37 12.5
24.5 715.3 24.9 715.5 21.3 7 14.4 50.4 7 20.0 45.5 7 21.0 30.8 7 15.5 44.2 7 22.9 40.17 14.4
– – 26.0 79.0 42.0 713.9 45.9 716.8 33.6 7 13.4 46.6 713.6 40.3 715.1
31.1 722.4 32.0 723.3 27.2 7 22.6 57.3 7 30.5 53.3 733.0 39.2 727.3 57.3 7 34.9 55.17 26.5
– – 31.5 7 6.5 53.4 7 12.2 65.7 7 17.0 47.6 7 12.3 64.07 13.3 57.6 7 13.2
CSPE FCSPE
39.6 7 15.6 40.8 713.0
34.4 7 15.5 37.37 13.2
58.3 7 20.2 60.77 18.3
55.7 717.1 62.27 15.9
67.5 731.2 70.4 7 28.2
69.0 7 13.7 73.3 7 15.2
Table 4 Average recognition rates (%) on PIE. Labeled
1
2
3
Method
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
GFHF RMGT LPP SPP SPDA SDE FME CGE
9.8 7 5.8 10.3 7 5.7 11.0 7 5.2 20.7 7 6.8 14.3 7 6.0 15.2 7 5.5 9.6 7 4.7 18.7 76.3
– – 11.3 7 2.7 25.0 7 5.6 17.9 7 3.7 19.8 7 5.0 10.5 7 2.7 22.5 7 5.5
16.9 7 6.8 17.3 7 6.7 16.4 7 7.4 32.0 77.2 29.4 710.7 24.57 6.6 20.5 79.4 29.7 76.4
– – 17.0 7 4.0 36.0 75.1 35.17 9.7 29.3 75.6 23.2 76.9 34.0 75.8
22.2 7 5.9 22.5 7 5.9 21.9 7 6.8 38.7 7 6.6 41.8 7 7.0 31.8 7 5.7 29.6 7 7.6 37.17 5.2
– – 18.5 7 4.5 41.9 7 7.0 46.8 7 9.6 35.376.8 29.5 7 7.6 39.9 7 7.4
CSPE FCSPE
27.6 7 9.4 27.77 9.3
31.0 77.5 31.0 74.3
36.9 78.6 39.3 7 7.9
41.17 7.6 43.07 6.4
46.7 7 6.1 48.2 77.7
50.4 7 8.3 51.9 7 8.1
818
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
Table 5 Average recognition rates (%) on Extended Yale B. Labeled
1
2
3
Method
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
GFHF RMGT LPP SPP SPDA SDE FME CGE
25.0 7 19.8 30.0 7 19.2 25.8 7 17.2 36.8 7 17.4 36.5 7 17.3 33.6715.2 28.8 7 20.2 36.6 7 14.9
– – 24.17 13.1 34.7 7 14.3 34.7 7 13.9 32.4 7 13.3 27.1 716.5 34.9 7 12.7
41.7 7 16.8 44.2 716.3 40.2 7 15.6 52.4 7 18.4 52.8 7 18.6 50.9 7 18.6 47.7 7 20.4 51.8 7 16.3
– – 38.6 714.3 51.6 7 19.2 50.9 718.3 50.4 718.4 46.3 719.8 49.7 7 15.9
50.1 78.8 53.4 7 6.4 47.6 78.7 63.4 7 8.2 61.8 7 12.0 61.9 7 9.2 58.3 7 12.0 62.5 7 9.6
– – 43.8 78.9 60.0 710.1 57.7 7 11.9 59.17 9.9 54.5 711.7 58.4 710.1
CSPE FCSPE
41.67 17.0 41.67 16.7
39.1 715.1 39.3 715.2
57.4 718.1 57.57 18.1
56.0 718.6 55.8 7 18.7
68.47 8.4 68.87 8.8
64.57 9.9 64.7 7 9.9
Table 6 Average recognition rates (%) on USPS. Labeled
1
2
3
Method
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
GFHF RMGT LPP SPP SPDA SDE FME CGE
33.47 8.6 48.57 5.5 43.4 7 7.0 40.8 7 5.0 33.07 5.4 29.0 7 6.7 31.4 76.0 37.9 7 9.0
– – 43.3 7 5.9 40.7 7 4.5 33.1 76.0 29.3 7 5.8 31.3 76.5 38.3 7 4.3
46.8 7 6.9 55.6 7 6.6 54.3 7 5.9 52.3 7 4.4 43.5 7 3.7 35.6 75.6 45.3 7 4.6 49.37 5.5
– – 52.8 7 5.6 50.6 7 4.4 42.8 7 4.5 35.774.5 44.4 7 4.9 49.77 4.4
56.2 7 5.7 61.8 75.3 61.6 74.5 59.3 7 4.3 51.3 7 3.1 40.6 7 3.6 53.9 7 4.1 57.6 74.9
– – 59.8 7 3.7 57.3 73.6 51.17 3.3 40.3 7 4.1 52.5 7 3.0 56.1 73.3
CSPE FCSPE
43.0 7 6.6 42.9 77.2
42.6 7 6.6 41.5 76.7
56.9 7 5.7 57.8 7 5.4
55.1 7 5.0 55.0 7 5.1
64.47 3.1 65.57 3.8
62.4 73.7 62.2 7 3.5
Table 7 Average recognition rates (%) on COIL-20. Labeled
1
2
Method
Unlabeled
Test
Unlabeled
Test
Unlabeled
Test
GFHF RMGT LPP SPP SPDA SDE FME CGE
58.4 7 4.8 60.7 7 4.6 61.9 7 4.7 64.87 6.8 56.4 7 4.7 42.2 7 4.5 57.9 7 5.2 63.6 7 8.8
– – 56.7 7 3.8 60.2 7 4.4 53.7 7 4.4 41.8 7 3.0 54.6 7 3.4 60.2 7 6.3
64.4 74.6 66.17 4.2 68.6 73.8 72.2 7 3.8 64.17 3.3 46.6 7 3.4 66.4 72.9 71.3 7 5.1
– – 63.4 7 4.9 66.5 75.1 63.6 7 4.9 50.6 7 3.8 64.8 75.4 69.8 7 8.8
68.3 74.4 72.3 7 4.5 72.7 7 5.0 76.17 6.3 68.8 73.9 53.4 75.1 71.0 7 4.5 75.5 74.9
– – 67.17 3.7 71.2 7 5.2 69.2 75.0 58.2 75.6 71.7 7 5.5 74.3 7 6.1
CSPE FCSPE
67.17 6.8 65.8 7 6.9
63.7 75.0 60.9 7 5.6
70.675.1 73.67 6.5
69.7 7 5.4 70.978.5
76.3 75.3 77.7 7 6.5
73.6 7 5.7 73.3 76.0
Table 8 Recognition rates (%) on LFW-a. Method
3
Table 9 Recognition rates (%) on LFW (non-aligned version). Feature
Method
Raw images
LBP feature
SPDA SDE FME CGE
43.6 20.9 35.3 30.9
23.2 18.4 22.7 41.0
CSPE FCSPE
41.3 45.0
45.9 48.9
Feature Raw images
LBP feature
SPDA SDE FME CGE
15.1 16.0 23.0 23.9
12.1 14.5 12.8 29.6
CSPE FCSPE
30.9 33.0
33.0 33.2
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
819
Fig. 2. Recognition rates of different embedding methods as a function of feature dimension for FERET, PIE, Extended Yale B and USPS data sets (test evaluation). Three samples per class are labeled. The classifier is NN.
where E = UT (I − S − ST + ST S) U + μ UT AT A U + μγ U (B − I)T (B − I). Now we want to solve the following problem which is only related to Z:
min
Tr (ZT E Z),
s. t.
ZT U T U Z = I.
Z
(33)
Then a Lagrangian multiplier is used. The optimal vector z is given by the minimum eigenvalue solution to the following generalized eigenvalue problem:
E z = λ U T U z.
(34)
The eigenvectors z1, … , z d corresponding to the d smallest eigenvalues, yield the auxiliary matrix Z = [z1, … , z d ]. Then the approximate linear projection matrix P is given by Eq. (31), i.e. P = A U Z . The linear projection can be used over both the training samples and the testing samples.
image databases: Yale, ORL, FERET, PIE, Extended Yale B, LFW (the original data set and the aligned version), COIL-20, and USPS. 4.1. Data sets Yale1: This database contains 165 face images from 15 people and each person has 11 images. The images are resized to 32 × 32 for processing. ORL2: This database contains 400 face images from 40 people and each person has 10 images. The size of the original image is 92 × 112. The images are resized to 32 × 32 for processing. FERET3: The subset used contains 1400 face images of 20 individuals and each person has 7 images. The images are resized to 32 × 32 for processing. PIE4: This database contains 1926 images from 68 individuals. The images are resized to 32 × 32 for processing. Extended Yale B5: The subset used contains 1680 face images of 28 individuals. The images are resized to 32 × 32 for processing. 1 2
4. Performance evaluation
3 4
In this section, we evaluate the proposed methods on eight real
5
http://vision.ucsd.edu/content/yale-face-database http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html http://www.nist.gov/itl/iad/ig/colorferet.cfm http://vasc.ri.cmu.edu/idb/html/face/index.html http://vision.ucsd.edu/ leekc/ExtYaleDatabase/ExtYaleB.html
820
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
Fig. 3. Recognition rates of different embedding methods as a function of feature dimension for FERET, PIE, Extended Yale B and USPS data sets (test evaluation). Three samples per class are labeled. The classifier is SVM.
LFW6 : The LFW (Labeled Faces in the Wild) data set is designed for unconstrained face verification tasks. Images of this data set are collected from the web. The original version contains more than 13 000 images. The face in this data set is misaligned. Our experiments are made on a set of 1551 images referring to 141 subjects (11 images for each subject). This selection of images is motivated by the fact that LFW data set is designed for face verification problems rather than face identification in our case. LFW-a7: The LFW-a data set is the aligned version of LFW data set. Our experiments are made on a set of 1551 images referring to 141 subjects (11 images for each subject). COIL-208: A subset of this database contains 360 images of 20 objects with different rotations. The images are of size 51 × 51 pixels. USPS9: A subset of this database contains 1100 images of 10 handwritten digits. The images are of size 16 × 16 pixels. Fig. 1 illustrates six images from FERET, Extended Yale, LFW-a, COIL-20, and USPS databases.
6 7 8 9
http://vis-www.cs.umass.edu/lfw http://www.openu.ac.il/home/hassner/data/lfwa/ http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html
4.2. Comparisons of effectiveness To evaluate the proposed methods, we compare them with several competing methods including LPP, SPP, SPDA [10], SDE [41], and CGE. We also use three label propagation methods GFHF [42], RMGT [43], and FME [38] which can be regarded as classifiers. The FCSPE method has two parameters to tune: μ and γ. In the experiments, they are chosen from {10−6 , 10−3, 1, 103, 106}. For every method, several values for the parameters are used. We then report the top-1 recognition accuracy (best average recognition rate) of all methods from the best parameter configuration. In all experiments, PCA is used as a preprocessing step to preserve 98% energy of the data. For every data sets except LFW-a, we randomly choose 50% of samples to be in the training set and the rest 50% form the test set. For each class, l samples are randomly chosen (from the training set) as labeled samples (l¼ 1,2 and 3). All compared methods use the training set (labeled and unlabeled samples) to build the projection model. Then, the obtained projection of the unlabeled train data and test data samples are classified using the Nearest Neighbor (NN) classifier. We repeat every experiments 10 times, i.e., we generate 10 random splits (Train/Test) for every data set. We depict the average recognition rates over the 10 splits. In general, the recognition rates varies with the dimension of the
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
821
Fig. 4. Recognition rates variation different values of parameter μ and γ on Yale, ORL and COIL-20. Fig. 5. Recognition rates variation different values of parameter μ and γ on Yale, ORL and COIL-20.
subspace. Thus, the average recognition rate is given by a curve depicting the rate as a function of the dimension of the new subspace. Tables 1–7 show the average recognition rates and the standard
deviations for six databases. For each database, 1–3 samples for each class are labeled and recognition rates are shown on both unlabeled training sets and testing sets. In the tables, ‘Unlabeled’
822
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
Table 10 The comparison between the recognition rates (%) of the optimal parameters and the fixed parameters ( μ = 1, γ ¼ 1). Parameters
Optimal Fixed
Datasets Yale
ORL
COIL-20
92.2 91.1
88.8 88.1
76.3 74.3
feature dimensions. This means that the proposed method can out-perform the other methods even in the low dimensional projection space. For the face images captured in the wild, both proposed methods keep their superiority with respect to the competing methods even for the misaligned faces. Moreover, the results show that the out-performance of the proposed methods can also be obtained with other types of image descriptors.
4.3. Sensitivity to parameters means the unlabeled training set and ‘Test’ means the testing set. We conducted one experiment on LFW and LFW-a. The main purpose of this data set is to evaluate face verification in the wild. Since we are addressing face recognition problem (one to many matching), we use another protocol for evaluating the proposed methods. For every person we randomly select 7 images and the remaining 4 images are used for testing. These test images are used as unlabeled data sample. In other worlds, the whole data set is used for training. We just test on one split using the original color feature and the LBP image [44,45], respectively. For comparison, we use SPDA, SDE, FME and CGE. The NN classifier is used for classification. Table 8 shows the recognition rates for LFW-a and Table 9 shows the recognition rates for LFW. Fig. 2 illustrates the average recognition rate curves among the range of feature dimension. NN classifier is used for classification. These curves are obtained on the test set with three labeled samples per class. We recall that FME method does not depend on the dimension since it is a label propagation technique. We stress the fact that the range of feature dimensions is not the same for all compared methods. Thus, the maximum dimensions of the methods are not the same. The maximum dimension of SPDA method is given by c − 1, and that of SDE is given by the dimension of input samples. For CGE, CSPE and FCSPE, the maximum dimension relying on the constraint matrix U is given by n − l + c . Fig. 3 illustrates the average recognition rate curves among the range of reduced dimension as well. The classifier is non-linear SVM, and the rest settings are the same as those in Fig. 2. Analysis of results: According to the results reported in the previous tables and figures, we can make the following observations:
The proposed CSPE and FCSPE algorithms can out-perform the
other embedding algorithms both on the unlabeled training sets and the test sets. These results show the effectiveness of CSPE and FCSPE. Almost all algorithms perform better on the unlabeled training sets than on the test sets for most of the data sets. This is intuitive since the unlabeled training sets are used in the learning model. The CSPE and FCSPE algorithms can outperform the other algorithms for NN and SVM classifier. According to the figures, the superiority of the proposed methods are independent from the classifier used. In most cases, the proposed FCSPE algorithm provides better performance than the proposed CSPE algorithm even for the unlabeled train set. As we mentioned in the previous section, FCSPE simultaneously computes a non-linear embedding of training samples and the linear transform based on a regression over these non-linear representations. This provides better data mapping than those obtained by CSPE which performs a cascaded estimation, in the sense it calculates the non-linear embedding first and then perform a linear regression over the nonlinear embedding. According to the depicted curves, we can observe that the two proposed methods provide good performance even with small
The FCSPE method has two parameters μ and γ. In this section, we aim to study the recognition rates obtained by FCSPE when these two parameters vary. Fig. 4 illustrates the recognition rates with different parameter values for Yale, ORL and COIL-20 data sets. Fig. 5 illustrates the recognition rates with a smaller range of parameter μ (from 10 2 to 102) for Yale, ORL and COIL-20. We can observe that the optimal domain for the two parameters μ and γ is almost the same for the three face data sets. Generally, μ should near to 1 and the influence of γ is not significant when μ is fixed. We can conclude that despite the fact that our proposed algorithms have two parameters, optimal values of theses parameters are simply limited to a small interval of values. The setting of these parameters is not a difficult task. Table 10 depicts a comparison between the recognition rates obtained with fixed parameters ( μ = 1, γ = 1) and the rates obtained with the optimal ones. This shows that by fixing the two parameters to the values of μ = 1, γ = 1 the associated performance is almost similar to the optimal results. In this case, one can simply fix the parameters when the proposed method is used.
5. Conclusion and discussion In this paper, two semi-supervised methods for data embedding are proposed. For semi-supervised data embedding, the proposed methods utilize the label information from the labeled data and the manifold regularization (derived from sparsity preserving criterion) on both labeled and unlabeled training data. The FCSPE method can generate a linear projection for unseen data points through a linear regression term in the optimal function. The experimental results on eight real image databases clearly demonstrate that the proposed methods can outperform the other embedding methods for comparison. Moreover, for misaligned faces, the proposed methods retained their superiority. The experimental results also give some hints for parameter selection in the proposed methods.
Acknowledgments This work is partially supported by National Natural Science Foundation of China under Grant Nos. 61373063, 61420201, 61472187, 61233011, 61375007, 61220301, and by National Basic Research Program of China under Grant No. 2014CB349303.
References [1] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, New York, USA, 2012. [2] I. Borg, P.J. Groenen, Modern Multidimensional Scaling: Theory and Applications, Springer Science & Business Media, New York, US, 2005. [3] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [4] J.B. Tenenbaum, V. De Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323.
L. Weng et al. / Pattern Recognition 60 (2016) 813–823
[5] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [6] X. He, P. Niyogi, Locality preserving projections, Neural Inf. Process. Syst. 16 (2003) 234–241. [7] X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, in: Tenth IEEE International Conference on Computer Vision, vol. 2, IEEE, 2005, Beijing, China, pp. 1208–1213. [8] Z. Lai, W.K. Wong, Y. Xu, J. Yang, D. Zhang, Approximate orthogonal sparse embedding for dimensionality reduction, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2016) 723–735, http://dx.doi.org/10.1109/TNNLS.2015.2422994. [9] S. Yan, H. Wang, Semi-supervised learning by sparse representation, in: International Conference on Data Mining, SIAM, 2009, Sparks, USA, pp. 792–801. [10] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognit. 43 (1) (2010) 331–341. [11] Z. Lai, W.K. Wong, Z. Jin, J. Yang, Y. Xu, Sparse approximation to the eigensubspace for discrimination, IEEE Trans. Neural Netw. Learn. Syst. 23 (12) (2012) 1948–1960, http://dx.doi.org/10.1109/TNNLS.2012.2217154. [12] Z. Lai, W.K. Wong, Y. Xu, C. Zhao, M. Sun, Sparse alignment for robust tensor learning, IEEE Trans. Neural Netw. Learn. Syst. 25 (10) (2014) 1779–1792, http: //dx.doi.org/10.1109/TNNLS.2013.2295717. [13] Z. Lai, Y. Xu, Q. Chen, J. Yang, D. Zhang, Multilinear sparse principal component analysis, IEEE Trans. Neural Netw. Learn. Syst. 25 (10) (2014) 1942–1950, http: //dx.doi.org/10.1109/TNNLS.2013.2297381. [14] D. Zhou, J. Huang, B. Schölkopf, Learning from labeled and unlabeled data on a directed graph, in: International Conference on Machine Learning, ACM, 2005, Bonn, Germany, pp. 1036–1043. [15] F. Nie, S. Xiang, Y. Jia, C. Zhang, Semi-supervised orthogonal discriminant analysis via label propagation, Pattern Recognit. 42 (11) (2009) 2615–2627. [16] X. Zhu, Semi-supervised learning, in: Encyclopedia of Machine Learning, Springer-Verlag, New York, 2010, pp. 892–897. [17] F. Nie, H. Wang, H. Huang, C. Ding, Adaptive loss minimization for semi-supervised elastic embedding, in: International Joint Conference on Artificial Intelligence, AAAI Press, 2013, Beijing, China, pp. 1565–1571. [18] S. Yang, X. Wang, M. Wang, Y. Han, L. Jiao, Semi-supervised low-rank representation graph for pattern recognition, IET Image Process. 7 (2) (2013) 131–136. [19] Y. Cui, X. Cai, Z. Jin, Semi-supervised classification using sparse representation for cancer recurrence prediction, in: IEEE International Workshop on Genomic Signal Processing and Statistics, 2013, pp. 102–105. [20] C.-L. Liu, W.-H. Hsaio, C.-H. Lee, F.-S. Gou, Semi-supervised linear discriminant clustering, IEEE Trans. Cybern. 44 (7) (2014) 989–1000. [21] S. Sun, Z. Hussain, J. Shawe-Taylor, Manifold-preserving graph reduction for sparse semi-supervised learning, Neurocomputing 124 (2014) 13–21. [22] Z. Li, Z. Lai, Y. Xu, J. Yang, D. Zhang, A locality-constrained and label embedding dictionary learning algorithm for image classification, IEEE Trans. Neural Netw. Learn. Syst. PP 99 (2015) 1–16, http://dx.doi.org/10.1109/TNNLS.2015. 2508025. [23] T. Zhang, A. Popescul, B. Dom, Linear prediction models with graph regularization for web-page categorization, in: International Conference on Knowledge Discovery and Data Mining, ACM, 2006, Philadelphia, USA, pp. 821–826. [24] G. Camps-Valls, T.V.B. Marsheva, D. Zhou, Semi-supervised graph-based hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 45 (10) (2007) 3044–3054. [25] J. Abernethy, O. Chapelle, C. Castillo, Web spam identification through content and hyperlinks, in: International Workshop on Adversarial Information Retrieval on the Web, ACM, Beijing, China, 2008, pp. 41–44.
823
[26] W. Yang, S. Zhang, W. Liang, A graph based subspace semi-supervised learning framework for dimensionality reduction, in: Computer Vision–ECCV, Springer, Marseille, France, 2008, pp. 664–677. [27] W. Liu, J. He, S.-F. Chang, Large graph construction for scalable semi-supervised learning, in: International Conference on Machine Learning (ICML10), 2010, pp. 679–686. [28] Z. Xu, I. King, M.R.-T. Lyu, R. Jin, Discriminative semi-supervised feature selection via manifold regularization, IEEE Trans. Neural Netw. 21 (7) (2010) 1033–1047. [29] F. Nie, D. Xu, X. Li, S. Xiang, Semisupervised dimensionality reduction and classification through virtual label regression, IEEE Trans. Syst. Man Cybern. Part B 41 (3) (2011) 675–685. [30] F. Pan, J. Wang, X. Lin, Local margin based semi-supervised discriminant embedding for visual recognition, Neurocomputing 74 (5) (2011) 812–819. [31] Y. Huang, D. Xu, F. Nie, Semi-supervised dimension reduction using trace ratio criterion, IEEE Trans. Neural Netw. Learn. Syst. 23 (3) (2012) 519–526. [32] T. Zhang, R. Ji, W. Liu, D. Tao, G. Hua, Semi-supervised learning with manifold fitted graphs, in: International Joint Conference on Artificial Intelligence, AAAI Press, Beijing, China, 2013, pp. 1896–1902. [33] F. Dornaika, A. Bosaghzadeh, H. Salmane, Y. Ruichek, Graph-based semi-supervised learning with local binary patterns for holistic object categorization, Exp. Syst. Appl. 41 (17) (2014) 7744–7753. [34] Y.-M. Zhang, K. Huang, X. Hou, C.-L. Liu, Learning locality preserving graph from data, IEEE Trans. Cybern. 44 (11) (2014) 2088–2098. [35] I. Triguero, S. García, F. Herrera, Seg-ssc: a framework based on synthetic examples generation for self-labeled semi-supervised classification, IEEE Trans. Cybern. 45 (4) (2015) 622–634. [36] C. Chen, L. Zhang, J. Bu, C. Wang, W. Chen, Constrained laplacian eigenmap for dimensionality reduction, Neurocomputing 73 (4) (2010) 951–958. [37] X. He, M. Ji, H. Bao, Graph embedding with constraints, in: International Joint Conference on Artificial Intelligence, vol. 9, 2009, pp. 1065–1070. [38] F. Nie, D. Xu, I.W.-H. Tsang, C. Zhang, Flexible manifold embedding: a framework for semi-supervised and unsupervised dimension reduction, IEEE Trans. Image Process. 19 (7) (2010) 1921–1932. [39] X. Fang, Y. Xu, X. Li, Z. Lai, W. Wong, Learning a nonnegative sparse graph for linear regression, IEEE Trans. Image Process. 24 (9) (2015) 2760–2771, http: //dx.doi.org/10.1109/TIP.2015.2425545. [40] C. Sousa, S. Rezende, G. Batista, Influence of graph construction on semi-supervised learning, in: European Conference on Machine Learning, 2013, pp. 160–175. [41] G. Yu, G. Zhang, C. Domeniconi, Z. Yu, J. You, Semi-supervised classification based on random subspace dimensionality reduction, Pattern Recognit. 45 (3) (2012) 1119–1135. [42] X. Zhu, Z. Ghahramani, J. Lafferty, et al., Semi-supervised learning using gaussian fields and harmonic functions, in: International Conference on Machine Learning, vol. 3, 2003, pp. 912–919. [43] W. Liu, S.-F. Chang, Robust multi-class transductive learning with graphs, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 381–388. [44] T. Ojala, M. Pietikäinen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 971–987. [45] V. Takala, T. Ahonen, M. Pietikinen, Block-based methods for image retrieval using local binary patterns, in: Image Analysis, SCIA, Lecture Notes in Computer Sciences, vol. 3540, 2005, pp. 882–891.
Libo Weng received his B.S. degree in Mathematics and Applied Mathematics from Nanjing University of Science and Technology, Nanjing, China, in 2011. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent Systems at Nanjing University of Science and Technology, Nanjing, China. He is also an international joint Ph. D. Student at the University of the Basque Country UPV/EHU, San Sebastian, Spain. His current research interests include pattern recognition and machine learning.
Fadi Dornaika received the M.S. degree in signal, image and speech processing from Grenoble Institute of Technology, France, in 1992, and the Ph.D. degree in computer science from Grenoble Institute of Technology, France, in 1995. He is currently a Research Professor at IKERBASQUE (Basque Foundation for Science) and the University of the Basque Country. Prior to joining IKERBASQUE, he held numerous research positions in Europe, China, and Canada. He has published more than 200 papers in the field of computer vision and pattern recognition. His current research interests include pattern recognition, machine learning and data mining.
Zhong Jin received the B.S. degree in mathematics, M.S. degree in applied mathematics and the Ph.D. degree in pattern recognition and intelligent system from Nanjing University of Science and Technology, Nanjing, China in 1982, 1984 and 1999, respectively. His current interests are in the areas of pattern recognition and face recognition.