Neurocomputing 128 (2014) 341–362
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Sparse regularization discriminant analysis for face recognition Fei Yin n, L.C. Jiao, Fanhua Shang, Lin Xiong, Xiaodong Wang Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi'an 710071, China
art ic l e i nf o
a b s t r a c t
Article history: Received 31 October 2012 Received in revised form 26 August 2013 Accepted 30 August 2013 Communicated by: D. Tao Available online 23 October 2013
Recently the underlying sparse representation structure in high dimensional data has attracted considerable interests in pattern recognition and computer vision. Sparse representation structure means the sparse reconstructive relationship of the data. In this paper, we propose a novel dimensionality reduction method called Sparse Regularization Discriminant Analysis (SRDA), which aims to preserve the sparse representation structure of the data when learning an efficient discriminating subspace. More specifically, SRDA first constructs a concatenated dictionary through class-wise PCA decompositions which conduct PCA on data from each class separately, and learns the sparse representation structure under the constructed dictionary quickly through matrix–vector multiplications. Then SRDA takes into account both the sparse representation structure and the discriminating efficiency by using the learned sparse representation structure as a regularization term of linear discriminant analysis. Finally, the optimal embedding of the data is learned via solving a generalized eigenvalue problem. The extensive and promising experimental results on four publicly available face data sets (Yale, Extended Yale B, ORL and CMU PIE) validate the feasibility and effectiveness of the proposed method. & 2013 Elsevier B.V. All rights reserved.
Keywords: Subspace learning Sparse representation Class-wise PCA decompositions Linear discriminant analysis Face recognition
1. Introduction In various fields of scientific research such as face recognition [1], text classification [2], and information retrieval [3], one is often confronted with data lying in a very high dimensional space. Generally, high dimensional data bring many problems in practical applications, e.g. (1) the computational complexity will limit the application of many practical technologies; (2) it will deteriorate the performance of model estimation when the number of samples is small compared to the number of dimensionality (also called small sample size problem). These problems are called “the curse of dimensionality” in machine learning and data mining [4]. In practice, dimensionality reduction has been employed as an effective method to handle these problems [5–7]. In the past few decades, many dimensionality reduction methods have been designed. Based on the data structure considered, these methods can be divided into three categories: global structure based methods, local neighborhood structure based methods, and the recently presented sparse representation structure based methods. Principal Component Analysis (PCA) [8] and Linear Discriminant Analysis (LDA) [9] belong to global structure based methods.
n Correspondence to: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Mailbox 224, Xidian University, No. 2 South TaiBai Road, Xi'an 710071, China. Tel.: þ 86 29 88202661; fax: þ86 29 88201023. E-mail address:
[email protected] (F. Yin).
0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.08.032
PCA, which aims at maximizing the variance of the projected data, has been applied widely in the areas of science and engineering. The classical Eigenface method [10] for face recognition is an application of PCA. PCA is a good method for representation, but it does not consider the label information. Thus it may not be reliable for classification [11]. LDA tries to simultaneously maximize the between-class scatter and minimize the within-class scatter. Because the label information is employed, LDA has proven to be more effective than PCA in classification. LDA is used in face recognition under the name of Fisherface [12]. However, LDA can extract at most K 1 features (K is the number of classes), which is undesirable in many cases. Moreover, it confronts the Small Sample Size (SSS) problem [13] when dealing with high dimensional data. To address these problems, Regularized LDA (RLDA) has been presented [14,15]. Another strategy is to incorporate a locality preserving term into LDA, and the representative method is Laplacian Linear Discriminant Analysis (LapLDA) [16]. Besides, the projection of LDA tends to merge those classes which are close together in the original feature space. To tackle this problem, Tao et al. [17,18] proposed the Geometric Mean Divergence Analysis (GMDA) from the viewpoint of maximizing the geometric mean of the Kullback–Leibler (KL) divergences between different classes. Local neighborhood structure based methods include Locality Preserving Projections (LPP) [19] and Neighborhood Preserving Embedding (NPE) [20]. LPP finds an embedding that preserves the local neighborhood structure, and obtains a face subspace best fitting the essential face manifold. NPE also aims at preserving the
342
F. Yin et al. / Neurocomputing 128 (2014) 341–362
local neighborhood structure of the data. Different from LPP, the affinity matrix in NPE is obtained via solving local least square problems instead of being defined directly. In essence, LPP is the optimal linear approximation to the classical manifold learning method called Laplacian Eigenmaps (LE) [21]. It has also been proved that NPE is a linearized version of the popular manifold learning method called Locally Linear Embedding (LLE) [22]. As a new signal representation method, sparse representation has received considerable attention in recent years, especially in compressed sensing [23–27]. It models a signal as a sparse linear combination of the elementary signals from a dictionary. The coefficient vector of the sparse linear combination is named sparse coefficient vector. Sparse representation has proven to be a powerful tool for signal processing and pattern classification [28]. Under the ℓ1 minimization framework, sparse representation has been successfully applied to denoising and inpainting [29–31], image super-resolution [32,33], image classification [34,35], etc. Sparse model has also been employed in dimensionality reduction [36–38]. The representative methods include Sparsity Preserving Projections (SPP) [39], Sparsity Preserving Discriminant Analysis (SPDA) [40], Fast Fisher Sparsity Preserving Projections (FFSPP) [41], Double Shrinking Sparse Dimension Reduction [42], and Group Sparse Manhattan Non-negative Matrix Factorization [43]. In this paper, sparse representation structure means every sample can be linearly represented by other samples from the same class, and thus can be sparse linearly represented by samples from all classes. That is, sparse representation structure means the sparse reconstructive relationship of the data, which originates from the subspace assumption in data: samples from a single class lie on a low dimensional linear subspace. Recently, Sparsity Preserving Projections (SPP) [39] considering the sparse representation structure has been proposed. It aims to preserve the sparse reconstructive relationship of the data, which is achieved by minimizing a ℓ1 regularization-related objective function. SPP is a very good and novel technology to dimensionality reduction, but it has some shortages: (1) The computational complexity is high because n (the number of training samples) ℓ1 norm optimization problems need to be solved in learning the sparse representation structure; (2) It does not take advantage of the label information, which is very useful for classification problems such as face recognition. A ℓ1 norm optimization problem needs to be solved to obtain the sparse coefficient vector of a sample, whose computational cost is approximately Oðn3 Þ [44]. The cost of obtaining sparse coefficient vectors for all samples is Oðn4 Þ. Therefore, for large scale problem, SPP is computationally prohibitive. In this paper, in order to learn an efficient discriminating subspace which can preserve the sparse representation structure of the data and avoid the disadvantages of SPP, we propose a novel dimensionality reduction method: Sparse Regularization Discriminant Analysis (SRDA). Particularly, SRDA first constructs a concatenated dictionary through class-wise PCA decompositions and learns the sparse representation structure under the constructed dictionary quickly through matrix–vector multiplications. Then SRDA takes into account both the sparse representation structure and the discriminating efficiency by using the learned sparse representation structure as a regularization term of linear discriminant analysis. Finally, the optimal embedding of the data is learned via solving a generalized eigenvalue problem. It is worthwhile to highlight some aspects of SRDA as follows: (1) SRDA is a novel dimensionality reduction method which simultaneously seeks an efficient discriminating subspace and preserves the sparse representation structure. (2) Based on the special concatenated dictionary constructed via class-wise PCA decompositions, the sparse coefficient vector in
SRDA can be obtained via basic matrix operation: matrix– vector multiplication. Compared to SPP, in which the sparse coefficient vector is computed via ℓ1 norm optimization, the computational cost of learning the sparse representation structure is greatly reduced. (3) Compared to LDA, the number of features that SRDA can find is not limited to K 1. Besides, SRDA can effectively avoid the SSS problem as the singular problem of within-class scatter matrix vanishes due to the regularized term in the SRDA formulation. (4) In SRDA, label information is used in two aspects. First, it is used in building the dictionary for sparse representation and computing the sparse coefficient vector, which may lead to a more discriminating sparse representation structure. Second, it is used in computing the within-class scatter matrix and the between-class scatter matrix. The rest of the paper is organized as follows. SRDA is proposed in Section 2. SRDA is compared with other related works in Section 3. The experimental results and discussions are presented in Section 4. Finally, Section 5 provides some concluding remarks and future work.
2. Sparse regularization discriminant analysis (SRDA) In this section, we describe the proposed SRDA in detail. Totally, SRDA consists of three steps: (1) learning the sparse representation structure; (2) constructing the sparsity preserving regularizer; and (3) integrating the learned sparsity preserving regularizer into the objective function of LDA to form the SRDA objective function and solving the optimization problem to obtain the embedding functions. 2.1. Learning sparse representation structure Given a set of training samples fxi gni¼ 1 where xi A ℝm , let X ¼ ½x1 ; x2 ; …; xn A ℝmn be the data matrix consisting of all the training samples. X ¼ ½X1 ; X2 ; …; XK , where Xj A ℝmnj contains samples from class j. Assume that samples from a single class lie on a linear subspace. Subspace model is flexible enough to capture much of the variation in real-world data sets and is a simple and effective model in the context of face recognition [45]. It has been observed that face images under various lighting and expression lie on a special low-dimensional subspace [12,46]. We conduct PCA decomposition for each class Xj , j ¼ 1; 2; …; K, whose objective function is max gT Σ j g;
jjgjj ¼ 1
ð1Þ
where Σ j is the data covariance matrix of Xj . For each class j, the first lj principal components are preserved to constitute Gj ¼ ½g1 ; g2 ; …; glj . Considering that the sample mean of Xj usually is not at the origin, in order to fully represent a sample from class j, Gj is extended to be G~ j ¼ ½g1 ; g2 ; …; glj ; xjmean , where xjmean is the mean vector of samples from class j (in fact, through some preprocessing, xjmean can be removed, which is detailed in section 2.2). So, a sample x from class j can be represented as x ¼ ½g1 ; g2 ; …; glj ; xjmean sj ; ¼ G~ j sj ; ¼ ½G~ 1 ; G~ 2 ; …; G~ j ; …; G~ K s; ~ ¼ Gs;
ð2Þ
where G~ ¼ ½G~ 1 ; G~ 2 ; …; G~ j ; …; G~ K , s ¼ ½0T ; 0T ; …; sj T ; …; 0T T . Here ~ is a sparse representation of x, G~ is the concatenated x ¼ Gs dictionary constructed from principal components of PCA from
F. Yin et al. / Neurocomputing 128 (2014) 341–362
Training Set X
X1
PCA
G1
X2
PCA
G2
343
Dictionary
G PCA
XK
GK
Fig. 1. The process of constructing dictionary for sparse representation.
Dictionary G constructed in Fig.1
preserving regularizer, which is defined by n
Matrix-vector multiplication
x1
~ i jj2 : JðwÞ ¼ ∑ jjwT xi wT Gs
s1
ð5Þ
i¼1
With some algebraic reformulation, the above sparsity preserving regularizer is rewritten as Matrix-vector multiplication
x2
s2
n
Sparse representation
~ i jj2 ; JðwÞ ¼ ∑ jjwT xi wT Gs i¼1
X G[s1 , s 2 ,..., s n ]
¼w
! T ~ ~ ∑ ðxi Gsi Þðxi Gsi Þ w; n
T
i¼1
Matrix-vector multiplication
xn
sn
Fig. 2. The process of learning the sparse representation structure.
¼w
each class, and s is the sparse coefficient vector. The process of constructing dictionary is shown in Fig. 1. According to the previous procedure, each training sample ~ corresponds to a sparse representation under the dictionary G. Once G~ is obtained from k PCA decompositions, the sparse coefficient vector s of a training sample from class j can be quickly computed from matrix–vector multiplication because according to Eq. (2), the computation of s involves only Gj and Gj is column orthogonal. For example, if x is a sample from class j, s corresponding to x can be obtained as s ¼ ½0T ; 0T ; …; ðGj T xÞT ; 1; …; 0T T ;
ð3Þ xjmean .
where the ‘1’ corresponds to the sample mean To sum up, the process of learning the sparse representation structure is shown in Fig. 2. First, the dictionary for sparse representation is constructed through class-wise PCA decompositions as in Fig. 1. Then the sparse coefficient vector of each sample is computed quickly through matrix–vector multiplication to obtain the sparse representation structure.
~ i jj ; min ∑ jjw xi w Gs w
2
i¼1
T ~ T þ GSS ~ T G~ T w; XXT XST G~ GSX
ð6Þ
Theorem 1. For a data set X ¼ ½X1 ; X2 ; …; XK , if we translate each class to make it be centered at the origin, that is, to make the translated data set X~ ¼ ½X~ 1 ; X~ 2 ; …; X~ K has the property j j x~ mean ¼ 0; j ¼ 1; …; K, where x~ mean is the sample mean of X~ j , then the problem in Eq. (4) will stay unchanged under the translated data set. Proof. X~ j is centered at the origin and translated from Xj , for a sample xi from class j, its corresponding x~ i ¼ xi xjmean which indicates xi ¼ x~ i þxjmean . So, starting from the problem in Eq. (4), we have n
~ i jj2 ; min ∑ jjwT xi wT Gs w
i¼1
n
¼ min ∑ jjwT ðx~ i þ xjmean Þ wT ½G~ 1 ; G~ 2 ; …; G~ j ; …; G~ K si jj2 ;
w
It is expected that the sparse representation structure in the original high dimensional space can be preserved in the projected low dimensional space. Therefore, the following objective function can be used to seek projection which best preserves the sparse representation structure: T
i¼1 n
¼ min ∑ jjwT ðx~ i þ xjmean Þ wT G~ j ðsi Þj jj2 ;
2.2. Sparsity preserving regularization
T
T
! T
where S ¼ ½s1 ; s2 ; …; sn . It should be noted that if the samples from each class are first translated to be centered at the origin, the xjmean in Eq. (2) and the term ‘1’ in Eq. (3) will disappear while the problem in Eq. (4) does not change. This is presented in the following theorem.
w
n
n
~ i xi T þ Gs ~ i ðGs ~ i ÞT Þ w; ∑ ðxi xi T xi si T G~ Gs
¼ wT
ð4Þ
i¼1
where si is the sparse coefficient vector corresponding to xi . ~ i jj2 is the total sparse representation error of ∑ni¼ 1 jjwT xi wT Gs all samples in the projected space, so minimizing this term can ensure the sparse representation structure is preserved in the projected space. We use this objective function as the sparsity
i¼1 n
¼ min ∑ jjwT ðx~ i þ xjmean Þ wT ½g1 ; g2 ; …; glj ; xjmean ðsi Þj jj2 ; w
i¼1 n
¼ min ∑ jjwT x~ i wT ½g1 ; g2 ; …; glj ðsi Þj jj2 ; w
i¼1 n
¼ min ∑ jjwT x~ i wT Gj ðsi Þj jj2 ; w
i¼1 n
¼ min ∑ jjwT x~ i wT ½G1 ; G2 ; …; Gj ; …; GK s~ i jj2 ; w
i¼1 n
¼ min ∑ jjwT x~ i wT Gs~ i jj2 ; w
i¼1
where ðsi Þj consists of the components of si corresponding to G~ j , ðsi Þj consists of the first lj components of ðsi Þj , that is, ðsi Þj is the coefficient vector corresponding to ½g1 ; g2 ; …; glj , s~ i ¼ ½0T ; 0T ; …;
344
F. Yin et al. / Neurocomputing 128 (2014) 341–362
Algorithm: Sparse Regularization Discriminant Analysis (SRDA) Step 1: Calculate S B and SW by Eqs. (8) and (9), respectively; Step 2: Conduct PCA decomposition for every X j , j 1, 2,..., K using Eq. (1) to obtain sparse representation dictionary G ; Step 3: Calculate the sparse coefficient vector s for every sample based on Eq. (3) to obtain S ; Step 4: Calculate the projecting vectors by the generalized eigenvalue problem in Eq. (12) . Fig. 3. SRDA algorithm.
Fig. 4. (a) All the 11 faces of the first person in the Yale data set. (b) Partial faces of the first person in the Extended Yale B data set. (c) All the 10 faces of the first person in the ORL data set. (d) Some sample faces of the first person in CMU PIE data set.
Fig. 5. The first 10 basis vectors of PCA, LDA, RLDA, LPP, NPE, and SRDA calculated from the training set of the ORL data set.
F. Yin et al. / Neurocomputing 128 (2014) 341–362
345
G3 0.8
0.6
0.7 0.6
0.5
Recognition Rate
Rcognition Rate
G2 0.7
0.4 Baseline PCA LDA RLDA LPP NPE SRDA
0.3 0.2 0.1 0
0
5
10
15 20 Dimensions
0.5 0.4
Baseline PCA LDA RLDA LPP NPE SRDA
0.3 0.2 0.1
25
0
30
0
5
10
15
0.9
0.8
0.8
0.7
0.7
0.6 0.5
Baseline PCA LDA RLDA LPP NPE SRDA
0.4 0.3 0.2 0.1 0
10
20
30 40 Dimensions
0.5
0.3 0.2
50
0.1 0
60
10
20
30 40 50 Dimensions
0.8
0.7
0.7
Recognition Rate
Recognition Rate
0.8
0.6 Baseline PCA LDA RLDA LPP NPE SRDA
0.5 0.4 0.3 30
60
70
G7 0.9
20
45
Baseline PCA LDA RLDA LPP NPE SRDA
0.4
G6
10
40
0.6
0.9
0.2 0
35
G5
0.9
Recognition Rate
Recognition Rate
G4
20 25 30 Dimensions
40 50 60 Dimensions
0.6 Baseline PCA LDA RLDA LPP NPE SRDA
0.5 0.4 0.3
70
80
0.2 0
90
10
20
30
80
90
40 50 60 Dimensions
70
80
90
G8 0.9
Recognition Rate
0.8 0.7 0.6 Baseline PCA LDA RLDA LPP NPE SRDA
0.5 0.4 0.3 0.2 0
10
20
30
40 50 60 Dimensions
70
Fig. 6. Recognition rate versus dimensions on Yale with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.
346
F. Yin et al. / Neurocomputing 128 (2014) 341–362
ðsi ÞTj ; …; 0T T , and G¼ [G1, G2 …,Gj, …GK]. Thus, the proof is completed. □ Therefore, if we first center the samples from each class at the origin, the computation of class-wise PCA decompositions and matrix–vector multiplication can be simplified and the presentation can be more concise without the consideration of xjmean , while Theorem 1 guarantees that this preprocessing of data does not change the constructed sparsity preserving regularizer.
According to section 2.2, minimizing the sparsity preserving regularizer JðwÞ can preserve the sparse representation structure, so the objective function of SRDA can be formulated as follows: max w
wT SB w ; wT SW w þλ1 wT w þ λ2 JðwÞ
ð7Þ
K
SB ¼ ∑ N k ðm mk Þðm mk ÞT ;
ð8Þ
k¼1
2.3. Sparse regularization discriminant analysis SRDA aims to simultaneously seek an efficient discriminating subspace and preserve the sparse representation structure.
K
SW ¼ ∑ ∑ ðxi mk Þðxi mk ÞT ;
ð9Þ
k ¼ 1 i A Ck
G10
G20 1
0.9
0.9 0.8
0.7
Recognition Rate
Recognition Rate
0.8
0.6 0.5
Baseline PCA LDA RLDA LPP NPE SRDA
0.4 0.3 0.2 0.1
0
30
60
0.7 0.6 0.5
Baseline PCA LDA RLDA LPP NPE SRDA
0.4 0.3 0.2 0.1
90 120 150 Dimensions
180
0
210
0
25
50
G40
1
1
0.9
0.9
0.8
Recognition Rate
Recognition Rate
G30
0.7 Baseline PCA LDA RLDA LPP NPE SRDA
0.6 0.5 0.4 0.3
0
25
50
75
75 100 125 150 175 200 Dimensions
0.8 0.7
Baseline PCA LDA RLDA LPP NPE SRDA
0.6 0.5 0.4
100 125 150 175 200 225 Dimensions
0
25
50
75 100 125 150 175 200 Dimensions
G50 1
Recognition Rate
0.9 0.8 0.7
Baseline PCA LDA RLDA LPP NPE SRDA
0.6 0.5 0.4
0
25
50
75 100 125 150 175 200 Dimensions
Fig. 7. Recognition rate versus dimensions on Extended Yale B with 10, 20, 30, 40 and 50 images per individual randomly selected for training.
F. Yin et al. / Neurocomputing 128 (2014) 341–362
347
G2
G3 0.9
0.8
0.8
Recognition Rate
Recognition Rate
0.7 0.6 Baseline PCA LDA RLDA LPP NPE SRDA
0.5 0.4 0.3
0
10
20
30 40 50 Dimensions
60
0.7 0.6
Baseline PCA LDA RLDA LPP NPE SRDA
0.5 0.4
70
0.3
80
0
10
20
30 40 50 Dimensions
G4
60
70
80
G5
0.9
0.9
Recognition Rate
Recognition Rate
0.8 0.7 Baseline PCA LDA RLDA LPP NPE SRDA
0.6 0.5 0.4 0.3
0
10
20
30 40 50 Dimensions
60
70
0.8 0.7 Baseline PCA LDA RLDA LPP NPE SRDA
0.6 0.5 0.4
80
0
10
20
30 40 50 Dimensions
1
0.9
0.9
0.8 Baseline PCA LDA RLDA LPP NPE SRDA
0.7
0.6
0.5
0
10
20
70
80
G7
1.0
Recognition Rate
Recognition Rate
G6
60
30 40 50 Dimensions
60
70
0.8 Baseline PCA LDA RLDA LPP NPE SRDA
0.7
0.6
80
0.5
0
10
20
30 40 50 Dimensions
60
70
80
G8 1
Recognition Rate
0.9
0.8 Baseline PCA LDA RLDA LPP NPE SRDA
0.7
0.6
0.5
0
10
20
30 40 50 Dimensions
60
70
80
Fig. 8. Recognition rate versus dimensions on ORL with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.
348
F. Yin et al. / Neurocomputing 128 (2014) 341–362
G5
G10
0.8
0.6
Baseline PCA LDA RLDA LPP NPE SRDA
0.5 0.4 0.3
Recognition Rate
Recognition Rate
0.7
Baseline PCA LDA RLDA LPP NPE SRDA
0.7 0.6 0.5 0.4
0.2 0.3 0.1
0
20
40
60 80 Dimensions
100
120
0
20
40
60
80 100 120 140 160 Dimensions
G20
Recognition Rate
0.9 Baseline PCA LDA RLDA LPP NPE SRDA
0.8 0.7 0.6 0.5 0.4
0
20
40
60
80 100 120 140 160 180 Dimesions
Fig. 9. Recognition rate versus dimensions on CMU PIE with 5, 10 and 20 images per individual randomly selected for training.
Table 1 The best recognition rate (mean7 std, %) and the associated dimensionality on Yale. Method
G2
G3
G4
G5
G6
G7
G8
Baseline PCA LDA RLDA LPP NPE SRDA
43.93 7 2.73 (1024) 43.93 7 2.73 (29) 53.59 7 6.01 (14) 55.377 4.11 (15) 57.89 7 4.49 (14) 56.417 5.02 (14) 59.937 4.47 (15)
49.007 3.76 (1024) 49.007 3.49 (43) 64.927 5.17 (14) 68.797 4.51 (16) 68.217 4.46 (15) 68.177 4.98 (14) 72.717 4.84 (17)
52.197 3.72 (1024) 52.197 3.72 (59) 72.29 7 3.64 (14) 76.62 7 3.58 (14) 74.767 4.64 (14) 74.62 7 3.51 (14) 80.387 2.46 (16)
56.22 7 4.95 (1024) 56.447 4.80 (72) 78.007 3.54 (14) 81.447 4.11 (15) 80.337 4.02 (14) 79.617 3.41 (14) 84.50 7 3.84 (21)
59.277 3.86 (1024) 59.477 3.85 (40) 80.40 7 3.75 (14) 86.077 2.95 (14) 80.477 4.50 (14) 80.20 7 4.18 (14) 88.007 3.15 (19)
60.337 4.24 (1024) 61.677 4.29 (29) 83.83 7 3.90 (14) 88.50 7 2.41 (30) 82.92 7 3.78 (33) 82.677 4.24 (14) 89.837 3.33 (21)
62.44 73.88 (1024) 63.3374.59 (38) 86.22 74.76 (14) 88.89 74.08 (17) 84.00 74.36 (14) 84.11 74.22 (16) 91.56 73.65 (17)
Table 2 The best recognition rate (mean7 std, %) and the associated dimensionality on Extended Yale B. Method
G10
G20
G30
G40
G50
Baseline PCA LDA RLDA LPP NPE SRDA
53.007 1.22 (1024) 49.367 1.14 (122) 86.517 1.22 (36) 87.38 7 1.03 (60) 87.89 7 1.09 (65) 83.147 1.64 (95) 92.457 0.64 (219)
69.46 71.19 (1024) 66.09 71.18 (165) 94.84 70.67 (36) 95.09 70.62 (48) 95.72 70.46 (121) 93.18 70.66 (137) 97.03 70.59 (198)
77.20 7 0.99 (1024) 74.157 1.25 (190) 97.477 0.45 (35) 97.58 7 0.47 (54) 97.727 0.36 (122) 95.81 7 0.51 (139) 98.537 0.39 (166)
81.417 0.93 (1024) 78.63 7 1.01 (205) 98.36 7 0.43 (36) 98.43 7 0.42 (36) 98.39 7 0.40 (100) 96.96 7 0.63 (170) 99.23 7 0.29 (57)
84.22 7 1.19 (1024) 81.617 1.42 (216) 98.81 7 0.61 (37) 98.88 7 0.54 (32) 98.78 7 0.46 (120) 97.62 7 0.68 (144) 99.507 0.28 (110)
where SB and SW are respectively the between-class and withinclass scatter matrices, which are calculated using the training data. wT w is the Tikhonov regularizer [47]. λ1 and λ2 are two parameters which control the tradeoff between the three terms in the denominator.
Substituting Eq. (6) into (7) and making some algebraic reformulation, we have wT SB w wT SW w þ λ1 wT w þ λ2 JðwÞ
F. Yin et al. / Neurocomputing 128 (2014) 341–362
349
Table 3 The best recognition rate (mean7 std, %) and the associated dimensionality on ORL. Method
G2
Baseline PCA LDA RLDA LPP NPE SRDA
67.16 72.59 67.16 72.59 72.48 72.91 79.11 72.23 78.41 72.27 78.59 72.78 82.00 72.62
(1024) (79) (24) (43) (39) (39) (47)
G3
G4
G5
G6
G7
G8
75.84 7 3.70 (1024) 75.84 7 3.70 (119) 82.84 7 3.12 (39) 88.59 7 3.24 (47) 86.667 1.96 (39) 86.50 7 2.09 (39) 90.93 7 2.11 (49)
82.127 2.71 (1024) 82.127 2.71 (159) 89.677 2.17 (39) 93.737 1.86 (54) 90.677 1.82 (39) 90.56 7 1.65 (39) 94.92 7 1.24 (42)
86.357 2.31 (1024) 86.357 2.31 (199) 92.40 7 1.31 (39) 95.95 7 1.81 (54) 92.98 7 2.16 (39) 93.20 7 1.80 (39) 97.007 1.42 (56)
88.75 7 2.58 (1024) 88.75 7 2.58 (239) 93.69 7 2.62 (37) 97.127 1.37 (40) 94.417 1.45 (39) 94.59 7 1.59 (38) 98.387 1.04 (45)
90.54 7 2.76 (1024) 90.54 7 2.76 (276) 95.54 7 2.27 (39) 98.127 1.11 (39) 95.377 1.82 (40) 95.217 1.75 (38) 98.927 0.82 (42)
92.25 72.58 (1024) 92.25 72.58 (316) 96.63 72.19 (39) 98.31 71.73 (39) 95.56 72.55 (36) 95.50 72.48 (41) 99.50 70.75 (52)
Table 4 The best recognition rate (mean 7std, %) and the associated dimensionality on CMU PIE. Method
G5
G10
G20
Baseline PCA LDA RLDA LPP NPE SRDA
29.55 7 0.65 (1024) 28.177 0.60 (130) 72.50 7 1.52 (38) 73.56 7 1.20 (34) 72.53 7 1.53 (37) 71.487 1.59 (51) 75.137 1.05 (30)
44.13 70.83 (1024) 42.60 70.88 (172) 86.53 70.93 (36) 87.13 71.15 (38) 86.59 70.94 (35) 86.01 70.83 (70) 89.45 70.55 (37)
61.80 7 0.68 (1024) 60.28 7 0.60 (206) 93.377 0.32 (44) 93.58 7 0.35 (59) 93.42 7 0.17 (52) 92.32 7 0.58 (63) 94.247 0.11 (48)
¼ ¼
wT SB w
;
T ~ T þ GSS ~ T G~ T Þw wT SW w þ λ1 wT w þ λ2 wT ðXXT XST G~ GSX wT SB w : ð10Þ T T ~ T þ GSS ~ T G~ T ÞÞw wT ðSW þ λ1 I þ λ2 ðXX XST G~ GSX
~ T þ GSS ~ T G~ , the problem in Eq. (7) With M ¼ XXT XST G~ GSX can be recast as T
max w
T
wT SB w ; wT ðSW þλ1 I þ λ2 MÞw
ð11Þ
where I is an identity matrix related to the Tikhonov regularizer and M is a matrix corresponding to the sparsity preserving regularizer. The problem in Eq. (11) can be solved via the following generalized eigenvalue problem: SB w ¼ ηðSW þ λ1 I þ λ2 MÞw:
ð12Þ
Then the projecting matrix W ¼ ½w1 ; w2 ; …; wd consists of the eigenvectors corresponding to the largest d eigenvalues. Based on the above discussion, the proposed SRDA is summarized in Fig. 3. 2.4. Computational complexity analysis The computation of SB requires Oðm2 KÞ, while the computation of SW requires Oðm2 nÞ. The computational complexity of K full PCA decompositions for matrices of size m m is OðKm3 Þ. Here we only need to find the first lj eigenvalues and eigenvectors for class j, j ¼ 1; …; K, which can be done with the more efficient power method [48] which runs in Oðm2 lj Þ, so K PCA decompositions can be done with Oðm2 Σ Kj¼ 1 lj Þ. Calculating the sparse coefficient vector s for a sample from class j involves only a matrix–vector multiplication which scales like Oðmlj Þ, hence sparse coefficient vectors for all samples can be obtained with OðmΣ Kj¼ 1 nj lj Þ. The evaluation of M requires Oðm2 nΣ Kj¼ 1 lj Þ. The generalized eigenvalue problem in Eq. (12) can be solved in Oðm3 Þ. So the overall computational complexity of our proposed SRDA method is Oðm2 nΣ Kj¼ 1 lj þ m3 Þ. 3. Comparison with related methods In the proposed SRDA method, PCA is used as a basic procedure in the construction of the dictionary for sparse representation. And
if λ1 ¼ λ2 ¼ 0 in Eq. (7), SRDA becomes the standard LDA [9]; if λ1 a 0; λ2 ¼ 0, it reduces to RLDA [49]. LPP [19] and NPE [20] are two linear manifold learning methods aiming at preserving the local neighborhood structure of the data, while the regularization term in SRDA tires to preserve the sparse representation structure of the data. Qiao et al. [40] presented a semi-supervised dimensionality reduction method, called Sparsity Preserving Discriminant Analysis (SPDA). SPDA has a similar formulation as SRDA, while it is different from SRDA in the method for learning the sparse representation structure. In SPDA, n time consuming ℓ1 norm minimization problems need to be solved to learn the sparse representation structure whose computational cost is Oðn4 Þ, while the proposed SRDA method can obtain the sparse representation structure much faster through K PCA decompositions and n matrix–vector multiplications whose computational cost is Oðm2 Σ Kj¼ 1 lj þmΣ Kj¼ 1 nj lj Þ; nj 5n; lj 5 n and K 5 n. Moreover, SRDA uses label information in both learning the sparse representation structure and the Fisher terms, while in SPDA, label information is used only in the Fisher terms. Ref. [41] proposed a sparse representation structure based dimensionality reduction method, named Fast Fisher Sparsity Preserving Projections (FFSPP), which can both employing the sparse representation structure and the Fisher term. Although the objective functions of FFSPP and SRDA both have some terms, the models of these two methods are different. SRDA is a quotient model while FFSPP is an additive model. Quotient model of SRDA can lead to a better recognition performance as is shown in Section 4. Zhang et al. [50,51] proposed a unifying framework, named “patch alignment”, for dimensionality reduction and developed a new dimensionality reduction algorithm, Discriminative Locality Alignment (DLA) based on this framework. This framework consists of two stages: part optimization and whole alignment. Patch alignment framework reveals that (1) algorithms are intrinsically different in the patch optimization stage and (2) all algorithms share an almost identical whole alignment stage. Our proposed SRDA can be explained under this framework in the similar way as LDA [51]. Manifold elastic net (MEN) [52] is a unified framework for sparse dimensionality reduction, which incorporates the merits of both the manifold learning based dimensionality reduction and the sparse learning based dimensionality reduction. SRDA can also be unified under the framework of MEN through the patch alignment framework. The proposed SRDA is a supervised dimensionality reduction method. Another popular class of supervised dimensionality reduction methods is based on Non-negative Matrix Factorization (NMF) [53]. In [54], the discriminant constraints are incorporated inside the NMF decomposition to enhance the classification accuracy of the NMF algorithm and this method is named Discriminant Non-negative Matrix Factorization (DNMF). Guan et al. [55] presented a Non-negative Patch Alignment Framework (NPAF) to unify popular NMF related dimensionality reduction methods. Based on NPAF, they developed a supervised dimensionality reduction method named Non-negative Discriminative Locality Alignment
350
F. Yin et al. / Neurocomputing 128 (2014) 341–362
G3
G2
0.85 0.8
0.65 0.6
Recognition Rate
Recognition Rate
0.75
0.55 0.5
0.7 0.65 0.6 0.55
0.45
0.5
0.4
0.45
Baseline PCA
LDA RLDA LPP
0.4 Baseline PCA
NPE SRDA
LDA RLDA LPP
G4 0.9
0.8
0.85
0.75
0.8
Recognition Rate
Recognition Rate
0.85
0.7 0.65 0.6 0.55
0.75 0.7 0.65 0.6
0.5
0.55
0.45
0.5
Baseline PCA
LDA RLDA LPP
NPE SRDA
Baseline PCA
LDA RLDA LPP
G6 0.95
0.9
0.9
0.85
0.85
Recognition Rate
Recognition Rate
NPE SRDA
G7
0.95
0.8 0.75 0.7 0.65
0.8 0.75 0.7 0.65
0.6
0.6
0.55
0.55
0.5 Baseline PCA
NPE SRDA
G5
LDA RLDA LPP
0.5 Baseline PCA
NPE SRDA
LDA RLDA LPP
NPE SRDA
G8 1 0.95
Recognition Rate
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 Baseline PCA
LDA RLDA LPP
NPE SRDA
Fig. 10. Boxplots of recognition rate for compared methods on Yale with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.
(NDLA). Guan et al. [56] introduced the manifold regularization and the margin maximization to NMF and obtained the Manifold regularized Discriminative NMF (MD-NMF) which is an effective supervised variant of NMF.
4. Experiments As popular dimensionality reduction methods, PCA, LDA, Regularized LDA (RLDA), Local Preserving Projections (LPP), and
F. Yin et al. / Neurocomputing 128 (2014) 341–362
351
G10
G20
0.95 0.9
0.95 0.9
Recogniton Rate
Recognition Rate
0.85 0.8 0.75 0.7 0.65 0.6
0.8 0.75 0.7
0.55 0.5 Baseline PCA
0.85
0.65 LDA RLDA LPP
NPE SRDA
Baseline PCA
LDA RLDA LPP
G30
NPE SRDA
G40
1
1
0.95
Recognition Rate
Recognition Rate
0.95 0.9 0.85 0.8 0.75 0.7 Baseline PCA
0.9
0.85
0.8
LDA RLDA LPP
NPE SRDA
Baseline PCA
LDA RLDA LPP
NPE SRDA
G50 1 0.98
Recognition Rate
0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 Baseline PCA
LDA RLDA LPP
NPE SRDA
Fig. 11. Boxplots of recognition rate for compared methods on Extended Yale B with 10, 20, 30, 40 and 50 images per individual randomly selected for training.
Neighbor Preserving Embedding (NPE) have been successfully applied to face recognition under the name of Eigenface [10], Fisherface [12], R_Fisherface [15], Laplacianface [57], and NPEface [20]. Typically, the recognition process has three steps: (1) calculating the face subspace from the training samples using PCA, LDA, RLDA, LPP, or NPE; (2) projecting the testing face image into ddimensional face subspace; and (3) identifying the testing face image by nearest neighbor classifier. In this section, we compare the recognition performance of these methods with our proposed SRDA method on four publicly available face data sets. In addition, the proposed SRDA method is compared with SPP, FFSPP, DLA, two representative regularized discriminant analysis methods: Laplacian Linear Discriminant Analysis (LapLDA) [16] and Locality Preserving Discriminant Projections (LPDP) [58], and two NMF related supervised dimensionality reduction methods: DNMF [54] and MD-NMF [56]. Finally, the sensitivity of our proposed SRDA method to its parameter λ2 is studied. All experiments are performed with Matlab 7.8.0 on a personal computer with Pentium-IV 3.20 GHz processor, 2 GB main memory and Windows XP operating system.
4.1. Database description Yale face database [12] contains 165 images. There are 15 individuals with 11 images per individual. These images are taken under the following different facial expressions or configurations: center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised and wink. In our experiments, the cropped images of size 32 32 are used. Extended Yale B [59] consists of 2414 front-view face images of 38 individuals. There are about 64 images under different laboratory-controlled lighting conditions for each individual. In our experiments, each image is cropped to have the resolution of 32 32. ORL1 contains 400 images of 40 persons, with 10 images per person. The images are taken at different times, under different lighting and facial expressions. The faces are in an upright, frontal
1 ORL dataset is at http://www.cl.cam.ac.uk/research/dtg/attarchive/facedata base.html.
352
F. Yin et al. / Neurocomputing 128 (2014) 341–362
G2
G3 0.95
0.85
Recognition Rate
Recognition Rate
0.9 0.8 0.75 0.7
0.85 0.8 0.75
0.65 Baseline PCA
0.7 LDA RLDA LPP G4
NPE SRDA
Baseline PCA
LDA RLDA LPP G5
NPE SRDA
LDA RLDA LPP G7
NPE SRDA
LDA RLDA LPP
NPE SRDA
0.98
0.95
Recognition Rate
Recognition Rate
0.96 0.9
0.85
0.94 0.92 0.9 0.88 0.86
0.8 Baseline PCA
0.84 LDA RLDA LPP G6
0.82 Baseline PCA
NPE SRDA
1.02
1
1 0.98
Recognition Rate
Recognition Rate
0.98 0.96 0.94 0.92 0.9
0.94 0.92 0.9 0.88
0.88
0.86
0.86 0.84 Baseline PCA
0.96
LDA RLDA LPP
NPE SRDA
0.84 Baseline PCA G8
1
Recognition Rate
0.98 0.96 0.94 0.92 0.9 0.88 0.86 Baseline PCA
LDA RLDA LPP
NPE SRDA
Fig. 12. Boxplots of recognition rate for compared methods on ORL with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.
F. Yin et al. / Neurocomputing 128 (2014) 341–362
353
G5
G10 0.9
0.7
Recognition Rate
Recognition Rate
0.8 0.6 0.5 0.4
0.7 0.6 0.5
0.3 0.4 Baseline PCA
LDA RLDA LPP
NPE SRDA
Baseline PCA
LDA RLDA LPP
NPE SRDA
G20 0.95
Recognition Rate
0.9 0.85 0.8 0.75 0.7 0.65 0.6 Baseline PCA
LDA RLDA LPP
NPE SRDA
Fig. 13. Boxplots of recognition rate for compared methods on CMU PIE with 5, 10 and 20 images per individual randomly selected for training.
position with tolerance for some side movement. The cropped images of size 32 32 are used. CMU PIE [60] face database contains 41 368 face images of 68 individuals. The face images were captured by 13 synchronized cameras and 21 flashes, under varying pose, illumination, and expression. In our experiments, five near frontal poses (C05, C07, C09, C27, and C29) are selected. This leaves us 170 face images per individual and each image were resized to 32 32. Fig. 4 shows some faces from these face data sets. For all our experiments, each face image is normalized to have unit norm. 4.2. Experimental setting In our experiments, each face database is randomly partitioned into the gallery and probe sets with different numbers. For ease of representation, GN means N images per person are randomly selected to form the gallery set and the remaining images are used as the probe set. The gallery set is used for training while the probe set is used for testing. Twenty random training/testing splits are used and the average classification accuracies are reported. PCA and LDA are parameter-free. For LPP and NPE, the supervised versions are used. In particular, the neighbor mode in LPP and NPE is set to be ‘supervised’; the weight mode in LPP is set to be ‘Cosine’. For RLDA, the regularization parameter λ1 is set to be 0.01 as in [40,61,62]. For SRDA, the number of leading principal components lj for class j is set to be nj 1, the regularization parameter λ1 is set to be the same as in RLDA, and the regularization parameter λ2 is set by the 5-fold cross-validation where λ2 is tuned from f0:1; 0:2; …; 2g. Since the dimensionality of the face vector space is much larger than the number of training samples, the un-regularization methods LDA, LPP, and NPE all involve a PCA preprocessing phase, that is, project the training set X onto a PCA subspace spanned by the leading eigenvectors. For Yale and ORL which are relatively small databases, 100% energy is kept in the PCA preprocessing
phase, and for Extended Yale B and CMU PIE which are relatively large scale databases, in order to obtain all the experimental results in reasonable time, 98% energy is kept in the PCA preprocessing phase. 4.3. Experimental results and discussions As shown in the proposed SRDA method, the eigenvectors spanning the learned subspace can be obtained via the generalized eigenvalue problem in Eq. (12). These eigenvectors can be resized and displayed as images. The first 10 Eigenface, Fisherface, R_Fisherface, Laplacianface, NPEface, and SRDAface of the training set of ORL data set (5 images per individual are used for training) are shown in Fig. 5. From the results shown in the figures, we can see that the discriminative information contained in SRDAface is richer than that in the other faces. In general, the performance of dimensionality reduction method varies with the number of dimensions. Figs. 6–9 show the plots of recognition rate versus dimensions for PCA, LDA, RLDA, LPP, NPE and SRDA on the four tested data sets. The best results and the associated dimensionality of these compared methods on Yale, Extended Yale B, ORL and CMU PIE data sets are summarized in Tables 1–4 respectively, in which the best performance for each data set is highlighted, where Baseline represents the classification results of nearest neighbor classifier without dimensionality reduction, and the number in brackets is the corresponding optimal projecting dimension. Moreover, it needs to be noted that the upper bound for the dimensionality of LDA is K 1, because there are at most K 1 generalized nonzero eigenvalues [12]. From these experimental results we observe that: (1) The performance of PCA is much worse than that of other methods and its performance is only comparable to the Baseline method because it is unsupervised and the label information is not considered.
354
F. Yin et al. / Neurocomputing 128 (2014) 341–362
G3
G2 0.6
0.7 0.6
Recognition Rate
Recognition Rate
0.5 0.4 0.3
SPP FFSPP
0.2
0.5 0.4
SPP FFSPP
0.3
DLA
DLA
SRDA
0.2
SRDA
0.1 0
5
10
15 20 Dimensions
25
0.1
30
0
10
20 30 Dimensions
G4
G5
0.8
0.8
0.7
0.7 Recognition Rate
Recognition Rate
40
0.6 0.5 0.4
SPP FFSPP
0.3
0.6 0.5 SPP
0.4
FFSPP DLA
DLA
0.3
SRDA
SRDA
0.2 0.2 0.1
0
10
20
30 40 Dimensions
50
60
0
10
20
30 40 50 Dimensions
G6
60
70
G7 0.9 0.8
0.7
Recognition Rate
Recognition Rate
0.8
0.6 0.5
SPP FFSPP
0.4
DLA
10
20
30
40 50 60 Dimensions
70
SPP FFSPP
0.5
SRDA
0.4
0.3 0
0.6
DLA
SRDA
0.2
0.7
80
0.3
90
0
10
20
30
40 50 60 Dimensions
70
80
90
G8 0.9
Recognition Rate
0.8 0.7 0.6 SPP FFSPP
0.5
DLA SRDA
0.4 0.3
0
10
20
30
40 50 60 Dimensions
70
80
90
Fig. 14. Recognition rate versus dimensions on Yale with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.
F. Yin et al. / Neurocomputing 128 (2014) 341–362
355
G20
G10
1 0.9
0.9 Recognition Rate
Recognition Rate
0.85 0.8 0.75 0.7
SPP FFSPP
0.65
DLA
0.6
SRDA
0.8 0.7 SPP FFSPP DLA SRDA
0.6 0.5
0.55 0.5
0.4 0
25
50
75
100 125 150 175 200 220 Dimensions
0
25
50
75 100 125 150 175 200 Dimensions
G30
G40
1
1
0.9 Recognition Rate
Recognition Rate
0.9 0.8 0.7 SPP FFSPP
0.6
DLA SRDA
0
25
50
75
0.7
SPP FFSPP DLA SRDA
0.6
0.5 0.4
0.8
100 125 150 175 200 220 Dimensions
0.5
0
25
50
75 100 125 Dimensions
150
175
200
G50 1
Recognition Rate
0.9
0.8 SPP
0.7
FFSPP DLA SRDA
0.6
0.5
0
25
50
75 100 125 Dimensions
150
175
200
Fig. 15. Recognition rate versus dimensions on Extended Yale B with 10, 20, 30, 40 and 50 images per individual randomly selected for training.
(2) In most cases, SRDA and RLDA are superior to other compared methods. This is because regularization techniques are used in them and is consistent with the existing results [14,15,61,62]. (3) The recognition rate of SRDA is better than that of all the other compared methods, especially when the size of the training set is small. This is because SRDA simultaneously seeks an efficient discriminating subspace and preserves the sparse representation structure. The superiority of SRDA under small sample setting validates that SRDA is an effective method to the SSS problem. (4) In most cases, the recognition rate of SRDA increases more quickly than that of other methods as the dimension increases in the beginning since label information is used both in learning the sparse representation structure and computing the Fisher terms. This indicates that we can obtain a better performance using SRDA than other methods with a very low dimensional subspace.
(5) The discrimination power of SRDA will be enhanced as the increase of projecting dimension, but it will not increase after some point. When the projecting dimension is higher than that point, the classification accuracy will stand still or even decrease. How to determine the optimal dimension is still an open problem. Boxplots of the experimental results of these compared methods on the four face data sets are shown in Figs. 10–13. Boxplot produces a box and whisker plot for each method. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, and the whiskers extend from each end of the box to the adjacent values in the data-by default, the most extreme values within 1.5 times the interquartile range from the ends of the box. As is shown in Figs. 10–13, for the most part, RLDA achieves a more robust recognition rate with respect to random training/ testing splits than LDA because of the use of regularization
356
F. Yin et al. / Neurocomputing 128 (2014) 341–362
G3 0.9
0.7
0.8 0.7
0.6
Recognition Rate
Recognition Rate
G2 0.8
0.5 0.4 SPP
0.3
FFSPP DLA
0.2
0
0.5 0.4
SPP FFSPP DLA SRDA
0.3 0.2
SRDA
0.1 0
0.6
0.1 10
20
30 40 50 Dimensions
60
70
0
80
0
10
20
30 40 50 Dimensions
G4
60
70
80
G5 1
0.9 0.8
0.7
Recognition Rate
Recognition Rate
0.8
0.6 0.5 SPP
0.4
FFSPP
0.3
DLA
0.2
SRDA
0.6
SPP FFSPP DLA SRDA
0.4
0.2
0.1 0
0
10
20
30 40 50 Dimensions
60
70
0
80
0
10
20
30 40 50 Dimensions
G6
60
70
80
G7
1
1 0.9 0.8 Recognition Rate
Recognition Rate
0.8
0.6
SPP
0.4
FFSPP DLA
0.2
0.7 0.6 0.5 SPP
0.4
FFSPP
0.3
DLA
0.2
SRDA
SRDA
0.1 0
0
10
20
30 40 50 Dimensions
60
70
0
80
0
10
20
30 40 50 Dimensions
60
70
80
G8 1 0.9
Recognition Rate
0.8 0.7 0.6 0.5 SPP
0.4
FFSPP
0.3
DLA
0.2
SRDA
0.1 0
0
10
20
30 40 50 Dimensions
60
70
80
Fig. 16. Recognition rate versus dimensions on ORL with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.
F. Yin et al. / Neurocomputing 128 (2014) 341–362
357
G10 0.9
0.7
0.8
0.6
0.7 Recognition Rate
Recognition Rate
G5 0.8
0.5 0.4 SPP FFSPP
0.3
DLA
0.2
0.6 0.5 SPP
0.4
FFSPP DLA
0.3
SRDA
SRDA
0.2 0.1 0
20
40
60 80 Dimensions
100
0.1
120
0
20
40
60
80 100 120 140 160 Dimensions
G20 0.95 0.9
Recognition Rate
0.85 0.8 0.75 0.7
SPP FFSPP
0.65
DLA
0.6
SRDA
0.55 0.5
0
20
40
60
80 100 120 140 160 180 200 Dimensions
Fig. 17. Recognition rate versus dimensions on CMU PIE with 5, 10 and 20 images per individual randomly selected for training. Table 5 The best recognition rate (mean7 std, %) and the associated dimensionality on Yale. Method
G2
G3
SPP FFSPP DLA SRDA
52.077 3.84 55.197 3.67 57.707 6.27 59.937 4.47
(29) (14) (14) (15)
65.32 7 3.82 69.29 7 2.90 71.007 3.56 72.717 4.84
G4
G5
72.05 74.37 73.38 72.63 78.29 73.70 80.38 72.46
(44) (15) (19) (17)
74.78 74.33 77.72 73.60 81.78 73.92 84.50 73.84
(59) (15) (28) (16)
(67) (16) (20) (21)
G6
G7
G8
76.58 7 5.02 (89) 82.337 4.28 (20) 85.40 7 3.16 (16) 88.007 3.15 (19)
77.78 7 4.64 83.58 7 3.98 88.007 3.09 89.837 3.33
(104) (20) (41) (21)
80.45 7 5.82 (89) 84.89 7 5.08 (14) 90.447 3.31 (15) 91.567 3.65 (17)
Table 6 The best recognition rate (mean7 std, %) and the associated dimensionality on Extended Yale B. Method
G10
G20
SPP FFSPP DLA SRDA
83.87 7 0.97 (124) 86.42 7 0.88 (107) 89.43 7 0.94 (76) 92.457 0.64 (219)
92.067 0.90 94.92 7 0.75 96.187 0.53 97.037 0.59
(164) (36) (109) (198)
G30
G40
G50
94.82 70.60 (189) 97.32 70.47 (36) 97.98 70.49 (142) 98.53 70.39 (166)
95.93 7 0.49 98.377 0.43 98.767 0.43 99.237 0.29
(205) (37) (169) (57)
96.497 0.71 (215) 98.93 7 0.53 (34) 99.32 7 0.51 (181) 99.507 0.28 (110)
Table 7 The best recognition rate (mean7 std, %) and the associated dimensionality on ORL. Method
G2
SPP FFSPP DLA SRDA
73.317 2.93 77.39 7 2.39 79.727 2.25 82.007 2.62
(69) (41) (48) (47)
G3
G4
G5
G6
G7
G8
80.63 7 2.22 (119) 85.98 7 2.37 (47) 89.047 2.24 (64) 90.937 2.11 (49)
84.50 7 2.42 (146) 90.487 2.69 (46) 93.107 1.84 (88) 94.92 7 1.24 (42)
87.137 2.17 (148) 93.777 1.78 (73) 96.43 7 0.73 (119) 97.007 1.42 (56)
88.44 72.64 (181) 94.50 71.50 (61) 97.69 71.15 (114) 98.38 71.04 (45)
89.717 2.80 (179) 96.217 2.19 (101) 98.047 1.06 (147) 98.927 0.82 (42)
91.067 3.72 (196) 97.317 2.12 (70) 98.25 7 1.18 (178) 99.50 7 0.75 (52)
technique. In most cases, SRDA is more robust than the other compared methods because two regularization terms (Tikhonov regularizer and sparsity preserving regularizer) are used in SRDA.
In addition, for the most part, PCA performs worse than the other compared methods in robustness with respect to random training/ testing splits, because it does not consider the label information.
358
F. Yin et al. / Neurocomputing 128 (2014) 341–362
Table 8 The best recognition rate (mean7 std, %) and the associated dimensionality on CMU PIE. Method
G5
G10
G20
SPP FFSPP DLA SRDA
63.91 71.31 (130) 70.61 71.32 (37) 73.16 71.68 (66) 75.13 71.05 (33)
77.20 7 1.23 (169) 85.977 1.03 (47) 87.197 1.05 (126) 89.457 0.55 (37)
86.077 0.59 (203) 93.32 7 0.26 (54) 93.127 0.49 (176) 94.247 0.11 (52)
Table 9 The time (s) for learning the embedding function on Yale. Method
G2
G3
G4
G5
G6
G7
G8
SPP SRDA
0.2371 0.0984
0.5594 0.1484
0.9502 0.1969
1.8754 0.2544
2.6294 0.3326
5.0403 0.4240
7.4958 0.5186
Table 10 The time (s) for learning the embedding function on Extended Yale B. Method
G10
G20
G30
G40
G50
SPP SRDA
766.2118 1.0218
2760 2.3161
5793 5.3356
8450 5.4323
10113 9.1306
Table 11 The time (s) for learning the embedding function on ORL. Method
G2
G3
G4
G5
G6
G7
G8
SPP SRDA
2.0176 1.0224
6.8909 1.2106
20.8091 1.4683
60.8904 1.7137
107.2234 1.8829
251.1639 2.0477
360.8560 2.2986
Table 12 The time (s) for learning the embedding function on CMU PIE. Method
G5
G10
G20
SPP SRDA
983.3388 0.4228
4343.7 1.1214
9298.6 2.9944
Table 13 Recognition rate (mean 7 std, %) comparison of RLDA, LapLDA, LPDP and SRDA on Yale. Method
G2
G3
G4
RLDA LapLDA [16] LPDP [58] SRDA
55.377 4.11 (15) 56.047 4.94 (13) 57.747 5.90 (14) 59.93 7 4.47 (15)
68.79 74.51 (16) 69.17 73.60 (14) 71.75 74.50 (14) 72.71 74.84 (17)
76.62 7 3.58 75.86 7 3.86 78.90 7 3.86 80.387 2.46
(14) (14) (16) (16)
4.4. Comparisons with other state-of-the-art dimensionality reduction methods In this part, we compare the proposed method with the recently proposed sparse representation structure based methods SPP [39] and FFSPP [41], patch alignment based method DLA [51], two representative regularized discriminant analysis methods: Laplacian Linear Discriminant Analysis (LapLDA) [16] and Locality Preserving Discriminant Projections (LPDP) [58], and two NMF related supervised dimensionality reduction methods DNMF [54] and MD-NMF [56]. Figs. 14–17 show the plots of recognition rate versus dimensions for SPP, FFSPP, DLA and SRDA on four tested data sets. The parameter setting methods for SPP, FFSPP and DLA are the same as in [39,41, 51]. The best results and the associated dimensionality of
G5
G6
G7
G8
81.447 4.11 (15) 80.067 4.21 (14) 81.78 7 3.75 (13) 84.50 7 3.84 (21)
86.077 2.95 (14) 82.53 7 4.08 (15) 86.737 3.95 (14) 88.007 3.15 (19)
88.50 7 2.41 (30) 85.08 7 4.38 (15) 88.177 3.24 (14) 89.837 3.33 (21)
88.89 74.08 (17) 89.00 74.17 (14) 90.6772.35 (14) 91.56 73.65 (17)
these compared methods on Yale, Extended Yale B, ORL and CMU PIE data sets are summarized in Tables 5–8 respectively. From these experimental results, SRDA achieves the best performance, followed by DLA, FFSPP and SPP. Moreover, the time comparisons for learning the embedding functions of SPP and SRDA on four tested data sets are shown in Tables 9–12. As is shown in Tables 9–12, SRDA is much faster than SPP due to the efficient evaluation of sparse coefficient vectors in SRDA. Therefore, SRDA not only achieves a better recognition performance than SPP, but also is more time efficient than SPP. LapLDA is a regularized LDA method aiming at integrating the global and local structures by introducing a locality preserving regularization term into LDA. LPDP is proposed by adding the maximum margin criterion (MMC) into the objective function of
F. Yin et al. / Neurocomputing 128 (2014) 341–362
359
Table 14 Recognition rate (mean 7 std, %) comparison of RLDA, LapLDA, LPDP and SRDA on Extended Yale B. Method
G10
G20
G30
G40
G50
RLDA LapLDA [16] LPDP [58] SRDA
87.38 7 1.03 (60) 87.22 7 0.99 (58) — 92.457 0.64 (219)
95.09 7 0.62 (48) 94.667 0.80 (52) — 97.037 0.59 (198)
97.58 7 0.47 (54) 97.167 0.49 (36) — 98.53 7 0.39 (166)
98.43 7 0.42 (36) 98.34 7 0.36 (38) — 99.23 7 0.29 (57)
98.88 7 0.54 (32) 98.87 7 0.32 (29) — 99.507 0.28 (110)
Table 15 Recognition rate (mean 7 std, %) comparison of RLDA, LapLDA, LPDP and SRDA on ORL. Method
G2
G3
G4
G5
G6
G7
G8
RLDA LapLDA [16] LPDP [58] SRDA
79.117 2.23 (43) 79.56 7 2.77 (39) 81.117 3.10 (37) 82.007 2.62 (47)
88.59 7 3.24 (47) 89.187 1.98 (41) 91.147 1.69 (39) 90.93 7 2.11 (49)
93.737 1.86 (54) 92.60 7 2.21 (39) 95.157 1.38 (39) 94.92 7 1.24 (42)
95.95 7 1.81 (54) 95.477 1.90 (53) 97.707 1.06 (40) 97.007 1.42 (56)
97.127 1.37 (40) 96.007 2.39 (46) 98.25 7 1.16 (40) 98.38 7 1.04 (45)
98.12 71.11 (39) 97.42 71.75 (40) 98.63 71.39 (42) 98.92 70.82 (42)
98.317 1.73 (39) 98.067 1.88 (56) 99.067 0.90 (40) 99.50 7 0.75 (52)
1
1
0.8
Recognition Rate
Recognition Rate
0.8
DNMF MD-NMF SRDA
0.6
0.4
0.4
DNMF MD-NMF SRDA
0.2
0.2
0
0.6
G3
G5 Size of training set
G7
0
G3
G5 Size of training set
G7
Fig. 18. Recognition rate comparison of DNMF, MD-NMF, and SRDA on Yale with 3, 5 and 7 images per individual randomly selected for training.
Fig. 19. Recognition rate comparison of DNMF, MD-NMF, and SRDA on ORL with 3, 5 and 7 images per individual randomly selected for training.
Locality Preserving Projections, aiming at retaining the locality preserving characteristic of LPP and using the global discriminative structure from MMC. The experimental results are presented in Tables 13–15 for Yale, Extended Yale B and ORL respectively. As a classical regularized LDA method, the results of RLDA are also listed in the tables for comparison. Thus, here we essentially compare three kinds of regularized discriminant analysis methods: non-data-dependent regularized discriminant analysis (RLDA), local structure preserving discriminant analysis (LapLDA and LPDP), and sparse structure preserving discriminant analysis (SRDA). From the experimental results, RLDA is superior to LapLDA in most cases, which is consistent with the empirical study in [62]. LPDP achieves better performance than RLDA and LapLDA consistently. Moreover, for the most part, SRDA performs better than the other three compared regularized discriminant analysis methods. The recognition rate comparisons of DNMF, MD-NMF, and SRDA on Yale and ORL with 3, 5 and 7 images per individual randomly selected for training are shown in Figs. 18 and 19 respectively. From the experimental results in Figs. 18 and 19, SRDA achieves better recognition performance than these two NMF related supervised dimensionality reduction methods DNMF and MD-NMF.
4.5. Analysis of the parameter λ2 There are two parameters in our proposed algorithm SRDA: the regularization parameters λ1 and λ2 . In the experiments, λ1 is set to be the same as in the previous works [40,61,62]. In this part, the influence of λ2 on the performance of SRDA is investigated. Fig. 20 shows the influence of λ2 on the performance of SRDA on Yale (G6), Extended Yale B (G30), ORL (G5) and CMU PIE (G10) respectively, when λ2 ranges from 0.1 to 2 equally spaced with interval 0.1. From the experimental results, the performance of SRDA does not fluctuate greatly as the variation of λ2 on all four tested data sets, so it is robust to the regularization parameterλ2 .
5. Conclusions and future work In this paper, we proposed a novel dimensionality reduction method for face recognition, named sparse regularization discriminant analysis (SRDA), which aims at simultaneously seeking an efficient discriminating subspace and preserving the sparse representation structure. Specifically, SRDA first constructs a concatenated dictionary through class-wise PCA decompositions and learns the sparse representation structure under the constructed
360
F. Yin et al. / Neurocomputing 128 (2014) 341–362
Yale G6
Extended Yale B G30 0.99
0.89
0.988 0.986
Recognition Rate
Recognition Rate
0.88 0.87 0.86 0.85
0.984 0.982 0.98 0.978 0.976 0.974
0.84
0.972 0.83
0.97
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
2
2
CMU PIE G10
0.98
0.91
0.975
0.9
0.97
Recognition Rate
Recognition Rate
ORL G5
0.965 0.96 0.955
0.89 0.88 0.87
0.95
0.86 0.945 0.94
0.85 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
0.2 0.4 0.6 0.8
1
1.2 1.4 1.6 1.8
2
2
2
Fig. 20. Influence of λ2 on the performance of SRDA on Yale, Extended Yale B, ORL and CMU PIE respectively.
dictionary quickly through matrix–vector multiplications. Then, SRDA takes into account both the sparse representation structure and the discriminating efficiency by using the learned sparse representation structure as a regularization term of linear discriminant analysis. Finally, the optimal embedding of the data is learned via solving a generalized eigenvalue problem. The extensive experiments on four publicly available face data sets validate the promising performance of our proposed SRDA approach. Our SRDA method may not perform well on class imbalance problem, since it needs to perform PCA decomposition for each class. For a class whose sample size is small, PCA on this class cannot well capture the subspace structure of this class, which will deteriorate the performance of our SRDA method. How to tackle this problem is one of our future focuses. Another topic worthwhile to explore is how to extend our SRDA method to the semisupervised scenario, which is an important scenario where we can obtain a lot of unlabeled face images easily due to the popularity of digital camera.
Acknowledgments The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to significantly improve the quality of this paper. This work is supported by the National Natural Science Foundation of China, Nos. 61072106, 60971112, 60971128, 60970067, 61072108; the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project), No. B07048; the Program for Cheung Kong Scholars and Innovative Research Team in University, No. IRT1170.
References [1] W. Zhao, R. Chellappa, P. Phillips, A. Rosenfeld, Face recognition: a literature survey, ACM Comput. Surv. 35 (4) (2003) 399–458. [2] M. Berry, Survey of Text Mining: Clustering, Classification and Retrieval, Springer, New York, 2003. [3] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008. [4] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006. [5] L.O. Jimenez, D.A. Landgrebe, Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data, IEEE Trans. Syst. Man Cybern. 28 (1) (1997) 39–54. [6] F. Shang, L.C. Jiao, J. Shi, J. Chai, Robust positive semidefinite L-Isomap ensemble, Pattern Recognit. Lett. 32 (4) (2011) 640–649. [7] S. Gunal, R. Edizkan, Subspace based feature selection for pattern recognition, Inf. Sci. 178 (19) (2008) 3716–3726. [8] I.T. Jolliffe, Principal Component Analysis, Springer, New York, 1986. [9] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, New York, 1990. [10] M.A. Turk, A.P. Pentland, Eigenfaces for recognition, J. Congnitive Neurosci. 3 (1) (1991) 71–86. [11] F. Wang, X. Wang, D. Zhang, C. Zhang, T. Li, MarginFace: a novel face recognition method by average neighborhood margin maximization, Pattern Recognit. 42 (11) (2009) 2863–2875. [12] P.N. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [13] S.J. Raudys, A.K. Jain, Small sample size effects in statistical pattern recognition: recommendations for practitioners, IEEE Trans. Pattern Anal. Mach. Intell. 13 (3) (1991) 252–264. [14] J.H. Friedman, Regularized discriminant analysis, J. Am. Stat. Assoc. 84 (405) (1989) 165–175. [15] S. Ji, J. Ye, Generalized linear discriminant analysis: a unified framework and efficient model selection, IEEE Trans. Neural Netw. 19 (10) (2008) 1768–1782. [16] J. Chen, J. Ye, Q. Li, Integrating global and local structures: a least squares framework for dimensionality reduction, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. [17] D. Tao, X. Li, X. Wu, General averaged divergence analysis, in: Proceedings of IEEE International Conference on Data Mining (ICDM), 2007, pp. 2272–2279.
F. Yin et al. / Neurocomputing 128 (2014) 341–362
[18] D. Tao, X. Li, X. Wu, Geometric mean for subspace selection, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 260–274. [19] X. He, P. Niyogi, Locality preserving projections, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), vol. 16, 2003, pp. 585–591. [20] X. He, D. Cai, S. Yan, H. Zhang, Neighborhood preserving embedding, in: Proceeding of the IEEE International Conference on Computer Vision (ICCV), 2005, pp. 1208–1213. [21] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [22] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [23] E. Candès, Compressive sampling, in: Proceeding of International Congress of Mathematics, 2006, pp. 1433–1452. [24] D. Donoho, For most large underdetermined systems of linear equations the minimal ℓ1 -norm solution is also the sparsest solution, Comm. Pure Appl. Math. 59 (6) (2006) 797–829. [25] D. Donoho, Compressed sensing, IEEE Trans. Inf. Theory 52 (4) (2006) 1289–1306. [26] R.G. Baraniuk, M.B. Wakin, Random projections of smooth manifolds, Found. Comput. Math. 9 (1) (2009) 51–77. [27] M.A. Davenport, P.T. Boufounos, M.B. Wakin, R.G. Baraniuk, Signal processing with compressive measurements, IEEE J. Sel. Top. Signal Process. 4 (2) (2010) 445–460. [28] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T.S. Huang, S. Yan, Sparse representation for computer vision and pattern recognition, Proc. IEEE 98 (6) (2010) 1031–1044. [29] J. Cai, H. Ji, X. Liu, Z. Shen, Blind motion deblurring from a single image using sparse approximation, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 104–111. [30] X. Lu, Y. Yuan, P. Yan, Sparse coding for image denoising using spike and slab prior, Neurocomputing 106 (2013) 12–20. [31] J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Non-local sparse models for image restoration, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2272–2279. [32] J. Yang, J. Wright, T. Huang, Y. Ma, Image super-resolution as sparse representation of raw patches, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. [33] W. Liu, S. Li, Multi-morphology image super-resolution via sparse representation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.04.002. [34] J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Supervised dictionary learning, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2009, pp. 1033–1040. [35] K. Huang, S. Aviyente, Sparse representation for signal classification, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2006, pp. 585–591. [36] Y. Xu, Q. Zhu, Z. Fan, Y. Wang, J. Pan, From the idea of “sparse representation” to a representation-based transformation method for feature extraction, Neurocomputing 113 (2013) 168–176. [37] F. Zang, J. Zhang, Discriminative learning by sparse representation for classification, Neurocomputing 74 (12–13) (2011) 2176–2183. [38] Y. Han, F. Wu, D. Tao, Sparse unsupervised dimensionality reduction for multiple view data, IEEE Trans. Circuits Syst. Video Technol. 22 (10) (2012) 1485–1496. [39] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognit. 43 (1) (2010) 331–341. [40] L. Qiao, S. Chen, X. Tan, Sparsity preserving discriminant analysis for single training image face recognition, Pattern Recognit. Lett. 31 (5) (2010) 422–429. [41] F. Yin, L.C. Jiao, F. Shang, S. Wang, B. Hou, Fast fisher sparsity preserving projections. Neural Comput. Appl. 23 (3–4) (2013) 691–705. [42] T. Zhou, D. Tao, Double shrinking sparse dimension reduction, IEEE Trans. Image Process. 22 (1) (2013) 224–257. [43] N. Guan, D. Tao, Z. Luo, et al., MahNMF: Manhattan Non-negative Matrix Factorization CoRRabs/1207, vol. 3438, 2012. [44] R. Baraniuk, Compressive Sensing, IEEE Signal Process. Mag. 24 (4) (2007) 118–121. [45] J. Wright, A. Yang, S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [46] R. Basri, D. Jacobs, Lambertian reflection and linear subspaces, IEEE Trans. Pattern Anal. Mach. Intell. 25 (3) (2003) 218–233. [47] A.N. Tikhonov, V.Y. Arsenin, Solution of Ill-posed Problems, Winston & Sons, Washington, 1977. [48] G. Golub, C. Van Loan, Matrix Computations, third ed., Johns Hopkins University Press, Baltimore, 1996. [49] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed., Springer, New York, 2009. [50] T. Zhang, D. Tao, J. Yang, Discriminative locality alignment, in: Proceeding of the IEEE European Conference on Computer Vision (ECCV), 2008, pp. 725–738.
361
[51] T. Zhang, D. Tao, X. Li, et al., Patch alignment for dimensionality reduction, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1299–1313. [52] T. Zhou, D. Tao, X. Wu, Manifold elastic net: a unifiled framework for sparse dimension reduction, Data Min. Knowl. Discov. 22 (3) (2011) 340–371. [53] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (21) (1999) 788–791. [54] S. Zafeiriou, A. Tefas, I. Buciu, I. Pitas, Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification, IEEE Trans. Neural Netw. 17 (3) (2006) 683–695. [55] N. Guan, D. Tao, Z. Luo, B. Yuan, Non-negative patch alignment framework, IEEE Trans. Neural Netw. 22 (8) (2011) 1218–1230. [56] N. Guan, D. Tao, Z. Luo, B. Yuan, Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent, IEEE Trans. Image Process. 20 (7) (2011) 2030–2048. [57] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using Laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [58] J. Gui, W. Jia, L. Zhu, S. Wang, D. Huang, Locality preserving discriminant projections for face and palmprint recognition, Neurocomputing 73 (13–15) (2010) 2696–2707. [59] K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698. [60] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression database, IEEE Trans. Pattern Anal. Mach. Intell. 25 (12) (2003) 1615–1618. [61] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–7. [62] L. Qiao, L. Zhang, S. Chen, An empirical study of two typical locality preserving linear discriminant analysis methods, Neurocomputing 73 (10–12) (2010) 1587–1594.
Fei Yin was born in Shaanxi, China, on October 8, 1984. He received the B.S. degree in computer science from Xidian university, Xi‘an, China, in 2006. Currently he is working towards the Ph.D. degree in pattern recognition and intelligent systems from the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi‘an, China. His current research interests include pattern recognition, machine learning, data mining, and computer vision.
L.C. Jiao received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 1982, and the M.S. and Ph.D. degrees from Xi‘an Jiaotong University, Xi‘an, China, in 1984 and 1990 respectively. He is the author or coauthor of more than 200 scientific papers. His current research interests include signal and image processing, nonlinear circuit and systems theory, learning theory and algorithms, optimization problems, wavelet theory, and data mining.
Fanhua Shang is currently pursuing the Ph.D. degree in circuits and systems from the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi‘an, China. His current research interests include pattern recognition, machine learning, data mining, and computer vision.
362
F. Yin et al. / Neurocomputing 128 (2014) 341–362 Lin Xiong was born in Shaanxi, China, on July 15, 1981. He received the B.S. degree from Shaanxi University of Science & Technology, Xi‘an, China, in 2003. He worked for Foxconn Co. after graduation from 2003 to 2005. Since 2006, he has been working toward the M.S. and Ph.D. degree in pattern recognition and intelligent information at Xidian University. His research interests include active learning, ensemble learning, low-rank and sparse matrix factorization, Subspace tracking and Background modeling in video surveillance.
Xiaodong Wang received the B.S. degree from Harbin Institute of technology, Harbin, China, in 1988, and the M.S. degree from Inner Mongolia University of Technology, Hohhot, China, in 2007. He is currently working toward the Ph.D. degree in computer application technology at the School of Computer Science and Technology, Xidian University and the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xi‘an, China. His current research interests include convex optimization, compressive sensing and pattern recognition.