Sparse regularization discriminant analysis for face recognition

Sparse regularization discriminant analysis for face recognition

Neurocomputing 128 (2014) 341–362 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Sparse ...

2MB Sizes 0 Downloads 95 Views

Neurocomputing 128 (2014) 341–362

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Sparse regularization discriminant analysis for face recognition Fei Yin n, L.C. Jiao, Fanhua Shang, Lin Xiong, Xiaodong Wang Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi'an 710071, China

art ic l e i nf o

a b s t r a c t

Article history: Received 31 October 2012 Received in revised form 26 August 2013 Accepted 30 August 2013 Communicated by: D. Tao Available online 23 October 2013

Recently the underlying sparse representation structure in high dimensional data has attracted considerable interests in pattern recognition and computer vision. Sparse representation structure means the sparse reconstructive relationship of the data. In this paper, we propose a novel dimensionality reduction method called Sparse Regularization Discriminant Analysis (SRDA), which aims to preserve the sparse representation structure of the data when learning an efficient discriminating subspace. More specifically, SRDA first constructs a concatenated dictionary through class-wise PCA decompositions which conduct PCA on data from each class separately, and learns the sparse representation structure under the constructed dictionary quickly through matrix–vector multiplications. Then SRDA takes into account both the sparse representation structure and the discriminating efficiency by using the learned sparse representation structure as a regularization term of linear discriminant analysis. Finally, the optimal embedding of the data is learned via solving a generalized eigenvalue problem. The extensive and promising experimental results on four publicly available face data sets (Yale, Extended Yale B, ORL and CMU PIE) validate the feasibility and effectiveness of the proposed method. & 2013 Elsevier B.V. All rights reserved.

Keywords: Subspace learning Sparse representation Class-wise PCA decompositions Linear discriminant analysis Face recognition

1. Introduction In various fields of scientific research such as face recognition [1], text classification [2], and information retrieval [3], one is often confronted with data lying in a very high dimensional space. Generally, high dimensional data bring many problems in practical applications, e.g. (1) the computational complexity will limit the application of many practical technologies; (2) it will deteriorate the performance of model estimation when the number of samples is small compared to the number of dimensionality (also called small sample size problem). These problems are called “the curse of dimensionality” in machine learning and data mining [4]. In practice, dimensionality reduction has been employed as an effective method to handle these problems [5–7]. In the past few decades, many dimensionality reduction methods have been designed. Based on the data structure considered, these methods can be divided into three categories: global structure based methods, local neighborhood structure based methods, and the recently presented sparse representation structure based methods. Principal Component Analysis (PCA) [8] and Linear Discriminant Analysis (LDA) [9] belong to global structure based methods.

n Correspondence to: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Mailbox 224, Xidian University, No. 2 South TaiBai Road, Xi'an 710071, China. Tel.: þ 86 29 88202661; fax: þ86 29 88201023. E-mail address: [email protected] (F. Yin).

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.08.032

PCA, which aims at maximizing the variance of the projected data, has been applied widely in the areas of science and engineering. The classical Eigenface method [10] for face recognition is an application of PCA. PCA is a good method for representation, but it does not consider the label information. Thus it may not be reliable for classification [11]. LDA tries to simultaneously maximize the between-class scatter and minimize the within-class scatter. Because the label information is employed, LDA has proven to be more effective than PCA in classification. LDA is used in face recognition under the name of Fisherface [12]. However, LDA can extract at most K  1 features (K is the number of classes), which is undesirable in many cases. Moreover, it confronts the Small Sample Size (SSS) problem [13] when dealing with high dimensional data. To address these problems, Regularized LDA (RLDA) has been presented [14,15]. Another strategy is to incorporate a locality preserving term into LDA, and the representative method is Laplacian Linear Discriminant Analysis (LapLDA) [16]. Besides, the projection of LDA tends to merge those classes which are close together in the original feature space. To tackle this problem, Tao et al. [17,18] proposed the Geometric Mean Divergence Analysis (GMDA) from the viewpoint of maximizing the geometric mean of the Kullback–Leibler (KL) divergences between different classes. Local neighborhood structure based methods include Locality Preserving Projections (LPP) [19] and Neighborhood Preserving Embedding (NPE) [20]. LPP finds an embedding that preserves the local neighborhood structure, and obtains a face subspace best fitting the essential face manifold. NPE also aims at preserving the

342

F. Yin et al. / Neurocomputing 128 (2014) 341–362

local neighborhood structure of the data. Different from LPP, the affinity matrix in NPE is obtained via solving local least square problems instead of being defined directly. In essence, LPP is the optimal linear approximation to the classical manifold learning method called Laplacian Eigenmaps (LE) [21]. It has also been proved that NPE is a linearized version of the popular manifold learning method called Locally Linear Embedding (LLE) [22]. As a new signal representation method, sparse representation has received considerable attention in recent years, especially in compressed sensing [23–27]. It models a signal as a sparse linear combination of the elementary signals from a dictionary. The coefficient vector of the sparse linear combination is named sparse coefficient vector. Sparse representation has proven to be a powerful tool for signal processing and pattern classification [28]. Under the ℓ1 minimization framework, sparse representation has been successfully applied to denoising and inpainting [29–31], image super-resolution [32,33], image classification [34,35], etc. Sparse model has also been employed in dimensionality reduction [36–38]. The representative methods include Sparsity Preserving Projections (SPP) [39], Sparsity Preserving Discriminant Analysis (SPDA) [40], Fast Fisher Sparsity Preserving Projections (FFSPP) [41], Double Shrinking Sparse Dimension Reduction [42], and Group Sparse Manhattan Non-negative Matrix Factorization [43]. In this paper, sparse representation structure means every sample can be linearly represented by other samples from the same class, and thus can be sparse linearly represented by samples from all classes. That is, sparse representation structure means the sparse reconstructive relationship of the data, which originates from the subspace assumption in data: samples from a single class lie on a low dimensional linear subspace. Recently, Sparsity Preserving Projections (SPP) [39] considering the sparse representation structure has been proposed. It aims to preserve the sparse reconstructive relationship of the data, which is achieved by minimizing a ℓ1 regularization-related objective function. SPP is a very good and novel technology to dimensionality reduction, but it has some shortages: (1) The computational complexity is high because n (the number of training samples) ℓ1 norm optimization problems need to be solved in learning the sparse representation structure; (2) It does not take advantage of the label information, which is very useful for classification problems such as face recognition. A ℓ1 norm optimization problem needs to be solved to obtain the sparse coefficient vector of a sample, whose computational cost is approximately Oðn3 Þ [44]. The cost of obtaining sparse coefficient vectors for all samples is Oðn4 Þ. Therefore, for large scale problem, SPP is computationally prohibitive. In this paper, in order to learn an efficient discriminating subspace which can preserve the sparse representation structure of the data and avoid the disadvantages of SPP, we propose a novel dimensionality reduction method: Sparse Regularization Discriminant Analysis (SRDA). Particularly, SRDA first constructs a concatenated dictionary through class-wise PCA decompositions and learns the sparse representation structure under the constructed dictionary quickly through matrix–vector multiplications. Then SRDA takes into account both the sparse representation structure and the discriminating efficiency by using the learned sparse representation structure as a regularization term of linear discriminant analysis. Finally, the optimal embedding of the data is learned via solving a generalized eigenvalue problem. It is worthwhile to highlight some aspects of SRDA as follows: (1) SRDA is a novel dimensionality reduction method which simultaneously seeks an efficient discriminating subspace and preserves the sparse representation structure. (2) Based on the special concatenated dictionary constructed via class-wise PCA decompositions, the sparse coefficient vector in

SRDA can be obtained via basic matrix operation: matrix– vector multiplication. Compared to SPP, in which the sparse coefficient vector is computed via ℓ1 norm optimization, the computational cost of learning the sparse representation structure is greatly reduced. (3) Compared to LDA, the number of features that SRDA can find is not limited to K  1. Besides, SRDA can effectively avoid the SSS problem as the singular problem of within-class scatter matrix vanishes due to the regularized term in the SRDA formulation. (4) In SRDA, label information is used in two aspects. First, it is used in building the dictionary for sparse representation and computing the sparse coefficient vector, which may lead to a more discriminating sparse representation structure. Second, it is used in computing the within-class scatter matrix and the between-class scatter matrix. The rest of the paper is organized as follows. SRDA is proposed in Section 2. SRDA is compared with other related works in Section 3. The experimental results and discussions are presented in Section 4. Finally, Section 5 provides some concluding remarks and future work.

2. Sparse regularization discriminant analysis (SRDA) In this section, we describe the proposed SRDA in detail. Totally, SRDA consists of three steps: (1) learning the sparse representation structure; (2) constructing the sparsity preserving regularizer; and (3) integrating the learned sparsity preserving regularizer into the objective function of LDA to form the SRDA objective function and solving the optimization problem to obtain the embedding functions. 2.1. Learning sparse representation structure Given a set of training samples fxi gni¼ 1 where xi A ℝm , let X ¼ ½x1 ; x2 ; …; xn  A ℝmn be the data matrix consisting of all the training samples. X ¼ ½X1 ; X2 ; …; XK , where Xj A ℝmnj contains samples from class j. Assume that samples from a single class lie on a linear subspace. Subspace model is flexible enough to capture much of the variation in real-world data sets and is a simple and effective model in the context of face recognition [45]. It has been observed that face images under various lighting and expression lie on a special low-dimensional subspace [12,46]. We conduct PCA decomposition for each class Xj , j ¼ 1; 2; …; K, whose objective function is max gT Σ j g;

jjgjj ¼ 1

ð1Þ

where Σ j is the data covariance matrix of Xj . For each class j, the first lj principal components are preserved to constitute Gj ¼ ½g1 ; g2 ; …; glj . Considering that the sample mean of Xj usually is not at the origin, in order to fully represent a sample from class j, Gj is extended to be G~ j ¼ ½g1 ; g2 ; …; glj ; xjmean , where xjmean is the mean vector of samples from class j (in fact, through some preprocessing, xjmean can be removed, which is detailed in section 2.2). So, a sample x from class j can be represented as x ¼ ½g1 ; g2 ; …; glj ; xjmean sj ; ¼ G~ j sj ; ¼ ½G~ 1 ; G~ 2 ; …; G~ j ; …; G~ K s; ~ ¼ Gs;

ð2Þ

where G~ ¼ ½G~ 1 ; G~ 2 ; …; G~ j ; …; G~ K , s ¼ ½0T ; 0T ; …; sj T ; …; 0T T . Here ~ is a sparse representation of x, G~ is the concatenated x ¼ Gs dictionary constructed from principal components of PCA from

F. Yin et al. / Neurocomputing 128 (2014) 341–362

Training Set X

X1

PCA

G1

X2

PCA

G2

343

Dictionary

G PCA

XK

GK

Fig. 1. The process of constructing dictionary for sparse representation.

Dictionary G constructed in Fig.1

preserving regularizer, which is defined by n

Matrix-vector multiplication

x1

~ i jj2 : JðwÞ ¼ ∑ jjwT xi  wT Gs

s1

ð5Þ

i¼1

With some algebraic reformulation, the above sparsity preserving regularizer is rewritten as Matrix-vector multiplication

x2

s2

n

Sparse representation

~ i jj2 ; JðwÞ ¼ ∑ jjwT xi  wT Gs i¼1

X G[s1 , s 2 ,..., s n ]

¼w

! T ~ ~ ∑ ðxi  Gsi Þðxi  Gsi Þ w; n

T

i¼1

Matrix-vector multiplication

xn

sn

Fig. 2. The process of learning the sparse representation structure.

¼w

each class, and s is the sparse coefficient vector. The process of constructing dictionary is shown in Fig. 1. According to the previous procedure, each training sample ~ corresponds to a sparse representation under the dictionary G. Once G~ is obtained from k PCA decompositions, the sparse coefficient vector s of a training sample from class j can be quickly computed from matrix–vector multiplication because according to Eq. (2), the computation of s involves only Gj and Gj is column orthogonal. For example, if x is a sample from class j, s corresponding to x can be obtained as s ¼ ½0T ; 0T ; …; ðGj T xÞT ; 1; …; 0T T ;

ð3Þ xjmean .

where the ‘1’ corresponds to the sample mean To sum up, the process of learning the sparse representation structure is shown in Fig. 2. First, the dictionary for sparse representation is constructed through class-wise PCA decompositions as in Fig. 1. Then the sparse coefficient vector of each sample is computed quickly through matrix–vector multiplication to obtain the sparse representation structure.

~ i jj ; min ∑ jjw xi w Gs w

2

i¼1

 T ~ T þ GSS ~ T G~ T w; XXT  XST G~  GSX

ð6Þ

Theorem 1. For a data set X ¼ ½X1 ; X2 ; …; XK , if we translate each class to make it be centered at the origin, that is, to make the translated data set X~ ¼ ½X~ 1 ; X~ 2 ; …; X~ K  has the property j j x~ mean ¼ 0; j ¼ 1; …; K, where x~ mean is the sample mean of X~ j , then the problem in Eq. (4) will stay unchanged under the translated data set. Proof. X~ j is centered at the origin and translated from Xj , for a sample xi from class j, its corresponding x~ i ¼ xi  xjmean which indicates xi ¼ x~ i þxjmean . So, starting from the problem in Eq. (4), we have n

~ i jj2 ; min ∑ jjwT xi  wT Gs w

i¼1

n

¼ min ∑ jjwT ðx~ i þ xjmean Þ  wT ½G~ 1 ; G~ 2 ; …; G~ j ; …; G~ K si jj2 ;

w

It is expected that the sparse representation structure in the original high dimensional space can be preserved in the projected low dimensional space. Therefore, the following objective function can be used to seek projection which best preserves the sparse representation structure: T



i¼1 n

¼ min ∑ jjwT ðx~ i þ xjmean Þ  wT G~ j ðsi Þj jj2 ;

2.2. Sparsity preserving regularization

T

T

! T

where S ¼ ½s1 ; s2 ; …; sn . It should be noted that if the samples from each class are first translated to be centered at the origin, the xjmean in Eq. (2) and the term ‘1’ in Eq. (3) will disappear while the problem in Eq. (4) does not change. This is presented in the following theorem.

w

n

n

~ i xi T þ Gs ~ i ðGs ~ i ÞT Þ w; ∑ ðxi xi T  xi si T G~  Gs

¼ wT

ð4Þ

i¼1

where si is the sparse coefficient vector corresponding to xi . ~ i jj2 is the total sparse representation error of ∑ni¼ 1 jjwT xi  wT Gs all samples in the projected space, so minimizing this term can ensure the sparse representation structure is preserved in the projected space. We use this objective function as the sparsity

i¼1 n

¼ min ∑ jjwT ðx~ i þ xjmean Þ  wT ½g1 ; g2 ; …; glj ; xjmean ðsi Þj jj2 ; w

i¼1 n

¼ min ∑ jjwT x~ i wT ½g1 ; g2 ; …; glj ðsi Þj  jj2 ; w

i¼1 n

¼ min ∑ jjwT x~ i wT Gj ðsi Þj  jj2 ; w

i¼1 n

¼ min ∑ jjwT x~ i wT ½G1 ; G2 ; …; Gj ; …; GK s~ i jj2 ; w

i¼1 n

¼ min ∑ jjwT x~ i wT Gs~ i jj2 ; w

i¼1

where ðsi Þj consists of the components of si corresponding to G~ j , ðsi Þj  consists of the first lj components of ðsi Þj , that is, ðsi Þj  is the coefficient vector corresponding to ½g1 ; g2 ; …; glj , s~ i ¼ ½0T ; 0T ; …;

344

F. Yin et al. / Neurocomputing 128 (2014) 341–362

Algorithm: Sparse Regularization Discriminant Analysis (SRDA) Step 1: Calculate S B and SW by Eqs. (8) and (9), respectively; Step 2: Conduct PCA decomposition for every X j , j 1, 2,..., K using Eq. (1) to obtain sparse representation dictionary G ; Step 3: Calculate the sparse coefficient vector s for every sample based on Eq. (3) to obtain S ; Step 4: Calculate the projecting vectors by the generalized eigenvalue problem in Eq. (12) . Fig. 3. SRDA algorithm.

Fig. 4. (a) All the 11 faces of the first person in the Yale data set. (b) Partial faces of the first person in the Extended Yale B data set. (c) All the 10 faces of the first person in the ORL data set. (d) Some sample faces of the first person in CMU PIE data set.

Fig. 5. The first 10 basis vectors of PCA, LDA, RLDA, LPP, NPE, and SRDA calculated from the training set of the ORL data set.

F. Yin et al. / Neurocomputing 128 (2014) 341–362

345

G3 0.8

0.6

0.7 0.6

0.5

Recognition Rate

Rcognition Rate

G2 0.7

0.4 Baseline PCA LDA RLDA LPP NPE SRDA

0.3 0.2 0.1 0

0

5

10

15 20 Dimensions

0.5 0.4

Baseline PCA LDA RLDA LPP NPE SRDA

0.3 0.2 0.1

25

0

30

0

5

10

15

0.9

0.8

0.8

0.7

0.7

0.6 0.5

Baseline PCA LDA RLDA LPP NPE SRDA

0.4 0.3 0.2 0.1 0

10

20

30 40 Dimensions

0.5

0.3 0.2

50

0.1 0

60

10

20

30 40 50 Dimensions

0.8

0.7

0.7

Recognition Rate

Recognition Rate

0.8

0.6 Baseline PCA LDA RLDA LPP NPE SRDA

0.5 0.4 0.3 30

60

70

G7 0.9

20

45

Baseline PCA LDA RLDA LPP NPE SRDA

0.4

G6

10

40

0.6

0.9

0.2 0

35

G5

0.9

Recognition Rate

Recognition Rate

G4

20 25 30 Dimensions

40 50 60 Dimensions

0.6 Baseline PCA LDA RLDA LPP NPE SRDA

0.5 0.4 0.3

70

80

0.2 0

90

10

20

30

80

90

40 50 60 Dimensions

70

80

90

G8 0.9

Recognition Rate

0.8 0.7 0.6 Baseline PCA LDA RLDA LPP NPE SRDA

0.5 0.4 0.3 0.2 0

10

20

30

40 50 60 Dimensions

70

Fig. 6. Recognition rate versus dimensions on Yale with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.

346

F. Yin et al. / Neurocomputing 128 (2014) 341–362

ðsi ÞTj ; …; 0T T , and G¼ [G1, G2 …,Gj, …GK]. Thus, the proof is completed. □ Therefore, if we first center the samples from each class at the origin, the computation of class-wise PCA decompositions and matrix–vector multiplication can be simplified and the presentation can be more concise without the consideration of xjmean , while Theorem 1 guarantees that this preprocessing of data does not change the constructed sparsity preserving regularizer.

According to section 2.2, minimizing the sparsity preserving regularizer JðwÞ can preserve the sparse representation structure, so the objective function of SRDA can be formulated as follows: max w

wT SB w ; wT SW w þλ1 wT w þ λ2 JðwÞ

ð7Þ

K

SB ¼ ∑ N k ðm  mk Þðm  mk ÞT ;

ð8Þ

k¼1

2.3. Sparse regularization discriminant analysis SRDA aims to simultaneously seek an efficient discriminating subspace and preserve the sparse representation structure.

K

SW ¼ ∑ ∑ ðxi mk Þðxi mk ÞT ;

ð9Þ

k ¼ 1 i A Ck

G10

G20 1

0.9

0.9 0.8

0.7

Recognition Rate

Recognition Rate

0.8

0.6 0.5

Baseline PCA LDA RLDA LPP NPE SRDA

0.4 0.3 0.2 0.1

0

30

60

0.7 0.6 0.5

Baseline PCA LDA RLDA LPP NPE SRDA

0.4 0.3 0.2 0.1

90 120 150 Dimensions

180

0

210

0

25

50

G40

1

1

0.9

0.9

0.8

Recognition Rate

Recognition Rate

G30

0.7 Baseline PCA LDA RLDA LPP NPE SRDA

0.6 0.5 0.4 0.3

0

25

50

75

75 100 125 150 175 200 Dimensions

0.8 0.7

Baseline PCA LDA RLDA LPP NPE SRDA

0.6 0.5 0.4

100 125 150 175 200 225 Dimensions

0

25

50

75 100 125 150 175 200 Dimensions

G50 1

Recognition Rate

0.9 0.8 0.7

Baseline PCA LDA RLDA LPP NPE SRDA

0.6 0.5 0.4

0

25

50

75 100 125 150 175 200 Dimensions

Fig. 7. Recognition rate versus dimensions on Extended Yale B with 10, 20, 30, 40 and 50 images per individual randomly selected for training.

F. Yin et al. / Neurocomputing 128 (2014) 341–362

347

G2

G3 0.9

0.8

0.8

Recognition Rate

Recognition Rate

0.7 0.6 Baseline PCA LDA RLDA LPP NPE SRDA

0.5 0.4 0.3

0

10

20

30 40 50 Dimensions

60

0.7 0.6

Baseline PCA LDA RLDA LPP NPE SRDA

0.5 0.4

70

0.3

80

0

10

20

30 40 50 Dimensions

G4

60

70

80

G5

0.9

0.9

Recognition Rate

Recognition Rate

0.8 0.7 Baseline PCA LDA RLDA LPP NPE SRDA

0.6 0.5 0.4 0.3

0

10

20

30 40 50 Dimensions

60

70

0.8 0.7 Baseline PCA LDA RLDA LPP NPE SRDA

0.6 0.5 0.4

80

0

10

20

30 40 50 Dimensions

1

0.9

0.9

0.8 Baseline PCA LDA RLDA LPP NPE SRDA

0.7

0.6

0.5

0

10

20

70

80

G7

1.0

Recognition Rate

Recognition Rate

G6

60

30 40 50 Dimensions

60

70

0.8 Baseline PCA LDA RLDA LPP NPE SRDA

0.7

0.6

80

0.5

0

10

20

30 40 50 Dimensions

60

70

80

G8 1

Recognition Rate

0.9

0.8 Baseline PCA LDA RLDA LPP NPE SRDA

0.7

0.6

0.5

0

10

20

30 40 50 Dimensions

60

70

80

Fig. 8. Recognition rate versus dimensions on ORL with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.

348

F. Yin et al. / Neurocomputing 128 (2014) 341–362

G5

G10

0.8

0.6

Baseline PCA LDA RLDA LPP NPE SRDA

0.5 0.4 0.3

Recognition Rate

Recognition Rate

0.7

Baseline PCA LDA RLDA LPP NPE SRDA

0.7 0.6 0.5 0.4

0.2 0.3 0.1

0

20

40

60 80 Dimensions

100

120

0

20

40

60

80 100 120 140 160 Dimensions

G20

Recognition Rate

0.9 Baseline PCA LDA RLDA LPP NPE SRDA

0.8 0.7 0.6 0.5 0.4

0

20

40

60

80 100 120 140 160 180 Dimesions

Fig. 9. Recognition rate versus dimensions on CMU PIE with 5, 10 and 20 images per individual randomly selected for training.

Table 1 The best recognition rate (mean7 std, %) and the associated dimensionality on Yale. Method

G2

G3

G4

G5

G6

G7

G8

Baseline PCA LDA RLDA LPP NPE SRDA

43.93 7 2.73 (1024) 43.93 7 2.73 (29) 53.59 7 6.01 (14) 55.377 4.11 (15) 57.89 7 4.49 (14) 56.417 5.02 (14) 59.937 4.47 (15)

49.007 3.76 (1024) 49.007 3.49 (43) 64.927 5.17 (14) 68.797 4.51 (16) 68.217 4.46 (15) 68.177 4.98 (14) 72.717 4.84 (17)

52.197 3.72 (1024) 52.197 3.72 (59) 72.29 7 3.64 (14) 76.62 7 3.58 (14) 74.767 4.64 (14) 74.62 7 3.51 (14) 80.387 2.46 (16)

56.22 7 4.95 (1024) 56.447 4.80 (72) 78.007 3.54 (14) 81.447 4.11 (15) 80.337 4.02 (14) 79.617 3.41 (14) 84.50 7 3.84 (21)

59.277 3.86 (1024) 59.477 3.85 (40) 80.40 7 3.75 (14) 86.077 2.95 (14) 80.477 4.50 (14) 80.20 7 4.18 (14) 88.007 3.15 (19)

60.337 4.24 (1024) 61.677 4.29 (29) 83.83 7 3.90 (14) 88.50 7 2.41 (30) 82.92 7 3.78 (33) 82.677 4.24 (14) 89.837 3.33 (21)

62.44 73.88 (1024) 63.3374.59 (38) 86.22 74.76 (14) 88.89 74.08 (17) 84.00 74.36 (14) 84.11 74.22 (16) 91.56 73.65 (17)

Table 2 The best recognition rate (mean7 std, %) and the associated dimensionality on Extended Yale B. Method

G10

G20

G30

G40

G50

Baseline PCA LDA RLDA LPP NPE SRDA

53.007 1.22 (1024) 49.367 1.14 (122) 86.517 1.22 (36) 87.38 7 1.03 (60) 87.89 7 1.09 (65) 83.147 1.64 (95) 92.457 0.64 (219)

69.46 71.19 (1024) 66.09 71.18 (165) 94.84 70.67 (36) 95.09 70.62 (48) 95.72 70.46 (121) 93.18 70.66 (137) 97.03 70.59 (198)

77.20 7 0.99 (1024) 74.157 1.25 (190) 97.477 0.45 (35) 97.58 7 0.47 (54) 97.727 0.36 (122) 95.81 7 0.51 (139) 98.537 0.39 (166)

81.417 0.93 (1024) 78.63 7 1.01 (205) 98.36 7 0.43 (36) 98.43 7 0.42 (36) 98.39 7 0.40 (100) 96.96 7 0.63 (170) 99.23 7 0.29 (57)

84.22 7 1.19 (1024) 81.617 1.42 (216) 98.81 7 0.61 (37) 98.88 7 0.54 (32) 98.78 7 0.46 (120) 97.62 7 0.68 (144) 99.507 0.28 (110)

where SB and SW are respectively the between-class and withinclass scatter matrices, which are calculated using the training data. wT w is the Tikhonov regularizer [47]. λ1 and λ2 are two parameters which control the tradeoff between the three terms in the denominator.

Substituting Eq. (6) into (7) and making some algebraic reformulation, we have wT SB w wT SW w þ λ1 wT w þ λ2 JðwÞ

F. Yin et al. / Neurocomputing 128 (2014) 341–362

349

Table 3 The best recognition rate (mean7 std, %) and the associated dimensionality on ORL. Method

G2

Baseline PCA LDA RLDA LPP NPE SRDA

67.16 72.59 67.16 72.59 72.48 72.91 79.11 72.23 78.41 72.27 78.59 72.78 82.00 72.62

(1024) (79) (24) (43) (39) (39) (47)

G3

G4

G5

G6

G7

G8

75.84 7 3.70 (1024) 75.84 7 3.70 (119) 82.84 7 3.12 (39) 88.59 7 3.24 (47) 86.667 1.96 (39) 86.50 7 2.09 (39) 90.93 7 2.11 (49)

82.127 2.71 (1024) 82.127 2.71 (159) 89.677 2.17 (39) 93.737 1.86 (54) 90.677 1.82 (39) 90.56 7 1.65 (39) 94.92 7 1.24 (42)

86.357 2.31 (1024) 86.357 2.31 (199) 92.40 7 1.31 (39) 95.95 7 1.81 (54) 92.98 7 2.16 (39) 93.20 7 1.80 (39) 97.007 1.42 (56)

88.75 7 2.58 (1024) 88.75 7 2.58 (239) 93.69 7 2.62 (37) 97.127 1.37 (40) 94.417 1.45 (39) 94.59 7 1.59 (38) 98.387 1.04 (45)

90.54 7 2.76 (1024) 90.54 7 2.76 (276) 95.54 7 2.27 (39) 98.127 1.11 (39) 95.377 1.82 (40) 95.217 1.75 (38) 98.927 0.82 (42)

92.25 72.58 (1024) 92.25 72.58 (316) 96.63 72.19 (39) 98.31 71.73 (39) 95.56 72.55 (36) 95.50 72.48 (41) 99.50 70.75 (52)

Table 4 The best recognition rate (mean 7std, %) and the associated dimensionality on CMU PIE. Method

G5

G10

G20

Baseline PCA LDA RLDA LPP NPE SRDA

29.55 7 0.65 (1024) 28.177 0.60 (130) 72.50 7 1.52 (38) 73.56 7 1.20 (34) 72.53 7 1.53 (37) 71.487 1.59 (51) 75.137 1.05 (30)

44.13 70.83 (1024) 42.60 70.88 (172) 86.53 70.93 (36) 87.13 71.15 (38) 86.59 70.94 (35) 86.01 70.83 (70) 89.45 70.55 (37)

61.80 7 0.68 (1024) 60.28 7 0.60 (206) 93.377 0.32 (44) 93.58 7 0.35 (59) 93.42 7 0.17 (52) 92.32 7 0.58 (63) 94.247 0.11 (48)

¼ ¼

wT SB w

;

T ~ T þ GSS ~ T G~ T Þw wT SW w þ λ1 wT w þ λ2 wT ðXXT  XST G~  GSX wT SB w : ð10Þ T T ~ T þ GSS ~ T G~ T ÞÞw wT ðSW þ λ1 I þ λ2 ðXX  XST G~  GSX

~ T þ GSS ~ T G~ , the problem in Eq. (7) With M ¼ XXT  XST G~  GSX can be recast as T

max w

T

wT SB w ; wT ðSW þλ1 I þ λ2 MÞw

ð11Þ

where I is an identity matrix related to the Tikhonov regularizer and M is a matrix corresponding to the sparsity preserving regularizer. The problem in Eq. (11) can be solved via the following generalized eigenvalue problem: SB w ¼ ηðSW þ λ1 I þ λ2 MÞw:

ð12Þ

Then the projecting matrix W ¼ ½w1 ; w2 ; …; wd  consists of the eigenvectors corresponding to the largest d eigenvalues. Based on the above discussion, the proposed SRDA is summarized in Fig. 3. 2.4. Computational complexity analysis The computation of SB requires Oðm2 KÞ, while the computation of SW requires Oðm2 nÞ. The computational complexity of K full PCA decompositions for matrices of size m  m is OðKm3 Þ. Here we only need to find the first lj eigenvalues and eigenvectors for class j, j ¼ 1; …; K, which can be done with the more efficient power method [48] which runs in Oðm2 lj Þ, so K PCA decompositions can be done with Oðm2 Σ Kj¼ 1 lj Þ. Calculating the sparse coefficient vector s for a sample from class j involves only a matrix–vector multiplication which scales like Oðmlj Þ, hence sparse coefficient vectors for all samples can be obtained with OðmΣ Kj¼ 1 nj lj Þ. The evaluation of M requires Oðm2 nΣ Kj¼ 1 lj Þ. The generalized eigenvalue problem in Eq. (12) can be solved in Oðm3 Þ. So the overall computational complexity of our proposed SRDA method is Oðm2 nΣ Kj¼ 1 lj þ m3 Þ. 3. Comparison with related methods In the proposed SRDA method, PCA is used as a basic procedure in the construction of the dictionary for sparse representation. And

if λ1 ¼ λ2 ¼ 0 in Eq. (7), SRDA becomes the standard LDA [9]; if λ1 a 0; λ2 ¼ 0, it reduces to RLDA [49]. LPP [19] and NPE [20] are two linear manifold learning methods aiming at preserving the local neighborhood structure of the data, while the regularization term in SRDA tires to preserve the sparse representation structure of the data. Qiao et al. [40] presented a semi-supervised dimensionality reduction method, called Sparsity Preserving Discriminant Analysis (SPDA). SPDA has a similar formulation as SRDA, while it is different from SRDA in the method for learning the sparse representation structure. In SPDA, n time consuming ℓ1 norm minimization problems need to be solved to learn the sparse representation structure whose computational cost is Oðn4 Þ, while the proposed SRDA method can obtain the sparse representation structure much faster through K PCA decompositions and n matrix–vector multiplications whose computational cost is Oðm2 Σ Kj¼ 1 lj þmΣ Kj¼ 1 nj lj Þ; nj 5n; lj 5 n and K 5 n. Moreover, SRDA uses label information in both learning the sparse representation structure and the Fisher terms, while in SPDA, label information is used only in the Fisher terms. Ref. [41] proposed a sparse representation structure based dimensionality reduction method, named Fast Fisher Sparsity Preserving Projections (FFSPP), which can both employing the sparse representation structure and the Fisher term. Although the objective functions of FFSPP and SRDA both have some terms, the models of these two methods are different. SRDA is a quotient model while FFSPP is an additive model. Quotient model of SRDA can lead to a better recognition performance as is shown in Section 4. Zhang et al. [50,51] proposed a unifying framework, named “patch alignment”, for dimensionality reduction and developed a new dimensionality reduction algorithm, Discriminative Locality Alignment (DLA) based on this framework. This framework consists of two stages: part optimization and whole alignment. Patch alignment framework reveals that (1) algorithms are intrinsically different in the patch optimization stage and (2) all algorithms share an almost identical whole alignment stage. Our proposed SRDA can be explained under this framework in the similar way as LDA [51]. Manifold elastic net (MEN) [52] is a unified framework for sparse dimensionality reduction, which incorporates the merits of both the manifold learning based dimensionality reduction and the sparse learning based dimensionality reduction. SRDA can also be unified under the framework of MEN through the patch alignment framework. The proposed SRDA is a supervised dimensionality reduction method. Another popular class of supervised dimensionality reduction methods is based on Non-negative Matrix Factorization (NMF) [53]. In [54], the discriminant constraints are incorporated inside the NMF decomposition to enhance the classification accuracy of the NMF algorithm and this method is named Discriminant Non-negative Matrix Factorization (DNMF). Guan et al. [55] presented a Non-negative Patch Alignment Framework (NPAF) to unify popular NMF related dimensionality reduction methods. Based on NPAF, they developed a supervised dimensionality reduction method named Non-negative Discriminative Locality Alignment

350

F. Yin et al. / Neurocomputing 128 (2014) 341–362

G3

G2

0.85 0.8

0.65 0.6

Recognition Rate

Recognition Rate

0.75

0.55 0.5

0.7 0.65 0.6 0.55

0.45

0.5

0.4

0.45

Baseline PCA

LDA RLDA LPP

0.4 Baseline PCA

NPE SRDA

LDA RLDA LPP

G4 0.9

0.8

0.85

0.75

0.8

Recognition Rate

Recognition Rate

0.85

0.7 0.65 0.6 0.55

0.75 0.7 0.65 0.6

0.5

0.55

0.45

0.5

Baseline PCA

LDA RLDA LPP

NPE SRDA

Baseline PCA

LDA RLDA LPP

G6 0.95

0.9

0.9

0.85

0.85

Recognition Rate

Recognition Rate

NPE SRDA

G7

0.95

0.8 0.75 0.7 0.65

0.8 0.75 0.7 0.65

0.6

0.6

0.55

0.55

0.5 Baseline PCA

NPE SRDA

G5

LDA RLDA LPP

0.5 Baseline PCA

NPE SRDA

LDA RLDA LPP

NPE SRDA

G8 1 0.95

Recognition Rate

0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 Baseline PCA

LDA RLDA LPP

NPE SRDA

Fig. 10. Boxplots of recognition rate for compared methods on Yale with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.

(NDLA). Guan et al. [56] introduced the manifold regularization and the margin maximization to NMF and obtained the Manifold regularized Discriminative NMF (MD-NMF) which is an effective supervised variant of NMF.

4. Experiments As popular dimensionality reduction methods, PCA, LDA, Regularized LDA (RLDA), Local Preserving Projections (LPP), and

F. Yin et al. / Neurocomputing 128 (2014) 341–362

351

G10

G20

0.95 0.9

0.95 0.9

Recogniton Rate

Recognition Rate

0.85 0.8 0.75 0.7 0.65 0.6

0.8 0.75 0.7

0.55 0.5 Baseline PCA

0.85

0.65 LDA RLDA LPP

NPE SRDA

Baseline PCA

LDA RLDA LPP

G30

NPE SRDA

G40

1

1

0.95

Recognition Rate

Recognition Rate

0.95 0.9 0.85 0.8 0.75 0.7 Baseline PCA

0.9

0.85

0.8

LDA RLDA LPP

NPE SRDA

Baseline PCA

LDA RLDA LPP

NPE SRDA

G50 1 0.98

Recognition Rate

0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 Baseline PCA

LDA RLDA LPP

NPE SRDA

Fig. 11. Boxplots of recognition rate for compared methods on Extended Yale B with 10, 20, 30, 40 and 50 images per individual randomly selected for training.

Neighbor Preserving Embedding (NPE) have been successfully applied to face recognition under the name of Eigenface [10], Fisherface [12], R_Fisherface [15], Laplacianface [57], and NPEface [20]. Typically, the recognition process has three steps: (1) calculating the face subspace from the training samples using PCA, LDA, RLDA, LPP, or NPE; (2) projecting the testing face image into ddimensional face subspace; and (3) identifying the testing face image by nearest neighbor classifier. In this section, we compare the recognition performance of these methods with our proposed SRDA method on four publicly available face data sets. In addition, the proposed SRDA method is compared with SPP, FFSPP, DLA, two representative regularized discriminant analysis methods: Laplacian Linear Discriminant Analysis (LapLDA) [16] and Locality Preserving Discriminant Projections (LPDP) [58], and two NMF related supervised dimensionality reduction methods: DNMF [54] and MD-NMF [56]. Finally, the sensitivity of our proposed SRDA method to its parameter λ2 is studied. All experiments are performed with Matlab 7.8.0 on a personal computer with Pentium-IV 3.20 GHz processor, 2 GB main memory and Windows XP operating system.

4.1. Database description Yale face database [12] contains 165 images. There are 15 individuals with 11 images per individual. These images are taken under the following different facial expressions or configurations: center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised and wink. In our experiments, the cropped images of size 32  32 are used. Extended Yale B [59] consists of 2414 front-view face images of 38 individuals. There are about 64 images under different laboratory-controlled lighting conditions for each individual. In our experiments, each image is cropped to have the resolution of 32  32. ORL1 contains 400 images of 40 persons, with 10 images per person. The images are taken at different times, under different lighting and facial expressions. The faces are in an upright, frontal

1 ORL dataset is at http://www.cl.cam.ac.uk/research/dtg/attarchive/facedata base.html.

352

F. Yin et al. / Neurocomputing 128 (2014) 341–362

G2

G3 0.95

0.85

Recognition Rate

Recognition Rate

0.9 0.8 0.75 0.7

0.85 0.8 0.75

0.65 Baseline PCA

0.7 LDA RLDA LPP G4

NPE SRDA

Baseline PCA

LDA RLDA LPP G5

NPE SRDA

LDA RLDA LPP G7

NPE SRDA

LDA RLDA LPP

NPE SRDA

0.98

0.95

Recognition Rate

Recognition Rate

0.96 0.9

0.85

0.94 0.92 0.9 0.88 0.86

0.8 Baseline PCA

0.84 LDA RLDA LPP G6

0.82 Baseline PCA

NPE SRDA

1.02

1

1 0.98

Recognition Rate

Recognition Rate

0.98 0.96 0.94 0.92 0.9

0.94 0.92 0.9 0.88

0.88

0.86

0.86 0.84 Baseline PCA

0.96

LDA RLDA LPP

NPE SRDA

0.84 Baseline PCA G8

1

Recognition Rate

0.98 0.96 0.94 0.92 0.9 0.88 0.86 Baseline PCA

LDA RLDA LPP

NPE SRDA

Fig. 12. Boxplots of recognition rate for compared methods on ORL with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.

F. Yin et al. / Neurocomputing 128 (2014) 341–362

353

G5

G10 0.9

0.7

Recognition Rate

Recognition Rate

0.8 0.6 0.5 0.4

0.7 0.6 0.5

0.3 0.4 Baseline PCA

LDA RLDA LPP

NPE SRDA

Baseline PCA

LDA RLDA LPP

NPE SRDA

G20 0.95

Recognition Rate

0.9 0.85 0.8 0.75 0.7 0.65 0.6 Baseline PCA

LDA RLDA LPP

NPE SRDA

Fig. 13. Boxplots of recognition rate for compared methods on CMU PIE with 5, 10 and 20 images per individual randomly selected for training.

position with tolerance for some side movement. The cropped images of size 32  32 are used. CMU PIE [60] face database contains 41 368 face images of 68 individuals. The face images were captured by 13 synchronized cameras and 21 flashes, under varying pose, illumination, and expression. In our experiments, five near frontal poses (C05, C07, C09, C27, and C29) are selected. This leaves us 170 face images per individual and each image were resized to 32  32. Fig. 4 shows some faces from these face data sets. For all our experiments, each face image is normalized to have unit norm. 4.2. Experimental setting In our experiments, each face database is randomly partitioned into the gallery and probe sets with different numbers. For ease of representation, GN means N images per person are randomly selected to form the gallery set and the remaining images are used as the probe set. The gallery set is used for training while the probe set is used for testing. Twenty random training/testing splits are used and the average classification accuracies are reported. PCA and LDA are parameter-free. For LPP and NPE, the supervised versions are used. In particular, the neighbor mode in LPP and NPE is set to be ‘supervised’; the weight mode in LPP is set to be ‘Cosine’. For RLDA, the regularization parameter λ1 is set to be 0.01 as in [40,61,62]. For SRDA, the number of leading principal components lj for class j is set to be nj  1, the regularization parameter λ1 is set to be the same as in RLDA, and the regularization parameter λ2 is set by the 5-fold cross-validation where λ2 is tuned from f0:1; 0:2; …; 2g. Since the dimensionality of the face vector space is much larger than the number of training samples, the un-regularization methods LDA, LPP, and NPE all involve a PCA preprocessing phase, that is, project the training set X onto a PCA subspace spanned by the leading eigenvectors. For Yale and ORL which are relatively small databases, 100% energy is kept in the PCA preprocessing

phase, and for Extended Yale B and CMU PIE which are relatively large scale databases, in order to obtain all the experimental results in reasonable time, 98% energy is kept in the PCA preprocessing phase. 4.3. Experimental results and discussions As shown in the proposed SRDA method, the eigenvectors spanning the learned subspace can be obtained via the generalized eigenvalue problem in Eq. (12). These eigenvectors can be resized and displayed as images. The first 10 Eigenface, Fisherface, R_Fisherface, Laplacianface, NPEface, and SRDAface of the training set of ORL data set (5 images per individual are used for training) are shown in Fig. 5. From the results shown in the figures, we can see that the discriminative information contained in SRDAface is richer than that in the other faces. In general, the performance of dimensionality reduction method varies with the number of dimensions. Figs. 6–9 show the plots of recognition rate versus dimensions for PCA, LDA, RLDA, LPP, NPE and SRDA on the four tested data sets. The best results and the associated dimensionality of these compared methods on Yale, Extended Yale B, ORL and CMU PIE data sets are summarized in Tables 1–4 respectively, in which the best performance for each data set is highlighted, where Baseline represents the classification results of nearest neighbor classifier without dimensionality reduction, and the number in brackets is the corresponding optimal projecting dimension. Moreover, it needs to be noted that the upper bound for the dimensionality of LDA is K  1, because there are at most K  1 generalized nonzero eigenvalues [12]. From these experimental results we observe that: (1) The performance of PCA is much worse than that of other methods and its performance is only comparable to the Baseline method because it is unsupervised and the label information is not considered.

354

F. Yin et al. / Neurocomputing 128 (2014) 341–362

G3

G2 0.6

0.7 0.6

Recognition Rate

Recognition Rate

0.5 0.4 0.3

SPP FFSPP

0.2

0.5 0.4

SPP FFSPP

0.3

DLA

DLA

SRDA

0.2

SRDA

0.1 0

5

10

15 20 Dimensions

25

0.1

30

0

10

20 30 Dimensions

G4

G5

0.8

0.8

0.7

0.7 Recognition Rate

Recognition Rate

40

0.6 0.5 0.4

SPP FFSPP

0.3

0.6 0.5 SPP

0.4

FFSPP DLA

DLA

0.3

SRDA

SRDA

0.2 0.2 0.1

0

10

20

30 40 Dimensions

50

60

0

10

20

30 40 50 Dimensions

G6

60

70

G7 0.9 0.8

0.7

Recognition Rate

Recognition Rate

0.8

0.6 0.5

SPP FFSPP

0.4

DLA

10

20

30

40 50 60 Dimensions

70

SPP FFSPP

0.5

SRDA

0.4

0.3 0

0.6

DLA

SRDA

0.2

0.7

80

0.3

90

0

10

20

30

40 50 60 Dimensions

70

80

90

G8 0.9

Recognition Rate

0.8 0.7 0.6 SPP FFSPP

0.5

DLA SRDA

0.4 0.3

0

10

20

30

40 50 60 Dimensions

70

80

90

Fig. 14. Recognition rate versus dimensions on Yale with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.

F. Yin et al. / Neurocomputing 128 (2014) 341–362

355

G20

G10

1 0.9

0.9 Recognition Rate

Recognition Rate

0.85 0.8 0.75 0.7

SPP FFSPP

0.65

DLA

0.6

SRDA

0.8 0.7 SPP FFSPP DLA SRDA

0.6 0.5

0.55 0.5

0.4 0

25

50

75

100 125 150 175 200 220 Dimensions

0

25

50

75 100 125 150 175 200 Dimensions

G30

G40

1

1

0.9 Recognition Rate

Recognition Rate

0.9 0.8 0.7 SPP FFSPP

0.6

DLA SRDA

0

25

50

75

0.7

SPP FFSPP DLA SRDA

0.6

0.5 0.4

0.8

100 125 150 175 200 220 Dimensions

0.5

0

25

50

75 100 125 Dimensions

150

175

200

G50 1

Recognition Rate

0.9

0.8 SPP

0.7

FFSPP DLA SRDA

0.6

0.5

0

25

50

75 100 125 Dimensions

150

175

200

Fig. 15. Recognition rate versus dimensions on Extended Yale B with 10, 20, 30, 40 and 50 images per individual randomly selected for training.

(2) In most cases, SRDA and RLDA are superior to other compared methods. This is because regularization techniques are used in them and is consistent with the existing results [14,15,61,62]. (3) The recognition rate of SRDA is better than that of all the other compared methods, especially when the size of the training set is small. This is because SRDA simultaneously seeks an efficient discriminating subspace and preserves the sparse representation structure. The superiority of SRDA under small sample setting validates that SRDA is an effective method to the SSS problem. (4) In most cases, the recognition rate of SRDA increases more quickly than that of other methods as the dimension increases in the beginning since label information is used both in learning the sparse representation structure and computing the Fisher terms. This indicates that we can obtain a better performance using SRDA than other methods with a very low dimensional subspace.

(5) The discrimination power of SRDA will be enhanced as the increase of projecting dimension, but it will not increase after some point. When the projecting dimension is higher than that point, the classification accuracy will stand still or even decrease. How to determine the optimal dimension is still an open problem. Boxplots of the experimental results of these compared methods on the four face data sets are shown in Figs. 10–13. Boxplot produces a box and whisker plot for each method. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, and the whiskers extend from each end of the box to the adjacent values in the data-by default, the most extreme values within 1.5 times the interquartile range from the ends of the box. As is shown in Figs. 10–13, for the most part, RLDA achieves a more robust recognition rate with respect to random training/ testing splits than LDA because of the use of regularization

356

F. Yin et al. / Neurocomputing 128 (2014) 341–362

G3 0.9

0.7

0.8 0.7

0.6

Recognition Rate

Recognition Rate

G2 0.8

0.5 0.4 SPP

0.3

FFSPP DLA

0.2

0

0.5 0.4

SPP FFSPP DLA SRDA

0.3 0.2

SRDA

0.1 0

0.6

0.1 10

20

30 40 50 Dimensions

60

70

0

80

0

10

20

30 40 50 Dimensions

G4

60

70

80

G5 1

0.9 0.8

0.7

Recognition Rate

Recognition Rate

0.8

0.6 0.5 SPP

0.4

FFSPP

0.3

DLA

0.2

SRDA

0.6

SPP FFSPP DLA SRDA

0.4

0.2

0.1 0

0

10

20

30 40 50 Dimensions

60

70

0

80

0

10

20

30 40 50 Dimensions

G6

60

70

80

G7

1

1 0.9 0.8 Recognition Rate

Recognition Rate

0.8

0.6

SPP

0.4

FFSPP DLA

0.2

0.7 0.6 0.5 SPP

0.4

FFSPP

0.3

DLA

0.2

SRDA

SRDA

0.1 0

0

10

20

30 40 50 Dimensions

60

70

0

80

0

10

20

30 40 50 Dimensions

60

70

80

G8 1 0.9

Recognition Rate

0.8 0.7 0.6 0.5 SPP

0.4

FFSPP

0.3

DLA

0.2

SRDA

0.1 0

0

10

20

30 40 50 Dimensions

60

70

80

Fig. 16. Recognition rate versus dimensions on ORL with 2, 3, 4, 5, 6, 7 and 8 images per individual randomly selected for training.

F. Yin et al. / Neurocomputing 128 (2014) 341–362

357

G10 0.9

0.7

0.8

0.6

0.7 Recognition Rate

Recognition Rate

G5 0.8

0.5 0.4 SPP FFSPP

0.3

DLA

0.2

0.6 0.5 SPP

0.4

FFSPP DLA

0.3

SRDA

SRDA

0.2 0.1 0

20

40

60 80 Dimensions

100

0.1

120

0

20

40

60

80 100 120 140 160 Dimensions

G20 0.95 0.9

Recognition Rate

0.85 0.8 0.75 0.7

SPP FFSPP

0.65

DLA

0.6

SRDA

0.55 0.5

0

20

40

60

80 100 120 140 160 180 200 Dimensions

Fig. 17. Recognition rate versus dimensions on CMU PIE with 5, 10 and 20 images per individual randomly selected for training. Table 5 The best recognition rate (mean7 std, %) and the associated dimensionality on Yale. Method

G2

G3

SPP FFSPP DLA SRDA

52.077 3.84 55.197 3.67 57.707 6.27 59.937 4.47

(29) (14) (14) (15)

65.32 7 3.82 69.29 7 2.90 71.007 3.56 72.717 4.84

G4

G5

72.05 74.37 73.38 72.63 78.29 73.70 80.38 72.46

(44) (15) (19) (17)

74.78 74.33 77.72 73.60 81.78 73.92 84.50 73.84

(59) (15) (28) (16)

(67) (16) (20) (21)

G6

G7

G8

76.58 7 5.02 (89) 82.337 4.28 (20) 85.40 7 3.16 (16) 88.007 3.15 (19)

77.78 7 4.64 83.58 7 3.98 88.007 3.09 89.837 3.33

(104) (20) (41) (21)

80.45 7 5.82 (89) 84.89 7 5.08 (14) 90.447 3.31 (15) 91.567 3.65 (17)

Table 6 The best recognition rate (mean7 std, %) and the associated dimensionality on Extended Yale B. Method

G10

G20

SPP FFSPP DLA SRDA

83.87 7 0.97 (124) 86.42 7 0.88 (107) 89.43 7 0.94 (76) 92.457 0.64 (219)

92.067 0.90 94.92 7 0.75 96.187 0.53 97.037 0.59

(164) (36) (109) (198)

G30

G40

G50

94.82 70.60 (189) 97.32 70.47 (36) 97.98 70.49 (142) 98.53 70.39 (166)

95.93 7 0.49 98.377 0.43 98.767 0.43 99.237 0.29

(205) (37) (169) (57)

96.497 0.71 (215) 98.93 7 0.53 (34) 99.32 7 0.51 (181) 99.507 0.28 (110)

Table 7 The best recognition rate (mean7 std, %) and the associated dimensionality on ORL. Method

G2

SPP FFSPP DLA SRDA

73.317 2.93 77.39 7 2.39 79.727 2.25 82.007 2.62

(69) (41) (48) (47)

G3

G4

G5

G6

G7

G8

80.63 7 2.22 (119) 85.98 7 2.37 (47) 89.047 2.24 (64) 90.937 2.11 (49)

84.50 7 2.42 (146) 90.487 2.69 (46) 93.107 1.84 (88) 94.92 7 1.24 (42)

87.137 2.17 (148) 93.777 1.78 (73) 96.43 7 0.73 (119) 97.007 1.42 (56)

88.44 72.64 (181) 94.50 71.50 (61) 97.69 71.15 (114) 98.38 71.04 (45)

89.717 2.80 (179) 96.217 2.19 (101) 98.047 1.06 (147) 98.927 0.82 (42)

91.067 3.72 (196) 97.317 2.12 (70) 98.25 7 1.18 (178) 99.50 7 0.75 (52)

technique. In most cases, SRDA is more robust than the other compared methods because two regularization terms (Tikhonov regularizer and sparsity preserving regularizer) are used in SRDA.

In addition, for the most part, PCA performs worse than the other compared methods in robustness with respect to random training/ testing splits, because it does not consider the label information.

358

F. Yin et al. / Neurocomputing 128 (2014) 341–362

Table 8 The best recognition rate (mean7 std, %) and the associated dimensionality on CMU PIE. Method

G5

G10

G20

SPP FFSPP DLA SRDA

63.91 71.31 (130) 70.61 71.32 (37) 73.16 71.68 (66) 75.13 71.05 (33)

77.20 7 1.23 (169) 85.977 1.03 (47) 87.197 1.05 (126) 89.457 0.55 (37)

86.077 0.59 (203) 93.32 7 0.26 (54) 93.127 0.49 (176) 94.247 0.11 (52)

Table 9 The time (s) for learning the embedding function on Yale. Method

G2

G3

G4

G5

G6

G7

G8

SPP SRDA

0.2371 0.0984

0.5594 0.1484

0.9502 0.1969

1.8754 0.2544

2.6294 0.3326

5.0403 0.4240

7.4958 0.5186

Table 10 The time (s) for learning the embedding function on Extended Yale B. Method

G10

G20

G30

G40

G50

SPP SRDA

766.2118 1.0218

2760 2.3161

5793 5.3356

8450 5.4323

10113 9.1306

Table 11 The time (s) for learning the embedding function on ORL. Method

G2

G3

G4

G5

G6

G7

G8

SPP SRDA

2.0176 1.0224

6.8909 1.2106

20.8091 1.4683

60.8904 1.7137

107.2234 1.8829

251.1639 2.0477

360.8560 2.2986

Table 12 The time (s) for learning the embedding function on CMU PIE. Method

G5

G10

G20

SPP SRDA

983.3388 0.4228

4343.7 1.1214

9298.6 2.9944

Table 13 Recognition rate (mean 7 std, %) comparison of RLDA, LapLDA, LPDP and SRDA on Yale. Method

G2

G3

G4

RLDA LapLDA [16] LPDP [58] SRDA

55.377 4.11 (15) 56.047 4.94 (13) 57.747 5.90 (14) 59.93 7 4.47 (15)

68.79 74.51 (16) 69.17 73.60 (14) 71.75 74.50 (14) 72.71 74.84 (17)

76.62 7 3.58 75.86 7 3.86 78.90 7 3.86 80.387 2.46

(14) (14) (16) (16)

4.4. Comparisons with other state-of-the-art dimensionality reduction methods In this part, we compare the proposed method with the recently proposed sparse representation structure based methods SPP [39] and FFSPP [41], patch alignment based method DLA [51], two representative regularized discriminant analysis methods: Laplacian Linear Discriminant Analysis (LapLDA) [16] and Locality Preserving Discriminant Projections (LPDP) [58], and two NMF related supervised dimensionality reduction methods DNMF [54] and MD-NMF [56]. Figs. 14–17 show the plots of recognition rate versus dimensions for SPP, FFSPP, DLA and SRDA on four tested data sets. The parameter setting methods for SPP, FFSPP and DLA are the same as in [39,41, 51]. The best results and the associated dimensionality of

G5

G6

G7

G8

81.447 4.11 (15) 80.067 4.21 (14) 81.78 7 3.75 (13) 84.50 7 3.84 (21)

86.077 2.95 (14) 82.53 7 4.08 (15) 86.737 3.95 (14) 88.007 3.15 (19)

88.50 7 2.41 (30) 85.08 7 4.38 (15) 88.177 3.24 (14) 89.837 3.33 (21)

88.89 74.08 (17) 89.00 74.17 (14) 90.6772.35 (14) 91.56 73.65 (17)

these compared methods on Yale, Extended Yale B, ORL and CMU PIE data sets are summarized in Tables 5–8 respectively. From these experimental results, SRDA achieves the best performance, followed by DLA, FFSPP and SPP. Moreover, the time comparisons for learning the embedding functions of SPP and SRDA on four tested data sets are shown in Tables 9–12. As is shown in Tables 9–12, SRDA is much faster than SPP due to the efficient evaluation of sparse coefficient vectors in SRDA. Therefore, SRDA not only achieves a better recognition performance than SPP, but also is more time efficient than SPP. LapLDA is a regularized LDA method aiming at integrating the global and local structures by introducing a locality preserving regularization term into LDA. LPDP is proposed by adding the maximum margin criterion (MMC) into the objective function of

F. Yin et al. / Neurocomputing 128 (2014) 341–362

359

Table 14 Recognition rate (mean 7 std, %) comparison of RLDA, LapLDA, LPDP and SRDA on Extended Yale B. Method

G10

G20

G30

G40

G50

RLDA LapLDA [16] LPDP [58] SRDA

87.38 7 1.03 (60) 87.22 7 0.99 (58) — 92.457 0.64 (219)

95.09 7 0.62 (48) 94.667 0.80 (52) — 97.037 0.59 (198)

97.58 7 0.47 (54) 97.167 0.49 (36) — 98.53 7 0.39 (166)

98.43 7 0.42 (36) 98.34 7 0.36 (38) — 99.23 7 0.29 (57)

98.88 7 0.54 (32) 98.87 7 0.32 (29) — 99.507 0.28 (110)

Table 15 Recognition rate (mean 7 std, %) comparison of RLDA, LapLDA, LPDP and SRDA on ORL. Method

G2

G3

G4

G5

G6

G7

G8

RLDA LapLDA [16] LPDP [58] SRDA

79.117 2.23 (43) 79.56 7 2.77 (39) 81.117 3.10 (37) 82.007 2.62 (47)

88.59 7 3.24 (47) 89.187 1.98 (41) 91.147 1.69 (39) 90.93 7 2.11 (49)

93.737 1.86 (54) 92.60 7 2.21 (39) 95.157 1.38 (39) 94.92 7 1.24 (42)

95.95 7 1.81 (54) 95.477 1.90 (53) 97.707 1.06 (40) 97.007 1.42 (56)

97.127 1.37 (40) 96.007 2.39 (46) 98.25 7 1.16 (40) 98.38 7 1.04 (45)

98.12 71.11 (39) 97.42 71.75 (40) 98.63 71.39 (42) 98.92 70.82 (42)

98.317 1.73 (39) 98.067 1.88 (56) 99.067 0.90 (40) 99.50 7 0.75 (52)

1

1

0.8

Recognition Rate

Recognition Rate

0.8

DNMF MD-NMF SRDA

0.6

0.4

0.4

DNMF MD-NMF SRDA

0.2

0.2

0

0.6

G3

G5 Size of training set

G7

0

G3

G5 Size of training set

G7

Fig. 18. Recognition rate comparison of DNMF, MD-NMF, and SRDA on Yale with 3, 5 and 7 images per individual randomly selected for training.

Fig. 19. Recognition rate comparison of DNMF, MD-NMF, and SRDA on ORL with 3, 5 and 7 images per individual randomly selected for training.

Locality Preserving Projections, aiming at retaining the locality preserving characteristic of LPP and using the global discriminative structure from MMC. The experimental results are presented in Tables 13–15 for Yale, Extended Yale B and ORL respectively. As a classical regularized LDA method, the results of RLDA are also listed in the tables for comparison. Thus, here we essentially compare three kinds of regularized discriminant analysis methods: non-data-dependent regularized discriminant analysis (RLDA), local structure preserving discriminant analysis (LapLDA and LPDP), and sparse structure preserving discriminant analysis (SRDA). From the experimental results, RLDA is superior to LapLDA in most cases, which is consistent with the empirical study in [62]. LPDP achieves better performance than RLDA and LapLDA consistently. Moreover, for the most part, SRDA performs better than the other three compared regularized discriminant analysis methods. The recognition rate comparisons of DNMF, MD-NMF, and SRDA on Yale and ORL with 3, 5 and 7 images per individual randomly selected for training are shown in Figs. 18 and 19 respectively. From the experimental results in Figs. 18 and 19, SRDA achieves better recognition performance than these two NMF related supervised dimensionality reduction methods DNMF and MD-NMF.

4.5. Analysis of the parameter λ2 There are two parameters in our proposed algorithm SRDA: the regularization parameters λ1 and λ2 . In the experiments, λ1 is set to be the same as in the previous works [40,61,62]. In this part, the influence of λ2 on the performance of SRDA is investigated. Fig. 20 shows the influence of λ2 on the performance of SRDA on Yale (G6), Extended Yale B (G30), ORL (G5) and CMU PIE (G10) respectively, when λ2 ranges from 0.1 to 2 equally spaced with interval 0.1. From the experimental results, the performance of SRDA does not fluctuate greatly as the variation of λ2 on all four tested data sets, so it is robust to the regularization parameterλ2 .

5. Conclusions and future work In this paper, we proposed a novel dimensionality reduction method for face recognition, named sparse regularization discriminant analysis (SRDA), which aims at simultaneously seeking an efficient discriminating subspace and preserving the sparse representation structure. Specifically, SRDA first constructs a concatenated dictionary through class-wise PCA decompositions and learns the sparse representation structure under the constructed

360

F. Yin et al. / Neurocomputing 128 (2014) 341–362

Yale G6

Extended Yale B G30 0.99

0.89

0.988 0.986

Recognition Rate

Recognition Rate

0.88 0.87 0.86 0.85

0.984 0.982 0.98 0.978 0.976 0.974

0.84

0.972 0.83

0.97

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

2

2

CMU PIE G10

0.98

0.91

0.975

0.9

0.97

Recognition Rate

Recognition Rate

ORL G5

0.965 0.96 0.955

0.89 0.88 0.87

0.95

0.86 0.945 0.94

0.85 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.2 0.4 0.6 0.8

1

1.2 1.4 1.6 1.8

2

2

2

Fig. 20. Influence of λ2 on the performance of SRDA on Yale, Extended Yale B, ORL and CMU PIE respectively.

dictionary quickly through matrix–vector multiplications. Then, SRDA takes into account both the sparse representation structure and the discriminating efficiency by using the learned sparse representation structure as a regularization term of linear discriminant analysis. Finally, the optimal embedding of the data is learned via solving a generalized eigenvalue problem. The extensive experiments on four publicly available face data sets validate the promising performance of our proposed SRDA approach. Our SRDA method may not perform well on class imbalance problem, since it needs to perform PCA decomposition for each class. For a class whose sample size is small, PCA on this class cannot well capture the subspace structure of this class, which will deteriorate the performance of our SRDA method. How to tackle this problem is one of our future focuses. Another topic worthwhile to explore is how to extend our SRDA method to the semisupervised scenario, which is an important scenario where we can obtain a lot of unlabeled face images easily due to the popularity of digital camera.

Acknowledgments The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to significantly improve the quality of this paper. This work is supported by the National Natural Science Foundation of China, Nos. 61072106, 60971112, 60971128, 60970067, 61072108; the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project), No. B07048; the Program for Cheung Kong Scholars and Innovative Research Team in University, No. IRT1170.

References [1] W. Zhao, R. Chellappa, P. Phillips, A. Rosenfeld, Face recognition: a literature survey, ACM Comput. Surv. 35 (4) (2003) 399–458. [2] M. Berry, Survey of Text Mining: Clustering, Classification and Retrieval, Springer, New York, 2003. [3] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008. [4] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006. [5] L.O. Jimenez, D.A. Landgrebe, Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data, IEEE Trans. Syst. Man Cybern. 28 (1) (1997) 39–54. [6] F. Shang, L.C. Jiao, J. Shi, J. Chai, Robust positive semidefinite L-Isomap ensemble, Pattern Recognit. Lett. 32 (4) (2011) 640–649. [7] S. Gunal, R. Edizkan, Subspace based feature selection for pattern recognition, Inf. Sci. 178 (19) (2008) 3716–3726. [8] I.T. Jolliffe, Principal Component Analysis, Springer, New York, 1986. [9] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, New York, 1990. [10] M.A. Turk, A.P. Pentland, Eigenfaces for recognition, J. Congnitive Neurosci. 3 (1) (1991) 71–86. [11] F. Wang, X. Wang, D. Zhang, C. Zhang, T. Li, MarginFace: a novel face recognition method by average neighborhood margin maximization, Pattern Recognit. 42 (11) (2009) 2863–2875. [12] P.N. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [13] S.J. Raudys, A.K. Jain, Small sample size effects in statistical pattern recognition: recommendations for practitioners, IEEE Trans. Pattern Anal. Mach. Intell. 13 (3) (1991) 252–264. [14] J.H. Friedman, Regularized discriminant analysis, J. Am. Stat. Assoc. 84 (405) (1989) 165–175. [15] S. Ji, J. Ye, Generalized linear discriminant analysis: a unified framework and efficient model selection, IEEE Trans. Neural Netw. 19 (10) (2008) 1768–1782. [16] J. Chen, J. Ye, Q. Li, Integrating global and local structures: a least squares framework for dimensionality reduction, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. [17] D. Tao, X. Li, X. Wu, General averaged divergence analysis, in: Proceedings of IEEE International Conference on Data Mining (ICDM), 2007, pp. 2272–2279.

F. Yin et al. / Neurocomputing 128 (2014) 341–362

[18] D. Tao, X. Li, X. Wu, Geometric mean for subspace selection, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 260–274. [19] X. He, P. Niyogi, Locality preserving projections, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), vol. 16, 2003, pp. 585–591. [20] X. He, D. Cai, S. Yan, H. Zhang, Neighborhood preserving embedding, in: Proceeding of the IEEE International Conference on Computer Vision (ICCV), 2005, pp. 1208–1213. [21] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [22] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [23] E. Candès, Compressive sampling, in: Proceeding of International Congress of Mathematics, 2006, pp. 1433–1452. [24] D. Donoho, For most large underdetermined systems of linear equations the minimal ℓ1 -norm solution is also the sparsest solution, Comm. Pure Appl. Math. 59 (6) (2006) 797–829. [25] D. Donoho, Compressed sensing, IEEE Trans. Inf. Theory 52 (4) (2006) 1289–1306. [26] R.G. Baraniuk, M.B. Wakin, Random projections of smooth manifolds, Found. Comput. Math. 9 (1) (2009) 51–77. [27] M.A. Davenport, P.T. Boufounos, M.B. Wakin, R.G. Baraniuk, Signal processing with compressive measurements, IEEE J. Sel. Top. Signal Process. 4 (2) (2010) 445–460. [28] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T.S. Huang, S. Yan, Sparse representation for computer vision and pattern recognition, Proc. IEEE 98 (6) (2010) 1031–1044. [29] J. Cai, H. Ji, X. Liu, Z. Shen, Blind motion deblurring from a single image using sparse approximation, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 104–111. [30] X. Lu, Y. Yuan, P. Yan, Sparse coding for image denoising using spike and slab prior, Neurocomputing 106 (2013) 12–20. [31] J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Non-local sparse models for image restoration, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2272–2279. [32] J. Yang, J. Wright, T. Huang, Y. Ma, Image super-resolution as sparse representation of raw patches, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. [33] W. Liu, S. Li, Multi-morphology image super-resolution via sparse representation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.04.002. [34] J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Supervised dictionary learning, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2009, pp. 1033–1040. [35] K. Huang, S. Aviyente, Sparse representation for signal classification, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2006, pp. 585–591. [36] Y. Xu, Q. Zhu, Z. Fan, Y. Wang, J. Pan, From the idea of “sparse representation” to a representation-based transformation method for feature extraction, Neurocomputing 113 (2013) 168–176. [37] F. Zang, J. Zhang, Discriminative learning by sparse representation for classification, Neurocomputing 74 (12–13) (2011) 2176–2183. [38] Y. Han, F. Wu, D. Tao, Sparse unsupervised dimensionality reduction for multiple view data, IEEE Trans. Circuits Syst. Video Technol. 22 (10) (2012) 1485–1496. [39] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognit. 43 (1) (2010) 331–341. [40] L. Qiao, S. Chen, X. Tan, Sparsity preserving discriminant analysis for single training image face recognition, Pattern Recognit. Lett. 31 (5) (2010) 422–429. [41] F. Yin, L.C. Jiao, F. Shang, S. Wang, B. Hou, Fast fisher sparsity preserving projections. Neural Comput. Appl. 23 (3–4) (2013) 691–705. [42] T. Zhou, D. Tao, Double shrinking sparse dimension reduction, IEEE Trans. Image Process. 22 (1) (2013) 224–257. [43] N. Guan, D. Tao, Z. Luo, et al., MahNMF: Manhattan Non-negative Matrix Factorization CoRRabs/1207, vol. 3438, 2012. [44] R. Baraniuk, Compressive Sensing, IEEE Signal Process. Mag. 24 (4) (2007) 118–121. [45] J. Wright, A. Yang, S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [46] R. Basri, D. Jacobs, Lambertian reflection and linear subspaces, IEEE Trans. Pattern Anal. Mach. Intell. 25 (3) (2003) 218–233. [47] A.N. Tikhonov, V.Y. Arsenin, Solution of Ill-posed Problems, Winston & Sons, Washington, 1977. [48] G. Golub, C. Van Loan, Matrix Computations, third ed., Johns Hopkins University Press, Baltimore, 1996. [49] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed., Springer, New York, 2009. [50] T. Zhang, D. Tao, J. Yang, Discriminative locality alignment, in: Proceeding of the IEEE European Conference on Computer Vision (ECCV), 2008, pp. 725–738.

361

[51] T. Zhang, D. Tao, X. Li, et al., Patch alignment for dimensionality reduction, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1299–1313. [52] T. Zhou, D. Tao, X. Wu, Manifold elastic net: a unifiled framework for sparse dimension reduction, Data Min. Knowl. Discov. 22 (3) (2011) 340–371. [53] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (21) (1999) 788–791. [54] S. Zafeiriou, A. Tefas, I. Buciu, I. Pitas, Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification, IEEE Trans. Neural Netw. 17 (3) (2006) 683–695. [55] N. Guan, D. Tao, Z. Luo, B. Yuan, Non-negative patch alignment framework, IEEE Trans. Neural Netw. 22 (8) (2011) 1218–1230. [56] N. Guan, D. Tao, Z. Luo, B. Yuan, Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent, IEEE Trans. Image Process. 20 (7) (2011) 2030–2048. [57] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using Laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [58] J. Gui, W. Jia, L. Zhu, S. Wang, D. Huang, Locality preserving discriminant projections for face and palmprint recognition, Neurocomputing 73 (13–15) (2010) 2696–2707. [59] K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698. [60] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression database, IEEE Trans. Pattern Anal. Mach. Intell. 25 (12) (2003) 1615–1618. [61] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–7. [62] L. Qiao, L. Zhang, S. Chen, An empirical study of two typical locality preserving linear discriminant analysis methods, Neurocomputing 73 (10–12) (2010) 1587–1594.

Fei Yin was born in Shaanxi, China, on October 8, 1984. He received the B.S. degree in computer science from Xidian university, Xi‘an, China, in 2006. Currently he is working towards the Ph.D. degree in pattern recognition and intelligent systems from the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi‘an, China. His current research interests include pattern recognition, machine learning, data mining, and computer vision.

L.C. Jiao received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 1982, and the M.S. and Ph.D. degrees from Xi‘an Jiaotong University, Xi‘an, China, in 1984 and 1990 respectively. He is the author or coauthor of more than 200 scientific papers. His current research interests include signal and image processing, nonlinear circuit and systems theory, learning theory and algorithms, optimization problems, wavelet theory, and data mining.

Fanhua Shang is currently pursuing the Ph.D. degree in circuits and systems from the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi‘an, China. His current research interests include pattern recognition, machine learning, data mining, and computer vision.

362

F. Yin et al. / Neurocomputing 128 (2014) 341–362 Lin Xiong was born in Shaanxi, China, on July 15, 1981. He received the B.S. degree from Shaanxi University of Science & Technology, Xi‘an, China, in 2003. He worked for Foxconn Co. after graduation from 2003 to 2005. Since 2006, he has been working toward the M.S. and Ph.D. degree in pattern recognition and intelligent information at Xidian University. His research interests include active learning, ensemble learning, low-rank and sparse matrix factorization, Subspace tracking and Background modeling in video surveillance.

Xiaodong Wang received the B.S. degree from Harbin Institute of technology, Harbin, China, in 1988, and the M.S. degree from Inner Mongolia University of Technology, Hohhot, China, in 2007. He is currently working toward the Ph.D. degree in computer application technology at the School of Computer Science and Technology, Xidian University and the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xi‘an, China. His current research interests include convex optimization, compressive sensing and pattern recognition.