Double linear regressions for single labeled image per person face recognition

Double linear regressions for single labeled image per person face recognition

Pattern Recognition 47 (2014) 1547–1558 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr ...

876KB Sizes 1 Downloads 110 Views

Pattern Recognition 47 (2014) 1547–1558

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Double linear regressions for single labeled image per person face recognition Fei Yin a,n, L.C. Jiao a, Fanhua Shang b, Lin Xiong a, Shasha Mao a a Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Mailbox 224, No. 2 South TaiBai Road, Xi’an 710071, PR China b Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA

art ic l e i nf o

a b s t r a c t

Article history: Received 12 March 2013 Received in revised form 10 September 2013 Accepted 19 September 2013 Available online 9 October 2013

Recently the underlying sparse representation structure in high dimensional data has received considerable attention in pattern recognition and computer vision. In this paper, we propose a novel semi-supervised dimensionality reduction (SDR) method, named Double Linear Regressions (DLR), to tackle the Single Labeled Image per Person (SLIP) face recognition problem. DLR simultaneously seeks the best discriminating subspace and preserves the sparse representation structure. Specifically, a Subspace Assumption based Label Propagation (SALP) method, which is accomplished using Linear Regressions (LR), is first presented to propagate the label information to the unlabeled data. Then, based on the propagated labeled dataset, a sparse representation regularization term is constructed via Linear Regressions (LR). Finally, DLR takes into account both the discriminating efficiency and the sparse representation structure by using the learned sparse representation regularization term as a regularization term of Linear Discriminant Analysis (LDA). The extensive and encouraging experimental results on three publicly available face databases (CMU PIE, Extended Yale B and AR) demonstrate the effectiveness of the proposed method. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Semi-supervised dimensionality reduction Label propagation Sparse representation Linear regressions Linear discriminant analysis Face recognition

1. Introduction In many fields of scientific research such as face recognition [1], bioinformatics [2], and information retrieval [3], the data are usually presented in a very high dimensional form. This make the researchers confront with the problem of “the curse of dimensionality” [4], which limits the application of many practical technologies due to the heavy computational cost in high dimensional space, and deteriorates the performance of model estimation when the number of training samples are small compared to the number of features. In practice, dimensionality reduction has been employed as an effective way to deal with “the curse of dimensionality”. In the past years, a variety of dimensionality reduction methods have been proposed [5–10]. According to the geometric structure considered, the existing dimensionality reduction methods can be categorized into three types: global structure based methods, local neighborhood structure based methods, and the recently proposed sparse representation structure [11,12] based methods. Two classical dimensionality reduction methods Principle Component Analysis (PCA) [13] and Linear Discriminant Analysis (LDA) [14] belong to global structure based methods. In the field of face recognition, they are known as “Eigenfaces” [15] and “Fisherfaces” [16]. Two popular local neighborhood structure based

n

Corresponding author. Tel.: þ 86 29 88202661; fax: þ 86 29 88201023. E-mail address: [email protected] (F. Yin).

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.09.013

methods are Locality Preserving Projections (LPP) [17] and Neighborhood Preserving Embedding (NPE) [18]. LPP and NPE are named “Laplacianfaces” [19] and “NPEfaces” [18] in face recognition. The representative sparse representation structure based methods include Sparsity Preserving Projections (SPP) [20], Sparsity Preserving Discriminant Analysis (SPDA) [21] and Fast Fisher Sparsity Preserving Projections (FFSPP) [22]. They have also been successfully applied to face recognition. In order to deal with the nonlinear structure in data, most of the above linear dimensionality reduction methods have been extended to their kernelized versions which perform in Reproducing Kernel Hilbert Space (RKHS) [23]. Kernel PCA (KPCA) [24] and Kernel LDA (KLDA) [25] are the nonlinear dimensionality reduction methods corresponding to PCA and LDA. Kernel LPP (KLPP) [17,26] and Kernel NPE (KNPE) [27] are the kernelized versions of LPP and NPE. The nonlinear version of SPDA is Kernel SPDA [21]. One of the major challenges to appearance-based face recognition is recognition from a single training image [28,29]. This problem is called “one sample per person” problem: given a stored database of faces, the goal is to identify a person from the database later in time in any different and unpredictable poses, lighting, etc. from just one image per person [28]. Under many practical scenarios, such as law enforcement, driver license and passport card identification, in which there is usually only one labeled sample per person available, the classical appearance-based methods including Eigenfaces and Fisherfaces suffer big performance drop or tend to fail to work. LDA fails to work since the withinclass scatter matrix degenerates to a zero matrix when only one

1548

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

sample per person is available. Zhao et al. [30] suggested replacing the within-class scatter matrix with an identity matrix to make LDA work in this setting, although the performance of this Remedied LDA (ReLDA) is still not satisfying. Due to its importance and difficulty, one sample per person problem has aroused lots of interest in face recognition community. To attack this problem, many ad hoc techniques have been developed, including synthesizing virtual samples [31,32], localizing the single training image [33], probabilistic matching [34] and neural network methods [35]. More details on single training image problem can be found in a recent survey [28]. As the fast development of the digital photography industry, it is possible to have a large set of unlabeled images. In this background, a more natural and promising way to attack one labeled sample per person problem is semi-supervised dimensionality reduction (SDR). Semi-supervised Discriminant Analysis (SDA) [29] is one SDR method which has been successfully applied to single labeled image per person face recognition. SDA first learns the local neighborhood structure using the unlabeled data and then uses the learned local neighborhood structure to regularize LDA to obtain a discriminant function which is as smooth as possible on the data manifold. Laplacian LDA (LapLDA) [36], Semisupervised LDA (SSLDA) [37], and Semi-supervised Maximum Margin Criterion (SSMMC) [37] are all reported semi-supervised dimensionality reduction methods which can improve the performance of their supervised counterparts like LDA and Maximum Margin Criterion (MMC) [38]. These methods consider the local neighborhood structure and can be unified under the graph embedding framework [37,39]. Despite the success of these SDR methods, there are still some disadvantages: (1) these SDR methods are based on the manifold assumption which requires sufficiently many samples to characterize the data manifold [40]; (2) the adjacency graphs constructed in these methods are artificially defined, which brings the difficulty of parameter selection of neighborhood size and edge weights. To resolve these issues, Sparsity Preserving Discriminant Analysis (SPDA) [21] was presented. SPDA first learns the sparse representation structure through solving n (number of training samples) ℓ1 norm optimization problems, and then uses the learned sparse representation structure to regularize LDA. SPDA has achieved a good performance on single labeled image per person face recognition, but it still has some shortages: (1) it is computationally expensive since n ℓ1 norm optimization problems need to be solved in learning the sparse representation structure and (2) the label information is not taken advantage of in learning the sparse representation structure. To tackle the above problems, we propose a novel SDR method, named Double Linear Regressions (DLR), which simultaneously seeks the best discriminating subspace and preserves the sparse representation structure. More specifically, a Subspace Assumption Based Label Propagation (SALP) method, which is accomplished using Linear Regressions (LR), is first presented to propagate the label information to the unlabeled data. Then, based on the propagated labeled dataset, a sparse representation regularization term is constructed via Linear Regressions (LR). Finally, DLR takes into account both the discriminating efficiency and the sparse representation structure by using the learned sparse representation regularization term as a regularization term of linear discriminant analysis. It is worthwhile to highlight some aspects of DLR as follows:

(3) In DLR, label information is first propagated to all the training set. Then it is used in learning a more discriminative sparse representation structure. (4) Unlike SDA, there are no graph construction parameters in DLR. The difficulty of selecting these parameters is avoided. (5) Our proposed label propagation method SALP is quite general. It can be combined with other graph-based SDR methods to construct a more discriminative graph. The rest of the paper is organized as follows. Section 2 gives a brief review of LDA and RDA. DLR is proposed in Section 3. DLR is compared with some related works in Section 4. The experimental results and discussions are presented in Section 5. Finally, Section 6 gives some concluding remarks and future work.

2. A brief review of LDA and RDA Before we go into the details of our proposed DLR algorithm, a brief review of LDA and RDA is given in the following. 2.1. LDA Given a set of samples fxi gni¼ 1 , where xi A ℝm , let X ¼ ½x1 ; x2 ; …; xn  A ℝmn be the data matrix consisting of all samples. Suppose samples are from K classes. LDA aims at simultaneously maximizing the between-class scatter and minimizing the withinclass scatter. The objective function of LDA is defined as follows: max w

wT SB w ; wT SW w K

SB ¼ ∑ N k ðm  mk Þðm  mk ÞT ;

ð1Þ

ð2Þ

k¼1

K

SW ¼ ∑ ∑ ðxi mk Þðxi mk ÞT ;

ð3Þ

k ¼ 1 i A Ck

where mk ¼ 1=N k ∑i A Ck xi , m ¼ 1=n∑ni¼ 1 xi , Ck is the index set of samples from class k, and Nk is the number of samples in class k. SB is called between-class scatter matrix and SW is called within-class scatter matrix. The optimal w can be computed as the eigenvector of SW  1 SB that corresponds to the largest eigenvalue [14]. When there is only one labeled sample per class, LDA fails to work because SW is a zero matrix. The Remedied LDA (ReLDA) for one labeled sample per class scenario was proposed by Zhao et al. [30] in which SW is replaced by an identity matrix. 2.2. RDA Despite its simplicity and effectiveness for classification, LDA suffers from the Small Sample Size (SSS) problem [41]. Among the methods designed to attack this problem, Regularized Discriminant Analysis (RDA) [42,43] is a simple and effective one, whose objective function is defined as follows: wT SB w T w SW w þλ1 wT w

ð4Þ

(1) DLR is a novel semi-supervised dimensionality reduction method aiming at simultaneously seeking the best discriminating subspace and preserving the sparse representation structure.

max

(2) DLR can obtain the sparse representation structure via n small class specific linear regressions. Thus, it is more time efficient than SPDA.

where λ1 is the tradeoff parameter. The optimal w can be computed

w

as the eigenvector of ðSW þ λ1 IÞ  1 SB that corresponds to the largest eigenvalue.

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

1549

empty, that is, all unlabeled samples are labeled. This process is summarized in Algorithm 1.

3. Double Linear Regressions In this section, we propose a novel semi-supervised dimensionality reduction method, named Double Linear Regressions (DLR), which simultaneously seeks the best discriminating subspace and preserves the sparse representation structure. DLR consists of four steps: (1) Subspace Assumption Based Label Propagation using linear regressions; (2) learning the sparse representation structure using linear regressions; (3) constructing the sparsity preserving regularizer; and (4) integrating the sparsity preserving regularizer into the objective function of linear discriminant analysis to form the DLR objective function and solving the optimization problem to obtain the embedding function.

Algorithm 1. Subspace Assumption based Label Propagation (SALP) Input: Training sample set X ¼ ½x1 ; x2 ; …; xl ; xl þ 1 ; …; xl þ u  ¼ ½XL ; XU  A ℝmn , where XL denotes labeled samples and XU denotes unlabeled samples. Output: Labels of all training samples. Step 1: Calculate β^ ¼ ðXLj T XLj Þ  1 XLj T x and projðx; spanðXLj ÞÞ ¼ XLj β^ for all x A XU , and 1 r j r K. Step 2: Calculate dðx; XLj Þ for all x A XU , and 1 r j r K. ^ XLjn Þ ¼ Step 3: Calculate dðx; min dðx; XLj Þ, and select x^ x A XU ;1 r j r K

3.1. subspace assumption based label propagation

as the most reliable unlabeled sample to be labeled and label it as class jn . Step 4: Add x^ into XLjn and delete it from XU . Step 5: If XU is empty, go to Step 6, else go to Step 1. Step 6: End.

Assume the training sample set X ¼ ½x1 ; x2 ; …; xl ; xl þ 1 ; …; xl þ u  ¼ ½XL ; XU  A ℝmn , where XL denotes the first l labeled samples, XU denotes the last u unlabeled samples and n ¼ l þ u. Let XL ¼ ½XL1 ; XL2 ; …; XLK , where XLj A ℝmlj is a matrix which consists of lj labeled samples from class j. Assume that samples from a single class lie on a linear subspace. Subspace model is flexible enough to capture much of the variation in real-world data sets and is a simple and effective model in the context of face recognition [44]. It has been observed that face images under various lightings and expressions lie on a special low-dimensional subspace [16,45]. In the following, based on the labeled samples XL and the subspace assumption, the unlabeled samples XU are to be labeled. First, the distance between a sample x and a class of labeled samples XLj is defined as the distance between x and the subspace spanned by XLj :

3.2. Learning the sparse representation structure

dðx; XLj Þ ¼ dðx; spanðXLj ÞÞ;

xji ¼ Xj sj ¼ ½xj1 ; xj2 ; …; xjnj sj

ð5Þ

where spanðXLj Þ denotes the subspace spanned by the columns of XLj . To compute this distance, the projection projðx; spanðXLj ÞÞ of x on the subspace spanðXLj Þ is first evaluated and then dðx; XLj Þ can be evaluated as dðx; XLj Þ ¼ dðx; spanðXLj ÞÞ ¼ jjx  projðx; spanðXLj ÞÞjj2 ;

ð6Þ

After the label propagation, we now have the training sample set X ¼ ½x1 ; x2 ; …; xn  A ℝmn and the corresponding labels of all training samples. Let X ¼ ½X1 ; X2 ; …; XK , where Xj ¼ ½xj1 ; xj2 ; …; xjnj  A ℝmnj consists of samples from class j. Assume that samples from a single class lie on a linear subspace. A sample xji from class j can be represented as ð11Þ

where sj ¼ ½sj1 ; sj2 ; …; sj;i  1 ; 0; sj;i þ 1 ; …; sjnj T is a nj dimensional column vector in which the ith element is equal to zero, implying xji is removed from Xj . sj can be computed as follows: (1) delete xji from Xj to form Xj ¼ ½xj1 ; xj2 ; …xj;i1 ; xj;i þ 1 ; xjnj ; (2) solve the following Linear Regression (LR) problem to obtain sj

where jj:jj2 is the Euclidean norm. To compute the projection projðx; spanðXLj ÞÞ, we first solve the following Linear Regression (LR) problem:

min jjxji  Xj sj jj2 : 

min jjx  XLj βjj2 :

This linear regression problem can be resolved via the leastsquares estimation [46,47]:

ð7Þ

β

β can be obtained through the least-squares estimation [46,47]: β^ ¼ ðXLj T XLj Þ  1 XLj T x;

projðx; spanðXLj ÞÞ ¼ XLj β^ ¼ XLj ðXTLj XLj Þ  1 XTLj x:

ð9Þ

To propagate the label, the most reliable unlabeled sample to be labeled is selected and labeled as follows: min

x A XU ;1 r j r K

sj ¼ ðXj T Xj Þ  1 Xj T xji ;

ð12Þ

ð13Þ

ð8Þ

Then projðx; spanðXLj ÞÞ can be computed as

^ XLjn Þ ¼ dðx;

sj

dðx; XLj Þ

ð10Þ

That is, the distances between all unlabeled samples and all classes are computed first. Then we find the unique couple of an unlabeled sample x^ and a class XLjn which makes this distance be the minimum. x^ is selected as the most reliable unlabeled sample to be labeled and labeled as class jn . x^ is added into XLjn and deleted from the unlabeled sample set XU . The next unlabeled sample to be labeled is selected and labeled based on the updated XL and XU in the same manner. This process continues until XU is

(3) insert 0 into the ith position of sj to obtain sj . Now, xji can be represented as xji ¼ Xj sj ¼ ½X1 ; X2 ; …; XK s ¼ Xs ð14Þ where s ¼ ½0T ; 0T ; …; sj T ; …; 0T T . Here xji ¼ Xs is a sparse representation of xji , X is the dictionary, and s is the sparse representation coefficient vector. Then the sparse representation coefficient vectors of all training samples are computed as above. It should be noted that the linear reconstruction in Eq. (14) is sparse relative to samples from all classes and it is not necessarily sparse within the class which xji belongs to. Once computing the sparse representation coefficient vector si for each training sample xi ; i ¼ 1; 2; …; n, we can define the sparse representation coefficient matrix as follows: S ¼ ½s1 ; s2 ; …; sn 

ð15Þ

The process of learning the sparse representation structure is shown in Fig. 1.

1550

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

¼

Dictionary X x1

s1

LR

x2

s2

LR

Sparse coefficient matrix

sn

LR

3.3. Sparsity preserving regularizer According to the above design, the sparse representation coefficient matrix S reflects to some extent the intrinsic geometric structure of the data and encodes the discriminating information of training samples. So it is expected that the sparse representation structure in the original high dimensional space can be preserved in the projected low dimensional space. To seek projection which best preserves the sparse representation structure, the following objective function is defined n

min ∑ jjwT xi wT Xsi jj2 ;

ð16Þ

i¼1

where si is the sparse representation coefficient vector corresponding to xi . This objective function is used to form the sparsity preserving regularizer, which is defined by n

JðwÞ ¼ ∑ jjwT xi wT Xsi jj2 :

ð17Þ

i¼1

With some algebraic manipulation, the sparsity preserving regularizer can be recast as: ! n

JðwÞ ¼ ∑ jjwT xi wT Xsi jj2 ; ¼ wT i¼1

¼w

i¼1

!

∑ ðxi xi T  xi si T XT  Xsi xi T þXsi ðXsi ÞT Þ w;

¼ wT T

n

∑ ðxi  Xsi Þðxi  Xsi ÞT w;

n



i¼1

 XXT  XST XT  XSXT þ XSST XT w:

max w

wT SB w ; wT ðSW þ λ1 I þ λ2 RÞw

ð18Þ

We try to simultaneously seek the best discriminating subspace and preserve the sparse representation structure. The objective function is formulated as follows:

w

wT SB w ; T W w þ λ1 w w þ λ2 JðwÞ

ð19Þ

where SB and SW are the between-class and within-class scatter matrices defined in Eqs. (2) and (3), respectively. wT w is the Tikhonov regularizer [48], and JðwÞ is the sparsity preserving regularizer. λ1 and λ2 are two regularization parameters. Substituting Eq. (18) into Eq. (19) and making some algebraic reformulations, we have wT SB w ; wT SW w þ λ1 wT w þ λ2 JðwÞ ¼

ð21Þ

wT SB w wT SW w þ λ1 wT w þ λ2 wT ðXXT  XST XT  XSXT þ XSST XT Þw

;

ð22Þ

where η denotes the eigenvalue of the above generalized eigenvalue problem. Then the projecting matrix W ¼ ½w1 ; w2 ; …; wd  consists of the eigenvectors corresponding to the largest d eigenvalues. Based on the above discussion, the proposed DLR algorithm is summarized in Algorithm 2. Algorithm 2. Double Linear Regressions (DLR) Input: Training sample set X ¼ ½x1 ; x2 ; …; xl ; xl þ 1 ; …; xl þ u  ¼ ½XL ; XU  A ℝmn where XL are labeled samples and XU are unlabeled samples, regularization parameters λ1 and λ2 . Output: Projecting matrix W ¼ ½w1 ; w2 ; …; wd . Step 1: Conduct SALP (Algorithm 1) to obtain the labels of all the training samples X ¼ ½x1 ; x2 ; …; xn  A ℝmn . Step 2: Calculate SB and SW by Eqs. (2) and (3), respectively. Step 3: Calculate the sparse representation coefficient vector s for every sample using linear regression (LR) to obtain sparse representation coefficient matrix S. Step 4: Calculate the projecting matrix W ¼ ½w1 ; w2 ; …; wd  by the generalized eigenvalue problem in Eq. (22). Step 5: End.

3.5. Kernel DLR DLR proposed above is a linear dimensionality reduction method, so it may fail to deal with the highly nonlinear structure in data. To tackle this problem, DLR is extended to its nonlinear version via kernel method in this part. For the convenience of derivation, we first make a reformulation of problem (21). Let ST ¼ SB þ SW , where ST is called total scatter matrix, problem (21) is equivalent to the following problem [49]: max

wT S

ð20Þ

where I is an identity matrix related to the Tikhonov regularizer and R is a matrix corresponding to the sparsity preserving regularizer. The problem in Eq. (21) can be solved via the following generalized eigenvalue problem

3.4. Double Linear Regressions based SDR

max

:

SB w ¼ ηðSW þ λ1 I þ λ2 RÞw:

Fig. 1. The process of learning the sparse representation structure.

w

þ λ1 I þ λ2 ðXX  XST XT  XSXT þ XSST XT ÞÞw T

With R ¼ XXT XST XT  XSXT þXSST XT , the problem in Eq. (19) can be rewritten as

S = [s1 ,s 2 ,..., s n ]

xn

w T SB w wT ðSW

w

wT ðST

wT SB w : þ λ1 I þ λ2 RÞw

ð23Þ

Without loss of generality, we assume the mean of samples is zero, then SB and ST can be rewritten as follows: SB ¼ XHXT ; ST ¼ XXT ;

ð24Þ

where H ¼ diagðH1 ; H2 ; …; HK Þ is a block diagonal matrix and Hk is an nk  nk matrix with all its elements equal to 1=nk . Now, problem (21) can be recast as: max w

wT XHXT w wT ðXX þ λ1 I þ λ2 ðXXT  XST XT XSXT þ XSST XT ÞÞw T

:

ð25Þ

Let ϕðxÞ be a nonlinear mapping which maps the data points from the input space to the feature space. Following the kernel trick [50], we want to use inner product Kðxi ; xj Þ ¼ 〈ϕðxi Þ; ϕðxj Þ〉 instead of the explicit mappingϕðxÞ. Let Φ ¼ ½ϕðx1 Þ; ϕðx2 Þ; …; ϕðxn Þ, the between-class scatter matrix and the total scatter matrix in the

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

1551

Geometric structure

Supervise mode

Unsupervised method

Supervised method

Semisupervised method

Global structure

Local Neighborhood structure

Sparse representation structure

PCA LPP SPP

LDA RDA

SDA SPDA DLR

PCA LDA RDA

LPP SDA

SPP SPDA DLR

Fig. 2. (a) Taxonomy of related dimensionality reduction methods according to “Supervise mode” and (b) taxonomy of related dimensionality reduction methods according to the “Geometric structure” considered.

4. Comparison with related methods

feature space can respectively be denoted by SFB ¼ ΦHΦT ; SFT ¼ ΦΦT :

ð26Þ

According to the Representer Theorem [51] in RKHS, the projection wF in feature space can be represented as wF ¼ Φa, in which a is the coefficient vector that represents wF under the basis Φ in feature space. The objective function of kernel DLR is as follows: max a

aT KHKa aT ðKK þ λ

T T 1 K þ λ2 KðIS S þ SS ÞKÞa

:

ð27Þ

where K ¼ ΦT Φ is the kernel matrix. Then, the optimal a^ can be computed via the following generalized eigenvalue problem: KHKa ¼ ηðKK þ λ1 K þ λ2 KðI ST S þ SST ÞKÞa:

ð28Þ

For a given sample x, its low-dimensional representation can be computed as follows: T ðwF ÞT ϕðxÞ ¼ a^ Kð U; xÞ

ð29Þ

where Kð U; UÞ is a kernel function.

3.6. Computational complexity analysis The computational complexity of label propagation mainly concentrates on the computation of projðx; spanðXLj ÞÞ ¼ XLj ðXLj T XLj Þ  1 XLj T x for all x A XU , and 1 r j r K. The computational cost of projðx; spanðXLj ÞÞ for all x A XU , and 1 r j rK is Oð∑Kj¼ 1 ð mn2ji þ m2 nji þnji 3 Þ þ ðu  i þ 1Þm2 Þ, where i is the iteration num in Algorithm 1, nji is the number of labeled samples in class j when the iteration num is i, and u is the size of unlabeled sample set. Since the total number of iterations is u, the computational cost of label propagation is Oð∑ui¼ 1 ð∑Kj¼ 1 ðmn2ji þ m2 nji þ nji 3 Þ þ ðu  i þ 1Þm2 ÞÞ. The computation of SB requires Oðm2 KÞ, while the computation of SW requires Oðm2 nÞ. The computation of sparse representation coefficient matrix S mainly involves n class specific Linear Regression problems whose computational cost is ∑Kj¼ 1 nj ðmn2j þ nj 3 Þ. The evaluation of R requires Oðm2 nÞ. The generalized eigenvalue problem in Eq. (22) can be solved in Oðm3 Þ. So the overall computational complexity of our proposed DLR method is Oð∑ui¼ 1 ð∑Kj¼ 1 ðmn2ji þ m2 nji þ nji 3 Þ þ ðu  i þ 1Þm2 Þ þ∑Kj¼ 1 nj ðmn2j þ nj 3 Þ þ m2 n þ m3 Þ.

According to “Supervise mode”, dimensionality reduction methods can be categorized into 3 classes: unsupervised method, supervised method and semi-supervised method [37,20]. PCA, LPP and SPP are unsupervised dimensionality reduction methods; LDA and RDA are supervised dimensionality reduction methods; SDA, SPDA and our proposed DLR are semi-supervised dimensionality reduction methods. According to the “Geometric structure” considered, dimensionality reduction methods can be divided into 3 types: the global structure based method, the local neighborhood structure based method and the sparse representation structure based method. PCA, LDA and RDA are global structure based methods; LPP and SDA are local neighborhood structure based methods; SPP, SPDA and the proposed DLR are sparse representation structure based methods. These two taxonomies are summarized in Fig. 2. There is an interesting relationship between PCA, ReLDA and RDA when only one labeled sample per class is available for training. To demonstrate this relationship, we have the following theorem: Theorem 1. When only one labeled sample per class is available for training, PCA, ReLDA and RDA reduce to the same method. Proof. The objective function of LDA is max wT SB w=wT SW w. w When there is only one labeled sample per class, LDA fails to work because SW is a zero matrix. Zhao et al. [30] presented the Remedied LDA (ReLDA) for one labeled sample per class scenario, where SW is replaced by an identity matrix, whose objective function is max wT SB w=wT Iw ¼ max wT SB w. Since SB is exactly jjwjj ¼ 1

w

the data covariance matrix of the training set under the one labeled sample per class scenario, the objective function of ReLDA reduces to that of PCA. Thus PCA is equivalent to ReLDA under the one labeled sample per class scenario. The objective function of RDA is max wT SB w=wT SW w þ w

λ1 wT w. When there is only one labeled sample per class, the objective function reduces to max wT SB w=λ1 wT w as SW is a zero w

matrix. This objective function has the same optimal solution as that of ReLDA. Therefore, RDA is equivalent to ReLDA when there is only one labeled sample per class for training. So, when there is only one labeled sample per class, PCA, ReLDA and RDA reduce to the same method. SDA, SPDA, and DLR all reduce to LDA when λ1 ¼ λ2 ¼ 0, and reduce to RDA when λ1 a 0; λ2 ¼ 0. SDA employs the local neighborhood structure to regularize LDA while SPDA and DLR try to capture the sparse representation structure to regularize LDA. DLR

1552

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

Fig. 3. (a) Some faces of the first person in the PIE database, (b) partial faces of the first person in the Extended Yale B database and (c) all the 14 faces of the first person in the AR database.

is different from SPDA in how to learn the sparse representation structure. SPDA learns the sparse representation structure through solving n time-consuming ℓ1 norm optimization problems while the proposed DLR method can obtain the sparse representation structure more easily via n small class specific linear regressions which are more time efficient. In addition, the label information is not used by SPDA in learning the sparse representation structure while it is employed by DLR in learning the sparse representation structure, which may lead to a more discrimination-oriented sparse representation structure.

5. Experiments In this section, we validate the performance of the proposed DLR method on three publicly available face databases: CMU PIE, Extended Yale B and AR. First, the proposed DLR is compared with PCA, ReLDA, RDA, LPP, SPP, SDA and SPDA. Here we essentially compare methods considering three kinds of geometric structures: global structure (PCA, ReLDA and RDA), local neighborhood structure (LPP and SDA), and sparse representation structure (SPP, SPDA and DLR). From the viewpoint of “supervise mode”, we compare unsupervised methods (PCA, LPP and SPP), supervised methods (ReLDA and RDA) and semi-supervised methods (SDA, SPDA and DLR). Typically, the recognition process consists of three steps: (1) calculating the face subspace using the subspace learning method (PCA, ReLDA, RDA, LPP, SPP, SDA, SPDA or DLR) based on training samples; (2) projecting the test face image into the learned face subspace; and (3) identifying the test face image in the face subspace by the nearest neighbor classifier. Besides, DLR is also compared with state-of-the-art classification error based methods. These methods include Nonparametric Discriminant Analysis (NDA) [52], Neighborhood Components Analysis (NCA) [53] and Large Margin Nearest Neighbor (LMNN) [54]. Moreover, we investigate the influence of the number of unlabeled samples on our proposed DLR and discuss the sensitivity of DLR in relation to its parameter in this section. All experiments were performed with Matlab 7.8.0 on a personal computer with Pentium-IV 3.20 GHz processor, 2 GB main memory and Windows XP operating system. 5.1. Database description CMU PIE face database [55] contains 41368 face images of 68 individuals. These images were taken under varying pose, illumination and expression. As in [29], we choose the frontal pose (C27) with varying lighting in our experiments, which leaves us 43 images per person. Each face image is cropped to have the resolution of 3232.

Table 1 The specific setting for two groups of experiments. Database Num of samples per class

Num of classes

PIE Yale B AR

68 38 100

43 64 14

Experiment 1 (E1)

Experiment 2 (E2)

Unlabeled Labeled Unlabeled Labeled 2 2 2

1 1 1

29 29 9

1 1 1

Extended Yale B [56] consists of 2414 front-view face images of 38 individuals. There are about 64 images under different laboratory-controlled lighting conditions for each individual. In our experiments, the cropped images of size 32  32 are used. Due to larger illumination variations, this database may be more challenging than the PIE database. AR database contains over 4000 face images of 126 persons. For each person, there are 26 pictures taken in two sessions (separated by two weeks) and each session contains 13 images. These images include front-view faces with different expressions, illuminations and occlusions. In our experiments, we use the images without occlusion in the AR database provided and preprocessed by [57]. This subset contains 1400 face images of 100 person (50 men and 50 women), where each person has 14 different images taken in two sessions. The original resolution of these face images is 165120. In our experiments, for computational convenience, they were resized to 6648. Fig. 3 shows some faces from these three face databases. For all our experiments, each face image is normalized to have unit norm. 5.2. Experimental setting On each database, two groups of experiments with different numbers of unlabeled training samples are performed. In the first group, 3 face images are randomly selected from each class to form the training set, and the rest face images are used as the testing set. For each class, one face image is randomly selected from the training set and labeled, and the other two are leaved unlabeled. In the second group, the selecting method is the same as in the first group, while the number of the training samples per individual increases from 3 to 30 for PIE and Yale B, and to 10 for AR. The specific setting for the two groups of experiments on the three databases is given in Table 1. For all the experiments, 30 random training/testing splits are used and the average classification accuracy as well as the standard deviation is reported. In the two groups of experiments, the proposed DLR method is compared with Baseline, PCA, ReLDA, RDA, LPP, SPP, SDA and SPDA. As proved in Section 4, PCA, ReLDA and RDA reduce to the same method under the one labeled sample per class scenario.

57.45 7 2.95 69.83 7 3.39 0 1  1.01 8.60 58.46 7 2.00 61.23 7 2.50 1 1 32.87 40.45 24.58 7 1.92 29.38 7 2.87 1 1 12.88 14.77 44.577 2.60 55.06 7 3.10 1 1 23.93 33.55 33.52 72.05 36.28 72.39 1 1 30.49 42.85 26.96 71.58 26.98 71.96 1 1 30.49 42.85 E1 E2 AR

26.96 7 1.58 26.98 7 1.96

33.80 7 3.28 55.87 7 4.26 0 1 2.53 20.43 31.277 3.60 35.447 3.30 1 1 17.39 28.87 16.417 2.12 27.007 3.96 1 1 15.85 41.59 17.95 7 3.10 14.28 7 3.20 1 1 11.48 30.51 22.32 72.48 25.36 72.57 1 1 20.85 42.97 12.95 71.10 12.90 71.19 1 1 20.85 42.97 E1 E2

12.95 7 1.10 12.90 7 1.19

63.217 1.76 93.417 1.84 1 1  4.26 22.97 67.477 1.80 70.447 3.00 1 1 32.70 33.95 30.517 1.97 59.46 7 3.12 0 1 0.66 42.12 62.55 7 2.00 51.29 7 3.10

mean 7 std H y  x6 mean 7 std H y  x5 mean7 std H y  x4 mean7 std H

1 1 25.06 35.51 38.15 72.81 57.90 71.94 1 1 37.78 67.81 25.43 71.15 25.60 71.65 1 1 37.78 67.81 25.43 7 1.15 25.60 7 1.65

Yale B

(1) ReLDA achieves relatively low recognition rates because it uses only one labeled sample per class to learn the face subspace. In contrast, despite unsupervised methods, LPP and SPP outperform ReLDA due to the help of extra unlabeled samples.

E1 E2

The recognition rates of the compared methods corresponding to three databases and two groups of experiments are summarized in Table 3, in which E1 and E2 denote respectively the first and second groups of experiments. As in [58], the standard deviation, the difference in the average classification accuracies, and the statistical significance based on paired t-test with significance level α ¼ 0:05 are also tabulated in the same table. Table 4 summarizes the win/tie/loss counts of DLR versus the other compared methods base on the same paired t-tests. It has been shown in Section 4 that PCA, ReLDA, and RDA reduce to the same method when only one labeled sample per class is available for training, so they have the same experimental results. The times consumed by SPDA and DLR for learning the embedding functions of the two groups of experiments on three tested databases are shown in Table 5, in which the best performance for each data set is highlighted. From the experimental results shown in Tables 3–5, we can observe the following:

PIE

5.3. Experimental results and discussions

y  x3

“Auto” means the sparse reconstructive coefficients are used as the edge weights in SPDA, so the Neighborhood size k is automatically the number of samples whose corresponding coefficients are nonzero and the Edge weights in SPDA need not to be defined artificially.

mean 7std

ð30Þ

H

x1 T x2 : jjx1 jj2  jx2 jj2

y  x2

Cosineðx1 ; x2 Þ ¼

mean 7std

The Baseline method represents the nearest neighbor classifier on the original face space without dimensionality reduction, where only one labeled sample per class can be used as the training set. For ReLDA/RDA/PCA, the face subspace is learned based only on the labeled samples. For LPP and SPP, the face subspace is learned using all the training samples without label information. For SDA, SPDA and DLR, the face subspace is learned employing both labeled and unlabeled samples. ReLDA and SPP are both parameter-free. Two parameters are needed to be set in LPP: Neighborhood size k and Edge weights parameter. There are 4 parameters in SDA, two regularization parameters: λ1 and λ2 , and two parameters for graph construction: Neighborhood size k and Edge weights parameter, which are set as in [29]. For both SPDA and DLR, there are two regularization parameters: λ1 and λ2 . For the convenience of comparison, the two graph construction parameters in LPP and the two regularization parameters in SPDA and DLR are assigned the same values as in SDA. The specific parameter values for LPP, SDA, SPDA and DLR are shown in Table 2. “Cosine” means the edge weight between two samples is computed as the cosine of the angle between them. The “Cosine” function is defined as follows:

H

Cosine Cosine Auto None

y  x1

2 2 Auto None

mean 7 std

None 0.1 0.1 0.1

DLR (y)

None 0.01 0.01 0.01

SPDA [21] (x6)

LPP SDA SPDA DLR

SDA [29] (x5)

Edge weights

SPP [20] (x4)

Neighbor size, k

LPP [19] (x3)

λ2

PCA/ReLDA/RDA (x2)

λ1

Baseline (x1)

Method

1553

Method

Table 2 Parameter setting for LPP, SDA, SPDA and DLR.

Table 3 Recognition rates (mean 7std, %) of the compared methods corresponding to three databases and two groups of experiments. H: “1” rejects the null hypothesis that the two compared means are equal at significance level of 0.05, “0” otherwise.

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

1554

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

0.8

Table 4 Win/tie/loss counts of DLR versus the other methods based on paired t-tests with significance level α ¼ 0:05.

Win/tie/loss

Baseline 6/0/0

PCA/ReLDA/RDA 6/0/0

LPP 6/0/0

SPP 5/1/0

SDA 6/0/0

Average recognition rate

Method

0.75

SPDA In All 3/2/1

32/3/1

Table 5 The time (s) for learning the embedding functions of SPDA and DLR.

SPDA DLR

0.7 0.65 0.6 0.55 0.5 0.45

Method

PIE

Yale B

AR

0.4 E1

E1

E2

E1

E2

82.99 10.13

729.6 503.2

7.189 0.8327

208.8 119.0

147.2 15.05

1577 191.1

(2) Semi-supervised dimensionality reduction methods always perform better than the methods using only labeled samples. This indicates that unlabeled samples play an important role in obtaining a good description of the data and achieving a good performance. (3) For all semi-supervised methods (SDA, SPDA, DLR), the performance in the second group of experiments (E2) is better than that in the first group of experiments (E1) on all three databases, because the more unlabeled samples are employed, the better the underlying geometrical structure of the data can be captured by semi-supervised methods. (4) SPDA and our proposed DLR always achieve a better performance than SDA under both the first group of experiments (E1) and the second group of experiments (E2) on all three databases. This illustrates that a well-constructed data-dependent regularizer is important in the ultimate recognition performance. (5) DLR performs consistently better than SPDA in the second group of experiments (E2). This demonstrates that DLR can achieve a better recognition rate than SPDA if unlabeled samples are enough and validates that label information is important in learning the sparse representation structure. In the first group of experiments (E1), DLR is superior to SPDA on Extended Yale B while slightly inferior to SPDA on PIE and AR. This is because, under E1 setting, only few unlabeled samples are available, which cannot well describe the underlying subspace structure in data. This may deteriorate the performance of label propagation and the learning of sparse representation structure in DLR. On the other side, collaborative representation mechanism [59] helps SPDA when unlabeled samples are scarce. In addition, when the size of training set is enlarged from 3 to 30/10 (from E1 setting to E2 setting), the recognition rate of SPDA on three tested databases increases only 2–4 percent while the recognition rate of DLR increases more than 12 percent on all three tested databases, especially achieving a 30 percent improvement on PIE. This indicates that when unlabeled samples are sufficient, DLR can employ the unlabeled samples more effectively than SPDA. (6) DLR is faster than SPDA in all experiments. Specifically, DLR is 8–10 times faster than SPDA in the first group of experiments (E1), and it is 1.5–8 times faster than SPDA in the second group of experiments (E2).1

Besides, we compare the average performance of SPDA and DLR on two groups of experiments. The results are shown in Figs. 4 and 1 As suggested in [21], in Experiment 2 where unlabeled samples are sufficient, the faster ensemble version of SPDA is used.

E2

Two groups of experiments

Fig. 4. The average recognition rate of SPDA and DLR on two groups of experiments.

900 800 Average time (seconds)

E2

SPDA DLR

700 600 500 400 300 200 100 0

E1

E2

Two groups of experiments

Fig. 5. The average time consumed by SPDA and DLR on two groups of experiments.

0.7 0.6226

0.6 0.5405

Average recognition rate

SPDA DLR

E1

0.4914

0.5 0.4

0.3559 0.3122

0.3 0.2180 0.2180

0.2 0.1 0 Baseline ReLDA LPP

SPP

SDA

SPDA

DLR

Dimensionality reduction method

Fig. 6. Average recognition rate on all data sets.

5. From the results, in the first group of experiments (E1), DLR is slightly inferior to SPDA on average, but it is much faster than SPDA. In the second group of experiments (E2), DLR not only achieves a much better average recognition rate than SPDA but also takes less time than SPDA. The average recognition rates on all tested data sets of all compared methods are plotted in Fig. 6. From this figure, we can see that the best average performance is achieved by the proposed DLR method, followed by SPDA, and DLR is more than 8 percent better than the second best method SPDA. So, on the average, the proposed DLR achieves the best performance of all compared methods. Furthermore, DLR is compared with three state-of-the-art classification error based methods: NDA, NCA, and LMNN. For

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

these methods to work, at least two labeled samples per class are needed [52,60], so we compare DLR and these methods in a different setting. 30/10 face images (for PIE, Yale B/AR) are randomly selected from each class to form the training set, and the rest face images are used as the testing set. For each class, two face images are randomly selected from the training set and labeled, and the others are left unlabeled. For each of these compared methods, the corresponding algorithm parameters are properly adjusted and the best results obtained are recorded. 30 random training/testing splits are used and the average classification accuracies as well as the standard deviations are reported in Table 6. From Table 6, the proposed DLR achieves a statistically significantly better performance than state-of-the-art classification error based methods on three tested databases. This is because when only an extremely limited number of labeled samples are available, classification error based methods tend to overfit the labeled samples [54] while DLR can avoid overfitting via effectively employing unlabeled samples. 5.4. Further explorations of the DLR algorithm First, we investigate the influence of the number of unlabeled samples on the performance of DLR. On PIE and Extended Yale B, Table 6 Recognition rates (mean 7 std, %) of classification error based methods and DLR. The best performances based on paired t-tests with significance level α ¼ 0:05 are highlighted in boldface. Method

NDA

NCA

LMNN

DLR

PIE Yale B AR

74.86 7 3.49 39.25 7 3.09 64.207 2.97

83.63 7 1.91 44.577 2.94 71.85 7 2.04

87.94 72.47 49.03 73.51 77.3371.82

98.677 1.12 79.297 3.33 81.507 1.42

we observe the performance of DLR when the number of labeled samples is 1 and the number of unlabeled samples are 2, 5, 8, 11, 14, 17, 20, 23, 26, and 29. On AR, we consider the performance of DLR when the number of labeled samples is 1 and the number of unlabeled samples are 2, 3, 4, 5, 6, 7, 8, and 9. The experimental results are shown in Fig. 7. From the experimental results, we can see that as the increase of unlabeled samples, the performance of DLR increases consistently on the three databases. This validates the usefulness of unlabeled samples and indicates that the more unlabeled samples we can secure the better performance we can achieve. There are two parameters in DLR, the regularization parameters λ1 and λ2 . As in other regularized LDA methods [21,29,61], λ1 is set to be 0.01 in DLR. In this part, we study the influence of λ2 on the performance of DLR. Figs. 8–10 show the influence of λ2 on the performance of DLR on PIE, Extended Yale B, and AR respectively, when λ2 ranges from 0.1 to 2 equally spaced with interval 0.2. As is shown in the experimental results, the recognition rate of DLR does not fluctuate greatly as the variation of λ2 on all three databases, so it is robust to the regularization parameter λ2 .

6. Conclusions and future work In this paper, we proposed a novel semi-supervised dimensionality reduction method, named Double Linear Regressions (DLR), to attack the single labeled image per person face recognition problem. DLR simultaneously seeks the best discriminating subspace and preserves the sparse representation structure. A Subspace Assumption based Label Propagation (SALP) method, which is accomplished using Linear Regressions (LR), is first presented to propagate the label information to the unlabeled data. Then, based on the propagated labeled dataset, a sparse representation regularization term is constructed via Linear

0.95

Recognition rate

0.55

0.9 0.85 0.8 0.75 0.7

0.5 0.45 0.4 0.35

0.65 2

5

8

11 14 17 20 23 26 29

2

Num of unlabeled samples

5

8

11 14 17 20 23 26 29

Num of unlabeled samples

0.7

Recognition rate

Recognition rate

1555

0.65

0.6

0.55

2

3

4

5

6

7

8

9

Num of unlabeled samples Fig. 7. Recognition rate versus Num of unlabeled samples on PIE, Extended Yale B and AR.

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

0.7

1

0.65

0.95

Recognition rate

Recognition rate

1556

0.6

0.55

0.9

0.85 0.8 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

0.5 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

2

2

Fig. 8. The performance of DLR varies with the regularization parameter λ2 on PIE with 3 (left) and 30 (right) training samples.

0.65 0.6

0.35

Recognition rate

Recognition rate

0.4

0.3

0.25

0.55 0.5 0.45 0.4 0.35 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

0.2 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

2

2

0.65

0.8

0.6

0.75

Recognition rate

Recognition rate

Fig. 9. The performance of DLR varies with the regularization parameter λ2 on Extended Yale B with 3 (left) and 30 (right) training samples.

0.55

0.5

0.45 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

0.7

0.65

0.6 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

2

2

Fig. 10. The performance of DLR varies with the regularization parameter λ2 on AR with 3 (left) and 10 (right) training samples.

Regressions (LR). Finally, DLR takes into account both the discriminating efficiency and the sparse representation structure by using the learned sparse representation regularization term as a regularization term of linear discriminant analysis. The extensive experiments on three publicly available face databases demonstrate the promising performance of our proposed DLR method, from which we also find that the proposed DLR method can better employ unlabeled samples than SDA and SPDA, and has high parameter stability. According to the experimental results, our proposed DLR outperforms all the other compared methods when unlabeled samples are sufficient (E2). However, when unlabeled samples are scarce (E1), the performance of DLR is not good enough on PIE and AR. This is because when unlabeled samples are too few, the subspace

structure may not be well captured. This will deteriorate the performance of label propagation and learning the sparse representation structure. How to tackle this problem is one of our future focuses. One possible strategy is to combine the collaborative representation mechanism [59] into DLR when unlabeled samples are scarce. Another interesting direction is to design/explore other label propagation methods which can propagate the label with more accuracy because we find it is very significant if we can obtain more reliable labels in the label propagation phase.

Conflict of interest statement None declared.

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

Acknowledgments This work is supported by the National Natural Science Foundation of China Nos. 61173090, 61072106, 60971112, 60971128, 60970067, and 61072108; the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) No. B07048; the Program for Cheung Kong Scholars and Innovative Research Team in University No. IRT1170.

References [1] W. Zhao, R. Chellappa, P. Phillips, A. Rosenfeld, Face recognition: a literature survey, ACM Computing Surveys 35 (4) (2003) 399–458. [2] P. Baldi, S. Brunak, Bioinformatics: The Machine Learning Approach, second ed., MIT Press, Cambridge, 2001. [3] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008. [4] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006. [5] L.O. Jimenez, D.A. Landgrebe, Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data, IEEE Transactions on Systems, Man and Cybernetics 28 (1) (1997) 39–54. [6] T. Zhou, D. Tao, X. Wu, Manifold elastic net: a unified framework for sparse dimension reduction, Data Mining and Knowledge Discovery 22 (3) (2011) 340–371. [7] S. Gunal, R. Edizkan, Subspace based feature selection for pattern recognition, Information Sciences 178 (19) (2008) 3716–3726. [8] H. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, A survey of multilinear subspace learning for tensor data, Pattern Recognition 44 (7) (2011) 1540–1551. [9] D. Zhang, Z. Zhou, S. Chen, Semi-supervised dimensionality reduction, in: Proceedings of the 7th SIAM International Conference on Data Mining (SDM), 2007, pp. 629–634. [10] L. Zhang, S. Chen, L. Qiao, Graph optimization for dimensionality reduction with sparsity constraints, Pattern Recognition 45 (3) (2012) 1205–1210. [11] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T.S. Huang, S. Yan, Sparse representation for computer vision and pattern recognition, Proceedings of the IEEE 98 (6) (2010) 1031–1044. [12] H. Cheng, Z. Liu, L. Yang, X. Chen, Sparse representation and learning in visual recognition: theory and applications, Signal Processing 93 (6) (2013) 1408–1425. [13] I.T. Jolliffe, Principal Component Analysis, Springer, New York, 1986. [14] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, New York, 1990. [15] M.A. Turk, A.P. Pentland, Eigenfaces for recognition, Journal of Congnitive Neuroscience 3 (1) (1991) 71–86. [16] P.N. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 711–720. [17] X. He, P. Niyogi, Locality preserving projections, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2003, pp. 585–591. [18] X. He, D. Cai, S. Yan, H. Zhang, Neighborhood preserving embedding, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2005, pp. 1208–1213. [19] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using laplacianfaces, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (3) (2005) 328–340. [20] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognition 43 (1) (2010) 331–341. [21] L. Qiao, S. Chen, X. Tan, Sparsity preserving discriminant analysis for single training image face recognition, Pattern Recognition Letters 31 (5) (2010) 422–429. [22] F. Yin, L.C. Jiao, F. Shang, S. Wang, B. Hou, Fast fisher sparsity preserving projections, Neural Computing and Applications 23 (3–4) (2013) 691–705. [23] N. Aronszajn, Theory of reproducing kernels, Transactions of the American Mathematical Society 68 (3) (1950) 337–404. [24] B. Scholkopf, A.J. Smola, K. Muller, Kernel principal component analysis, in: Proceedings of Advances in Kernel Methods-Support Vector Learning, 1999, pp. 327–352. [25] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, K. Muller, Fisher discriminant analysis with kernels, in: Proceedings of IEEE International Workshop on Neural Networks for Signal Processing, volume IX, 1999, pp. 41–48. [26] J. Li, J. Pan, S. Chu, Kernel class-wise locality preserving projection, Information Sciences 178 (7) (2008) 1825–1835. [27] Z. Wang, X. Sun, Face recognition using kernel-based NPE, in: Proceedings of IEEE International Conference on Computer Science and Software Engineering (CSSE), 2008, pp. 802–805. [28] X. Tan, S. Chen, Z.H. Zhou, F. Zhang, Face recognition from a single image per person: a survey, Pattern Recognition 39 (9) (2006) 1725–1745. [29] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–7.

1557

[30] W. Zhao, R. Chellappa, P.J. Phillips, Subspace linear discriminant analysis for face recognition, Technical Report CAR-TR-914, Center for Automation Research, University of Maryland, 1999. [31] D. Beymer, T. Poggio, Face recognition from one example view, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 1995, pp. 500– 507. [32] P. Niyogi, F. Girosi, T. Poggio, Incorporating prior information in machine learning by creating virtual examples, Proceedings of the IEEE 86 (11) (1998) 2196–2209. [33] S.C. Chen, J. Liu, Z.H. Zhou, Making FLDA applicable to face recognition with one sample per person, Pattern Recognition 37 (7) (2004) 1553–1555. [34] A.M. Martinez, Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (6) (2002) 748–763. [35] X. Tan, S.C. Chen, Z.H. Zhou, F. Zhang, Recognizing partially occluded, expression variant faces from single training image per person with SOM and soft kNN ensemble, IEEE Transactions on Neural Networks 16 (4) (2005) 875–886. [36] J.H. Chen, J.P. Ye, Q. Li, Integrating global and local structures: a least squares framework for dimensionality reduction, in: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. [37] Y.Q. Song, F.P. Nie, C.S. Zhang, S.M. Xiang, A unified framework for semisupervised dimensionality reduction, Pattern Recognition 41 (9) (2008) 2789–2799. [38] H.F. Li, T. Jiang, K.S. Zhang, Acquiring linear subspaces for face recognition under variable lighting, IEEE Transactions on Neural Networks 17 (1) (2006) 157–165. [39] S.C. Yan, D. Xu, B.Y. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1) (2007) 40–51. [40] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006) 2399–2434. [41] S.J. Raudys, A.K. Jain, Small sample size effects in statistical pattern recognition: recommendations for practitioners, IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (3) (1991) 252–264. [42] J.H. Friedman, Regularized discriminant analysis, Journal of the American Statistical Association 84 (405) (1989) 165–175. [43] S. Ji, J. Ye, Generalized linear discriminant analysis: a unified framework and efficient model selection, IEEE Transactions on Neural Networks 19 (10) (2008) 1768–1782. [44] J. Wright, A. Yang, S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 210–227. [45] R. Basri, D. Jacobs, Lambertian reflection and linear subspaces, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (3) (2003) 218–233. [46] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, New York, 2001. [47] G.A.F. Seber, Linear Regression Analysis, Wiley-Interscience, Hoboken, NJ, 2003. [48] A.N. Tikhonov, V.Y. Arsenin, Solution of Ill-posed Problems, Winston & Sons, Washington, 1977. [49] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, Face recognition using LDA based algorithms, IEEE Transactions on Neural Networks 14 (1) (2003) 195–200. [50] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002. [51] B. Scholkopf, R. Herbrich, A.J. Smola, A generalized representer theorem, in: Proceedings of 14th Annual Conference on Computational Learning Theory (COLT), 2001, pp. 416–426. [52] M. Bressan, J. Vitrià, Nonparametric discriminant analysis and nearest neighbor classification, Pattern Recognition Letters 24 (15) (2003) 2743–2749. [53] J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighbourhood components analysis, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2005, pp. 513–520. [54] K.Q. Weinberger, J. Blitzer, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research 10 (2009) 207–244. [55] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression database, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (12) (2003) 1615–1618. [56] K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (5) (2005) 684–698. [57] A.M. Martinez, A.C. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2) (2001) 228–233. [58] K. Toh, H. Eng, Between classification-error approximation and weighted least-squares learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (4) (2008) 658–669. [59] L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: which helps face recognition? in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2011, pp. 471–478. [60] M. Butman, J. Goldberger, Face recognition using classification-based linear projections, EURASIP Journal on Advances in Signal Processing (2008) 1–7. [61] L. Qiao, L. Zhang, S. Chen, An empirical study of two typical locality preserving linear discriminant analysis methods, Neurocomputing 73 (10–12) (2010) 1587–1594.

1558

F. Yin et al. / Pattern Recognition 47 (2014) 1547–1558

Fei Yin was born in Shaanxi, China, on October 8, 1984. He received the B.S. degree in Computer Science from Xidian university, Xi’an, China, in 2006. Currently he is working towards the Ph.D. degree in pattern recognition and intelligent systems from the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an, China. His current research interests include pattern recognition, machine learning, data mining, and computer vision.

L. C. Jiao received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 1982, and the M.S. and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 1984 and 1990, respectively. He is the author or coauthor of more than 200 scientific papers. His current research interests include signal and image processing, nonlinear circuit and systems theory, learning theory and algorithms, optimization problems, wavelet theory, and data mining.

Fanhua Shang received his Ph. D. degree from the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an, China, in 2012. Currently, he is a postdoctoral research associate in the Department of Electrical and Computer Engineering at Duke University, Durham, NC, USA. His current research interests include pattern recognition, machine learning, data mining, and computer vision.

Lin Xiong was born in Shaanxi, China, on July 15, 1981. He received the B.S. degree from Shaanxi University of Science & Technology, Xi’an, China, in 2003. He worked for Foxconn Co. after graduation from 2003 to 2005. Since 2006, he has been working toward the M.S. and Ph.D. degree in pattern recognition and intelligent information at Xidian University. His research interests include active learning, ensemble learning, low-rank and sparse matrix factorization, Subspace tracking and Background modeling in video surveillance.

Shasha Mao was born in Shaanxi, China, on August 1, 1985. She received the B.S. degree from Xidian University, Xi’an, China, in 2006. Since 2006, she has been working toward the M.S. and Ph.D. degrees in circuit and system at Xidian University. Her research interests include ensemble learning, active learning and data mining.