Optimal feature extraction methods for classification methods and their applications to biometric recognition

Optimal feature extraction methods for classification methods and their applications to biometric recognition

ARTICLE IN PRESS JID: KNOSYS [m5G;February 20, 2016;14:1] Knowledge-Based Systems 0 0 0 (2016) 1–11 Contents lists available at ScienceDirect Kno...

878KB Sizes 0 Downloads 114 Views

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 20, 2016;14:1]

Knowledge-Based Systems 0 0 0 (2016) 1–11

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Optimal feature extraction methods for classification methods and their applications to biometric recognition Jun Yin∗, Weiming Zeng, Lai Wei College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China

a r t i c l e

i n f o

Article history: Received 24 September 2015 Revised 6 January 2016 Accepted 30 January 2016 Available online xxx Keywords: Nearest feature line Sparse representation Linear discriminant analysis Discriminative projection

a b s t r a c t Classification is often performed after feature extraction. To improve the recognition performance, we could develop the optimal feature extraction method for a classification method. In this paper, we propose three feature extraction methods Discriminative Projection for Nearest Neighbor (DP-NN), Discriminative Projection for Nearest Mean (DP-NM) and Discriminative Projection for Nearest Feature Line (DPNFL), which are optimal for classification methods Nearest Neighbor (NN), Nearest Mean (NM) and Nearest Feature Line (NFL), respectively. We also prove that DP-NN and DP-NM are equivalent to Linear Discriminant Analysis (LDA) under a certain condition. In the experiments, LDA, DP-NFL and SRC steered Discriminative Projection (SRC-DP) are used for feature extraction and then the extracted features are classified by NN, NM, NFL, Sparse Representation based Classification (SRC) and Collaborative Representation Classifier (CRC). Experimental results of biometric recognition show that the proposed DP-NFL performs well, and that combining an effective classification method with the optimal feature extraction method for it can perform best. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Feature extraction and classification are two essential procedures of pattern recognition. The original data usually have high dimensionality. Due to time-consuming and curse of dimensionality [1], it is not suitable to recognize these data directly. Feature extraction could reduce dimensionality and get the most important features simultaneously. In the past few decades, numerous feature extraction techniques were proposed. Principal Component Analysis (PCA) [2] and Linear Discriminant Analysis (LDA) [3] are two of the most classical methods. PCA is an unsupervised method which projects the data along the direction of maximal variance. LDA is a supervised method, which seeks the projection that maximizes the between-class scatter and minimizes the within-class scatter simultaneously. PCA and LDA are linear feature extraction methods, and they cannot discover the underlying manifold structure of the data. After that, a number of manifold learning methods were proposed to analyze the high dimensional data which lie on or near a low dimensional manifold, such as Locally Linear Embedding (LLE) [4], Isometric Mapping (Isomap) [5], Laplacian Eigenmaps (LE) [6] and Local Tangent Space Alignment (LTSA) [7]. These manifold learning methods are effective in representing the nonlinear data, but



Corresponding author. Tel.: +86 02138282823. E-mail address: [email protected], [email protected] (J. Yin).

they are only defined in the training samples and cannot find the low dimensional representation of new test samples. Therefore, they cannot be directly applied to the pattern recognition problem. Neighborhood Preserving Embedding (NPE) [8], Isometric Projection (IsoProjection) [9], Locality Preserving Projections (LPP) [10] and Linear Local Tangent Space Alignment (LLTSA) [11] solve this problem by acquiring the explicit projections from original high dimensional space to low dimensional embedding space, and they can be seen as the linear version of LLE, Isomap, LE and LTSA. However, these linear feature extraction methods based on manifold learning are unsupervised, and they are designed to preserve the locality of samples in the low dimensional space rather than good discriminating ability. To increase discriminating ability, some supervised feature extraction methods based on manifold learning were proposed. The representative methods include Local Discriminant Embedding (LDE) [12], Marginal Fisher Analysis (MFA) [13], Discriminant Simplex Analysis (DSA) [14], Neighborhood Discriminant Projection (NDP) [15], Local Sensitive Discriminant Analysis (LSDA) [16] and multi-manifold discriminant analysis (MMDA) [17]. These methods thoroughly consider the within-class information and the between-class information. For the feature extraction methods based on manifold learning, it is unclear to select the neighborhood size and define the adjacent weight matrix which are key problems for these methods. Recently, with the development of sparse representation theory [18–22], some feature extraction methods used sparse representation coefficients to construct the adjacent weight matrix.

http://dx.doi.org/10.1016/j.knosys.2016.01.043 0950-7051/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

JID: KNOSYS 2

ARTICLE IN PRESS

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

Sparsity Preserving Projection (SPP) [23] and Sparse Neighborhood Preserving Embedding (SNPE) [24] are the representative methods. Both SPP and SNPE use sparse representation coefficients to construct the adjacent weight matrix. Unlike manifold learning based feature extraction methods, these sparse representation based methods do not need to select the model parameters during construction of the adjacent weight matrix. In order to improve the recognition ability, some research works introduced class label information to sparse representation, such as Discriminant Sparsity Preserving Embedding (DSPE) [25], Discriminant Sparse Neighborhood Preserving Embedding (DSNPE) [26, 27] and Weighted Discriminative Sparsity Preserving Embedding (WDSPE) [28]. Classification is an indispensable procedure for recognizing the data. After feature extraction, a classification method should be used to recognize these extracted features. Several classification approaches have been proposed over the past few decades [29, 30]. Among them, the classification methods Nearest Neighbor (NN) and Nearest Mean (NM) are widely used because of their simpleness and availability. NN and NM are based on the nearest distance between two data points, while the classification method Nearest Feature Line (NFL) [31] is based on the nearest distance from the data point to the feature line passing through two data points. Motivated by sparse representation, Wright et al. proposed Sparse Representation based Classification (SRC) [18] for classification. SRC assigns a test sample to the class which has the minimal sparse reconstruction residual. L1-norm was adopted in sparse representation. Zhang et al. pointed out that collaborative representation strategy is more important than the L1-norm-based sparsity constraint and proposed Collaborative Representation Classifier (CRC) [32]. Yin et al. proposed Kernel Sparse Representation based Classification (KSRC) [33]. They considered that the data in a higher dimensional space probably have better linear separability and performed SRC in the transformed higher dimensional space by the kernel trick. Recently, based on SRC, Yang et al. proposed SRC steered Discriminative Projection (SRC-DP) [34, 35]. They tried to make the feature extraction method has a direct connection to SRC. The connection between SRC and SRC-DP is constructed by utilizing sparse representation coefficients in the transformed low dimensional space. We know classification is performed after feature extraction. Therefore, to improve the recognition performance, we could strengthen the connection between classification and feature extraction and make the feature extraction method fit the classification method as much as possible. This idea is not only for SRC. In this paper, we start from classification methods NN, NM and NFL, and develop feature extraction methods Discriminative Projection for NN (DP-NN), Discriminative Projection for NM (DP-NM) and Discriminative Projection for NFL (DP-NFL) which are optimal for NN, NM and NFL, respectively. For DP-NN and DP-NM, we also prove that they are equivalent to LDA under a certain condition. For a pattern recognition system, it’s a good scheme to combine an effective classification method with the optimal feature extraction method for this classification method. Experimental results of biometric recognition prove this viewpoint. The remainder of this paper is organized as follows: Section 2 briefly reviews the related classification and feature extraction methods. Section 3 describes DP-NN, DP-NM algorithms and gives the relationship among DP-NN, DP-NM and LDA. DP-NFL algorithm is also presented in this section. Section 4 presents experiments and discussions of the experimental results. The conclusions are summarized in Section 5. The important notations used throughout the rest of the paper are listed in Table 1.

Table 1 Summary of the notations. Notations

Description

n ni c d m mi X = [X1 , X2 , . . . , Xc ] ∈ Rd×n Xi = [xi1 , xi2 , . . . , xini ] ∈ Rd×ni

Number of training samples Number of training samples in class i Number of classes Dimension of samples Center of all the training samples Center of training samples in class i Training sample matrix Training sample matrix of class i

2. Related classification and feature extraction methods In this section, we first review three classification methods: NM, NN and NFL and then outline the popular feature extraction method LDA. 2.1. Classification methods NM: For a test sample y, NM calculates the distance between y and the center of each class mi (i = 1, 2, . . . , c ), where mi =  i ( ni=1 xini )/ni . y is assigned to the class with the minimal distance:

ident it y(y ) = arg min y − mi 2

(1)

i

NN: For a test sample y, NN calculates the distance between y and each training sample xi j (i = 1, 2, . . . , c; j = 1, 2, . . . , ni ). y is assigned to the class that the training sample with the minimal distance belongs to :





ident it y(y ) = arg min y − xi j  i

2

(2)

NFL: For a test sample y, NFL calculates the distance from y to each feature line. The feature line is obtained by connecting every two training samples from the same class. Suppose xij and xik are two training samples from class i, the feature line passing through xij and xik can be denoted as xi j xik and the distance from y to xi j xik is denoted as d (y, xi j xik ). To compute d (y, xi j xik ), y is projected onto xi j xik as point y˜ and we have d (y, xi j xik ) = y − y˜2 . y˜can be computed as:y˜ = xi j + μ(xik − xi j ), where μ = (y − xi j )T (xik − xi j )/(xik − xi j )T (xik − xi j ). The test sample y is assigned to the class which has the minimal d (y, xi j xik ):

ident it y(y ) = arg min d (y, xi j xik ) i

(3)

where 1 ≤ i ≤ c and 1 ≤ j < k ≤ ni . 2.2. LDA LDA tries to find the projection by maximizing the betweenclass scatter and minimizing the within-class scatter simultaneously. The between-class scatter matrix Sb , the within-class scatter matrix Sw and the global scatter matrix St can be defined by

Sb =

c 

ni ( mi − m ) ( mi − m )T

(4)

i=1

Sw =

ni c  

(xi j − mi )(xi j − mi )T

(5)

i=1 j=1

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

and

St =

ni c  

the between-class scatter based on NM. The criterion of DP-NM is defined as:

(xi j − m )(xi j − m )T

(6)

i=1 j=1

pT S p p = arg max T b p p Sw p

(7)

Based on the idea that the feature extraction method should fit the classification method as much as possible, we propose Discriminative Projection for NM (DP-NM), Discriminative Projection for NN (DP-NN) and Discriminative Projection for NFL (DP-NFL) in this section. They are derived from NM, NN and NFL separately and optimal for these classification methods. In addition, we prove that DP-NM and DP-NN are equivalent to LDA under a certain condition. 3.1. Discriminative projection for NM

ni  ni c  c     yi j − m˜ i 2 = (yi j − m˜ i )2

ni c  

=

i=1

j=1

ni c  c   i=1 j=1

c  c 

=

i=1 j=1

( xi j − mi ) ( xi j − mi

)T

is the within-class

2

( pT xi j − pT xil )

pT (xi j − xil )(xi j − xil ) p T

l=1 l= j

NN pT Sw p

(12)

NN = where Sw

ni  ni 

c 

i=1 j=1 l=1

(xi j − xil )(xi j − xil )T is the within-class

l= j

scatter matrix based on NN. The between-class scatter based on NN in the transformed space is defined as:

k=1 k=i

i=1 j=1

l=1

nk ni c  c    k=1 k=i

i=1 j=1

= where

k=1 k=i

pT SbNN p SbNN

=

k=1 k=i

l=1

2

( pT xi j − pT xkl )

l=1

nk ni c  c   

pT (xi j − xkl )(xi j − xkl ) p T

l=1

(13) c

i=1

ni c j=1

k=1 k=i

nk l=1

(xi j − xkl )(xi j − xkl )

pT SbNN p NN p pT Sw

NN SbNN p = λSw p

pT (xi j − mk )(xi j − mk ) p T

ni c j=1

k=1 k=i

the

(14)

(xi j − mk )(xi j − mk )T

(15)

3.3. The relationship among DP-NM, DP-NN and LDA (9)

i=1

is

The projection vector p in Eq. (14) can be obtained by solving the following generalized characteristic equation:

2

( pT xi j − pT mk )

c

T

between-class scatter matrix based on NN. The criterion of DP-NN is defined as: p

k=1 k=i

SbNM =

ni ni c   

p = arg max

k=1 k=i

= pT SbNM p where

=

l=1 l= j

l=1 l= j

i=1 j=1

=

k=1 k=i

ni

ni ni c   

i=1 j=1

ni c  c    yi j − m˜ k 2 = (yi j − m˜ k )2

=

=

(8)

i=1 j=1

i=1 j=1

l=1 l= j

i=1 j=1

pT (xi j − mi )(xi j − mi )T p

ni c  c    k=1 k=i

i=1 j=1

=

NM is equivalent scatter matrix based on NM. Here we find that Sw to Sw . The between-class scatter based on NM after projection is defined as:

i=1 j=1

ni ni c     yi j − yil 2 = (yi j − yil )2

2

( p xi j − p mi ) T

NM = pT Sw p

n i

ni ni  c   

i=1 j=1 T

i=1 j=1

c

Considering the decision rule of NN, a sample should be as close as possible to the samples belonging to the same class and as far as possible to the samples belonging to different classes. To realize this target, similar to DP-NM, Discriminative Projection for NN (DP-NN) defines the within-class scatter based on NN and the between-class scatter based on NN. The within-class scatter based on NN in the transformed space is defined as:

nk ni c  c     yi j − ykl 2 = (yi j − ykl )2

i=1 j=1

=

(11)

nk ni c  c    

i=1 j=1

=

NM SbNM p = λSw p

=

According to the decision rule of NM, a sample should be as close as possible to the center of the class that it belongs to and as far as possible to the centers of other classes. To make NM obtain good results on all the training samples, we characterize the within-class scatter based on NM and the between-class scatter based on NM. Suppose that each training sample xi j (i = 1, 2, . . . , c; j = 1, 2, . . . , ni ) is transformed to yi j = pT xi j by the projection vector p and each center of the class mi (i = 1, 2, . . . , c ) is ˜ i = pT mi . The within-class scatter based on NM transformed to m after projection is defined as:

ni c  

(10)

NM p pT Sw

3.2. Discriminative projection for NN

3. Optimal feature extraction methods for classification methods

i=1 j=1

pT SbNM p

The projection vector p in Eq. (10) can be obtained by solving the following generalized characteristic equation:

where p is the projection vector.

NM Sw

p = arg max = p

We have St = Sb + Sw . The criterion of LDA is defined as :

where

3

is

the

between-class scatter matrix based on NM. Discriminative Projection for NM (DP-NM) seeks the projection that minimizes the within-class scatter based on NM and simultaneously maximizes

In order to show the relationship among DP-NM, DP-NN and LDA, we give three Lemmas. Lemma 1. The between-class scatter matrix based on NM  n i SbNM can be converted to c ci=1 j=1 (xi j − m )(xi j − m )T + c T NM . n i=1 (m − mi )(m − mi ) − Sw

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS 4

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

NN can be Lemma 2. The within-class scatter matrix based on NN Sw c n i T converted to2 i=1 j=1 ni (xi j − mi )(xi j − mi ) .

Lemma 3. The between-class scatter matrix based on NN SbNN can be  n i NN . converted to2n ci=1 j=1 (xi j − m )(xi j − m )T − Sw The proofs of Lemmas 1–3 are given in the appendix. Now we present a Theorem which describes the relationship among DPNM, DP-NN and LDA. Theorem 1. DP-NM and DP-NN are both equivalent to LDA when all the classes have the same number of training samples. Proof. When all the classes have the same number of training samples, we have n = cni (i = 1, 2, . . . , c). Based on Lemma 1, Eq. (6) and Eq. (10), the criterion of DP-NM is

p = arg max



p

NM p pT Sw

ni c  

T

p c ×

pT SbNM p

= arg max p

(xi j −m )(xi j −m ) + n T

i=1 j=1

pT cSt + n p

c  i=1

p

= arg max p

p

= arg max

p



c ( Sb + Sw ) + c

c  i=1



ni ( m − mi ) ( m − mi )

T

p

p

(16)

= arg max p

c

i=1

i=1 j=1

l=1 l= j

  yi j − (yil + μi jlm (yim − yil ))2

m=l+1 m= j

ni ni −1 ni c   

T

(PT xi j − PT (xil + μi jlm (xim − xil )))

l=1 m=l+1 l= j m= j

(P xi j − PT (xil + μi jlm (xim − xil ))) T

ni ni −1 ni c     l=1 l= j

tr (P T (xi j + (μi jlm − 1 )xil

m=l+1 m= j

=

l=1 m=l+1

k=1 k=i

=

k=1 k=i

T

(PT xi j − PT (xkl +μi jklm (xkm −xkl )))

l=1 m=l+1

(PT xi j − PT (xkl + μi jklm (xkm − xkl ))) nk ni c  c n k −1   

So we can draw the conclusion that DP-NN is equivalent to LDA when all the classes have the same number of training samples. 

=

3.4. Discriminative projection for NFL

= tr (P T SbNF L P )

To make NFL perform well, for a sample xij from class i, we should make the distance between xij and the feature line passing through any two samples from class i except xij as small

  yi j − (ykl + μi jklm (ykm − ykl ))2

l=1 m=l+1

nk ni c  c n k −1    k=1 k=i

  yi j − y˜i j 2

l=1 m=l+1

nk c  c   

i=1 j=1

(17)

nk 

nk −1

i=1 j=1

c pT (Sb + Sw ) p pT Sw p

(d (yi j , ykl ykm ))2

ni c  c n k −1  

ni

 (xi j − mi )(xi j − mi )T p

pT S p = arg max T b p p Sw p

k=1 k=i

i=1 j=1

T

m= j

ter matrix based on NFL. Since NFL can only be performed when the dimension of samples n satisfies n ≥ 2, the number of columns of the projection matrix P cannot be smaller than 2. The between-class scatter based on NFL after projection is defined as:

=

p (nSt ) p

j=1

l= j

i=1 j=1

NN p pT Sw

n i

(18)

μi jlm xim )(xi j + (μi jlm − 1 )xil − μi jlm xim )T is the within-class scat-

nk ni c  c n k −1   

NN p pT Sw     i pT 2n ci=1 nj=1 (xi j − m )(xi j − m )T − SwNN p

 n c n i

ni ni −1 ni c    

=

where yil yim is the feature line passing through yil and yim , y˜i j is the projection point of yij on yil yim ,μi jlm = (yi j − yil )T (yim − yil )/(yim − yil )T (yim − yil ), P is the projection matrix from the original space to the transformed c ni ni −1 ni NF L = space and Sw (xi j + (μi jlm − 1 )xil − i=1 l=1 m=l+1 j=1

pT SbNN p

pT

m=l+1 m= j

NF L = tr (P T Sw P)

 T p 2n i=1 j=1 (xi j − m )(xi j − m ) p = arg max  c n  T i p pT 2 i=1 j=1 ni ( xi j − mi ) ( xi j − mi ) p = arg max

l=1 l= j

  yi j − y˜i j 2

−μi jlm xim )(xi j + (μi jlm − 1 )xil − μi jlm xim ) P )

pT Sb p pT Sw p

c

i=1 j=1

i=1 j=1

pT (c (Sb + Sw ) + cSb ) p pT Sw p

 T

ni ni −1 ni c    

=

=

pT Sw p

p

m=l+1 m= j

i=1 j=1

According to the criterion of LDA, we can conclude that DP-NM is equivalent to LDA when all the classes have the same number of training samples. Now based on Lemmas 2 and 3, Eq. (6), Eq. (14) and n = cni (i = 1, 2, . . . , c), the criterion of DP-NM is

p = arg max

l=1 l= j

(d (yi j , yil yim )2

T

p

p

i=1 j=1

(m − mi )(m − mi )T p

= arg max = arg max

(m −mi )(m −mi )

NM −Sw

ni ni −1 ni c    

=

 T

pT Sw p

 T

i=1

NM p pT Sw

 = arg max

c 

as possible and from another class k (k = j) as big as possible. Discriminative Projection for NFL (DP-NFL) defines the withinclass scatter based on NFL and the between-class scatter based on NFL. The within-class scatter based on NFL after projection is defined as:

i=1 j=1

k=1 k=i

tr (P T (xi j +(μi jklm −1 )xkl −μi jklm xkm )

l=1 m=l+1

(xi j + (μi jklm − 1 )xkl − μi jklm xkm )T P ) (19)

where ykl ykm is the feature line passing through ykl and ykm , y˜i j is the projection point of yij on ykl ykm , μi jklm = (yi j − ykl )T (ykm − ykl )/(ykm − ykl )T (ykm − ykl ), and

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

SbNF L =

c

i=1

ni c j=1

k=1 k=i

nk −1 nk l=1

m=l+1

5

(xi j + (μi jklm − 1 )xkl −

μi jklm xkm )(xi j + (μi jklm − 1 )xkl − μi jklm xkm )T is the between-class scatter matrix based on NFL. DP-NFL maximizes the between-class scatter based on NFL and minimizes the within-class scatter based on NFL simultaneously. Its criterion is defined as:

P = arg max P

tr (P T SbNF L P ) NF L P ) tr (P T Sw

(20)

The projection matrix P in Eq. (20) can be obtained by solving the following generalized characteristic equation: NF L SbNF L p = λSw p

(21)

In this generalized characteristic equation, there are two variables μijlm and μijklm unknown. As μijlm and μijklm are determined by the samples in the transformed space, they are related to the projection matrix P. Therefore, this generalized characteristic equation cannot be solved directly. We can use the iterative method to solve this problem. It is performed as follows: Given an initial projection matrix P1 , the samples yi j = P1T xi j (i = 1, 2, . . . , c; j = 1, 2, . . . , ni ) in the transformed space can be acquired and then we can figure out μijlm and μijklm . With the known μijlm and μijklm , the generalized characteristic Eq. (21) can be solved and the new projection matrix P2 is obtained. Based on this idea, in the kth iteration, we can obtain the new projection matrix Pk+1 by using Pk as the initial projection matrix. The iteration continues until the algorithm converges. The convergence of the algorithm is determined by the value of the following formula:

J (P ) =

tr (P T SbNF L P ) NF L P ) tr (P T Sw

(22)

We can conclude that the algorithm converges when the following in equation holds:

(J (Pk+1 ) − J (Pk ))/J (Pk+1 ) < ε

(23)

The iteration algorithm of DP-NFL is summarized as follows: Input: training samples xi j (i = 1, 2, . . . , c; j = 1, 2, . . . , ni ) Output: projection matrix P Step 1. Choose an initial projection matrix P1 and set k = 1. Step 2. Use the known projection matrix Pk to calculate new samples yi j = PkT xi j in the transformed space. NF L and then Step 3. Use new samples yij to construct SbNF L and Sw solve the generalized characteristic Eq. (21). Form the projection matrix Pk+1 by the generalized eigenvectors corresponding to d largest eigenvalues to. Step 4. If (J (Pk+1 ) − J (Pk ))/J (Pk+1 ) < ε , the algorithm converges and the final projection matrix P = Pk+1 . Otherwise, set k = k + 1 and then go to Step 2 and continue the algorithm. To implement this iteration algorithm, the initial projection matrix P1 should be chosen. One simple method is to use the random matrix. Here, we use a better method to choose the initial projection matrix. In this method, we use the samples in the original NF L and then the initial projection maspace to calculate SbNF L and Sw trix is computed by the criterion (20). This method is called the heuristic method and the acquired matrix is called the heuristic matrix. The heuristic method uses the same criterion as DP-NFL algorithm during computing the initial projection matrix, so it may be more suitable for DP-NFL algorithm. 4. Experiments In this section, we perform experiments on several common databases. In these experiments, we compare the performances of

Fig. 1. Sample images of one person on Yale face database.

three different feature extraction methods LDA, DP-NFL and SRCDP under five classification methods NN, NM, NFL, SRC and CRC. To solve the singular problem, PCA is performed to reduce dimensionality before the implementation of LDA, DP-NFL and SRC-DP. From the above paragraphs, we know that DP-NFL is optimal for NFL and SRC-DP is optimal for SRC. In the experiments, each class has the same number of training samples. In this condition, LDA is optimal for NN and NM.

4.1. Experiment on Yale database The Yale face database was constructed at the Yale Center for Computational Vision and Control. It contains 165 gray-scale images of 15 individuals. The images demonstrate variations in lighting, facial expression and with/without glasses. In our experiment, every image was manually cropped and resized to 100 × 80 pixels. Fig. 1 shows eleven images of one people. First we evaluate the convergence of the proposed DP-NFL. The first four images of each person are used for training and the rest images for testing. For DP-NFL, the heuristic matrix and a random matrix are used as initial projection matrices, respectively. Fig. 2 shows the value of the criterion function J(P) versus No. of iteration. From Fig. 2, we can see that DP-NFL converges quickly no matter which initial projection matrix is used. It converges after no more than 10 times of iterations. Then we compare the performance of DP-NFL using the heuristic matrix and using the random matrix. Fig. 3 shows the recognition rates of DP-NFL versus the dimensions. It can be seen from Fig. 3 that DP-NFL using the heuristic matrix has higher recognition rates than using the random matrix. This indicates that the heuristic matrix is more suitable for DP-NFL. In all the following experiments, we use the heuristic matrix as the initial projection matrix of DP-NFL. Now three images per person are randomly selected for training and the remaining images are used for testing. We perform 20 times to choose training samples. NN, NM, NFL, SRC and CRC are used for classification. For each classification method, LDA, DP-NFL and SRC-DP are used to extract low dimensional features before classification. Table 2 lists the maximal average recognition rate and the corresponding dimension of each combination of classification method and feature extraction method. From Table 2, we can see that the combination of NM and LDA achieves the best result. On this database, compared with other classification methods, the classification method NM has a better performance. For NM, combining it with the optimal feature extraction method can improve the performance. LDA is just this method. LDA plus NM indeed obtains the highest recognition rate. It can also be seen from Table 2 that the proposed DP-NFL performs better than LDA and SRC-DP when the up to date classification method CRC is employed. DP-NFL is not the optimal feature extraction method for CRC. This indicates that DP-NFL performs well under other unrelated classification method.

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS 6

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

98

Value of the Criterion Function J(P)

96 94 92 90 88 86 84 82 80 78

Random Matrix Heuristic Matrix 1

2

3

4

5 6 No. of Iteration

7

8

9

Fig. 2. The convergence of DP-NFL using different initial projection matrices.

1 0.9 0.8

Recognition Rate

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

DP-NFL(Random Matrix) DP-NFL(Heuristic Matrix) 5

10

15 Dimension

20

25

Fig. 3. The recognition rates of DP-NFL using different initial projection matrices versus the dimensions on Yale database.

4.2. Experiment on FERET database The FERET face database was sponsored by the US Department of Defense through the DARPA Program. It has become a standard database for testing and evaluating face recognition algorithms. We performed algorithms on a subset of the FERET database. The subset is composed of 1400 images of 200 individuals, and each individual has seven images. It involves variations in face expression, pose and illumination. In the experiment, the facial portion of the

original image was cropped based on the location of eyes and mouth. Then we resized the cropped images to 80 × 80 pixels. Seven sample images of one individual are shown in Fig. 4. Three images per person are randomly selected for training and the remaining images for testing. The experiment is repeated for 20 times. LDA, DP-NFL and SRC-DP are used to extract low dimensional features and NN, NM, NFL, SRC and CRC are used to classify these new features. Table 3 lists the maximal average recognition rate and the corresponding dimension of each combination of fea-

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

7

Fig. 4. Sample images of one person on FERET database.

Fig. 5. Sample images of one person on AR face database. Table 2 The maximal average recognition rate of each combination of classification method and feature extraction method and the corresponding dimension on Yale database.

Table 4 The maximal recognition rate of each combination of classification method and feature extraction method and the corresponding dimension on AR database.

Classification method

Feature extraction method

Maximal average recognition rate

Dimension

Classification method

Feature extraction method

Maximal recognition Dimension rate

NN

LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP

0.905 0.900 0.898 0.923 0.908 0.897 0.883 0.898 0.896 0.878 0.878 0.873 0.881 0.898 0.878

14 14 16 14 16 16 14 16 16 14 15 14 14 15 15

NN

LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP

0.661 0.662 0.677 0.646 0.639 0.661 0.662 0.662 0.692 0.714 0.718 0.752 0.740 0.754 0.751

NM

NFL

SRC

CRC

Table 3 The maximal average recognition rate of each combination of classification method and feature extraction method and the corresponding dimension on FERET database. Classification method

Feature extraction method

Maximal average recognition rate

Dimension

NN

LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP

0.600 0.603 0.455 0.610 0.613 0.471 0.615 0.627 0.522 0.546 0.546 0.504 0.448 0.488 0.478

20 19 34 20 22 37 34 32 38 34 34 38 37 65 38

NM

NFL

SRC

CRC

ture extraction method and classification method. Table 3 shows that NFL performs better than the other four classification methods on this database, no matter which feature extraction method is used to extract low dimensional features. If NFL is combined with the optimal feature extraction method DP-NFL, the maximal recognition rate is obtained. Moreover, we can see that the proposed

NM

NFL

SRC

CRC

63 63 60 62 62 63 61 61 63 63 63 63 61 86 63

DP-NFL outperforms the other two feature extraction methods LDA and SRC-DP in general no matter which classification method is used. If a classification method has a good performance on one database, the optimal feature extraction method for this classification method often performs well too. Thus, their combination could have the best performance. 4.3. Experiment on AR database The AR face database contains over 40 0 0 color face images of 126 people, including 26 frontal views of faces with different facial expressions, lighting conditions, and occlusions for each people. The pictures of 120 individuals were collected in two sessions (14 days apart) and each session contains 13 color images. In our experiment, fourteen face images (each session containing seven) of these 120 individuals are selected. The images are converted to grayscale. The face portion of each image is manually cropped and normalized to 50×40 pixels. Sample images of one person are shown in Fig. 5. These images vary as follows: neutral expression, smiling, angry, screaming, left light on, right light on, all sides light on. The first seven images of each person taken in the first session are used for training and the rest images taken in the second session for testing. The same feature extraction and classification methods as the foregoing two databases are employed. Table 4 lists the maximal recognition rate and the corresponding dimension. It

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS 8

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

4.5. Discussions

Fig. 6. Sample images of one palm on PolyU palmprint database. Table 5 The maximal recognition rate of each combination of classification method and feature extraction method and the corresponding dimension on PolyU palmprint database. Classification method

Feature extraction method

Maximal recognition Dimension rate

NN

LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP LDA DP-NFL SRC-DP

0.917 0.923 0.96 0.923 0.953 0.967 0.927 0.94 0.967 0.983 0.993 0.993 0.98 0.993 0.987

NM

NFL

SRC

CRC

94 100 100 98 134 100 98 149 101 80 51 83 74 121 72

shows that the classification method CRC performs very well on this database, irrespective of the variation of the feature extraction methods. The combinations of CRC and three feature extraction methods LDA, DP-NFL and SRC-DP outperform all the other combinations of classification methods and feature extraction methods except the combination of SRC and SRC-DP. SRC plus SRC-DP has a good performance on this database. However, the combination of CRC and DP-NFL obtains the highest recognition rate. DP-NFL performs well again. 4.4. Experiment on PolyU palmprint database There are 600 images of 100 different palms on PolyU palmprint database. Each palm has six images. The images of each palm were collected in two sessions, where the first three were captured in the first session and the other three in the second session. The central part of each original image was automatically cropped using the algorithm mentioned in [36]. The cropped images were resized to 128 × 128 pixels and preprocessed by histogram equalization. Fig. 6 shows six sample images of one palm. We use the first three images of each palm captured in the first session for training and the rest three images for testing. NN, NM, NFL, SRC and CRC are used to classify the low dimensional features extracted by LDA, DP-NFL and SRC-DP. Table 5 lists the experimental results. It can be found from Table 5 that SRC outperforms the other four classification methods. SRC plus DP-NFL and SRC plus SRC-DP both obtain the highest recognition rate 0.993. As the optimal feature extraction method for SRC, SRC-DP has a good performance on this database. It obtains high recognition rates under different classification methods. Table 5 also shows that the proposed DP-NFL performs better than LDA irrespective of the variation of classification method. DP-NFL plus CRC obtains the highest recognition rate 0.993. This indicates DP-NFL has a good performance in palmprint recognition.

For a classification method, using the optimal feature extraction method does not perform better than using other feature extraction method in some of the previous experiments. However, this does not conflict with the theoretical assumption of developing the optimal feature extraction method for a classification method. A classification method cannot always perform well on all the databases. For a database, there is a classification method effective for it. This effective classification method performs better than other classification methods on this database. Take the experiment on FERET database for example. NFL is effective for FERET database. When a classification method is effective for a database, the optimal feature extraction method for this classification method can also be effective on this database. As the optimal feature extraction method for NFL, DP-NFL is effective on FERET database. Conversely, if a classification method is not effective on this database, the optimal feature extraction method for this classification method is probably not effective. This optimal feature extraction method does not fit this database. For example, SRC is not effective on FERET database. As the optimal feature extraction method for SRC, SRCDP is also not effective on FERET database. On the one hand, the effective feature extraction method is powerful on the corresponding database. On the other hand, on this database, for a classification method which is not effective, the optimal feature extraction method is weak. Then, combining this classification method with the effective feature extraction method is possible to perform better than combining it with the optimal feature extraction method. Therefore, we can see that, SRC plus DP-NFL performs better than SRC plus SRC-DP on FERET database. However, when a classification method is effective for a database, using the optimal feature extraction method for this classification method can improve its performance. Here, the optimal feature extraction method is just the effective feature extraction method. For example, in the experiment on FERET database, NFL plus DP-NFL obtains the maximal recognition rate. The combination of an effective classification method and the optimal feature extraction method for it can perform best. It is valuable to develop the optimal feature extraction method for a classification method. 5. Conclusions Starting from the idea that the feature extraction method should fit the classification method, we develop three feature extraction methods DP-NN, DP-NM and DP-NFL. DP-NN, DP-NM and DP-NFL are derived from the classification methods NN, NM and NFL respectively, and they are optimal for these classification methods. In addition, DP-NN and DP-NM are proved to be equivalent to LDA when each class has the same number of training samples. Different combinations of classification methods and feature extraction methods are employed in biometric recognition experiments. NN, NM, NFL, SRC and CRC are used as classification methods and LDA, DP-NFL and SRC-DP are used as feature extraction methods. The proposed DP-NFL performs very well on FERET, AR and PolyU palmprint databases. In the experiments, the effectiveness of a classification method changes with the variation of the data. However, for a classification method which is effective, combining it with the optimal feature extraction method could receive the highest recognition rate. Acknowledgments This work is supported by Shanghai Municipal Natural Science Foundation under Grant no. 13ZR1455600, the National Natural Science Foundation of China under Grant nos. 31470954, 61203240 and 61403251.

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

Appendix A. Proof of Lemma 1

and ni c  c  

According to the definition of SbNM , we have

SbNM =

ni c  c   i=1 j=1

=

(xi j − mk )(xi j − mk )T

=

(xi j −mk )(xi j − mk )T −

ni c  

(xi j −mi )(xi j − mi )T

ni c  c  

+n

(xi j −m )(xi j − m )T +

ni c  c  

(xi j − m )(m − mk )T

ni c  c  

NN Sw =

( m − mk ) ( xi j − m )

T

ni ni c    i=1 j=1

(m − mk )(m − mk )T − SwNM

=

(A.1)

ni c  

=

( xi j − m ) ( xi j − m ) =

i=1 j=1 k=1

ni c  

( xi j − m ) ( xi j − m )

T

i=1 j=1

=c

c 

ni ni c   

1

=

ni ni c   

(xi j − m )(xi j − m )T ,

( xi j − m ) ( m − mk ) =

i=1 j=1 k=1

(xi j −m )

i=1 j=1

(m−mk )(xi j −m )T =

i=1 j=1 k=1

ni c  

c 

(m − mk )T ,

k=1

i=1 j=1

=

c 

c 

(mi − xil )(xi j − mi )T

ni ni c   

(mi − xil )(mi − xil )T

Then we have ni ni c   

(xi j −mi )(xi j − mi )T =

i=1 j=1 l=1

=

( m − mk )

ni c  

(xi j − m )T

i=1 j=1

k=1

(xi j − mi )(xi j − mi )T

ni c  

ni c  c  

(m − mk )(m − mk )T =

i=1 j=1 k=1

ni c   i=1 j=1

( m − mk ) ( m − mk ) = n T

c 

1

ni ( xi j − mi ) ( xi j − mi ) , T

(B.2) ni ni c   

(xi j − mi )(mi − xil )T =

i=1 j=1 l=1

ni c  

( xi j − mi )

i=1 j=1

ni 

(mi − xil )T ,

l=1

ni c   i=1 j=1



(mi − xil )(xi j − mi ) = T

i=1 j=1 l=1

i=1 j=1

(mi − xil ) (xi j − mi )T

l=1

(B.4)

( m − mk ) ( m − mk ) . T

(A.5)

and ni ni c   

(xi j − m ) = 0, Eq. (A.3) and Eq. (A.4), we

(mi − xil )(mi − xil )T =

i=1 j=1 l=1

have ni c  c  

ni ni c   

k=1

k=1

According to

(B.3) ni ni c   

c 

1

l=1

(A.4) and

ni 

i=1 j=1

(m − mk ) (xi j − m )T



ni c   i=1 j=1



k=1

(B.1)

i=1 j=1 l=1

(A.3)



ni ni c    i=1 j=1 l=1

+ ni c  

( xi j − mi )

i=1 j=1 l=1

(mi − xil )T +

(A.2)



ni ni c   

(xi j − mi )(xi j − mi )T +

i=1 j=1 l=1

i=1 j=1

T

(xi j − mi + mi − xil )(xi j − mi + mi − xil )T

i=1 j=1 l=1

k=1

ni c  

(xi j − xil )(xi j − xil )T

i=1 j=1 l=1

Here we have T

(xi j − xil )(xi j − xil )T

l=1 l= j

ni

i=1 j=1 k=1

ni c  c  

(A.8)

NN , we can obtain From the definition of Sw

i=1 j=1 k=1

ni c  c  

(m − mi )(m − mi )T − SwNM



i=1 j=1 k=1

ni c  c  

ni c  c  

c  i=1

(xi j −m+m−mk )(xi j −m+m−mk )T − SwNM

i=1 j=1 k=1

+

(xi j − m )(xi j − m )T

Appendix B. Proof of Lemma 2

ni c  c  

+

ni c   i=1 j=1

(xi j − mk )(xi j − mk )T − SwNM

i=1 j=1 k=1

=

SbNM = c

i=1 j=1

i=1 j=1 k=1

=

(A.7)

According to (A.1), Eq. (A.2), Eq. (A.5), Eq. (A.6) and Eq. (A.7), it follows that

i=1 j=1 k=1 ni c  c  

(m − mk )(xi j − m )T = 0

i=1 j=1 k=1

k=1 k=i

ni c  c  

9

ni c   i=1 l=1

=

ni c  

(mi − xil )(mi − xil )T

ni 

1

j=1

ni (xil − mi )(xil − mi ) . T

i=1 l=1

(xi j − m )(m − mk )T = 0

(A.6)

(B.5)

i=1 j=1 k=1

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

ARTICLE IN PRESS

JID: KNOSYS 10

[m5G;February 20, 2016;14:1]

J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

According to tain ni ni c   

ni l=1

(mi − xil )T = 0, Eq. (B.3) and Eq. (B.4), we obnk ni c  c   

(xi j − mi )(mi − xil ) = 0 T

i=1 j=1 k=1 l=1

(B.6)



i=1 j=1 l=1

=

and ni ni c   

(B.7)

From Eq. (B.1), Eq. (B.2), Eq. (B.5), Eq. (B.6) and Eq. (B.7), it follows that

ni ( xi j − mi ) ( xi j − mi ) + T

ni c  

i=1 j=1

=2



ni (xil − mi )(xil − mi )

T

=

ni c  

T

(B.8)

=n

 Appendix C. Proof of Lemma 3

=

nk ni c  c   

ni ni c   

(xi j − xkl )(xi j − xkl )T

nk ni c  c   

SbNN = n

(xi j − m + m − xkl )(xi j − m + m − xkl ) −

NN Sw

( xi j − m ) ( xi j − m ) + T

nk ni c  c   

( xi j − m )

i=1 j=1 k=1 l=1

nk ni c  c   

nk ni c  c   

(m − xkl )(m − xkl )T − SwNN

(C.1)

Then we have

(xi j − m )(xi j − m )T

i=1 j=1 k=1 l=1

(xi j − m )(xi j − m )T

i=1 j=1

=n

nk c  

1

k=1 l=1

ni c  

(xi j − m )(xi j − m )T ,

(C.2)

i=1 j=1 nk ni c  c   

(xi j − m )(m − xkl )T

i=1 j=1 k=1 l=1

=

ni c   i=1 j=1

(xi j − m )(m − xkl )T = 0

(C.6)

(m − xkl )(xi j − m )T = 0.

(C.7)

ni c  

(xi j − m )(xi j − m )T

( xi j − m )

+n

nk c  

(m − xkl )(m − xkl )T − SwNN

= 2n

ni c  

(xi j − m )(xi j − m )T − SwNN .

(C.8)

References

(m − xkl )(xi j − m )T

i=1 j=1 k=1 l=1

ni c  

(m − xkl )T = 0, Eq. (C.3) and Eq. (C.4), we

i=1 j=1

i=1 j=1 k=1 l=1

=

l=1

k=1 l=1

(m − xkl )T

nk ni c  c   

nk

i=1 j=1 T

i=1 j=1 k=1 l=1

+

k=1

Then according to Eq. (C.1), Eq. (C.2), Eq. (C.5), Eq. (C.6) and Eq. (C.7), it follows that

i=1 j=1 k=1 l=1

+

(C.5)

i=1 j=1 k=1 l=1

(xi j − xil )(xi j − xil )T

nk ni c  c   

c

nk ni c  c   

i=1 j=1 l=1

=

(m − xkl )(m − xkl )T .

and

(xi j − xkl )(xi j − xkl )T

i=1 j=1 k=1 l=1

=

(m − xkl )(m − xkl )T

i=1 j=1 k=1 l=1

l=1

nk ni c  c   



nk c   k=1 l=1

nk ni c  c   

According to the definition of SbNN , we obtain

k=1 k=i

1

nk c  

Based on have

i=1 j=1



k=1 l=1

i=1 j=1

SbNN =

(m − xkl )(m − xkl )T

i=1 j=1

ni ( xi j − mi ) ( xi j − mi )

(C.4)

i=1 j=1 k=1 l=1

i=1 l=1

ni c  

(m − xkl ) (xi j − m )T

k=1 l=1

nk ni c  c   

ni c  



and

i=1 j=1 l=1

NN Sw =

nk ni c  c    i=1 j=1

(mi − xil )(xi j − mi )T = 0

(m − xkl )(xi j − m )T

nk c   k=1 l=1

(m − xkl )T ,

(C.3)

[1] K. Fukunaga, Introduction to statistical pattern recognition, Academic Press, Boston, 1990. [2] M.A. Turk, A.P. Pentland, Face recognition using eigenfaces, Computer Vision and Pattern Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society Conference on1991), pp. 586–591. [3] P.N. Belhumeur, J.P. Hespanha, D. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, Pattern Anal. Mach. Intell. IEEE Trans. 19 (1997) 711–720. [4] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (20 0 0) 2323–2326. [5] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (20 0 0) 2319–2323. [6] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (2003) 1373–1396. [7] Z.-y. Zhang, H.-y. Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, J. Shanghai Univ. (English Edition) 8 (2004) 406–424. [8] X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, (IEEE2005), pp. 1208–1213. [9] D. Cai, X. He, J. Han, Isometric projection, in: Proceedings of the National Conference on Artificial Intelligence, 2007, pp. 528–533. [10] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using Laplacianfaces, Pattern Anal. Mach. Intell. IEEE Trans. 27 (2005) 328–340. [11] T.H. Zhang, J. Yang, D.L. Zhao, X.L. Ge, Linear local tangent space alignment and application to face recognition, Neurocomputing 70 (2007) 1547–1553. [12] C. Hwann-Tzong, C. Huang-Wei, L. Tyng-Luh, Local discriminant embedding and its variants, Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on2005), pp. 846–853 842.

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043

JID: KNOSYS

ARTICLE IN PRESS J. Yin et al. / Knowledge-Based Systems 000 (2016) 1–11

[13] Y. Shuicheng, X. Dong, Z. Benyu, Z. Hong-Jiang, Y. Qiang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, Pattern Anal. Mach. Intell. IEEE Trans. 29 (2007) 40–51. [14] F. Yun, Y. Shuicheng, T.S. Huang, Classification and feature extraction by simplexization, Inf. Forensics Secur. IEEE Trans. 3 (2008) 91–100. [15] Q. You, N. Zheng, S. Du, Y. Wu, Neighborhood discriminant projection for face recognition, Pattern Recognit. Lett. 28 (2007) 1156–1163. [16] D. Cai, X. He, K. Zhou, J. Han, H. Bao, Locality sensitive discriminant analysis, IJCAI 2007, pp. 708–713. [17] W. Yang, C. Sun, L. Zhang, A multi-manifold discriminant analysis method for image feature extraction, Pattern Recognit. 44 (2011) 1649–1657. [18] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, M. Yi, Robust face recognition via sparse representation, Pattern Anal. Mach. Intell. IEEE Trans. 31 (2009) 210– 227. [19] D.L. Donoho, For most large underdetermined systems of linear equations the minimal l(1)-norm solution is also the sparsest solution, Commun. Pure Appl. Math. 59 (2006) 797–829. [20] E.J. Candes, J.K. Romberg, T. Tao, Stable signal recovery from incomplete and inaccurate measurements, Commun. Pure Appl. Math. 59 (2006) 1207–1223. [21] E.J. Candes, T. Tao, Near-optimal signal recovery from random projections: universal encoding strategies? Inf. Theory IEEE Trans. 52 (2006) 5406–5425. [22] P. Zhao, B. Yu, On model selection consistency of Lasso, J. Mach. Learn. Res. 7 (2006) 2541–2563. [23] L.S. Qiao, S.C. Chen, X.Y. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognit. 43 (2010) 331–341. [24] C. Bin, Y. Jianchao, Y. Shuicheng, F. Yun, T.S. Huang, Learning With l(1)-graph for image analysis, Image Process. IEEE Trans. 19 (2010) 858–866.

[m5G;February 20, 2016;14:1] 11

[25] J. Lai, X. Jiang, Discriminative sparsity preserving embedding for face recognition, ICIP 2013, pp. 3695–3699. [26] J. Gui, Z.A. Sun, W. Jia, R.X. Hu, Y.K. Lei, S.W. Ji, Discriminant sparse neighborhood preserving embedding for face recognition, Pattern Recognit. 45 (2012) 2884–2893. [27] G.-F. Lu, Z. Jin, J. Zou, Face recognition using discriminant sparsity neighborhood preserving embedding, Knowl. Based Syst. 31 (2012) 119–127. [28] L. Wei, F. Xu, A. Wu, Weighted discriminative sparsity preserving embedding for face recognition, Knowl. Based Syst. 57 (2014) 136–145. [29] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, Pattern Anal. Mach. Intell. IEEE Trans. 22 (20 0 0) 4–37. [30] S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: a review of classification techniques, Informatica 31 (2007) 249–268. [31] S.Z. Li, J. Lu, Face recognition using the nearest feature line method, Neural Netw. IEEE Trans. 10 (1999) 439–443. [32] D. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: Which helps face recognition?, Computer Vision (ICCV), 2011 IEEE International Conference on, (IEEE2011), pp. 471–478. [33] J. Yin, Z. Liu, Z. Jin, W. Yang, Kernel sparse representation based classification, Neurocomputing 77 (2012) 120–128. [34] J. Yang, D. Chu, Sparse representation classifier steered discriminative projection, Pattern Recognition (ICPR), 2010 20th International Conference on, (IEEE2010), pp. 694–697. [35] J. Yang, D.L. Chu, L. Zhang, Y. Xu, J.Y. Yang, Sparse representation classifier steered discriminative projection with application to face recognition, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 1023–1035. [36] D. Zhang, Palmprint authentication, Kluwer Academic, 2004.

Please cite this article as: J. Yin et al., Optimal feature extraction methods for classification methods and their applications to biometric recognition, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.01.043