Parameterless reconstructive discriminant analysis for feature extraction

Parameterless reconstructive discriminant analysis for feature extraction

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Parameter...

798KB Sizes 1 Downloads 125 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Parameterless reconstructive discriminant analysis for feature extraction Pu Huang a,n, Guangwei Gao b a b

School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu 210023, China Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

art ic l e i nf o

a b s t r a c t

Article history: Received 25 June 2015 Received in revised form 25 October 2015 Accepted 7 January 2016 Communicated by Jiayu Zhou

Reconstructive discriminant analysis (RDA) is an effective dimensionality reduction method that can match well with linear regression classification (LRC). RDA seeks to find projections that can minimize the intra-class reconstruction scatter and simultaneously maximize the inter-class reconstruction scatter of samples. However, RDA needs to select the k heterogeneous nearest subspaces of each sample to construct the inter-class reconstruction scatter and it is very difficult to predefine the parameter k in practical applications. To deal with this problem, we propose a novel method called parameterless reconstructive discriminant analysis (PRDA) in this paper. Compared to traditional RDA, our proposed RDA variant cannot only fit LRC well but also has two important characteristics: (1) the performance of RDA depends on the parameter k that requires manual turning, while ours is parameter-free, and (2) it adaptively estimates the heterogeneous nearest classes for each sample to construct the inter-class reconstruction scatter. To evaluate the performance of the proposed algorithm, we test PRDA and some other state-of-the-art algorithms on some benchmark datasets such as the FERET, AR and ORL face databases. The experimental results demonstrate the effectiveness of our proposed method. & 2016 Elsevier B.V. All rights reserved.

Keywords: Dimensionality reduction Feature extraction Linear regression classification Reconstructive discriminant analysis Parameterless

1. Introduction In many computer vision and pattern recognition problems, since the input data is often very high-dimensional, it is desirable to reduce the dimensionality of the high-dimensional data, allowing for more efficient analysis and learning. The goal of dimensionality reduction (DR) is to find a meaningful low-dimensional representation of highdimensional data while preserving most information contained in the original data. With respect to pattern recognition, DR is an effective approach to overcome “the curse of dimensionality” [1] and improve the computational efficiency of pattern matching [2]. Over the past few decades, techniques for DR have attracted considerable interests and many useful DR methods have been developed. Among them, principal component analysis (PCA) [3] and linear discriminant analysis (LDA) [4] are the two most representative linear subspace learning techniques. PCA is an unsupervised method, which seeks to find a set of projection axis such that the variance of data is maximized after projection. LDA is a supervised method, which attempts to find a set of optimal projection axis such that the fisher criterion (i.e. the ratio of the between-class scatter to the within-class scatter) is maximized after projection. Unlike PCA which takes no n

Corresponding author. Tel.: þ 86 15295539600. E-mail addresses: [email protected], [email protected] (P. Huang).

consideration of label information, LDA takes advantage of the label information to find the discriminant structure of the data. In many classification tasks such as face recognition, it is generally believed that encoding label information could enhance the discriminative ability. Thus LDA often achieves significant better performance than PCA for classification tasks. Despite the success of LDA in many applications, there are still some limitations existed in it. For example, (1) it can only extract at most C-1 available features (C is the number of classes), which is suboptimal for many applications; (2) it cannot be directly applied to the small sample size (SSS) problems [5] because the withinclass scatter matrix is singular; (3) it is only optimal in cases where the data of each class is approximately Gaussian distributed, a property that cannot always be satisfied in real-world applications. To overcome these limitations, numerous LDA variants [6–11] have been developed and received remarkable performances. Some studies [12–14] have shown that large volumes of highdimensional data possibly reside on a non-linear submanifold. The two classical methods PCA and LDA may fail to discover the essential manifold structure of the data since they can see only the global Euclidean structure of the data. In order to discover the manifold structure of the data, many manifold learning based methods have been proposed, such as locally linear embedding (LLE) [12], Isomap [13] and Laplacian Eigenmap [14]. These methods do produce impressive visualization results on some benchmark data (e.g., facial

http://dx.doi.org/10.1016/j.neucom.2016.01.001 0925-2312/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

2

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

images and handwritten digits). However, they are defined only on the training data points; how to evaluate the map for novel testing points remains unclear. To obtain effective map for new testing data, He et al. [15,16] proposed a locality preserving projection (LPP) method, which searches for a set of projection axis such that the local structure of the data is preserved in a certain sense after projection of samples. However, LPP is a completely unsupervised method with the purpose of preserving the neighborhood relationships between data samples. If close points are from different classes, preserving the neighborhood relationships will lead to large between-class overlaps, thus LPP may not be a suitable feature extraction algorithm. Many previous studies [17–26] have demonstrated that the recognition performance can be improved significantly via manifold learning based discriminant approaches, which are straightforward in preserving the intrinsic structure and the discriminant structure of the data. These methods have a common point that they all combine locality and label information to characterize intra-class compactness and inter-class separability, seeking a subspace where data points from the same class are close to each other while data points from different classes are far away from each other. In [17], Yan et al. proposed a general framework called graph embedding for DR. The graph embedding framework provides a powerful platform to develop various kinds of DR algorithms. It can be found that the aforementioned algorithms can be integrated in such a framework and their differences only lie in the strategy used to construct graphs. Although the above mentioned algorithms have already achieved promising results in practical applications, they were designed independently of classifiers. At the stage of pattern recognition progress, the classifier is usually selected by experience. Obviously, the subspaces learned by different DR methods have different characteristics that are invisible to the classifiers. However, one specific classifier just explores the subspace following the classification rule rather than the characteristic of the subspace [27]. Therefore, the DR method may not match the random selected classifier perfectly, which potentially degrades the performance of the pattern recognition system. In order to connect DR methods well with classifiers, some researchers focus on developing DR methods according to the classification rule of a specific classifier. For example, in [28] and [29], Yang et al. designed discriminant analysis methods according to the minimal local reconstruction error (MLRE) measure based classifier and the local mean based nearest neighbor classifier (LM-NNC) respectively, and demonstrated remarkable improvements over conventional DR methods. Recently, motivated by Yang's work, Chen et al. proposed reconstructive discriminant analysis (RDA) [27] which can fit linear regression classification (LRC) [30] well. RDA seeks the projections that can maximize the inter-class reconstruction scatter and simultaneously minimize the intra-class reconstruction scatter, thus RDA could produce better performance with LRC than with other classifiers. But unfortunately, RDA needs to choose the k heterogeneous nearest subspaces for each sample to construct the inter-class reconstruction scatter and it is very difficult to set an optimal k in practical applications. To overcome the limitation of RDA, we introduce an efficient approach called parameterless reconstructive discriminant analysis (PRDA) in this paper. Based on the decision rule of LRC that a testing sample is always assigned to the class which leads to the minimum reconstruction error, PRDA seeks to find a set of projection directions that can separate each sample from its heterogeneous class which leads to a smaller reconstruction error than its own class. PRDA defines the intra-class reconstruction scatter and the inter-class reconstruction scatter to characterize the compactness and the separability of samples, respectively. Specifically, the intra-class reconstruction scatter is constructed according to the distance between each sample and its own class,

and the inter-class reconstruction scatter is constructed according to the distance between each sample and its heterogeneous class which has a smaller reconstruction error than its own class. After charactering the inter-class and intra-class reconstruction scatters, the feature extraction criterion is derived via maximizing the ratio between them. The main contributions of this paper can be summarized as follows: (1) Similar to RDA, the proposed method PRDA also owns the good property of matching well with LRC; (2) Different from RDA which suffers from the difficulty of parameter selection, PRDA is completely parameter-free; (3) RDA chooses the same number (i.e. k) of heterogeneous nearest subspaces for each sample, while PRDA adaptively determines the heterogeneous nearest subspaces by taking special consideration of the relationship between the interclass reconstruction error and the intra-class reconstruction error for each sample. The reminder of the paper is organized as follows. In Section 2, we briefly review LRC and RDA. In Section 3, we introduce the new method PRDA in detail. In Section 4, we give comparison of PRDA and some related algorithms. In Section 5, experiments on the FERET, AR and ORL face databases are presented to demonstrate the effectiveness of our proposed method. Finally, conclusions are made in Section 6.

2. Related works Suppose there are n training samples from c classes. Let ni denote the number of samples from the ith class, and xji A RN is the jth sample in the ith class, i¼1,2,…,c, j¼1,2,…,ni. In this section, we briefly review linear regression classification (LRC) and reconstructive discriminant analysis (RDA). 2.1. Linear regression classification (LRC) LRC is a fairly simple but efficient classification approach and it has been successfully applied for face recognition. LRC is developed based on the assumption that samples from a specific class lie on a linear subspace. Using this concept, LRC codes a probe image as a linear combination of class-specific samples. The task of face recognition is defined as a problem of linear regression. Leastsquares estimation (LSE) [31,32] is used to estimate the vectors of reconstruction coefficients for a given probe image against all class models. Finally, the label of the probe image is assigned in favor of the class with the most precise estimation. Let Xi denote a specific class model by stacking the N-dimensional image vectors from the ith class   i ¼ 1; 2; …; c ð1Þ Xi ¼ x1i ; x2i ; …; xni i A RNni ; Let y be an unlabeled test image and the problem is to classify y as one of the classes i¼1,2,…,c. If y belongs to the ith class, it should be represented as a linear combination of the training images from the same class (lying in the same subspace), i.e., y ¼ X i βi ;

i ¼ 1; 2; …; c

ð2Þ

where βi A R is the vector of reconstruction coefficients. Given that N Zni, the system of equations in Eq. (2) is well conditioned and βi can be estimated using LSE: ni 1



βi ¼ Xi Τ Xi

1

Xi Τ y

ð3Þ

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

subspace is calculated as: PP j  j yi  Y i βi 2

The probe sample can be reconstructed by: i ¼ 1; 2; …; c y^ i ¼ Xi βi ;  1 ¼ X i Xi Τ Xi Xi Τ y

i

ð4Þ

i

i

XX i

j

i

j

  j Τ yji  Y i

yji  Y i βi

βji



   XX Τ j j Τ j ¼ A xi  AΤ Xi βi AΤ xji  AΤ Xi βi 8 9
i ¼ 1; 2; …; c

¼ min ‖y  Xi βi ‖2

j

¼

Then the distance between the probe sample y and reconstructed sample y^ i can be computed, and the label of y is assigned as the class with the minimum distance, i.e., lðyÞ ¼ min ‖y  y^ i ‖2 ;

3

ð5Þ

where l(y) denotes the predicted label of y and ‖y  y^ i ‖ is the Euclidean distance between y and y^ i . 2.2. Reconstructive discriminant analysis (RDA)

ð11Þ

where βi is vector of reconstruction coefficients obtained by Eq. (3) and tr(  ) is the notation of trace operator, and   XX j j j Τ SRw ¼ xi  Xi βi xji  Xi βi ð12Þ j

As described in Section 2.1, LRC assumes each class lies on a linear subspace and finds the class which leads to the minimum reconstruction error (or equivalently, the nearest subspace) for a given sample. But the original space cannot guarantee the given sample matches its correct class since the intra-class reconstruction error is probably larger than the inter-class reconstruction error due to large variations in images. To overcome this limitation, Chen et al. [27] proposed RDA. RDA aims to find an efficient subspace for LRC so that LRC can perform better in the reduced subspace than in the original space. Concretely, RDA characterizes the intra-class reconstruction scatter as well as the inter-class reconstruction scatter, seeking to find the projections that simultaneously maximize the inter-class reconstruction scatter and minimize the intra-class reconstruction scatter. For a given sample xji , its intra-class reconstruction error is actually the distance of xji to its own class, i.e.,

εij ¼

‖xji  Xi

β

j 2 i‖

ð6Þ

Based on Eq. (6), the intra-class reconstruction scatter of samples in the original space is defined by: XX XX j j εij ¼ ‖xi  Xi βi ‖2 ð7Þ i

j

i

j

Similarly, the inter-class reconstruction error of xji by the pth (p ai) class is:

εpij ¼ ‖xji  Xp βjip ‖2

ð8Þ

Based on Eq. (8), RDA characterizes the inter-class reconstruction error of xji by the sum of distances between xji and its k heterogeneous nearest subspaces (the k classes with the least interclass reconstruction errors). Assume the subspace spanned by Xm(m ai) is one of the k heterogeneous nearest subspaces of xji , then its inter-class reconstruction error is calculated as: X j j ‖xi  Xm βim ‖2 ð9Þ m

The inter-class reconstruction scatter of samples in the original space is defined by: XXX j j ‖xi  Xm βim ‖2 ð10Þ i

j

m

Let A ¼[a1,a2,…, ad] denote the projection matrix. Project each data point xji and the ith class samples Xi onto the subspace, we   can get yji ¼ AΤ xji and Yi ¼ATXi, where Y i ¼ y1i ; y2i ; …; yni i A Rdni . Then the intra-class reconstruction scatter of samples in the

i

j

is called the intra-class reconstruction scatter matrix. Likewise, the inter-class reconstruction scatter of samples in the subspace is computed as: PPP j j ‖yi  Y m βim ‖2 i

j

¼

m

XXX i

¼

j

m

yji  Y m βim j

Τ 

yji  Y m βim j



Τ   XXX Τ j j j A xi  AΤ Xm βim AΤ xji AΤ Xm βim i

j

m

8 )
j

  ¼ tr AΤ SRb A where  Τ XXX j j j xi  Xm βim xji  Xm βim SRb ¼ i

j

ð13Þ

ð14Þ

m

is called the inter-class reconstruction scatter matrix. RDA aims to find a transformation matrix that minimizes the intra-class reconstruction scatter and simultaneously maximizes the inter-class reconstruction scatter. The natural solution to a large family of DR algorithms is to pose a trace ratio optimization problem, and generally, the trace ratio problem is often simplified into a ratio trace problem. The ratio trace problem can be efficiently solved with the generalized eigenvalue decomposition (GEVD) method [2]. However, its solution may deviate from the original objectives and suffers from the fact that it is invariant under any non-singular transformation, which may lead to uncertainty in subsequent processing such as classification and clustering [33–35]. Therefore, to use the GSVD method to calculate

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

each projection direction, the objective function of RDA is defined by the following form: aRDA ¼ max

aΤ SRb a aΤ SRw a

ð15Þ

where a A RN is a column vector. It can be found that the optimal projection matrix ARDA is composed of the eigenvectors associated with the d largest eigenvalues of the following generalized eigenvalue problem: SRb a ¼ λSRw a

ð16Þ

where λ is the eigenvalue and a is the corresponding eigenvector.

3. Parameterless reconstructive discriminant analysis (PRDA) From the introduction of RDA, we can clearly see that the performance of RDA depends on the parameter k used to construct the inter-class reconstruction scatter. Specifically, for a given data set X, if the data set has c classes, then the parameter k can be selected from 1 to c-1. Obviously, it is quite difficult to set in advance the best parameter k when c is very large. To overcome this limitation, we propose parameterless reconstruction discriminant analysis (PRDA). The details are given as follows.

class reconstruction error of A with respect to Class 2), dA3 is the distance between A and A3inter (i.e. the inter-class reconstruction error of A with respect to Class 3), and dA2 odA1 odA3. In such a case, if we apply LRC to predict the label of sample A, we can easily find that A will be misclassified into Class 2 rather than its own class. That means that Class 2 seriously degrades the performance of LRC. Hence, to obtain a high classification performance, we should derive an algorithm to learn a projection to separate each sample from the subspaces spanned by other classes which have smaller reconstruction errors than its own class. 3.2. PRDA As the above discussion, based on Eqs. (6) and (8), for a given sample xji , we can find its heterogeneous classes whose reconstruction error is smaller than the intra-class reconstruction error j j (i.e., ‖xji Xp βip ‖2 o‖xji Xi βi ‖2 (p ai)). Then the inter-class reconstruction scatter of the nearest subspaces in the original space is defined as follows: XXX p p XXX p j j wij εij ¼ wij ‖xi  Xp βip ‖2 ð17Þ i

p

j

i

j

p

where wpij is a weight used to decide whether the heterogeneous class Xp belongs to the heterogeneous nearest classes of xji

according to the inequality ‖xji  Xp βip ‖2 o‖xji Xi βi ‖2 , and its value is defined by: ( j j 1; if ‖xji  Xp βip ‖2 o ‖xji  Xi βi ‖2 ðp ¼ 1; 2; …; c and p aiÞ wpij ¼ 0; otherwise j

3.1. Basic idea According to the decision rule of LRC in Eq. (5), we can see that the label of a probe image is always assigned to the class which leads to the minimum reconstruction error. That implies that, for a sample x, if its intra-class reconstruction error is not the smallest among all the classes, x will be misclassified as a heterogeneous class rather than its own class by using LRC. In other words, for a sample x, if there is a heterogeneous class which leads to a smaller reconstruction error than its homogeneous class, the label of x would be predicted to be wrong. For the convenience of understanding, we take Fig.1 as an example. Fig. 1 shows three classes (Class 1, Class 2 and Class 3) denoted by circles, triangles and squares respectively, and A is a sample from Class 1. Let Aintra be the intra-class reconstruction vector of A, and A2inter and A3inter are the inter-class reconstruction vectors of A with respect to Class 2 and Class 3 respectively. In Fig. 1, dA1 is the distance between A and Aintra (i.e. the intra-class reconstruction error of A), dA2 is the distance between A and A2inter (i.e. the inter-

ð18Þ In fact, the inter-class reconstruction scatter of RDA in the original space (i.e. Eq. (10)) can also be calculated according to Eq. (17) by using a different weight wpij defined as: 8 > < 1; if Xp belongs to the k heterogeneous wpij ¼

nearest subspaces of xi ðp ¼ 1; 2; …; c and p aiÞ > : 0; otherwise

ð19Þ

From Eqs. (18) and (19), we can find that PRDA can actively estimate the heterogeneous nearest classes of each sample, while RDA cannot. Like RDA, we also use the intra-class reconstruction scatter and the inter-class reconstruction scatter to characterize the compactness of intra-class samples and the separability of inter-class samples, respectively. Naturally, a sample should be close to the subspace spanned by its intra-class samples. Simultaneously, it should be far away from the subspaces that other classes lie on. It is obvious that larger inter-class reconstruction scatter and smaller intra-class reconstruction scatter will lead to better classification results. Therefore, the goal of PRDA is to find a low-dimensional subspace where the intra-class reconstruction scatter is minimized and the inter-class reconstruction scatter is maximized at the same time. In PRDA, the intra-class reconstruction scatter of samples in the subspace is defined as the same as that of RDA (i.e., Eq. (11)), while the inter-class reconstruction scatter of samples in the subspace is defined by: PPP p j j  wij kyi  Y p βip 2 i

j

¼

p

XXX i

¼ Fig. 1. An example for illustrating the basic idea of PRDA.

j

j

p

XXX i

j

p

 j   wpij AΤ xji  AΤ Xp βip 2  Τ   j j wpij AΤ xji  AΤ Xp βip AΤ xji  AΤ Xp βip

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

9 8
i

Τ ~R

¼ tr A S b A



j

ð20Þ

where wpij is a weight obtained by Eq. (18) and   XXX p  j R j j Τ S~ b ¼ wij xi  Xp βp xji  Xp βp i

j

ð21Þ

p

is the inter-class reconstruction scatter matrix. In order to maximize the inter-class reconstruction scatter and simultaneously minimize the intra-class reconstruction scatter in the low-dimensional subspace, we can get the objective function of PRDA as follows: aΤ S~ b a R

aPRDA ¼ max

ð22Þ

aΤ SRw a

Then it can be found that the optimal projection matrix APRDA consists of the d eigenvectors corresponding to the d largest eigenvalues of the following generalized eigenvalue problem: R S~ b a ¼ λSRw a

ð23Þ

classes to calculate the inter-class reconstruction scatters. For each sample, RDA simply chooses its k heterogeneous nearest classes with least reconstruction errors, while PRDA adaptively determines the heterogeneous nearest classes by evaluating which class leads to a smaller inter-reconstruction error than its intra-class reconstruction error. In the procedure of constructing inter-class reconstruction scatter, PRDA does not need any predefined parameter, while RDA needs to set in advance the parameter k which is intractable in practical applications. On the other hand, PRDA makes use of prior information (i.e. whether the inter-class reconstruction error is smaller than the intra-class reconstruction error) to decide the heterogeneous nearest classes for each sample, while RDA does not. Let n, c and N be the number of total samples, number of classes and dimension of the data, respectively, then the computational cost of RDA is O(nc3N þnc2N2 þncN þnclogc þnkN2 þ dN2) and P the cost of PRDA is Oðnc3 N þ nc2 N 2 þ ncN þ nc þ ki N 2 þ dN 2 Þ. i

O(nc3N þnc2N2) is used to calculate the c vectors of reconstruction coefficients for each sample. O(ncN) is used to calculate the c reconstruction errors for each sample. O(nclogc) is used for finding k heterogeneous nearest classes for all the n samples. O(nkN2) is used to calculate the scatter matrix of RDA and O(dN2) is used to compute the first d generalized eigenvectors. O(nc) is used to decide whether the inter-class reconstruction error is smaller than the intra-class reconstruction error for the n samples in PRDA. P Oð ki N 2 Þ is used to compute the scatter matrix of PRDA, where ki i

3.3. Algorithm of PRDA In summary of the previous description of PRDA, we can get the algorithm of PRDA as follows: Step 1: If the dimension of the data is too high, we first project each sample into a PCA subspace to avoid the SSS problem, and let APCA denote the transformation matrix of PCA. Note that the intra-class reconstruction scatter of PRDA is defined as the same as RDA, thus the dimension of the PCA subspace should be larger than the number of training samples in each class and smaller than the total number of the training samples as discussed in [27]; Step 2: Compute the vectors of reconstruction coefficients of each sample with regard to different classes using Eq. (3); Step 3: Construct the intra-class reconstruction scatter matrix SRw using Eq. (12) and construct the inter-class reconstruction R scatter matrix S~ b using Eq. (21); R Step 4: Solve the generalized eigenvalue problem S~ b a ¼ λSRw a. Let R λ1 4 λ2 4 … 4 λ be the d largest eigenvalues of SR  1 S~ and a1, d

5

w

b

a2,…, ad be the associated eigenvectors; Step 5: The final projection matrix is A¼APCAAPRDA, where APRDA ¼[a1, a2,…, ad]. For a test sample x, its low-dimensional representation can be obtained by y¼ATx; Step 6: Adopt a suitable classifier to predict the label of a test sample. 3.4. Comparison of PRDA and RDA Both PRDA and RDA construct the intra-class reconstruction scatter and the inter-class reconstruction scatter to characterize the compactness and separability of samples. They have the same motivation that is to find the projections to maximize the interclass reconstruction errors and minimize the intra-class reconstruction errors at the same time so that the algorithms can fit LRC well. The constructed intra-class reconstruction scatters are the same in two methods, and the main difference between the two methods mainly lies in the strategy of selecting heterogeneous

denotes the number of adaptively determined heterogeneous nearest classes of xi. According to the analysis of computational complexity of RDA and PRDA, we can find that: (1) the parameter k affects the computational time of RDA, and the consuming time of RDA will increase with the increasing of k, and (2) since O(nc)oO P ki o nk, then the computational cost of PRDA (nclogc) and if i

must be less than that of RDA.

4. Comparison of PRDA and other algorithms In this section, we compare PRDA with some other DR algorithms such as locality preserving discriminant analysis (LPDA) [24,25] and locality sensitive discriminant analysis (LSDA) [26]. 4.1. Comparison of PRDA and LPDA LPDA is a manifold based discriminant analysis method which performs dimensionality reduction by maximizing the locality preserving between-class scatter and meanwhile minimizing the locality preserving within-class scatter. Let SLb and SLw denote the locality preserving between-class scatter matrix and the locality preserving within-class scatter matrix, respectively, and they are defined by: SLb ¼

C C    Τ X X x ci  x cj W ci cj x ci  x cj

ð24Þ

ci ¼ 1 cj ¼ 1

and SLw ¼

C X

X

c ¼ 1 lðxi Þ ¼ lðxj Þ ¼ c



   xi  xj Wcij xi  xj

ð25Þ

where xci represents the mean vector of the cith class samples, Wci cj is the weight measuring the importance of xci and xcj for charactering the between-class scatter, and Wcij is the weight measuring the importance of the data pairs xi and xj within the cth

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6



8 ‖xci  xcj ‖2 > > ; xci is among k nearest neighbor of xcj > exp  2σ 2 <

defined by: n   o Nw ðxi Þ ¼ xji l xji ¼ lðxi Þ; 1 r jr k

ð33Þ

> > > :

n   o Nb ðxi Þ ¼ xji l xji a lðxi Þ; 1 rj r k

ð34Þ

class for characterizing the within-class scatter:

W ci cj ¼

or vice versa otherwise

0;

ð26Þ   8 ‖xi  xj ‖2 > exp  ; xi is among k nearest neighbor of xj > 2 2 σ <   Wcij ¼ or vice versa; lðxi Þ ¼ l xj > > : 0; otherwise ð27Þ where σ is an empirically determined parameter. To obtain the optimal transform from high-dimensional image space to low-dimensional feature space, LPDA solves the following optimization problem: aLPDA ¼ max

aΤ SLb a aΤ SLw a

ð28Þ

The purpose of LPDA is to find a subspace where the nearby samples from the same class are close to each other and the nearby different classes are separated from each other at the same time. However, LRC assumes that the overall samples from the same class lie in a subspace. That means that the subspace learned by LPDA may not fit LRC well. Furthermore, compared to LPDA, the proposed algorithm PRDA has the following advantages: (1) the number of available projection vectors of PRDA is more than that of LPDA because the rank of the locality preserving between-class scatter matrix is at most c-1 [25], and (2) LPDA also suffers from the parameter selection problem, while PRDA is parameter-free. 4.2. Comparison of PRDA and LSDA LSDA finds projections which maximize the margin between data points from different classes at each local area. For each data point xi, let N ðxi Þ ¼ fx1i ; …; xki g be the set of its k nearest neighbors. In order to discover both geometrical and discriminant structure of the data, LSDA finds a map to optimize the following two objective functions: 2 X min yi  yj Ww ð29Þ ij

By simple algebra formulation, the objective functions (29) and (30) can be reduced to 2 1X w Τ w Τ Τ Τ yi  yj W w ð35Þ ij ¼ a XD X a  a XW X a 2 ij 2 1X yi  yj W bij ¼ aΤ XDb XΤ a  aΤ XWb XΤ a ¼ aΤ XLb XΤ a 2 ij where Dw and Db are diagonal matrices, Dw ii ¼

P j

b Ww ij , Dii ¼

ð36Þ P j

Wbij

and Lb ¼ Db  Wb . By imposing a constraint aΤ XDw XΤ a ¼ 1 on (35), the objective function (29) becomes max aΤ XW w XΤ a a

ð37Þ

And the objective function (30) can be rewritten as follows: max aΤ XL b XΤ a

ð38Þ

Finally, the optimization problem reduces to finding:   aLSDA ¼ max aΤ X αLb þ ð1  αÞWw XΤ a

ð39Þ

a

aΤ XDw XΤ a ¼ 1

where 0 r α r 1 is a suitable constant. Then the projections can be obtained by solving the following generalized eigenvalue problem:   X αLb þ ð1  αÞWw XΤ a ¼ λXDw XΤ a ð40Þ LSDA seeks to find a subspace where the nearby samples from the same class are close to each other and the marginal samples are far away from each other. Obviously, the purpose of LSDA does not follow the decision rule of LRC. Thus LSDA may not match well with LRC. Besides, in contrast to PRDA which is parameter-free, LSDA needs to predefine two parameters k and α, which are intractable in practical applications.

5. Experiments

ij

2 X max yi yj Wbij

ð30Þ

ij

where yi ¼aTxi, Ww and Wb are two weight matrices defined by: (   1; if xi A N w xj or xj A Nw ðxi Þ Ww ¼ ð31Þ ij 0; otherwise ( Wbij ¼

1; 0;

  if xi A N b xj or xj A N b ðxi Þ otherwise

ð32Þ

here, N w ðxi Þ contains the neighbors sharing the same label with xi while Nb ðxi Þ contains the neighbors having different labels, Nw ðxi Þ \ N b ðxi Þ ¼ ϕ and N w ðxi Þ [ N b ðxi Þ ¼ N k ðxi Þ, and they are respectively

In this section, in order to demonstrate the effectiveness of the proposed algorithm PRDA, we conduct experiments and compare the performances of PRDA and some other algorithms such as LDA [4], LPP [15], LPDA [25], LSDA [26] and RDA [27] on some publicly available databases including FERET, AR and ORL face databases. For LPP, we use the k nearest-neighbor method to characterize the adjacency graph, and we predefine the Gaussian kernel t ¼ þ1. The three methods including LPP, LPDA and LSDA all have the parameter selection problems and their parameters are selected by exhaustive search. After the DR algorithms have been adopted for feature extraction, three classifiers (i.e. the nearest neighbor (NN) classifier [36], minimum distance (MD) classifier [37] and LRC [30]) are employed for classification. All the experiments are programmed in the MATLAB language (version 2011b) and

Fig. 2. 7 images of one person in the FERET database.

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

performed in the following configured machine: CPU-Intel Core i5-5200U 2.2 GHz, RAM-4 GB. 0.8

5.1. Experiments on the FERET face database

recognition rate

0.75

0.7

0.65 RDA+NN RDA+MD RDA+LRC PRDA+NN PRDA+MD PRDA+LRC

0.6

0.55

0.5

20

40

60

80

100

120

140

160

180

k

0.8

recognition rate

0.75

0.7

0.65 RDA+NN RDA+MD RDA+LRC PRDA+NN PRDA+MD PRDA+LRC

0.6

0.55

0.5

20

40

60

80

100

120

140

160

180

k

Fig. 3. Recognition rates of PRDA and RDA with varied k by using different classifiers on FERET database: (a) l ¼5 and (b) l ¼ 6.

Table 1 Maximum recognition rates (%) of PRDA and RDA with varied k on FERET database. l ¼5 NN PRDA RDA (lowest)

73.25 72.50 (k¼ 39) RDA (highest) 74.50 (k¼ 99)

l¼6 MD

LRC

NN

MD

LRC

78.25 72.75 (k¼ 149) 78.00 (k¼ 9)

79.00 74.75 (k ¼ 179) 79.75 (k ¼ 9)

73.50 71.00 (k ¼ 199) 74.50 (k ¼ 59)

69.50 63.00 (k ¼ 199) 72.50 (k ¼ 29)

78.50 72.00 (k¼ 199) 79.50 (k¼ 29)

The FERET (http://www.itl.nist.gov/iad/humanid/feret/feret_ master.html) face image database has become a standard database for testing and evaluating state-of-the-art face recognition algorithm. The proposed method is tested on a subset of the FERET database. This subset includes 1400 image of 200 individuals (each one has seven images). It is composed of the images whose names are marked with two-character strings: “ba”, “bd”, “be”, “bf”, “bg”, “bj” and “bk”. This subset involves two facial expression images, two left pose images, two right pose images and an illumination image. All the images were in grayscale and manually cropped to 80  80 pixels. Fig. 2 shows some sample images of one person from the FERET database. Since the images in the FERET database are high-dimensional, we first apply the PCA method to preprocess the data before feature extraction, and the dimension of PCA subspace is set to 150. Firstly, to test the performance of PRDA and RDA with varied values of k, we choose the first l ¼(5, 6) images of each person for training and the rest images for testing. For RDA, since the sample images are from 200 persons, the parameter k should range from 1 to 199. Therefore, in the experiments, the parameter k is selected from 9 to 199 with increments of 10. Fig.3 shows the recognition rates of PRDA and RDA with varied k by using different classifiers, and the best results of PRDA and RDA are reported in Table 1, where RDA (lowest) is the lowest recognition rate of RDA and RDA (highest) is the highest recognition rate of RDA. From the results shown in Fig. 3, we see that: (1) the performance of RDA changes with different values of parameter k, and (2) by using a specific classifier, our method PRDA generally performs better than RDA with most values of k. According to the results in Table 1, we can find that: (1) the parameter k plays an important role on the performance of RDA since the difference between the highest and lowest recognition rates is obvious especially when l ¼6, (2) RDA achieves its best results with different k when different training sets are used, which indicates that it is difficult to decide the optimal parameter k in practical applications, and (3) the proposed method PRDA consistently receives competitive results in comparison with RDA. Secondly, we conduct experiments to evaluate the performances of PRDA and other methods with different training set sizes. Therefore we randomly select l ¼(3,4, and 5) images of each person to form the training set and the rest images for testing. Since the selection of parameter k for RDA is intractable, we set k¼ 190 as that in the literature [27] for simplicity. In the experiments, we run each method 10 times for each l and report the maximum average recognition rates in Table 2. From the results shown in Table 2, we can find that: (1) with the increasing of the training set size, the recognition rate of each

Table 2 Maximum average recognition rates (%) and corresponding feature dimensions (shown in the parentheses) of all the methods on FERET database.

l¼3

l¼4

l¼5

NN MD LRC NN MD LRC NN MD LRC

LDA

LPP

LPDA

LSDA

RDA

PRDA

53.01(50) 60.66(150) 61.01(150) 62.15(45) 66.68(150) 68.98(150) 70.27(45) 76.17(150) 79.10(145)

48.65(70) 56.85(120) 62.61(150) 52.68(90) 51.88(150) 63.98(150) 60.58(85) 62.20(150) 74.63(150)

57.66(45) 60.99(45) 59.85(150) 67.72(45) 67.33(40) 69.33(150) 74.82(40) 75.80(50) 78.65(150)

38.44(150) 51.89(150) 55.76(150) 45.46(150) 49.47(150) 58.80(150) 52.65(150) 59.65(150) 70.68(150)

61.28(45) 62.40(25) 62.68(45) 68.35(45) 67.05(55) 70.17(110) 75.60(60) 76.48(55) 79.23(145)

62.39(35) 64.20(60) 64.38(55) 65.82(45) 66.85(45) 71.62(150) 74.08(65) 75.77(110) 81.20(150)

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

Fig. 4. 14 images of one person in the AR database. Table 3 Maximum recognition rates (%) of our method PRDA and other methods in the first case on AR database (shown in the parentheses are the corresponding k for RDA and feature dimensions for all methods).

LDA LPP LPDA LSDA RDA PRDA

NN

MD

LRC

70.12(99) 55.95(150) 68.93(54) 57.98(145) 69.88(k ¼ 109,90) 70.00(80)

70.60(114) 55.48(150) 66.90(59) 62.86(145) 69.05(k ¼ 104, 145) 69.52(90)

74.14(119) 67.38(140) 73.93(119) 73.57(145) 75.95(k ¼ 89, 145) 75.12(150)

Table 4 Maximum recognition rates (%) of our method PRDA and other methods in the second case on AR database (shown in the parentheses are the corresponding k for RDA and feature dimensions for all methods).

LDA LPP LPDA LSDA RDA PRDA

NN

MD

LRC

68.33(94) 55.60(150) 67.26(44) 56.19(145) 68.57(k ¼19, 40) 69.29(80)

68.21(109) 55.00(150) 65.95(44) 60.60(145) 65.95(k ¼ 114, 125) 65.71(110)

71.19(114) 65.24(145) 71.43(79) 71.43(145) 73.33(k ¼ 99, 140) 72.62(145)

method increases, (2) by using different classifiers, our algorithm PRDA performs best when LRC is applied for classification, and (3) PRDA with LRC obtains the best performance with varied training data sets. Note that the parameter k for RDA is set to 190 for convenience, and our method performs better than RDA. That indicates that our algorithm is more suitable for classification than RDA when there are too many classes in the data set.

Table 5 Computational time (s) of all the methods in two cases on AR database.

Case 1 Case 2

LDA

LPP

LPDA

LSDA

RDA

PRDA

2.5440 2.4640

28.1830 28.0350

4.7170 4.7490

51.1450 50.9450

13.0630(k ¼89) 13.6990(k ¼99)

6.3680 6.3420

On this database, we do experiments in two cases: (1) in the first case, we select the first 7 images per person for training and the rest images for testing, and (2) in the second case, we select the last 7 images per person for training and the rest images for testing. We first use PCA method to preprocess the image data, and the dimension of PCA subspace is set to 150. Note that the parameter k in RDA can be selected from 1 to 119, so we set k from 4 to 119 with increments of 5 and report the best recognition results of RDA. Tables 3 and 4 show the maximum recognition rates of our method PRDA and other methods in two cases. From the results shown in Tables 3 and 4, we can see that: (1) by using different classifiers, our algorithm PRDA with LRC for classification obtains the best performance, which further shows the good property that PRDA can match LRC well, and (2) with LRC for classification, RDA and PRDA achieve the two highest recognition rates among all the methods and RDA performs better than PRDA, but it is worth noting that the optimal parameter k for RDA is selected by exhaustive search. Besides the recognition performance, the computational cost is also important in practical applications. Thus we report the computational time of each method in Table 5. From the results shown in Table 5, we can find that: (1) the proposed algorithm PRDA consumes less time than RDA, and (2) the computational time of RDA with k ¼99 is more than that of RDA with k ¼89 due to that the consuming time of RDA will increase with the increasing of k.

5.2. Experiments on the AR face database

5.3. Experiments on the ORL face database

The AR face database (http://www2.ece.ohio-state.edu/  aleix/ ARdatabase.html) contains over 4000 face images of 126 people (70 men and 56 women), including frontal views of faces with different facial expressions, lighting conditions and occlusions. For each individual, 26 images were taken in two sessions (separated by two weeks) and each session contains 13 images. In our experiment, we use the images without occlusion in the preprocessed AR face database. The subset includes 1680 face images corresponding to 120 persons (65 men and 55 women), where each person has 14 different images taken in two sessions: each session has 7 images. All the images were in grayscale and manually cropped to 50  45 pixels. Some sample images of one person are shown in Fig. 4.

The ORL database (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html) contains 400 different images from 40 subjects: each subject has 10 images. For some subjects, the images were taken at different times, with varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). The heads in images are slightly titled or rotated. All images are grayscale and normalized to a resolution of 56  46 pixels. Fig. 5 shows some sample images of one person from the ORL database. On the ORL database, we randomly select l( ¼3, 4, and 5) images for training and the rest images for testing. To avoid the singularity problem of scatter matrix, we first apply PCA method to reduce the dimension of the image data, and the number of

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

9

Fig. 5. 10 images of one person in the ORL database. Table 6 Maximum average recognition rates (%) and corresponding feature dimensions (shown in the parentheses) of all the methods on ORL database.

NN MD LRC NN MD LRC NN MD LRC

l¼3

l¼4

l¼5

LDA

LPP

LPDA

LSDA

RDA

PRDA

83.50(39) 84.79(39) 84.29(39) 89.62(39) 90.83(39) 90.17(19) 93.90(37) 93.95(39) 94.05(39)

73.64(46) 80.89(58) 84.68(60) 82.96(32) 85.17(60) 90.79(60) 87.65(50) 87.70(60) 93.40(60)

80.29(23) 81.50(33) 84.64(27) 89.25(21) 88.29(25) 91.04(31) 93.25(25) 91.60(29) 94.30(31)

74.04(60) 82.96(60) 85.04(60) 82.46(60) 85.79(60) 90.17(60) 87.65(60) 89.35(60) 92.95(60)

84.61(32) 85.11(46) 87.71(26) 89.38(24) 89.96(42) 91.79(52) 93.40(24) 92.00(30) 95.00(46)

84.18(26) 83.96(26) 87.68(34) 90.58(26) 89.83(24) 92.33(46) 93.60(24) 92.65(34) 94.75(44)

(Grant no. NY214165) and the Natural Science Foundation of Jiangsu Province (Grant no. BK20150849).

Table 7 Computational time (s) of all the methods on ORL database.

l¼3 l¼4 l¼5

LDA

LPP

LPDA

LSDA

RDA

PRDA

0.1413 0.1877 0.2233

0.6654 1.0965 1.6832

0.2582 0.3302 0.4194

2.1905 3.9203 6.0895

0.2187 0.2885 0.4062

0.2172 0.2898 0.3975

principal components is set to 60. For simplicity, we set k ¼3 for RDA as that in the literature [27]. In the experiments, we run each method 10 times for each l and report the maximum average recognition rates in Table 6. As we can see from Table 6, the two methods RDA and PRDA with LRC achieve the two highest recognition rates among all the methods, which validates that both RDA and PRDA can fit LRC well. In addition, we can find that no matter what the training set size is, PRDA still obtains competitive results in comparison with RDA. To evaluate the computational complexity of each method, we list the computational time of each method in Table 7. Since the computational cost of RDA is related to the parameter k, and in the experiments, k is set to a quite small value (i.e. 3), thus RDA does not spend too much time on computing as shown in Table 7. Besides, from Table 7, we can find that PRDA consumes almost the same time as RDA.

6. Conclusions In this paper, in order to overcome the difficulty of parameter selection of RDA, we propose a novel algorithm, namely PRDA, for feature extraction. Compared with RDA which needs to predefine the parameter k, the proposed algorithm PRDA is completely parameter-free, and is simpler on computation. To test the effectiveness of PRDA, we conduct experiments on the FERET, AR and ORL databases, and the experimental results show that PRDA not only inherits the good property of RDA (i.e. matching well with LRC) but also is more suitable for recognition tasks when the parameter k is very difficult to choose in practical applications.

Acknowledgment This work is sponsored by the National Natural Science Foundation of China (Grant nos. 61503195 and 61502245), NUPTSF

References [1] O. Egecioglu, H. Ferhatosmanoglu, U. Ogras, Dimensionality reduction and similarity computation by inner-product approximations, IEEE Trans. Knowl. Data Eng. 16 (6) (2014) 714–725. [2] K. Fukunaga, Introduction to statistical pattern recognition, second ed., Academic Press, Boston, MA, 1990. [3] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognit. Neurosci. 3 (1) (1991) 71–86. [4] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [5] S.J. Raudys, A.K. Jain, Small sample size effects in statistical pattern recognition: recommendations for practitioners, IEEE Trans. Pattern Anal. Mach. Intell. 13 (3) (1991) 252–264. [6] J.P. Ye, Computational and theoretical analysis of null space and orthogonal linear discriminant analysis, J. Mach. Learn. Res. 7 (2006) 1183–1204. [7] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional data-with application to face recognition, Pattern Recognit. 34 (2001) 2067–2070. [8] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Trans. Neural Netw. 17 (1) (2006) 1157–1165. [9] J.P. Ye, Q. Li, A two-stage linear discriminant analysis via QR-decomposition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (6) (2005) 929–941. [10] J.W. Lu, K. Plataniotis, A. Venetsanopoulos, Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition, Pattern Recognit. Lett. 26 (2005) 181–191. [11] Z. Jin, J.Y. Yang, Z. Hu, Z. Lou, Face recognition based on the uncorrelated discrimination transformation, Pattern Recognit. 34 (7) (2001) 1405–1416. [12] S.T. Roweis, L.K. Saul, Nonlinear dimension reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [13] J.B. Tenenbaum, Vd Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [14] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [15] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using Laplacian faces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [16] X. He, S. Yan, Y. Hu, H. Zhang, Learning a locality preserving subspace for visual recognition, ICCV (2003) 385–393. [17] S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, S. Lin, Graph embedding and extension: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 40–51. [18] P. Huang, C.K. Chen, Z.M. Tang, Z.J. Yang, Discriminant similarity and variance preserving projection for feature extraction, Neurocomputing 139 (9) (2014) 180–188. [19] P. Huang, C.K. Chen, Z.M. Tang, Z.J. Yang, Feature extraction using local structure preserving discriminant analysis, Neurocomputing 140 (9) (2014) 104–113. [20] Y. Chen, W.S. Zheng, X.H. Xu, J.H. Lai, Discriminant subspace learning constrained by locally statistical uncorrelation for face recognition, Neural Netw. 42 (2013) 28–43. [21] Q. Gao, H. Zhang, J. Liu, Two-dimensional margin, similarity and variation embedding, Neurocomputing 86 (2012) 179–183.

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i

10

P. Huang, G. Gao / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

[22] Q. Hua, L.J. Bai, X.Z. Wang, Y.C. Liu, Local similarity and diversity preserving discriminant projection for face and handwriting digits recognition, Neurocomputing 86 (2012) 150–157. [23] Q. Gao, H. Xu, Y. Li, D. Xie, Two-dimensional supervised similarity and diversity projection, Pattern Recognit. 43 (10) (2010) 3359–3363. [24] L.P. Yang, W.G. Gong, X.H. Gu, W.H. Li, Y.F. Liu, Bagging null space locality preserving discriminant classifiers for face recognition, Pattern Recognit. 42 (2009) 1853–1858. [25] L.P. Yang, W.G. Gong, X.H. Gu, Extended locality preserving discriminant analysis for face recognition, ICPR, 2010, pp. 539–542. [26] D. Cai, X. He, K. Zhou, J.W. Han, H.J. Bao, Locality sensitive discriminant analysis, IJCAI, 2007, pp. 708–713. [27] Y. Chen, Z. Jin, Reconstructive discriminant analysis: a feature extraction method induced from linear regression classification, Neurocomputing 87 (2012) 41–50. [28] J. Yang, Z. Lou, Z. Jin, J.Y. Yang, Minimal local reconstruction error measure based discriminant feature extraction and classification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), Alaska, June 23–28, 2008, pp. 1–6. [29] J. Yang, L. Zhang, J.Y. Yang, D. Zhang, From classifiers to discriminators: a nearest neighbor rule induced discriminant analysis, Pattern Recognit. 44 (7) (2011) 1387–1402. [30] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 32 (11) (2010) 2106–2112. [31] G.A.F. Seber, Linear Regression Analysis, Wiley Interscience, New Jersey, United States, 2003. [32] T.P. Ryan, Modern Regression Methods, Wiley Interscience, New Jersey, United States, 1997. [33] H. Wang, S. Yan, D. Xu, X. Tang, T. Huang, Trace ratio vs. ratio trace for dimensionality reduction, CVPR, 2007, pp. 1–8. [34] F. Nie, S. Xiang, Y. Jia, C. Zhang, Semi-supervised orthogonal discriminant analysis via label propagation, Pattern Recognit. 42 (2009) 2615–2627. [35] M. Zhao, Z. Zhang, T. Chow, Trace ratio criterion based generalized discriminative learning for semi-supervised dimensionality reduction, Pattern Recognit. (2012) 1482–1499. [36] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory 13 (1) (1967) 21–27. [37] R.C. Gonzalez, R.E. Woods, Digital Image Processing, Addison Wesley, Boston, USA, 1997.

Pu Huang received his B.S. and M.S. degrees in computer applications from Yangzhou University, PR China, in 2007 and 2010, respectively, and the Ph.D. degree in Pattern Recognition and Intelligent Systems at Nanjing University of Science and Technology, PR China, in 2014. He is now a lecturer in School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing, PR China. His current research interests include pattern recognition, computer vision and machine learning.

Guangwei Gao received the B.S. in information and computation science from Nanjing Normal University, Nanjing, China, in 2009, and received the Ph.D. degree from the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, in 2014. From Mar 2011 to Sep 2011 and Feb 2013 to Aug 2013, he was an exchange student of Department of Computing, the Hong Kong Polytechnic University. He is now an Assistant Professor in Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, PR China. His current research interests include biometrics, pattern recognition, computer vision, etc.

Please cite this article as: P. Huang, G. Gao, Parameterless reconstructive discriminant analysis for feature extraction, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.001i