Information Sciences 232 (2013) 11–26
Contents lists available at SciVerse ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
K-local hyperplane distance nearest neighbor classifier oriented local discriminant analysis Jie Xu a,b, Jian Yang a,⇑, Zhihui Lai c a
School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, PR China Shaoguan University, School of Mathematics and Information Science, Guangdong 512000, PR China c The Bio-Computing Research Center, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055, PR China b
a r t i c l e
i n f o
Article history: Received 1 April 2011 Received in revised form 8 December 2012 Accepted 31 December 2012 Available online 16 January 2013 Keywords: Regularization HKNN classifier Discriminant analysis Pattern recognition system
a b s t r a c t K-local hyperplane distance nearest neighbor (HKNN) classifier is an improved K-nearest neighbor (KNN) algorithm that has been successfully applied to pattern classification. This paper embeds the decision rule of HKNN classifier into the discriminant analysis model to develop a new feature extractor. The obtained feature extractor is called K-local hyperplane distance nearest neighbor classifier oriented local discriminant analysis (HOLDA), in which a regularization item is imposed on the original HKNN algorithm to obtain a more reliable distance metric. Based on this distance metric, the homo-class and hetero-class local scatters are characterized in HOLDA. By maximizing the ratio of the hetero-class local scatter to the homo-class local scatter, we obtain a subspace which is suitable for feature extraction and classification. In general, this paper provides a framework for building a feature extractor from the decision rule of a classifier. By this means, the feature extractor and classifier can be seamlessly integrated. Experimental results on four databases demonstrate that the integrated pattern recognition system is effective. Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction Feature extraction has been an important topic and attracted considerable attention in the field of pattern recognition. Particularly, the subspace learning [2] based feature extraction methods have become popular over the past few decades. The goal of the subspace learning methods is to find a set of basis vectors to obtain the low-dimensional compact representations of data. The most popular subspace learning methods [2] include: Principal Component Analysis (PCA) [1], Linear Discriminant Analysis (LDA) [1,10] or called Fisher’s Linear Discriminant Analysis (FLDA) [1] and the Non-negative Matrix Factorization (NMF) based feature extraction methods [3,9,28]. PCA, as an unsupervised method, aims to find a set of basis vectors (or projection axes) maximizing the data’s variance. PCA might not be very suitable for classification because maximizing the variance of data cannot guarantee the separability of different classes. NMF is optimal for learning the parts of objects but its discriminative ability is limited because it fails to capture the geometrical structure of the data which is crucial for classification problems [3]. In contrast, LDA is a supervised method and aims to maximize the ratio of between-class scatter to the within-class scatter in the low-dimensional subspace. The projection of LDA makes the data in the same class as close as possible while the data in different classes as far from each other as possible. Therefore, LDA is generally superior to PCA in terms of recognition rate.
⇑ Corresponding author at: School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, PR China. Fax: +86 25 8431 7297x307. E-mail address:
[email protected] (J. Xu). 0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2012.12.045
12
J. Xu et al. / Information Sciences 232 (2013) 11–26
After extracting features, we need to choose a proper classifier. When the chosen classifier suits for the extracted features, a good recognition performance will be achieved. In order to improve the recognition performance, a way of integrating the feature extractor with classifier into a system can be taken into account. This means one can design a feature extractor to match the decision rule of a given classifier. In this case, the designed feature extractor and classifier can be seamlessly integrated. The K-nearest Neighbors (KNNs) [7] is a classical classifier. To enhance the performance of KNN, Vincen and Bengio proposed the K-local Hyperplane Distance Nearest Neighbor (HKNN) algorithm [23]. The principle of HKNN is to approximate the potential instances using a local affine hyperplane on the class manifold [23], and then to classify a query sample according to the distance between the query sample and the local affine hyperplane. Specifically, for a given query sample, we can find its K nearest neighbors by the KNN algorithm in each class, and then construct a local affine hyperplane to approximate the class manifold based on these neighbors. Finally, a query sample is assigned to the class that the local affine hyperplane is the closest to the query sample. The experimental results in [23] indicate that the HKNN classifier often gives a dramatic improvement over the standard KNN classifier. As we know, a proper distance metric [11,12,24,34] is very crucial for the improvement of classification performance. Inspired by the advantages of the HKNN classifier, we try to embed its classification rule into the design of a discriminant analysis method. In this way, the resulting discriminator and the classifier share a common distance metric, so the extracted features are naturally suitable for the HKNN classifier. Actually, a similar idea of combining feature extractor and classifier has been validated in [29]. Unlike the method in [29], which takes the point-to-local-mean distance as a metric, we here use the point-to-hyperplane distance as the metric. The point-to-hyperplane metric (distance) is much more dependable than that the point-to-local-mean metric, since the former can largely reduce the negative influence caused by outliers. Despite this, the point-to-hyperplane metric sometimes also encounters unreliable cases and leads to misclassification. A specific example is shown in Section 2.4. To address this problem, a regularization item [25,30] is imposed on the HKNN algorithm to rectify the distance metric and to yield the regularized HKNN (RHKNN) classifier [23]. As stated in [23], RHKNN achieves a dramatic improvement over HKNN on many classification tasks. Using the rectified point-to-hyperplane distance metric, we develop a novel discriminant analysis algorithm under the guide of the classification rule of RHKNN, which is called K-local Hyperplane Distance Nearest Neighbor Classifier Oriented Local Discriminant Analysis (HOLDA). Intuitively, the proposed discriminant analysis is supposed to enable the RHKNN classifier to achieve better classification performance in the reduced-dimensional feature space, since the feature extractor and classifier are integrated into a pattern recognition system with a common distance metric. In the HOLDA and RHKNN integrated system, the regularized point-to-hyperplane distance is used to characterize the homo-class and hetero-class local scatters. Then, by maximizing the hetero-class local scatter and simultaneously minimizing the homo-class local scatter, we obtain the optimal transform matrix. The criterion, similar to the classical Fisher criterion, is a Rayleigh quotient in form. Thus, it is easy to compute its optimal solutions by solving a generalized eigen-equation. From the locality point of view, HOLDA can be treated as a local subspace learning method. HOLDA uses the constructed point on local affine hyperplane as the class-prototype, and then adopts the point-to-hyperplane distance to characterize the hetero-class and homo-class scatters. Recently, a number of local methods, such as Local Fisher Discriminant Analysis (LFDA) [22], Local Discriminant Models and Global Integration (LDMGI) [31] and Local Similarity Discriminant Analysis (LSDA) [4] have been developed. These methods adopt different distance metrics in modeling and solve different types of problems. Specifically, LFDA is for feature extraction, LDMGI is for clustering and LSDA is for classification. Obviously, LFDA is the one most similar to HOLDA, since they are developed as feature extractors. LFDA can be viewed as a localized LDA version that combines the idea of LDA and Locality Preserving Projection (LPP) [8] to solve the multimodal problems, i.e. the samples in some classes owning several separate clusters. Specifically, LFDA firstly uses the weighted graph to characterize the hetero-class and homo-class point-pair relations and then adopts the point-to-point distance to characterize the hetero-class and homo-class scatters. The design of LFDA is independent of the design of classifier. In other words, the connection between LFDA and the classifiers are loose. In contrast, our proposed recognition system combines the feature extractor HOLDA and RHKNN classifier together. As a result, the extracted features by HOLDA should match RHKNN well. The rest of the paper is organized as follows. In Section 2, we review the HKNN classifier, the related works and its regularization version. Section 3 introduces the proposed feature extraction algorithm HOLDA and a unified framework of building a feature extractor from a classifier. Section 4 presents our experimental results. Finally, our conclusions are drawn in Section 5.
2. Outline of the HKNN classifier and the related classifiers 2.1. Local affine hyperplane In HKNN [23], the classification is done based on the distance from the point to its corresponding local affine hyperplane in each class. Now we present the notion of affine hyperplane (affine subspace) [13,19,35] as follows: For a given set S, the points belonging to set S can span an affine subspace of S when the points are not co-linear. In other words, the affine subspace of S can be represented by a linear combination of data in S. It means that any point p in the affine subspace of S can be
J. Xu et al. / Information Sciences 232 (2013) 11–26
13
calculated by a linear combination of the elements of S with a set of coefficients ak. Now, we rewrite the affine subspace of S in the following form:
( p¼
X
X
k
k
ak xk xk 2 S and
)
ak ¼ 1 :
ð1Þ
As a linear approximation of data, the affine hyperplane has been widely used in many classification methods. More details can be found in Refs. [5,13,14]. 2.2. K-local hyperplane distance nearest neighbor classifier (HKNN) Suppose that there is a training set X = [x1, . . . , xM] 2 RdM, where d is the dimensionality and Mis the number of training samples. For a given sample x 2 X, we find its K-nearest neighbors from each class and denote the K-nearest neighbors of x in Class i by xik (k = 1, . . . , K). These K nearest neighbors determine a local affine subspace, which is called ‘‘local affine hyperplane’’. This affine hyperplane is defined by
LHi ðxÞ ¼
( ) K K X X pp ¼ xi þ aik ðxik xi Þ; aik 2 R; aik ¼ 1 ; k¼1 k¼1
ð2Þ
P where xi ¼ K1 Kk¼1 xik . The square Euclidean distance from x to LHi(x) is defined by
2 K X di ðxÞ ¼ kx pk ¼ x xi aik ðxik xi Þ ¼ aTi X Ti X i ai ; k¼1 2
ð3Þ
where k k denotes the Euclidean distance, X i ¼ ½xi1 xi ; ; xiK xi and ai = (ai1, . . . , aiK)T. The optimal reconstruction weight vector ai in Eq. (3) can be obtained by solving the following problem:
min gðai Þ ¼ min aTi X Ti X i ai s:t:
K X
ð4Þ
aik ¼ 1
k¼1
The Eq. (4) amounts to solving a linear system of ai. By using a gradient search procedure [6], the following necessary condition will be yielded
X Ti X i ai ¼ X Ti ðx xi Þ:
ð5Þ
If X Ti X i is nonsingular, we can solve for ai uniquely as
ai ðxÞ ¼ X Ti X i
1
X Ti ðx xi Þ:
ð6Þ
Once ai ðxÞ is obtained, one can substitute it into Eq. (2), and then the local affine hyperplane with respect to x in Class i can be found. The Euclidean distance from x to LHi(x) can be calculated by Eq. (3). If
dl ðxÞ ¼ mindi ðxÞ; i
ð7Þ
x will be assigned to Class l according to the decision rule of the HKNN classifier. 2.3. The related classifiers 2.3.1. The special cases of the HKNN classifier Let us take a close look at Eq. (3), we find that the HKNN classifier can be understood as choosing the class that provides the best reconstruction (the smallest reconstruction error) for the test pattern through a linear combination of K particular neighbors. Geometrically, minimizing the point-to-hyperplane distance in Eq. (3) is to find a point x0 on the affine hyperplane that is closest to x, as shown in Fig. 1. This point x0 must be the projection of x onto the affine hyperplane S. The distance from x to its nearest affine hyperplane S, i.e. the length of the line segment xx0 , is actually the reconstruction error. In addition, we notice that, if the K-nearest neighbor parameter K = 1, the HKNN algorithm degrades into the Nearest Neighbor classifier (NN classifier); When K = 2, the ‘‘hyperplane’’ is spanned by two nearest neighbors and the ‘‘hyperplane’’ degenerates into a nearest neighbor line (NNL) [36], and thus the HKNN algorithm becomes NNL classifier; When K = 3, the HKNN classifier is the Nearest Neighbor Plane (NNP) classifier [36]; When K equals to the number of training samples per class, the HKNN classifier is the Linear Regression Classifier (LRC) [18]. From this point of view, the NN classifier, NNL classifier, NNP classifier and LRC can be viewed as the special cases of the HKNN classifier.
14
J. Xu et al. / Information Sciences 232 (2013) 11–26
Affine hyperplane S
x x′ S 0
Fig. 1. Illumination of the distance from one point x to an affine hyperplane S.
2.3.2. LRC and HKNN LRC is typically a nearest subspace algorithm. LRC uses the linear regression in modeling. In contrast, HKNN aims to find a class with a nearest local affine superplane. Although LRC and HKNN can be put into a unified framework of finding an affine superplane within a class with the minimal reconstruction error, they have the following differences: (1) In HKNN, the sum-to-one constraint is imposed. However, such a constraint is not included in LRC. Actually, the sumto-one constraint provides an important symmetric property: For any particular data point, it is invariant to rotations, rescaling, and translations of the data point and its neighbors. In such a way, the obtained weights can preserve the intrinsic geometric characteristics of each neighborhood [20]. (2) In HKNN only K nearest neighbors from each class are selected to reconstruct local affine hyperplane, while in LRC the hyperplane is spanned by all class samples. For small sample size problems like face recognition, it is feasible to choose all class samples to reconstruct the local affine hyperplane. However, for large sample size problems (i.e. the number of the training samples in each class is larger than the dimension of samples), it is better to choose K nearest neighbors to approximate local affine hyperplane. Obviously, in such a case, X Ti X i (in Eq. (5)) is usually singular (or nearly singular) if all class samples are used [27]. Therefore, from the numerical computation point of view, selecting K nearest neighbors to reconstruct the local affine hyperplane can produce a more stable solution than that using all of the class training samples. 2.3.3. SRC and HKNN The Sparse Representation based Classification (SRC) [26] is another classifier related to HKNN. The basic idea of SRC is to represent a query sample as a sparse linear combination of all training samples; the sparse nonzero representation coefficients are supposed to concentrate on the training samples with the same class-label as the query sample. From the experimental results reported in [26], it can be seen that, when the data is well clustered, the samples associated with the dominant nonzero representation coefficients are mostly with the same class-label as the query sample. From the reconstruction point of view, SRC aims at finding a class, in which the sparse samples can form a local affine superplane closing to the query sample as possible as it can be. This is very similar to the HKNN classifier. SRC has a advantage that it does not need the K nearest neighbors searching step. However, SRC is generally more time consuming because it need to solve an l1regularized least square optimization problem. 2.4. The roles of regularization in the HKNN algorithm For a given point x, if it is close to a data set, the HKNN classifier provides a good distance metric and yield correct classification. However, once the given point x is far away from the data set, HKNN may lead to misclassification. As shown in Fig. 2, when K = 2, two nearest neighbors y1 and y2 in Class 2 span a line provided that the sum-to-one constraint is enforced. As we know, x belongs to Class 1. However, the point x is closer to x0 than x00 . In such a case, the HKNN classifier yields a wrong result. To avoid this problem, the idea in LLE [20] can be borrowed to modify the formula of the local reconstruction
x2
x1
x3 x ′′
x4
Class 1
Class 2
x L
x′
x ′′′
y4
y2 y1
y3
Fig. 2. An example where the HKNN classifier fails while the RHKNN classifier succeeds.
J. Xu et al. / Information Sciences 232 (2013) 11–26
15
error by introducing a regularization term to penalize larger weights. The modified formula is a function of the weight vector ai and is defined as follows:
2 K K X X f ðai Þ ¼ x xi aik ðxik xi Þ þ l ðaik Þ2 ¼ ðai ÞT X Ti X i þ lI ðai Þ; k¼1 k¼1
ð8Þ
where the parameter l > 0 and I is the identity matrix. PK By minimizing f(ai) in Eq. (8) and subject to k¼1 aik ¼ 1, one can obtain a set of rectified reconstruction weights 1 T T a^ i ¼ X i X i þ lI X i ðx xi Þ, based on which the regularized point-to-hyperplane distance can be calculated as follows:
2 K X ^i ¼ d a^ ik ðxik xi Þ : x xi k¼1
ð9Þ
^ ðxÞ; x will be assigned to Class l. Similarly, if dl ðxÞ ¼ mini d i P It should be stressed that the roles of the regularization term l Kk¼1 ðaik Þ2 ¼ lðai ÞT ðai Þ in Eq. (8) lie in two aspects. Firstly, from the numerical computation point of view, the term makes the optimal solution more robust since it adds a small multiple of the identity matrix to the local Gram matrix X Ti X i and thus makes the matrix stable when X Ti X i is singular or nearly singular. Secondly, more importantly, from the geometric meaning point of view, the regularization term can help HKNN to produce a rectified distance measure. As shown in Fig. 2, when the HKNN measure causes misclassification, the projection of x onto the nearest line L, i.e. x0 , is far way from y1 and y2, which leads to large values of ai. To avoid this occurrence, the regularization term l(ai)T(ai) can be used to penalize the weight vector with large values. In other words, minimizing the ^ i with relative small values. As shown in Eq. (9), the rectified function f(ai) in Eq. (8) can yield a rectified weight vector a weight vector leads to a regularized HKNN measure, which should be more reasonable. For the case shown in Fig. 2, the rec000 tified weight vector makes the reconstruction of x on the nearest line L, x , closer to y1 and y2, and at the same time, the 000 regularized HKNN distance between x and Class 2 (i.e. the distance between x and x ) is larger than the regularized HKNN 00 distance between x and Class 1 (i.e. the distance between x and x ). Thus, the regularized HKNN algorithm provides a correct distance measure. 3. K-local hyperplane distance nearest neighbor classifier oriented local discriminant analysis (HOLDA) 3.1. Calculation of the point-to-hyperplane distance Suppose there are M samples from C pattern classes. The collection of all samples can be denoted by X = [x1, . . . , xM] 2 RdM. For a given sample, xi 2 X, we find its K nearest neighbors in Class c, and the index set of these K nearest neighbors is denoted by Xci ¼ fjjxj is one of the K nearest neighbors of xi in Class c}. Let dc(xi) be the Euclidean distance from xi to its nearest local affine hyperplane in Class c. The optimal reconstruction weights can be calculated by solving the following optimization problem:
a^ i ¼ arg
min
ai i j2Xci aj ¼ 1
dc ðxi Þ ¼ arg
P
min
ai i j2Xci aj ¼ 1
P
8 9 2 > > < = X X 2 xi a x þ l ð a Þ ; ij j ij > > c c : ; j2Xi
ð10Þ
j2Xi
where ai = [ai1, . . . , aiK] and l > 0. Once the optimal reconstruction weights are obtained, the corresponding nearest point-tohyperplane distance dc(xi) can be calculated. For each xi, we find its K homo-class neighbors, and the collection of the indexes of these K homo-class neighbors can be described as follows: XW i ¼ fjjxj is one of the K homo-class nearest neighbors of xi}. It is obvious that the size of XW i is K. Similarly, we select a hetero-class, whose local affined superplane is closest to xi, as the nearest hetero-class. Assume that dr ðxi Þ ¼ minlðx sÞ–s ds ðxi Þ, where l(xi) denotes the class-label of xi. It means the rth class is i
the nearest hetero-class of xi. The index set of the hetero-class neighbors of xi is the collection of indexes of the K neighbors with respect to xi in Class r, i.e. XBi ¼ fjjxj is one of the K nearest hetero-class neighbors of xi in Class r satisfying dr ðxi Þ ¼ minlðx sÞ–s ds ðxi Þg. i Notice that the size of XBi is also K. Based on the obtained reconstruction weights and the corresponding homo-class and hetero-class index sets, we define the homo-class local distance from xi to its nearest homo-class local affine hyperplane as follows:
2 X W di ðxi Þ ¼ a^ ij xj : xi j2XW i
The hetero-class local distance from xi to its nearest hetero-class local affine hyperplane is defined as follows:
ð11Þ
16
J. Xu et al. / Information Sciences 232 (2013) 11–26
2 X ¼ a^ ij xj : xi j2XB
B di ðxi Þ
ð12Þ
i
P W Generally, the sum of homo-class local distances M d ðx Þ characterizes the compactness of the close samples of the i¼1 PMi Bi same class, and the sum of hetero-class local distances i¼1 di ðxi Þ characterizes the separability of the close samples with different class-labels. If making the sum of hetero-class local distances larger and the sum of homo-class local distances smaller, it will lead to a better classification result than being classified in the input space. 3.2. Characterization of the homo-class and hetero-class local scatters Assume that there is a linear transformation yi = WTxi. Under this transformation, the local reconstruction weights are preserved such that the close points in the high dimensional space remain within the same neighborhood in the low dimensional transformed space. In the low dimensional space, the average homo-class local distance is actually the average distance of all samples to their nearest homo-class local affine hyperplanes. The average distance can be viewed as the homo-class local scatter, which is defined as follows:
2 M M X X 1X 1 W y J W ðWÞ ¼ di ðyi Þ ¼ AW i ij yj ; M i¼1 M i¼1 W
ð13Þ
j2Xi
where AW 2 RMK is the homo-class weight coefficient matrix, whose element is given below:
(
AW ij ¼
a^ ij ; if j 2 XW i 0
ð14Þ
otherwise:
JW(W) can be derived as follows:
2 !2 M M M M X X 1X 1X W W T T J W ðWÞ ¼ A y ¼ W xi Aij W xj y M i¼1 i j¼1 ij j M i¼1 j¼1 " !# M M M X M X X 1X W W W ¼ WT xi xi 2 AW x x þ A A x x ij i j ij it i t M i¼1 j¼1 j¼1 t¼1 " !# M M M M X M X X X X W W W T 1 ¼W dij xi xi 2 Aij xi xj þ Ati Atj xi xj W M i¼1 j¼1 j¼1 j¼1 t¼1 " ! # M X M M X M X X 1 W W W W T 1 XðI AW ÞT ðI AW ÞX T W ¼ WT SLW W dij Aij Aji þ Ati Atj xi xj W ¼ WT ¼W M i¼1 j¼1 M j¼1 t¼1
ð15Þ
where dij = 1 if i = j, otherwise is 0; and
SLW ¼
1 XðI AW ÞT ðI AW ÞX T M
ð16Þ
is called the homo-class local scatter matrix. Similar to the homo-class local scatter, the hetero-class local scatter can be characterized by the average nearest heteroclass point-to-hyperplane distance of all points. Specifically, it can be defined by
2 M M X B 1X 1X B J B ðWÞ ¼ d ðy Þ ¼ yi Aij yj : M i¼1 i i M i¼1 B
ð17Þ
j2Xi
where ABij (i = 1, . . . , M and j = 1, . . . , K) is the element of the hetero-class coefficient matrix, which is defined by
( ABij
¼
a^ ij ; if j 2 XBi 0
otherwise:
ð18Þ
Similarly, JB(W) can be derived as follows:
J B ðWÞ ¼
2 M M X 1X 1 ABij yi ¼ WT XðI AB ÞT ðI AB ÞX T W ¼ WT SLB W yi M i¼1 M j¼1
ð19Þ
17
J. Xu et al. / Information Sciences 232 (2013) 11–26
where
SLB ¼
1 XðI AB ÞT ðI AB ÞX T M
ð20Þ
is called the hetero-class local scatter matrix. 3.3. Criterion of HOLDA For the purpose of classification, we try to find a projection which maps the homo-class samples closer and the heteroclass samples as far from each other as possible. From this point of view, a desirable projection of HOLDA can be obtained by minimizing the homo-class local scatter WT SLW W and at the same time maximizing the hetero-class local scatter WT SLB W, i.e. maximizing the following criterion:
JðWÞ ¼
J B ðWÞ WT SL W ¼ T LB : J W ðWÞ W SW W
ð21Þ
The optimal projections of Eq. (21) are the generalized eigenvectors of the following generalized eigen-equation:
SLB W ¼ kSLW W: The projection axes of HOLDA can be selected as the generalized eigenvectors w1, w2, . . . , wd of to d largest positive eigenvalues k1 > k2 > > kd, i.e. W = [w1, w2, . . . , wd].
ð22Þ SLB W
¼
kSLW W
corresponding
3.4. Algorithm of HOLDA As a summary of the preceding description, the HOLDA algorithm is given below: Step 1. Find the local affine hyperplanes. For each training sample, we find its K nearest neighbors in every class and then calculate the optimal rectified weight coefficients by solving the optimization problem of Eq. (10). The homo-class and hetero-class coefficient matrices can be calculated by using Eqs. (14) and (18), respectively. Step 2. Construct the homo-class local scatter matrix SLW and the hetero-class local scatter matrix SLB by using Eqs. (16) and (20), respectively. Step 3. Calculate the generalized eigenvectors w1, w2, . . . , wd of SLB W ¼ kSLW W corresponding to the d largest positive eigenvalues k1 > k2 > > kd. Note that, in real-world biometric applications, such as the face recognition and palm recognition, the homo-class local scatter matrix SLW is always singular due to the fact that the total number of training samples is much smaller than the dimension of images. In this case, SLW cannot be directly used to compute the transformation matrix based on Eq. (22). In order to overcome this problem, PCA can be first used to reduce the dimension of data and then our method is performed in the PCA-transformed space. However, one may concern whether the discriminative information would be lost after PCA transformation. This problem has been addressed in [32], i.e. the obtained optimal discriminant vectors would keep all discriminatory information with respect to Fisher criterion [32]. 3.5. A unified framework of developing a feature extractor from a classifier Recalling the design of HOLDA, we find that it can be put into a unified framework of developing a feature extractor from a classifier. HOLDA can be treated as a paradigm under this framework and its successful applications on different types of databases demonstrate that this framework is feasible. The unified framework can be described as follows: (1) Apply the decision rule of a given classifier to obtain the homo-class and hetero-class distance metrics. Based on these distance metrics, the corresponding scatter matrices of the data set can be constructed. (2) Select a discriminative criterion. The quotient criterion and discrepancy criterion are two frequently used discriminative criterions and they ultimately are transformed into the form of the generalized eigenequation or standard eigenequation. (2) Solve the generalized eigenequation (or standard eigenequation) to obtain the desirable projection directions and map the data into the transformed space. (4) Based on the extracted features, the given classifier is used for classification. Obviously, the designed feature extractor is bound with a classifier. Since a common distance metric is used in the discriminator and classifier, the extracted features will be very suitable for the given classifier. Such an integration of discriminator and classifier yields an effective pattern recognition system.
18
J. Xu et al. / Information Sciences 232 (2013) 11–26
4. Experiments In this section, the AR face database, the Yale face database, the PolyU palmprint database and the CENPARMI handwritten numeral database are used to compare the proposed HOLDA with the following algorithms: PCA [1], FLDA [1], Locality Preserving Projection (LPP) [8], Maximum Margin Criterion (MMC) [15] and LFDA [22]. For LPP, we select the Gaussian kernel t = 1 and choose the K nearest neighbor method to characterize the adjacency undirected graph. After feature extraction, NNC, LRC [18], SRC [26] and RHKNN are employed for classification. In addition, the publicly available Matlab function ‘‘BPDN_homotopy_function’’ from [33] is used to calculate the sparse representation coefficients in SRC. In LPP, LFDA, HOLDA and RHKNN, the neighbor parameter K can be chosen differently. For distinction, the neighbor parameter is denoted by K1 in LPP, LFDA and HOLDA and K2 in RHKNN in the following experiments. 4.1. Experiments on the AR face database The AR face database [16,17] contains over 4000 color face images of 126 people, including frontal views of faces with different facial expressions, lighting conditions and occlusions. The images of 120 persons including 65 males and 55 females are selected and used in our experiments. The pictures of each person are taken in two sessions (separated by two weeks) and each section contains seven color images without occlusions. The face portion of each image is manually cropped and then normalized to 50 40 pixels. The sample images of one person are shown in Fig. 3. 4.1.1. Performance of low-dimensional visualization To verify the criterion of HOLDA (Eq. (21)), we project the face images of four randomly selected individuals (denote as class A, class B, class C and class D) from the AR face image database onto a two-dimensional space with PCA, FLDA, LPP, MMC, LFDA and HOLDA. From Fig. 4 we can see that, the class separability of FLDA, MMC, LFDA and HOLDA is better than that of PCA and LPP. The main reason is that PCA and LPP are unsupervised techniques without taking the class information into account. PCA aims to preserve the global structure and the purpose of LPP is to preserve the local structure of data. Thus, the features obtained by PCA and LPP are not suitable for classification. In contrast, FLDA, MMC, LFDA and HOLDA use the class information and aim to maximize the margins among different categories in different ways. It is obvious that the margins of four classes obtained by FLDA, MMC, LFDA and HOLDA are larger than those of PCA and LPP. Moreover, the embedding of HOLDA clusters better than the other methods. This indicates that the characterization of the scatters in HOLDA is effective. 4.1.2. Performance in face recognition In the second experiment, the first seven images from the first session are used for training and the first seven images from the second session are used for test. PCA, FLDA, LPP, MMC, LFDA and the proposed HOLDA algorithm are used for feature extraction. Note that, FLDA, LPP, LFDA and HOLDA all involve a PCA phase. For fair comparisons, 220 principal components are kept for each method in the PCA step. In the phase of feature extraction, we take the combination of HOLDA and RHKNN as an example and show how to tune the parameters. Let the K-nearest neighbor parameter K1 in HOLDA vary from 2 to 6 with an interval of 1. In the phase of classification, the K-nearest neighbor parameter K2 in RHKNN varies from 2 to 6 with an interval of 1. The maximal recognition rates of the combination of HOLDA and RHKNN over the variations of K1 and K2 are shown in Table 1. From Table 1, we find that the performance of HOLDA is very robust to the variation of the parameters, and the optimal parameter combination is chosen as K1 = 4 and K2 = 2. This example provides us a flexible way for parameter selection. Similarly, we can find the optimal parameter for differently combined recognition systems. Based on these optimal parameters, we vary the dimension of the extracted features from 5 to 220 with an interval of 5 and record the recognition rates of each method in Fig. 5. From Fig. 5, we find the combined system of HOLDA and RHKNN outperforms the other combined recognition systems.
Fig. 3. Sample images of one person in the AR face database.
19
J. Xu et al. / Information Sciences 232 (2013) 11–26
(a) Data distribution of PCA
8
(b) Data distribution of LPP
12 10
6
8 6
2nd feature
2nd feature
4 2 0 -2
-6 -8 -15
-10
-5
0
5
2 0 -2
class A class B class C class D
-4
4
class A class B class C class D
-4 -6 -8 -7
10
-6
-5
1st feature (c) Data distribution of FLDA
15
-4
-3
-2
-1
1st feature (d) Data distribution of MMC
5 4 3 2
5
2nd feature
2nd feature
10
0 -5
0 -1 -2
class A class B class C class D
-10 -15 -10
1
-5
0
5
10
class A class B class C class D
-3 -4 15
20
-5 -10
25
-8
-6
-4
1st feature
0
2
4
6
8
10
1st feature
(e) Data distribution of LFDA
4
-2
(f) Data distribution of HOLDA
1.5
3 2
1
2nd feature
2nd feature
1 0 -1 -2 -3
class A class B class C class D
-4 -5 -6 -10
-8
-6
-4
-2
0.5
0 class A class B class C class D
-0.5
0
2
4
-1 -1.5
1st feature
-1
-0.5
0
0.5
1
1.5
1st feature
Fig. 4. The PCA (a), LPP (b), FDLA (c), MMC (d), LFDA (e) and HOLDA (f) embedding on four classes of face images from the AR face database. Table 1 The maximal recognition rates the variation of K1 in HOLDA and K2 in the RHKNN classifier. The bold value indicates the optimal parameter combination of K1 and K2, with which the recognition system of HOLDA and RHKNN can achieve the best recognition rate in the AR face database, when the first seven images from the first session are used for training and the first seven images from the second session are used for test. K1
2 3 4 5 6
K2 2
3
4
5
6
74.52 74.52 75.36 74.29 74.88
74.05 74.50 75.24 74.52 74.76
74.17 73.93 74.76 74.17 74.05
73.81 73.21 74.05 73.21 72.98
72.98 72.26 73.57 72.86 72.38
20
J. Xu et al. / Information Sciences 232 (2013) 11–26
(LRC) 0.8
0.75
0.75
0.7
0.7
0.65 0.6 0.55 PCA FLDA LPP MMC LFDA HOLDA
0.5 0.45 0.4
20
40
60
Recognition rate
Recognition rate
(NNC) 0.8
0.65 0.6 0.55 PCA FLDA LPP MMC LFDA HOLDA
0.5 0.45 0.4
80 100 120 140 160 180 200 220
20
40
60
Dimension
80 100 120 140 160 180 200 220
Dimension
(SRC)
(RHKNN)
0.8
0.8 0.75
0.75
0.65 0.6
PCA FLDA LPP MMC LFDA HOLDA
0.55 0.5
20
40
60
80 100 120 140 160 180 200 220
Dimension
Recognition rate
Recognition rate
0.7 0.7
0.65 0.6 0.55 PCA FLDA LPP MMC LFDA HOLDA
0.5 0.45 0.4
20
40
60
80 100 120 140 160 180 200 220
Dimension
Fig. 5. The recognition rates of six methods versus the dimensions on the AR face database when the first seven face images in first section are used for training and the first seven face images in second section for test with NNC, LRC, SRC and RHKNN classifier.
Precision and Recall (PR) [21] is a pair of widely used criteria for evaluating the performance of a pattern recognition algorithm. Specially, when dealing with highly skewed datasets, Precision–Recall (PR) curve provides more information about the algorithm’s performance. In order to evaluate the verification performance of our proposed method, we calculate the precision and recall of six methods. Firstly, we convert the multi-class cases into two-class cases in the following way: The collection of samples from the ith class is viewed as one class and the remainders are viewed as samples of the other class, where i = 1, . . . , C. If we take samples from the ith class as the positive instances and the remaining samples from other c 1 classes as the negative instances, the number of positive instances is less than that of the negative instances. Under this circumstance, we plot the corresponding PR curves of each algorithm on this skewed data and show them in Fig. 6. As we know, in PR space the goal is to be in the upper-right corner. From Fig. 6, we can see that (i) HOLDA has a significant advantage over PCA, FLDA, LPP, MMC and LFDA; (ii) the PR curves of PCA and MMC are furthest ones from the upper-right corner and thus the verification performance of PCA and MMC is worst among the six methods. To further evaluate the performance of the proposed method, l (l = 2, 4 and 6) images per individual are randomly selected from 14 images for training and the remaining 14 l images for test. We run the system 10 times. For FLDA, LPP, LFDA and HOLDA, we firstly project the data set into a 220-dimensional PCA subspace. As for the setting of neighbor parameter, we use the following method: When l = 2, the nearest-neighbor parameters in feature extractor and classifier can only be set as 1. For fair comparisons, with the variation of l, the nearest-neighbor parameters in feature extractor and classifier are both set as l 1. After extracting features, we use NNC, LRC, SRC and RHKNN for classification. The maximal average recognition rates of each method with different classifiers are listed in Table 2. The average PR curves of each method with different number of training samples are shown in Fig. 7. From Table 2 and Fig. 7, we can draw the following conclusions. (1) All experimental results verify that the combination of HOLDA and RHKNN outperforms PCA, FLDA, LPP, MMC and LFDA with other classifiers, irrespective of the variation in the size of training samples.
21
J. Xu et al. / Information Sciences 232 (2013) 11–26
1 0.95
Precision
0.9 0.85 0.8 HOLDA PCA FLDA LPP MMC LFDA
0.75 0.7 0.65 0
0.2
0.4
0.6
0.8
1
Recall Fig. 6. The PR curves of six methods on the AR face database when the first seven image samples in section one are used for training and the first seven image sample in section two for test.
Table 2 The maximal average recognition rates (%) and the corresponding standard deviations of six methods with four classifiers on the AR database across 10 runs. No.
Classifier
PCA
FLDA
LPP
MMC
2
NNC LRC SRC RHKNN
69.63 ± 4.91 70.65 ± 5.15 79.79 ± 5.62 70.23 ± 5.26
80.89 ± 4.97 72.23 ± 7.61 80.44 ± 5.78 74.25 ± 7.4
77.75 ± 4.65 75.09 ± 6.88 78.10 ± 5.16 74.56 ± 6.49
72.85 ± 6.00 70.39 ± 5.45 80.58 ± 5.29 72.85 ± 6.00
LFDA
HOLDA
4
NNC LRC SRC RHKNN
77.86 ± 9.22 79.55 ± 10.60 86.50 ± 8.43 80.02 ± 10.44
84.04 ± 8.04 81.90 ± 8.06 86.62 ± 7.45 83.83 ± 7.72
83.35 ± 9.24 84.67 ± 6.89 85.82 ± 5.05 84.53 ± 9.26
80.16 ± 11.19 78.36 ± 12.17 86.86 ± 8.73 80.28 ± 11.25
84.02 ± 7.58 82.98 ± 7.95 86.99 ± 7.94 84.31 ± 7.59
89.47 ± 4.77 89.16 ± 4.98 89.80 ± 4.82 90.22 ± 4.55
6
NNC LRC SRC RHKNN
86.88 ± 10.24 89.55 ± 10.89 92.20 ± 9.42 88.81 ± 11.02
92.58 ± 8.45 91.68 ± 9.36 93.20 ± 7.63 92.55 ± 8.20
89.84 ± 10.15 90.90 ± 6.93 92.03 ± 5.16 90.76 ± 10.55
90.08 ± 10.90 89.49 ± 11.50 92.25 ± 9.46 90.05 ± 10.80
92.89 ± 8.34 92.62 ± 8.71 93.52 ± 7.45 92.68 ± 8.40
94.08 ± 5.88 93.89 ± 6.08 94.44 ± 5.73 93.90 ± 5.55
84.37 ± 3.79 83.15 ± 4.07 83.22 ± 4.01 84.88 ± 3.60
(2) In Table 2, the recognition results of LFDA are not reported when the number of training sample is 2. In our experiments, the dimension of PCA-transformed space is set as 220. When there are 2 samples per class for training, the neighbor parameter is set as 1. In this case, the local within-class scatter of LFDA is singular and LFDA does not work. (3) From Fig. 7, one easily finds that the verification performance of HOLDA is better than that of PCA, FLDA, LPP, MMC and LFDA. When the number of training samples per class is small, the advantage of HOLDA becomes more evident. This property is useful in real-world applications, since the number of available training sample is always very limited. Thus HOLDA is more suitable for small sample size problems. 4.2. Experiments on the Yale face database The Yale face database is constructed at the Yale Center for Computational Vision and Control. It contains 165 grayscale images of 15 individuals. The images demonstrate the variations in lighting condition, facial expression, and with/without glasses. The size of each cropped image is 100 80. Fig. 8 shows some sample images of one individual. These images vary as follows: Center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleep, surprised, and winking. In our experiments, l(=3, 4) images are randomly selected from the image gallery of each individual to form the training sample set. The remaining 11 l images are used for test. We independently run the system 10 times. Table 3 shows the average recognition rates across 10 runs of each method with NNC, LRC, SRC and RHKNN and their corresponding standard deviations. The average PR curves of each algorithm are plotted in Fig. 9. From Table 3 and Fig. 9, we can see that. (1) HOLDA combined with RHKNN performs best among all combined systems, irrespective of the variation in the size of training samples per class;
22
J. Xu et al. / Information Sciences 232 (2013) 11–26
(a)
(b)
1
0.95
1
0.95
Precision
Precision
0.9 0.85 HOLDA PCA FLDA LPP MMC
0.8 0.75 0.7
0
0.2
0.9 HOLDA PCA FLDA LPP MMC LFDA
0.85
0.8
0.4
0.6
0.8
0.75
1
0
0.2
0.4
Recall
0.6
0.8
1
Recall
(c)
1 0.9 0.8
Precision
0.7 0.6 0.5 HOLDA PCA FLDA LPP MMC LFDA
0.4 0.3 0.2 0.1 0 0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Recall Fig. 7. The average PR curves of each method on the AR face database when the number of training samples is 2 (a), 4 (b) and 6 (c).
Fig. 8. Sample face images from the Yale database.
Table 3 The maximal average recognition rates (%), and the corresponding standard deviations of six methods with four classifiers on the Yale database. No.
Classifier
PCA
FLDA
LPP
MMC
LFDA
HOLDA
3
NNC LRC SRC RHKNN
89.50 ± 3.38 85.58 ± 3.58 86.83 ± 3.37 89.58 ± 3.02
86.00 ± 13.1 84.50 ± 5.16 75.33 ± 12.32 87.08 ± 3.83
86.33 ± 2.33 86.17 ± 3.52 85.83 ± 3.42 89.25 ± 3.34
89.17 ± 3.75 86.08 ± 4.73 87.50 ± 4.50 89.50 ± 3.02
88.67 ± 4.18 86.50 ± 4.66 87.00 ± 4.83 87.60 ± 4.98
92.17 ± 3.29 91.25 ± 3.89 91.25 ± 3.87 92.17 ± 3.29
4
NNC LRC SRC RHKNN
89.90 ± 4.67 86.86 ± 4.35 87.43 ± 5.12 90.48 ± 3.62
86.67 ± 5.18 84.48 ± 6.09 71.90 ± 14.06 86.67 ± 5.18
90.86 ± 4.43 87.43 ± 5.59 87.14 ± 5.68 91.90 ± 5.28
90.48 ± 5.16 87.33 ± 5.35 89.14 ± 6.43 89.81 ± 3.21
89.33 ± 5.33 85.14 ± 6.12 87.52 ± 6.32 86.86 ± 5.10
93.14 ± 5.15 92.00 ± 5.26 92.00 ± 5.28 93.24 ± 5.15
23
J. Xu et al. / Information Sciences 232 (2013) 11–26
(a)
(b)
1 0.9
1
0.95
0.8 0.9
Precision
Precision
0.7 0.6 0.5 0.4
HOLDA PCA FLDA LPP MMC LFDA
0.3 0.2 0.1 0 0.65
0.7
0.75
0.85 0.8 HOLDA PCA FLDA LPP MMC LFDA
0.75 0.7 0.65
0.8
0.85
0.9
0.95
1
0
Recall
0.2
0.4
0.6
0.8
1
Recall
Fig. 9. The average PR curves of each method on the Yale face database when the number of training samples is 3 (a) and 4 (b).
(2) By comparing the recognition results on the Yale face database and the AR face database, we find that NNC and RHKNN seem to be more suitable for the features extracted by each method on the Yale face database, while on the AR face database most methods combined with SRC and LRC can achieve higher recognition rates. The possible reason is that both of LRC and SRC use the regression method to find the class-label of query sample, and they can achieve excellent performance when the training sample size is large enough. Compared with the AR face database, the Yale face database is much smaller. Therefore, the performance of the recognition system with LRC and SRC is not the best on the Yale face database. (3) In the PR space as shown in Fig. 9, except for LFDA, the PR curves of most methods are close to the upper-right corner. This indicates these methods achieve comparable verification performance on the Yale database, particularly when the training sample size per class is 4. 4.3. Experiment on the PolyU palmprint database The PolyU palmprint database contains 600 gray-scale images of 100 different palms with six samples for each palm (http://www4.comp.polyu.edu.hk/biometrics/). Six samples from each of these palms are collected in two sessions, where the first three are captured in the first session and the other three in the second session. The average interval between the first and the second sessions is two months. In our experiments, by using the algorithm mentioned in [37], the central part of each original image is automatically cropped with the size of 64 64 pixels and preprocessed using histogram equalization. Fig. 10 shows some sample images of one palm. Our first experiment uses the three palmprint images in Session 1 of each class for training and the remaining for test. Our second experiment uses the three palmprint images in Session 2 of each class for training and the remaining for test. Both experiments are performed on the 150-dimeansional PCA subspaces. Since there are three of training sample per class, the neighbor parameters in feature extraction and classification are set as 2. After feature extraction, NNC, LRC, SRC and RHKNN are, respectively, used for classification. The maximal recognition rates of each method in both experiments are listed in Table 4. The PR curves of six methods are shown in Fig. 11. From Table 4 and Fig. 11, we can see that the pattern recognition system of HOLDA and RHKNN still outperforms others. In addition, the PR curves of HOLDA are close to the upper-right corner, which is the best among all methods. 4.4. Experiment on the CENPARMI database for handwritten numeral recognition The experiment is done on the Concordia University CENPARMI handwritten numeral database. The database contains 6000 samples of 10 numeral classes (each class has 600 samples). Fig. 12 shows the samples digital images ‘‘2’’ from the CENPARMI handwritten numeral database.
Fig. 10. Samples of the cropped images in the PolyU Palmprint database.
24
J. Xu et al. / Information Sciences 232 (2013) 11–26
Table 4 The maximal recognition rates (%) of six methods on the PolyU palmprint database and the corresponding dimensions in two experiments. Training
Classifier
PCA
FLDA
LPP
MMC
LFDA
HOLDA
Session 1
NNC LRC SRC RHKNN
86.0(85) 83.3(145) 92.3(105) 86.7(120)
95.3(60) 97.7(90) 98.0(55) 98.0(70)
99.0(105) 98.7(90) 98.7(95) 98.3(105)
86.3(85) 81.0(90) 93.0(100) 86.0(100)
98.0(85) 97.7(105) 99.0(105) 98.0(70)
99.3(45) 97.7(105) 99.3(100) 99.3(55)
Session 2
NNC LRC SRC RHKNN
82.7(100) 83.7(85) 90.7(105) 83.0(145)
89.0(95) 89.0(95) 87.3(70) 89.0(95)
96.7(105) 97.0(105) 96.7(105) 96.3(115)
82.7(100) 83.0(70) 90.0(75) 81.7(100)
95.3(90) 94.7(75) 96.7(125) 95.3(80)
97.0(85) 96.7(90) 97.0(90) 97.0(90)
(a)
(b)
1
1
0.98 0.96
PCA MMC LFDA FLDA LPP HOLDA
0.9
0.85
Precision
0.94
Precision
0.95
0
0.2
0.92 PCA MMC LFDA FLDA LPP HOLDA
0.9 0.88 0.86 0.84
0.4
0.6
0.8
0.82
1
0
0.2
0.4
Recall
0.6
0.8
1
Recall
Fig. 11. The PR curves of six methods on the PolyU database when image samples from Session 1 (a) and image samples from Session 2 (b) per class are used for training.
Fig. 12. The sample digital images ‘‘2’’ from the CENPARMI handwritten numeral database.
In our experiments, the first 100 and 200 training sample per class are, respectively, selected for training and the remaining for test. We first perform six feature extraction methods on the original 121-dimensional Legendre moment features space, and then use NNC and RHKNN for classification. As for SRC and LRC, they are suitable for face recognition, which is the small sample size problem. This experiment is regarding handwritten numeral recognition, which is the large sample size problem. Therefore, we only choose to use NNC and RHKNN. Similarly, the nearest-neighbor parameters are set using the methodology adopted in Section 4.1.2. The maximal recognition rates of each method and the corresponding dimension are shown in Table 5. The PR curves of each method are plotted in Fig. 13. From the classification results listed in Table 5, we find that the combinations of each method with RHKNN can achieve higher recognition rates than those with NNC. In addition, among six methods, HOLDA performs best regardless of using NNC
Table 5 The maximal recognition rates (%) of six methods on the CENPARMI handwritten numeral database and the corresponding dimensions. No.
Classifier
PCA
FLDA
LPP
MMC
LFDA
HOLDA
100
NNC RHKNN
84.6(40) 90.3(115)
85.1(45) 85.3(45)
84.5(40) 89.9(25)
89.4(115) 90.3(120)
85.1(25) 90.7(120)
87.7(50) 92.2(50)
200
NNC RHKNN
88.4(40) 93.2(25)
87.6(45) 87.8(5)
88.7(70) 93.5(35)
91.2(115) 93.2(20)
88.1(25) 92.9(120)
91.8(50) 94.7(50)
25
J. Xu et al. / Information Sciences 232 (2013) 11–26
(a)
(b)
1
1
0.98 0.96 0.95
Precision
Precision
0.94 0.92 0.9 HOLDA PCA FLDA LPP MMC LFDA
0.88 0.86 0.84 0.82
0
0.2
HOLDA PCA FLDA LPP MMC LFDA
0.9
0.4
0.6
Recall
0.8
1
0.85
0
0.2
0.4
0.6
0.8
1
Recall
Fig. 13. The PR curves of six methods on the CENPARMI handwritten numeral database when the number of training samples is 100 (a) and 200 (b).
or RHKNN. FLDA is supposed to outperform PCA in general. However, in these experiments, FLDA does not show performance advantages over PCA. This is due to the limited number of features (only nine features) of FLDA, which might be not enough to represent the pattern well in such a low dimensional space. From the PR curves of each method shown in Fig. 13, we find that when the verification performance of each method is evaluated on a skewed data, the advantage of HOLDA is evident. Unlike the results reported in Table 5, the performance of FLDA is better than PCA. In addition, in the PR space, there is still vast room for improvement, since the PR curves of all methods are far from the goal of upper-right corner.
5. Conclusions The main contribution of this paper is to offer a framework for developing a feature extractor from the decision rule of a classifier. The proposed HOLDA is a paradigm under this framework, that is, HOLDA is oriented for the RHKNN classifier. Experimental results on four databases demonstrate that the integrated pattern recognition system, HOLDA and RHKNN, is effective. Acknowledgements The authors would like to thank the anonymous reviewers for their critical and constructive comments and suggestions. This work was partially supported by the National Science Fund for Distinguished Young Scholars under Grant No. 61125305, the Key Project of Chinese Ministry of Education No. 61233011, the National Science Foundation of China under Grant Nos. 61203376 and 61233011, China Postdoctoral Science Foundation funded project 2012M511479, and the Guangdong Natural Science Foundation under Project S2012040007289. References [1] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 711–720. [2] H. Cheng, K. Vu, K.A. Hua, SubSpace projection: a unified framework for a class of partition-based dimension reduction techniques, Information Sciences 179 (2009) 1234–1248. [3] D. Cai, X.F. He, J.W. Han, T. Huang, Graph regularized non-negative matrix factorization for data representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8) (2011) 1548–1560. [4] Luca Cazzanti, Maya R. Gupta, Local similarity discriminant analysis, in: Proceedings of the 24th International Conference on Machine Learning, pp. 137–144. [5] J.T. Chien, C.C. Wu, Discriminant wavelet faces and nearest feature classifiers for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (12) (2002) 1644–1649. [6] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., Wiley, New York, NY, 2001. ISBN: 0471056693. [7] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, 1990. [8] X. He, P. Niyogi, Locality preserving projections, in: Proceedings of 16th Conference on Neural Information Processing Systems, 2003. [9] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791. [10] G.F. Lu, Y. Wang, Feature extraction using a fast null space based linear discriminant analysis algorithm, Information Sciences 193 (2012) 72–80. [11] B. Liu, M. Wang, R. Hong, Z. Zha, X.S. Hua, Joint learning of labels and distance metric, IEEE Transactions on System, Man and Cybernetics: Part B 40 (3) (2010) 973–978. [12] W. Liu, S.C.H. Hoi, J. Liu, Output regularized metric learning with side information, in: Proceedings of 10th European Conference on Computer Vision, 2008, pp. 358–371. [13] D.D. Lee, H.S. Seung, Unsupervised learning by convex and conic coding, Proceedings of the Conference on Neural Information Processing Systems (NIPS) 9 (1996) 515–521.
26
J. Xu et al. / Information Sciences 232 (2013) 11–26
[14] S.Z. Li, Face recognition based on nearest linear combinations, in: Proceedings of 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1998, pp. 839–844. [15] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Transactions on Neural Networks 17 (1) (2006) 157– 165. [16] A.M. Martinez, R. Benavente, The AR Face Database, CVC Technical Report #24, June 1998. [17] A.M. Martinez, R. Benavente, The AR Face Database
, 2006. [18] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (11) (2010) 2106–2112. [19] E. Oja, Subspace Methods of Pattern Recognition, Research Studies Press Ltd, Letchworth, England, 1983. [20] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science (290) (2000) 2323–2326. [21] V. Raghavan, P. Bollmann, G.S. Jung, A critical investigation of recall and precision as measures of retrieval system performance, ACM Transactions on Information Systems 7 (3) (1989) 205–229. [22] Masashi Sugiyama, Local Fisher discriminant analysis for supervised dimensionality reduction, in: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, 2006, pp. 905–912. [23] P. Vincent, Y. Bengio, K-local hyperplane and convex distance nearest neighbor algorithms, Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2002, pp. 985–992. [24] M. Wang, H. Li, D. Tao, K. Lu, X. Wu, Multimodal graph-based reranking for web image search, IEEE Transactions on Image Processing 21 (2012) 4649– 4661. [25] M. Wu, B. Sch¨olkopf, Transductive classification via local learning regularization, in: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 2007, pp. 628–635. [26] J. Wright, A. Yang, A. Ganesh, S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 210–227. [27] J. Xu, J. Yang, Mean representation based classifier with its applications, IET Electronics Letters 47 (18) (2011) 1024–1026. [28] Z.R. Yang, Erkki Oja, Quadratic nonnegative matrix factorization, Pattern Recognition 45 (4) (2012) 1500–1510. [29] J. Yang, L. Zhang, J.Y. Yang, D. Zhang, From classifiers to discriminators: a nearest neighbor rule induced discriminant analysis, Pattern Recognition 44 (7) (2011) 1387–1402. [30] Y. Yang, D. Xu, F.P. Nie, J.B. Luo, Y.T. Zhuang, Ranking with local regression and global alignment for cross media retrieval, ACM Multimedia (2009) 175– 184. [31] Y. Yang, D. Xu, F.P. Nie, S.C. Yan, Y.T. Zhuang, Image clustering using local discriminant models and global integration, IEEE Transactions on Image Processing (TIP) 10 (2010) 2761–2773. [32] J. Yang, J.Y. Yang, Why can LDA be performed in PCA transformed space?, Pattern Recognition 36 (2) (2003) 563–566 [33] A. Yang, A. Ganesh, S. Sastry, Y. Ma, Fast l1-Minimization Algorithms and An Application in Robust Face Recognition: A Review, ICIP 2010. [34] W. Zuo, D. Zhang, K. Wang, Bidirectional PCA with assembled matrix distance metric for image recognition, IEEE Transactions on Systems, Man and Cybernetics, Part B 36 (4) (2006) 863–872. [35] X.F. Zhou, W.H. Jiang, Y. Shi, Y.J. Tian, Credit risk evaluation with kernel-based affine subspace nearest points learning method, Expert Systems with Applications: An International Journal 38 (4) (2011) 4272–4279. [36] W. Zheng, L. Zhao, C. Zou, Locally nearest neighbour classifiers for pattern classification, Pattern Recognition 37 (6) (2004) 1307–1309. [37] D. Zhang, Palmprint Authentication, Kluwer Academic, 2004.