Pattern Recognition Letters 30 (2009) 1166–1174
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
A decision-boundary-oriented feature selection method and its application to face recognition Jae Hee Park *, Seong Dae Kim, Wook-Joong Kim Division of Electrical Engineering, School of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), 373-1 Guseong-Dong, Yuseong-Gu, Daejeon 305-701, Republic of Korea
a r t i c l e
i n f o
Article history: Received 21 April 2008 Received in revised form 2 May 2009 Available online 10 June 2009
a b s t r a c t A novel feature selection scheme is proposed. We construct a piecewise linear decision boundary, and find a feature sub-space suitable to the constructed boundary. Experimental results show that the proposed scheme outperforms conventional algorithms from the viewpoint of recognition accuracy. Ó 2009 Elsevier B.V. All rights reserved.
Communicated by W. Zhao Keywords: Piecewise linear decision boundary Face recognition Linear discriminant analysis Elemental direction preserving discriminant analysis
1. Introduction The goal in pattern recognition problems is to achieve high recognition performance. In order to achieve high accuracy, every single step of a recognition process (starting from feature extraction to classification) has to perform its own role faithfully. In a recognition process, feature selection (also known as feature dimensionality reduction) brings computational efficiency by selecting meaningful components from the original feature vectors. Since recognition performance is highly dependent on the characteristics of feature vectors, it is important to identify the meaningful components in order to obtain high recognition accuracy. Linear projection is a popular approach of feature selection. For a given M-dimensional feature vector x, the feature selection of x into N ðN < MÞ-dimensional feature vector y is defined as
y ¼ W T x;
ð1Þ
where W ¼ ðw1 ; w2 ; . . . ; wN Þ is a M N ðM > NÞ linear projection matrix and each M-dimensional column vector wi is the basis of the N-dimensional feature sub-space where feature vectors lie after feature selection. Such feature selection methods have been widely studied due to its computational effectiveness and extensibility
(e.g., Bressan and Vitria, 2003; Li et al., 2005; Chen and Huang, 2003; Xiang and Huang, 2006). Among feature selection methods, PCA (principal component analysis) (Turk and Pentland, 1991) and LDA (linear discriminant analysis, also known as FLD) (Belhumeur et al., 1997) are the most widely used. While PCA maximizes the distances between samples without considering sample classes, LDA aims to minimize the distances between samples within a class as well as to maximize the distances between samples in different classes. Though LDA requires more computation than PCA, the higher discriminability of LDA gained more attention in academia and motivated additional studies such as Direct LDA (Chen et al., 2000) and Kernel FDA (Liu et al., 2004). The projection matrix W of LDA maximizes Fisher’s criterion (Belhumeur et al., 1997) which is defined as
det W T S B W ; JðWÞ ¼ det W T S W W
where S B is the between-class scatter matrix (3) and S W is the within-class scatter matrix (4):
SB ¼
X
XX i
0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.05.018
ni ðmi mÞðmi mÞT
ð3Þ
i
SW ¼ * Corresponding author. Tel.: +82 42 350 5430; fax: +82 42 350 8030. E-mail addresses:
[email protected],
[email protected] (J.H. Park),
[email protected] (S.D. Kim),
[email protected] (W.-J. Kim).
ð2Þ
ðx mi Þðx mi ÞT ;
ð4Þ
x2Xi
where m is the mean vector of whole training patterns, Xi represents the set of patterns from the ith class, mi is the mean vector
1167
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174
of the patterns in Xi , and ni is the number of patterns in Xi . Fukunaga and Mantock (1983) introduced a modification of LDA, non-parametric discriminant analysis (NDA), with an emphasis on local discrimination characteristics (Bressan and Vitria, 2003; Li et al., 2005). NDA computes the between-class scatter matrix imposing more weight on training patterns which are adjacent to other class patterns. Yan et al. (2007) proposed another linear projection based feature selection method named marginal Fisher analysis (MFA). MFA also focuses on local discrimination. It minimizes the distances between neighboring intra-class samples while maximizing the distances between neighboring samples from different classes. Chen and Huang (2003) and Xiang and Huang (2006) presented clusterbased linear discriminant analysis (CDA or CLD) with a multimodal assumption for class-conditional sample distributions. CLD computes the between-class scatter matrix with the mean vectors of each cluster obtained from class-wise clustering process. In addition to these methods, a variety of feature selection methods has been also derived from the basic principle of LDA. These methods commonly intended that all training patterns within a class lie closely and lie apart from other class patterns. It is obvious that such clear discrimination of training patterns is very helpful for achieving high recognition performance, however, it is not always obtainable. Especially when sample distributions are complicated, these methods often fail to produce accurate recognition results. Conventionally, when constructing a recognition system, a feature selection process needs to be designed prior to constructing classifier, and the final recognition rate of target recognition system is measured after the full chain of the system is constructed. Hence, at the moment of designing feature selection, it is impossible to identify overall recognition performance exactly. Due to this reason, feature selection processes are generally designed under alternative criteria such as Fisher’s. However, these criteria are sometimes insufficient to represent final recognition performance of practical problems precisely. In another type of feature selection, selecting a sub-set of original features, there are some studies about utilization of a decision boundary. For instance, Guo and Dyer (2005) concurrently generate a linear decision boundary and select a sub-set of features via linear programming. Even though these studies are hard to be applied directly in linear projection types, they showed that decision boundaries can be utilized for organizing feature vectors. In this paper, we propose a novel linear-projection-based feature selection scheme which is derived from a decision boundary. We construct a piecewise linear decision boundary (PLDB), and find a feature sub-space which produces similar decision results from the constructed PLDB. By this way, with reduced dimensional feature vectors, we can minimize the degradation of recognition accuracy compared to other existing methods. Moreover, the proposed strategy is effective even with complicated or hard-to-estimate sample distributions. This paper is organized as follows: in Section 2, we present an analytical expression of a PLDB and derive the optimal feature selection for a given PLDB. In Section 3, a new feature selection method, elemental direction preserving discriminant analysis (EDPDA), is proposed. EDPDA produces a near-optimal result for the nearest neighbor classifier. In Section 4, we show experimental results on face recognition problems, and conclude in Section 5.
ments. In the paper, we define the decision boundary with multiple hyper-plane segments as a piecewise linear decision boundary (PLDB). A PLDB B is defined as the union of the hyper-plane segments as
B¼
L [
Hi ;
ð5Þ
i¼1
where L is the number of hyper-plane segments on B and Hi is the ith hyper-plane segment. Fig. 1 shows a typical form of the PLDB which is generated by the nearest neighbor classifier or the piecewise linear machine. The PLDB in the figure consists of three hyper-plane segments, H1 , H2 , and H3 , and they define the decision regions of each class, R1 and R2 . A hyper-plane segment Hi composing a PLDB on a M-dimensional feature space is a set of point x 2 RM which is defined as
Hi ¼ xjaTi x þ bi ¼ 0 \
n o \ xjaij aTj x þ bj P 0 ;
j–i
where aTi x þ bi ¼ 0 is the hyper-plane which includes the ith hyperplane segment Hi , aTj x þ bj ¼ 0 ðj–iÞ is a hyper-plane including Hj and it determines a part of the boundary of Hi . aij is a pre-determined value having 1, 1, or 0 according to the relative position of the points on Hi to the hyper-plane aTj x þ bj ¼ 0. (aij ¼ 1 if all points on H satisfy aTj x þ bj P 0, aij ¼ 1 if all points on Hi satisfy aTj x þ bj 6 0, and aij ¼ 0 otherwise). Fig. 2 shows an example of the hyper-plane segment on a PLDB. As shown in the figure, a hyper-plane segment on a PLDB is included by a hyper-plane and its boundary is defined by other hyper-planes. Eq. (6) implies that we can determine whether a point x is on a hyper-plane segment by observing the output of each linear function aTi x þ bi ð1 6 i 6 LÞ. Now, we define a function g as
Fig. 1. An example of a PLDB and corresponding decision regions.
2. Piecewise linear decision boundary and feature selection 2.1. Piecewise linear decision boundary (PLDB) It is known that the decision boundary from the nearest neighbor classifier (Duda et al., 2001) or the piecewise linear machine (Duda and Fossum, 1966) consists of multiple hyper-plane seg-
ð6Þ
Fig. 2. A hyper-plane segment on a PLDB.
1168
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174
¼ 0 if x is on B g aT1 x þ b1 ; aT2 x þ b2 ; . . . ; aTL x þ bL ; – 0 otherwise
ð7Þ
where aTi x þ bi ð1 6 i 6 LÞ is the linear function corresponding to each hyper-plane segment on B. Using (7), we can modify the definition of B, (5), as
B ¼ xjg aT1 x þ b1 ; aT2 x þ b2 ; . . . ; aTL x þ bL ¼ 0 :
ð8Þ
Similarly, for a PLDB B, we define the decision rule nðxÞ corresponding to B as
nðxÞ ¼ f aT1 x þ b1 ; aT2 x þ b2 ; . . . ; aTL x þ bL ;
ð9Þ
where f is a decision rule whose domain consists of the linear functions corresponding to the hyper-planes. (For further details regarding the representation of nðxÞ by f , refer to Appendix A.1.) Eqs. (8) and (9) imply that the description and the decision rule corresponding to B can be represented using the linear functions aTi x þ bi ð1 6 i 6 LÞ which are the hyper-planes of B. 2.2. Optimal feature selection for PLDB Feature selection causes information loss by reducing feature vector dimension. When a feature selection process minimizes the loss of useful information, it can be considered optimal. In this subsection, we present an optimal feature selection method for a given PLDB. For this purpose, we start from defining the optimal feature selection for a decision rule. After feature selection, decision rules defined on the original feature space are usually irreproducible because decision rule’s information is also lost through the dimensionality reduction. Therefore, if we can construct an equivalent decision rule to the given decision rule after feature selection, such feature selection can be considered optimal for the given decision rule. Definition (Optimal feature selection for decision rule). A feature selection q : RM ! RN ðM > NÞ is optimal for a decision rule n : RM ! U (U: the set of possible decisions) if there exists another decision rule n0 : RN ! U such that n0 ðqðxÞÞ ¼ nðxÞ for any arbitrary x 2 RM . Suppose that a PLDB B and the decision rule nðxÞ corresponding to B are given, and the normal vectors of the hyper-planes composing B are denoted by ai ð1 6 i 6 LÞ. For the sake of compact explanation, we define the dimension of a PLDB. Definition (Dimension of PLDB). The dimension of a PLDB B, denoted as dimðBÞ, is a function defined as dimðBÞ ¼ dimðspanfa1 ; . . . ; aL gÞ. Using these definitions, we derive the optimal feature selection for B based on the theorem below. Theorem 1. For a linear projection y ¼ pðxÞ ¼ W T x, if spanfa1 ; . . . ; aL g rangeðWÞ, there exists a decision rule n0 ðyÞ such that ðn0 pÞðxÞ ¼ nðxÞ for any arbitrary x. Proof. Suppose that nðxÞ is described as nðxÞ ¼ f aT1 x þ b1 ; . . . ; aTL x þ bL Þ. If spanfa1 ; . . . ; aL g rangeðWÞ, we can find k1 ; . . . ; kL such that Wki ¼ ai for 1 6 i 6 L. Let n0 ðyÞ be a decision rule defined as
T T T n0 ðyÞ ¼ f k1 y þ b1 ; k2 y þ b2 ; . . . ; kL y þ bL :
ð10Þ
sion rule, it is also optimal for the corresponding decision boundary. As described in Theorem 1, if the range of W includes the span of ai ð1 6 i 6 LÞ, the linear projection pðxÞ ¼ W T x is an optimal feature selection for the decision rule nðxÞ, and consequently it is optimal for the PLDB B. Here is an example for a PLDB and its optimal feature selection. Suppose that a PLDB B consisting of three hyper-plane segments, H1 fxjð1; 0; 1Þx þ 2 ¼ 0g, H2 fxjð2; 1; 1Þx 1 ¼ 0g, and H3 fxjð0; 1; 1Þx þ 1 ¼ 0g, and the corresponding decision rule nðx 2 R3 Þ ¼ f ðð1; 0; 1Þx þ 2; ð2; 1; 1Þx 1; ð0; 1; 1Þx þ 1Þ are given. T T The normal vectors of each segment, ð1; 0; 1Þ , ð2; 1; 1Þ T, and 1 0 1 ; and ð0; 1; 1ÞT , reside on the range of a matrix W ¼ 1 1 0 each normal vector is equal to Wð1; 0ÞT , Wð1; 1ÞT , and Wð1; 1ÞT respectively. In this case, a decision rule n0 ðy 2 R2 Þ ¼ f ðð1; 0Þyþ 2; ð1; 1Þy 1; ð1; 1Þy þ 1Þ satisfies n0 ðy ¼ W T xÞ ¼ nðxÞ for any arbitrary x. Since we can find a decision rule n0 such that n0 ðW T xÞ ¼ nðxÞ, the linear projection y ¼ W T x is an optimal feature selection for nðxÞ and B. Through the feature selection y ¼ W T x, we can reduce the dimension of feature vectors without any information loss of the PLDB B and its corresponding decision rule nðxÞ. Now, we present two more properties regarding the existence condition (Corollary 1) and a special case (Corollary 2) of optimal feature selection. Corollary 1. For a linear projection y ¼ pðxÞ ¼ W T x, if there exists a decision rule n0 ðyÞ such that ðn0 pÞðxÞ ¼ nðxÞ for any x, rankðWÞ is equal to or larger than dimðBÞ. Corollary 2. If p(x) is an orthonormal projection onto spanfa1 ; . . . aL g, there exists a decision rule n0 ðyÞ such that ðn0 pÞðxÞ ¼ nðxÞ for any x. Corollary 1 implies that a feature selection pðxÞ ¼ W T x cannot be optimal for B if the reduced dimension by p(x) is smaller than dimðBÞ. Corollary 2, which is a specific case of Theorem 1, implies that p(x) is the optimal feature selection for B if p(x) is an orthonormal projection onto spanfa1 ; . . . aL g. 2.3. Dimensionality reduction of PLDB In the previous subsection, we introduced the definition of optimal feature selection for a PLDB. It is defined that the optimal feature selection causes no degradation of recognition performance. However, when the dimension of a feature space is reduced from M to N, optimal feature selection for a PLDB B is unobtainable if dimðBÞ > N as shown in Corollary 1. In this subsection, we describe a near-optimal feature selection method which minimizes the degradation of recognition performance. Our approach to obtain a near-optimal feature selection method consists of two steps. First, a PLDB B whose dimension is higher than N is transformed into a b which produces similar decision renew N-dimensional PLDB B sults to B. And second, an optimal feature selection method for b is obtained. the transformed PLDB B b by modIn the paper, the original PLDB B is transformed into B ifying each hyper-plane of B as shown in Fig. 3 – i.e., B ¼ T xjg a1 x þ b1 ; . . . ; aTL x þ bL ¼ 0 is transformed into a N-dimenn o ^1 ; . . . ; a ^L ¼ 0 . Note that the b ¼ xjg a ^ TL x þ b ^ T1 x þ b sional PLDB B
T T For any arbitrary x, ðn0 pÞðxÞ ¼ f k1 ðW T xÞ þ b1 ; . . . ; kL ðW T xÞþ 0 bL Þ is equal to nðxÞ. Therefore, we can always find n ðyÞ such that ðn0 pÞðxÞ ¼ nðxÞ for any arbitrary x if spanfa1 ; . . . ; aL g rangeðWÞ. h
b function g is maintained for both B and B. b Since dimð BÞ is equal to N, there exists an orthonormal matrix ^ i 2 rangeðWÞ for 1 6 i 6 L. This implies that W 2 RMN such that a ^ i ¼ Wki for there are N-dimensional vectors k1 ; . . . kL such that a b can be described as 1 6 i 6 L. Hence, B
A decision boundary is determined by the corresponding decision rule. It implies that if a feature selection is optimal for a deci-
n o ^1 ; . . . ; kT W T x þ b ^L Þ ¼ 0 : b ¼ xjgðkT W T x þ b B 1 L
ð11Þ
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174
1169
c T x is an optimal feature selection The linear projection y ¼ W b method for B since it is the orthonormal projection onto PL cW c T aL . cW c T a1 ; . . . ; W span W Therefore, when i¼1 xðai Þ 2 T ai =kai k WW ai =kai k is small enough, the feature selection c T x produces near-optimal results for the given PLDB B. by y ¼ W 3. Elemental direction preserving discriminant analysis In order to apply the feature selection method introduced in the previous section, the normal vectors of an accurate PLDB are required. In this section, we obtain the direction vectors, of which the statistical characteristics are similar to those of the normal vectors of an accurate PLDB. And based on the direction vectors, we propose a new feature selection method named elemental direction preserving discriminant analysis (EDPDA).
Fig. 3. PLDB transformation.
b each hyper-plane aT x þ bi ¼ 0 of When B is transformed into B, i T ^ ¼ 0. Such replacement changes the B is replaced by ki W T x þ b i decision for samples which reside between aTi x þ bi ¼ 0 and T ^ ¼ 0 as shaded in Fig. 4. The region is named ‘‘decision ki W T x þ b i changed region” in this paper. Assuming that the probability density of the sample distribution near aTi x þ bi ¼ 0 is uniform, occurrence of the decision changes is minimized by reducing the area of the decision changed region to be as small as possible. In order to minimize the area of the decision changed region, we have to find a appropriate normal vector Wki and a proper va^ for 1 6 i 6 L – i.e., we have to find three variables W, k , lue of b i i ^ , the area of the decision changed region ^ . For any arbitrary b and b i i is minimized by making the angle between the two normal vectors Wki and ai as small as possible. In order to minimize the angle, the vector ki must be W T ai . It can be considered that the angle beT tween Wkiand ai , when ki ¼ W ai , is proportional to the vector T difference ai WW ai if the angle is close to zero. Therefore, the orthonormal matrix W which minimizes the angle between normal vectors, and consequently, minimizes the area of the decision changed region, is obtained by
c ¼ arg min W W
L X i¼1
2 ai ai WW T kai k kai k
xðai Þ
ð12Þ
where a positive value xðai Þ is a weight for ai . In the paper, to find proper xðai Þ, we assume that it is only dependent on the number of samples adjacent to the hyper-plane segment on aTi x þ bi ¼ 0. The detailed description of setting xðai Þ is presented in the next section. Previously, we described thentransformation of a PLDB B into a ^ 1 ; . . . ; aT W b ¼ xjg aT W cW c Tx þ b cW c T xþ new N-dimensional PLDB B 1 L ^L Þ ¼ 0g which minimizes the occurrence of decision changes. b ^L , which were unspecified, are ^1 ; . . . ; b Meanwhile, the constants b not required to be known because they have no contribution to the feature selection method.
Fig. 4. Decision change caused by hyper-plane replacement.
3.1. Elemental direction and (Basic) EDPDA The decision boundary of the nearest neighbor (NN) classifier (Duda et al., 2001) consists of hyper-plane segments – i.e., PLDB. Since the NN classifier separates its own prototypes perfectly, a PLDB separating all training samples perfectly can be generated by the NN classifier by assigning all training samples to its prototypes. The normal vector of a hyper-plane segment on a PLDB generated by the NN classifier is the difference vector between two adjacent training samples from different classes. In order to obtain the normal vectors, in this paper, training samples which are adjacent to another training sample in a different class are found. For two training samples xj and xk from different classes, it is determined that xj and xk are ‘‘adjacent” if the distance between them is smaller than a proper value e > 0 as
kxj xk k < e:
ð13Þ
After acquiring all adjacent training samples, the difference vector of each adjacent sample pair is obtained. When there are L0 adjacent sample pairs, the normalized difference vector, corresponding to the ith pair xj and xk , is denoted as ei . In this paper, ei is named ‘‘elemental direction” and computed as
ei ¼
xj xk ð1 6 i 6 L0 Þ: kxj xk k
ð14Þ
In Fig. 5, an example of the PLDB generated by the NN classifier and the elemental directions obtained from training samples is shown. As shown in the figure, we can observe the followings:
Fig. 5. PLDB of NN classifier (a) and elemental directions (b).
1170
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174
1. With an appropriate e, each ai =kai k is one of the elemental directions where ai is the ith normal vector of the PLDB from the NN classifier. 2. Some of the elemental directions are identical to the normal vectors of the PLDB. The others, though they are not the same, are similar to the direction of the normal vectors of the neighboring hyper-plane segments. 3. As marked with a circle in Fig. 5, we can obtain more elemental directions from a region around certain hyper-planes where the density of training samples is higher.
weighted sample. The weight for a mean vector is set to be proportional to the population of its corresponding cluster. At first, a class-wise clustering is processed and the sample distribution of each cluster is approximated by a high dimensional hyper-sphere. The jth cluster of the ith class Sij , its mean vector mij , and the radius of the approximating hyper-sphere rij are defined, respectively as (18)–(20):
n o Sij ¼ xk jxk 2 Xi ; xk mij < xk mil for all l – j X
1 xk nðSij Þ x 2Si k j rij ¼ max xk mij ; mij ¼
Based on these observations, we propose a novel feature selection method: elemental direction preserving discriminant analysis c minimizing the (EDPDA). In the method, the projection matrix W reconstruction error of each elemental direction ei WW T ei is obtained as
c ¼ arg min W W
L0 2 X T ei WW ei :
ð15Þ
i¼1
Let E be a set of elemental directions ei j1 6 i 6 L0 and A be fai =kai k j1 6 i 6 Lg where ai ð1 6 i 6 LÞ are the normal vectors of the PLDB generated by the NN classifier. Since each ai =kai k is an elemental direction as mentioned in property 1, A is a subset of E. Then, (15) can be described as
! 2 X L 2 X T T c ¼ argmin : a =ka kWW a =ka k þ e WW e W i i i i i i W
ei 2EA
i¼1
ð16Þ Since each ei belonging to E A is similar to the normal vector of a neighboring hyper-plane segment, ei 2 E A can be replaced by one of the aj =kaj k 2 A. Then, (16) can be replaced with
c ¼ arg min W W
L X i¼1
2 nðai Þai =kai k WW T ai =kai k ;
ð17Þ
where nðai Þ is the number of sample pairs generating elemental directions adjacent to the corresponding hyper-plane segment. Note that (17) is equal to (12) when xðai Þ ¼ nðai Þ. Consequently, we can obtain a feature sub-space which produces similar decision results to the NN classifier by minimizing the reconstruction errors of elemental directions. When the sum of the reconstruction errors of elemental directions is small enough, EDPDA produces near-optimal results for the NN classifier. Although the derivation of EDPDA is distinctive, EDPDA has a similar characteristic to NDA and MFA. NDA, MFA, and EDPDA commonly focus on local discrimination and intend to maximize the distances between neighboring samples from different classes. However, while NDA and MFA emphasize sample pairs with a relatively long distance between them since they minimize the magnitude reduction of difference vectors between neighboring samples, EDPDA normalizes the magnitudes and utilizes only the directions of the difference vectors. Taking only the directions is helpful to avoid being dominated by a small number of sample pairs with a relatively long distance. 3.2. EDPDA: a cluster-based approach EDPDA produces accurate recognition results by finding a suitable feature sub-space for the NN classifier, however, it requires high computational cost when there are a large number of training samples. In this paper, a cluster-based approach is proposed to alleviate its computational burden. In the method, after class-wise clustering, the mean vector of each cluster is considered as a
xk 2Sij
ð18Þ ð19Þ ð20Þ
where N Sij is the number of samples in Sij , and each cluster is constructed by the K-means algorithm (Linde et al., 1980). In the paper, Sij is determined manually. Then, clusters which are adjacent to another cluster in a different class are found. If the distance between the mean vectors of two separate clusters from different classes satisfies
i mj mlk < c rij þ r lk þ e;
ð21Þ
where c, e are constants which define adjacency of two clusters, the two clusters are considered ‘‘adjacent.” If there are a large number of training samples and the class-conditional probability density functions are significantly overlapping, it is better to assign relatively small values to c, e. In many applications we dealt with, the value of c in the range of 1.5–3.0 showed the best performance. Suppose that there are b L adjacent cluster pairs and the pth cluster pair consists of two adjacent clusters Sij and Slk . The pth elemental direction (i.e., the normalized difference vector between Sij and Slk ), is computed as
mi mlk ^ep ¼ j ð1 6 p 6 b LÞ: i mj mlk
ð22Þ
Then, we replace (15) with
c ¼ arg min W W
bL X p¼1
2 nð^ep Þ^ep WW T ^ep ;
ð23Þ
where nð^ ep Þ is the number of sample pairs which contribute to genep is generated by two cluseration of ^ ep . If an elemental direction ^ ep Þ has the value of nðS1 ÞnðS2 Þ where nðSk Þ is the ters S1 and S2 , nð^ number of samples in Sk . c in (23) can be obtained as The matrix W
c ¼ arg max W T RED W ; W W
ð24Þ
where RED is the weighted correlation matrix of the elemental directions which is defined as
RED ¼
L X
nð^ep Þ^ep ^eTp :
ð25Þ
p¼1
^ i of Similarly to KL transform (Jain, 1989), each column vector w c ¼ ðw ^2 ...w ^ N ÞT is the eigen-vector of the correlation matrix ^ 1w W RED corresponding to the N largest eigen-values. The aforementioned process of cluster-based EDPDA is summarized in Table 1. Not only the cluster-based EDPDA holds the strengths of the basic EDPDA, but also it brings higher computational efficiency as well as more robustness to abnormal training samples than the basic EDPDA.
1171
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174 Table 1 Process of cluster-based EDPDA. Step Step Step Step Step
1: 2: 3: 4: 5:
Divide each class into multiple clusters Find the adjacent cluster pairs Extract elemental directions ^ ep (22) from adjacent cluster pairs Make the weighted correlation matrix of elemental directions RED (25) c Compose W with the eigen-vectors of RED corresponding to N largest eigen-values
4. Experimental results 4.1. Experimental environments The proposed method is not limited to a specific application since it is developed without any assumption about the characteristics of feature vectors. In order to show the superiority of the proposed method reasonably, it is necessary to evaluate the recognition accuracy on various applications frequently dealt with. In this paper, the proposed method and other feature selection methods were compared on several face recognition problems. In the experiments, the intensity values of each face image were used as the original feature vector with size-normalization and histogram equalization. We compared the performance of the proposed method to four conventional feature selection methods: PCA (Turk and Pentland, 1991), (generalized) NDA (Bressan and Vitria, 2003), RCLD (Xiang and Huang, 2006), and RMLD (Xiang and Huang, 2006). Well-known and public-accessible face image databases were used for obtaining reproducible experimental results. In measuring the recognition performance, we used k-NN (k-nearest neighbor) classifier (Duda et al., 2001) and support vector machine (SVM) (Vapnik, 1998). Through this procedure, the performance of each method was evaluated by comparing the error rates versus feature dimension. 4.2. Face identity recognition Face (identity) recognition is one of the most widely researched problems in the field of pattern recognition. Due to the non-rigid appearance of human faces, many statistical feature selection methods have been evaluated through the face recognition problems for the last several decades. Besides the statistical approaches, other methods which are restricted to face recognition problems, such as (Shin et al., 2007), were also proposed. However, in the paper, only the linear projection based feature selection methods using intensity feature were compared. It is because the experimental goal is to observe usefulness of each feature selection method for achieving accurate recognition results rather than to achieve the best recognition results on some specific applications. 4.2.1. Experiment on AR DB AR face database (Martinez and Benavente, 1998) contains a large number of images with various lighting conditions, facial expressions, and occlusions. In the experiment, among 133 individual’s face sets in AR DB, we selected 20 face sets randomly. Each set consists of 26 face images of a person. The images were aligned based on eye positions and scaled to 46 56 pixel2 at the initial step. Fig. 6 shows 12 images from the same person. In the experiment, error rates were determined by ‘‘leaving-one-out” strategy (Belhumeur et al., 1997; Xiang and Huang, 2006). In order to compute the projection matrices of RCLD and EDPDA, each class was divided into five clusters. Table 2 shows the lowest recognition error rates determined by k-NN classifier. EDPDA achieved the perfect recognition result. The
Fig. 6. Face images in AR face DB.
Table 2 Lowest error rates on identity recognition using AR DB.
PCA NDA RCLD RMLD EDPDA
Lowest error rate (%)
Feature dimension
2.88 2.69 2.73 43.92 0
30 30 30 24 22
error rates versus feature dimension are shown in Fig. 7. As shown in the figure, EDPDA achieved noticeable improvement over the other methods when the dimension is higher than 3. To achieve the error rate below 2%, the required feature dimension of EDPDA is only half of the other methods. When the dimension is 11, the error rate of EDPDA is just 16% of the others. As mentioned previously, EDPDA is derived with the consideration of the decision boundary required for proper classification. This unique feature contributes to improve the recognition accuracy, and consequently, it leads to outperforming the other methods. 4.2.2. Experiment on Yale face database B Yale face database B (Georghiades et al., 2000) includes a number of face images in variable lighting and pose conditions. In this paper, we used 268 frontal face images from 10 people with various lighting conditions. Fig. 8 shows some of the images used for our experiment. We selected all frontal face images having the direction of light sources 0°, ±10°, ±20°, ±50°, ±60°, ±70°, ±110° in azimuth with 0°, ±10°, ±20°, ±40° in elevation. Images of which face silhouette cannot be cognized by human eyes were excluded in both training and test steps. In computing RCLD and EDPDA, images from the same person were divided into five clusters. The error rate versus feature dimension curve generated by kNN classifier is shown in Fig. 9. Considering only the lowest error rate, the accuracy of EDPDA was similar or slightly better than the others. However, Error rate convergence of EDPDA was much faster than the others. It implies that the elemental directions are helpful to identify meaningful sub-spaces and they can be described properly with a relatively small number of bases. RMLD was the most accurate when the feature dimension is less than 5, however, after some iterative projections in RMLD, it became the worst. In Table 3, the error rates around the feature vector dimension of 10 are shown. The error rates of EDPDA are 45%, 19%, and 15% of the others when the feature vector dimension is 8, 11, and 13, respectively.
1172
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174
Fig. 7. Comparative identity recognition results on AR DB.
Fig. 8. Face images in Yale-B DB.
Fig. 9. Comparative identity recognition results on Yale-B DB.
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174
4.3. Gender recognition Gender recognition is the process to decide the gender of people from their face images. In gender recognition, due to the variety of human face appearances, it is hard to describe the sample distribution of each class (male or female) by parametric models. In this paper, we organized an experiment with the frontal face images of 106 males and 88 females from a sub-set of FERET database (Phillips et al., 2000) called FERET-B (Yang et al., 2005). Fig. 10 shows 10 male and 10 female face images of the total used in
1173
our experiment. No extra images from the same person are used for the experiment, and each class was divided into 20 clusters for RCLD and EDPDA. Fig. 11 shows the curve of error rates versus feature dimensions generated by support vector machine with a radial basis kernel. Among the compared methods, EDPDA showed the most accurate performance. Although the differences in error rate between EDPDA and NDA are not significant, EDPDA was not worse than the other methods at almost every dimension and it achieved low error rates at lower feature vector dimensions. 5. Conclusions
Table 3 Performance comparison on identity recognition using Yale-B DB. Feature dimension
PCA (%)
NDA (%)
RCLD (%)
RMLD (%)
EDPDA (%)
8 11 13
29.48 17.91 12.69
45.90 34.33 24.25
38.43 22.76 19.03
27.99 22.76 23.13
12.69 3.36 1.87
In this paper, we proposed a feature selection scheme which produces more accurate recognition results. We introduced a novel strategy to design feature selection methods. In the strategy, we construct a PLDB on the original feature space and find a feature sub-space which produces similar decision results to the constructed PLDB. Based on the proposed strategy, we introduced a new feature selection method, EDPDA, which produces near optimal results for the nearest neighbor classifier. Unlike other feature selection methods based on distances between samples, the proposed method focuses on the constructability of accurate decision boundaries to produce accurate recognition results. Furthermore, since it is less influenced by the characteristics of sample distributions, the proposed method can produce more reliable results than conventional methods even under tough situations as: the sample distribution of each class is significantly non-normal; the classconditional density functions are hard to estimate; a precise decision boundary is needed for proper classification. In the experiments with well-known face databases, compared to other linear projection based methods, EDPDA showed higher recognition accuracy with smaller number of features. Especially, in the experiments of identity recognition with AR DB and Yale-B DB, EDPDA showed an impressive recognition performance. Since recognition accuracy highly depends on the choice of a classifier and the training strategy, EDPDA cannot guarantee reliable recognition results by itself. However, if the performance of the classifier is closer to the optimal, the superiority of EDPDA to other methods stands out more clearly.
Fig. 10. Face images in FERET-B.
Fig. 11. Comparative gender recognition results.
1174
J.H. Park et al. / Pattern Recognition Letters 30 (2009) 1166–1174
References
Fig. 12. Convex regions defined by linear functions.
Appendix A A.1. The decision rule corresponding to PLDB Suppose that a PLDB which consists of L hyper-plane segments Hi xjaTi x þ bi ¼ 0 ð1 6 i 6 LÞ is given. Then, the feature space is divided into at most 2L convex regions which are defined as
Rb ¼ xjbi ðaTi x þ bi Þ > 0 T
ð26Þ
where b ¼ ðb1 ; b2 ; . . . ; bL Þ is a L-dimensional vector and each element bi has the value of 1 or 1. If bi equals 1, all points in Rb satisfy aTi x þ bi > 0, and otherwise they satisfy aTi x þ bi < 0. Fig. 12 shows an example of Rb when the PLDB consists of three hyper-plane segments. As shown in the figure, a region Rb is included by the decision region of one class, and thus all points x 2 Rb have same decisions. Therefore, after finding the decisions for each Rb , the decision for an input vector x can be obtained by determining which Rb the input x belongs to. The Rb which the input x belongs to can be found by observing each value aTi x þ bi ð1 6 i 6 LÞ. For example, when L ¼ 3, if the input x satisfies aT1 x þ b1 > 0, aT2 x þ b2 < 0, and aT3 x þ b3 > 0, x belongs to Rð1;1;1ÞT . Therefore the decision rule nðxÞ corresponding to B can be described a function f whose domain consists of the linear functions corresponding to the hyper-plane segments as (9).
Belhumeur, P., Hespanha, J., Kriegman, D., 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 711–720. Bressan, M., Vitria, J., 2003. Nonparametric discriminant analysis and nearest neighbor classification. Pattern Recognition Lett. 24, 2743–2749. Chen, X., Huang, T., 2003. Facial expression recognition: A clustering-based approach. Pattern Recognition Lett. 24, 1295–1302. Chen, L.-F., Liao, H.-Y.M., Ko, M.-T., Lin, J.-C., Yu, G.-J., 2000. A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition 33, 1713–1726. Duda, R., Fossum, H., 1966. Pattern classification by iteratively determined linear and piecewise linear discriminant functions. IEEE Trans. Electronic Computers EC-15 (2), 220–232. Duda, R., Hart, P., Stork, D., 2001. Pattern Classification, second ed. John Wiley and Sons, New York. Fukunaga, K., Mantock, J., 1983. Nonparametric discriminant analysis. IEEE Trans. Pattern Anal. Machine Intell. 5 (6), 671–678. Georghiades, A., Kreigman, D., Belhumeur, P., 2000. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Machine Intell. 23 (10), 1090–1104. Guo, G., Dyer, C.R., 2005. Learning from examples in the small sample case: Face expression recognition. IEEE Trans. Systems Man Cybernet.–Part B 35 (3), 477– 488. Jain, A.K., 1989. Fundamentals of Digital Image Processing. Prentice Hall, New Jersey. Li, Z., Liu, W., Lin, D., Tang, X., 2005. Nonparametric subspace analysis for face recognition. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 961–966. Linde, Y., Buzo, A., Gray, R., 1980. An algorithm for vector quantizer design. IEEE Trans. Commun. COM-28, 84–95. Liu, Q., Lu, H., Ma, S., 2004. Improving kernel fisher discriminant analysis for face recognition. IEEE Trans. Circ. Syst. Video Technol. 14 (1), 42–49. Martinez, A., Benavente, R., 1998. The AR Face Database. CVC Technical Report #24. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J., 2000. The FERET evaluation methodology for face recognition algorithms. IEEE Trans. Pattern Anal. Machine Intell. 22 (10), 1090–1104. Shin, H.-C., Park, J.H., Kim, S.-D., 2007. Combination of warping robust elastic graph matching and kernel-based projection discriminant analysis for face recognition. IEEE Trans. Multimedia 9 (6), 1125–1136. Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J. Cognition Neurosci. 3 (1), 71–86. Vapnik, V.N., 1998. Statistical Learning Theory. Wiley, New York. Xiang, C., Huang, D., 2006. Feature extraction using recursive cluster-based linear discriminant with application to face recognition. IEEE Trans. Image Process. 15 (12), 3824–3832. Yan, S., Xu, D., Zhang, B., Zhang, H.-J., Yang, Q., Lin, S., 2007. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Trans. Pattern Anal. Machine Intell. 29 (1), 40–51. Yang, J., Frangi, A.F., Yang, J.-Y., Zhang, D., Jin, Z., 2005. KPCA Plus LDA: A complete kernel Fisher discriminant framework for feature extraction and recognition. IEEE Trans. Pattern Anal. Machine Intell. 27 (2), 230–244.