ARTICLE IN PRESS
Neurocomputing 69 (2006) 1806–1810 www.elsevier.com/locate/neucom
Letters
A reformative kernel Fisher discriminant algorithm and its application to face recognition Yu-jie Zhenga,, Jian Yanga,b, Jing-yu Yanga, Xiao-jun Wuc a
Department of Computer Science, Nanjing University of Science and Technology, Nanjing 210094, PR China b Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong c School of Electronics and Information, Jiangsu University of Science and Technology, Zhenjiang 212003, PR China Received 10 July 2005; received in revised form 14 July 2005; accepted 16 August 2005 Available online 7 March 2006 Communicated by R.W. Newcomb
Abstract In this paper, a reformative kernel Fisher discriminant (KFD) algorithm with fuzzy set theory is studied. The KFD algorithm is effective to extract nonlinear discriminative features of input samples using the kernel trick. However, the conventional KFD algorithm assumes the same level of relevance of each sample to the corresponding class. In this paper, a fuzzy kernel Fisher discriminant (FKFD) algorithm is proposed. Distribution information of samples is represented with fuzzy membership degree and this information is utilized to redefine corresponding scatter matrices which are different to the conventional KFD algorithm and effective to extract discriminative features from overlapping (outlier) samples. Experimental results on the ORL face database demonstrate the effectiveness of the proposed method. r 2006 Elsevier B.V. All rights reserved. Keywords: Kernel linear discriminant analysis; Fuzzy kernel linear discriminant analysis; Fuzzy K-nearest neighbor; Feature extraction; Face recognition
1. Introduction Kernel-based learning machines, e.g., support vector machines (SVMs) [8], kernel principal component analysis (KPCA) [7], and kernel Fisher discriminant analysis (KFD) [1,4,9,10], have aroused considerable interest in areas of pattern recognition and machine learning recently. KFD has been one of the effective algorithms in many real-world applications due to its power of extracting the most discriminatory nonlinear features. However, KFD may encounter the ill-posed problems in its real-world applications [5]. To palliate this problem, a number of regularization techniques have been suggested [1,4,9]. Regretfully, all existing KFD based algorithms are employed to dwell on binary (yes/no) class assignment meaning that the samples Corresponding author. Tel.: +86 25 8431 9794.
E-mail addresses:
[email protected] (Y.-j. Zheng),
[email protected] (J. Yang),
[email protected] (J.-y. Yang),
[email protected] (X.-j. Wu). 0925-2312/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.08.009
come fully assigned to the given classes (categories). Evidently, as the samples are significantly affected by numerous environmental conditions (such as face images are affected by illumination, expression, etc.), it is advantageous to investigate these factors and quantify their impact on their ‘‘internal’’ class assignment [3]. Interestingly, the idea of such class assignment has been around for a long time and can be dated back to the results published by Keller et al. [2] coming under the notion of a fuzzy k-nearest neighbor classifier. The objective of this study is to reexamine the KFD technique (which is well established and has already enjoyed a significant level of success) and augment it by some mechanisms of fuzzy sets. By taking advantage of the technology of fuzzy set theory [11], a new kernel Fisher discriminant analysis algorithm named Fuzzy Kernel Fisher discriminant analysis is presented in this paper. Experimental results on the ORL [6] face database demonstrate the effectiveness of the proposed method.
ARTICLE IN PRESS Y.-j. Zheng et al. / Neurocomputing 69 (2006) 1806–1810
where the matrix K is defined as
2. Outline of KFD algorithm (KPCA+LDA) 2.1. Fundamental A pattern in the original input data space Rn can be mapped into the higher dimensional feature space H through a given nonlinear mapping function F. F : Rn ! H;
x7!FðxÞ.
(1)
The idea is to solve the problem of linear discriminant analysis (LDA) in the higher dimensional feature space H, thereby yielding a set of nonlinear discriminant vectors in input space. This can be achieved by maximizing the following Fisher criterion in the space H. J F ðjÞ ¼
jT SF bj ; T j SF t j
ja0,
(2)
F where SF b and St are the between-class and the total scatter matrices defined in feature space H:
SF b ¼
c F T 1 X F l i mF , mi mF i m0 0 M i¼1
(3)
SF t ¼
M T 1 X Fðxi Þ mF . Fðxi Þ mF 0 0 M i¼1
(4)
Here, x1 ; x2 ; ; xM is a set of M training samples in input space; l P i is the number of training samples of class i c F and satisfies i¼1 l i ¼ M; mi is the mean vector of the mapped training samples of class i; mF 0 is the mean vector across all mapped training samples. However, this nonlinear transformation is too complicated. Furthermore, extracting the optimal discriminative vectors based on Fisher criterion in higher dimensional (maybe infinite) space H is difficult. How can we avoid these shortcomings and gain nonlinear discriminative vectors of original samples simultaneously? Certainly, the kernel trick is a good choice. 2.2. Kernel Fisher Discriminant (KFD) algorithm The optimal discriminant vectors with respect to the Fisher criterion are actually the eigenvectors of the F generalized equation S F b j ¼ lS t j. Since any of its eigenvectors can be expressed by a linear combination of the observations in feature space, we have j¼
M X
aj F xj ¼ Qa,
(5)
j¼1
where Q ¼ ½Fðx1 Þ; ; FðxM Þ and a ¼ ða1 ; ; aM ÞT . Substituting Eq. (5) into Eq. (2), the Fisher criterion in space H is converted to J K ð aÞ ¼
aT ðKWK Þa , aT ðKK Þa
1807
(6)
~ M þ I M KI ~ M. (7) K ¼ K~ I M K~ KI Here, I M ¼ 1=M MM is a scaled identity matrix;K~ ¼ QT Q is an M M matrix and its elements are determined by K~ ij ¼ Fðxi ÞT Fðxj Þ ¼ Fðxi Þ Fðxj Þ ¼ kðxi ; xj Þ, (8) where kðx; yÞ is the kernel function corresponding to the given nonlinear mapping function F. And W ¼ diagðW 1 ; ; W c Þ, where W j is a l j l j matrix with terms all equal to 1 l j . Thereby, W is a M M block diagonal matrix. Now, let us consider the QR decomposition of the matrix K. Suppose g1 ; g2 ; . . . ; gm are K’s orthonormal eigenvectors corresponding to the m largest eigenvalues l1 Xl2 X Xlm 40. Then, Kcan be expressed by K ¼ PLPT , where P ¼ g1 ; g2 ; ; gm and L ¼ diagðl1 ; l2 ; ; lm Þ. Obviously, PT P ¼ I, where I is the identity matrix. Substituting K ¼ PLPT into Eq. (6), we have T L1=2 PT a L1=2 PT WPL1=2 L1=2 PT a . (9) J K ðaÞ ¼ T L1=2 PT a L L1=2 PT a Let b ¼ L1=2 PT a.
(10)
Then, Eq. (9) becomes J ðbÞ ¼
bT S b b , bT S t b
(11)
where S b ¼ L1=2 PT WPL1=2 and S t ¼ L.
(12)
It is easy to know that St is positive definite and S b is positive semi-definite. So, Eq. (11) is a standard generalized Rayleigh quotient. By maximizing this Rayleigh quotient, we can obtain a set of optimal solutions b1 ; b2 ; ; bd , which are actually the eigenvectors of S1 t S b corresponding to d ðdpc 1Þ largest eigenvalues. From Eq. (10), we know that for a given b, there exists at least one a satisfying a ¼ PL1=2 b. Thus, after determining b1 ; b2 ; ; bd , we can obtain a set of optimal solutions aj ¼ PL1=2 bj ð j ¼ 1; 2; ; d Þ with respect to the criterion in Eq. (6). Thereby, the optimal discriminant vectors with respect to the Fisher criterion in Eq. (2) in feature space are jj ¼ Qaj ¼ QPL1=2 bj ;
j ¼ 1; 2; ; d.
(13)
However, previous algorithms of KFD are somewhat complicated. Yang [9] revealed the essence of KFD and proposed a concise KFD algorithm framework: KPCA plus LDA. In Yang’s opinion, Eq. (13) can be divided into two steps.
ARTICLE IN PRESS Y.-j. Zheng et al. / Neurocomputing 69 (2006) 1806–1810
1808
Step 1: KPCA transformation from feature space H into Euclidean space Rm , i.e., T 1ffi pgmffiffiffiffi ðFðx1 Þ; ; FðxM ÞÞT FðxÞ y ¼ pgffiffiffi ; ; l lm
1
¼
1ffi pgffiffiffi ; ; pgmffiffiffiffi l1 lm
T
(14)
(as we are concerned with ‘‘k’’ neighbors, this returns a list of ‘‘k’’ integers). Step 4: Compute the membership degree to class ‘‘i’’ for jth pattern using the expression proposed in the literature [2]. (
½kðx1 ; xÞ; ; kðxM ; xÞ: mij ¼
0:51 þ 0:49 nij k 0:49 nij k
if i ¼ the same label of the jth pattern; otherwise:
Step 2: LDA transformation in the KPCA transformed space Rm , i.e., j ¼ BT y,
(15)
where B ¼ b1 ; ; bd . 3. Fuzzy Kernel Fisher Discriminant (FKFD) Up to now, all existing KFD methods are considered to solve binary problems. Where overlapping (outlier) samples exist, performance of these hard feature extraction methods may degenerate. How can we represent the distribution of these samples and improve classification performance through extracting discriminative information from these samples? Obviously, fuzzy set theory is a good choice. In this paper, we propose a new KFD algorithm, which makes full use of the distribution of samples. The sample distribution information is represented by fuzzy membership degrees corresponding to every class. Thus, each sample can be classified into multi-class according to fuzzy membership degree. Therefore, corresponding scatter matrices can be redefined and the distribution information is incorporated which is helpful to extract discriminative features from samples, especial from overlapping (outlier) samples. 3.1. Fuzzy K-nearest Neighbor (FKNN) In our method, fuzzy membership degree and each class center are gained through a [2] FKNN algorithm. Suppose we have c known patterns w1 ; w2 ; ; wc , and X ¼ fxi gi ¼ 1; 2; ; M is a set of samples with N-dimension. Every sample in X belongs to one of the known classeswj , i.e. xi 2 wj i ¼ 1; 2; ; Mj ¼ 1; 2; ; c. With the FKNN algorithm, the computations of the fuzzy membership degrees can be realized through a sequence of steps: Step 1: Compute the Euclidean distance matrix between pairs of feature vectors in the training set. Step 2: Because the diagonal elements of the Euclidean distance matrix are Euclidean distances of samples xi (i ¼ 1; 2; ; M) and themselves. The results of these elements equal to zeroes. In order to avoid collecting the corresponding class labels of themselves in the next step set diagonal elements of the Euclidean distance matrix to infinity. Step 3: Sort the distance matrix (treat each of its columns separately) in an ascending order. Collect the corresponding class labels of the patterns located in the closest neighborhood of the pattern under consideration
(16) In the above expression, nij stands for the number of the neighbors of the jth pattern that belong to the ith class. Taking into account the fuzzy membership degrees, the mean vector of each class is PM j¼1 mij xj oi ¼ PM . (17) j¼1 mij Therefore, the class center matrix o and the fuzzy membership matrix U can be achieved with the result of FKNN. U ¼ ½mij ;
i ¼ 1; 2; ; c; j ¼ 1; 2; ; M,
(18)
o ¼ ½oi ;
i ¼ 1; 2; ; c.
(19)
3.2. Fuzzy Kernel Fisher Discriminant (FKFD) The key step of FKFD is how to incorporate the contribution of each training sample into the redefinition of scatter matrices. With the conception of fuzzy set theory, every sample can be classified into multi classes under the fuzzy membership degree. Thus, the contribution of every sample to classification can be represented by its corresponding fuzzy membership degree. In the redefinition of the fuzzy within-class scatter matrix, samples that are more close to the class center have more contributions to classification. Then, the membership degree of each sample (contribution to each class) should be considered and the corresponding fuzzy within-class scatter matrix can be redefined as follow: ! c X p X T FS w ¼ mij xj oi xj oi , (20) i¼1
xj 2wi
where p is a constant which controls the influence of fuzzy membership degree. In the redefinition of the between-class scatter matrix, in contrast to the redefinition of the within-class scatter matrix, a class which is far from the total center will have more contribution to classification. Thus, the fuzzy between-class scatter matrix can be redefined as follow: FS b ¼
c X i¼1
1
X xj 2wi
, mpij
M X
! mpij
ðoi x¯ Þðoi x¯ Þ
! T
,
j¼1
(21)
ARTICLE IN PRESS Y.-j. Zheng et al. / Neurocomputing 69 (2006) 1806–1810
where p is the constant chosen above which controls the influence of fuzzy membership degree, x¯ is the mean of all samples. Similarly, the fuzzy total scatter matrix can be achieved as follows: FS t ¼ FS b þ FS w .
(22)
Therefore, all scatter matrices with fuzzy set theory are redefined and the contribution of each sample is incorporated. Furthermore,FS t is positive definite andFS b is positive semi-definite. We can use them instead of St and S b to design an LDA algorithm based on the Fisher criterion. Based on the above descriptions, a FKFD algorithm can be described as follows: Step 1: KPCA transformation is implemented on the original image space using Eq. (14) to transform all samples into an m-dimensional space with the kernel trick. Step 2: Class center matrix o and fuzzy membership degree matrix U of training samples can be achieved with FKNN algorithm in KPCA transformed space Rm . Step 3: According to o and U calculate the fuzzy withinclass scatter matrixFS w , fuzzy between-class scatter matrix FS b and fuzzy total scatter matrix FS t in m-dimensional space, respectively. Then, the optimal discriminant eigenvectors B ¼ b1 ; ; bd can be calculated with FS b and FS t instead of Sb and S t under Fisher criterion when the total scatter matrix is nonsingular and the LDA algorithm is implemented on the KPCA transformed space Rm using Eq. (15). Here, d is c 1 or less than the rank of FS b which represents the number of the optimal discriminant eigenvectors. Step 4: Project all samples into the obtained optimal discriminant vectors and classify. 4. Experimental results To demonstrate the effectiveness of our method, an experiment was done on the ORL face database [6]. The database composed of 40 distinct subjects. Each subject has 10 images under different expression and different views, with tolerance for some tilting and rotation of up to about 20 1. Fig. 1 shows some images in the ORL face database. The training and testing set are selected randomly for each subject. The number of training samples per subject,W, increases from 4 to 6. In each round, the training samples
1809
are selected randomly and the remaining samples are used for testing. This procedure was repeated 10 times by randomly choosing different training and testing sets. Here, we determined 70 eigenvectors representing the best performance in ten experiments. The number of discriminant vectors corresponding to 39 (i.e. c 1) largest eigenvalues. The parameter of fuzzy scatter matrices is fixed as p ¼ 2. For kernel methods, two popular kernels are involved. One is the second-order polynomial kernel function kðx; yÞ ¼ ðx y þ 1Þ2 and the other is the Sigmoid kernel function kðx; yÞ ¼ tanh½qðx yÞ þ Y where the parameter of the Sigmoid kernel function is fixed as q ¼ 0:01 and Y ¼ 4. Both kernels require infinite dimensional phi, which however are not needed in actual calculations. Finally, a nearest neighbor classifier is employed for classification. Table 1 contains a comparative analysis of the mean and standard deviation for the obtained recognition rates. Experimental results in Table 1 show that the performance of the FKFD algorithm is superior to the conventional KFD algorithm under the same kernel function. With the introduction of fuzzy set theory, distribution information is incorporated into the process of feature extraction through fuzzy membership degree and more effective features were extracted by the proposed method.
5. Conclusions In this paper, we proposed a new subspace method, fuzzy kernel Fisher discriminant (FKFD), to extract features from samples, especially from overlapping (outlier) samples. This algorithm is based on conventional methods (KPCA plus LDA) which can extract nonlinear discriminative features of input samples. Furthermore, the overlapping (outlier) samples’ distribution information is Table 1 Recognition rates (%) on the ORL face database (mean and standard deviation) ] Training sample/class (W)
4
5
6
KFD (Polynomial) FKFD (Polynomial)
90.1771.43 91.3871.44
92.2070.86 93.3570.88
94.2571.35 95.1371.25
KFD (Sigmoid) FKFD (Sigmoid)
88.7171.23 89.1271.19
92.1071.85 93.0071.68
94.1971.91 94.9471.04
Fig. 1. Some images from ORL face database.
ARTICLE IN PRESS 1810
Y.-j. Zheng et al. / Neurocomputing 69 (2006) 1806–1810
incorporated in the redefinition of corresponding scatter matrices, which is important for classification. Acknowledgements We would like to thank all anonymous reviewers for their constructive advice on the original submission of this work. This work was supported in part by the following projects: National Natural Science Foundation of P.R. China (Grant No. 60472060, 60503026, 60572034), Robotics Laboratory, Shenyang Institute of Automation, Chinese Academy of Sciences foundation (Grant No. RL200108), Natural Science Foundation of Jiangsu Province (Grant No. BK2002001, BK2004058), and Open Foundation of Image Processing and Image Communication Lab (Grant No. KJS03038). References [1] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Comput 12 (10) (2000) 2385–2404. [2] J.M. Keller, M.R. Gray, J.A. Givens, A fuzzy k-nearest neighbor algorithm, IEEE Trans. Syst. Man Cybernet. 15 (4) (1985) 580–585. [3] K.C. Kwak, W. Pedrycz, Face recognition using a fuzzy fisherface classifier, Pattern Recognition 10 (2005) 1717–1732. [4] S. Mika, G. Ra¨tsch, J. Weston, B. Scho¨lkopf, K.-R. Mu¨ller. Fisher discriminant analysis with kernels, in: IEEE International Workshop on Neural Networks for Signal Processing IX, Madison (USA). August 1999, pp. 41–48. [5] S. Mika, G. Ra¨tsch, J. Weston, B. Scho¨lkopf, A. Smola, K.-R. Mu¨ller, Constructing descriptive and discriminative non-linear features: Rayleigh coefficients in kernel feature spaces, IEEE Trans. Pattern Anal. Machine Intell. 25 (5) (2003) 623–628. [6] ORL face database, http://www.uk.research.att.com/facedatabase.html. [7] B. Scho¨lkopf, A. Smola, K.R. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput 10 (5) (1998) 1299–1319. [8] V. Vapnik, The Nature of Statistical Learning Theory, Springer, NY, 1995. [9] J. Yang, Z. Jin, J.Y. Yang, D. Zhang, A.F. Frangi, Essence of kernel Fisher discriminant: KPCA plus LDA, Pattern Recognition 37 (10) (2004) 2097–2100. [10] J. Yang, A.F. Frangi, J.Y. Yang, D. Zhang, Z. Jin, KPCA plus LDA: A complete kernel Fisher discriminant framework for feature extraction and recognition, IEEE Trans. Pattern Anal. Machine Intell. 27 (2) (2005) 230–244. [11] L.A. Zadeh, Fuzzy sets, Inform. Control 8 (1965) 338–353. Yu-jie Zheng was born in Zhejiang, China, February 1977. He obtained his Bachelor of Computer Science at the Jiangsu University in 1999. He then continued to complete a Masters of Computer Science at the Jiangsu University of Science and Technology in 2004. Now, he is a Ph.D. Candidate at the Nanjing University of Science and Technology in the Department of Computer Science on the subject of Pattern Recognition and Intelligence Systems from
2004. His current research interests include pattern recognition, computer vision and machine learning. Jian Yang was born in Jiangsu, China, June 1973. He obtained his Bachelor of Science in Mathematics at the Xuzhou Normal University in 1995. He then continued to complete a Masters of Science degree in Applied Mathematics at the Changsha Railway University in 1998 and his Ph.D. at the Nanjing University of Science and Technology (NUST) in the Department of Computer Science on the subject of Pattern Recognition and Intelligence Systems in 2002. From January to December in 2003, he was a postdoctoral researcher at the University of Zaragoza and affiliated with the Division of Bioengineering of the Aragon Institute of Engineering Research (I3A). In the same year, he was awarded the RyC program Research Fellowship, sponsored by the Spanish Ministry of Science and Technology. Now, he is a professor at the Department of Computer Science of NUST and, at the same time, a Postdoctoral Research Fellow at Biometrics Centre of Hong Kong Polytechnic University. He is the author of more than 30 scientific papers in pattern recognition and computer vision. His current research interests include pattern recognition, computer vision and machine learning. Jing-yu Yang received the B.S. Degree in Computer Science from Nanjing University of Science and Technology (NUST), Nanjing, China. From 1982 to 1984 he was a visiting scientist at the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign. From 1993 to 1994 he was a visiting professor at the Department of Computer Science, Missuria University. And in 1998, he acted as a visiting professor at Concordia University in Canada. He is currently a professor and Chairman in the department of Computer Science at NUST. He is the author of over 300 scientific papers in computer vision, pattern recognition, and artificial intelligence. He has won more than 20 provincial awards and national awards. His current research interests are in the areas of pattern recognition, robot vision, image processing, data fusion, and artificial intelligence. Xiaojun Wu received the B.S.degree in Mathematics from Nanjing Normal University, Nanjing, P.R.China in 1991 and M.S. degree in 1996, and Ph.D. in Pattern Recognition and Intelligent System in 2002, both from Nanjing University of Science and Technology, Nanjing, P.R. China, respectively. Dr. Wu was a fellow of United Nations University, International Institute for Software Technology (UNU/IIST) from 1999 to 2000. He won the most outstanding postgraduate award by Nanjing University of Science and Technology in 1996. Since 1996, he has been teaching in the School of Electronics and Information, Jiangsu University of Science and Technology where he is an exceptionally promoted professor in charge of the Department of Computer Science and Technology. He has published more than 70 papers. He was a visiting researcher in the Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, UK from 2003 to 2004. His current research interests are pattern recognition, fuzzy systems, neural networks and intelligent systems.