ARTICLE IN PRESS
Neurocomputing 69 (2006) 949–953 www.elsevier.com/locate/neucom
Letters
A new nonlinear feature extraction method for face recognition Yanwei Pang, Zhengkai Liu, Nenghai Yu MOE-Microsoft Key Laboratory of Multimedia Computing and Communication, Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, PR China Received 24 June 2005; received in revised form 27 July 2005; accepted 28 July 2005 Available online 6 October 2005 Communicated by R.W. Newcomb
Abstract Feature extraction is a crucial step for pattern recognition. In this paper, a nonlinear feature extraction method is proposed. The objective function of the proposed method is formed by combining the ideas of locally linear embedding (LLE) and linear discriminant analysis (LDA). Optimizing the objective function in a kernel feature space, nonlinear features can be extracted. A major advantage of the proposed method is that it makes full use of both the nonlinear structure and class-specific information of the training data. Experimental results on the AR face database demonstrate the effectiveness of the proposed method. r 2005 Elsevier B.V. All rights reserved. Keywords: Face recognition; Subspace learning; Linear discriminant analysis; Manifold learning; Feature extraction
1. Introduction Feature extraction is a crucial step for pattern recognition such as face recognition. Eigenface and fisherface are popular for facial feature extraction, the underlying ideas of which are the principal component analysis (PCA) [10] and the linear discriminant analysis (LDA) [1], respectively. PCA seeks a subspace that best represents the data in a least-squares sense. Therefore, the feature extracted by PCA is called the most expressive feature [9]. However, PCA cannot discover the local structure of the data. Due to the large variations of face appearance caused by variations in expression, illumination and pose, the manifold of the face space is believed to be very complex. Though the structure of this manifold can be represented by the locally linear structure using locally linear embedding (LLE) [6], LLE cannot map a new testing point directly, which is referred to as the out-of-sample problem. In addition, LLE as well as PCA are unsupervised learning methods, so the information carried by class labels is lost. LDA is a well-known supervised technique for Corresponding author.
E-mail address:
[email protected] (Y. Pang). 0925-2312/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.07.005
dealing with classification problems. It selects a linear transformation matrix in such a way that the ratio of the between-class scatter to within-class scatter is maximized. Therefore, the feature extracted by LDA is called the most discriminating feature [9]. Compared with unsupervised methods such as PCA, LDA is prone to overfitting when the training data set is small which is often the case in face recognition [4]. To take advantages of LLE and LDA and, meanwhile, to avoid their disadvantages, we firstly formulate an objective function that minimizes an item similar to LLE and at the same time minimizes another item similar to LDA. As a result, we get a formula for extracting linear features. We then generalize the above linear setting to a nonlinear one. This is implemented by explicitly mapping the input space to a high-dimensional feature space using the kernel matrix. The actual computation of the subspace reduces to an eigenvalue problem. For the sake of convenience, we denote the proposed method by KGAD. The remainder of this paper is organized as follows. In Section 2, the overview of the algorithm is given. The justification of the proposed method is presented in Section 3, with the linear form derived in Section 3.1 and the nonlinear form in Section 3.2. The proposed method is
ARTICLE IN PRESS Y. Pang et al. / Neurocomputing 69 (2006) 949–953
950
evaluated on the AR face database in Section 4. Finally, a brief conclusion is drawn in Section 5. 2. Overview of the proposed method (KGAD) Let X ¼ [x1, x2,y, xN] be a data set of D-dimensional vectors of face images. Each data point belongs to one of C object classes {X1, y, XC}. Let Nc denote the number of points belonging to the cth class. Feature extraction is conducted to map these points to be new points Y ¼ [y1, y2, y, yN] in a d-dimensional space where d5D. Suppose that the space X is mapped into a Hilbert space F through a nonlinear mapping function / : X-F. Before presenting a detailed derivation of the proposed algorithm, we will give an overview of it. Step 1: Assign neighbors to each data point /(xi) using the k nearest neighbors. Distances are computed using the kernel matrix KN N with its elements Kij ¼ (/(xi))T/ (xj) ¼ /(xi) . /(xj). Step 2: Compute the weights Wij that best linearly reconstruct /(xi) from its neighborsP solving the constrainPk ed least-squares problem J 1 ðWÞ ¼ N i¼1 kfðxi Þ j¼1 W ij ðxj Þk2 . Owing to space limitation, the details of step 2, which are similar to those of LLE, are omitted. Step 3: Compute matrices M, G, L, H, and V as follows:M ¼ ðI WÞðI WÞT , G ¼ I ð1=NÞeeT , L ¼ I E, H ¼ G L, and V ¼ ðyM þ ð1 yÞLÞ, where I represents an identity matrix, e ¼ (1, y, 1)T, Eij ¼ 1/Nc if xi and xj both belong to the cth class and Eij ¼ 0 otherwise. y is a weight parameter. Step 4: Compute the matrix B ¼ [b1, y, bd] by solving the generalized eigenvalue problem: KHKbi ¼ li KVKbi with l1 4l2 4 4ld 40. P n Step 5: Nonlinear feature is extracted by yni ¼ N j¼1 bj K ij , n ¼ 1; 2; . . . ; d. 3. Justification Linear features yi are obtained by a linear transformation matrix A ¼ [a1, y, ad]: yi ¼ ATxi, i ¼ 1; . . . ; N. Nonlinear features are extracted by explicitly mapping the input space to a high-dimensional feature space using the kernel matrix K. Justification that is directly related to steps 3–5 is described in Section 3.2. Section 3.2 is based on the derivation of the linear form in Section 3.1.
ðyðXMXT Þ þ ð1 yÞSw Þai ¼ li Sb ai ;
l1 o old ,
The basic idea behind LLE is that the same weights Wij that reconstruct the ith data, xi, in D dimensions should also reconstruct its embedded coordinates, yi, in d dimensions. Hence, yi can be achieved by minimizing P Pk the embedding cost function J 2 ðYÞ ¼ N i¼1 kyi j¼1 T
T
W ij yj k ¼ traceðYMY Þ, where M ¼ ðI WÞ ðI WÞ . We find that J2 depends merely on the weights Wij. This leads to the out-of-sample problem: the embedding of
(1)
or equivalently the maximum eigenvalue problem: Sb ai ¼ li ðyðXMXT Þ þ ð1 yÞSw Þai ;
l1 4 4l2 .
(2)
3.2. The nonlinear form Now we generalize the linear form to the nonlinear one. We first reformulate Eq. (2) in a compact manner and then solve the corresponding equation in a Hilbert space F which is related to the input space by a nonlinear map / : xARD//(x)ARf. Note that F, which we will refer to as the feature space, could have an arbitrarily large, possibly infinite, dimensionality. According to [2], the total scatter matrix, St, within-class scatter matrix, Sw, and between-class scatter matrix Sb can be formulated, respectively, as follows: St ¼ ð1=NÞ
N X
ðxi uÞðxi uÞT
i¼1
¼ ð1=NÞXðI ð1=NÞeeT ÞXT ¼ XGXT ,
3.1. The linear form
2
samples out of the training set cannot be directly calculated. To solve this problem, we propose to plug yi ¼ ATxi into J2. It then holds: J 3 ðAÞ ¼ traceððAT XÞ MðAT XÞT Þ ¼ traceðAT ðXMXT ÞAÞ. By virtue of its similarity to J2, the subspace obtained by minimizing J3 has the neighborhood-preserving property as LLE does. To incorporate discriminant information, we add a new item, related to the Fisher criterion, to J3. According to the Fisher criterion, between-class scatter over within-class scatter should be maximized. Alternatively, we can minimize the within-class scatter and constrain the between-class scatter. Therefore, a new objective function is formed: J 4 ðAÞ ¼ y traceðAT ðXMXT ÞAÞ þ ð1 yÞ trace ðAT Sw AÞ, subject to AT Sb A ¼ I, where Sw and Sb denote the within-class and between-class scatter matrices, respectively, and 0pyp1 represents a weight parameter that balances the contributions of intrinsic geometric structure and the class-specific information. The constrained minimization can then be done using the method of Largrange multipliers: Lðai Þ ¼ yaTi ðXMXT Þai þ ð1 yÞaTi Sw ai þ lð1 aTi Sb ai Þ. Compute the gradients with respect to ai and set the gradients to zero, we have the following minimization eigenvalue problem:
Sw ¼
C X X
ð3Þ ðx ui Þðxi ui ÞT
i¼1 x2Xi
¼ XðI EÞXT ¼ XLXT , Sb ¼ St Sw ¼ XðG LÞXT ¼ XHXT , T
ð4Þ (5)
where G ¼ I ð1=NÞee , L ¼ I E; and H ¼ G L. Note that e serves as a summing vector. We can then rewrite Eq. (2) as XHXT a ¼ l½yðXMXT Þ þ ð1 yÞXLXT a, or
ARTICLE IN PRESS Y. Pang et al. / Neurocomputing 69 (2006) 949–953
equivalently as T
T
XHX a ¼ l½XðyM þ ð1 yÞLÞX a.
(6)
Defining V9yM þ ð1 yÞL, we have XHXT a ¼ lXVXT a.
(7)
Now we attempt to solve Eq. (7) in the feature space F by introducing a nonlinear mapping function f: ½fðx1 Þ; . . . ; fðxN ÞH½fðx1 Þ; . . . ; fðxN ÞT a ¼ l½fðx1 Þ; . . . ; fðxN ÞV½fðx1 Þ; . . . ; fðxN ÞT a. As the eigenvectors are linear combinations of F elements, thereP exist coefficients bi (i ¼ 1; . . . ; N) such that a¼ N i¼1 bi fðxi Þ. Thus, ½fðx1 Þ; . . . ; fðxN ÞH½fðx1 Þ; . . . ; fðxN ÞT
N X
bi fðxi Þ
i¼1
¼ l½fðx1 Þ; . . . ; fðxN ÞV½fðx1 Þ; . . . ; fðxN ÞT
N X
bi fðxi Þ.
i¼1
Multiplying [f(xj)]T on both sides, we get 2 2 N 6X 6 6 6 ½fðxj Þ fðx1 Þ; . . . ; fðxj Þ fðxN ÞH6 bi 6 4 i¼1 4
fðx1 Þ fðxi Þ .. .
fðxN Þ fðxi Þ 2
¼ l½fðxj Þ fðx1 Þ; . . . ; fðxj Þ fðxN ÞV
N 6 X 6 bi 6 4 i¼1
951
among the classical kernels which meet the conditions. The Gaussian kernel k(xi,xj) ¼ exp(||xixj||2/s2) ¼ f(xi) f(xj) was adopted in our experiments. Though it can be hard to give the exact form of f (see p. 297 of [8]), we need not do it according to Mercer’s theorem. The parameter s controls the flexibility of the kernel which was empirically determined in our experiments. Note that the feature space has infinite dimension for every value of s [8]. By defining the kernel matrix KN N with elements Kij ¼ k(xi,xj) and the vector b ¼ [b1, y, bN], the above equation can be reformulated as ðKHKÞb ¼ lðKVKÞb.
(8)
Note that here, K is identical to the K used in Section 2. The computation is thus reduced to the generalized eigenvalue problem expressed in Eq. (8). Once the solution of Eq. (8) is found, nonlinear features are given as P the n n yni ¼ an fðxi Þ ¼ N j¼1 bj K ij , where zi represents the nth element of a vector zi. As stated in Section 1, we denote the proposed method by KGAD.
33 77 77 77 55
fðx1 Þ fðxi Þ .. . fðxN Þ fðxi Þ
3 7 7 7. 5
We have shown that, given the matrices H and V, all the information the method needs is the inner products between data points, f(xi) f(xj), in the feature space F. But the complexity of evaluating each inner product is usually proportional to the dimension of the feature space. The inner products can, however, be computed more efficiently as a direct function of the input features, without explicitly computing the mapping f. A function that performs this direct computation is known as a kernel D function. Assume data fxi gN have been given. i¼1 2 R A kernel function k is defined as k : RD RD ! R : ðxi ; xj Þ7!kðxi ; xj Þ, and it satisfies k(xi,xj) ¼ f(xi) f(xj). Mercer’s theorem of function analysis gives conditions under which we can construct the mapping f from the eigenfunction decomposition of k (refer to Appendix C of [7] for details). Polynomial and the Gaussian kernels are
4. Experimental results A large subset of the AR face database [3] was used to evaluate the proposed method (i.e. KGAD). One hundred and seventeen subjects are selected from a total of 126 subjects. Images in the AR database were recorded in two different sessions. Only seven nonoccluded images (e.g. Fig. 1) per person from the first session are used in our experiments. Among the seven images per person, two images were used for training and five for testing. There are 21 different ways of selecting two images for training and five for testing from the seven images. As in [4], the data sets were indexed 1,y,21, and the test results for the ith data set were represented by test ]i. The images were cropped based on the centers of eyes, and the cropped images were resized to 60 60 pixels. Then they were normalized to have zero mean and unit variance. The parameter, s2, of the Gaussian kernel is set to 10,000. The number of neighbors k is set to 11. To show how the weight parameter y affects the recognition performance, we randomly selected two data sets and changed y from 0.0 to 1.0 with step 0.1. Fig. 2 shows the average recognition rate. It is shown that y ¼ 0:5 achieves the optimal value. We then let y ¼ 0, 0:5, and 1:0 and conducted experiments on the 21 different data sets.
Fig. 1. Example images of one subject used in the experiments.
ARTICLE IN PRESS Y. Pang et al. / Neurocomputing 69 (2006) 949–953
952
80
85 KGAD Recognition rate (%)
Recognition rate (%)
78
76
74
KDA 80
LDA PCA
KPCA
75
72
70
70 0
0.2
0.4
θ
0.6
0.8
1 Fig. 4. Performance comparison on the AR database.
Fig. 2. Recognition rates as the function of weight parameter y.
and the kernel matrix K. The actual computation reduces to a maximum eigenvalue problem. Though the weight parameter y is not determined in theory, the proposed method works well even simply setting y ¼ 0:5.
95
Recognition rate (%)
90 85 80
Acknowledgements
75 70 65
θ=0.0 θ=0.5 θ=1.0
60 55
0
5
10
15
20
25
Test # Fig. 3. Recognition rates on different data sets with three typical weight values.
Fig. 3 shows the results. In most cases, as can be seen, when y ¼ 0:5, the accuracy is the best. We then fixed y ¼ 0:5 for the proposed method, KGAD. Fig. 4 shows the average accuracy over the 21 data sets of PCA (eigenface), LDA (fisherface), KPCA [7], KDA [5], and KGAD. Obviously, KGAD outperforms all the other methods. These results give evidences that the nonlinear features extracted by KGAD are more useful for classification. Note that since our attention focuses on how the feature works, we employ the simplest classifier, nearestneighbor approach, as the final classifier. 5. Conclusions We have presented a nonlinear feature extraction method for face recognition. The extracted features make use of not only the intrinsic geometric structure of the data manifold, but also the class specific information. Optimizing the objective function in a kernel feature space, nonlinear features can be extracted efficiently. As can be seen, all the information is encoded in matrices H and V,
The authors are very grateful for the comments and suggestions of Prof. Robert W. Newcomb and the anonymous reviewers. The work was supported by the Open Fund of MOE-Microsoft Key Laboratory under Grant no. 050718-6 and Open Fund of Image Processing & Image Communication Lab, Nanjing University of Posts & Telecommunications, under Grant no. KJS03039. References [1] P.N. Belhumeur, J.P. Hespanha, J. Kriengman, Eigenfaces vs. fisherfaces: recognition using class specific linear projections, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [2] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [3] A. Martinez, R. Benavente, The AR face database, CVC Technical Report No. 24, 1998, http://rvl1.ecn.purdue.edu/aleix/aleix_face_ DB.html. [4] A. Martinez, A. Kak, PCA versus LDA, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) 228–233. [5] S. Mica, G. Ratsch, J. Weston, B. Scholkopf, K.R. Muller, Fisher discriminant analysis with kernels, in: Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing, vol. IX, Madison, USA, 1999, pp. 41–48. [6] S. Roweis, K. Saul, Nonlinear dimension reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [7] B. Scholkopf, A. Smola, K. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. [8] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, 2004. [9] D. Swets, J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18 (8) (1996) 831–836. [10] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive Neurosci. 3 (1) (1991) 71–86.
ARTICLE IN PRESS Y. Pang et al. / Neurocomputing 69 (2006) 949–953 Yanwei Pang was born in 1976 in Hebei, China. He received his B.E. degree from the Hebei Normal University, in 1998, M.E. degree from the North China Institute of Technology, in 2001, and Ph.D. degree from the University of Science and Technology of China (USTC), in 2004, respectively, all in Electronic Engineering. He also spent 10 months at the Microsoft Research Asia as a visiting student. He is currently a postdoctor at USTC. His research interests include face recognition, manifold learning and computer vision.
Zhengkai Liu received his B.E. degree in Electronic Engineering from the University of Science and Technology of China, Hefei, China, in 1964. He is currently a Professor at the Department of Electronic Engineering and Information Science (EEIS), University of Science and Technology of China (USTC). He has been the Chairman of the Research Committee in the Department of EEIS of the USTC, and the head of the Multimedia Communication Lab and the Information Processing Center (USTC). His research interests include artificial neural networks, image processing and computer vision.
953 Nenghai Yu received his B.E. degree from the Nanjing University of Posts and Telecommunication, in 1987, M.E. degree from the Tsinghua University, in 1992, and Ph.D. degree from the University of Science and Technology of China, in 2004, respectively. His research interests include communication network and multimedia communication.