ARTICLE IN PRESS
Neurocomputing 71 (2007) 455–458 www.elsevier.com/locate/neucom
Letters
Kernel subclass discriminant analysis Bo Chen, Li Yuan, Hongwei Liu, Zheng Bao National Lab of Radar Signal Processing, Xidian University, Xi’an, Shaanxi 710071, China Received 3 February 2007; received in revised form 2 June 2007; accepted 23 July 2007 Communicated by L.C. Jain Available online 2 September 2007
Abstract In order to overcome the restricts of linear discriminant analysis (LDA), such as multivariate Normal distributed classes with equal covariance matrix but different means and the single-cluster structure in each class, subclass discriminant analysis (SDA) is proposed recently. In this paper the kernel SDA is presented, called KSDA. Moreover, we reformulate SDA so as to avoid the complicated derivation in the feature space. The encouraging experimental results on eight UCI data sets demonstrate the efficiency of our method. r 2007 Elsevier B.V. All rights reserved. Keywords: Linear discriminant analysis (LDA); Kernel linear discriminant analysis (KLDA); Subclass discriminant analysis (SDA); Feature space; Kernel methods
1. Introduction Linear discriminant analysis (LDA) is a popular method for linear dimensionality reduction, which maximizes between-class scatter and minimizes within-class scatter. However, LDA is optimal only in the case that all the classes are generated from underlying multivariate Normal distributions of common covariance matrix but different means and each class is expressed by a single cluster [4,12], therefore LDA cannot give desired results for multimodally distributed data sets, such as face recognition [5], radar automatic target recognition [3] and so on. In order to overcome the limitations of LDA, recently Zhu and Martinez [12] propose subclass discriminant analysis (SDA). Just as its name implies, SDA divides each class into suitable subclasses so as to approximate the underlying distribution with mixture of Gaussians and then performs LDA among these subclasses, moreover, the authors also employ a stability criteria [8] to determine the optimal subclass divisions. In this letter we develop SDA into kernel SDA (KSDA) in the feature space, which can result in a better subspace for the classification task since a nonlinear clustering technique can find the underlying Corresponding author. Tel.: +86 29 88209504.
E-mail address:
[email protected] (B. Chen). 0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.07.006
subclasses more exactly in the feature space and nonlinear LDA [5] can provide a nonlinear discriminant hyperplane. Furthermore, a reformulation of SDA is given to avoid the complicated derivation in the feature space. 2. Subclass discriminant analysis In SDA, a new between-subclass scatter matrix is defined SðbÞ SDA ¼
Hk Hi X C 1 X C X X
pij pkl ðmij mkl Þðmij mkl ÞT ,
(1)
i¼1 j¼1 k¼iþ1 l¼1
where C is the number of data classes, H i the number of subclass divisions in class i, n the total number of all samples, nij the number of the jth subclass in class i, pij ¼ nij n and mij are the prior and mean of the jth subclass in class i. In order to determine the optimal subclass divisions, the authors propose a discriminant stability criteria [12,8], which can evaluate whether LDA works G¼
m X i X
ðuTj wi Þ2 ,
(2)
i¼1 j¼1
where wi is the i eigenvector of between-subclass scatter matrix SðbÞ SDA and uj the jth eigenvector of the covariance matrix of the data [12] (which is defined as
ARTICLE IN PRESS B. Chen et al. / Neurocomputing 71 (2007) 455–458
456
Pn T S ðmÞ i¼1 ðxi mÞðxi mÞ , m is the global mean), SDA ¼ ðbÞ morankðSSDA Þ. The smaller the value of cost function (2), the better SDA can work with the current subclass divisions. Since SDA is just LDA when H i ¼ 1 for i ¼ 1; 2; . . . ; C, LDA can be regarded as a special case of SDA. From (1), ðbÞ we PC know that rankðS SDA ÞpminðH 1; pÞ, where H ¼ i¼1 H i is the total number of subclasses and p the dimensionality of the range of the covariance matrix. Therefore, SDA can also solve the problem posed by the deficiency of the rank of the ordinary between-class scatter matrix, which has been proved by the example in [12].
scatter matrices ðbÞ T SðbÞ SDA ¼ XDSDA X ,
(7)
ðmÞ T SðmÞ SDA ¼ XDSDA X ,
(8)
where DðbÞ SDAði;jÞ
( DðmÞ SDAði;jÞ
3. Kernel SDA In this section, we present a nonlinear extension of SDA based on kernel functions, KSDA. The main idea of the kernel method is that without knowing the nonlinear feature mapping explicitly, we can work on the feature space through kernel functions, as long as the problem formulation depends only on the inner products between data points. This is based on the fact that for any kernel function k satisfying Mercer’s condition [3] there exists a mapping F such that hFðxÞ; FðyÞi ¼ kðx; yÞ, where h; i is an inner product in the feature space F transformed by F. We apply the kernel method to perform SDA in the feature space instead of the original input space. Given a kernel function k, let F be a mapping satisfying (3). So ðmÞ S ðbÞ KSDA and S KSDA in the feature space F can be expressed as S ðbÞ KSDA ¼
Hi C 1 X X
C X
Hk X
pij pkl ðFij Fkl ÞðFij Fkl ÞT ,
(3)
i¼1 j¼1 k¼iþ1 l¼1
S ðmÞ KSDA ¼
n X
ðFi FÞðFi FÞT ,
(4)
i¼1
where Fij indicates the mean vector of jth subclass of ith class, F is the global mean. Similar to SDA, KSDA maximizes jVT SðbÞ KSDA Vj= T ðmÞ jV S KSDA Vj to find a transformation matrix V, the columns of which are the eigenvectors corresponding to the rpminðH 1; pÞ largest eigenvalues of ðmÞ S ðbÞ KSDA V ¼ lS KSDA V.
(5)
Let the transformation matrix V be represented as V ¼ Ua,
(6)
where U ¼ ½F1 ; . . . ; Fn , a ¼ ½a1 ; . . . ; ar . Usually one substitutes (3), (4), and (6) into (5) to obtain the equation represented by the kernel Gram matrix; however, the whole derivation procedures and representations of kernel scatter matrices are complicated and not intuitive [1]. Therefore, in order to simplify it we will give a new representation of SDA based on the following
8 ðn nk Þ=ðn2 nkl Þ if zi ¼ zj ¼ C kl > > < if zi azj but yi ¼ yj ; ¼ 0 > > : 1=n2 if yi ayj ;
¼
1 1=n 1=n
if xi ¼ xj ; else;
(9)
(10)
where X ¼ ½x1 ; . .P . ; xn , yi 2 ½1; C is the class label of the k sample xi , nk ¼ H l¼1 nkl , C kl indicates the lth subclass of kth class, zi denotes the subclass which xi belongs to. If SDA is transformed into the feature space, (7) and (8) can be modified as ðbÞ T SðbÞ KSDA ¼ UDKSDA U ,
(11)
ðmÞ T SðmÞ KSDA ¼ UDKSDA U .
(12)
Substituting (6), (11), and (12), into (5), we obtain ðmÞ T T UDðbÞ KSDA U Ua ¼ lUDKSDA U Ua.
(13)
T
Then multiplying (13) by U from the left-hand side, we have ðmÞ KDðbÞ KSDA Ka ¼ lKDKSDA Ka,
(14)
nn
is the kernel Gram matrix. Compared with where K 2 R traditional forms of GDA [6], the new reformulation of KSDA is more concise and easy to operate. Note that ðmÞ DðbÞ KSDA and DKSDA should reflect the relations of samples in the feature space, therefore, the division of subclasses has to be obtained by kernel clustering techniques. In this paper we use the kernel k-means method [10]. In addition, the stable criterion should also be reformulated in the feature space. Although the kernel scatter matrices may be of unknown dimensionalities, we can calculate the eigenvectors corresponding to the first largest eigenvalues of them using the kernel Gram matrix. The details can be referred to [9]. 4. Experimental results We demonstrate that our proposed method KSDA is an effective nonlinear extension of SDA by comparing the classification performances of KSDA and other kernelbased nonlinear discriminant analysis algorithms as well as GDA [1], Kernel direct LDA (KDDA) [7], complete kernel Fisher discriminant analysis (CKFD) [11] and kernel principal component analysis (KPCA) [10]. Eight data sets collected from UCI machine learning repository [2] and the ETH-80 database [6] were used, all of which were centered and normalized to a distribution with zero mean and unit
ARTICLE IN PRESS B. Chen et al. / Neurocomputing 71 (2007) 455–458
variance. In addition, for ETH-80 database we downsampled each image to 26 26 pixels and used PCA to obtain 80-dimensional samples. In each category the first five objects were used as training set and other ones as test set.
Table 1 The recognition accuracies (%) (For Letter data set we only selected the samples of ’A’, ’B’ and ’C’.) Dataset
LDA
SDA
KPCA
KLDA
KDDA
CKFD
KSDA
Ionosphere Monks Pima Liver WDBC WBC Letter DNA ETH-80
89.7 57.8 71.7 66.9 94.0 96.7 97.7 88.3 60.2
89.7 61.0 74.1 66.9 95.0 96.2 97.7 88.3 62.0
95.4 54.6 73.5 56.4 92.3 96.5 86.8 70.9 55.5
96 75 75.3 70.9 96.1 97.1 99.3 88.6 62.9
96.6 67.9 74.7 73.3 94.0 97.1 99.3 90.4 62.0
97.1 81.5 77.4 73.8 97.5 96.8 99.4 90.4 59.0
97.1 87.3 78.1 73.8 96.1 97.1 99.7 89.0 66.9
Table 2 The number of subclasses in each data class Method Ionosphere Monks Pima Liver WDBC WBC Letter DNA ETH-80 SDA 1 KSDA 6
2 2
2 5
1 2
8 1
5 1
1 5
1 4
5 5
457
In this experiment we only consider Gaussian kernel Kðx; yÞ ¼ expðgkx yk2 Þ. For all the methods, the optimal value for g was determined by five-fold Crossvalidation. As the above mentioned rankðSðbÞ KSDA Þp minðH 1; pÞ, the reduced dimensions in KSDA is not less than that of KLDA. Hence we selected rankðS ðbÞ KSDA Þ as the reduced dimension, as well as that of KPCA. Also rankðS ðbÞ SDA Þ was chosen as the reduced dimension of SDA. After dimension reduction, a knearest neighbor classifier was used for k ¼ 3. Table 1 shows the recognition accuracies obtained by the seven different methods. The experimental results demonstrate that the KSDA and CKFD achieve the competent recognition accuracies over other compared methods. However, the CKFD requires to optimize three parameters including the kernel parameter g, the dimensionality of projection subspace and a fusion coefficient which task is to combine the results obtained in the range and null spaces of within-class scatter matrix. Although it brings more freedom, the computational cost of CKFD is rather high, while in KSDA only the kernel parameter g needs tuning. Table 2 shows the number of subclasses in each data class. According to the motivation of SDA, ideally the performance of it cannot be worse than LDA since in SDA the optimal subclass divisions can be automatically determined. However, there exists an important premise that the underlying distribution of data class can be approximated, whereas, the structures of data classes
Fig. 1. The distributions of Monks and Iris data sets after SDA and KSDA in the two-dimensional feature space. (a) SDA on Monks data set, (b) KSDA on Monks data set, (c) SDA on Iris data set, (d) KSDA on Iris data set.
ARTICLE IN PRESS 458
B. Chen et al. / Neurocomputing 71 (2007) 455–458
sometimes are too complex to be exploited by clustering techniques in the input space, which maybe leads to a poor subspace. From the results of WBC in Tables 1 and 2, we find that SDA even performs worse than LDA, while KSDA achieves the best classification performance with KLDA because of the kernel stable criterion exploiting the single-modal structure of each class in the feature space. In detail, KSDA firstly maps the data into a suitable highdimensional feature space using the kernel trick, which usually can simplify the relations and structures of data classes [1,10], i.e. enlarge the margins between different data classes even increase the linear separability or make the structure of each class more easily exploited and helpful to perform classification. And then in this kernel-induced feature space, SDA is applied to find a nonlinear discriminant hyperplane. Fig. 1 also shows the embedded samples in the twodimensional space by SDA and KSDA. We see that for Monks data set SDA has mixed samples of different classes, while in the feature space KSDA separates the different classes very well. Moreover, the multimodal structure in each class has been clearly observed in the results of KSDA, which implies that the complex structures and relations of data classes in the input space may limit the classification performance of SDA. The similar results also happen to Iris data set as shown in Fig. 1(c) and (d).
[4] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press Inc., Boston, 1990. [5] T.K. Kim, J. Kittler, Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image, IEEE Trans. PAMI 27 (3) (2005) 318–327. [6] B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2003. [7] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, Face recognition using kernel discriminant analysis algorithms, IEEE Trans. Neural Networks 14 (1) (2003) 117–126. [8] A.M. Martinez, M. Zhu, Where are linear feature extraction methods applicable?, IEEE Trans. PAMI 27 (12) (2005) 1934–1944. [9] S. Roweis, Finding the First Few Eigenvectors in a Large Space, Available from hhttp://www.cs.toronto.edu/roweis/ i. [10] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, 2004. [11] J. Yang, A.F. Frangi, J. Yang, D. Zhang, Z. Jin, KPCA plus LDA: a complete kernel fisher discriminant framework for feature extraction and recognition, IEEE Trans. PAMI 27 (2) (2005) 230–244. [12] M. Zhu, A.M. Martinez, Subclass discriminant analysis, IEEE Trans. PAMI 28 (8) (2006) 1274–1286. Bo Chen received his B.Eng. and M.Eng. degrees in electronic engineering from Xidian University in 2003 and 2006, respectively. He is currently a Ph.D. student in the National Lab of Radar Signal Processing, Xidian University. His research interests include radar signal processing, radar automatic target recognition, kernel machine.
5. Conclusions We have introduced the kernel version of SDA, KSDA, which can take advantage of nonlinear clustering techniques and nonlinear LDA to find the underlying distribution of data class in the feature space and provide a nonlinear discriminant hyperplane. Meanwhile, a reformulation of KSDA is also given to make the computation more intuitional and simple. The comparison with other methods in classification tasks demonstrates that KSDA is an effective nonlinear dimension reduction method. Acknowledgments The authors thank the reviewers for their helpful comments that greatly improved the article. This work was supported by the National Science Foundation of China under Grant 60302009. References [1] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Comput. 12 (2000) 2385–2404. [2] C. Blake, E. Keogh, C.J. Merz. 1998, UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA (Online). Available from hhttp://www.ics.uci.edu/mlearni. [3] B. Chen, H. Liu, Z. Bao, A kernel optimization method based on the localized kernel Fisher criterion, Pattern Recognition (2007), doi: 10.1016/j.patcog.2007.08.009.
Li Yuan is currently a Ph.D. student in the National Lab of Radar Signal Processing, Xidian University. Her research interests include radar signal processing, radar automatic target recognition.
Hongwei Liu received his M.S. degree and Ph.D. all in electronic engineering from Xidian University in 1995 and 1999, respectively. He joints the National Lab of Radar Signal Processing, Xidian University since 1999. From 2001 to 2002, he is a visiting scholar at the Department of Electrical and Computer Engineering, Duke University, USA. He currently holds professor position and director of National Key Lab of Radar Signal Processing, Xidian University. His research interests are radar automatic target recognition, radar signal processing, and adaptive signal processing. Zheng Bao graduated from the Communication Engineering Institution of China in 1953. Currently, he is a professor at Xidian University and an academician of the Chinese Academy of Science. He is the author or co-author of six books and has published more than 300 papers. Now his research work focuses on the areas of space–time adaptive processing, radar imaging, and radar automatic target recognition.