Neurocomputing 73 (2010) 3334–3337
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Letters
Reformative nonlinear feature extraction using kernel MSE Qi Zhu Key Laboratory of Network Oriented Intelligent Computation, Shenzhen Graduate School, Harbin Institute of Technology, ShenZhen, China
a r t i c l e in f o
a b s t r a c t
Article history: Received 22 May 2009 Received in revised form 14 March 2010 Accepted 2 April 2010 Communicated by M.S. Bartlett Available online 12 May 2010
In this paper, we propose an efficient nonlinear feature extraction method using kernel-based minimum squared error (KMSE). This improved method is referred to as reformative KMSE (RKMSE). In RKMSE, we use a linear combination of a small portion of samples that are selected from the training sample set, i.e. ‘‘significant nodes’’, to approximate to the transform vector of KMSE in kernel space. As a result, RKMSE is much superior to naive KMSE in computational efficiency of feature extraction. Experimental results on several benchmark datasets illustrate that RKMSE can efficiently classify the data with high recognition correct rate. & 2010 Elsevier B.V. All rights reserved.
Keywords: Pattern classification Feature extraction Nonlinear discriminant analysis Kernel MSE Reformative kernel MSE
1. Introduction KMSE has attracted much attention and has been applied in many pattern recognition problems [1,2], because of its laconically formal description and high computational efficiency. KMSE is a regression method in kernel space. Two procedures are implicitly contained in the implementation of KMSE. The first procedure transforms the original samples space into a high dimensional kernel space via a nonlinear mapping induced by the kernel function [3], and the second procedure carries out the minimum squared error (MSE) regression in kernel space. For two-class classification problems, the two class labels are coded as ‘1’ and ‘ 1’, respectively, in KMSE. Under the assumption that each sample in kernel space can be transformed to be its class label via a linear transform, KMSE establishes the linear equation group, which includes l equations (l is the number of training samples), with all the training samples. The least square error solution of that equation group is also the linear transform vector. KMSE views the dot product between the sample and the transform vector as the feature extraction result. Because the details of the sample in kernel space are unknown, the feature extraction result of KMSE cannot be calculated directly according to the above form. By producting kernel theory, the feature extraction result is also equal to a combination of some kernel functions, and the details of the concerning derivations are available in the next section. Judging from the above implementation of KMSE, we note that KMSE is a regression method in kernel space essentially. In KMSE, the independent variables
E-mail address:
[email protected] 0925-2312/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.04.007
are the features of the sample in kernel space, and the dependent variable is the class label of the sample. The most obvious difference between ordinary regression and KMSE is that the dependent variable in KMSE is discrete. Once we obtain the feature extraction result Y(x) of a testing sample x using KMSE, it is very easy to classify it as follows: if Y(x) 40, we classify x into the first class; otherwise, we classify x into the second class. KMSE has many advantages as follows. Firstly, compared to ordinary nonlinear methods, KMSE has a lower computational cost of feature extraction. That is, KMSE does not need to explicitly implement the nonlinear mapping induced by the kernel function. Secondly, compared to most kernel-based methods, KMSE has higher computational efficiency in training and classification procedure. In the training procedure, the primary problem in nonlinear support vector machine (SVM) is solving a quadratic optimization [3], and the primary problem in kernel fisher discriminant analysis (KFDA) is calculating the eigenvectors [4]. However, KMSE needs to solve only a linear equation group, which is much simpler than the above two problems. In the classification procedure, even though the classifier of KMSE described in paragraph 1 is a special case of nearest neighbour classifier essentially, KMSE does not need to calculate all the distances between the testing sample and all the training samples. In addition, it has been proven that KMSE is a unified framework of KFDA, least square SVM (LS-SVM) and kernel ridge regression (KRR) [1,2]. Hence, KMSE plays a very important role in the field of pattern recognition. However, the KMSE-based feature extraction efficiency is enslaved to the size of the training sample set. Actually, as shown by the reproducing kernel theory, the transform vector of naive
Q. Zhu / Neurocomputing 73 (2010) 3334–3337
KMSE is a linear combination of the total training samples in kernel space. In order to extract feature from the arbitrary sample, we must calculate the kernel functions regarding this sample and the total training samples in advance [5]. Consequently, if the training sample set is very large, the feature extraction procedure of KMSE will be very time consuming or even unfeasible. That is disadvantageous for the applications of KMSE. Hence, it is necessary and important to accelerate the KMSE-based feature extraction efficiency. The same problem of inefficient feature extraction occurs in other naive kernel methods. Many researchers have studied in solving that problem. A basic principle in practical nonlinear data modeling is the parsimonious principle that ensures the smallest possible model that explains the data [6]. Combining this idea and the nature of kernel methods, Xu develops the reformative KFDA [5] and IKPCA [7] to speed up their feature extraction efficiencies. However, they are unsuitable for optimizing the other kernel methods, e.g. KMSE. We also note that regularization is an effective and widely used technique to improve the model robustness of kernel methods. Adopting the Bayesian learning theory. Chen et al. proposes the locally regularized orthogonal least squares (LROLS) algorithm [8]. The LROLS introduces an individual regularizer for each weight. However, the orthogonal decomposition method that they used is sensitive to the progressive error. In this paper, we propose to use a linear combination of ‘‘significant nodes’’ to approximate to the transform vector of KMSE in kernel space, which can overcome the inefficiency of naive KMSE-based feature extraction. The ‘‘significant nodes’’ selection criterion is as follows: in KMSE, the feature extraction result responding to the sample x, is indeed a linear combination of kernel functions regarding x and the total training samples. The absolute value of the combination coefficient of the kernel function regarding x and some training sample shows the significance of this training sample in contributing to the feature extraction result of x. In addition, we identify the KMSE is equivalent to some neural network. Our improvement can also be viewed as the simplification on this network, which can improve the robustness and generalization of the KMSE.
3335
According to the reproducing kernel theory, there exist the coefficients ai ¼1, 2, y, l satisfying w¼
l X
ai fðxi Þ
ð3Þ
i¼1
Substituting Eq. (3) into Eq. (2), we obtain A ¼ ðw0 a1 a2 :::al ÞT
KAþ E ¼ Y, 2
ð4Þ 3
1 kðx1 ,x1 Þ ::: kðx1 ,xl Þ 6 7 6 1 kðx2 ,x1 Þ ::: kðx2 ,xl Þ 7 7 where K ¼ 6 6 ::::::::::::::::::::::::::::::::: 7. K is referred to as the kernel 4 5 1 kðxl ,x1 Þ ::: kðxl ,xl Þ matrix of KMSE. Eq. (4) can be solved using the least squared criterion. Suppose that KTK is nonsingular, we have A ¼ ðK T KÞ1 K T Y
ð5Þ
Indeed, Eq. (5) is also the solution that minimizes (Y KA)T(Y KA). For an arbitrary sample x, its projection along A, i.e. its feature extraction result obtained using KMSE is YðxÞ ¼ w0 þ
l X
ai kðx, xi Þ
ð6Þ
i¼1
The design of the classifier for KMSE has been introduced in Section 1. The number of the coefficients to be determined including w0, a1, y, al, is l +1, which is larger than that of the equations. Hence, Eq. (4) is an undetermined system and has no unique solution. Even though it is possible for us to get the numerical solution via Eq. (5), the solution will correspond to the poor generalization. In the feature extract procedure, KMSE must calculate all the kernel functions regarding this sample and the total training samples. Thus, the KMSE-based feature extraction efficiency decreases as the number of the training samples increases.
3. RKMSE algorithm 3.1. RKMSE model based on ‘‘significant nodes’’
2. KMSE model In this paper, we consider the classification task of two classes. Assume training samples x1 , x2 , . . ., xl1 belong to the first class (coded as ‘1’) and the others xl1 þ 1 , . . ., xl1 þ l2 belong to the second class (coded as ‘ 1’), where l1 +l2 ¼l. By the nonlinear mapping f, the original space R(xiAR) is transformed into a high dimensional kernel space F(f(xi)AF). f(x1), f(x2), y, f(xl) are the training samples in kernel space, and fi(x) denotes the ith feature of sample x in kernel space. We consider the input–output data model shown below X wi fi ðxÞ þ e, ð1Þ YðxÞ ¼ w0 þ i
where x, Y(x) and e are the system input, output and error, respectively. We can formulate this model as the following equation with all the training samples Y ¼ FW þ E
ð2Þ
where 2
3 1fðx1 ÞT 6 7 6 . . .. . .. . . 7 6 7 T 6 1fðxl Þ 7 1 7 F¼6 6 7, 6 1fðxl1 þ 1 ÞT 7 6 7 6 . . .. . .. . . 7 4 5 T 1fðxl Þ
2
W¼
w0 w
,
3 e1 6 7 6 ::: 7 6 7 6 el 7 6 1 7 E¼6 7, 6 el1 þ 1 7 6 7 6 ::: 7 4 5 el
2
1
3
6 ::: 7 6 7 6 7 61 7 7 Y ¼6 6 1 7 6 7 6 7 4 ::: 5
and
1
w is a column vector, whose dimension is equal to that of f(x).
For constructing more efficient feature extraction model based on KMSE, we give the following proposition: Proposition 1. The transform vector of KMSE can be approximately expressed by some combination of a small portion of training samples instead of the total training samples in kernel space. We assume that ‘‘significant nodes’’ x1, x2, y, xr, contributing the most to the feature extraction result, have been selected from the training set. According to Proposition 1, there is a linear P combination of the r ‘‘significant nodes’’, i.e. ri ¼ 1 bi fðxi0 Þ, which can well approximate to w in Eq. (2). In other words w
r X
bi fðxi0 Þ r o l
ð7Þ
i¼1
Thus, substituting Eq. (7) into Eq. (2), we obtain the simplified KMSE data model as follows: 2 3 2 3 w0 1 kðx1 ,x10 Þ. . .kðx1 ,xr0 Þ 6 7 6 76 b 1 7 6 1 kðx2 ,x10 Þ. . .kðx2 ,xr0 Þ 76 7 6 76 b2 7 þE0 ¼ Y ð8Þ 6 . . .. . .. . .. . .. . .. . .. . .. . . 76 7 4 56 . . . 7 4 5 1 kðxl ,x10 Þ. . .kðxl ,xr0 Þ
br
where E0 is the error of the new model. The simplified kernel matrix of KMSE (the first matrix in Eq. (8)) is denoted as Kr.
3336
Q. Zhu / Neurocomputing 73 (2010) 3334–3337
We refer to the new KMSE model based on Kr as RKMSE. [w0b1b2ybr]T is denoted as Ar that can be solved using Ar ¼ ðKr T Kr Þ1 Kr T Y
ð9Þ
3.2. ‘‘Significant nodes’’ selection algorithm and RKMSE framework The ‘‘significant nodes’’ should contribute much to the feature extraction result of each training sample. Since the elements of the same kernel column vector have different values, i.e. the (i+1)th kernel column vector (k(x1, xi), k(x2, xi)yyk(xl, xi))T, and these elements have the same coefficient, i.e. the coefficient ai corresponding to the above column vector, we exploit ai to evaluate the significance of this kernel column vector (k(x1, xi), k(x2, xi)yyk(xl, xi))T in KMSE. The absolute value of the coefficient ai indeed shows the contribution of this kernel column vector to the feature extraction result of each training sample. If 9ai9 is smaller than a relatively low threshold, the corresponding kernel column vector (k(x1, xi), k(x2, xi)yyk(xl, xi))T contributes little to the feature extraction result of all training samples. If the absolute value of coefficient aj is the minimum among 9a19, 9a19y9al9, we eliminate the kernel column vector (k(xl, xj), k(x2, xj)yyk(xl, xi))T from the kernel matrix. Based on the remaining kernel matrix, we construct the new KMSE model using the method discussed in Section 3.1 and calculate Al 1 for the selection of the next ‘‘non-significant node’’. As this loop runs repeatedly, we can eliminate a defined number of ‘‘nonsignificant nodes’’. Finally, we treat the remaining kernel column vectors as significant kernel column vectors and the corresponding training samples as ‘‘significant nodes’’. Then, we propose the following RKMSE based feature extraction procedure: 1. Firstly, we use the method above to select the significant kernel column vectors {(k(x1, xi), k(x2, xi)yyk(xl, xi)T, i¼1, 2, y, r)}. Then, the training samples {x1, x2, yy, xr} corresponding to the significant kernel column vectors are treated as ‘‘significant nodes’’. 2. Determine RKMSE model by Eq. (8), and calculate Ar. 3. Calculate the feature extraction result Y(x) of arbitrary sample x using YðxÞ ¼ w0 þ
r X
bi kðx,xi0 Þ
3.3. The equivalence to network We identify that the KMSE model is equivalent to the network P in Fig. 1. In this network, the input is the arbitrary sample x. is the summing function, which produces the feature extraction result Y(x) in the form of Eq. (6). The activation function is ( 1 YðxÞ Z0 jðxÞ ¼ , and its output is the classification decision 1 YðxÞ o0 of x. Our improvements for KMSE can also be viewed as the optimization on the construction of network in Fig. 1. After we exploit the optimization, the number of neuron units regarding x and the training samples decreases from l to r. Because, the P summing function deals with only the neurons regarding x and the ‘‘significant nodes’’, in the simplified network corresponding to RKMSE. This class of neurons contribute the most to the feature extraction result Y(x), i.e. the network responding to RKMSE can approximately produce the feature extraction result obtained by the network in Fig. 1. From the viewpoint of simplifying the structure and scale of network, our improvements might improve the generalization of KMSE.
4. Experiments We carried out experiments on several benchmark datasets. Each dataset includes 100 partitions (except ‘‘image’’ and ‘‘splice’’ with 20 partitions). Gaussian kernel function in the form of K(x, y)¼exp( 99x y992/2s2) is adopted, where s2 is set to be the variance of the first training partition. For guaranteeing KTK to be
1
W0
. .
k(x,x1)
1
. input
x
.
ð10Þ
activation
.
2
k(x,x2)
.
i¼1
feature extraction
. .
Y(x) is indeed the projection of sample f(x) along the transform vector in kernel space. It is clearly that the computation of Eq. (10) is more efficient than that of Eq. (6) because of r ol. As a result, we say that RKMSE based feature extraction procedure is much more computationally efficient than the KMSE-based one.
result
l
classification decision (output)
weights
k(x,x1)
Fig. 1. The network for KMSE.
Table 1 Experimental results of three methods. Dataset
Titanic Splice Breast-cancer Image Diabetis Banana Flare-solar Thyroid German
Number of training samples
150 1000 200 3000 468 400 666 140 700
Number of ‘‘significant nodes’’
17 266 76 225 117 120 123 28 204
Average accuracy (%) of
Feature extraction time (s) of
KMSE
DKMSE
RKMSE
KMSE
DKMSE
RKMSE
77.28 91.79 86.43 96.03 83.90 87.73 68.34 98.65 91.22
76.91 91.32 84.66 95.56 83.63 87.74 68.06 98.14 84.72
78.44 92.10 86.85 96.29 83.91 87.90 68.37 98.65 88.98
217.58 458.70 13.13 398.06 86.48 656.08 197.66 10.36 202.28
3.88 21.42 2.81 4.40 3.05 42.98 4.13 1.47 6.09
4.00 20.08 2.69 4.49 3.19 43.06 4.30 1.63 6.33
Q. Zhu / Neurocomputing 73 (2010) 3334–3337
nonsingular, we reform Eq. (5) as follows: A¼(KTK +uI) 1XTY. In our experiments, I is the identity matrix, and u is set to be 0.001. The training process runs on the first training sample subset, while testing process runs on the total testing sample partitions. There are 100 or 20 subsets for testing and every subset has the classification accuracy, so we can obtain the average accuracy and the deviation on each dataset. The experimental results of naive KMSE, DKMSE and RKMSE [9] are shown in Table 1. We can compare the experimental results of naive KMSE and RKMSE from Table 1. It is clearly that the feature extraction procedure of RKMSE is much efficient than that of naive KMSE. In most of the cases, including datasets ‘‘Titanic,’’ ‘‘Splice,’’ ‘‘Breastcancer,’’ ‘‘Image,’’ ‘‘Diabetis,’’ ‘‘Banana’’ and ‘‘Flare-solar’’, RKMSE achieves higher accuracy than to naive KMSE. Naive KMSE and RKMSE obtain the same classification accuracy on the dataset ‘‘Thyroid’’. The accuracy slightly lower than that of naive KMSE is obtained by RKMSE on the dataset ‘‘German’’. Table 1 also gives the experimental comparisons between the RKMSE and the DKMSE. We note that DKMSE does not update the new model. Indeed, DKMSE cannot produce the optimal ‘‘significant node,’’ because of the correlativity existing in different kernel column vectors. To exclude the influence that is lead by the difference in the sizes of ‘‘significant node’’ sets, we extract the same number of ‘‘significant nodes’’ using RKMSE and DKMSE, respectively. Compared to the DKMSE, our method RKMSE achieves the higher classification accuracy on the total nine datasets.
5. Conclusion In this paper, we develop RKMSE algorithm to improve KMSE for more efficient feature extraction. For extracting features from an arbitrary sample, RKMSE needs to compute only the kernel functions regarding this sample and ‘‘significant nodes’’. As a result, the feature extraction efficiency of KMSE has been accelerated. We identify that KMSE is equivalent to some network. From the viewpoint of simplifying the construction and scale of network, our improvement also improve the generalization of KMSE. Moreover the experimental
3337
results carried out on benchmarks show that RKMSE can obtain satisfactory classification performance. References [1] Y. Xu, J.-Y. Yang, Z. Jin, Z. Lou, A fast kernel-based nonlinear discriminant analysis method, Journal of Computer Research and Development, 42, 3, 367–374. [2] J.-H. Xu, X. Zhang, Y. Li. Kernel MSE Algorithm: a unified framework for KFD, LS-SVM and KRR. in: proceedings of the International Joint Conference on Neural Networks, Washington, D.C, 2001:1486–1491. [3] Cristianini, N., Shawe-Taylor, J. Saunders, C. Kernel methods: a Paradigm for pattern analysis. in: Kernel Methods in Bioengineering, Signal and Image Processing, 2007, 1–41. [4] S. Mika, G. Ratsh, J. Weston, B. Scholkopf, K.R. Muller, Fisher discriminant analysis with kernels, Neural Networks for Signal Processing IX, IEEE (1999) 41–48. [5] Y. Xu, J.-Y. Yang, J. Yang, A reformative kernel fisher discriminant analysis, Pattern Recognition (2004) 1299–1302. [6] S. Chen, X. Hong, C.J. Harris, Sparse kernel regression modeling using combined locally regularized orthogonal least squares and d-optimality experimental design, IEEE Transactions on Automatic Control 48 (6) (2003) 1029–1036. [7] Y. Xu, David Zhang, Fengxi Song, Jing-Yu Yang, Zhong Jing, Li Miao, A method for speeding up feature extraction based on KPCA, Neurocomputing (2007) 1056–1061. [8] S. Chen, Kernel-based data modeling using orthogonal least squares selection with local regularization, in: Proceedings of the Seventh Annual Chinese, Automation and Computer Science Conference U.K., Nottingham, U.K., September 22, 2001, pp. 27–30. [9] Y. Xu, J.-Y. Yang, J.-F. Lu, An efficient kernel-based nonlinear regression method for two-class classification, in: Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18–21 August 2005.
Qi Zhu received his Master degree in Harbin Institute of Technology in 2009. Recently, he is a Ph.D. candidate in Harbin Institute of Technology. His current interests include digital image processing, pattern recognition, machine learning and bioinformatics.