Expert Systems with Applications 36 (2009) 8714–8718
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Fuzzy multi-class classifier based on support vector data description and improved PCM Yong Zhang a,b,*, Zhong-Xian Chi b, Ke-Qiu Li b a b
Department of Computer, Liaoning Normal University, Dalian 116081, China Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China
a r t i c l e
i n f o
Keywords: Support vector data description Possibilistic c-means algorithm Minimum enclosing sphere Classifier SVM
a b s t r a c t In this paper, a novel fuzzy classifier for multi-classification problems, based on support vector data description (SVDD) and improved PCM, is proposed. The proposed method is the robust version of SVDD by assigning a weight to each data point, which represents fuzzy membership degree of the cluster computed by the improved PCM method. Accordingly, this paper presents the multi-classification algorithm based on the robust weighted SVDD, and gives the simple classification rule. Experimental results show that the proposed method can reduce the effect of outliers and yield higher classification rate. Ó 2008 Elsevier Ltd. All rights reserved.
1. Introduction SVMs for pattern classification are based on two-class classification problems. The interesting property of SVM is that it is an approximate implementation of the structural risk minimization principal based on statistical learning theory (SLT) (Vapnik, 1995) rather than the empirical risk minimization method. In recent years, SVM has drawn considerable attentions due to its high generalization ability of a wide range of applications and better performance than other traditional learning machines (Muller, Mike, & Ratsch, 2001; Schölkopf & Smola, 2002). However, unclassifiable regions exist when they are extended to multi-class problems. Now there are some methods to solve this problem. One is to construct several two-class classifiers, and then to assemble a multi-class classifier, such as one-against-one, oneagainst-all and DAGSVMs (Hsu & Lin, 2002; Platt, Cristianini, & Shawe-Taylor, 2002). The second method is to construct multiclass classifier directly, such as k-SVM proposed by Weston and Watkins (1998). Zhang, Chi, Liu, and Wang (2007) also proposed a fuzzy compensation multi-classification SVM based on Weston and Walkins’ idea. Recently, Zhu, Chen, and Liu (2003) proposed a multi-class classification algorithm which adopted the minimum enclosing spheres to classify a new example and showed that the resulting classifier performed comparable to the standard SVMs. Based on Zhu et al. (2003), Wang, Neskovic, and Cooper (2007) and Lee and Lee (2007) also proposed a new classification rule on the basis of Bayesian optimal decision theory, respectively.
* Corresponding author. Address: Department of Computer, Liaoning Normal University, No. 20, Liushu South Street, Ganjingzi, Dalian, Liaoning Province 116081, China. E-mail address:
[email protected] (Y. Zhang). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.03.026
The minimum enclosing sphere of a set of data, defined as the smallest sphere enclosing the data, was first used by Schölkopf et al., to estimate the VC-dimension of support vector classifiers (Schölkopf, Burges, & Vapnik, 1995, 2001) and later applied by Tax and Duin to data domain description (Tax & Duin, 1999, 2004), named as support vector data description (SVDD). From a computational point of view, there is a great advantage to use the minimum enclosing spheres for pattern classification purpose. In many practical engineering applications, the obtained training data is often contaminated by noises. Furthermore, some points in the training data set are misplaced far away from main body or even on the wrong side in feature space. The standard SVM is very sensitive to outliers and noises. Some techniques have been found to tackle this problem for SVM in the literature of data classification. One of the methods is to use averaging algorithm to relax the effect of outliers in Song, Hu, and Xie (2002), Hu and Song (2004). But in this kind of method, the additional parameter is difficult to be tuned. The second method, proposed in Lin and Wang (2002, 2003), Inoue and Abe (2001) and named as a fuzzy SVM (FSVM), is to apply fuzzy membership to the training data to relax the effect of the outliers. However, the selection of membership function remains a problem for FSVM so far. The third interesting method, called support vector data description (SVDD) (Tax & Duin, 1999, 2004), has been shown effectiveness to detect outliers from the normal data points. However, it is well-known that SVDD is particular in the case of one-class classification problem. The original possibilistic c-means (PCM) algorithm (Krishnapuram & Keller, 1993, 1996) has been shown to have satisfied ability of handling noises and outliers. In addition, the induced partition assigns each data point a relative value (probability) according to the compatibility of the points with the class prototypes: important data points have high values but outliers and noises have
8715
Y. Zhang et al. / Expert Systems with Applications 36 (2009) 8714–8718
low values. The weights used in our proposed method are generated by the improved possibilistic c-means algorithm, which extends the original PCM algorithm into the kernel space by using kernel methods. That is to say, the partitioned relative values of the improved PCM algorithm are served as the weights for the training samples. In this paper we propose the robust version of SVDD by assigning a weight to each data point, which presents fuzzy membership degree of the cluster computed by the improved PCM method. Accordingly, we propose the multi classification algorithm based on the robust weighted SVDD, and give the classification rule. The rest of this paper is organized as follows. In Section 2 we give a brief description of support vector data description (SVDD) and the possibilistic c-means (PCM) clustering, then extend the PCM algorithm into the kernel space, and present the weights generating algorithm. In Section 3 we define the objective functions of the multi classification based on the weighted SVDD, and present the proposed algorithm. In Section 4 we illustrate the new algorithm with examples by comparing its performance with other methods. Finally, the conclusion is given in Section 5. 2. Previous works In this section, we first review the support vector data description and the original possibilistic c-means algorithm. Then we extend the PCM algorithm into the kernel space, and present the weights generating algorithm. 2.1. Support vector data description
where c is the center of the minimum enclosing sphere S and ni are slack variables allowing for soft boundaries. To penalize large distances to the center of the sphere, the quadratic optimization problem in Eq. (1), accordingly, is represented as follows:
min R2 þ C
X
c;R
ð4Þ
ni
i
under the constraints (3), where C > 0 is a penalty constant that controls the trade-off between the size of the sphere and the number of samples that possibly fall outside of the sphere. To solve this problem, we introduce the Lagrangian
L ¼ R2
X 2 X X ðR þ ni kUðxi Þ ck2 Þai bi ni þ C ni i
i
Using the Lagrange multiplier method (Tax & Duin, 2004), the above constrained quadratic optimization problem can be formulated as the Wolfe dual form. Differentiating in R, c, ni according to Eq. (5), and setting their derivative value is zero, we obtain
! X X oL ai ¼ 0 ) ai ¼ 1 ¼ 2R 1 oR i i X X oL ¼2 ai ðUðxi Þ cÞ ¼ 0 ) c ¼ ai Uðxi Þ oc i i oL ¼ C ai bi ¼ 0 ) C ¼ ai þ bi oni
W¼
X
ai Kðxi ; xi Þ
i
min R2 c;R
ð1Þ
subject to the constraints
kxi ck2 6 R2
ð2Þ
However, in real-world applications, training data of a class is rarely distributed spherically, even if the outermost samples are excluded. To have more flexible descriptions of a class, one can first transform the training samples into a high-dimensional feature space using a nonlinear mapping U and then compute the minimum enclosing sphere S in the feature space. Furthermore, to allow for the possibility of some samples falling outside of the sphere, one can relax the constraints (2), which require the distance from xi to the center c to be smaller than R for all xi, with a set of soft constraints
kUðxi Þ ck2 6 R2 þ ni ni P 0; for i ¼ 1; 2; . . . ; l
ð3Þ
ð6Þ ð7Þ ð8Þ
Substituting (6)–(8) back into (5), we obtain the dual formulation
max
Over the last several years, SVM has already been generalized to solve one-class problems. Tax and Duin (1999, 2004) presented a support vector data description (SVDD) method for classifier designation and outlier detection. The main idea in Tax and Duin (1999, 2004) is to map data points by means of a nonlinear transformation to a high dimensional feature space and to find the smallest sphere that contains most of the mapped data points in the feature space. This sphere, when mapped back to the data space, can be separated into several components, each enclosing a separate cluster of points. In real operations, they introduce the soft idea and kernel techniques from SVM and allow some data points outside the more general sphere. The advantage of SVDD lies in the fact that the algorithm is very intuitive and comprehensive. More specifically, given a set of training data fxi ; i ¼ 1; 2; . . . ; lg Rn , the minimum enclosing sphere S, which is characterized by its center c and radius R, can be found by solving the following constrained quadratic optimization problem:
ð5Þ
i
s:t:
0 6 ai 6 C;
X
X
ai aj Kðxi ; xj Þ
i;j
ai ¼ 1; j ¼ 1; 2; . . . ; l
ð9Þ
i
where the kernel K(xi, xj) = U(xi) U(xj) satisfies Mercer’s condition. Points xi with ai > 0, called support vectors (SVs), are needed in the description and lie on the boundary of the sphere. To test the point x, the distance to the center of the sphere has to be calculated. A test point x is accepted when this distance is smaller or equal than the radius:
kUðxÞ ck2 ¼ Kðx; xÞ 2
X i
ai Kðxi ; xÞ þ
X
ai aj Kðxi ; xj Þ 6 R2 ð10Þ
i;j
2
where R is the distance from the center of the sphere c to any support vector xi on the boundary. 2.2. Improved possibilistic c-means algorithm In the theories of SVDD, the training process is very sensitive to outlier and noise data. Quadratic form (4) uses parameter C to control the penalty degree of error classification samples. If parameter C value is bigger, it denotes to set a bigger penalty value to error classification samples, and to decrease error classification data, while a smaller C denotes to ignore some ‘‘negligible” error classification data, and to obtain a bigger classification margin as a result. But parameter C value is invariable in the SVDD training process, which means all training data is comparably disposed. So SVDD training process is sensitive to some data, such as outlier, noise data, etc. Furthermore, it could result in ‘‘over-fitting”. In order to decrease the effect of those outliers or noises, researchers assign each data point in the training dataset 1 membership and sum the deviations weighted by their memberships. If one data point is detected as an outlier, it is assigned with a low membership, so its contribution to total error term will be decreased. Unlike the equal treatment in standard SVM, this method fuzzifies the penalty term to reduce the sensitivity of less important data points.
8716
Y. Zhang et al. / Expert Systems with Applications 36 (2009) 8714–8718
Possibilistic c-means (PCM) algorithm, which was first proposed in Krishnapuram and Keller (1993) and further explored in Krishnapuram and Keller (1996) to overcome the relative membership problem of fuzzy c-means (FCM) algorithm, has been shown to have satisfied ability of handling noises and outliers. Its basic idea is to relax this constraint and propose a possibilistic approach to clustering by minimizing the following object function:
JðU; VÞ ¼
c X n X
2 um ik kxk v i k þ
i¼1 k¼1
c X
gi
n X ð1 uik Þm
i¼1
ð11Þ
k¼1
where c is the number of clusters and selected as a specified value in this paper, nis the number of data points, uik 2 [0, 1] is the membership of xk in class i, m is the quantity controlling clustering fuzziness, and V is the set of cluster centers or prototypes (vi 2 Rp), gi are suitable positive numbers. The function J(U, V) is minimized by a famous alternate iterative algorithm (Krishnapuram & Keller, 1993, 1996). The first term demands that the distances from data points to the prototypes be as low as possible, whereas the second term forces the uik to be as large as possible, thus avoiding the trivial solution. To make use of the PCM algorithm for the proposed weighted SVDD classifier, we extend the original PCM into the kernel space by using kernel methods and develop the improved kernel-based PCM algorithm to generate weights in the kernel space instead of in the input space. Define a nonlinear map as /:x ? /(x) 2 F, where x 2 X. X denotes the data space, and F denotes the transformed feature space with higher dimension. Kernel-based PCM minimizes the following objective function:
JðU; VÞ ¼
c X n X
um ik k/ðxk Þ
2
/ðv i Þk þ
i¼1 k¼1
c X
n X gi ð1 uik Þm
i¼1
Kðxi ; xj Þ ¼ exp
ð12Þ
k¼1
! ð13Þ
r2
This kernel is independent of the position of the data with respect to the origin since only uses distances between objects. The parameter r is a width parameter that controls how tight the description is around the data. Obviously, when use the Gaussian function as the kernel function, K(x, x) = 1, k/(xk) /(vi)k2 = 2 2K(xk, vi). According to these, Eq. (12) can be rewritten as
JðU; VÞ ¼ 2
c X n X
um ik ð1 Kðxk ; v i ÞÞ þ
i¼1 k¼1
c X i¼1
gi
n X ð1 uik Þm
ð14Þ
k¼1
gi ¼ K
m k¼1 uik k/ðxk Þ Pn m k¼1 uik
/ðv i Þk2
¼K
2
Pn
m k¼1 uik ð1 Kðxk ; Pn m k¼1 uik
v i ÞÞ
ð15Þ
Typically, K is chosen to be 1, and we set K a constant with the value 1 in this paper. The minimization of (14) gives the updating equations for probability and cluster prototype as follows (Krishnapuram & Keller, 1993):
¼
1 1 þ ð2ð1 Kðxk ; v i ÞÞ=gi Þ1=ðm1Þ ð16Þ
vi ¼
Pn m k¼1 uik /ðxk Þ Pn m k¼1 uik
ð17Þ
The kernel-based PCM algorithm is implemented by iteratively updating (16) and (17) until the stopping criterion is satisfied. After partitioning, the weights of outliers and noises will be very small or even nearly to zero, such that the effect of outliers and noises to decision hyperplane is significantly reduced. To reduce the training time in the weighted SVDD classifier, we set a threshold rt for weights uik. If uik > rt, it denotes the training sample xk affects the ith class to some degree. Otherwise, training algorithm ignores the effect of xk to the ith class. 3. Weighted SVDD based on improved PCM 3.1. Weighted SVDD Let fðxi ; mi Þ; i ¼ 1; 2; . . . ; lg Rn be a given training data set, where mi is the weights of training sample xi. Once we compute weights mi of the training sample using the improved PCM method in Section 2, weighted SVDD problems can be subsequently described as follows:
min R2 þ C
X
c;R
mi ni
ð18Þ
i
subject to the constraints
for i ¼ 1; 2; . . . ; l
ð19Þ
where mi is the weights generalized by improved PCM method in Section 2. We can find the solution to this optimization problem in dual variables by finding the saddle point of the Lagrange. Its corresponding Lagrangian formula is
L ¼ R2
X 2 X X ðR þ ni kUðxi Þ ck2 Þai b i ni þ C mi ni i
i
ð20Þ
i
Differentiating in R, c, ni according to Eq. (20), and setting their derivative value is zero, we obtain
! X oL ai ¼ 0 ¼ 2R 1 oR i X oL ai ðUðxi Þ cÞ ¼ 0 ¼2 oc i
ð21Þ ð22Þ
oL ¼ Cmi ai bi ¼ 0 oni
ð23Þ
P P respectively, leads to i ai ¼ 1, and c ¼ i ai Uðxi Þ. Substituting (21)–(23) back into (20), we obtain the dual formulation
max
W¼
X
ai Kðxi ; xi Þ
i
Referring to the recommendation in Krishnapuram and Keller (1993), gi can be obtained from the average possibilistic intra-cluster distance of cluster i as
Pn
1 1 þ ðk/ðxk Þ /ðv i Þk2 =gi Þ1=ðm1Þ
kUðxi Þ ck2 6 R2 þ ni ni P 0;
where k/(xk) /(vi)k2 = K(xk, xk) + K(vi, vi) 2K(xk, vi). Compare with Eqs. (11) and (12) uses the squared Euclidian distance between the transformed pattern /(xk) and cluster prototype /(xk) in the kernel space. In this paper, we adopt the Gaussian function as a kernel function:
kxi xj k2
uik ¼
s:t:
0 6 ai 6 Cmi ;
X
X
ai aj Kðxi ; xj Þ
i;j
ai ¼ 1; j ¼ 1; 2; . . . ; l
ð24Þ
i
We replace the inner products in the original form (24) by a kernel function K(xi, xj) = U(xi) U(xj). Several kernel functions have been proposed, including linear kernel, polynomial kernel, radial basis function kernel and sigmoid kernel. In our experiments we use a Gaussian kernel function as form (13). Solving the dual QP problem, we obtain the Lagrange multipliers a i, which give the center c of the minimum enclosing sphere S as a linear combination of xi,
8717
Y. Zhang et al. / Expert Systems with Applications 36 (2009) 8714–8718
c¼
X
ai xi
ð25Þ
i
classification rule in Zhu et al. (2003) and Wang et al. (2007) through experiments in Section 4.
According to the Karush–Kuhn–Tucker (KKT) optimality conditions, we have
ai ¼ 0 ) kUðxi Þ ck2 < R2 and ni ¼ 0 0 < ai < Cmi ) kUðxi Þ ck2 ¼ R2 and ni ¼ 0 ai ¼ Cmi ) kUðxi Þ ck2 P R2 and ni P 0
4. Experimental results
ð26Þ
Therefore, only ai that correspond to training examples xi that lie either on or outside of the sphere are nonzero. All the remaining ai are zero and the corresponding training samples are irrelevant to the final solution. It is clear that the only difference between standard SVDD and weighted SVDD is the upper bounds of Lagrange multipliers ai in the dual problem. In standard SVDD, the upper bounds of ai are bounded by a constant C while they are bounded by dynamical boundaries that are weight values Cmi in weighted SVDD. 3.2. The proposed algorithm description In this section, we present a method for multi classification problems. Suppose that a set of training data (x1, y1), (x2, y2), . . . , (xl, yl) is given where l 2 R denotes the number of training samples, xi 2 Rn denotes an input pattern, and yi 2 R = {1,2, . . . , C} denotes its output class, where C is the number of classes. The central idea of the proposed method is to utilize the information of the fuzzy domain description generated by the fuzzy SVDD for estimating the distributions of each partitioned class data, and then to utilize the estimates to classify a data point via the given decision rule. The detailed procedure of the proposed method is as follows: Step 1. (Computing weights uik): Set a threshold rt, Use improved kernel-based PCM method to compute the weights uik of each training sample xi to each class k. For weight matrix U = {uik}, set uik be 0 if uik 6 rt. Step 2. (Data partitioning): Partition the given training data into C subsets fDp gCp¼1 according to their weights uik. For example, the pth subset, Dp, contains lp elements as follows:
Dp ¼ fðxi1 ; pÞ; . . . ; ðxilp ; pÞg
ð27Þ
where uic p P rt , for c = 1, . . . , lp. Step 3. (Weighted SVDD for each class data set): For each class data set Dp, we build a trained kernel support function via fuzzy SVDD. Specifically, we solve the dual problem Eq. (9). Let its solution be ai ; i ¼ i1 ; . . . ; ilp , and Jp {1,. . .,lp} be the set of the index of the nonzero ai . The trained kernel support function for each class data set Dp is then given by
fp ðxÞ ¼ Kðx; xÞ 2
X
ai Kðxi ; xÞ þ
i2J p
X
ai aj Kðxi ; xj Þ
p
Table 1 Data sets from UCI. Data set
#pts
#att
#class
Glass Ionosphere Sonar Pima-diabetes Iris Vehicle
214 351 208 768 150 846
9 34 60 8 4 18
6 2 2 2 3 4
ð28Þ
i;j2J p
Step 4. (Constructing a classification rule for each class): We construct the following classification rule for the new sample x:
FðxÞ ¼ arg max
The effectiveness of the proposed weighted SVDD algorithm to classification problems is tested on benchmark data sets from the UCI machine learning repository http:/www.ics.uci.edu/~mlearn/ MLRepository.html, including glass, ionosphere, sonar, pima-diabetes, irisandvehicle. Their related parameters are listed in Table 1. In the Table 1, #pts denotes the number of data points, #att denotes the number of the attributes, and #class denotes the number of the classifications. On each data set, we compared our method to the standard SVM classifier and to the sphere-based classifier using the decision rule of Zhu et al. (2003) and Wang et al. (2007). We used Gaussian kernel for all algorithms and used the fivefold cross-validation method to estimate the generalization errors of the classifiers. In the cross validation process, we ensured that each training set and each testing set were the same for all algorithms. Note that the generalization errors of the classifiers depend on the values of the kernel parameter r and the regularization parameter C. So the boundary description is strongly influenced by the width parameter r. For small values of r all the objects tend to be support vectors. The description approximates a Parzen density estimation with a small width parameter. For large values of r, the SVDD description approximates the original spherical solution. For intermediate values of r, a weighted Parzen density is obtained. Accordingly, let a threshold rt be 0.1 in the step 1 of our proposed algorithm. The value of the threshold rt influences the algorithm runtime. For small values of rt more training samples may be included in the extra class. In the experiments we standardized the data to zero mean. While using the Gaussian kernel, we determined the tuning parameters r2 and C with standard fivefold cross validation. The initial search for optimal parameters r2 and C was done on a 10 10 uniform coarse grid in the (r2, C) space, namely, r 2 = [23, 21, . . . , 213, 215], C = [215,213, . . . , 21, 23].
lp ððRp Þ2 fp ðxÞÞ l ðRp Þ2
ð29Þ
where p = {1,2, . . . , C}, ¼ fp ðxi Þ for any support vector xi of the pth class, l 2 R denotes the number of the training samples, and lp 2 R denotes the number of the pth class in the training samples.In step 4, we propose the classification rule based on the SVDD which is an optimal classifier indicated by Lee and Lee (2007) according to the Bayesian decision theory, and show that it leads to a better generalization accuracy than the
Table 2 Comparison between SVMs and the sphere-based classifiers. Data set
Glass Ionosphere Sonar Pima-diabetes Iris Vehicle Iris Vehicle
The classification error rate (%) SVM
The sphere-based classifier
Proposed method (rt = 0.1)
27.57 5.12 10.62 27.48 2.67 12.64 5.42 19.86
28.54 4.84 15.12 27.00 2.72 12.60 5.64 19.32
27.51 4.68 11.38 26.82 2.68 12.60 3.89 17.58
8718
Y. Zhang et al. / Expert Systems with Applications 36 (2009) 8714–8718
The experimental results are summarized in Table 2. Note that values in Table 2 denote the error classification rate. As we see, except on the Sonar and Iris data sets, using our proposed new algorithm, the performance of the resulting classifier improves significantly and is better than that of the standard SVM classifier on four out of the six data sets being tested. Moreover, our proposed method also improved compare with the sphere-based classifier in Zhu et al. (2003) and Wang et al. (2007) from experimental results. Additionally, to test algorthm sensitivity to outier data, we random add 10% ‘‘outier” samples to iris and vehicle, respectively, which are labeled as Iris* and Vehicle* in Table 2. Experimental results are shown as the last 2 rows in Table 2. Results indicate that the proposed method successfully reduce the effect of outliers and accordingly yield good generalization performance, especially existing outier data in data sets. 5. Conclusions In this paper, a weighted support vector data description (SVDD) is proposed, which can be used to deal with the outlier sensitivity problem in traditional multi-class classification problems. The robust possibilistic c-means clustering algorithm is improved to generate different weight values for main training data points and outliers according to their relative importance in the training set. Experimental results show that the proposed method can reduce the effect of outliers and yield higher classification rate than other existing methods do. Acknowledgements This work is supported by the Liaoning Doctoral Research Foundation of China (Grant No. 20081079), and the University Scientific Research Project of Liaoning Education Department of China (Grant No. 2008347). The authors are grateful for the anonymous reviewers who made constructive comments. References Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425.
Hu, W. J., & Song, Q. (2004). An accelerated decomposition algorithm for robust support vector machines. IEEE Transactions on Circuits and Systems II: Express Briefs, 51(5), 234–240. Inoue, T., & Abe, S. (2001). Fuzzy support vector machines for pattern classification. In Proceedings of international joint conference on neural networks (IJCNN ‘01) (Vol. 2, pp. 1449–1454). Krishnapuram, R., & Keller, J. M. (1993). A possibilistic approach to clustering. IEEE Transactions on Fuzzy System, 1(2), 98–110. Krishnapuram, R., & Keller, J. M. (1996). The possibilistic c-means algorithm: Insights and recommendations. IEEE Transactions on Fuzzy System, 4(3), 385–393. Lee, D., & Lee, J. (2007). Domain described support vector classifier for multiclassification problems. Pattern Recognition, 40, 41–51. Lin, C. F., & Wang, S. D. (2002). Fuzzy support vector machines. IEEE Transactions on Neural Networks, 13(2), 464–471. Lin, C. F., & Wang, S. D. (2003). Training algorithms for fuzzy support vector machines with noisy data. In Proceedings of the IEEE 8th workshop on neural networks for signal processing (pp. 517–526). Muller, K. R., Mike, S., Ratsch, G., et al. (2001). An introduction to kernel based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2002). Large margin DAG’s for multiclass classification. In S. A. Solla, T. K. Leen, & K. R. Müller (Eds.). Advances in neural information processing systems (Vol. 12, pp. 547–553). Cambridge, MA: MIT Press. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Schölkopf, B., Burges, C., & Vapnik, V. (1995). Extracting support data for a given task. In Proceedings of first international conference on knowledge discovery and data mining (pp. 252–257). Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. Song, Q., Hu, W. J., & Xie, W. F. (2002). Robust support vector machine with bullet hole image classification. IEEE Transactions on Systems, Man and Cybernetics, 32(4), 440–448. Tax, D., & Duin, R. (1999). Support vector domain description. Pattern Recognition Letters, 20, 1191–1199. Tax, D., & Duin, R. (2004). Support vector data description. Machine Learning, 54, 45–66. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer. Wang, J., Neskovic, P., & Cooper, L. N. (2007). Bayes classification based on minimum bounding spheres. Neurocomputing, 70, 801–808. Weston, J., & Watkins, C. (1998). Multi-class support vector machines, Department of Computer Science, Royal Holloway University of London Technical Report, SD2TR298204. Zhang, Y., Chi, Z. X., Liu, X. D., & Wang, X. H. (2007). A novel fuzzy compensation multi-class support vector machines. Applied Intelligence, 27(1), 21–28. Zhu, M. L., Chen, S. F., & Liu, X. D. (2003). Sphere-structured support vector machines for multi-class pattern recognition. Lecture notes in computer science (Vol. 2639, pp. 589–593). Berlin: Heidelberg, Springer.