Expert Systems with Applications 36 (2009) 6654–6658
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Fuzzy support vector machines with the uncertainty of parameter C Che-Chang Hsu a, Ming-Feng Han b, Shih-Hsing Chang c, Hung-Yuan Chung b,* a b c
Department of Mechanical Engineering, National Central University, No. 300, Jhongda Road, Jhongli City, Taoyuan County, Taiwan Department of Electrical Engineering, National Central University, No. 300, Jhongda Road, Jhongli City, Taoyuan County 32001, Taiwan Institute of Business and Management, Vanung University, No. 1, Van-Nung Road, Chung-Li, Tao-Yuan 320614, Taiwan
a r t i c l e
i n f o
Keywords: Pattern recognition Fuzzy support vector machines Uncertainty
a b s t r a c t In typical pattern recognition applications, there are usually only some vague and general knowledge about the situation. An optimal classifier, then, will be definitely hard to develop if the decision function lacks sufficient knowledge. The aim of our experiments was to extract some features by some appropriate transformation of the training set. In the paper, we assumed that the training samples were drawn from a Gaussian distribution. Also, assumed that if the data sets are in an imprecise situation, such as classes overlap, it can be represented by fuzzy sets. Results showed a powerful learning capacity: the fuzzy support vector machines with the uncertainty of parameter C rule (FSVMs-UPC) was proposed. Here it indicated that each data point had individual parameter coefficient been valuable. The experimental results show that the proposed method is a better way to postpone or avoid overfitting, and it also gives us a measure of the quality of the ultimately chosen model. Ó 2008 Published by Elsevier Ltd.
1. Introduction Support vector machines (SVMs) are extensively used as a classification technique, which is derived from the idea of the generalized optimal hyperplane with maximum margin between the two classes. This idea can be implemented by the structural risk minimization (SRM) principle (Vapnik, 1995; Vapnik, 1999; Vapnik, 1998). Since the SVMs can actually refer to related convex optimization learning problems, it has received much attention in recent years. A simple description of the SVMs algorithm is that it is embodied in the SRM principle to describe a trade-off between model complexity and the quality of the ultimately chosen model, and it is referred to Vapnik (1998). But in general we are unlikely to expect that the SVM classifier is not misclassified (overfitting) while it is still keeping the quality of the model. This would be a kind of contradictory related to the two goals, but the generalization performance in the learning process is related to its prediction capability with independent test data. For this reason the classification performance should be more important than training error estimator. In the last years, several works have been published in the machine learning literature about how to improve the class separability by feature selection or feature extraction. Thus, an issue based on analyzing training data itself has been discussed, and the purpose of the way is to achieve a better classification performance by preprocess on the noises or outliers. Here they have lower de* Corresponding author. E-mail address:
[email protected] (H.-Y. Chung). 0957-4174/$ - see front matter Ó 2008 Published by Elsevier Ltd. doi:10.1016/j.eswa.2008.08.032
grees of membership proposed by Lin and Wang (1999). A similar issue based on the training data itself has been studied by Tax and Duin (2005), that is, they have focused on the size of samples and the degree of overlap. Rather, on the two features of the data there are quite possible to influence the loss of performance of learning system. Also, Prati, Batista, and Monard (2004) reported that the class imbalances and the degree of overlap among the classes are obstacles to effective learning algorithms. Furthermore, Schölkopf, Simard, Smola, and Vapnik (1998) tried to extract regularities from training data in order to incorporate prior knowledge into SVMs. In the case, the addition of information does contribute to improving separability. However, there has been relatively little research that has been reported how to use statistical analysis and fuzzy set theory to extract the properties of the overlapping region. Here the purpose of this feature extraction was to realize more about the analyses of the overlapping region. It is hoped that information from this extraction may be useful in improving separability. In the discussion of feature extraction, we detect that the feature of overlap of the boundaries in the initial set is no clear distinction. Thus, it is quite possible to grasp fuzzy theory (Zadeh, 1965) that may be useful in filling lack of previous research. In this work, we assume that the training samples are drawn from a Gaussian distribution. Also, assume that the data sets in an imprecise situation can be represented by fuzzy sets. Using the hypotheses, the addition of features will do contribute to the compensation of the information for the lack of data. Then, we propose a simple programming of SVMs algorithm with imprecise parameter C, and it is called the fuzzy support vector machines with the uncertainty of parameter C (FSVMs-UPC).
6655
C.-C. Hsu et al. / Expert Systems with Applications 36 (2009) 6654–6658
2. Theoretical overview
f (x)
In this section, we briefly review some representative works in the idea of structural risk minimization and the method of SVM.
Vague region
Class 2
2.1. The concept of structural risk minimization
Class 1
SRM principle (Vapnik, 1998) is denoted by a nested sequence of hypothesis spaces S = {f(x, h), h e X}, then each hypothesis space f(x, h) depends on adjustable parameter vector h. Where x e Rd denotes the input features. The capacity term of the function sets S can be decomposed into a nested sequence of subsets of increasing sizes
n o Si ¼ wT uðxÞ þ b : kwk22 6 ci ;
i ¼ 1; 2; . . . ; n; n þ 1 . . . ;
where Si is considered a nested set of models of increasing complexity, w is a vector orthogonal to the hyperplane, and the term b is a bias scalar. In the case, mapping function u(x) is responsible for mapping the input features into the higher dimensional features. In general, the hypothesis is satisfied kwk22 6 ci . Yet Vapnik and Chervonenkis (1971) proposed the VC-dimension is a property of a set of functions f(x, h), such that the VCdimension of each subset Si satisfies the bounded hi, we have
hi 6 minð½r 2 ci ; dÞ þ 1;
i ¼ 1; 2; . . . ; n; n þ 1;
ð2Þ
where r is the radius of the smallest spherical model containing these points u(x1), u(x2), . . . , u(xn) in the high dimensional feature space. 2.2. An overview of SVMs SVMs are developed based on the structural risk minimization rule. It possesses an excellent learning capacity and the generalization performance in the finite sample. As for the SVM classifier (Vapnik, 1995; Vapnik, 1998; Vapnik, 1999), the goal is to find a separating hyperplane with a margin as large as possible to minimize the classification error. The decision function of SVM with a non-linear mapping u(x) that maps the input data into a high dimensional feature space is given as
f ðxÞ ¼ signð< w; uðxÞ > þbÞ:
x
ð1Þ
ð3Þ
3. The fuzzy analysis of data for support vector machines In feature extraction, we analyze the overlap region for designing a new classifier. Thus, we can use fuzzy set theory to extract useful features from data, and this way is very practical and inexpensive.
Fig. 1. Based on the Gaussian signal detection.
3.2. Setting fuzzy membership function 3 Let us reconsider the hypothetical problem posed in Section 3.1. It makes the assumption that the overlap of classes is posed in fuzzy terms. Then, we can set up a membership function to handle the message of the vague region. In addition, most of the support vectors (SVs) are always found in the overlap region. Here the SVs of SVMs are the most crucial features for constructing decision function (Vapnik, 1999). In such a situation, we should assume that if the samples are in the overlap region, we shall denote higher degree of membership than the examples are away from it. Such a function is the triangular membership function that satisfies the above properties, and thus we can quantify the process of solving overlap problems. Let the U denotes a universe. If any element x in U, there is a corresponding real number u(x) e [0, 1]. Therefore, given interval Xo = ha, bi, M e Xo, the triangular membership function has the form
(
uðxÞ ¼
xa ; Ma bx ; bM
x 6 M;
ð4Þ
x P M:
Eq. (4) is satisfying for three properties: The u(x) in point x = M can be maximal value, and it is equal to u(M) = 1, x e Xo and x – a, b , 1 > u(x) > 0, and x = a or x = b , u(x) = 0. The graph of this function is sketched in Fig. 2. Using this function, we can determine the lower grades of samples are located away from the overlap region. Here the solid line expresses a triangular membership function which is characterized by the three
u(x) Class 2
3.1. Analysis of the training data Real-world data sets usually involve classes overlap. The overlap problem can be described as a region where at least two different classes have existed. This means that it is different from crisp sets. To illustrate the distribution of overlapping region, in general, we should reasonably assume that both class 1 and class 2 are the Gaussian density distribution. Thus, we can express the individual normal density f(x) shown in Fig. 1. Here function f(x) is the continuous univariate normal at x. However, the two classes overlap can not be easily evaluated by a definite class 1 or class 2. It can be interpreted as an uncertain measure in the vague region. Indeed, we might be able to adopt fuzzy set theory to deal with such an imprecise region, and it certainly is a natural tool to characterize such a concept.
Class 1 1
a
x M Fig. 2. A simple triangular function.
b
6656
C.-C. Hsu et al. / Expert Systems with Applications 36 (2009) 6654–6658
parameters, a, b, and M. In the example, the point of M reaches its maximum value. 3.3. Fuzzy objective coefficients in the tolerance parameter C In the SVMs approach, the regularization parameter C is that it determines the trade-off between margin width and model complexity. In actual fact (see Section 3.2), we can give samples higher grades in the overlap region. Therefore, we assume that the available parameter C may change in some range based on the different importance. An optimization problem can become
min w;b;n
n n X X 1 T ðC uðxi ÞÞ ni ; w wþ 2 i¼1 i¼1
4.2. Changes in training phase
ð5Þ
subject to
yi ððwT xi þ bÞÞ P 1 ni ; ni P 0;
i ¼ 1; :::; n;
i ¼ 1; :::; n;
ð6Þ ð7Þ
according to Eq. (5), we introduce the abbreviation for simplicity
uC ðxi Þ ¼ C uðxi Þ;
i ¼ 1; . . . ; n;
ð8Þ
Where the term ni is a slack variable, yi is a class label for each training sample, and uC(xi) is regarded as the fuzzy-penalizing parameter to control the trade-off between the margin and the misclassification. Now consider the product uC(xi) ni in Eq. (5). It can be thought of as the measure of misclassification. Note that the product value is not heavier than original SVMs. In the situation, we can detect that Eq. (5) is less sensitive to the presence of misclassifications, and that our programming would be useful to postpone or suppress overfitting. Solving the primal programming problem in (5)–(7), the primal problem is converted into the following dual problem by the Lagrangian method
max a~
~iÞ ¼ LD ða
n X n n X 1X ~j þ ~ia yi yj Kðxi ; xj Þa a~ i ; 2 i¼1 j¼1 i¼1
ð9Þ
subject to
In order to clearly understand the simulations of original SVMs and FSVMs-UPC, considering a given TwoNorm dataset is taken only on two dimensional input spaces with two classes. In the dataset, each class is drawn from a multivariate normal distributionpwith class ffiffiffiffiffiffi unit pffiffiffiffiffiffi standard deviation. One p ffiffiffiffiffiffi center pffiffiffiffiffiffiis located at (2= 20; 2= 20) and the other (2= 20; 2= 20) (Murphy, 1995) shown in Fig. 3. It is discovered that the intersection point is at M = 0.1783 in Fig. 4. pffiffiffi In this paper, we suggest that parameter is designed as r ¼ d in the RBF kernel function. A separating hyperplane of TwoNorm dataset depicted by the original SVMs and FSVMs-UPC is shown in Fig. 5. In this example, parameter C in the competed models is set at the same level of 2 noted in Eq. (5), and the misclassified points are marked with solid symbols ‘‘d” and ‘‘j”. Here the misclassification rate has been slightly increased from 19.17% to 20.83%, but that our FSVMs-UPC classifier is conceivably not a wigglier boundary. This is caused by a margin growth form 0.511 to 0.673 in the original SVMs and FSVMs-UPC. Fig. 5b shows that the boundary of the FSVMs-UPC can be smoother. Clearly, the margin growth is used to trade-off the misclassification counts (Vapnik, 1999).
Table 1 Details of datasets Dataset
~ i 6 uC ðxi Þ; 06a Xn a~ i yi ¼ 0:
i ¼ 1; 2; . . . ; n;
ð10Þ
i¼1
Eq. (9) employs a kernel function K(xi, xj) (Vapnik, 1999) that can compute the dot product of the sample in a non-linear SVM. A set of multipliers i are the solutions after the quadratic programming of (9) and (10), and this is different from the original SVMs. The non-linear decision function is shown as
f ðxÞ ¼ sign
1996) Bupa (Murphy, 1995), Diabetes (Murphy, 1995), and Biomed ( Vlachos & Meyer, 1989). Here we adopt the most widely used method for estimating prediction error is K-fold cross-validation (Hastie, Tibshirani, & Friedman, 2001). The details of all these datasets and the use of parameters are shown in Table 1. The point M is estimated at the intersection of two categories noted in Eq. (4). But if the points of intersection of the functions have more than one point, we will take an average score. We denote that the value of yj = 1 is given as Class 1 and pffiffiffi yj = 1 as Class 2; the other relevant parameter setting is r ¼ d in RBF kernel.
#SV X
Twonorm Bupa Diabetes Biomed
Sample size Class 1
Class 2
Total
60 145 268 66
60 200 500 126
120 345 768 192
Number of features
K-fold
Point M
2 6 8 4
10 5 4 6
0.1779 0.4292 0.3743 0.0428
!
a~ k yk Kðx; xk Þ þ b ;
ð11Þ
k¼1
where the b is obtained by according to Karush–Kuhn–Tucker (KKT) conditions. 4. Experimental results and discussions This section introduces that the simulations of SVMs and FSVMs-UPC use the same data sets. Since some details about the training procedure, model assessment, and others related development in the FSVMs-UPC are as follows. 4.1. The data sets In our experiments, there are four datasets used for generalization performance test. These datasets include TwoNorm (Breiman,
Fig. 3. TwoNorm data set.
C.-C. Hsu et al. / Expert Systems with Applications 36 (2009) 6654–6658
Fig. 4. The overlapping distribution of the TwoNorm data set.
6657
Fig. 6. Classification accuracy for TwoNorm dataset with different parameter C.
Fig. 7. Margin for TwoNorm dataset with different parameter C.
results. In this way, a little thought suggests that for referring to Vapnik (1999), a larger margin will lead to good separation on the test data. This will give us an encouragement to innovation. 4.3. Generalization performance via K-fold cross-validation
Fig. 5. Comparison of decision boundaries between original SVMs and FSVMs-UPC.
As an example of the benefit of the fuzzy objective coefficients, we return to the TwoNorm data. Here Fig. 6 shows the training error for the classifier based on the parameter C form 20 to 228 multiplied by 2. As mentioned earlier for parameter C = 2, typically the misclassification rate of the FSVMs-UPC can be higher than original SVMs, but equally be bigger margin occurred. Fig. 7 shows the
The generalization performance of classifier is an important issue. In general, the expected prediction error over a lot of independent test sets is not easy to obtain. K-fold cross-validation (Hastie et al., 2001) using just one dataset is often taken instead to assess the generalization performance, where K is given in Table 1. According to K-fold cross-validation, this behavior could be used in practice to determine an optimal point where the learning process can be terminate. Fig. 8 shows the optimal point of FSVMsUPC at the C = 210 (a), C = 24 (b), and C = 23 (c), respectively. Comparing original SVMs with FSVMs-UPC in the minimum testing error rate, the test error rate is slightly better than original SVMs form 26.67% to 26.38% (a), 22.92% to 22.01% (b), and 12.5% to 10.42%. In Fig. 8, we also suggest a very important observation: Although the training error rate becomes lower, the test error rate of FSVMs-UPC keeps smaller than original SVMs. This is because the FSVMs-UPC is less prone to overfitting. To sum up, our experiments indicate that the FSVMs-UPC not only has a better testing accuracy, but also overcomes the drawback of overfitting.
6658
C.-C. Hsu et al. / Expert Systems with Applications 36 (2009) 6654–6658
5. Concluding remarks Fuzzy set theory was first developed for exploring some class boundaries overlap in the SVMs, and we would have proposed the definition of fuzzy decision making. The findings, by the FSVMs-UPC classifier, presented a better testing accuracy and the postponement of the overfitting. One possible conclusion is that we have adopted the feature extraction to compensate for the lack of data, and thus improvement of classification performance is possible. Furthermore, we assumed that the imprecise coefficients in the parameter uC(xi) can make different contributions for the programming problem. In future work, we should take advantage of the FSVMs-UPC in the multi-class classification, and it should be used in the fingerprint and face recognition. Acknowledgement The authors wish to the financial support of the National Science Council of the Republic of China under Contract NSC 962221-E-008-116. References Breiman, L. (1996). Bias, variance and arcing classifiers, technical report 460. CA: Statistics Department, University of California. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. Berlin Heidelberg, New York: SpringerVerlag (pp. 214–217). Lin, C. F., & Wang, S. D. (1999). Training algorithms for fuzzy support vector machine with noisy data. Pattern Recognition Letters archive, 25(14), 1647–1656. Tax, D. M. J., & Duin, R. P. W. (2005). Characterizing one-class datasets. In Proceedings of the 16th annual symposium of the pattern recognition association of South Africa (pp. 21–26). Murphy, P. M. (1995). UCI-benchmark repository of artificial and real data sets. http:// www.ics.uci.edu/~mlearn, CA: University of California Irvine. Prati, R. C., Batista, G. E. A. P. A., & Monard, M. C. (2004). Class imbalances versus class overlapping: an analysis of a learning system behavior. In MICAI (pp. 312– 321). Schölkopf, B., Simard, P., Smola, A., & Vapnik, V. (1998). Prior knowledge in support vector kernels. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing system (pp. 312–321). MIT Press. 10. Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin Heidelberg, New York: Springer-Verlag. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transaction on Neural Networks, 10, 988–999. Vapnik, V. N., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), 264–280. Vlachos, P., & Meyer, M. (1989). StatLib, http://lib.stat.cmu.edu/, Department of Statistics, Carnegie Mellon University. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353.
Fig. 8. Assessments and comparisons of generalization performance varied exponential grids of C in the K-fold cross-validation.