Applied Mathematics and Computation 283 (2016) 141–152
Contents lists available at ScienceDirect
Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc
Distance-based margin support vector machine for classification Yan-Cheng Chen a, Chao-Ton Su b,∗ a
National Chung-Shan Institute of Science and Technology, Taoyuan City, Taiwan Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Room 820, Engineering Building I, 101, Sec. 2, Kuang Fu Road, Hsinchu 30013, Taiwan
b
a r t i c l e
i n f o
Keywords: Support vector machine Class imbalance Class overlapping Classification
a b s t r a c t Recently, the development of machine-learning techniques has provided an effective analysis tool for classification problems. Support vector machine (SVM) is one of the most popular supervised learning techniques. However, SVM may not effectively detect the instance of the minority class and obtain a lower classification performance in the overlap region when learning from complicated data sets. Complicated data sets with imbalanced and overlapping class distributions are common in most practical applications. Moreover, they negatively affect the classification performances of the SVM. The present study proposes the use of modified slack variables within the SVM (MS-SVM) to solve complex data problems, including class imbalance and overlapping. Artificial and UCI data sets are provided to evaluate the effectiveness of the MS-SVM model. Experimental results indicate that the MS-SVM performed better than the other methods in terms of accuracy, sensitivity, and specificity. In addition, the proposed MS-SVM is a robust approach for solving different levels of complex data sets. © 2016 Elsevier Inc. All rights reserved.
1. Introduction The SVM, which has become one of most powerful classification techniques in the data mining field, was proposed by Vapnik [1]. SVM has a strong statistical learning theory and has shown excellent classification results in many applications, from handwritten digit recognition [2] to text categorization [3]. SVM also works very well with high-dimension data sets and avoids the dimensionality problem [4]. The aim of the SVM is to minimize classification errors by maximizing the margin between the separating hyperplane and the data sets. A special property of the SVM is that it simultaneously minimizes the empirical classification error and maximizes the geometric margin. However, the SVM may not effectively detect the instance of the minority class when learning from complex data sets. Complicated data sets with class imbalance and overlapping distributions are common in most practical applications, and affect the classification performances of the SVM. In predicting the rarest objects, the hyperplane or decision boundary generated by the SVM can be severely skewed toward the majority class, especially on the class imbalance if the training instances of the majority class outnumber that of the minority class. For class overlapping problem, the clear interval of separation of the two classes was almost non-existent. On the complex overlapping regions, the decision boundary without flexible property is difficult to separate from the corrected classes because the regions of the different classes usually have ∗
Corresponding author. Tel.: +886 3 5742936; fax: +886 3 5722204. E-mail address:
[email protected] (C.-T. Su).
http://dx.doi.org/10.1016/j.amc.2016.02.024 0 096-30 03/© 2016 Elsevier Inc. All rights reserved.
142
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
the same ranges in the single attribute or multi-attributes. SVM usually produces high predictive values over a majority class and poor values over a minority class when learning from complex data sets. This condition results in a serious bias in the minority class, especially in medical disease and credit scoring problems. For example, if 1% of the patients have medical disease, then the model that predicts every patient as healthy has a 99% accuracy, although it fails to detect any of the disease conditions. Recent studies have suggested that four main directions cope with data complexity problems. Several researchers have examined various methods using sampling methods, the modified kernel boundary, trade-off cost matrices, and adaptive margins. Preprocessing the data via down-sampling the majority class or over-sampling the minority class is adopted as the first approach. Several researchers have studied over-sampling and down-sampling techniques for skewed or imbalanced data sets [5,6]. The first approach focuses on the concept of sampling methods. The performance of the classification methods changes from imbalanced to newly balanced data sets. Synthetic minority over-sampling technique (SMOTE) is a well-known algorithm to fight unbalanced classification problems [5]. The general idea of this method is to generate new examples of the minority class artificially using the nearest neighbors. Furthermore, the majority class examples are also under-sampled, leading to a more balanced data set. The lack of rigorous and systematic treatment for imbalanced data is a common problem in sampling methods [7]. For example, down-sampling training data sets can potentially remove certain important information, whereas over-sampling may introduce noise. The goal of the second approach is to place the learned hyperplane farther from the positive class. Wu and Chang [8] proposed the adjustment of the class boundary by modifying the kernel matrix according to the class distributions to avoid the hyperplane from getting closer to the positive instances. The original kernel matrix and conformal transformation kernel matrix will produce bias during the iterative procedure. In the third approach, Veropoulos et al. [9] suggested the control of the trade-off between the false positives (FPs) and false negatives (FNs) by the data mining technique to overcome class imbalance problems. For the fourth approach, Trafalis and Gilbert [10,11] and Song et al. [12] proposed to replace the original elements by modifying the slack variables and augmented data points. Replacing the slack variables with modified slack variables makes the adaptive margin and decision function less affected by the abnormal example. Using the composed mean and standard deviation of each data point replaces the original data point to overcome data perturbations [10,11] based on the linear and non-linear data sets. Augmented terms designed by large margins in geometry have been proposed by considering the original elements of SVMs in a convex quadratic programming model. The present study aims to develop the modified slack variables within the SVM to enhance its classification performance and adjust the suitable distance between the decision boundary and the margin. The main idea of the proposed method follows that in the previous studies [12,13], which use the novel modified margin to reformulate the model. Novel modified margins use distance metrics to replace the original slack variables and enlarge the distance between the rarest objects and the normal groups when learning from complex data sets. The original slack variables in the SVM model are not sensitive to the rarest objects in the learning stage. Considering the advantage of distance metrics, the proposed method can achieve a good performance by eliminating the influence of class distributions and separating the majority and minority classes effectively. Slack variables based on distance metrics are included in the optimal function of the dual problem. The selected parameters of which are determined by the gradient descent method. The proposed method, with the aligned slack variables, will be not affected easily by the complex data sets. The proposed method performs similar functions as that of the standard SVM in terms of synthetic and UCI data sets. Data complexity metrics are used to evaluate the performance of the proposed method in various scenarios. The experimental results of these metrics provide a qualitative description of the training data characteristics. Similarly, they could help explain the good performance of the proposed method under different data set scenarios. The present paper is structured as follows. Section 2 presents the proposed modified SVM. Section 3 consists of the systematic experiments on the synthetic and UCI data sets based on the measurement of the performance metrics. Section 4 summarizes the experimental results. Finally, the main remarks and future works are presented in Section 5. 2. Proposed modified slack variables within the SVM (MS-SVM) 2.1. Original SVM Vapnik [1] first proposed the SVM, a set of related machine-learning techniques, to solve classification problems. In a simple pattern classification problem, a hyperplane separates two classes of patterns based on given examples xi , yi , for i = 1, …, n, where xi is a vector in input space S ∈ Rn and denotes the class index, taking the value +1 or −1. Kernel trick transforms data xi from S into the feature space F ∈ RN (N may be infinite) using nonlinear mapping φ (x ). Kernel trick uses a classifier algorithm to solve a nonlinear problem by mapping the original nonlinear observations into a higher dimensional F in machine learning. It then searches for a linear decision function
f (x ) = w · φ (x ) + b
(2.1)
in the feature space. Patterns are classified by the sign in Eq. (2.1). If no hyperplane splits the positive and negative instances, the soft margin method selects a hyperplane that splits the instances as cleanly as possible. At the same time, the distance to the nearest cleanly split instances is maximized. This method can be accomplished by introducing positive-valued slack
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
143
Mahalanobis space
Short Distance Long Distance
Normal examples Abnormal examples Fig. 1. The calculation of distance metric for normal and abnormal examples.
variables (ξ ) into the constraint of the optimization problem, which is shown in the following equations:
(w · φ (x ) + b) ≥ 1 − ξi , i f yi = +1,
(2.2)
(w · φ (x ) + b) ≤ −1 + ξi, i f yi = −1
(2.3)
where ∀i : ξi ≥ 0. The objective function must be modified to penalize a hyperplane with slack variables. Mathematically, the SVM solves an optimization problem
n w2 Min +C ξi , 2 i=1
subject to yi (w · φ (x ) + b) ≥ 1 − ξi ,
ξi ≥ 0 , 1 ≤ i ≤ n,
(2.4)
where C is the soft margin constant parameter. Some common kernel functions are listed below:
p
1.
Polynomial : k xi , x j = xi · x j + 1
2.
RadialBasis : k xi , x j = e−xi −x j
3.
Sigmoid : k xi , x j = tanh
2
(2.5)
/2σ 2
κ xi · x j + c
(2.6)
(2.7)
where p, κ , σ , and c are constants. For more details on the SVM, refer to [14,15]. 2.2. Proposed MS-SVM The present study proposes the MS-SVM, which replaces the original slack variables (ξi ) with distance metrics. The advantage of the proposed MS-SVM is the ability of the distance metric to enlarge the geometric distance between the abnormal and normal examples. To reformulate the distance metric, the constraint develops a quadratic problem of the SVM. Fig. 1 shows the special property of the distance metric for normal and abnormal examples. The hyperplane lies at the center of the normal and abnormal examples when the training data set is balanced and nonoverlapped. The main idea of the MS-SVM is to find a hyperplane that correctly separates the binary-labeled training data with the optimal margin, yielding a maximal robustness to the perturbation and reducing the risk of future misclassifications. The modified slack variables could detect the abnormal examples in the imbalanced and overlapped cases sensitively. Fig. 2 shows the change generated by the MS-SVM from the original hyperplane to the modified hyperplane. In the original SVM model for non-separable data sets, Eq. (2.4) represents the prime convex quadratic problem. A new slack variable D(x )2 , instead of {ξi }li=1 , is introduced in the standard SVM training to obtain a formal setting of non-separable training data points. The MS-SVM model is developed by minimizing the margin of the weight vector w, instead of minimizing the sum of the margin of the weight vector w and misclassification error, as in the original SVM model. The modified model controls the margin for each training point, as presented below.
1 2 w 2 s.t. yi f (xi ) ≥ 1 − θ D(xi )2 , Min
(2.8)
144
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
Fig. 2. The change of the hyperplane and margins.
where θ is a preselected parameter measuring the influence of averaged information and D(xi )2 represents the normalized distance between each data point and the center of the respective class, which is calculated by
D(xi )2 =
(xi − μ )t S−1 (xi − μ ) , max D(xi )2
(2.9)
where S−1 is the inverse of covariance matrix, μ is the vector of mean, t is the transpose notation, and max(D(xi ) ) represents the maximum value of D(xi )2 . The distance D(x )2 satisfies the following equation: 2
0 ≤ D(xi )2 ≤ 1.
(2.10)
The SVs satisfy the following equation:
yi f (xi ) = 1 − θ D(xi )2 .
(2.11)
The Lagrangian function for the optimization problem:
1 2 w − αi yi f (xi ) − 1 + θ D(xi )2 2
Lp =
(2.12)
where the parameter αi is called the Lagrange multiplier. The Lagrange multiplier αi can solve the weight vectors of the training sets in the dual optimization problem. Setting the first order derivate of function L p , with respect to w, and b to zero would result in the following equations: n ∂ Lp =w− αi yi xi = 0 ∂w i=1
(2.13)
n ∂ Lp =− αi yi = 0 ∂b i=1
(2.14)
From the above equations, the parameter w can be derived by the following equation
w=
n
αi yi xi
(2.15)
i=1
and bias b can be computed by the average values of
αi yi f (xi ) − 1 + θ D(xi )2 = 0
(2.16)
for each SV. Substituting Eq. (2.15) into Eq. (2.12), we obtain
w (α ,
θ) =
l i=1
l 1 αi 1 − θ D(xi )2 − αi α j yi y j k xi , x j
2
i, j=1
(2.17)
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
145
The dual problem for the non-separable pattern is presented as follows:
Maximize w(α ,
θ) =
l
l 1 αi 1 − θ D(xi )2 − αi α j yi y j k xi , x j
2
i=1
s.t.
l
i, j=1
αi yi = 0, αi ≥ 0.
(2.18)
i=1
After obtaining the optimal parameters w and b by solving a dual problem, they can be classified according to the sign in Eq. (2.1). The parameter θ is determined by the Gradient descent method, which is the first order optimization algorithm. The present study tries to maximize Eq. (2.18) and find the suitable parameter θ . A good strategy in this case is to try to maximize the value of Eq. (2.18) on the training examples, which can be achieved by using the gradient descent method to adjust the parameter vector after each example with a small amount in the direction that would reduce the values most. l ∂ w (α , θ ) (x − μi )t S−1 (xi − μi ) =− αi i , ∂θ max D(x )2 i=1
(2.19)
The general gradient descent method yields the estimated θˆ :
t −1 x − μ S x − μ ( ) ( ) i i i i θˆt+1 = θˆt + ρ αi ,t ≥ 0 max D(x )2 i=1 l
(2.20)
where ρ is the learning rate. The effects of the regularization parameter θ are similar with that in the previous works [12,13] and are discussed as follows: • Case 1: If θ = 0, no modified margin is performed. The MS-SVM belongs to the hard-margin SVM as θ = 0. • Case 2: If θ = 1, the constrain function of MS-SVM is equal to the original margin of SVM. In this condition, there is no difference between the MS-SVM and the standard SVM. • Case 3: If 0 < θ < 1, then 0 < θ × D(xi )2 < 1. The algorithm performs well against the abnormal examples on the right side of the decision boundary in the decision region. The influence of the each class center is limited. • Case 4: If θ > 1, then θ × D(xi )2 > 1. The algorithm is robust against the abnormal examples falling on the wrong regions. The SVs should be set by the data points relatively closer to the center of the class. The larger the parameter θ is, the closer the support vectors to the data points toward the center. The algorithm becomes more robust against the abnormal examples and leads to a smoother decision boundary. However, the classification error may increase because of the adjusted decision boundary toward the detected abnormal examples. The parameter σˆ 2 is also estimated by the gradient descent method.
l ∂ w (α , θ ) 1 =− αi α j yi y j xi − x j 2 k xi , x j , 2 ∂σ 2 i, j=1
where the kernel function of k(xi , x j ) is e−xi −x j
σ
2 ˆ t+1
=σ +ρ ˆ t2
l 1 2
2 /2σ 2
. The general gradient descent method yields the estimated σˆ 2 .
αi α j yi y j xi − x j k xi , x j 2
(2.21)
, t≥0
(2.22)
i, j=1
This algorithm performs robustly against the abnormal examples by adjusting the distance-based margin. Fig. 3 summarizes the MS-SVM algorithm. We apply MS-SVM on the training data set xi for T iterations or until the terminal condition is reached. In each iteration, MS-SVM adaptively calculates the function of D(xi )2 and updates parameter θ and σ . For complex computations, the MS-SVM adds an augmented term into the convex quadratic problem. The number of operations needs to consider the computation of the modified slack variable and the number of iterations to update the estimated parameters. Let h be the number of iterations. An inner product in the modified slack variable would require O(m × m ) iterations. In the case where the SVs are not at the upper bound and NSV /m 1, the num3 × h + N 2 × nh + N 2 ber of iterations is O(NSV SV × mnh + m × h). However, if NSV /m ≈ 1, then the number of iterations is SV 3 ×h+N 2 × h). If most SVs are at the upper bound and N /m 1, then the number of itO(NSV × nh + N × mnh + m SV SV SV 3 ×h+N 2 erations is O(NSV SV × mnh + m × h ). Finally, if most SVs are at the upper bound and NSV /m ≈ 1, the number of operations of O(mh × n2 + m2 × h ) can be obtained. 3. Performance evaluation This section evaluates the effectiveness of the proposed method. Several metrics are adopted to measure the effect of data complexity and evaluate the performance of the classifiers, as described in the following sentences. The overall experimental procedure is presented in Fig. 4.
146
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
Fig. 3. The MS-SVM algorithm.
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
147
Fig. 4. The experimental procedure.
3.1. Measure of data complexity The behavior of classifiers has a high relationship with specific data characteristics. Past studies commonly evaluated the weak comparisons of classifier accuracies over a reduced collection of unexplored data sets. These studies did not consider the statistical and geometrical concepts of class distributions to explain the classification results. Several researchers [11, 16–18] have recently mentioned using a number of measures to characterize data complexity. They tried to relate such descriptions to practical classifier performance. The present study uses data complexity measures to analyze the behavior of the proposed method in a number of situations. 3.1.1. Fisher’s discriminant ratio To measure data complexity, Fisher’s discriminant ratio [16] is used to measure the separation of two classes according to a given feature. Eq. (3.1) analyzes the effectiveness of a single feature in separating the corresponding classes, as presented below.
Fisher s discriminant ratio =
( μ1 − μ2 ) 2 , σ12 + σ22
(3.1)
where μ1 , μ2 , σ12 , and σ22 are the means of the two classes and their variances, respectively. Eq. (3.2) considers the allfeature space for multi-classes. The index of a class overlapping ratio (R1) adopts Fisher’s discriminant ratio for multidimensional attributes and multi-classes.
Class i=1 ni · δ (μ, μi ) R1 = Class n i δ xij , μi i=1 j=1
(3.2)
where ni indicates the number of samples in class i, δ is the Euclidean distance metric, μ is the overall mean, μi corresponds to the mean of class i, and xij represents sample j belonging to class i. The smaller value of R1 indicates that the data set has a higher class imbalance condition. 3.1.2. Volume of overlap region The volume of the overlapping region (R2) [17] was proposed for each feature fi , in which the length of the overlap range is normalized by the length of the total range where all values of both classes are distributed. For multi-dimensional attributes, the index of Eq. (3.3) is obtained as the product of the normalized lengths of overlapping ranges for all features.
R2 =
m
min maxi − max mini max maxi − min mini
(3.3)
i=1
where min maxi = min{max( fi , c1 ), max( fi , c2 )}, max mini = max{min( fi , c1 ), min( fi , c2 )} max maxi = max{max( fi , c1 ), max( fi , c2 )} min mini = min{min( fi , c1 ), min( fi , c2 )} being max( fi , ci ) and min( fi , ci ) the maximum and minimum
148
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152 Table 1 Different outcomes of a binary classification problem. True condition
Test outcome
Positive Negative
Positive True positive (TP) False negative (FN)
Negative False positive (FP) True negative (TN)
Sensitivity
Specificity
↓
↓
values of feature fi in class ci (i = 1, 2), respectively. The smaller value of R2 indicates that the data set has a higher class overlapping condition. 3.2. Performance metrics Different methods should be compared using the same data set to identify the more effective method for the experiments. In the binary classification problem with classes yes and no, incidence or absence, and so on, a single prediction has four different possible outcomes. Table 1 presents these four outcomes. For more details on the definitions of the four possible outcomes, refer to [4]. To measure the performance of the overlapped and skewed data set, the concepts of sensitivity and specificity are used in the current experiment to evaluate the proposed method. Sensitivity is defined as the proportion of true positive (TP) for all positive instances in the data set, whereas specificity is the proportion of true negative (TN) for all negatives instances. The formulations of sensitivity and specificity are as follows:
Sensitivity =
(TP ) , TP + FN ) (
(3.4)
Specificity =
(TN ) . (TN + FP )
(3.5)
Sensitivity and specificity are used in the present study to evaluate the classification performance of the classifiers because accuracy is not concerned with issues of high levels of overlapped regions in the overlapped data set and class imbalance in the skewed data set. Accuracy is the number of correct classifications divided by the total number of classifications, that is,
Accuracy =
TP + TN TP + FN + FP + TN
(3.6)
The three indexes of sensitivity, specificity, and accuracy are used in the current study to measure the performance of the classification techniques. 3.3. Experiments In this section, the performance of the proposed method, MS-SVM, is evaluated against the original SVM and SMOTE while facing data complexity problems. In the first experiment, artificial data sets are used to allow the control of all variables to be analyzed. The three-factor design is used to study the effect between the class imbalance and the overlapping problem. The first experimental design considers three levels of the “class imbalance” factor, three levels of the “class overlapping” factor, two levels of the “classifiers” factor, and three factors fixed and arranged in a factorial experiment. Synthetic data sets were generated according to the different combinations of treatments to evaluate the effect of class imbalance and overlapping on the classification accuracy. Each data set is described by two classes and randomly generated by twodimensional normal distributions. The three different levels for class overlapping are (1) low level: μ1 = 40, μ2 = 60; (2) medium level: μ1 = 45, μ2 = 55; and (3) high level: μ1 = 50, μ2 = 50. The covariance matrix for the synthetic data sets of the three levels is defined by
σ11 = 16 σ21 = 20
σ12 = 20 σ22 = 64
A total of 100 instances can be found in each data set. The three levels for class imbalance are (1) low: out of all the instances, 50% are from class 1 and 50% are from class 2; (2) medium: 75% are from class 1 and 25% are from class 2; and (3) high: 90% are from class 1 and 10% are from class 2. The third factor (classifiers) includes the original SVM and the proposed MS-SVM. Fig. 5 illustrates nine different examples of synthetic data sets. Axis Y is irrelevant and does not affect class imbalance and overlapping. The notations “◦” and “×” are the majority and minority classes, respectively. Table 2 presents the two indexes of the measure of data complexity for the synthetic data sets. The second experiment considers six data sets from the UCI databank. Additionally, these data sets belonged to highly skewed data sets. Four data sets, namely, abalone (19), car (3), yeast (5), and glass (7), belong to the multi-class data set.
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
149
Fig. 5. Examples of the synthetic data sets. Table 2 The indexes of R1 and R2 for synthetic data sets. Imbalance Medium
High
Overlap
R1
Low R2
R1
R2
R1
R2
Low Medium High
60.71 60.71 60.71
0.82 0.54 0.28
1.47 1.47 1.47
0.85 0.56 0.29
0.28 0.28 0.28
0.89 0.58 0.29
Table 3 UCI data sets.
BCW German Glass Car Yeast Abalone
Attributes
Classes
Examples
R1
R2
10 25 10 6 8 8
2(305:195) 2(70 0:30 0) 2(185:29) 2 (1659:69) 2 (1433:51) 2 (4145:32)
400 10 0 0 214 1728 1484 4177
0.279 0.0274 0.0254 0.0106 0.0 0 06 0.0012
0.258 0.712 0.0 0 0 0.667 0.0 0 0 0.031
Hence, the class label in the parentheses indicates that the specified class is regarded as the minority class, whereas the other classes are considered as the majority classes. The data preprocessing step is needed to construct an appropriate format for a subsequent analysis. Detailed information on the three data sets after the data preprocessing step is presented in Table 3. The parentheses in the second column in Table 3 indicate the characteristics of these six data sets, which are organized according to their ratios of negative-to-positive training examples. According to the ratios of negative-to-positive instances, the top three data sets belong to the slightly imbalanced, whereas the middle two data sets (car and yeast) belong to the middle imbalanced. The abalone data set is the most imbalanced. The R1 and R2 for the data complexity measure are introduced in Table 3. The most imbalanced and overlapping data set is the yeast. Merely considering the imbalance
150
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152 Table 4 After over-sampling data sets. Original data sets
Over-sampling
Data
Classes
Examples
“over”
Classes
Examples
BCW German Glass Car Yeast Abalone
2(444:239) 2(383:307) 2(185:29) 2 (1659:69) 2 (1433:51) 2 (4145:32)
683 10 0 0 214 1728 1484 4177
100 25 550 2400 2800 12,950
2(444:478) 2(383:383) 2(185:188) 2 (1659:1725) 2(1433:1479) 2(4145:4176)
922 766 373 3384 2912 8321
Table 5 Performance metrics for different levels of class imbalance and overlapping. Model Factors
SVM
MS-SVM
Overlap
Imbalance
Acc.
Sen.
Spe.
Acc.
Sen.
Spe.
Low Low Low Med. Med. Med. High High High Ave.
Low Med. High Low Med. High Low Med. High
100 99 74 99 97 70 85 78 63 85.00
100 100 99 100 96 100 100 76 100 96.78
100 98.3 71.1 98.3 97.3 67 83.3 78.6 58.8 83.63
100 99 85 99 98 80 85 83 68 88.56
100 100 100 100 96 100 100 76 100 96.89
100 98.3 83.3 98.3 98.6 77.8 83.3 84 75.5 88.79
ratio is not enough to view the pattern of the data set. Hence, the indexes of R1 and R2 must be considered to evaluate the robust performance of the proposed method on complex data sets. A four-fold cross-validation approach is used in the present study to evaluate the performances of the original SVM, SMOTE, and the proposed MS-SVM on the UCI data sets. The multi-fold cross-validation approach has an advantage of utilizing as much data as possible for training. In addition, the test sets are mutually exclusive and effectively cover the entire data set. The second experiment was repeated four times. The average accuracy, sensitivity, and specificity were computed, and the results were compared. The kernel function was selected for the radial basis kernel function. The parameter settings of θ and σ were obtained using the gradient descent method. For the suitable parameters of θ and σ , the termination condition was set at 500 iterations or when the same results are reached after 50 iterations. In addition, the learning rate was set at 0.02; however, the effect of learning rate was not considered in the present study. The third experiment evaluated the SMOTE and proposed MS-SVM on the sampling data set. For the parameter settings of SMOTE, the examples will be generated using the information from the nearest neighbors of each example of the minority class. For each currently existing minority class example, new examples will be created by the parameters “over” and “under”, which control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively. The parameter “over” will typically be a number above 100. With this type of values, for each case in the original data set belonging to the minority class, “over”/100 new examples of that class will be created. This proportion is calculated with respect to the number of newly generated minority class cases. For instance, if 200 new examples were generated for the minority class, the value of “under” of 100 will randomly select exactly 200 cases belonging to the majority classes from the original data set to belong to the final data set. After the over-sampling of the minority class, the detailed information of the data sets is presented in Table 4. The data sets of Table 4 were only used for SMOTE. In the following comparison, the proposed MS-SVM, SMOTE, and the original SVM were built with the R software (http://cran.r-project.org/). All experiments were conducted on an Intel Core(TM) 2 CPU notebook with T7200 @ 2.00 GHz, 2.00 GB RAM, and 120 GB disk space. 4. Results Table 5 shows the results of the first experiment, in which the proposed MS-SVM performed better than the original SVM on the measurement of accuracy and sensitivity, especially in high levels of class imbalance and overlapping. In low levels of class imbalance corresponding to each level of class overlapping, two classifiers obtained the same values on the measurement of accuracy. The results show a clear and strong relationship between the classifiers and the effect of data complexity. However, both factors of class imbalance and overlapping appear to be negatively related to the accuracy of the two classifiers.
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
151
Table 6 Comparison of MS-SVM and SVM on the UCI data sets. Model SVM
MS-SVM
Parameters
Data
Sen.
Spe.
Acc.
Sen.
Spe.
Acc.
θ
σ
BCW Ger. Glass Car Yeast Aba. Ave.
99.09 90.71 98.98 100.0 100.0 100.0 98.13
94.56 51.67 93.10 11.59 7.84 0 43.13
97.62 78.98 98.53 96.46 96.83 99.23 94.61
99.09 91.42 100.0 99.81 100.0 100.0 98.39
95.82 53.33 96.55 27.53 15.68 30.43 53.22
98.05 80.21 99.02 96.93 97.10 99.46 95.13
1.03 1.43 1.02 2.00 2.45 4.42
0.05 0.25 0.01 0.65 0.20 0.05
Table 7 Comparison of MS-SVM and SMOTE on the over-sampling data sets. Model SMOTE
MS-SVM
Parameters
Data
Sen.
Spe.
Acc.
Sen.
Spe.
Acc.
θ
σ
BCW Ger. Glass Car Yeast Aba. Ave.
97.6 85.2 97.1 94.8 85.6 85.3 90.93
97.6 85.7 97.1 94.3 85.7 86.4 91.13
97.61 85.17 97.05 94.26 85.61 85.29 90.83
99.5 85.9 100 98.8 85.9 85.9 92.67
99.5 86.1 100 98.8 85.9 87.2 92.92
99.45 85.90 100 98.78 85.88 85.90 92.65
1 1 1.05 1.5 2.03 3.5
0.05 0.75 0.2 0.5 0.5 0.5
The results of the second experiment are presented in Table 6. The proposed MS-SVM measured specificity well. This condition is an indication that MS-SVM can detect the rarest objects in the minority class effectively, whereas the original SVM could hardly obtain a good performance on the index of specificity for the more complex data sets, such as the bottom three data sets. Moreover, the original SVM was easily affected by the complicated data sets, especially by the abalone data set. Table 6 shows that the proposed MS-SVM obtained better performance than that of the SVM on the measurement of the three indexes. The suitable parameters for θ and σ presented in Table 6 selected the best accuracy fold from four-folds. The results reveal that parameter θ is related to the data complexity, whereas σ is dependent on the data characteristics. Parameter θ is inclined toward higher values in high levels of class imbalance and overlapping. The top three data sets corresponding to parameter θ were smaller than 1.5 and regarded to be of slight data complexity. The bottom three data sets corresponding to parameter θ were greater than two and evidently of high data complexity. The parameter σ evidently did not increase with the data complexity and was affected by the data characteristic. From the two experimental results, MS-SVM always performs much better than the original SVM on the measurement of specificity. For MS-SVM, the values of specificity of the two experiments were 88.79 and 53.22, which are better than that of the original SVM. MS-SVM demonstrated that flexible distance margin is suitable for predicting the rarest objects from the class imbalance distribution in the overlap region. Table 7 presents the measurement of sensitivity, specificity, and accuracy on the over-sampling data sets. The average accuracy, sensitivity, and specificity of MS-SVM were 92.65, 92.92, and 92.67, respectively. From the third experimental results, MS-SVM obtained a robust performance on the balance data sets and better classification results than that of the sampling method, SMOTE.
5. Conclusions The present study presents a new quadratic optimization model for the SVM. The proposed MS-SVM method introduces a modified margin of capacity control by adjusting the suitable distance from the decision boundary to the margins in the feature space. The adaptive margin fits the training model well at each training data point and achieves a good performance on the analysis of the different scenarios of data complexity. The parameter settings were obtained by the gradient descent method. The experimental results show that the proposed MS-SVM gives relatively accurate results and performs better in terms of class imbalance and overlapping problems. Hence, a correct classification of the minority class has greater value than a correct classification of the majority class, especially in the highly overlapping region. These results conclude that the proposed MS-SVM method is a useful classification and forecasting technique. These experimental results are the main contributions of the present study. Systematic experiments and analyses were used to evaluate the performance of the proposed method. The experimental results indicate that the different levels of complex data sets affect the classification
152
Y.-C. Chen, C.-T. Su / Applied Mathematics and Computation 283 (2016) 141–152
accuracy of the classifiers. Nevertheless, the proposed MS-SVM still performed well and remained stable. The MS-SVM may be used with artificial, UCI, and over-sampling data sets. Future works can delve into a more robust approach for the application of the proposed MS-SVM on imbalanced or overlapping data sets. Data sets that are more complicated warrant further study. Moreover, extending the theorem of the SVM and dealing with the rarest number of positive examples in real applications are crucial issues that need to be clarified. Acknowledgment This work is partially supported by National Science Council, Ministry of Science and Technology (Taiwan) under grant no. 104-2221-E-007-064-MY3. References [1] V. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. [2] C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition: benchmarking of state-of-the-art techniques, Pattern Recognit. 36 (2003) 2271–2285. [3] T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: C. Nédellec, C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, DE, 1998, pp. 137–142. [4] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2005. [5] N. Chawla, K. Bowyer, L. Hall, W. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. [6] N. Japkowicz, The class imbalance problem: significance and strategies, in: Proceedings of the 20 0 0 International Conference on Artificial Intelligence, 20 0 0, pp. 111–117. [7] K. Huang, H. Yang, I. King, M. Lyu, Learning classifiers from imbalanced data based on biased minimax probability machine, in: Proceedings of the 04’ IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, pp. 558–563. [8] G. Wu, E.Y. Chang, KBA: Kernel boundary alignment considering imbalanced data distribution, IEEE Trans. Knowl. Data Eng. 17 (2005) 786–794. [9] K. Veropoulos, C. Campbell, N. Cristianini, Controlling the sensitivity of support vector machines, in: Proceedings of the International Joint Conference on AI, 1999, pp. 55–60. [10] T.B. Trafalis, R.C. Gilbert, Robust classification and regression using support vector machines, Eur. J. Oper. Res. 173 (2006) 893–909. [11] T.B. Trafalis, R.C. Gilbert, Robust support vector machines for classification and computational issues, Optim. Methods Software 22 (2007) 187–198. [12] Q. Song, W. Hu, W. Xie, Robust support vector machine with bullet hole image classification, IEEE Trans. Syst. Man Cybernet. Part C Appl. Rev. 32 (2002) 440–448. [13] R. Herbrich, J. Weston, Adaptive margin support vector machines for classification, in: Proceeding 9th ICANN, 1999, pp. 880–885. [14] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK, 20 0 0. [15] B. Schlkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2002. [16] R.A. Mollineda, J.S. Sánchez, J.M. Sotoca, Data Characterization for Effective Prototype Selection, Springer, Berlin/Heidelberg, 2005. [17] H. Tin Kam, M. Basu, Complexity measures of supervised classification problems, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 289–300. [18] J. Gama, P. Brazdil, Characterization of classification algorithms, in: Proceedings of the 7th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence, Springer-Verlag, 1995.