Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs

Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs

Neural Networks 70 (2015) 39–52 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Near-Bay...

2MB Sizes 1 Downloads 135 Views

Neural Networks 70 (2015) 39–52

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs Shounak Datta, Swagatam Das ∗ Electronics and Communication Sciences Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata-700 108, India

article

info

Article history: Received 12 February 2015 Received in revised form 19 May 2015 Accepted 30 June 2015 Available online 8 July 2015 Keywords: Support Vector Machines Bayes error Imbalanced data Decision boundary shift Unequal costs Multi-class classification

abstract Support Vector Machines (SVMs) form a family of popular classifier algorithms originally developed to solve two-class classification problems. However, SVMs are likely to perform poorly in situations with data imbalance between the classes, particularly when the target class is under-represented. This paper proposes a Near-Bayesian Support Vector Machine (NBSVM) for such imbalanced classification problems, by combining the philosophies of decision boundary shift and unequal regularization costs. Based on certain assumptions which hold true for most real-world datasets, we use the fractions of representation from each of the classes, to achieve the boundary shift as well as the asymmetric regularization costs. The proposed approach is extended to the multi-class scenario and also adapted for cases with unequal misclassification costs for the different classes. Extensive comparison with standard SVM and some stateof-the-art methods is furnished as a proof of the ability of the proposed approach to perform competitively on imbalanced datasets. A modified Sequential Minimal Optimization (SMO) algorithm is also presented to solve the NBSVM optimization problem in a computationally efficient manner. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction 1.1. Overview Support Vector Machines (SVMs) are very popular classifiers, proposed in 1995 (Cortes & Vapnik, 1995). The popularity of SVMs is due to their elegant mathematical basis and due to their ability to model both linear and non-linear decision boundaries (using the kernel trick). An SVM aims to choose the maximum-margin hyperplane (expressed as a linear combination of training points known as the Support Vectors (SVs)) which is equidistant from the two classes, while putting equal penalty on the regularization of both classes. This is suitable when both the classes have comparable number of training points (the classes are equally represented) and identical costs of misclassification. As a result, SVM finds the maximum-margin hyperplane equidistant from both the classes, when the data is linearly separable. For overlapping classes, the placement or orientation of the hyperplane may be slightly changed to minimize the regularization cost. However, problems arise when the data points belonging



Corresponding author. Tel.: +91 33 2575 2323. E-mail addresses: [email protected] (S. Datta), [email protected] (S. Das). http://dx.doi.org/10.1016/j.neunet.2015.06.005 0893-6080/© 2015 Elsevier Ltd. All rights reserved.

to one of the classes greatly outnumber that of the other class. Such datasets are called imbalanced datasets and the traditional formulation of SVM is not likely to perform well on such problems, because the chosen hyperplane will be skewed towards the minority class in order to minimize the cost of regularization. This results in higher misclassification of the minority class and any new datum is more likely to be assigned to the dominant class. Credit-card fraud detection is an example of such class-imbalanced classification problems, where the target class is the minority as most customers are genuine and fraud customers are hard to come by. In such a situation, an SVM is likely to misclassify a fraudulent customer as genuine, possibly incurring financial loss upon the bank. 1.2. Literature There have been various departures from the traditional formulation of SVM to handle asymmetry in distribution, prior probability, and/or misclassification cost between the two classes in a binary classification problem. These can be broadly classified into two separate classes with distinct aims. The first class of algorithms attempts to find a hyperplane which minimizes the Bayes error of the classifier. Finding the Bayesian decision boundary is useful when the two classes are equally important and misclassification of points from either class may have serious consequences. For example, in a biometric security system designed to grant access (to

40

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

a certain critical facility) to a few key persons, the misidentification of a genuine key person may result in critical delay to urgent proceedings while granting access to a non-key person may facilitate a burglary! Tao, Wu, Wang, and Wang (2005) proposed Posterior Probability SVM (PPSVM) to meet this requirement, which was later extended to the multi-class scenario by Gonen, Tanugur, and Alpaydin (2008). Another endeavor towards a similar goal was by Yu, Cheng, Xiong, Qu, and Chen (2008), who tried to locate the optimal Bayesian hyperplane by using the Reversible Jump Markov Chain Monte Carlo (RJMCMC). The second class of algorithms aims to minimize the misclassification for a particular target class (usually the minority class), because misclassification for target points is costlier than that of non-target points. For example, identification of some spam emails as ordinary emails is trivial while a failure to identify a genuine email may result in that email being deleted, without being read; possibly leading to loss of valuable information. A common approach to these problems is to resample the data so as to undo the data imbalance; either by oversampling the minority class (Chawla, Bowyer, Hall, & Kegelmeyer, 2002; KoknarTezel & Latecki, 2009), undersampling the majority class (Kubat et al., 1997; Peng, Ting-ting, & Yang, 2015), or some combination thereof (Wang, 2014; Zughrat, Mahfouf, Yang, & Thornton, 2014). Oversampling is generally preferred over undersampling which may result in the loss of some important majority instances. However, oversampling increases the computational cost by increasing the number of training instances. To overcome these disadvantages of random resampling, various improvements (which adaptively resample the data) have been proposed (Bao-Liang, Xiao-Lin, Yang, & Hai, 2011; Batuwita & Palade, 2010; Cervantes, Li, & Yu, 2014; Choi, 2010; Peng, Xiao-yang, Ting-ting, & Jiu-ling, 2014; Stecking & Schebesch, 2012; Tang, Zhang, Chawla, & Krasser, 2009; Wang, Li, Chao, & Cao, 2012). Active learning techniques for SVMs have also been combined with random (Lee, Wu, & Kim, 2014) as well as selective (Ertekin, 2009) oversampling techniques to utilize their ability to learn dynamically from limited amount of training data. These resampling approaches are not specific to SVMs and Masnadi-Shirazi and Vasconcelos (2010) argue that these may not affect the decision boundary if new SVs are not created. On the other hand, Akbani, Kwek, and Japkowicz (2004) have shown that it may still affect the orientation of the separating hyperplane. Alternatively, when the costs of misclassification of the individual classes are known, the SVM formulation can be modified to take this into account (Chang & Lin, 2011; Veropoulos, Campbell, & Cristianini, 1999). Cao, Zhao, and Zaiane (2013) and Duan, Jing, and Lu (2014) have proposed methods to estimate the misclassification costs (when the costs are unknown), based respectively on the information entropies of the individual classes, and on the maximization of an appropriate fitness function by using the Particle Swarm Optimization (PSO) algorithm. The effectiveness of boosting for imbalanced classification is attested by Galar, Fernandez, Barrenechea, Bustince, and Herrera (2012); and was combined with cost-sensitive SVM by Wang and Japkowicz (2010). Unfortunately, cost-sensitive approaches are only useful when the training data from different classes are overlapping. Some other methods utilize the ability of SVMs to work with kernels to improve performance on imbalanced datasets by modifying the kernel function or the kernel matrix. Wu et al., drawing inspiration from Amari and Wu (1999), modified the SVM kernel based on conformal transformations, to enable better handling of imbalanced datasets, by increasing the resolution of the kernels near the minority SVs (Wu & Chang, 2003, 2005). The slightly different Asymmetric Kernel Scaling (AKS), which also differently enlarges areas on both sides of the separating hyperplane so as to compensate for the data skew, was proposed by Maratea, Petrosino, and Manzo (2014) and later adapted for multi-class imbalanced problems by Zhang, Fu, Liu, and Chen (2014). Zhao, Zhong,

and Zhao (2011) maximized the Weighted Maximum Margin Criterion (WMMC) to select the proper kernel for imbalanced datasets. Phoungphol (2013) discussed a modification of SVM which minimizes an approximation of the gmeans measure of Kubat et al. (1997) along with margin maximization. Another approach is to shift the decision boundary to account for the data imbalance between the two classes (Imam, Ting, & Kamruzzaman, 2006; Lin, Lee, & Wahba, 2002). This requires information about the distribution of the two classes, which is often unavailable or costly. Wu, Lin, Chien, Chen, and Chen (2013) proposed Asymmetric SVM (ASVM) which attempts to minimize false assignments to a certain target class. Raskutti and Kowalczyk (2004) published a comparative study which demonstrated the superiority of oneclass SVM (trained as a similarity detector based only on instances from the target class) over two-class SVM with resampling, or unequal costs in high-dimensional and/or noisy feature spaces. An extensive review of the various SVM based imbalanced data classifiers has been written by Batuwita and Palade (2013). Finally, hybrid methods combining two or more of the above approaches have also been proposed, such as Yang, Yang, and Wang (2009) which combines margin shift with unequal costs, and works like Akbani et al. (2004), Wang et al. (2012), and Stecking and Schebesch (2012) which integrate resampling with unequal costs. 1.3. Motivation It is clear that most of the existing literature is focused on minimizing the misclassification of the costlier target class (at the cost of greater misclassification of the non-target class), instead of minimizing the Bayes error. Interestingly, the former problem can also be looked at as a derivative of the latter problem where the prior probabilities have been scaled by different misclassification costs for the different classes. Therefore, it makes more sense to minimize the Bayes error. This can be achieved by appropriately shifting the decision boundary within the margin. Ideally, this shift should be based on the knowledge of posterior probability distributions, which is costly. but, if the two classes are both characterized by similar demographics (as is the case for most real-world datasets), prior probabilities can be used to induce the decision boundary shift (see next section). Moreover, it also makes sense to further combine cost-sensitivity to this arrangement, to prevent undue misclassification of the minority class. This motivates us to combine decision boundary shifting with costsensitivity for class-imbalanced classification with SVMs, in this paper. 1.4. Contribution In the present paper, we propose a new variant of SVM for classification of imbalanced data, called Near-Bayesian Support Vector Machines (NBSVMs). In the new classifier, we combine the decision boundary shift philosophy with varying misclassification penalties. While the shifting is expected to achieve better inductive performance, even for training datasets that are linearly separable; the unequal regularization penalties (higher penalty for minority class) are expected to ensure that the shift does not result in undue misclassification of the minority class. Alternatively, one can think of the cost asymmetry to be compensating for the unequal representation from the two classes, so that the decision boundary is not closer to the minority class. The margin shift then offsets the separating hyperplane so that it (almost) coincides with the Bayesian decision boundary. When the two classes are deemed to have equal costs of misclassification, the above approach places the decision boundary so that the Bayes error is reduced (as long as the assumption about the dataset holds true). However, if the two classes have distinct costs, the new classifier reduces the

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

asymmetric-cost scaled Bayes error, thus placing more emphasis on the correct classification of points from the costlier class (known as the target class), at the expense of points from the other class. The important contributions of this paper are as follows: (i) We develop NBSVM, a modification of SVM to reduce Bayes error in imbalanced data classification, which combines decision boundary shift with cost-sensitivity. We also provide mathematical results supporting the philosophy behind the new method, along with a proof of existence of a unique solution. (ii) While our philosophy is remotely similar to the approach in Yang et al. (2009), we make certain assumptions about the dataset (see next section), which enables us to make use of the fraction of representatives from each class to shift the decision boundary and to impose unequal costs. Unlike (Yang et al., 2009), there is no need to run costly Evolutionary Algorithm (EA) based optimization to ascertain the values of the parameters. (iii) Unlike previous methods such as Gonen et al. (2008) and Tao et al. (2005), the proposed method does not require costly posterior probability approximation and makes use of the fractions of representation from the different classes (which serve as estimates of prior probabilities). (iv) We also extend NBSVM to the inherently imbalanced oneversus-all approach for multi-class classification and also to cases where the two classes have different misclassification costs. (v) We make modifications to the Sequential Minimal Optimization (SMO) method (Platt, 1999) for solving the dual optimization problems posed by all of the methods proposed in this article. (vi) We undertake extensive experimentation on artificial as well as real datasets to evaluate the performance of the proposed methods. We also study the effects of conformation/nonconformation of the tested datasets to the assumptions underlying the proposed methods. 1.5. Organization

be n− . Therefore, the total number of data points N = n+ + n− . The Representation Ratio (RR) for a class k is defined as follows: No. of data points belonging to class k

ρk =

Total no. of data points in the dataset

2. Effects of hyperplane shift and unequal regularization on SVM Let X = {(xi , yi )|i = 1, 2, . . . , N } be a two class imbalanced dataset, where xi ∈ Rm are the data points and yi ∈ {1, −1} are the corresponding labels. Let the number of representative data points from the positive class be n+ and that for the negative class

.

(1)

Therefore, the RRs, ρ+ and ρ− respectively for the positive class and the negative class are: n+

ρ+ =

N

,

n−

ρ− =

and

N

.

(2)

Traditional SVM attempts to find the max-margin hyperplane w T x − b = 0 (w ∈ Rm and b ∈ R), such that the points in the positive class have a positive margin and points in the negative class have a negative margin. The resulting optimization problem can be written in the following form: min w ,b

1 2

∥w ∥2

(3a)

subject to yi (w T xi − b) ≥ 1,

∀(xi , yi ) ∈ X .

(3b)

However, in many real-world problems, the different classes in the dataset are not completely separable. In such cases, the constraints in the aforementioned optimization problem are relaxed by introducing a slack variable ξi ≥ 0 for each data point xi . Then, the soft margin version of SVM is formulated as follows: min

w ,ξ,b

1 2

∥w ∥2 + C

N 

ξi

(4a)

i =1

subject to yi (w T xi − b) ≥ 1 − ξi ,

∀(xi , yi ) ∈ X ,

(4b)

where C is the cost of regularization. Introducing the Lagrange multipliers αi and βi , the previous optimization problem can be rewritten as: min max

w ,ξ,b α,β

The rest of this paper is arranged in the following way. In Section 2, we discuss the assumption that has been made about the dataset and the consequences of the same. Section 3 introduces the reader to NBSVM, while the two succeeding sections extend it for application to multi-class problems (Section 4) and to unequal costs problems (Section 5). We show the modifications that must be made to (SMO) for solving the dual optimization problem for the new method, in Section 6. NBSVM is tested on standard real datasets and the results are compared to that of traditional SVM as well as other state-of-the-art methods for imbalanced data. The results obtained by NBSVM are very encouraging. NBSVM is also found to be very useful for multiclass classification using the oneversus-all approach which becomes inherently unbalanced as multiple classes are combined to form the negative class. These results are listed and discussed in Section 7. Section 8 concludes the article by discussing the usefulness and applicability of the new method, based on the findings of this study.

41



N 

1 2

∥w ∥2 + C

N 

ξi

i=1

αi (yi (w T xi − b) − 1 + ξi ) −

i=1

N 

βi ξi ,

(5)

i =1

where αi ≥ 0 and βi ≥ 0, ∀i ∈ {1, . . . , N }. Wolfe’s dual of the above problem is as follows: max α

N 

αi −

i=1

N 1

2 i ,j = 1

αi αj yi yj K (xi , xj )

(6a)

subject to 0 ≤ αi ≤ C , and

N 

∀i ∈ {1, . . . , N },

αi yi = 0,

(6b) (6c)

i=1

where K (xi , xj ) is the kernel function, evaluated at the points xi and xj . Since the majority class (having greater RR than the other class) is likely to have more representatives in the region of overlap, classical SVM moves the separating hyperplane towards the minority class, by an arbitrary amount. Depending on the extent of regularization and the actual distribution of the data in the region of overlap, this shifted boundary may or may not coincide with the Bayes optimal decision boundary.

42

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

(a) Class-conditional probability distributions for the majority class and minority class.

(b) Posterior probability distributions for the two classes. The separating hyperplane obtained by SVM is closer to the minority class, but does not coincide with the ideal decision boundary.

(c) Adding unequal costs for the two classes moves the hyperplane close to the midway line, this hyperplane can be shifted towards the ideal decision boundary using proper biasing. Fig. 1. Effects of hyperplane shift and unequal costs on SVM.

One approach to get a better shifting of the hyperplane is to combine margin shifting with unequal regularization costs for the two classes. The resulting optimization problem can be written as:

min

w ,ξ,b

1 2

2

∥w ∥ + C

N 

f (xi )ξi

(7a)

i=1

subject to yi (w T xi − b) ≥ q(xi ) − ξi ,

∀(xi , yi ) ∈ X ,

(7b)

where q : {xi |i = 1, . . . , N } −→ [0, 1] is the minimum margin function which gives the minimum margin over all points belonging to the class of a particular point; f : {xi |i = 1, . . . , N } −→ [0, ∞) is the cost of misclassification which attains a higher value for points in the minority class and a lower value for points in the majority class. Notice that q(xi ) gives the same value for all xi belonging to a particular class. So, it is always possible to scale w and b (by the same factor, so that the decision boundary remains unchanged), such that q(xi ) + q(xj ) = 1 for all xi ∈ I+ (the positive class) and xj ∈ I− (the negative class). The effect of combining decision boundary shifting with unequal regularization costs for the two classes (of an example problem) is explained in Fig. 1. In Fig. 1(a), we see the class-conditional probability distributions for the two classes. For the sake of simplicity, the two classes are shown to have identical distributions. But, since the majority class (shown on the left side) has a greater prior probability than the minority class (shown on the right side), the optimal decision boundary is closer to the minority class. Classical SVM also finds a decision boundary closer to the minority class, but it is not necessarily optimal, as it depends on the choice of C . This situation is illustrated in Fig. 1(b). It is seen in Fig. 1(c) that choosing a higher regularization cost for the minority class results in the separating hyperplane moving towards the majority class. It was empirically observed by Akbani et al. (2004) that the optimal

choice of f (xi ) is:

 1    ρ+ f (xi ) = 1    ρ−

if xi ∈ I+ (8) if xi ∈ I− .

This results in a separating hyperplane midway between the two classes as f (xi ) compensates for the difference in representation of the two classes, in the region of overlap. Then a margin shift can be induced by using a proper function q(xi ) to obtain the optimal separating hyperplane, as shown in Fig. 1(c). Thus, the philosophy of combining boundary shift with unequal regularization costs is used in the proposed NBSVM method (see the next section). Ideally, the posterior probabilities should be used to choose f (xi ) to shift the decision boundary (Gonen et al., 2008; Tao et al., 2005). But, we can make use of the RRs (which are essentially estimates of the prior probabilities) to move the separating hyperplane if the two classes are similarly distributed, i.e. the classes have equal monotonic spread in the direction of maximum margin. This is often true for real world datasets in which target as well as non-target classes are drawn from the same demographic. In fact, Lemma 1 shows that the proposed approach is sound whenever the class-conditional distributions of the two classes have identical rates of decrement (gradients) in the region of overlap. Lemma 1. Let b1 be the offset of the furthest regularized point from the class k1 , from the origin (in the direction perpendicular to the maxmargin separating hyperplane w T x−b = 0). Let b2 be the same for the furthest regularized point from class k2 . Then [b2 , b1 ] is the region of overlap between two classes k1 and k2 which have prior probabilities P1 and P2 , and class-conditional probabilities distributions p(x|k1 ) and p(x|k2 ), respectively. Let p(x|k1 ) decrease monotonically from b2 to b1 and let p(x|k2 ) increase monotonically from b2 to b1 , both with constant average slope m (assuming the region of overlap to be sufficiently narrow). Then the separating hyperplane is Bayes optimal, if b − b2 = P1 × (b1 − b2 ) and b1 − b = P2 × (b1 − b2 ).

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

Proof. For the decision boundary w T x − b = 0, the misclassification error is given by:

ϵ(b) = P2

b



m(b − b2 )db + P1

b1



m(b1 − b)db

(9a)

Introducing Lagrange multipliers αi and βi , the previous optimization problem can be rewritten as:

 = P2 × m

b

b2



w ,ξ,b α,β

+ P 1 × m b1 · b −

− b2 · b

2

b2

 = P2 × m ×

b2 2

 + P1 × m ×

− b2 · b + b21 2

b22

b2



b1

2









b2

.

2

(9c)

max α

= P2 × m × (b − b2 ) + P1 × m × (b − b1 ) = 0.

(1 − P1 ) × (b − b2 ) + P1 × (b − b1 ) = 0. ⇒ b − b2 = P1 × (b1 − b2 ).

∀xi ∈I+

and

In practice, even when the two classes do not have similar distributions, if one class has sufficiently larger number of representatives, the prior probabilities become more important than the class conditional probabilities. Therefore, NBSVM (which uses the RRs as estimates of prior probabilities to shift the decision boundary) still places the decision boundary near the Bayes optimal in such cases. But, sometimes the minority class is more compact than the majority class, resulting in greater ratio of target class to non-target class representation in the region of overlap (than implied by the RRs). In such a case, NBSVM (which moves the decision boundary towards the minority class based on the RRs) tends to misclassify fewer points from the majority class, at the cost of points from the minority class. Thus, the accuracy of NBSVM on the minority class deteriorates with increase in the compactness of the minority class. On the other hand, if the minority class has a greater spread than the majority class, then the accuracy of NBSVM on the minority class further improves at the cost of majority instances. This is advantageous when the minority class is the target class (which is often the case). Therefore, NBSVM should not be used when the minority (target) class is known to be highly compact in comparison to the majority class.

2

∥w ∥ +

ρ−

∀xi ∈I−

ξi +

ρ+

∀xi ∈I+

ξi

(12a)

βi ξi ,

C

ρ−

N 1

2 i,j=1

αi αj yi yj K (xi , xj )

(13)

(14a)

(14b)

,

∀xi ∈ I− ,

(14c)

αi yi = 0.

(14d)

if xi ∈ I+ (15)

otherwise,

The new formulations boil down to traditional SVM when ρ+ = ρ− , i.e. n+ = n− = N /2. In fact, the new objective function in (14) yields the same Hessian as that of (6) (which is positive semidefinite (Burges, 1998)), and therefore has a unique solution. The simple theorem that follows, states the positive semi-definiteness of the NBSVM Hessian. Theorem 2. Let the objective function for the NBSVM   dual optimization problem be Q (α) = ∀xi ∈I+ ρ+ αi + ∀xi ∈I− ρ− αi − 1/2 i,j=1 αi αj yi yj K (xi , xj ). Then, the Hessian matrix H of −Q (α) is positive semi-definite.

N

Proof. This proof closely follows a similar proof by Abe (2002) for classical SVM. Let us rewrite Q (α) as follows: N



ρ+ αi +

×

N 

 ∀xi ∈I−

ρ− αi −

1  2

T αi yi φ(xi )

i =1

 αi yi φ(xi ) ,

(16)

i=1

where φ(xi ) is the image of the data point xi in the higherdimensional feature space, i.e. K (xi , xj ) = φ T (xi )φ(xj ).

(17)

Then from the constraint (14d), we can write:

subject to yi (w T xi − b) ≥ ρ+ − ξi ,

N 

∀xi ∈ I+ ,

∀xi ∈I+

Using the results obtained in the previous section, we now present the formulation for NBSVM, as follows:



ξi

,

 0 H (yi , f (xi )) = ρk − yi f (xi )  ρk for all xi ∈ Ik .

3. Near-Bayesian Support Vector Machines

min

∀xi ∈I+

According to the interrelation between hinge loss, decision boundary shift and cost-sensitive regularization (Yang et al., 2009), it can be easily seen that the above formulation of NBSVM minimizes the modified hinge loss function for imbalanced data, defined as:

Q (α) =

w ,ξ,b

ρ+

i=1

Similarly, we can show that b1 − b = P2 × (b1 − b2 ).

C

ρ+

N 

ρ− αi −

∀xi ∈I−

C

0 ≤ αi ≤

(11)



ρ+ αi +

(10)

Since m ̸= 0, and putting P2 = 1 − P1 we have:





i =1



0 ≤ αi ≤

C

∀xi ∈I−

C

where αi ≥ 0 and βi ≥ 0, ∀i ∈ {1, . . . , N }. Therefore, Wolfe’s dual of the above problem is as follows:



dϵ(b)

2

ρ−

ξi +

αi (yi (w T xi − b) − ρ− + ξi ) −

subject to

1



αi (yi (w T xi − b) − ρ+ + ξi )

For the decision boundary to be Bayesian, the error must be minimized at b. So, we get:

db

C

∀xi ∈I−

2

− b1 · b +

2

∥w ∥2 +

∀xi ∈I+

(9b) b

1

min max

b

b2

43

∀xi ∈ I+ ,

and yi (w T xi − b) ≥ ρ− − ξi ,

∀xi ∈ I− .

(12b) (12c)

αi = −yi

N  g =1 g ̸=i

αg yg .

(18)

44

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

Fig. 2. Effects of suitable and unsuitable kernels on the performance of NBSVM.

4. Multi-class NBSVM

Substituting (18) in (16), we get: Q (α) = Q1 (α) −

1 2

Q2T (α)Q2 (α),

(19)

where Q1 (α) =

N  N  (1 − yi yg )αg ,

(20)

i=1 g =1 g ̸=i

and Q2 (α) =

N  N 

  αg yg φ(xg ) − φ(xi ) .

(21)

i=1 g =1 g ̸=i

Therefore, the Hessian matrix H of −Q (α) can be written as: H = ∇ Q2T (α)Q2 (α)∇ T

= Q2 (α)∇ 

where ∇ =



 T T

(22a)

Q2 (α)∇ ,

δ , . . . , δαδN δα1

 T

T

(22b)

.

Since H is expressed as the product of the transpose of a matrix and the matrix itself, H is positive semi-definite. The reader should note that the selection of kernel for NBSVM is not trivial (just like SVM). NBSVM relies on certain properties of the distributions of the two classes, which may change due to the mapping to a higher dimensional feature space. Therefore, one should be careful while choosing a kernel for NBSVM. On the other hand, suitable kernels can be used to improve the performance of NBSVM by mapping to feature spaces where the data distribution is in keeping with the assumptions despite being adversely distributed in the input space. The effects of choosing suitable or unsuitable kernels are illustrated in Fig. 2.

The SVM classifier is inherently a binary classifier, i.e. it can only distinguish between two classes. However, there are ways to adapt SVMs for multiclass classification. One of these is known as one-versus-all (OVA) approach. In OVA-SVM for an M class problem, M different SVMs are trained (one for each class) with that class as the positive (target) class and all the rest of the (M − 1) classes combined to form the negative class. In this way, M different labels are obtained for any data point. Ideally, each data point should receive a positive label only from the classifier which targets its class. So, if the data point only has a single positive label, it can be assigned to the corresponding class. However, if it receives multiple positive labels, it has to be randomly assigned to one of the corresponding classes. It should be noted that the training of the OVA approach is inherently imbalanced, as the set of all data points from all other classes is likely to outnumber the representatives of the target class, for each subclassifier. Therefore, using the NBSVM philosophy for such multiclass OVA-SVMs can prove to be advantageous. Hence, in this section, we extend NBSVM to a multiclass framework (OVA-NBSVM). Let X = {(xi , yi )|i = 1, . . . , N } be the multiclass training dataset, where xi ∈ Rm are the data points and yi ∈ {1, . . . , N } are the corresponding labels. Let the number of representative data points from the kth class be M nk . Therefore, the total number of data points N = k=1 nk and the RRs of each class are defined as follows:

ρk =

nk N

.

(23)

j̸=k

Therefore, j=1 to M ρj = 1−ρk . Then the primal form of the multiclass OVA-NBSVM, with the kth class as the target, is given by: min

w ,ξ,b

1 2

∥w ∥2 +

C 

ρk

∀xi ∈Ik

ξi +

C



1 − ρk ∀x ̸∈I i k

ξi

(24a)

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

Introducing the Lagrange multipliers αi and βi , we get the Lagrangian:

subject to yi (w xi − b) ≥ ρk − ξi ,

∀xi ∈ Ik ,

T

and yi (w xi − b) ≥ 1 − ρk − ξi , T

(24b)

∀xi ̸∈ Ik

(24c)

min max

w ,ξ,b α,β

where Ik denotes the kth class.

− 5. NBSVM with unequal costs

Lemma 3. Let [b2 , b1 ] be the region of overlap between two classes k1 and k2 with prior probabilities P1 and P2 , classconditional probabilities distributions p(x|k1 ) and p(x|k2 ), and distinct misclassification costs C1 and C2 , respectively. Let p(x|k1 ) decrease monotonically from b2 to b1 and let p(x|k2 ) increase monotonically from b2 to b1 , both with constant average slope m. Then P C the optimal decision boundary is w T x−b = 0, if b−b2 =  1 1P C ×  P2 C2

i=1,2 Pi Ci

1 2



∥w ∥2 +

C



ρ−

∀xi ∈I−

αi (yi (w T xi − b) −

∀xi ∈I+

As seen in Section 2, NBSVM places the decision boundary closer to the minority class to minimize the Bayes error of the classifier. However, this is not desired when the minority class is the target class and misclassification of minority points incurs a greater cost than the misclassification of majority points. In this section, we modify NBSVM for such situations by using the different misclassification costs of the two classes. The following lemma states the change in the position of the optimal decision boundary (as described in Lemma 1) due to unequal costs of misclassification of the two classes.

(b1 − b2 ) and b1 − b =

45

× ( b1 − b2 ) .

i=1,2 i i





αi



∀xi ∈I−



N 

ξi +

C



ρ+

∀xi ∈I+

ξi

ρ+ C+ + ξi ) ρ+ C+ + ρ− C−

ρ− C− yi (w xi − b) − + ξi ρ+ C+ + ρ− C− T



βi ξi ,

(28)

i=1

where αi ≥ 0 and βi ≥ 0, ∀i ∈ {1, . . . , N }. Wolfe’s dual of the above problem is: max α



 ∀xi ∈I+ N 1

2 i,j=1

 ρ+ C+ ρ − C− αi + αi ρ+ C+ + ρ− C− ρ C + ρ − C− ∀x ∈I− + + i

αi αj yi yj K (xi , xj )

(29)

subject to the constraints (14b)–(14d). Both NBSVM (binary as well as OVA) and uNBSVM can be solved by slightly modifying the SMO of Platt (1999), as discussed in the next section. 6. Sequential minimal optimization for NBSVM

Proof. From Lemma 1, we have: b − b2 = P1 × (b1 − b2 ), and b1 − b = P2 × (b1 − b2 ). But, because of the distinct misclassification costs, the effective prior probabilities for the two classes are: P1 C1 P1,eff =  , Pi Ci

P2 C2 and P2,eff =  . Pi Ci

i=1,2

(25)

i=1,2

So, it follows from the above expressions that: b − b2 = P1,eff × (b1 − b2 ), P1 C1 ⇒ b − b2 =  × (b1 − b2 ), Pi Ci

(26a)

i=1,2

b1 − b = P2,eff × (b1 − b2 ), P C and ⇒ b1 − b = 2 2 × (b1 − b2 ). Pi Ci

(26b)

i=1,2

Thus, the modified NBSVM for unequal costs (uNBSVM), can be formulated as follows: min

w ,ξ,b

1 2

∥w ∥2 +

C



ρ−

∀xi ∈I−

ξi +

C



ρ+

∀xi ∈I+

ξi

(27a)

α1 = arg max yi Gi ,

(30a)

i∈Iup

and α2 = arg max j∈Idown

(yα1 Gα1 − yj Gj )2 , 2(K (xα1 , xα1 ) + K (xj , xj ) − 2K (xα1 , xj )) (30b)

where Gi = ρki − yi g =1 αg yg K (xi , xg ), Gαi is the Gi corresponding to αi , and ki denotes the class to which xi belongs. The next step analytically optimizes the chosen pair of multipliers to minimize the objective function, while keeping the other Lagrange multipliers constant. To maintain the linear constraint in (14d), the values of the Lagrange multipliers α1 and α2 must satisfy the equation:

N

subject to

ρ+ C+ yi (w T xi − b) ≥ − ξi , ∀xi ∈ I+ , ρ+ C+ + ρ− C− ρ− C− and yi (w T xi − b) ≥ − ξi , ∀xi ∈ I− . ρ+ C+ + ρ− C−

The SMO algorithm for the optimization of the SVM dual problem was proposed by Platt (1999) and later fine-tuned by Keerthi, Shevade, Bhattacharyya, and Murthy (2001). SMO owes its popularity to the fact that the amount of memory required by it scales linearly with the training set size and it qualifies as the fastest solver for linear and sparse classification problems (Platt, 1999). Both NBSVM as well as uNBSVM can be solved using the SMO algorithm, by making some modifications to the original algorithm, which are described in this section. SMO consists of two main steps which are repeated alternatively until convergence. The first step is concerned with choosing a pair of multipliers to be optimized, called the working set. The working set for the modified SMO can be chosen along the lines of Keerthi et al. (2001). Let us, without loss of generality, denote the pair of multipliers to be selected as α1 (belonging to class k1 ) and α2 (belonging to class k2 ). The following index sets are defined at the current α: I0 = {i|0 < αi < C }, I1 = {i|yi = 1, αi = 0}, I2 = {i|yi = −1, αi = C }, I3 = {i|yi = 1, αi = C }, I4 = {i|y = −1, αi = 0}, Iup = I0 ∪ I1 ∪ I2 and Idown = I0 ∪ I3 ∪ I4 . Then α1 and α2 are chosen as follows:

(27b) (27c)

where C+ and C− are the respective misclassification costs for the classes I+ and I− , while C is the cost of regularization.

y1 α1new + y2 α2new = y1 α1old + y2 α2old ,

(31)

46

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

in addition to the box constraints: C

0 ≤ α1new ≤

ρk1

,

0 ≤ α2new ≤

and

(32a) C

ρk2

.

(32b)

Without loss of generality, we can assume that α2 is calculated first. Then, the resultant constraint on the multiplier α2 can be written as L ≤ α2new ≤ U, where:

   max 0, α old + α old − C if y1 = y2 2 1 ρk1 L=  max{0, α2old − α1old } otherwise,    C old old   if y1 = y2 min ρ , α2 + α1 k2   U =  C C  min , + α2old − α1old otherwise. ρk2 ρk1

(33a)

(33b)

Let Ei denote the difference between the current function output and the target output for the training datum xi . Then, Ei is given by the following formula:

 T w xi − b − yi ρki ,    ρ C Ei = w T xi − b − yi ki ki ,  ρj Cj  

for NBSVM for uNBSVM,

(34)

j=ki , ki

where ki denotes the class to which xi belongs and  ki denotes the opposite class. Then, to maximize the objective function while only changing α1 and α2 , the parameters must be updated using the following steps: 1. Find the new, unclipped value of α2 :

α

new 2



old 2



y2 (E1 − E2 )

η

,

(35)

where η = −K (x1 , x1 ) + 2K (x1 , x2 ) − K (x2 , x2 ). 2. Find the clipped value of α2 : new,clipped 2

α

 L, = α2new ,  U,

if α2new ≤ L if L < α2new < U

(36)

if U ≤ α2new .

3. Find the new value of α1 :

α

new 1



old 1

+ y1 y2 (α

old 2

new,clipped 2

−α

).

(37)

4. Update the threshold in the following way: b1 =

+ (α + (α

− α )K (x1 , x1 ) − α2old )K (x1 , x2 ) + bold ,

E1 y1 1new new,clipped y2 2

(38a)

+ (α − α )K (x2 , x2 ) + (α − α ) (x2 , x1 ) + bold , (38b)  new,clipped new ∈ {L, U } b , if L < α1 < U , α2    1  new,clipped new  b2 , if L < α2 < U , α1 ∈ {L, U } and bnew = b1 + b2 (38c)  , if L < α1new , α2new,clipped < U    2   old b , otherwise.

b2 =

old 2

5. Update Ei for all the data points: Einew = Eiold + y1 (α1new − α1old )K (x1 , xi )

+ y2 (α2new,clipped − α2old )K (x2 , xi ) + bold − bnew .

7. Experimental results In order to demonstrate the performance of NBSVM on different types of data distributions, we present a few artificial examples in the first part of this section. Later, we present the results obtained by applying NBSVM to some popular real datasets (with and without unequal costs) and compare the results obtained with that of the canonical SVM (Cortes & Vapnik, 1995) and other stateof-the-art methods such as PPSVM (Tao et al., 2005) for problems with equal costs, and SMOTE with Different Costs (SDC) of Akbani et al. (2004) for problems with unequal costs. A non-parametric statistical test called Wilcoxon’s rank sum test for independent samples (García, Fernández, Luengo, & Herrera, 2010; Wilcoxon, 1992) is conducted at the 5% significance level to judge whether the results obtained by NBSVM or uNBSVM differ from those of the competing classifiers to a statistically significant extent. If the obtained P-values are less than 0.05 (5% significance level), it is a strong evidence against the null hypothesis, indicating that the results obtained by NBSVM or uNBSVM are statistically significant and have not occurred by chance (García et al., 2010). Finally, in an attempt to understand the extent of dependence of performance on adherence of the dataset to the underlying assumptions, we study the ratio of the gradients (within the margin) of the target class to those of the non-target class for all the tested real datasets. 7.1. Artificial datasets The performance of NBSVM on four different artificial datasets is demonstrated, to show how it behaves under ideal conditions, and also when the basic assumptions are not met exactly. SVM is also run on the same data, to serve as a baseline algorithm. Each of the first 3 datasets, namely ArtSet1, ArtSet2, and ArtSet3, consists of two classes; the minority class consisting of 100 data points (with labels +1) and the majority class consisting of 200 data points (labeled −1) both drawn from classes (class 1 and class 2, respectively) characterized by right-circular conic probability distribution functions: pk (x) =



γk × (R − d(x, µk )), 0, 1

old 1

new,clipped E2 y2 2 old y1 1new K 1

It can be easily seen that NBSVM solves a linearly constrained Quadratic Programming (QP) problem, like traditional SVM. It has the same number of variables and constraints as SVM, both in the primal space as well as the dual space. Hence, the computational complexity of NBSVM and the modified SMO are approximately the same as those of classical SVM and original SMO, respectively.

(39)

if, d(x, µk ) ≤ R otherwise,

(40)

where R = ( π3γ ) 3 , d(x, µk ) = (x − µk )T (x − µk ), µk and γk are k respectively the class mean and gradient, and k denotes the class. ArtSet4, on the other hand, consists of 3 classes of 100 points each, also drawn from right-circular conic probability distributions. The class means µk and gradients γk for all the datasets are shown in Table 1. A right-circular conic probability distribution function is shown in Fig. 3. These distributions are chosen for the artificial classes because of the simple structure and radially constant gradients, which makes it easy to understand the effects of the proposed methods, in comparison to that of SVM. SVM and NBSVM are run on all four datasets on a linear kernel with cost of regularization C = 10 (using 10-fold cross-validation), with the OVA approach being used for the fourth dataset. SDC and uNBSVM are also run on the third dataset, with class 1 as the target class and C = 10. Target class to non-target class cost ratios of 2:1, 5:1 and 10:1 are used for the experiments and the cost ratio of 2:1 is found to be the best for both SDC as well as uNBSVM. The



S. Datta, S. Das / Neural Networks 70 (2015) 39–52

47

Fig. 3. Right-circular conic probability distribution function with µk = [0, 0] and γk = 75. Table 3 Accuracy (%) and Gmeans (%) values for SVM, SDC and uNBSVM for ArtSet3.

Table 1 µk and γk for all artificial datasets. Dataset

µ1

µ2

µ3

γ1

γ2

γ3

Algorithm

Cost ratio

acc.

gmeans

ArtSet1 ArtSet2 ArtSet3 ArtSet4

[0, 0] [0, 0] [0, 0] [0, 0]

[1.5, 1.5] [1.5, 1.5] [1.5, 1.5] [2, 2]

– – – [4, 0.06]

75 250 75 75

75 75 250 75

– – – 75

SVM SDC uNBSVM

– 2:1 2:1

95.00 95.67 95.67

91.55 (a ) 94.72 (a ) 95.28

Table 2 Accuracies (%) for NBSVM and SVM over the artificial datasets. Dataset

SVM

NBSVM

ArtSet1 ArtSet2 ArtSet3 ArtSet4

88.00 (a ) 85.67 (b ) 95.33 (a ) 77.33 (a )

89.67 86.33 96.67 79.33

a b

Significantly worse than NBSVM. Statistically equal to NBSVM.

results obtained for NBSVM are listed in Table 2. The performance of uNBSVM in comparison to that of SDC and SVM can be seen in Table 3. The performance metrics used are (He & Garcia, 2009; Kubat et al., 1997): accuracy =

tp + tn tp + fp + tn + fn

 and gmeans =

tn tn + fp

×

,

(41a) tp

tp + fn

,

(41b)

where tp, tn, fp and fn are respectively the number of correct positive (target) predictions, true negative (non-target) predictions, false positive predictions, and false negative predictions by the classifier in question. As the first artificial dataset, ArtSet1, has both classes with identical gradients, it is ideal for the application of NBSVM. Indeed, it is able to outperform SVM by a statistically significant margin on this dataset. A comparison of the corresponding decision boundaries can be seen in Fig. 4(a). The second dataset has a minority class (class 1) which has higher gradient compared to majority class (class 2). In other words, the minority class is more compact than the majority class. The third dataset, on the other hand, is characterized by greater gradient of the majority class. As explained earlier, the performance of NBSVM is expected to deteriorate on the second dataset while it is expected to achieve greater accuracy than SVM on the third dataset. Both these expectations are found to be true. NBSVM is unable to significantly

a

Significantly worse than NBSVM.

outperform SVM on ArtSet2 (even though it is able to achieve slightly better accuracy) while it achieves statistically significant improvement on ArtSet3. The resulting hyperplanes can be seen in Fig. 4(b) and (c), respectively. Finally, the new method also significantly outperforms SVM on the fourth three-class dataset ArtSet4. It can seen from Table 3, that both SDC as well as uNBSVM outperform SVM, while uNBSVM is able to achieve statistically better gmeans value compared to both SVM and SDC. This demonstrates the capability of uNBSVM to achieve better target class classification compared to SDC, even for identical set of costs (the cost set of 2:1 was found to be the best for both SDC and uNBSVM). A comparison between the decision boundaries can be seen in Fig. 4(d). 7.2. Real datasets for NBSVM Experiments are performed using NBSVM on 15 two-class and multi-class datasets from the University of California at Irvine (UCI) repository (Lichman, 2013), the IDA benchmark repository (Rätsch, 2001), and the Statlog Collection (Brazdil & Gama, 1991). The names and relevant details of the used datasets are provided in Table 4. Each dataset is normalized so that each feature has zero mean and unit standard deviation. Then, each dataset is 10-fold cross-validated such that fractions of representatives from each class are preserved in the sub-partitions. The following kernels are used for the experiments: 1. Linear kernel: K (x, y ) = (x.y ); 2. Polynomial kernel: K (x, y ) = (x.y + 1)p ; 3. Radial Basis Function (RBF) kernel: K (x, y ) = exp(−∥x − y ∥2 /2σ 2 ). The parameter p for the polynomial kernel is varied as integers in the range [2, 5] while σ for the RBF kernel is varied as multiples of 0.1 in the interval (0, 5], and as multiples of 10 in the interval [10, 50]. All kernels are used for experimentation with all the 15 datasets, with regularization cost C = 2.

48

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

(a) Comparison between SVM (dotted line) and NBSVM (solid line) for ArtSet1.

(b) Comparison between SVM (dotted line) and NBSVM (solid line) for ArtSet2.

(c) Comparison between SVM (dotted line) and NBSVM (solid line) for ArtSet3.

(d) Comparison between SDC (dotted line) and uNBSVM (solid line) for ArtSet3.

Fig. 4. Comparison on artificial datasets. Table 4 Summary of the benchmark datasets used in the experiments for NBSVM. Name

#data-points

#classes

#attributes

Bananab Germanc Diabetesa Ionospherea Livera Sonara Spambasea Titanicb Balancea Car Eval.a Glassa Irisa Vehiclec Waveforma Winea

5300 1000 768 351 345 208 4601 2201 625 1728 214 150 946 5000 178

2 2 2 2 2 2 2 2 3 4 6 3 4 3 3

2 24 8 34 6 60 57 3 4 21 9 4 18 21 13

a b c

Lichman (2013). Rätsch (2001). Brazdil and Gama (1991).

The modified version of SMO discussed in Section 6 is used to solve the QP problem posed by NBSVM, with the OVA approach being used for experiments with multi-class datasets. The best test accuracies obtained by NBSVM are compared with those of SVM and PPSVM in Table 5. The statistical significance of the results is tested using Wilcoxon’s rank sum test. The number of wins (W), ties (T) and loses (L) of NBSVM with respect to the competing methods are listed at the bottom of the table. It is seen from Table 5 that NBSVM consistently produces better accuracy compared to SVM and PPSVM, except for the Waveform dataset (for which PPSVM is the best). The rank sum test indicates that SVM is statistically equal to NBSVM on 6 of the tested datasets, while it is significantly worse on the other 9 datasets. PPSVM is found to be statistically equal to NBSVM on 7 of the tested datasets (including waveform), while being significantly worse on the remaining datasets. This demonstrates the superiority of

NBSVM, in comparison to SVM and PPSVM. We explore the reason behind the slightly poor performance of NBSVM on the Waveform dataset, in Section 7.4. 7.3. Real datasets for uNBSVM For comparison with SDC, experiments are conducted with the 10 datasets listed in Table 6. When the datasets used are originally multi-class, only one of the classes is chosen as the target class and the rest of the classes are combined to form the non-target class. The resulting imbalance in the number of representatives between the target and non-target classes can also be seen in the same table. Like datasets from the previous set of experiments, these datasets are also normalized so as to achieve zero mean and unit standard deviation for each feature. The datasets are then subjected to 10-fold cross-validation by SVM, SDC, and uNBSVM. Since there are no strict guidelines to select the set of costs for SDC (or uNBSVM), we test on different sets of misclassification costs for each dataset, so that the ratio of the costs of the target class and the non-target class is greater than or equal to the inverse of the ratio of the RRs of the respective classes. The cost of regularization was fixed at C = 2. The cost sets, the percentage of oversampling and the kernel used for each of the datasets can also be seen in Table 6. Akbani et al. (2004) argue that gmeans (Kubat et al., 1997) is a good measure of performance for imbalanced datasets, as it reports the geometric mean of the sensitivity and specificity of a classifier. So, classifiers which are likely to assign all data points to the dominant class will have a low gmeans value, despite having high accuracy value. Hence, gmeans has been used to compare the performance of uNBSVM with that of SDC (in terms of the best result obtained over the different cost sets for both SDC and uNBSVM) and SVM, in Table 7. A scrutiny of Table 7 indicates that uNBSVM is able to produce satisfactory results for all datasets except Glass (with target class

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

49

Table 5 Comparison between SVM, PPSVM and NBSVM based on kernels yielding best test accuracy. Dataset

SVM

PPSVM

Kernel (σ /p) Banana German Diabetes Ionosphere Liver Sonar Spambase Titanic Balance Car Eval. Glass Iris Vehicle Waveform Wine

b

RBF(0.1) Linear Linear RBF(10) Linear Poly(3) Poly(2) RBF(1) Linear Poly(3) RBF(1.6) RBF(2) Linear Linear RBF(10)

NBSVM

Kernel (σ /p)

acc. (%) 89.79 ( ) 72.11 (a ) 76.47 (b ) 92.63 (a ) 64.50 (a ) 77.12 (a ) 87.08 (a ) 77.57 (b ) 84.90 (a ) 85.99 (a ) 60.91 (a ) 96.00 (b ) 72.38 (a ) 86.50 (b ) 96.05 (b )

acc. (%) b

RBF(8r0 ) Linear Linear RBF(r0 ) Linear Poly(3) Linear RBF(1) Linear Linear RBF(r0 ) Poly(3) Linear Poly(3) RBF(2r0 )

W-T-L: 9-6-0

89.87 ( ) 73.29 (a ) 77.32 (b ) 94.05 (a ) 65.83 (a ) 78.12 (b ) 90.35 (a ) 77.60 (b ) 86.87 (a ) 81.15 (a ) 62.27 (a ) 95.60 (b ) 70.00 (a ) 86.71 (b ) 96.05 (b )

Kernel (σ /p)

acc. (%)

RBF(1) Linear RBF(10) RBF(5) RBF(5) RBF(20) RBF(20) RBF(5) RBF(3) RBF(1) RBF(2) RBF(2.7) RBF(4) RBF(7) RBF(5)

90.58 75.23 77.77 95.17 70.81 79.88 92.91 77.96 92.49 93.89 66.07 96.00 77.89 85.29 96.67

W-T-L: 8-7-0

r0 is the average distance of a point with its nearest neighbor. a Significantly worse than NBSVM. b Statistically equal to NBSVM. Table 6 Details of the benchmark datasets used for the experiments on uNBSVM. Name b

Abalone Breast Cancerb Car Eval.b Glassb Habermanb Letterc Liverb Segmentc Sickb Soybeanb a b c

#points

#attributes

Target class

Ratioa

% Oversampling

Target: non-target cost sets

4 177 286 1 728 214 306 20 000 345 2 310 3 772 683

8 9 21 9 3 16 6 19 21 35

19 M 3 7 2 26 1 1 2 12

01:99 35:65 04:96 14:86 27:73 04:96 42:58 14:86 06:94 06:94

1000 100 400 200 100 200 100 100 100 100

99:1, 200:1, 500:1 1.86:1, 5:1, 10:1 24:1, 50:1, 100:1 6:1, 10:1, 20:1 3;1, 5:1, 10:1 26:1, 50:1, 75:1 1.37:1, 5:1, 10:1 6:1, 10:1, 20:1 15:1, 20:1, 50:1 20:1, 50:1, 100:1

Ratio of target: non-target representation (approx.) Lichman (2013). Brazdil and Gama (1991).

Table 7 Comparison between SVM, SDC and uNBSVM based on kernels yielding best test accuracy. Dataset

Abalone Breast Cancer Car Eval. Glass Haberman Letter Liver Segment Sick Soybean

SVM

b

uNBSVM

gmeans (%)

Kernel (σ /p)

Cost set

gmeans (%)

Kernel (σ /p)

Cost set

gmeans (%)

L L R(1.4) L R(1.4) R(1.4) R(1.4) R(15) L L

0 (a ) 96.59 (b ) 93.24 (a ) 92.39 (b ) 45.01 (a ) 88.11 (a ) 65.77 (a ) 98.85 (b ) 89.12 (b ) 98.00 (a )

L L R(1.4) R(1.4) L R(1.4) R(1.4) R(15) L L

99:1 1.86:1 100:1 20:1 3:1 26:1 1.37:1 10:1 15:1 100:1

71.17 (a ) 97.67 (b ) 99.52 (b ) 91.01 (a ) 61.51 (a ) 94.42 (a ) 64.30 (a ) 98.91 (b ) 88.85 (b ) 98.39 (a )

L L R(1.4) L L R(1.4) R(5) R(15) R(1.4) L

200:1 1.86:1 24:1 10:1 5:1 75:1 1.37:1 20:1 50:1 50:1

79.61 97.91 99.67 92.19 65.18 95.17 69.97 99.27 89.55 99.85

W-T-L: 6-4-0 a

SDC

Kernel (σ /p)

W-T-L: 6-4-0

Significantly worse than NBSVM. Statistically equal to NBSVM.

7), where SDC also performs poorly. In fact, the performance of uNBSVM is much closer to that of SVM (which achieves the best gmeans value on this dataset). Wilcoxon’s rank sum test results are summarized as the number of wins (W), ties (T) and loses (L) of uNBSVM with respect to the competing methods, at the bottom of the table. Both SVM and SDC are statistically equal to uNBSVM on 4 of the 10 tested datasets and are significantly worse on 6 datasets. Even for the Glass dataset, the gmeans achieved by uNBSVM (while being slightly lower) is statistically equal to that of SVM. The slight deterioration on this dataset may be due to various reasons, including the possibility that the said dataset

does not adhere to the assumptions of similar gradients made by uNBSVM. However, the failure of SDC on the same dataset points in a different direction. The target class is likely to have a complex distribution which requires more sophisticated algorithms (this is confirmed later in Section 7.4). 7.4. Study of class gradients within the margin NBSVM (and its variants) are based on the assumption that the class-conditional distributions for both the target class and non-target classes are monotonically decaying at similar rates

50

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

in the region of overlap. Therefore, it makes sense to study the relationship between the average gradients of the two classes within the region of overlap. However, this is non-trivial when the classes are multivariate and/or the distributions in the higherdimensional feature space (to which the data is mapped using the kernel trick) are unknown. It is easier to study the projections of the classes in the direction of the margin. Thus, in order to understand how much the performance of the proposed algorithms depends on adherence to these assumptions, we now look at the ratios of the average gradients of the projections of the two classes within the margin (due to the implicit assumption that the margin coincides with the region of overlap). Let Γk be the set of points from class Ik within the margin. Then Γ+ and Γ− , respectively for the classes I+ (target class) and I− (nontarget class), are given by:



 xi ∈ I+ | − ρ− ≤

Γ+ =

N 





αj yj K (xi , xj ) − b ≤ ρ+ ,

(42a)

j =1

and



 xi ∈ I− | − ρ− ≤

Γ− =

N 





αj yj K (xi , xj ) − b ≤ ρ+ .

(42b)

j =1

Since the class-conditional distributions are assumed to be monotonically decreasing in the region of overlap, the average class-conditional gradients of I+ and I− (within the margin) can be respectively estimated as:

 

N 

xi ∈Γ+

j =1

2

γˆ+ =

and γˆ− =

γˆ+ . γˆ−

Dataset

Classifier

Ratio(s) γˆ in order of class labels

Banana German Diabetes Ionosphere Liver Sonar Spambase Titanic

NBSVM

1.25 1.02 0.89 1.18 1.02 0.94 1.21 3.12

Balance Car Eval. Glass Iris Vehicle Waveform Wine

OVA-NBSVM

1.03, 2.99, 1.07 1.99, 1.05, 1.03, 0.99 1.39, 1.36, 1.47,a , 1.26, 0.99, 0.99 0.99, 1.12, 1.25 1.62, 1.65, 1.15, 1.27 1.20, 1.40, 1.40 1.05, 1.14, 1.36

Abalone Breast Cancer Car Eval. Glass Haberman Letter Liver Segment Sick Soybean

uNBSVM

1.14 0.99 1.14 1.25 0.43 1.00 1.08 1.01 1.35 0.99

a

There are no points belonging to class 4.

 αj yj K (xi , xj ) − b + ρ−

, |Γ+ |   N   −2 αj yj K (xi , xj ) − b − ρ+ xi ∈Γ−

j =1

|Γ− |

(43a)

.

(43b)

Notice that the normalization is required to compensate for the unequal prior probabilities. So, the estimated ratio of the gradients of the target class and non-target class can be expressed as:

γˆ =

Table 8 Ratios of target: non-target class gradients of benchmark datasets within the margin.

(44)

The γˆ values for all the real datasets used for experiments with NBSVM, OVA-NBSVM and uNBSVM are listed in Table 8. Multiple γˆ values (in order of class labels) are listed for OVA datasets, one for each of the M classifiers. The ratios for all the datasets (except Titanic and Balance (class 2)) fall within the range [0.89,1.99]. Since the proposed methods performed well on most of these datasets, it suggests that they are not overtly sensitive to the γˆ value (which should ideally have been 1). The slightly high values for the Titanic dataset and class 2 of the Balance dataset are due to the predominance of the target class within the region of overlap. However, OVA-NBSVM and uNBSVM respectively perform slightly worse than the competing methods on the Waveform and Glass (class 7) datasets, inspite of having moderate γˆ values. To help understand the situation better, we present histographic approximation of the projected posterior (cost-scaled) distributions of some of the datasets in Fig. 5. Fig. 5(a) shows the plot for the Banana dataset. It is seen that the margin approximately corresponds to the region of overlap and NBSVM is able to place the decision boundary close to the Bayes optimal. The plot for class 2 of the Car Eval. dataset (Fig. 5(b)) illustrates a different situation. The target and non-target classes appear to be completely separable and the decision boundary is placed closer to the minority class to achieve better inductive

performance. Class 7 of the Glass dataset is also seen to be mostly separable from the non-target classes (Fig. 5(c)). However, there exist a sizeable number of outliers beyond the peak of the nontarget class, which are misclassified. Finally, Fig. 5(d) (plot with class 1 of the Waveform dataset as the target) shows that there is considerable overlap between the two classes and the primary peaks of both the classes are within this region. In other words, the assumption of monotonic decay does not hold in this case. Therefore, the margin only corresponds to a subset of the actual region of overlap, resulting in a miscalibrated decision boundary shift. Similar situations are observed for all the other classes of this dataset. Hence, the slightly poor performance on the latter two datasets can be ascribed to their complex distributions in the tested feature spaces. 8. Conclusions In this paper, we proposed a new modification of SVMs, called NBSVM. The new method functions by combining margin shift with asymmetric costs, using the number of representatives from each class. NBSVM can be used to meet both the distinct objectives of an imbalanced classification problem; i.e. it can be used along with equal misclassification costs to minimize the overall Bayes error, and also to minimize target class misclassification when used along with unequal costs. It also offers many advantages over the existing methods; as it does not require costly probability estimations or does not oversample the minority class (which augments the time complexity by increasing the number of training points). Instead, NBSVM relies on certain assumptions made about the dataset, which are often found to be true. When all the classes have equal number of representatives, it boils down to the traditional SVM classifier. The new method is extended to multi-class problems using the OVA approach and the necessary modifications to Platt’s SMO (Platt, 1999) to implement the new classifier are also presented. NBSVM is tested on various artificial and real datasets, and compared with classical SVM and other state-of-the-art methods, to demonstrate its classification ability on imbalanced datasets.

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

(a) Histographic approximation of the projected posterior distributions for Banana dataset.

(b) Histographic approximation of the projected posterior distributions for Car Eval. dataset (class 2).

(c) Cost-scaled histographic approximation of the projected posterior distributions for Glass dataset (class 7).

(d) Histographic approximation of the projected posterior distributions for Waveform dataset (class 1).

51

Fig. 5. Histographic approximations of the projected posterior (cost-scaled) distributions.

The simplicity and elegance of NBSVM is expected to facilitate its extension to various aspects of the imbalanced data learning, such as identification and learning from small disjuncts in the dataset, or identification of a minority target class in case of extensive overlap between the target and non-target classes, or learning from noisy and imbalanced datasets (Fernández, García, & Herrera, 2011). Another interesting area of research can be the construction of special kernel functions which are tailor-made to utilize the assumptions of NBSVM; i.e. construction of kernel functions which distort the distributions of the classes in the feature space so that the gradients of the two classes in the direction normal to the separating hyperplane become similar and monotonic. References Abe, S. (2002). Analysis of support vector machines. In Proceedings of the 2002 12th IEEE workshop on neural networks for signal processing, 2002 (pp. 89–98). IEEE, http://dx.doi.org/10.1109/NNSP.2002.1030020. Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Lecture notes in computer science: Vol. 3201. Machine learning: ECML 2004 (pp. 39–50). Springer, http://dx.doi.org/10.1007/978-3540-30115-8_7. Amari, S.-i., & Wu, S. (1999). Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12, 783–789. http://dx.doi.org/10.1016/S0893-6080(99)00032-5. Bao-Liang, L., Xiao-Lin, W., Yang, Y., & Hai, Z. (2011). Learning from imbalanced datasets with a min–max modular support vector machine. Frontiers of Electrical and Electronic Engineering in China, 6, 56–71. http://dx.doi.org/10.1007/s11460011-0127-1. Batuwita, R., & Palade, V. (2010). Efficient resampling methods for training support vector machines with imbalanced datasets. In The 2010 international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE, http://dx.doi.org/10.1109/IJCNN.2010.5596787. Batuwita, R., & Palade, V. (2013). Class imbalance learning methods for support vector machines. In Imbalanced learning: foundations, algorithms, and applications (pp. 83–99). John Wiley & Sons, Inc., http://dx.doi.org/10.1002/9781118646106.ch5. Brazdil, P., & Gama, J. (1991). Statlog repository. URL: http://www.liacc,up.pt/ML/statlog/datasets.html [2007-10-22].

Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167. Cao, P., Zhao, D., & Zaiane, O. (2013). An optimized cost-sensitive SVM for imbalanced data learning. In Lecture notes in computer science: Vol. 7819. Advances in knowledge discovery and data mining (pp. 280–292). Springer, http://dx.doi.org/10.1007/978-3-642-37456-2_24. Cervantes, J., Li, X., & Yu, W. (2014). Imbalanced data classification via support vector machines and genetic algorithms. Connection Science, 26, 335–348. http://dx.doi.org/10.1080/09540091.2014.924902. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2, 27:1–27:27. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. http://dx.doi.org/10.1613/jair.953. Choi, J.M. (2010). A selective sampling method for imbalanced data learning on support vector machines (Ph.D. thesis). Ames, IA, USA. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297. http://dx.doi.org/10.1023/A:1022627411411. Duan, W., Jing, L., & Lu, X. Y. (2014). Imbalanced data classification using costsensitive support vector machine based on information entropy. Advanced Materials Research, 989, 1756–1761. http://dx.doi.org/10.4028/www.scientific.net/AMR.989-994.1756. Ertekin, S. (2009). Learning in extreme conditions: Online and active learning with massive, imbalanced and noisy data. (Ph.D. thesis). PA, USA: University Park. Fernández, A., García, S., & Herrera, F. (2011). Addressing the classification with imbalanced data: open problems and new challenges on class distribution. In Lecture notes in computer science: Vol. 6678. Hybrid artificial intelligent systems (pp. 1–10). Springer, http://dx.doi.org/10.1007/978-3-642-21219-2_1. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybridbased approaches. IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 42, 463–484. http://dx.doi.org/10.1109/TSMCC.2011.2161285. García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180, 2044–2064. http://dx.doi.org/10.1016/j.ins.2009.12.010. Special Issue on Intelligent Distributed Information Systems. Gonen, M., Tanugur, A. G., & Alpaydin, E. (2008). Multiclass posterior probability support vector machines. IEEE Transactions on Neural Networks, 19, 130–139. http://dx.doi.org/10.1109/TNN.2007.903157. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21, 1263–1284. http://dx.doi.org/10.1109/TKDE.2008.239.

52

S. Datta, S. Das / Neural Networks 70 (2015) 39–52

Imam, T., Ting, K. M., & Kamruzzaman, J. (2006). z-SVM: An SVM for improved classification of imbalanced data. In Lecture notes in computer science: Vol. 4304. AI 2006: advances in artificial intelligence (pp. 264–273). Springer, http://dx.doi.org/10.1007/11941439_30. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637–649. http://dx.doi.org/10.1162/089976601300014493. Koknar-Tezel, S., & Latecki, L. J. (2009). Improving SVM classification on imbalanced data sets in distance spaces. In Ninth IEEE international conference on data mining, 2009. ICDM’09 (pp. 259–267). IEEE, http://dx.doi.org/10.1109/ICDM.2009.59. Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the fourteenth international conference on machine learning (pp. 179–186). Nashville, USA: Morgan Kaufmann. Lee, J., Wu, Y., & Kim, H. (2014). Unbalanced data classification using support vector machines with active learning on scleroderma lung disease patterns. Journal of Applied Statistics, 42, 676–689. http://dx.doi.org/10.1080/02664763.2014.978270. Lichman, M. (2013). UCI machine learning repository. URL: http://archive.ics.uci.edu/ml. Lin, Y., Lee, Y., & Wahba, G. (2002). Support vector machines for classification in nonstandard situations. Machine Learning, 46, 191–202. http://dx.doi.org/10.1023/A:1012406528296. Maratea, A., Petrosino, A., & Manzo, M. (2014). Adjusted f-measure and kernel scaling for imbalanced data learning. Information Sciences, 257, 331–341. http://dx.doi.org/10.1016/j.ins.2013.04.016. Masnadi-Shirazi, H., & Vasconcelos, N. (2010). Risk minimization, probability elicitation, and cost-sensitive SVMs. In Proceedings of the 27th international conference on machine learning (pp. 759–766). Omnipress. Peng, L., Ting-ting, B., & Yang, L. (2015). SVM classification for highdimensional imbalanced data based on SNR and under-sampling. International Journal of Multimedia and Ubiquitous Engineering, 10, 105–112. http://dx.doi.org/10.14257/ijmue.2015.10.4.11. Peng, L., Xiao-yang, Y., Ting-ting, B., & Jiu-ling, H. (2014). Imbalanced data SVM classification method based on cluster boundary sampling and DT-KNN pruning. International Journal of Signal Processing, Image Processing and Pattern Recognition, 7, 61–68. Phoungphol, P. (2013). A classification framework for imbalanced data. (Ph.D. thesis) Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in kernel methods Support Vector Learning. Vol. 3. Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: A case study. SIGKDD Explorations Newsletter, 6, 60–69. http://dx.doi.org/10.1145/1007730.1007739. Rätsch, G. (2001). IDA benchmark repository. URL: http://ida.first.fhg.de/projects/bench/benchmarks.htm. Stecking, R., & Schebesch, K. B. (2012). Classification of large imbalanced credit client data with cluster based SVM. In Studies in classification, data analysis, and knowledge organization, Challenges at the interface of data analysis, computer science, and optimization (pp. 443–451). Springer, http://dx.doi.org/10.1007/978-3-642-24466-7_45.

Tang, Y., Zhang, Y.-Q., Chawla, N. V., & Krasser, S. (2009). SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 39, 281–288. http://dx.doi.org/10.1109/TSMCB.2008.2002909. Tao, Q., Wu, G.-W., Wang, F.-Y., & Wang, J. (2005). Posterior probability support vector machines for unbalanced data. IEEE Transactions on Neural Networks, 16, 1561–1573. http://dx.doi.org/10.1109/TNN.2005.857955. Veropoulos, K., Campbell, C., & Cristianini, N. et al. (1999). Controlling the sensitivity of support vector machines. In Proceedings of the international joint conference on AI, IJCAI (pp. 55–60). Wang, Q. (2014). A hybrid sampling SVM approach to imbalanced data classification. In Abstract and applied analysis. Hindawi Publishing Corporation, http://dx.doi.org/10.1155/2014/972786. Wang, B. X., & Japkowicz, N. (2010). Boosting support vector machines for imbalanced datasets. Knowledge and Information Systems, 25, 1–20. http://dx.doi.org/10.1007/s10115-009-0198-y. Wang, S., Li, Z., Chao, W., & Cao, Q. (2012). Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In The 2012 international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE, http://dx.doi.org/10.1109/IJCNN.2012.6252696. Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Springer series in statistics, Breakthroughs in statistics (pp. 196–202). Springer, http://dx.doi.org/10.1007/978-1-4612-4380-9_16. Wu, G., & Chang, E.Y. (2003). Adaptive feature-space conformal transformation for imbalanced-data learning. In Proceedings of the twentieth international conference on machine learning (pp. 816–823). Wu, G., & Chang, E. Y. (2005). KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 17, 786–795. http://dx.doi.org/10.1109/TKDE.2005.95. Wu, S.-H., Lin, K.-P., Chien, H.-H., Chen, C.-M., & Chen, M.-S. (2013). On generalizable low false-positive learning using asymmetric support vector machines. IEEE Transactions on Knowledge and Data Engineering, 25, 1083–1096. http://dx.doi.org/10.1109/TKDE.2012.46. Yang, C.-Y., Yang, J.-S., & Wang, J.-J. (2009). Margin calibration in SVM classimbalanced learning. Neurocomputing, 73, 397–411. http://dx.doi.org/10.1016/j.neucom.2009.08.006. Yu, J., Cheng, F., Xiong, H., Qu, W., & Chen, X.-w. (2008). A Bayesian approach to support vector machines for the binary classification. Neurocomputing, 72, 177–185. http://dx.doi.org/10.1016/j.neucom.2008.06.010. Zhang, Y., Fu, P., Liu, W., & Chen, G. (2014). Imbalanced data classification based on scaling kernel-based support vector machine. Neural Computing and Applications, 25, 927–935. http://dx.doi.org/10.1007/s00521-014-1584-2. Zhao, Z., Zhong, P., & Zhao, Y. (2011). Learning SVM with weighted maximum margin criterion for classification of imbalanced data. Mathematical and Computer Modelling, 54, 1093–1099. http://dx.doi.org/10.1016/j.mcm.2010.11.040. Zughrat, A., Mahfouf, M., Yang, Y., & Thornton, S. (2014). Support vector machines for class imbalance rail data classification with bootstrapping-based oversampling and under-sampling. In 19th world congress of the international federation of automatic control (pp. 8756–8761). Cape Town, South Africa, http://dx.doi.org/10.3182/20140824-6-ZA-1003.00794.