Affinity and class probability-based fuzzy support vector machine for imbalanced data sets

Affinity and class probability-based fuzzy support vector machine for imbalanced data sets

Neural Networks 122 (2020) 289–307 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Affin...

1MB Sizes 0 Downloads 65 Views

Neural Networks 122 (2020) 289–307

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Affinity and class probability-based fuzzy support vector machine for imbalanced data sets ∗

Xinmin Tao , Qing Li, Chao Ren, Wenjie Guo, Qing He, Rui Liu, Junrong Zou College of Engineering and Technology, Northeast Forestry University, Harbin, Heilongjiang 150040, China

article

info

Article history: Received 15 November 2018 Received in revised form 13 September 2019 Accepted 28 October 2019 Available online 2 November 2019 Keywords: Imbalanced data Fuzzy support vector machine (FSVM) Affinity Class probability Kernelknn

a b s t r a c t The learning problem from imbalanced data sets poses a major challenge in data mining community. Although conventional support vector machine can generally show relatively robust performance in dealing with the classification problems of imbalanced data sets, it treats all training samples with the same contribution for learning, which results in the final decision boundary biasing toward the majority class especially in the presence of outliers or noises. In this paper, we propose a new affinity and class probability-based fuzzy support vector machine technique (ACFSVM). The affinity of a majority class sample is calculated according to support vector description domain (SVDD) model trained only by the given majority class training samples in kernel space similar to that used for FSVM learning. The obtained affinity can be used for identifying possible outliers and some border samples existing in the majority class training samples. In order to eliminate the effect of noises, we employ the kernel k-nearest neighbor method to determine the class probability of the majority class samples in the same kernel space as before. The samples with lower class probabilities are more likely to be noises and their contribution for learning seems to be reduced by their low memberships constructed by combining the affinities and the class probabilities. Thus, ACFSVM can pay more attention to the majority class samples with higher affinities and class probabilities while reducing their effects of the ones with lower affinities and class probabilities, eventually skewing the final classification boundary toward the majority class. In addition, the minority class samples are assigned relative high memberships to guarantee their importance for the model learning. The extensive experimental results on the different imbalanced datasets from UCI repository demonstrate that the proposed approach can achieve better generalization performance in terms of G-Mean, F-Measure, and AUC as compared to the other existing imbalanced dataset classification techniques. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction The problem of imbalanced datasets occurs when the size of one class is much larger than that of the other classes (Raghuwanshi & Shukla, 2018; Tao et al., 2019). More specifically, when the datasets have only two classes, this problem happens when one class outnumbers the other class, which are usually called majority class and minority class, respectively (Du, Vong, Pun, Wong, & Ip, 2017; Tao et al., 2019). Imbalanced datasets exist in many real-world applications such as medical diagnosis (Hassan, Huda, Yearwood, Jelinek, & Almogren, 2018; Khatami et al., 2018), risk management (Alonso-Ayuso, Escudero, Guignard, & Weintraub, 2018; Papadopoulos, Kyriacou, & Nicolaides, ∗ Correspondence to: College of Engineering and Technology, Northeast Forestry University, #26 Hexing Road, Xiangfang District, Harbin, Heilongjiang 150040, China. E-mail addresses: [email protected] (X.M. Tao), [email protected] (Q. Li), [email protected] (C. Ren), [email protected] (W.J. Guo), [email protected] (Q. He), [email protected] (R. Liu), [email protected] (J.R. Zou). https://doi.org/10.1016/j.neunet.2019.10.016 0893-6080/© 2019 Elsevier Ltd. All rights reserved.

2017), face recognition (Mamta & Hanmandlu, 2014; Romani et al., 2018), and intrusion detection (Gautam & Om, 2018; Jokar & Leung, 2018). When dealing with imbalanced datasets, conventional classification algorithms, which are designed based on balanced datasets or equal misclassification cost, often fail to achieve good performance especially for minority class due to the dominating influence of majority class (Tao, Li, Ren et al., 2019). As a representative kernel-based learning algorithm, support vector machine (SVM) can often show relatively robust performance than other ones in imbalance datasets classification (Tao, Li, Guo et al., 2019). The objective of SVM is to find optimal hyperplane that can perfectly separate the two classes with maximal margin (Liu, Zhang, Tao and Cheng, 2017; Tao, Jin, Liu, & Li, 2013). This objective is originated from the fact that maximal margin can guarantee lower Vapnik–Chervonenkis (VC) dimension and consequently leads to improved generalization performance (Liu, Ma, Zhou, Tao and Cheng, 2018). Since SVM treats all samples with the same contribution, it ignores the difference between majority and minority class, which can result in the learned classification boundary biasing toward the majority class when

290

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

dealing with imbalanced datasets (Tao, Li, Guo et al., 2019). In addition, possible outliers or class noises can greatly distort the margin and location of the classification boundary. In order to ensure SVM’s performance in imbalanced datasets, especially in the presence of outliers or noises, numerous variations have been introduced. The most popular version is fuzzy SVM, also known as FSVM (Lin & Wang, 2002). FSVM is a fuzzified variant of SVM, which allows the incorporation of fuzzy membership values (MVs) in SVM and reformulates SVM so that different samples have different impact on the construction of the classification boundary. Although many other variants of FSVM have been developed in the past few decades (Dheeba, Jaya, & Singh, 2017; Jiaqiang et al., 2017; Hang, Zhang, & Cheng, 2016; Khemchandani & Pal, 2018; Liu, Song, Zhang and Zhao, 2017; Liu et al., 2018; Moteghaed, Maghooli, & Garshasbi, 2018; Ni, He, & Jiang, 2017; Sampath & Gomathi, 2017; Wang et al., 2018), only Lin et al.’s work is being followed here. In (Hang et al., 2016; Liu, Ma et al., 2018; Liu, Wang, Cai, Chen and Qin, 2018; Liu, Zhang et al., 2018; Moteghaed et al., 2018; Ni et al., 2017; Wanget al., 2018), based on the same underlying principle as that of Lin et al.’s, various FSVM versions have been introduced with the exception of different formulations of objective function. On the other hand, in the variations (Dheeba et al., 2017; Jiaqiang et al., 2017; Liu, Song et al., 2017; Liu, Zhang et al., 2017; Sampath & Gomathi, 2017), the fuzzy measures are introduced only for class assignment. During the class determination for a candidate test sample, each assigned class is associated with an MV, and the class with the highest MV is chosen as the final class for the sample. Those variations are useful for multiclass classification of samples especially in unclassifiable region. Different from Lin et al.’s FSVM, all the samples have the equal importance for the construction of the classification boundary and fuzzy measure is relevant only during class assignment in those variations. Devising a strategy for identifying possible outliers and class noises and assigning them low MVs are the key issues for FSVM to improve generalization performance in dealing with imbalanced datasets classification. Most of the literatures using FSVM have employed membership functions (MFs) that are designed specifically to the application under consideration (Abe, 2015; Chaudhuri, 2014; Hsu, Lin, Chou, Hsiao, & Liu, 2017; Liu, Wang et al., 2018; Naderian & Salemnia, 2017; Wu & Yap, 2006). For the credit risk identification task (Chaudhuri, 2014), each customer was assigned two MVs according to their good and bad repayment abilities via exponential MFs. Similarly, for the contentbased image retrieval (Wu & Yap, 2006), the exponential-based MFs have also been adopted. For the EGG signal classification (Hsu et al., 2017; Liu, Ma et al., 2018; Liu, Wang et al., 2018; Liu, Zhang et al., 2018), MVs were determined based on the density of samples. For the text classification, and the classification of power quality disturbances (Abe, 2015; Naderian & Salemnia, 2017), the quadratic-based and linear-based MFs have been used. For a general purpose of MFs, Yang et al. (Yang, Song, & Wang, 2007; Yang, Zhang, Lu, & Ma, 2011) applied the fuzzy C-means (FCM), kernel FCM (KFCM) (Yang et al., 2011), and kernel possibilistic C-means (KPCM) clustering (Yang et al., 2007) for assigning MVs to all samples. Alternatively, Batuwita and Palade (2010) incorporated differential error cost within MVs of each sample for dealing with imbalanced datasets classification, where four MFs are introduced and evaluated on numerous imbalanced datasets. Recently, in order to weight down correctlyclassified samples and increasing the importance of boundary and easy-misclassified cases, Maldonado, Merigo, and Miranda (2018) presented a two-step methodology based SVM by incorporating OWA operator (OWASVM). In the proposed method, the soft-margin SVM model is first applied to obtain the distance between each training sample and the trained classification hyperplane. Then, the soft-margin SVM formulation is redefined

by aggregating with OWA operator in ascending order of the obtained distances and is finally trained to the resulting decision hyperplane. By dividing samples into two fuzzy sets using clustering techniques and assigning a normal MV to one set including possible outliers with a constant 1 to the other set containing nonoutliers, Sevakula and Verma (2017) designed a general membership function and FSVM based on this function (GPFSVM) was demonstrated to perform better in handling outliers. However, this method does not take into account the effect of majority class noises among minority class. In order to avoid the effect of noises, Fan, Wang, Li, Gao, and Zha (2017) proposed an entropy-based FSVM (EFSVM) by paying more attention to the samples with higher class probabilities calculated by k-nearest neighbors. Unfortunately, the method cannot consider the impact of possible outliers and border samples. Based on the similar idea, Gupta and Richhariya (2018) proposed two efficient variants of entropy based fuzzy SVM (EFSVM): an entropy based fuzzy least squares support vector machine (EFLSSVM-CIL) and entropy based fuzzy least squares twin support machine (EFLSTWSVM-CIL). Compared to EFSVM, the two proposed methods significantly reduce the training cost as demonstrated in the experimental results by solving a set of linear equations instead of a large-size quadratic programming problem used in conventional SVM. Similar to EFSVM, the two proposed methods also fail to further consider the impact of possible outliers and border samples on the generation of the final classification boundary. It is worth noting that unlike normal classification tasks, when dealing with imbalanced classification issues, the learned classification boundary of SVM tends to bias toward majority class due to the dominating impact of the majority class on learning, thus reducing generalization performance of the classifier. In order to ensure the final classification boundary not biasing toward the majority class, inspire by under-sampling, we propose a novel fuzzy SVM based on combination of affinity and class probability as membership values (ACFSVM), where we not only attempt to eliminate the effect of possible outliers and class noises but also reduce the contribution of majority border samples. In the proposed approach, we apply support vector description domain (SVDD) model to obtain a hypersphere enclosing most of majority samples in an appropriate kernel space. The affinities of majority training samples are calculated according to their distances to the center of the hypersphere relative to the radius of the hypersphere. This strategy can guarantee that the border samples, possible outliers or even class noises are assigned with lower affinities and thus have less impact on the construction of the classifier. In order to further eliminate the effect of class noises, we use kernel k-nearest neighbors to determine the corresponding class probabilities of majority training samples instead of k-nearest neighbors in input space, so that the estimation of samples MVs is based on their location in the similar space to that used for training FSVM. This attempt seems to be more reasonable for kernel SVM. In order to evaluate the performance of the proposed ACFSVM, we carried out several classification experiments on the datasets available from UCI repository. The experimental results demonstrate that the proposed ACFSVM approach is superior to other existing imbalanced classification methods in terms of different evaluation metrics. The main contributions of this work are as follows: We present a new affinity and class probability-based fuzzy support vector machine technique (ACFSVM) to solve the learning problem from imbalanced data sets and conduct extensive experiments on several datasets to verify the proposed algorithm. ACFSVM pays more attention to the majority class samples with higher affinities and class probabilities while reducing their effects of the ones with lower affinities and class probabilities, eventually skewing the final classification boundary toward the majority class.

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

In the presented algorithm, we employ the kernel-nearest neighbor method to determine the class probability of the majority class samples in the same kernel space as before, which can guarantee that the space for samples MVs estimation can coincide with the space for SVM classifier. The remainder of the paper is organized as follows: In Section 2, we provide an overview of relevant previous works and other existing imbalanced classification methods. Our proposed approach is described in detail in Section 3. Section 4 presents a brief procedure of our proposed imbalanced classification scheme. Section 5 shows extensive experimental results and discussions. Finally, conclusions are drawn in Section 6. 2. Related work In this section, we briefly introduce the related work on the imbalanced dataset classification problems. Generally, the stateof-the-art approaches addressing imbalanced datasets classification problems can be broadly divided into two categories: data-level and algorithm-level (Zhang, Zhu, Wu, Liu, & Xu, 2017). Considering that our research focus is the algorithm-level, we only provide a brief review about data-level. A more detailed and extensive review about it can be found in (Galar, Fernández, Barrenechea, Bustince, & Herrera, 2012; Zhu, Wright, Wang, & Wang, 2018).

291

instances by its corresponding three nearest neighbors after the generation of artificial minority class instances using SMOTE. Alternatively, Bunkhumpornpat, Sinapiromsaran, and Lursinsap (2009) presented a Safe-Level SMOTE, which modified the SMOTE by assigning different safe-level weights to discriminate noisy and safe instances during the data generation process. Other heuristic over-sampling methods such as Borderline-SMOTE (BSMOTE) (Han, Wang, & Mao, 2005), Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning (MWMOTE) (Barua, Islam, Yao, & Murase, 2014), ADASYN (He, Bai, Garcia, & Li, 2008) and its variant KernelADASYN (Tang & He, 2015) aim to first identify the informative minority class instances and then generate artificial instances among them, which can effectively avoid the effect of noises during the data generation process. Recently, to address the limitation of SMOTE for nonlinear problems, Mathew, Luo, Pang, and Chan (2015) proposed a kernel-SMOTE (K-SMOTE) version that synthetically generates minority instances in feature space of SVM. To further improve K-SMOTE, they developed an enhanced weighted K-SMOTE (WK-SMOTE) (Mathew, Pang, Luo, & Leong, 2018) by introducing a cost-sensitive weighting factor to differently bias the majority, minority, and synthetic instances. The experimental results show the WK-SMOTE can obtain better classification performance in addressing some real-world imbalanced issues.

2.1. Data-level approaches

2.2. Algorithm-level approaches

Commonly, the data-level can be categorized into undersampling and over-sampling. Under-sampling aims to re-balance datasets by removing the majority class instance in the training sets. The random under-sampling (RU) belongs to a non-heuristic method that re-balances the class distribution by randomly throwing away some of majority class instances. Due to no application of selection criteria, this method can lead to the loss of potentially useful information for classification in training phase under the assumption that all majority class instances make the same contribution to classification. In order to solve this drawback, several informative under-sampling techniques have been proposed such as the Condensed Nearest Neighbor (CNN) rule (Hart, 1968; Liang, Xu, & Xiao, 2017), Tomek Link (Devi, Biswas, & Purkayastha, 2017; Tomek, 1976), One-Sided Selection (OSS) (Kubat & Matwin, 1997; Zuo & Jia, 2017), and the Neighborhood Cleaning Rule (Laurikkala, 2001). In contrast to under-sampling, over-sampling increases the number of the minority class instance by generating artificial instances in the training sets to reduce the imbalance degree of distribution. Similar to random under-sampling, random over-sampling (RO) belongs to a non-informative method that rebalances class distribution through randomly duplicating the minority class instances. Its drawback is that the exact replication with replacement can result in overfitting of the subsequent supervised classification algorithm due to reduplicated information in the obtained training set. To overcome this drawback, an informative over-sampling method, called Synthetic minority over-sampling technique (SMOTE), was proposed in Chawla, Bowyer, Hall, and Kegelmeyer (2002). SMOTE can generate new minority class instances by interpolation among several k-nearest minority class neighbors. However, although it has some advantages, the generation of artificial minority class instances can lead to the overlap between the two classes especially when there is not clear separation between them in input space and the set of minority class is not convex one. In order to accommodate this scenario, several other informative over-sampling approaches have been proposed. SMOTE combined with Edited Nearest Neighbor applied Edited Nearest Neighbor rule to remove the misclassified

The algorithm-level involves creating or modifying the existing algorithm to deal with imbalanced datasets classification problems. It mainly includes cost-sensitive learning and ensemble learning. As one of the most representative algorithm-level methods, cost-sensitive SVM (CSVM) incorporated different misclassification costs for each class in learning and improved its generalization performance through assigning higher misclassification costs to all the minority class. Based on the similar principles, Veropoulos, Campbell, and Cristianini (1999) introduced a biased support vector machine through assigning different costs to the majority class and minority class, which is useful for skewing the final hyperplane away from the minority class. However, this method does not consider the different contributions of the samples in the same class to the formation of the classifier, which can lead to model overfitting affected by the noises and outliers. Ensemble of classifiers seems to be an effective strategy for improving generalization performance by combining each decision of individual classifier into a final voting result. Inspired by ensemble approach, Jian, Gao, and Ao (2016) proposed a novel ensemble SVM based on AdaBoost (AdaSVM) to improve classification accuracy for imbalanced datasets by resampling adaptively the instances according to their weights. Its main drawback is that the performance improvement is partly dependent on the diversity of individual classifiers whereas it is difficult to define and enlarge the diversity. In this paper, we propose a novel affinity and class probabilitybased Fuzzy SVM to deal with imbalanced datasets. In the proposed approach, the majority samples with high affinities and class probabilities are assigned higher MVs, which would have more impact on the formation of the classifier. On the contrary, the majority samples with low affinities and class probabilities, which are likely to be border samples, outliers or class noises, are assigned lower MVs to reduce or eliminate their contribution for learning. This attempt is more beneficial for skewing the final classification boundary away from the minority class and thus improves the classification accuracy, especially for the minority class.

292

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

3. Fuzzy support vector machine and membership functions 3.1. Fuzzy support vector machine Unlike conventional SVM, FSVM considers the belongingness of samples to a class fuzzy in its objective function. Assume a binary classification problem with the training sample set S = {(x1 , y1 ) , (x2 , y2 ) , . . . (xm , ym )}, where xi denotes feature vector of ith sample and yi ∈ {1, −1} refers to its corresponding class. m denotes the cardinality of the set. The objective function of Lin et al.’s FSVM is given in the following: min w,b

1 2

∥w∥2 + C

m ∑

si ξ i

i=1

subject to (yi ⟨w, xi ⟩ + b) ≥ 1 − ξi

ξi ≥ 0, 0 ≤ si ≤ 1, ∀i = 1, 2, . . . , m

(1)

Its corresponding dual form can be expressed by max a

m ∑

ai −

i=1

m m 1 ∑∑

2

yi yj ai aj xi , xj





i=1 j=1

subject to 0 ≤ ai ≤ si C , ∀i = 1, 2, . . . , m m ∑

yi ai = 0

(2)

i=1

where a = [a1 , a2 , . . . , am ]T is the vector of Lagrange multipliers. Note that samples are never accessed directly, but only in the form of inner product in Eq. (2). This property facilitates using kernel trick to solve the dual problem and allows SVM to flexibly obtain nonlinear classification boundary in⟩ higher dimensional ⟨ space. Therefore, the dot product of xi , xj in Eq. (2) is usually replaced with K (xi , xj ) so as to use kernel function K (·) to solve the dual problem. si denotes the MV of ith sample to characterize the belongingness to its own class. It can be inferred that a training sample’s MV would work only when it is misclassified. Therefore, to reduce the impact of possible outliers or class noises, we should set their MVs smaller so that their penalty costs term become low when they are misclassified, which allows the final classifier not to be overfitting affected by them. Unlike CSVM, FSVM treats different samples with the different contributions to the formation of the classifier as shown in the definition of its objective function, which enables it effectively to deal with outliers with the help of well-defined MFs. Accordingly, defining a suitable heuristic for MFs becomes a key issue for the improvement of generalization performance of FSVM. 3.2. Available MFS and their limitations Here, some classical informative heuristic for general purpose MFs (GPMFs) defined in previous studies (Sevakula & Verma, 2017) are provided as follows:

µcen lin (xi ) = 1 − µcen exp (xi ) =

dcen i

(3)

r +∆ 2

1 + exp(β dcen i )

, β ∈ [0, 1]

(4)

hyp

µhyp lin (xi ) = 1 − µhyp exp (xi ) =

di

hyp

max(di 2

)+∆

hyp

1 + exp(β di

)

, β ∈ [0, 1]

(5) (6)

where µ(xi ) denotes the MVs assigned to the ith training sample and its value ranges from 0 to 1. dcen denotes the distance i

of the ith training sample to its own class center, where the class center is determined by taking mean of all samples in the class. r represents the radius of the class, which is defined to be r = max(dcen i ). Thus, according to the definition of Eq. (3), we can conclude that if a training sample is far from the class center, its belongingness to its own class would reduce. This MF was originally defined in Lin and Wang (2002) and was later used in Batuwita and Palade (2010). Eq. (4) was designed in a similar manner to Eq. (3) with the only difference that MVs in the former equation decays linearly with distance, while in the latter case they decay exponentially. For Eq. (5) and Eq. (6), the initial separation hyperplane was generated by conventional hyp SVM. di denotes the absolute value of functional margin of the ith training sample from this initial separation hyperplane. This design is based on the notion that the training samples closer to the separation hyperplane are more informative than others and thus should be assigned higher MVs than others. In Eq. (5), MVs hyp decay linearly with di while exponentially decay in Eq. (6). ∆ is small value greater than 0 so that µ(xi ) is always greater than zero, and β is used to control the extent of exponential decay in MFs. As mentioned previously, the informative heuristics for MFs hyp like µcen exp (xi ) and µexp (xi ) have been demonstrated to obtain good results empirically. However, they still suffer from some serious limitations. The detailed description is provided in the following. cen 1. MFs µcen lin (xi ) and µexp (xi ) heavily depend on the center of the class. Both MFs are defined based on the assumption that a training sample’s belongingness to its own class is higher when it is closer to the center of the class. This can work well when the class contains only one dense partition, while may not work well when there are multiple separable dense partitions with irregular shaped structure. In addition, Euclidean distance is commonly used as distance measure for both MFs, which imposes ellipsoid shaped clusters on the data. Thus, this works well if the class is shaped like ellipsoid and cannot accommodate for the data with arbitrarily shaped structures. hyp hyp 2. The notion held in MFs µlin (xi ) and µexp (xi ) is that a training sample closer to the initial hyperplane is more informative than those far away from it. Therefore, the training samples closer to the initial hyperplane are assigned higher MVs than others. However, the two MFs are based on the assumption that the initial learned hyperplane is an accurate estimation of the final hyperplane and simultaneously are not affected by possible outliers or class noises. Unfortunately, this assumption is not always satisfied in most cases. 3. These MFs only consider the factor of distance from the class center or the initial hyperplane, but the training samples with the equal distances have sometimes different contribution to the formation of the classifier. For example, the class noises with the same distance as the normal training samples should have lower MVs, shown in Figs. 1 and 2. This indicates that the MVs only with the distance under consideration are not enough to accurately describe the belongingness to its own class. Such is the case with Sevakula and Verma (2017), which uses DBSCAN method to determine the densities of the training samples, where the training samples with lower densities are considered as possible outliers. Therefore, we should also take into account the relation of a candidate training sample with other classes when determining its belongingness to its own class. Fan et al. (2017) considered such relations and proposed an entropy-based FSVM, where the relations with other classes are expressed by the proportion of its own class neighbors among its all k-nearest neighbors, often called class probabilities. However, the method cannot consider the factor of the distance such that it fails to eliminate the impact of possible outliers and border samples, producing unsatisfactory results in some imbalanced cases.

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

293

Fig. 1. Two scenarios where MFs only with distance under consideration are inefficient as shown. (a) The class noise has the same distance as the normal sample (jointed by two lines) to the class center (denoted by circle point), while the class noise should have lower MV than counterpart sample. (b) The class noise has the same distance as the normal sample (jointed by two lines) to the initial hyperplane trained by SVM, but the class noise should have lower MV than counterpart sample.

4.1. Determination of affinity based on support vector data description

Fig. 2. The affinities of samples with their distances larger than R on the artificial datasets.

4. It should be noted that the clustering techniques and other heuristics discussed here only use the data distribution properties in input space. The validity of assigning MVs using data properties in input space and later performing classification in kernel space may be questioned. We empirically believe that estimating the training sample’s MVs based on their locations in input space is unreasonable for kernel SVM, which is also the reason why to apply kernel-version SMOTE for kernel SVM classifier in Mathew et al. (2018). 4. Fuzzy support vector machine based on affinity and class probability In the previous section, we discussed some scenarios where existing MFs strategies exhibit some insufficiency. In order to deal with those drawbacks, we propose a novel MVs determination approach based on affinity and class probability. Then, by incorporating the affinity and class probability-based membership in the objective function, the corresponding FSVM-based classification model is later presented.

As discussed before, most of MFs with the distance under consideration are based on ellipsoid-shaped cluster assumption and cluster methods usually are performed in the input space, such that the estimated MVs of training samples fail to accurately express their belongingness to their own classes. In order to accommodate for the arbitrarily shaped clusters and use the properties of data in the similar space to the later used kernel SVM, we apply support vector data description to model the majority class and calculate the relation of a training sample with the final model as one factor of its MVs, often called the affinity to its own class. The detailed procedures are presented in the following. In the past few decades, SVM has already been reformulated to solve one-classification tasks. As the extended version of SVM, support vector data description (SVDD) was first proposed by Tax and Duin (2004) for outlier detection. Its main idea is to map the data samples in input space into a higher-dimensional feature space through a nonlinear transformation and find a minimum hypersphere enclosing most of all mapped data samples in the feature space. This hypersphere, when mapped back to the input space, can be separated into several components, each enclosing a separable arbitrarily shaped cluster of data samples, which can allow it handle multiple separable dense partitions with irregular shaped structure and the data with arbitrarily shaped structures. More specifically, consider a given training set X = {x1 , x2 , . . . , xl } in input space, where l refers to the cardinality of the set. Let ϕ : X → F be the nonlinear mapping function of X to F . Then, the minimum hypersphere enclosing most of those samples’ images in the feature space is identified, which is characterized by its center c and the radius R. Furthermore, to allow some samples outside of the hypersphere and generate more general hypersphere, penalty terms are added in SVDD’s cost function, which enables it to relax the constraints. The concrete expression of the cost function is as follows: min R2 + C c ,R

l ∑

ξi

i=1 2

subject to ∥ϕ (xi ) − c ∥ ≤ R2 + ξi

ξi ≥ 0, ∀i = 1, 2, . . . , l

(7)

294

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

where C > 0 denotes a penalty constant that controls the trade-off between the size of the hypersphere and the number of samples falling outside of the hypersphere. To solve this constrained quadratic optimization problem, we introduce the Lagrangian, L (c , R, ξ, α, β) = R2 + C

l ∑

ξi +

i=1



l ∑

l ∑

( ) αi ∥ϕ (xi ) − c ∥2 − R2 − ξi

i=1

βi ξi

(8)

i=1

with Lagrange multipliers αi ≥ 0 and βi ≥ 0. Set the derivatives of L (c , R, ξ, α, β) with respect to the primal variable to be zero, we can obtain:

( ) l l ∑ ∑ ∂L = 2R 1 − αi = 0 H⇒ αi = 1 ∂R i=1

(9)

i=1

l

l

i=1

i=1

∑ ∑ ∂L αi ϕ (xi ) =2 αi (ϕ (xi ) − c ) = 0 H⇒ c = ∂c

(10)

∂L = C − αi − βi = 0 H⇒ C = αi + βi ∂ξi

(11)

Substituting (9)–(11) back into (8), we can obtain the dual formulation: max

l ∑

a

ai K (xi , xi ) −

i=1

l l 1 ∑∑

2

ai aj K (xi , xj )

i=1 j=1

subject to 0 ≤ ai ≤ C , ∀i = 1, 2, . . . , l m ∑

ai = 0

(12)

i=1

subject to where K (xi , xi ) = ϕ (xi ) , ϕ (xj ) is a kernel function, which satisfies Mercer’s condition. The data samples with ai > 0 are called support vectors (SVs), which lie on the boundary of the hypersphere. To calculate the affinity of the given sample to its own class, its distance to the center of this trained hypersphere needs to be determined, which is expressed by



2

l ∑

l l ∑ ( ) ∑ αj K xi , xj + αj αk K (xj , xk ) (13)

j=1

j=1 k=1

R can be computed by the mean distance of all SVs on the boundary from the center of the hypersphere. Subsequently, the affinity of a given training sample is calculated by the following expression:

µaffinity (xi ) =

⎧ ⎨0.8 × ⎩

dsi v dd −mini dsi v dd

(

)

maxi dsi v dd −mini dsi v dd

(

)

(

sv dd

0.2 × exp(γ ∗ (1 − di

)

+ 0.2 dsi vdd < ρ ∗ R

/ρ∗R))

dsi v dd ≥ ρ ∗ R (14)

(

) sv dd

4.2. Determination of class probability based on kernel k-nearest neighbors



dsi v dd = ∥ϕ (xi ) − c ∥2

= K (xi , xi ) − 2

the hyperplane and thus skews the final hyperplane more toward the majority class. γ > 0 is a decay weight parameter that can control the extent of decay with dsi v dd . In order to intuitively illustrate the strategy of determined affinity based on SVDD, we take an artificial dataset as example, which contains 172 normal majority class samples with 8 abnormal ones and 20 minority class samples. Specifically, these abnormal majority class samples include 5 outliers and 3 class noises. Because the minority class is more important than the majority class when dealing with imbalanced datasets, to guarantee the contribution of the minority class in learning, we usually assign a relatively larger constant MV to the minority class samples and only calculate different MVs for the majority class samples. Thus, we only present the affinity results of the majority class samples here. For SVDD, the penalty constant C is set to 1 and the Gaussian kernel with width 1 is selected as kernel function. The trade-off parameter ρ is set to 1 and the decay weight parameter γ is set to 10. The final affinities for the majority class samples are shown in Fig. 2. The circle points denote the majority class samples, and the solid line refers to the final SVDD hyperplane trained by all majority class samples with SVs circled by black circles. From the results, it can be seen that 5 outliers and 3 class noises are assigned lower affinities than other samples so that the contribution of these abnormal samples to the formation of the classifier is significantly reduced, thus eliminating effect of outliers or class noises. In addition, some border samples are also assigned relatively lower affinities, which is beneficial to skewing the final hyperplane toward the majority class and consequently improves the generalization performance, especially for minority class.

where mini di denotes the minimum of all training ( distance ) sample in the same class, while maxi dsi v dd denotes the maximum distance of all training sample in the same class. ρ ∈ (0, 1] refers to a trade-off parameter that controls the size of possible outliers and border samples. This parameter is also related to the imbalanced ratio of the majority class to the minority class. If the imbalanced ratio is large, ρ should be set smaller so that more outliers and border samples are assigned small affinities, which makes those samples having little influence on the formation of

As described in Section 3, the training samples with the equal distances are likely to make different contributions to the formation of the classifier, indicating that the MVs only with the distance under consideration are not enough to accurately describe the belongingness to its own class. In order to make further discrimination, we also take into account the relation of a candidate training sample with other classes when determining its belongingness to its own class. Here, we use the kernel k-nearest neighbor (kernelknn) method to express the samples’ class probabilities to their own classes, which can allow us to use the data properties in the same space to the kernel SVM. More specifically, Consider the given training samples set S = {(x1 , y1 ) , (x2 , y2 ) , . . . (xm , ym )}, where xi denotes feature vector of ith sample and yi ∈ +1, −1 refers to its corresponding class. For a candidate sample xi , we firstly select its k-nearest neighbors {xi1 , xi1 , . . . , xik } in the same kernel space as the SVDD and kernel-SVM. The corresponding pairwise distance measure in the kernel space is defined as 2

dkernel = K (xi , xi ) − 2K xi , xj + K xj , xj ij

(

)

(

)

(15)

Then we count the number of the samples that belong to its own class in these k selected neighbors. Assume that the number is numoi wn . Eventually, the class probability of it is calculated as numoi wn

(16) k The larger pi indicates the higher probability of xi that belongs to its own class, and vice versa. In order to intuitively illustrate the strategy of class probability determination based on kernelknn, we adopt the similar artificial dataset to the one used before as example. The Gaussian kernel function with width 1 is pi =

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

Fig. 3. The class probabilities of samples with their values smaller than 1 on the artificial datasets.

selected as kernel function, which can guarantee that kernelknn method is performed in the same kernel space to that used in SVDD. The number of k-nearest neighbors is set 7. The calculated class probabilities for majority class samples with pi < 1 are shown in Fig. 3. From the result, we can find that the outliers and class noises are assigned relatively lower class probabilities than other normal samples, indicating that these samples would play little impact on the formation of the classifier. On the contrary, those samples far away from the counterpart class are assigned to 1 and do not be affected by the class probabilities. In addition, because the minority class is more important and fewer than the majority class in dealing with imbalanced datasets, we assign relatively larger MVs to the minority class samples, e.g., 1, to guarantee their contribution to learning, while we assign MVs for the majority class training samples according to their affinities and class probabilities. Based on these facts, we propose a combined formulation of MFs based on affinity and class probability, which is expressed by

µ (xi ) =

{

1

xi ∈ minority class

µaffinity (xi ) ∗ pi xi ∈ majority class

295

Fig. 4. The membership values of samples with their class probabilities smaller than 1 or their distances larger than R in the artificial datasets.

Fig. 5. The membership values of all majority samples in the artificial datasets with threshold value equal to 0.2 and k value equal to 7.

(17)

The results of combining affinity and class probability as MVs for the majority class samples in the above artificial dataset are demonstrated in Fig. 4. From the results, we can clearly find that the MVs assigned to the outliers and class noises are further reduced, while others samples do not be affected by class probability term, indicating that class probability can further discriminate the normal samples and abnormal samples, especially for class noises. As a special case, MV assigned to one of the class noises is set to 0 as shown in Fig. 4. At last, we demonstrate the final MVs for all majority class samples in the above artificial dataset in Fig. 5. Unlike other existing approaches that usually set MVs of most of majority class samples to the constant 1, MV assigned to each of majority class samples is set according to its affinity and class probability, which can be more beneficial to skewing the final hyperplane toward the majority class like under-sampling method, thus improving the generalization performance in dealing with imbalanced datasets.

entire operation of our proposed MFs can be divided into two parts: affinity calculation and class probability estimation. According to the defined distance to the center of SVDD model, the time complexity of calculating affinity for all majority class samples would be O(l × nSVs ), where l denotes the number of the majority class samples, and nSVs is the number of SVs in the trained SVDD model. Since we use kernelknn to determine the class probabilities and its time complexity is O(k × m × l), the time complexity of class probability estimation is also equal to O(k × m × l), where k is the number of k-nearest neighbors, and m and l denote the number of all training samples and majority samples, respectively. In addition, the time complexity involved in calculating the maximum and minimum of affinity is 2O(l). Thus, the whole computational complexity of MFs determination approach becomes O (l × nSVs )+ O (k × m × l)+ 2O(l). Since nSVs < k × m, it comes out to be O (k × m × l).

4.3. Computational complexity analysis

4.4. FSVM based on affinity and class probability for imbalanced datasets

Here, we discuss the computational complexity of MVs determination strategy based on affinity and class probability. The

By incorporating the MVs based on affinity and class probability described above into FSVM, we propose a novel affinity

296

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

and class probability based FSVM for imbalanced datasets. Its objective function is reformulated into the following expression: min w,b

1 2

∥w∥2 + C +

m ∑

m ∑

si + ξi + C −

i=1,yi =+1

Table 1 Confusion matrix. True positive class True negative class

si − ξi

i=1,yi =−1

Predict positive class

Predict negative class

TP FP

FN TN

subject to (yi ⟨w, xi ⟩ + b) ≥ 1 − ξi si + = µ xi + , si − = µ xi − = 1

(

)

(

)

ξi ≥ 0, ∀i = 1, 2, . . . , m

(18)

where yi = +1 indicates that the sample xi belongs to the majority class, while yi = −1 indicates that xi belongs to the minority class. si + and si − are the final MVs assigned for the ith majority (class) sample and the ith minority class sample, respectively. µ xi + lies the range of [0,1] to reflect the importance of ith majority class sample in its own class for classification. To ensure the contribution of minority class samples, si − is constantly set to 1. Let the r im represent the imbalance ratio of majority class and minority class, and we empirically set C − = C + ∗ r im , which has previously been proved to be effective for improving classification accuracy (Akbani, Kwek, & Japkowicz, 2004). The detailed flowchart of the proposed method is shown in Fig. 6. 5. Experimental results and analyses 5.1. Performance evaluation for imbalanced classification Evaluation measures play a key role in assessing the classification performance of the classification model. Traditional accuracy-based evaluation measure, which is the most commonly used one, is no longer a proper measure for classification of imbalanced data since the minority class has very little effect on the accuracy as compared to the majority class. However, in imbalanced classification problems, the correct classification of instances in the minority class is more important than the contrary case. For example, in a disease diagnostic problem where the disease cases are usually quite rare as compared with normal populations, the recognition goal is to detect people with disease. Hence, a favorable classification model should be one that provides a higher identification rate on the disease category. To quantitatively evaluate the classification performance in imbalanced problems, several alternative classification evaluation measures has been defined. Most of those evaluation measures are dependent on the following confusion matrix as illustrated in Table 1, where the columns are the predicted class and the rows are the true class. In the confusion matrix, TP (True Positives) is the number of positive instances (minority class) correctly classified, FN (False Negatives) is the number of positive instances incorrectly classified as negative (majority class), FP (False Positives) is the number of negative instances incorrectly classified as positive, and TN (True Negatives) is the number of negative instances correctly classified. If only the performance of the minority (positive) class is considered, two measures are important: Sensitivity and Precision. Sensitivity, which is also called positive class accuracy, is defined as the ratio of True Positives (TP) to the number of all positive instances. Precision is defined as the ratio of True Positives (TP) to the number of all instances predicted as positive. F-Measure is suggested to integrate these two measures into an average. In principle, F-Measure denotes a harmonic mean between Sensitivity and Precision. The harmonic mean of two numbers tends to be closer to the smaller of the two. Therefore, a high F-Measure value ensures that both Sensitivity and Precision are reasonably high. When the performance of both classes is concerned, an alternative measure called Specificity is required to be defined,

which represents the ratio of True Negatives (TN) to the number of all negative instances. G-Mean is suggested as the balanced performance between these two classes, which is intrinsically defined as the geometric mean of sensitivity and specificity. If the G-Mean value is high, both True Positive Rate (TP) and True Negative Rate (TN) are expected to be high simultaneously. The five performance evaluation measures are defined as follows: Sensitivity of positive class sample (Sensitivity): Sensitivity = TP/(TP + FN) Precision of positive class sample (Precision): Precision = TP/(TP + FP) Specificity of negative class sample (Specificity): Specificity = TN/(TN + FP) Geometric mean accuracy (G-Mean): G=



Sensitivity · Specificity

F-Measure index of positive class sample: F=

2 × Sensitivity × Precision Sensitivity + Precision

Area under Receiving Operator Characteristic (AUC) is another widely used evaluation metric for the performance of classifiers especially in imbalanced datasets scenarios. It is referred to as the area under ROC graph and is not sensitive to the distribution of two classes. The ROC graph can be obtained by plotting the True Positive Rate (Sensitivity) over the False Positive Rate (1-Specificity). In this study, F-Measure, G-Mean, and AUC are used as the performance measures to compare different methods. 5.2. Experiments on the artificial imbalanced dataset In order to intuitively demonstrate the effectiveness of the proposed FSVM based on affinity and class probability (ACFSVM) in dealing with outliers and class noises, we carried out the classification experiments on two-dimensional imbalanced artificial dataset similar to the previously used ones compared to traditional SVM, the Entropy-based FSVM (EFSVM) proposed in Fan et al. (2017), and general purpose function based FSVM (GPFSVM) presented in Sevakula and Verma (2017). The dataset contains 200 majority class instances, and 200 minority class instances available. In order to eliminate the randomness, we employ 10-fold cross validation to obtain the experimental results and then calculate their average values. The imbalance ratio is set to 10:1. Specifically, before training, we split the majority class into 10 folds and then 9 folds are used as training majority data with the remaining one fold as the testing majority data. In addition, in order to validate the robustness to outliers and class noises, we add 5 outliers and 3 class noises to the training majority data for each split. The 20 training minority data are generated by selecting randomly from all minority data and the rest are used for testing. Because all compared approaches are SVM-based classifiers, we set the same parameters to them for the convenience of comparison. Specifically, the Gaussian kernel

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

297

Fig. 6. The flowchart of the proposed method.. Table 2 G-Mean, F-Measure, and AUC metric values of different algorithms in both training and testing sets of the artificial dataset (The bold denotes the best result on each column). Algorithms

SVM EFVM GPFSVM ACFSVM

Training

Testing

G-Mean

F-Measure

AUC

G-Mean

F-Measure

AUC

0.9773 0.9875 0.9831 0.9817

0.9535 0.9777 0.9779 0.9815

0.9612 0.9739 0.9813 0.9897

0.9158 0.9526 0.9611 0.9867

0.8517 0.9551 0.9672 0.9835

0.8951 0.9523 0.9531 0.9775

with kernel width 0.5 is selected as the kernel function, and the penalty constant for the majority class is set to 1 according to 5-fold cross stratified validation results, while for the minority class the penalty constant is set to 9 according to imbalance ratio. For EFSVM and the proposed approach, the number of nearest neighbors k for calculating the class probability of each majority class sample is set to 7. For the GPFSVM, the ∆ is set to be 0.2 and β is set to 0.5. For the proposed ACFSVM, the trade-off parameter ρ is set to 1 and the decay weight parameter is set to 10. The trained classification boundaries by different methods in certain training process are shown in Fig. 7. In addition, the averaged G-Mean, F-Measure, and AUC values are also presented in Table 2. From the results shown in Fig. 7, we can find that the classification boundary of traditional SVM skews toward the minority class affected by several abnormal training majority samples and imbalanced training data, which leads to poor classification accuracy, especially for the minority class. From the results obtained by EFSVM, it can be seen that the final classification boundary moves slightly from the minority class and effectively avoid the impact of the class noises. Unfortunately, since it only considers class probabilities of samples and ignores their distance (density) information in membership function, EFSVM fails to eliminate the effect of outliers and thus part of the obtained classification boundary still skews toward the minority class. In contrast to EFSVM, although GPFSVM incorporates the distance information of samples in its membership function, it does not take the class probabilities under consideration, which leads to its failure of avoiding the influence of class noises and partly skewing of the final classification boundary around class noises, as shown in Fig. 7(c). As we expected, compared to other algorithms, the proposed approach uses not only the distance information of samples but also their class probabilities in the membership function, so that ACFSVM can effectively avoid the effect of outliers and class noises, and thus produces the most reasonable classification boundary. In the aspect of the classification performance comparison, as shown in Table 2, the proposed ACFSVM outperforms SVM, EFSVM and GPFSVM in terms of G-Mean, F-Measure, and AUC values for both training samples and testing samples except for G-Mean values in training samples. Although EFSVM obtains the best results in terms of G-Mean for the training samples, it performs poorly in terms of all metrics for the testing samples

Table 3 Description of the experimental datasets. Dataset

Attribute

Minority class/ Majority class

Category

Imbalance ratio

Wine Iris German Abalone Pima Ecoli Libra Vehicle Balance Haberman Car Liver Seed Spect Pageblock Pageblock1 Pageblock2 Pageblock3 Yeast Yeast1 Yeast2 Yeast3 Yeast4 Yeast5 Yeast6 Yeast7 Yeast8

13 4 24 8 8 7 90 18 4 3 6 6 7 22 10 10 10 10 8 8 8 8 8 8 8 8 8

48/130 50/100 300/700 67/259 268/500 52/284 24/336 199/647 49/576 81/225 65/1663 145/200 70/140 55/212 560/4913 329/5144 115/4913 231/4913 429/1055 244/1240 163/1321 51/463 35/429 20/463 51/1433 44/1440 35/1449

1: others 2: others B:G 16: 6 1:0 ‘pp’: others 15: others ‘van’:others ‘B’:others 2:1 ‘v-good’:others 1:2 1:others 1:2 others:1 2:others 5:1 3,4,5:1 2:others 3:others 4:others 5:1 7:2 9:1 5:others 6:others 7:others

1:2.71 1:2.00 1:2.33 1:3.86 1:1.87 1:5.46 1:14.00 1:3.25 1:11.76 1:2.78 1:25.58 1:1.38 1:2.00 1:3.85 1:8.77 1:15.63 1:42.72 1:21.27 1:2.46 1:5.08 1:8.10 1:9.07 1:12.25 1:23.15 1:28.09 1: 32.72 1:41.40

compared to the proposed ACFSVM, which indicates that EFSVM fails to avoid overfitting in the training phase due to the influence of outliers and class noises. This result can further conform that our proposed affinity and class probability based FSVM is effective in dealing with imbalanced datasets, especially with outliers and class noises. 5.3. Experimental configurations on UCI datasets In order to evaluate the performance of the proposed ACFSVM approach in dealing with imbalanced datasets classification tasks, we use 27 different benchmark datasets selected from the Machine Learning Repository UCI, and the detailed description of the datasets are listed in Table 3. Those datasets with more than two classes were converted into bi-class datasets by means of one versus others strategy, where one of the classes with the relatively small size is labeled the minority one and the rest classes are merged into the majority class. In addition, in order to obtain the imbalanced datasets with higher imbalance ratios, we select two UCI datasets with high dimensions and large sizes including Pageblock and Yeast ones to generate some imbalanced datasets through different class combinations. The detailed class combinations regarding these generated datasets are indicated in the Category column in Table 3.

298

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

Fig. 7. Decision boundaries obtained by SVM and FSVMs with different membership values. (a) Traditional SVM. (b) EFSVM with membership values only determined by class probabilities. (c) GPFSVM with determined membership values only based on distances. (d) ACFSVM with membership values determined by the proposed approach.

5.4. Influences of parameters on the performance of the proposed approach In the proposed MFs based on the affinity and class probability, there are three parameters to be pre-specified, which include the penalty constant C for the SVDD model, the trade-off parameter ρ for the calculation of affinities and the number of nearest neighbors k for the class probabilities. In addition, since the Gaussian kernel K xi , xj = exp(−

(

)

(xi −xj )2 2σ 2

) is usually selected as the kernel

function for all kernel methods in the proposed approach, the width parameter σ for Gaussian kernel is also a key factor for improving the classification performance. In order to investigate the effect of different C , ρ , k and σ values on the classification performance of the proposed ACFSVM, we conducted the imbalanced classification experiments on four datasets selected from the ones listed in Table 3, which contain Pima, Yeast, Haberman and German. In order to eliminate the randomness, we employ 10-fold cross validation to obtain the experimental results and then calculate their average values under different parameters settings. In order to avoid no minority class instance available for training, before training, we only split the majority class into 10 folds and then 9 folds are used as training majority data with the remaining one fold as the testing majority data. Subsequently, to significantly observe the change of classification performance with different parameter settings, the minority class is randomly selected with the ratio of 10:1 relative to the number of training majority data to generate high imbalance ratio scenario. Finally,

the selected minority data are used as the training minority data with the remaining data as the testing minority data. G-Mean and F-Measure metrics are selected as the cross-validation criterion as G-Mean is only a criterion that considers all values in the confusion matrix and thus can provide more reliable measure while F-Measure can be used to measure the change of classification performance regarding the minority class. 5.4.1. Influences of C for SVDD model on the performance In order to discuss the impact of the penalty constant C for SVDD model on the classification performance of the proposed approach, we carried out the classification experiments on the selected four datasets using the ACFSVM with different C values ranged from 0.3 to 200. The averaged F-Measures and G-Means values by 10-fold cross validation under different C values are shown in the Fig. 8(a) and (b), respectively. Note that when discussing the effect of C , other parameters are set to the best ones which is determined by 10-fold cross validation. Specifically, for Pima dataset, the trade-off parameter ρ and the number of nearest neighbors k are set to be 1 and 7, respectively, and the Gaussian width is set to 3. For Yeast dataset, ρ and k are set to be 1 and 7, respectively, and the Gaussian width is set to 2. For Haberman and German datasets, the Gaussian widths are 1 and 10, respectively, and other parameters are similar to the above cases. From the results shown in Fig. 8, it can be seen that the F-Measure and G-Mean values do not significantly fluctuate with the change of C values and the classification performance

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

299

Fig. 8. (a) F-Measure metrics and (b) G-Mean metrics obtained by ACFSVM with respect to different C values of SVDD model on pima, yeast, haberman and german datasets.

reaches the best when C lies around 10. Since the final penalty parameters for SVDD model is set to be C /r im , the best final penalty constant for should be 1 for all the datasets, which is also consistent with our expectation. The experimental results indicate that the penalty constant C has little effect on classification performance of our proposed approach and is usually set to be 1 for good performance without the requirement of considering the imbalanced ratio effect. 5.4.2. Influences of ρ on the performance In order to investigate the trade-off parameter ρ on the performance, we conducted the classification experiments on the four selected datasets via our proposed approach with different ρ values from 0.5 to 20. The averaged F-Measure and G-Mean values obtained by 10 fold cross-validation are shown in Fig. 9. For all datasets, the penalty constant for the SVDD model is set to 10 and other parameter settings are the same as the above cases. From the results, we can find that when the ρ value is smaller than 1, although the F-Measure values reach the best, the G-Mean values are nearly 0, which indicates that final classification boundary severely skews toward the majority class due to small MVs assigned to almost all training majority samples and thus leads to the misclassification of all majority samples. When ρ value is 1, the classification performance of our proposed approach reaches the best in terms of F-Measure and G-Mean metrics and slightly decreases with the further increase of ρ value. The experimental results demonstrate that the too small ρ value can result in the severe skewing of the hyperplane toward the majority class and the too large ρ value means that MFs play no role in FSVM. Therefore, the ρ value should be set to be around 1 for superior classification performance. 5.4.3. Influences of k on the performance The F-Measure and G-Mean values with respect to different k values ranged from 1 to 21 are shown in Fig. 10(a) and (b), respectively. It can be found that the F-Measure and G-Means values do not significantly change with the modification of k values. This is mainly because that the possibility of the majority samples locating within the minority class is quite low due to the rare minority training samples available. The experimental results also substantiate that for imbalanced classification problems, the impact of border samples and possible outliers in the majority class on the classification performance is significantly bigger than the class noises, indicating that the affinity factors should dominate in MVs.

5.4.4. Influences of Gaussian width on the performance Since the Gaussian kernel is the most widely used kernel function, we employ it as the kernel function for all kernel methods. In order to discuss the effect of Gaussian width on the classification performance of our proposed approach, we carried out classification experimental on the four selected datasets under different Gaussian widths varied from 0.1 to 20. The corresponding experimental results are shown in Fig. 11. From the results, we can find that the F-Measure and G-Mean values behave significantly fluctuating with different Gaussian widths on all datasets, indicating that the kernel parameter is still a key factor of affecting the classification performance of our proposed approach. Specifically, the best classification performance for pima, yeast, haberman and german datasets can be achieved when Gaussian width reaches 3, 2, 1 and 10, respectively. 5.5. The performance comparison of different approaches on UCI imbalanced datasets 5.5.1. Experimental procedure and results In order to further verify the effectiveness of our proposed approach in handling imbalanced datasets with different imbalance ratios, we performed comparable experiments on all datasets listed in Table 3 with other imbalanced classification techniques: (1) SVM, (2) cost-sensitive SVM (CSVM), (3) Hessian Regularized Support Vector Machines (HesSVM), (4) random under-sampling based SVM (RUSVM), (5) random over-sampling based SVM (ROSVM), (6) SMOTE over-sampling based SVM (SMOTESVM), (7) BSMOTE over-sampling based SVM (BSMOTESVM), (8) WKSMOTE over-sampling based SVM (WKSMOTESVM), (9) AdaBoost SVM ensemble (AdaSVM), (10) OWASVM proposed in Maldonado et al. (2018), (11) GPFSVM, (12) EFSVM, (13) EFLSSVM-CIL, (14) EFLSTWSVM-CIL. Conventional SVM serves as the baseline method. CSVM is the most representative algorithm-level imbalanced classification method. HesSVM is a state-of-the-art manifold regularization semi-supervised classification algorithm proposed in Tao et al. (2013), which has been proved to perform better than other ones. RU and RO are used for comparison due to their prevalence among imbalanced datasets classification applications and their effectiveness sometimes in some special cases. SMOTE is regarded as the most popular over-sampling method, which has been confirmed by lots of citations of the original papers about it. BSMOTE is an improved SMOTE version which only generates synthetic samples among borderline majority instances. Similarly, KSMOTE is another kernel SMOTE whose generated synthetic samples are more consistent with

300

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

Fig. 9. (a) F-Measure metrics and (b) G-Mean metrics obtained by ACFSVM with respect to different ρ values on pima, yeast, haberman and german datasets.

Fig. 10. (a) F-Measure metrics and (b) G-Mean metrics obtained by ACFSVM with respect to different k values on pima, yeast, haberman and german datasets.

Fig. 11. (a) F-Measure metrics and (b) G-Mean metrics obtained by ACFSVM with respect to different Gaussian widths on pima, yeast, haberman and german datasets.

later used kernel SVM model as its applied feature space is similar to one of kernel SVM. Ensemble SVM (AdaSVM) is the hybrid ensemble method which used Adaboost strategy to improve the classification performance of the base classifier for imbalanced datasets. Eventually, GPFSVM, EFSVM, OWASVM, EFLSSVM-CIL and EFLSTWSVM-CIL are selected since they are the most useful fuzzy variants of SVM and are more related to the proposed method. For all comparative imbalanced classification methods, we applied Support Vector Machine (SVM) as the subsequent classifier after data-preprocess. To avoid the effect of parameters

on the performance, the parameters for all compared methods are optimized using 5-fold stratified cross-validation in the training dataset based on G-Means performance measure. For SMOTE, BSMOTE and WKSMOTE, the number of nearest minority neighbors (NN) to be found for each minority instance to generate synthetic instances are selected among values (3, 5, 7, 9). Similarly, The number of nearest (k) used to identify borderline instances for BSMOTE was also selected among values (3, 5, 7, 9). For WKSMOTE, other parameters are set to the same as those advised in Mathew et al. (2018). For CSVM,

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

the penalty constant weights for majority class and minority class are set inversely proportional to the imbalanced ratio. For the GPFSVM, the max ∆ is set to be 0.2 and β is set to 0.5. The maximum iteration number is set to 15 for Ensemble SVM so as to avoid overfitting the minority class due to the effect of different cost values (Sevakula & Verma, 2017). For EFSVM, EFLSSVM-CIL, and EFLSTWSVM-CIL, K value is selected from the set {3, 5, 7, 9, 11}. For the proposed approach, the C for SVDD model is set to be the imbalance ratio of each imbalanced dataset according to the previous experimental results so as to maintain constant 1. Similarly, the trade-off parameter ρ for the calculation of affinities and the number of nearest neighbors k for the class probabilities are set 1 and 7, respectively. For all data pre-procession methods, the training data are balanced and then trained the SVM with the balanced data. For base SVM classifier, penalty constant C and σ are selected by grid search method from the set {10−3 , 10−2 , 10−1 , 100 , 101 , 102 , 103 } and {2−5 , 2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 }. In order to eliminate the randomness, 5-fold stratified cross validation was used and each experiment was repeated 3 times to report their averaged metrics values and their standard deviations. Table 4 shows the experimental results for all compared techniques on 27 datasets. The best measures are highlighted in bold. From experimental results shown in Table 4, we can clearly find that traditional SVM produces lower G-Mean,F-Measure and AUC metrics than other techniques on all used imbalanced datasets, especially on highly imbalanced ones such Pageblock, Pageblock1, Pageblock2, Pageblock3, and Yeast6, indicating that the classification performance of traditional SVM deteriorates in dealing with the imbalanced data due to the extreme domination of majority class. More specifically, for SVM, influenced by imbalanced data during the training, the percent of the majority class training samples around the margin between two classes is much more than that of the minority class training samples, which leads to the fact that majority support vectors outnumber minority support vectors in the final SVM model. According to the definition of final decision function of SVM, the hyperplane would skew toward the minority class or even over the region of the minority, thus producing poor generalization capability. According to the results shown in Table 4, for manifold regularization SVM algorithms such as HesSVM, its performance fails to be significantly improved when dealing with imbalanced datasets since they are developed mainly to address semi-supervised issues other than imbalanced issues. Compared to conventional SVM, as one of the most representative costsensitive algorithms, CSVM obtains relatively good classification performance in some datasets. However, it only pre-specifies two fixed cost-sensitive penalty constants for two classes and neglects different contribution of each sample to the formation of the classifier, which can lead to no significant improvement in most datasets. Although RUSVM and ROSVM can obtain relatively higher G-Mean,F-Measure and AUC metrics compared to traditional SVM on some datasets such as Pageblock, Pagebock1 and Pagebock2, they show poor classification performance on other datasets such as Yeast3, Yeast4 and Yeast5. This is primarily due to the fact that the RUSVM and ROSVM adopt random sampling technique. Concretely, if the under-sampled samples or the over-sampled samples are located around the boundary and are informative for learning, the obtained classification boundary tends to shift toward the majority class, producing good generalization performance when dealing with the imbalanced problems, such as Pagebock1 dataset. Otherwise, good generalization performance would not be expected, such as Yeast3 and Yeast4 datasets. In summary, the generalization performances of RUSVM and ROSVM are significantly affected by the uncertainty of undersampled or over-sampled instances. Unlike ROSVM, SMOTESVM

301

and BSMOTESVM belong to heuristic over-sampling techniques and hence their performances are superior to traditional SVM, RU-SVM and RO-SVM on most of datasets. However, like RU and RO, SMOTE and BSMOTE also considerably rely on the distribution of the minor class training samples. Sometimes, inappropriate knearest neighbor value also can cause the generation of noisy samples penetrating the majority class, eventually decreasing the classification performance of the obtained classification model. As an advanced SMOTE version, WKSMOTE can sufficiently consider the consistency of feature space with subsequent SVM and introduce different weighting factors so that the classification performance can be significantly improved on most of the imbalanced datasets as indicated in Table 4. As an ensemble method, AdaSVM can obtain relatively good performance in dealing with some imbalanced datasets due to the effect of boosting scheme, indicating that boosting strategy can improve the classification performance in dealing with imbalanced datasets with the help of iteratively weights-based sampling. From the results shown in Table 4, we can find that OWASVM cannot show good classification performance on most imbalanced datasets and even performs poorly relative to conventional SVM. This is because that the calculated distance based on the first-phase trained hyperplane usually tends to be inaccurate due to the impact of imbalanced datasets on the hyperplane, which would eventually result in the deterioration of the second-phase trained model classification accuracy. Although GPFSVM, EFSVM and its two improved versions: EFLSSVM-CIL and EFLSTWSVM-CIL techniques overcome the disadvantages of OWASVM by means of introducing MV, they still have to be faced with the inconsistency between the space for MV determination and the one for the final classifier. In addition, no consideration of the impact of possible outliers and border samples in majority class also restricts the further improvement of their classification performance. From the comparison results between GPFSVM and EFSVM, we also find that GPFSVM generally outperforms EFSVM in dealing with some highly imbalanced datasets. This is because that EFSVM adopt class probabilities based on k-nearest neighbors to discriminate the noisy samples and its estimated class probabilities would become inaccurate when only a few minority samples exists, which thus deteriorates its classification performance. Compared with other imbalanced classification techniques, the proposed ACFSVM shows the best classification performance on nearly all datasets in terms of GMean, F-Measure, and AUC metrics. In addition, the deviations of these obtained evaluation metrics by the proposed method on all imbalanced datasets except Yeast5 one are relatively small, which indicates that the proposed approach possesses good stability. This is because that the proposed approach not only consider the influence of possible outliers and border samples but also the class noises, which allows the final hyperplane to skew toward the majority class and improves the classification accuracy, especially for the minority class, as shown in Table 4. In addition, all kernel spaces for determining MVs and classification including SVDD model for affinities, kernelknn for class probabilities, and kernel-SVM are similar to each other. This strategy seems to be more reasonable and more effective than other MFs for fuzzy SVM. In order to evaluate the classification performance of different methods across multiple datasets, we calculate and analyze mean rankings of performance measures for different methods on these datasets instead of comparing directly obtained performance measures according to Ref. Jian et al. (2016). Fig. 12 shows the results of mean rankings for our proposed method and other compared ones on 27 datasets in terms of G-Mean, F-Measure and AUC, respectively. For each dataset, the method with the best performance can be assigned a mean ranking of 1 while the worst performing method can be assigned a mean ranking of 15. From

302

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

Table 4 Results of different imbalanced classification methods on all used datasets. Dataset

Wine

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.813 0.847 0.835 0.805 0.845 0.826 0.843 0.866 0.855 0.852 0.851 0.861 0.859 0.863 0.873

Iris F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

Dataset

Abalone

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.921 0.931 0.930 0.935 0.932 0.942 0.932 0.946 0.923 0.931 0.935 0.933 0.935 0.931 0.945

Dataset

Libra

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.845 0.885 0.877 0.877 0.911 0.839 0.916 0.919 0.908 0.875 0.925 0.918 0.917 0.915 0.933

0.096 0.060 0.067 0.092 0.063 0.066 0.069 0.076 0.075 0.032 0.036 0.037 0.053 0.045 0.032

0.737 0.745 0.743 0.741 0.746 0.723 0.700 0.778 0.755 0.731 0.757 0.761 0.761 0.776 0.789

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.119 0.080 0.066 0.091 0.073 0.080 0.110 0.092 0.100 0.087 0.100 0.113 0.043 0.053 0.029

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.073 0.053 0.067 0.084 0.056 0.072 0.079 0.061 0.070 0.053 0.055 0.061 0.073 0.053 0.033

0.969 0.950 0.965 0.964 0.966 0.966 0.944 0.976 0.968 0.963 0.969 0.967 0.970 0.968 0.981

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.035 0.036 0.031 0.040 0.027 0.022 0.034 0.048 0.031 0.045 0.037 0.031 0.027 0.029 0.033

0.960 0.923 0.957 0.913 0.950 0.949 0.949 0.957 0.947 0.945 0.951 0.953 0.955 0.952 0.965

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.039 0.057 0.051 0.046 0.037 0.028 0.056 0.051 0.052 0.073 0.043 0.052 0.035 0.035 0.021

0.996 0.996 0.995 0.997 0.997 0.998 0.995 0.997 0.998 0.996 0.997 0.998 0.997 0.996 0.998

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.006 0.005 0.007 0.005 0.003 0.002 0.005 0.041 0.002 0.021 0.013 0.036 0.003 0.005 0.017

Pima F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.884 0.903 0.901 0.858 0.922 0.902 0.905 0.917 0.903 0.911 0.913 0.915 0.913 0.916 0.920

German

G-M

0.069 0.035 0.057 0.056 0.041 0.037 0.046 0.039 0.041 0.033 0.041 0.041 0.028 0.059 0.023

0.891 0.837 0.885 0.870 0.863 0.886 0.898 0.871 0.852 0.847 0.862 0.865 0.869 0.879 0.895

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.080 0.073 0.077 0.046 0.068 0.065 0.073 0.060 0.076 0.012 0.076 0.076 0.051 0.085 0.025

0.985 0.983 0.982 0.980 0.987 0.987 0.980 0.987 0.987 0.980 0.982 0.985 0.983 0.985 0.991

0.013 0.014 0.011 0.022 0.012 0.010 0.016 0.038 0.010 0.021 0.010 0.010 0.011 0.027 0.017

0.660 0.726 0.720 0.701 0.729 0.729 0.735 0.747 0.723 0.711 0.737 0.738 0.739 0.739 0.773

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.042 0.043 0.036 0.048 0.040 0.055 0.037 0.037 0.035 0.022 0.026 0.031 0.051 0.037 0.015

0.575 0.647 0.643 0.658 0.651 0.641 0.618 0.650 0.644 0.605 0.646 0.645 0.647 0.646 0.657

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.056 0.066 0.059 0.052 0.053 0.069 0.065 0.048 0.045 0.037 0.037 0.025 0.055 0.055 0.038

AUC

G-M

0.800 ± 0.024 0.808 ± 0.033 0.807 ± 0.035 0.815 ± 0.036 0.805 ± 0.025 0.807 ± 0.035 0.808 ± 0.035 0.815 ± 0.036 0.810 ± 0.031 0.807 ± 0.029 0.813 ± 0.031 0.811 ± 0.037 0.811 ± 0.043 0.813 ± 0.039 0.831±0.021

0.902 0.937 0.936 0.938 0.930 0.940 0.934 0.945 0.939 0.927 0.935 0.933 0.936 0.932 0.949

Vehicle F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.103 0.120 0.055 0.133 0.104 0.149 0.078 0.256 0.115 0.103 0.094 0.103 0.110 0.110 0.015

Dataset

Haberman

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.460 0.635 0.647 0.593 0.630 0.637 0.644 0.656 0.645 0.639 0.651 0.635 0.637 0.635 0.669

0.813 0.857 0.849 0.887 0.869 0.802 0.840 0.894 0.871 0.859 0.902 0.877 0.883 0.881 0.919

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.129 0.150 0.087 0.087 0.125 0.188 0.155 0.270 0.169 0.137 0.117 0.135 0.134 0.134 0.073

0.116 0.073 0.060 0.074 0.102 0.060 0.054 0.053 0.057 0.066 0.055 0.097 0.103 0.111 0.037

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.005 0.009 0.003 0.009 0.006 0.007 0.013 0.143 0.003 0.017 0.004 0.003 0.016 0.052 0.017

0.968 0.980 0.977 0.973 0.978 0.984 0.973 0.985 0.980 0.975 0.981 0.977 0.978 0.975 0.988

0.299 0.490 0.475 0.500 0.482 0.494 0.454 0.504 0.502 0.493 0.503 0.495 0.497 0.494 0.521

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.125 0.087 0.080 0.082 0.120 0.075 0.089 0.064 0.068 0.055 0.039 0.078 0.119 0.123 0.051

0.657 0.707 0.703 0.707 0.716 0.720 0.720 0.726 0.726 0.710 0.726 0.726 0.727 0.725 0.727

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.016 0.010 0.017 0.013 0.011 0.008 0.013 0.041 0.009 0.011 0.012 0.009 0.015 0.012 0.021

0.086 0.066 0.055 0.073 0.094 0.057 0.053 0.048 0.062 0.027 0.051 0.065 0.066 0.067 0.063

0.977 0.991 0.987 0.990 0.992 0.977 0.990 0.994 0.991 0.986 0.993 0.987 0.988 0.985 0.995

0.532 0.555 0.551 0.575 0.574 0.595 0.590 0.595 0.587 0.585 0.614 0.606 0.617 0.585 0.635

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.053 0.030 0.031 0.026 0.041 0.039 0.041 0.026 0.029 0.035 0.031 0.037 0.043 0.043 0.017

0.826 0.862 0.851 0.838 0.846 0.859 0.863 0.857 0.857 0.838 0.855 0.853 0.856 0.853 0.861

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.000 0.059 0.038 0.086 0.064 0.154 0.129 0.042 0.073 0.039 0.037 0.067 0.100 0.082 0.035

0.000 0.309 0.217 0.283 0.296 0.352 0.308 0.379 0.304 0.301 0.380 0.375 0.374 0.373 0.383

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.068 0.047 0.037 0.065 0.041 0.050 0.045 0.080 0.043 0.023 0.052 0.027 0.036 0.041 0.045

0.599 0.649 0.657 0.654 0.646 0.647 0.636 0.659 0.647 0.637 0.651 0.645 0.643 0.646 0.677

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.035 0.049 0.037 0.043 0.041 0.052 0.031 0.047 0.040 0.038 0.025 0.041 0.051 0.048 0.043

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.061 0.051 0.050 0.058 0.071 0.050 0.050 0.048 0.061 0.051 0.043 0.031 0.056 0.053 0.031

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.000 0.057 0.038 0.104 0.056 0.127 0.077 0.031 0.069 0.051 0.051 0.077 0.083 0.061 0.051

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.070 0.051 0.049 0.047 0.049 0.066 0.080 0.078 0.053 0.071 0.073 0.045 0.050 0.048 0.023

0.775 0.766 0.769 0.760 0.764 0.770 0.769 0.769 0.768 0.759 0.791 0.784 0.787 0.764 0.797

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.024 0.030 0.036 0.034 0.041 0.037 0.028 0.042 0.028 0.027 0.027 0.032 0.034 0.045 0.037

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.030 0.030 0.032 0.031 0.032 0.042 0.038 0.026 0.027 0.085 0.035 0.023 0.042 0.039 0.027

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.030 0.051 0.037 0.058 0.052 0.068 0.059 0.035 0.063 0.051 0.039 0.065 0.057 0.064 0.055

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.062 0.059 0.035 0.068 0.025 0.052 0.053 0.065 0.052 0.032 0.037 0.039 0.044 0.047 0.031

AUC 0.964 0.962 0.960 0.960 0.960 0.959 0.962 0.966 0.961 0.957 0.962 0.959 0.958 0.957 0.975

0.950 0.955 0.955 0.939 0.951 0.963 0.948 0.955 0.956 0.932 0.952 0.951 0.952 0.951 0.958

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.026 0.023 0.031 0.027 0.019 0.015 0.027 0.048 0.019 0.025 0.019 0.019 0.033 0.021 0.027

0.998 0.997 0.998 0.997 0.998 0.998 0.997 0.998 0.997 0.978 0.998 0.997 0.997 0.997 0.998

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.001 0.001 0.003 0.002 0.001 0.001 0.001 0.035 0.002 0.021 0.015 0.002 0.001 0.003 0.015

0.000 0.717 0.661 0.727 0.710 0.733 0.690 0.747 0.721 0.717 0.749 0.737 0.737 0.737 0.767

F-M

AUC 0.928 0.831 0.836 0.864 0.842 0.849 0.834 0.851 0.830 0.827 0.856 0.845 0.844 0.846 0.865

Liver

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.026 0.036 0.039 0.032 0.040 0.033 0.021 0.044 0.035 0.041 0.021 0.028 0.029 0.041 0.032

Balance

G-M

Car F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.996 0.994 0.995 0.993 0.995 0.995 0.994 0.996 0.998 0.993 0.997 0.996 0.997 0.997 0.996

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

Ecoli

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.635 0.666 0.661 0.692 0.691 0.707 0.705 0.711 0.701 0.673 0.726 0.721 0.729 0.700 0.753

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.039 0.015 0.033 0.017 0.013 0.047 0.016 0.014 0.015 0.035 0.017 0.026 0.020 0.032 0.015

0.945 0.933 0.945 0.938 0.939 0.973 0.955 0.973 0.942 0.937 0.962 0.935 0.932 0.938 0.977

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.045 0.051 0.051 0.065 0.024 0.051 0.055 0.067 0.047 0.031 0.045 0.037 0.055 0.049 0.023

0.999 0.999 0.996 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.000 0.000 0.003 0.000 0.000 0.001 0.001 0.015 0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.666 0.685 0.677 0.684 0.687 0.686 0.693 0.695 0.685 0.673 0.686 0.683 0.684 0.684 0.701

F-M

AUC 0.749 0.752 0.751 0.745 0.749 0.755 0.754 0.755 0.740 0.735 0.750 0.743 0.745 0.742 0.753

(continued on next page)

the results in Fig. 12, we can find that conventional SVM produces

than those of other techniques, indicating that the classification

greater mean rankings in terms of G-Mean and F-Measure metrics

performance of traditional SVM severely deteriorates affected by

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

303

Table 4 (continued). Dataset

Seed

Measure

G-M

Spect

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.902 0.915 0.926 0.913 0.923 0.936 0.918 0.934 0.922 0.910 0.928 0.915 0.913 0.916 0.935

Dataset

Pageblock1

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.049 0.882 0.845 0.865 0.882 0.881 0.916 0.917 0.881 0.877 0.911 0.907 0.908 0.909 0.927

Dataset

Yeast

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.611 0.706 0.679 0.708 0.712 0.716 0.710 0.721 0.709 0.652 0.717 0.713 0.714 0.715 0.729

Dataset

Yeast3

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.832 0.888 0.879 0.873 0.888 0.898 0.903 0.898 0.897 0.837 0.894 0.885 0.886 0.887 0.897

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.038 0.070 0.060 0.068 0.051 0.042 0.033 0.036 0.053 0.031 0.031 0.037 0.045 0.039 0.035

0.870 0.877 0.876 0.868 0.889 0.905 0.882 0.903 0.881 0.873 0.898 0.873 0.874 0.875 0.917

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.050 0.091 0.087 0.063 0.057 0.056 0.103 0.068 0.067 0.038 0.037 0.070 0.055 0.032 0.035

0.062 0.022 0.061 0.031 0.030 0.030 0.016 0.012 0.018 0.026 0.029 0.031 0.029 0.041 0.033

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.012 0.019 0.023 0.026 0.017 0.018 0.011 0.035 0.021 0.016 0.009 0.025 0.020 0.023 0.033

0.626 0.706 0.705 0.702 0.704 0.744 0.724 0.752 0.727 0.713 0.747 0.745 0.743 0.742 0.773

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.110 0.085 0.031 0.076 0.062 0.070 0.081 0.209 0.059 0.037 0.035 0.072 0.086 0.111 0.051

0.500 0.540 0.523 0.550 0.532 0.587 0.545 0.607 0.567 0.545 0.587 0.585 0.584 0.583 0.625

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.115 0.103 0.051 0.091 0.097 0.114 0.090 0.218 0.096 0.075 0.052 0.103 0.132 0.126 0.035

0.633 0.794 0.796 0.793 0.814 0.812 0.820 0.812 0.816 0.809 0.836 0.820 0.822 0.821 0.827

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.100 0.051 0.035 0.062 0.040 0.085 0.043 0.096 0.053 0.016 0.032 0.059 0.065 0.077 0.021

Pageblock2 F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.979 0.978 0.977 0.972 0.977 0.980 0.977 0.974 0.978 0.968 0.978 0.977 0.976 0.975 0.985

Pageblock

G-M

0.011 0.605 0.547 0.596 0.606 0.603 0.630 0.621 0.605 0.525 0.623 0.614 0.613 0.612 0.627

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.015 0.036 0.035 0.036 0.035 0.043 0.040 0.014 0.032 0.028 0.027 0.035 0.044 0.037 0.051

0.977 0.947 0.923 0.941 0.945 0.946 0.972 0.948 0.945 0.937 0.963 0.952 0.953 0.951 0.976

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.009 0.019 0.057 0.015 0.018 0.018 0.009 0.013 0.018 0.029 0.051 0.021 0.019 0.021 0.026

0.385 0.567 0.545 0.583 0.839 0.561 0.518 0.663 0.553 0.523 0.653 0.557 0.555 0.653 0.875

0.059 0.053 0.039 0.055 0.068 0.170 0.056 0.001 0.134 0.101 0.134 0.107 0.095 0.134 0.037

0.253 0.307 0.327 0.357 0.321 0.271 0.225 0.360 0.213 0.201 0.373 0.351 0.353 0.355 0.437

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.068 0.035 0.031 0.038 0.112 0.123 0.073 0.006 0.136 0.113 0.101 0.087 0.093 0.075 0.032

0.975 0.953 0.977 0.935 0.964 0.836 0.853 0.906 0.809 0.817 0.912 0.905 0.903 0.907 0.969

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.041 0.024 0.027 0.046 0.024 0.032 0.028 0.033 0.025 0.031 0.051 0.037 0.026 0.042 0.027

0.510 0.583 0.581 0.589 0.589 0.593 0.591 0.595 0.586 0.547 0.594 0.590 0.591 0.592 0.620

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.051 0.040 0.030 0.032 0.040 0.041 0.063 0.041 0.047 0.035 0.051 0.025 0.034 0.037 0.031

0.027 0.021 0.025 0.033 0.025 0.035 0.020 0.003 0.052 0.031 0.037 0.033 0.052 0.041 0.023

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.033 0.018 0.023 0.037 0.022 0.025 0.019 0.033 0.021 0.033 0.022 0.031 0.027 0.021 0.025

0.663 0.780 0.767 0.770 0.778 0.777 0.783 0.785 0.782 0.717 0.786 0.783 0.782 0.783 0.793

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.092 0.061 0.067 0.077 0.057 0.043 0.055 0.056 0.081 0.025 0.043 0.037 0.043 0.054 0.015

0.766 0.716 0.753 0.684 0.706 0.708 0.789 0.769 0.725 0.736 0.719 0.715 0.717 0.717 0.797

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.114 0.072 0.087 0.109 0.103 0.097 0.116 0.104 0.084 0.077 0.105 0.032 0.057 0.064 0.037

0.982 0.979 0.979 0.958 0.981 0.981 0.974 0.979 0.979 0.982 0.982 0.975 0.977 0.976 0.982

0.040 0.038 0.036 0.029 0.033 0.045 0.037 0.032 0.044 0.032 0.036 0.029 0.031 0.029 0.045

0.010 0.015 0.015 0.040 0.010 0.016 0.027 0.054 0.021 0.035 0.011 0.035 0.013 0.021 0.033

0.923 0.886 0.887 0.910 0.910 0.914 0.883 0.925 0.918 0.917 0.915 0.912 0.911 0.913 0.929

0.564 0.576 0.575 0.557 0.575 0.575 0.634 0.596 0.577 0.567 0.597 0.588 0.591 0.588 0.615

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.048 0.037 0.051 0.050 0.062 0.064 0.034 0.061 0.050 0.076 0.051 0.046 0.027 0.046 0.033

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.031 0.031 0.045 0.030 0.034 0.034 0.039 0.037 0.031 0.051 0.013 0.056 0.055 0.037 0.017

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.088 0.075 0.067 0.037 0.042 0.041 0.033 0.041 0.031 0.037 0.051 0.042 0.038 0.035 0.031

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.060 0.055 0.043 0.052 0.044 0.055 0.048 0.040 0.047 0.023 0.043 0.035 0.048 0.056 0.043

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.240 0.237 0.153 0.137 0.200 0.231 0.290 0.144 0.299 0.237 0.122 0.107 0.132 0.194 0.113

0.914 0.911 0.913 0.902 0.912 0.913 0.923 0.915 0.911 0.907 0.903 0.895 0.897 0.896 0.922

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.020 0.015 0.015 0.024 0.018 0.016 0.013 0.016 0.017 0.029 0.033 0.023 0.024 0.038 0.015

0.334 0.655 0.616 0.517 0.613 0.746 0.671 0.775 0.735 0.728 0.763 0.741 0.743 0.745 0.797

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.086 0.112 0.037 0.035 0.063 0.035 0.016 0.035 0.067 0.073 0.025 0.078 0.036 0.027 0.027

0.206 0.185 0.231 0.203 0.199 0.185 0.197 0.267 0.255 0.236 0.263 0.253 0.250 0.251 0.353

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.028 0.027 0.021 0.002 0.017 0.002 0.051 0.002 0.015 0.030 0.069 0.053 0.045 0.047 0.015

0.049 0.037 0.036 0.045 0.037 0.019 0.023 0.031 0.032 0.045 0.027 0.029 0.029 0.033 0.019

0.776 0.749 0.753 0.735 0.746 0.744 0.795 0.788 0.765 0.757 0.782 0.757 0.752 0.753 0.799

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.009 0.008 0.007 0.012 0.007 0.007 0.009 0.033 0.012 0.031 0.008 0.041 0.008 0.011 0.015

0.235 0.293 0.212 0.295 0.226 0.233 0.116 0.100 0.320 0.303 0.139 0.223 0.201 0.195 0.190

0.663 0.479 0.555 0.188 0.460 0.565 0.639 0.665 0.492 0.496 0.542 0.535 0.531 0.533 0.667

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.157 0.155 0.113 0.185 0.152 0.085 0.152 0.127 0.194 0.135 0.212 0.121 0.130 0.176 0.168

AUC 0.958 0.916 0.919 0.935 0.925 0.920 0.925 0.933 0.915 0.911 0.927 0.921 0.923 0.920 0.935

0.858 0.854 0.855 0.854 0.858 0.858 0.860 0.858 0.860 0.857 0.858 0.859 0.860 0.857 0.867

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.034 0.030 0.031 0.030 0.031 0.031 0.033 0.031 0.046 0.038 0.029 0.037 0.043 0.035 0.041

0.853 0.914 0.915 0.890 0.914 0.923 0.926 0.926 0.919 0.901 0.920 0.917 0.918 0.919 0.935

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

AUC 0.976 0.974 0.966 0.973 0.975 0.976 0.973 0.976 0.975 0.945 0.975 0.973 0.975 0.974 0.975

Yeast5

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.099 0.597 0.551 0.569 0.600 0.604 0.632 0.637 0.601 0.612 0.621 0.611 0.612 0.613 0.641

Yeast2

G-M

Yeast4 F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.791 0.797 0.792 0.796 0.794 0.798 0.796 0.799 0.795 0.792 0.799 0.796 0.795 0.795 0.817

0.038 0.018 0.026 0.034 0.023 0.020 0.014 0.011 0.020 0.027 0.022 0.032 0.044 0.030 0.025

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

Yeast1 F-M

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

Pageblock3 F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.227 0.849 0.817 0.814 0.851 0.850 0.861 0.857 0.850 0.845 0.856 0.845 0.847 0.846 0.861

0.066 0.114 0.055 0.081 0.059 0.067 0.048 0.051 0.050 0.076 0.050 0.035 0.021 0.057 0.076

0.747 0.728 0.733 0.484 0.737 0.708 0.832 0.841 0.737 0.741 0.786 0.757 0.753 0.751 0.879

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.073 0.141 0.067 0.127 0.118 0.120 0.130 0.098 0.087 0.025 0.071 0.106 0.086 0.073 0.053

0.957 0.976 0.975 0.965 0.969 0.974 0.963 0.972 0.974 0.965 0.973 0.971 0.970 0.972 0.977

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.043 0.022 0.023 0.036 0.031 0.026 0.015 0.049 0.024 0.035 0.026 0.037 0.045 0.051 0.033

0.711 0.658 0.696 0.679 0.725 0.710 0.663 0.721 0.718 0.713 0.720 0.712 0.713 0.715 0.742

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

AUC 0.785 0.798 0.799 0.768 0.817 0.837 0.774 0.821 0.811 0.818 0.816 0.804 0.803 0.807 0.830

(continued on next page)

the imbalanced data. In addition, HesSVM, RUSVM and OWASVM obtain higher mean rankings relative to other compared algorithms or even conventional SVM, which shows that HesSVM,

RUSVM and OWASVM are not well applicable for dealing with imbalanced datasets. As we know, HesSVM is originally developed to address semi-supervised issues and thus it is not suitable

304

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

Table 4 (continued). Dataset

Yeast6

Measure

G-M

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL ACFSVM

0.112 0.819 0.815 0.768 0.818 0.809 0.811 0.825 0.822 0.801 0.829 0.827 0.825 0.827 0.839

Yeast7 F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.170 0.106 0.078 0.092 0.092 0.068 0.069 0.043 0.079 0.021 0.066 0.035 0.067 0.088 0.013

0.068 0.277 0.247 0.299 0.291 0.271 0.304 0.309 0.291 0.215 0.299 0.296 0.295 0.297 0.337

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.110 0.082 0.080 0.070 0.064 0.058 0.067 0.061 0.070 0.037 0.055 0.032 0.045 0.039 0.021

0.805 0.877 0.877 0.872 0.893 0.888 0.891 0.895 0.903 0.871 0.893 0.891 0.892 0.890 0.899

Yeast8

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.069 0.077 0.075 0.072 0.066 0.059 0.065 0.042 0.051 0.067 0.050 0.070 0.037 0.025 0.033

0.766 0.949 0.951 0.896 0.944 0.965 0.933 0.956 0.950 0.905 0.955 0.953 0.950 0.954 0.963

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.122 0.044 0.038 0.072 0.046 0.020 0.049 0.009 0.037 0.021 0.049 0.035 0.035 0.047 0.101

0.617 0.539 0.541 0.520 0.533 0.543 0.587 0.631 0.539 0.521 0.557 0.539 0.541 0.538 0.637

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.035 0.131 0.035 0.106 0.107 0.101 0.116 0.069 0.071 0.046 0.035 0.073 0.081 0.081 0.133

0.986 0.987 0.983 0.983 0.985 0.986 0.985 0.985 0.986 0.977 0.986 0.985 0.986 0.985 0.987

G-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.011 0.007 0.009 0.008 0.007 0.006 0.006 0.009 0.006 0.011 0.031 0.011 0.006 0.007 0.006

0.653 0.858 0.855 0.802 0.873 0.855 0.841 0.860 0.848 0.821 0.859 0.855 0.854 0.859 0.863

F-M

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.131 0.065 0.073 0.231 0.081 0.082 0.101 0.105 0.087 0.063 0.037 0.045 0.071 0.072 0.035

0.493 0.303 0.401 0.302 0.324 0.285 0.516 0.507 0.314 0.305 0.417 0.327 0.325 0.328 0.515

AUC

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.164 0.062 0.037 0.131 0.084 0.074 0.192 0.118 0.106 0.103 0.055 0.033 0.067 0.112 0.017

0.842 0.943 0.941 0.935 0.948 0.938 0.938 0.951 0.942 0.935 0.947 0.939 0.937 0.940 0.957

± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

0.107 0.030 0.035 0.069 0.038 0.034 0.033 0.082 0.035 0.029 0.038 0.057 0.043 0.048 0.025

Fig. 12. Mean ranking of all compared imbalanced classification techniques on all tested datasets.

for imbalanced classification applications. Because RUSVM belongs to under-sampling technique, the removal of samples often causes the loss of importance information in majority class thus deteriorating the classification performance. Since OWASVM utilizes the distances to initially trained hyperplane as weights and the hyperplane is usually inaccurately trained by imbalanced datasets, the final obtained model would risk the loss of classification performance affected by inaccurately estimated weights. Similarly, CSVM cannot show good performance when dealing with imbalanced datasets due to its application-dependence and the effect of improper misclassification cost values. In addition, compared to RUSVM, all over-sampling methods such as ROSVM, SMOTESVM, BSMOTESVM and WKSMOTESVM obtain better results in the mean rankings of all three performance metrics. Among them, WKSMOTESVM performs better due to the consistency of feature space and introduction of different weights. All weighted methods except for OWASVM can obtain lower mean rankings, indicating that the introduction of MFs is beneficial for the classification performance improvement of SVM in dealing with imbalanced datasets. Compared to other weighted SVM methods, the proposed method significantly outperforms other methods with regard to the mean rankings of all evaluation metrics. This is primarily because that the proposed approach not only consider the influence of possible outliers and border samples but also the class noises and thus its classification boundary can skew toward majority class. In addition, the space consistency is another crucial factor resulting in classification performance improvement. In order to test whether differences in terms of mean rankings among different methods are merely a matter of chance, we perform Friedman test followed by Holm’s test to verify the statistical significance of the proposed method compared

to the other imbalanced classification methods with regard to the calculated mean rankings. The null hypothesis is that all compared methods perform similarly in mean rankings without significant difference. After Friedman test, the p-values for all three performance measures are 1.615e−32, 7.157e−25, and 1.063e−14, respectively and are much smaller than a significance level of α = 0.05, which indicates that there exists sufficient evidence to reject the hypothesis and thus all compared methods do not perform similarly. Since the null hypothesis is rejected for all performance measures, a post-hoc test is applied to make pairwise comparisons of the proposed method and other imbalanced classification methods. Holm’s test was used in this study where the proposed method was treated as the control method. The Holm’s test is a non-parametric equivalent of multiple t-test that adjusts α to compensate for multiple comparisons in a step-down procedure. The null hypothesis is that the proposed method does not perform better than all other methods as the control algorithm. Table 5 shows all adjusted α values and the corresponding p-values for each compared results. From the results, we can find that the null hypothesis is rejected for all pairwise comparison at a significance level of α = 0.05 except for WKSMOTESVM regarding G-Mean, indicating that the proposed method outperforms the other compared methods with significant difference. Although the p-value of the pairwise comparison for WKSMOTESVM regarding G-Mean higher than 0.05, which is 0.056044, the proposed algorithm still outperforms it with weak predominance according to the above results. 6. Conclusion In this paper, a novel affinity and class probability-based FSVM (ACFSVM) approach has been proposed for imbalanced

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307 Table 5 Results of Holm’s test with the proposed method as the control algorithm. Classification methods

α0.05

Performance metric

G-Mean

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL

0.0039379 0.0051162 0.0063912 0.0042653 0.0063912 0.016952 0.005683 0.05 0.0085124 0.0046522 0.025321 0.0073008 0.012741 0.010206

Performance metric

F-Measure

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL

0.0046522 0.0051162 0.005683 0.0042653 0.005683 0.0063912 0.016952 0.05 0.0073008 0.0039379 0.025321 0.0085124 0.012741 0.010206

Performance metric

AUC

SVM CSVM HesSVM RUSVM ROSVM SMOTESVM BSMOTESVM WKSMOTESVM AdaSVM OWASVM GPFSVM EFSVM EFLSSVM-CIL EFLSTWSVM-CIL

0.0073008 0.0051162 0.005683 0.0042653 0.010206 0.016952 0.0051162 0.025321 0.012741 0.0039379 0.05 0.0063912 0.0085124 0.0046522

p-value

3.982e−46 2.5989e−26 2.7746e−20 2.4646e−35 5.2008e−20 2.9502e−14 1.8704e−20 0.056044 1.3859e−16 1.0268e−34 5.6146e−05 6.455e−17 1.1981e−14 6.9442e−15

305

ACFSVM approach obtains higher classification accuracy than the other existing imbalanced classification methods in terms of G-Mean, F-Measure and AUC metrics with good robustness. Finally, we want to emphasize that the proposed method can easily be integrated into any other SVM variants such as kernel least square SVMs since it essentially belongs to wrapper framework. In addition, after the effect of Gaussian kernel parameters on the performance was discussed in this study, we found that it has significant impact on the classification performance, which has also been demonstrated by previous studies. The better classification accuracy would be expected if the optimal parameters are found. Therefore, further efforts can still be made to potentially improve the classification accuracy by tuning the kernel parameters of ACFSVM algorithm when dealing with imbalanced datasets. Acknowledgments

4.352e−24 9.2694e−24 6.9687e−19 1.3919e−24 7.2919e−23 1.2321e−16 1.3656e−10 0.032203 1.4555e−16 3.9662e−28 4.7474e−05 4.1621e−13 1.244e−11 2.1602e−12

1.3629e−08 1.6799e−11 8.4886e−11 8.9426e−18 2.9915e−08 1.7845e−06 9.9704e−12 0.00092169 5.1953e−08 8.8793e−23 0.0017438 1.3252e−09 1.3629e−08 7.6682e−12

datasets classification tasks. In ACFSVM, the SVDD model is firstly trained only by all given majority samples and the corresponding formulation of affinity with respect to the trained SVDD model is presented to calculate different affinity for each sample in majority class. This strategy can effectively identify the possible outliers, and border samples existing in the majority class. In order to further avoid the effect of class noises in majority class, we use the kernelknn technique to determine its class probability for each majority class sample, which is combined with its corresponding affinity as its membership value for Fuzzy SVM. When dealing with the imbalanced datasets classification problems with rare minority data available, the proposed ACFSVM usually assigns relatively low MVs to some possible abnormal majority samples based on their corresponding affinities and class probabilities while the high MVs to rare minority samples, which can allow the final classification boundary to skew toward the majority class, thus producing more satisfactory classification results. The experimental results on the synthetic datasets show that the proposed ACFSVM approach can obtain better classification boundary than SVM, EFSVM and GPFSVM methods. In addition, the extensive experimental results on UCI imbalanced datasets with different imbalanced ratios demonstrate that the proposed

This work was supported in part by the Fundamental Research Funds for the Central Universities no. 2572017EB02, 2572017CB07, Innovative talent fund of Harbin science and technology Bureau (No. 2017RAXXJ018), Double first-class scientific research foundation of Northeast Forestry University (411112438). References Abe, S. (2015). Fuzzy Support vector machines for multilabel classification. Pattern Recognition, 48(6), 2110–2117. http://dx.doi.org/10.1016/j.patcog.2015. 01.009. Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Lecture notes in computer science: machine learning: ECML 2004, proceedings, Vol. 3201 (pp. 39-50). Alonso-Ayuso, A., Escudero, L. F., Guignard, M., & Weintraub, A. (2018). Risk management for forestry planning under uncertainty in demand and prices. European Journal of Operational Research, 267(3), 1051–1074. http://dx.doi. org/10.1016/j.ejor.2017.12.022. Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE-Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405–425. http://dx. doi.org/10.1109/TKDE.2012.232. Batuwita, R., & Palade, V. (2010). FSVM-CIL: Fuzzy support vector machines for class imbalance learning. IEEE Transactions on Fuzzy Systems, 18(3), 558–571. http://dx.doi.org/10.1109/TFUZZ.2010.2042721. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-levelsmote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics): vol. 5476, Lecture notes in computer science (pp. 475–482). Chaudhuri, A. (2014). Modified fuzzy support vector machine for credit approval classification. AI Communication, 27(2), 189–211. http://dx.doi.org/10.3233/ AIC-140597. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. Devi, D., Biswas, S. K., & Purkayastha, B. (2017). Redundancy-driven modified tomek-link based undersampling: A solution to class imbalance. Pattern Recognition Letters, 93, 1–12. http://dx.doi.org/10.1016/j.patrec.2016.10.006. Dheeba, J., Jaya, T., & Singh, N. A. (2017). Breast cancer risk assessment and diagnosis model using fuzzy support vector machine based expert system. Journal of Experimental and Theoretical Artificial Intelligence, 29(5), 1011–1021. http://dx.doi.org/10.1080/0952813X.2017.1280088. Du, J., Vong, C. M., Pun, C. M., Wong, P. K., & Ip, W. F. (2017). Post-boosting of classification boundary for imbalanced data using geometric mean. Neural Networks, 96, 101–114. http://dx.doi.org/10.1016/j.neunet.2017.09.004. E, J. Q., Qian, C., Zhu, H., Peng, Q. G., Zuo, W., & Liu, G. L. (2017). Parameteridentification investigations on the hysteretic preisach model improved by the fuzzy least square support vector machine based on adaptive variable chaos immune algorithm. Journal of Low Frequency Noise Vibration and Active Control, 36(3), 227–242. http://dx.doi.org/10.1177/0263092317719634. Fan, Q., Wang, Z., Li, D. D., Gao, D. Q., & Zha, H. Y. (2017). Entropy-based fuzzy support vector machine for imbalanced datasets. Knowledge-Based Systems, 115, 87–99. http://dx.doi.org/10.1016/j.knosys.2016.09.032.

306

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307

Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting, and hybrid-based approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 42(4), 463–484. http://dx.doi.org/ 10.1109/TSMCC.2011.2161285. Gautam, S. K., & Om, H. (2018). Intrusion detection in RFID system using computational intelligence approach for underground mines. International Journal of Communication Systems, 31(8), http://dx.doi.org/10.1002/dac.3532. Gupta, D., & Richhariya, B. (2018). Entropy based fuzzy least squares twin support vector machine for class imbalance learning. Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies, 48(11), 4212–4231. http://dx.doi.org/10.1007/ s10489-018-1204-4. Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new oversampling method in imbalanced data sets learning. In Lecture notes in computer science: advances in intelligent computing, Pt 1, proceedings, Vol. 3644 (pp. 878-887). Hang, J., Zhang, J. Z., & Cheng, M. (2016). Application of multi-class fuzzy support vector machine classifier for fault diagnosis of wind turbine. Fuzzy Sets and Systems, 297, 128–140. http://dx.doi.org/10.1016/j.fss.2015.07.005. Hart, P. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 14, 515–516. Hassan, M. M., Huda, S., Yearwood, J., Jelinek, H. F., & Almogren, A. (2018). Multistage fusion approaches based on a generative model and multivariate exponentially weighted moving average for diagnosis of cardiovascular autonomic nerve dysfunction. Information Fusion, 41, 105–118. http://dx.doi. org/10.1016/j.inffus.2017.08.004. He, H., Bai, Y., Garcia, E., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In IEEE international joint conference on neural networks (pp. 1322–1328). IEEE. Hsu, W. C., Lin, L. F., Chou, C. W., Hsiao, Y. T., & Liu, Y. H. (2017). EEG Classification of imaginary lower limb stepping movements based on Fuzzy support vector machine with kernel-induced membership function. International Journal of Fuzzy Systems, 19(2), 566–579. http://dx.doi.org/10.1007/ s40815-016-0259-9. Jian, C. X., Gao, J., & Ao, Y. H. (2016). A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing, 193, 115–122. http://dx.doi.org/10.1016/j.neucom.2016.02.00. Jokar, P., & Leung, V. C. M. (2018). Intrusion detection and prevention for zigbeebased home area networks in smart grids. IEEE Transactions on Smart Grid, 9(3), 1800–1811. http://dx.doi.org/10.1109/TSG.2016.2600585. Khatami, A., Babaie, M., Tizhoosh, H. R., Khosravi, A., Nguyen, T., & Nahavandi, S. (2018). A sequential search-space shrinking using CNN transfer learning and a radon projection pool for medical image retrieval. Expert Systems With Applications, 100, 224–233. http://dx.doi.org/10.1016/j.eswa.2018.01.056. Khemchandani, R., & Pal, A. (2018). Chandra s. Fuzzy least squares twin support vector clustering. Neural Computing and Applications, 29(2), 553–563. http: //dx.doi.org/10.1007/s00521-016-2468-4. Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one sided selection. In ICML, 97 (pp. 179–186). Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. In Conference on Ai in medicine in Europe: Artificial intelligence medicine (pp. 63–66). Springer-Verlag. Liang, T. M., Xu, X. Z., & Xiao, P. C. (2017). A new image classification method based on modified condensed nearest neighbor and convolutional neural networks. Pattern Recognition Letters, 94, 105–111. http://dx.doi.org/10.1016/ j.patrec.2017.05.019. Lin, C.-F., & Wang, S.-D. (2002). Fuzzy Support vector machines. IEEE Transactions on Neural Networks, 13(2), 462–471. http://dx.doi.org/10.1109/72.991432. Liu, W. F., Ma, X. Q., Zhou, Y. C., Tao, D. P., & Cheng, J. (2018). P-Laplacian regularization for scene recognition. IEEE Transactions on Cybernetics, 1–14. http://dx.doi.org/10.1109/tcyb.2018.2833843. Liu, Z. B., Song, W. A., Zhang, J., & Zhao, W. J. (2017). Classification of stellar spectra with Fuzzy minimum within-class support vector machine. Journal of Astrophysics and Astronomy, 38(2), http://dx.doi.org/10.1007/s12036-0179441-1. Liu, Y., Wang, J., Cai, L. H., Chen, Y. Y., & Qin, Y. M. (2018). Epileptic seizure detection from EEG signals with phase–amplitude cross-frequency coupling and support vector machine. International Journal of Modern Physics B. Condensed Matter Physics. Statistical Physics. Applied Physics., 32(8), http: //dx.doi.org/10.1142/S0217979218500868. Liu, T. P., Zhang, W. T., McLean, P., Ueland, M., Forbes, S. L., & Su, S. W. (2018). Electronic nose-based odor classification using genetic algorithms and Fuzzy support vector machines. International Journal of Fuzzy Systems, 20(4), 1309–1320. http://dx.doi.org/10.1007/s40815-018-0449-8. Liu, W. F., Zhang, L. B., Tao, D. P., & Cheng, J. (2017). Support vector machine active learning by hessian regularization. Journal of Visual Communication and Image Representation , 49, 47–56. http://dx.doi.org/10.1016/j.jvcir.2017. 08.001.

Maldonado, S., Merigo, J., & Miranda, J. (2018). Redefining support vector machines with the ordered weighted average. Knowledge-based Systems, 148, 41–46. http://dx.doi.org/10.1016/j.knosys.2018.02.025. Mamta, & Hanmandlu, M. (2014). A new entropy function and a classifier for thermal face recognition. Engineering Applications of Artificial Intelligence, 36, 269–286. http://dx.doi.org/10.1016/j.engappai.2014.06.028. Mathew, J., Luo, M., Pang, C. K., & Chan, T. L. (2015). Kernel-based SMOTE for SVM classification of imbalanced datasets. In IECON 2015-41ST annual conference of the IEEE industrial electronics society (pp. 1127–1132). IEEE. Mathew, J., Pang, C. K., Luo, M., & Leong, W. H. (2018). Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 29(9), 4065–4076. http://dx.doi.org/10.1109/TNNLS.2017.2751612. Moteghaed, N. Y., Maghooli, K., & Garshasbi, M. (2018). Improving classification of Cancer and mining biomarkers from gene expression profiles using hybrid optimization algorithms and Fuzzy support vector machine. Journal of Medical Signals and Sensors, 8(1), 1–11. Naderian, S., & Salemnia, A. (2017). An implementation of type-2 fuzzy kernel based support vector machine algorithm for power quality events classification. International Transactions on Electrical Energy Systems, 27(5), http: //dx.doi.org/10.1002/etep.2303. Ni, F., He, Y. Z., & Jiang, F. (2017). Fuzzy Support vector machine based on hyperbolas optimized by the quantum-inspired gravitational search algorithm. Turkish Journal Electrical Engineering and Computer Sciences, 25(4), 3073–3084. http://dx.doi.org/10.3906/elk-1604-260. Papadopoulos, H., Kyriacou, E., & Nicolaides, A. (2017). Unbiased confidence measures for stroke risk estimation based on ultrasound carotid image analysis. Neural Computing and Applications, 28(6), 1209–1223. http://dx.doi. org/10.1007/s00521-016-2590-3. Raghuwanshi, B. S., & Shukla, S. (2018). Class-specific extreme learning machine for handling binary class imbalance problem. Neural Networks, 105, 206–217. http://dx.doi.org/10.1016/j.neunet.2018.05.011. Romani, M., Vigliante, M., Faedda, N., Rossetti, S., Pezzuti, L., Guidetti, V., et al. (2018). Face memory and face recognition in children and adolescents with attention deficit hyperactivity disorder: A systematic review. Neuroscience and Biobehavioral Reviews, 89, 1–12. http://dx.doi.org/10.1016/j.neubiorev. 2018.03.026. Sampath, A. K., & Gomathi, N. (2017). Fuzzy-Based multi-kernel spherical support vector machine for effective handwritten character recognition. SadhanaAcademy Proceedings in Engineering Sciences, 42(9), 1513–1525. http://dx.doi. org/10.1007/s12046-017-0706-9. Sevakula, R. K., & Verma, N. K. (2017). Compounding general purpose membership functions for Fuzzy support vector machine under noisy environment. IEEE Transactions on Fuzzy Systems, 25(6), 1446–1459. http://dx.doi.org/10. 1109/TFUZZ.2017.2722421. Tang, B., & He, H. (2015). Kerneladasyn: Kernel based adaptive synthetic data generation for imbalanced learning. In IEEE congress on evolutionary computation (CEC) (pp. 664–671). Tao, D. P., Jin, L. W., Liu, W. F., & Li, X. L. (2013). Hessian regularized support vector machines for mobile image annotation on the cloud. IEEE Transactions on Multimedia, 15(4), 833–844. http://dx.doi.org/10.1109/TMM.2013.2238909. Tao, X. M., Li, Q., Guo, W. J., Ren, C., Li, C. X., Liu, R., et al. (2019). Selfadaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Information Sciences, 487, 31–56. http: //dx.doi.org/10.1016/j.ins.2019.02.062. Tao, X. M., Li, Q., Ren, C., Guo, W. J., Li, C. X., He, Q., et al. (2019). Realvalue negative selection over-sampling for imbalanced data set learning. Expert Systems with Applications, 129, 118–134. http://dx.doi.org/10.1016/j. eswa.2019.04.011. Tax, D. M. J., & Duin, R. P. W. (2004). Support vector data description. Machine Learning, 54(1), 45–66. http://dx.doi.org/10.1023/B:MACH.0000008084.60811. 49. Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems Man and Cybernetics, 6, 769–772. Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. In Proceedings of the international joint conference on AI (pp. 55–60). Wang, C. L., Li, Z. R., Dey, N., Li, Z. C., Ashour, A. S., Fong, S. J., et al. (2018). Histogram of oriented gradient based plantar pressure image feature extraction and classification employing Fuzzy support vector machine. Journal of Medical Imaging and Health Informatics, 8(4), 842–854. http://dx.doi.org/10. 1166/jmihi.2018.2310. Wu, K., & Yap, K. H. (2006). Fuzzy SVM For content-based image retrieval. IEEE Computational Intelligence Magazine, 1(2), 10–16. Yang, X., Song, Q., & Wang, Y. (2007). A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence, 21(5), 961–976.

X.M. Tao, Q. Li, C. Ren et al. / Neural Networks 122 (2020) 289–307 Yang, X., Zhang, G., Lu, J., & Ma, J. (2011). A kernel fuzzy c-means clusteringbased fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE Transactions on Fuzzy Systems, 19(1), 105–115. Zhang, X., Zhu, C., Wu, H. G., Liu, Z., & Xu, Y. Y. (2017). An imbalance compensation framework for background subtraction. IEEE Transactions on Multimedia, 19(11), 2425–2438. http://dx.doi.org/10.1109/TMM.2017.2701645.

307

Zhu, J. X., Wright, G., Wang, J., & Wang, X. Y. (2018). A critical review of the integration of geographic information system and building information modelling at the data level. ISPRS International Journal of Geo-Information, 7(2), http://dx.doi.org/10.3390/ijgi7020066. Zuo, Y., & Jia, C. Z. (2017). CaRsite: identifying carbonylated sites of human proteins based on a one-sided selection resampling method. Molecular Biosystems, 13, 2362–2369. http://dx.doi.org/10.1039/c7mb00363c.