A three-way decision ensemble method for imbalanced data oversampling

A three-way decision ensemble method for imbalanced data oversampling

International Journal of Approximate Reasoning 107 (2019) 1–16 Contents lists available at ScienceDirect International Journal of Approximate Reason...

562KB Sizes 0 Downloads 66 Views

International Journal of Approximate Reasoning 107 (2019) 1–16

Contents lists available at ScienceDirect

International Journal of Approximate Reasoning www.elsevier.com/locate/ijar

A three-way decision ensemble method for imbalanced data oversampling ✩ Yuan Ting Yan a,b , Zeng Bao Wu a,b , Xiu Quan Du a,b , Jie Chen a,b , Shu Zhao a,b , Yan Ping Zhang a,b,∗ a

Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, Anhui Province 230039, PR China b School of Computer Science and Technology, Anhui University, Hefei, Anhui Province 230601, PR China

a r t i c l e

i n f o

Article history: Received 13 March 2018 Received in revised form 29 October 2018 Accepted 12 December 2018 Available online 18 December 2018 Keywords: SMOTE Three-way decision CCA Imbalanced data Ensemble learning

a b s t r a c t Synthetic Minority Over-sampling Technique (SMOTE) is an effective method for imbalanced data classification. Many variants of SMOTE have been proposed in the past decade. These methods mainly focused on how to select the crucial minority samples which implicitly assume the selection of key minority samples is binary. Thus, the cost of key sample selection is seldom considered. To this end, this paper proposes a three-way decision model (CTD) by considering the differences in the cost of selecting key samples. CTD first uses Constructive Covering Algorithm (CCA) to divide the minority samples into several covers. Then, a three-way decision model for key sample selection is constructed according to the density of the cover on minority samples. Finally, the corresponding threshold α and β of CTD are obtained based on the pattern of cover distribution on minority samples, after that key samples can be selected for SMOTE oversampling. Moreover, to overcome the shortage of CCA which may contain non-optimal by randomly selecting the cover center, an ensemble model based on CTD (CTDE) is further proposed to improve the performance of CTD. Numerical experiments on 10 imbalanced datasets show that our method is superior to the comparison methods. By constructing the ensemble of the three-way decision based key sample selection, performance of the model can be effectively improved compared with several state-of-the-art methods. © 2018 Elsevier Inc. All rights reserved.

1. Introduction Imbalanced data exists in many practical applications, such as text classification [1], deception credit card detection [2], distinguish malicious harassing phone [3] and medical diagnosis [4–6], etc. The characteristic of imbalanced dataset is that the number of instances of one class is significantly fewer than another [7]. For convenience, the former is known as the minority class (or positive), and the latter is called the majority class (or negative). Traditional classification methods tend to improve the recognition rate of all the samples, and the recognition rate of the minority class samples is easily to be ignored. However, in many cases, it is more important to identify the positive class samples than to identify the negative class samples. For example, in medical diagnosis, if a healthy person is misdiagnosed as a patient, it will bring a mental



*

This paper is part of the Virtual special issue on Uncertainty in Granular Computing, Edited by Duoqian Miao and Yiyu Yao. Corresponding author at: School of Computer Science and Technology, Anhui University, Hefei, Anhui Province 230601, PR China. E-mail addresses: [email protected] (Y.T. Yan), [email protected] (Y.P. Zhang).

https://doi.org/10.1016/j.ijar.2018.12.011 0888-613X/© 2018 Elsevier Inc. All rights reserved.

2

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

burden to the patient. However, if a patient is misdiagnosed as a healthy person, he or she may miss the best treatment period, and it may result in very serious consequence. The cost of misdiagnosis a patient as a healthy person is much greater than the cost of misdiagnosis a healthy person as a patient. Therefore, how to effectively improve the classification accuracy on minority samples and the overall accuracy on the imbalanced datasets simultaneously has become a research hot spot in the field of machine learning. Currently, researches on imbalanced data classification can be roughly divided into two aspects: algorithm level and dataset level. For the first one, designing more efficient classification algorithm is the mainstream approach. And for the latter, how to generate balanced dataset from the original imbalanced dataset is the research hotspot. There is no consensus as to which of the two different approaches is better. However, there are more studies on dataset level than algorithm level. In other words, researches are more concerned with the dataset level which include oversampling, under-sampling and the hybrid model of both. Solution at dataset level is to balance the distribution of the minority and majority class [8]. Synthetic Minority Over-sampling Technique (SMOTE) [3] is one of the most representative oversampling techniques. SMOTE can effectively solve the over-fitting problem of random oversampling and improve the learning ability of the downstream classifier which trained on the balanced data generated by SMOTE. However, SMOTE generates the same number of synthetic samples for each of the minority samples which ignoring the distribution of the nearest neighbors. Overlapping samples and disjunct samples could also be generated by SMOTE which increases the difficulty of the downstream classification. To overcome the shortcoming of SMOTE, some scholars have put forward many improved algorithms of SMOTE in recent years. Borderline-SMOTE algorithm was proposed by Han et al. [9], which uses KNN to find minority samples near the boundary to generate the synthetic samples. It overcomes the blindness of SMOTE in selecting minority samples. Meanwhile, to some extent, it can improve the recognition rate of minority samples. However, the algorithm still did not consider the spatial data distribution. He et al. [10] proposed ADASYN, an adaptive synthetic sampling algorithm. This algorithm mainly uses density distribution as the standard to automatically determine the number of synthetic samples, and it changes the weights of different minority samples adaptively to generate synthetic samples for each of the minority samples. Barua et al. [11] proposed a Majority Weighted Minority Oversampling Technique (MWMOTE). The method first identifies the hard-to-learn informative minority class samples and assigns them weights according to their Euclidean distance from the nearest majority class samples. It then generates the synthetic samples from the weighted informative minority class samples using a clustering approach. Finally, this is done in such a way that all the generated samples lie inside some clusters with minority class. This method claims that it can avoid the effect of synthetic noise samples. However, how to effectively detect the hard-to-learn sample is difficult, thus, this method still cannot select informative samples adequately. Batista et al. [12] proposed an improvement method (SMOTE+TomekLink) for SMOTE from another point of view which is different with the above methods. This method uses TomekLink as a data cleaning method, which is applied to the balanced dataset obtained by SMOTE. The strategy in TomekLink is to delete the nearest neighbor samples in pairs. To some extent, the error samples introduced by SMOTE can be removed to improve the performance of the subsequent classification model. However, this method may delete representative minority samples near the classification boundary from the original dataset. In addition, it is difficult to determine the distance judgment standard for heterogeneous samples, which affects the stability of the algorithm. The above algorithms have improved SMOTE from different points of view. But they are all binary in the selection of key samples. The discussion about the relationship between key sample selection and final classification performance is seldom. When some of the minority samples are relatively concentrated, SMOTE cannot effectively improve the classification performance by oversampling these samples. On the contrary, it will increase the complexity of the algorithm. In other words, the consideration of the cost of key sample selection in improving the classification performance is insufficient. To solve the above problems, this paper proposes a three-way decision sampling method (CTD) based on Constructive Covering Algorithm (CCA) [13,14]. CTD first uses CCA to construct the cover of the imbalanced data. Then, the covers of minority samples are selected and divided into three regions according to the density of the cover. Finally, the corresponding thresholds α and β are obtained based on the cover distribution patterns and then key samples can be selected for SMOTE oversampling. Taking into account the uncertainties caused by CCA which randomly choosing cover center, this paper further proposes an ensemble model (CTDE) based on CTD to improve the efficiency of the algorithm. The rest of the paper is organized as follows. Section 2 briefly reviews some related work. Section 3 introduces our approach. Numerical experiments were conducted to validate the effectiveness of CTDE in section 4. And section 5 concludes the paper. 2. Preliminary 2.1. SMOTE SMOTE [3] is a classical oversampling method, which uses linear interpolation to synthesize new samples for minority class. First of all, the Euclidean distance is used to find k-nearest neighbor for each of the original minority samples. Then, SMOTE uses linear interpolation [15] to synthesize minority samples between each minority sample and its k minority class nearest neighbors. Finally, the synthetic samples were added to the original dataset. SMOTE can reduce the imbalance degree by increasing the number of minority samples of the dataset. The procedure of SMOTE is shown in Table 1.

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

3

Table 1 Pseudo-code of SMOTE. Algorithm 1: SMOTE( T , N , k) Input: N atr : Number of attributes; T : Number of minority class samples; N%: Amount of SMOTE; k: Number of nearest neighbors; SV [] []: Array for original minority class samples. Output: S [] []: Synthetic array minority class samples. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

If N < 100 then Randomize the T minority class samples; T = ( N /100) ∗ T ; N = 100; EndIf N = (int)( N /100); For m → 1 to T Find k nearest neighbors for sample m, save the indices in array M; Populate( N , m, M ); EndFor

11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

Procedure Populate (N, m, M); // Generate the synthetic samples. While N = 100 Choose a random number mm between 1 and k; For i → 1 to N atr Compute: dif = S V [ M [mm]][i ] − S V [m][i ]; Generate: a random number g between 0 and 1; S [ j ][i ] = S V [m][i ] + g ∗ dif; //linear interpolation EndFor j ++; N = N − 1; EndWhile End

2.2. Constructive covering algorithm Constructive covering algorithm (CCA) [13] can be regarded as a three-layer neural network which is realized by covering samples with same label constructively. Let X = {(x1 , y 1 ), (x2 , y 2 ), . . . , (x p , y p )} be a given dataset, where xi = (x1i , x2i , . . . , xni ) and y i represent the decision attribute of the i-th sample. Input layer: a total of n neurons. Each neuron corresponds to one dimension of the sample. In other words, the characteristic attribute of the sample is xi = (x1i , x2i , . . . , xni ). Neurons in this layer are only responsible for receiving external information. Hidden layer: a total of s neurons. The number of neurons in hidden layer is 0 at the beginning, and then it increases monotonically with the increasing of the covers constructed by CCA until all the samples are covered. After that, a set of j 1 nm cover C = {C 11 , . . . , C 1n1 , C 21 , . . . , C 2n2 , . . . , C m , . . . , Cm } can be obtained. Where C i represents the j-th cover of the i-th class, which represents a neuron in the hidden layer. Output layer: a total of m neurons. The input of the t-th neuron is a set of covers with same class label, and the output of the t-th neuron is the corresponding class label O t = (o1 = 1, . . . , ot = t , . . . , om = m). CCA is a kind of supervised learning, which can be described as follows [13]: Step 1. Normalization. Normalize the data to [0, 1]. Step 2. Project the samples to a spherical surface space S n+1 with n + 1 dimension.

 

T : X → S n+1 , T (x) = x,

R 2 − |x|2



(2.1)

where R ≥ max{| x |, x ∈ X }. Step 3. Construct hidden neurons. A sample xk in X is randomly selected as the cover center, that is, one of the weights w of the hidden layer neurons. CCA uses inner product instead of Euclidean distance to calculate the distance between samples. The threshold θ (radius of the cover) corresponding with w can be calculated by the following steps: Inner product:

xk , xi  = xk1 x1i + · · · + xkn+1 xni +1 ,

i ∈ {1, 2, . . . , p }

(2.2)

Calculate the maximum inner product (equivalent to minimum distance):

d1 (k) = max {xk , xi }, y i = yk

i ∈ {1, 2, . . . , p }

Calculate the minimum inner product (equivalent to maximum distance):

(2.3)

4

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

Table 2 Loss function matrix of three-way decision.

S

¬S





d2 (k) = min xk , xi |xk , xi  > d1 (k) , y i = yk

aP

aB

aN

λPP λPN

λBP λBN

λNP λNN

i ∈ {1, 2, . . . , p }

(2.4)

Calculate the cover radius:

θ=

d1 (k) + d2 (k)

(2.5) 2 A neuron ( w , θ) can be obtained according to (2.2)–(2.5), which contains a set of samples with the same class label as xk . And the samples within the cover are marked as “learned”. Step 4. Remove the samples in ( w , θ) and repeat Step 3 until all samples are removed from space S n+1 . From the process of CCA, it can be seen that if a sample is covered, then it does not participate in the next round of cover construction process, algorithm computation time can be greatly reduced. 2.3. Three-way decision Three-way decision is an extension of traditional two-way decision [16–23,27,28]. In two-way decision, only two choices, acceptance or rejection are considered. However, in practical applications, it is often impossible to accept or reject because of the inaccuracy or incompleteness of information. In this case, three-way decision is used unconsciously. The three-way decision theory is proposed in rough set [24,25] and decision rough set [26–33] to provide reasonable semantic explanation for three domains of rough set. The positive, negative and boundary domains of a rough set model can be interpreted as the result of three decisions: acceptance, rejection and non-committed. Yao proposed decision-theoretic rough set (DTRS) model by introducing risk minimization of Bayesian decision to the rough set research [16,24,26]. By calculating the risk loss values of all kinds of classification decisions, DTRS can find out the minimum expected risk decision. It is then used for the dividing of positive domain (POS), boundary domain (BND) and the negative domain (NEG). Many scholars have done further studies on three-way decision method and its applications [34–47]. Without loss of generality, let Ω = { S , ¬ S } be the state set, where S and ¬ S are complementary. The action set A = (a P , a B , a N ), where a P , a B and a N denote the actions that decide object to positive domain (POS( X )), negative domain NEG( X ) and boundary domain (BND( X )) respectively. The loss function matrix for actions in different states is given in Table 2. In Table 2, for example, λPP , λBP and λNP represent the loss function values when the actions are a P , a B and a N and the object is in state S. Similarly, λPN , λBN and λNN represent the loss function values when the actions are a P , a B and a N and the object is in state ¬ S. According to the loss function matrix and a derivation process, the DTRS decision rules can be obtained [16,26,48,49]: (1) If P ( S | X ) ≥ α , then decide POS( X ); (2) If β < P ( S | X ) < α , then decide BND( X ); (3) If P ( S | X ) ≤ β , then decide NEG( X ); where

α=

λPN −λBN (λPN −λBN )+(λBP −λPP ) ,

β=

λPN −λNN (λ P B N −λ B N N )+(λNP −λBP ) .

3. Proposed method 3.1. CCA based three-way decision model (CTD) For imbalanced data classification, recognition rate of minority samples and recognition rate of all the samples should be considered simultaneously. In order to improve the accuracy of the minority samples, sampling key minority class samples is one of the methods to improve the performance of the classification. Therefore, how to select the key samples is very important. In minority class, different samples have different impacts on oversampling performance. Generally, samples in the boundary of minority class are more critical than those in the minority centers. However, existing oversampling methods only consider the boundary region samples. Those methods assume the selection of key minority samples is binary and the effect of non-boundary samples on algorithm performance is ignored. Actually, when selecting minority samples for imbalanced data oversampling, the samples can be classified into three categories: (1) Samples with obvious aggregation effect in minority class, they have little impact on the model performance. (2) Noise samples or isolated samples in minority class, which are far from the minority class samples which have obvious aggregation effect. Selecting those samples not only cannot improve model performance but also have negative effects on model performance.

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

5

Fig. 1. Schematic diagram of the construction process of CTD.

(3) The minority samples not belong to the above two categories, which play a vital role in improving the recognition ability of minority class. It is important to support the delineation of the classification boundary, and the efficient selection of these samples can effectively improve the downstream classification performance. Inspired by three-way decision, the three kinds of minority samples mentioned above could naturally relate to the three domains of three-way decision model. The first category, the second category and the third category correspond to the positive domain (POS), the negative domain (NEG) and the boundary domain (BND) of three-way decision, respectively. Fig. 1 gives the schematic diagram of the construction process of CTD. With the above description we know that how to excavate the minority class samples and then construct the POS, NEG and BND of the three-way decision model is the research priority in this study. For a given dataset X = {(x1 , y 1 ), (x2 , y 2 ), . . . , (x p , y p )} contains p samples with n-dimension and m classes. Let xi = (x1i , x2i , . . . , xni ) represent the n-dimensional feature attribute of the i-th sample and y i represents the decision attribute of 1 2 i-th sample, and i = 1, 2, . . . , p. A set of covers can be obtained by CCA: C = {C 11 , C 12 , . . . , C 1n1 , C 21 , C 22 , . . . , C 2n2 , . . . , C m , Cm ,..., j

nm Cm }. Let C i = ∪C i , where j = 1, 2, . . . , ni . And C i represents a set of covers with the same class label. And three domains can be generated according to these covers. For convenience, assuming that the imbalanced dataset has two categories (minority and majority) of samples, C 0 = (C 01 , C 02 , . . . , C 0n0 ) and C 1 = (C 11 , C 12 , . . . , C 1n1 ) represent all the covers of the minority class and the majority class respectively. Let C 0t be the t-th cover of C 0 , the density of C 0t can be defined as follows:





Density C 0t =

|C 0t | , π ∗ R t0 ∗ R t0

s.t. t ∈ [1, 2, . . . , n0 ]

(3.1)

where R t0 represents the radius of C 0t , |C 0t | represents the cardinality of C 0t . Let xi denote the i-th sample in C 0t , where i ∈ [1, 2, . . . |C 0t ], t ∈ [1, 2, . . . , n0 ]. Then we can get the following definitions. The positive domain (POS) of minority class is defined as follows:







POS(C 0 ) = xi |xi ∈ C 0t ∧ Density C 0t ≥ β



(3.2)

The negative domain (NEG) of minority class is defined as follows:







NEG(C 0 ) = xi |xi ∈ C 0t ∧ Density C 0t ≤ α



(3.3)

The boundary domain (BND) of minority class is defined as follows:



BND(C 0 ) = xi |xi ∈ C 0t ∧







α ≤ Density C 0t ≤ β



(3.4)

We first apply CCA on the imbalanced dataset, then, we can get the three domains (POS, NEG and BND) of the minority samples based on definitions (3.1)–(3.4). The minority samples in POS domain have an obvious aggregation effect, and they are not selected for oversampling in CTD model. The minority samples in NEG domain have high discrete degree. The CTD model treats them as noise samples, and they are not selected for oversampling too. The model holds that minority samples

6

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

Table 3 CTD algorithm pseudo-code. Algorithm 3: CCA based Three-way Decision model(CTD) Input: Data set D = {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , yn )}; Threshold α and β ; Sampling rate Rate. Output: Samples set S with synthetic minority class samples. 1. 2. 3. 4. 5. 6.

Initialize C = {}, S = {}; For a random sample xi in D Construct a cover C i on xi ; //xi is the center of C i C ← Ci ; End For Get the set C 0 = {C 01 , C 02 , C 03 , . . . , C 0n0 } from C ; j

7. For each C 0 in C 0

j

8. Compute Density (C 0 ) according to (3.1); 9. End For 10. Get BND(C 0 ) according to (3.4); 11. For xk in BND(C 0 ) 12. Synthetic samples by using SMOTE with sampling rate Rate; 13. Obtain the synthetic minority class samples [sk1 , sk2 , . . . , skRate ];

14. S ← [sk1 , sk2 , . . . , skRate ]. 15. End For 16. Output S.

Fig. 2. The diversity caused by the randomly choosing cover center of CCA.

in the BND domain will improve the classifier performance on the dataset by using SMOTE sampling. Algorithm 3 (see Table 3) gives CTD algorithm based on CCA. 3.2. CTD ensemble Ensemble learning is an efficient technique in machine learning [50–54]. The underlying mechanism of ensemble learning is to train multiple classifiers instead of single classifier. Learning results of multiple classifiers are then combined. The performance of ensemble learning is better than single classifier, and the generalization ability of the learning system can be significantly improved. If a base classifier is regarded as a decision maker, then ensemble learning is equivalent to make a decision together by multiple decision makers. The diversity of the base classifiers is a crucial factor for ensemble model performance. As is shown in Fig. 2, in subfigure (a), CCA chooses A and D as the cover center and then constructs two covers, samples A, B and C belong to the first cover and samples D, E and F belong to the other cover. In our method, the samples in the two covers are divided into BND region. However, as one can see in subfigure (b), CCA chooses C and E as the cover center and then construct two covers with A and C , D and E respectively, these two covers are then divided into BND region. And samples B and F belong to two covers which are divided into NEG region in CTD. Considering that CCA constructs covers by randomly choosing cover center, thus, different initialization of CCA will lead to different sample sets of the BND region. And this is just suitable to make the diversity of the downstream classifiers trained on the balanced dataset which is generated by SMOTE on the sample set of BND region.

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

7

Fig. 3. Framework of CTDE.

To this end, we further propose CTDE, a three-way decision ensemble approach to improve the performance of CTD. Fig. 3 gives the framework of CTDE method. Conducting T times of CCA (randomly choosing cover center) on the original imbalanced dataset, we can get T BND domains. Then SMOTE is applied to get a balanced dataset for each of the BND domain. After that, T classifiers can be obtained by applying base learning algorithm on the T balanced datasets, respectively. Finally, majority voting is used to combine the T classifiers as follows:

Y x = arg

max



k∈[1,...,C ]



wj

(3.5)

j

tc x =k j

where k denotes the prediction class value of x, tc x denotes the prediction result of sample x on the jth sub classifier and w j denotes the weight of the j-th sub classifier. In this paper, we consider the binary classification, the value of C is 2. Decisions are made with a simple majority, thus, w j is equal to 1/ T . 4. Experiments and analysis 4.1. Evaluation criterion For imbalanced data classification, only use the commonly criterion “Accuracy” is not sufficient to measure the performance of the algorithm. Several evaluation criterions that can provide more information have been proposed, such as Recall, Precision, F-measure and AUC, etc. Learning results of imbalanced data classification algorithm can be expressed by the confusion matrix. In this article, minority class is defined as positive class, and majority class is defined as negative class. The definition of the confusion matrix [55] is shown in Table 4. Table 4 Confusion matrix for classification problem.

Positive class Negative class

Predicted positive

Predicted negative

TP FP

FN TN

The following formula can be obtained from the above confusion matrix. The precision of the minority class is:

Precision =

TP FP + TP

The accuracy of the minority samples is:

Recall = TP rate =

TP TP + FN

The accuracy of the sample is as follows:

Accuracy =

TP + TN TP + TN + FP + FN

Among them, Recall (R1) is also known as true positive rate TPrate . It is also known as the correct rate, which indicates that the proportion of positive samples is correctly classified. The ideal classification result is that algorithm could have a higher accuracy and a higher recall simultaneously. But in some cases the two indicators are contradictory. Thus, another criterion F-measure (F1) is defined:

8

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

Table 5 Summary description of datasets. Datasets

Abb

#Ex.

#Atts.

(mr, MR)

IR

Balance_0_B Vehicle_0_van Yeast05679vs4 Wdbc_0_malignant Breast_0_CARFAD Wisconsin Segment0 Page-blocks0 Wine_0_3 Yeast2vs4

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

625 846 528 569 106 683 2308 5472 178 514

4 18 8 30 9 9 19 10 13 8

(7.84, 92.16) (23.52, 76.48) (9.66, 90.34) (37.26, 62.74) (33.96, 66.04) (34.99, 65.01) (14.25, 85.75) (10.22, 89.78) (26.97, 73.03) (9.92, 90.08)

11.76 3.25 9.35 1.68 1.95 1.86 6.02 8.78 2.71 9.08

Table 6 Different datasets corresponding to the thresholds (α , β). B0b

Veh

Y04

W0m

B0c

[α , β]

[5e−5, 1.5e−3]

[1.8e−13, 5e−13]

[0.09, 0.54]

[2e−15, 1.5e−14]

[7.9e−18, 1.1e−14]

Wis

Seg

Pb

W03

Y24

[α , β]

[4.97e−7, 2.6e−6]

[8e−10, 1.3e−8]

[2.3e−17, 1.38e−13]

[1.2e−12, 6.3e−12]

[0.004, 0.2]

F -measure =

(1 + β 2 ) ∗ Recall ∗ Precision β 2 ∗ Recall + Precision

The β represents the relative importance degree of Recall and Precision. In this paper, the value is 1. It can be seen from the definition that Recall and Precision can be used to evaluate the classification performance of minority class. We can obtain a larger F-measure value when the Recall and Precision are high. Therefore, F-measure is commonly used to evaluate the recognition rate of classifiers for minority class. F-measure is the weighted harmonic average of the Recall and Precision. When β = 1, it is the called F1 value. G-mean is the equilibrium value of the classification rate of positive samples and negative samples respectively,

G-mean =



TP rate ∗ TNrate

TP rate and TNrate are used to evaluate the classification performance of minority class and majority class respectively. From the definition, we can get a larger G-mean value only when TP rate and TNrate are relatively large simultaneously. AUC is another common used criterion to measure classifier performance which is the area under the ROC curve [56–58]. The larger the AUC value is, the better the classifier will be. Hence, Precision, Recall, F-measure, G-mean, Accuracy and AUC are used as the main evaluation criterions in this paper. 4.2. Dataset In this experiment, we utilize the 10 imbalanced datasets, which are publicly available on http://www.keel.es/dataset.php. Table 5 shows the details of 10 imbalanced datasets from the KEEL database, including the name and the abbreviation (Abb), the number of samples (#Ex.), the number of attributes (#Atts.), the percentage of samples of each class (mr, MR) and the imbalance rate (IR) of the minority and majority samples. 4.3. Relevant parameter setting experiment Performance of CTDE is sensitive to parameters (α , β). For different datasets, the values of (α , β) are not the same. Considering the differences of imbalanced data distribution, and the value of thresholds (α , β) will affect the subsequent oversampling results. Therefore, the AUC value of the classifier after oversampling is used as the criterion for selecting (α , β). Table 6 gives the thresholds (α , β) for CTDE on the ten datasets, respectively. Thresholds (α , β) controls the construction of boundary domain. In general, smaller thresholds (α , β) lead to a small number of minority samples in the boundary region. From Table 6, it can be seen that the thresholds (α , β) on ten datasets are different. And the difference of thresholds (α , β) in Table 6 is obvious, which is determined by the imbalanced pattern of the datasets. Table 7 reports the average number of samples within domains POS, BND and NEG respectively corresponding to the thresholds (α , β) in Table 6. N POS , N BND and N NEG represent the number of minority samples in POS, BND and NEG domains respectively. And the number in Table 7 are the average result by ten times of tenfold cross validation (rounded to the nearest integer). As can be seen from Table 7, for different datasets, the number of minority samples in three domains are also various. This phenomenon may be caused by the complicated internal data space distribution.

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

9

Table 7 The number of samples in three domains on ten datasets.

N POS N BND N NEG

B0b

Veh

Y04

W0m

B0c

Wis

Seg

Pb

W03

Y24

38 9 3

197 14 7

43 7 1

193 15 4

18 9 9

199 35 5

637 14 4

520 34 5

37 9 2

32 16 3

Fig. 4. Relationship between the testing error and T on five datasets.

Table 8 Independent training number (T ) on ten datasets.

T

T

B0b

Veh

Y04

W0m

B0c

14

23

14

24

24

Wis

Seg

Pb

W03

Y24

20

21

18

27

14

Another important parameter need to be considered is the independent training number of the ensemble. The underlying idea of ensemble learning is that even if one of the weak classifiers gets the wrong prediction, and other weak classifiers can also correct the errors. Thus, it is essential to study the relationship between independent training number T and the overall performance of CTDE. In this study, training number T is gradually increased from 2 to 50 with the interval of 2. 50 repetitions are conducted for each value of T , and the average results are reported. To make the presentation clearer, we give the relationship between T and testing errors on the first five datasets (B0b, Veh, Y04, W0m and B0c). And the results of the rest five datasets are similar with the results reported. Fig. 4 gives the relationship between T and the testing error on the five datasets. It can be seen from Fig. 4 that the errors on all the five datasets decrease monotonically with T . Moreover, the performance degradation at the beginning parts is more obvious than the ending parts. For example, for dataset W0m, when T increases from 2 to 25, the error decreases from 0.3515 to 0.2225 (about 13%). However, with the further increase of T from 26 to 50, the degradation of error is only 0.00277, less than 0.3%. As the training time of T = 50 is about two times than the time of T = 25, we believe it is reasonable to choose T around 25 for dataset W0m. Similar phenomenon can be seen on the other four datasets, for clarity, we do not enumerate them here, details are given in Fig. 4. The optimal independent training number T on the 10 datasets is shown in Table 8. Because SMOTE is applied to generate new synthetic minority class samples on the key sample set which is selected by our three-way decision based method. Considering the differences in the imbalance degree of datasets, sampling rate of SMOTE on 10 datasets are reported in Table 9.

10

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

Table 9 The sampling rate on ten datasets.

Rate

Rate

B0b

Veh

Y04

W0m

B0c

20

15

10

2

4

Wis

Seg

Pb

W03

Y24

3

34

14

12

30

Table 10 Experimental comparison between the CTD and the POS method. Data

Precision

Recall

F-measure

G-mean

Accuracy

AUC

POS

CTD

POS

CTD

POS

CTD

POS

CTD

POS

CTD

POS

CTD

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

0.4750 0.8750 0.7503 0.6895 0.9662 0.7178 0.9621 0.8818 0.8889 0.9601

0.6087 0.8817 0.6851 0.8301 0.7931 0.7786 0.9698 0.9169 0.8947 0.8414

0.5372 0.8482 0.7738 0.8077 0.4888 0.876 0.9064 0.8371 0.7889 0.675

0.6100 0.9163 0.6000 0.6224 0.8214 0.8916 0.9265 0.6675 0.7900 0.7000

0.5611 0.8866 0.8316 0.7458 0.6425 0.8076 0.9291 0.7326 0.8575 0.7224

0.6084 0.8971 0.6185 0.5817 0.8070 0.8151 0.9475 0.7726 0.7719 0.7580

0.5231 0.8964 0.8427 0.6038 0.6347 0.8932 0.9501 0.7964 0.8724 0.7906

0.7601 0.9349 0.9283 0.6476 0.7849 0.8458 0.9578 0.8124 0.8416 0.8279

0.5054 0.9095 0.8519 0.6824 0.7505 0.8802 0.9625 0.8366 0.8901 0.8602

0.9386 0.9458 0.9283 0.7438 0.7885 0.8467 0.9745 0.9386 0.8694 0.9551

0.5369 0.8995 0.8547 0.7251 0.6596 0.8749 0.9498 0.8801 0.8748 0.8500

0.7369 0.9048 0.8301 0.7658 0.6267 0.8913 0.9662 0.8946 0.8431 0.8513

Avg

0.8167

0.8200

0.7539

0.7546

0.7951

0.7744

0.8089

0.8424

0.8471

0.8879

0.8409

0.8415

Table 11 Experimental results on precision. Datasets

C4.5

SMOTE

BSMO

MWMO

ASYN

SM-T

CTD

CTDE

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

0.0669 0.8247 0.3636 0.6295 0.3571 0.7419 0.9416 0.8823 0.6670 0.6773

0.4707 0.9539 0.9183 0.8144 0.6750 0.9593 0.9685 0.9931 0.9032 0.9545

0.6230 0.9485 0.9107 0.8399 0.7586 0.9356 0.9674 0.9441 0.8542 0.9736

0.6965 0.9368 0.9315 0.7618 0.8102 0.9280 0.9729 0.8130 0.9218 0.9684

0.5197 0.9451 0.8910 0.8200 0.8086 0.9345 0.9920 0.9747 0.8903 0.9701

0.5076 0.9620 0.9261 0.6446 0.8000 0.9415 1.000 0.9846 0.8965 0.9737

0.6087 0.8817 0.6851 0.8301 0.7931 0.7786 0.9698 0.9169 0.8947 0.8414

0.7553 0.9610 0.8619 0.9121 0.8697 0.9108 0.9766 0.9798 0.9488 0.9685

Average

0.6152

0.8641

0.8756

0.8741

0.8746

0.8636

0.8200

0.9145

4.4. Comparison between CTD and positive region oversampling In this experiment, C4.5 decision tree [59] is used as the base classifier. We compared our algorithm with several oversampling methods, including SMOTE, Borderline-SMOTE (abbreviated BSMO), MWMOTE (abbreviated MWMO), ADASYN (abbreviated ASYN) and SMOTE+Tomeklink (abbreviated SM-T). And the average results of ten times of tenfold crossvalidation [60] are presented. In order to verify the validity of the CTD method, the results between CTD and the method of conducting SMOTE on the minority samples in the POS domain (abbreviated POS) are also reported in Table 10. The last row gives the average result on ten datasets for each of the evaluation criterion, and the best results are reported in bold. As can be seen, compared with the POS method, CTD has better performance on most of the datasets (7 of 10 for Precision, 7 of 10 for Recall, 7 of 10 for F-measure, 8 of 10 for G-mean, 8 of 10 for Accuracy, 7 of 10 for AUC). Generally speaking, CTD has better performance than the POS method, and the average experiment results are also consistent with it. Therefore, compared with the POS method which uses the samples in the POS domain for oversampling, the CTD method in this paper is a more favorable approach. 4.5. Experiment results and analysis Tables 11–16 give the results of the 7 algorithms on 10 datasets, including Precision, Recall, F-measure, G-mean, Accuracy and AUC. The last row of each table gives the average results on 10 datasets for each of the criterions respectively. And the best performance is highlighted in bold. Generally, the CTDE method proposed in this paper has achieved better results than the comparative algorithms according to the average results. Specifically, for Precision, Recall, F-measure, G-mean, Accuracy and AUC, the CTDE method achieves the best results in 4 of 10, 6 of 10, 5 of 10, 9 of 10, 10 of 10 and 6 of 10 datasets,

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

11

Table 12 Experimental results on recall. Datasets

C4.5

SMOTE

BSMO

MWMO

ASYN

SM-T

CTD

CTDE

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

0.4200 0.7431 0.1905 0.4612 0.3334 0.9583 0.8560 0.5357 0.6000 0.4428

0.7600 0.8735 0.5138 0.6171 0.9310 0.8593 0.9389 0.8338 0.7179 0.5122

0.7130 0.8638 0.6296 0.7955 0.8461 0.8988 0.9823 0.8073 0.8723 0.7400

0.7000 0.7658 0.5445 0.6014 0.4678 0.8857 0.9974 0.7373 0.7976 0.6784

0.5690 0.7805 0.3689 0.5507 0.7686 0.8266 0.9000 0.7522 0.6154 0.5065

0.7414 0.8554 0.3814 0.5542 0.7231 0.9332 0.9057 0.7542 0.6794 0.7233

0.6100 0.9163 0.6000 0.6224 0.8214 0.8916 0.9265 0.6675 0.7900 0.7000

0.7800 0.9165 0.7619 0.8553 0.9267 0.9146 0.9613 0.8375 0.8250 0.8095

Average

0.5541

0.7558

0.8149

0.7176

0.6638

0.7251

0.7546

0.8484

Table 13 Experimental results on F-measure. Datasets

C4.5

SMOTE

BSMO

MWMO

ASYN

SM-T

CTD

CTDE

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

0.1148 0.7810 0.2500 0.5126 0.3448 0.8363 0.8968 0.6667 0.6315 0.5273

0.5814 0.9120 0.6667 0.7022 0.7826 0.9065 0.9534 0.9065 0.8000 0.6667

0.6653 0.9042 0.7445 0.7936 0.8000 0.9169 0.9748 0.8703 0.8632 0.8409

0.1266 0.8479 0.6765 0.6643 0.5179 0.906 0.9850 0.7487 0.8574 0.7951

0.5235 0.8528 0.4061 0.6406 0.7673 0.8764 0.9401 0.8527 0.7166 0.6552

0.6174 0.9037 0.5076 0.5733 0.6768 0.9361 0.9504 0.8541 0.7380 0.8092

0.6084 0.8971 0.6185 0.5817 0.8070 0.8151 0.9475 0.7726 0.7719 0.7580

0.7570 0.9188 0.7898 0.7838 0.8901 0.9026 0.9631 0.8561 0.8947 0.8282

Average

0.5562

0.7878

0.8384

0.7125

0.7231

0.7567

0.7578

0.8585

respectively. According to the average results, the CTDE proposed in this paper achieves the best results compared with the comparative algorithm for different evaluation criterion. Table 11 shows the experimental results of the Precision. As can be seen in Table 11, for datasets B0b, W0m and B0c, the CTDE method has an improvement of more than 5.88% compared with the other comparison algorithms. For data set W03, CTDE also has a 2.3% improvement over the MWMO approach. In particular, for datasets Veh, Y04, Wis, Seg, Pb, and Y24, CTDE experiment results are not as good as the other comparison methods on the Precision. However, there is little difference between them. For example, compared with CTDE, SM-T also has an improvement about 1% on dataset Veh. SMOTE also has an improvement about 1.33% on dataset Pb. But, generally speaking, it is worth noting that the CTDE method proposed in this paper is optimal compared all the contrast methods. Tables 12–13 give the experimental results of Recall and F-measure respectively. Recall and F-measure have similar expression to represent the classification performance of minority class in imbalanced data. As can be seen from Table 12, for datasets Y04, Y24 and W0m, compared with other comparison algorithms, the CTDE have an improvement about 5.98%. For datasets Seg and B0c, results obtained by CTDE method is less than the optimal value in other comparison algorithms, but the difference between them is relatively small. For example, on the dataset B0c, the experimental results of the SMOTE method are only 0.43% higher than the CTDE. Moreover, we can see from Table 13, CTDE has the best experimental results on datasets B0b, Veh, B0c, W03 and Y04, compared with other methods. For datasets Wis, Seg, Pb, Y24 and W0m, the results of CTDE method is not as good as the other comparison methods, but results of CTDE have little difference from the best value of other methods. Similarly, CTDE has achieved the best average results on ten datasets. In particular, compared with MWMO, ASYN, and SM-T, there are more than 10% improvements of CTDE. Tables 14–15 show the results on G-mean and Accuracy. G-mean and Accuracy are considered as the overall classification performance of imbalanced datasets. It can be seen from Table 14, CTDE achieves the best results on most datasets (9 of 10). For dataset Seg, MWMO has obtained the best experimental results. Compared with CTDE, MWMO has an improvement about 0.71%. In addition, compared with BSMO, MWMO has an improvement about 0.41%. For the average results of G-mean, CTDE achieves the best results compared with other methods. Specifically, compared with CTD, SM-T, ASYN, MWMO, BSMO and SMOTE, CTDE has an improvement about 8.22%, 16.99%, 16.5%, 16.16%, 0.904% and 13.6% respectively. It is surprising that CTDE achieves the best results (Accuracy) on all datasets. In particular, for dataset B0b, CTDE has greatly improved accuracy compared with other methods, more than 30%, which may be caused by the spatial distribution of data. It also reflects the effectiveness of using CCA and three-way decision to select the key samples in the minority class in some cases. It can be seen CTDE have achieved the best results on the rest 9 datasets. More specifically, compared with CTD, SM-T, ASYN, MWMO, BSMO and SMOTE, CTDE also has an improvement about 4.81%, 16.66%, 17.49%, 17.56%, 9.3% and 13.51%.

12

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

Table 14 Experimental results on G-mean. Datasets

C4.5

SMOTE

BSMO

MWMO

ASYN

SM-T

CTD

CTDE

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

0.4440 0.8372 0.8867 0.5942 0.4618 0.8876 0.9211 0.7289 0.7285 0.6514

0.3904 0.9139 0.7797 0.7329 0.6823 0.9086 0.9640 0.9012 0.8215 0.7080

0.6283 0.9089 0.8713 0.7183 0.6249 0.9183 0.9817 0.8916 0.8647 0.8508

0.3651 0.8522 0.7523 0.6808 0.5022 0.9077 0.9847 0.8265 0.8631 0.8126

0.4948 0.8609 0.6240 0.6744 0.7237 0.8806 0.9434 0.8600 0.7521 0.6995

0.4182 0.9083 0.5839 0.5903 0.6392 0.9205 0.9516 0.8632 0.7607 0.8283

0.7601 0.9349 0.9283 0.6476 0.7849 0.8458 0.9578 0.8124 0.8416 0.8279

0.8869 0.9426 0.9610 0.8241 0.9178 0.9262 0.9776 0.9083 0.9279 0.8909

Average

0.7141

0.7803

0.8259

0.7547

0.7513

0.7464

0.8341

0.9163

Table 15 Experimental results on accuracy. Datasets

C4.5

SMOTE

BSMO

MWMO

ASYN

SM-T

CTD

CTDE

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

0.4828 0.8923 0.8867 0.6917 0.5250 0.8695 0.9718 0.9452 0.8056 0.9216

0.4711 0.9140 0.7797 0.7491 0.7273 0.9081 0.9770 0.9286 0.8426 0.7614

0.6421 0.9117 0.8713 0.7765 0.7179 0.9185 0.9815 0.9417 0.8646 0.8541

0.2311 0.8571 0.7523 0.6898 0.5846 0.9082 0.9848 0.9523 0.8667 0.8275

0.5407 0.8643 0.6240 0.6965 0.7203 0.8819 0.9465 0.8694 0.7715 0.7457

0.5112 0.9110 0.6715 0.6199 0.6583 0.9111 0.9528 0.8712 0.7884 0.8488

0.9386 0.9458 0.9283 0.7438 0.7885 0.8467 0.9745 0.9386 0.8694 0.9551

0.9602 0.9585 0.9610 0.8241 0.9186 0.9303 0.9895 0.9714 0.9299 0.9661

Average

0.7992

0.8059

0.8480

0.7654

0.7661

0.7744

0.8929

0.9410

Table 16 Experimental results on AUC. Datasets

C4.5

SMOTE

BSMO

MWMO

ASYN

SM-T

CTD

CTDE

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

0.5068 0.7102 0.5221 0.5088 0.5078 0.8416 0.9848 0.7896 0.8000 0.5727

0.5146 0.8873 0.8839 0.8035 0.5439 0.8811 0.9898 0.9100 0.8223 0.8457

0.5287 0.9072 0.8071 0.8214 0.5833 0.9204 0.9846 0.9462 0.7971 0.9037

0.5767 0.9024 0.8042 0.8611 0.5714 0.9122 0.9974 0.8561 0.8589 0.8671

0.5224 0.8993 0.7510 0.7083 0.5625 0.9083 0.9672 0.8406 0.8205 0.8394

0.5112 0.9111 0.6716 0.6199 0.6583 0.9365 0.9528 0.8711 0.7884 0.8489

0.7369 0.9048 0.8301 0.7658 0.6267 0.8913 0.9662 0.8946 0.8431 0.8513

0.7557 0.9185 0.8891 0.8002 0.6602 0.9551 0.9859 0.9064 0.8809 0.8610

Average

0.7135

0.8024

0.8199

0.8108

0.7820

0.7769

0.8311

0.8613

Experimental results on AUC are given in Table 16. AUC denotes the area under the ROC curve, which is the criterion to evaluate the quality of the two prediction models. AUC is used to measure the performance of machine learning algorithm (generalization ability) of the “binary classification problem”. CTDE achieves good experimental results on datasets B0b, Wis, Veh, B0c, W03 and Y04. It is worth noting that CTDE has an improvement about 17.9% on dataset B0b, compared with MWMO. In addition, the results of CTDE are slightly worse than other comparison methods on some datasets. For example, on the dataset Seg, compared with CTDE, MWMO has an improvement about 1.15%. And for dataset Pb, Y24 and W0m, BSMO has an improvement about 3.98% than CTDE on dataset Pb. BSMO has an improvement about 4.27% than CTDE on dataset Y24. MWMO has an improvement about 6.09% than CTDE on dataset W0m. But the average result (an average improvement about 3%) also shows the superiority of CTDE. Table 17 reports the ranking of the performance of CTD compared with the rest algorithms (except for CTDE). As one can see, the performance of CTD on evaluation metrics Recall, G-mean, Accuracy and AUC is in the front rank compared with the rest six algorithms. Moreover, as we pointed out above, no algorithm is superior to the others on all the datasets with respect to the six evaluation metrics. And CTD has the best performance on 13 of 60 elements of Table 17. Besides, it also reaches the top three in 17 of the rest 47 elements of Table 17. Experimental results show that CTD can be an alternative method for imbalanced data oversampling. From Tables 11–16, one can see that, compared with CTD, CTDE has consistency better performance on all datasets. Specifically, the improvements on average results are apparent (about 9.5% in Precision, 9.4% in Recall, 10.1% in F-measure,

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

13

Table 17 Performance ranking on six evaluation metrics and ten datasets. Datasets

Precision

Recall

F-measure

G-mean

Accuracy

AUC

B0b Veh Y04 W0m B0c Wis Seg Pb W03 Y24

3 6 6 2 4 6 4 5 4 6

5 1 2 2 3 4 4 6 3 3

3 4 4 5 1 7 5 5 4 4

1 1 1 5 1 7 4 6 3 3

1 1 1 3 1 7 4 4 1 1

1 3 2 4 2 5 6 3 2 3

Fig. 5. Performance comparison of CTD and CTDE on dataset B0b.

8.2% in G-mean, 4.9% in Accuracy and 3% in AUC). It indicates that the ensemble strategy can effectively improve our CCA based three-way decision model. Moreover, one can see that, for convenience, i.e. from Table 14, the performance (G-mean) between CTD and CTDE can be roughly divided into three categories: the first is that the improvement on G-mean is more than 10% on three datasets B0b, W0m and B0c. The second is that the improvement on G-mean is about 5%–8% on four datasets Wis, Pb, W03 and Y24. The third is that the improvement on G-mean is less than 4% on three datasets Veh, Y04 and Seg. For the first category, we can see that the performance of CTD is relatively small (less than 80%). At this circumstance, the improvement is apparent, which is coincidence with the theory of ensemble. For the second category, the performance of CTD is relatively better than the first category (G-mean are about 80%∼85%). In this situation, CTDE can still have a certain degree of performance improvement. And for the third category, CTD can achieve a relatively high performance (more than 92%). Although CTDE can still have better performance than CTD, the difference between the two methods are very small (less than 1% on dataset Veh, about 3% on dataset Y04, less than 2% on dataset Seg). Considering that the computational time of CTDE is about 10 times more than CTD, we believe that CTDE is more appropriate when CTD is relatively ‘weak’. This phenomenon can also be found on the rest metrics. It can be easily clarified by the ensemble learning technique which aims to obtain a strong learner by integrating multiple weak learner. We have further investigated the performance between CTD and CTDE on each of the datasets. Fig. 5 reports the comparison between CTD and CTDE on dataset B0b as an example. As it shows, CTDE has performance improvement on all the metrics compared with CTD. Specifically, the improvement on metrics Precision, Recall, F-measure and G-means are more than 10 percent. The same phenomenon can be found on the rest of the datasets. (For clarity, we don’t enumerate them here.)

14

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

Table 18 Welch p-values between CTD and the six algorithms. Methods

Precision

Recall

F-measure

G-mean

Accuracy

AUC

C4.5 SMOTE BSMO MWMO ASYN SM-T

0.0058 0.3030 0.0785 0.1342 0.0812 0.2978

0.0031 0.9742 0.0183 0.3977 0.0136 0.3929

0.0063 0.2115 0.0034 0.4720 0.2491 0.9647

0.0113 0.2523 0.7647 0.1405 0.0593 0.0947

0.0796 0.1063 0.1970 0.1021 0.0156 0.023

0.0017 0.3873 0.6675 0.6392 0.0435 0.0935

Table 19 Welch p-values between CTDE and the seven algorithms. Methods

Precision

Recall

F-measure

G-mean

Accuracy

AUC

C4.5 SMOTE BSMO MWMO ASYN SM-T CTD

0.0014 0.1578 0.0742 0.1129 0.1481 0.1969 2.2536e−04

8.9772e−04 0.0191 0.0251 0.0119 3.3906e−04 0.0084 0.0023

0.0012 0.0166 0.1506 0.0489 0.0051 0.0180 5.3721e−04

0.0020 0.0189 0.0224 0.0167 0.0021 0.0095 9.1666e−04

0.0209 0.0156 0.0145 0.0319 0.0015 0.0037 0.0040

5.4691e−04 0.0577 0.1418 0.0855 0.0044 0.0197 7.6085e−04

4.6. Significance testing Algorithm performances were statistically compared in this paper. Table 18 gives the p values of Welch’s t test between CTD and the rest six algorithms, and Table 19 gives the p values of Welch’s t test between CTDE and the rest seven algorithms on each of the metrics respectively. The p values below 0.05 are reported in bold. From the results in Table 18, we can see that CTD is significantly different from the (six) comparison methods in about one third (11 of 36) of all the six evaluation criteria. The underlying reasons for this phenomenon are manifolds. Generally there are roughly three reasons. The first is that no algorithm is uniformly superior than all the others on all the datasets, i.e. the performance of algorithms on all the datasets are not consistent. The second is that the potential randomness of the CTD algorithm which uses CCA to constructively form the cover, the nature of CCA (random initialization) will affect the subsequent establishing process of the three domains in CTD. The third is that, by analyzing the results of CTD, we can see that the performance of CTD on datasets Y04 and W0m have a large gap with most of the comparison methods on all the six evaluation metrics. And it can significantly affect the results of the Welch’s t test. We believe it is highly related to the data distribution of these two datasets. As we pointed out above, the results caused by the randomness of CCA and the complicated data distribution of datasets in detecting the hard to learn samples, thus, we further propose the CTDE method to solve this problem. As shown in Table 19, generally speaking, Recall, G-mean and Accuracy are significantly different from CTDE, compared with the rest algorithms (7 of 7 for Recall, G-mean and Accuracy). F-measure and AUC are also significantly different from CTDE, compared with the rest algorithms (6 of 7 on F-measure, 4 of 7 on AUC). Table 19 also shows that there is a significant difference between CTDE and CTD. It is worth noting that, compared with the rest algorithms on Precision, CTDE only significantly different from two comparison methods (C4.5 and CTD). From the definition of Precision metric, we can see that, either the increase of FP or the decrease of TP will both lead to a small value of Precision. However, it is complicated to analyze the underlying reason for the variation of Precision. One can see from Table 11, no algorithm (SMOTE, BSMO, MWMO, ASYN SM-T and CTDE) is uniformly superior in all the datasets on evaluation metric Precision. As we pointed out above, we believe that the reason for this phenomenon may be highly related to the complicated data distribution of the imbalanced dataset. Moreover, CTDE has significantly differences with the compared methods in the other five evaluation metrics. From the above analysis, we can see that, generally, CTDE can be an alternative method in dealing imbalanced data oversampling, which also verifies the validity of our three way ensemble model. 5. Conclusions The three-way decision, as a development and expansion of traditional two-way decision, has shown its advantages compared to the traditional two-way decision. It can effectively improve the performance of problem solving by mining the boundary domain, and it has been widely used in many fields. The classification of imbalanced data is a hot issue in machine learning, which is meaningful in practical applications. As one of the most classical methods of imbalanced data processing, minority oversampling is widely concerned by scholars in recent years. Inspired by three-way decision theory, this paper proposes a three-way decision based method (CTD) to improve the performance of the SMOTE oversampling method. In our method, CCA is applied to construct the covers of minority samples, and then cover density is used to construct the three domains POS, NEG and BND of the minority samples. Due to the randomness of CCA, sometimes the learner constructed on the samples which obtained by CTD oversampling belongs to

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

15

a “weak” learner. Thus, to overcome the randomness of CTD, we further introduce the ensemble technique to integrate multiple weak learner to generate a stronger learner. The idea of constructing three-way decision ensemble to generate a more robust decision model instead of single three-way decision is in some sense consistent with the practice of the “voting” in daily life. Numerical experiments on ten typical imbalanced datasets show that CTDE achieves better classification performance than the comparison methods. Furthermore, it also shows the effectiveness of the ensemble model in improving single three-way decision performance. This paper not only provides an efficient imbalanced data processing method from threeway decision perspective, but also enriches the application scope of the three-way decision. The approach proposed in this paper is employed on binary classification scenario of imbalanced data. Our method could also be extended to multiclass scenario. For multiclass problem, we can adopt the “one versus rest” strategy to transform it into binary classification problem which is also a common practice in the study of imbalanced data. And it is also one of the topics that we will study in the future. Acknowledgements The authors would like to thank all the anonymous reviewers for their valuable comments on an earlier draft of this paper. This work was supported by National Natural Science Foundation of China (Nos. 61673020, 61602003, 61876001 and 61806002), Doctoral Scientific Research Startup Foundation of Anhui University (No. J01003253). References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33]

Z. Zheng, X. Wu, R. Srihari, Feature selection for text categorization on imbalanced data, ACM SIGKDD Explor. Newsl. 6 (1) (2004) 80–89. H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284. N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. A. Anand, G. Pugalenthi, G.B. Fogel, P.N. Suganthan, An approach for classification of highly imbalanced data using weighting and undersampling, Amino Acids 39 (5) (2010) 1385–1391. L. Liu, Y. Cai, W. Lu, K. Feng, C. Peng, B. Niu, Prediction of protein–protein interactions based on PseAA composition and hybrid feature selection, Biochem. Biophys. Res. Commun. 380 (2) (2009) 318–322. H. He, X. Shen, A ranked subspace learning method for gene expression data classification, in: IC-AI, 2007, pp. 358–364. Haibo He, Edwardo A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (2009) 1263–1284. Q. Wang, A hybrid sampling SVM approach to imbalanced data classification, Abstr. Appl. Anal. 5 (2014) 22–35. H. Han, W.Y. Wang, B.H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, Berlin, Heidelberg, 2005, pp. 878–887. H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: adaptive synthetic sampling approach for imbalanced learning, in: IEEE International Joint Conference on Neural Networks, 2008, IJCNN 2008 (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322–1328. S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2) (2014) 405–425. G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (1) (2004) 20–29. L. Zhang, B. Zhang, A geometrical representation of McCulloch–Pitts neural model and its applications, IEEE Trans. Neural Netw. 10 (4) (1999) 925–929. Y. Zhang, H. Xing, H. Zou, S. Zhao, X. Wang, A three-way decisions model based on constructive covering algorithm, in: 8th International Conference on Rough Sets and Knowledge Technology, in: LNAI, vol. 8171, 2013, pp. 346–353. B. Senjean, E.D. Hedegard, M.M. Alam, S. Knecht, E. Fromager, Combining linear interpolation with extrapolation methods in range-separated ensemble density functional theory, Mol. Phys. 114 (7–8) (2016) 968–981. Y. Yao, Three-way decision: an interpretation of rules in rough set theory, in: International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, 2009, pp. 642–649. D. Liu, D. Liang, C. Wang, A novel three-way decision model based on incomplete information system, Knowl.-Based Syst. 91 (2016) 32–45. Y. Yao, An outline of a theory of three-way decisions, in: International Conference on Rough Sets and Current Trends in Computing, Springer, Berlin, Heidelberg, 2012, pp. 1–17. Y. Yao, Three-way decisions and cognitive computing, Cogn. Comput. 8 (4) (2016) 543–554. Y. Yao, Rough sets and three-way decisions, in: International Conference on Rough Sets and Knowledge Technology, Springer, Cham, 2015, pp. 62–73. Y. Yao, C. Gao, Statistical interpretations of three-way decisions, in: International Conference on Rough Sets and Knowledge Technology, Springer, Cham, 2015, pp. 309–320. H. Yu, P. Jiao, Y. Yao, G. Wang, Detecting and refining overlapping regions in complex networks with three-way decisions, Inf. Sci. 373 (2016) 21–41. Y. Li, L. Zhang, Binary classification by modeling uncertain boundary in three-way decisions, IEEE Trans. Knowl. Data Eng. 29 (7) (2017) 1438–1451. Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (5) (1982) 341–356. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic Publishers, Dordrecht, 1991. Y. Yao, Decision-theoretic rough set models, in: International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, 2007, pp. 1–12. Y. Yao, Y. Zhao, Attribute reduction in decision-theoretic rough set models, Inf. Sci. 178 (17) (2008) 3356–3373. Y. Yao, The superiority of three-way decisions in probabilistic rough set models, Inf. Sci. 181 (6) (2011) 1080–1096. Y. Yao, Three-way decisions with probabilistic rough sets, Inf. Sci. 180 (3) (2010) 341–353. J. Xu, D. Miao, Y. Zhang, Z. Zhang, A three-way decisions model with probabilistic rough sets for stream computing, Int. J. Approx. Reason. 88 (2017) 1–22. Y. Yao, Two semantic issues in a probabilistic rough set model, Fundam. Inform. 108 (3–4) (2011) 249–265. X. Zhou, H. Li, A multi-view decision model based on decision-theoretic rough set, in: International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, 2009, pp. 650–657. M.T. Khan, N. Azam, S. Khalid, J. Yao, A three-way approach for learning rules in automatic knowledge-based topic models, Int. J. Approx. Reason. 82 (2017) 210–226.

16

Y.T. Yan et al. / International Journal of Approximate Reasoning 107 (2019) 1–16

[34] J.P. Herbert, J. Yao, Learning optimal parameters in decision-theoretic rough sets, in: International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, 2009, pp. 610–617. [35] J.P. Herbert, J. Yao, Game-theoretic rough sets, Fundam. Inform. 108 (3–4) (2011) 267–286. [36] H. Li, L. Zhang, X. Zhou, B. Huang, Cost-sensitive sequential three-way decision modeling using a deep neural network, Int. J. Approx. Reason. 85 (2017) 68–78. [37] H. Li, X. Zhou, Risk decision making based on decision-theoretic rough set: a three-way view decision model, Int. J. Comput. Intell. Syst. 4 (1) (2011) 1–11. [38] H. Li, X. Zhou, J. Zhao, B. Huang, Cost-sensitive classification based on decision-theoretic rough set model, in: International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, 2012, pp. 379–388. [39] H. Li, X. Zhou, J. Zhao, D. Liu, Attribute reduction in decision-theoretic rough set model: a further investigation, in: International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, 2011, pp. 466–475. [40] D. Liu, T. Li, H. Li, A multiple-category classification approach with decision-theoretic rough sets, Fundam. Inform. 115 (2–3) (2012) 173–188. [41] X. Li, H. Yi, Y. She, B. Sun, Generalized three-way decision models based on subset evaluation, Int. J. Approx. Reason. 83 (2017) 142–159. [42] D. Liu, T. Li, D. Liang, Decision-theoretic rough sets with probabilistic distribution, in: International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, 2012, pp. 389–398. [43] B. Hu, H. Wong, K.C. Yiu, On two novel types of three-way decisions in three-way decision spaces, Int. J. Approx. Reason. 82 (2017) 285–306. [44] X. Jia, K. Zheng, W. Li, T. Liu, L. Shang, Three-way decisions solution to filter spam email: an empirical study, in: International Conference on Rough Sets and Current Trends in Computing, Springer, Berlin, Heidelberg, 2012, pp. 287–296. [45] X. Li, B. Sun, Y. She, Generalized matroids based on three-way decision models, Int. J. Approx. Reason. 90 (2017) 192–207. [46] B. Zhou, Y. Yao, J. Luo, A three-way decision approach to email spam filtering, in: Canadian Conference on Artificial Intelligence, Springer, Berlin, Heidelberg, 2010, pp. 28–39. [47] Y. Zhang, J. Yao, Gini objective functions for three-way classifications, Int. J. Approx. Reason. 81 (2017) 103–114. [48] J.P. Herbert, J. Yao, Criteria for choosing a rough set model, Comput. Math. Appl. 57 (6) (2009) 908–918. [49] Y. Yao, Probabilistic approaches to rough sets, Expert Syst. 20 (5) (2003) 287–297. [50] Z. Chen, T. Lin, X. Xia, H. Xu, S. Ding, A synthetic neighborhood generation based ensemble learning for the imbalanced data classification, Appl. Intell. 48 (2018) 2441–2457. [51] T.G. Dietterich, Ensemble learning, in: The Handbook of Brain Theory and Neural Networks, 2 edn., 2002, pp. 110–125. [52] Z. Zhou, Ensemble learning, in: Encyclopedia of Biometrics, 2015, pp. 411–416. [53] G. Nápoles, R. Falcon, E. Papageorgiou, R. Bello, K. Vanhoof, Rough cognitive ensembles, Int. J. Approx. Reason. 85 (2017) 79–96. [54] B. Liu, S. Wang, R. Long, K. Chou, iRSpot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics 33 (1) (2016) 35–41. [55] K. Jiang, J. Lu, K. Xia, A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE, Arab. J. Sci. Eng. 41 (8) (2016) 3255–3266. [56] T. Saito, M. Rehmsmeier, Precrec: fast and accurate precision-recall and ROC curve calculations in R, Bioinformatics 33 (1) (2017) 145–147. [57] T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets, PLoS ONE 10 (3) (2015) e0118432. [58] J. Huang, C.X. Ling, Using AUC and accuracy in evaluating learning algorithms, IEEE Trans. Knowl. Data Eng. 17 (3) (2005) 299–310. [59] Z. Mašetic, A. Subasi, J. Azemovic, Malicious web sites detection using C4.5 decision tree, Southeast Eur. J. Soft Comput. 5 (1) (2016). [60] P. Refaeilzadeh, L. Tang, H. Liu, Cross-validation, in: Encyclopedia of Database Systems, Springer US, 2009, pp. 532–538.