Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification

ARTICLE IN PRESS JID: NEUCOM [m5G;September 26, 2017;19:16] Neurocomputing 0 0 0 (2017) 1–12 Contents lists available at ScienceDirect Neurocompu...

Download PDF

3MB Sizes 0 Downloads 65 Views

Report

Full Text

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 26, 2017;19:16]

Neurocomputing 0 0 0 (2017) 1–12

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation Samad Nejatian a,b, Hamid Parvin c,d,∗, Eshagh Faraji d,e a

Department of Electrical Engineering, Yasooj Branch, Islamic Azad University, Yasooj, Iran Young Researchers and Elite Club, Yasooj Branch, Islamic Azad University, Yasooj, Iran c Department of Computer Engineering, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran d Young Researchers and Elite Club, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran e Department of Electrical Engineering, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran b

a r t i c l e

i n f o

Article history: Received 26 January 2016 Revised 13 April 2017 Accepted 10 June 2017 Available online xxx Keywords: Imbalanced learning Neural networks Decision tree Cancer diagnosis

a b s t r a c t Abundant data of the patients is recorded within the health care system. During data mining process, we can achieve useful knowledge and hidden patterns within the data and consequently we will discover the meaningful knowledge. The discovered knowledge can be used by physicians and managers of health care to improve the quality of their services and to reduce the number of their medical errors. Since by the usage of a single data mining algorithm, it is diﬃcult to diagnose or predict diseases, therefore in this research, we take a combination of the advantages of some algorithms in order to achieve better results in terms of eﬃciency. Most of standard learning algorithms have been designed for balanced data (the data with the same frequency of samples in each class), where the cost of wrong classiﬁcation is the same within all classes. These algorithms cannot properly represent data distribution characteristics when datasets are imbalanced. In some cases, the cost of wrong classiﬁcation can be very high in a sample of a special class, such as wrongly misclassifying cancerous individuals or patients as healthy ones. In this article, it is tried to present a fast and eﬃcient way to learn from imbalanced data. This method is more suitable for learning from the imbalanced data having very little data in class of minority. Experiments show that the proposed method has more eﬃciency compared to traditional simple algorithms of machine learning, as well as several special-to-imbalanced-data learning algorithms. In addition, this method has lower computational complexity and faster implementation time. © 2017 Published by Elsevier B.V.

1. Introduction Different methods of data mining can help predict diseases automatically with high accuracy rate. Moreover, additional costs of irrelevant clinical trials will be reduced through this process. It also reduces the wrong predictions due to human tiredness, and consequently improves the quality of services. Some of the data mining methods that have been successfully applied to medical data include: neural networks, decision trees (DT), association rule mining, Bayesian networks, support vector machines (SVM), clustering and etc. Depending on the type of their application, one of these methods will be more useful. However, it is very hard to choose only a data mining algorithm that is suitable to diagnose or pre-

∗ Corresponding author at: Department of Computer Engineering, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran. E-mail addresses: [email protected], [email protected] (H. Parvin).

dict all diseases. Some algorithms are better than the others for certain purposes. However, when we bring advantages of several algorithms together, it will result in a better performance. Performance criteria will be discussed later in this study. By the way, it is almost impossible to choose the best data mining method to predict diseases for a speciﬁc criterion like accuracy, sensitivity and characteristic. Data analysis and the confusion among them is a problem preventing to achieve remarkable diagnostic results, because the knowledge within the data should be used properly. In fact, data mining is a response to the need of health care organizations. The more data and the complexity of their relations are, the more diﬃcult is to access the hidden information among data. It is often assumed that distribution of classes is balanced or nearly balanced. In general, the cost of wrong classiﬁcation for all classes is assumed to be the same as well. So when the dataset is imbalanced, these algorithms cannot properly display data distribution features. In a sense, these algorithms tend to put an unknown data into the

http://dx.doi.org/10.1016/j.neucom.2017.06.082 0925-2312/© 2017 Published by Elsevier B.V.

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

Fig. 1. (A) An imbalanced interclass dataset (left). (B) Dataset with high complexity, intra-class and interclass imbalance, multi-concept, overlapping of classes, noise (right).

classes with more frequency, and as a result, it provides unacceptable accuracy among data classes. An imbalanced dataset is any dataset representing an imbalanced distribution among its classes, in such a way that the imbalanced distribution is too much. This type of imbalance is called inter-classes imbalance (such as a one-to-one thousand distribution (1:10 0 0) where in this case, one class completely eliminates the other one). The imbalanced distribution wasn’t between two classes necessarily and there may be among several ones, though. In scientiﬁc communities, over 65% rate of a class may be even considered to be imbalanced data [14,19,23,24]. The distributions among many actual datasets are mainly imbalanced, so it is necessary to modify the learning algorithms in order to extract knowledge out of them. As one example of these imbalanced dataset, we can exemplify the data related with the patients with breast cancer. These data are often shown with positive (cancer) and negative (health) classes. As expected, the number of healthy people is much higher than cancerous patients. Therefore, a kind of classiﬁcation is required which exploits appropriate and balanced prediction accuracy for both minority and majority classes. As we know that medical diagnosis of a cancerous patient as a healthy individual is unacceptable (and similarly a diagnosis of a healthy person as a patient), so in order to generate decision support systems, modiﬁed classiﬁcations are required. Applied classiﬁers must be able to provide high validity for minority class, but also does not affect the validity of majority one. For example, in this case, a healthy sample may be diagnosed 100% correctly, while the correct classiﬁcation accuracy of the patient is 10%. So, it is very possible that the patient’s sample is diagnosed wrongly. In this regard, it is obvious that the single evaluation criteria such as overall accuracy and error rate do not provide enough information about the quality of imbalanced learning. This kind of imbalance is called inherently imbalanced. This means that the imbalance is a direct result of the nature of data space. It is worth mentioning that imbalanced data are not just inherent; and imbalance can be sometimes relative as well, that is, the number of minority samples is naturally large but their number is very low compared to the majority class. The data complexity is an important issue which includes data overlapping, missing data and etc. This concept is shown in Fig. 1. In Fig. 1, the stars and circles represent the minority and majority classes, respectively. As it is clear, two distributions shown in parts (A) and (B) are imbalanced, but in part (B), there are sample overlapping and multi-concept, too. According to part (B) the sub-concept C may be not learned because of lack of data. Another form of imbalance is intra-class which corresponds to the distribution of representation data for sub-concepts in a class. In Fig. 1(B), class B and C represent the dominant minority and majority sub-concept, respectively. In addition, A and D are dominant concept and dominant sub-concept for majority class, respectively.

For each class, the number of samples existing in the dominant cluster of that class eliminates the sub-concept. As it is clear, this data space represents inter-classes and intra-class imbalance. In this paper, we present a new method to classify imbalanced training data, and we compare this method with standard methods such as the nearest neighbor, decision tree and multi-layer perceptron neural network (MLP). In the following, we review the literature and introduce some works done in this area. Then, we examine the evaluation criteria of these methods and the manner of classiﬁcation tests. Finally, we will discuss the results of the tests and conclude the paper. In general, contributions presented in this article include: • A new method for learning from imbalanced data. • An eﬃcient method to be used in the decision support system for breast cancer diagnosis. • The results of the proposed method on real dataset of breast cancer. • A method for the diagnosis of cardiovascular patients. 2. Related works In this section, we review the literature of topic and the previous works. In this paper, training set and the number of its samples is presented by S and m. S = {(xi , yj )| i = 1, . . . , m} where xi ∈ X is a sample in the n-dimensional characteristic space of X = {( f1 , f2 , . . . , fn )| fi ∈ R}and yi ∈ Y = {1, . . . , c} is the label of the class associated with the sample xi . For example, c = 2 indicates a classiﬁcation with two classes. Smin and Smax are sample sets of the minority and majority classes that the union of them is the training set, and intersection of them is null. Also, we consider E as the acquired set of sampling from S. As discussed earlier, when a standard learning algorithm is applied to an imbalanced data, the minority class is not often learned well, because the deductive rules describing the minority concept are often much weaker than the ones describing the majority concept. In order to show the effect of imbalanced learning problem on standard learning algorithms, consider the general decision tree algorithm. Decision tree is built based on a recursive top-down greedy search method which uses a feature selection method for selecting the best feature as the separation criterion in each node. Next, nodes are created based on possible values of the separator feature. As a result, at each stage, the training set is divided into smaller subsets which can result in separate rules of the class concept. Finally, these rules are combined, and make the hypothesis which results in the lowest error rate in the classes. The problem occurring by using this process in the presence of imbalanced data is in two directions. First, frequent partitioning of data space leads to smaller observations of minority samples which brings about a reduction in the number of leaves describing the concept of minority class, and its result contains less con-

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

JID: NEUCOM

ARTICLE IN PRESS

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

ﬁdence. Secondly, the concepts relying on different feature space binding remain without learning because of sparseness caused by partitioning. In fact, the ﬁrst issue associates with absolute and relative imbalance, and the second one is concerned with the imbalance among classes and the high dimensional problem. Anyway, imbalanced data has a negative effect on the decision tree classiﬁcation performance. In the following of this paper, a proposed classiﬁcation to overcome the effects of imbalanced data will be offered and it will be investigated in details. In general, the proposed solutions to the problem of imbalanced learning are in two general directions. First category of solutions considers the changes in the dataset in order to make it balanced. The other category generally considers learning algorithms in order to adapt them to learn from imbalanced data [5,22,23,29–31]. To classify cardiovascular disease, artiﬁcial neural networks with backward propagation error method were used. To do this, the dataset contains 100 records of medical ﬁelds where 60 records were of men and 40 records were of women. 16 input features are used for prediction. Training speed is between 1.0 and 9.0, and ﬁnally, the degree of accuracy is also measured [15]. One of the methods used to reduce the size and complexity of algorithm is to select shapes subset. Shape extraction includes the process of identifying and estimating poor and irrelevant communication, or extra dimensions or characteristics in a set of existing data. The purpose of choosing the shape is to ﬁnd the minimum number of feature subset so that the results of data distribution probability can be obtained similar to the results of main distribution. For example, an optimal subset of suﬃcient features can be acquired to predict heart disease by using genetic algorithm. For instance, the number of features can be reduced from 13 to 6 (to predict heart disease), and this reduces the number of experiments which a patient is supposed to do [16]. Generally, sampling methods in imbalanced learning issue usually is used where they involve changing a set of imbalanced data using a series of mechanisms aiming at obtaining a balanced dataset. Studies on some basic classiﬁers have shown that these mechanisms have better results on any imbalanced dataset. In random over-sampling method, one set of E is sampled from Smin , and added to dataset S. In fact, the number of samples Smin increases as much as |E | − |Smin |, and the dataset proceeds to be balanced. Using this method it is possible to reach a classiﬁcation with an acceptable degree of balance. In sub-sampling method, unlike the previous method, some data is removed from the dataset. In fact, one subset is randomly selected from the majority class of Smax and removed from dataset S, and then it establishes the balance in the dataset [1]. Although two proposed sampling methods are to improve imbalanced learning, they have some problems as well. In subsampling, some samples are removed from the majority class which may lead to loss of important concepts. In the over-sampling method, the overtraining problem may occur due to data replication. Another category of the methods is named informed subsampling. Examples of informed sampling are EasyEnsemble and BalanceCascade algorithms [2,3]. The purpose of these methods is to overcome the problem of data loss in random methods. In EasyEnsemble algorithm, an ensemble learning system is created through sampling several subsets from majority class and making several classiﬁers based on combination of each subset with minority class data. The size of each subset from majority class is equal to size of minority class. In this way, each time, sampling is done with replacement, randomly. Another example of these methods uses KNN classiﬁer for sub-sampling [4,24,25]. Another method based on sampling for imbalanced learning is the hybrid sampling with data generation. An example is Synthetic

3

Minority Over-Sampling Technique (SMOTE). This algorithm creates the synthetic data based on similarities of feature spaces among the available minority samples. In fact, for each subset Smin , certain number of nearest neighbors (based on Euclidean distance) is determined for each sample xi ∈ Smin . Then a new pattern is created based on a relation among the speciﬁed points [7]. Another proposed method in this area is an adaptive combined sampling. In the previous methods, the same number of data samples is generated for each minority sample without considering neighboring samples, which it may lead to increase the overlapping of classes. Different adaptive methods have been introduced to overcome this problem. Some of these methods are ADA and border-line-SMOTE algorithms [8]. Sampling with data cleansing method is another proposal for imbalanced learning. Data cleansing has been introduced to remove overlapping of sampling methods. One of the cleansing methods is Tomek links [9] deﬁned as a pair of neighbors with minimum distance from opposing classes. If two samples form a link, either of them may be a noise or both are close to the border. One of the applications of these links is to eliminate undesirable overlapping of classes by combining, so that all links are removed, and all pairs of the neighbor with the nearest distance are from the same class. Fig. 2 shows how SMOTE works, and then the Tomek links are determined. Clustering based sampling is another method concerned with imbalanced learning problem. One of the algorithms proposed is over-sampling based on clustering (CBO) which uses K-means clustering algorithm [10]. In this algorithm, ﬁrst, k samples are selected from each cluster, and the average feature vector is calculated for these samples. Next, the cluster centers are determined. Then, Euclidean distance is calculated for each sample from each cluster center, and is assigned to the cluster with the nearest center, and the cluster center is updated. According to CBO algorithm and using over-sampling method, all clusters of majority class are extended the same size as the majority class with the highest sample. Then, over-sampling is done on clusters of minority class, and the clusters’ size is increased. The stages of this algorithm are shown in the following ﬁgure. As it is clear, ultimately, a strong representation of little concepts in the ﬁnal dataset is obtained. Finally, a series of algorithms are proposed based on combining sampling and boosting, which is an ensemble technique such as SMOTE-Boost and DataBoost-IM algorithms [11]. Unlike sampling methods seeking to balance the distribution with respect to class representation, the cost-sensitive learning methods consider the cost of wrong classiﬁcation of samples. In fact, in cost-sensitive methods, a solution can be obtained for imbalanced learning problems by creating cost matrix for wrong classiﬁcation of each sample. These methods are irrelevant to this paper application and are not discussed anymore. In recent years, the methods like the single-class SVM or SVDD have been proposed. Particularly, Rascoutti and Kovalizak recommended that single-class learning is specially tailored to imbalanced datasets with high spatial dimensions. In addition, Japcuise presented an achievement that automated interface is trained to rebuild positive class in feedback layer, and it is suggested that under certain circumstances, for examples in multi-quality areas, single-class learning achievement is probably better than other methods [17]. The new exploration achievement was examined based on comparing the frequency and non-redundant distinction techniques. Authors have recommended that innovative exploration methods are useful for highly imbalanced datasets [17]; while decision tree classiﬁcation is suitable for relatively balanced datasets [18].

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

Fig. 2. (A) Basic dataset. (B) Dataset after applying SMOTE. (C) Tomek link. (D) Dataset after removing the links.

Finally, we also seek to note that, although the current efforts are focusing on two-class imbalanced issues, multiple-class issues of imbalanced learning are available and important.

of 95% in spite of the 0% of properly diagnosed minority class samples. Studying the confusion matrix, it is clear that the ﬁrst column indicates the number of positives samples and the second column is the number of negative samples; also, the ﬁrst row is the number of samples that classiﬁer determines as the minority class and the second row shows the samples the classiﬁer diagnoses as the majority class. So, the ratio of column indicates the data distribution in dataset, and each criterion using the values of two columns is inherently sensitive to imbalance. For example, the accuracy criterion uses both columns and is sensitive to imbalance and changes as the class distribution changes, although the main performance of the classiﬁer may not change. Other evaluation criteria applied to the imbalanced learning problem in an adapted way. These criteria include accuracy, precision, recall, F-measure and G-mean [1]. Accuracy can be obtained from Eq. (1):

3. Evaluation criteria for imbalanced learning

accuracy =

Fig. 3. Confusion matrix.

Regarding to the development of researches done in the ﬁeld of imbalanced learning, it is necessary to present some criteria for evaluating the effectiveness of imbalanced learning algorithms. In this part, we examine the evaluation criteria for imbalanced learning. Conventional evaluation criteria are accuracy rate and error rate. Although these criteria are simple ways to describe the performance of classiﬁer on a dataset, they are not suitable for imbalanced data. Fig. 3 shows the confusion matrix which is obtained from mentioned eﬃciency criteria. For example, if a dataset contains 5% minority class and 95% majority class, a classiﬁer which associates all samples to the majority class, maintains the accuracy

TP + TN TP + TN + FP + FN

(1)

Precision is obtained from Eq. (2):

precision =

TP TP + FP

(2)

Recall is obtained from Eq. (3):

recall =

TP TP + FN

(3)

F-measure is obtained from Eq. (4):

F − measure = 2 ×

precision × recall precision + recal

(4)

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

ROC evaluation method uses two single-column evaluation criteria of TP and FP rates, and obtains a graph by drawing TP rate on the FP. In fact, each point in this space represents the classiﬁer eﬃciency for a distribution. The ROC diagram is a strong method to evaluate the eﬃciency visually. In these cases, accuracy-recall diagram can provide more information about eﬃciency evaluation. These diagrams can be regarded as the best representation of classiﬁcation performance in imbalanced applications. Inherently, accuracy is a precise criterion (i.e. the accuracy of labeled samples is positive and how many of the samples are labeled correctly). As a result, recall is the criterion of completeness (i.e. how many of positive class samples are labeled correctly). This two criteria which are very similar to accuracy and error, have an inverse relation in common. 3.1. Possible classiﬁcation error The possibility of wrong labeling is considered as the fourth difference source on tested data. The list above shows that several testing and training sets must be used. And it must be executed several times where training the classiﬁers has a random component. To compare two classiﬁers of A and B, a test procedure is recommended including K times of the test procedure iteration, which 33% of data are considered as the test data and others are considered as training data. In other words, through each of K iterations of the test procedure, we divide data into the training and testing parts. Classiﬁcation models of A and B is ﬁrst trained on training part and then are tested on testing part. Consequently, p1A , p1B are obtained as accuracies of two classiﬁers of A and B, respectively. Regarding the second random testing and training subsets, p2A , p2B estimations are obtained. Now, the differences are deﬁned in Eq. (5):

P i = PAi − PBi

(5)

The estimated average and the variance are calculated for two implementations of cross validity in Eq. (6):

t=

√ P¯ K

K

i=1

(Pi −P¯ )

2

(6)

K−1

p¯ can be obtained from Eq. (7):

P¯ =

K 1 i P K

(7)

i=1

Now, we want to see distribution table t with freedom degrees of K − 1. Assuming the null hypothesis of t, if its value is less than the value in the table, difference of two classiﬁers is meaningless otherwise is meaningful. Supposing to use the above test method, as already said, 30 times was implemented. It is concluded from the equation above that t = 1.9796, so the table value is 2.045 is achieved by importance level of 0.05 and freedom degree of 29. Now because the ﬁgure calculated is lower than the Figure of table, we cannot reject null hypothesis. This test suggests that there is no signiﬁcant difference between the compared classiﬁers in our data in terms of the accuracy. 4. The proposed method The main structure of the proposed algorithm named ModiﬁedBagging is similar to the algorithm EasyEnsemble. Ensemble clustering has been used many time in medical problems [25–28,32,33] In this algorithm, we ﬁrst select a series of sub-samplings from Smax called Ei where |Ei | = |Smin |. Then we deﬁne subsets Si ⊂S as Si = Smin ∪ Ei and we train a poor classiﬁer similar to the decision tree on each Si . This classiﬁer is displayed by DTi .

5

Fig. 4. The pseudo code of the proposed algorithm.

In the end, we consider all these DTi as an ensemble. The proposed algorithm pseudo code is presented in Fig. 4. Although there are a lot of solutions to improve imbalanced learning, as mentioned earlier, we consider only the category of sub-sampling algorithms in this article. In this category of algorithms, the best samples are two algorithms of EasyEnsemble and BalanceCascade which fall in the informed sampling category [2,3]. As shown in [3], these methods are superior to other methods. This superiority is for their eﬃciency and training speed. In addition, EasyEnsemble and BalanceCascade algorithms operate similarly in terms of training speed and eﬃciency. As both algorithms have very similar structures and the EasyEnsemble structure bears resemblance to the proposed algorithm, the EasyEnsemble algorithm is compared to the proposed algorithm. Easy ensemble algorithm operates as follows: in the ﬁrst step, it creates a random subsampling in which all minority class data present. In this subsampling, the majority class is the same number as minority class where the majority class data is randomly selected. Then, the complicated procedure of ADABOOST is tailored to this subsampling, and the achieved ensemble is called ADABOOST1 . The ﬁrst step is over and the second one begins. In the next step, another subsampling is exercised and the ADABOOST2 ensemble is generated on it. After T steps, T ensembles are acquired. It is like having an ensemble of several ensembles. In fact, each basic classiﬁer is a powerful ADABOOST classiﬁer itself in our ensemble. This method has a fundamental weakness. An ensemble is strong when its base classiﬁers are weak [12], because they must have diversity (That is why the powerful and stable classiﬁers like SVM are never used in ensembles). Since EasyEnsemble, and also BalanceCascade, uses powerful base classiﬁers. These methods are not often better than BAGGING and ADABOOST and they are just slower. The reason for priority of EasyEnsemble algorithm compared to ModiﬁedBagging should be found in their differences. The difference is illustrated in line 6 in pseudo code. In EasyEnsemble algorithm, a classiﬁcation algorithm with highly time order, named ADABOOST, is used instead of a simple classiﬁer [5]. Using a similar complex classiﬁcation system not only has too much time overhead, but also is actually unjustiﬁed because voting mechanism is used after generating classiﬁers, i.e. Ci s. In addition, the ensemble classiﬁers may not be properly trained on the Si s due to the small sample size of minority class, |Smin |. After selecting useful features, the PCA technique is used to reduce dimensions. Finally, several classiﬁcation models will be used. It is tried to apply different models in this regard. Some of them are given below: (1) DT, (2) single-class SVDD, (3) double-class SVDD, (4) doubleclass Parzen_DD, (5) double-class KNN-DD with K times as much as a neighbor, (6) ensemble of 5 classiﬁers of 1 to 5 by average ensemble method (7) ensemble of 6 classiﬁers of PARZENC, FISHERC, QDC, SVDD, KNNDD and RBNC by average integration method, (8) combination of 6 classiﬁers of PARZENC, FISHERC, QDC, SVDD, KN-

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

NDD and RBNC by optimal classiﬁer selection method, (9) boosting ensemble of 21 classiﬁers of FISHERC, (10) boosting ensemble of 21 classiﬁers of QDC, (11) boosting ensemble of 21 DT classiﬁers, (12) boosting ensemble of 21 classiﬁers of Naïve Bayesian, (13) bagged ensemble of 21 QDC classiﬁers, (14) bagged ensemble of 21 DT classiﬁers, (15) MLP, (16) NUSVM, (17) SVM, (18) ensemble of 6 classiﬁers of PARZENC, FISHERC, QDC, SVDD, KNNDD and RBNC by majority-vote consensus function.

5. Experiments and results In this article, we have tried to help doctors by providing a machine learning system to diagnose the cancer in patients.

5.1. Dataset The ﬁrst tested dataset is a real set collected from a hospital [6]. This dataset contains information of 369 clients. 17 of them are diagnosed as breast cancer patients and 352 people are healthy. It means the dataset contains 352 negative samples (majority) and 17 positive samples (minority). The maximum number of extracted features for these samples is 26. The values of these features are non-numeric which have been coded to numeric values to facilitate the implementation. After coding, each feature was normalized, and their values are recorded in range of [0 1]. Normalization is calculated by Eq. (8):

n fx,i =

f

x,i

max fy,i + min fy,i y

(8)

y

In Eq. (8), fx, i is the ith feature of xth data and nfx, i is ith normalized feature of xth data. The second data set is a real one that has already been collected. The dataset contains information of 1282 clients. 120 of them are diagnosed as cardiovascular patients who have not had the opportunity to be treated and 1162 people were cardiovascular patients who have been treated. It means the dataset contains 1162 negative samples (majority) and 120 positive samples (minority). The maximum number of features extracted for these samples is 72 which only 52 of the features were kept after preprocessing (feature selection phase), and the rest was discarded. Some of these feature values contains missing values (undeﬁned) which have been set to numerical values in order to facilitate the implementation. We will explain how the missing values have been managed in the following. Let the tag of jth data point be denoted by lj . If the ith feature of jth data point is lost, ﬁrst we should select all the data that have the same tag of lj . In the end, we consider the average of their ith feature of them as the ith feature of the jth data (Duda and Hart, 1973). The third dataset like the second dataset is a real collection of cardiovascular patients that has already been collected. The dataset consists of the information of 11,541 clients. 700 of them are diagnosed as cardiovascular patients who have not had the opportunity to be treated and 10,841 ones were cardiovascular patients who have had treatment opportunities. In other words, the data set includes 10,841 negative samples (majority) and 700 positive samples (minority). The maximum number of features extracted for these samples is 86 that after preprocessing (feature selection phase), only 79 features are kept and the rest is discarded. Some of these feature values are missing values (not deﬁned) which have been set to numerical values in order to facilitate the implementation as mentioned previously.

Fig. 5. ROC curve of the proposed method with 25 DTs.

5.2. Experiments In this paper, the experimental data were trained by decision tree learning algorithms, multi-layer neural networks and the proposed algorithm. It should be noted that the Decision Tree (DT) used in this paper is decision tree with Gini index. Gini index threshold has been set to 2 through all paper. K-Nearest Neighbors (KNN) is another used classiﬁer. K is set to 5. Also, we use Artiﬁcial Neural Network (ANN) as one of our basic learners. All used ANNs are MLPs with two hidden layers. The number of neurons in the ﬁrst and second hidden-layers is 10 and 5, respectively. The ﬁrst and second layers have “linear” and “tangent-sigmoid” activation function. It is worth mentioning that all parameters of MLPs and DTs are kept ﬁx through all of the experiments. In the next step, the above experiments have been done by the ensemble of classiﬁers and the results have been presented. Finally, the proposed method has been compared with EasyEnsemble method. 5.3. Results In this section, the results are provided and elaborated in three stages as follows. 5.3.1. Results of the experimentation on the ﬁrst dataset Table 1 shows the results of the ﬁrst stage of tests. As seen in Table 1, although the recognition rate of simple methods (DT, KNN and MLP) is very high, they are not eﬃcient. It means that although their validity is acceptable, they cannot diagnose the patients. This is not unexpected, because these classiﬁers have gained very high accuracy by assigning almost all data to the same class. Therefore, if we note the columns 5 and 6, it can be clearly recognized that the performance of those classiﬁers is much higher for diagnosis when we use subsampling (the proposed algorithm); while their accuracy is signiﬁcantly decreased. As expected, DT classiﬁer has a signiﬁcant advantage compared to the MLP classiﬁer. These results were not also unexpected. Studying the previous works, we can ﬁnd that subsampling reduces the eﬃciency of neural networks compared to decision tree [13]. Also, the neural networks have better recognition rate than decision trees in imbalanced environments, but they have less accuracy and F-Measure [13]. As seen in Table 2, using the ensemble without employing the proposed approach in order to balance the training data does not solve the problem. However, applying the proposed method

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

7

Table 1 Results of experiments with one classiﬁer using leave-one-out (LOO) technique. Evaluation measure

DT

MLP

NN

ModiﬁedBagging with T = 1 and DT

ModiﬁedBagging with T = 1 and MLP

TP FP TN FN Recognition rate Precision Recall F-measure Accuracy

1/17 = 5.88 0/352 = 0.00 352/352 = 100.0 16/17 = 94.12 353/369 = 95.66 10 0.0 0 5.88 7.14 52.94

0/17 = 0.00 0/352 = 0.00 352/352 = 100 17/17 = 10 0.0 0 352/369 = 95.39 ∝(50) 0.00 0.00 50.00

4/17 = 23.53 7/352 = 1.99 345/352 = 98.01 13/17 = 76.47 344/369 = 94.58 36.36 23.53 28.57 60.77

10/17 = 58.82 82/352 = 23.30 270/352 = 76. 70 7/17 = 41.18 280/369 = 75.88 71.63 58.82 64.60 67.76

4/17 = 23.53 116/352 = 32.95 236/352 = 67.05 13/17 = 76.47 240/369 = 65.04 41.66 23.53 30.07 45.29

Table 2 Results of experiments with multiple classiﬁers using LOO technique. Evaluation measure

ModiﬁedBagging with T = 25 and DT (The best cutting)

ModiﬁedBagging with T = 25 and MLP (The best cutting)

ModiﬁedBagging with T = 25 and DT (Median cutting)

TP FP TN FN Recognition rate Precision Recall F-measure Accuracy

1/17 = 5.88 0/352 = 0.00 352/352 = 100.0 16/17 = 94.12 353/369 = 95.66 10 0.0 0 5.88 7.14 52.94

0/17 = 0.00 0/352 = 0.00 352/352 = 100 17/17 = 10 0.0 0 352/369 = 95.39 ∝(50) 0.00 0.00 50.00

13/17 = 76.47 71/352 = 20.17 281/352 = 79.83 4/17 = 23.53 294/369 = 79.67 79.12 76.47 77.77 78.15

ModiﬁedBagging with T = 25 and MLP (Median cutting) 11/17 = 64.71 116/352 = 32.95 236/352 = 67.05 6/17 = 35.29 247/369 = 66.94 66.63 64.71 65.66 65.88

Fig. 6. ROC curve of the proposed method with 25 MLPs.

improves the performance, signiﬁcantly. This is not unexpected, because it is obvious in all references that simple methods (even classic ensemble methods) cannot be successful in imbalanced problems [8,9,13,14]. In Fig. 5, ROC algorithm with T = 25 and base classiﬁer of DT are brought together. Failing to have suﬃcient data is why the presented ROC curves are not smooth. According to Fig. 5, reader will realize that if he selects better cutting level even with keeping TP high, he could increase the accuracy to some extent. Of course, this improvement is not very large because the accuracy incre-

ment will be as long as TP is reduced. Above experiments indicate that the proposed method accuracy is much better with ensemble with T = 25. It has been shown that the eﬃciency of the ensembles with T = 25, and simple classiﬁers on all samples of training dataset is not comparable with the ensembles of proposed method. Another point is that using DT classiﬁer as a base classiﬁer is more eﬃcient than MLP neural network. In order to have a better comparison, Fig. 6 illustrates the proposed algorithm ROC curve with T = 25 and multilayer neural network.

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

Table 3 Comparison of the proposed method to other method such as EasyEnsemble. Evaluation measure

EasyEnsemble of 25 classiﬁers (Mean cutting of the ROC)

Balance cascade of 25 classiﬁers (Mean cutting of the ROC)

Modiﬁed Bagging of 25 classiﬁers (Mean cutting of the ROC)

SMOTE-Tomek method

TP FP TN FN Recognition rate Precision Recall F-measure Accuracy

3/17 = 17.6 31/352 = 8.81 321/352 = 91.19 14/17 = 82.35 324/369 = 87.80 66.70 17.65 27.91 54.42

5/17 = 29.41 43/352 = 12.22 309/352 = 87.78 12/17 = 70.59 314/369 = 85.09 70.44 29.41 41.50 58.60

13/17 = 76.47 71/352 = 20.17 281/352 = 29.83 4/17 = 23.53 294/369 = 79.67 79.13 76.47 77.78 78.15

6/17 = 35.29 51/352 = 14.49 309/352 = 85.51 11/17 = 64.71 315/369 = 85.37 70.90 35.29 47.13 60.40

CBO method

4/17 = 23.53 93/352 = 26.42 259/352 = 73.58 13/ 17 = 76.47 263/369 = 71.27 47.11 23.53 31.38 48.55

Fig. 7. Fisher criterion derived from the output of an ensemble of DTs in terms of number of DTs.

In order to compare EasyEnsemble algorithm with the proposed method, satisfactory results have not been fulﬁlled by applying this algorithm and simple linear classiﬁers used in [3]. Comparing the proposed algorithm with EasyEnsemble algorithm in Table 3, we conclude that the EasyEnsemble accuracy is weak in the dataset where the minority class data is very small. So, it is crucial not to follow the boosting methods in such datasets. Due to the long time the EasyEnsemble algorithm needs to learn, it could be argued that the proposed method is better than EasyEnsemble algorithm in terms of performance and learning speed in datasets being similar to the datasets used in this paper. In addition, in order to achieve a learning model in the same data, a general framework is proposed. As we see the last two columns of Table 3, you will discover that the method based on the ensemble sampling, over-sampling of minority and cleansing of Tomek links are better than EasyEnsemble and BalanceCascade methods. In the last column, the CBO method [10], which is an oversampling method, led to poorer results than BalanceCascade and EasyEnsemble algorithms. It is also not far from our expectations. Because it is proven in the scientiﬁc community that over-sampling techniques work well when the number of data is not small in minority classes that it is violated in our problem [9].

Perhaps the most important reason of failure of EasyEnsemble method (and BalanceCascade) can be found in the fact that in very little data, boosting is not only meaningless, but also it actually acts as a deceptive factor. Fig. 7 shows the effect of participating decision trees numbers on the proposed method eﬃciency. Performance criterion speciﬁed in this example is Fisher Index. As can be seen, when the number of classiﬁers reaches to 20, the performance reaches its peak. It shows increasing the number of classiﬁers in the ensemble will have little impact on the performance of the ensemble after 20 classiﬁers.

5.3.2. Results of the experiment on the second dataset Fig. 8 shows the results of method number 1 to method number 18, EE method and MB method on second dataset, without deleting missing values and removing meaningless features. In this Figure, EE is the EasyEnsemble with 21 DT classiﬁers and MB is the ModiﬁedBagging with 21 DT classiﬁers. Clearly, these results are not suitable, especially for double-class problem where random prediction itself has a performance near to the same values. Same results are shown in Fig. 9 for this dataset after the using PCA. Clearly, these results are not good; especially for double-class problem.

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

JID: NEUCOM

ARTICLE IN PRESS

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

Fig. 8. The Results on the second set without deleting the missing values.

9

Fig. 11. After the stage of eliminating the missing data and without applying PCA.

Fig. 9. The results of data after applying PCA.

Fig. 12. Box plot.

Fig. 13. The adjusted results. Fig. 10. After two steps of eliminating the missing values and applying PCA.

Clearly, the proper results are even worse than previous results. After removing the missing values and applying PCA, we obtain following conclusions. As it turns out in Fig. 9, these results are still worse than previous results. Now, the results of this dataset are obtained after eliminating the missing data and without applying PCA (Figs. 10 and 11). Clearly, the results are better now. As you can see, the proposed method has acted better than all other methods. Now, we examine

the statistical analysis of these results. First, we obtain box plot for accuracy of data, and illustrate it in Fig. 12. The vertical axis shows the number of method and the horizontal one shows the accuracy of the methods. The 19th method is the EE. As far as you can see, all the methods of SVM, EE and MB have the same distribution. Now, we show the results based on AUC in Fig. 13. These results verify the same results. The methods of SVM, MB and EE have the best performance, respectively. Still vertical axis shows the number of method and horizontal axis shows the method accuracy. 19th method is the EE.

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

Fig. 14. Dispersion based on AUC.

You can also see that all the methods SVM, EE and MB contain the same dispersion based on AUC in Fig. 14. The vertical axis shows the number of method and the horizontal axis shows the precision of method EE. The following ﬁgure, i.e. Fig. 15, has been obtained after doing paired t-test on accuracy of various methods. If the ith row and the jth column equal one, it means, the ith method outperforms the jth method meaningfully; if the ith row and the jth column equal minus one, it means, the ith method underperforms the jth method in a meaningful way; otherwise both methods are completely irrelevant. By counting the number of minus ones in a row and subtracting them from the number of positive ones in that row, we can gain the score of method. Note that the 19th and 20th methods are MB and EE, respectively. By sorting these methods based on their scores, we can conclude that 19th, 17th and 20th methods are the best methods. ROC graphs of the best methods have been displayed for this dataset in Fig. 16. The ﬁgure aims to show the uncontested superiority of the proposed method, then 17th method and after that 19th method.

5.3.3. Results of the experiment on the third dataset According to the results of previous subsections, we do the last stage of this study. Fig. 17 shows the results of method number 1 to method number 18, method EE, method BC, method CBO and

Fig. 16. ROC curves of the best methods.

Fig. 17. The results on the third dataset with deleting missing data.

Fig. 15. The results of paired t-test on accuracies of various methods.

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

JID: NEUCOM

ARTICLE IN PRESS S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

[m5G;September 26, 2017;19:16] 11

Table 4 Comparison of the proposed method to state-of-the-art methods. Evaluation measure

IESOM [21]

KernelADASYN [20]

Modiﬁed bagging

TP FP TN FN Recognition rate Precision Recall F-measure Accuracy

6/17 = 35.29 14/352 = 3.98 338/352 = 96.02 11/17 = 64.71 344/369 = 93.22 89.87 35.29 50.68 65.66

10/17 = 58.82 37/352 = 10.51 315/352 = 89.49 7/17 = 41.18 325/369 = 88.08 84.84 58.82 69.47 74.16

13/17 = 76.47 71/352 = 20.17 281/352 = 29.83 4/17 = 23.53 294/369 = 79.67 79.13 76.47 77.78 78.15

and 11th methods are the second best and the third best methods, respectively. The ROC curves of superior methods, i.e. 10th and 11th methods, EE method, BC method, CBO method and MB method are provided in Fig. 19. As you can see from Fig. 19, the ROC curve of MB method is the best and ROC curves of 10th and 11th methods, CBO method and EE method are the best in turn, respectively. Table 4 shows a comparison between some of state-of-the-art methods in the ﬁeld of imbalanced learning and our method. 6. Conclusion

Fig. 18. Box plot different methods on the 3th dataset.

In this paper, a new method was presented for imbalanced learning. This type of learning is special for the datasets in which minority class was much less than the majority one. Also, this method was applied to breast cancer detection problem. Inability of simple classic learning techniques to learn this type of datasets (imbalanced cancer datasets) was also shown. In addition, due to the lack of minority class data, the speciﬁc-purpose methods underperform to learn imbalanced data. Results of this research can also be used in the ﬁeld of medicine for screening people. Considering the individual’s speciﬁcations and history in health centers, it is possible to design automated methods to identify the high risk patients by using this method. This study can help diagnose and treat the disease early, and it makes signiﬁcant savings in health care costs. Acknowledgement We want to thank from Yasooj Branch, Islamic Azad University, Yasooj, Iran, for their supporting this research. References

Fig. 19. ROC curves of the best methods.

method MB on the third dataset by deleting missing data and removing unnecessary features. In Fig. 17, EE stands for the EasyEnsemble method with 21 DT classiﬁers, BC stands for the BalanceCascade method with 21 DT classiﬁers, CBO is over-sampling algorithm based on clustering with 21 DT classiﬁers, and MB is ModiﬁedBagging with 21 DT classiﬁers. MB method is considered to be the best method. Then, 10th method, which is the boosting ensemble of 21 QDC classiﬁers, is the second best method. 11th method, which is the boosting ensemble of 21 DT classiﬁers, is the third best method of all. The Fig. 18 illustrates box plot of these methods. By looking at the ﬁgure, we can see that MB method is the best method (because it has minimal variance and maximum ceil). 10th

[1] H. He, E.A. Garcia, Learning from imbalanced data„ IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284. [2] B. Minaei-Bidgoli, H. Parvin, H. Alinejad-Rokny, H. Alizadeh, W. Punch, Effects of resampling method and adaptation on clustering ensemble eﬃcacy, Artif. Intell. Rev. 41 (1) (2014) 27–48. [3] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory under sampling for class-imbalance learning, IEEE Trans. Syst. Man Cybern. Part B Cybern. (2009). [4] J. Zhang, I. Mani, KNN approach to imbalanced data distributions: a case study involving information extraction, in: Proceedings of the International Conference Machine Learning (ICML’2003), Workshop Learning from Imbalanced Data Sets, 2003. [5] Hamzei M., Kangavari M.R.: Learning from Imbalanced Data. Technical Report, Iran University of Science and Technology, Iran, 2010. [6] Minaei F., Soleimanian M., Kheirkhah D.: Investigation the relationship between risk factors of occurrence of breast tumor in women, Aranobidgol, Iran. 3th National Conference on Data Mining, Kashan (in Persian), 2009. [7] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. [8] H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: adaptive synthetic sampling approach for imbalanced learning, in: Proceedings of the International Joint Conference on Neural Networks, 2008, pp. 1322–1328. [9] G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (1) (2004) 20–29.

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

JID: NEUCOM 12

ARTICLE IN PRESS

[m5G;September 26, 2017;19:16]

S. Nejatian et al. / Neurocomputing 000 (2017) 1–12

[10] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newsl. 6 (1) (2004) 40–49. [11] N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, SMOTEBoost: improving prediction of the minority class in boosting, in: Proceedings of the Seventh European Conference Principles and Practice of Knowledge Discovery in Databases, 2003, pp. 107–119. [12] H. Parvin, H. Alinejad-Rokny, S. Parvin, A classiﬁer ensemble of binary classiﬁer ensembles, Int. J. Learn. Manag. Syst. 1 (2) (2013) 37–47. [13] D. Thammasiri, D. Delen, P. Meesad, N. Kasap, A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition, Expert Syst. Appl. 41 (2014) 321–330. [14] H. Li, J. Sun, Forecasting business failure: the use of nearest-neighbour support vectors and correcting imbalanced samples-evidence from the Chinese hotel industry, Tourism Manag. 33 (3) (2012) 622–634. [15] O. Olabode, B.T. Olabode, Cerebrovascular accident attack classiﬁcation using multilayer feed forward artiﬁcial neural network with back propagation error, J. Comput. Sci. 8 (1) (2012) 18–25. [16] R. Chitra, V. Seenivasagam, Review of heart disease prediction system using data mining and hybrid intelligent techniques, ICTACT J. Soft Comput. 03 (04) (2013) 605–609. [17] L. Lam, Classiﬁer combinations: implementations and theoretical issues, in: J. Kittler, F. Roli (Eds.), Multiple Classiﬁer Systems, Lecture Notes in Computer Science, 1857, Springer, Cagliari, Italy, 20 0 0, pp. 78–86. [18] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory under sampling for class imbalance learning, in: Proceedings of the International Conference Data Mining, 2006, pp. 965–969. [19] S. Li, Tang B., H. He, An imbalanced learning based MDR-TB early warning system, J. Med. Syst. 40 (7) (2016) 164:1–164:9. [20] B. Tang, H. He, KernelADASYN: kernel based adaptive synthetic data generation for imbalanced learning, in: Proceedings of the Congress on Evolutionary Computation (CEC), 2015, pp. 664–671. [21] Q. Cai, H. He, H. Man, Imbalanced evolving self-organizing learning, Neurocomputing 133 (2014) 258–270. [22] H. Parvin, M. MirnabiBaboli, H. Alinejad-Rokny, Proposing a classiﬁer ensemble framework based on classiﬁer selection and decision tree, Eng. Appl. Artif. Intell. 37 (2015) 34–42. [23] H. Parvin, B. Minaei-Bidgoli, H. Alinejad-Rokny, A new imbalanced learning and dictions tree method for breast cancer diagnosis, J. Bionanosci. 7 (6) (2013) 673–678. [24] H. Parvin, B. Minaei-Bidgoli, H. Alinejad-Rokny, W.F. Punch, Data weighing mechanisms for clustering ensembles, Comput. Electr. Eng. 39 (5) (2013) 1433–1450. [25] H. Parvin, H. Alinejad-Rokny, N. Seyedaghaee, S. Parvin, A heuristic scalable classiﬁer ensemble of binary classiﬁer ensembles, J. Bioinf. Intel. Control 1 (2) (2013) 163–170. [26] H. Parvin, H. Alinejad-Rokny, B. Minaei-Bidgoli, S. Parvin, A new classiﬁer ensemble methodology based on subspace learning, J. Exp. Theor. Artif. Intell. 25 (2) (2013) 227–250. [27] H. Parvin, H. Alinejad-Rokny, S. Parvin, Divide and conquer classiﬁcation, Austr. J. Basic Appl. Sci. 5 (12) (2011) 2446–2452. [28] H. Parvin, H. Alinejad-Rokny, M. Asadi, An ensemble based approach for feature selection, J. Appl. Sci. Res. 7 (9) (2011) 33–43. [29] R. Barandela, J.S. Sánchez, V. Garcıa, E. Rangel, Strategies for learning in class imbalance problems, Pattern Recogn. 36 (3) (2003) 849–851.

[30] R. Batuwita, V. Palade, FSVM-CIL: fuzzy support vector machines for class imbalance learning, IEEE Trans. Fuzzy Syst. 18 (3) (2010) 558–571. [31] Y. Sun, M.S. Kamel, Y. Wang, Boosting for learning multiple classes with imbalanced class distribution, in: Proceedings of the Sixth International Conference on Data Mining, 2006. ICDM’06, IEEE, 2006, December, pp. 592–602. [32] D. Greene, A. Tsymbal, N. Bolshakova, P. Cunningham, Ensemble clustering in medical diagnostics, in: Proceedings of the Seventeenth IEEE Symposium on Computer-Based Medical Systems, 20 04 CBMS 20 04, IEEE, 20 04, June, pp. 576–581. [33] V. Singh, L. Mukherjee, J. Peng, J. Xu, Ensemble clustering using semideﬁnite programming with applications, Mach. Learn. 79 (1) (2010) 177–200. S. Nejatian obtained Bachelor’s degree in Electrical Engineering. He received the Master’s degree (M.Eng) in Telecommunication Technology, and PhD degree in Data Communication from the University Technology Malaysia in 2008 and 2014, respectively. He holds university Assistant professor position at the Faculty of Electrical Engineering, Islamic Azad University, Yasooj Branch, Yasooj, Iran. His research interests are in, Cognitive Radio Networks, Software Deﬁned Radio,and Wireless Sensor Networks. He is a registered member of professional organizations such as IEEE and IET.

Eshagh Faraji is a PhD student in Islamic Azad University, Yasooj Branch, Yasooj, Iran in Electrical Engineering Department. His research interests are in the areas of Data Mining, Artiﬁcial Intelligence, and Dispatching.

H. Parvin received a B.E. degree from Shahid Chamran University, Ahvaz, Iran, in 2006 and an M.S. degree from Iran University of Science and Technology, Tehran, Iran, in 20 08. From 20 08 to 2013, he worked in the Data mining Research Lab, Iran University of Science and Technology, Tehran, Iran. He then received his Ph.D degree Iran University of Science and Technology, Tehran, Iran. Her research interests include data mining, machine learning, and ensemble learning.

Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classiﬁcation, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082

Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification

Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification

Recommend Documents