ARTICLE IN PRESS
JID: NEUCOM
[m5G;September 26, 2017;19:16]
Neurocomputing 0 0 0 (2017) 1–12
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification Samad Nejatian a,b, Hamid Parvin c,d,∗, Eshagh Faraji d,e a
Department of Electrical Engineering, Yasooj Branch, Islamic Azad University, Yasooj, Iran Young Researchers and Elite Club, Yasooj Branch, Islamic Azad University, Yasooj, Iran c Department of Computer Engineering, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran d Young Researchers and Elite Club, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran e Department of Electrical Engineering, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran b
a r t i c l e
i n f o
Article history: Received 26 January 2016 Revised 13 April 2017 Accepted 10 June 2017 Available online xxx Keywords: Imbalanced learning Neural networks Decision tree Cancer diagnosis
a b s t r a c t Abundant data of the patients is recorded within the health care system. During data mining process, we can achieve useful knowledge and hidden patterns within the data and consequently we will discover the meaningful knowledge. The discovered knowledge can be used by physicians and managers of health care to improve the quality of their services and to reduce the number of their medical errors. Since by the usage of a single data mining algorithm, it is difficult to diagnose or predict diseases, therefore in this research, we take a combination of the advantages of some algorithms in order to achieve better results in terms of efficiency. Most of standard learning algorithms have been designed for balanced data (the data with the same frequency of samples in each class), where the cost of wrong classification is the same within all classes. These algorithms cannot properly represent data distribution characteristics when datasets are imbalanced. In some cases, the cost of wrong classification can be very high in a sample of a special class, such as wrongly misclassifying cancerous individuals or patients as healthy ones. In this article, it is tried to present a fast and efficient way to learn from imbalanced data. This method is more suitable for learning from the imbalanced data having very little data in class of minority. Experiments show that the proposed method has more efficiency compared to traditional simple algorithms of machine learning, as well as several special-to-imbalanced-data learning algorithms. In addition, this method has lower computational complexity and faster implementation time. © 2017 Published by Elsevier B.V.
1. Introduction Different methods of data mining can help predict diseases automatically with high accuracy rate. Moreover, additional costs of irrelevant clinical trials will be reduced through this process. It also reduces the wrong predictions due to human tiredness, and consequently improves the quality of services. Some of the data mining methods that have been successfully applied to medical data include: neural networks, decision trees (DT), association rule mining, Bayesian networks, support vector machines (SVM), clustering and etc. Depending on the type of their application, one of these methods will be more useful. However, it is very hard to choose only a data mining algorithm that is suitable to diagnose or pre-
∗ Corresponding author at: Department of Computer Engineering, Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran. E-mail addresses:
[email protected],
[email protected] (H. Parvin).
dict all diseases. Some algorithms are better than the others for certain purposes. However, when we bring advantages of several algorithms together, it will result in a better performance. Performance criteria will be discussed later in this study. By the way, it is almost impossible to choose the best data mining method to predict diseases for a specific criterion like accuracy, sensitivity and characteristic. Data analysis and the confusion among them is a problem preventing to achieve remarkable diagnostic results, because the knowledge within the data should be used properly. In fact, data mining is a response to the need of health care organizations. The more data and the complexity of their relations are, the more difficult is to access the hidden information among data. It is often assumed that distribution of classes is balanced or nearly balanced. In general, the cost of wrong classification for all classes is assumed to be the same as well. So when the dataset is imbalanced, these algorithms cannot properly display data distribution features. In a sense, these algorithms tend to put an unknown data into the
http://dx.doi.org/10.1016/j.neucom.2017.06.082 0925-2312/© 2017 Published by Elsevier B.V.
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
JID: NEUCOM 2
ARTICLE IN PRESS
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
Fig. 1. (A) An imbalanced interclass dataset (left). (B) Dataset with high complexity, intra-class and interclass imbalance, multi-concept, overlapping of classes, noise (right).
classes with more frequency, and as a result, it provides unacceptable accuracy among data classes. An imbalanced dataset is any dataset representing an imbalanced distribution among its classes, in such a way that the imbalanced distribution is too much. This type of imbalance is called inter-classes imbalance (such as a one-to-one thousand distribution (1:10 0 0) where in this case, one class completely eliminates the other one). The imbalanced distribution wasn’t between two classes necessarily and there may be among several ones, though. In scientific communities, over 65% rate of a class may be even considered to be imbalanced data [14,19,23,24]. The distributions among many actual datasets are mainly imbalanced, so it is necessary to modify the learning algorithms in order to extract knowledge out of them. As one example of these imbalanced dataset, we can exemplify the data related with the patients with breast cancer. These data are often shown with positive (cancer) and negative (health) classes. As expected, the number of healthy people is much higher than cancerous patients. Therefore, a kind of classification is required which exploits appropriate and balanced prediction accuracy for both minority and majority classes. As we know that medical diagnosis of a cancerous patient as a healthy individual is unacceptable (and similarly a diagnosis of a healthy person as a patient), so in order to generate decision support systems, modified classifications are required. Applied classifiers must be able to provide high validity for minority class, but also does not affect the validity of majority one. For example, in this case, a healthy sample may be diagnosed 100% correctly, while the correct classification accuracy of the patient is 10%. So, it is very possible that the patient’s sample is diagnosed wrongly. In this regard, it is obvious that the single evaluation criteria such as overall accuracy and error rate do not provide enough information about the quality of imbalanced learning. This kind of imbalance is called inherently imbalanced. This means that the imbalance is a direct result of the nature of data space. It is worth mentioning that imbalanced data are not just inherent; and imbalance can be sometimes relative as well, that is, the number of minority samples is naturally large but their number is very low compared to the majority class. The data complexity is an important issue which includes data overlapping, missing data and etc. This concept is shown in Fig. 1. In Fig. 1, the stars and circles represent the minority and majority classes, respectively. As it is clear, two distributions shown in parts (A) and (B) are imbalanced, but in part (B), there are sample overlapping and multi-concept, too. According to part (B) the sub-concept C may be not learned because of lack of data. Another form of imbalance is intra-class which corresponds to the distribution of representation data for sub-concepts in a class. In Fig. 1(B), class B and C represent the dominant minority and majority sub-concept, respectively. In addition, A and D are dominant concept and dominant sub-concept for majority class, respectively.
For each class, the number of samples existing in the dominant cluster of that class eliminates the sub-concept. As it is clear, this data space represents inter-classes and intra-class imbalance. In this paper, we present a new method to classify imbalanced training data, and we compare this method with standard methods such as the nearest neighbor, decision tree and multi-layer perceptron neural network (MLP). In the following, we review the literature and introduce some works done in this area. Then, we examine the evaluation criteria of these methods and the manner of classification tests. Finally, we will discuss the results of the tests and conclude the paper. In general, contributions presented in this article include: • A new method for learning from imbalanced data. • An efficient method to be used in the decision support system for breast cancer diagnosis. • The results of the proposed method on real dataset of breast cancer. • A method for the diagnosis of cardiovascular patients. 2. Related works In this section, we review the literature of topic and the previous works. In this paper, training set and the number of its samples is presented by S and m. S = {(xi , yj )| i = 1, . . . , m} where xi ∈ X is a sample in the n-dimensional characteristic space of X = {( f1 , f2 , . . . , fn )| fi ∈ R}and yi ∈ Y = {1, . . . , c} is the label of the class associated with the sample xi . For example, c = 2 indicates a classification with two classes. Smin and Smax are sample sets of the minority and majority classes that the union of them is the training set, and intersection of them is null. Also, we consider E as the acquired set of sampling from S. As discussed earlier, when a standard learning algorithm is applied to an imbalanced data, the minority class is not often learned well, because the deductive rules describing the minority concept are often much weaker than the ones describing the majority concept. In order to show the effect of imbalanced learning problem on standard learning algorithms, consider the general decision tree algorithm. Decision tree is built based on a recursive top-down greedy search method which uses a feature selection method for selecting the best feature as the separation criterion in each node. Next, nodes are created based on possible values of the separator feature. As a result, at each stage, the training set is divided into smaller subsets which can result in separate rules of the class concept. Finally, these rules are combined, and make the hypothesis which results in the lowest error rate in the classes. The problem occurring by using this process in the presence of imbalanced data is in two directions. First, frequent partitioning of data space leads to smaller observations of minority samples which brings about a reduction in the number of leaves describing the concept of minority class, and its result contains less con-
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
JID: NEUCOM
ARTICLE IN PRESS
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
fidence. Secondly, the concepts relying on different feature space binding remain without learning because of sparseness caused by partitioning. In fact, the first issue associates with absolute and relative imbalance, and the second one is concerned with the imbalance among classes and the high dimensional problem. Anyway, imbalanced data has a negative effect on the decision tree classification performance. In the following of this paper, a proposed classification to overcome the effects of imbalanced data will be offered and it will be investigated in details. In general, the proposed solutions to the problem of imbalanced learning are in two general directions. First category of solutions considers the changes in the dataset in order to make it balanced. The other category generally considers learning algorithms in order to adapt them to learn from imbalanced data [5,22,23,29–31]. To classify cardiovascular disease, artificial neural networks with backward propagation error method were used. To do this, the dataset contains 100 records of medical fields where 60 records were of men and 40 records were of women. 16 input features are used for prediction. Training speed is between 1.0 and 9.0, and finally, the degree of accuracy is also measured [15]. One of the methods used to reduce the size and complexity of algorithm is to select shapes subset. Shape extraction includes the process of identifying and estimating poor and irrelevant communication, or extra dimensions or characteristics in a set of existing data. The purpose of choosing the shape is to find the minimum number of feature subset so that the results of data distribution probability can be obtained similar to the results of main distribution. For example, an optimal subset of sufficient features can be acquired to predict heart disease by using genetic algorithm. For instance, the number of features can be reduced from 13 to 6 (to predict heart disease), and this reduces the number of experiments which a patient is supposed to do [16]. Generally, sampling methods in imbalanced learning issue usually is used where they involve changing a set of imbalanced data using a series of mechanisms aiming at obtaining a balanced dataset. Studies on some basic classifiers have shown that these mechanisms have better results on any imbalanced dataset. In random over-sampling method, one set of E is sampled from Smin , and added to dataset S. In fact, the number of samples Smin increases as much as |E | − |Smin |, and the dataset proceeds to be balanced. Using this method it is possible to reach a classification with an acceptable degree of balance. In sub-sampling method, unlike the previous method, some data is removed from the dataset. In fact, one subset is randomly selected from the majority class of Smax and removed from dataset S, and then it establishes the balance in the dataset [1]. Although two proposed sampling methods are to improve imbalanced learning, they have some problems as well. In subsampling, some samples are removed from the majority class which may lead to loss of important concepts. In the over-sampling method, the overtraining problem may occur due to data replication. Another category of the methods is named informed subsampling. Examples of informed sampling are EasyEnsemble and BalanceCascade algorithms [2,3]. The purpose of these methods is to overcome the problem of data loss in random methods. In EasyEnsemble algorithm, an ensemble learning system is created through sampling several subsets from majority class and making several classifiers based on combination of each subset with minority class data. The size of each subset from majority class is equal to size of minority class. In this way, each time, sampling is done with replacement, randomly. Another example of these methods uses KNN classifier for sub-sampling [4,24,25]. Another method based on sampling for imbalanced learning is the hybrid sampling with data generation. An example is Synthetic
3
Minority Over-Sampling Technique (SMOTE). This algorithm creates the synthetic data based on similarities of feature spaces among the available minority samples. In fact, for each subset Smin , certain number of nearest neighbors (based on Euclidean distance) is determined for each sample xi ∈ Smin . Then a new pattern is created based on a relation among the specified points [7]. Another proposed method in this area is an adaptive combined sampling. In the previous methods, the same number of data samples is generated for each minority sample without considering neighboring samples, which it may lead to increase the overlapping of classes. Different adaptive methods have been introduced to overcome this problem. Some of these methods are ADA and border-line-SMOTE algorithms [8]. Sampling with data cleansing method is another proposal for imbalanced learning. Data cleansing has been introduced to remove overlapping of sampling methods. One of the cleansing methods is Tomek links [9] defined as a pair of neighbors with minimum distance from opposing classes. If two samples form a link, either of them may be a noise or both are close to the border. One of the applications of these links is to eliminate undesirable overlapping of classes by combining, so that all links are removed, and all pairs of the neighbor with the nearest distance are from the same class. Fig. 2 shows how SMOTE works, and then the Tomek links are determined. Clustering based sampling is another method concerned with imbalanced learning problem. One of the algorithms proposed is over-sampling based on clustering (CBO) which uses K-means clustering algorithm [10]. In this algorithm, first, k samples are selected from each cluster, and the average feature vector is calculated for these samples. Next, the cluster centers are determined. Then, Euclidean distance is calculated for each sample from each cluster center, and is assigned to the cluster with the nearest center, and the cluster center is updated. According to CBO algorithm and using over-sampling method, all clusters of majority class are extended the same size as the majority class with the highest sample. Then, over-sampling is done on clusters of minority class, and the clusters’ size is increased. The stages of this algorithm are shown in the following figure. As it is clear, ultimately, a strong representation of little concepts in the final dataset is obtained. Finally, a series of algorithms are proposed based on combining sampling and boosting, which is an ensemble technique such as SMOTE-Boost and DataBoost-IM algorithms [11]. Unlike sampling methods seeking to balance the distribution with respect to class representation, the cost-sensitive learning methods consider the cost of wrong classification of samples. In fact, in cost-sensitive methods, a solution can be obtained for imbalanced learning problems by creating cost matrix for wrong classification of each sample. These methods are irrelevant to this paper application and are not discussed anymore. In recent years, the methods like the single-class SVM or SVDD have been proposed. Particularly, Rascoutti and Kovalizak recommended that single-class learning is specially tailored to imbalanced datasets with high spatial dimensions. In addition, Japcuise presented an achievement that automated interface is trained to rebuild positive class in feedback layer, and it is suggested that under certain circumstances, for examples in multi-quality areas, single-class learning achievement is probably better than other methods [17]. The new exploration achievement was examined based on comparing the frequency and non-redundant distinction techniques. Authors have recommended that innovative exploration methods are useful for highly imbalanced datasets [17]; while decision tree classification is suitable for relatively balanced datasets [18].
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
ARTICLE IN PRESS
JID: NEUCOM 4
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
Fig. 2. (A) Basic dataset. (B) Dataset after applying SMOTE. (C) Tomek link. (D) Dataset after removing the links.
Finally, we also seek to note that, although the current efforts are focusing on two-class imbalanced issues, multiple-class issues of imbalanced learning are available and important.
of 95% in spite of the 0% of properly diagnosed minority class samples. Studying the confusion matrix, it is clear that the first column indicates the number of positives samples and the second column is the number of negative samples; also, the first row is the number of samples that classifier determines as the minority class and the second row shows the samples the classifier diagnoses as the majority class. So, the ratio of column indicates the data distribution in dataset, and each criterion using the values of two columns is inherently sensitive to imbalance. For example, the accuracy criterion uses both columns and is sensitive to imbalance and changes as the class distribution changes, although the main performance of the classifier may not change. Other evaluation criteria applied to the imbalanced learning problem in an adapted way. These criteria include accuracy, precision, recall, F-measure and G-mean [1]. Accuracy can be obtained from Eq. (1):
3. Evaluation criteria for imbalanced learning
accuracy =
Fig. 3. Confusion matrix.
Regarding to the development of researches done in the field of imbalanced learning, it is necessary to present some criteria for evaluating the effectiveness of imbalanced learning algorithms. In this part, we examine the evaluation criteria for imbalanced learning. Conventional evaluation criteria are accuracy rate and error rate. Although these criteria are simple ways to describe the performance of classifier on a dataset, they are not suitable for imbalanced data. Fig. 3 shows the confusion matrix which is obtained from mentioned efficiency criteria. For example, if a dataset contains 5% minority class and 95% majority class, a classifier which associates all samples to the majority class, maintains the accuracy
TP + TN TP + TN + FP + FN
(1)
Precision is obtained from Eq. (2):
precision =
TP TP + FP
(2)
Recall is obtained from Eq. (3):
recall =
TP TP + FN
(3)
F-measure is obtained from Eq. (4):
F − measure = 2 ×
precision × recall precision + recal
(4)
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
ARTICLE IN PRESS
JID: NEUCOM
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
ROC evaluation method uses two single-column evaluation criteria of TP and FP rates, and obtains a graph by drawing TP rate on the FP. In fact, each point in this space represents the classifier efficiency for a distribution. The ROC diagram is a strong method to evaluate the efficiency visually. In these cases, accuracy-recall diagram can provide more information about efficiency evaluation. These diagrams can be regarded as the best representation of classification performance in imbalanced applications. Inherently, accuracy is a precise criterion (i.e. the accuracy of labeled samples is positive and how many of the samples are labeled correctly). As a result, recall is the criterion of completeness (i.e. how many of positive class samples are labeled correctly). This two criteria which are very similar to accuracy and error, have an inverse relation in common. 3.1. Possible classification error The possibility of wrong labeling is considered as the fourth difference source on tested data. The list above shows that several testing and training sets must be used. And it must be executed several times where training the classifiers has a random component. To compare two classifiers of A and B, a test procedure is recommended including K times of the test procedure iteration, which 33% of data are considered as the test data and others are considered as training data. In other words, through each of K iterations of the test procedure, we divide data into the training and testing parts. Classification models of A and B is first trained on training part and then are tested on testing part. Consequently, p1A , p1B are obtained as accuracies of two classifiers of A and B, respectively. Regarding the second random testing and training subsets, p2A , p2B estimations are obtained. Now, the differences are defined in Eq. (5):
P i = PAi − PBi
(5)
The estimated average and the variance are calculated for two implementations of cross validity in Eq. (6):
t=
√ P¯ K
K
i=1
(Pi −P¯ )
2
(6)
K−1
p¯ can be obtained from Eq. (7):
P¯ =
K 1 i P K
(7)
i=1
Now, we want to see distribution table t with freedom degrees of K − 1. Assuming the null hypothesis of t, if its value is less than the value in the table, difference of two classifiers is meaningless otherwise is meaningful. Supposing to use the above test method, as already said, 30 times was implemented. It is concluded from the equation above that t = 1.9796, so the table value is 2.045 is achieved by importance level of 0.05 and freedom degree of 29. Now because the figure calculated is lower than the Figure of table, we cannot reject null hypothesis. This test suggests that there is no significant difference between the compared classifiers in our data in terms of the accuracy. 4. The proposed method The main structure of the proposed algorithm named ModifiedBagging is similar to the algorithm EasyEnsemble. Ensemble clustering has been used many time in medical problems [25–28,32,33] In this algorithm, we first select a series of sub-samplings from Smax called Ei where |Ei | = |Smin |. Then we define subsets Si ⊂S as Si = Smin ∪ Ei and we train a poor classifier similar to the decision tree on each Si . This classifier is displayed by DTi .
5
Fig. 4. The pseudo code of the proposed algorithm.
In the end, we consider all these DTi as an ensemble. The proposed algorithm pseudo code is presented in Fig. 4. Although there are a lot of solutions to improve imbalanced learning, as mentioned earlier, we consider only the category of sub-sampling algorithms in this article. In this category of algorithms, the best samples are two algorithms of EasyEnsemble and BalanceCascade which fall in the informed sampling category [2,3]. As shown in [3], these methods are superior to other methods. This superiority is for their efficiency and training speed. In addition, EasyEnsemble and BalanceCascade algorithms operate similarly in terms of training speed and efficiency. As both algorithms have very similar structures and the EasyEnsemble structure bears resemblance to the proposed algorithm, the EasyEnsemble algorithm is compared to the proposed algorithm. Easy ensemble algorithm operates as follows: in the first step, it creates a random subsampling in which all minority class data present. In this subsampling, the majority class is the same number as minority class where the majority class data is randomly selected. Then, the complicated procedure of ADABOOST is tailored to this subsampling, and the achieved ensemble is called ADABOOST1 . The first step is over and the second one begins. In the next step, another subsampling is exercised and the ADABOOST2 ensemble is generated on it. After T steps, T ensembles are acquired. It is like having an ensemble of several ensembles. In fact, each basic classifier is a powerful ADABOOST classifier itself in our ensemble. This method has a fundamental weakness. An ensemble is strong when its base classifiers are weak [12], because they must have diversity (That is why the powerful and stable classifiers like SVM are never used in ensembles). Since EasyEnsemble, and also BalanceCascade, uses powerful base classifiers. These methods are not often better than BAGGING and ADABOOST and they are just slower. The reason for priority of EasyEnsemble algorithm compared to ModifiedBagging should be found in their differences. The difference is illustrated in line 6 in pseudo code. In EasyEnsemble algorithm, a classification algorithm with highly time order, named ADABOOST, is used instead of a simple classifier [5]. Using a similar complex classification system not only has too much time overhead, but also is actually unjustified because voting mechanism is used after generating classifiers, i.e. Ci s. In addition, the ensemble classifiers may not be properly trained on the Si s due to the small sample size of minority class, |Smin |. After selecting useful features, the PCA technique is used to reduce dimensions. Finally, several classification models will be used. It is tried to apply different models in this regard. Some of them are given below: (1) DT, (2) single-class SVDD, (3) double-class SVDD, (4) doubleclass Parzen_DD, (5) double-class KNN-DD with K times as much as a neighbor, (6) ensemble of 5 classifiers of 1 to 5 by average ensemble method (7) ensemble of 6 classifiers of PARZENC, FISHERC, QDC, SVDD, KNNDD and RBNC by average integration method, (8) combination of 6 classifiers of PARZENC, FISHERC, QDC, SVDD, KN-
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
ARTICLE IN PRESS
JID: NEUCOM 6
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
NDD and RBNC by optimal classifier selection method, (9) boosting ensemble of 21 classifiers of FISHERC, (10) boosting ensemble of 21 classifiers of QDC, (11) boosting ensemble of 21 DT classifiers, (12) boosting ensemble of 21 classifiers of Naïve Bayesian, (13) bagged ensemble of 21 QDC classifiers, (14) bagged ensemble of 21 DT classifiers, (15) MLP, (16) NUSVM, (17) SVM, (18) ensemble of 6 classifiers of PARZENC, FISHERC, QDC, SVDD, KNNDD and RBNC by majority-vote consensus function.
5. Experiments and results In this article, we have tried to help doctors by providing a machine learning system to diagnose the cancer in patients.
5.1. Dataset The first tested dataset is a real set collected from a hospital [6]. This dataset contains information of 369 clients. 17 of them are diagnosed as breast cancer patients and 352 people are healthy. It means the dataset contains 352 negative samples (majority) and 17 positive samples (minority). The maximum number of extracted features for these samples is 26. The values of these features are non-numeric which have been coded to numeric values to facilitate the implementation. After coding, each feature was normalized, and their values are recorded in range of [0 1]. Normalization is calculated by Eq. (8):
n fx,i =
f
x,i
max fy,i + min fy,i y
(8)
y
In Eq. (8), fx, i is the ith feature of xth data and nfx, i is ith normalized feature of xth data. The second data set is a real one that has already been collected. The dataset contains information of 1282 clients. 120 of them are diagnosed as cardiovascular patients who have not had the opportunity to be treated and 1162 people were cardiovascular patients who have been treated. It means the dataset contains 1162 negative samples (majority) and 120 positive samples (minority). The maximum number of features extracted for these samples is 72 which only 52 of the features were kept after preprocessing (feature selection phase), and the rest was discarded. Some of these feature values contains missing values (undefined) which have been set to numerical values in order to facilitate the implementation. We will explain how the missing values have been managed in the following. Let the tag of jth data point be denoted by lj . If the ith feature of jth data point is lost, first we should select all the data that have the same tag of lj . In the end, we consider the average of their ith feature of them as the ith feature of the jth data (Duda and Hart, 1973). The third dataset like the second dataset is a real collection of cardiovascular patients that has already been collected. The dataset consists of the information of 11,541 clients. 700 of them are diagnosed as cardiovascular patients who have not had the opportunity to be treated and 10,841 ones were cardiovascular patients who have had treatment opportunities. In other words, the data set includes 10,841 negative samples (majority) and 700 positive samples (minority). The maximum number of features extracted for these samples is 86 that after preprocessing (feature selection phase), only 79 features are kept and the rest is discarded. Some of these feature values are missing values (not defined) which have been set to numerical values in order to facilitate the implementation as mentioned previously.
Fig. 5. ROC curve of the proposed method with 25 DTs.
5.2. Experiments In this paper, the experimental data were trained by decision tree learning algorithms, multi-layer neural networks and the proposed algorithm. It should be noted that the Decision Tree (DT) used in this paper is decision tree with Gini index. Gini index threshold has been set to 2 through all paper. K-Nearest Neighbors (KNN) is another used classifier. K is set to 5. Also, we use Artificial Neural Network (ANN) as one of our basic learners. All used ANNs are MLPs with two hidden layers. The number of neurons in the first and second hidden-layers is 10 and 5, respectively. The first and second layers have “linear” and “tangent-sigmoid” activation function. It is worth mentioning that all parameters of MLPs and DTs are kept fix through all of the experiments. In the next step, the above experiments have been done by the ensemble of classifiers and the results have been presented. Finally, the proposed method has been compared with EasyEnsemble method. 5.3. Results In this section, the results are provided and elaborated in three stages as follows. 5.3.1. Results of the experimentation on the first dataset Table 1 shows the results of the first stage of tests. As seen in Table 1, although the recognition rate of simple methods (DT, KNN and MLP) is very high, they are not efficient. It means that although their validity is acceptable, they cannot diagnose the patients. This is not unexpected, because these classifiers have gained very high accuracy by assigning almost all data to the same class. Therefore, if we note the columns 5 and 6, it can be clearly recognized that the performance of those classifiers is much higher for diagnosis when we use subsampling (the proposed algorithm); while their accuracy is significantly decreased. As expected, DT classifier has a significant advantage compared to the MLP classifier. These results were not also unexpected. Studying the previous works, we can find that subsampling reduces the efficiency of neural networks compared to decision tree [13]. Also, the neural networks have better recognition rate than decision trees in imbalanced environments, but they have less accuracy and F-Measure [13]. As seen in Table 2, using the ensemble without employing the proposed approach in order to balance the training data does not solve the problem. However, applying the proposed method
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
ARTICLE IN PRESS
JID: NEUCOM
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
7
Table 1 Results of experiments with one classifier using leave-one-out (LOO) technique. Evaluation measure
DT
MLP
NN
ModifiedBagging with T = 1 and DT
ModifiedBagging with T = 1 and MLP
TP FP TN FN Recognition rate Precision Recall F-measure Accuracy
1/17 = 5.88 0/352 = 0.00 352/352 = 100.0 16/17 = 94.12 353/369 = 95.66 10 0.0 0 5.88 7.14 52.94
0/17 = 0.00 0/352 = 0.00 352/352 = 100 17/17 = 10 0.0 0 352/369 = 95.39 ∝(50) 0.00 0.00 50.00
4/17 = 23.53 7/352 = 1.99 345/352 = 98.01 13/17 = 76.47 344/369 = 94.58 36.36 23.53 28.57 60.77
10/17 = 58.82 82/352 = 23.30 270/352 = 76. 70 7/17 = 41.18 280/369 = 75.88 71.63 58.82 64.60 67.76
4/17 = 23.53 116/352 = 32.95 236/352 = 67.05 13/17 = 76.47 240/369 = 65.04 41.66 23.53 30.07 45.29
Table 2 Results of experiments with multiple classifiers using LOO technique. Evaluation measure
ModifiedBagging with T = 25 and DT (The best cutting)
ModifiedBagging with T = 25 and MLP (The best cutting)
ModifiedBagging with T = 25 and DT (Median cutting)
TP FP TN FN Recognition rate Precision Recall F-measure Accuracy
1/17 = 5.88 0/352 = 0.00 352/352 = 100.0 16/17 = 94.12 353/369 = 95.66 10 0.0 0 5.88 7.14 52.94
0/17 = 0.00 0/352 = 0.00 352/352 = 100 17/17 = 10 0.0 0 352/369 = 95.39 ∝(50) 0.00 0.00 50.00
13/17 = 76.47 71/352 = 20.17 281/352 = 79.83 4/17 = 23.53 294/369 = 79.67 79.12 76.47 77.77 78.15
ModifiedBagging with T = 25 and MLP (Median cutting) 11/17 = 64.71 116/352 = 32.95 236/352 = 67.05 6/17 = 35.29 247/369 = 66.94 66.63 64.71 65.66 65.88
Fig. 6. ROC curve of the proposed method with 25 MLPs.
improves the performance, significantly. This is not unexpected, because it is obvious in all references that simple methods (even classic ensemble methods) cannot be successful in imbalanced problems [8,9,13,14]. In Fig. 5, ROC algorithm with T = 25 and base classifier of DT are brought together. Failing to have sufficient data is why the presented ROC curves are not smooth. According to Fig. 5, reader will realize that if he selects better cutting level even with keeping TP high, he could increase the accuracy to some extent. Of course, this improvement is not very large because the accuracy incre-
ment will be as long as TP is reduced. Above experiments indicate that the proposed method accuracy is much better with ensemble with T = 25. It has been shown that the efficiency of the ensembles with T = 25, and simple classifiers on all samples of training dataset is not comparable with the ensembles of proposed method. Another point is that using DT classifier as a base classifier is more efficient than MLP neural network. In order to have a better comparison, Fig. 6 illustrates the proposed algorithm ROC curve with T = 25 and multilayer neural network.
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
ARTICLE IN PRESS
JID: NEUCOM 8
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
Table 3 Comparison of the proposed method to other method such as EasyEnsemble. Evaluation measure
EasyEnsemble of 25 classifiers (Mean cutting of the ROC)
Balance cascade of 25 classifiers (Mean cutting of the ROC)
Modified Bagging of 25 classifiers (Mean cutting of the ROC)
SMOTE-Tomek method
TP FP TN FN Recognition rate Precision Recall F-measure Accuracy
3/17 = 17.6 31/352 = 8.81 321/352 = 91.19 14/17 = 82.35 324/369 = 87.80 66.70 17.65 27.91 54.42
5/17 = 29.41 43/352 = 12.22 309/352 = 87.78 12/17 = 70.59 314/369 = 85.09 70.44 29.41 41.50 58.60
13/17 = 76.47 71/352 = 20.17 281/352 = 29.83 4/17 = 23.53 294/369 = 79.67 79.13 76.47 77.78 78.15
6/17 = 35.29 51/352 = 14.49 309/352 = 85.51 11/17 = 64.71 315/369 = 85.37 70.90 35.29 47.13 60.40
CBO method
4/17 = 23.53 93/352 = 26.42 259/352 = 73.58 13/ 17 = 76.47 263/369 = 71.27 47.11 23.53 31.38 48.55
Fig. 7. Fisher criterion derived from the output of an ensemble of DTs in terms of number of DTs.
In order to compare EasyEnsemble algorithm with the proposed method, satisfactory results have not been fulfilled by applying this algorithm and simple linear classifiers used in [3]. Comparing the proposed algorithm with EasyEnsemble algorithm in Table 3, we conclude that the EasyEnsemble accuracy is weak in the dataset where the minority class data is very small. So, it is crucial not to follow the boosting methods in such datasets. Due to the long time the EasyEnsemble algorithm needs to learn, it could be argued that the proposed method is better than EasyEnsemble algorithm in terms of performance and learning speed in datasets being similar to the datasets used in this paper. In addition, in order to achieve a learning model in the same data, a general framework is proposed. As we see the last two columns of Table 3, you will discover that the method based on the ensemble sampling, over-sampling of minority and cleansing of Tomek links are better than EasyEnsemble and BalanceCascade methods. In the last column, the CBO method [10], which is an oversampling method, led to poorer results than BalanceCascade and EasyEnsemble algorithms. It is also not far from our expectations. Because it is proven in the scientific community that over-sampling techniques work well when the number of data is not small in minority classes that it is violated in our problem [9].
Perhaps the most important reason of failure of EasyEnsemble method (and BalanceCascade) can be found in the fact that in very little data, boosting is not only meaningless, but also it actually acts as a deceptive factor. Fig. 7 shows the effect of participating decision trees numbers on the proposed method efficiency. Performance criterion specified in this example is Fisher Index. As can be seen, when the number of classifiers reaches to 20, the performance reaches its peak. It shows increasing the number of classifiers in the ensemble will have little impact on the performance of the ensemble after 20 classifiers.
5.3.2. Results of the experiment on the second dataset Fig. 8 shows the results of method number 1 to method number 18, EE method and MB method on second dataset, without deleting missing values and removing meaningless features. In this Figure, EE is the EasyEnsemble with 21 DT classifiers and MB is the ModifiedBagging with 21 DT classifiers. Clearly, these results are not suitable, especially for double-class problem where random prediction itself has a performance near to the same values. Same results are shown in Fig. 9 for this dataset after the using PCA. Clearly, these results are not good; especially for double-class problem.
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
JID: NEUCOM
ARTICLE IN PRESS
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
Fig. 8. The Results on the second set without deleting the missing values.
9
Fig. 11. After the stage of eliminating the missing data and without applying PCA.
Fig. 9. The results of data after applying PCA.
Fig. 12. Box plot.
Fig. 13. The adjusted results. Fig. 10. After two steps of eliminating the missing values and applying PCA.
Clearly, the proper results are even worse than previous results. After removing the missing values and applying PCA, we obtain following conclusions. As it turns out in Fig. 9, these results are still worse than previous results. Now, the results of this dataset are obtained after eliminating the missing data and without applying PCA (Figs. 10 and 11). Clearly, the results are better now. As you can see, the proposed method has acted better than all other methods. Now, we examine
the statistical analysis of these results. First, we obtain box plot for accuracy of data, and illustrate it in Fig. 12. The vertical axis shows the number of method and the horizontal one shows the accuracy of the methods. The 19th method is the EE. As far as you can see, all the methods of SVM, EE and MB have the same distribution. Now, we show the results based on AUC in Fig. 13. These results verify the same results. The methods of SVM, MB and EE have the best performance, respectively. Still vertical axis shows the number of method and horizontal axis shows the method accuracy. 19th method is the EE.
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
ARTICLE IN PRESS
JID: NEUCOM 10
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
Fig. 14. Dispersion based on AUC.
You can also see that all the methods SVM, EE and MB contain the same dispersion based on AUC in Fig. 14. The vertical axis shows the number of method and the horizontal axis shows the precision of method EE. The following figure, i.e. Fig. 15, has been obtained after doing paired t-test on accuracy of various methods. If the ith row and the jth column equal one, it means, the ith method outperforms the jth method meaningfully; if the ith row and the jth column equal minus one, it means, the ith method underperforms the jth method in a meaningful way; otherwise both methods are completely irrelevant. By counting the number of minus ones in a row and subtracting them from the number of positive ones in that row, we can gain the score of method. Note that the 19th and 20th methods are MB and EE, respectively. By sorting these methods based on their scores, we can conclude that 19th, 17th and 20th methods are the best methods. ROC graphs of the best methods have been displayed for this dataset in Fig. 16. The figure aims to show the uncontested superiority of the proposed method, then 17th method and after that 19th method.
5.3.3. Results of the experiment on the third dataset According to the results of previous subsections, we do the last stage of this study. Fig. 17 shows the results of method number 1 to method number 18, method EE, method BC, method CBO and
Fig. 16. ROC curves of the best methods.
Fig. 17. The results on the third dataset with deleting missing data.
Fig. 15. The results of paired t-test on accuracies of various methods.
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
JID: NEUCOM
ARTICLE IN PRESS S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
[m5G;September 26, 2017;19:16] 11
Table 4 Comparison of the proposed method to state-of-the-art methods. Evaluation measure
IESOM [21]
KernelADASYN [20]
Modified bagging
TP FP TN FN Recognition rate Precision Recall F-measure Accuracy
6/17 = 35.29 14/352 = 3.98 338/352 = 96.02 11/17 = 64.71 344/369 = 93.22 89.87 35.29 50.68 65.66
10/17 = 58.82 37/352 = 10.51 315/352 = 89.49 7/17 = 41.18 325/369 = 88.08 84.84 58.82 69.47 74.16
13/17 = 76.47 71/352 = 20.17 281/352 = 29.83 4/17 = 23.53 294/369 = 79.67 79.13 76.47 77.78 78.15
and 11th methods are the second best and the third best methods, respectively. The ROC curves of superior methods, i.e. 10th and 11th methods, EE method, BC method, CBO method and MB method are provided in Fig. 19. As you can see from Fig. 19, the ROC curve of MB method is the best and ROC curves of 10th and 11th methods, CBO method and EE method are the best in turn, respectively. Table 4 shows a comparison between some of state-of-the-art methods in the field of imbalanced learning and our method. 6. Conclusion
Fig. 18. Box plot different methods on the 3th dataset.
In this paper, a new method was presented for imbalanced learning. This type of learning is special for the datasets in which minority class was much less than the majority one. Also, this method was applied to breast cancer detection problem. Inability of simple classic learning techniques to learn this type of datasets (imbalanced cancer datasets) was also shown. In addition, due to the lack of minority class data, the specific-purpose methods underperform to learn imbalanced data. Results of this research can also be used in the field of medicine for screening people. Considering the individual’s specifications and history in health centers, it is possible to design automated methods to identify the high risk patients by using this method. This study can help diagnose and treat the disease early, and it makes significant savings in health care costs. Acknowledgement We want to thank from Yasooj Branch, Islamic Azad University, Yasooj, Iran, for their supporting this research. References
Fig. 19. ROC curves of the best methods.
method MB on the third dataset by deleting missing data and removing unnecessary features. In Fig. 17, EE stands for the EasyEnsemble method with 21 DT classifiers, BC stands for the BalanceCascade method with 21 DT classifiers, CBO is over-sampling algorithm based on clustering with 21 DT classifiers, and MB is ModifiedBagging with 21 DT classifiers. MB method is considered to be the best method. Then, 10th method, which is the boosting ensemble of 21 QDC classifiers, is the second best method. 11th method, which is the boosting ensemble of 21 DT classifiers, is the third best method of all. The Fig. 18 illustrates box plot of these methods. By looking at the figure, we can see that MB method is the best method (because it has minimal variance and maximum ceil). 10th
[1] H. He, E.A. Garcia, Learning from imbalanced data„ IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284. [2] B. Minaei-Bidgoli, H. Parvin, H. Alinejad-Rokny, H. Alizadeh, W. Punch, Effects of resampling method and adaptation on clustering ensemble efficacy, Artif. Intell. Rev. 41 (1) (2014) 27–48. [3] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory under sampling for class-imbalance learning, IEEE Trans. Syst. Man Cybern. Part B Cybern. (2009). [4] J. Zhang, I. Mani, KNN approach to imbalanced data distributions: a case study involving information extraction, in: Proceedings of the International Conference Machine Learning (ICML’2003), Workshop Learning from Imbalanced Data Sets, 2003. [5] Hamzei M., Kangavari M.R.: Learning from Imbalanced Data. Technical Report, Iran University of Science and Technology, Iran, 2010. [6] Minaei F., Soleimanian M., Kheirkhah D.: Investigation the relationship between risk factors of occurrence of breast tumor in women, Aranobidgol, Iran. 3th National Conference on Data Mining, Kashan (in Persian), 2009. [7] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. [8] H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: adaptive synthetic sampling approach for imbalanced learning, in: Proceedings of the International Joint Conference on Neural Networks, 2008, pp. 1322–1328. [9] G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (1) (2004) 20–29.
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082
JID: NEUCOM 12
ARTICLE IN PRESS
[m5G;September 26, 2017;19:16]
S. Nejatian et al. / Neurocomputing 000 (2017) 1–12
[10] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newsl. 6 (1) (2004) 40–49. [11] N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, SMOTEBoost: improving prediction of the minority class in boosting, in: Proceedings of the Seventh European Conference Principles and Practice of Knowledge Discovery in Databases, 2003, pp. 107–119. [12] H. Parvin, H. Alinejad-Rokny, S. Parvin, A classifier ensemble of binary classifier ensembles, Int. J. Learn. Manag. Syst. 1 (2) (2013) 37–47. [13] D. Thammasiri, D. Delen, P. Meesad, N. Kasap, A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition, Expert Syst. Appl. 41 (2014) 321–330. [14] H. Li, J. Sun, Forecasting business failure: the use of nearest-neighbour support vectors and correcting imbalanced samples-evidence from the Chinese hotel industry, Tourism Manag. 33 (3) (2012) 622–634. [15] O. Olabode, B.T. Olabode, Cerebrovascular accident attack classification using multilayer feed forward artificial neural network with back propagation error, J. Comput. Sci. 8 (1) (2012) 18–25. [16] R. Chitra, V. Seenivasagam, Review of heart disease prediction system using data mining and hybrid intelligent techniques, ICTACT J. Soft Comput. 03 (04) (2013) 605–609. [17] L. Lam, Classifier combinations: implementations and theoretical issues, in: J. Kittler, F. Roli (Eds.), Multiple Classifier Systems, Lecture Notes in Computer Science, 1857, Springer, Cagliari, Italy, 20 0 0, pp. 78–86. [18] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory under sampling for class imbalance learning, in: Proceedings of the International Conference Data Mining, 2006, pp. 965–969. [19] S. Li, Tang B., H. He, An imbalanced learning based MDR-TB early warning system, J. Med. Syst. 40 (7) (2016) 164:1–164:9. [20] B. Tang, H. He, KernelADASYN: kernel based adaptive synthetic data generation for imbalanced learning, in: Proceedings of the Congress on Evolutionary Computation (CEC), 2015, pp. 664–671. [21] Q. Cai, H. He, H. Man, Imbalanced evolving self-organizing learning, Neurocomputing 133 (2014) 258–270. [22] H. Parvin, M. MirnabiBaboli, H. Alinejad-Rokny, Proposing a classifier ensemble framework based on classifier selection and decision tree, Eng. Appl. Artif. Intell. 37 (2015) 34–42. [23] H. Parvin, B. Minaei-Bidgoli, H. Alinejad-Rokny, A new imbalanced learning and dictions tree method for breast cancer diagnosis, J. Bionanosci. 7 (6) (2013) 673–678. [24] H. Parvin, B. Minaei-Bidgoli, H. Alinejad-Rokny, W.F. Punch, Data weighing mechanisms for clustering ensembles, Comput. Electr. Eng. 39 (5) (2013) 1433–1450. [25] H. Parvin, H. Alinejad-Rokny, N. Seyedaghaee, S. Parvin, A heuristic scalable classifier ensemble of binary classifier ensembles, J. Bioinf. Intel. Control 1 (2) (2013) 163–170. [26] H. Parvin, H. Alinejad-Rokny, B. Minaei-Bidgoli, S. Parvin, A new classifier ensemble methodology based on subspace learning, J. Exp. Theor. Artif. Intell. 25 (2) (2013) 227–250. [27] H. Parvin, H. Alinejad-Rokny, S. Parvin, Divide and conquer classification, Austr. J. Basic Appl. Sci. 5 (12) (2011) 2446–2452. [28] H. Parvin, H. Alinejad-Rokny, M. Asadi, An ensemble based approach for feature selection, J. Appl. Sci. Res. 7 (9) (2011) 33–43. [29] R. Barandela, J.S. Sánchez, V. Garcıa, E. Rangel, Strategies for learning in class imbalance problems, Pattern Recogn. 36 (3) (2003) 849–851.
[30] R. Batuwita, V. Palade, FSVM-CIL: fuzzy support vector machines for class imbalance learning, IEEE Trans. Fuzzy Syst. 18 (3) (2010) 558–571. [31] Y. Sun, M.S. Kamel, Y. Wang, Boosting for learning multiple classes with imbalanced class distribution, in: Proceedings of the Sixth International Conference on Data Mining, 2006. ICDM’06, IEEE, 2006, December, pp. 592–602. [32] D. Greene, A. Tsymbal, N. Bolshakova, P. Cunningham, Ensemble clustering in medical diagnostics, in: Proceedings of the Seventeenth IEEE Symposium on Computer-Based Medical Systems, 20 04 CBMS 20 04, IEEE, 20 04, June, pp. 576–581. [33] V. Singh, L. Mukherjee, J. Peng, J. Xu, Ensemble clustering using semidefinite programming with applications, Mach. Learn. 79 (1) (2010) 177–200. S. Nejatian obtained Bachelor’s degree in Electrical Engineering. He received the Master’s degree (M.Eng) in Telecommunication Technology, and PhD degree in Data Communication from the University Technology Malaysia in 2008 and 2014, respectively. He holds university Assistant professor position at the Faculty of Electrical Engineering, Islamic Azad University, Yasooj Branch, Yasooj, Iran. His research interests are in, Cognitive Radio Networks, Software Defined Radio,and Wireless Sensor Networks. He is a registered member of professional organizations such as IEEE and IET.
Eshagh Faraji is a PhD student in Islamic Azad University, Yasooj Branch, Yasooj, Iran in Electrical Engineering Department. His research interests are in the areas of Data Mining, Artificial Intelligence, and Dispatching.
H. Parvin received a B.E. degree from Shahid Chamran University, Ahvaz, Iran, in 2006 and an M.S. degree from Iran University of Science and Technology, Tehran, Iran, in 20 08. From 20 08 to 2013, he worked in the Data mining Research Lab, Iran University of Science and Technology, Tehran, Iran. He then received his Ph.D degree Iran University of Science and Technology, Tehran, Iran. Her research interests include data mining, machine learning, and ensemble learning.
Please cite this article as: S. Nejatian et al., Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.06.082