Journal Pre-proofs Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis Dan Gan, Jiang Shen, Bang An, Man Xu, Na Liu PII: DOI: Reference:
S0360-8352(19)30735-1 https://doi.org/10.1016/j.cie.2019.106266 CAIE 106266
To appear in:
Computers & Industrial Engineering
Received Date: Revised Date: Accepted Date:
18 June 2019 4 December 2019 30 December 2019
Please cite this article as: Gan, D., Shen, J., An, B., Xu, M., Liu, N., Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis, Computers & Industrial Engineering (2019), doi: https://doi.org/10.1016/j.cie.2019.106266
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier Ltd.
Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis Dan Gana, Jiang Shena, Bang Ana, Man Xub,*, Na, Liua a College
of Management and Economy, Tianjin University, Tianjin, 300072, China
b Business
School, Nankai University, Tianjin, 300071, China
1
Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis ABSTRACT: For the imbalanced classification problems, most traditional classification models only focus on searching for an excellent classifier to maximize classification accuracy with the fixed misclassification cost, not take into consideration that misclassification cost can change with sample probability distribution. So far as we know, cost-sensitive learning method can be effectively utilized to solve imbalanced data classification problems. In this regards, we propose an integrated TANBN with cost-sensitive classification algorithm (AdaC-TANBN) to overcome the above drawback and improve classification accuracy. The AdaC-TANBN algorithm employs variable misclassification cost determined by samples distribution probability to train classifier, then implements classification for imbalanced data in medical diagnosis. The effectiveness of our proposed approach is examined on the Cleveland heart dataset (Heart), Indian liver patient dataset (ILPD), Dermatology dataset and Cervical cancer risk factors dataset (CCRF) from the UCI learning repository. The experimental results indicate that the AdaC-TANBN algorithm can outperform other state-of-the-art comparative methods. Keywords: TANBN; Cost Sensitive; Integrated Learning; Classification Algorithm; Imbalanced Data; Medical Diagnosis
1. Introduction Medical diagnostics classification can effectively assist physicians in disease diagnosis and predict outcomes in response to the treatment (Zhu et al., 2018). Many efforts have been made to improve the medical diagnostics classification performance. However, class imbalance problems are quite widespread in the practice of medical diagnostics classification (Bak, A., & Jensen, L, 2016), that means the distributions of classes are not uniform (Maurya, Toshniwal, & Venkoparao, 2016). So in the last few years, class imbalance problems have been the focus of the medical diagnostics classification field. The medical diagnostic classification samples in reality are imbalanced, which means that the samples exist majority class and minority class. In the case of binary classification, the classes with larger distribution are called the majority, while the other is called the minority (Mirza et al., 2016). When dealing with the imbalanced data, traditional algorithms tend to treat minority observations 2
as noise or outliers, and ignore them in the classification (Mahmoud, 2017). The samples tend to be classified into majority class (Lu, Yang, & Shi, 2016). Consequently, the classification accuracy of the minority class will be much lower than that of the majority class (Pouyanfar & Chen, 2017). In view of this, it is necessary to improve the algorithm accuracy of both the minority and majority. It can be seen from the existing literatures that the imbalanced data classification has widely attracted people's attention. Therefore, it is very necessary to design a suitable classification method for imbalanced data in medical diagnosis. So far, all sorts of algorithms have been put forward to solve this problem, they are shown in table 1. Table 1 Summary of typical previous literatures for imbalanced data in medical diagnosis. Author
Year
Methods
R. Blagus
2013
Improved SMOTE
N. Herndon
2016
Logistic regression
Jinyan Li
2016
Sakyajit
2017
R.J. Kuo
2018
Na Liu
2019
Bat-inspired algorithm and PSO Feature transformation IG concept algorithm IGSAGAW hybrid algorithm
Results The SMOTE does not change the specific mean values of class on high-dimensional data while it reduces the data variability and it introduces the correlation between samples. Achieve high accuracy under the precision recall curve between 50.83% and 82.61% with the highest areas. Outperforms other class balancing methods by using five imbalanced medical datasets. Surpasses the best competing methods for mortality prediction by using 4000 ICU patients. The proposed algorithms have lower error rate by using prostate cancer patient’s dataset. Classification accuracy is 97.5% by using Wisconsin original and diagnostic breast cancer datasets.
The disadvantages of above excellent intelligent classification methods are that they only pursuit a suitable classifier to maximize the classification accuracy under equal or fixed misclassification cost, not take into consideration that misclassification cost of minority and majority changes with the sample probability distribution. Obviously, the cost associated with missing a case (false negative) is clearly much higher than this of mislabeling a benign one (false positive). Therefore, class imbalance is a major problem in the medical diagnostic classification field and the classification algorithm for imbalanced data play a pivotal role. To overcome the above drawbacks and more accurately reflect the punishment of minority misclassification, this research proposes an integrated
TANBN
and
cost-sensitive
classification
algorithm,
which
uses
variable
misclassification cost determined by the samples distribution probability to train classifiers and then implements the classification for imbalanced data in medical diagnosis by using the integrated AdaCost algorithm. The main contributions of our research work can be summarized as follows:
Propose an integrated TANBN with cost-sensitive classification algorithm (AdaC-TANBN) for imbalanced data in medical diagnosis. 3
The algorithm uses variable misclassification cost determined by the samples distribution probability after iteration to train classifiers.
Variable misclassification cost can more accurately reflect the punishment of minority misclassification.
Experiment results on the heart, ILPD, Dermatology and CCRF datasets indicate that the proposed algorithm can achieve good performances and can be applied to real medical clinical diagnostic to help physicians make the right decisions in the future. The rest of my paper is organized as follows. Section 2 reviews related work on imbalanced data
processing methods. Section 3 describes the preliminaries of proposed method. Section 4 proposes the framework of proposed method. Section 5 presents the experimental analysis and discussion. Finally, Section 6 presents the conclusions. 2. Related works This paper reviews the state-of-the-art imbalanced data classification methods in recent years, and comprehensively analyzes and compares them from the data preprocessing level, feature level and algorithm level respectively (Haixiang et al., 2017). 2.1. Data preprocessing level Resampling is the most representative imbalanced data classification method in data preprocessing level. The main classification methods based on resampling include undersampling, oversampling and hybrid sampling. Undersampling method improves the minority classification accuracy by reducing the number of majority samples. Two undersampling strategies using clustering techniques are proposed and provide the optimal performance on small scale and largescale datasets based on single multilayer perceptron classifier and C4.5 decision tree classifier (Lin et al., 2017). The diversified sensitivity-based undersampling method was proposed and demonstrated good generalization capabilities using 14 UCI datasets (Ng, Hu, & Yeung, 2015). Moreover, the different undersampling methods including the cluster-based strategy (Sun et al., 2015; Lin, Tang, & Yao, 2013; Kumar et al., 2015), the distance based strategy (D.Addabbo et al., 2015; Anand et al., 2010) and the evolutionary based strategy (Galar et al., 2012; Krawczyk, Wozniak, & Schaefer, 2014) were proposed for imbalanced data classification. In contrast to the undersampling method, oversampling does not affect the majority samples, but increases the majority samples to achieve a better classification performance. A synthetic oversampling algorithm based on k-nearest neighbor (k-NN) called SMOM was proposed to deal with multiclass imbalance problems (Zhu, Lin, & Liu, 2017). Apart from the above, the hybrid sampling contained cluster-based oversampling and undersampling method has been proposed for imbalanced data
4
classification problems (Peng et al., 2014; Sáez et al., 2015; Song et al., 2016). From the above resampling researches, it can be concluded that when the samples are less than 100, the undersampling method is superior to the oversampling method in terms of calculation time. The oversampling method SMOTE is usually considered to be a more excellent choice when there are only a few dozen samples. However, a combination of the SMOTE method and the undersampling method can be used as an alternative when the training sample size is too large. Especially, the SMOTE method is slightly more efficient in recognizing the outliers. 2.2. Feature level Compared to the resampling method, there is much less research at the feature level for imbalanced data classification. The minority samples in imbalanced datasets easily be ignored as noise. Given this situation, screening out the irrelevant features from the feature space can avoid such risks (Li et al., 2016). As can be seen from the existing literatures, the feature selection methods usually include two types of filters and wrappers. Wherein, Wu (2014) proposed a ForesTexter feature selection method for imbalanced text classification. The truncated gradient method is used to select relevant features and is effectively applied to online inequality balanced data processing (Han et al., 2016). A feature selection method has been proposed for imbalanced data based on Hellinger distance (Lin et al., 2013). Maldonado (2014) proposed an embedded feature selection method for high-dimensional imbalanced data. Gene programming ideas were introduced into encapsulated feature selection algorithm and can effectively achieve the highdimensional imbalanced data feature selection (Viegas et al., 2018). Furthermore, K-nearest neighbor correlation method (Zhou et al., 2017) and optimized classification method (Liu et al., 2018; Moayedikia et al., 2017) were proposed to deal with high-dimensional imbalanced data. The feature level is mainly applied for imbalanced data processing with high dimensional features. However, the feature selection process may exist information loss and affect subsequent classification. At the same time, the feature level researches are mainly based on feature selection, and the research on feature extraction is rare. 2.3. Algorithm level At the algorithm level, the researches about imbalanced data have focused on two aspects: the classifier modifications and the ensemble methods. For the aspect of classifier modifications, the kernel and activation function conversion methods based on logistic regression (Maalouf, & Trafalis, 2011), SVM (Zhang et al., 2014; Maratea, Petrosino, & Manzo, 2014), ELM (Wu et al., 2016) and NN (Raj et al., 2016) have been proposed to enhance the classifier identification ability and increase the original training space separability. Fuzzy methods based on SVM (Cheng et al.,
5
2015), KNN (Liao, 2008) and NN (Gao, Hong, & Harris, 2014) have been proposed to extract classification rules from the imbalanced dataset. For ensemble classification methods, Galar (2012) divided the ensemble classification methods into two categories: data preprocessing and ensemble learning combined methods and cost-sensitive ensemble learning methods by combing and summarizing the existing literatures. The typical data preprocessing and ensemble learning combined methods include the SMOTE Bagging and Under Over Bagging method (Wang et al., 2009), the SMOTE Boost method (Chawla et al., 2003) and the RUSBoost method (Seiffert et al., 2010). However, the representative cost-sensitive ensemble learning methods were the AdaCost, AdaC1, AdaC2 and AdaC3 method (Sun, & Kamel, 2007). Cost-sensitive ensemble learning methods based on Bayesian theory have been proposed (Ali et al., 2016; Datta et al., 2015; Bahnsen et al., 2013; Moepya, Akhoury, & Nelwamondo, 2014), which incorporated the cost matrix into the Bayesian-based decision-making boundary. In addition, deep learning methods are gradually being used to solve imbalance classification problems (Zhang et al., 2018; Bhatnagar et al., 2018; Hendry, Chen, & Liao, 2018). Cost-sensitive deep neural network (CSDNN) and cost-sensitive deep neural network ensemble (CSDE) have been proposed to address class imbalance problem, the performance of six large real-life data sets in different business domains outperform all the other compared methods (Wong, Seng, & Wong, 2020). The existed cost-sensitive integrated classification algorithms have a limitation, and these excellent classification algorithms usually only aim to pursuit a suitable classifier with the fixed misclassification cost to maximize the classification accuracy, not consider that the misclassification cost is variable. So this paper applies the cost-sensitive integrated classification method for imbalanced data in medical diagnosis, the misclassification cost of which is variable with changes in distribution probability of positive and negative samples. 3. Preliminaries 3.1. Bayesian classification Bayesian shows the relationship between conditional probabilities and marginal probabilities for random variables, and it is an important theorem in probability theory and statistics (Chen, H, Liu, W, & Wang, L, 2016; Park et al., 2019). For a sample
,
indicate
the sample’s feature variables, c indicates the sample’s categorical variable. Pr(x) indicates feature variables’ probability distribution,
Pr(c j )
indicates sample category variable
probability distribution. The prior probability of c j can be expressed as: {N | f (xi ) c j ,i 1,2,..., N i } Pr c j i N
cj
(1)
Among them, N i is the samples’ number of sub dataset c j , N is the samples’ total number, 6
and f indicates the mapping function. The property variable in the dataset is distributed according to the constraint conditional probability to form Bayesian network B , the conditional probability can be defined as follows: (2) Gain more classification information by learning the dataset, use the full probability formula and Bayesian theorem to obtain the posterior probability, and then correct the prior probability (Anagaw, A, & Chang, Y, L, 2019). For a sample, the prior probability of category variable c j is
Pr(c j ) , after Bayesian Network classification obtain the posterior probability Then the Bayesian full probability formula can be expressed as follows:
Pr f (x) c j X xi
Among them, Pr f (x) c j X xi
Pr f (x) c j Pr X xi f (x) c j Pr X xi
.
(3)
indicates the conditional probability. Assuming that µ is a
regularization factor, according to the probability’ chain rule, the formula (3) can be represented as: (4) Assume that feature variable X i is only related to category variable c , then the formula (4) can be further expressed as:
n
Pr f (x) c j X xi Pr f (x) c j Pr xi c j For the sample
i1
(5)
, the training sample set is firstly divided into subsets
according to the category label, and then used to estimate the prior probability of each category. Final classification according to the maximum posterior probability criterion by using the Bayesian classifier. 3.2. TAN tree Bayesian classification The fatal shortcoming of Bayesian classification process is that supposing the feature variables are independent of each other, failing to consider that some features in the sample may not be independent of each other. However, TAN tree Bayesian classification process can solve this problem very well (J, H, Wu, 2018; Long et al., 2019). Mutual information between two features which are not independent of each other can be expressed as: p(xin , x mj ) N M I(xi , x j ) n1 p(xin , x mj ) log (i j) m1 p(xin ) p(x mj )
(6)
Among them, xi and x j indicate the two feature variables, N and M indicate the number
7
of the two feature variables, then xin and x mj indicate the corresponding variable values. Consider the addition of category variables in the TAN tree Bayesian classification process, so the conditional mutual information under the category variables can be expressed as: p(xi , x j | c) I(xi , x j | c) p(xi , x j ,c) log (i j) xi ,x j ,c p(xi | c) p(x j | c)
(7)
After determining the conditional mutual information value between each feature variable, construct the maximum weight span tree according to the principle of not generating a loop. The maximum weight is determined by the conditional mutual information value between each feature variables, and finally create an undirected graph with n 1 edges. After establishing an undirected graph, a node is arbitrarily selected as a child node, and the outward direction of the child node is the direction between the feature nodes. Add a parent node for each feature node, that is the category node. Finally, the classification result is determined by the joint probability , in which Parents(xi ) indicates
distribution formula parent node of feature variable xi . 3.3. Cost sensitive integrated classification algorithms
A variant of AdaBoost named AdaCost was firstly proposed in 1995, which was a cost sensitive boosting method based on misclassification. It used misclassification cost to update the training distribution on successive boosting rounds to achieve cumulative misclassification cost reduction compared with the AdaBoost method (Y, Freund., & Robert, 1995). For the imbalanced data, the misclassification cost for each type sample is different. In view of this, the AdaCost algorithm introduces the concept of cost sensitive based on AdaBoosting. The distribution probability of the sample will also change during the iterative process in AdaCost algorithm, and then increase the proportion of misclassified samples in new sample set in order to achieve the purpose of strengthening the study of misclassified samples. The iterative formula for its sample distribution can be expressed as follows: D (i)exp t1 yi ht1 (xi )Ci Dt (i) t1 Zt1
(8)
It can be obtained that another option of the iterative formula derived from the prior probability of category variable in formula (1), which can be expressed as follows (Stolfo, 1999): C(i) D t1(i)exp t1 y i ht1(xi ) D t (i) Z t1
(9)
The cost adjustment formula can be expressed as C(i) C sign y i ht1(xi ) , C i . Ci (0,1) is
the adjustment cost when sign y i ht1(xi ) is true, which means that the classifier can correctly
predict the sample. While Zt1 denotes the normalization factor, whose purpose is to make the
8
sum probabilities of all samples equal to one. Three frameworks for AdaCost algorithm distribution function iteration including AdaC1, AdaC2 and AdaC3 are summarized as follows (Sun et al., 2007): D t1(i)exp t1 y i ht1(xi )C i D t (i) Z t1 C i D t1(i)exp t1 y i ht1(xi ) D t (i) Z t1 C i D t1(i)exp t1 y i ht1(xi )C i D t (i) Z t1
(10) (11) (12)
Ci denotes the cost item in cost-sensitive boosting method to minimize the classification cost. Generally, use a fixed value to denote Ci in AdaC1, AdaC2 and AdaC3, which is determined by learning the dataset. It’s known that Ci improves the false samples’ proportion in the learning process by increasing the misclassified samples’ proportion in the training set. Fixing Ci means that the cost of all misclassifications is consistent, and this cost is known for a certain classification task. 4. The framework of proposed approach 4.1. The cost sensitive integrated classification algorithm of AdaC-TANBN The disadvantage of these cost sensitive integrated classification methods is that they only aim to pursuit the fixed value of Ci , fail to consider that the misclassification cost is also different. But medical unbalanced datasets have different classification errors costs for most and few classes. So, the integrated TANBN and cost sensitive classification algorithm is proposed, which uses variable cost sensitive values associated with two types of sample classification results, and changes constantly with the iterative process. The purpose of AdaC-TANBN algorithm is to associate the misclassification cost with the error classification’s proportion in its category. Introduce TPratio to indicate the correct proportion of positive samples and TN ratio to indicate the correct proportion of negative samples, which can be expressed as follows:
TP TP FN TN TN FP
TPratio
(13)
TN ratio
(14)
It’s known by the above two formulas that the misclassification proportion of positive samples is 1 TPratio and the misclassification proportion of negative samples is 1 TN ratio . Assume that the misclassification cost of a positive sample is X and the misclassification cost of a negative sample is Y , then we can obtain formula (15), which can be expressed as follows: 9
FN X 1 TPratio FP Y 1 TN ratio
(15)
We can obtain that the misclassification cost for each positive sample is TN FP and the misclassification cost for each negative sample is TP FN by performing a corresponding mathematical transformation on the formula (15). The higher misclassification proportion, the higher cost. TP denotes the correct classification number of positive samples, (TP FN ) / (TN FP) denotes the proportion between positive samples and negative samples. Based on above theory model, the relative classification cost can be expressed as:
TN FP TP(TN FP) / (TP FN )
(16)
When TP TP FN , the result of formula (16) is one, which means the cost of no misclassification is one. Therefore, the relative cost of positive samples’ misclassification is
(TP FN ) / TP , which can also be expressed as 1/ TPratio . Similarly, the relative cost of negative samples’ misclassification is 1/ TN ratio . The proposed AdaC-TANBN method is based on the AdaCost algorithm, the cost adjustment formula of it is variable, and it can increase the weight of higher-cost samples by changing the sample distribution probability. Therefore, the cost adjustment formula of the proposed AdaCTANBN method can be expressed as:
TP ratio if i is TN C(i) TN ratio if i is TP 1 other
(17)
4.2. The main steps of our proposed method In our work, we firstly employ the cost adjustment formula (17) to calculate the variable cost determined by the sample’s distribution probability. Next, we implement classification learning for imbalanced data in medical diagnosis by using the integrated AdaBoosting algorithm. In order to evaluate the experimental performance of our method, the main steps of AdaCTANBN algorithm are described as follows: Step 1: Obtain medical imbalanced datasets and related parameters, the format of those datasets is
. Step 2: Discretize the datasets and divide the processed datasets into the training sets and the
testing sets. Step 3: Learn from the training sets until the end of the iteration and update the weak classifier weights formed by each training set. Step 4: Predict the testing sets using each weak classifier and determine the final classification 10
results based on the weight of each weak classifier. Step 5: Provide support for intelligence diagnosis of medical imbalanced data based on the final classification results The overall flow chart of our proposed AdaC-TANBN cost sensitive integrated classification algorithm is shown in Fig.1.
Fig. 1. The overall flow chart of the method in this paper.
5. Experiment analysis and discussion The experimental analysis in this paper is carried out on the R 3.5 mathematical development environment. Some comparative experiments are executed between AdaC-TANBN algorithm and six other methods, including TANBN, BN, SVM, AdaC1, AdaC2 and AdaC3 to evaluate the method’s effectiveness on four medical imbalance datasets. The experiment analysis process is composed of four aspects:
The data sets used in the experiments.
The overall flow and evaluation measures of the experiments.
The description of experimental details on four medical imbalance datasets.
The discussion of the experiments.
5.1. Data sets 11
To evaluate the proposed method’s performance, our experiments are executed on Cleveland heart dataset (Heart), Indian liver patient dataset (ILPD), Dermatology dataset, and Cervical cancer risk factors dataset (CCRF) from the UCI learning repository. The details of four data sets are presented in Table 2. Table 2 The details of four data sets. Datasets
Cases Number
Attributes Number
Class distribution
Missing value
Heart
303
13
163/55/36/35/13
1
ILPD
583
10
415/167
4
Dermatology
366
33
112/61/72/49/52/20
8
CCRF
858
36
803/55
179
5.2. Evaluation measures Classification accuracy (ACC), sensitivity, specificity, area under curve (AUC) and the ROC curve are utilized as the evaluation measures to evaluate the performance of the algorithm in this paper (Murat, Karabatak, 2015; V, Sharma, & K, C, Juglan, 2018). Actual positive and predicted positive means true positive, TP represents the number of true positives. Actual positive and predicted negative means false negative, FN represents the number of false negatives. Actual negative and predicted positive means false positive, FP represents the number of false positives. Actual negative and predicted negative means true negative, TN represents the number of true negatives (Zhang, L, Yang, H, & Jiang, Z, 2018). The calculation formulas are presented as follows: (TP TN ) Acc (18) (TP FN FP TN ) TP Sensitivity (19) TP FN TN Specificity (20) FP TN 5.3. Comparative experiments between AdaC-TANBN and TANBN, BN, SVM Table 3 describes the detailed experiment results of ACC and AUC on the four datasets. Those experiments use various classification methods including AdaC-TANBN, TANBN, BN and SVM. It can be seen from this table that the AdaC-TANBN method can achieve the best performance among the four methods on three datasets, including heart dataset, ILPD and CCRF dataset with average results of 80.27% ACC, 88.87% AUC in heart dataset, 68.26% ACC, 69.53% AUC in ILPD dataset, and 92.84% ACC, 87.50% AUC in CCRF dataset. Compared with SVM method, the AdaC-TANBN method achieves a slightly inferior performance on Dermatology dataset with
12
average results of 91.24% ACC, 98.03% AUC. But the AdaC-TANBN method can also achieve obviously better performance among TANBN and BN methods on Dermatology dataset. From the above experiment results, it can be concluded that the standard deviation produced by the AdaC-TANBN method is far less than that of other three comparative methods in the case of ACC and AUC for most medical imbalanced datasets. For a small number of medical imbalanced datasets, the AdaC-TANBN method achieves a slightly inferior performance compared with SVM method, but it is also significantly better than the common classification methods. This demonstrates that the AdaC-TANBN method proposed in this paper has a better stability and consistency compared with the others, and also its performance is superior to those of the other methods for medical imbalanced datasets classification. Table 3 The detailed results of the different classification methods on the four datasets. Datasets Heart
ILPD
Dermatology
CCRF
Metrics
AdaC-TANBN
TANBN
BN
SVM
ACC
0.8027±0.0371
0.4716±0.5074
0.4756±0.0219
0.7878±0.0285
AUC
0.8887±0.0268
0.8521±0.0369
0.8370±0.0321
0.8640±0.0349
ACC
0.6826±0.0303
0.7174±0.0215
0.6903±0.0288
0.6757±0.0268
AUC
0.6953±0.0249
0.6023±0.0418
0.6212±0.0452
0.6359±0.3100
ACC
0.9124±0.0317
0.6921±0.0424
0.7000±0.0453
0.9483±0.0206
AUC
0.9803±0.0116
0.9363±0.0245
0.9368±0.0234
0.9919±0.0037
ACC
0.9284±0.0234
0.9305±0.0215
0.9365±0.0224
0.9263±0.0189
AUC
0.8750±0.0341
0.7305±0.0489
0.7274±0.0577
0.8248±0.0944
(a) Heart dataset
(b) LIPD dataset
13
(c) Dermatology dataset
(d) CCRF dataset
Fig. 2. The ROC curves for different common comparative methods based on the four datasets.
To evaluate the classification specificity and sensitivity of the AdaC-TANBN method, we compare it with the other three methods, including TANBN, BN and SVM in the case of the ROC curves. Fig.2 indicates the description of corresponding ROC curves based on the four datasets. As shown in Fig.2, the AdaC-TANBN method can generate much larger area with the lower right axis than either TANBN, BN or SVM given the same dataset. That means the AUC of our proposed approach on the four datasets is larger than other three common comparative methods. Respectively, our proposed method achieves the promising classification results for imbalanced data in medical diagnosis. 5.4. Comparative experiments between AdaC-TANBN and AdaC1, AdaC2, AdaC3 Table 4 describes the detailed experiment results of ACC and AUC on the four datasets. Those experiments use various classification methods including AdaC-TANBN, AdaC1, AdaC2 and AdaC3. It can be seen from this table that the AdaC-TANBN method can achieve the best performance among the four methods with average results of 80.27% ACC, 88.87% AUC in heart dataset, 68.26% ACC, 69.53% AUC in ILPD dataset, 91.24% ACC, 98.03% AUC in Dermatology dataset, and 92.84% ACC, 87.50% AUC in CCRF dataset. From the above experiment results, it can be concluded that the standard deviation produced by the AdaC-TANBN method is far less than that of other three comparative methods in the case of ACC and AUC. This demonstrates that the AdaC-TANBN method proposed in this paper has a better stability and consistency compared with the others, and also its performance is superior to those of the other methods for medical imbalanced datasets classification. Table 4
14
The detailed results of the various cost sensitive integrated classification methods on the four datasets. Datasets Heart
ILPD
Dermatology
CCRF
Metrics
AdaC-TANBN
AdaC1
AdaC2
AdaC3
ACC
0.8027±0.0371
0.7987±0.0346
0.7933±0.0479
0.7907±0.0308
AUC
0.8887±0.0268
0.8687±0.0461
0.8384±0.0374
0.8147±0.0387
ACC
0.6826±0.0303
0.6694±0.0299
0.6694±0.0321
0.6583±0.0459
AUC
0.6953±0.0249
0.6713±0.0566
0.6499±0.0901
0.6383±0.0343
ACC
0.9124±0.0317
0.9056±0.0185
0.9011±0.0350
0.8989±0.0295
AUC
0.9803±0.0116
0.9693±0.0119
0.9773±0.0143
0.9716±0.0160
ACC
0.9284±0.0234
0.9219±0.0133
0.9201±0.0248
0.9254±0.0183
AUC
0.8750±0.0341
0.8357±0.0499
0.7888±0.0489
0.8118±0.0718
(a) Heart dataset
(b) LIPD dataset
(c) Dermatology dataset
(d) CCRF dataset
Fig. 3. The ROC curves for different cost sensitive integrated methods based on the four datasets.
To evaluate the classification specificity and sensitivity of the AdaC-TANBN method, we 15
compare it with the other three cost sensitive integrated methods in the case of the ROC curves. Fig.3 indicates the description of corresponding ROC curves based on the four datasets. As shown in Fig.3, the AdaC-TANBN method can generate much larger area with the lower right axis than the others given the same dataset. That means the AUC of our proposed approach on the four datasets is larger than the other three cost sensitive integrated comparative methods. Respectively, our proposed method achieves the promising classification results for imbalanced data in medical diagnosis. 5.5. Discussion As can be seen from the above experimental analysis details, the standard deviation produced by the AdaC-TANBN method is far less than that of other six comparative methods in the case of ACC and AUC for most medical imbalanced datasets, this indicates that the AdaC-TANBN method has a better stability and consistency compared with the common classification methods and the cost sensitive integrated classification methods. Apart from this, the AdaC-TANBN method’s accuracy is superior to those of the other methods for medical imbalanced datasets classification. So we can draw a conclusion that the AdaC-TANBN method is the most appropriate method to achieve excellent classification for medical imbalanced datasets. The AdaC-TANBN method’s perfect performance reveals that it can obtain a better cost sensitive learning weight for medical imbalanced datasets. Obviously, the costs associated with missing a case (false negative) are clearly much higher than those of mislabeling a benign one (false positive). The AdaC-TANBN method employs the variable misclassification cost determined by samples distribution probability after iteration to train classifiers, it can overcome this drawback and further improve the classification accuracy. 6. Conclusions The integrated TANBN with cost sensitive classification algorithm (AdaC-TANBN) proposed in this paper is a superior performance method to solve the imbalanced data problems in medical diagnosis, which employs the variable cost determined by the samples distribution probability to train the classifier, and then implements classification for imbalanced data in medical diagnosis by using the integrated AdaCost algorithm. The effectiveness of our proposed approach is tested on Cleveland heart dataset (Heart), Indian liver patient dataset (ILPD), Dermatology dataset and cervical cancer risk factors dataset (CCRF) from the UCI learning repository. The results indicate that the proposed AdaC-TANBN algorithm is superior to other comparison methods and our research can improve the performance of clinical assisted diagnostic system and help clinicians make efficient decisions.
16
Despite some improvements produced in my research, there are still some enhancements which could be made in the future. For instance, deep learning imbalance classification approaches can be introduced to provide a broader view of this field. In addition, one critical problem in this research is that our experiments need to be further improved due to only four medical imbalanced datasets from the UCI learning repository have been involved in the experiment. Thus, it is necessary to collect more datasets in order to make the classification more accurately. References Ali, S., Majid, A., Javed, S. G., & Sattar, M. (2016). Can-CSC-GBE: Developing cost-sensitive classifier with gentle boost ensemble for breast cancer classification using protein amino acids and
imbalanced
data.
Computers
in
biology
and
medicine,
73,
38-46.
https://doi.org/10.1016/j.compbiomed.2016.04.002. Anagaw, A., & Chang, Y, L. (2019). A new complement naive Bayesian approach for biomedical data classification. Journal of Ambient Intelligence and Humanized Computing, 10, 3889-3897. https://doi.org/10.1007/s12652-018-1160-1. Anand A., Pugalenthi G., Fogel, G. B., & Suganthan, P. (2010). An approach for classification of highly imbalanced data using weighting and undersampling. Amino acids, 39, 1385-1391. https:// doi.org/10.1007/s00726-010-0595-2. Bahnsen, A, C., Stojanovic, A., Aouada, D., & Ottersten, B. (2013, December). Cost sensitive credit card fraud detection using Bayes minimum risk. The 12th International Conference In Machine Learning and Applications, Miami, FL. Bak, A., & Jensen, J. (2016). High dimensional classifiers in the imbalanced case. Computational Statistics & Data Analysis, 98, 6-59. https:// doi.org/10.1016/j.csda.2015.12.009. Bhatnagar, R., Hu, V., Ratnagiri, M., et al. (2019, March). Deep learning for precision medicine: stacked autoencoders overcome classification imbalance in gene expression profiling of systemic lupus erythematosus treatments. Annual Meeting of the American Society for Clinical Pharmacology and Therapeutics, Washington, DC. Bhattacharya, S., Rajan, V., & Shrivastava, H. (2017, February). ICU mortality prediction: A classification algorithm for imbalanced datasets. 31st AAAI Conference on Artificial Intelligence, San Francisco, CA. Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. Bmc Bioinformatics, 14, 106-122. http://www.biomedcentral.com/1471-2105/14/106. Chawla, N, V., Lazarevic, A., Hall, L, O., & Bowyer, K, W. (2003, September). SMOTEBoost: Improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, CAVTAT, CROATIA. 17
Cheng, J., & Liu, Y. (2015). Affective detection based on an imbalanced fuzzy support vector machine.
Biomedical
Signal
Processing
and
Control,
18,
118-126.
https://doi.org/10.1016/j.bspc.2014.12.006. Chen, H., Liu, W., Wang, L. (2016). Naive Bayesian classification of uncertain objects based on the theory of interval probability. International Journal on Artificial Intelligence Tools, 25, 1650012. https://doi.org/10.1142/S0218213016500123. Datta, S., & Das, S. (2015). Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Networks, 70, 39-52. https://doi.org/10.1016/j.neunet.2015.06.005. D, Addabbo, A., & Maglietta, R. (2015). Parallel selective sampling method for imbalanced and large
data
classification.
Pattern
Recognition
Letters,
62,
61-67.
https://doi.org/10.1016/j.patrec.2015.05.008. Galar, M., Fernandez, A., Barrenechea, E., et al. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans on Systems Man
and
Cybernetics
Part
C-Applications
and
Reviews,
42,
463-484.
https://doi.org/10.1109/TSMCC.2011.2161285. Gao, M., Hong, X., & Harris, C, J. (2014). Construction of neurofuzzy models for imbalanced data classification.
IEEE
Transactions
on
Fuzzy
Systems,
22,
1472-1488.
https://doi.org/10.1109/TFUZZ.2013.2296091. Han, C., Tan, Y, K., Zhu, J, H., et al. (2016). Online feature selection of class imbalance via pa algorithm.
Journal
of
Computer
Science
and
Technology,
31,
673-682.
https://doi.org/10.1007/s11390-016-1656-0. Haixiang, G., Yijing, L., & Shang, J. (2017). Learning from class-imbalanced data: Review of methods
and
applications.
Expert
Systems
with
Applications,
73,
220-239.
https://doi.org/10.1016/j.eswa.2016.12.035. Hendry., Chen, R, C., & Liao, C, Y. (2018, April). Deep learning to predict user rating in imbalance classification data incorporating ensemble methods. 4th IEEE International Conference on Applied System Invention, Tokyo, Japan. Herndon, N., & Caragea, D. (2016). A study of domain adaptation classifiers derived from logistic regression for the task of splice site prediction. IEEE Trans on Nano bioscience, 15, 75-83. https://doi.org/10.1109/TNB.2016.2522400. J, H, Wu. (2018). A generalized tree augmented naive Bayes link prediction model. Journal of Computatiomal Science, 27, 206-217. https://doi.org/10.1016/j.jocs.2018.04.006. Karabatak, M. (2015). A new classifier for breast cancer detection based on naïve Bayesian. Measurement, 72, 32-36. https://doi.org/ 10.1016/j.measurement.2015.04.028. 18
Krawczyk, B., Wozniak, M., & Schaefer, G. (2014). Cost-sensitive decision tree ensembles for effective
imbalanced
classification.
Applied
Soft
Computing,
14,
554-562.
https://doi.org/10.1016/j.asoc.2013.08.014. Kumar, N. S., Rao, K. N., & Govardhan, A. (2015, July). Under sampled k-means approach for handling imbalanced distributed data. 2nd International Conference on Computer and Communication Technologies, CMR Tech Campus, Hyderabad, INDIA. Kuo, R, J., Su, P, Y., Zulvia, F, E., & Lin, C, C. (2018). Integrating cluster analysis with granular computing for imbalanced data classification problem-A case study on prostate cancer prognosis. Computers & Industrial Engineering, 125, 319-332. https://doi.org/10.1016/j.cie.2018.08.031. Li, J., Fong, S., Mohammed, S., & Jinan, Fiaidhi. (2016). Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. Journal of Supercomputing, 72, 3708-3728. https://doi.org/10.1007/s11227-015-1541-6. Liao, T, W. (2008). Classification of weld flaws with imbalanced class data. Expert Systems with Applications, 35, 1041-1052. https://doi.org/10.1016/j.eswa.2007.08.044. Lin M., Tang, K., & Yao, X. (2013). Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Transactions on Neural Networks and Learning Systems, 24, 647-660. https://doi.org/10.1109/TNNLS.2012.2228231. Lin, W, C., Tsai, C, F., Hu, Y, H, et al. (2017). Clustering-based undersampling in class imbalanced data. Information Sciences, 409, 17-26. https://doi.org/10.1016/j.ins.2017.05.008. Liu, M., Xu, C., Luo, Y., et al. (2018). Cost-sensitive feature selection by optimizing F-measures. IEEE Trans on Image Processing, 27, 1323-1335. https://doi.org/10.1109/TIP.2017.2781298. Long, Y, G., Wang, L, M., & Sun, M, H. (2019). Structure Extension of Tree-Augmented Naive Bayes. Entropy, 21, 1-20. https://doi.org/10.3390/e21080721. Lu, H., Yang, K., & Shi, J, C. (2016, July). Constraining the water imbalance in a land data assimilation system through a recursive assimilation scheme. 36th IEEE International Geoscience and Remote Sensing Symposium, Beijing, China. Maalouf, M., & Trafalis, T, B. (2011). Robust weighted kernel logistic regression in imbalanced and rare events data. Computational Statistics & Data Analysis, 55, 168-183. https://doi.org/10.1016/j.csda.2010.06.014. Mahmoud, E, B. (2017). Modified mahalanobis taguchi system for imbalance data classification. Computational
Intelligence
and
Neuroscience,
2017,
1-15.
https://doi.org/10.1155/2017/5874896. Maldonado, S., Weber, R., & Famili, F. (2014). Feature selection for high-dimensional classimbalanced data sets using support vector machines. Information Sciences, 286, 228-246. https://doi.org/10.1016/j.ins.2014.07.015. 19
Maratea, A., Petrosino, A., & Manzo, M. (2014). Adjusted F-measure and kernel scaling for imbalanced
data
learning.
Information
Sciences,
257,
331-341.
https://doi.org/10.1016/j.ins.2013.04.016. Maurya, C, K., Toshniwal, D., & Venkoparao, V. (2016). Online sparse class imbalance learning on big data. Neurocomputing, 216, 250-260. https://doi.org/10.1016/j.neucom.2016.07.040. Mirza, B., Kok, S., & Lin, Z., et al. (2016, October). Efficient representation learning for highdimensional imbalance data. IEEE International Conference on Digital Signal Processing, Beijing, China. Moayedikia, A., Ong, K, L., Boo, Y, L., et al. (2017). Feature selection for high dimensional imbalanced class data using harmony search. Engineering Applications of Artificial Intelligence, 57, 38-49. https://doi.org/10.1016/j.engappai.2016.10.008. Moepya, O., Akhoury, S, S., & Nelwamondo, F, V. (2014, December). Applying cost-sensitive classification for financial fraud detection under high class-imbalance. 14th IEEE International Conference on Data Mining, Shenzhen, China. Na Liu., Er-Shi Qi., Man Xu., Bo, Gao., & Guiqiu, Liu. (2019). A novel intelligent classification model for breast cancer diagnosis. Information Processing and Management, 56, 609-623. https://doi.org/10.1016/j.ipm.2018.10.014. Ng, W, Y., Hu, J., Yeung, D, S., et al. (2015). Diversified sensitivity-based undersampling for imbalance
classification
problems.
IEEE
Trans
on
Cybernetics,
45,
2402-2412.
https://doi.org/10.1109/TCYB.2014.2372060. Park, Seongsik; Chung, Wan, Kyun; & Kim, Keehoon. (2019). Training-free Bayesian selfadaptive classification for sEMG pattern recognition including motion transition. IEEE transactions on bio-medical engineering, 1-12, https://doi.org/10.1109/TBME.2019.2947089. Peng, Yang., Li, W., Zhao, D., & Zaiane, O. (2014). Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD. Computerized Medical Imaging and Graphics, 38, 137-150. https://doi.org/10.1016/j.compmedimag.2013.12.003. Pouyanfar, S., & Chen, S, C. (2017). Automatic video event detection for imbalance data using enhanced ensemble deep learning. International Journal of Semantic Computing, 11, 85-109. https://doi.org/10.1142/S1793351X17400050. Raj, V., Magg, S., & Wermter, S. (2016, September). Towards effective classification of imbalanced data with convolutional neural networks. 7th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition, Ulm University, Ulm, Germany. Sáez, J, A., Luengo, J., Stefanowski, J., & Herrera, F. (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a resampling method with filtering. Information Sciences, 291, 184-203. https://doi.org/10.1016/j.ins.2014.08.051. 20
Seiffert, C., Khoshgoftaar, T, M., Hulse, J, V., et al. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans on Systems, Man and Cybernetics-Part A: Systems and Humans, 40, 185-197. https://doi.org/10.1109/TSMCA.2009.2029559. Song, J., Huang, X., Qin, S., & Song, Q. (2016, June). A bi-directional sampling based on K-means method for imbalance text classification. 15th IEEE/ACIS International Conference on Computer and Information Science, Okayama, Japan. Stolfo. (1999, June). AdaCost: Misclassification cost-sensitive boosting. 16th International Conference on Machine Learning, San Francisco, USA. Sun, Yanmin., Kamel, M, S., Andrew, K, C, Wong., & Yang, Wang. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40, 3358-3378. https://doi.org/10.1016/j.patcog.2007.04.009. Sun, Z., Song, Q., Zhu, X., et al. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48, 1623-1637. https://doi.org/10.1016/j.patcog.2014.11.014. Viegas, F., Rocha, L., Gonalves, M., et al. (2018). A genetic programming approach for feature selection
in
highly
dimensional
skewed
data.
Neurocomputing,
273,
554-569.
https://doi.org/10.1016/j.neucom.2017.08.050. V, Sharma., & K, C, Juglan. (2018). Automated classification of fatty and normal liver ultrasound images
based
on
mutual
information
feature
selection.
IRBM,
39,
313-323.
https://doi.org/10.1016/j.irbm.2018.09.006. Wang, S., & Yao, X. (2009, March). Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN. Wong, M, L., Seng, K., & Wong, P, K. (2020). Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain. Expert Systems with Applications, 141, 1-18. https://doi.org/10.1016/j.eswa.2019.112918. Wu, Q., Ye, Y., Zhang, H., et al. (2014). ForesTexter: An efficient random forest algorithm for imbalanced
text
categorization.
Knowledge-Based
Systems,
67,
105-116.
https://doi.org/10.1016/j.knosys.2014.06.004. Wu, D., Wang, Z., Chen, Y., & Zhao, H. (2016). Mixed kernel based weighted extreme learning machine for inertial sensor based human activity recognition with imbalanced dataset. Neurocomputing, 190, 35-49. https://doi.org/10.1016/j.neucom.2015.11.095. Yin, L., Ge, Y., Xiao, K., et al. (2013). Feature selection for high-dimensional imbalanced data. Neurocomputing, 105, 3-11. https://doi.org/10.1016/j.neucom.2012.04.039. Yoav, Freund., Robert, E., & Schapire. (1995, June). A decision theoretic generalization of online learning and an application to boosting. Lecture Notes in Artificial Intelligence, Berlin Heidelberg, Germany. 21
Zhang, C., Tavanapong, W., Kijkul, G., et al. (2018, November). Similarity-based active learning for image classification under class imbalance. 18th IEEE International Conference on Data Mining Workshops, Singapore, Singapore. Zhang, L., Yang, H., & Jiang, Z. (2018). Imbalanced biomedical data classification using selfadaptive multilayer ELM combined with dynamic GAN. BioMedical Engineering OnLine, 17, 1-23. https://doi.org/10.1186/s12938-018-0604-3. Zhang, Y., Fu, P., Liu, W., & Chen, G. (2014). Imbalanced data classification based on scaling kernel-based support vector machine. Neural Computing and Applications, 25, 927-935. https://doi.org/10.1007/s00521-014-1584-2. Zhou, Peng., Hu, Xuegang., Li, Peipei., & Wu, Xiaodong. (2017). Online feature selection for highdimensional
class
imbalanced
data.
Knowledge-Based
Systems,
136,
187-199.
https://doi.org/10.1016/j.knosys.2017.09.006. Zhu, Tuanfei., Lin, Yaping., & Liu, Yonghe. (2017). Synthetic minority oversampling technique for
multiclass
imbalance
problems.
Pattern
Recognition,
72,
327-340.
https://doi.org/10.1016/j.patcog.2017.07.024. Zhu, M., Xia, J., & Jin X., et al. (2018). Class weights random forest algorithm for processing class imbalanced
medical
data.
IEEE
https://doi.org/10.1109/ACCESS.2018.2789428.
22
Access,
6,
4641-4652.
Highlights
Propose an integrated TANBN with cost-sensitive classification algorithm (AdaC-TANBN) for imbalanced data in medical diagnosis.
The algorithm uses variable misclassification cost determined by the samples distribution probability after iteration to train classifiers.
Variable misclassification cost can more accurately reflect the punishment of minority misclassification.
Experiment results indicate that the proposed method can achieve good performances.
23
Author Contributions Dan Gan wrote the manuscript; Dan Gan, Bang An and Man Xu contributed to data collection and cleaning, performed the experiments and analyzed the results; Dan Gan and Na Liu assisted in interpreting the data and revised the manuscript ; Gan Dan, Jiang Shen and Man Xu supervised and supported the research. All the authors reviewed and approved the final manuscript.
24