Medical decision support system for extremely imbalanced datasets

Medical decision support system for extremely imbalanced datasets

ARTICLE IN PRESS JID: INS [m3Gsc;August 30, 2016;18:14] Information Sciences 0 0 0 (2016) 1–15 Contents lists available at ScienceDirect Informat...

4MB Sizes 0 Downloads 82 Views

ARTICLE IN PRESS

JID: INS

[m3Gsc;August 30, 2016;18:14]

Information Sciences 0 0 0 (2016) 1–15

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Medical decision support system for extremely imbalanced datasets Swati Shilaskar∗, Ashok Ghatol1, Prashant Chatur2 Government College of Engineering, Amravati, India

a r t i c l e

i n f o

Article history: Received 29 October 2015 Revised 2 August 2016 Accepted 21 August 2016 Available online xxx Keywords: Imbalanced dataset Evolutionary algorithm Medical diagnosis Particle swarm Physiological parameters Synthetic sampling

a b s t r a c t Advanced biomedical instruments and data acquisition techniques generate large amount of physiological data. For accurate diagnosis of related pathology, it has become necessary to develop new methods for analyzing and understanding this data. Clinical decision support systems are designed to provide real time guidance to healthcare experts. These are evolving as an alternate strategy to increase the exactness of diagnostic testing. Generalization ability of these systems is governed by the characteristics of dataset used during its development. It is observed that sub pathologies have a much varied ratio of occurrence in the population, making the dataset extremely imbalanced. This problem can be resolved at both levels i.e. at data level as well as algorithmic level. This work proposes a synthetic sampling technique to balance dataset along with Modified Particle Swarm Optimization (M-PSO) technique. A comparative study of multiclass support vector machine (SVM) classifier optimization algorithm based on grid selection (GSVM), hybrid feature selection (SVMFS), genetic algorithm (GA) and M-PSO is presented in this work. Empirical analysis of five machine learning algorithms demonstrate that M-PSO statistically outperforms the others. © 2016 Elsevier Inc. All rights reserved.

1. Introduction Collection of either measured or observed parameters of patients during examination and respective diagnosis of the sample by healthcare expert constitutes a medical dataset. Its sample count depends on the number of patients visiting a clinic. It is likely that the number of samples generated for various pathologies have varied sample count which makes the dataset imbalanced. Evaluating imbalanced datasets is a very important problem from performance and algorithmic perspective. If the classification categories are not represented in an appropriate (nearly equal) proportion or if there are significantly more data points of one class and fewer occurrences of the other class, then the dataset is said to be imbalanced. A bias is developed towards majority due to non-uniform distribution of samples. Many real world problems are characterized by imbalanced data. For pathology detection, minority class is of potential interest as it indicates disease diagnosis capability of the system. In such cases there is a higher cost for misclassifying the minority class than misclassifying



Corresponding author, Assistant Professor, Vishwakarma Institute of Technology, Bibwewadi, Pune 411037, India. E-mail addresses: [email protected], [email protected] (S. Shilaskar), [email protected] (A. Ghatol), [email protected] (P. Chatur). 1 Former Vice Chancellor, Dr. Babasaheb Ambedkar Technological University, Lonere, India. 2 Associate Professor and Head, Department of Computer science and Engineering. http://dx.doi.org/10.1016/j.ins.2016.08.077 0020-0255/© 2016 Elsevier Inc. All rights reserved.

Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

JID: INS 2

ARTICLE IN PRESS

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

the majority class. Our work has focused on classification of multiclass imbalanced datasets from medical domain. In this paper the topics are arranged as follows: Literature survey carried out for data conditioning with synthetic sampling techniques is given in Section 2. Description of imbalance and dimensions of datasets is given in Section 3. It also includes comparative study of classification performance achieved by various algorithms for benchmark datasets in literature. Data conditioning scheme is described in Section 4. Experiments set up for multiclass classifiers and graphical illustrations of observations noted during optimization process are given in Section 5. Section 6 consists of performance evaluation methods for multiclass classification and results. 2. Literature survey Many machine learning approaches are developed to deal with the imbalanced data. Either classifier algorithm is designed to handle the bias or data is externally synthesized and dataset is modified to be a balanced dataset. The oversampling of minority classes gives more accurate results than under sampling of majority classes. Under sampling by deleting weak discriminating samples away from hyper plane and oversampling by generating virtual samples in the proximity of hyper plane of SVM is suggested in [16]. Synthetic sampling technique for binary classification using fivefold cross validation technique is used in [31]. B. Das et al. [12] suggested one class classifier approach, making ’sensitivity’ - a key evaluating parameter for imbalanced data. B. Almogahed et al. [2] used a semi-supervised learning method to identify the most relevant instances to establish a well-defined training set. According to J. F. Diez-Pastor [15] and S. Barua et al. [8], diversity increasing techniques for synthetic samples improve classification performance. Literature indicates absence of a fixed rule to find the correct distribution of samples for a learning algorithm [11]. X. Wan et al. [46] used difference between majority and minority class samples used to define cost function without priori knowledge. Literature suggests elimination of samples which do not constitute support vectors of the classifier. For a multiclass classification system this technique may not be appropriate, as a sample insignificant for one, might be vital for classifying other pathology. For balancing the dataset, under sampling, oversampling or a combination of both can be used. Under sampling may potentially remove certain important examples leading to over fitting. Oversampling on the other hand introduces additional computational task. The class imbalance and the nature of the learning algorithm both are strongly related. Artificially generated instances are not real data hence their true class may be questioned. There are two types of feature vectors, (1) Directly collected e.g. symptoms, age, temperature etc. (2) Extracted by signal processing techniques e.g. features extracted from speech, ECG, EEG, EMG signal, medical image etc. It is appropriate to rely on domain experts’ opinion about true class of synthetic samples in the first type. In the second type, where features are transformed, one may need to rely on the data itself [40]. In this work, we propose a technique to balance the dataset with minimal possible compromise to the diversity of feature space. We use multiclass medical datasets with varied degree of imbalance. Classifier optimization is carried out using cross validation technique [28] The performance measures for multiclass classification need to be selected carefully. Evaluation of classifiers is based on the performance for reserved samples which were not involved in the training process. 3. Dataset description and related work We employed pathological multiclass imbalanced datasets for analysis. The datasets were chosen on the basis of their class distributions and sizes. Various physiological parameters collected during observation and treatments are parts of medical dataset. The performance of various classification methods in literature, implemented on relevant datasets is given below. Vani Dataset: ’Vani’ is a word of Sanskrit origin, meaning ’voice’ . This database is developed at Vani speech therapy center, Jabalpur, India. These are 16-bit resolution speech samples with sampling rate of 22,050 Hz. The samples are sustained phonation of /a/ recordings from patients with a variety of vocal cord pathologies including organic, neurological and traumatic disorders. The data set contains 11 samples with mild pathology, 19 with severe pathology and 93 healthy speakers’ samples. Pathological class is categorized after expert’s opinion about the perceived pathology. Expert medical practitioner’s opinion and suggestions from previous study [5] were applied for categorization of samples. Healthy voice samples are categorized in the following ways: • • • •

The subjective feeling that the speaker has no perceived laryngeal pathology. An adequate voice for the age, gender and cultural group of the speaker. An adequate pitch, tone, volume and flexibility of diction. No history of surgery related with any laryngeal pathology.

22 features based on voice quality are extracted from speech samples from this dataset. The features are based on fundamental frequency, harmonic to noise ratio (HNR), normalized noise energy (NNE) and glottal-to-noise excitation ratio (GNE) features in combined feature set for evaluation. For removal of gender bias, statistical techniques like standard deviation and inter quartile range of the feature values are used. Details of the feature enhancement techniques may be found in [38]. The smallest class sample count is 11% of the largest class. Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

3

Table 1 Class count of datasets.

Attributes Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 IRPLMax

Vani

Thyroid

PdA

Cleveland

Audiology

Vertigo

SVD

22 94 11 19

5 150 35 30

22 239 29 28 29 22

13 160 54 35 35 12

22 24 68 46 24 23

40 130 146 313 41 65 120

22 70 75 142 686 185 62 51 82 34 20.1

8.5

5

10.9

13.3

2.8

7.6

Thyroid Dataset: This dataset, donated by Stefan Aeberhard is available at UCI online repository. With five attributes and total 215 samples, the task is to find out whether a given patient is normal or suffering from hyperthyroidism or hypothyroidism. The smallest class sample count is 20% of the largest class. Alexandridis and Chondrodima, [1] demonstrated evolutionary simulated annealing algorithm as a medical diagnostic tool giving test accuracy of 95.7% and Matthew’s Correlation Coefficient (MCC) of 79% for Thyroid classification. Sez et al. [35] used synthetic over-sampling scheme and achieved an accuracy of 94.45%. The modified synthetic minority over-sampling technique (SMOTE) [28] evaluated area under curve (AUC) as 97.18% [36]. Mikel Galar et al. [19] implemented binary classifier ensemble for under sampled as well as over sampled database and achieved highest value of area under curve as 98.33%. Wing et al. [47] implemented dual autoencoder based feature for neural network giving an AUC as 99.9% for binary classification. PdA Dataset: This speech pathology dataset is generated at the Principe de Asturias (PdA) Hospital in Alcal de Henares of Madrid [5]. PdA dataset consists of recordings of the sustained vowel /a/, with the first and last part of the utterance removed to avoid onset and offset effects [4]. The speech signals are categorized according to the detected pathology by health experts. It has 238 samples from normal speakers and 201 samples from speakers with a wide range of speech disorders. We extracted 22 features based on voice quality for this dataset using procedure similar to dataset1. Pathologies with more than 20 sample counts are included for this work, thus giving 347 samples in all. The smallest class sample count is 9% of the largest class. Binary classification of Universidad Politechnica De Madrid (PdA) speech dataset on 1908 cepstral and spectral features gave 80% accuracy for Gaussian mixture model (GMM) [4]. Weighted sum of both the features is implemented in [5] resulting in improved accuracy of 84%. Mekyska J et al. [33] used 36 pathological speech measures with binary SVM classifier and achieved 82.1% accuracy. Cleveland dataset: Cleveland heart dataset is benchmark dataset collected at V.A. Medical Center, Long Beach and Cleveland Clinic Foundation by Robert Detrano. This dataset has total 303 samples with 13 features with a number of missing values. We removed the rows with NaN values thus getting total 296 samples. Four classes indicating heart disease and one healthy class constitutes this dataset. Smallest class sample count is 7.5% of the largest class. Sez et al. [35] detected accuracy of 34.50%. Risk level identification using fuzzy rules [3], resulted in an accuracy of 57.85%. Synthetic sampling using SMOTE technique [36], gave AUC as 64.33%. Mikel Galar et al. [19] implemented binary classifier achieved accuracy of 86.73%. Audiology Dataset: This is donated by Bruce Porter and collected at Baylor College of Medicine by Professor Jergen. This is a widely used benchmark dataset. It includes 24 pathology classes, listing 18 pathologies, few of them with decimated sample count. We combined training and testing datasets and selected classes with individual sample count totaling more than 20, generating 5 classes. The smallest class sample count is 34% of the largest class. S. Hyontai [41] categorized Audiology dataset as small imbalanced one and used 100 trees of random forest to evaluate 81.4% accuracy of classification. Vertigo dataset: Vertigo dataset is credited to Martti Juhola. It includes six pathologies amounting to total 815 samples with 40 features. This stands different from other datasets under study in a way that it lists only pathological samples. The smallest class sample count is 13% of the largest class. Vertigo multiclass dataset employed in [45] using neural networks and genetic algorithm (GA) for one versus all (OVA) classifiers and achieved average prediction accuracy 82.65% for 812 samples. SVD Dataset: Saarbrucken Voice Database (SVD), a German database, brought by Manfred Pützer and William J. Barry [7], consists of collection of voice recordings from more than 20 0 0 persons. SVD is a large dataset with proper categorization and labeling. We selected 9 classes with sample count more than 31, thus including 1387 samples in all. This dataset has an imbalance ratio of 4.9%. We extracted 22 features based on voice quality for this dataset. Feature extraction technique applied to Vani and PdA datasets described above is used for these samples. G. Muhammad et al. [34] selected 100 healthy and 100 pathological speakers and accomplished 100% accuracy using RBF kernel SVM classifier. J. P. Teixeira and P. O. Fernandes, [42] analyzed SVD dataset [13,7] for statistical analysis of healthy voices. [14] evaluated SVD database for individual vowels as well as for fusion of four, namely normal, low, high and low-high-low intonations of 3 vowels for the task of pathology detection, achieving area under curve (AUC) of 87.9% for binary classification. Table. 1 lists all the datasets and sample count. Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS 4

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

Fig. 1. Sample space after synthetic sampling. Table 2 Effect of synthetic sampling on Vani speech dataset.

Class1 Class2 Class3

Original count

Synthetic count

Under sampled count

Crossvalidation train+test count

Hold out count

94 11 19

– 66 57

76 – –

69 70 69

7 7 7

4. Data conditioning with synthetic sampling We formed a collection of seven medical datasets, three of which are taken from UCI data repository, one by Martti Juhola, one each from PdA, SVD and Vani. Detailed description of these datasets is given in previous section in this paper. Sample count of every dataset is given in Table 1. Balancing the dataset. Our proposed technique for data balancing employs synthetic oversampling as well as under sampling. The count of synthetic samples in a class depends upon the difference between smallest and largest class. The datasets under consideration are multi class imbalanced datasets. It is difficult to find the exact measure to indicate imbalance in the datasets. The imbalance ratio per label (IRPL) is calculated as the ratio between majority count and original minority class count ’O’ [10]. Thus IRPL = 1 for most frequent class and greater value for the rest. Literature exhibits classification for multiclass datasets with a range of imbalance ratio from 9.22 to 128 in [28], from 1.4 to 15 in [49]. Sparseness of minority samples necessitates working on small sample count. Due to extreme imbalanced nature of the multiclass datasets under study, the IRPL value ranges from 20 down to 1.5 amongst the datasets. Maximum IRPL for each dataset is given as IRPLMax in Table 1. Based on IRPL value, the required oversynthesis count, ’R’ is determined. The ’R’ count is aimed towards reducing the IRPL value in the range of 2 down to 1 and is selected manually. Synthetic oversampling using Euler’s distance criterion (ref Algorithm 1.), is carried out for minor classes. Count of synthetic samples generated, is ‘n’ times the original sample count, where ‘n’ is an integer value. Sum of synthetic and original class count of minority class is not allowed to exceed the count of majority class. We avoid using oversampling with replication as replication increases training time without any gain in classification performance. In this work, we have generated synthetic minority class samples rather than replicating existing samples. The Euler’s distance criterion generates synthetic samples in the neighborhood of the existing minority class examples. After generating synthetic samples for all smaller classes, under sampling of larger classes is carried out. The class samples are under sampled till their count equals to the total sample count of smallest class. Table 2 shows conditioned dataset after balancing. Fig. 1 shows the sample space for original and synthetic data. It can be observed that the synthetic samples are generated within the feature space periphery of small original samples. We ensured the randomness of the feature value by subtracting a random value from original feature value as given in algorithm 1. Oversampling may generate artificial samples belonging to different class even though they are generated based on nearest neighbors. It is necessary that synthetic and real data should have similar patterns. X. Zhang, [50] suggested bridging the ’synthetic gap’ between these two in such a Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

JID: INS

ARTICLE IN PRESS S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

[m3Gsc;August 30, 2016;18:14] 5

Fig. 2. Experimental setup of our system for SVM based classifiers. Figure to the left indicates training setup. Figure to the right exhibits testing based on multiple optimized classifiers.

way that feature distribution of synthetic data generated should not shift away from that of real data. We have tried to avoid synthetic outliers by generating synthetic samples within the periphery of original features. Literature suggests use of cross validation techniques (also known as leave-one-out approach or k-fold technique) for multiclass datasets. In this technique, the dataset is split into k folds, each fold containing 100 % samples of the dataset. For each fold, the algorithm is trained k with the examples contained in the remaining folds and then tested with the current fold [28]. The cross validation testing is carried out on 33% samples for 3 fold, 20% samples for 5 fold, 10% samples for 10 fold etc. Smaller datasets are evaluated with this approach [15,18,25,26,29,50]. Since the real class of synthetic samples is in question, it is not recommended to use synthetic samples for classifier evaluation. In order to have appropriate number of test samples from minority classes, we selected original sample count 2/3rd of original samples of the smallest class. These samples are kept aside from each class, ensuring that these samples do not contribute to the training process for classifier model optimization. The remaining original + synthetic samples are used with 5 fold stratified cross validation technique for classifier model optimization. Reserved samples are used to evaluate the optimized model. Vani dataset represents a combination of imbalanced data and small sample size. Table 2 shows the effect of synthetic sampling on Vani dataset. It exhibits original samples in column 2, generated synthetic sample for minority in column 3, count of under sampled majority class to balance the dataset in column 4. The original and synthetic sample count, used together for cross validation is given in column 5. The 2/3rd of smallest class samples are reserved as holdout data as given in column 6. Set up for generation, training and testing of classifiers using OVO SVMs is given in fig. 2. 5. Multiclass classification Levenberg-Marquardt algorithm is widely used for neural network parameter selection; hence we used it as one of the classification techniques in our analysis. Other techniques are SVM based, with various strategies for initialization and selection. SVM is a very powerful binary classification algorithm. We used OVO as a decomposition strategy using SVM as base classifiers. Based on statistical learning theory, SVM is a novel type of learning machine, and it contains polynomial, neural network and radial basis function (RBF) as special cases. Gaussian kernel is commonly used for RBF while the spread parameter of the Gaussian kernel is selected for generalization performance of SVM. The effectiveness of SVM depends on the selection of kernel, the kernel’s parameters and soft margin parameter [20,30]. The cost parameter (penalty parameter) C creates a soft margin that allows misclassification to certain extent. The Parzen window parameter, sigma of RBF and C are tuned to optimize the classifiers. If sigma → low, then all training vectors function as support vectors, thus giving very high training accuracy leading to over fitting. Such a system cannot identify test vectors, thus offering very low test accuracy. If sigma → high, then all the training data points are treated as one point. As a result, the SVM fails to recognize any new test data point. C affects the tradeoff between complexity and proportion of non-separable samples and must be selected carefully. Large value of C offers high penalty for non-separable classes. It stores increased number of support vectors and over fits. We observed that even after maximum number of iterations, the convergence is not achieved. Small value of C may cause under fitting. We optimized RBF kernel based SVM classifiers using grid search, vector search, particle swarm optimization and genetic algorithm based search techniques. Classifiers are trained using 5-fold cross validation and tested using hold out approach. The following techniques are implemented for classification. 1. Feed forward Neural Networks (NEU) 2. Grid based SVM classifiers (GSVM) 3. Forward feature selection based SVM classifier (SVMFS) Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS 6

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

Fig. 3. Parameter range of OVO classifiers for original dataset (diagram to the left) and for balanced dataset (diagram to the right).

4. Genetic algorithm based SVM classifiers (GA) 5. Modified PSO algorithm based SVM classifiers (MPSO) The techniques 2 to 5 are implemented using one versus one (OVO) method. This gives ’c’ number of optimized classifiers.

c=

n (n − 1 ) 2

(1)

where n is the number of classes. These optimized classifiers are arranged graphically and this arrangement is subjected to reserved unseen original samples and performance is evaluated. Optimized model of Feed forward neural network is subjected to the reserved unseen original samples and performance of classifier is evaluated. We used original imbalanced as well as conditioned balanced dataset for comparative analysis of these methods of classification. Maximum vote wins technique. For SVM, the one-against-one approach generally performs better than the one againstrest approach. For the one-against-one approach, all the included binary classifiers should be sufficiently competent. If some of the classifiers are not competent, it might give invalid classification results [23]. We optimize each classifier to address this problem. Amongst c classifiers, ref [eqn:1], during evaluation, every class appears as one amongst participating pair of classes for each of (n−1) classifiers. For other classifiers, it belongs to neither of the participating classes. It is only if the input vector belongs to any one classes of OVO classifier then we expect correct prediction. We give equal weightage to every classifier and select the output class that has obtained maximum number of votes, as winner (ref fig. 2). Experimental setup: We implemented all the algorithms from scratch. The empirical experiment was conducted on Intel Core 2 duo CPU (2.66 GHz) with 4GB of RAM. Literature suggests searching ranges for C and sigma to be about {10−2 ,102 }. In our preliminary experiments, we worked on wider ranges of parameters for every dataset and found that best performance is achieved with the range of parameter as {0.1, 5}. This keeps the penalty factor low giving lower cost of misclassification. We used 5 fold cross validation technique for training and hold out technique for testing the model. 5.1. Feedforward neural network Neural network training is carried out by Levenberg Marquardt algorithm. It trains neural network by a combination of gradient descent method and Gauss Newton method. The selection of optimum number of neurons is crucial for a neural network model. Some thumb rules are suggested in literature. The geometric pyramid rule, proposed by Masters [32], states √ that for a three layer network with n input and m output neurons, the hidden layer would have n.m neurons. In this work, feedforward neural network is optimized for two hidden layers for multiclass classification with all features at the input. 5.2. Grid based SVM classifier (GSVM) This algorithm works on complete feature set and selects appropriate SVM hyperparameters for the Gaussian kernel using grid-search. For each hyperparameter combination, c numbers of OVO GSVM classifiers are built with the training sets using 5-fold cross validation techniques. In each iteration, we select individual classifier hyperparameters which contribute to best average classification accuracy of classifiers. This algorithm selects appropriate SVM hyperparameters for the Gaussian kernel using grid-search. Our focus is selection of hyperparameters, giving group best performance rather than individual best performance. The optimized model performance is assessed for reserved unseen data using maximum vote wins technique. Box plots in fig. 3 represent a range of parameter values for SVM classifiers for one dataset. We can see improvement in Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

JID: INS

ARTICLE IN PRESS S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

[m3Gsc;August 30, 2016;18:14] 7

Fig. 4. Feature selection performance with SVMFS. Graph to the left shows performance for original and graph to the right shows performance for balanced dataset.

classifiers for balanced data. Grid search is often criticized to be leading to local minima. Its failure to achieve global maxima prompts us to explore hybrid, evolutionary and random techniques of search. 5.3. Forward feature selection based SVM classifier (SVMFS) This algorithm selects both, discriminative features and appropriate SVM hyperparameters for the Gaussian kernel. Hybrid search works on the following [43], - Grid search - Vector search Grid search is conducted for individual classifier hyperparameter selection. Vector search is conducted for feature selection. For this hybrid search technique, we arrange the features in descending order of relevance towards discrimination of multiclass using ‘t-test’ criterion. This makes the first feature most discriminative and the last feature the least. One feature is appended to the dataset and individual optimized classifiers are evaluated for model performance. If the performance with the added feature is better than without it, then the feature is retained. Otherwise the feature is removed from dataset. Forward feature selection is implemented for the group of classifiers. The mean performance of classifiers towards disease diagnosis for imbalanced and balanced dataset is shown in fig. 4. Forward feature selection algorithm for binary classifier is given in [37]. In the present work authors developed ’multiclass’ version of hybrid forward selection technique. As may be seen in fig. 4, for imbalanced datasets, bias of majority class is seen from two parameters. The optimized model performance is assessed for reserved unseen data using maximum vote wins (MVW) technique. 5.4. Genetic algorithm A genetic algorithm (GA), an evolutionary algorithm, applies the principles of evolution found in nature to the problem of finding an optimal solution. In GA, the problem is encoded in a series of bit strings that are manipulated by the algorithm [22]. This is different from classical algorithm as it relies on random sampling and maintains a population of candidate solutions, where one of these is best. It is the population of solutions that avoids GA getting trapped at local optimum. Chromosomes hold the key to classifier model optimization. Subsequent generations of chromosomes are based on following operations. Mutation: Inspired by natural evolution, it is a periodical, low probability and random alteration to the chromosome. The new chromosome may be better or worse than existing population. Crossover: Characteristics of parents are passed on to the offspring using crossover operation. Selection: Natural selection in evolution is performed by a selection process in which the fittest members of the population survive, and the weakest members are eliminated. In this work we optimize classifiers for different random values of sigma and penalty and same set of randomly selected features. The chromosome generated has three parts C, sigma and feature mask. Fig. 5 shows chromosome with different model parameter values and common features for each classifier of OVO model. Genome encoding is used for real valued SVM parameters. We implemented the scheme by taking mask of the genome to select the parameter values in the search space as given in Table 3. Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

JID: INS 8

ARTICLE IN PRESS

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15 Table 3 Genome encoding. Genome mask Parameter value

1 0.1

0 0.2

0 0.3

1 0.4

0 0.5

1 0.6

1 0.7

1 0.8

1 0.9

Fig. 5. Chromosome for GA based multiclass classification.

Fig. 6. Figure indicates GA based hyperparameter optimization.

Fig. 6 shows optimization of one of the classifiers participating in optimization during training. The layers indicate parameters searched by chromosome and vertical movement shows improvement in performance by evolving healthier chromosomes. As shown in fig. 7, it can be observed that there are various evaluations during each chromosome’s lifetime. This is due to the fact that the duration of mutation depends upon the improved evaluation of the chromosome. Probability of mutation has been set to very low. In case the performance of both the offspring chromosomes is poor as compared to their ancestors, they are considered ’unfit’ and eliminated from the process, and next crossover is initiated. Best chromosomes of the generation participate in the production of an offspring. A drawback of any evolutionary algorithm is that a solution is better only in comparison to other presently known solutions. Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

9

Fig. 7. Figure shows GA population performance such as new chromosome generation, mutation, crossover production and selection.

5.5. Modified particle swarm optimization The particle swarm concept originated as a simulation of a simplified social system [21]. Similar to evolutionary algorithms, particle swarm optimization (PSO) is initialized with a population of random solutions. However, it is motivated by the simulation of social behavior rather than natural selection and evolutionary mechanism. In PSO, each particle is treated as a point in the d-dimensional problem space. The ith particle is represented as xi = (xi1 , xi2 , … xin ). Previous best position of ith particle is represented as pi = (pi1 , pi2, … pin ). g represents index of the best particle of population. The velocity or rate of the position change for particle i is represented as vi = (vi1 , vi2 ,… vin ). The particles are manipulated according to following two equations [9,22].

  vid = w.vid + c1 .ϕ1 .( pid − xid ) + c2 .ϕ1 . pgd − xid

(2)

xid = xid + vid

(3)

where d -dimension of problem space, w -inertia weight factor, pid -best neighborhood particle position, xid -current position of particle, pgd -global best fitness particle location, c1, c2 - acceleration coefficients, ϕ 1 , ϕ 2 - random values in the range of (0, 1). Modifications to PSO: PSO is modified by implementing dynamic decision making during the flight of particles. This algorithm generates 30 particles and selects top 10% of them. Every particle is allowed to accelerate, unless terminated owing to degraded performance, till the completion of acceleration count. Two important steps taken are as given below: 1. Two step velocity modification based on classifier performance. 2. Termination of the particle when the performance achieved after v t acceleration steps is less than mean performance of previous particles.

2

Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

JID: INS 10

ARTICLE IN PRESS

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

Fig. 8. Figure shows PSO swarm particle spanning the model hyperparameters for thyroid dataset. Left figure shows search for imbalanced set, right figure shows search for balanced set. Straight lines indicate path traversed by particle for optimization. The change in resolution of search grid indicates acceleration and deceleration of particle.

where t-number of times the particle is allowed to accelerate. While maintaining Pbest-Particle best and Gbest – Group best concept of traditional PSO, it is modified as given below •

Initialize and evaluate performance of random population Select the particles giving high accuracy



Every particle accelerates for maximum t times but terminated at



t 2

if

2 t

 2t

q=1

X pq <

 pcnt k=1

X pk

where - pcnt : current particle count - Xp : particle performance • The velocity of particle is computed dynamically • The fine tuned particle with best performance is selected for every classifier In Modified-PSO (M-PSO), feature selection is for group of classifiers whereas hyperparameter selection is for individual classifier. M-PSO undertakes evolution like feature of GA as it explores top 10% of best fitting particles. Feature set is represented, similar to GA, [27,48]. Each article is encoded by a p-bit binary string and p is the number of available features. The bit with value ‘1’ indicates that the feature is selected in the subset, and ‘0’ otherwise. For example, for a feature set with 5 features, the binary string S = (F1, F2, F3, F4, F5). We can select any number of features smaller than p, i.e. we can choose three random features, giving feature subset as S = (F2, F3, F5). M-PSO technique tries to find optimum model parameters by further venturing into the model space with changed velocity. We see in Fig. 8, M-PSO technique optimizes imbalanced dataset classifier with higher values of sigma and penalty as compared to balanced dataset. Fig. 9 Shows flight of particles for optimizing three classifiers simultaneously in the model search space. 6. Performance evaluation Multi-class classification problems incur more intricate conditions than binary classification hence selecting an appropriate metric is critical to avoid drawing misleading conclusions. When the evaluation is based on accuracy, enhancement comes from higher classification rate of easily recognizable majority classes. Other evaluation criterion may diminish this claim and show weakness of classifier owing to presence of difficult minority classes [17]. Area defined by true positive rate and false positive rate, area under curve (AUC) is suggested in [25] for multiclass problem evaluation. Ensemble methods [15] are analyzed based on average scores of AUC, geometric mean and F-measure (F-mea) to compare performance of classifiers for imbalanced datasets. Sokolova and Lapalme [39] analyzed twenty-four performance measures used in the complete spectrum of Machine Learning classification tasks: binary, multi-class, multi-labeled, and hierarchical. We selected the measures most relevant to multiclass classification for analysis of imbalanced and balanced dataset versions. G. Jurman et al., [42], Alexandridis and Chondrodima, [1], and G. Armano et al., [6], suggested MCC as one of the crucial performance measure for evaluating the outcome of a classification task, both on binary and multiclass problems. F-mea, actually cares about positive versus negative class whereas for MCC, positive and negative distinction is merely semantic, which means that the meaning of positive and negative can be switched, keeping the MCC value same. MCC is Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

11

Fig. 9. Figure shows M-PSO swarm, spanning model hyperparameters for three classifiers simultaneously for thyroid dataset.

regarded as a balanced measure which can be used even if the classes are of very different sizes. We used macro averaging technique suggested in [39,44] for calculating accuracy, sensitivity, specificity, AUC, precision and, F-mea as performance metric. Multiclass version for calculating MCC, given in [42] is also employed as a performance measure in this work. Apart from these measures, we have a focus on percentage of support vectors generated for SVM and number of features used by machine learning algorithm. The expressions below represent the evaluation method for multiclass classification. In the expressions, l stands for class count. A two class confusion matrix shows positions of targeted vs classified samples. Confusion Matrix

Actual positive Actual negative

Predicted positive

Predicted negative

True positive (TP) False positive (FP)

False negative (FN) True negative (TN)

Average per class accuracy (ACC) is given as [24] l 

Accuracy =

i=1

T Pi + T Ni T Pi + F Ni + F Pi + T Ni

Sensitivity (SEN), also referred as Recall is the number of correctly classified positive examples divided by the number of positive examples in the data

l

T Pi i=1 (T Pi +F Ni )

Sensitivity =

l

Specificity (SPEC), measures the proportion of negatives that are correctly identified.

l

T Ni i=1 (T Ni +F Pi )

Specificity =

l

Precision (PREC), is indicated by the number of correctly classified positive examples divided by the number of labeled by the system as positive, given as

l Precision =

T Pi i=1 (T Pi +F Pi )

l

F score or F measure, (F-MEA) is the harmonic mean of precision and recall

l F − Measure =

2.precision.recall i=1 precision+recall

n

Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS 12

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

Table 4 Classifier performance for Vertigo (6 class) dataset. Imbalanced dataset

NEU GSVM SVMFS GA M-PSO

Balanced dataset

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

0.67 0.78 0.89 0.80 0.87

0.68 0.75 0.90 0.81 0.90

0.93 0.95 0.98 0.95 0.97

0.68 0.83 0.90 0.82 0.88

0.64 0.83 0.90 0.84 0.88

0.58 0.71 0.86 0.74 0.83

0.65 0.78 0.90 0.82 0.89

40 40 15 19 28

81% 46% 49% 59%

0.84 0.91 0.90 0.92 0.94

0.84 0.91 0.90 0.92 0.94

0.97 0.98 0.98 0.98 0.99

0.84 0.91 0.90 0.92 0.94

0.85 0.92 0.91 0.92 0.95

0.81 0.89 0.88 0.90 0.93

0.84 0.91 0.90 0.92 0.94

40 40 20 26 33

54% 37% 45% 40%

Table 5 Classifier performance Vani (3 Class) dataset. Imbalanced dataset

NEU GSVM SVMFS GA M-PSO

Balanced dataset

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

0.76 0.78 0.84 0.81 0.81

0.59 0.44 0.58 0.51 0.55

0.81 0.73 0.81 0.77 0.77

0.59 0.58 0.71 0.65 0.64

0.55 0.61 0.56 0.61 0.73

0.39 0.27 0.50 0.40 0.40

0.56 0.48 0.57 0.55 0.60

22 22 8 9 15

60% 57% 59% 58%

0.86 0.90 0.95 0.95 1.00

0.86 0.90 0.95 0.95 1.00

0.93 0.95 0.98 0.98 1.00

0.86 0.90 0.95 0.95 1.00

0.88 0.90 0.96 0.96 1.00

0.80 0.86 0.93 0.93 1.00

0.86 0.90 0.95 0.95 1.00

22 22 10 10 22

50% 63% 48% 43%

Table 6 Classifier performance Cleveland (5 Class) dataset. Imbalanced dataset

NEU GSVM SVMFS GA M-PSO

Balanced dataset

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

0.72 0.61 0.55 0.56 0.63

0.42 0.41 0.29 0.25 0.36

0.94 0.89 0.87 0.83 0.88

0.37 0.49 0.40 0.45 0.46

0.43 0.42 0.26 0.27 0.35

0.58 0.39 0.26 0.17 0.38

0.41 0.39 0.26 0.23 0.33

13 13 5 6 10

76% 79% 90% 83%

0.62 0.73 0.75 0.80 0.83

0.62 0.73 0.75 0.80 0.83

0.90 0.93 0.94 0.95 0.96

0.62 0.73 0.75 0.80 0.83

0.60 0.75 0.77 0.83 0.84

0.52 0.66 0.69 0.73 0.78

0.61 0.72 0.75 0.80 0.83

13 13 13 10 10

60% 54% 59% 60%

Table 7 Classifier performance Thyroid (3 Class) dataset. Imbalanced dataset

NEU GSVM SVMFS GA M-PSO

Balanced dataset

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

0.94 0.94 0.97 0.94 0.94

0.91 0.91 0.99 0.91 0.91

0.96 0.95 0.99 0.95 0.95

0.91 0.91 0.98 0.91 0.91

0.92 0.91 0.94 0.91 0.91

0.87 0.86 0.94 0.86 0.86

0.90 0.91 0.96 0.91 0.91

5 5 3 4 4

75% 67% 48% 66%

0.98 0.98 1.00 0.98 1.00

0.98 0.98 1.00 0.98 1.00

0.99 0.99 1.00 0.99 1.00

0.98 0.98 1.00 0.98 1.00

0.98 0.98 1.00 0.98 1.00

0.98 0.98 1.00 0.98 1.00

0.98 0.98 1.00 0.98 1.00

5 5 4 4 4

53% 34% 27% 34%

For binary classification setup, MCC has following shape.

MCC =



T P.T N − F P.F N

(T P + F P ) (T P + F N ) (T N + F P ) (T N + F N )

For multiclass classification, following equation is used to calculate MCC.

MCC =



N N k=1

(

N

l=1 Clk

)

 N

k,l,m=1 CkkCml

f,g=1, f =k Cg f

−C C

lk km    N  N

.

k=1

l=1 Ckl

 N ( f,g=1, f =k C f g )

We calculated mean percentage of support vectors (SVs) required by each algorithm and the numbers of feature attributes (Fea) for comparative study. Tables 4–10, show holdout test performance parameters for imbalanced and balanced sets. Neural networks treat the test vectors in much different way as compared to SVM; during training 60% samples were used for training neural network, 20% for validation and 20% for testing. For evaluating OVO classifiers, literature use k-fold cross validation test performance as evaluating criterion, but as these samples are used for assessment during training, we believe that they indirectly participate in shaping up the classifier model. For multiclass classification, it is not just individual best performance but group best performance that matters for classifying unknown sample. The holdout test results represent classification performance for unknown samples. Fig. 10 shows 5-fold cross validation test and holdout test performances. It is seen that when dataset is balanced by implementing the technique given in algorithm 1, the test performance Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

13

Table 8 Classifier performance SVD (9 Class) dataset. Imbalanced dataset

NEU GSVM SVMFS GA M-PSO

Balanced dataset

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

0.35 0.51 0.50 0.49 0.50

0.13 0.15 0.16 0.12 0.14

0.89 0.91 0.91 0.89 0.90

0.13 0.53 0.51 0.52 0.52

0.13 0.12 0.11 0.13 0.09

0.09 0.19 0.20 0.04 0.16

0.12 0.12 0.12 0.09 0.11

22 22 11 8 15

96% 94% 93% 93%

0.39 0.79 0.80 0.80 0.87

0.39 0.79 0.80 0.80 0.87

0.92 0.97 0.98 0.98 0.98

0.39 0.79 0.80 0.80 0.87

0.39 0.88 0.89 0.86 0.91

0.31 0.78 0.79 0.79 0.86

0.38 0.81 0.82 0.82 0.88

22 22 15 16 18

0 50% 48% 50% 50%

Table 9 Classifier performance Audiology (5 Class) dataset. Imbalanced dataset

NEU GSVM SVMFS GA M-PSO

Balanced dataset

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

0.75 0.84 0.93 0.85 0.87

0.67 0.81 0.89 0.80 0.81

0.93 0.95 0.98 0.96 0.97

0.67 0.89 0.96 0.91 0.93

0.73 0.88 0.95 0.87 0.90

0.67 0.79 0.91 0.81 0.84

0.66 0.83 0.91 0.81 0.83

22 22 11 13 16

70% 57% 62% 69%

0.73 0.88 0.92 0.87 0.95

0.73 0.88 0.92 0.87 0.95

0.93 0.97 0.98 0.97 0.99

0.73 0.88 0.92 0.87 0.95

0.77 0.89 0.93 0.87 0.95

0.67 0.85 0.90 0.84 0.94

0.74 0.88 0.92 0.86 0.90

22 22 12 14 15

63% 47% 60% 55%

Table 10 Classifier performance PdA (5 Class) dataset. Imbalanced dataset

NEU GSVM SVMFS GA M-PSO

Balanced dataset

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

Fea

SVs

0.39 0.71 0.71 0.69 0.70

0.39 0.27 0.28 0.26 0.25

0.85 0.82 0.82 0.81 0.82

0.39 0.69 0.80 0.75 0.75

0.44 0.54 0.34 0.44 0.44

0.24 0.27 0.27 0.18 0.22

0.40 0.28 0.27 0.27 0.25

22 22 5 15 15

93% 92% 88% 90%

0.72 0.89 0.91 0.86 0.91

0.72 0.89 0.91 0.87 0.91

0.93 0.97 0.98 0.97 0.98

0.72 0.89 0.91 0.87 0.91

0.71 0.92 0.93 0.89 0.92

0.65 0.87 0.88 0.84 0.88

0.71 0.90 0.91 0.86 0.91

22 22 13 12 16

47% 49% 52% 45%

Algorithm 1 Synthetic oversampling. Input S : Original data Output synth : Synthetic data 1: For j=1:1:(R/O) 2: For i=1:1: O 3: vec=S(i) 4: Find Euler’s distance between vec and other samples 5: Sort the samples in ascending order of the distance 6: Find vecns, the nearest neighbor of vec 7: Find the difference in feature values diff = vec - vecns 8: synth(i)= vec-(diff∗rand(1)) 9: End 10: Store synthetic sample 11: End

improves. Though the results of GSVM, SVMFS, GA and M-PSO are almost equivalent for cross validation test (ref. Fig. 10), for holdout test, results of M-PSO and SVMFS show an improvement over the rest. We calculated mean performance measures for all datasets and depicted in column charts in fig. 11. For imbalanced dataset, due to the inherent bias towards bigger class, there is a large difference in sensitivity and specificity values. AUC and MCC are considered as balanced evaluators for imbalanced data. We found SVMFS and M-PSO performing better than other algorithms even in biased imbalanced environment (ref Fig. 11, left side graph). SVMFS is based on two stage search whereas M-PSO is based on random initialization of feature vector. Thus M-PSO is found to perform optimization in reduced time as compared to SVMFS. M-PSO is the clear winner for balanced datasets as it outperforms all other algorithm performances. With clinical diagnostic support system in mind, we put less emphasis on minimization of feature descriptors as accurate diagnosis of pathology is essential. Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

ARTICLE IN PRESS

JID: INS 14

[m3Gsc;August 30, 2016;18:14]

S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

Fig. 10. Figure shows k-fold test accuracy achieved during optimization and holdout test accuracy on optimized classifiers.

Classifier performance for imbalanced datasets

Classifier performance for balanced datasets

1 .00

1 .00

0.90 NEU

0.80

GSVM

0.70

SVMFS

0.60

GA M-PSO

0.50 0.40 ACC

SEN

SPEC

AUC

PREC

MCC

F-mea

0.90 NEU 0.80

GSVM SVMFS

0.70

GA M-PSO

0.60 0.50 ACC

SEN

SPEC

AUC

PREC

MCC F-mea

Fig. 11. Holdout performance of all datasets; Column bar to left indicate performance of original imbalanced datasets, to the right indicates performance of balanced dataset.

7. Conclusions The purpose of this paper has been to carry out investigations in the field of medical diagnosis of pathological samples for imbalanced class. The work is categorized in three main research lines: resampling technique, multiclass classifier optimization and the selection of performance measures. The proposed technique for resampling is easy. It implements oversampling as well as under sampling, thus restricting the size of dataset. SVMFS is based on hybrid filter wrapper approach with two stages, grid search and vector search. M-PSO is based on random initialization of feature vector and swarm based optimization. Both the techniques give improved classification accuracy. M-PSO is superior to SVMFS considering training time of classifier set. Proposed techniques of classification, M-PSO and SVMFS along with synthetic sampling are found to be very effective for multi-class classification problems. The empirical results show that proposed classification algorithms yield robust classification results. Classifier algorithm assessment is based on nine measures for performance evaluation. Holdout unseen data is tested for this evaluation. Our goal is to apply proposed algorithms in real clinical diagnostic systems in future. References [1] A. Alexandridis, E. Chondrodima, A medical diagnostic tool based on radial basis function classifiers and evolutionary simulated annealing, J. Biomed. Inf. 49 (2014) 61–72. [2] B.A. Almogahed, I.A. Kakadiaris, Empowering imbalanced data in supervised learning a semi-supervised learning approach, in: Artificial Neural Networks and Machine Learning ICANN, Springer International Publishing, 2014, pp. 523–530, doi:10.1007/978- 3- 319- 11179- 7- 66. [3] P. Anooj, Clinical decision support system: risk level prediction of heart disease using weighted fuzzy rules, J King Saud University – Comput. Inf. Sci. 24 (1) (2012) 27–40, doi:10.1016/j.jksuci. 2011.09.002. [4] J.D. Arias-Londono, J.I. Godino-Llorente, N. Saenz-Lechon, V. Osma-Ruiz, G. Castellanos- Dominguez, An improved method for voice pathology detection by means of a hmm-based feature space transformation, Pattern Recognit 43 (9) (2010) 3100–3112, doi:10.1016/j.patcog.2010.03.019. [5] J.D. Arias-Londono, J.I. Godino-Llorente, M. Markaki, Y. Stylianou, On combining information from modulation spectra and mel-frequency cepstral coeficients for automatic detection of pathological voices, Logopedics Phoniatrics Vocology 36 (2) (2011) 60–69, doi:10.3109/14015439.2010.528788.

Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077

JID: INS

ARTICLE IN PRESS S. Shilaskar et al. / Information Sciences 000 (2016) 1–15

[m3Gsc;August 30, 2016;18:14] 15

[6] G. Armano, A direct measure of discriminant and characteristic capability for classifier building and assessment, Inf. Sci. 325 (2015) 466–483. [7] W. Barry, M. Putzer, Saarbrucken Voice Database, Institute of Phonetics Univ. of Saarland. http://www.stimmdatenbank.coli.uni-saarland.de/ [8] S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2) (2014) 405–425. http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.232. [9] P. Chang, J. Lin, C. Liu, An attribute weight assignment and particle swarm optimization algorithm for medical database classifications, Comput. Methods Programs Biomed. 107 (3) (2012) 382–392, doi:10.1016/j.cmpb.2010.12.004. [10] F. Charte, A.J. Rivera, M.J. del Jesus, F. Herrera, Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neurocomputing 163 (2015) 3–16. [11] N.V. Chawla, in: Data Mining And Knowledge Discovery Handbook, Springer, US, 2010, pp. 875–886, doi:10.1007/0- 387- 25465- X_40. [12] B. Das, N.C. Krishnan, D.J. Cook, Racog and wracog: two probabilistic oversampling techniques, IEEE Trans. Knowl. Data Eng. 27 (1) (2015) 222–234 PMID:27041974 [PubMed] PMCID: PMC4814938. [13] M.G. David, E. Lleida, A. Ortega, A. Miguel, J.A. Villalba, Voice pathology detection on the saarbrucken voice database with calibration and fusion of scores using multifocal toolkit, in: Advances in Speech and Language Technologies for Iberian Languages - IberSPEECH 2012 Conference, Madrid, Spain, November 21–23, 2012. Proceedings, 2012, pp. 99–109, doi:10.1007/978- 3- 642- 35292- 8_11. [14] D.M. Gonzalez, E. Lleida, A. Ortega, A. Miguel, Score level versus audio level fusion for voice pathology detection on the saarbrucken voice database, in: Advances in Speech and Language Technologies for Iberian Languages - IberSPEECH 2012 Conference, Madrid, Spain, November 21–23, 2012. Proceedings, 2012, pp. 110–120, doi:10.1007/978- 3- 642- 35292- 8_12. [15] J.F. Diez-Pastor, J.J. Rodriguez, C.I. Garcia- Osorio, L.I. Kuncheva, Diversity techniques improve the performance of the best imbalance learning ensembles, Inf. Sci. 325 (2015) 98–117. [16] S. Ertekin, Adaptive oversampling for imbalanced data classification, 264, (2013) 261–269. doi:10.1007/978- 3- 319- 01604- 7_26. [17] M. Galar, A. Fernandez, E. Barrenechea, F. Herrera, Empowering difficult classes with a similarity based aggregation in multi-class classification problems, Inf. Sci. 264 (2014) 135–157. [18] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes, Pattern Recognit. 44 (8) (2011) 1761–1776. [19] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets, Inf. Sci. 354 (2016) 178–196. [20] P. Guo, P. Bhattacharya, An evolutionary framework for detecting protein conformation defects, Inf. Sci. 276 (2014) 332–342. [21] Y. Hsieh, M. Su, P. Wang, A pso-based rule extractor for medical diagnosis, J. Biomed. Inf. 49 (2014) 53–60. [22] I. Ilhan, G. Tezel, A genetic algorithm-support vector machine method with parameter optimization for selecting the tag snps, J. Biomed. Inf. 46 (2) (2013) 328–340. [23] S. Kang, S. Cho, P. Kang, Constructing a multi-class classifier using one-against-one approach with different binary classifiers, Neurocomputing 149 (2015) 677–682. [24] L. Li, Y. Wu, M. Ye, Experimental comparisons of multi-class classifiers, Informatica 39 (1) (2015) 71. [25] H. Lin, Efficient classifiers for multi-class classification problems, Decis. Support Syst. 53 (3) (2012) 473–481. [26] Y. Liu, X. Yu, J.X. Huang, A. An, Combining integrated sampling with svm ensembles for learning from imbalanced datasets, Inf. Process. Manage. 47 (4) (2011) 617–631. [27] N.C. Long, P. Meesad, H. Unger, A highly accurate firefly based algorithm for heart disease prediction, Expert Syst. Appl. 42 (21) (2015) 8221–8231. [28] V. Lopez, A. Fernandez, S. Garcia, V. Palade, F. Herrera, An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141. [29] V. Lopez, I. Triguero, C.J. Carmona, S. Garcia, F. Herrera, Addressing imbalanced classification with instance generation techniques: Ipade-id, Neurocomputing 126 (2014) 15–28. [30] S. Maldonado, R. Weber, J. Basak, Simultaneous feature selection and classification using kernel penalized support vector machines, Inf. Sci. 181 (1) (2011) 115–128. [31] A.I. Marques Marzal, V. Garcia Jimenez, J.S. Sanchez Garreta, On the suitability of resampling techniques for the class imbalance problem in credit scoring, J. Oper. Res. Soc. 64 (7) (2013) 1060–1070. [32] T. Masters, Practical Neural Network Recipes in C++, Morgan Kaufmann, 1993. [33] J. Mekyska, E. Janousova, P. Gomez-Vilda, Z. Smekal, I. Rektorova, I. Eliasova, M. Kostalova, M. Mrackova, J.B. Alonso-Hernandez, M. Faundez-Zanuy, et al., Robust and complex approach of pathological speech signal analysis, Neurocomputing 167 (2015) 94–111. [34] G. Muhammad, M. Alsulaiman, A. Mahmood, M. Almojali, B.M. Abdelkader, Voice pathology detection using multiresolution technique, Pathology 16 (1) (2014) 185–189. [35] J.A. Saez, B. Krawczyk, M. Wozniak, Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets, Pattern Recognit. (2016) 164–178, doi:10.1016/j.patcog. 2016.03.012. [36] J.A. Saez, J. Luengo, J. Stefanowski, F. Herrera, Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inf. Sci. 291 (2015) 184–203, doi:10.1016/j.ins.2014.08.051. [37] S. Shilaskar, A. Ghatol, Feature selection for medical diagnosis: evaluation for cardiovascular diseases, Expert Syst. Appl. 40 (10) (2013) 4146–4153. [38] S. Shilaskar, A. Ghatol, Feature enhancement for classifier optimization and dimensionality reduction, in: India Conference (INDICON), 2014 Annual IEEE, IEEE, 2014, pp. 1–6. [39] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Inf. Process. Manage. 45 (4) (2009) 427–437. [40] H. Sug, D.D. Douglas II, More reliable oversampled synthetic data instances by using artificial neural networks for a minority class, in: Proceedings of The 2014 World Congress in Computer Science, Computer Engineering, and Applied Computing, 2014, pp. 1–4. http://worldcomp-proceedings.com/ proc/p2014/DMI.html. [41] H. Sug, Towards effective data mining using random forests, Audiology 71 (2013) 24–28 http://www.wseas.us/e-library/conferences/2013/Nanjing/SCIE/ SCIE-23.pdf. [42] J.P. Teixeira, P.O. Fernandes, Jitter, shimmer and hnr classification within gender, tones and vowels in healthy voices, Procedia Technol. 16 (2014) 1228–1237. [43] A. Unler, A. Murat, R.B. Chinnam, mr 2 pso: a maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification, Inf. Sci. 181 (20) (2011) 4625–4641. [44] V. Van Asch, Macro-and micro-averaged evaluation measures [basic draft]. (2013) 1–27. http://www.clips.ua.ac.be/∼vincent/pdf/microaverage.pdf [45] K. Varpa, K. Iltanen, M. Juhola, Genetic algorithm based approach in attribute weighting for a medical data set, J. Comput. Med. (2014) 1–11. http: //dx.doi.org/10.1155/2014/526801. [46] X. Wan, J. Liu, W.K. Cheung, T. Tong, Learning to improve medical decision making from imbalanced data without a priori cost, BMC Med. Inf. Decis. Making 14 (1) (2014) 1–13. [47] W.Y. Ng Wing, Zeng Guangjun, Zhang Jiangjun, S. Yeung Daniel, Pedrycz Witold, Dual autoencoders features for imbalance classification problem, Pattern Recognition 60 (2016) 875–889 http://dx.doi.org/10.1016/j.patcog.2016.06.013. [48] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimization for feature selection in classification: a multi-objective approach, IEEE Trans. Cybern. 43 (6) (2013) 1656–1671. [49] H. Yu, S. Hong, X. Yang, J. Ni, Y. Dan, B. Qin, Recognition of multiple imbalanced cancer types based on dna microarray data using ensemble classifiers, BioMed Res. Int. (2013) 1–14. [50] X. Zhang, Y. Fu, A. Zang, L. Sigal, G. Agam, Learning classifiers from synthetic data using a multichannel autoencoder, (2015) 1–11 CoRR abs/1503.03163

Please cite this article as: S. Shilaskar et al., Medical decision support system for extremely imbalanced datasets, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.08.077