Boosting Support Vector Machines for Imbalanced Microarray Data

Boosting Support Vector Machines for Imbalanced Microarray Data

ScienceDirect ScienceDirect Procedia Computer Science 00 (2018) 000–000 Available online at www.sciencedirect.com Available online at www.sciencedir...

575KB Sizes 1 Downloads 107 Views

ScienceDirect ScienceDirect Procedia Computer Science 00 (2018) 000–000

Available online at www.sciencedirect.com

Available online at www.sciencedirect.com

www.elsevier.com/locate/procedia

ScienceDirect

www.elsevier.com/locate/procedia

Procedia Computer Science 00 (2018) 000–000

Procedia Computer Science 144 (2018) 174–183

INNS Conference on Big Data and Deep Learning 2018 INNS Conference on Big Data and Deep Learning 2018

Boosting Support Vector Machines for Imbalanced Microarray Data Boosting Support Vector Machines for Imbalanced Microarray Data Risky Frasetio Wahyu Pratamaa, Santi Wulan Purnamia,* and Santi Puteri Rahayua a

Department of Statistics, Institut Teknologi Sepuluh Nopember, Sukolilo, Surabaya 60111, Indonesia

a

Department of Statistics, Institut Teknologi Sepuluh Nopember, Sukolilo, Surabaya 60111, Indonesia

Risky Frasetio Wahyu Pratamaa, Santi Wulan Purnamia,* and Santi Puteri Rahayua Abstract Abstract Nowadays, microarray data plays an important role in the detection and classification of almost all types of cancer tissue. The gene expression produced by microarray technology that carries the information from genes is then matched to a specific cancer condition. The problemsdata thatplays oftenan appear in therole classification using and microarray data are data and imbalanced Nowadays, microarray important in the detection classification ofhigh-dimensional almost all types of cancer tissue. The class.expression The problem of high-dimensional can be solved by using Correlated Based Filter (FCBF) feature In this gene produced by microarraydata technology that carries the Fast information from genes is then matched to aselection. specific cancer paper, Support Vector Machine (SVM) classifier is used because of its advantages. some studies data mention that almost condition. The problems that often appear in the classification using microarray dataHowever, are high-dimensional and imbalanced all classifier model of including SVM are data sensitive with respect to imbalanced class. Synthetic Minority Technique class. The problem high-dimensional can be solved by using Fast Correlated Based Filter (FCBF)Oversampling feature selection. In this (SMOTE) is one of theMachine prepocessing data methods handling class based on sampling approach by increasing the paper, Support Vector (SVM) classifier is in used becauseimbalanced of its advantages. However, some studies mention that almost number of samples from the SVM minority class. This method works wellclass. but Synthetic sometimesMinority it mightOversampling suffer from over-fitting all classifier model including are sensitive with respectoften to imbalanced Technique problem. other approach in methods improving performance of imbalanced data is boosting. This method (SMOTE)One is one of alternative the prepocessing data in the handling imbalanced class based onclassification sampling approach by increasing the constructs powerful finaltheclassifier combining a set ofoften SVMs as base during itthe iteration So, it can number of asamples from minoritybyclass. This method works well classifier but sometimes might sufferprocess. from over-fitting improve performance. study, the colon cancer andofmyeloma datadata are classification used in the analysis. The results show problem. the Oneclassification other alternative approach In in this improving performance imbalanced is boosting. This method that SMOTEBoost withfinal SVMclassifier as base classifier outperforms SVM, SMOTE-SVM, and AdaBoost with SVM as base classifier by constructs a powerful by combining a set of SVMs as base classifier during the iteration process. So, it can looking on metric.performance. In this study, colon cancer and myeloma data are used in the analysis. The results show improve theG-mean classification that SMOTEBoost with SVM as base classifier outperforms SVM, SMOTE-SVM, and AdaBoost with SVM as base classifier by looking on G-mean metric. © 2018 The Authors. Published by Elsevier Ltd. © 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) © 2018 The Published by Elsevier Ltd.of the INNS Conference on Big Data and Deep Learning 2018. Selection andAuthors. peer-review underresponsibility responsibility Selection and peer-review under of the INNS Conference on Big Data and Deep Learning 2018. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility ofSMOTE; the INNS Conference on Big Data and Deep Learning 2018. Keywords: Microarray Data; Imbalanced Class; SVM; AdaBoost; SMOTEBoost. Keywords: Microarray Data; Imbalanced Class; SVM; SMOTE; AdaBoost; SMOTEBoost.

* Corresponding author. Tel.:+6281234158275; fax:+62315943352. E-mail address: [email protected] * Corresponding author. Tel.:+6281234158275; fax:+62315943352. E-mail address: [email protected] 1877-0509 © 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the INNS 1877-0509 © 2018 The Authors. Published by Elsevier Ltd. Conference on Big Data and Deep Learning 2018. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the INNS Conference on Big Data and Deep Learning 2018.

1877-0509 © 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the INNS Conference on Big Data and Deep Learning 2018. 10.1016/j.procs.2018.10.517

2

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

175

1. Introduction Recently, many of researchers use microarray data to collect information from tissue and cell samples regarding different-level gene expression that could be useful in disease diagnosis or cancer classification [4]. However, the microarray data contains a matrix with thousands of columns and hundreds of rows, where each column and row presents genes and samples respectively. So, using this data in classification task would be adressed in the highdimensional data issue [19, 21]. Many studies have shown that feature or gene selection can be used to address this issues [4, 17, 25]. The purpose of feature selection is to reduce the level of complexity of a classification algorithm, and to know features which influence class labels. Feature selection can also improve the accuracy of classification by removing redundant and irrelevant features [20]. One of the filter methods that has been widely used is FCBF (Fast Corelation Based Filter) by [29]. FCBF is a new feature selection algorithm that has been proven to work fast and is able to choose the best features and consider the speed of time. So we use this method to deal with highdimensional problem. Another issue in microarray data classification is imbalanced class. Imbalanced class is a condition in which the class (case) to be studied tends to be distributed more or less than the other class. In this case, the classification accuracy of the majority will tend to be high whereas the accuracy of the minority will tend to be low. Thus, the cost will be large when a classifier misclassifies rare class observations (minority) [18, 22]. In this paper, we use Support Vector Machines algorithm (SVM) to classify the data. SVM has been used in many studies of microarray data classification such as in [5, 25]. The advantages of using SVM are a solid mathematical background, high generalization and ability to find global and non-linear classification solutions [3]. Although SVM often works effectively on balance datasets, SVM can produce suboptimal results on imbalanced datasets. The Effect of imbalanced data on SVM will weaken soft margin optimization. The separating hyperplane of an SVM model developed with an imbalanced dataset can be skewed towards the minority class [26]. The second reason is the ratio between the positive and negative support vectors will also becomes unbalanced. This can produces a result that the neighbourhood of a test instance close to the boundary is more likely to be dominated by negative support vectors hence the decision function is more likely to classify a boundary point as negative class [28]. There are several ways of handling imbalanced class problems in classification using SVM. One of them is preprocessing data based-approach. Synthetic Minority oversampling technique (SMOTE) is one of the prepocessing data methods by [7] that deals with imbalanced data classification using SVM. SMOTE is one of the algorithms that has achieved success in the classification of imbalanced data using SVM [1, 20]. This algorithm works to make synthetic data in minority class by doing extrapolation process among minority class observations. The extrapolation process is based on its k-nearest neighbor. This method often works well but sometimes it might suffer from over-fitting problem. One other alternative in improving the performance of imbalanced data classification is boosting. Many of boosting algorithm have been proposed. One of these is SMOTEBoost that has been proposed by [8]. SMOTEBoost algorithm utilizes SMOTE algorithm to modify the training set in each boosting iteration so that the accuracy of minority class prediction can be increased. The SMOTEBoost originally uses AdaBoost.M2 boosting procedure by [10]. In this study, we use two imbalanced microarray datasets . There are colon cancer data by [2] and myeloma data by [24]. Both data will be analyzed by using SMOTEBoost-SVM compared with SVM, SMOTE-SVM, and AdaBoost-SVM. The results will be compared by using G-mean metric. The feature selection procces using FCBF will be done first, before the classification procces. 2. Literature Review 2.1. Support Vector Machine (SVM) Support Vector Machine (SVM) was first introduced by [9] and obtained good prediction for classification or regression problem. SVM works by mapping training data into high dimension feature space. First we describe SVM in linierly non-separable case then the case of non-linear classification. An objective function corresponding to penalized margin maximization is formulated as follows n 1 2 (1) min w  C   i i 1 w , 2 subject to,

176

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

3

T

y i ( x w  b)  1   i

(2)

i  0

Fig 1. The separating hyperplane and the margin in the linearly non-separable case [12].

The parameter C > 0 is a constant that controls the trade-off between minimizing the training error and maximizing the margin. Non-negative slack variable T ( x i w  b  1) .

i

allow points to be on the wrong side of their soft margin

The dual problem optimization is n 1 n n T max L D ( a )   a i    a i a j y i y j x i x j i 1 a 2 i 1 j 1

Subject to,

(3)

n

0  ai  C ,

 ai yi  0 i 1

(4)

The dual problem can be solved numerically using quadratic programming to find the solution of

a i . In the cases

where a linear hyperplane is not applicable, SVM can transform input vectors, x, into higher dimension feature space by using kernel trick, so that the data can be separated linearly in the new feature space. In the non-linear T

T classification, scalar product x i x j is transformed into k ( x i , x j )   ( x i )  ( x j ) . Then, the the dual optimization

would be

n  i 1

max L D  max   a i  a

a

Subject to,

1

2 i 1 j 1

0  ai  C , The classification function has following formula

n n



n

 ai y i  0

i 1

f ( x)  sign( aˆ i yi k (x i , x j )  bˆ)

Kernel function that is used in this study are

SVs



  ai a j y i y j k x i , x j 



(5)

(6) (7)

4

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000 T

1. Linear kernel : k ( x i , x j )  x i x j

177

(8)

2. Radial basis Function: k ( x i , x j )  exp(  x i  x j

2

 0

),

(9)

2.2. AdaBoost In principle, AdaBoost works by assigning the equal weight to each observations at the beginning, then the weight is updated during the iteration procces. The observations that are classified incorrectly will be given higher weight than the observations that are classified correctly. So, it will prompt the classifier to focus more on the observations that are difficult to learn. In the end of boosting rounds, AdaBoost constructs a powerful final classifier by combining a set of weak classifiers during its iterations process. The algorithm of AdaBoost [23] is shown as follows.







Input: a set of training samples with label (x1 , y1),....., (xm , ym ) , a base (weak) classifier, the number of iteration T



Initialize the weight distribution D1 over the examples, such that D1 (i )  1 / m .



For t=1,2,3,...T



Train a weak learner on the weighted training set. Get a weak hypothesis ht : X  1,1

2.

Calculate the training error of ht: et   Dt (i ) , yi  ht (xi ) and calculating at 

3.

Update the weight of training samples

m

i 1

Dt 1 (i)  where





1.

1  1  et ln 2  et

   

Dt (i) exp( at yi ht (xi )) Zt

Z t is a normalization constant, and

m

 Dt 1 (i )  1

i 1

Output the final hypothesis:

T  H ( x)  sign   at ht (x)   i 1 

Fig. 2. AdaBoost algorithm

2.3. Synthetic Minority Oversampling Technique (SMOTE) Synthetic minority oversampling technique (SMOTE) that was first introduced by [7] is one of several techniques of handling imbalanced class. This method works by replicating samples that come from minority (positive) class so that the amount of minority class observation is equal with the majority (negative) class observation. SMOTE algorithm begins with defining k-nearest neighbor for each minority sample, then constructing the data synthetic duplication through its neighbor as many as the desired percentage among minority class observations. The neighbor is drawn randomly from k-nearest neighbor. Generally, it can be formulated as follows: (10) x syn  xi  ( x knn  xi ) , where γ is a random number between 0 and 1. 2.4. SMOTEBoost SMOTEBoost is a boosting algorithm proposed by [8] that uses combination of SMOTE and the AdaBoost boosting procedure. SMOTE is used in each boosting rounds to modify the weight distribution of training set. The purpose of using SMOTE in boosting procedure is to enhance the probability of selection for the difficult minority class samples in training set for each round boosting iteration. The algorithm of SMOTEBoost is shown as follows.

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

5

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183

178







Input: a set of training samples with label (x1 , y1 ),....., (xm , ym ) , a base classifier, the number of iteration T



Initialize the weight distribution D1 over the examples, such that D1 (i )  1 / m .



For t=1,2,3,...T Modify distribution Dt by creating N synthetic examples from minority class using SMOTE algorithm. Train a weak learner on the weighted training set.

1. 2. 3.



m

4.

Calculate the training error of ht: et   Dt (i ) , yi  ht (xi ) and calculating at 

5.

Update the weight of training samples

i 1

Dt 1 (i)  where





Compute a weak hypothesis ht : X  1,1

Dt (i) exp( at yi ht (xi )) Zt

Z t is a normalization constant, and

1  1  et ln 2  et

    (2.40)

m

 Dt 1 (i )  1

i 1

Output the final hypothesis:

T  )  sign   at halgorithm FigureH3:( xSMOTEBoost t ( x)    i 1 

Fig. 3. SMOTEBoost algorithm

2.5. Performance Evaluation Performance of classification model is often measured through the accuracy rate where the higher accuracy rate shows better the classification model performance. It can be calculated as the following formula. TP  TN accuracy rate  (11) TP  FP  FN  TN TP is the number of true positive, FP is the number of false positive, TN is the number of true negative, and FN is the number of false negative. Sensitivity is true positive rate or accuracy of positive class (minority class) while specificity is true negative rate or accuracy of negative class (majority class). Both are calculated as the following formula. TN Specificit y  (12) TN  FP TP Sensitivit y  (13) TP  FN Another metric that can be used to evaluate classification performance of imbalanced dataset is G-mean proposed by [14] as follows:

G - mean  Sensitivit y  Specificit y

(14)

3. Methodology In this study, colon cancer and myeloma datasets are used. The descriptions of the data are shown in the following table:

6

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

179

Table 1. Microarray datasets description Datasets

Number of sample

Feature

Colon Cancer

62

2000

Myeloma

173

12625

Class distribution Tumour= 40 Normal=22 MRI lytic lesion =137 MRI no lytic lesion =36

Imbalance ratio 1.82 3.81

We divide the data into training and testing data by using stratified 5-fold cross validation in order to ensure that the training and testing data have same proportion of majority and minority class. Feature selection is done by using FCBF filter methods. The value 0.05 is choosen as the threshold parameter in the FCBF feature selection. We use LibSVM [6] as the algorithm for SVM classification. For SVM kernel function, we use linear kernel and radial basis function (RBF) kernel with cost parameter C  0.1, 1, 10 , 100  and kernel parameter   0.01, 0.1, 1, 10 . SVM with RBF kernel model will be obtained using all combination of pair C and  . Numbers of maximum iteration that will be used in the AdaBoost and SMOTEBoost are varying in 2-20 iteration. Thus, the model with highest mean of Gmean will be choosen. The maximum iteration is restricted by 20 iteration regarding the decreasing performance of using too much iteration in boosting. These methods are summarized on following figure.

Fig 4. Research flowchart

4. Results and Discussion In this section, we will provide the analysis which start from feature selection using FCBF, then classifying the colon cancer and myeloma data using SVM, SMOTE-SVM where we either include all features or only selected features in the analysis to assess the effect of feature selection method. Next we will classify the data using AdaBoost-SVM and SMOTEBoost-SVM only on selected features. We compare the classification performance using G-mean, accuracy, sensitivity, and specificity metric. In order to get the informative features in the classification task, feature selection can be done upfont the analysis. The result of feature selection using FCBF in the colon cancer and myeloma datasets is shown in the following table.

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

180

7

Table 2. Number of feature before and after feature selection using FCBF

Datasets

Number of features

Number of selected features

Colon cancer

2000

15

Myeloma

12625

59

The effect of feature selection using FCBF is checked by comparing SVM and SMOTE classification using all features and using selected features. The result is shown in following table. Table 3. The effect of FCBF feature selection in SVM classification performance Data

Feature all features

Colon Cancer selected feature

all feature Myeloma selected feature

Model

Accuracy

Specificity

Sensitivity

G-mean

SVM

0,8218

0,875

0,74

0,7926

SMOTE-SVM

0,8385

0,9

0,74

0,8055

SVM

0,8872

0,925

0,83

0,8699

SMOTE-SVM

0,8846

0,875

0,91

0,8873

SVM

0,8094

0,9636

0,2143

0,3935

SMOTE-SVM

0,8094

0,9638

0,2143

0,3935

SVM

0,8549

0,9122

0,6357

0,7519

SMOTE-SVM

0,8842

0,9269

0,725

0,8141

Based on Table 3, FCBF feature selection can improve the classification performance of SVM with or without SMOTE for all given dataset. So, discarding redundant and irrelevant features in classification task can make classifiers easier to classify the minority class (positive sample). Next, boosting method will be done using only selected or informative features.

Fig. 5. Box plot of G-mean for all parameter

Fig. 5 presents box and whisker plot of G-mean values in the experiment. The observations are the G-mean values obtained using all C for linear kernel and the combination of C dan  for radial kernel. Based on the the Fig. 5, the median of G-mean obtained by using SMOTEBoost-SVM model is relatively higher than others for all two datasets. In the other hand, variation of G-mean obtained by SMOTEBoost-SVM algorithm smaller than the others. It

8

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

181

indicates that G-mean values of SMOTEBoost-SVM are more stable for all parameter used in the model. However, the variation of G-mean look wider on the myeloma data. It is because myeloma data has imbalanced class ratio higher than colon cancer data. Model performances are summarized in the following table. The values listed on the table are the mean of classification result in the five-fold cross validation set. The models listed are the best models with highest g-mean value for all possible pair of C and γ on RBF kernel, and cost C for linear kernel. Based on Gmean metric listed on the the Table 4, SMOTEboost with SVM as base classifier obtain the highest value for all two datasets. It can be seen that SMOTEBoost-SVM is able to increase performance of classifying minority class (sensitivity) so that it can improve the value of G-mean metric. Based on the value of accuracy, Adaboost-SVM obtain the highest value, it is because the AdaBoost focuses more on increasing the accuracy by increasing classification performance of majority class (specificity). Table 4. Performance evaluation model with selected feature Data

Colon Cancer

Myeloma

Model

Accuracy

Specificity

Sensitivity

G-mean

SVM

0,8872

0,925

0,83

0,8699

SMOTE-SVM

0,8846

0,875

0,91

0,8873

AdaBoost-SVM

0,9026

0,925

0,87

0,8929

SMOTEBoost-SVM

0,8859

0,825

1

0,9055

SVM

0,8549

0,9122

0,6357

0,7519

SMOTE-SVM

0,8842

0,9269

0,725

0,8141

AdaBoost-SVM

0,9017

0,9492

0,7179

0,8102

SMOTEBoost-SVM

0,8726

0,8979

0,7786

0,832

The variances of G-mean on the five-fold cross validation set are evaluated. The following table shows variances of G-mean for the model with optimum parameter. Table 5. Variance of G-mean of five-fold cross validation

Data

Myeloma

Mean

Variance

1

2

3

4

5

SVM

0,8296

0,8166

0,8748

0,4828

0,7557

0,7519

0,0244

SMOTE-SVM

0,9638

0,7454

0,8922

0,7271

0,7415

0,8141

0,0115

SMOTEBoost-SVM

0,9061

0,7453

0,845

0,8296

0,8334

0,8319

0,0033

0,945

0,8908

0,655

0,6431

0,9179

0,8102

0,0221

0,9354

0,9354

0,8101

0,7746

0,8944

0,8699

0,0055

0,866

0,866

0,8101

1

0,8944

0,8873

0,0049

SMOTEBoost-SVM

1

0,7907

0,866

0,9354

0,9354

0,9055

0,0064

Adaboost-SVM

1

0,866

0,8101

0,8944

0,8944

0,8923

0,0049

AdaBoost-SVM SVM Colon Cancer

Fold

Model

SMOTE-SVM

Adaboost-SVM and SMOTEBoost-SVM respectively have the small variance of G-mean in 5 fold validation test for colon data and myeloma data. 5. Conclusions Boosting with SVM as base classifier obtains good classification performance on imbalanced microarray datasets. The result of study showed that SMOTEBoost method with SVM as base classifier used in the analysis has

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

182

9

small variation of G-mean on two datasets. On the optimum parameter, SMOTEBoost-SVM outperforms the others by looking on the G-mean metric especially in the myeloma data that has Imbalance class ratio higher than another one, with small variance of G-mean on five-fold cross validation. AdaBoost with SVM as base classifier also looks good in the results of classification performance especially on the accuracy metric. In this study, the effect of FCBF feature selection was also checked in the analysis. The result has been shown that using only informative features in the classification task produces better classification performance than using all features on imbalanced microarray datasets. References [1] Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying Support Vector Machines to imbalanced datasets. Proceedings of the 15th European Conference on Machine Learning, (pp. 39-50). [2] Alon, U. B. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences , 96 (12), 6745-6750. [3] Batuwata, R., & Palade, V. (2013). Class Imbalance Learning Methods for Support Vector Machine. In H. He, & M. Yunqian, Imbalance Learning: Foundation, Algorithms, and Applications (pp. 83-99). Berlin: John Wiley & Sons [4] Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J. M., & Herrera, F. (2014). A review of microarray datasets and applied feature selection methods. Information Sciences , 111-135 [5] Brown, M. P. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences , 97 (1), 262-267. [6] Chang, C. C., & Lin, C. J. (2011). LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) , 2 (3), 27. [7] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research , 16, 321-357. [8] Chawla, N., Lazarevic, A., Hall, L., & Bowyer, K. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases: PKDD 2003 , 107-119. [9] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning , 20 (3), 273-297. [10] Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Icml, 96, pp. 148-156. [11] Golub, T. R. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science , 286 (5439), 531-537. [12] Härdle, W. K., Prastyo, D. D., & Hafner, C. M. (2012). Support Vector Machines with Evolutionary Model Selection for Default Prediction. [13] Kim, M. J., Kang, D. K., & Kim, H. B. (2015). Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications , 42 (3), 1074-1082. [14] Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In ICML, 97, pp. 179-186. [15] Ladayya, F., Purnami, S. W., & Irhamah. (2017). Fuzzy support vector machine for microarray imbalanced data classification. In AIP Conference Proceedings (Vol. 1905, No. 1, p. 030022). AIP Publishing. [16] Li, X., Wang, L., & Sung, E. (2008). AdaBoost with SVM-based component classifiers. Engineering Applications of Artificial Intelligence , 21 (5), 785-795. [17] Liu, H., Sun, J., Liu, L., & Zhang, H. (2009). Feature selection with dynamic mutual information. Pattern Recognition , 42 (7), 13301339. [18] Liu, Y., An, A., & Huang, X. (2006). Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles. PAKDD, 6, pp. 107118. [19] Purnami, S. P., Andari, S., & Rusydiana, A. (2017). On Selecting Features for Binary Classification in Microarray Data Analyses. In Proceedings of the 9th International Conference on Machine Learning and Computing (pp. 133-136). ACM. [20] Purnami, S. W., & Trapsilasiwi, R. K. (2017). SMOTE-Least Square Support Vector Machine for Classification of Multiclass Imbalanced Data. In Proceedings of the 9th International Conference on Machine Learning and Computing (pp. 107-111). ACM. [21] Purnami, S. W., Andari, S., & Pertiwi, Y. D. (2015). High-Dimensional Data Classification Based on Smooth Support Vector Machines. Procedia Computer Science , 72, 477-484. [22] Sain, H., & Purnami, S. W. (2015). Combine Sampling Support Vector Machine for Imbalanced Data Classification. Procedia Computer

10

Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 144 (2018) 174–183 Risky Frasetio Wahyu Pratama et al. / Procedia Computer Science 00 (2018) 000–000

183

Science , 72, 59-66. [23] Schapire, R. E. (1999). A brief introduction to boosting. Ijcai , 1999. [24] Tian, E. Z. (2003). The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. New England Journal of Medicine , 349 (26), 2483-2494. [25] Vanitha, C. D., Devaraj, D., & Venkatesulu, M. (2015). Gene expression data classification using support vector machine and mutual information-based gene selection. Procedia Computer Science , 47, 13-21. [26] Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. In Proceedings of the international joint conference on AI, (pp. 55-60). [27] Wang, X., Liu, X., Matwin, S., & Japkowicz, N. Applying instance-weighted support vector machines to class imbalanced datasets. In Big Data (Big Data), 2014 IEEE International Conference on (pp. 112-118). IEEE. [28] Wu, G., & Chang, E. Y. (2003). Adaptive Feature-Space Conformal Transformation for Imbalanced-Data Learning. Proceedings of the 20th International Conference on Machine Learning, (pp. 816-823). [29] Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th international conference on machine learning (pp. 856-863). ICML-03.