Comparison of Feature Selection Methods to Classify Inhibitors in DUD-E Database

Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect ScienceDirect Available online at www.sciencedirec...

Download PDF

801KB Sizes 0 Downloads 35 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect ScienceDirect

Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 00 (2018) 000–000

ScienceDirect

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Procedia Computer Science 144 (2018) 194–202

INNS Conference on Big Data and Deep Learning 2018 INNS Conference on Big Data and Deep Learning 2018

Comparison of Feature Selection Methods to Classify Inhibitors in Comparison of Feature Selection Methods to Classify Inhibitors in DUD-E Database DUD-E Database a a

Heri Kuswantoa* , Renny Yunia Nurhidayahaa, Hayato Ohwadabb a* Heri Kuswanto , Renny Yunia Nurhidayah , Hayato Ohwada

Department of Statistics, Faculty of Mathematics, Computing and Data Science, Institut Teknologi Sepuluh Nopember (ITS), Kampus ITS Surabaya 60111, Indonesia Department of Statistics, Faculty of Mathematics, Sukolilo, Computing and Data Science, Institut Teknologi Sepuluh Nopember (ITS), Kampus ITS b Department of Industrial Administration,Sukolilo, Faculty of Science 60111, and Technology, Surabaya Indonesia Tokyo Universty of Science, Chiba-Japan b Department of Industrial Administration, Faculty of Science and Technology, Tokyo Universty of Science, Chiba-Japan

Abstract Abstract In designing a new drug, inhibitor compound is usually used to control the enzyme work to recover a particular disease. In the drug design technique, theinhibitor classification of inhibitor is carry by docking softwarework to simulate theabounding mixingIn(new In designing a new drug, compound is usually used out to control the enzyme to recover particularofdisease. the inhibitor candidate) the targeted of enzyme. database simulateto docking withbounding high dimensional data drug design technique,with the classification inhibitorDUD-E is carryisouta by dockingtosoftware simulate the of mixing (new inhibitor candidate) thethetargeted enzyme. DUD-Elearning is a database docking high dimensional data characteristic, which with lead to feasibility of machine approach toas simulate the analytical tool.with A compound with specific characterictics can be lead classified ligand orofdecoy by using manyapproach characterictics to atool. problem in the machine characteristic, which to theinto feasibility machine learning as theleading analytical A compound with learning specific algorithm. Thiscan paper selection analysis to obtain compound characteristics which are characterictics be discusses classified feature into ligand or decoy by using manythe characterictics leading to a problem in effectively the machinedetermine learning algorithm. This paper featureMutual selection analysis to obtainFeature the compound which are effectively ligand or decoy. This discusses paper examined Information-based Selectioncharacteristics (MIFS), Correlation-based Featuredetermine Selection (CFS) as Fast Filter (FCBF), and the results show that the (MIFS), FCBF always selects less number features ligand as or well decoy. This Correlation-Based paper examined Mutual Information-based Feature Selection Correlation-based FeatureofSelection with of classification. Filter The highest when selects all features are used in the (CFS)fastest as wellruntime as Fast Correlation-Based (FCBF),classification and the resultsaccuracy show thatistheobtained FCBF always less number of features with fastest runtime classification. The highest classification accuracy is obtained all features. features The are CFS used method in the classification by k-NN.ofHowever, the accuracy is slightly different with classification usingwhen selected performs wellbyfork-NN. Data-A with accuracy of 89,55%, whiledifferent the MIFS theusing others for Data-B and The Data-C the classification However, the accuracy is slightly withoutperforms classification selected features. CFSwith method classification of 92,34% and 95,20% consecutively. performs wellaccuracy for Data-A with accuracy of 89,55%, while the MIFS outperforms the others for Data-B and Data-C with the classification accuracy of 92,34% and 95,20% consecutively. © 2018 The Authors. Published by Elsevier Ltd. © 2018 The Authors. Published by Elsevier Ltd. This is an open accessPublished article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) © 2018 The Authors. by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the INNS Conference Big Data and Deep Learning 2018. Selection and peer-review under responsibility of the INNS Conference ononBig Data and Deep Learning 2018. Selection and peer-review under responsibility of the INNS Conference on Big Data and Deep Learning 2018. Keywords: DUD-E; k-NN; feature; accuracy; runtime Keywords: DUD-E; k-NN; feature; accuracy; runtime

1. Introduction 1. Introduction Inhibitor is used to control the enzyme work to recover a particular disease during the drug design process. A Inhibitor is used to control the enzyme to recover particular during the drug design A survey in 2000 reported that almost 30% work of clinical drugsa are made disease by enzyme inhibitor [1]. The process. process of survey in drug 2000isreported that almost 30% with of clinical drugs cost are [2]. made by enzyme inhibitor [1]. The process of designing a long process (5-7 years) an expensive designing drug is a long process (5-7 years) with an expensive cost [2]. 1877-0509 © 2018 The Authors. Published by Elsevier Ltd. This is an open access under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2018 Thearticle Authors. Published by Elsevier Ltd. Selection under responsibility of the INNS Conference on Big Data and Deep Learning 2018. This is an and openpeer-review access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the INNS Conference on Big Data and Deep Learning 2018. 1877-0509 © 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the INNS Conference on Big Data and Deep Learning 2018. 10.1016/j.procs.2018.10.519

KuswantoH.etOhwada al. / Procedia Computer Science 144 (2018) 194–202 H. Kuswanto, R.Y.Heri Nurhidayah, / Procedia Computer Science 00 (2018) 000–000

195

*Corresponding author. Tel.: +62-818513223 ; fax: +62-31-5943352. E-mail address: [email protected]

To overcome this problem, in silico has been introduced and proven to be an effective way i.e. by utilyzing computer to simulate the filter of lead compound. Lead compound is one of the important steps in making drugs, i.e to analyze the fit compound structure which may cause biological activities and identify general active sub structure to be shyntezied. In the drug making techniques, the classification of inhibitor is done by docking score as the main tools to simulate the compound from mixing (candidate of new inhibitor) with target enzyme or protein based on the molecular structure [3]. DUD-E is a database to simulate docking with high dimension, so that the analysis is conducted by machine learning because using classical statistical approaches may lead to a bias and overfitting. A compound with specific characteristics can be classified as ligand (good inhibitor) or decoy (bad inhibitor) using number of compounds. This fact may cause a problem on some machine learnings. For high dimesional data, it is possible that there are number of irrelevant or redundant varibles that can reduce the performance of learning. To deal with this, feature selection is required to obtain compounds which effectively determine ligand or decoy in associated with the determination of new inhibitor on the new drug design process with targeted enzyme. Feature selection is a process of choosing a sub group of available features so that the feature can be optimally reduced based on a certain evaluation criteria [4]. Koller and Sahami [5] stated that the purpose of feature selection is to choose a group of feature to increase the prediction accuracy or to reduce the number of feature without reducing the accuracy prediction significantly by classifier that has been constructed by only selected features. Filter model is one of the methods which is usually used in feature selection if the number of feature is very big due to its simple algorithm, however it is able to omit the irrelevant and redundant features effectively [6]. The Mutual Information-based Feature Selection (MIFS), Correlation-based Feature Selection (CFS), and Fast Correlation-Based Filter (FCBF) are feature selection methods involving filter model. This paper examines MIFS due to the fact it has been proven to be faster than wrapper-based method [7]. Meanwhile, the correlation based approach (CFS and FCBF) are able to quickly identifies and screens irrelevant, redundant, and noisy features, and identifies relevant features as long as their relevance does not strongly depend on other features. Moreover, it also excutes faster than wrapper [8]. Considering the comparative strengths of those methods, it is necessary to deeply investigate their performance, an issue which has never been explored so far. In order to investigate the performance of these feature selection methods, the classifiction is carried out by k-Nearest Neighbors (k-NN), which is a relatively simple classification method developed by Fix and Hodges in 1951 and has been proven to have a good performance. This work compares the performance of MIFS, CFS, and FCBF combined with k-NN to classify the inhibitor in DUD-E Database. The performance is represented by the ability of the method to accurately classify the inhibitor. Moreover, the execution time to proceed the classification will be performed as well. 2. Literature Review 2.1. Mutual Information-based Feature Selection (MIFS) MIFS is a feature selection method by filter that was introduced by Battiti [9]. This method is based on the concept of mutual information (MI). The MI is a quantity which measures how big the random variable explains the others. Another measurement that has to be known about MI is entropy, which is used to measure the heterogeneity (variability) of group of sample data. The entropy of C (Class) can be calculated by (1) while entropy c given (vector feature) is called conditional entropy defined as (2) is the probability of class c given the input vector (predictor variable) . The entropy value and conditional entropy is used in the calculation of MI between and which is formulated by (3) The algorithm of MIFS can be described as follows:

196

1. 2. 3. 4.

5.

Heri Kuswanto et al. / Procedia Computer Science 144 (2018) 194–202 H. Kuswanto, R.Y. Nurhidayah, H. Ohwada / Procedia Computer Science 00 (2018) 000–000

(Initialisation). Set “initial set of p feature”; . (Calculation of MI with class output). For every feature calculate (Choose the first feature). Search the feature which maximize ; set (Greedy selection). Repeat until ; a. (Calculate MI between feature). For all pair of feature with unavailable. b. (Further feature selection). Choose feature which maximize ; set set Perform set containing of the selected feature.

.

set ,

. Calculate

. , if it is

The parameter β is used to regulate the relative importance between the candidate features and the after-selected feature with respect to the mutual information with the output class. We can see that the parameter β has significant influence to the right feature selection in MIFS. Unfortunately, the exact value of the β is unknown, and hence it can be find by grid search approach. 2.2. Correlation-based Feature Selection (CFS) CFS is a feature selection method belong to filter model which was developed by Hall and Smith [10]. The CFS grades the subgroup of feature based on the correlation using heurictic evaluation of the function. The irrelevant feature is ignored because it has weak correlation with the class. The redundant feature is excluded because it has strong corelation with other features. First, the CFS calculates correlation matrix of class with feature and correlation between feature on the training, and then search the subset of feature space using best first search. Figure 1 shows the algorithm of best first search. 1. Start with OPEN list consiting of initial condition. CLOSED list empty, and BEST initial condition. 2. Calculate (Obtain part of with highest evaluation). 3. Omit from OPEN and add to CLOSED. 4. if (BEST), the BEST . 5. For every child from which does not belong to OPEN or CLOSED list, evaluate and add to OPEN. 6. If BEST changes in the last set, repeat step 2. Figure 1. Algorithm of Best First Search 7. Back BEST.

Figure 1: algorithm of CFS

Score of best subset feature is evaluated heuristically and defined as

(4) is heuristic merit of subgroup feature consisting of p features i.e. correlation between feature and class output is average of correlation between feature and class , and is average of correleation between feature The correlation between feature with class uses Symmetrical Uncertainty (SU) as follow (5) 2.3. Fast Correlation-based Filter (FCBF) FCBF is one of feature selection algorithms for filter type which was developed by Yu and Liu [6]. Several concepts used in FCBF algorithma are predominant correlation and heuristic-heuristic. The predominant correlation

Heri Kuswanto et al. / Procedia Computer Science 144 (2018) 194–202 H. Kuswanto, R.Y. Nurhidayah, H. Ohwada / Procedia Computer Science 00 (2018) 000–000

197

concept is used to determine the correlation between feature and class, while heurictic-heuristic is used to omit redundant features. The feature used for classification step is feature which has higher correlation with the class than correlation between feature and another feature. and class C predominant if Definition 1 (Predominant Correlation): Correlation between feature and , thre is no satisfies . If there on feature , it is called redundant peer on and uses to show set of all redundant peers for . Given and , divided in to two parts, i.e. and , with and . Definition 2 (Predominant Feature): Feature is said to be predominant to class, if the correlation to class is predominant or can be predominnat after omitting the redundant peer. Based on those two definitions, a feature is said to be good if the feature predominant to predict the class, and the feature selection for classification is defined as a process to identify all predominant features to the class and omit the rests. Furthermore, to identify the predominant features and omit the redundant among features, the following three heuristics are used: ). Process as predominant feature, omit all feature in , and ignore to identify the Heuristic 1 (If peers ). Process all features in before making decision on . If there is no predominant, Heuristik 2 (If follow Heuristic 1; otherwise, only omit and determine whether omit the features in or not based on other features in . is predominant feature and can be initial feature Heuristik 3 (Initial point). Feature with biggest value of to omit the others. The FCBF algorithm is shown in Figure 2 below (Yu and Liu, 2003): Input: output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Begin for i = 1 to N do begin calculate for ; if append to ; end; order in descending value; = getFirstElement ; do begin = getNextElement ; if NULL) do begin ; if remove from ; = getNextElement else = getNextElement end until ( == NULL); = getNextElement ; end until ( == NULL); ; end;

; ;

Figure 2. Algorithm of FCBF

2.4. Equal Widht Binding Equal width binning is one of the unsupervised methods to transform continuous variable into categorical variable (discrete) to be simpler. The method equal width binning divides data into l bin with equal width and it is formulated as, (6)

198

Heri Kuswanto et al. / Procedia Computer Science 144 (2018) 194–202 H. Kuswanto, R.Y. Nurhidayah, H. Ohwada / Procedia Computer Science 00 (2018) 000–000

With limit bin limit is called as cut points.

l is number of bin which can be determined by the researcher. The bin

2.5. k-Nearest Neighbors (k-NN) k-NN is a popular classification technique introduced by [11]. Hassanat [12] argued that k–NN has high efficiency in machine, text categorization and data mining. The goal of this algorithm is to classify a new object based on distance of objects that will be classified based on training data. Classifier uses only distance function from new data to training data. The principle of k-NN is to find the shortest distance between data that will be evaluated with k closest neighbor in the training data. The training data is projected to a multiple dimension space, where each of them represent the feature of data. This space is divided into parts based on traning data classification. A point in this space is signed by class c, if class c is the most found classification on k neighbour close to that point. The closeness of the neighbour is calculated by Euclidean as the most used distance [13]. The formula to calculate and is [14] Euclidean distance between two points (7)

, where

is number of feature.

2.6. K-Fold Cross Validation K-fold cross validation (KCV) is one of the methods used to estimate the misclassification rate associated with the learning process [15]. This method randomly divides the data into K folds (partition) equally. The first fold is treated as testing and the rests are training data. This procedure is repeated K times, so that all folds will have equal opportunity to be treated as training and testing data. This process yields on K classification accuracies, and then the average is used to evaluate the performance as shown in formula 8. (8) where is the accuracy resulted from each fold. In practice, 10 folds cross validation has been a standard method which has been proven to give a good performance. 3. Data and Methodology The data used in this research is secondary data as previously used by Okada, Ohwada, and Aoki [3] i.e about the structure of compound with docking simulation which is classified into good inhibitor (ligand) and bad inhibitor (decoy). The data consists of three different enzymes i.e. aofb, cah2, and hs90a. Three sets of data used in this research are hereafter denoted as data-A, Data-B and Data-C Data-A for aofb, cah2 and hs90a consecutively. The Data-A consists of 672 observations, Data-B consists of 3340 observations and Data-C has 500 observations. The variable used in this research are response variable i.e inhibitor classification (ligand and decoy) and predictor variables (features) i.e. compound characteristic consisting of 70 features for Data-A, 71 features for Data-B and 69 features for Data-C. The steps of analysis can be described as follows. The feature selection step is carried out by discretizing the features using equal width binding method and continued with applying CFS for selecting the features. The MIFS and FCBF requires determination of threshold prior to selecting the features. Furthermore, the selected features by those three approaches are used in the classification steps. In this case we use k-NN with K-fold evaluation procedure. The cross validation uses 10-fold and hence, the data is divided into training and testing with the proportion of 90:10. This step is repeated 10 times so that all data has equal chance to be training and testing data. The k-NN method for classification uses k=5. Furthermore, the euclidean distance between testing with every training data is calculated, where 5 training data with smallest euclidean distance to testing data are chosen. The inhibitor is classified into majority class on 5 closest neighbors based on the previous step. These procedure are repeated 10 times by K-fold cross validation procedure.

H. Kuswanto, R.Y. Nurhidayah, H. Ohwada / Procedia Computer Science 00 (2018) 000–000

Heri Kuswanto et al. / Procedia Computer Science 144 (2018) 194–202

199

4. Results and Discussion 4.1. Data Description Inhibitor on DUD_E database is classified as ligand and decoy. The response variable consists of 75% decoy and 25% ligand, for all types of enzymes. The statistic shows that there is a strong positive correlation feature ALogP and feature ALogP_MR as well as feature ALogP98. Similar pattern is observed feature ALogP_MR and feature ALogP98. These facts shows that there are redundancy between the features, and one of them might need to be eliminated. The method used to discritize the data is equal width binning. The number of bin is 3 meaning that there will be three categories. The cut off points for Data-A on every feature is listed in Table 1. Only three features are performed in the table. Table 1: Cut points to descritize Data A Feature ALogP ALogP_MR ALogP98

Cut Points (-1.1565 ; 0.7929) (-0.5529 ; 1.7154) (-1.1584 ; 0.7945)

For feature ALogP, we obtain cut points of (-1.1565 ; 0.7929), so that the feature ALogP will be discretize to 3 categories i.e. less than -1.1565 for the first category, -1.1565 to 0,7929 for second category, while the third category exceeds 0.7929. 4.2. Feature Selection Using MIFS, CFS and FCBF The first feature selection is carried out by MIFS method. The feature selection is applied to three dataset, with in the interval of 0.5-1. The choice of β refers to the work of [16] which conducted simulation study through generating artificial data and found that the β is optimum withing those defined interval. Table 2 shows the results of feature selection using MIFS. Based on the values on Table 2, we know that the higher the fewer feature that are selected. Table 2: Feature selection using MIFS Data

Number of initial feature

Data-A Data-B Data-C

70 71 69

Number of final feature 29 24 42

22 23 29

19 23 28

16 23 25

12 21 24

10 21 22

The second feature selection is done by CFS. Table 3 summarizes the results of feature selection by CFS Table 3: Feature Selection using CFS

Data Data-A Data-B Data-C

Number of initial feature 70 71 69

Number of final feature 14 6 9

In the table, we see that the number of selected feature using CFS is relatively fewer than MIFS. The feature selection analysis carried out by FCBF uses threshold value in the interval 0 to 0.2. The number of selected feature using various thresholds is depicted in Figure 3.

H. Kuswanto, R.Y. Nurhidayah, H. Ohwada / Procedia Computer Science 00 (2018) 000–000

Heri Kuswanto et al. / Procedia Computer Science 144 (2018) 194–202

200

Data-A Data-B Data-C

Figure 3: Feature selection using FCBF

Figure 3 shows that the bigger the threshold, the number of selected feature is fewer. For all data, the number of selected feature by FCBF is fewer than feature selection by MIFS and CFS. With threshold on zero, thenumber of selected feature for Data-A, data-B and Data-C are 7,6, and 6 consecutively. The selected features will be further analyzed to determine the dominant feature which has significant role to classify ligand or decoy. Table 4 below listed the features which always be selected in all data. Table 4: Selected features on feature selection analysis Data Data-A Data-B Data-C

Feature Num_ChainAssemblies, Num_Hydrogens, Num_StereoAtoms, Num_ExplicitHydrogens Molecular_3D_PolarSASA, AverageBondLength, IsChiral, Num_SpiroAtoms, Num_StereoBonds, Num_Rings3 Num_H_Acceptors, IsChiral, Num_Rings9Plus, Num_PositiveAtom, Num_RingBonds, RadOfGyration

Table 4 shows the selected features in all feature selection analysis. For Data-A, no feature which always be selected using those three methods. The performed features are the features which always be selected when they are analyzed by MIFS and CFS, while for Data-B dan Data-C, the performed features are always be selected when they are analyzed by MIFS, CFS, and FCBF. From those results we can conclude that the always selected features are the most informative features to determine whether a compound will be classified into ligand or decoy. 4.3. Inhibitors Classification with All Features and Selected Features The inhibitors are analyzed by k-NN using all features to classify inhibitor as ligand or decoy. The classification results are performed in Table 5. Table 5: Classification results using all Features Data Data-A Data-B Data-C

Accuracy 91,94% 95,51% 96,20%

Sensitivity 80,72% 92,34% 96,00%

Specificity 95,63% 96,57% 96,27%

runtime 11.965 ms 338.366 ms 4.664 ms

Table 5 shows that the inhibitor classification using k-NN leads to a good results. All data have accuracy, sensitivity and specificity greater than 90% excepts the sensitivity of Data-A i.e. 80,72%. This results indicates that k-NN is a good method to be used for inhibitor classification on DUD-E Database. Figure 4 shows the inhibitor classification using k-NN with only selected features by MIFS, CFS and FCBF. Figure 4 shows that the highest accuracy of inhibitor classification on Data-A is obtained by CFS, using MIFS with 19 features while FCBF yields on lowest accuracy using 4 features. For Data-B, the highest accuracy is obtained for 23 features generated from

Heri Kuswanto et al. / Procedia Computer Science 144 (2018) 194–202 H. Kuswanto, R.Y. Nurhidayah, H. Ohwada / Procedia Computer Science 00 (2018) 000–000

201

MIFS, while FCBF uses 7 features. For Data-C, using 25 features generated from MIFS leads to highest accuracy with 6 features.

92.86 90.28

90 85

88.66

87.16

100

94.64

93.65

92.26 89.55

88.66

89.47

88.17

82.69

80 75

73.81

70

70.60

75.20

74.80 72.09

71.23 70.63

71.79 69.85 65.66

65 60.84

60 0

74.10

62.65

70.48

66.27

62.65

5

73.49 71.69

71.08

69.55

59.64

10

15

20

feature Numberbanyak of feature

akurasi

CFS

akurasi akurasi

FCBF MIFS

sensi sensi sensi

CFS FCBF MIFS

speci speci

CFS FCBF

speci

MIFS

87.07

91.06

87.46

80.04

91.50

89.76

85.27

86.65

80.23

80

92.18

25

88.27

73.41

ak urasi ak urasi ak urasi sensi sensi sensi speci speci speci

57.53

0

5

90.6 90.60

90

10 15 banyak feature

20

C FS FC BF MIFS C FS FC BF MIFS C FS FC BF MIFS

95.2 96.0 95.2 95.2 94.8 94.93 94.6 94.8 93.6 94.67 94.13 94.67 93.4 93.33

94.2 93.87

95.6 95.47

85.6

85

83.2

ak urasi ak urasi ak urasi sensi sensi sensi speci speci speci

80 77.6

75

25

96.0

96.0

95.2

88.8

87.20

53.40

50

86.23

75.09

70

60

86.23

94.67 94.4 93.6 91.2 91.20

92.34 92.34

82.51 75.42

96.0

95

93.57 94.37 94.37

91.14 91.58

90

Y-Data

Y-Data

94.64

87.46

Y-Data

95

71.2

70 0

10

Number of feature

20 banyak feature

30

C FS FC BF MIFS C FS FC BF MIFS C FS FC BF MIFS

40

Number of feature

Figure 4: Inhibitor classification by k-NN with selected features on Data-A, Data-B and Data-C consecutively

4.4. Comparison of the Feature Selection Method The result of feature selection analysis is performed in Table 6. Table 6: Results of Feature Selection Data

Number of initial feature

Data-A Data-B Data-C

70 71 69

Number of final feature MIFS 19 23 25

CFS 14 6 9

FCBF 4 7 6

runtime (ms) MIFS 172 687 203

CFS 94 93 31

FCBF 16 16 16

Table 6 shows that the runtime is linearly related with the number of selected feature. From the three methods, we can see that the FCBF yields on the least number of features with shortest runtime. Beside the runtime criteria, the classification results is also one of the criterias. The classification results are performed in Table 7. Table 7: Comparison of best classification results Data Data-A

Data-B

Data-C

Classification All features Feature by MIFS Feature by CFS Feature by FCBF All Features Feature by MIFS Feature by CFS Feature by FCBF All Features Feature by MIFS Feature by CFS Feature by FCBF

Accuracy

Sensitivity

91.94% 88.66% 89.55% 75.20% 95.51% 92.34% 86.65% 89.76% 96.20% 95.20% 94.20% 94.40%

80.72% 73.49% 74.10% 62.65% 92.34% 86.23% 73.41% 82.51% 96.00% 96.00% 95.20% 93.60%

Specificity 95.63% 93.65% 94.64% 72.09% 96.57% 94.37% 91.06% 92.18% 96.27% 94.93% 93.87% 94.67%

Runtime (ms) 11,965 2,730 2,090 874 338,366 68,547 20,030 22,261 4,664 2,012 1,030 749

Table 7 shows that in overall the classification results are good. For all three data, the highest accuracy, sensitivity and specificity are obtained when classification is carried out by all features. However, if we compare the classification results with selected features, the performances are slightly different. Moreover, the runtime of classification using selected feature is significantly shorted than all features. For Data-A, the highest accuracy is obtained when we use CFS i.e. 89,55% with 14 features. For Data-B and Data-C, the highest accuracy is obtained with MIFS i.e. 92,34% and 95,20% using 23 features and 25 features consecutively. The FCBF yields on the fewest features.

Heri Kuswanto et al. / Procedia Computer Science 144 (2018) 194–202 H. Kuswanto, R.Y. Nurhidayah, H. Ohwada / Procedia Computer Science 00 (2018) 000–000

202

5. Conclusion Based on the analysis, we conclude that the MIFS selects more features than CFS and FCBF. The execution time is influenced by the number of selected features, and FCBF run the process faster than the others. Although the classification accuracy proceed with all features yields on the highest values of classification criterias, the values are not significantly different with classification using selected features. Moreover, feature selection reduces the runtime significantly. The main point of conducting feature selection in classification problem is to obtain best method with highest accuracy. Based on this, the CFS slightly outperforms the MIFS for Data-A, while MIFS performs best for Data-B and Data-C. This research therefore suggests to use MIFS combined with k-NN to obtain best accuracy. Considering the fact that DUD-E Database originally contains of big dataset, the execution time can be one of the considerations to effectively run the classification. If one really concerns about time efficiency, the FCBF combined with k-NN will be the best choice as FCBF always has the shortest execution time.

Acknowledgements The authors acknowledge the financial support from the Ministry of Research, Technology and Higher Education Indonesia through International Collaboration and Scientific Publication research Scheme. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

Drews, J. (2000). Drug Discovery: A Historical Perspective. Drug Discovery Review. DiMasi, J. A., Hansen, R. W., dan Grabowski, H. G. (2003). The Price of Innovation: New Estimates of Drug Development Costs. Journal of Health Economics 22, 151-185. Okada, M., Ohwada, H., dan Aoki, S. (2013). Docking Score Calculation Using Machine Learning with an Enhanced Inhibitor Database. Journal of Bioinformatics and Computational Biology. Imperial College Press. Narendra, P.M. dan Fukunaga, K. (1977). A Branch and Bound Algorithm for Feature Selection. IEEE Transactions on Computers 9, 917922. Koller, D. dan Sahami, M. (1996). Toward optimal feature selection. Proceedings of International Conference on Machine Learning. Yu, L. dan Liu, H. (2003). Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. Proceedings of the Twentieth International Conference on Machine Learning (ICML). Hoque, N., Bhattacharya, D.K., Kalita, J.K. (2014), MIFS-ND: A mutual information-based feature selection method. Expert Systems with Applications, 41(14). Hall, M.A. (1999). Correlation-based Feature Selection for Machine Learning. PhD Thesis, University of Waikato. Battiti, R. (1994). Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transactions on Neural Network 5(4), 537-550. Hall, M. A. dan Smith, L. A. (1999). Feature Selection for Machine Learning: Comparing a Correlation-based Filter Approach to the Wrapper. Proceedings of the Twelfth International FLAIRS Conference. Dougherty, J., Kohavi, R., dan Sahami, M. (1996). Supervised and Unsupervised Discretization of Continous Features. Computer Science Department, Stanford University. Hassanat, A. B. (2014). Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach. International Journal of Computer Science and Information Security 12(8), 33-39. Bramer, M. (2007). Principles of Data Mining. London: Springer. Han, J., Kamber, M., dan Pei, J. (2012). Data Mining Concepts and Techniques, Third Edition. USA: Morgan Kaufmann Publishers. James, Witten, Hastie, dan Tibshirani. (2013). An Introduction to Statistical Learning. New York: Springer. Huang, J., Cai, Y., and Xu, X. (2006). A Filter Approach to Feature Selection Based on Mutual Information. Proc. 5th IEEE International Conference on Cognitive Informatics (ICCI'06) in Y.Y.Yao, Z.Z.Shi,Y.Wang, and W.Kinsner(Eds.)Scholes, Discuss. Faraday Soc. No. 50 (1970) 222.

Comparison of Feature Selection Methods to Classify Inhibitors in DUD-E Database

Comparison of Feature Selection Methods to Classify Inhibitors in DUD-E Database

Recommend Documents