Applied Soft Computing 51 (2017) 39–48
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Kernel-based learning and feature selection analysis for cancer diagnosis Seyyid Ahmed Medjahed a,∗ , Tamazouzt Ait Saadi b , Abdelkader Benyettou a , Mohammed Ouali c,d a
Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf, USTO-MB, BP 1505, El M’naouer, 31000 Oran, Algérie University of Le Havre, France and University of Abdelhamid Ibn Badis Mostaganem, Algérie c Computer Science Department, University of Sherbrooke, J1K2R1 Canada d College of Computers and Information Technology, Taif University, Taif, Saudi Arabia b
a r t i c l e
i n f o
Article history: Received 2 June 2016 Received in revised form 5 November 2016 Accepted 5 December 2016 Available online 9 December 2016 Keywords: Classification Feature selection Kernel-based learning Support vector machines recursive feature elimination Binary dragon fly
a b s t r a c t DNA microarray is a very active area of research in the molecular diagnosis of cancer. Microarray data are composed of many thousands of features and from tens to hundreds of instances, which make the analysis and diagnosis of cancer very complex. In this case, gene/feature selection becomes an elemental and essential task in data classification. In this paper, we propose a complete cancer diagnostic process through kernel-based learning and feature selection. First, support vector machines recursive feature elimination (SVM-RFE) is used to prefilter the genes. Second, the SVM-RFE is enhanced by using binary dragonfly (BDF), which is a recently developed metaheuristic that has never been benchmarked in the context of feature selection. The objective function is the average of classification accuracy rate generated by three kernel-based learning methods. We conducted a series of experiments on six microarray datasets often used in the literature. Experiment results demonstrate that this approach is efficient and provides a higher classification accuracy rate using a reduced number of genes. © 2016 Elsevier B.V. All rights reserved.
1. Introduction Cancer is a dangerous disease characterized by abnormally large cell proliferation in normal tissue. It is the most lethal disease in the world, and the diagnosis of cancer is a very difficult and complex task. Early detection of cancerous cells can increase survival rates for patients by more than 97% 1 [1]. To aid in this, classifier systems are commonly used in cancer classification to help experts make a well-informed diagnosis. Recently, DNA microarray technology for cancer diagnosis has become a very popular topic of research [2–4]. It measures the expressions of large numbers of genes simultaneously and results in high quality tumor identification. However, the number of genes is between 20,000 and 30,000, and the number of samples is often
∗ Corresponding author. E-mail addresses:
[email protected],
[email protected] (S.A. Medjahed),
[email protected] (T.A. Saadi),
[email protected] (A. Benyettou),
[email protected] (M. Ouali). 1 American Cancer Society Hompage. (2008). Citing Internet sources Available from: http://www.cancer.org. http://dx.doi.org/10.1016/j.asoc.2016.12.010 1568-4946/© 2016 Elsevier B.V. All rights reserved.
less than 100, which results in the Hughes phenomenon (curse of dimensionality), and, consequently, an incorrect diagnosis. The curse of dimensionality was coined by Richard E. Bellman in 1961 and refers to phenomena that arise when analyzing data in highdimensional spaces. To resolve this, a gene/feature selection step is necessary. Feature selection is an essential step not only in gene selection but also in many other applications, such as pattern recognition, hyperspectral imagery, and bioinformatics [5–8]. In supervised classification, feature selection is a major step in providing a high accuracy rate, improving the quality of classification, and reducing the computational complexity of the classification algorithm. The principal aim is to reduce the dimensionality by finding the smallest subset of features from the original features that can achieve maximum classification accuracy. This task aids classification performance by eliminating the irrelevant and redundant features. Feature selection approaches are characterized by objective function, search strategy, subset generator and stop criterion. Therefore, we can classify feature selection approaches into three classes: filter, wrapper and embedded approach [2,3,9]. In filter approach, the optimal feature subset is obtained by using a measure of relevance independent from the classifier system [2]. In wrapper approach, the objective function is computed by using
40
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
Fig. 1. General schema of the proposed approach.
a classifier: for example, the classification accuracy rate is the percentage of examples that are correctly classified [5]. For the embedded method, the feature selection process is integrated into the classifier system, so the search is guided by the classifier system. Although the filter approach is much faster, features are independent of the classification model. The wrapper approach is not as fast but still provides very good results [3]. In this paper, we propose a complete cancer diagnostic process through kernel-based learning and feature selection. In addition, we propose a new feature selection approach composed of two steps. The first step consists of selecting a subset of candidate genes using SVM-RFE [10]. Since SVM-RFE does not take into account the redundancy of genes and becomes unstable at some values [11], we offer a second step of gene selection based on the BDF optimizer to locate the optimal subset of genes. BDF is a new swarm intelligence optimization inspired by the static and dynamic swarming behaviors of dragonflies in nature and was developed by Mirjalili Seyedali in 2015. The second step is applied to the subset of candidate genes selected by SVM-RFE and selects the most important and informative from the candidate gene subset by minimizing the classification error rate. The problem of gene selection is a binary optimization problem, which will be solved using the BDF algorithm. The proposed approach is benchmarked by six microarray datasets widely used in the literature. Experiment results show that our approach outperforms several alternative methods and demonstrate its feasibility and effectiveness. The rest of the paper is organized as follows: Section 2 details the proposed approach; Section 3 describes and analyzes the experiment results; and Section 4 concludes this study and offers some overall perspective.
by SVM-RFE alone. The subset of genes generated by SVM-RFE will be considered the candidate subset of genes. The second step will select the optimal and the smallest subset of genes from the candidate subset by minimizing the objective function, which represents the classification error rate (classification error rate = 1 − classification accuracy rate). In other term, we attempt to select the minimum size of features (smallest subset of features) that minimize the classification error rate. The objective function is optimized by using BDF. Fig. 1 illustrates the proposed approach.
2.1. First phase In this pretreatment phase, SVM-RFE selects a subset of candidate genes (k) from the initial number of genes (N) [10]. Support vector machine (SVM) is a learning algorithm based on the principle of minimizing structural risk [12]. It is a method that analyzes data and is used for classification and regression analysis. SVM consists to predict the class of each given input. To classify the data, SVM constructs a hyperplane in a high-dimensional space, providing good separation. This hyperplane has the largest distance to the nearest training point of any class (i.e., maximizing the distance between these classes). Clearly, there are several valid hyperplanes, but a remarkable property of SVM is that this hyperplane must be optimal. The points that lie closest to the maximum margin hyperplane are called the support vectors [12]. For a given training set of N points (xi , yi ) with input data xi , i = 1, . . ., N and output data yi = { −1, 1} given by an expert. The margin is the distance of closest examples from the line decision (hyperplane). The primal problem of optimal hyperplane is given as follows:
2. The proposed approach The proposed approach uses a combination of the SVM-RFE approach and BDF algorithm, since the effectiveness of SVM-RFE becomes unstable at some values of the genes, and, also, SVMRFE does not take into account the redundancy of the genes [11]. The proposed approach seeks to improve the results obtained
minw,b
1 w2 2
subject to
yi (w, xi + b) ≥ 1 i ∈ {1, . . ., N}
(1)
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
The problem of optimal hyperplane can be translated as a quadratic optimization problem, defined as follows in dual form by introducing the Lagrangian Multiplier ˛i : N
−
min˛
˛i +
1 yi yj ˛i ˛j xi , xj 2
i=1
subject to
N
i,j
(2) ˛i yi = 0
41
to avoid to generate a subset of feature that dependent on a single classifier. This formulation can be seen as a combinatorial optimization problem and can be solved by used a stochastic approach. We propose to use dragonfly algorithm which is a new swarm intelligence optimization inspired from the static and dynamic swarming behaviors of dragonflies in nature and developped by Mirjalili Seyedali in 2015 [13]. We chose to use Dragonfly algorithm for two major reasons:
i=1
∀i ∈ {1, . . ., N}, ˛i ≥ 0 The margin width for linear SVM is given as follows: w=
N
˛i yi xi
(3)
i=1
where ˛i is the Lagrangian Multiplier, xi is the gene expression, yi is the class of xi . SVM-RFE is an embedded approach developed by Guyon et al. [10] and it is used for feature selection by generate the ranking of features using backward feature elimination. The ranking score is calculated the components of weight vector w of the SVM defined in Eq. (3). The algorithm of SVM-RFE can be defined in three steps: Algorithm 1.
SVM-RFE Algorithm
1: 2: 3: 4: 5:
repeat Train SVM on the training set Compute w Remove the genes with the smallest weight until all genes are ranked
SVM-RFE is an embedded approach used for feature selection by generating a ranking of features using backward feature elimination. First, SVM-RFE is developed to solve the problem of gene selection for cancer classification [10]. The ranking score is calculated using the weight vector w of the SVM defined in Eq. (3). SVM-RFE attempts to select the genes using the weight vector w, which is computed using the support vector machine. The algorithm is iterative, and, in each iteration, the SVM classifier is trained in computing the vector w, as the genes that have low weight are removed. Finally, the genes are sorted according to their weights, and the k genes with the highest weights are selected. In this paper, this phase is used as a preprocessing phase, to reduce the number of features. Also, this phase allows the selection of k N features and passes the selected features on to the second phase. 2.2. Second phase The wrapper approach selects features using a classifier algorithm, which is used as an evaluation function. In this phase, we use BDF as a search strategy, and the kernel-based learning methods to compute the objective function, which is the classification error rate. The classification error rate is the average of classification error rates obtained by -SVM, C-SVM, and LS-SVM. Taking the average of three classifiers will prevent the generation of a subset of features that depend on a single classifier. This formulation can be seen as a combinatorial optimization problem and can be solved using a stochastic approach. Our wrapper approach is defined as follow: Suppose D = {f1 , . . ., fk } feature set obtained after the first phase. The principle is to select or not the feature fi which will be used in the classification model. Therefore, we define the binary variable B = {B1 , . . ., Bk } such as If Bi = 1 the feature fi is selected. Else the feature fi is not selected The objective function is the classification error rate which is the average of classification error rate obtained by: -SVM, C-SVM and LS-SVM. The basic idea to take the average of three classifier is
• The first one is that Dragonfly algorithm has never been applied nor benchmarked in the context of gene selection and cancer classification. • The second one is that Dragonfly algorithm is a new optimization algorithm more efficient than Practical Swarm Optimization, Genetic Algorithm [13]. Dragonfly is an insect of Odonata family and it is considered as small predators that hunt all entire small insects. The dragonfly algorithm is composed of two major’s phases: the exploration and exploitation [13]. Reynolds [13,14] has defined the behaviors of swarms as follows: Separation of the ith dragonfly, which is the avoidance of others individuals in the neighborhood and is defined as: n
Si = −
Xi − Xj
(4)
j=1
Xi is the position of the current dragonfly and Xj the position of jth neighboring. n is the total number of neighboring [13]. Alignment of the ith dragonfly, which indicates the velocity compared to others dragonflies in the neighborhood and it can be computing as follows:
n Ai =
X j=1 j
(5)
n
Xj is the velocity of the jth neighboring dragonfly [13]. Cohesion of the ith dragonfly, which is the tendency of swarm towards the center of the mass of the neighborhood and is given as follows:
n
Ci =
X j=1 j n
− Xi
(6)
Attraction of the ith dragonfly, which is the attraction to the food source can be described as: Fi = X + − Xi
(7)
X+ is the position of the food source. Distraction of the ith dragonfly, which is the outwards of an enemy can be computing as Ei = X − + Xi
(8)
X− is the position of the enemy. In this stage, we use Si , Ai , Ci , Fi , Ei to update of the position of dragonfly. This position is defined as follows: Xt+1 = (sS i + aAi + cC i + dF i + eE i ) + wXt Xi,t+1 = Xi,t + Xi,t+1
(9) (10)
where s, a, c, d and e respectively is the weight of separation, alignment, cohesion, food source and enemy position respectively Xi is the step. w is the inertia weight and t the iteration [13].
42
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
This algorithm is used to solve continuous variable, for binary variables, the equation T(Xi ) generates the step for binary variables [13]:
Xi T (Xi ) = Xi2 + 1
(11)
Xi,t+1 =
(12)
1, r < T (Xi,t+1 ) 0, r ≥ T (Xi,t+1 )
where r is random variable between 0 and 1. The schema of the BDA algorithm for gene selection is given as follows: Algorithm 2. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32:
This study’s major contribution is achieving a high classification accuracy rate by solving the problem of gene selection. Our approach combines the advantages of two different approaches, the embedded and wrapper approaches. The embedded approach SVM-RFE is used to reduce the number of genes, selecting k genes from N (the total number of genes). The second phase uses the wrapper approach based on BDF. Generally, the standard wrapper approach uses the classification error rate produced by a single classifier, which gives rise to three disadvantages: the subset of selected genes is specific to this single classifier; it increases overfitting when the number of samples is insufficient; and the computational time for the number of genes is too large. The proposed approach overcomes these disadvantages by:
BDA Algorithm for Gene Selection
Split the microarray dataset into: Z1 : training set, Z2 : validation set and Z3 : testing set k is the number of genes after first phase i is the ith dragon fly and g is the gth gene Initialize the dragonflies population Xi,g (i = 1, . . ., n) Initialize step vectors Xi,g (i = 1, . . ., n) Initialize max i teration Initialize Err O ptimal = inf for t = 1 to max i teration do for i = 1 to n do Generate randomly Z1 and Z2 from Z1 and Z2 Z1 = Z1 − {genes with Xi,g = 0} Z2 = Z2 − {genes with Xi,g = 0} Train C-SVM using Z1 and evalute its using Z2
Err C − SVM = classification error rate Train -SVM using Z1 and evalute its using Z2
Err − SVM = classification error rate Train LS-SVM using Z1 and evalute its using Z2
Err L S − SVM = classification error rate Err = Err C−SVM+Err 3−SVM+LS C−SVM if Err < Err B est then Err O ptimal = Err Xi,g O ptimal = Xi,g end if Compute S, A and C using Eqs. (4) and (5) and (6) Compute the food source Fi and enemy Ei using Eqs. (7) and (8) Update w, s, a, c, d and e Compute step vector Xt+1 using Eq. (9) Compute the probabilities T(x) using Eq. (11) Compute position vector Xt+1 using Eq. (12) end for end for Output: Xi,g O ptimal
Algorithm 2 describes the gene selection algorithm based on the BDF optimization algorithm, which is the second phase of the approach. First, the algorithm takes as input the dataset that contains k genes selected from the first phase. Also, the algorithm takes the parameters of the support vector machine and the Gaussian kernel. Lines 1–7 represent the data initialization. The dataset is divided into three sets: training set (Z1 ), validation set (Z2 ) and testing set (Z3 ). The training and validation sets are used for gene selection, and the testing set is used to compute the classification accuracy rate under the final subset of genes. Line 8 is the main loop of the algorithm. Line 9 is the loop for dragonflies’ population (the search agents). Line 10 randomly generates training set Z1 and testing set Z2 from the initial training set Z1 and testing set Z2 . In lines 11 and 12, the algorithm removes the genes that have the value of Xi,g = 0 from the training and testing sets. From lines 13 to 18, the classification error rate is computed using the three kernel-based methods. In line 19, the algorithm computes the average classification error rate provided by the three classifiers. In lines 20 − 23, the evaluation of the objective function is completed, and in lines 24 − 29, the position of dragonflies’ populations is updated. The output of the algorithm is the subset of genes (i.e. Xi,g = 1) that produced the lowest average classification error rate. The selected genes are evaluated under the testing set.
• SVM-RFE reduces the number of genes and, consequently, reduces the computational time in the second step. • Randomly choosing new training and test sets in each iteration of the algorithm prevents overfitting. • Using the average of classification error rates generated by three kernel-based classifiers addresses the issue of the subset of selected genes not being specific to nor dependent on only one classifier. 2.3. The objective function In this study, we propose to use an objective function that represents the average of classification error rates computed from three classifiers: C-SVM, -SVM and LS-SVM. We have chosen three different variants of SVM for our calculations. LS-SVM is different from C-SVM and -SVM in the resolution of quadratic problems. LS-SVM transforms the dual problem as a set of linear equations, while CSVM and -SVM solve the dual problem using SMO, active set or other methods. -SVM, compared to C-SVM, has the advantage of controlling the number of support vectors used. Note that, while other classifiers can be used with this approach, we focused our study on kernel-based classifiers.
2.3.1. C-SVM C-SVM has been developed to classify the data that are not linearly separabale by introducing the slack variables i which represent the distance between wrong points and the optimal hyperplane. Another parameter is introduced called C which controls the trade-off between the slack variable and the margin size [12]. The quadratic optimization problem is given as follows:
minw,b,
1 i w2 + C 2
subject to
yi (w, xi + b) ≥ 1 − i
N
i=1
(13)
i ≥ 0 i ∈ {1, . . ., N} The dual problem of (13) has the following form:
min˛
N
−
˛i +
1 yi yj ˛i ˛j xi , xj 2
i=1
subject to
N i=1
i,j
(14) ˛i yi = 0
∀i ∈ {1, . . ., N}, 0 ≤ ˛i ≤ C
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
2.3.2. -SVM -SVM controls the complexity (the number of support vectors) of SVM by introducing a constant and a parameter [15]. The primal form of the quadratic optimization problem is as follows: minw,b,,
1 1 i w2 − + 2 N
subject to
yi (w, xi + b) ≥ 1 − i
N
i=1
(15)
43
Table 1 Classification accuracy rate (%) and the selected features obtained by the proposed approach for each artificial dataset. Datasets
Selected features
Classification accuracy rate
CorrAL m-of-n-3-7-10 Monk1 Monk2 Monk3
1, 2, 3, 4, 6 3, 4, 5, 6, 7, 8, 9 1, 2, 5 1, 2, 3, 6 2, 4, 5
90.62 100 97.22 68.00 80.56
i ≥ 0 and ≥ 0 Table 2 Information about the microarray datasets.
i ∈ {1, . . ., N} The dual problem is given as follows: 1 yi yj ˛i ˛j xi , xj 2
min˛
i,j
N
subject to
˛i yi = 0 (16)
i=1 N
∀i ∈ {1, . . ., N}, 0 ≤ ˛i ≤
1 N
The problem (14) and (16) can be solved by using Sequential Minimal Optimization, Active Set, Interior Point, etc. 2.3.3. LS-SVM LS-SVM is a least squares SVM version. The dual problem is defined as linear equations system, in other terms, LS-SVM finds the solution by solving a linear equations system instead of a quadratic programming [16]. The dual problem (14) is transformed as the following linear euqations system: 0
1TN
1N
+ −1 IN
Number of genes
Number of samples
Number of classes
Breast cancer [18] Colon cancer [19] DLBC [20] Leukemia [21] Lung cancer [22] Ovarian cancer [23]
24,481 2000 4026 5147 12,533 15,154
97 62 47 72 181 253
2 2 2 2 2 2
˛i ≥ 0
i=1
Dataset name
b
˛
=
0
Y
(17)
where: Y = [y1 , . . ., yN ]T , 1N = [1, . . ., 1]T , ˛ = [˛1 , . . ., ˛N ], IN = identity matrix, = kernel function. We have chosen three different variant of SVM. LS-SVM is different compared to C-SVM and -SVM in the resolution of quadratic problem. LS-SVM transforms the dual problem as an set of linear equations system while the C-SVM and -SVM solve the dual problem using SMO, active set or other methods. -SVM compared to C-SVM has the advantage to control the number of support vector using . 3. Experimental results In this section, we present the results obtained by the experiments. The performance evaluation is conducted in terms of classification accuracy rate, number of selected genes and computational time. We done two experimentations. The first experimentation is performed to demonstrate the relevance of the proposed approach. We run experiments on five synthetic datasets (artificial benchmark datasets) widely used for testing the performance of feature selection approach, called CorrAL, m-of-n-3-7-10, Monk1, Monk2 and Monk3 [17]. These datasets can be obtained from https://www.sgi.com/tech/mlc/db/. The first dataset is the artificial dataset CorrAL. This dataset is composed of 6 features (a0 , a1 , b0 , b1 , Irr, Corr) and it represents the target concept (a0 ∩ a1 ) ∪ (b0 ∩ b1 ). Irr is an irrelevant feature and Corr is highly correlated feature with the class label. The second dataset is the artificial dataset m-of-n-3-7-10 which is composed of 10 features including 3 irrelevant features. Features 2, 3, 4, 5, 6, 7, 8 are relevant to the class label.
The datasets Monk1, Monk2 and Monk3 are composed of 6 features (a1 ,. . .,a6 ). The target concept of each dataset is defined as follows: • Monk1: (a1 = a2 ) ∨ (a5 = 1). Relevant features are: a1 , a2 , a5 • Monk2: (an = 1) for exactly two choices of n. Relevant features are: a1 , a2 , a3 , a4 , a5 , a6 • Monk3: (a5 = 3 ∧ a4 = 1) ∨ (a5 = / 1 ∧ a2 = / 3). Relevant feature are: a2 , a4 , a5 Table 1 describes the results obtained by the proposed approach for each artificial dataset. Table 1 shows the selected features and the classification accuracy rate obtained by the proposed approach for each artificial dataset. As observed in Table 1, for CorrAL dataset, the proposed algorithm has selected a0 , a1 , b0 , b1 and Corr as the optimal feature subset. The classification accuracy rate is 90.62%. The results obtained for m-of-n-3-7-10 is very satisfactory. The classification accuracy rate is improved and achieved 100%. The proposed approach has selected 7 features, 3, 4, 5, 6, 7, 8, 9 as the optimal feature subset. We note that the relevant features are 2, 3, 4, 5, 6, 7, 8, therefore, the proposed approach has selected 6 relevant features among the 7. For the Monk1 dataset, our proposed approach has selected the features, 1, 2, 5 which correspond to the target concept. The accuracy rate is 97.22%. For the Monk2 dataset, the proposed approach has selected the features, 1, 2, 3, 6 and reached 68% of the accuracy rate. The results obtained for Monk3 are satisfactory. The proposed approach has selected the features, 2, 4, 5 as optimal feature subset which match with the target concept. The second experimentation is conducted on real datasets, which are the DNA microarray datasets. 3.1. Datasets The microarray datasets used in our study are common in the literature. We chose six DNA microarray datasets for research: breast cancer, colon cancer, DLBCL cancer (diffuse large B-cell lymphoma), leukemia, lung cancer, and ovarian cancer. Table 2 presents information about the microarray datasets. The first column shows the name of each dataset, and the second column contains the number of genes. The third column is the number of samples, and the last column is the number of classes. The datasets are taken from Kent Ridge (KR) Bio-Medical Data
44
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
Fig. 2. Classification accuracy rate for each selected genes obtained by the first step SVM-RFE.
Set Repository and can be found at http://datam.i2r.a-star.edu.sg/ datasets/krbd/. 3.2. Parameters setting To assess the performance of the approach, we used an explicit investigative setting. Each microarray dataset was split into three subsets: 50% of the instances were used for the training phase, 30% for the validation phase, and the remaining 20% for the testing phase. To overcome the problem of overfitting, we randomly generated training and validation sets in each iteration of the algorithm. In the first phase, SVM-RFE selected 60% of the initial genes (k = 60%). The value of k has been determined empirically. We have defined 30%, 40%, 50%, 60%, 70% and 80%. The value of k that has given good results is 60%. A small value of k (20%, 40%) selects a small number of genes in the first phase, therefore, the second phase will have no interest if small number of genes are selected in the first phase. A large value of k (70%, 80%), makes the second step very computationally intensive and the processing time can be prohibitive. The second phase is a wrapper approach and is very computationally intensive. The idea is to balance the number of selected genes in each phase. In addition, the value of 60% produces a good classification accuracy. Fig. 2 shows the classification accuracy rate obtained by different percentage of selected genes in the first step SVM-RFE. As seen in Fig. 2, the high classification accuray rate obtained for each dataset is acheived by using 60% of genes. The SVM classifier was used with the Gaussian kernel, and the parameter was initialized to 2. Hyper-parameters tuning of the SVM and the kernel function is a crucial task. Many optimization algorithms such as PSO, SA-SVM, DIRECT can be used to automatically optimize the hyper-parameters [24,25]. The parameters of the BDF algorithm were as follows: the number of dragonflies’ population is set to 60. The algorithm stops when the objective function is equal to 0 or when the number of iterations fixed to 500 is reached. All of these parameters were chosen by experimentation and have proven their performance. 3.3. Results and discussion The experimentation is conducted in terms of: classification accuracy rate, computational time and number of selected genes.
The classification accuracy rate was computed under different training and testing sets. We run 100 executions of the proposed approach, at each execution, we split randomly the dataset into training, validation and testing sets. Then, we compute the classification accuracy rate and we retain the best, the worst, and the mean classification accuracy rate. Tables 3 and 4 describe the classification accuracy rate, the number of selected genes and the computational time obtained by the proposed approach for each microarray dataset and under the testing set. Table 3 describes the classification accuracy rate obtained by the proposed approach for each microarray dataset and under the testing set. Table 3 also presents the best, mean and worst classification accuracy rates produced by our approach for each microarray dataset and at each step. The first column is the name of the dataset. The second column represents the classification accuracy rate (best, mean, worst) in the first step (SVM-RFE). The third column represents the classification accuracy rate (best, mean, worst) in the second step (BDF). Table 4 desribes the number of selected genes and the computational time obtained in each step of the proposed approach and for each microarray dataset. The first column is the name of the dataset. The second column contains the initial number of genes, before the treatment. The third column represents the number of selected genes after applying SVM-RFE (we selected 60% of the genes that provided the highest weight vector of SVM). The fourth column is the number of genes after applying the BDF algorithm. The last column is the computational time. From Table 3, we can observe that the classification accuracy rate is significantly improved in the second step compared to the first step. In the first step, SVM-RFE selects the top 60% and in the second step, BDF algorithm selects the smallest subset among the 60% of features that achieves a high classification accuracy rate. Analysis of the results presented in Tables 3 and 4 demonstrate the performance of the proposed approach. The classification accuracy rate obtained by our approach reached 100% for five microarray datasets: colon, DLBCL, leukemia, lung and ovarian. For the breast cancer dataset, the classification accuracy rate achieved was 89.47%. The number of genes was significantly reduced. SVMRFE removes 40% of genes that are considered irrelevant, and the BDF algorithm finds the optimal subset of genes to improve the classification accuracy rate. The performance of the breast cancer data was not improved by selecting 7237 genes from 24, 481
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
45
Table 3 Classification accuracy rate (%), obtained by the proposed approach in the first and second step. Dataset name
Breast cancer Colon cancer DLBC Leukemia Lung cancer Ovarian cancer
Step 1: SVM-RFE
Step 2: BDF
Classification accuracy
Classification accuracy
Best
Mean
Worst
Best
Mean
Worst
78.94 95.65 96.87 96.20 95.34 99.00
64.71 81.08 80.62 81.48 80.73 93.80
47.36 60.86 59.37 61.55 60.44 85.07
89.47 100 100 100 100 100
86.22 97.46 89.44 95.81 99.14 98.19
84.11 95.04 86.67 88.89 97.18 94.00
Table 4 Number of selected genes and computational time obtained by the proposed approach in the first and second step. Dataset name
Initial number of genes
Step 1 SVM-RFE
Step 2 BDF
Computational time (S)
Breast cancer Colon cancer DLBC Leukemia Lung cancer Ovarian cancer
24,481 2000 4026 5147 12,533 15,154
14,688 1200 2415 3088 7519 9092
7237 510 1210 1522 3737 4573
1149.19 287.99 381.02 498.49 831.33 939.80
genes, which corresponds to 29.56% of the genes; the classification accuracy rate was 86.22%. Therefore, the results provided by our algorithm for this dataset were not very satisfying compared to the other datasets. As seen in Tables 3 and 4, the performance of the colon cancer data was very satisfying, as maximum accuracy of 100% was reached with 510 genes from 2000 genes, representing 25% of the genes. The results obtained for the DLBCL dataset were quite appropriate, as 100% accuracy was achieved using 30% of genes (1210 genes from 4026 genes). For the leukemia dataset, the number of selected genes was 1522 genes from 5144 genes (29.58% of genes), attaining 100% classification accuracy rate. The analysis of the results of the lung cancer data shows that the proposed approach has reached maximum accuracy with 3737 genes from 12,533 genes, which represents 29.81% of the genes. The last dataset is ovarian cancer, for which the proposed approach produced high accuracy (100%), with 4573 genes from 15,154 genes (30.17%). In terms of computational time, wrapper approaches are demanding because they make repetitive calls to the classifier. Our approach is not overwhelming in terms of computational time because the first phase eliminates 60% of the genes using SVM-RFE. In this study, SVM-RFE is used in the first step of the algorithm as pre-processing step to reduce the number of features, because, in the second step we use the BDF algorithm and it becomes very computationally when the number of features is very large. We propose to test other filtering techniques such as mRMR instead of SVM-RFE for selecting the top k = 60% of genes from initial genes set. The results are described in Table 5. Table 5 shows the best, average and worst classification accuracy rate obtained by the proposed approach at each step on the six datasets. We observe from Table 5 that the results obtained by the proposed approach using mRMR as a pre-processing stage are quite similar compared to SVM-RFE. As a pre-processing stage, one can use either SVM-RFE or mRMR. Both methods produce a decent seed point where the BDF method takes over and it is not as important which initialization method is used since much of the work is done by BDF in the second stage. Nevertheless, mRMR has a slight advantage over SVM-RFE in terms of accuracy and it is slightly faster. Hence, we propose to use mRMR instead of SVM-RFE in the pre-filtering stage. Comparison with other approaches is given in Table 6. The results are reported from [26].
Table 6 compares the classification accuracy rate obtained by our approach and other approaches suggested in the literature. We note that the methods described in the table are reported from [26] and they use cross validation to compute the classification accuracy rate. These papers do not detail the ways in which the experimentations were conducted [26]. However, we suggest to compare the proposed approach with seven filter approaches: Max-Relevance Min-Redundancy (mRMR) [9,17], Mutual Information Maximization (MIM) [45], Mutual Information Feature Selection (MIFS) [45], Conditional Mutual Info Maximization (CMIM) [45], Joint Mutual Information (JMI) [9,17], Gini Index and Relief. In addition, four wrapper approaches are also used for comparison: Particle Swarm Optimization (PSO) [46], Binary Bat Algorithm (BBA) [47], Binary Gravitational Search Algorithm (BGSA) [48] and Genetic Algorithm (GA) [37]. The proposed approach is also compared to two classifiers: SVM and KNN by using all the features. Table 7 presents the classification accuracy rate obtained by the proposed approach compared to seven filter approaches, four wrapper approaches and two classifiers using all the features: SVM and KNN. The comparison between the proposed approach and other feature selection techniques has been conducted with the same training and testing samples. SVM with Gaussian kernel ( = 2) was used to report final performance after the selection of the optimal subset of features. For filtering methods, we have tested different percentage of selected features and we retain the subset of features which provided the best classification accuracy rate. As seen in Table 7, our proposed approach achieved a high classification accuracy rate while selecting the fewest number of genes. The most remarkable observation for our approach concerns nearly all of the datasets, in that we obtained the maximum classification accuracy rate (100%). To validate the performance of our approach, we tested the stability by executing the algorithm 100 times with randomly generated training and testing sets, using the holdout method. Stability can be defined as the sensitivity of feature preference (i. e ., how different training sets affect the feature preference) [49]. The results are illustrated in Fig. 3. From the experiment results, we can conclude that our approach produced satisfying results compared to other approaches. This paper can be summarized with the following points:
46
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
Table 5 Classification accuracy rate (%), obtained by the proposed approach in the first step: mRMR and second step: BDF. Dataset name
Breast cancer Colon cancer DLBC Leukemia Lung cancer Ovarian cancer
Step 1: mRMR
Step 2: BDF
Classification accuracy
Classification accuracy
Best
Mean
Worst
Best
Mean
Worst
77.29 90.61 95.83 97.07 95.62 91.50
71.31 81.65 89.70 92.99 90.05 85.21
65.12 70.00 83.19 89.10 84.65 79.88
88.31 100 100 100 100 100
83.78 96.66 93.75 95.53 96.98 97.16
80.01 93.42 88.02 90.99 94.33 94.00
Table 6 Comparison between the proposed approach and some previous work. Authors
Colon
DLBC
Leukemia
Lung
Ovarian
Muchenxuan et al. [27] Tan et al. [28] Ye et al. [29,26] Liu et al. [30,26] Tan and Gilbert [31,26] Ding and Peng [32,26] Cho and Won [33,26] Yand et al. [34,26] Peng et al. [35,26] Wang et al. [36,26] Huerta et al. [37,26] Pang et al. [38,26] Li et al. [39,26] Zhang et al. [40,26] Yue et al. [41,26] Hernandez et al. [42,26] Li et al. [43,26] Wang et al. [44,26] Edmundo et al. [26] This study Average values Best value
90.30 80.70 85.00 91.90 95.10 93.50 87.70 84.80 87.00 100 91.40 83.80 83.50 90.30 85.40 84.60 93.60 93.50 93.50
96.10 80.50 – 98.00 – – 93.00 – – 95.60 – – 93.00 92.20 – – – – 100
97.20 73.60 97.50 100 91.10 100 95.90 73.20 98.60 95.80 100 94.10 97.10 100 91.50 100 100 100 100
– – – 100 93.20 97.20 – – 100 – – 91.20 – 100 – – – – 98.30
– – – 99.20 – – – – – – – 98.80 99.90 – – – – – 98.80
97.46 100
89.44 100
95.81 100
99.14 100
98.19 100
Table 7 Comparison between the best classification accuracy rate obtained by the proposed approach and several feature selection approaches. Datasets
Colon
Leukemia
Ovarian
Breast
DLBC
Average performance
Standard deviation
Filter approaches MIM mRMR MIFS JMI CMIM Gini Index Relief
82.61 90.61 82.61 73.91 56.62 80.50 81.46
70.37 97.07 77.78 66.67 62.69 94.00 91.61
89.47 91.50 87.20 91.81 91.25 99.40 95.82
74.03 77.29 66.16 69.89 65.16 72.95 47.37
95.67 95.83 87.50 86.11 85.67 94.44 91.00
82.43 90.46 80.25 77.67 72.27 88.25 81.45
10.49 7.85 8.81 10.80 15.22 11.06 19.76
Wrapper approaches PSO BGSA BBA GA
99.83 67.50 77.08 98.30
100 85.00 90.35 97.20
100 92.60 69.00 98.83
76.32 48.15 57.10 95.86
99.44 73.33 77.22 98.27
95.11 73.31 74.15 97.69
10.51 17.14 12.22 1.18
Using all features KNN SVM
91.67 95.33
96.43 100
93.00 100
63.16 72.42
72.22 94.44
83.29 92.43
14.70 11.48
This study Best values Average values Standard deviation
100 97.46 1.31
100 95.81 3.43
100 98.19 1.50
89.47 86.22 2.52
100 89.44 8.18
97.89 93.42 –
4.70 5.30 –
1. The proposed approach is an amalgamation of embedded and wrapper approaches (SVM-RFE and BDF). 2. The embedded approach is SVM-RFE, which was used to reduce the initial number of genes (60% of genes selected). 3. The wrapper approach is based on the BDF algorithm, which is a recent metaheuristic never used in the context of gene selection.
4. The objective function is the average classification error rate obtained by 3 variants of SVM (C-SVM, -SVM and LS-SVM). 5. The proposed approach is not time-consuming compared to other wrapper approaches defined in the literature. 6. Compared to other methods, the proposed approach is powerful and efficient in selecting the subset of genes that achieved a high classification accuracy rate.
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
47
Fig. 3. Classification accuracy rate obtained by different training and test sets.
4. Conclusion In this paper, we present a cancer diagnostic process based on feature selection. The proposed approach incorporates two phases: the first selected 60% of candidate genes using SVM-RFE. The second phase selected the optimal subset of genes from the candidate genes using the BDF algorithm. In the second phase, we optimized the objective function, which is the mean of the three values of classification error rate obtained by three kernel-based learning methods. The performance of our approach was demonstrated on six microarray datasets: breast cancer, colon cancer, ovarian cancer, DLBCL cancer, leukemia and lung cancer. Additionally, our approach was compared with several approaches mentioned in the literature. Compared to previous work, our approach provides a high classification accuracy rate with a much smaller gene subset. According to our experimentation, it has been shown that our approach can successfully challenge the problem of gene selection and the diagnosis of cancer. We also tested the stability of our approach by running the algorithm 100 times, and the results show that our approach is stable. Gene selection is an important issue in medical research. It improves the quality and complexity of the classifier model and, consequently, significantly improves the classification accuracy rate. For future work, instead of minimizing the classification accuracy rate, another objective function could be designed from class separability distances or correlation between genes.
References [1] K.D. Miller, R.L. Siegel, C.C. Lin, A.B. Mariotto, J.L. Kramer, J.H. Rowland, K.D. Stein, R. Alteri, A. Jemal, Cancer treatment and survivorship statistics, 2016, Cancer J. Clin. 66 (4) (2016) 271–289. [2] R. Sheikhpoura, M.A. Sarrama, R. Sheikhpourb, Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer, Appl. Soft Comput. 40 (2016) 113–131. [3] J. Apollonia, G. Leguizamna, E. Alba, Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments, Appl. Soft Comput. 38 (2016) 922–932. [4] B.A. Garroa, K. Rodrgueza, R.A. Vzquezb, Classification of DNA microarrays using artificial neural networks and ABC algorithm, Appl. Soft Comput. 38 (2016) 548–560. [5] A.A. Abdoos, P.K. Mianaei, M.R. Ghadikolaei, Combined VMD-SVM based feature selection method for classification of power quality events, Appl. Soft Comput. 38 (2016) 637–646.
[6] Y. Lina, Q. Hua, J. Liub, J. Chenc, J. Duana, Multi-label feature selection based on neighborhood mutual information, Appl. Soft Comput. 38 (2016) 244–256. [7] J. Prez-Rodrguez, A.G. Arroyo-Pea, N. Garca-Pedrajas, Simultaneous instance and feature selection and weighting using evolutionary computation: proposal and study, Appl. Soft Comput. 37 (2015) 416–443. [8] S.A. Medjahed, T.A. Saadi, A. Benyettou, M. Ouali, Gray Wolf Optimizer for hyperspectral band selection, Appl. Soft Comput. 40 (2016) 178–186. [9] O.M. Soufan, D. Kleftogiannis, P. Kalnis, V.B. Bajic, DWFS: a wrapper feature selection tool based on a parallel genetic algorithm, PLOS ONE 10 (2) (2015) 1–23. [10] G. Isabelle, W. Jason, B. Stephen, V. Vladimir, Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (1) (2002) 389–422. [11] P.A. Mundra, J.C. Rajapakse, SVM-RFE with MRMR filter for gene selection, IEEE Trans. Nanobiosci. 9 (1) (2010) 31–37. [12] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [13] M. Seyedali, Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems, Neural Comput. Appl. (2015) 1–21. [14] C. Reynolds, Flocks, herds and schools: a distributed behavioral model, SIGGRAPH Comput. Graph. 21 (4) (1987) 25–34. [15] B. Schlkopf, A.J. Smola, R.C. Williamson, P.L. Bartlett, New support vector algorithms, Neural Comput. 12 (5) (2000) 1207–1245. [16] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific Pub. Co, Singapore, 2002. [17] O.M. Soufan, An Empirical Study of Wrappers for Feature Subset Selection based on a Parallel Genetic Algorithm: The Multi-Wrapper Model, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia, 2012, Master’s thesis. [18] L.J. van ’t Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A.M. Hart, M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, S.H. Friend, Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (6871) (2002) 530–536. [19] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, A. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. 96 (12) (1999) 6745–6750. [20] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J.J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, L.M. Staudt, Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403 (6769) (2000) 503–511. [21] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531–537. [22] G.J.G. Gordon, R.V.R. Jensen, L.-L.L. Hsiao, S.R.S. Gullans, J.E.J. Blumenstock, S.S. Ramaswamy, W.G.W. Richards, D.J.D. Sugarbaker, R.R. Bueno, Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma, Cancer Res. 62 (17) (2002) 4963–4967.
48
S.A. Medjahed et al. / Applied Soft Computing 51 (2017) 39–48
[23] E.F. Petricoin, A.M. Ardekani, B.A. Hitt, V.A.F.J.D. Peter J Levine, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, L.A. Liotta, Use of proteomic patterns in serum to identify ovarian cancer, Mech. Dis. 259 (9306) (2002) 572–577. [24] S.-W. Lina, K.-C. Yingb, S.-C. Chenc, Z.-J. Lee, Particle swarm optimization for parameter determination and feature selection of support vector machines, Expert Syst. Appl. 35 (4) (2008) 1817–1824. [25] S.-W. Lina, Z.-J. Lee, S.-C. Chenc, T.-Y. Tseng, Parameter determination of support vector machine and feature selection using simulated annealing approach, Appl. Soft Comput. 8 (4) (2008) 1505–1512. [26] E.B. HuertaAffiliated, B. Duval, J.-K. Hao, Pattern Recognition in Bioinformatics: Third IAPR International Conference, PRIB 2008, Melbourne, Australia, October 15–17, 2008, in: Proceedings, chap. Gene Selection for Microarray Data by a LDA-Based Genetic Algorithm, Springer Berlin Heidelberg, 2008, pp. 250–261. [27] M. Tonga, K.-H. Liub, C. Xuc, W. Jud, An ensemble of SVM classifiers based on gene pairs, Comput. Biol. Med. 43 (2013) 729–737. [28] A.C. Tan, D.Q. Naiman, L. Xu, R.L. Winslow, D. Geman, Simple decision rules for classifying human cancers from gene expression profiles, Bioinformatics 21 (2005) 3896–3904. [29] J. Ye, T. Li, T. Xiong, R. Janardan, Using uncorrelated discriminant analysis for tissue classification with gene expression data, IEEE Trans. Comput. Biol. Bioinform. 1 (4) (2004) 181–190. [30] B. Liu, Q. Cui, T. Jiang, S. Ma, A combinational feature selection and ensemble neural network method for classification of gene expression data, BMC Bioinform. 5 (136) (2004) 1–12. [31] A.C. Tan, D. Gilbert, Ensemble machine learning on gene expression data for cancer classification, Appl. Bioinform. 2 (3) (2003) 75–83. [32] C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression data, J. Bioinform. Comput. Biol. 3 (2) (2005) 185–206. [33] S.B. Cho, H.-H. Won, Cancer classification using ensemble of neural networks with multiple significant gene subsets, Appl. Intell. 26 (3) (2007) 243–250. [34] W.-H. Yang, D.-Q. Dai, H. Yan, Generalized discriminant analysis for tumor classification with gene expression data, Mach. Learn. Cybern. (2006) 4322–4327. [35] Y. Peng, W. Li, Y. Liu, A hybrid approach for biomarker discovery from microarray gene expression data, Cancer Inform. (2006) 301–311.
[36] Z. Wang, V. Palade, Y. Xu, Neuro-fuzzy ensemble approach for microarray cancer gene expression data analysis, Proc. Evolv. Fuzzy Syst. 24 (2006) 1–24, 6. [37] E.B. Huerta, B. Duval, J.-K. Hao, A hybrid GA/SVM approach for gene selection and classification of microarray data, in: Rothlauf EvoWorkshops 2006 LNCS, 2006, pp. 34–44. [38] S. Pang, I. Havukkala, Y. Hu, N. Kasabov, Classification consistency analysis for bootstrapping gene selection, Neural Comput. Appl. (2007) 527–539. [39] G.-Z. Li, X.-Q. Zeng, J.Y. Yang, M.Q. Yang, Partial least squares based dimension reduction with gene selection for tumor classification, in: Proc. of 7th IEEE Intl. Symposium on Bioinformatics and Bioengineering, 2007, pp. 1439–1444. [40] L.-J. Zhang, Z.-J. Li, H.-W. Chen, An effective gene selection method based on relevance analysis and discernibility matrix, in: PAKDD 2007. LNCS 4426, 2007, pp. 1088–1095. [41] F. Yue, K. Wang, W. Zuo, Informative gene selection and tumor classification by null space LDA for microarray data, in: ESCAPE 2007. LNCS 4614, 2007, pp. 435–446. [42] J.C.H. Hernandez, B. Duval, J.-K. Hao, A genetic embedded approach for gene selection and classification of microarray data, in: EvoBIO 2007. LNCS 4447, 2007, pp. 90–101. [43] S. LiAffiliated, X. Wu, X. Hu, Gene selection using genetic algorithm and support vectors machines, Soft Comput. 12 (7) (2008) 693–698. [44] S. Wang, H. Chen, S. Li, D. Zhang, Feature extraction from tumor gene expression profiles using DCT and DFT, in: EPIA 2007. LNCS 4874, 2007, pp. 485–496. [45] G. Brown, A. Pocock, M.-J. Zhao, M. Luján, Conditional likelihood maximisation: a unifying framework for information theoretic feature selection, J. Mach. Learn. Res. 13 (2012) 27–66. [46] KJ, ER, Particle swarm optimization, in: Proceedings of IEEE International Conference on Neural Networks IV, vol. 4, 1995, pp. 1942–1948. [47] S. Mirjalili, S.M. Mirjalili, X.-S. Yang, Binary bat algorithm, Neural Comput. Appl. 25 (3–4) (2013) 663–681. [48] E. Rashedi, H. Nezamabadi, S. Saryazdi, GSA: a gravitational search algorithm, Inf. Sci. 179 (13) (2009) 2232–2248. [49] A. Kalousis, Stability of feature selection algorithms, in: Fifth IEEE International Conference on Data Mining, 2005.