Accepted Manuscript Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification Niam Abdulmunim Al-Thanoon, Omar Saber Qasim, Zakariya Yahya Algamal PII:
S0010-4825(18)30336-6
DOI:
https://doi.org/10.1016/j.compbiomed.2018.10.034
Reference:
CBM 3129
To appear in:
Computers in Biology and Medicine
Received Date: 1 August 2018 Revised Date:
28 October 2018
Accepted Date: 29 October 2018
Please cite this article as: N.A. Al-Thanoon, O.S. Qasim, Z.Y. Algamal, Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification, Computers in Biology and Medicine (2018), doi: https://doi.org/10.1016/ j.compbiomed.2018.10.034. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification
Niam Abdulmunim Al-Thanoon
RI PT
Department of Operations Research and Artificial Intelligence, University of Mosul, Mosul, Iraq
SC
E-mail:
[email protected]
Omar Saber Qasim
E-mail:
[email protected]
Zakariya Yahya Algamal *
M AN U
Department of Mathematics, University of Mosul, Mosul, Iraq
Department of Statistics and Informatics, University of Mosul, Mosul, Iraq
TE D
E-mail:
[email protected]
AC C
EP
ORCID: 0000-0002-0229-7958
*Corresponding Author
Zakariya Yahya Algamal E-mail:
[email protected] Telephone number: +964 7701640834
1
ACCEPTED MANUSCRIPT
Abstract
In cancer classification, gene selection is one of the most important bioinformatics
RI PT
related topics. The selection of genes can be considered to be a variable selection problem, which aims to find a small subset of genes that has the most discriminative information for the classification target. The penalized support vector machine (PSVM)
SC
has proved its effectiveness at creating a strong classifier that combines the advantages of the support vector machine and penalization. PSVM with a smoothly clipped absolute
M AN U
deviation (SCAD) penalty is the most widely used method. However, the efficiency of PSVM with SCAD depends on choosing the appropriate tuning parameter involved in the SCAD penalty. In this paper, a firefly algorithm, which is a metaheuristic continuous algorithm, is proposed to determine the tuning parameter in PSVM with SCAD penalty.
TE D
Our proposed algorithm can efficiently help to find the most relevant genes with high classification performance. The experimental results from four benchmark gene expression datasets show the superior performance of the proposed algorithm in terms of
AC C
methods.
EP
classification accuracy and the number of selected genes compared with competing
Keywords: SCAD; gene selection; cancer classification; penalized support vector machine; firefly algorithm.
2
ACCEPTED MANUSCRIPT
1. Introduction With the development of DNA microarray technologies in biology, the resulting
RI PT
datasets naturally have a small sample size with a high dimension where the sample size is usually in the range of hundreds, while the number of genes is tens of thousands [1-4]. This makes classical classification methods difficult to apply for correct classification.
SC
Cancer classification datasets often contain a large number of irrelevant or redundant genes that may significantly degrade the classifier accuracy and increase the time
M AN U
required for computations [5]. Identifying an optimal gene subset is a very complex task. Gene selection, which is also known as dimensionality reduction, is the method of selecting an optimum subset of relevant genes that can improve the computational efficiency of the classification method and lower the classification error rate [6, 7].
TE D
Consequently, several gene selection methods have been proposed and studied in the literature. These methods can be divided into three broad categories: the filter, wrapper, and embedded methods [6].
EP
Filter methods are some of the most popular gene selection methods, which are based on a specific criterion for gaining information of each gene. These methods work
AC C
separately and they are not dependent on the classification method. On the other hand, in the wrapper methods, the gene selection process is based on the performance of a classification algorithm to optimize the classification performance. In embedded methods, the gene selection process is incorporated into the classification methods, which can simultaneously perform gene selection and classification [8]. The support vector machine (SVM) has attracted much attention from many scientific fields in recent years because of its theoretical and practical advantages, which result in 3
ACCEPTED MANUSCRIPT
its improved performance in classification [9, 10]. Despite the excellent characteristics of SVM, there are still several drawbacks, including the selection of genes. In other words, SVM cannot perform gene selection [11]. The penalized support vector machine
RI PT
(PSVM), which is one of the most effective embedded methods, is preferable to the SVM because PSVM combines the standard SVM with a penalty to simultaneously perform both gene selection and classification [12]. With deferent penalties, numerous PSVMs
SC
can be applied, among them is L1-norm, which is known as the least absolute shrinkage and selection operator (lasso) [11], and the smoothly clipped absolute deviation (SCAD)
M AN U
penalty [13]. However, the efficiency of PSVM with SCAD penalty depends on choosing the appropriate tuning parameter involved in the SCAD penalty.
In this paper, a firefly optimization method, which is a metaheuristic continuous algorithm, is proposed to determine the tuning parameter in PSVM with SCAD penalty.
constructing
cancer
TE D
The proposed method will efficiently help to find the most significant genes for classifications
with
high
classification
performance.
The
experimental results show the favorable performance of the proposed method when the
AC C
2. Methods
EP
number of genes is high and the sample size is small.
2.1 Penalized Support vector machine The support vector machine is an excellent, efficient, effective, and powerful
classification method for binary classification problems [14]. In cancer classification, the gene matrix can be described as a matrix X = (x ij )n ×d , where each column represents a gene and each row represents a sample (patient). The numerical value of x ij denotes the 4
ACCEPTED MANUSCRIPT
value of a specific gene j ( j = 1,..., d ) in a specific sample i (i = 1,..., n ) . For a binary classification problem, a typical training dataset is {(xi , y i )}in=1 , where xi =
RI PT
(xi,1,xi,2,…,xi,d) represents a vector of the ith gene, and y i ∈{−1, +1} for i = 1,..., n , where
y i = +1 indicates the ith sample is in class 1 and y i = −1 indicates the ith sample is in class 2. An SVM generates a real-valued function ϕ ( X) as a hyperplane to maximize the
SC
distance, w, between the data to be separated.
Depending on the Lagrangian, solving this problem can be written as a quadratic dual
M AN U
optimization problem by
n 1 n n T α α y y x x − αi ∑∑ i j i j i j ∑ 2 i =1 j =1 i =1
min α
n
∑α y
S.T.
i
i =1
i
(1)
= 0, α i ≥ 0 , i = 1, 2,..., n ,
TE D
where α is a vector of Lagrange multipliers vector and each αi corresponds to a training observation (xi , y i ) . Equations (1) is used for linearly separable training observations.
EP
However, to extend the SVM for the linearly non-separable training observations, each observation (xi , y i ) is associated with a slack variable ζ i ≥ 0 . Then, the Lagrangian
AC C
becomes:
min α
S.T.
n 1 n n α i α j y i y j K (xi . x j ) − ∑α i ∑∑ 2 i =1 j =1 i =1 n
∑α y i
i
(2)
= 0, 0 ≤ α i ≤ C , i = 1, 2,..., n ,
i =1
where C is a parameter that controls the tradeoff between the maximum margin and the minimum classification error and K (xi . x j ) = ϕ (xi )T ϕ (x j ) is the kernel function. 5
ACCEPTED MANUSCRIPT
Although SVM has proven useful in binary classification, it cannot perform feature
RI PT
selection because it uses L2-norm, || w ||22 . Typically, any classification problem includes a number of features where many of these features that may be noisy or redundant, leading to degradation of the performance of the classification algorithm. Therefore, reducing dimensions is an essential step, which can be achieved through feature selection
SC
strategies.
M AN U
Bradley and Mangasarian [15] and Zhu, Rosset, Hastie and Tibshirani [11] proved that the SVM optimization problem is equivalent to a penalization problem, which has the form:
min w,b
1 n ∑ [1 − yi f (x i )]+ + Pen λ (w ) n i =1
(3)
TE D
where [1 − yi f (xi )]+ = max(1 − yi f (xi ),0) represents the hinge loss term and Penλ (w) represents the penalty term.
EP
Several penalties have been proposed, including L1-norm [11, 15] Lq-norm with q < 1 [16-18]. Furthermore, Zhang, Ahn, Lin and Park [19] proposed using the smoothly
AC C
clipped absolute deviation (SCAD) penalty of Fan and Li [13] with SVM. In addition, Wang, Zhu and Zou [12] proposed a hybrid huberized SVM by using the elastic net penalty, whereas Becker, Toedt, Lichter and Benner [20] proposed a combination of ridge and SCAD with SVM. The L1-norm penalty, proposed by Bradley and Mangasarian [15] and Zhu, Rosset, Hastie and Tibshirani [11], is one of the most popular penalty functions because SVM
6
ACCEPTED MANUSCRIPT
with L1-norm can automatically select genes by shrinking the hyper-plane coefficients to zero. The SVM-L1 is defined as:
w ,b
d 1 n 1 − y f (x ) + λ | wj | [ ] ∑ i i + ∑ n i =1 j=1
(4)
RI PT
min
where λ is a positive tuning parameter, which controls the amount of shrinkage, and
SC
f (x i ) is the function of the hyperplane. Equation (4) is a convex optimization problem and can be solved by the method of Lagrange multipliers.
M AN U
The SCAD penalty function has better theoretical characteristics than L1-norm. Zhang, Ahn, Lin and Park [19] suggested the hybridization of SVM with the non-convex SCAD penalty for feature selection processes and they proved that their method performed better than L1-norm SVM. The penalization term of SCAD in Eq. (3) has the
TE D
following form:
d
Pen λ ( w ) = ∑ pSCAD( λ ) (w j ),
(5)
where
EP
j=1
AC C
λ wj w 2 − 2aλ w + λ 2 j j pSCAD( λ ) (w j ) = − 2(a − 1) (a + 1)λ 2 2
if w j ≤ λ, if λ < w j ≤ aλ,
(6)
if w j > aλ,
where a > 2 [13, 19]. The two tuning parameters, λ and a play an important role in determining an accurate classification. Thus, Eq. (4) with penalty term in Eq. (5) can be written as: 7
ACCEPTED MANUSCRIPT
min w,b
d 1 n [1 − yi f (x i ) ]+ + λ ∑ pSCAD( λ ) (w j ). ∑ n i =1 j=1
(7)
Compared to the L1-norm penalty, for large coefficients, the SCAD penalty applies a
RI PT
constant penalty. This decreases the estimation bias and is contrary to the L1-norm penalty, which increases linearly when the coefficient increases. Also, the SCAD penalty yields sparse solutions by thresholding small estimates to zero and provides
SC
approximately unbiased estimates to give a model continuous in data which leads to
M AN U
selecting consistently the most important features [13, 19].
2.2 Firefly algorithm
Swarm intelligence algorithms have been widely studied and successfully applied to a
TE D
variety of complex optimization problems. The firefly algorithm (FFA) is one of the most recent novel swarm intelligence methods and most powerful optimization algorithms, which was developed by Yang [21].
EP
The firefly algorithm has shown good performance and effectiveness for solving various optimization problems [22]. The firefly algorithm was inspired by the simulation
AC C
of the social behavior of fireflies on the basis of the flashing lights or the flash attractiveness. The firefly flash acts as a signal system that is used to attract another firefly by representing the attractiveness of the flashing characteristics of fireflies and it is possible to model how fireflies interact with these flashing lights/other fireflies [23]. In the implementation of the FFA, each member is classed as a firefly in the swarm. Each firefly represents a candidate solution in the dimensional search space. The brighter locations are assumed to represent better solutions. Then, the algorithm tries to help 8
ACCEPTED MANUSCRIPT
fireflies to find these locations in the search space. The attractiveness of a firefly is determined by its brightness, which in turn is associated with the objective function for a given optimization problem. The brightness decreases when the distance between a
RI PT
firefly and the target location increases. The attraction between fireflies is based on the differences in brightness. This means that a less bright firefly can move toward a brighter firefly owing to attraction. If none of the fireflies are brighter than the other fireflies, an
SC
individual firefly will move randomly. During the search process and because of the attractions among fireflies, fireflies can move towards new locations or positions because
M AN U
of the attraction and thus find new candidate solutions.
Mathematically, assume that there are n f fireflies in the swarm (populations size), which are randomly distributed in the D-dimensional search space. During the evolutionary process, each firefly has a position vector denoted as xi = {x i 1 , x i 2 , …, x id } ,
TE D
where i = 1, 2,..., n f and d ∈ D is the dimensionality of the solutions. The distance between any two fireflies i and j , at positions x i and x j in the search
AC C
equation:
EP
space, respectively, is the Cartesian distance which can be calculated using the following
rij =|| x i − x j ||=
∑ (x D
− x jd ) . 2
id
d =1
(8)
The brightness of firefly i at a particular or current position x can be denoted by the
objective function value as follows: I (x i
)
= f (x i )
(9)
9
ACCEPTED MANUSCRIPT
The light intensity of the firefly is directly proportional to its brightness and is related to objective values. In comparing two fireflies, both fireflies are attracted, but the firefly that has a lower light intensity is attracted toward the other firefly with the higher light
RI PT
intensity. The light intensity of a firefly depends on the intensity I 0 of light emitted by the firefly and the distance rij between the two fireflies. Light intensity I (r ) can be
follows: r2
M AN U
I (r ) = I 0 e −γ ,
SC
described by a monotonically decreasing function of rij which can be formulated as
(10)
where γ is used to control the decrease in light intensity or brightness and can be taken as a constant.
TE D
Each firefly has its distinctive attractiveness which indicates how powerfully it attracts other members in the swarm. Attractiveness, β , is relative, which means that it must be judged by others, and therefore varies with the distance rij . The attractiveness
EP
must be allowed to vary with differing degrees of absorption [24]. Thus, the main form of
AC C
the attractiveness of a firefly is defined by the following equation: r2
β (r ) = β 0 e −γ ,
(11)
where β (r ) represents the attractiveness function of a firefly at a distance, r , and β 0 denotes the initial attractiveness of a firefly at distance r = 0 and can be constant. For implementation, β 0 is usually set to 1 for most problems.
10
ACCEPTED MANUSCRIPT
The fireflies will try to move to the best position in the search space. This means that the lower light intensity firefly will be attracted by the brighter firefly. The location updates for each pair of fireflies i and j . Each firefly x i is compared to all other
RI PT
fireflies x j , j = 1, 2,..., n f . If firefly j at position x j is brighter than firefly i , then x i will move towards x j by the attraction. The movement is defined as: r2
(x ( ) − x ( ) ) + α ε ( ), t jd
t id
t
t id
SC
x id(t +1) = x id(t ) + β 0 e −γ
(12)
M AN U
where αt is the randomization parameter, γ is an absorption coefficient that controls the decrease in light intensity, and ε id(t ) = ( rand − 0.5) , where rand is a random number from
AC C
EP
TE D
uniform distribution with [0, 1]. The flow diagram of the FFA is shown in Figure 1.
11
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
Figure 1: The Flow diagram of the FFA.
12
ACCEPTED MANUSCRIPT
2.3 The proposed method
The efficiency of the penalized support vector machine with the SCAD penalty largely depends on choosing the appropriate tuning parameter, λ . The tuning parameter
RI PT
controls the tradeoff between classification and the number of selected genes. As a result, selecting a suitable value of the tuning parameter is an important part of the fitting [2527].
SC
In addition, the SCAD penalty in Equation (6) depends on the quantity a . As suggested by Fan and Li [13], the value of this quantity should be a ≥ 2 , so they used
M AN U
a = 3.7 . In the penalization, λ controls the tradeoff between classification and the
number of selected genes. As a result, it is of crucial importance to select a suitable value of λ . If λ is small this leads to overfitting the data because a large number of genes will not be removed. When λ is large, a large number of genes will be removed.
TE D
In the literature, the most widely used method for selecting λ is cross-validation (CV), which is a data-driven approach. However, it has been pointed out that CV usually identifies too many irrelevant genes when the number of genes is large [25, 26] and can
EP
be very time consuming [27]. Consequently, several modifications of the CV approach to estimating λ have been suggested by researchers [28-32].
AC C
Due to the drawbacks of the CV approach, in this paper, a FFA is proposed to
determine the tuning parameter in PSVM with SCAD penalty. Additionally, the term a will also be determined using the FFA. In other words, the FFA will be used to find the value of λ and a simultaneously. The proposed method will efficiently help to find the most significant genes related to cancer classification with high classification performance. The parameter configurations for our proposed method were as follows.
13
ACCEPTED MANUSCRIPT
(1) The number of fireflies was n f = 50 , β 0 = 1 , γ = 0.2 , α = 0.1 , and the maximum number of iterations was t max =100 . (2) Two positions were setup for each firefly. The first position represents the tuning
RI PT
parameter, λ , which was randomly generated from a uniform distribution between 0 and 100. The second position represents the value of a , which was randomly generated from a uniform distribution between 2 and 3.7 as suggested by Fan and Li
SC
[13]. The positions of the firefly are depicted in Figure 2.
M AN U
(3) The fitness function is defined as
g − g% fitness = 0.8 × CA + 0.2 × g
,
(13)
where CA is the classification accuracy obtained, g represents the number of genes
TE D
in the dataset, and g% represents the number of selected genes. The fitness function was calculated for all fireflies.
(4) The positions of the fireflies were updated by using Eq. (14).
AC C
EP
(5) Steps 3 and 4 were repeated until t max was reached.
Figure 2: Representation of a firefly in swarm
14
ACCEPTED MANUSCRIPT
2.4 Evaluation criteria
The classification performance of the used methods was measured by classification
RI PT
accuracy (CA), sensitivity (SE), specificity (SP), Mathew’s correlation coefficient (MCC), and area under the curve (AUC). The used criteria are defined as:
SE=
TP ×100% TP+FN
SP=
TN ×100% FP+TN
M AN U
MCC=
TP+TN ×100% TP+FP+FN+TN
SC
CA =
(TP+TN)-(FP+FN) (TP+FP)(TP+FN)(TN+FP)(TN+FN)
where TP, TN, FP, and FN be the numbers of true positive, true negative, false positive,
TE D
and false negative of the confusion matrix, respectively. The higher value of the used evaluation criteria the power classification performance is.
EP
3. Datasets
Four benchmark gene expression datasets with a binary classification and different
AC C
numbers of genes and sample sizes, namely, ovarian, breast, CNS (Central Nervous System), and autism disorder were tested. These datasets are publicly available and were downloaded from GEO (NCBI) repository ([33]). The main characteristics of the four datasets are summarized in Table 1.
15
ACCEPTED MANUSCRIPT
Table 1: The characteristics of the four used datasets Datasets
# of samples # genes (g)
class
(n) 253
15154
Breast
97
24481
91 normal/ 162 ovarian cancer 51 healthy /46 unhealthy
CNS
60
7129
21 survivors/ 39 failures
Autism
146
54613
64 healthy/ 82 autism
M AN U
SC
RI PT
Ovarian
4. Results
With the aim of correctly assessing the performance of our proposed method, comparative experiments were carried out using the original CV method of estimating the
proposed algorithm):
TE D
tuning parameter. The methods used were as follows (the last two methods are our
(1) CV: We set a = 3.7 and used CV to estimate λ . (2) FFA1: We set a = 3.7 and used FFA to estimate λ .
EP
(3) FFA2: FFA was used to estimate both λ and a simultaneously.
AC C
In our experiments, a 10-fold was set and the range of tuning parameters for the CV method was fixed between 0 and 100. In addition, the linear kernel function was employed. To obtain a reliable classification performance, for each dataset, 70% of samples were used as a training dataset and the remaining 30% of the samples were used as a test dataset. This partition was repeated 20 times and the averaged evaluation criteria are reported in Table 2. The number in parentheses is the standard error.
16
ACCEPTED MANUSCRIPT
As can be seen from Table 2, both FFA2 and FFA1 selected fewer genes than CV for all datasets. In the ovarian dataset, for instance, FFA2 and FFA1, respectively, selected 22 and 26 genes compared with 37 genes for the CV method. Compared to FFA1, FFA2
RI PT
shows comparable results. In terms of gene selection, FFA2 performs slightly better than FFA1 for the ovarian and autism datasets. Importantly, FFA2 had the potential to select fewer genes than the other two methods, indicating that most of these additionally
SC
selected genes were probably not highly irrelevant to the classification study.
In terms of classification accuracy, FFA2 achieved a maximum accuracy of 97.04%
M AN U
and 98.82% for the ovarian and autism datasets, respectively. In contrast, FFA1 showed the best classification accuracy in the breast and CNS datasets. Furthermore, it is clear from the results that both FFA2 and FFA1 outperformed the CV method in terms of classification accuracy for all datasets. This improvement in classification accuracy was
TE D
mainly due to the improved ability of our proposed method (FFA1 or FFA2) in selecting the tuning parameter. Moreover, FFA1 slightly improved the classification accuracy compared with FFA2. The improvement in the CNS dataset, for example, was 0.171%.
EP
On the contrary, FFA2 showed substantial improvements compared with FFA1, especially in the autism dataset, where FFA2 improved the classification accuracy by
AC C
5.535%.
It can be seen from Table 2 that both FFA1 and FFA2 have the best results in terms of
sensitivity and specificity. FFA2 had the highest sensitivities of 96.52% and 97.25% for the ovarian and autism datasets, respectively. On the other hand, FFA1 had the highest sensitivities of 94.74% and 96.54% for the breast and CNS datasets. This indicated that
17
ACCEPTED MANUSCRIPT
FFA2 and FFA1 significantly succeeded in identifying cases of cancer with a probability of greater than 0.947. On the other hand, the results for the specificity (SP) represent the probability that our
RI PT
proposed method would identify cases that are actually healthy. In terms of the SP, FFA2 and FFA1 significantly outperformed CV for all datasets. In the breast dataset, for
healthy patients compared with 0.896 for CV.
SC
example, FFA2 and FFA1 had the highest probabilities of 0.931 and 0.933 for identifying
Looking at Mathew’s correlation coefficient (MCC), the classification performance of
M AN U
FFA2 and FFA1 was comparable, with CV performing the best. In the CNS, the MCC value was 0.979 for FFA2, which was higher than that for FFA1 (MCC = 0.945) and CV (MCC = 0.881). In general, an algorithm with a higher Mathew’s correlation coefficient value is considered to be a more predictive classification algorithm.
TE D
Further, depending on the testing dataset, FFA2 and FFA1 achieved the best classification results for the four datasets. In contrast with these results, CV attained poor classification results. For instance, in the breast dataset, the CA of the test dataset was
AC C
CV.
EP
93.21% and 93.85% for FFA2 and FFA1, respectively, which was higher than 85.36% for
18
ACCEPTED MANUSCRIPT
Table 2: Experimental results of the methods used.
CV Breast
FFA2 FFA1
CV CNS
FFA2 FFA1
CV
Autism
FFA2
FFA1
CA
SE
SP
97.04 (0.038) 96.22 (0.071) 90.35 (0.381) 96.22 (0.121) 96.72 (0.124) 88.90 (0.355) 98.68 (0.005) 98.84 (0.006) 90.74 (0.071)
96.52 (0.032) 95.92 (0.074) 90.51 (0.387) 94.74 (0.124) 95.12 (0.123) 88.13 (0.372) 96.54 (0.006) 96.87 (0.006) 90.11 (0.077)
95.80 (0.035) 94.31 (0.072) 90.31 (0.375) 93.18 (0.123) 93.36 (0.124) 89.66 (0.441) 95.84 (0.007) 95.93 (0.006) 90.34 (0.073)
0.964 (0.037) 0.955 (0.071) 0.901 (0.382) 0.958 (0.121) 0.962 (0.125) 0.888 (0.361) 0.977 (0.003) 0.981 (0.005) 0.898 (0.078)
94.83 (0.041) 93.07 (0.044) 87.57 (0.408) 93.21 (0.217) 93.85 (0.216) 85.36 (0.402) 96.22 (0.008) 96.34 (0.007) 87.11 (0.085)
97.32 (0.069) 92.64 (0.073) 86.71 (0.537)
0.979 (0.071) 0.945 (0.074) 0.881 (0.522)
95.81 (0.088) 91.37 (0.094) 84.21 (0.614)
98.82 (0.066) 93.35 (0.068) 88.64 (0.516)
97.24 (0.071) 93.20 (0.080) 87.84 (0.538)
AC C
EP
CV
18 (0.051) 21 (0.055) 47 (1.013)
MCC
Testin g dataset CA
RI PT
FFA1
# selected genes 22 (0.077) 26 (0.082) 37 (1.15) 25 (0.117) 25 (0.141) 51 (1.113) 11 (0.005) 10 (0.003) 19 (0.092)
SC
Ovarian FFA2
Training dataset
M AN U
Method s
TE D
Dataset s
According to the AUC criteria, a non-parametric Friedman test was employed to
check whether FFA2, FFA1, and CV were statistically significant. Then, the post hoc Bonferroni test was computed when the null hypothesis was rejected. This test was computed under different critical values (0.01, 0.05, and 0.1). Table 3 reports the statistical test results. Based on the obtained results, the null hypothesis was rejected at 0.05 significance level using the Friedman test statistic. As a result, the obtained results 19
ACCEPTED MANUSCRIPT
showed statistically significant differences between the methods used. In addition, FFA2 had the lower average rank with 3.112 compared with FFA1 and CV. Depending on the Bonferroni test results, it is clear that the average rank of the CV method was higher than
than both FFA1 and FFA2 over the four datasets.
RI PT
α0.05 , α0.01 , and α0.10 . These results suggest that the CV method was significantly worse
Friedman test results
3.112
2 χFriedman =14.746 , p-value
3.174
CV
10.325
α0.05 = 6.185 , α0.01 = 6.839 , α0.10 = 5.907
TE D
(0.05)=0.0011
FFA1
Bonferroni test results
M AN U
FFA2
Friedman average rank
SC
Table 3: Friedman and Bonferroni test results for the used methods over the four datasets
EP
Further, a stability test, which is an indicator of gene selection consistency, using the Jaccard index is utilized to highlight the performance of our proposed method.
AC C
Let D1 and D2 be subsets of the selected descriptors such that D1 , D 2 ⊆ D , for a number of solutions D = {D1 ,..., D r } , the stability test is defined as:
Stability test =
r −1 r 2 ∑ ∑ I J (D i , D j ). r ( r − 1) i =1 j =i +1
(14)
20
ACCEPTED MANUSCRIPT
where I J (D i , D j ) is the Jaccard index which is defined as the size of the intersection between any two groups divided by the size of their union. Mathematically, it is defined
I J (D1 , D 2 ) =
D1 ∩ D 2 . D1 ∪ D 2
RI PT
as:
(15)
SC
The higher the stability test value is, the more stable the gene selection is. Figure 3 shows the stability test values on the four datasets for the FFA2, FFA1, and the CV. As can be
M AN U
seen from Figure 3, both the FFA1 and FFA2 displays the high rate of stability compared with CV. Further, FFA2 is the best stable gene selection method. This is mean that FFA2
AC C
EP
TE D
is more consistent than CV and slightly consistent than FFA1 in gene selection.
21
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
AC C
Figure 3: Stability test results of the gene selection consistency for the used methods.
5. Discussion
The study presents an improved version of PSVM with SCAD penalty. The efficiency of the penalized support vector machine with the SCAD penalty largely depends on choosing the appropriate tuning parameter, λ .
The tuning parameter controls the 22
ACCEPTED MANUSCRIPT
tradeoff between classification and the number of selected genes. As a result, selecting a suitable value of the tuning parameter is an important part of the fitting.
RI PT
In this paper, an optimization algorithm, FFA, is proposed to determine the tuning parameter in PSVM with SCAD penalty. In this section, the main characteristics of the FAA is highlighted. With this approach, the proposed algorithm could evaluate not only
SC
the positive side of PSVM, but also the SCAD penalty.
In several studies, it was proven that the FFA can perform superiorly, compared with
M AN U
genetic algorithm (GA) and particle swarm optimization (PSO) [34-36], and it is applicable for a large number of real optimization problems because the FFA has fast convergence, obtaining good results on function optimization, and more appropriate for combinatorial optimization [22].
TE D
Compared with GA and PSO, our proposed algorithm does not need more the convergence speed than GA and PSO. The FFA yielded the best results, on average, when the t max ≤ 32 . On the contrast, GA and PSO yielded their results when t max ≤ 74
EP
and t max ≤ 67 , respectively. Another strength of the FFA is the possibility to reduce computational complexity. Generally, the complexity of the FFA is significantly lower
AC C
than those of GA and PSO in all used datasets. One interesting observation is that our proposed algorithm is still more powerful
than GA and PSO even when the number of genes is high. It indicates that FFA is more appropriate than others for gene expression data in cancer classification although FFA depends on calculating the distance (Eq. 10) among the fireflies [37, 38].
23
ACCEPTED MANUSCRIPT
As a result, the proposed method outperforms the CV approach on all datasets in terms of classification performance. By considering the comparison results of the our
effective results, and the approaches become more convenient.
SC
6. Conclusion
RI PT
proposed approach, it can be said that the proposed FFA2 and FFA1 produce better and
The efficiency of PSVM with SCAD penalty depends on choosing the appropriate
M AN U
tuning parameter involved in the SCAD penalty. This paper presented a new tuning parameter selection method for the penalized support vector machine with SCAD penalty. A firefly algorithm was proposed for determining the tuning parameter and was compared with the classical CV method. Based on four freely accessible gene expression benchmark datasets, the results show that cancer classification using our proposed
TE D
algorithm has higher classification accuracy with fewer selected genes, and yields better
AC C
EP
results than the CV method.
REFERENCES
[1] Z.Y. Algamal, M.H. Lee, Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification, Comput. Biol. Med., 67 (2015) 136-145. [2] T. Latkowski, S. Osowski, Data mining for feature selection in gene expression autism data, Expert. Syst. Appl., 42 (2015) 864-872. [3] S.S. Hameed, R. Hassan, F.F. Muhammad, Selection and classification of gene expression in autism disorder: Use of a combination of statistical filters and a GBPSOSVM algorithm, PLoS One, 12 (2017) e0187371. 24
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
[4] Z.Y. Algamal, M.H. Lee, A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification, Advances in Data Analysis and Classification, (2018). [5] H. Motieghader, A. Najafi, B. Sadeghi, A. Masoudi-Nejad, A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata, Informatics in Medicine Unlocked, 9 (2017) 246-254. [6] Z.Y. Algamal, M.H. Lee, Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification, Expert. Syst. Appl., 42 (2015) 9326–9332. [7] L.-Y. Chuang, C.-S. Yang, K.-C. Wu, C.-H. Yang, Gene selection and classification using Taguchi chaotic binary particle swarm optimization, Expert. Syst. Appl., 38 (2011) 13367-13377. [8] Y. Liang, C. Liu, X.-Z. Luan, K.-S. Leung, T.-M. Chan, Z.-B. Xu, H. Zhang, Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification, BMC Bioinformatics, 14 (2013) 198-211. [9] Q. Shen, W.M. Shi, W. Kong, B.X. Ye, A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification, Talanta, 71 (2007) 1679-1683. [10] Y. Cong, B.-k. Li, X.-g. Yang, Y. Xue, Y.-z. Chen, Y. Zeng, Quantitative structure– activity relationship study of influenza virus neuraminidase A/PR/8/34 (H1N1) inhibitors by genetic algorithm feature selection and support vector regression, Chemom. Intell. Lab. Syst., 127 (2013) 35-42. [11] J. Zhu, S. Rosset, T. Hastie, R. Tibshirani, 1-norm support vector machines, Advances in neural information processing systems, 16 (2004) 49-56. [12] L. Wang, J. Zhu, H. Zou, Hybrid huberized support vector machines for microarray classification and gene selection, Bioinformatics, 24 (2008) 412-419. [13] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc., 96 (2001) 1348-1360. [14] H. Dong, G. Jian, Parameter Selection of a Support Vector Machine, Based on a Chaotic Particle Swarm Optimization Algorithm, Cybernetics and Information Technologies, 15 (2015). [15] P.S. Bradley, O.L. Mangasarian, Feature selection via concave minimization and support vector machines, ICML, 1998, pp. 82-90. [16] K. Ikeda, N. Murata, Geometrical properties of Nu support vector machines with different norms, Neur. comput., 17 (2005) 2508-2529. [17] Z. Liu, S. Lin, M.T. Tan, Sparse support vector machines with Lp penalty for biomarker identification, IEEE Trans. Comput. Bi., 7 (2010) 100-107. [18] Y. Liu, H. Helen Zhang, C. Park, J. Ahn, Support vector machines with adaptive Lq penalty, Comput. Stat. Data. Anal., 51 (2007) 6380-6394. [19] H.H. Zhang, J. Ahn, X. Lin, C. Park, Gene selection using support vector machines with non-convex penalty, Bioinformatics, 22 (2006) 88-95. [20] N. Becker, G. Toedt, P. Lichter, A. Benner, Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data, BMC Bioinformatics, 12 (2011) 138-151. [21] X.-S. Yang, Multiobjective firefly algorithm for continuous optimization, Engineering with Computers, 29 (2013) 175-184. 25
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
[22] I. Fister, I. Fister, X.-S. Yang, J. Brest, A comprehensive review of firefly algorithms, Swarm and Evolutionary Computation, 13 (2013) 34-46. [23] A. Yelghi, C. Köse, A modified firefly algorithm for global minimum optimization, Appl. Soft. Comput., 62 (2018) 29-44. [24] S. Karthikeyan, P. Asokan, S. Nickolas, A hybrid discrete firefly algorithm for multi-objective flexible job shop scheduling problem with limited resource constraints, The International Journal of Advanced Manufacturing Technology, 72 (2014) 1567-1579. [25] K.W. Broman, T.P. Speed, A model selection approach for the identification of quantitative trait loci in experimental crosses, J. Roy. Statist. Soc. Ser. B, 64 (2002) 641656. [26] J. Chen, Z. Chen, Extended Bayesian information criteria for model selection with large model spaces, Biometrika, 95 (2008) 759-771. [27] H. Park, F. Sakaori, S. Konishi, Robust sparse regression and tuning parameter selection via the efficient bootstrap information criteria, Journal of Statistical Computation and Simulation, 84 (2013) 1596-1607. [28] S. Roberts, G. Nowak, Stabilizing the lasso against cross-validation variability, Computational Statistics & Data Analysis, 70 (2014) 198-211. [29] J.A. Sabourin, W. Valdar, A.B. Nobel, A permutation approach for selecting the penalty parameter in penalized model selection, Biometrics, 71 (2015) 1185-1194. [30] Y. Jung, J. Hu, A K-fold Averaging Cross-validation Procedure, J. Nonparametr. Stat., 27 (2015) 167-179. [31] R.J. Meijer, J.J. Goeman, Efficient approximate k-fold and leave-one-out crossvalidation for ridge regression, J. Biom. , 55 (2013) 141-155. [32] Z. Pang, B. Lin, J. Jiang, Regularisation Parameter Selection Via Bootstrapping, Australian & New Zealand Journal of Statistics, 58 (2016) 335-356. [33] NCBI data base, http://www.ncbi.nlm.nih.gov/sites/GDSbrowser, 2017. [34] S. Yu, S. Zhu, Y. Ma, D. Mao, Enhancing firefly algorithm using generalized opposition-based learning, Computing, 97 (2015) 741-754. [35] L. Zhang, W. Srisukkham, S.C. Neoh, C.P. Lim, D. Pandit, Classifier ensemble reduction using a modified firefly algorithm: An empirical evaluation, Expert. Syst. Appl., 93 (2018) 395-422. [36] O.S. Qasim, Z.Y. Algamal, Feature selection using particle swarm optimizationbased logistic regression model, Chemom. Intell. Lab. Syst., 182 (2018) 41-46. [37] L. Zhang, L. Shan, J. Wang, Optimal feature selection using distance-based discrete firefly algorithm with mutual information criterion, Neural. Comput. Applic., 28 (2016) 2795-2808. [38] S.L. Tilahun, J.M.T. Ngnotchouye, N.N. Hamadneh, Continuous versions of firefly algorithm: a review, Artificial Intelligence Review, (2017).
26
ACCEPTED MANUSCRIPT Highlights The proposed method has better performance than the CV.
•
The classification ability for the proposed method is quite high.
•
The proposed method performed remarkably well in gene selection stability test.
AC C
EP
TE D
M AN U
SC
RI PT
•