Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification

Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification

Accepted Manuscript Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer c...

829KB Sizes 0 Downloads 51 Views

Accepted Manuscript Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification Niam Abdulmunim Al-Thanoon, Omar Saber Qasim, Zakariya Yahya Algamal PII:

S0010-4825(18)30336-6

DOI:

https://doi.org/10.1016/j.compbiomed.2018.10.034

Reference:

CBM 3129

To appear in:

Computers in Biology and Medicine

Received Date: 1 August 2018 Revised Date:

28 October 2018

Accepted Date: 29 October 2018

Please cite this article as: N.A. Al-Thanoon, O.S. Qasim, Z.Y. Algamal, Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification, Computers in Biology and Medicine (2018), doi: https://doi.org/10.1016/ j.compbiomed.2018.10.034. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification

Niam Abdulmunim Al-Thanoon

RI PT

Department of Operations Research and Artificial Intelligence, University of Mosul, Mosul, Iraq

SC

E-mail: [email protected]

Omar Saber Qasim

E-mail: [email protected]

Zakariya Yahya Algamal *

M AN U

Department of Mathematics, University of Mosul, Mosul, Iraq

Department of Statistics and Informatics, University of Mosul, Mosul, Iraq

TE D

E-mail: [email protected]

AC C

EP

ORCID: 0000-0002-0229-7958

*Corresponding Author

Zakariya Yahya Algamal E-mail: [email protected] Telephone number: +964 7701640834

1

ACCEPTED MANUSCRIPT

Abstract

In cancer classification, gene selection is one of the most important bioinformatics

RI PT

related topics. The selection of genes can be considered to be a variable selection problem, which aims to find a small subset of genes that has the most discriminative information for the classification target. The penalized support vector machine (PSVM)

SC

has proved its effectiveness at creating a strong classifier that combines the advantages of the support vector machine and penalization. PSVM with a smoothly clipped absolute

M AN U

deviation (SCAD) penalty is the most widely used method. However, the efficiency of PSVM with SCAD depends on choosing the appropriate tuning parameter involved in the SCAD penalty. In this paper, a firefly algorithm, which is a metaheuristic continuous algorithm, is proposed to determine the tuning parameter in PSVM with SCAD penalty.

TE D

Our proposed algorithm can efficiently help to find the most relevant genes with high classification performance. The experimental results from four benchmark gene expression datasets show the superior performance of the proposed algorithm in terms of

AC C

methods.

EP

classification accuracy and the number of selected genes compared with competing

Keywords: SCAD; gene selection; cancer classification; penalized support vector machine; firefly algorithm.

2

ACCEPTED MANUSCRIPT

1. Introduction With the development of DNA microarray technologies in biology, the resulting

RI PT

datasets naturally have a small sample size with a high dimension where the sample size is usually in the range of hundreds, while the number of genes is tens of thousands [1-4]. This makes classical classification methods difficult to apply for correct classification.

SC

Cancer classification datasets often contain a large number of irrelevant or redundant genes that may significantly degrade the classifier accuracy and increase the time

M AN U

required for computations [5]. Identifying an optimal gene subset is a very complex task. Gene selection, which is also known as dimensionality reduction, is the method of selecting an optimum subset of relevant genes that can improve the computational efficiency of the classification method and lower the classification error rate [6, 7].

TE D

Consequently, several gene selection methods have been proposed and studied in the literature. These methods can be divided into three broad categories: the filter, wrapper, and embedded methods [6].

EP

Filter methods are some of the most popular gene selection methods, which are based on a specific criterion for gaining information of each gene. These methods work

AC C

separately and they are not dependent on the classification method. On the other hand, in the wrapper methods, the gene selection process is based on the performance of a classification algorithm to optimize the classification performance. In embedded methods, the gene selection process is incorporated into the classification methods, which can simultaneously perform gene selection and classification [8]. The support vector machine (SVM) has attracted much attention from many scientific fields in recent years because of its theoretical and practical advantages, which result in 3

ACCEPTED MANUSCRIPT

its improved performance in classification [9, 10]. Despite the excellent characteristics of SVM, there are still several drawbacks, including the selection of genes. In other words, SVM cannot perform gene selection [11]. The penalized support vector machine

RI PT

(PSVM), which is one of the most effective embedded methods, is preferable to the SVM because PSVM combines the standard SVM with a penalty to simultaneously perform both gene selection and classification [12]. With deferent penalties, numerous PSVMs

SC

can be applied, among them is L1-norm, which is known as the least absolute shrinkage and selection operator (lasso) [11], and the smoothly clipped absolute deviation (SCAD)

M AN U

penalty [13]. However, the efficiency of PSVM with SCAD penalty depends on choosing the appropriate tuning parameter involved in the SCAD penalty.

In this paper, a firefly optimization method, which is a metaheuristic continuous algorithm, is proposed to determine the tuning parameter in PSVM with SCAD penalty.

constructing

cancer

TE D

The proposed method will efficiently help to find the most significant genes for classifications

with

high

classification

performance.

The

experimental results show the favorable performance of the proposed method when the

AC C

2. Methods

EP

number of genes is high and the sample size is small.

2.1 Penalized Support vector machine The support vector machine is an excellent, efficient, effective, and powerful

classification method for binary classification problems [14]. In cancer classification, the gene matrix can be described as a matrix X = (x ij )n ×d , where each column represents a gene and each row represents a sample (patient). The numerical value of x ij denotes the 4

ACCEPTED MANUSCRIPT

value of a specific gene j ( j = 1,..., d ) in a specific sample i (i = 1,..., n ) . For a binary classification problem, a typical training dataset is {(xi , y i )}in=1 , where xi =

RI PT

(xi,1,xi,2,…,xi,d) represents a vector of the ith gene, and y i ∈{−1, +1} for i = 1,..., n , where

y i = +1 indicates the ith sample is in class 1 and y i = −1 indicates the ith sample is in class 2. An SVM generates a real-valued function ϕ ( X) as a hyperplane to maximize the

SC

distance, w, between the data to be separated.

Depending on the Lagrangian, solving this problem can be written as a quadratic dual

M AN U

optimization problem by

n 1 n n T α α y y x x − αi ∑∑ i j i j i j ∑ 2 i =1 j =1 i =1

min α

n

∑α y

S.T.

i

i =1

i

(1)

= 0, α i ≥ 0 , i = 1, 2,..., n ,

TE D

where α is a vector of Lagrange multipliers vector and each αi corresponds to a training observation (xi , y i ) . Equations (1) is used for linearly separable training observations.

EP

However, to extend the SVM for the linearly non-separable training observations, each observation (xi , y i ) is associated with a slack variable ζ i ≥ 0 . Then, the Lagrangian

AC C

becomes:

min α

S.T.

n 1 n n α i α j y i y j K (xi . x j ) − ∑α i ∑∑ 2 i =1 j =1 i =1 n

∑α y i

i

(2)

= 0, 0 ≤ α i ≤ C , i = 1, 2,..., n ,

i =1

where C is a parameter that controls the tradeoff between the maximum margin and the minimum classification error and K (xi . x j ) = ϕ (xi )T ϕ (x j ) is the kernel function. 5

ACCEPTED MANUSCRIPT

Although SVM has proven useful in binary classification, it cannot perform feature

RI PT

selection because it uses L2-norm, || w ||22 . Typically, any classification problem includes a number of features where many of these features that may be noisy or redundant, leading to degradation of the performance of the classification algorithm. Therefore, reducing dimensions is an essential step, which can be achieved through feature selection

SC

strategies.

M AN U

Bradley and Mangasarian [15] and Zhu, Rosset, Hastie and Tibshirani [11] proved that the SVM optimization problem is equivalent to a penalization problem, which has the form:

min w,b

1 n ∑ [1 − yi f (x i )]+ + Pen λ (w ) n i =1

(3)

TE D

where [1 − yi f (xi )]+ = max(1 − yi f (xi ),0) represents the hinge loss term and Penλ (w) represents the penalty term.

EP

Several penalties have been proposed, including L1-norm [11, 15] Lq-norm with q < 1 [16-18]. Furthermore, Zhang, Ahn, Lin and Park [19] proposed using the smoothly

AC C

clipped absolute deviation (SCAD) penalty of Fan and Li [13] with SVM. In addition, Wang, Zhu and Zou [12] proposed a hybrid huberized SVM by using the elastic net penalty, whereas Becker, Toedt, Lichter and Benner [20] proposed a combination of ridge and SCAD with SVM. The L1-norm penalty, proposed by Bradley and Mangasarian [15] and Zhu, Rosset, Hastie and Tibshirani [11], is one of the most popular penalty functions because SVM

6

ACCEPTED MANUSCRIPT

with L1-norm can automatically select genes by shrinking the hyper-plane coefficients to zero. The SVM-L1 is defined as:

w ,b

d 1 n 1 − y f (x ) + λ | wj | [ ] ∑ i i + ∑ n i =1 j=1

(4)

RI PT

min

where λ is a positive tuning parameter, which controls the amount of shrinkage, and

SC

f (x i ) is the function of the hyperplane. Equation (4) is a convex optimization problem and can be solved by the method of Lagrange multipliers.

M AN U

The SCAD penalty function has better theoretical characteristics than L1-norm. Zhang, Ahn, Lin and Park [19] suggested the hybridization of SVM with the non-convex SCAD penalty for feature selection processes and they proved that their method performed better than L1-norm SVM. The penalization term of SCAD in Eq. (3) has the

TE D

following form:

d

Pen λ ( w ) = ∑ pSCAD( λ ) (w j ),

(5)

where

EP

j=1

AC C

 λ wj   w 2 − 2aλ w + λ 2 j j  pSCAD( λ ) (w j ) = − 2(a − 1)   (a + 1)λ 2   2

if w j ≤ λ, if λ < w j ≤ aλ,

(6)

if w j > aλ,

where a > 2 [13, 19]. The two tuning parameters, λ and a play an important role in determining an accurate classification. Thus, Eq. (4) with penalty term in Eq. (5) can be written as: 7

ACCEPTED MANUSCRIPT

min w,b

d 1 n [1 − yi f (x i ) ]+ + λ ∑ pSCAD( λ ) (w j ). ∑ n i =1 j=1

(7)

Compared to the L1-norm penalty, for large coefficients, the SCAD penalty applies a

RI PT

constant penalty. This decreases the estimation bias and is contrary to the L1-norm penalty, which increases linearly when the coefficient increases. Also, the SCAD penalty yields sparse solutions by thresholding small estimates to zero and provides

SC

approximately unbiased estimates to give a model continuous in data which leads to

M AN U

selecting consistently the most important features [13, 19].

2.2 Firefly algorithm

Swarm intelligence algorithms have been widely studied and successfully applied to a

TE D

variety of complex optimization problems. The firefly algorithm (FFA) is one of the most recent novel swarm intelligence methods and most powerful optimization algorithms, which was developed by Yang [21].

EP

The firefly algorithm has shown good performance and effectiveness for solving various optimization problems [22]. The firefly algorithm was inspired by the simulation

AC C

of the social behavior of fireflies on the basis of the flashing lights or the flash attractiveness. The firefly flash acts as a signal system that is used to attract another firefly by representing the attractiveness of the flashing characteristics of fireflies and it is possible to model how fireflies interact with these flashing lights/other fireflies [23]. In the implementation of the FFA, each member is classed as a firefly in the swarm. Each firefly represents a candidate solution in the dimensional search space. The brighter locations are assumed to represent better solutions. Then, the algorithm tries to help 8

ACCEPTED MANUSCRIPT

fireflies to find these locations in the search space. The attractiveness of a firefly is determined by its brightness, which in turn is associated with the objective function for a given optimization problem. The brightness decreases when the distance between a

RI PT

firefly and the target location increases. The attraction between fireflies is based on the differences in brightness. This means that a less bright firefly can move toward a brighter firefly owing to attraction. If none of the fireflies are brighter than the other fireflies, an

SC

individual firefly will move randomly. During the search process and because of the attractions among fireflies, fireflies can move towards new locations or positions because

M AN U

of the attraction and thus find new candidate solutions.

Mathematically, assume that there are n f fireflies in the swarm (populations size), which are randomly distributed in the D-dimensional search space. During the evolutionary process, each firefly has a position vector denoted as xi = {x i 1 , x i 2 , …, x id } ,

TE D

where i = 1, 2,..., n f and d ∈ D is the dimensionality of the solutions. The distance between any two fireflies i and j , at positions x i and x j in the search

AC C

equation:

EP

space, respectively, is the Cartesian distance which can be calculated using the following

rij =|| x i − x j ||=

∑ (x D

− x jd ) . 2

id

d =1

(8)

The brightness of firefly i at a particular or current position x can be denoted by the

objective function value as follows: I (x i

)

= f (x i )

(9)

9

ACCEPTED MANUSCRIPT

The light intensity of the firefly is directly proportional to its brightness and is related to objective values. In comparing two fireflies, both fireflies are attracted, but the firefly that has a lower light intensity is attracted toward the other firefly with the higher light

RI PT

intensity. The light intensity of a firefly depends on the intensity I 0 of light emitted by the firefly and the distance rij between the two fireflies. Light intensity I (r ) can be

follows: r2

M AN U

I (r ) = I 0 e −γ ,

SC

described by a monotonically decreasing function of rij which can be formulated as

(10)

where γ is used to control the decrease in light intensity or brightness and can be taken as a constant.

TE D

Each firefly has its distinctive attractiveness which indicates how powerfully it attracts other members in the swarm. Attractiveness, β , is relative, which means that it must be judged by others, and therefore varies with the distance rij . The attractiveness

EP

must be allowed to vary with differing degrees of absorption [24]. Thus, the main form of

AC C

the attractiveness of a firefly is defined by the following equation: r2

β (r ) = β 0 e −γ ,

(11)

where β (r ) represents the attractiveness function of a firefly at a distance, r , and β 0 denotes the initial attractiveness of a firefly at distance r = 0 and can be constant. For implementation, β 0 is usually set to 1 for most problems.

10

ACCEPTED MANUSCRIPT

The fireflies will try to move to the best position in the search space. This means that the lower light intensity firefly will be attracted by the brighter firefly. The location updates for each pair of fireflies i and j . Each firefly x i is compared to all other

RI PT

fireflies x j , j = 1, 2,..., n f . If firefly j at position x j is brighter than firefly i , then x i will move towards x j by the attraction. The movement is defined as: r2

(x ( ) − x ( ) ) + α ε ( ), t jd

t id

t

t id

SC

x id(t +1) = x id(t ) + β 0 e −γ

(12)

M AN U

where αt is the randomization parameter, γ is an absorption coefficient that controls the decrease in light intensity, and ε id(t ) = ( rand − 0.5) , where rand is a random number from

AC C

EP

TE D

uniform distribution with [0, 1]. The flow diagram of the FFA is shown in Figure 1.

11

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

Figure 1: The Flow diagram of the FFA.

12

ACCEPTED MANUSCRIPT

2.3 The proposed method

The efficiency of the penalized support vector machine with the SCAD penalty largely depends on choosing the appropriate tuning parameter, λ . The tuning parameter

RI PT

controls the tradeoff between classification and the number of selected genes. As a result, selecting a suitable value of the tuning parameter is an important part of the fitting [2527].

SC

In addition, the SCAD penalty in Equation (6) depends on the quantity a . As suggested by Fan and Li [13], the value of this quantity should be a ≥ 2 , so they used

M AN U

a = 3.7 . In the penalization, λ controls the tradeoff between classification and the

number of selected genes. As a result, it is of crucial importance to select a suitable value of λ . If λ is small this leads to overfitting the data because a large number of genes will not be removed. When λ is large, a large number of genes will be removed.

TE D

In the literature, the most widely used method for selecting λ is cross-validation (CV), which is a data-driven approach. However, it has been pointed out that CV usually identifies too many irrelevant genes when the number of genes is large [25, 26] and can

EP

be very time consuming [27]. Consequently, several modifications of the CV approach to estimating λ have been suggested by researchers [28-32].

AC C

Due to the drawbacks of the CV approach, in this paper, a FFA is proposed to

determine the tuning parameter in PSVM with SCAD penalty. Additionally, the term a will also be determined using the FFA. In other words, the FFA will be used to find the value of λ and a simultaneously. The proposed method will efficiently help to find the most significant genes related to cancer classification with high classification performance. The parameter configurations for our proposed method were as follows.

13

ACCEPTED MANUSCRIPT

(1) The number of fireflies was n f = 50 , β 0 = 1 , γ = 0.2 , α = 0.1 , and the maximum number of iterations was t max =100 . (2) Two positions were setup for each firefly. The first position represents the tuning

RI PT

parameter, λ , which was randomly generated from a uniform distribution between 0 and 100. The second position represents the value of a , which was randomly generated from a uniform distribution between 2 and 3.7 as suggested by Fan and Li

SC

[13]. The positions of the firefly are depicted in Figure 2.

M AN U

(3) The fitness function is defined as

 g − g% fitness = 0.8 × CA + 0.2 ×   g

 , 

(13)

where CA is the classification accuracy obtained, g represents the number of genes

TE D

in the dataset, and g% represents the number of selected genes. The fitness function was calculated for all fireflies.

(4) The positions of the fireflies were updated by using Eq. (14).

AC C

EP

(5) Steps 3 and 4 were repeated until t max was reached.

Figure 2: Representation of a firefly in swarm

14

ACCEPTED MANUSCRIPT

2.4 Evaluation criteria

The classification performance of the used methods was measured by classification

RI PT

accuracy (CA), sensitivity (SE), specificity (SP), Mathew’s correlation coefficient (MCC), and area under the curve (AUC). The used criteria are defined as:

SE=

TP ×100% TP+FN

SP=

TN ×100% FP+TN

M AN U

MCC=

TP+TN ×100% TP+FP+FN+TN

SC

CA =

(TP+TN)-(FP+FN) (TP+FP)(TP+FN)(TN+FP)(TN+FN)

where TP, TN, FP, and FN be the numbers of true positive, true negative, false positive,

TE D

and false negative of the confusion matrix, respectively. The higher value of the used evaluation criteria the power classification performance is.

EP

3. Datasets

Four benchmark gene expression datasets with a binary classification and different

AC C

numbers of genes and sample sizes, namely, ovarian, breast, CNS (Central Nervous System), and autism disorder were tested. These datasets are publicly available and were downloaded from GEO (NCBI) repository ([33]). The main characteristics of the four datasets are summarized in Table 1.

15

ACCEPTED MANUSCRIPT

Table 1: The characteristics of the four used datasets Datasets

# of samples # genes (g)

class

(n) 253

15154

Breast

97

24481

91 normal/ 162 ovarian cancer 51 healthy /46 unhealthy

CNS

60

7129

21 survivors/ 39 failures

Autism

146

54613

64 healthy/ 82 autism

M AN U

SC

RI PT

Ovarian

4. Results

With the aim of correctly assessing the performance of our proposed method, comparative experiments were carried out using the original CV method of estimating the

proposed algorithm):

TE D

tuning parameter. The methods used were as follows (the last two methods are our

(1) CV: We set a = 3.7 and used CV to estimate λ . (2) FFA1: We set a = 3.7 and used FFA to estimate λ .

EP

(3) FFA2: FFA was used to estimate both λ and a simultaneously.

AC C

In our experiments, a 10-fold was set and the range of tuning parameters for the CV method was fixed between 0 and 100. In addition, the linear kernel function was employed. To obtain a reliable classification performance, for each dataset, 70% of samples were used as a training dataset and the remaining 30% of the samples were used as a test dataset. This partition was repeated 20 times and the averaged evaluation criteria are reported in Table 2. The number in parentheses is the standard error.

16

ACCEPTED MANUSCRIPT

As can be seen from Table 2, both FFA2 and FFA1 selected fewer genes than CV for all datasets. In the ovarian dataset, for instance, FFA2 and FFA1, respectively, selected 22 and 26 genes compared with 37 genes for the CV method. Compared to FFA1, FFA2

RI PT

shows comparable results. In terms of gene selection, FFA2 performs slightly better than FFA1 for the ovarian and autism datasets. Importantly, FFA2 had the potential to select fewer genes than the other two methods, indicating that most of these additionally

SC

selected genes were probably not highly irrelevant to the classification study.

In terms of classification accuracy, FFA2 achieved a maximum accuracy of 97.04%

M AN U

and 98.82% for the ovarian and autism datasets, respectively. In contrast, FFA1 showed the best classification accuracy in the breast and CNS datasets. Furthermore, it is clear from the results that both FFA2 and FFA1 outperformed the CV method in terms of classification accuracy for all datasets. This improvement in classification accuracy was

TE D

mainly due to the improved ability of our proposed method (FFA1 or FFA2) in selecting the tuning parameter. Moreover, FFA1 slightly improved the classification accuracy compared with FFA2. The improvement in the CNS dataset, for example, was 0.171%.

EP

On the contrary, FFA2 showed substantial improvements compared with FFA1, especially in the autism dataset, where FFA2 improved the classification accuracy by

AC C

5.535%.

It can be seen from Table 2 that both FFA1 and FFA2 have the best results in terms of

sensitivity and specificity. FFA2 had the highest sensitivities of 96.52% and 97.25% for the ovarian and autism datasets, respectively. On the other hand, FFA1 had the highest sensitivities of 94.74% and 96.54% for the breast and CNS datasets. This indicated that

17

ACCEPTED MANUSCRIPT

FFA2 and FFA1 significantly succeeded in identifying cases of cancer with a probability of greater than 0.947. On the other hand, the results for the specificity (SP) represent the probability that our

RI PT

proposed method would identify cases that are actually healthy. In terms of the SP, FFA2 and FFA1 significantly outperformed CV for all datasets. In the breast dataset, for

healthy patients compared with 0.896 for CV.

SC

example, FFA2 and FFA1 had the highest probabilities of 0.931 and 0.933 for identifying

Looking at Mathew’s correlation coefficient (MCC), the classification performance of

M AN U

FFA2 and FFA1 was comparable, with CV performing the best. In the CNS, the MCC value was 0.979 for FFA2, which was higher than that for FFA1 (MCC = 0.945) and CV (MCC = 0.881). In general, an algorithm with a higher Mathew’s correlation coefficient value is considered to be a more predictive classification algorithm.

TE D

Further, depending on the testing dataset, FFA2 and FFA1 achieved the best classification results for the four datasets. In contrast with these results, CV attained poor classification results. For instance, in the breast dataset, the CA of the test dataset was

AC C

CV.

EP

93.21% and 93.85% for FFA2 and FFA1, respectively, which was higher than 85.36% for

18

ACCEPTED MANUSCRIPT

Table 2: Experimental results of the methods used.

CV Breast

FFA2 FFA1

CV CNS

FFA2 FFA1

CV

Autism

FFA2

FFA1

CA

SE

SP

97.04 (0.038) 96.22 (0.071) 90.35 (0.381) 96.22 (0.121) 96.72 (0.124) 88.90 (0.355) 98.68 (0.005) 98.84 (0.006) 90.74 (0.071)

96.52 (0.032) 95.92 (0.074) 90.51 (0.387) 94.74 (0.124) 95.12 (0.123) 88.13 (0.372) 96.54 (0.006) 96.87 (0.006) 90.11 (0.077)

95.80 (0.035) 94.31 (0.072) 90.31 (0.375) 93.18 (0.123) 93.36 (0.124) 89.66 (0.441) 95.84 (0.007) 95.93 (0.006) 90.34 (0.073)

0.964 (0.037) 0.955 (0.071) 0.901 (0.382) 0.958 (0.121) 0.962 (0.125) 0.888 (0.361) 0.977 (0.003) 0.981 (0.005) 0.898 (0.078)

94.83 (0.041) 93.07 (0.044) 87.57 (0.408) 93.21 (0.217) 93.85 (0.216) 85.36 (0.402) 96.22 (0.008) 96.34 (0.007) 87.11 (0.085)

97.32 (0.069) 92.64 (0.073) 86.71 (0.537)

0.979 (0.071) 0.945 (0.074) 0.881 (0.522)

95.81 (0.088) 91.37 (0.094) 84.21 (0.614)

98.82 (0.066) 93.35 (0.068) 88.64 (0.516)

97.24 (0.071) 93.20 (0.080) 87.84 (0.538)

AC C

EP

CV

18 (0.051) 21 (0.055) 47 (1.013)

MCC

Testin g dataset CA

RI PT

FFA1

# selected genes 22 (0.077) 26 (0.082) 37 (1.15) 25 (0.117) 25 (0.141) 51 (1.113) 11 (0.005) 10 (0.003) 19 (0.092)

SC

Ovarian FFA2

Training dataset

M AN U

Method s

TE D

Dataset s

According to the AUC criteria, a non-parametric Friedman test was employed to

check whether FFA2, FFA1, and CV were statistically significant. Then, the post hoc Bonferroni test was computed when the null hypothesis was rejected. This test was computed under different critical values (0.01, 0.05, and 0.1). Table 3 reports the statistical test results. Based on the obtained results, the null hypothesis was rejected at 0.05 significance level using the Friedman test statistic. As a result, the obtained results 19

ACCEPTED MANUSCRIPT

showed statistically significant differences between the methods used. In addition, FFA2 had the lower average rank with 3.112 compared with FFA1 and CV. Depending on the Bonferroni test results, it is clear that the average rank of the CV method was higher than

than both FFA1 and FFA2 over the four datasets.

RI PT

α0.05 , α0.01 , and α0.10 . These results suggest that the CV method was significantly worse

Friedman test results

3.112

2 χFriedman =14.746 , p-value

3.174

CV

10.325

α0.05 = 6.185 , α0.01 = 6.839 , α0.10 = 5.907

TE D

(0.05)=0.0011

FFA1

Bonferroni test results

M AN U

FFA2

Friedman average rank

SC

Table 3: Friedman and Bonferroni test results for the used methods over the four datasets

EP

Further, a stability test, which is an indicator of gene selection consistency, using the Jaccard index is utilized to highlight the performance of our proposed method.

AC C

Let D1 and D2 be subsets of the selected descriptors such that D1 , D 2 ⊆ D , for a number of solutions D = {D1 ,..., D r } , the stability test is defined as:

Stability test =

r −1 r 2 ∑ ∑ I J (D i , D j ). r ( r − 1) i =1 j =i +1

(14)

20

ACCEPTED MANUSCRIPT

where I J (D i , D j ) is the Jaccard index which is defined as the size of the intersection between any two groups divided by the size of their union. Mathematically, it is defined

I J (D1 , D 2 ) =

D1 ∩ D 2 . D1 ∪ D 2

RI PT

as:

(15)

SC

The higher the stability test value is, the more stable the gene selection is. Figure 3 shows the stability test values on the four datasets for the FFA2, FFA1, and the CV. As can be

M AN U

seen from Figure 3, both the FFA1 and FFA2 displays the high rate of stability compared with CV. Further, FFA2 is the best stable gene selection method. This is mean that FFA2

AC C

EP

TE D

is more consistent than CV and slightly consistent than FFA1 in gene selection.

21

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

Figure 3: Stability test results of the gene selection consistency for the used methods.

5. Discussion

The study presents an improved version of PSVM with SCAD penalty. The efficiency of the penalized support vector machine with the SCAD penalty largely depends on choosing the appropriate tuning parameter, λ .

The tuning parameter controls the 22

ACCEPTED MANUSCRIPT

tradeoff between classification and the number of selected genes. As a result, selecting a suitable value of the tuning parameter is an important part of the fitting.

RI PT

In this paper, an optimization algorithm, FFA, is proposed to determine the tuning parameter in PSVM with SCAD penalty. In this section, the main characteristics of the FAA is highlighted. With this approach, the proposed algorithm could evaluate not only

SC

the positive side of PSVM, but also the SCAD penalty.

In several studies, it was proven that the FFA can perform superiorly, compared with

M AN U

genetic algorithm (GA) and particle swarm optimization (PSO) [34-36], and it is applicable for a large number of real optimization problems because the FFA has fast convergence, obtaining good results on function optimization, and more appropriate for combinatorial optimization [22].

TE D

Compared with GA and PSO, our proposed algorithm does not need more the convergence speed than GA and PSO. The FFA yielded the best results, on average, when the t max ≤ 32 . On the contrast, GA and PSO yielded their results when t max ≤ 74

EP

and t max ≤ 67 , respectively. Another strength of the FFA is the possibility to reduce computational complexity. Generally, the complexity of the FFA is significantly lower

AC C

than those of GA and PSO in all used datasets. One interesting observation is that our proposed algorithm is still more powerful

than GA and PSO even when the number of genes is high. It indicates that FFA is more appropriate than others for gene expression data in cancer classification although FFA depends on calculating the distance (Eq. 10) among the fireflies [37, 38].

23

ACCEPTED MANUSCRIPT

As a result, the proposed method outperforms the CV approach on all datasets in terms of classification performance. By considering the comparison results of the our

effective results, and the approaches become more convenient.

SC

6. Conclusion

RI PT

proposed approach, it can be said that the proposed FFA2 and FFA1 produce better and

The efficiency of PSVM with SCAD penalty depends on choosing the appropriate

M AN U

tuning parameter involved in the SCAD penalty. This paper presented a new tuning parameter selection method for the penalized support vector machine with SCAD penalty. A firefly algorithm was proposed for determining the tuning parameter and was compared with the classical CV method. Based on four freely accessible gene expression benchmark datasets, the results show that cancer classification using our proposed

TE D

algorithm has higher classification accuracy with fewer selected genes, and yields better

AC C

EP

results than the CV method.

REFERENCES

[1] Z.Y. Algamal, M.H. Lee, Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification, Comput. Biol. Med., 67 (2015) 136-145. [2] T. Latkowski, S. Osowski, Data mining for feature selection in gene expression autism data, Expert. Syst. Appl., 42 (2015) 864-872. [3] S.S. Hameed, R. Hassan, F.F. Muhammad, Selection and classification of gene expression in autism disorder: Use of a combination of statistical filters and a GBPSOSVM algorithm, PLoS One, 12 (2017) e0187371. 24

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[4] Z.Y. Algamal, M.H. Lee, A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification, Advances in Data Analysis and Classification, (2018). [5] H. Motieghader, A. Najafi, B. Sadeghi, A. Masoudi-Nejad, A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata, Informatics in Medicine Unlocked, 9 (2017) 246-254. [6] Z.Y. Algamal, M.H. Lee, Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification, Expert. Syst. Appl., 42 (2015) 9326–9332. [7] L.-Y. Chuang, C.-S. Yang, K.-C. Wu, C.-H. Yang, Gene selection and classification using Taguchi chaotic binary particle swarm optimization, Expert. Syst. Appl., 38 (2011) 13367-13377. [8] Y. Liang, C. Liu, X.-Z. Luan, K.-S. Leung, T.-M. Chan, Z.-B. Xu, H. Zhang, Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification, BMC Bioinformatics, 14 (2013) 198-211. [9] Q. Shen, W.M. Shi, W. Kong, B.X. Ye, A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification, Talanta, 71 (2007) 1679-1683. [10] Y. Cong, B.-k. Li, X.-g. Yang, Y. Xue, Y.-z. Chen, Y. Zeng, Quantitative structure– activity relationship study of influenza virus neuraminidase A/PR/8/34 (H1N1) inhibitors by genetic algorithm feature selection and support vector regression, Chemom. Intell. Lab. Syst., 127 (2013) 35-42. [11] J. Zhu, S. Rosset, T. Hastie, R. Tibshirani, 1-norm support vector machines, Advances in neural information processing systems, 16 (2004) 49-56. [12] L. Wang, J. Zhu, H. Zou, Hybrid huberized support vector machines for microarray classification and gene selection, Bioinformatics, 24 (2008) 412-419. [13] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc., 96 (2001) 1348-1360. [14] H. Dong, G. Jian, Parameter Selection of a Support Vector Machine, Based on a Chaotic Particle Swarm Optimization Algorithm, Cybernetics and Information Technologies, 15 (2015). [15] P.S. Bradley, O.L. Mangasarian, Feature selection via concave minimization and support vector machines, ICML, 1998, pp. 82-90. [16] K. Ikeda, N. Murata, Geometrical properties of Nu support vector machines with different norms, Neur. comput., 17 (2005) 2508-2529. [17] Z. Liu, S. Lin, M.T. Tan, Sparse support vector machines with Lp penalty for biomarker identification, IEEE Trans. Comput. Bi., 7 (2010) 100-107. [18] Y. Liu, H. Helen Zhang, C. Park, J. Ahn, Support vector machines with adaptive Lq penalty, Comput. Stat. Data. Anal., 51 (2007) 6380-6394. [19] H.H. Zhang, J. Ahn, X. Lin, C. Park, Gene selection using support vector machines with non-convex penalty, Bioinformatics, 22 (2006) 88-95. [20] N. Becker, G. Toedt, P. Lichter, A. Benner, Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data, BMC Bioinformatics, 12 (2011) 138-151. [21] X.-S. Yang, Multiobjective firefly algorithm for continuous optimization, Engineering with Computers, 29 (2013) 175-184. 25

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[22] I. Fister, I. Fister, X.-S. Yang, J. Brest, A comprehensive review of firefly algorithms, Swarm and Evolutionary Computation, 13 (2013) 34-46. [23] A. Yelghi, C. Köse, A modified firefly algorithm for global minimum optimization, Appl. Soft. Comput., 62 (2018) 29-44. [24] S. Karthikeyan, P. Asokan, S. Nickolas, A hybrid discrete firefly algorithm for multi-objective flexible job shop scheduling problem with limited resource constraints, The International Journal of Advanced Manufacturing Technology, 72 (2014) 1567-1579. [25] K.W. Broman, T.P. Speed, A model selection approach for the identification of quantitative trait loci in experimental crosses, J. Roy. Statist. Soc. Ser. B, 64 (2002) 641656. [26] J. Chen, Z. Chen, Extended Bayesian information criteria for model selection with large model spaces, Biometrika, 95 (2008) 759-771. [27] H. Park, F. Sakaori, S. Konishi, Robust sparse regression and tuning parameter selection via the efficient bootstrap information criteria, Journal of Statistical Computation and Simulation, 84 (2013) 1596-1607. [28] S. Roberts, G. Nowak, Stabilizing the lasso against cross-validation variability, Computational Statistics & Data Analysis, 70 (2014) 198-211. [29] J.A. Sabourin, W. Valdar, A.B. Nobel, A permutation approach for selecting the penalty parameter in penalized model selection, Biometrics, 71 (2015) 1185-1194. [30] Y. Jung, J. Hu, A K-fold Averaging Cross-validation Procedure, J. Nonparametr. Stat., 27 (2015) 167-179. [31] R.J. Meijer, J.J. Goeman, Efficient approximate k-fold and leave-one-out crossvalidation for ridge regression, J. Biom. , 55 (2013) 141-155. [32] Z. Pang, B. Lin, J. Jiang, Regularisation Parameter Selection Via Bootstrapping, Australian & New Zealand Journal of Statistics, 58 (2016) 335-356. [33] NCBI data base, http://www.ncbi.nlm.nih.gov/sites/GDSbrowser, 2017. [34] S. Yu, S. Zhu, Y. Ma, D. Mao, Enhancing firefly algorithm using generalized opposition-based learning, Computing, 97 (2015) 741-754. [35] L. Zhang, W. Srisukkham, S.C. Neoh, C.P. Lim, D. Pandit, Classifier ensemble reduction using a modified firefly algorithm: An empirical evaluation, Expert. Syst. Appl., 93 (2018) 395-422. [36] O.S. Qasim, Z.Y. Algamal, Feature selection using particle swarm optimizationbased logistic regression model, Chemom. Intell. Lab. Syst., 182 (2018) 41-46. [37] L. Zhang, L. Shan, J. Wang, Optimal feature selection using distance-based discrete firefly algorithm with mutual information criterion, Neural. Comput. Applic., 28 (2016) 2795-2808. [38] S.L. Tilahun, J.M.T. Ngnotchouye, N.N. Hamadneh, Continuous versions of firefly algorithm: a review, Artificial Intelligence Review, (2017).

26

ACCEPTED MANUSCRIPT Highlights The proposed method has better performance than the CV.



The classification ability for the proposed method is quite high.



The proposed method performed remarkably well in gene selection stability test.

AC C

EP

TE D

M AN U

SC

RI PT